CN111949317B

CN111949317B - Instruction processing method and device and related product

Info

Publication number: CN111949317B
Application number: CN201910412708.0A
Authority: CN
Inventors: 请求不公布姓名
Original assignee: Shanghai Cambricon Information Technology Co Ltd
Current assignee: Shanghai Cambricon Information Technology Co Ltd
Priority date: 2019-05-17
Filing date: 2019-05-17
Publication date: 2023-04-07
Anticipated expiration: 2039-05-17
Also published as: CN111949317A

Abstract

The disclosure relates to an instruction processing method, an instruction processing device and a related product. The machine learning device comprises one or more instruction processing devices, is used for acquiring data to be operated and control information from other processing devices, executing specified machine learning operation and transmitting an execution result to other processing devices through an I/O interface; when the machine learning arithmetic device includes a plurality of instruction processing devices, the plurality of instruction processing devices can be connected to each other by a specific configuration to transfer data. The command processing devices are interconnected through a Peripheral Component Interface Express (PCIE) bus and transmit data; the plurality of instruction processing devices share the same control system or own control system and share a memory or own memory; the interconnection mode of the plurality of instruction processing devices is an arbitrary interconnection topology. The instruction processing method, the instruction processing device and the related products provided by the embodiment of the disclosure have the advantages of wide application range, high instruction processing efficiency and high instruction processing speed.

Description

Instruction processing method and device and related product

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to an instruction processing method and apparatus for implementing memory assignment, and a related product.

Background

With the continuous development of science and technology, machine learning algorithms such as neural network algorithms are widely used, and the method is well applied to the fields of image recognition, voice recognition, natural language processing and the like. The wide application of big data operation and machine learning algorithm makes the storage and operation of data face a bigger challenge. How to assign or initialize memory space of the computer device becomes a research focus.

Disclosure of Invention

In view of this, the present disclosure provides an instruction processing method and apparatus for implementing memory assignment, and a related product.

According to a first aspect of the present disclosure, there is provided an instruction processing apparatus, the apparatus comprising:

the control module is used for analyzing the compiled memory assignment instruction to obtain an operation code and an operation domain of the memory assignment instruction, and obtaining a storage address of a target tensor, the number of elements to be assigned in the target tensor and a target value according to the operation code and the operation domain; the operation domain comprises a storage address of a target tensor, the number of elements to be assigned in the target tensor and the target value; the storage space pointed by the storage address of the target tensor is an on-chip storage space of the instruction processing device;

and the processing module is used for taking the target value as the value of the element to be assigned in the target tensor according to the storage address of the target tensor, the number of the element to be assigned in the target tensor and the target value.

According to a second aspect of the present disclosure, there is provided a machine learning arithmetic device, the device including:

one or more instruction processing devices according to the first aspect, configured to acquire data to be migrated and control information from another processing device, execute a specified machine learning operation, and transmit an execution result to the other processing device through an I/O interface;

when the machine learning arithmetic device comprises a plurality of instruction processing devices, the instruction processing devices can be connected through a specific structure and transmit data;

the command processing devices are interconnected through a PCIE bus which is a bus for interconnecting fast external equipment and transmit data so as to support operation of machine learning in a larger scale; a plurality of instruction processing devices share the same control system or own respective control systems; the instruction processing devices share a memory or own respective memories; the interconnection mode of the plurality of instruction processing devices is any interconnection topology.

According to a third aspect of the present disclosure, there is provided a combined processing apparatus, the apparatus comprising:

the machine learning arithmetic device, the universal interconnect interface, and the other processing device according to the second aspect;

and the machine learning arithmetic device interacts with the other processing devices to jointly complete the calculation operation designated by the user.

According to a fourth aspect of the present disclosure, there is provided a machine learning chip including the machine learning network operation device of the second aspect or the combination processing device of the third aspect.

According to a fifth aspect of the present disclosure, there is provided a machine learning chip package structure, which includes the machine learning chip of the fourth aspect.

According to a sixth aspect of the present disclosure, a board card is provided, which includes the machine learning chip packaging structure of the fifth aspect.

According to a seventh aspect of the present disclosure, there is provided an electronic device, which includes the machine learning chip of the fourth aspect or the board of the sixth aspect.

According to an eighth aspect of the present disclosure, there is provided an instruction processing method, the method comprising:

analyzing the compiled memory assignment instruction to obtain an operation code and an operation domain of the memory assignment instruction, and obtaining a storage address of a target tensor, the number of elements to be assigned in the target tensor and a target value according to the operation code and the operation domain; the operation domain comprises a storage address of a target tensor, the number of elements to be assigned in the target tensor and the target value; the storage space pointed by the storage address of the target tensor is an on-chip storage space of the instruction processing device;

and according to the storage address of the target tensor, the number of the elements to be assigned in the target tensor and the target value, taking the target value as the value of the elements to be assigned in the target tensor.

According to an eighth aspect of the present disclosure, there is provided a computer-readable storage medium having stored therein a computer program, the computer program being executable by one or more processors to implement the steps of the instruction processing method described above.

The device comprises a control module and a processing module. The control module analyzes the compiled memory assignment instruction to obtain an operation code and an operation domain of the memory assignment instruction, and obtains a storage address of a target tensor, the number of elements to be assigned in the target tensor and a target value according to the operation code and the operation domain, wherein a storage space to which the storage address of the target tensor points is an on-chip storage space of the instruction processing device. The processing module can regard the target value as the value of the element to be assigned in the target tensor according to the storage address of the target tensor, the number of the element to be assigned in the target tensor and the target value, and therefore assignment of the on-chip storage space in the instruction processing device is achieved. The instruction processing method, the instruction processing device and the related products have the advantages that the processing efficiency and the processing speed of the instruction assignment instruction are high.

Other features and aspects of the present disclosure will become apparent from the following detailed description of exemplary embodiments, which proceeds with reference to the accompanying drawings.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of the specification, illustrate exemplary embodiments, features, and aspects of the disclosure and, together with the description, serve to explain the principles of the disclosure.

FIG. 1 shows a block diagram of an instruction processing apparatus according to an embodiment of the present disclosure;

FIG. 2 illustrates a block diagram of memory modules in an instruction processing apparatus according to an embodiment of the present disclosure;

3 a-3 e show block diagrams of an instruction processing apparatus according to an embodiment of the present disclosure;

FIGS. 4a, 4b show block diagrams of a combined processing device according to an embodiment of the present disclosure;

fig. 5 shows a schematic structural diagram of a board card according to an embodiment of the present disclosure;

FIG. 6 shows a flow diagram of an instruction processing method according to an embodiment of the present disclosure.

Detailed Description

Various exemplary embodiments, features and aspects of the present disclosure will be described in detail below with reference to the accompanying drawings. In the drawings, like reference numbers can indicate functionally identical or similar elements. While the various aspects of the embodiments are presented in drawings, the drawings are not necessarily drawn to scale unless specifically indicated.

The word "exemplary" is used exclusively herein to mean "serving as an example, embodiment, or illustration. Any embodiment described herein as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments.

Furthermore, in the following detailed description, numerous specific details are set forth in order to provide a better understanding of the present disclosure. It will be understood by those skilled in the art that the present disclosure may be practiced without some of these specific details. In some instances, methods, means, elements and circuits that are well known to those skilled in the art have not been described in detail so as not to obscure the present disclosure.

The application provides an instruction processing device for realizing a memory assignment instruction and executing the memory assignment instruction. The instruction processing device can be used for executing various instructions such as a memory assignment instruction and the like. The memory assignment instruction may be configured to set a value in a specified interval of an on-chip storage space of the instruction storage device as a target value. Optionally, the memory assignment instruction may include an operation code and an operation field, where the operation code may be used to indicate what operation the instruction is used to perform, in this embodiment of the present application, the operation code of the memory assignment instruction may be used to indicate that the memory assignment instruction is used to set a value in a specified storage space, and further, the operation code may include an identifier of a storage to which a storage address of the target tensor belongs. That is, the memory assignment instruction in this application is for setting a value in a specified storage space in a specific storage. The operation domain of the memory assignment instruction may be used to describe object information acted on by the instruction, and specifically, the operation domain of the memory assignment instruction may be used to indicate target tensor information and a target value, where the target tensor information may include a storage address of a target tensor and a number of elements to be assigned in the target tensor. For example, the operation domain may include three operands, which are a storage address of the target tensor, the number of elements to be assigned in the target tensor, and the target value, respectively. The memory space pointed to by the memory address of the target tensor can be an on-chip memory space of the instruction storage device. It should be understood that the location of the operation code, operation code and operation field in the instruction format, and operation code of the memory assignment instruction may be set as required by those skilled in the art, and the disclosure is not limited thereto.

As shown in fig. 1 and fig. 2, an embodiment of the present application provides an instruction processing apparatus, which may include a control module 11, a processing module 12, and a storage module 13, and optionally, the control module 11 and the processing module 12 may be integrated into a processor. As shown in fig. 2, the processing module 12 may include at least one computing core (computing cores 11 to 1Q, computing cores 21 to 2Q, computing cores P1 to PQ), and more than one computing core may form a computing core cluster (cluster). The computing core may be a basic element of a device for implementing computation, and may include at least one on-chip storage, an arithmetic unit or module for performing data operation, and the like. In this embodiment, the computation core may also be configured to implement the memory assignment instruction. Further alternatively, the processor may be an artificial intelligence processor, and the specific structure and workflow of the control module 11 and the processing module 12 of the artificial intelligence processor may be referred to the following description.

The memory module may include on-chip memory and off-chip memory. In particular, as shown in fig. 2, the memory module may be connected to the processor. Wherein, each computing core of the processor can be provided with the private on-chip storage of the computing core. Alternatively, the on-chip Memory may be a neuron Memory for storing scalar data or vector data, and the neuron Memory may be a Random Access Memory (NRAM). The off-chip memory may be a DDR (Double Data Rate SDRAM). A part of the memory space of the DDR serves as a general-purpose memory, which may be a memory shared by the respective computing cores, and may be abbreviated as GDRAM. Another part of the memory space of the DDR can be used as a memory which can be private to each computing core, and the memory which is private to the computing core can be abbreviated as a ldam.

The control module 11 is configured to analyze the compiled memory assignment instruction to obtain an operation code and an operation domain of the memory assignment instruction, and obtain a storage address of the target tensor, and a number of elements to be assigned and a target value in the target tensor according to the operation code and the operation domain, where the operation domain includes the storage address of the target tensor, the number of elements to be assigned and the target value in the target tensor, and further, a storage space to which the storage address of the target tensor points may be an on-chip storage space of the instruction processing device. Optionally, the storage space pointed to by the storage address of the target tensor is an on-chip NRAM of the instruction processing apparatus. Of course, in other embodiments, the storage space of the target tensor can also be other off-chip storage spaces, such as an LDRAM or a GDRAM, etc.

The processing module 12 is configured to take the target value as the value of the element to be assigned in the target tensor according to the storage address of the target tensor, the number of the element to be assigned in the target tensor, and the target value. Specifically, the processing module 12 may set, according to the storage address of the target tensor and the number of the elements to be assigned in the target tensor, the values of the first N elements to be assigned in the storage space to which the storage address of the target tensor points as the target values, so as to implement assignment of the on-chip storage space of the instruction storage device. Wherein, N refers to the number of the elements to be assigned in the target tensor.

Alternatively, the target tensor can be neural network data, such as neuron data or weight data of a neural network, and the like. The target tensor refers to data of 0 or more dimensions, which may have a plurality of dimensions. Specifically, the 0-dimensional tensor data is scalar data, the 1-dimensional tensor data is vector data, and the 2-dimensional tensor data may be matrix data. That is to say, the memory assignment instruction in the embodiment of the present application can implement reassignment of scalar data, and can also implement reassignment of some elements in tensor data.

Alternatively, the target value may be 0, and when the target value is 0, the processing module 12 specifically sets a value of an operation to be assigned in the storage space pointed by the storage address of the target tensor to 0, which is equivalent to performing an initialization operation on the on-chip storage space of the instruction storage device. In other embodiments, the target value may be any optional value.

The following illustrates an implementation manner of the compiled memory assignment instruction according to the embodiment of the present application:

Nramset.s16[％r0],1,128

wherein, nramset may represent an operation code of the memory assignment instruction, s16[% r0] indicates a storage address representing the target tensor, 128 indicates the number of elements to be assigned in the target tensor, and 1 indicates the target value. The semantics of the memory assignment instruction described above are to set the values of the first 128 elements of tensor data with a starting address s16[% r0] in NRAM to 1.

In this embodiment of the application, the instruction processing device may directly analyze the memory assignment instruction to obtain the operation code and the operation domain, and the processing module 12 may use the target value as the value of the element to be assigned in the target tensor in the on-chip storage space according to the information such as the operation code and the operation domain. The instruction processing device in the embodiment of the application realizes assignment operation by directly writing corresponding storage space, and compared with a mode that in the prior art, new data after assignment is carried to on-chip storage space after assignment is carried out on data of specified storage space on off-chip storage, the memory assignment process is simpler, and the instruction processing efficiency is improved.

Alternatively, the storage address of the target tensor can be a starting address of a storage space where the target tensor is located. The processing module 12 may include a data reading and writing circuit, and the data reading and writing circuit may set, from a start address of a storage space where the target tensor is located, an element of the number of elements to be assigned in the target tensor as a target value. That is, the data read-write circuit may set the element of the number of the previous element to be assigned of the target tensor as the target value. Optionally, the step of setting the to-be-assigned element in the target tensor to the target value may be to set the to-be-assigned element to the target value according to the number of bytes.

Optionally, each time the value of the element to be assigned is determined to be set, the data read-write circuit of the processing module 12 may update the start address of the target tensor according to the preset address offset, and the updated start address of the target tensor may be equal to the sum of the start address of the current target tensor and the address offset. Optionally, the address offset may be a default address offset in an opcode of the memory assignment instruction, and optionally, the default address offset may be determined according to a number of bytes, and specifically, the default address offset may be a multiple of 64 bytes. Of course, in other embodiments, the address offset may also be a multiple of 8 bytes, 16 bytes, 32 bytes, or 128 bytes, and the like, which is only for illustration and is not used to specifically limit the address offset in the present application.

Of course, the address offset may also be determined according to the storage space occupied by the element to be assigned, and the address offset may be a multiple of the storage space occupied by the element to be assigned. For example, if the memory space occupied by the element to be assigned is 64 bytes, the address offset is a multiple of 64 bytes.

Optionally, the number of the element to be assigned may also be represented by a byte number. The byte number corresponding to the storage space of the target tensor is an integral multiple of the byte number corresponding to the number of the elements to be assigned in the target tensor. The memory address occupied by the element to be assigned may be a multiple of 64 bytes. Optionally, the number of the elements to be assigned is a constant, and the data type of the elements to be assigned is int. Specifically, the data type of the element to be assigned may be int32 (4-byte integer). In other embodiments, the data type of the element to be assigned may also be other data types.

Optionally, the storage address of the target tensor and the target value have the same data type, and the storage address of the target tensor and the data type of the target value are one of half (floating point data type), int16 (2 byte integer), unit16 (2 byte unsigned integer), int32 (4 byte integer), and unit32 (4 byte unsigned integer).

Optionally, the opcode includes a data type, and the data type in the opcode may be half (floating point data type), short, or int. Further optionally, the data type in the operation code has a certain correspondence with the storage address of the target tensor in the operation domain and the data type of the target value. Specifically, when the data type in the operation code is half, the storage address of the target tensor and the data type of the target value are half. When the data type in the opcode is identified as short, the data type of the storage address of the target tensor and the target value is int16 (2-byte integer) or unit16 (2-byte unsigned integer). When the data type in the operation code is identified as int, the storage address of the target tensor and the data type of the target value are int32 (4-byte integer) or unit32 (4-byte unsigned integer). Further optionally, since the memory assignment instruction in the present application is used to perform assignment operation on specified storage spaces such as on-chip NRAM, the operation code of the memory assignment instruction may further include an identifier of a storage to which the on-chip storage space belongs. For example, the opcode of the memory assignment instruction may be nramset _ half, nramset _ short, or nramset _ int, among others.

Further alternatively, the memory assignment instruction may be encapsulated as a function. For example, when the data type in the opcode is half, the function can be expressed as:

nramset_half(half*dst,int32_t elem_num,half value)；

the nramset _ half represents an operation code, half _ dst represents a storage address of a target tensor, int32_ t elem _ num represents the number of elements to be assigned in the target tensor, half value represents a target value, the number of the elements to be assigned in the target tensor is shaping, and the data types of the storage address and the target value of the target tensor are consistent with the data types in the operation code.

For another example, when the data type in the opcode is short, the function can be expressed as:

nramset _ short (int 16_ t × dst, int32_ t elem _ num, int16_ t value); or

nramset_short(uint16_t*dst,int32_t elem_num,uint16_t value)。

Wherein nramset _ short represents an operation code, int16_ t _ dst or uint16_ t _ dst represents a storage address of the target tensor, int32_ t elem _ num represents the number of elements to be assigned in the target tensor, and int16_ t value or uint16_ t value represents a target value. The number of the elements to be assigned in the target tensor is shaping, and the storage address of the target tensor and the data type of the target value correspond to the data type in the operation code.

For another example, when the data type in the opcode is int, the function can be expressed as:

nramset _ int (int 32_ t × dst, int32_ t elem _ num, int32_ t value); or

nramset_int(uint32_t*dst,int32_t elem_num,uint32_t value)。

Wherein nramset _ int represents an operation code, int32_ t _ dst or uint32_ t _ dst represents a storage address of the target tensor, int32_ t elem _ num represents the number of elements to be assigned in the target tensor, and int32_ t value or uint32_ t value represents a target value. The number of the elements to be assigned in the target tensor is shaping, and the storage address of the target tensor and the data type of the target value correspond to the data type in the operation code.

In the above embodiment, the compiled memory assignment instruction is a hardware instruction that can be executed by the processor, and the instruction processing apparatus can directly process the hardware instruction obtained after the compiling, so as to implement assignment of the target tensor in the specified on-chip storage space. In an alternative embodiment, the memory assignment instruction obtained by the control module is an uncompiled software instruction that cannot be directly executed by hardware, and the control module needs to compile the memory assignment instruction (uncompiled) first. The compiled memory assignment instruction can only be parsed after the compiled memory assignment instruction is obtained. And then, the processing module executes assignment operation according to the compiled memory assignment instruction.

Specifically, the instruction processing device further includes a compiler, and the compiler is configured to compile the obtained uncompiled memory assignment instruction to obtain a compiled memory assignment instruction. Alternatively, the compiled memory assignment instructions may be binary instructions that the artificial intelligence processor is capable of executing. The control module may receive the binary instruction obtained after the compiling, and perform analysis operations such as decoding on the hardware instruction, so as to obtain a hardware instruction that can be executed by at least one processing module. The processing module may execute the assignment operation of the specified storage space according to the analyzed memory assignment instruction, where a specific memory assignment process is consistent with the implementation process in the foregoing embodiment, and may specifically refer to the above description, which is not described herein again.

Optionally, the compiler may translate the memory assignment instruction into an intermediate code instruction, and assemble the intermediate code instruction to obtain a binary instruction that can be executed by the machine, where the compiled memory assignment instruction may be a binary instruction. Alternatively, the compiler may be provided separately from the control module and the processing module described above, the control module and the processing module being integrated on the same artificial intelligence processor, the compiler running on a general purpose processor (e.g., CPU) connected to the artificial intelligence processor.

In an alternative embodiment, the control module 11 may optionally include an instruction storage sub-module 111, an instruction processing sub-module 112 and a queue storage sub-module 113, as shown in fig. 3 a-3 e. The instruction storage submodule 111 is configured to store the compiled memory assignment instruction. The instruction processing sub-module 112 is configured to parse the compiled memory assignment instruction to obtain an operation code and an operation domain of the memory assignment instruction. The queue storage submodule 113 is configured to store an instruction queue, where the instruction queue includes multiple instructions to be executed that are sequentially arranged according to an execution order, and the multiple instructions to be executed may include a compiled memory assignment instruction. In this implementation, the instruction to be executed may further include a calculation instruction related to or unrelated to the memory assignment instruction, which is not limited in this disclosure. According to the embodiment of the application, the execution sequence of the plurality of instructions to be executed can be arranged according to the receiving time, the priority level and the like of the instructions to be executed to obtain the instruction queue, so that the plurality of instructions to be executed can be sequentially executed according to the instruction queue. The control module may obtain instructions and data through a data input output unit, which may be one or more data I/O interfaces or I/O pins.

As a further alternative, as shown in fig. 3 a-3 e, the control module 11 may include a dependency processing sub-module 114. The dependency relationship processing submodule 114 is configured to, when it is determined that a first to-be-executed instruction in the plurality of to-be-executed instructions is associated with a zeroth to-be-executed instruction before the first to-be-executed instruction, cache the first to-be-executed instruction in the instruction storage submodule 111, and after the zeroth to-be-executed instruction is executed, extract the first to-be-executed instruction from the instruction storage submodule 111 and send the first to-be-executed instruction to the processing module 12. The method for associating the first to-be-executed instruction with the zeroth to-be-executed instruction before the first to-be-executed instruction comprises the following steps of: the first storage address interval for storing the data required by the first to-be-executed instruction and the zeroth storage address interval for storing the data required by the zeroth to-be-executed instruction have an overlapped area. On the contrary, there is no association relationship between the first to-be-executed instruction and the zeroth to-be-executed instruction before the first to-be-executed instruction, which may be that there is no overlapping area between the first storage address interval and the zeroth storage address interval. By the method, according to the dependency relationship between the first to-be-executed instruction and the zeroth to-be-executed instruction before the first to-be-executed instruction, the subsequent first to-be-executed instruction is executed after the execution of the prior zeroth to-be-executed instruction is finished, and the accuracy of a result is guaranteed.

Each compute core may include a master processing submodule and a plurality of slave processing submodules. As shown in fig. 3a, the processing module 12 may include a master processing sub-module 121 and a plurality of slave processing sub-modules 122. The control module 11 is further configured to analyze the compiled instruction to obtain a plurality of operation instructions, and send the data to be migrated and the plurality of operation instructions to the main processing sub-module 121.

The main processing sub-module 121 is configured to perform preamble processing on the data to be migrated, and perform data transmission and multiple operation instruction transmission with the multiple slave processing sub-modules 122.

And the plurality of slave processing sub-modules 122 are configured to execute intermediate operations in parallel according to the data and the operation instructions transmitted from the master processing sub-module 121 to obtain a plurality of intermediate results, and transmit the plurality of intermediate results to the master processing sub-module 121.

The main processing operation sub-module 121 is further configured to perform subsequent processing on the plurality of intermediate results to obtain processed data to be migrated, and store the processed data to be migrated in the target address.

It should be noted that, a person skilled in the art may set a connection manner between the master processing sub-module and the multiple slave processing sub-modules according to actual needs to implement configuration setting of the processing module, for example, the configuration of the processing module may be an "H" type configuration, an array type configuration, a tree type configuration, and the like, which is not limited by the present disclosure.

FIG. 3b shows a block diagram of an instruction processing apparatus according to an embodiment of the present disclosure. In one possible implementation, as shown in fig. 3b, the processing module 12 may further include one or more branch processing sub-modules 123, where the branch processing sub-module 123 is configured to forward data and/or operation instructions between the master processing sub-module 121 and the slave processing sub-module 122. Wherein, the main processing sub-module 121 is connected with one or more branch processing sub-modules 123. Therefore, the main processing sub-module, the branch processing sub-module and the slave processing sub-module in the processing module are connected by adopting an H-shaped structure, and data and/or operation instructions are forwarded by the branch processing sub-module, so that the resource occupation of the main processing sub-module is saved, and the instruction processing speed is further improved.

FIG. 3c shows a block diagram of an instruction processing apparatus according to an embodiment of the present disclosure. In one possible implementation, as shown in fig. 3c, a plurality of slave processing sub-modules 122 are distributed in an array.

Each slave processing sub-module 122 is connected with other adjacent slave processing sub-modules 122, the master processing sub-module 121 connects k slave processing sub-modules 122 of the plurality of slave processing sub-modules 122, the k slave processing sub-modules 122 are: the n slave processing sub-modules 122 of row 1, the n slave processing sub-modules 122 of row m, and the m slave processing sub-modules 122 of column 1.

As shown in fig. 3c, the k slave processing sub-modules only include the n slave processing sub-modules in the 1 st row, the n slave processing sub-modules in the m th row, and the m slave processing sub-modules in the 1 st column, that is, the k slave processing sub-modules are slave processing sub-modules directly connected to the master processing sub-module from among the plurality of slave processing sub-modules. And the k slave processing sub-modules are used for forwarding data and instructions between the main processing sub-module and the plurality of slave processing sub-modules. Therefore, the plurality of slave processing sub-modules are distributed in an array manner, the speed of sending data and/or operation instructions to the slave processing sub-modules by the main processing sub-module can be increased, and the instruction processing speed is further increased.

FIG. 3d shows a block diagram of an instruction processing apparatus according to an embodiment of the present disclosure. In one possible implementation, as shown in fig. 3d, the processing module may further include a tree sub-module 124. The tree submodule 124 includes a root port 401 and a plurality of branch ports 402. The root port 401 is connected to the master processing submodule 121, and the plurality of branch ports 402 are connected to the plurality of slave processing submodules 122, respectively. The tree sub-module 124 has a transceiving function, and is used for forwarding data and/or operation instructions between the master processing sub-module 121 and the slave processing sub-module 122. Therefore, the processing modules are connected in a tree-shaped structure under the action of the tree-shaped sub-modules, and the speed of sending data and/or operation instructions to the slave processing sub-modules by the main processing sub-module can be increased by utilizing the forwarding function of the tree-shaped sub-modules, so that the processing speed of the instructions is increased.

In one possible implementation, the tree submodule 124 may be an optional result of the apparatus, which may include at least one level of nodes. The nodes are line structures with forwarding functions, and the nodes do not have operation functions. The lowest level node is connected to the slave processing submodule to forward data and/or operation instructions between the master processing submodule 121 and the slave processing submodule 122. In particular, if the tree submodule has zero level nodes, the apparatus does not require the tree submodule.

In one possible implementation, the tree submodule 124 may include a plurality of nodes of an n-ary tree structure, and the plurality of nodes of the n-ary tree structure may have a plurality of layers. For example, fig. 3e shows a block diagram of an instruction processing apparatus according to an embodiment of the present disclosure. As shown in fig. 3e, the n-ary tree structure may be a binary tree structure, with the tree sub-modules comprising 2 levels of nodes 01. The lowest node 01 is connected with the slave processing submodule 122 to forward data and/or operation instructions between the master processing submodule 121 and the slave processing submodule 122. In this implementation, the n-ary tree structure may also be a ternary tree structure or the like, where n is a positive integer greater than or equal to 2. The number of n in the n-ary tree structure and the number of layers of nodes in the n-ary tree structure may be set by those skilled in the art as needed, and the disclosure is not limited thereto.

In the embodiments provided in the present disclosure, it should be understood that the disclosed system and apparatus may be implemented in other ways. For example, the above-described embodiments of systems and apparatuses are merely illustrative, and for example, a division of a device, an apparatus, and a module is merely a logical division, and an actual implementation may have another division, for example, a plurality of modules may be combined or integrated into another system or apparatus, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of some interfaces, devices, apparatuses or modules, and may be an electrical or other form.

Modules described as separate parts may or may not be physically separate, and parts displayed as modules may or may not be physical units, may be located in one position, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.

In addition, functional modules in the embodiments of the present disclosure may be integrated into one processing unit, or each module may exist alone physically, or two or more modules are integrated into one module. The integrated module can be realized in a form of hardware or a form of a software program module.

The integrated modules, if implemented in the form of software program modules and sold or used as a stand-alone product, may be stored in a computer readable memory. Based on such understanding, the technical solution of the present disclosure may be embodied in the form of a software product, which is stored in a memory and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present disclosure. And the aforementioned memory comprises: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic disk, or an optical disk, which can store program codes.

The present disclosure provides a machine learning arithmetic device, which may include one or more of the above-described instruction processing devices, and is configured to acquire data to be migrated and control information from other processing devices and execute a specified machine learning arithmetic operation. The machine learning arithmetic device can obtain the memory assignment instruction from other machine learning arithmetic devices or non-machine learning arithmetic devices, and transmit the execution result to peripheral equipment (also called other processing devices) through an I/O interface. Peripheral devices such as cameras, displays, mice, keyboards, network cards, wifi interfaces, servers. When more than one instruction processing device is included, the instruction processing devices can be linked and transmit data through a specific structure, for example, the command processing devices are interconnected and transmit data through a PCIE bus, so as to support larger-scale operation of the neural network. At this time, the same control system may be shared, or there may be separate control systems; the memory may be shared or there may be a separate memory for each accelerator. In addition, the interconnection mode can be any interconnection topology.

The machine learning arithmetic device has high compatibility and can be connected with various types of servers through PCIE interfaces.

Fig. 4a shows a block diagram of a combined processing device according to an embodiment of the present disclosure. As shown in fig. 4a, the combined processing device includes the machine learning arithmetic device, the universal interconnection interface, and other processing devices. The machine learning arithmetic device interacts with other processing devices to jointly complete the operation designated by the user.

Other processing devices include one or more of general purpose/special purpose processors such as Central Processing Units (CPUs), graphics Processing Units (GPUs), neural network processors, and the like. The number of processors included in the other processing devices is not limited. The other processing devices are used as interfaces of the machine learning arithmetic device and external data and control, and comprise data transportation to finish basic control of starting, stopping and the like of the machine learning arithmetic device; other processing devices may also cooperate with the machine learning computing device to perform computing tasks.

And the universal interconnection interface is used for transmitting data and control instructions between the machine learning arithmetic device and other processing devices. The machine learning arithmetic device acquires required input data from other processing devices and writes the input data into a storage device on the machine learning arithmetic device; control instructions can be obtained from other processing devices and written into a control cache on a machine learning arithmetic device chip; the data in the storage module of the machine learning arithmetic device can also be read and transmitted to other processing devices.

Fig. 4b shows a block diagram of a combined processing device according to an embodiment of the present disclosure. In a possible implementation manner, as shown in fig. 4b, the combined processing device may further include a storage device, and the storage device is connected to the machine learning operation device and the other processing device respectively. The storage device is used for storing data stored in the machine learning arithmetic device and the other processing device, and is particularly suitable for data which is required to be calculated and cannot be stored in the internal storage of the machine learning arithmetic device or the other processing device.

The combined processing device can be used as an SOC (system on chip) system of equipment such as a mobile phone, a robot, an unmanned aerial vehicle and video monitoring equipment, the core area of a control part is effectively reduced, the processing speed is increased, and the overall power consumption is reduced. In this case, the universal interconnect interface of the combined processing device is connected to some component of the apparatus. Some parts are such as camera, display, mouse, keyboard, network card, wifi interface.

The present disclosure provides a machine learning chip, which includes the above machine learning arithmetic device or combined processing device.

The present disclosure provides a machine learning chip package structure, which includes the above machine learning chip.

Fig. 5 shows a schematic structural diagram of a board card according to an embodiment of the present disclosure. As shown in fig. 5, the board includes the above machine learning chip package structure or the above machine learning chip. The board may include, in addition to the machine learning chip 389, other kits including, but not limited to: memory device 390, interface device 391 and control device 392.

Memory device 390 is coupled via a bus to a machine learning chip 389 (or a machine learning chip within a machine learning chip package structure) for storing data. Memory device 390 may include multiple sets of memory cells 393. Each group 393 of memory cells is coupled to a machine learning chip 389 via a bus. It is understood that each group 393 of memory cells may be a DDR SDRAM (Double Data Rate SDRAM).

DDR can double the speed of SDRAM without increasing the clock frequency. DDR allows data to be read out on the rising and falling edges of the clock pulse. DDR is twice as fast as standard SDRAM.

In one embodiment, memory device 390 may include 4 groups 393 of memory cells. Each group of memory cells 393 may include multiple DDR4 particles (chips). In one embodiment, the machine learning chip 389 may include 4 72-bit DDR4 controllers, and 64 bits of the 72-bit DDR4 controllers are used for data transmission and 8 bits are used for ECC check. It is understood that when DDR4-3200 particles are used in each group of memory cells 393, the theoretical bandwidth of data transfer can reach 25600MB/s.

In one embodiment, each group 393 comprises a plurality of double rate synchronous dynamic random access memories arranged in parallel. DDR can transfer data twice in one clock cycle. A controller for controlling DDR is provided in the machine learning chip 389 for controlling data transfer and data storage of each memory unit 393.

Interface device 391 is electrically coupled to machine learning chip 389 (or a machine learning chip within a machine learning chip package). The interface device 391 is used to implement data transmission between the machine learning chip 389 and an external device (e.g., a server or a computer). For example, in one embodiment, the interface device 391 may be a standard PCIE interface. For example, the data to be processed is transmitted to the machine learning chip 289 by the server through the standard PCIE interface, so as to implement data transfer. Preferably, when PCIE 3.0X 16 interface is adopted for transmission, the theoretical bandwidth can reach 16000MB/s. In another embodiment, the interface device 391 may also be another interface, and the disclosure does not limit the specific representation of the other interface, and the interface device can implement the switching function. In addition, the calculation result of the machine learning chip is still transmitted back to the external device (e.g., server) by the interface device.

The control device 392 is electrically connected to a machine learning chip 389. The control device 392 is used to monitor the state of the machine learning chip 389. Specifically, the machine learning chip 389 and the control device 392 may be electrically connected through an SPI interface. The control device 392 may include a single chip Microcomputer (MCU). For example, machine learning chip 389 may include multiple processing chips, multiple processing cores, or multiple processing circuits, which may carry multiple loads. Therefore, the machine learning chip 389 can be in different operating states such as a multi-load and a light load. The control device can regulate and control the working states of a plurality of processing chips, a plurality of processing circuits and/or a plurality of processing circuits in the machine learning chip.

The present disclosure provides an electronic device, which includes the above machine learning chip or board card.

The electronic device may include a data processing apparatus, a robot, a computer, a printer, a scanner, a tablet, a smart terminal, a cell phone, a tachograph, a navigator, a sensor, a camera, a server, a cloud server, a camera, a video camera, a projector, a watch, an earphone, a mobile storage, a wearable device, a vehicle, a household appliance, and/or a medical device.

The vehicle may include an aircraft, a ship, and/or a vehicle. The household appliances may include televisions, air conditioners, microwave ovens, refrigerators, electric rice cookers, humidifiers, washing machines, electric lamps, gas cookers, and range hoods. The medical device may include a nuclear magnetic resonance apparatus, a B-mode ultrasound apparatus and/or an electrocardiograph.

FIG. 6 shows a flow diagram of an instruction processing method according to an embodiment of the present disclosure. As shown in fig. 6, the method can be applied to the above-described instruction processing apparatus. The instruction processing method comprises the following operations:

s600, analyzing the compiled memory assignment instruction to obtain an operation code and an operation domain of the memory assignment instruction, and obtaining a storage address of a target tensor, the number of elements to be assigned in the target tensor and a target value according to the operation code and the operation domain; the operation domain comprises a storage address of a target tensor, the number of elements to be assigned in the target tensor and the target value; a storage space pointed by the storage address of the target tensor is an on-chip storage space of the instruction processing device;

s610, according to the storage address of the target tensor, the number of the elements to be assigned in the target tensor and the target value, taking the target value as the value of the elements to be assigned in the target tensor.

Optionally, the storage space pointed to by the storage address of the target tensor is an on-chip NRAM, and the NRAM is used for storing scalar or tensor data.

Optionally, the number of the elements to be assigned is a constant, and the data type of the elements to be assigned is int.

Optionally, the storage address of the target tensor and the data type of the target value are the same, and the storage address of the target tensor and the data type of the target value are one of half, int16, unit16, int32, and unit32.

Optionally, the operation code includes a data type;

when the data type in the operation code is half, the storage address of the target tensor and the data type of the target value are half;

when the data type in the operation code is identified as short, the storage address of the target tensor and the data type of the target value are int16 or unit16;

and when the data type identifier in the operation code is int, the storage address of the target tensor and the data type of the target value are int32 or unit32.

Optionally, the operation code further includes an identifier of a memory to which the on-chip memory space belongs.

Optionally, the method further includes:

and compiling the obtained memory assignment instruction to obtain a compiled memory assignment instruction.

It should be noted that for simplicity of description, the above-mentioned embodiments of the method are described as a series of acts, but those skilled in the art should understand that the present application is not limited by the described order of acts, as some steps may be performed in other orders or simultaneously according to the present application. Further, those skilled in the art will appreciate that the embodiments described in the specification are exemplary embodiments and that acts and modules are not necessarily required for the disclosure. In the foregoing embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to the related descriptions of other embodiments.

Those skilled in the art will appreciate that all or part of the steps in the methods of the above embodiments may be implemented by associated hardware instructed by a program, which may be stored in a computer-readable memory, which may include: flash Memory disks, read-Only memories (ROMs), random Access Memories (RAMs), magnetic or optical disks, and the like.

In one embodiment, the present application further provides a computer-readable storage medium, in which a computer program is stored, and the computer program, when executed by one or more processors, implements the steps of the method. In particular, the computer program, when executed by one or more processors, embodies the steps of:

The specific implementation of each step in the above embodiment is basically the same as the implementation process of the step in the above method. In the foregoing embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

The foregoing may be better understood in light of the following clauses:

clause 1: an instruction processing apparatus, the apparatus comprising:

Clause 2: the apparatus according to clause 1, wherein the processing module includes a data read-write circuit, and the data read-write circuit is configured to set, as the target value, values of first N elements in the target tensor according to a storage address of the target tensor, the number of elements to be assigned in the target tensor, and the target value, where N represents the number of elements to be assigned.

Clause 3: the apparatus of clause 1 or 2, wherein the memory address of the target tensor points to a memory space that is an NRAM disposed on-chip, the NRAM for storing scalar or tensor data.

Clause 4: the apparatus according to any one of clauses 1-3, wherein the number of the elements to be assigned is a constant, and the data type of the elements to be assigned is int.

Clause 5: the apparatus according to any one of clauses 1-4, wherein a storage address of the target tensor and the target value have the same data type, and the storage address of the target tensor and the data type of the target value are one of half, int16, unit16, int32, and unit32.

Clause 6: the apparatus of any of clauses 1-5, wherein the opcode includes a data type;

Clause 7: the apparatus of any of claims 1-6, further comprising in the opcode an identification of a memory to which the on-chip memory space belongs.

Clause 8: the apparatus of any of clauses 1-7, wherein the control module comprises:

the instruction storage submodule is used for storing the memory assignment instruction;

the instruction processing submodule is used for analyzing the memory assignment instruction to obtain an operation code and an operation domain of the memory assignment instruction;

and the queue storage submodule is used for storing an instruction queue, and the instruction queue comprises a plurality of memory assignment instructions which are sequentially arranged according to an execution sequence.

Clause 9: a method of instruction processing, the method comprising:

Clause 10: the method of clause 9, wherein the memory address of the target tensor points to a memory space that is an NRAM disposed on-chip, the NRAM for storing scalar or tensor data.

Clause 11: the method according to any one of clauses 9-10, wherein the number of the elements to be assigned is a constant, and the data type of the elements to be assigned is int.

Clause 12: the method according to any one of clauses 9-11, wherein a storage address of the target tensor and the target value have the same data type, and the storage address of the target tensor and the data type of the target value are one of half, int16, unit16, int32, and unit32.

Clause 13: the method of any of clauses 9-12, wherein the opcode includes a data type;

when the data type identifier in the operation code is short, the storage address of the target tensor and the data type of the target value are int16 or unit16;

Clause 14: the apparatus of any of claims 9-13, further comprising in the opcode an identification of a memory to which the on-chip memory space belongs.

Clause 15: the method of any of clauses 9-14, further comprising:

storing the memory assignment instruction;

analyzing the memory assignment instruction to obtain an operation code and an operation domain of the memory assignment instruction;

and storing an instruction queue, wherein the instruction queue comprises a plurality of memory assignment instructions which are sequentially arranged according to an execution sequence.

Clause 16: a computer readable storage medium, which when executed by one or more processing devices performs the steps of the method of any of clauses 9-15.

The foregoing detailed description of the embodiments of the present application has been presented to illustrate the principles and implementations of the present application, and the above description of the embodiments is only provided to help understand the method and the core concept of the present application; meanwhile, for a person skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.

Claims

1. An instruction processing apparatus, characterized in that the apparatus comprises:

the control module is used for analyzing the compiled memory assignment instruction to obtain an operation code and an operation domain of the memory assignment instruction, and obtaining a storage address of a target tensor, the number of elements to be assigned in the target tensor and a target value according to the operation code and the operation domain; the operation domain comprises a storage address of a target tensor, the number of elements to be assigned in the target tensor and the target value; the storage space pointed by the storage address of the target tensor is an on-chip storage space of the instruction processing device; the operation code also comprises an identifier of a memory to which the on-chip storage space belongs;

the processing module comprises at least one computing core and is used for taking the target value as the value of the element to be assigned in the target tensor according to the storage address of the target tensor, the number of the element to be assigned in the target tensor and the target value; and

the storage module comprises an on-chip storage space, wherein each computing core is provided with corresponding on-chip storage;

the control module and the processing module are integrated into a processor, and the processor is an artificial intelligent processor.

2. The apparatus of claim 1, wherein the processing module includes a data read/write circuit, and the data read/write circuit is configured to set values of first N elements in the target tensor to be the target value according to a storage address of the target tensor, the number of elements to be assigned in the target tensor, and the target value, where N represents the number of elements to be assigned.

3. The apparatus of claim 1, wherein the memory address of the target tensor points to a memory space that is an on-chip NRAM, the NRAM for storing scalar or tensor data.

4. The apparatus according to claim 1, wherein the number of the elements to be assigned is a constant, and the data type of the elements to be assigned is int.

5. The apparatus according to claim 1, wherein the storage address of the target tensor and the target value have a same data type, and the storage address of the target tensor and the data type of the target value are one of half, int16, unit16, int32, and unit32.

6. The apparatus of claim 5, wherein the operation code comprises a data type;

7. The apparatus of any of claims 1-6, wherein the control module comprises:

the instruction storage submodule is used for storing the compiled memory assignment instruction;

the instruction processing submodule is used for analyzing the compiled memory assignment instruction to obtain an operation code and an operation domain of the compiled memory assignment instruction;

and the queue storage submodule is used for storing an instruction queue, and the instruction queue comprises a plurality of compiled memory assignment instructions which are sequentially arranged according to an execution sequence.

8. A method of instruction processing, the method comprising:

the control module analyzes the compiled memory assignment instruction to obtain an operation code and an operation domain of the memory assignment instruction, and obtains a storage address of a target tensor, the number of elements to be assigned in the target tensor and a target value according to the operation code and the operation domain; the operation domain comprises a storage address of a target tensor, the number of elements to be assigned in the target tensor and the target value; the storage space pointed by the storage address of the target tensor is an on-chip storage space of the instruction processing device; the operation code also comprises an identifier of a memory to which the on-chip storage space belongs;

the processing module takes the target value as the value of the element to be assigned in the target tensor according to the storage address of the target tensor, the number of the element to be assigned in the target tensor and the target value;

the processing module comprises at least one computing core, and each computing core is provided with corresponding on-chip storage; the control module and the processing module are integrated in a processor, and the processor is an artificial intelligent processor.

9. A computer-readable storage medium, which when executed by one or more processing devices performs the steps of the method recited in claim 8.