CN111353125B

CN111353125B - Operation method, operation device, computer equipment and storage medium

Info

Publication number: CN111353125B
Application number: CN201910645052.7A
Authority: CN
Inventors: 不公告发明人
Original assignee: Shanghai Cambricon Information Technology Co Ltd
Current assignee: Shanghai Cambricon Information Technology Co Ltd
Priority date: 2018-12-20
Filing date: 2019-07-17
Publication date: 2022-04-22
Anticipated expiration: 2039-07-17
Also published as: CN111353125A

Abstract

The present disclosure relates to an arithmetic method, an apparatus, a computer device, and a storage medium. Wherein the combined processing device comprises: a machine learning arithmetic device, a universal interconnection interface and other processing devices; the machine learning arithmetic device interacts with other processing devices to jointly complete the calculation operation designated by the user, wherein the combined processing device further comprises: and the storage device is respectively connected with the machine learning arithmetic device and the other processing devices and is used for storing the data of the machine learning arithmetic device and the other processing devices. The operation method, the operation device, the computer equipment and the storage medium provided by the embodiment of the disclosure have wide application range, high operation processing efficiency and high processing speed.

Description

Operation method, operation device, computer equipment and storage medium

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to a method and an apparatus for processing a loop vector instruction, a computer device, and a storage medium.

Background

With the continuous development of science and technology, machine learning, especially neural network algorithms, are more and more widely used. The method is well applied to the fields of image recognition, voice recognition, natural language processing and the like. However, as the complexity of neural network algorithms is higher and higher, the types and the number of involved data operations are increasing. In the related art, the efficiency and speed of performing vector correlation operation on vector data are low.

Disclosure of Invention

In view of the above, the present disclosure provides a loop vector instruction processing method, apparatus, computer device and storage medium to improve efficiency and speed of performing vector correlation operations on vector data.

According to a first aspect of the present disclosure, there is provided a loop vector instruction processing apparatus, the apparatus comprising:

the control module is used for analyzing the obtained cyclic vector instruction to obtain an operation code and an operation domain of the cyclic vector instruction, obtaining a first vector to be operated, a second vector to be operated and a target address which are required by executing the cyclic vector instruction according to the operation code and the operation domain, and determining the vector operation type of the cyclic vector instruction;

the operation module is used for dividing the first vector to be operated into a plurality of divided vectors according to the second vector to be operated, respectively carrying out vector operation on each divided vector and the second vector to be operated according to the vector operation type to obtain an operation result, and storing the operation result into the target address,

the operation code is used for indicating that the operation of the circular vector instruction on data is a circular vector operation, and the operation domain comprises a first to-be-operated vector address, a second to-be-operated vector address and the target address.

According to a second aspect of the present disclosure, there is provided a machine learning arithmetic device, the device including:

one or more of the loop vector instruction processing devices of the first aspect, configured to obtain a vector to be executed and control information from another processing device, execute a specified machine learning operation, and transmit an execution result to the other processing device through an I/O interface;

when the machine learning arithmetic device comprises a plurality of the circular vector instruction processing devices, the plurality of the circular vector instruction processing devices can be connected through a specific structure and transmit data;

the plurality of the circular vector instruction processing devices are interconnected through a Peripheral Component Interface Express (PCIE) bus and transmit data so as to support larger-scale machine learning operation; a plurality of the loop vector instruction processing devices share the same control system or own respective control systems; the plurality of the circular vector instruction processing devices share a memory or own memories; the interconnection mode of the plurality of the circular vector instruction processing devices is any interconnection topology.

According to a third aspect of the present disclosure, there is provided a combined processing apparatus, the apparatus comprising:

the machine learning arithmetic device, the universal interconnect interface, and the other processing device according to the second aspect;

and the machine learning arithmetic device interacts with the other processing devices to jointly complete the calculation operation designated by the user.

According to a fourth aspect of the present disclosure, there is provided a machine learning chip including the machine learning network operation device of the second aspect or the combination processing device of the third aspect.

According to a fifth aspect of the present disclosure, there is provided a machine learning chip package structure, which includes the machine learning chip of the fourth aspect.

According to a sixth aspect of the present disclosure, a board card is provided, which includes the machine learning chip packaging structure of the fifth aspect.

According to a seventh aspect of the present disclosure, there is provided an electronic device, which includes the machine learning chip of the fourth aspect or the board of the sixth aspect.

According to an eighth aspect of the present disclosure, there is provided a loop vector instruction processing method, which is applied to a loop vector instruction processing apparatus, the method including:

analyzing the obtained circular vector instruction to obtain an operation code and an operation domain of the circular vector instruction, obtaining a first vector to be operated, a second vector to be operated and a target address which are required by the circular vector instruction according to the operation code and the operation domain, and determining the vector operation type of the circular vector instruction;

dividing the first vector to be operated into a plurality of divided vectors according to a second vector to be operated, respectively carrying out vector operation on each divided vector and the second vector to be operated according to the vector operation type to obtain an operation result, and storing the operation result into the target address,

According to a ninth aspect of the present disclosure, there is provided a non-transitory computer readable storage medium having stored thereon computer program instructions which, when executed by a processor, implement the above-described loop vector instruction processing method.

In some embodiments, the electronic device comprises a data processing apparatus, a robot, a computer, a printer, a scanner, a tablet, a smart terminal, a cell phone, a tachograph, a navigator, a sensor, a camera, a server, a cloud server, a camera, a camcorder, a projector, a watch, a headset, a mobile storage, a wearable device, a vehicle, a household appliance, and/or a medical device.

In some embodiments, the vehicle comprises an aircraft, a ship, and/or a vehicle; the household appliances comprise a television, an air conditioner, a microwave oven, a refrigerator, an electric cooker, a humidifier, a washing machine, an electric lamp, a gas stove and a range hood; the medical equipment comprises a nuclear magnetic resonance apparatus, a B-ultrasonic apparatus and/or an electrocardiograph.

The device comprises a control module and an operation module, wherein the control module is used for analyzing the obtained circular vector instruction to obtain an operation code and an operation domain of the circular vector instruction, obtaining a first vector to be operated, a second vector to be operated and a target address which are required by executing the circular vector instruction according to the operation code and the operation domain, and determining the vector operation type of the circular vector instruction; the operation module is used for dividing the first vector to be operated into a plurality of divided vectors according to the second vector to be operated, respectively carrying out vector operation on each divided vector and the second vector to be operated according to the vector operation type to obtain an operation result, and storing the operation result into the target address. The method, the device, the computer equipment and the storage medium for processing the circular vector instructions provided by the embodiment of the disclosure have wide application range, high processing efficiency and high processing speed for the circular vector instructions, and high processing efficiency and high processing speed for operation.

Other features and aspects of the present disclosure will become apparent from the following detailed description of exemplary embodiments, which proceeds with reference to the accompanying drawings.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate exemplary embodiments, features, and aspects of the disclosure and, together with the description, serve to explain the principles of the disclosure.

Fig. 1 illustrates a block diagram of a loop vector instruction processing apparatus according to an embodiment of the present disclosure.

2 a-2 f illustrate block diagrams of a loop vector instruction processing apparatus according to an embodiment of the present disclosure.

Fig. 3 is a schematic diagram illustrating an application scenario of a loop vector instruction processing apparatus according to an embodiment of the present disclosure.

Fig. 4a, 4b show block diagrams of a combined processing device according to an embodiment of the present disclosure.

Fig. 5 shows a schematic structural diagram of a board card according to an embodiment of the present disclosure.

FIG. 6 shows a flow diagram of a method of loop vector instruction processing according to an embodiment of the present disclosure.

Detailed Description

The technical solutions in the embodiments of the present disclosure will be described clearly and completely with reference to the accompanying drawings in the embodiments of the present disclosure, and it is obvious that the described embodiments are some, not all embodiments of the present disclosure. All other embodiments, which can be derived by one skilled in the art from the embodiments disclosed herein without making any creative effort, shall fall within the scope of protection of the present disclosure.

It should be understood that the terms "zero," "first," "second," and the like in the claims, the description, and the drawings of the present disclosure are used for distinguishing between different objects and not for describing a particular order. The terms "comprises" and "comprising," when used in the specification and claims of this disclosure, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It is also to be understood that the terminology used in the description of the disclosure herein is for the purpose of describing particular embodiments only, and is not intended to be limiting of the disclosure. As used in the specification and claims of this disclosure, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should be further understood that the term "and/or" as used in the specification and claims of this disclosure refers to any and all possible combinations of one or more of the associated listed items and includes such combinations.

As used in this specification and claims, the term "if" may be interpreted contextually as "when", "upon" or "in response to a determination" or "in response to a detection". Similarly, the phrase "if it is determined" or "if a [ described condition or event ] is detected" may be interpreted contextually to mean "upon determining" or "in response to determining" or "upon detecting [ described condition or event ]" or "in response to detecting [ described condition or event ]".

Due to the wide use of neural network algorithms, the computing man power of computer hardware is continuously improved, and the types and the number of data operations involved in practical application are continuously improved. Because the programming languages are various in types, in order to implement the circular operation of the vector under different language environments, in the related art, because no instruction which can be widely applied to various programming languages and is used for performing the circular operation of the vector exists at the present stage, technicians need to customize one or more instructions corresponding to the programming language environments to implement the vector operation, and the efficiency and the speed of performing the vector operation are low. The present disclosure provides a method and an apparatus for processing a circular vector instruction, a computer device, and a storage medium, which can realize circular vector operation with only one instruction, and can significantly improve the efficiency and speed of performing the circular vector operation.

Fig. 1 illustrates a block diagram of a loop vector instruction processing apparatus according to an embodiment of the present disclosure. As shown in fig. 1, the apparatus includes a control module 11 and an operation module 12.

The control module 11 is configured to analyze the obtained cyclic vector instruction to obtain an operation code and an operation domain of the cyclic vector instruction, obtain a first to-be-operated vector, a second to-be-operated vector, and a target address required for executing the cyclic vector instruction according to the operation code and the operation domain, and determine a vector operation type of the cyclic vector instruction. The operation code is used for indicating that the operation of the circular vector instruction on the data is a circular vector operation, and the operation domain comprises a first vector address to be operated, a second vector address to be operated and a target address.

The operation module 12 is configured to divide the first vector to be operated into a plurality of divided vectors according to the second vector to be operated, perform vector operation on each divided vector and the second vector to be operated according to the vector operation type, obtain an operation result, and store the operation result in the target address.

In this embodiment, the circular vector operation may be to divide a vector with a large data size into a plurality of divided vectors with the same data size as another vector with a small data size, and then perform the operation of the corresponding vector operation type on each divided vector and the other vector, so as to obtain the operation result. The vector operation type may indicate a kind or type of an arithmetic operation, a logical operation performed on the sliced vector and the second vector to be operated on. Such as vector addition operations, etc. The vector operation type can be set by those skilled in the art according to actual needs, and the present disclosure does not limit this.

In this embodiment, each of the split vectors and the second vector to be operated are respectively subjected to vector operation according to the vector operation type, so that a plurality of split operation results corresponding to each of the split vectors can be obtained, and the plurality of split operation results are stored in the target address as the operation result of the circular vector instruction, that is, the plurality of split operation results are used as the operation results of vector operation performed on the first vector to be operated and the second vector to be operated. The data volume of the first vector to be operated may be an integer multiple of the data volume of the second vector to be operated, so as to ensure that the obtained split vector can perform vector operation with the second vector to be operated.

In this embodiment, the control module may obtain the first to-be-operated vector and the second to-be-operated vector from the first to-be-operated vector address and the second to-be-operated vector address, respectively. The control module may obtain instructions and data through a data input output unit, which may be one or more data I/O interfaces or I/O pins.

In this embodiment, the operation code may be a part of an instruction or a field (usually indicated by a code) specified in the computer program to perform an operation, and is an instruction sequence number used to inform a device executing the instruction which instruction needs to be executed specifically. The operation domain may be a source of all data required for executing the corresponding instruction, and all data required for executing the corresponding instruction includes parameters such as a first vector to be operated on, a second vector to be operated on, a vector operation type, and a corresponding operation method, and so on. For a circular vector instruction it must comprise an opcode and an operation field, wherein the operation field comprises at least a first to-be-operated vector address, a second to-be-operated vector address and a target address.

It should be understood that the instruction format of the loop vector instruction and the included opcode and operation domain may be set as desired by those skilled in the art, and the disclosure is not limited thereto.

In this embodiment, the apparatus may include one or more control modules and one or more operation modules, and the number of the control modules and the number of the operation modules may be set according to actual needs, which is not limited in this disclosure. . When the apparatus includes a control module, the control module may receive a circular vector instruction and control one or more operation modules to perform circular vector operations. When the device comprises a plurality of control modules, the control modules can respectively receive the circular vector instructions and control one or more corresponding operation modules to perform circular vector operation.

The device for processing the circular vector instruction provided by the embodiment of the disclosure comprises a control module and an operation module, wherein the control module is used for analyzing the obtained circular vector instruction to obtain an operation code and an operation domain of the circular vector instruction, obtaining a first vector to be operated, a second vector to be operated and a target address which are required by executing the circular vector instruction according to the operation code and the operation domain, and determining the vector operation type of the circular vector instruction; the operation module is used for dividing the first vector to be operated into a plurality of divided vectors according to the second vector to be operated, respectively carrying out vector operation on each divided vector and the second vector to be operated according to the vector operation type to obtain an operation result, and storing the operation result into the target address. The circular vector instruction processing device provided by the embodiment of the disclosure has a wide application range, and is high in processing efficiency and processing speed of a circular vector instruction, and high in processing efficiency and processing speed of operation.

Figure 2a illustrates a block diagram of a loop vector instruction processing apparatus according to an embodiment of the present disclosure. In one possible implementation, as shown in fig. 2a, the operation module 12 may include a plurality of vector operators 120. The plurality of vector operators 120 are for performing vector operations corresponding to vector operation types.

In this implementation, the vector operator may include an adder, a divider, a multiplier, a comparator, and the like capable of performing arithmetic operations, logical operations, and the like on the vector. The type and number of vector operators may be set according to the requirements of the size of the data amount of the vector operation to be performed, the type of vector operation, the processing speed and efficiency of the vector operation, and the like, which is not limited by the present disclosure.

Figure 2b illustrates a block diagram of a loop vector instruction processing apparatus according to an embodiment of the present disclosure. In one possible implementation, as shown in fig. 2b, the operation module 12 may include a master operation sub-module 121 and a plurality of slave operation sub-modules 122. The main operation sub-module 121 may include a plurality of vector operators (not shown in the drawings).

The main operation sub-module 121 is configured to perform a vector operation by using a plurality of vector operators to obtain an operation result, and store the operation result in a target address.

In one possible implementation, as shown in fig. 2b, the operation module 12 may include a master operation sub-module 121 and a plurality of slave operation sub-modules 122, and the slave operation sub-modules 122 may include a plurality of vector operators (not shown in the figure). The slave operation submodule 122 is configured to perform corresponding vector operations in parallel by using a plurality of included vector operators to obtain operation results, store the operation results in corresponding sub-cache spaces, and send the operation results to the master operation submodule 121. The main operation sub-module 121 is further configured to receive an operation result and store the operation result in a target address.

In this implementation, the control module may determine to execute the currently received vector instruction through the master operation submodule or the plurality of slave operation submodules according to the vector operation type, the task amount of the operation, and the like. For example, when the vector operation type is determined to be a vector addition operation, the main operation submodule may be controlled to perform an operation. When the vector operation type is determined to be the vector multiplication operation, a plurality of slave operation submodules can be controlled to perform operation.

In a possible implementation manner, the control module 11 is further configured to analyze the obtained calculation instruction to obtain an operation domain and an operation code of the calculation instruction, and obtain data to be operated, which is required for executing the calculation instruction, according to the operation domain and the operation code. The operation module 12 is further configured to perform an operation on the data to be operated according to the calculation instruction to obtain a calculation result of the calculation instruction. The operation module may include a plurality of operators for performing operations corresponding to operation types of the calculation instructions.

In this implementation, the calculation instruction may be other instructions for performing arithmetic operations, logical operations, and the like on data such as scalars, vectors, matrices, tensors, and the like, and those skilled in the art may set the calculation instruction according to actual needs, which is not limited by the present disclosure.

In this implementation, the arithmetic unit may include an adder, a divider, a multiplier, a comparator, and the like, which are capable of performing arithmetic operations, logical operations, and the like on data. The type and number of the arithmetic units may be set according to the requirements of the size of the data amount of the arithmetic operation to be performed, the type of the arithmetic operation, the processing speed and efficiency of the arithmetic operation on the data, and the like, which is not limited by the present disclosure.

In a possible implementation manner, the control module 11 is further configured to analyze the calculation instruction to obtain a plurality of operation instructions, and send the data to be operated and the plurality of operation instructions to the main operation sub-module 121.

The master operation sub-module 121 is configured to perform preamble processing on data to be operated, and transmit data and operation instructions with the plurality of slave operation sub-modules 122.

The slave operation submodule 122 is configured to execute an intermediate operation in parallel according to the data and the operation instruction transmitted from the master operation submodule 121 to obtain a plurality of intermediate results, and transmit the plurality of intermediate results to the master operation submodule 122.

The main operation sub-module 121 is further configured to perform subsequent processing on the plurality of intermediate results to obtain a calculation result of the calculation instruction, and store the calculation result in the corresponding address.

In this implementation, when the computation instruction is an operation performed on scalar or vector data, the apparatus may control the main operation sub-module to perform an operation corresponding to the computation instruction by using an operator therein. When the calculation instruction is to perform an operation on data having a dimension greater than or equal to 2, such as a matrix, a tensor, or the like, the device may control the slave operation submodule to perform an operation corresponding to the calculation instruction by using an operator therein.

It should be noted that, a person skilled in the art may set the connection manner between the master operation submodule and the plurality of slave operation submodules according to actual needs to implement the configuration setting of the operation module, for example, the configuration of the operation module may be an "H" configuration, an array configuration, a tree configuration, and the like, which is not limited in the present disclosure.

Figure 2c illustrates a block diagram of a loop vector instruction processing apparatus according to an embodiment of the present disclosure. In one possible implementation, as shown in fig. 2c, the operation module 12 may further include one or more branch operation sub-modules 123, and the branch operation sub-module 123 is configured to forward data and/or operation instructions between the master operation sub-module 121 and the slave operation sub-module 122. The main operation sub-module 121 is connected to one or more branch operation sub-modules 123. Therefore, the main operation sub-module, the branch operation sub-module and the slave operation sub-module in the operation module are connected by adopting an H-shaped structure, and data and/or operation instructions are forwarded by the branch operation sub-module, so that the resource occupation of the main operation sub-module is saved, and the instruction processing speed is further improved.

Figure 2d illustrates a block diagram of a loop vector instruction processing apparatus according to an embodiment of the present disclosure. In one possible implementation, as shown in FIG. 2d, a plurality of slave operation sub-modules 122 are distributed in an array.

Each slave operation submodule 122 is connected to another adjacent slave operation submodule 122, the master operation submodule 121 is connected to k slave operation submodules 122 of the plurality of slave operation submodules 122, and the k slave operation submodules 122 are: n slave operator sub-modules 122 of row 1, n slave operator sub-modules 122 of row m, and m slave operator sub-modules 122 of column 1.

As shown in fig. 2d, the k slave operator modules include only the n slave operator modules in the 1 st row, the n slave operator modules in the m th row, and the m slave operator modules in the 1 st column, that is, the k slave operator modules are slave operator modules directly connected to the master operator module among the plurality of slave operator modules. The k slave operation submodules are used for forwarding data and instructions between the master operation submodules and the plurality of slave operation submodules. Therefore, the plurality of slave operation sub-modules are distributed in an array, the speed of sending data and/or operation instructions to the slave operation sub-modules by the master operation sub-module can be increased, and the instruction processing speed is further increased.

Figure 2e illustrates a block diagram of a loop vector instruction processing apparatus according to an embodiment of the present disclosure. In one possible implementation, as shown in fig. 2e, the operation module may further include a tree sub-module 124. The tree submodule 124 includes a root port 401 and a plurality of branch ports 402. The root port 401 is connected to the master operation submodule 121, and the plurality of branch ports 402 are connected to the plurality of slave operation submodules 122, respectively. The tree sub-module 124 has a transceiving function, and is configured to forward data and/or operation instructions between the master operation sub-module 121 and the slave operation sub-module 122. Therefore, the operation modules are connected in a tree-shaped structure under the action of the tree-shaped sub-modules, and the speed of sending data and/or operation instructions from the main operation sub-module to the auxiliary operation sub-module can be increased by utilizing the forwarding function of the tree-shaped sub-modules, so that the instruction processing speed is increased.

In one possible implementation, the tree submodule 124 may be an optional result of the apparatus, which may include at least one level of nodes. The nodes are line structures with forwarding functions, and the nodes do not have operation functions. The lowest level node is connected to the slave operation sub-module to forward data and/or operation instructions between the master operation sub-module 121 and the slave operation sub-module 122. In particular, if the tree submodule has zero level nodes, the apparatus does not require the tree submodule.

In one possible implementation, the tree submodule 124 may include a plurality of nodes of an n-ary tree structure, and the plurality of nodes of the n-ary tree structure may have a plurality of layers.

For example, fig. 2f shows a block diagram of a loop vector instruction processing device according to an embodiment of the present disclosure. As shown in FIG. 2f, the n-ary tree structure may be a binary tree structure with tree-type sub-modules including 2 levels of nodes 01. The lowest level node 01 is connected with the slave operation sub-module 122 to forward data and/or operation instructions between the master operation sub-module 121 and the slave operation sub-module 122.

In this implementation, the n-ary tree structure may also be a ternary tree structure or the like, where n is a positive integer greater than or equal to 2. The number of n in the n-ary tree structure and the number of layers of nodes in the n-ary tree structure may be set by those skilled in the art as needed, and the disclosure is not limited thereto.

In one possible implementation, the operation domain may also include a vector operation type.

The control module 11 may be further configured to determine a vector operation type according to the operation domain.

In one possible implementation, the vector operation type may include at least one of: vector multiplication operation, vector addition operation, vector summation operation, specified value storage operation meeting operation conditions, bitwise AND operation, bitwise OR operation and bitwise XOR operation. The operation condition may include any one of the following: bitwise equal, bitwise unequal, bitwise less, bitwise greater than or equal to, bitwise greater than, bitwise less than or equal to. The specified value may be a numerical value such as 0, 1, etc., and the present disclosure does not limit this.

The operation satisfying the bit-wise equal storage of the specified value can be: judging whether the corresponding bits of the segmentation vector and the second vector to be operated are equal, and storing a specified value when the corresponding bits of the segmentation vector and the second vector to be operated are equal; and storing the values of the segmentation vector or the second vector to be operated at the corresponding bit when the corresponding bits are not equal, or storing 0 and other numerical values different from the specified values.

Satisfying the bitwise inequality store specified value operation may be: judging whether the corresponding positions of the segmentation vector and the second vector to be operated are equal, and storing a specified value when the corresponding positions of the segmentation vector and the second vector to be operated are not equal; and storing the values of the first segmentation vector or the second vector to be operated in the corresponding bits when the corresponding bits are equal, or storing 0 and other numerical values different from the specified values.

Satisfying the bitwise less than store specified value operation may be: judging the size relationship of the corresponding bit of the segmentation vector and the second vector to be operated, and storing the specified value when the value of the segmentation vector on the corresponding bit is smaller than that of the second vector to be operated; and when the value of the segmentation vector on the corresponding bit is greater than or equal to the value of the second vector to be operated, storing the value of the first segmentation vector or the second vector to be operated on the corresponding bit, or storing a value such as 0 which is different from the specified value.

Satisfying the bitwise greater than or equal to store the specified value operation may be: judging the size relationship of the corresponding bit of the segmentation vector and the second vector to be operated, and storing a specified value when the value of the segmentation vector on the corresponding bit is greater than or equal to the value of the second vector to be operated; and when the value of the segmentation vector on the corresponding bit is smaller than the value of the second vector to be operated, storing the value of the first segmentation vector or the second vector to be operated on the corresponding bit, or storing a value such as 0 which is different from the specified value.

Satisfying the bitwise greater than store specified value operation may be: judging the size relationship of the corresponding bit of the segmentation vector and the second vector to be operated, and storing the specified value when the value of the segmentation vector on the corresponding bit is larger than the value of the second vector to be operated; and when the value of the segmentation vector on the corresponding bit is less than or equal to the value of the second vector to be operated, storing the value of the first segmentation vector or the second vector to be operated on the corresponding bit, or storing a value such as 0 which is different from the specified value.

Satisfying the bitwise less than or equal to store the specified value operation may be: judging the size relationship of the corresponding bit of the segmentation vector and the second vector to be operated, and storing the specified value when the value of the segmentation vector on the corresponding bit is less than or equal to the value of the second vector to be operated; and when the value of the segmentation vector on the corresponding bit is larger than that of the second vector to be operated, storing the value of the first segmentation vector or the second vector to be operated on the corresponding bit, or storing a value such as 0 which is different from the specified value.

In this implementation, different operation domain codes may be set for different vector operation types to distinguish different operation types. For example, the code of the "vector multiplication operation" may be set to "mult. The code of the "vector addition operation" may be set to "add. The code for the "vector sum operation" may be set to "sub. The code for "bitwise and operation" may be set to "and. The code for "bitwise or operation" may be set to "or. The code for the "bitwise xor operation" may be set to "xor. The code "store a specified value 1 operation if bitwise equality is satisfied" may be set to "eq. The code "store a specified value 1 operation if bitwise inequality is satisfied" may be set to "ne. The code "satisfy bitwise less than store specified value 1 operation" may be set to "lt. The code "satisfy operate greater than or equal to store a specified value of 1 in bits" may be set to "ge. The code "satisfy bitwise greater than store specified value 1 operation" may be set to "gt. The code "satisfy bitwise less than or equal to store a specified value 1 operation" may be set to "le.

The operation types and the corresponding codes thereof can be set by those skilled in the art according to actual needs, and the disclosure does not limit this.

In one possible implementation, the operation field may further include a first input quantity and a second input quantity. The control module 11 may be further configured to determine a first input quantity and a second input quantity according to the operation domain, obtain a first to-be-operated vector with a data quantity being the first input quantity from the first to-be-operated vector address, and obtain a second to-be-operated vector with a data quantity being the second input quantity from the second to-be-operated vector address.

In one possible implementation, the segmenting the first to-be-operated vector into n segmented vectors according to the second to-be-operated vector may include: and determining the segmentation data volume of each segmentation vector according to the second input volume, and segmenting the first vector to be operated into n segmentation vectors according to the segmentation data volume.

In this implementation, the second input quantity may be determined as a sliced data quantity for each sliced vector.

In this implementation, the first input quantity and the second input quantity may be parameters characterizing data quantities of the first vector to be operated on and the second vector to be operated on, such as vector length, width, and the like.

In one possible implementation, a default first input amount and second input amount may be set. When the first input quantity and the second input quantity cannot be determined according to the operation domain, the default first input quantity and the default second input quantity can be determined as the first input quantity and the second input quantity of the current loop vector instruction, and a first to-be-operated vector with the data quantity as the default first input quantity and a second to-be-operated vector with the default second input quantity are obtained from the to-be-operated vector address.

In one possible implementation, as shown in fig. 2 a-2 f, the apparatus may further include a storage module 13. The storage module 13 is configured to store a first vector to be operated and a second vector to be operated.

In this implementation, the storage module may include one or more of a cache and a register, and the cache may include a temporary cache and may further include at least one NRAM (Neuron Random Access Memory). And the cache is used for storing the data to be operated, the first vector to be operated and the second vector to be operated. And the register is used for storing scalar data in the data to be operated.

In one possible implementation, the cache may include a neuron cache. The neuron buffer, i.e., the neuron random access memory, may be configured to store neuron data in data to be operated on, where the neuron data may include neuron vector data.

In a possible implementation manner, the apparatus may further include a direct memory access module for reading or storing data from the storage module.

In one possible implementation, as shown in fig. 2 a-2 f, the control module 11 may include an instruction storage sub-module 111, an instruction processing sub-module 112, and a queue storage sub-module 113.

The instruction storage submodule 111 is arranged to store loop vector instructions.

The instruction processing sub-module 112 is configured to parse the circular vector instruction to obtain an operation code and an operation domain of the circular vector instruction.

The queue storage submodule 113 is configured to store an instruction queue, where the instruction queue includes multiple instructions to be executed that are sequentially arranged according to an execution order, and the instructions to be executed may include a circular vector instruction.

In this implementation, the instructions to be executed may also include computation instructions related or unrelated to vector operations, which are not limited by this disclosure. The execution sequence of the multiple instructions to be executed can be arranged according to the receiving time, the priority level and the like of the instructions to be executed to obtain an instruction queue, so that the multiple instructions to be executed can be sequentially executed according to the instruction queue.

In one possible implementation, as shown in fig. 2 a-2 f, the control module 11 may further include a dependency processing sub-module 114.

The dependency relationship processing submodule 114 is configured to, when it is determined that a first to-be-executed instruction in the multiple to-be-executed instructions is associated with a zeroth to-be-executed instruction before the first to-be-executed instruction, cache the first to-be-executed instruction in the instruction storage submodule 111, and after the zeroth to-be-executed instruction is executed, extract the first to-be-executed instruction from the instruction storage submodule 111 and send the first to-be-executed instruction to the operation module 12.

The method for determining the zero-th instruction to be executed before the first instruction to be executed has an incidence relation with the first instruction to be executed comprises the following steps: the first storage address interval for storing the data required by the first to-be-executed instruction and the zeroth storage address interval for storing the data required by the zeroth to-be-executed instruction have an overlapped area. On the contrary, there is no association relationship between the first to-be-executed instruction and the zeroth to-be-executed instruction before the first to-be-executed instruction, which may be that there is no overlapping area between the first storage address interval and the zeroth storage address interval.

By the method, according to the dependency relationship between the first to-be-executed instruction and the zeroth to-be-executed instruction before the first to-be-executed instruction, the subsequent first to-be-executed instruction is executed after the execution of the previous zeroth to-be-executed instruction is finished, and the accuracy of the operation result is ensured.

In one possible implementation, the instruction format of the loop vector instruction may be:

opcode dst src0src1src0_size src1_size type.cycle

wherein opcode is the opcode of the loop vector instruction, dst, src, type, src0_ size, src1_ size are the operand of the loop vector instruction. Wherein dst is the target address. src0 is the first vector address to be operated on. src1 is the second to-be-computed vector address. type is a vector operation type. src0_ size is the first input quantity. src1_ size is the second input quantity. Wherein, type.cycle can be code of vector operation type, such as mult.cycle, add.cycle, sub.cycle, eq.cycle, ne.cycle, lt.cycle, ge.cycle, gt.cycle, le.cycle, eq.cycle, and.cycle, or.cycle, xor.cycle.

In one possible implementation, the instruction format of the loop vector instruction may also be:

type.cycle dst src0src1src0_size src1_size

in one possible implementation, the instruction format of the loop vector instruction for the "vector multiply operation" may be set to: cycle dst src0src1src0_ size src1_ size. It represents: a first to-be-operated vector of src0_ size is obtained from the first to-be-operated address src0, and a second to-be-operated vector of src1_ size is obtained from the second to-be-operated address src 1. And splitting the first to-be-operated vector into a plurality of split vectors, wherein the data volume of each split vector is the same as the src1_ size. And respectively multiplying each segmentation vector and the second vector to be operated to obtain an operation result. And stores the operation result into the target address dst.

In one possible implementation, the instruction format of the loop vector instruction for the "vector add operation" may be set to: cycle dst src0src1src0_ size src1_ size. It represents: a first to-be-operated vector of src0_ size is obtained from the first to-be-operated address src0, and a second to-be-operated vector of src1_ size is obtained from the second to-be-operated address src 1. And splitting the first to-be-operated vector into a plurality of split vectors, wherein the data volume of each split vector is the same as the src1_ size. And respectively carrying out addition operation on each segmentation vector and the second vector to be operated to obtain an operation result. And stores the operation result into the target address dst.

In one possible implementation, the instruction format of the loop vector instruction for the "vector sum operation" may be set to: cycle dst src0src1src0_ size src1_ size. It represents: a first to-be-operated vector of src0_ size is obtained from the first to-be-operated address src0, and a second to-be-operated vector of src1_ size is obtained from the second to-be-operated address src 1. And splitting the first to-be-operated vector into a plurality of split vectors, wherein the data volume of each split vector is the same as the src1_ size. And respectively carrying out summation operation on each segmentation vector and the second vector to be operated to obtain an operation result. And stores the operation result into the target address dst.

In one possible implementation, the instruction format for a "bitwise and operation" loop vector instruction may be set to: cycle dst src0src1src0_ size src1_ size. It represents: a first to-be-operated vector of src0_ size is obtained from the first to-be-operated address src0, and a second to-be-operated vector of src1_ size is obtained from the second to-be-operated address src 1. And splitting the first to-be-operated vector into a plurality of split vectors, wherein the data volume of each split vector is the same as the src1_ size. And performing bitwise AND operation on each segmentation vector and the second vector to be operated to obtain an operation result. And stores the operation result into the target address dst.

In one possible implementation, the instruction format for a "bitwise OR" loop vector instruction may be set to: cycle dst src0src1src0_ size src1_ size. It represents: a first to-be-operated vector of src0_ size is obtained from the first to-be-operated address src0, and a second to-be-operated vector of src1_ size is obtained from the second to-be-operated address src 1. And splitting the first to-be-operated vector into a plurality of split vectors, wherein the data volume of each split vector is the same as the src1_ size. And performing bitwise OR operation on each segmentation vector and the second vector to be operated to obtain an operation result. And stores the operation result into the target address dst.

In one possible implementation, the instruction format of the circular vector instruction for the "bitwise xor operation" may be set to: cycle dst src0src1src0_ size src1_ size. It represents: a first to-be-operated vector of src0_ size is obtained from the first to-be-operated address src0, and a second to-be-operated vector of src1_ size is obtained from the second to-be-operated address src 1. And splitting the first to-be-operated vector into a plurality of split vectors, wherein the data volume of each split vector is the same as the src1_ size. And performing bitwise XOR operation on each segmentation vector and the second vector to be operated to obtain an operation result. And stores the operation result into the target address dst.

In one possible implementation, the instruction format for the "store a specified value 1 operation if bitwise equality is satisfied" loop vector instruction may be set to: cycle dst src0src1src0_ size src1_ size. It represents: a first to-be-operated vector of src0_ size is obtained from the first to-be-operated address src0, and a second to-be-operated vector of src1_ size is obtained from the second to-be-operated address src 1. And splitting the first to-be-operated vector into a plurality of split vectors, wherein the data volume of each split vector is the same as the src1_ size. And judging whether the corresponding bits of the segmentation vector and the second vector to be operated are equal, and storing a specified value 1 when the corresponding bits of the segmentation vector and the second vector to be operated are equal to obtain an operation result. And stores the operation result into the target address dst.

In one possible implementation, the instruction format for the "store a specified value 1 operation if bitwise inequality is satisfied" loop vector instruction may be set to: cycle dst src0src1src0_ size src1_ size. It represents: a first to-be-operated vector of src0_ size is obtained from the first to-be-operated address src0, and a second to-be-operated vector of src1_ size is obtained from the second to-be-operated address src 1. And splitting the first to-be-operated vector into a plurality of split vectors, wherein the data volume of each split vector is the same as the src1_ size. And judging whether the corresponding bits of the segmentation vector and the second vector to be operated are equal, and storing a specified value 1 when the corresponding bits of the segmentation vector and the second vector to be operated are not equal to obtain an operation result. And stores the operation result into the target address dst.

In one possible implementation, the instruction format for a "satisfy bitwise less than store specified value 1 operation" loop vector instruction may be set to: cycle dst src0src1src0_ size src1_ size. It represents: a first to-be-operated vector of src0_ size is obtained from the first to-be-operated address src0, and a second to-be-operated vector of src1_ size is obtained from the second to-be-operated address src 1. And splitting the first to-be-operated vector into a plurality of split vectors, wherein the data volume of each split vector is the same as the src1_ size. And judging the size relationship of the corresponding bit of the segmentation vector and the second vector to be operated, and storing the specified value 1 when the value of the segmentation vector on the corresponding bit is smaller than that of the second vector to be operated to obtain an operation result. And stores the operation result into the target address dst.

In one possible implementation, the instruction format for a loop vector instruction that satisfies the store specified value 1 operation if bitwise is greater than or equal to may be set to: cycle dst src0src1src0_ size src1_ size. It represents: a first to-be-operated vector of src0_ size is obtained from the first to-be-operated address src0, and a second to-be-operated vector of src1_ size is obtained from the second to-be-operated address src 1. And splitting the first to-be-operated vector into a plurality of split vectors, wherein the data volume of each split vector is the same as the src1_ size. And judging the size relationship of the corresponding bit of the segmentation vector and the second vector to be operated, and storing the specified value 1 when the value of the segmentation vector on the corresponding bit is greater than or equal to the value of the second vector to be operated to obtain an operation result. And stores the operation result into the target address dst.

In one possible implementation, the instruction format for a "satisfy bitwise greater than store specified value 1 operation" loop vector instruction may be set to: cycle dst src0src1src0_ size src1_ size. It represents: a first to-be-operated vector of src0_ size is obtained from the first to-be-operated address src0, and a second to-be-operated vector of src1_ size is obtained from the second to-be-operated address src 1. And splitting the first to-be-operated vector into a plurality of split vectors, wherein the data volume of each split vector is the same as the src1_ size. And judging the size relationship of the corresponding bit of the segmentation vector and the second vector to be operated, and storing a specified value 1 when the value of the segmentation vector on the corresponding bit is greater than that of the second vector to be operated to obtain an operation result. And stores the operation result into the target address dst.

In one possible implementation, the instruction format for a loop vector instruction that satisfies the store specified value 1 operation if bitwise is less than or equal to may be set to: cycle dst src0src1src0_ size src1_ size. It represents: a first to-be-operated vector of src0_ size is obtained from the first to-be-operated address src0, and a second to-be-operated vector of src1_ size is obtained from the second to-be-operated address src 1. And splitting the first to-be-operated vector into a plurality of split vectors, wherein the data volume of each split vector is the same as the src1_ size. And judging the size relationship of the corresponding bit of the segmentation vector and the second vector to be operated, and storing the specified value 1 when the value of the segmentation vector on the corresponding bit is smaller than that of the second vector to be operated to obtain an operation result. And stores the operation result into the target address dst.

It should be understood that the location of the opcode, opcode and operand field in the instruction format for the loop vector instruction may be set by one skilled in the art as desired and is not limited by this disclosure.

In one possible implementation manner, the apparatus may be disposed in one or more of a Graphics Processing Unit (GPU), a Central Processing Unit (CPU), and an embedded Neural Network Processor (NPU).

It should be noted that, although the above embodiments are described as examples of the loop vector instruction processing apparatus, those skilled in the art can understand that the disclosure should not be limited thereto. In fact, the user can flexibly set each module according to personal preference and/or actual application scene, as long as the technical scheme of the disclosure is met.

Application example

An application example according to the embodiment of the present disclosure is given below in conjunction with "performing vector operations with a loop vector instruction processing apparatus" as one exemplary application scenario to facilitate understanding of the flow of the loop vector instruction processing apparatus. It is understood by those skilled in the art that the following application examples are merely for the purpose of facilitating understanding of the embodiments of the present disclosure and should not be construed as limiting the embodiments of the present disclosure

Fig. 3 is a schematic diagram illustrating an application scenario of a loop vector instruction processing apparatus according to an embodiment of the present disclosure. As shown in fig. 3, the loop vector instruction processing apparatus processes the loop vector instruction as follows:

the control module 11 analyzes the obtained loop vector instruction 1 (for example, the loop vector instruction 1 is opcode 500101102 add. cycle 6416), and obtains an operation code and an operation domain of the loop vector instruction 1. The operation code of the loop vector instruction 1 is opcode, the target address is 500, the first to-be-operated vector address is 101, and the second to-be-operated vector address is 102. The vector operation type is add. The first input amount is 64. The second input amount is 16. The control module 11 obtains a first vector to be operated with a data amount of the first input amount 64 from the first vector to be operated address 101, and obtains a second vector to be operated with a data amount of the second input amount 16 from the second vector to be operated address 102.

The operation module 12 divides the first to-be-operated vector into 4 divided vectors, such as divided vector 1, divided vector 2, divided vector 3, and divided vector 4 shown in fig. 3, where the data amount of each divided vector is 16. And performing addition operation on each of the division vectors and the second vector to be operated to obtain corresponding division operation results, such as a division operation result 1, a division operation result 2, a division operation result 3, and a division operation result 4 shown in fig. 3. And the division operation result 1, the division operation result 2, the division operation result 3 and the division operation result 4 are used as the operation result 1 of the circular vector instruction 1, and the operation result 1 is stored in the target address 500.

The loop vector instruction 1 may be opcode 500101102 add cycle 6416 or add cycle 5001011026416, and the processing procedures of the loop vector instructions of different instruction formats are similar and will not be described again.

The working process of the above modules can refer to the above related description.

Thus, the loop vector instruction processing device can efficiently and quickly process the loop vector instruction.

The present disclosure provides a machine learning arithmetic device, which may include one or more of the above-described cyclic vector instruction processing devices, and is configured to acquire a first vector to be calculated and a second vector to be calculated from other processing devices and control information, and execute a specified machine learning operation. The machine learning arithmetic device can obtain the circular vector instruction from other machine learning arithmetic devices or non-machine learning arithmetic devices, and transmit the execution result to peripheral equipment (also called other processing devices) through an I/O interface. Peripheral devices such as cameras, displays, mice, keyboards, network cards, wifi interfaces, servers. When more than one circular vector instruction processing device is included, the circular vector instruction processing devices can be linked and transmit data through a specific structure, for example, a PCIE bus is used for interconnection and data transmission, so as to support larger-scale operation of a neural network. At this time, the same control system may be shared, or there may be separate control systems; the memory may be shared or there may be separate memories for each accelerator. In addition, the interconnection mode can be any interconnection topology.

The machine learning arithmetic device has high compatibility and can be connected with various types of servers through PCIE interfaces.

Fig. 4a shows a block diagram of a combined processing device according to an embodiment of the present disclosure. As shown in fig. 4a, the combined processing device includes the machine learning arithmetic device, the universal interconnection interface, and other processing devices. The machine learning arithmetic device interacts with other processing devices to jointly complete the operation designated by the user.

Other processing devices include one or more of general purpose/special purpose processors such as Central Processing Units (CPUs), Graphics Processing Units (GPUs), neural network processors, and the like. The number of processors included in the other processing devices is not limited. The other processing devices are used as interfaces of the machine learning arithmetic device and external data and control, and comprise data transportation to finish basic control of starting, stopping and the like of the machine learning arithmetic device; other processing devices may cooperate with the machine learning computing device to perform computing tasks.

And the universal interconnection interface is used for transmitting data and control instructions between the machine learning arithmetic device and other processing devices. The machine learning arithmetic device acquires required input data from other processing devices and writes the input data into a storage device on the machine learning arithmetic device; control instructions can be obtained from other processing devices and written into a control cache on a machine learning arithmetic device chip; the data in the storage module of the machine learning arithmetic device can also be read and transmitted to other processing devices.

Fig. 4b shows a block diagram of a combined processing device according to an embodiment of the present disclosure. In a possible implementation manner, as shown in fig. 4b, the combined processing device may further include a storage device, and the storage device is connected to the machine learning operation device and the other processing device respectively. The storage device is used for storing data stored in the machine learning arithmetic device and the other processing device, and is particularly suitable for data which is required to be calculated and cannot be stored in the internal storage of the machine learning arithmetic device or the other processing device.

The combined processing device can be used as an SOC (system on chip) system of equipment such as a mobile phone, a robot, an unmanned aerial vehicle and video monitoring equipment, the core area of a control part is effectively reduced, the processing speed is increased, and the overall power consumption is reduced. In this case, the generic interconnect interface of the combined processing device is connected to some component of the apparatus. Some parts are such as camera, display, mouse, keyboard, network card, wifi interface.

The present disclosure provides a machine learning chip, which includes the above machine learning arithmetic device or combined processing device.

The present disclosure provides a machine learning chip package structure, which includes the above machine learning chip.

Fig. 5 shows a schematic structural diagram of a board card according to an embodiment of the present disclosure. As shown in fig. 5, the board includes the above-mentioned machine learning chip package structure or the above-mentioned machine learning chip. The board may include, in addition to the machine learning chip 389, other kits including, but not limited to: memory device 390, interface device 391 and control device 392.

The memory device 390 is coupled to a machine learning chip 389 (or a machine learning chip within a machine learning chip package structure) via a bus for storing data. Memory device 390 may include multiple sets of memory cells 393. Each group of memory cells 393 is coupled to a machine learning chip 389 via a bus. It is understood that each group 393 may be a DDR SDRAM (Double Data Rate SDRAM).

DDR can double the speed of SDRAM without increasing the clock frequency. DDR allows data to be read out on the rising and falling edges of the clock pulse. DDR is twice as fast as standard SDRAM.

In one embodiment, memory device 390 may include 4 groups of memory cells 393. Each group of memory cells 393 may include a plurality of DDR4 particles (chips). In one embodiment, the machine learning chip 389 may include 4 72-bit DDR4 controllers therein, where 64bit is used for data transmission and 8bit is used for ECC check in the 72-bit DDR4 controller. It is appreciated that when DDR4-3200 particles are used in each group of memory cells 393, the theoretical bandwidth of data transfer may reach 25600 MB/s.

In one embodiment, each group 393 of memory cells includes a plurality of double rate synchronous dynamic random access memories arranged in parallel. DDR can transfer data twice in one clock cycle. A controller for controlling DDR is provided in the machine learning chip 389 for controlling data transfer and data storage of each memory unit 393.

Interface device 391 is electrically coupled to machine learning chip 389 (or a machine learning chip within a machine learning chip package). The interface device 391 is used to implement data transmission between the machine learning chip 389 and an external device (e.g., a server or a computer). For example, in one embodiment, the interface device 391 may be a standard PCIE interface. For example, the data to be processed is transmitted to the machine learning chip 289 by the server through the standard PCIE interface, so as to implement data transfer. Preferably, when PCIE 3.0X 16 interface transmission is adopted, the theoretical bandwidth can reach 16000 MB/s. In another embodiment, the interface device 391 may also be another interface, and the disclosure does not limit the specific representation of the other interface, and the interface device can implement the switching function. In addition, the calculation result of the machine learning chip is still transmitted back to the external device (e.g., server) by the interface device.

The control device 392 is electrically connected to a machine learning chip 389. The control device 392 is used to monitor the state of the machine learning chip 389. Specifically, the machine learning chip 389 and the control device 392 may be electrically connected through an SPI interface. The control device 392 may include a single chip Microcomputer (MCU). For example, machine learning chip 389 may include multiple processing chips, multiple processing cores, or multiple processing circuits, which may carry multiple loads. Therefore, the machine learning chip 389 can be in different operation states such as a multi-load and a light load. The control device can regulate and control the working states of a plurality of processing chips, a plurality of processing circuits and/or a plurality of processing circuits in the machine learning chip.

The present disclosure provides an electronic device, which includes the above machine learning chip or board card.

The electronic device may include a data processing apparatus, a computer device, a robot, a computer, a printer, a scanner, a tablet, a smart terminal, a cell phone, a tachograph, a navigator, a sensor, a camera, a server, a cloud server, a camera, a camcorder, a projector, a watch, an earphone, a mobile storage, a wearable device, a vehicle, a household appliance, and/or a medical device.

The vehicle may include an aircraft, a ship, and/or a vehicle. The household appliances may include televisions, air conditioners, microwave ovens, refrigerators, electric rice cookers, humidifiers, washing machines, electric lamps, gas cookers, and range hoods. The medical device may include a nuclear magnetic resonance apparatus, a B-mode ultrasound apparatus and/or an electrocardiograph.

FIG. 6 shows a flow diagram of a method of loop vector instruction processing according to an embodiment of the present disclosure. The method can be applied to computer equipment and the like comprising a memory and a processor, wherein the memory is used for storing data used in the process of executing the method; the processor is used for executing relevant processing and operation steps, such as the steps S51 and S52. As shown in fig. 6, the method is applied to the above-mentioned loop vector instruction processing apparatus, and includes step S51 and step S52.

In step S51, the control module is used to analyze the obtained circular vector instruction to obtain an operation code and an operation domain of the circular vector instruction, and obtain a first to-be-operated vector, a second to-be-operated vector, and a target address required for executing the circular vector instruction according to the operation code and the operation domain, and determine a vector operation type of the circular vector instruction. The operation code is used for indicating that the operation of the circular vector instruction on the data is a circular vector operation, and the operation domain comprises a first vector address to be operated, a second vector address to be operated and a target address.

In step S52, the operation module is used to divide the first to-be-operated vector into a plurality of divided vectors according to the second to-be-operated vector, perform vector operation on each divided vector and the second to-be-operated vector according to the vector operation type, obtain an operation result, and store the operation result in the target address.

In a possible implementation manner, performing a vector operation on each split vector and the second vector to be operated according to a vector operation type may include: and executing vector operation corresponding to the vector operation type by using a plurality of vector operators in the operation module.

In one possible implementation, the operation module may include a master operation submodule and a plurality of slave operation submodules, and the master operation submodule may include a plurality of vector operators. Wherein, the step S52 may include: and executing vector operation corresponding to the vector operation type by using a plurality of vector operators in the main operation sub-module to obtain an operation result, and storing the operation result into a target address.

In one possible implementation, the operation module includes a master operation submodule and a plurality of slave operation submodules, and each slave operation submodule includes a plurality of vector operators, wherein the step S52 may include: utilizing a plurality of vector operators contained in each slave operation submodule to execute in parallel to perform corresponding vector operation to obtain an operation result, storing the operation result into a corresponding sub-cache space, and sending the operation result to a main operation submodule; and receiving the operation result by using the main operation sub-module, and storing the operation result into the target address.

In one possible implementation, the operation domain may also include a vector operation type. Determining the type of vector operation of the loop vector instruction may include: when the vector operation type is included in the operation domain, the vector operation type is determined according to the operation domain.

In one possible implementation, the operation field may further include a first input quantity and a second input quantity. The obtaining, according to the operation code and the operation domain, the first to-be-operated vector, the second to-be-operated vector, and the target address required for executing the circular vector instruction may further include: and determining a first input quantity and a second input quantity according to the operation domain, acquiring a first vector to be operated with a data quantity of the first input quantity from the first vector address to be operated, and acquiring a second vector to be operated with a data quantity of the second input quantity from the second vector address to be operated. The segmenting the first to-be-operated vector into a plurality of segmented vectors according to the second to-be-operated vector may include: and determining the segmentation data volume of each segmentation vector according to the second input volume, and segmenting the first vector to be operated into a plurality of segmentation vectors according to the segmentation data volume.

In one possible implementation, the operation code is further used for indicating a vector operation type, and determining the vector operation type of the circular vector instruction may include: when the operation code is used for indicating the vector operation type, the vector operation type is determined according to the operation code.

In one possible implementation, the vector operation type may include at least one of: vector multiplication operation, vector addition operation, vector summation operation, specified value storage operation meeting operation conditions, bitwise AND operation, bitwise OR operation and bitwise XOR operation. The operation condition may include any one of the following: bitwise equal, bitwise unequal, bitwise less, bitwise greater than or equal to, bitwise greater than, bitwise less than or equal to.

In one possible implementation, the method may further include: storing a first vector to be operated and a second vector to be operated by a storage module of the device, wherein the storage module comprises at least one of a register and a cache,

the cache is used for storing data to be operated, a first vector to be operated and a second vector to be operated, and comprises at least one neuron cache NRAM;

the register is used for storing scalar data in the data to be operated;

the neuron buffer is used for storing neuron data in data to be operated, and the neuron data comprises neuron vector data.

In a possible implementation manner, parsing the obtained loop vector instruction to obtain an operation code and an operation domain of the loop vector instruction may include:

a store loop vector instruction;

analyzing the circular vector instruction to obtain an operation code and an operation domain of the circular vector instruction;

the method includes storing an instruction queue, where the instruction queue includes a plurality of instructions to be executed that are sequentially arranged in an execution order, and the plurality of instructions to be executed may include loop vector instructions.

In one possible implementation, the method may further include: when determining that the first to-be-executed instruction in the plurality of to-be-executed instructions is in an association relation with a zeroth to-be-executed instruction before the first to-be-executed instruction, caching the first to-be-executed instruction, and after determining that the zeroth to-be-executed instruction is completely executed, controlling to execute the first to-be-executed instruction. The associating relationship between the first to-be-executed instruction and the zeroth to-be-executed instruction before the first to-be-executed instruction may include: the first storage address interval for storing the data required by the first to-be-executed instruction and the zeroth storage address interval for storing the data required by the zeroth to-be-executed instruction have an overlapped area.

It should be noted that, although the above embodiments are described as examples of the loop vector instruction processing method, those skilled in the art can understand that the disclosure should not be limited thereto. In fact, the user can flexibly set each step according to personal preference and/or actual application scene, as long as the technical scheme of the disclosure is met.

The loop vector instruction processing method provided by the embodiment of the disclosure has the advantages of wide application range, high vector processing efficiency, high processing speed, high operation processing efficiency and high processing speed.

The present disclosure also provides a non-transitory computer-readable storage medium having stored thereon computer program instructions that, when executed by a processor, implement the above-described loop vector instruction processing method.

It is noted that while for simplicity of explanation, the foregoing method embodiments have been described as a series of acts or combination of acts, it will be appreciated by those skilled in the art that the present disclosure is not limited by the order of acts, as some steps may, in accordance with the present disclosure, occur in other orders and concurrently. Further, those skilled in the art will also appreciate that the embodiments described in the specification are exemplary embodiments and that acts and modules referred to are not necessarily required by the disclosure.

It should be further noted that, although the steps in the flowchart of fig. 6 are shown in sequence as indicated by the arrows, the steps are not necessarily executed in sequence as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least a portion of the steps in fig. 6 may include multiple sub-steps or multiple stages that are not necessarily performed at the same time, but may be performed at different times, and the order of performance of the sub-steps or stages is not necessarily sequential, but may be performed in turn or alternately with other steps or at least a portion of the sub-steps or stages of other steps.

It should be understood that the above-described apparatus embodiments are merely exemplary, and that the apparatus of the present disclosure may be implemented in other ways. For example, the division of the units/modules in the above embodiments is only one logical function division, and there may be another division manner in actual implementation. For example, multiple units, modules, or components may be combined, or may be integrated into another system, or some features may be omitted, or not implemented.

In addition, unless otherwise specified, each functional unit/module in the embodiments of the present disclosure may be integrated into one unit/module, each unit/module may exist alone physically, or two or more units/modules may be integrated together. The integrated units/modules may be implemented in the form of hardware or software program modules.

If the integrated unit/module is implemented in hardware, the hardware may be digital circuits, analog circuits, etc. Physical implementations of hardware structures include, but are not limited to, transistors, memristors, and the like. Unless otherwise specified, the storage module may be any suitable magnetic storage medium or magneto-optical storage medium, such as resistive Random Access Memory (rram), Dynamic Random Access Memory (dram), Static Random Access Memory (SRAM), enhanced Dynamic Random Access Memory (edram), High-Bandwidth Memory (HBM), hybrid Memory cube (hmc), and the like.

The integrated units/modules, if implemented in the form of software program modules and sold or used as a stand-alone product, may be stored in a computer readable memory. Based on such understanding, the technical solution of the present disclosure may be embodied in the form of a software product, which is stored in a memory and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present disclosure. And the aforementioned memory comprises: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic or optical disk, and other various media capable of storing program codes.

In the foregoing embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments. The technical features of the embodiments may be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The foregoing may be better understood in light of the following clauses:

clause a1, a loop vector instruction processing device, the device comprising:

the operation module is used for dividing the first vector to be operated into a plurality of divided vectors according to a second vector to be operated, respectively carrying out vector operation on each divided vector and the second vector to be operated according to the vector operation type to obtain an operation result, and storing the operation result into the target address,

Clause a2, the apparatus of clause a1, the computing module comprising:

a plurality of vector operators for performing vector operations corresponding to the vector operation types.

Clause A3, the apparatus of clause a2, the arithmetic module comprising a master arithmetic sub-module and a plurality of slave arithmetic sub-modules, the master arithmetic sub-module comprising the plurality of vector operators,

and the main operation sub-module is used for executing the vector operation by utilizing the plurality of vector operators to obtain an operation result and storing the operation result into the target address.

Clause a4, the apparatus of clause a2, the arithmetic module comprising a master arithmetic sub-module and a plurality of slave arithmetic sub-modules, the slave arithmetic sub-modules comprising the plurality of vector operators,

the slave operation submodule is used for executing corresponding vector operation in parallel by utilizing a plurality of vector operators to obtain an operation result, storing the operation result into a corresponding sub-cache space, and sending the operation result to the master operation submodule;

and the main operation sub-module is also used for receiving the operation result and storing the operation result into the target address.

Clause a5, the device of clause a1, the operation domain further comprising a vector operation type,

the control module is further configured to determine a vector operation type according to the operation domain when the operation domain includes the vector operation type.

Clause a6, the apparatus of clause a1, the operational field further comprising a first input quantity and a second input quantity,

the control module is further configured to determine the first input quantity and the second input quantity according to the operation domain, obtain a first to-be-operated vector with a data quantity of the first input quantity from the first to-be-operated vector address, and obtain a second to-be-operated vector with a data quantity of the second input quantity from the second to-be-operated vector address,

the method for segmenting the first vector to be operated into a plurality of segmented vectors according to the second vector to be operated comprises the following steps:

and determining the segmentation data volume of each segmentation vector according to the second input volume, and segmenting the first vector to be operated into a plurality of segmentation vectors according to the segmentation data volume.

Clause a7, the device of clause a1, the opcode also being for indicating the vector operation type,

the control module is further configured to determine the vector operation type according to the operation code when the operation code is used to indicate the vector operation type.

Clause A8, the device of clause a1, the type of vector operations comprising at least one of:

vector multiplication operation, vector addition operation, vector summation operation, operation of storing specified values satisfying operation conditions, bitwise and operation, bitwise or operation, bitwise exclusive or operation,

wherein the operation condition includes any one of: bitwise equal, bitwise unequal, bitwise less, bitwise greater than or equal to, bitwise greater than, bitwise less than or equal to.

Clause a9, the apparatus of clause a1, further comprising:

a storage module for storing the first vector to be operated and the second vector to be operated,

wherein the storage module comprises at least one of a register and a cache,

the cache is used for storing data to be operated, the first vector to be operated and the second vector to be operated, and comprises at least one neuron cache NRAM;

the register is used for storing scalar data in the data to be operated;

the neuron cache is used for storing neuron data in the data to be operated, wherein the neuron data comprises neuron vector data.

Clause a10, the apparatus of clause a1, the control module comprising:

the instruction storage submodule is used for storing the circular vector instruction;

the instruction processing submodule is used for analyzing the circular vector instruction to obtain an operation code and an operation domain of the circular vector instruction;

and the queue storage submodule is used for storing an instruction queue, the instruction queue comprises a plurality of instructions to be executed which are sequentially arranged according to an execution sequence, and the plurality of instructions to be executed comprise the circular vector instructions.

Clause a11, the apparatus of clause a10, the control module further comprising:

the dependency relationship processing submodule is used for caching a first instruction to be executed in the instruction storage submodule when the fact that the first instruction to be executed in the plurality of instructions to be executed is associated with a zeroth instruction to be executed before the first instruction to be executed is determined, extracting the first instruction to be executed from the instruction storage submodule after the zeroth instruction to be executed is executed, and sending the first instruction to be executed to the operation module,

wherein the association relationship between the first to-be-executed instruction and a zeroth to-be-executed instruction before the first to-be-executed instruction comprises:

and a first storage address interval for storing the data required by the first instruction to be executed and a zeroth storage address interval for storing the data required by the zeroth instruction to be executed have an overlapped area.

Clause a12, a machine learning computing device, the device comprising:

one or more loop vector instruction processing apparatus as recited in any of clauses a 1-clause a11, configured to obtain a vector to be executed and control information from another processing apparatus, perform a specified machine learning operation, and transmit the execution result to the other processing apparatus via an I/O interface;

Clause a13, a combination processing device, comprising:

the machine learning computing device, universal interconnect interface, and other processing device of clause a 12;

the machine learning arithmetic device interacts with the other processing devices to jointly complete the calculation operation designated by the user,

wherein the combination processing apparatus further comprises: and a storage device connected to the machine learning arithmetic device and the other processing device, respectively, for storing data of the machine learning arithmetic device and the other processing device.

Clause a14, a machine learning chip, the machine learning chip comprising:

the machine learning computing device of clause a12 or the combination processing device of clause a 13.

Clause a15, an electronic device, comprising:

the machine learning chip of clause a 14.

Clause a16, a card, comprising: a memory device, an interface device and a control device and a machine learning chip as described in clause a 14;

wherein the machine learning chip is connected with the storage device, the control device and the interface device respectively;

the storage device is used for storing data;

the interface device is used for realizing data transmission between the machine learning chip and external equipment;

and the control device is used for monitoring the state of the machine learning chip.

Clause a17, a loop vector instruction processing method, the method being applied to a loop vector instruction processing apparatus comprising a control module and an operation module, the method comprising:

analyzing the obtained cyclic vector instruction by using a control module to obtain an operation code and an operation domain of the cyclic vector instruction, obtaining a first vector to be operated, a second vector to be operated and a target address which are required by executing the cyclic vector instruction according to the operation code and the operation domain, and determining the vector operation type of the cyclic vector instruction;

utilizing an operation module to divide a first vector to be operated into a plurality of divided vectors according to a second vector to be operated, respectively carrying out vector operation on each divided vector and the second vector to be operated according to the vector operation type to obtain an operation result, and storing the operation result into the target address,

Clause a18, performing vector operation on each sliced vector and the second vector to be operated according to the vector operation type according to the method of clause a17, respectively, including:

and utilizing a plurality of vector operators in the operation module to execute vector operation corresponding to the vector operation type.

Clause a19, the method of clause a18, the calculation module comprising a master calculation sub-module and a plurality of slave calculation sub-modules, the master calculation sub-module comprising the plurality of vector calculators,

the method for dividing a first vector to be operated into a plurality of divided vectors according to a second vector to be operated, performing vector operation on each divided vector and the second vector to be operated according to the vector operation type to obtain an operation result, and storing the operation result into a target address includes:

and executing vector operation corresponding to the vector operation type by utilizing the plurality of vector operators in the main operation sub-module to obtain an operation result, and storing the operation result into the target address.

Clause a20, the method of clause a18, the calculation module comprising a master calculation sub-module and a plurality of slave calculation sub-modules, the slave calculation sub-modules comprising the plurality of vector calculators,

utilizing a plurality of vector operators contained in each slave operation submodule to execute in parallel to perform corresponding vector operation to obtain an operation result, storing the operation result into a corresponding sub-cache space, and sending the operation result to the master operation submodule;

and receiving the operation result by using the main operation submodule, and storing the operation result into the target address.

Clause a21, the method of clause a17, the operation domain further comprising a vector operation type,

wherein determining a vector operation type for a loop vector instruction comprises:

and when the operation domain comprises a vector operation type, determining the vector operation type according to the operation domain.

Clause a22, the method of clause a17, the operational field further comprising a first input quantity and a second input quantity,

wherein, obtaining a first to-be-operated vector, a second to-be-operated vector and a target address required by executing a circular vector instruction according to the operation code and the operation domain further comprises:

determining the first input quantity and the second input quantity according to the operation domain, acquiring a first vector to be operated with the data quantity as the first input quantity from the first vector address to be operated, and acquiring a second vector to be operated with the data quantity as the second input quantity from the second vector address to be operated,

Clause a23, the method of clause a17, the opcode further being for indicating the vector operation type,

and when the operation code is used for indicating the vector operation type, determining the vector operation type according to the operation code.

Clause a24, the method of clause a17, the type of vector operations comprising at least one of:

Clause a25, the method of clause a17, the method further comprising:

storing, with a storage module of the device, the first vector to be operated on and the second vector to be operated on,

wherein the storage module comprises at least one of a register and a cache,

the register is used for storing scalar data in the data to be operated;

Clause a26, according to the method described in clause a17, parsing the acquired loop vector instruction to obtain the operation code and the operation domain of the loop vector instruction, includes:

storing the loop vector instruction;

and storing an instruction queue, wherein the instruction queue comprises a plurality of instructions to be executed which are sequentially arranged according to an execution sequence, and the plurality of instructions to be executed comprise the circular vector instructions.

Clause a27, the method of clause a26, the method further comprising:

when determining that the first to-be-executed instruction in the plurality of to-be-executed instructions is associated with a zeroth to-be-executed instruction before the first to-be-executed instruction, caching the first to-be-executed instruction, and after determining that the zeroth to-be-executed instruction is completely executed, controlling to execute the first to-be-executed instruction,

Clause a28, a non-transitory computer readable storage medium having computer program instructions stored thereon that, when executed by a processor, implement the method of any of clauses a 17-a 27.

The foregoing detailed description of the embodiments of the present application has been presented to illustrate the principles and implementations of the present application, and the above description of the embodiments is only provided to help understand the method and the core concept of the present application; meanwhile, for a person skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.

Claims

1. A loop vector instruction processing apparatus, the apparatus comprising:

the operation module is used for receiving a first vector to be operated, a second vector to be operated and a vector operation type which are obtained by the control module through analyzing the circular vector instruction, dividing the first vector to be operated into a plurality of divided vectors according to the second vector to be operated, respectively carrying out vector operation on each divided vector and the second vector to be operated according to the vector operation type to obtain an operation result, and storing the operation result into the target address,

the operation code is used for indicating that the operation of the circular vector instruction on data is a circular vector operation, and the operation domain comprises the source of the first vector to be operated on and the source of the second vector to be operated on and the target address.

2. The apparatus of claim 1, wherein the computing module comprises:

3. The apparatus of claim 2, wherein the operation module comprises a master operation submodule and a plurality of slave operation submodule, the master operation submodule comprising the plurality of vector operators,

4. The apparatus of claim 2, wherein the operation module comprises a master operation submodule and a plurality of slave operation submodule, the slave operation submodule comprising the plurality of vector operators,

5. The apparatus of claim 1, wherein the operation domain further comprises a vector operation type,

6. The apparatus of claim 1, wherein the operational realm further comprises a first input quantity and a second input quantity,

7. The apparatus of claim 1, wherein the opcode is further configured to indicate the type of vector operation,

8. The apparatus of claim 1, wherein the type of vector operation comprises at least one of:

9. The apparatus of claim 1, further comprising:

wherein the storage module comprises at least one of a register and a cache,

the register is used for storing scalar data in the data to be operated;

10. The apparatus of claim 1, wherein the control module comprises:

11. The apparatus of claim 10, wherein the control module further comprises:

12. A machine learning arithmetic device, the device comprising:

one or more loop vector instruction processing apparatus as claimed in any one of claims 1 to 11, configured to obtain a vector to be operated and control information from another processing apparatus, perform a specified machine learning operation, and transmit the execution result to the other processing apparatus via an I/O interface;

13. A combined processing apparatus, characterized in that the combined processing apparatus comprises:

the machine learning computing device, universal interconnect interface, and other processing device of claim 12;

14. A machine learning chip, the machine learning chip comprising:

a machine learning computation apparatus according to claim 12 or a combined processing apparatus according to claim 13.

15. An electronic device, characterized in that the electronic device comprises:

the machine learning chip of claim 14.

16. The utility model provides a board card, its characterized in that, the board card includes: a memory device, an interface apparatus and a control device and a machine learning chip according to claim 14;

the storage device is used for storing data;

17. A method for processing a loop vector instruction, the method being applied to a loop vector instruction processing apparatus comprising a control module and an operation module, the method comprising:

receiving a first vector to be operated, a second vector to be operated and a vector operation type obtained by analyzing the circular vector instruction by a control module through an operation module, dividing the first vector to be operated into a plurality of divided vectors according to the second vector to be operated, respectively carrying out vector operation on each divided vector and the second vector to be operated according to the vector operation type to obtain an operation result, and storing the operation result into the target address,

18. The method of claim 17, wherein performing a vector operation on each sliced vector and a second vector to be operated according to the vector operation type comprises:

19. The method of claim 18, wherein the operation module comprises a master operation submodule and a plurality of slave operation submodule, the master operation submodule comprising the plurality of vector operators,

20. The method of claim 18, wherein the operation module comprises a master operation submodule and a plurality of slave operation submodule, the slave operation submodule comprising the plurality of vector operators,

21. The method of claim 17, wherein the operation domain further comprises a vector operation type,

22. The method of claim 17, wherein the operational realm further comprises a first input quantity and a second input quantity,

23. The method of claim 17, wherein the opcode is further configured to indicate the type of vector operation,

24. The method of claim 17, wherein the vector operation type comprises at least one of:

25. The method of claim 17, further comprising:

wherein the storage module comprises at least one of a register and a cache,

the register is used for storing scalar data in the data to be operated;

26. The method of claim 17, wherein parsing the fetched loop vector instruction to obtain an opcode and an operation field of the loop vector instruction comprises:

storing the loop vector instruction;

27. The method of claim 26, further comprising:

28. A non-transitory computer readable storage medium having computer program instructions stored thereon, wherein the computer program instructions, when executed by a processor, implement the method of any of claims 17 to 27.