CN111381872A

CN111381872A - Operation method, device and related product

Info

Publication number: CN111381872A
Application number: CN201811621262.4A
Authority: CN
Inventors: 不公告发明人
Original assignee: Shanghai Cambricon Information Technology Co Ltd
Current assignee: Shanghai Cambricon Information Technology Co Ltd
Priority date: 2018-12-28
Filing date: 2018-12-28
Publication date: 2020-07-07

Abstract

The disclosure relates to an operation method, an operation device and a related product. The machine learning device comprises one or more instruction processing devices, is used for acquiring data to be processed and control information from other processing devices, executes specified machine learning operation and transmits an execution result to other processing devices through an I/O interface; when the machine learning arithmetic device includes a plurality of instruction processing devices, the plurality of instruction processing devices can be connected to each other by a specific configuration to transfer data. The command processing devices are interconnected through a Peripheral Component Interface Express (PCIE) bus and transmit data; the plurality of instruction processing devices share the same control system or own control system and share the memory or own memory; the interconnection mode of the plurality of instruction processing apparatuses is an arbitrary interconnection topology. The operation method, the operation device and the related products provided by the embodiment of the disclosure have the advantages of wide application range, high instruction processing efficiency and high instruction processing speed.

Description

Operation method, device and related product

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to a vector operation instruction processing method, an apparatus, and a related product.

Background

With the continuous development of science and technology, machine learning, especially neural network algorithms, are more and more widely used. The method is well applied to the fields of image recognition, voice recognition, natural language processing and the like. However, as the complexity of neural network algorithms is higher and higher, the types and the number of involved data operations are increasing. In the related art, the efficiency and speed of processing vector data are low.

Disclosure of Invention

In view of the above, the present disclosure provides a method, an apparatus, and a related product for processing vector operation instructions, so as to improve efficiency and speed of processing vector data.

According to a first aspect of the present disclosure, there is provided a vector operation instruction processing apparatus, the apparatus comprising:

the control module is used for analyzing the received vector operation instruction, obtaining an operation code and an operation domain of the vector operation instruction, determining data to be processed and a target address required by executing the vector operation instruction according to the operation code and the operation domain, and determining a data processing type corresponding to the vector operation instruction;

a processing module, configured to process the data to be processed according to the data processing type to obtain processed data, and store the processed data in the target address,

wherein the operation code is used for indicating that the processing of the data by the vector operation instruction at least comprises vector operation processing,

the data processing type comprises an initial data type of the data to be processed, a target data type and an operation type of the processed data, the initial data type or the target data type is a floating point data type,

the operation domain comprises the data address to be processed and the target address.

According to a second aspect of the present disclosure, there is provided a machine learning arithmetic device, the device including:

one or more vector operation instruction processing devices according to the first aspect, configured to acquire data to be processed and control information from another processing device, execute a specified machine learning operation, and transmit an execution result to the other processing device through an I/O interface;

when the machine learning arithmetic device comprises a plurality of vector arithmetic instruction processing devices, the vector arithmetic instruction processing devices can be connected through a specific structure and transmit data;

the vector operation instruction processing devices are interconnected through a Peripheral Component Interface Express (PCIE) bus and transmit data so as to support larger-scale machine learning operation; the vector operation instruction processing devices share the same control system or own respective control systems; the vector operation instruction processing devices share a memory or own memories; the interconnection mode of the vector operation instruction processing devices is any interconnection topology.

According to a third aspect of the present disclosure, there is provided a combined processing apparatus, the apparatus comprising:

the machine learning arithmetic device, the universal interconnect interface, and the other processing device according to the second aspect;

and the machine learning arithmetic device interacts with the other processing devices to jointly complete the calculation operation designated by the user.

According to a fourth aspect of the present disclosure, there is provided a machine learning chip including the machine learning network operation device of the second aspect or the combination processing device of the third aspect.

According to a fifth aspect of the present disclosure, there is provided a machine learning chip package structure, which includes the machine learning chip of the fourth aspect.

According to a sixth aspect of the present disclosure, a board card is provided, which includes the machine learning chip packaging structure of the fifth aspect.

According to a seventh aspect of the present disclosure, there is provided an electronic device, which includes the machine learning chip of the fourth aspect or the board of the sixth aspect.

According to an eighth aspect of the present disclosure, there is provided a vector operation instruction processing method, which is applied to a vector operation instruction processing apparatus, the method including:

analyzing a received vector operation instruction to obtain an operation code and an operation domain of the vector operation instruction, determining data to be processed and a target address required by executing the vector operation instruction according to the operation code and the operation domain, and determining a data processing type corresponding to the vector operation instruction;

processing the data to be processed according to the data processing type to obtain processed data, storing the processed data into the target address,

In some embodiments, the electronic device comprises a data processing apparatus, a robot, a computer, a printer, a scanner, a tablet, a smart terminal, a cell phone, a tachograph, a navigator, a sensor, a camera, a server, a cloud server, a camera, a camcorder, a projector, a watch, a headset, a mobile storage, a wearable device, a vehicle, a household appliance, and/or a medical device.

In some embodiments, the vehicle comprises an aircraft, a ship, and/or a vehicle; the household appliances comprise a television, an air conditioner, a microwave oven, a refrigerator, an electric cooker, a humidifier, a washing machine, an electric lamp, a gas stove and a range hood; the medical equipment comprises a nuclear magnetic resonance apparatus, a B-ultrasonic apparatus and/or an electrocardiograph.

The vector operation instruction processing method, the vector operation instruction processing device and the related product provided by the embodiment of the disclosure comprise a control module and a processing module. The control module is used for analyzing the received vector operation instruction, obtaining an operation code and an operation domain of the vector operation instruction, determining a target address of data to be processed required by executing the vector operation instruction according to the operation code and the operation domain, and determining a data processing type corresponding to the vector operation instruction. The processing module is used for processing the data to be processed according to the data processing type to obtain the processed data and storing the processed data into the target address. The vector operation instruction processing method, the vector operation instruction processing device and the related products provided by the embodiment of the disclosure have the advantages of wide application range, high processing efficiency and high processing speed of vector operation instructions, and can improve the processing efficiency and speed of vector data.

Other features and aspects of the present disclosure will become apparent from the following detailed description of exemplary embodiments, which proceeds with reference to the accompanying drawings.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate exemplary embodiments, features, and aspects of the disclosure and, together with the description, serve to explain the principles of the disclosure.

Fig. 1 shows a block diagram of a vector operation instruction processing apparatus according to an embodiment of the present disclosure.

Fig. 2 illustrates a block diagram of a vector operation instruction processing apparatus according to an embodiment of the present disclosure.

Fig. 3a, 3b show block diagrams of a combined processing device according to an embodiment of the present disclosure.

Fig. 4 shows a schematic structural diagram of a board card according to an embodiment of the present disclosure.

FIG. 5 is a flowchart illustrating a method of processing a vector operation instruction according to an embodiment of the present disclosure.

Detailed Description

Various exemplary embodiments, features and aspects of the present disclosure will be described in detail below with reference to the accompanying drawings. In the drawings, like reference numbers can indicate functionally identical or similar elements. While the various aspects of the embodiments are presented in drawings, the drawings are not necessarily drawn to scale unless specifically indicated.

The word "exemplary" is used exclusively herein to mean "serving as an example, embodiment, or illustration. Any embodiment described herein as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments.

Furthermore, in the following detailed description, numerous specific details are set forth in order to provide a better understanding of the present disclosure. It will be understood by those skilled in the art that the present disclosure may be practiced without some of these specific details. In some instances, methods, means, elements and circuits that are well known to those skilled in the art have not been described in detail so as not to obscure the present disclosure.

Fig. 1 shows a block diagram of a vector operation instruction processing apparatus according to an embodiment of the present disclosure. As shown in fig. 1, the apparatus includes a control module 11 and a processing module 12.

The control module 11 is configured to parse the received vector operation instruction, obtain an operation code and an operation domain of the vector operation instruction, determine, according to the operation code and the operation domain, to-be-processed data and a target address required for executing the vector operation instruction, and determine a data processing type corresponding to the vector operation instruction. The data processing type comprises an initial data type of data to be processed, a target data type of the processed data and an operation type. The initial data type or the target data type is a floating point data type. The operation code is used for indicating that the processing required by the vector operation instruction to the data at least comprises vector operation processing. The operation domain comprises a data address to be processed and a target address.

And the processing module 12 is configured to process the data to be processed according to the data processing type to obtain processed data, and store the processed data in the target address.

In this embodiment, the control module may obtain the data to be processed from the address of the data to be processed. The address of the data to be processed may be a first address where the data to be processed is stored, etc. The number of the data to be processed may be one or more, and when the number of the data to be processed is multiple, the operation domain may include multiple addresses of the data to be processed, so that the control module may obtain the required data to be processed from the multiple addresses of the data to be processed, respectively.

In this embodiment, the control module may obtain the instruction and the data to be processed through a data input/output unit, where the data input/output unit may be one or more data I/O interfaces or I/O pins.

In this embodiment, the operation code may be a part of an instruction or a field (usually indicated by a code) specified in the computer program to perform an operation, and is an instruction sequence number used to inform a device executing the instruction which instruction needs to be executed specifically. The operation domain may be a source of all data required for executing the corresponding instruction, where all data required for executing the corresponding instruction includes data to be processed, a data processing type, a corresponding operation method, or an address storing the data processing type, the data to be processed, the corresponding operation method, and the like. It must comprise, for an instruction, an opcode and an operation field, wherein the operation field comprises at least a data address to be processed and a target address.

It should be understood that the instruction format of the vector operation instruction and the contained opcode and operation domain may be set as desired by those skilled in the art, and the disclosure is not limited thereto.

In this embodiment, the apparatus may include one or more control modules and one or more processing modules, and the number of the control modules and the number of the processing modules may be set according to actual needs, which is not limited by this disclosure.

The vector operation instruction processing device provided by the embodiment of the disclosure comprises a control module and a processing module. The control module is used for analyzing the received vector operation instruction, obtaining an operation code and an operation domain of the vector operation instruction, determining data to be processed and a target address required by the vector operation instruction according to the operation code and the operation domain, and determining a data processing type corresponding to the vector operation instruction. The processing module is used for processing the data to be processed according to the data processing type to obtain the processed data and storing the processed data into the target address. The vector operation instruction processing device provided by the embodiment of the disclosure has a wide application range, high processing efficiency and high processing speed for vector operation instructions, and can improve the processing efficiency and speed of vector data.

In one possible implementation, the operation domain may also include a data processing type. The control module 11 may be further configured to determine a data processing type corresponding to the vector operation instruction according to the operation domain when the operation domain includes the data processing type.

In one possible implementation, the operation code may also be used to indicate the type of data processing. The control module 11 may be further configured to determine a data processing type corresponding to the vector operation instruction according to the operation code when the operation code indicates the data processing type.

In one possible implementation, different operation domain codes and/or operation code codes may be set for different data processing types, which the present disclosure does not limit.

In one possible implementation, a default data processing type may be set in advance. When the data processing type of the current vector operation instruction cannot be determined according to the operation domain and the operation code of the vector operation instruction, the control module may determine the default data processing type as the data processing type of the current vector operation instruction. So that the processing module can process the data to be processed according to the default data processing type.

In one possible implementation, the operation field may further include an input quantity. And the control module is also used for acquiring the data to be processed corresponding to the input quantity from the data address to be processed when the input quantity is included in the operation domain.

In this implementation, the control module may obtain the data to be processed with the data amount as the input amount from the address of the data to be processed. The input amount may be information capable of characterizing the length, width, etc. of the data amount. When the input amount is not included in the operation domain, the control module may directly acquire all data in the to-be-processed data address as the to-be-processed data. The data to be processed with the data volume as the default input volume may also be acquired from the data address to be processed according to the preset default input volume, which is not limited by the present disclosure.

In one possible implementation, the initial data type may include any one of a fixed point number data type and a floating point number data type, and the target data type may include a floating point number data type. Alternatively, the initial data type may include a floating point number data type, and the target data type may include any one of a fixed point number data type and a floating point number data type. To enable conversion between different data types.

In this implementation, the data of the fixed-point number data type may be data expressed in a fixed-point number expression manner. The fixed point number may be 8 bits, 16 bits, 32 bits, etc. The data of the floating-point data type may be data represented in a floating-point representation. The floating point number may be 8 bits, 16 bits, 32 bits, etc.

In one possible implementation, the floating point data type is a binary representation of the data. The number of floating points may be 8 bits, 16 bits, 32 bits, etc. The floating point number includes a sign bit, an exponent bit, and a significand bit. The floating point number may have either an unsigned bit or a signed bit.

Take 8-bit binary floating point number as an example. When there is no sign bit in the floating point number, each digit in the floating point number is counted from 0 from right to left (from low to high). The exponent number of the floating point number may be the leftmost digit, i.e., the 7 th digit, or any other digit among the 8 th digits. When the sign bit exists in the floating point number, the sign bit in the floating point number is 1 bit, the exponent bit is 1 bit, and the significant bit is 6 bits. The sign bit and the exponent bit of the floating point number may be located at any non-overlapping positions among the 8-bit digits of the floating point number. The present disclosure is not limited thereto.

For example, the number of floating points counts digits from 0 from right to left, and the number of floating points X in 8-bit binary form is: x₇X₆X₅X₄X₃X₂X₁X₀Wherein X is₇Is the sign bit, X₆Is an exponent number. X₅X₄X₃X₂X₁X₀Is a significant digit.

In one possible implementation, the value of the floating point number can then be shown as the following equation (1):

±m·base^p+e+1＝±1.d·base^2p+e+1formula (1)

Where m is the sign of the floating point number and base is the base, usually 2. e is the exponent of the floating point number, p is the digit of the highest nonzero digit in the effective number of floating point numbers, and d is the fractional part of the effective number of floating point numbers.

For example, assuming that the floating point number is "01010101", the floating point number has a value of 010101 × 2⁴⁺¹⁺¹＝1.0101*2^2*4+1+1. By utilizing the floating point number, the data expression range can be increased by the floating point number under the condition of the same bit width, and the vector data operation precision is improved.

In one possible implementation, the operation domain may further include a processing parameter, and the processing parameter may include any one of an offset address and a processing parameter.

The control module 11 may be further configured to, when the operation domain includes the offset address, obtain the data to be processed according to the address of the data to be processed and the offset address.

The processing module 12 may be further configured to, when the operation domain includes the processing parameter, process the data to be processed according to the data processing type and the processing parameter, so as to obtain processed data.

In this implementation, the processing parameter may be a parameter related to acquiring and processing the data to be processed, for example, an offset address related to acquiring the data to be processed. An operation parameter associated with performing a data operation or processing. The content included in the processing parameters can be set by those skilled in the art according to actual needs, and the disclosure does not limit this.

In a possible implementation manner, processing data to be processed according to a data processing type to obtain processed data may include: when the initial data type is different from the target data type, performing data type conversion processing on the data to be processed of the initial data type to obtain converted data of the target data type; and performing operation processing on the converted data according to the operation type to obtain an operation result, and determining the operation result as the processed data.

In the implementation mode, the data to be processed is converted into the converted data of the target data type, so that the processing process of subsequent operation can be simplified, and the speed and the efficiency of vector data processing are improved.

In a possible implementation manner, processing data to be processed according to a data processing type to obtain processed data may include: when the initial data type is the same as the target data type, the data to be processed can be directly operated according to the operation type to obtain an operation result, and the operation result is determined as the processed data. In this way, the process of data processing is simplified.

In a possible implementation manner, the initial data type and/or the target data type of the vector operation instruction may be determined according to an operation domain or an operation code of the vector operation instruction, may also be determined according to a preset second default initial data type and a second default target data type of the vector operation instruction, and may also be determined according to the operation domain or the operation code of the vector operation instruction, and the preset second default initial data type and the second default target data type, which is not limited by this disclosure.

Wherein the second default initial data type and the second default target data type may be preset. The control module may determine the second default initial data type and/or the second default target data type as the initial data type and/or the target data type of the current vector operation instruction when the initial data type and/or the target data type cannot be determined according to the vector operation instruction. For example, the control module may determine the second default target data type as the target data type of the vector operation instruction 1 when only the initial data type may be determined according to the vector operation instruction 1. When only the target data type can be determined by the control module according to the vector operation instruction 1, the second default initial data type can be determined as the initial data type of the vector operation instruction 1. When the initial data type and the target data type cannot be obtained according to the vector operation instruction 1, the control module may determine the second default initial data type and the second default target data type as the initial data type and the target data type of the vector operation instruction 1, respectively.

In one possible implementation, the data processing type may further include an initial number of bits and a target number of bits. The data type conversion processing of the to-be-processed data of the initial data type to obtain the converted data of the target data type may include: and performing data type conversion processing on the initial digit and the data to be processed of the initial data type to obtain converted data of the target digit and the target data type.

In a possible implementation manner, the initial number of bits and the target number of bits may be determined according to an operation domain or an operation code of the instruction, or may be determined according to a default initial number of bits and a default target number of bits of the instruction, or may be determined according to the operation domain or the operation code of the instruction, and the default initial number of bits and the default target number of bits that are set in advance, which is not limited by the present disclosure.

Wherein, a default initial digit and a default target digit can be preset. When the initial bit number and/or the target bit number cannot be determined according to the vector operation instruction, the control module may determine the default initial bit number and/or the default target bit number as the initial bit number and/or the target bit number of the current vector operation instruction. For example, the control module may determine the default target number of bits as the target number of bits for vector operation instruction 1 when only the initial number of bits can be determined from vector operation instruction 1. When only the target digit can be determined according to the vector operation instruction 1, the control module can determine the default initial digit as the initial digit of the vector operation instruction 1. When the initial digit and the target digit cannot be obtained according to the vector operation instruction 1, the control module can respectively determine the default initial digit and the default target digit as the initial digit and the target digit of the vector operation instruction 1.

In one possible implementation, the initial number of bits and the target number of bits may be 8 bits, 16 bits, 32 bits, etc. For example, the pending data of the floating point number data type of 8 bits may be converted into the converted data of the fixed point number data type of 16 bits. The data to be processed of the fixed point data type of 16 bits can be converted into the converted data of the floating point data type of 16 bits. The data to be processed of the 16-bit floating point data type may be converted into converted data of the 8-bit fixed point data type. The data to be processed of the 8-bit fixed point data type may be converted into converted data of the 16-bit floating point data type. The data to be processed of the 8-bit floating point data type may be converted into converted data of the 16-bit fixed point data type. The data to be processed of the 16-bit fixed point data type may be converted into converted data of the 8-bit floating point data type. The data to be processed of the 16-bit floating point data type may be converted into converted data of the 16-bit floating point data type. The data to be processed of the 16-bit floating point number data type may be converted into converted data of the 16-bit floating point number data type. The data to be processed of the 16-bit floating point number data type may be converted into converted data of the 8-bit floating point number data type. The data to be processed of the 8-bit floating point data type may be converted into converted data of the 16-bit floating point data type. The pending data of the 8-bit floating point data type may be converted into converted data of the 16-bit floating point data type. The data to be processed of the 16-bit floating point number data type may be converted into converted data of the 8-bit floating point number data type.

It should be understood that the content included in the data processing type indicated in the vector operation instruction, and the code of the initial data type, the target data type, the initial bit number, the target bit number and the operation type in the vector operation instruction may be set by those skilled in the art according to actual needs, and the present disclosure does not limit this.

In one possible implementation, the vector operation instruction may be an instruction that performs an arithmetic operation, a logical operation, or the like on a vector. The operation performed on the vector may be an operation between a vector and a vector, a vector and a scalar, a vector and a matrix, and the operation may include adding, subtracting, multiplying, comparing, operating on the vector based on a corresponding function, and the like. The corresponding function may include a logarithmic function, an exponential function, a power function, and the like. For example, the vector operation instruction may include at least one of a vector add vector operation instruction, a vector scalar add operation instruction, a vector dot product operation instruction, a vector outer product operation instruction, a vector multiply matrix operation instruction, a vector multiply vector operation instruction, a vector multiply scalar operation instruction, a vector max operation instruction, a vector min operation instruction, a vector log operation instruction, a vector finger operation instruction. The vector adding vector operation instruction may be to add a plurality of first data to be processed in the first data address to be processed and a plurality of data to be processed in the second data address to be processed in a one-to-one correspondence manner, respectively, to obtain a plurality of addition results, and determine the plurality of addition results as processed data. For example, the first to-be-processed data of the vector add-vector operation instruction are a1, a2 and a3, and the second to-be-processed data b1, b2 and b 3. Then, the processed data are a1+ b1, a2+ b2, and a3+ b 3.

Fig. 2 illustrates a block diagram of a vector operation instruction processing apparatus according to an embodiment of the present disclosure. In one possible implementation, as shown in fig. 2, the processing module 12 may include at least one operator 121, where the operator 121 is configured to perform an operation corresponding to an operation type. The operator may include an adder, a multiplier, a divider, an activation operator, etc., which is not limited by this disclosure.

In one possible implementation, as shown in fig. 2, the apparatus may further include a storage module 13. The storage module 13 is used for storing data to be processed.

In this implementation, the storage module may include one or more of a memory, a cache, and a register, and the cache may include a scratch pad cache. The data to be processed may be stored in the memory, cache and/or register of the storage module as needed, which is not limited by the present disclosure.

In a possible implementation manner, the apparatus may further include a direct memory access module for reading or storing data from the storage module.

In one possible implementation, as shown in fig. 2, the control module 11 may include an instruction storage sub-module 111, an instruction processing sub-module 112, and a queue storage sub-module 113.

The instruction storage submodule 111 is used for storing a vector operation instruction.

The instruction processing sub-module 112 is configured to parse the vector operation instruction to obtain an operation code and an operation field of the vector operation instruction.

The queue storage submodule 113 is configured to store an instruction queue, where the instruction queue includes multiple instructions to be executed that are sequentially arranged according to an execution order, and the multiple instructions to be executed may include a vector operation instruction. The plurality of instructions to be executed may include other computational instructions that may also include instructions related to vector operation instructions.

In this implementation manner, the execution order of the multiple instructions to be executed may be arranged according to the receiving time, the priority level, and the like of the instructions to be executed to obtain an instruction queue, so that the multiple instructions to be executed are sequentially executed according to the instruction queue.

In one possible implementation, as shown in fig. 2, the control module 11 may further include a dependency processing sub-module 114.

The dependency relationship processing submodule 114 is configured to, when it is determined that a first to-be-executed instruction in the plurality of to-be-executed instructions has an association relationship with a zeroth to-be-executed instruction before the first to-be-executed instruction, cache the first to-be-executed instruction in the instruction storage submodule 112, and after the zeroth to-be-executed instruction is executed, extract the first to-be-executed instruction from the instruction storage submodule 112 and send the first to-be-executed instruction to the processing module 12. The first to-be-executed instruction and the zeroth to-be-executed instruction are instructions in the plurality of to-be-executed instructions.

The method for determining the zero-th instruction to be executed before the first instruction to be executed has an incidence relation with the first instruction to be executed comprises the following steps: the first storage address interval for storing the data required by the first to-be-executed instruction and the zeroth storage address interval for storing the data required by the zeroth to-be-executed instruction have an overlapped area. Conversely, the no association relationship between the first to-be-executed instruction and the zeroth to-be-executed instruction may be that there is no overlapping area between the first storage address interval and the zeroth storage address interval.

By the method, according to the dependency relationship among the instructions to be executed, after the prior instruction to be executed is executed, the subsequent instruction to be executed is executed, so that the accuracy of the operation result is ensured.

In this embodiment, different codes or identifiers of the operation codes may be set for different vector operation instructions to distinguish the different vector operation instructions.

In one possible implementation, the instruction format of the vector operation instruction may be:

FY,IN,OUT,size,type4,type1.type2,a.b,pa

where FY is an operation code, IN, OUT, size, type1.type2, a.b, pa are operation domains. FY indicates that the instruction is a vector operation instruction. type1 in type1.type2 represents an initial data type, and type2 in type1.type2 represents a target data type. a.b, a indicates the initial number of bits and a.b, b indicates the target number of bits. type4 represents an operation type. IN denotes the address of the data to be processed. OUT denotes the target address. size represents the input amount. pa is a processing parameter, and when the processing parameter is plural, a plurality of positions pa0, pa1 …, pan may be set in the instruction to indicate different processing parameters; alternatively, the plurality of process parameters may be expressed in the form of pa0.pa1.…. pan. type1.type2, a.b, size, pa can be absent.

FY,IN,OUT,size,type4,pa

FY is an operation code, and IN, OUT, size, pa are operation domains. FY indicates that the instruction is a vector operation instruction. type4 represents an operation type. IN denotes the address of the data to be processed. OUT denotes the target address. size represents the input amount. pa is a processing parameter, and when the processing parameter is plural, a plurality of positions pa0, pa1 …, pan may be set in the instruction to indicate different processing parameters; alternatively, the plurality of process parameters may be expressed in the form of pa0.pa1.…. pan. size, pa may be absent.

type4,IN,OUT,size,pa

wherein type4 is the operation code, IN, OUT, size, pa are the operation domains. type4 indicates that the instruction is a vector operation instruction and indicates the operation type of the vector operation instruction. IN denotes the address of the data to be processed. OUT denotes the target address. size represents the input amount. pa is a processing parameter, and when the processing parameter is plural, a plurality of positions pa0, pa1 …, pan may be set in the instruction to indicate different processing parameters; alternatively, the plurality of process parameters may be expressed in the form of pa0.pa1.…. pan. size, pa may be absent.

FY,IN,OUT,size,type5,pa

where FY is an operation code, IN, OUT, type5, size, pa are operation fields. FY indicates that the instruction is a vector operation instruction. type5 represents a data processing type that includes a specified number of initial bits, an initial data type, a target number of bits, a target data type, and an operation type. IN denotes the address of the data to be processed. OUT denotes the target address. size represents the input amount. pa is a processing parameter, and when the processing parameter is plural, a plurality of positions pa0, pa1 …, pan may be set in the instruction to indicate different processing parameters; alternatively, the plurality of process parameters may be expressed in the form of pa0.pa1.…. pan. size, pa may be absent.

type5,IN,OUT,size,pa

wherein type5 is the operation code, IN, OUT, size, pa are the operation domains. type5 indicates that the instruction is a vector operation instruction and indicates the type of data processing, including the specified initial number of bits, initial data type, target number of bits, target data type, and operation type. IN denotes the address of the data to be processed. OUT denotes the target address. size represents the input amount. pa is a processing parameter, and when the processing parameter is plural, a plurality of positions pa0, pa1 …, pan may be set in the instruction to indicate different processing parameters; alternatively, the plurality of process parameters may be expressed in the form of pa0.pa1.…. pan. size, pa may be absent.

When there are a plurality of data to be processed, the vector operation instruction may include a plurality of addresses of the data to be processed, and for example, the instruction format of the data to be processed may be any one of the following:

FY,IN1,IN2,OUT,size,type4,type1.type2,a.b,pa

FY,IN1,IN2,OUT,size,type4,pa

type4,IN1,IN2,OUT,size,pa

FY,IN1,IN2,OUT,size,type4,pa

FY,IN1,IN2,OUT,size,type5,pa

type5,IN1,IN2,OUT,size,pa

where IN1 is the first to-be-processed data address and IN2 is the second to-be-processed data address.

The following tables 1-2 are examples of different vector operation instructions provided by embodiments of the present disclosure. The codes or identifiers of the operation codes of the vector operation instructions and the positions of different parameters in the operation domains of the vector operation instructions can be set by those skilled in the art according to actual needs, and the disclosure does not limit the code.

Table 1 vector operation instruction example 1

Table 2 vector operation instruction example 2

It should be understood that the location of the opcode, opcode in the instruction format, and operand field of the vector operation instruction may be set as desired by one skilled in the art, and the disclosure is not limited thereto.

In one possible implementation manner, the apparatus may be disposed in one or more of a Graphics Processing Unit (GPU), a Central Processing Unit (CPU), and an embedded Neural Network Processor (NPU).

It should be noted that, although the vector operation instruction processing apparatus has been described above by taking the above-described embodiment as an example, those skilled in the art will understand that the present disclosure should not be limited thereto. In fact, the user can flexibly set each module according to personal preference and/or actual application scene, as long as the technical scheme of the disclosure is met.

The present disclosure provides a machine learning arithmetic device, which may include one or more of the above vector operation instruction processing devices, and is configured to acquire data to be processed and control information from other processing devices and execute a specified machine learning operation. The machine learning arithmetic device can obtain a vector arithmetic instruction from other machine learning arithmetic devices or non-machine learning arithmetic devices, and transmit an execution result to peripheral equipment (also called other processing devices) through an I/O interface. Peripheral devices such as cameras, displays, mice, keyboards, network cards, wifi interfaces, servers. When more than one vector operation instruction processing device is included, the vector operation instruction processing devices can be linked and transmit data through a specific structure, for example, the vector operation instruction processing devices are interconnected and transmit data through a PCIE bus, so as to support larger-scale operation of the neural network. At this time, the same control system may be shared, or there may be separate control systems; the memory may be shared or there may be separate memories for each accelerator. In addition, the interconnection mode can be any interconnection topology.

The machine learning arithmetic device has high compatibility and can be connected with various types of servers through PCIE interfaces.

Fig. 3a shows a block diagram of a combined processing device according to an embodiment of the present disclosure. As shown in fig. 3a, the combined processing device includes the machine learning arithmetic device, the universal interconnection interface and other processing devices. The machine learning arithmetic device interacts with other processing devices to jointly complete the operation designated by the user.

Other processing devices include one or more of general purpose/special purpose processors such as Central Processing Units (CPUs), Graphics Processing Units (GPUs), neural network processors, and the like. The number of processors included in the other processing devices is not limited. The other processing devices are used as interfaces of the machine learning arithmetic device and external data and control, and comprise data transportation to finish basic control of starting, stopping and the like of the machine learning arithmetic device; other processing devices may cooperate with the machine learning computing device to perform computing tasks.

And the universal interconnection interface is used for transmitting data and control vector operation instructions between the machine learning operation device and other processing devices. The machine learning arithmetic device acquires required input data from other processing devices and writes the input data into a storage device on the machine learning arithmetic device; control vector operation instructions can be obtained from other processing devices and written into a control cache on a machine learning operation device chip; the data in the storage module of the machine learning arithmetic device can also be read and transmitted to other processing devices.

Fig. 3b shows a block diagram of a combined processing device according to an embodiment of the present disclosure. In one possible implementation, as shown in fig. 3b, the combined processing device may further include a storage device, and the storage device is connected to the machine learning arithmetic device and the other processing device respectively. The storage device is used for storing data stored in the machine learning arithmetic device and other processing devices, and is particularly suitable for data which is required to be calculated and cannot be stored in the internal storage of the machine learning arithmetic device or other processing devices.

The combined processing device can be used as an SOC (system on chip) system of equipment such as a mobile phone, a robot, an unmanned aerial vehicle and video monitoring equipment, the core area of a control part is effectively reduced, the processing speed is increased, and the overall power consumption is reduced. In this case, the generic interconnect interface of the combined processing device is connected to some component of the apparatus. Some parts are such as camera, display, mouse, keyboard, network card, wifi interface.

The present disclosure provides a machine learning chip, which includes the above machine learning arithmetic device or combined processing device.

The present disclosure provides a machine learning chip package structure, which includes the above machine learning chip.

Fig. 4 shows a schematic structural diagram of a board card according to an embodiment of the present disclosure. As shown in fig. 4, the board includes the above-mentioned machine learning chip package structure or the above-mentioned machine learning chip. The board may include, in addition to the machine learning chip 389, other kits including, but not limited to: memory device 390, interface device 391 and control device 392.

The memory device 390 is coupled to a machine learning chip 389 (or a machine learning chip within a machine learning chip package structure) via a bus for storing data. Memory device 390 may include multiple sets of memory cells 393. Each group of memory cells 393 is coupled to a machine learning chip 389 via a bus. It is understood that each group 393 may be a DDR SDRAM (Double Data Rate SDRAM).

DDR can double the speed of SDRAM without increasing the clock frequency. DDR allows data to be read out on the rising and falling edges of the clock pulse. DDR is twice as fast as standard SDRAM.

In one embodiment, memory device 390 may include 4 groups of memory cells 393. Each group of memory cells 393 may include a plurality of DDR4 particles (chips). In one embodiment, the machine learning chip 389 may include 4 72-bit DDR4 controllers therein, where 64bit is used for data transmission and 8bit is used for ECC check in the 72-bit DDR4 controller. It is appreciated that when DDR4-3200 particles are used in each group of memory cells 393, the theoretical bandwidth of data transfer may reach 25600 MB/s.

In one embodiment, each group 393 of memory cells includes a plurality of double rate synchronous dynamic random access memories arranged in parallel. DDR can transfer data twice in one clock cycle. A controller for controlling DDR is provided in the machine learning chip 389 for controlling data transfer and data storage of each memory unit 393.

Interface device 391 is electrically coupled to machine learning chip 389 (or a machine learning chip within a machine learning chip package). The interface device 391 is used to implement data transmission between the machine learning chip 389 and an external device (e.g., a server or a computer). For example, in one embodiment, the interface device 391 may be a standard PCIE interface. For example, the data to be processed is transmitted to the machine learning chip 289 by the server through the standard PCIE interface, so as to implement data transfer. Preferably, when PCIE 3.0X 16 interface transmission is adopted, the theoretical bandwidth can reach 16000 MB/s. In another embodiment, the interface device 391 may also be another interface, and the disclosure does not limit the specific representation of the other interface, and the interface device can implement the switching function. In addition, the calculation result of the machine learning chip is still transmitted back to the external device (e.g., server) by the interface device.

The control device 392 is electrically connected to a machine learning chip 389. The control device 392 is used to monitor the state of the machine learning chip 389. Specifically, the machine learning chip 389 and the control device 392 may be electrically connected through an SPI interface. The control device 392 may include a single chip Microcomputer (MCU). For example, machine learning chip 389 may include multiple processing chips, multiple processing cores, or multiple processing circuits, which may carry multiple loads. Therefore, the machine learning chip 389 can be in different operation states such as a multi-load and a light load. The control device can regulate and control the working states of a plurality of processing chips, a plurality of processing circuits and/or a plurality of processing circuits in the machine learning chip.

The present disclosure provides an electronic device, which includes the above machine learning chip or board card.

The electronic device may include a data processing apparatus, a robot, a computer, a printer, a scanner, a tablet, a smart terminal, a cell phone, a tachograph, a navigator, a sensor, a camera, a server, a cloud server, a camera, a video camera, a projector, a watch, an earphone, a mobile storage, a wearable device, a vehicle, a household appliance, and/or a medical device.

The vehicle may include an aircraft, a ship, and/or a vehicle. The household appliances may include televisions, air conditioners, microwave ovens, refrigerators, electric rice cookers, humidifiers, washing machines, electric lamps, gas cookers, and range hoods. The medical device may include a nuclear magnetic resonance apparatus, a B-mode ultrasound apparatus and/or an electrocardiograph.

FIG. 5 is a flowchart illustrating a method of processing a vector operation instruction according to an embodiment of the present disclosure. As shown in fig. 5, the method is applied to the above vector operation instruction processing apparatus, and includes step S51 and step S52.

In step S51, the received vector operation instruction is parsed to obtain an operation code and an operation field of the vector operation instruction, and the data to be processed and the target address required for executing the vector operation instruction are determined according to the operation code and the operation field, and the data processing type corresponding to the vector operation instruction is determined. The operation code is used for indicating that the processing required by the vector operation instruction to the data at least comprises vector operation processing. The data processing type comprises an initial data type of data to be processed, a target data type of the processed data and an operation type, and the initial data type or the target data type is a floating point data type. The operation domain comprises a data address to be processed and a target address.

In step S52, the data to be processed is processed according to the data processing type to obtain processed data, and the processed data is stored in the target address.

In one possible implementation, the operation domain may also include a data processing type. Determining the data processing type corresponding to the vector operation instruction may include: when the operation domain includes a data processing type, the data processing type corresponding to the vector operation instruction is determined according to the operation domain.

In one possible implementation, the operation code may also be used to indicate the type of data processing. Determining the data processing type corresponding to the vector operation instruction may include: when the opcode is used to indicate a data processing type, the data processing type corresponding to the vector operation instruction is determined from the opcode.

In one possible implementation, the operation field may further include an input quantity. Determining the to-be-processed data and the target address required for executing the vector operation instruction according to the operation code and the operation domain may include: when the input amount is included in the operation field, the data to be processed corresponding to the input amount is acquired from the data address to be processed.

In one possible implementation, the operation domain may further include a processing parameter, and the processing parameter includes any one of an offset address and a processing parameter. Determining the to-be-processed data and the target address required for executing the vector operation instruction according to the operation code and the operation domain may include: and when the operation domain comprises the offset address, acquiring the data to be processed according to the address of the data to be processed and the offset address.

The processing the data to be processed according to the data processing type to obtain the processed data may include: and when the operation domain comprises the processing parameters, processing the data to be processed according to the data processing type and the processing parameters to obtain the processed data.

In a possible implementation manner, processing data to be processed according to a data processing type to obtain processed data may include:

when the initial data type is different from the target data type, performing data type conversion processing on the data to be processed of the initial data type to obtain converted data of the target data type;

and performing operation processing on the converted data according to the operation type to obtain an operation result, and determining the operation result as the processed data.

In one possible implementation, the method may further include: and executing an operation corresponding to the operation type by using at least one operator.

In one possible implementation, the method may further include: and storing the data to be processed.

In a possible implementation manner, parsing the received vector operation instruction to obtain an opcode and an operation field of the vector operation instruction may include:

storing a vector operation instruction;

analyzing the vector operation instruction to obtain an operation code and an operation domain of the vector operation instruction;

the method includes storing an instruction queue, where the instruction queue includes a plurality of instructions to be executed that are sequentially arranged according to an execution order, and the plurality of instructions to be executed may include a vector operation instruction.

In one possible implementation, the method may further include:

when determining that the first to-be-executed instruction in the plurality of to-be-executed instructions has an association relation with a zeroth to-be-executed instruction before the first to-be-executed instruction, caching the first to-be-executed instruction, and after determining that the execution of the zeroth to-be-executed instruction is finished, controlling the execution of the first to-be-executed instruction,

the method for determining the zero-th instruction to be executed before the first instruction to be executed has an incidence relation with the first instruction to be executed comprises the following steps:

the first storage address interval for storing the data required by the first to-be-executed instruction and the zeroth storage address interval for storing the data required by the zeroth to-be-executed instruction have an overlapped area.

It should be noted that, although the vector operation instruction processing method is described above by taking the above-mentioned embodiment as an example, those skilled in the art can understand that the present disclosure should not be limited thereto. In fact, the user can flexibly set each step according to personal preference and/or actual application scene, as long as the technical scheme of the disclosure is met.

The vector operation instruction processing method provided by the embodiment of the disclosure has the advantages of wide application range, high processing efficiency and high processing speed of the vector operation instruction, and can improve the processing efficiency and speed of vector data.

It should be noted that, for simplicity of description, the above-mentioned method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present application is not limited by the order of acts described, as some steps may occur in other orders or concurrently depending on the application. Further, those skilled in the art should also appreciate that the embodiments described in the specification are exemplary embodiments and that acts and modules referred to are not necessarily required by the disclosure.

In the foregoing embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

In the embodiments provided in the present disclosure, it should be understood that the disclosed system and apparatus may be implemented in other ways. For example, the above-described embodiments of systems and apparatuses are merely illustrative, and for example, a division of a device, an apparatus, and a module is merely a logical division, and an actual implementation may have another division, for example, a plurality of modules may be combined or integrated into another system or apparatus, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of some interfaces, devices, apparatuses or modules, and may be an electrical or other form.

Modules described as separate parts may or may not be physically separate, and parts displayed as modules may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.

In addition, functional modules in the embodiments of the present disclosure may be integrated into one processing unit, or each module may exist alone physically, or two or more modules are integrated into one module. The integrated module can be realized in a form of hardware or a form of a software program module.

The integrated modules, if implemented in the form of software program modules and sold or used as a stand-alone product, may be stored in a computer readable memory. Based on such understanding, the technical solution of the present disclosure may be embodied in the form of a software product, which is stored in a memory and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present disclosure. And the aforementioned memory comprises: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic or optical disk, and other various media capable of storing program codes.

Those skilled in the art will appreciate that all or part of the steps in the methods of the above embodiments may be implemented by associated hardware instructed by a program, which may be stored in a computer-readable memory, which may include: flash Memory disks, Read-Only memories (ROMs), Random Access Memories (RAMs), magnetic or optical disks, and the like.

The foregoing detailed description of the embodiments of the present application has been presented to illustrate the principles and implementations of the present application, and the above description of the embodiments is only provided to help understand the method and the core concept of the present application; meanwhile, for a person skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.

Claims

1. An apparatus for processing a vector operation instruction, the apparatus comprising:

2. The apparatus of claim 1, wherein the operation domain further comprises the data processing type,

the control module is further configured to determine a data processing type corresponding to the vector operation instruction according to the operation domain when the operation domain includes the data processing type.

3. The apparatus of claim 1, wherein the opcode is further configured to indicate the type of data processing,

the control module is further configured to determine a data processing type corresponding to the vector operation instruction according to the operation code when the operation code is used to indicate the data processing type.

4. The apparatus of claim 1, wherein the operational field further comprises an input quantity,

the control module is further used for acquiring the data to be processed corresponding to the input quantity from the data address to be processed when the input quantity is included in the operation domain.

5. The apparatus of claim 1, wherein the operation domain further comprises a processing parameter, the processing parameter comprising any of an offset address and a processing parameter,

wherein the control module is further configured to obtain the data to be processed according to the address of the data to be processed and the offset address when the operation domain includes the offset address,

and the processing module is further configured to process the data to be processed according to the data processing type and the processing parameter when the operation domain includes the processing parameter, so as to obtain processed data.

6. The apparatus according to claim 1, wherein processing the data to be processed according to the data processing type to obtain processed data comprises:

7. The apparatus of claim 6, wherein the data processing type further comprises an initial number of bits and a target number of bits,

the method for performing data type conversion processing on the data to be processed of the initial data type to obtain the converted data of the target data type includes:

and performing data type conversion processing on the initial digit and the data to be processed of the initial data type to obtain converted data of the target digit and the target data type.

8. The apparatus of claim 1, wherein the processing module comprises:

at least one operator for performing an operation corresponding to the operation type.

9. The apparatus of claim 1,

the device further comprises: a storage module for storing the data to be processed,

wherein the control module comprises:

the instruction storage submodule is used for storing the vector operation instruction;

the instruction processing submodule is used for analyzing the vector operation instruction to obtain an operation code and an operation domain of the vector operation instruction;

a queue storage submodule, configured to store an instruction queue, where the instruction queue includes multiple instructions to be executed that are sequentially arranged according to an execution order, where the multiple instructions to be executed include the vector operation instruction,

wherein, the control module further comprises:

the dependency relationship processing submodule is used for caching a first to-be-executed instruction in the instruction storage submodule when the fact that the incidence relationship exists between the first to-be-executed instruction and a zeroth to-be-executed instruction before the first to-be-executed instruction is determined, extracting the first to-be-executed instruction from the instruction storage submodule after the zeroth to-be-executed instruction is executed, and sending the first to-be-executed instruction to the processing module,

wherein the association relationship between the first to-be-executed instruction and a zeroth to-be-executed instruction before the first to-be-executed instruction comprises:

and a first storage address interval for storing the data required by the first instruction to be executed and a zeroth storage address interval for storing the data required by the zeroth instruction to be executed have an overlapped area.

10. A machine learning arithmetic device, the device comprising:

one or more vector operation instruction processing devices according to any one of claims 1 to 9, configured to obtain data to be processed and control information from other processing devices, perform a specified machine learning operation, and transmit an execution result to the other processing devices through an I/O interface;

11. A combined processing apparatus, characterized in that the combined processing apparatus comprises:

the machine learning computing device, universal interconnect interface, and other processing device of claim 10;

the machine learning arithmetic device interacts with the other processing devices to jointly complete the calculation operation designated by the user,

wherein the combination processing apparatus further comprises: and a storage device connected to the machine learning arithmetic device and the other processing device, respectively, for storing data of the machine learning arithmetic device and the other processing device.

12. A machine learning chip, the machine learning chip comprising:

a machine learning computation apparatus according to claim 10 or a combined processing apparatus according to claim 11.

13. An electronic device, characterized in that the electronic device comprises:

the machine learning chip of claim 12.

14. The utility model provides a board card, its characterized in that, the board card includes: a memory device, an interface apparatus and a control device and a machine learning chip according to claim 12;

wherein the machine learning chip is connected with the storage device, the control device and the interface device respectively;

the storage device is used for storing data;

the interface device is used for realizing data transmission between the machine learning chip and external equipment;

and the control device is used for monitoring the state of the machine learning chip.

15. A vector operation instruction processing method, applied to a vector operation instruction processing apparatus, the method comprising:

16. The method of claim 15, wherein the operation domain further comprises the data processing type,

wherein determining a data processing type corresponding to the vector operation instruction comprises:

and when the operation domain comprises the data processing type, determining the data processing type corresponding to the vector operation instruction according to the operation domain.

17. The method of claim 15, wherein the operation code is further configured to indicate the data processing type,

and when the operation code is used for indicating the data processing type, determining the data processing type corresponding to the vector operation instruction according to the operation code.

18. The method of claim 15, wherein the operational field further comprises an input quantity,

determining the data to be processed and the target address required by executing the vector operation instruction according to the operation code and the operation domain, wherein the determining comprises the following steps:

and when the input quantity is included in the operation domain, acquiring the data to be processed corresponding to the input quantity from the data address to be processed.

19. The apparatus of claim 15, wherein the operation domain further comprises a processing parameter, the processing parameter comprising any of an offset address and a processing parameter,

determining the data to be processed and the target address required by executing the vector operation instruction according to the operation code and the operation domain, wherein the determining comprises the following steps: when the operation domain comprises the offset address, acquiring the data to be processed according to the address of the data to be processed and the offset address,

wherein, processing the data to be processed according to the data processing type to obtain processed data includes: and when the operation domain comprises the processing parameters, processing the data to be processed according to the data processing type and the processing parameters to obtain processed data.

20. The method of claim 15, wherein processing the data to be processed according to the data processing type to obtain processed data comprises:

21. The method of claim 20, wherein the data processing type further comprises an initial number of bits and a target number of bits,

22. The method of claim 15, further comprising:

and executing the operation corresponding to the operation type by using at least one operator.

23. The method of claim 15,

the method further comprises the following steps: the data to be processed is stored in the storage device,

the method for analyzing the received vector operation instruction to obtain the operation code and the operation domain of the vector operation instruction comprises the following steps:

storing the vector operation instruction;

storing an instruction queue, the instruction queue comprising a plurality of instructions to be executed arranged in sequence in an execution order, the plurality of instructions to be executed comprising the vector operation instruction,

wherein the method further comprises:

when determining that the first to-be-executed instruction in the plurality of to-be-executed instructions is associated with a zeroth to-be-executed instruction before the first to-be-executed instruction, caching the first to-be-executed instruction, and after determining that the zeroth to-be-executed instruction is completely executed, controlling to execute the first to-be-executed instruction,