CN112394997A

CN112394997A - Eight-bit shaping to half-precision floating point instruction processing device and method and related product

Info

Publication number: CN112394997A
Application number: CN201910743220.6A
Authority: CN
Inventors: 不公告发明人
Original assignee: Shanghai Cambricon Information Technology Co Ltd
Current assignee: Shanghai Cambricon Information Technology Co Ltd
Priority date: 2019-08-13
Filing date: 2019-08-13
Publication date: 2021-02-23

Abstract

The application relates to an eight-bit shaping to half-precision floating point instruction processing device, a method and a related product, which are used for acquiring data to be operated and control information from other processing devices, executing specified machine learning operation and transmitting an execution result to other processing devices through an I/O interface. The device and the method for processing the eight-bit shaping to half-precision floating point instruction and the related products provided by the embodiment of the application have wide application range, and have high processing efficiency and high processing speed on the eight-bit shaping to half-precision floating point instruction.

Description

Eight-bit shaping to half-precision floating point instruction processing device and method and related product

Technical Field

The present application relates to the field of computer technologies, and in particular, to an apparatus and a method for processing an eight-bit shaping to half-precision floating-point instruction, and a related product.

Background

With the continuous development of science and technology, machine learning, especially neural network algorithms, are more and more widely used. The method is well applied to the fields of image recognition, voice recognition, natural language processing and the like. However, as the complexity of the neural network algorithm is higher and higher, the requirement for data type conversion of data such as tensor is increasing. However, in the existing eight-bit shaping to half-precision floating point instruction and related technologies, flexible operation of the data eight-bit shaping to half-precision floating point instruction cannot be efficiently supported, and the execution efficiency and the execution speed are low.

Disclosure of Invention

In view of the above, the present application provides an apparatus, a method and a related product for processing an eight-bit shaping to half-precision floating point instruction, so as to improve the efficiency and speed of processing the eight-bit shaping to half-precision floating point instruction.

According to a first aspect of the present application, there is provided an eight-bit shaping to half precision floating point instruction processing apparatus, the apparatus comprising:

the control module is used for compiling the obtained eight-bit shaping to half-precision floating point instruction to obtain a compiled eight-bit shaping to half-precision floating point instruction, and analyzing the compiled eight-bit shaping to half-precision floating point instruction to obtain a source address and a destination address in an operation domain of the eight-bit shaping to half-precision floating point instruction;

and the execution module is used for extracting the eight-bit shaping tensor from the source address, converting the eight-bit shaping tensor into a half-precision floating point tensor and storing the half-precision floating point tensor in the destination address.

According to a second aspect of the present application, there is provided a machine learning arithmetic device including:

one or more eight-bit shaping to half-precision floating-point instruction processing devices according to the first aspect, configured to obtain data to be operated and control information from other processing devices, execute a specified machine learning operation, and transmit an execution result to the other processing devices through an I/O interface;

when the machine learning arithmetic device comprises a plurality of the eight-bit shaping to half-precision floating point instruction processing devices, the plurality of the eight-bit shaping to half-precision floating point instruction processing devices can be connected through a specific structure and transmit data;

the eight-bit shaping to half-precision floating point instruction processing devices are interconnected through a Peripheral Component Interface Express (PCIE) bus and transmit data so as to support larger-scale machine learning operation; a plurality of the eight-bit shaping to half-precision floating point instruction processing devices share the same control system or own respective control systems; the eight-bit shaping to half-precision floating point instruction processing devices share a memory or own memories; the interconnection mode of the eight-bit shaping to half-precision floating point instruction processing device is any interconnection topology.

According to a third aspect of the present application, there is provided a combined processing apparatus including:

a machine learning arithmetic device, a general interconnect interface, and other processing devices as described in the second aspect above;

the machine learning arithmetic device interacts with the other processing devices to jointly complete the operation designated by the user,

wherein the combination processing apparatus further comprises: and a storage device connected to the machine learning arithmetic device and the other processing device, respectively, for storing data of the machine learning arithmetic device and the other processing device.

According to a fourth aspect of the present application, there is provided a machine learning chip including the machine learning network computing device of the second aspect or the combination processing device of the third aspect.

According to a fifth aspect of the present application, there is provided a machine learning chip package structure, which includes the machine learning chip of the fourth aspect.

According to a sixth aspect of the present application, a board card is provided, which includes the machine learning chip packaging structure of the fifth aspect.

According to a seventh aspect of the present application, there is provided an electronic device, which includes the machine learning chip of the fourth aspect or the board of the sixth aspect.

According to an eighth aspect of the present application, there is provided an eight-bit shaping to half-precision floating point instruction processing method applied to an eight-bit shaping to half-precision floating point instruction processing apparatus, the method comprising:

compiling the obtained eight-bit shaping to half-precision floating point instruction to obtain a compiled eight-bit shaping to half-precision floating point instruction, and analyzing the compiled eight-bit shaping to half-precision floating point instruction to obtain a source address and a destination address in an operation domain of the eight-bit shaping to half-precision floating point instruction;

and extracting an eight-bit shaping tensor from the source address, converting the eight-bit shaping tensor into a half-precision floating point tensor, and storing the eight-bit shaping tensor in the destination address.

In some embodiments, the electronic device comprises a data processing apparatus, a robot, a computer, a printer, a scanner, a tablet, a smart terminal, a cell phone, a tachograph, a navigator, a sensor, a camera, a server, a cloud server, a camera, a camcorder, a projector, a watch, a headset, a mobile storage, a wearable device, a vehicle, a household appliance, and/or a medical device.

In some embodiments, the vehicle comprises an aircraft, a ship, and/or a vehicle; the household appliances comprise a television, an air conditioner, a microwave oven, a refrigerator, an electric cooker, a humidifier, a washing machine, an electric lamp, a gas stove and a range hood; the medical equipment comprises a nuclear magnetic resonance apparatus, a B-ultrasonic apparatus and/or an electrocardiograph.

The device for processing the eight-bit shaping to half-precision floating point instruction comprises a control module and an execution module, wherein the control module is used for compiling the obtained eight-bit shaping to half-precision floating point instruction to obtain a compiled eight-bit shaping to half-precision floating point instruction, and analyzing the compiled eight-bit shaping to half-precision floating point instruction to obtain a source address and a destination address in an operation domain of the eight-bit shaping to half-precision floating point instruction; and the execution module is used for extracting the eight-bit shaping tensor from the source address, converting the eight-bit shaping tensor into a half-precision floating point tensor, and storing the half-precision floating point tensor in a destination address. The eight-bit shaping tensor can be used for subsequent operation more quickly. The processing device for converting the eight-bit shaping instruction into the half-precision floating point instruction provided by the embodiment of the application has a wide application range, and is high in processing efficiency and processing speed for converting the eight-bit shaping instruction into the half-precision floating point instruction.

Other features and aspects of the present application will become apparent from the following detailed description of exemplary embodiments, which proceeds with reference to the accompanying drawings.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate exemplary embodiments, features, and aspects of the application and, together with the description, serve to explain the principles of the application.

FIG. 1 shows a block diagram of an eight-bit shaping to half precision floating point instruction processing apparatus according to an embodiment of the present application.

Fig. 1a and 1b show block diagrams of an eight-bit shaping to half-precision floating-point instruction processing apparatus according to an embodiment of the present application.

2 a-2 e illustrate block diagrams of an eight-bit shaping to half-precision floating point instruction processing apparatus according to an embodiment of the present application.

FIG. 3 is a diagram illustrating an application scenario of an eight-bit shaping to half-precision floating point instruction processing apparatus according to an embodiment of the present application.

Fig. 4a and 4b show block diagrams of a combined processing device according to an embodiment of the present application.

Fig. 5 shows a schematic structural diagram of a board card according to an embodiment of the present application.

FIG. 6 shows a flow diagram of a method of processing an eight-bit shaping to half-precision floating point instruction according to an embodiment of the present application.

Detailed Description

Various exemplary embodiments, features and aspects of the present application will be described in detail below with reference to the accompanying drawings. In the drawings, like reference numbers can indicate functionally identical or similar elements. While the various aspects of the embodiments are presented in drawings, the drawings are not necessarily drawn to scale unless specifically indicated.

The word "exemplary" is used exclusively herein to mean "serving as an example, embodiment, or illustration. Any embodiment described herein as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments.

Furthermore, in the following detailed description, numerous specific details are set forth in order to provide a better understanding of the present application. It will be understood by those skilled in the art that the present application may be practiced without some of these specific details. In some instances, methods, means, elements and circuits that are well known to those skilled in the art have not been described in detail so as not to obscure the present application.

FIG. 1 shows a block diagram of an eight-bit shaping to half precision floating point instruction processing apparatus according to an embodiment of the present application. As shown in fig. 1, the apparatus includes a control module 11 and an execution module 12.

The control module 11 is configured to compile the obtained eight-bit shaping to half-precision floating point instruction to obtain a compiled eight-bit shaping to half-precision floating point instruction, and analyze the compiled eight-bit shaping to half-precision floating point instruction to obtain a source address and a destination address in an operation domain of the eight-bit shaping to half-precision floating point instruction;

and the execution module 12 is configured to extract an eight-bit shaping tensor from the source address, convert the eight-bit shaping tensor into a half-precision floating-point tensor, and store the half-precision floating-point tensor in the destination address.

In a possible implementation manner, the eight-bit shaping to half-precision floating point instruction acquired by the control module is an uncompiled software instruction that cannot be directly executed by hardware, and the control module needs to compile the eight-bit shaping to half-precision floating point instruction (uncompiled) first. After the compiled eight-bit shaping to half-precision floating point instruction is obtained, the compiled eight-bit shaping to half-precision floating point instruction can be analyzed. And the compiled eight-bit shaping to half-precision floating point instruction is a hardware instruction which can be directly executed by hardware.

In this embodiment, the instruction may include an opcode and an operation field. The operation codes and the operation domains can form instructions according to a preset forming sequence and a format according to requirements. The operation code may be used to indicate an operation to be performed by the instruction. The operation code may have various forms of expression such as characters, codes or numbers, which is not limited in this application. The operation domain may include parameters of data (e.g., source, type, address, etc.) required for execution of the instruction, other parameters required for execution of the instruction, and so on.

In one possible implementation, the opcode of the eight-bit shaping to half-precision floating point instruction may be used to indicate the conversion of eight-bit shaping type data (tensor) to half-precision floating point type data (tensor). The operation domain of the eight-bit shaping to half precision floating point instruction may include a source address and a destination address. The source address is a storage address of the tensor to be converted, the data type of the tensor to be converted is an eight-bit shaping type, the destination address is a storage address of the tensor after conversion, and the data type of the tensor after conversion is a half-precision floating point type. The data type of each element in the eight-bit shaping tensor is an eight-bit shaping type, namely, each element in the eight-bit shaping tensor is 8-bit shaping number. The data type of each element in the half-precision floating point tensor is a half-precision floating point type, namely each element in the half-precision floating point type tensor is a half-precision floating point number with 16 bits.

In a possible implementation manner, when the eight-bit shaping to half-precision floating-point instruction processing apparatus is located on a chip (including a chip on which a general-purpose processor and/or an artificial intelligence processor is located), the storage address may be an address of an on-chip memory (hereinafter referred to as an on-chip address) or an address of an off-chip memory (hereinafter referred to as an off-chip address), and the destination address may also be an on-chip address or an off-chip address. It will be appreciated that the conversion efficiency of the eight-bit shaping to half precision floating point instruction processing apparatus is highest when both the memory address and the destination address are on-chip addresses.

In a possible implementation manner, the control module analyzes an operation domain of the eight-bit shaping to half-precision floating point instruction, obtains a source address and a destination address, and then sends the source address and the destination address to the execution module. The execution module can extract the eight shaping tensors to be converted according to the source address, convert the extracted eight shaping tensors into the half-precision floating-point tensors, and store the converted half-precision floating-point tensors in the destination address. The execution module can convert the eight-bit shaping tensor into the half-precision floating-point tensor by using a traditional data type conversion method, and the data type conversion method is not limited in the application.

It should be understood that the instruction format of the eight-bit shaping to half-precision floating-point instruction and the included opcode and operation domain may be set by one skilled in the art as desired, and the application is not limited in this respect.

In this embodiment, the apparatus may include one or more control modules and one or more execution modules, and the number of the control modules and the number of the execution modules may be set according to actual needs. According to the requirement, any one of the control modules is used for analyzing the obtained eight-bit shaping to half-precision floating point instruction to obtain a source address and a destination address in an operation domain of the eight-bit shaping to half-precision floating point instruction; and any execution module (or an execution module executed by a control module) in the plurality of execution modules is used for extracting the eight-bit shaping tensor from the source address, converting the eight-bit shaping tensor into a half-precision floating-point tensor, and storing the eight-bit shaping tensor in the destination address. This is not limited by the present application.

In a possible implementation manner, the control module is further configured to obtain the number of elements in an operation domain of the eight-bit shaping to half-precision floating-point instruction; the execution module is further configured to extract an element to be converted in the eight-bit shaping tensor at the source address according to the element number, convert the element to be converted into a half-precision floating-point element, and store the element to be converted in the destination address.

In one possible implementation, the half-precision floating-point tensor to be converted may include a plurality of elements, and an eight-bit reshaping to half-precision floating-point instruction may be used to convert a portion of the elements in the half-precision floating-point tensor. The partial elements in the half-precision floating-point tensor needing to be converted can be determined according to the requirement of operation after conversion, and the partial elements in the half-precision floating-point tensor needing to be converted can also be determined according to the processing efficiency of the eight-bit shaping to half-precision floating-point instruction processing device. For example, when the processing efficiency of the eight-bit shaping to half-precision floating-point instruction processing apparatus is low, fewer partial elements are determined as elements to be converted in the half-precision floating-point type tensor, and when the efficiency is high, more partial elements are determined as elements to be converted.

The number of elements of the element to be converted in the eight-bit shaping tensor can also be included in the operation domain of the eight-bit shaping to half precision floating point instruction. After the control module analyzes the eight-bit shaping to half-precision floating point instruction, the number of elements in the operation domain can be obtained; and the execution module is also used for extracting the elements to be converted in the eight-bit shaping tensor from the source address according to the element number. The number of the elements to be converted is determined according to the number of the elements obtained by analysis in the operation domain, and the execution module can convert the elements to be converted into half-precision floating-point type elements and then store the half-precision floating-point type elements in destination addresses, so that the conversion of partial elements of the eight-bit shaping tensor is completed.

In one possible implementation, in the convolution operation of the neural network, the input neuron data may be an eight-bit shaping tensor, and when the input neuron data is subjected to convolution operation with the convolution kernel, partial elements in the input neuron data are sequentially subjected to convolution operation with the convolution kernel. The eight-bit reshaped input neuron data may be converted into half-precision floating-point input neuron data, and then convolved with a convolution kernel. At this time, according to the number of elements in the operation domain of the eight-bit shaping to half-precision floating point instruction, it may be determined that part of the elements in the input neuron data are to-be-converted elements, and according to the number of elements in the operation domain of the eight-bit shaping to half-precision floating point instruction, it may be determined that the number of elements in the operation domain of the eight-bit shaping to half-precision floating point instruction is determined to obtain to-be-converted elements corresponding to the convolution kernel, and after the to-be-converted elements are converted into the half-precision floating point.

In this embodiment, the control module is further configured to obtain the number of elements in an operation domain of the eight-bit shaping to half-precision floating-point instruction; the execution module is also used for extracting an element to be converted in the eight-bit shaping tensor according to the number of the elements from the source address, converting the element to be converted into a half-precision floating-point element and storing the element to be converted in the destination address. The number of elements can enable the eight-bit shaping to half-precision floating point conversion instruction to convert part of elements in the tensor to be converted, so that data type conversion is more flexible, and conversion efficiency is higher.

In a possible implementation manner, the control module is further configured to obtain a conversion number in an operation domain of the eight-bit shaping to half-precision floating-point instruction; the execution module is further configured to extract an element to be converted in the eight-bit shaping tensor at the source address according to the number of the element, convert the element to be converted into a half-precision floating-point element, and store the element to be converted in the destination address, and the execution module repeatedly executes the above steps according to the number of conversion times, where the extracted element to be converted each time is not overlapped.

In a possible implementation manner, the data type conversion of the whole tensor can be completed after the elements in the eight-bit shaping tensor to be converted are converted for multiple times according to the conversion times in the operation domain of the eight-bit shaping to half-precision floating point instruction, and partial elements in the eight-bit shaping tensor are converted each time. For example, if the number of transitions is N, the number of elements in the eight-bit reshaped tensor is M, and the number of elements per transition is M/N. By using the conversion times, the instruction for converting eight-bit shaping into half precision floating point can convert part of elements in the tensor to be converted, so that the data type conversion is more flexible and the conversion efficiency is higher.

In one possible implementation, the product of the number of elements and the number of transitions is equal to the total number of elements in the tensor.

In one possible implementation, the number of elements and the number of transitions may be included simultaneously in the operation domain of an eight-bit shaping to half precision floating point instruction. The eight-bit shaping to half-precision floating point instruction can extract elements to be converted according to the number of the elements at each time for conversion, and complete the integral conversion of the eight-bit shaping tensor after the number of execution times is determined according to the number of conversion times. The operation domain of the eight-bit shaping to half-precision floating point instruction comprises the number of elements and the conversion times at the same time, so that the eight-bit shaping to half-precision floating point instruction does not need to calculate the number of elements required to be extracted each time or calculate the times required to be executed, and the execution efficiency of the eight-bit shaping to half-precision floating point instruction can be improved.

In a possible implementation manner, the control module is further configured to obtain an extraction step size in an operation domain of the eight-bit shaping to half-precision floating-point instruction; the execution module is further configured to extract an element to be converted in the eight-bit shaping tensor at the source address according to the element number and the extraction step size, convert the element to be converted into a half-precision floating-point element, and store the element to be converted in the destination address.

In one possible implementation, a fetch step size may be included in the operation domain of an eight-bit shaping to half-precision floating-point instruction. The extraction step length may be an interval step length between elements to be converted each time when the half-precision floating-point tensor is converted for multiple times. By extracting the step size, data type conversion of some elements, but not all elements, in the eight-bit reshaped tensor can be achieved.

In a possible implementation manner, when the operation domain of the eight-bit shaping to half-precision floating point instruction includes the number of elements and the extraction step size, the interval between the element to be converted and the last extracted element to be converted is determined according to the extraction step size every time the element to be converted is extracted except for the element to be converted which is extracted according to the number of elements for the first time.

In one possible implementation, the extraction step size may be the number of rows, columns, the number of spacing elements, and the like. This is not limited in this application.

In this embodiment, the step size extracted in the operation domain of the eight-bit shaping to half-precision floating point instruction can implement data type conversion of part of tensors in the eight-bit shaping tensor, and can improve the conversion flexibility of the eight-bit shaping to half-precision floating point instruction.

In a possible implementation manner, the control module is further configured to obtain a storage step size in an operation domain of the eight-bit shaping to half-precision floating-point instruction; the execution module is further configured to extract an element to be converted in the eight-bit shaping tensor at the source address according to the element number, convert the element to be converted into a half-precision floating-point element, and store the element to be converted in the destination address according to the storage step length.

In one possible implementation, the eight-bit shaping to half-precision floating-point instruction operation domain may further include a storage step size. The storage step size may be used to store the converted half-precision floating-point type element in the destination address at certain intervals.

In this embodiment, the storage step size in the operation domain of the eight-bit shaping to half-precision floating point instruction realizes discontinuous storage of the converted half-precision floating point type element on the destination address, and can improve the execution flexibility of the eight-bit shaping to half-precision floating point instruction.

In one possible implementation, the execution module includes a plurality of execution sub-modules,

the control module is further configured to compile the obtained eight-bit shaping to half-precision floating point instruction to obtain a compiled eight-bit shaping to half-precision floating point instruction, analyze the compiled eight-bit shaping to half-precision floating point instruction to obtain a source address and a destination address in an operation domain of the eight-bit shaping to half-precision floating point instruction, determine a plurality of source sub-addresses and a plurality of destination sub-addresses according to the source address and the destination address, and allocate each source sub-address and each destination sub-address to an execution sub-module;

and the target execution submodule is used for extracting an eight-bit shaping tensor from the corresponding source subaddress, converting the eight-bit shaping tensor into a half-precision floating point tensor, and storing the half-precision floating point tensor in the corresponding target subaddress, and the target execution submodule is any one of the execution submodules.

In one possible implementation, the execution module may include a plurality of execution submodules. The control module may divide the source address and the destination address in the operational domain into a plurality of source sub-addresses and a plurality of said destination sub-addresses. The number of source sub-addresses may be less than or equal to the number of execution sub-modules, and the number of destination sub-addresses may also be less than or equal to the number of execution sub-modules. When the number of the source sub-addresses and the number of the destination sub-addresses are smaller than the number of the execution sub-modules, part of the execution sub-modules can be in an idle state and do not participate in data type conversion. When the number of source and destination sub-addresses equals the number of execution sub-modules, all execution sub-modules participate in the data type conversion.

In a possible implementation manner, for any execution submodule participating in data type conversion, after an eight-bit shaping tensor to be converted is extracted according to a source subaddress corresponding to the source subaddress allocated by the control module, data type conversion is performed to obtain a half-precision floating point tensor, and the half-precision floating point tensor is stored in a destination subaddress corresponding to the half-precision floating point tensor.

In this embodiment, the execution module includes a plurality of execution submodules, the control module may determine a plurality of source sub-addresses and a plurality of destination sub-addresses according to the source address and the destination address, and allocate each of the source sub-addresses and each of the destination sub-addresses to the execution submodules, and the execution submodules may extract an eight-bit shaping tensor according to the corresponding source sub-address, perform data type conversion to obtain a half-precision floating-point tensor, and store the half-precision floating-point tensor in the corresponding destination sub-address. The multiple execution sub-modules can realize parallel data type conversion, and the conversion efficiency of the eight-bit shaping tensor is improved.

In one possible implementation, the execution module comprises a master execution submodule and a plurality of slave execution submodules,

the control module is further configured to compile the obtained eight-bit shaping to half-precision floating point instruction to obtain a compiled eight-bit shaping to half-precision floating point instruction, and analyze the compiled eight-bit shaping to half-precision floating point instruction to obtain a source address and a destination address in an operation domain of the eight-bit shaping to half-precision floating point instruction;

the main execution submodule is used for determining a plurality of source sub-addresses and a plurality of destination sub-addresses according to the source address and the destination address, and distributing each source sub-address and each destination sub-address to a slave execution submodule;

the target slave execution submodule is used for extracting an eight-bit shaping tensor from the corresponding source subaddress, converting the eight-bit shaping tensor into a half-precision floating point tensor, and storing the half-precision floating point tensor in the corresponding target subaddress, and the target slave execution submodule is any slave execution submodule.

In one possible implementation, the execution module may include one or more master execution sub-modules and a plurality of slave execution sub-modules, where one master execution sub-module may connect the plurality of slave execution sub-modules. The main execution sub-module is connected with the control module and used for receiving a source address and a destination address sent by the control module. The main execution submodule can divide the source address and the destination address into a plurality of source sub-addresses and a plurality of destination sub-addresses. The number of source sub-addresses or destination sub-addresses may be equal to or less than the number of slave execution sub-modules connected to the master execution sub-module. The master execution submodule may determine the slave execution submodule to perform the translation, and divide the source address and the destination address according to the number of the slave execution submodules determined to perform the translation.

In a possible implementation manner, the main execution sub-module may only divide the source address to obtain a plurality of source sub-addresses. Each slave execution submodule can extract an eight-bit shaping tensor from a corresponding source subaddress and then convert the eight-bit shaping tensor, the half-precision floating point tensor obtained through conversion is sent to the master execution submodule, and the master execution submodule sends the half-precision floating point tensor to a destination address in a unified mode.

In the embodiment, the plurality of slave execution submodules can perform data type conversion in parallel, so that the conversion efficiency of the eight-bit shaping tensor is improved. The setting of the main execution submodule and the auxiliary execution submodule can also improve the execution efficiency of the execution module.

In one possible implementation, the apparatus further includes: and the storage module is used for storing the eight-bit shaping tensor and/or the half-precision floating-point tensor.

In a possible implementation, the eight-bit shaping to half-precision floating-point instruction processing apparatus may further include a storage module configured to store the eight-bit shaping tensor and/or the half-precision floating-point tensor. The half-precision floating point tensor after conversion and/or the eight-bit shaping tensor to be converted are stored locally, and the eight-bit shaping to half-precision floating point instruction processing device can directly extract data from the local when conversion is needed after the data needing to be converted are transmitted to a local storage module in advance. The converted data can be stored locally without being limited by the IO data amount between the data and an external storage device, and the processing efficiency of the eight-bit shaping to half-precision floating point instruction processing device can be improved.

In a possible implementation manner, the control module is further configured to obtain a point position in an operation domain of the eight-bit shaping to half-precision floating-point instruction;

the execution module is further configured to extract an element to be converted in the eight-bit shaping tensor at the source address, convert the element to be converted into a half-precision floating-point element, quantize according to the point position, and store the quantized element in the destination address.

In one possible implementation, the operation domain of the eight-bit shaping to half precision floating point instruction may also include a point location. The converted half-precision floating-point type elements may be quantized according to the point position. The position of the decimal point in the quantized half-precision floating-point type element may be determined from the point position. The relationship between the tensor (or element) before conversion, the tensor (or element) after conversion, and the point position can be expressed by the following formula (1):

dst & src/2^ (pos) equation (1)

Where det denotes the tensor (or element) after conversion, src denotes the tensor (or element) before conversion, and pos denotes the point position.

In this embodiment, the operation domain of the eight-bit shaping to half-precision floating-point instruction may further include a point position, and the converted eight-bit shaping type element may be quantized by using the point position. The compiled eight-bit shaping to half-precision floating point instruction comprises the functions of data type conversion and quantization, and the execution efficiency of the instruction is improved.

In one possible implementation, the number of elements is the number of elements in any dimension of the tensor.

In a possible implementation manner, when the tensor is a two-dimensional tensor, the number of elements may be the number of elements in an X dimension of the two-dimensional tensor, or may also be the number of elements in a Y dimension. For example, when a two-dimensional tensor is stored in memory, the X dimension of the tensor can be a row in memory and the Y dimension of the tensor can be a column in memory. When the number of elements is in the X dimension, it is the number of elements in the row direction of the tensor in the memory. Assuming a line is 200 elements, when the number of elements is 1000, data of 1000/200-5 lines needs to be extracted. The number of the dimensionalities of the tensor is not limited in the application, the tensors of other quantity dimensionalities can refer to the description in the two-dimensional tensor, and the description is omitted.

In this embodiment, the number of elements is the number of elements in any dimension of the tensor, so that the eight-bit shaping to half-precision floating point instruction can accurately provide the original number to be extracted according to the requirement, and can be extracted in the set dimension, and the extraction efficiency of the data to be converted when the eight-bit shaping to half-precision floating point instruction is executed can be improved, thereby improving the execution efficiency of the eight-bit shaping to half-precision floating point instruction.

In a possible implementation manner, the number of bits of the extraction step and the storage step is a multiple of the number of bits of any dimension in the tensor.

In one possible implementation, the number of bits of the extraction step and the storage step is a multiple of the number of bits of any dimension in the tensor. For example, the eight-bit shaping tensor stored in the memory is a two-dimensional tensor, and the number of bits of the extraction step and the storage step can be integral multiples of the number of bits of the line in the memory, so that the eight-bit shaping half-precision floating point instruction processing device can extract and store the whole line when extracting and storing data.

In this embodiment, the number of bits of the extraction step and the storage step is a multiple of the number of bits of any dimension in the tensor, so that the need of calculating the extraction position or the storage position of data in the execution process can be avoided, and the execution efficiency of processing the eight-bit shaping to half-precision floating point instruction can be improved.

In a possible implementation manner, the number of bits of the storage step is greater than the number of bits of the half-precision floating-point type element to be stored.

In one possible implementation, the storage step size may be the interval between storing the first addresses twice. At this time, the bit number of the storage step is greater than the bit number of the half-precision floating-point element to be stored, so that the address of the two times of storage data is prevented from overlapping.

In one possible implementation, the control module is further configured to generate an assembly file according to the eight-bit shaping to half-precision floating-point instruction, and translate the assembly file into a binary file,

and the binary file is the compiled eight-bit shaping to half-precision floating point instruction.

In one possible implementation, the control module includes:

the instruction storage submodule is used for storing instructions and comprises the eight-bit shaping to half-precision floating point instruction;

the instruction processing submodule is used for analyzing the eight-bit shaping to half-precision floating point instruction to obtain an operation code and an operation domain of the eight-bit shaping to half-precision floating point instruction;

and the queue storage submodule is used for storing an instruction queue, and the instruction queue comprises a plurality of instructions which are sequentially arranged according to an execution sequence, and comprises eight-bit shaping to half-precision floating point instructions.

In a possible implementation manner, the control module further includes:

a dependency processing submodule, configured to cache the eight-bit shaping to half-precision floating point instruction in the instruction storage submodule when it is determined that an association relationship exists between an eight-bit shaping to half-precision floating point instruction in the plurality of instructions and a zeroth instruction before the eight-bit shaping to half-precision floating point instruction, and extract the eight-bit shaping to half-precision floating point instruction from the instruction storage submodule after the zeroth instruction is executed and send the eight-bit shaping to half-precision floating point instruction to the execution module,

wherein the associating relationship between the eight-bit shaping to half-precision floating point instruction and a zeroth instruction before the eight-bit shaping to half-precision floating point instruction comprises:

and a first storage address interval for storing data required by the eight-bit shaping to half-precision floating point instruction and a zeroth storage address interval for storing data required by the zeroth instruction have an overlapped area.

FIG. 1a shows a block diagram of an eight-bit shaping to half precision floating point instruction processing apparatus according to an embodiment of the present application. In one possible implementation, as shown in fig. 1a, the storage module 13 is configured to store an eight-bit shaping to half-precision floating point instruction, a tensor to be converted, and a tensor after conversion. The control module 11 includes an instruction storage sub-module 111, an instruction processing sub-module 112, a dependency processing sub-module 114, and a queue storage sub-module 113. The instruction storage submodule 111 may be configured to store the extracted eight-bit shaping to half-precision floating-point instruction. The instruction processing sub-module 112 may be configured to parse the eight-bit shaping to half-precision floating point instruction to obtain an operation code and an operation domain of the eight-bit shaping to half-precision floating point instruction, and obtain parameters such as a source address and a destination address in the operation domain. The dependency processing submodule 114 may be configured to determine an association between an eight-bit shaping to half precision floating point instruction and a previous zeroth instruction. Queue storage submodule 113 may be configured to store an instruction queue comprising a plurality of instructions arranged in order of execution, including eight-bit shaping to half precision floating point instructions.

The execution module 12 may be configured to extract the eight-bit reshaped tensor at the source address, convert to the half-precision floating-point tensor, and store the half-precision floating-point tensor at the destination address.

FIG. 1b shows a block diagram of an eight-bit shaping to half precision floating point instruction processing apparatus according to an embodiment of the present application. Unlike FIG. 1a, FIG. 1b shows an eight-bit reshaping to half-precision floating point instruction processing apparatus in which the execution module 12 includes a plurality of execution submodules 120.

As shown in FIG. 1b, execution module 12 may include a plurality of execution submodules 120. The instruction processing sub-module 112 in the control module 11 may be configured to compile the obtained eight-bit shaping to half-precision floating point instruction to obtain a compiled eight-bit shaping to half-precision floating point instruction, analyze the compiled eight-bit shaping to half-precision floating point instruction to obtain a source address and a destination address in an operation domain of the eight-bit shaping to half-precision floating point instruction, determine a plurality of source sub-addresses and a plurality of destination sub-addresses according to the source address and the destination address, and allocate each of the source sub-addresses and each of the destination sub-addresses to the execution sub-module. And the execution submodule 120 may be configured to extract an eight-bit shaping tensor at the corresponding source subaddress, convert the eight-bit shaping tensor into a half-precision floating-point tensor, and store the half-precision floating-point tensor at the corresponding destination subaddress.

FIG. 2a shows a block diagram of an eight-bit shaping to half precision floating point instruction processing apparatus according to an embodiment of the present application. In contrast to fig. 1a, fig. 2a shows an apparatus for processing an eight-bit shaping to half-precision floating-point instruction, wherein the execution module 12 comprises a master execution submodule 121 and a plurality of slave execution submodules 122. The master execution submodule 121 is configured to determine a plurality of source sub-addresses and a plurality of destination sub-addresses according to a source address and a destination address, and allocate each source sub-address and each destination sub-address to a slave execution submodule; and the slave execution submodule 122 is configured to extract an eight-bit shaping tensor from the corresponding source subaddress, convert the eight-bit shaping tensor into a half-precision floating-point tensor, and store the half-precision floating-point tensor in the corresponding destination subaddress.

It should be noted that, a person skilled in the art may set a connection manner between the master execution submodule and the plurality of slave execution submodules according to actual needs to implement the configuration setting of the execution module, for example, the configuration of the execution module may be an "H" type configuration, an array type configuration, a tree type configuration, and the like, which is not limited in this application.

FIG. 2b shows a block diagram of an eight-bit shaping to half precision floating point instruction processing apparatus according to an embodiment of the present application. In one possible implementation, as shown in fig. 2b, the execution module 12 may further include one or more branch execution sub-modules 123, where the branch execution sub-module 123 is configured to forward data and/or operation instructions between the master execution sub-module 121 and the slave execution sub-module 122. The main execution sub-module 121 is connected to one or more branch execution sub-modules 123. Therefore, the main execution sub-module, the branch execution sub-module and the slave execution sub-module in the execution module are connected by adopting an H-shaped structure, and data and/or operation instructions are forwarded by the branch execution sub-module, so that the resource occupation of the main execution sub-module is saved, and the instruction processing speed is further improved.

FIG. 2c shows a block diagram of an eight-bit shaping to half precision floating point instruction processing apparatus according to an embodiment of the present application. In one possible implementation, as shown in fig. 2c, a plurality of slave execution submodules 122 are distributed in an array.

Each slave execution submodule 122 is connected with other adjacent slave execution submodules 122, the master execution submodule 121 is connected with k slave execution submodules 122 in the plurality of slave execution submodules 122, and the k slave execution submodules 122 are: the n slave execution sub-modules 122 of the 1 st row, the n slave execution sub-modules 122 of the m th row, and the m slave execution sub-modules 122 of the 1 st column.

As shown in fig. 2c, the k slave execution sub-modules only include the n slave execution sub-modules in the 1 st row, the n slave execution sub-modules in the m th row, and the m slave execution sub-modules in the 1 st column, that is, the k slave execution sub-modules are slave execution sub-modules directly connected to the master execution sub-module from among the plurality of slave execution sub-modules. The k slave execution sub-modules are used for forwarding data and instructions between the master execution sub-module and the plurality of slave execution sub-modules. Therefore, the plurality of slave execution sub-modules are distributed in an array, the speed of sending data and/or operation instructions from the master execution sub-module to the slave execution sub-modules can be increased, and the instruction processing speed is further increased.

FIG. 2d shows a block diagram of an eight-bit shaping to half precision floating point instruction processing apparatus according to an embodiment of the present application. In one possible implementation, as shown in fig. 2d, the execution module may further include a tree sub-module 124. The tree submodule 124 includes a root port 401 and a plurality of branch ports 402. The root port 401 is connected to the master execution submodule 121, and the plurality of branch ports 402 are connected to the plurality of slave execution submodules 122, respectively. The tree sub-module 124 has a transceiving function, and is configured to forward data and/or operation instructions between the master execution sub-module 121 and the slave execution sub-module 122. Therefore, the execution modules are connected in a tree-shaped structure under the action of the tree-shaped sub-modules, and the speed of sending data and/or operation instructions from the main execution sub-module to the auxiliary execution sub-module can be increased by utilizing the forwarding function of the tree-shaped sub-modules, so that the processing speed of the instructions is increased.

In one possible implementation, the tree submodule 124 may be an optional result of the apparatus, which may include at least one level of nodes. The nodes are line structures with forwarding functions, and the nodes do not have operation functions. The lowest level node is connected with the slave execution submodule to forward data and/or operation instructions between the master execution submodule 121 and the slave execution submodule 122. In particular, if the tree submodule has zero level nodes, the apparatus does not require the tree submodule.

In one possible implementation, the tree submodule 124 may include a plurality of nodes of an n-ary tree structure, and the plurality of nodes of the n-ary tree structure may have a plurality of layers.

For example, FIG. 2e shows a block diagram of an eight-bit shaping to half precision floating point instruction processing device, according to an embodiment of the present application. As shown in fig. 2e, the n-ary tree structure may be a binary tree structure, with the tree sub-modules comprising 2 levels of nodes 01. The lowest node 01 is connected with the slave execution submodule 122 to forward data and/or operation instructions between the master execution submodule 121 and the slave execution submodule 122.

In this implementation, the n-ary tree structure may also be a ternary tree structure or the like, where n is a positive integer greater than or equal to 2. Those skilled in the art may set n in the n-ary tree structure and the number of layers of nodes in the n-ary tree structure as needed, which is not limited in this application.

In one possible implementation, as shown in fig. 1, 1a, 1b, and 2 a-2 e, the apparatus may further include a storage module 13. In this implementation, the storage module may include one or more of a memory, a cache, and a register, and the cache may include a scratch pad cache. The eight-bit shaping tensor and the half-precision floating-point tensor can be stored in a memory, a cache and/or a register of the storage module as required, which is not limited in the present application.

In one possible implementation, the instruction format of the eight-bit shaping to half-precision floating-point instruction may be:

int82half(half*dst,int8*src)

where int82half is the opcode of an eight-bit shaping to half precision floating point instruction. (half dst, int8 src) is the operation domain of an eight-bit shaping to half precision floating point instruction. Where dst is the destination address, src is the source address, half f indicates that the destination address storage tensor is a half precision floating point tensor, and int8 indicates that the source address storage tensor is an eight-bit shaping tensor.

In one possible implementation, the instruction format of the eight-bit shaping to half-precision floating-point instruction may be: int82half half dst, int8 src, int32_ NumOfEle, int32_ dststride, int32_ src stride, int32_ NumOfSection)

Wherein NumOfEle is the number of elements, dststride is the storage step size, srstride is the extraction step size, NumOfSection is the number of times of conversion, and int32 indicates that the numerical type of each parameter is a 32-bit integer. NumOfEle, dststride, srcisride, NumOfSection are optional parameters.

It should be understood that the location of the opcode, the opcode in the instruction format, and the operand field of the eight-bit shaping to half-precision floating-point instruction may be set by one skilled in the art as desired, and the application is not limited in this respect.

In one possible implementation manner, the apparatus may be disposed in one or more of a Graphics Processing Unit (GPU), a Central Processing Unit (CPU), and an embedded Neural Network Processor (NPU).

It should be noted that, although the above embodiment has described the eight-bit shaping to half-precision floating-point instruction processing apparatus, it should be understood by those skilled in the art that the present application is not limited thereto. In fact, the user can flexibly set each module according to personal preference and/or actual application scene, as long as the technical scheme of the application is met.

Application example

An application example according to the embodiment of the present application is given below in conjunction with "data type conversion operation using an eight-bit shaping to half-precision floating point instruction processing apparatus" as an exemplary application scenario to facilitate understanding of the flow of the eight-bit shaping to half-precision floating point instruction processing apparatus. It should be understood by those skilled in the art that the following application examples are only for the purpose of facilitating understanding of the embodiments of the present application and should not be construed as limiting the embodiments of the present application

FIG. 3 is a diagram illustrating an application scenario of an eight-bit shaping to half-precision floating point instruction processing apparatus according to an embodiment of the present application. As shown in fig. 3, the procedure of processing the eight-bit shaping to half-precision floating point instruction by the eight-bit shaping to half-precision floating point instruction processing apparatus is as follows:

in a possible implementation manner, the control module performs parsing on the obtained eight-bit shaping to half-precision floating point instruction to obtain a source address (source in the figure), a destination address (destination in the figure), an element number (numoflee in the figure), a conversion number (not shown in the figure), an extraction step size (srcsstride in the figure), and a storage step size (dsttried in the figure) in an operation domain of the eight-bit shaping to half-precision floating point instruction. Stored in the source address is an eight-bit reshaped tensor (int 8 in the figure) and stored in the destination address is a half precision floating point tensor (half). And the execution module is used for extracting the half-precision floating point tensor from the source address, determining the extracted elements each time according to the number of the elements and the extraction step length, converting the extracted elements into an eight-bit shaping tensor, and storing the eight-bit shaping tensor at the destination address according to the storage step length.

The working process of the above modules can refer to the above related description.

Therefore, the eight-bit shaping to half-precision floating point instruction processing device can efficiently and quickly process the eight-bit shaping to half-precision floating point instruction.

The application provides a machine learning arithmetic device, which can comprise one or more eight-bit shaping to half-precision floating point instruction processing devices, and is used for acquiring data to be operated and control information from other processing devices and executing specified machine learning operation. The machine learning arithmetic device can obtain eight-bit shaping to half precision floating point instructions from other machine learning arithmetic devices or non-machine learning arithmetic devices and transmit execution results to peripheral equipment (also called other processing devices) through an I/O interface. Peripheral devices such as cameras, displays, mice, keyboards, network cards, wifi interfaces, servers. When more than one instruction processing device for converting eight-bit shaping into half-precision floating point instructions is included, the instruction processing devices for converting eight-bit shaping into half-precision floating point instructions can be linked and transmit data through a specific structure, for example, data is interconnected and transmitted through a PCIE bus, so as to support larger-scale operations of the neural network. At this time, the same control system may be shared, or there may be separate control systems; the memory may be shared or there may be separate memories for each accelerator. In addition, the interconnection mode can be any interconnection topology.

The machine learning arithmetic device has high compatibility and can be connected with various types of servers through PCIE interfaces.

Fig. 4a shows a block diagram of a combined processing device according to an embodiment of the present application. As shown in fig. 4a, the combined processing device includes the machine learning arithmetic device, the universal interconnection interface, and other processing devices. The machine learning arithmetic device interacts with other processing devices to jointly complete the operation designated by the user.

Other processing devices include one or more of general purpose/special purpose processors such as Central Processing Units (CPUs), Graphics Processing Units (GPUs), neural network processors, and the like. The number of processors included in the other processing devices is not limited. The other processing devices are used as interfaces of the machine learning arithmetic device and external data and control, and comprise data transportation to finish basic control of starting, stopping and the like of the machine learning arithmetic device; other processing devices may cooperate with the machine learning computing device to perform computing tasks.

And the universal interconnection interface is used for transmitting data and control instructions between the machine learning arithmetic device and other processing devices. The machine learning arithmetic device acquires required input data from other processing devices and writes the input data into a storage device on the machine learning arithmetic device; control instructions can be obtained from other processing devices and written into a control cache on a machine learning arithmetic device chip; the data in the storage module of the machine learning arithmetic device can also be read and transmitted to other processing devices.

FIG. 4b shows a block diagram of a combined processing device according to an embodiment of the present application. In a possible implementation manner, as shown in fig. 4b, the combined processing device may further include a storage device, and the storage device is connected to the machine learning operation device and the other processing device respectively. The storage device is used for storing data stored in the machine learning arithmetic device and the other processing device, and is particularly suitable for data which is required to be calculated and cannot be stored in the internal storage of the machine learning arithmetic device or the other processing device.

The combined processing device can be used as an SOC (system on chip) system of equipment such as a mobile phone, a robot, an unmanned aerial vehicle and video monitoring equipment, the core area of a control part is effectively reduced, the processing speed is increased, and the overall power consumption is reduced. In this case, the generic interconnect interface of the combined processing device is connected to some component of the apparatus. Some parts are such as camera, display, mouse, keyboard, network card, wifi interface.

The application provides a machine learning chip, which comprises the machine learning arithmetic device or the combined processing device.

The application provides a machine learning chip packaging structure, and this machine learning chip packaging structure includes above-mentioned machine learning chip.

Fig. 5 shows a schematic structural diagram of a board card according to an embodiment of the present application. As shown in fig. 5, the board includes the above-mentioned machine learning chip package structure or the above-mentioned machine learning chip. The board may include, in addition to the machine learning chip 389, other kits including, but not limited to: memory device 390, interface device 391 and control device 392.

The memory device 390 is coupled to a machine learning chip 389 (or a machine learning chip within a machine learning chip package structure) via a bus for storing data. Memory device 390 may include multiple sets of memory cells 393. Each group of memory cells 393 is coupled to a machine learning chip 389 via a bus. It is understood that each group 393 may be a DDR SDRAM (Double Data Rate SDRAM).

DDR can double the speed of SDRAM without increasing the clock frequency. DDR allows data to be read out on the rising and falling edges of the clock pulse. DDR is twice as fast as standard SDRAM.

In one embodiment, memory device 390 may include 4 groups of memory cells 393. Each group of memory cells 393 may include a plurality of DDR4 particles (chips). In one embodiment, the machine learning chip 389 may include 4 72-bit DDR4 controllers therein, where 64bit is used for data transmission and 8bit is used for ECC check in the 72-bit DDR4 controller.

In one embodiment, each group 393 of memory cells includes a plurality of double rate synchronous dynamic random access memories arranged in parallel. DDR can transfer data twice in one clock cycle. A controller for controlling DDR is provided in the machine learning chip 389 for controlling data transfer and data storage of each memory unit 393.

Interface device 391 is electrically coupled to machine learning chip 389 (or a machine learning chip within a machine learning chip package). The interface device 391 is used to implement data transmission between the machine learning chip 389 and an external device (e.g., a server or a computer). For example, in one embodiment, the interface device 391 may be a standard PCIE interface. For example, the data to be processed is transmitted to the machine learning chip 289 by the server through the standard PCIE interface, so as to implement data transfer. In another embodiment, the interface device 391 may also be another interface, and the present application does not limit the specific representation of the other interface, and the interface device may implement the switching function. In addition, the calculation result of the machine learning chip is still transmitted back to the external device (e.g., server) by the interface device.

The control device 392 is electrically connected to a machine learning chip 389. The control device 392 is used to monitor the state of the machine learning chip 389. Specifically, the machine learning chip 389 and the control device 392 may be electrically connected through an SPI interface. The control device 392 may include a single chip Microcomputer (MCU). For example, machine learning chip 389 may include multiple processing chips, multiple processing cores, or multiple processing circuits, which may carry multiple loads. Therefore, the machine learning chip 389 can be in different operation states such as a multi-load and a light load. The control device can regulate and control the working states of a plurality of processing chips, a plurality of processing circuits and/or a plurality of processing circuits in the machine learning chip.

The application provides an electronic device, and the electronic device comprises the machine learning chip or the board card.

The electronic device may include a data processing apparatus, a robot, a computer, a printer, a scanner, a tablet, a smart terminal, a cell phone, a tachograph, a navigator, a sensor, a camera, a server, a cloud server, a camera, a video camera, a projector, a watch, an earphone, a mobile storage, a wearable device, a vehicle, a household appliance, and/or a medical device.

The vehicle may include an aircraft, a ship, and/or a vehicle. The household appliances may include televisions, air conditioners, microwave ovens, refrigerators, electric rice cookers, humidifiers, washing machines, electric lamps, gas cookers, and range hoods. The medical device may include a nuclear magnetic resonance apparatus, a B-mode ultrasound apparatus and/or an electrocardiograph.

FIG. 6 shows a flow diagram of a method of processing an eight-bit shaping to half-precision floating point instruction according to an embodiment of the present application. As shown in fig. 6, the method is applied to the eight-bit shaping to half-precision floating point instruction processing apparatus, and includes steps S51 and S52.

In step S51, compiling the obtained eight-bit shaping to half-precision floating point instruction to obtain a compiled eight-bit shaping to half-precision floating point instruction, and analyzing the compiled eight-bit shaping to half-precision floating point instruction to obtain a source address and a destination address in an operation domain of the eight-bit shaping to half-precision floating point instruction;

and step 52, extracting an eight-bit shaping tensor from the source address, converting the eight-bit shaping tensor into a half-precision floating point tensor, and storing the eight-bit shaping tensor in the destination address.

In a possible implementation manner, the compiling the obtained eight-bit shaping to half-precision floating point instruction to obtain a compiled eight-bit shaping to half-precision floating point instruction, and analyzing the compiled eight-bit shaping to half-precision floating point instruction to obtain a source address and a destination address in an operation domain of the eight-bit shaping to half-precision floating point instruction further includes:

obtaining the number of elements in an operation domain of the eight-bit shaping to half-precision floating point instruction;

the extracting eight-bit shaping tensor at the source address, converting the eight-bit shaping tensor into a half-precision floating point tensor, and storing the eight-bit shaping tensor at the destination address further comprises:

and extracting elements to be converted in the eight-bit shaping tensor according to the element number at the source address, converting the elements to be converted into half-precision floating point type elements, and storing the elements to be converted in the destination address.

obtaining the conversion times in the operation domain of the eight-bit shaping to half-precision floating point instruction;

extracting elements to be converted in the eight-bit shaping tensor according to the element number in the source address, converting the elements to be converted into half-precision floating-point type elements and then storing the elements to be converted in the destination address, and the execution module repeatedly executes the steps according to the conversion times, wherein the elements to be converted extracted each time are not overlapped.

obtaining an extraction step length in an operation domain of the eight-bit shaping to half-precision floating point instruction;

and extracting the element to be converted in the eight-bit shaping tensor at the source address according to the element number and the extraction step length, converting the element to be converted into a half-precision floating-point element, and storing the element to be converted in the destination address.

obtaining a storage step length in an operation domain of the eight-bit shaping to half-precision floating point instruction;

and extracting an element to be converted in the eight-bit shaping tensor at the source address according to the element number, converting the element to be converted into a half-precision floating point element, and storing the element to be converted at the destination address according to the storage step length.

In one possible implementation, the method is applied to an eight-bit shaping to half-precision floating point instruction processing apparatus, the eight-bit shaping to half-precision floating point instruction processing apparatus includes a control module and an execution module, the execution module includes a plurality of execution sub-modules, and the method further includes:

the control module determines a plurality of source sub-addresses and a plurality of destination sub-addresses according to the source address and the destination address, and distributes each source sub-address and each destination sub-address to an execution sub-module;

and the target execution submodule extracts an eight-bit shaping tensor from the corresponding source subaddress, converts the eight-bit shaping tensor into a half-precision floating point tensor and stores the half-precision floating point tensor in the corresponding target subaddress, and is an arbitrary execution submodule.

In one possible implementation, the method is applied to an eight-bit shaping to half-precision floating point instruction processing apparatus, the eight-bit shaping to half-precision floating point instruction processing apparatus includes a control module and an execution module, the execution module includes a master execution submodule and a plurality of slave execution submodules, and the method further includes:

determining a plurality of source sub-addresses and a plurality of destination sub-addresses by the main execution sub-module according to the source address and the destination address, and distributing each source sub-address and each destination sub-address to the auxiliary execution sub-module;

and the target slave execution submodule extracts an eight-bit shaping tensor from the corresponding source subaddress, converts the eight-bit shaping tensor into a half-precision floating point tensor and stores the half-precision floating point tensor in the corresponding target subaddress, and the target slave execution submodule is any slave execution submodule.

In one possible implementation, the method further includes:

storing the eight-bit reshaped tensor and/or the half-precision floating-point tensor.

In a possible implementation manner, the parsing the compiled eight-bit shaping to half-precision floating point instruction to obtain a source address and a destination address in an operation domain of the eight-bit shaping to half-precision floating point instruction further includes:

obtaining the point position in the operation domain of the eight-bit shaping to half-precision floating point instruction;

and extracting an element to be converted in the eight-bit shaping tensor from the source address, converting the element to be converted into a half-precision floating-point element, quantizing the element according to the point position, and storing the quantized element in the destination address.

In a possible implementation manner, the number of bits of the extraction step and the storage step is a multiple of the number of bits of any dimension in the tensor, and the number of bits of the storage step is greater than the number of bits of the half-precision floating-point type element to be stored.

In one possible implementation, the method further includes: and generating an assembly file according to the eight-bit shaping to half-precision floating point instruction, and translating the assembly file into a binary file, wherein the binary file is the compiled eight-bit shaping to half-precision floating point instruction.

In a possible implementation manner, the compiling the obtained eight-bit shaping to half-precision floating point instruction to obtain a compiled eight-bit shaping to half-precision floating point instruction, and analyzing the compiled eight-bit shaping to half-precision floating point instruction to obtain a source address and a destination address in an operation domain of the eight-bit shaping to half-precision floating point instruction includes:

a store instruction comprising the eight-bit shaping to half precision floating point instruction;

analyzing the eight-bit shaping to half-precision floating point instruction to obtain an operation code and an operation domain of the eight-bit shaping to half-precision floating point instruction;

and storing an instruction queue, wherein the instruction queue comprises a plurality of instructions which are sequentially arranged according to an execution sequence, and the instructions comprise eight-bit shaping to half-precision floating point instructions.

when the eight-bit shaping to half-precision floating point instruction in the plurality of instructions is determined to be associated with a zeroth instruction before the eight-bit shaping to half-precision floating point instruction, caching the eight-bit shaping to half-precision floating point instruction in the instruction storage submodule, after the zeroth instruction is executed, extracting the eight-bit shaping to half-precision floating point instruction from the instruction storage submodule and sending the eight-bit shaping to half-precision floating point instruction to the execution module,

It should be noted that, although the eight-bit shaping to half-precision floating-point instruction processing method is described above by taking the above embodiment as an example, those skilled in the art can understand that the present application should not be limited thereto. In fact, the user can flexibly set each step according to personal preference and/or actual application scene, as long as the technical scheme of the application is met.

The method for processing the eight-bit shaping to half-precision floating point instruction provided by the embodiment of the application has the advantages of wide application range, high processing efficiency and high processing speed for the eight-bit shaping to half-precision floating point instruction.

It should be noted that, for simplicity of description, the above-mentioned method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present application is not limited by the order of acts described, as some steps may occur in other orders or concurrently depending on the application. Further, those skilled in the art should also appreciate that the embodiments described in the specification are exemplary embodiments and that the acts and modules referred to are not necessarily required in this application.

In the foregoing embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

In the embodiments provided in the present application, it should be understood that the disclosed system and apparatus can be implemented in other ways. For example, the above-described embodiments of systems and apparatuses are merely illustrative, and for example, a division of a device, an apparatus, and a module is merely a logical division, and an actual implementation may have another division, for example, a plurality of modules may be combined or integrated into another system or apparatus, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of some interfaces, devices, apparatuses or modules, and may be an electrical or other form.

Modules described as separate parts may or may not be physically separate, and parts displayed as modules may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.

In addition, functional modules in the embodiments of the present application may be integrated into one processing unit, or each module may exist alone physically, or two or more modules are integrated into one module. The integrated module can be realized in a form of hardware or a form of a software program module.

The integrated modules, if implemented in the form of software program modules and sold or used as a stand-alone product, may be stored in a computer readable memory. Based on such understanding, the technical solution of the present application may be substantially implemented or a part of or all or part of the technical solution contributing to the prior art may be embodied in the form of a software product stored in a memory, and including several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method described in the embodiments of the present application. And the aforementioned memory comprises: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic or optical disk, and other various media capable of storing program codes.

Those skilled in the art will appreciate that all or part of the steps in the methods of the above embodiments may be implemented by associated hardware instructed by a program, which may be stored in a computer-readable memory, which may include: flash Memory disks, Read-Only memories (ROMs), Random Access Memories (RAMs), magnetic or optical disks, and the like.

The foregoing may be better understood in light of the following clauses:

a1, an eight-bit shaping to half precision floating point instruction processing apparatus, the apparatus comprising:

A2, the apparatus according to clause A1, the control module further configured to obtain a number of elements in an operation domain of the eight-bit shaping to half-precision floating point instruction;

the execution module is further configured to extract an element to be converted in the eight-bit shaping tensor at the source address according to the element number, convert the element to be converted into a half-precision floating-point element, and store the element to be converted in the destination address.

A3, the apparatus according to clause A2,

the control module is further configured to obtain the number of times of conversion in the operation domain of the eight-bit shaping to half-precision floating point instruction;

the execution module is further configured to extract an element to be converted in the eight-bit shaping tensor at the source address according to the number of the element, convert the element to be converted into a half-precision floating-point element, and store the element to be converted in the destination address, and the execution module repeatedly executes the above steps according to the number of conversion times, where the extracted element to be converted each time is not overlapped.

A4, the device according to clause A2, the control module is further configured to obtain an extraction step size in an operation domain of the eight-bit shaping to half-precision floating point instruction;

the execution module is further configured to extract an element to be converted in the eight-bit shaping tensor at the source address according to the element number and the extraction step size, convert the element to be converted into a half-precision floating-point element, and store the element to be converted in the destination address.

A5, the apparatus according to any of clauses A2 to A4, the control module further configured to obtain a storage step size in an operation domain of the eight-bit shaping to half-precision floating point instruction;

the execution module is further configured to extract an element to be converted in the eight-bit shaping tensor at the source address according to the element number, convert the element to be converted into a half-precision floating-point element, and store the element to be converted in the destination address according to the storage step length.

A6, the apparatus of clause A1, the execution module comprising a plurality of execution sub-modules,

A7, the apparatus according to clause A1, the execution module comprising a master execution submodule and a plurality of slave execution submodules,

A8, the apparatus of any of clauses A1 to A7, further comprising:

and the storage module is used for storing the eight-bit shaping tensor and/or the half-precision floating-point tensor.

A9, the apparatus according to clause A1, the control module further configured to obtain a point position in an operation domain of the eight-bit shaping to half-precision floating point instruction;

A10, the apparatus according to clause A3,

the number of elements is the number of elements in any dimension of the tensor, and the product of the number of elements and the number of transitions is equal to the total number of elements in the tensor.

A 11, according to the apparatus in clause A5, the number of bits of the extraction step and the storage step is a multiple of the number of bits of any dimension in the tensor, and the number of bits of the storage step is greater than the number of bits of the half-precision floating-point type element that needs to be stored.

A 12, the apparatus according to clause A5, the control module is further configured to generate an assembly file according to the eight-bit shaping to half-precision floating point instruction, and translate the assembly file into a binary file, where the binary file is the compiled eight-bit shaping to half-precision floating point instruction.

A 13, the apparatus of clause A1, the control module comprising:

A 14, the apparatus of clause a 11, the control module further comprising:

A 15, a machine learning arithmetic device, the device comprising:

one or more of the eight-bit shaping to half-precision floating-point instruction processing devices of any of clauses A1-a 14, configured to obtain data and control information to be operated on from other processing devices, perform a specified machine learning operation, and transmit an execution result to the other processing devices through an I/O interface;

A16, a combined treatment device, comprising:

the machine learning computing device, universal interconnect interface, and other processing device of clause 5;

A17, a machine learning chip, the machine learning chip includes:

the machine learning computing device of clause a 15 or the combined processing device of clause a 16.

A18, an electronic device, the electronic device comprising:

the machine learning chip of clause a 17.

A19, a board card, the board card comprising: a memory device, an interface device and a control device and a machine learning chip as set forth in clause a 17;

wherein the machine learning chip is connected with the storage device, the control device and the interface device respectively;

the storage device is used for storing data;

the interface device is used for realizing data transmission between the machine learning chip and external equipment;

and the control device is used for monitoring the state of the machine learning chip.

A20, an eight-bit shaping to half-precision floating point instruction processing method, which is applied to an eight-bit shaping to half-precision floating point instruction processing device, includes:

A 21, according to the method described in clause a 20, the compiling the obtained eight-bit shaping to half-precision floating point instruction to obtain a compiled eight-bit shaping to half-precision floating point instruction, and analyzing the compiled eight-bit shaping to half-precision floating point instruction to obtain a source address and a destination address in an operation domain of the eight-bit shaping to half-precision floating point instruction, further includes:

22. The method according to clause 21, wherein compiling the obtained eight-bit floating point from shaping to half-precision instruction to obtain a compiled eight-bit floating point from shaping to half-precision instruction, and parsing the compiled eight-bit floating point from shaping to half-precision instruction to obtain a source address and a destination address in an operation domain of the eight-bit floating point from shaping to half-precision instruction further comprises:

A 23, according to the method described in clause a 21, the compiling the obtained eight-bit shaping to half-precision floating point instruction to obtain a compiled eight-bit shaping to half-precision floating point instruction, and analyzing the compiled eight-bit shaping to half-precision floating point instruction to obtain a source address and a destination address in an operation domain of the eight-bit shaping to half-precision floating point instruction, further includes:

A 24, according to the method described in any one of clauses a 21 to a 23, compiling the obtained eight-bit shaping to half-precision floating point instruction to obtain a compiled eight-bit shaping to half-precision floating point instruction, and analyzing the compiled eight-bit shaping to half-precision floating point instruction to obtain a source address and a destination address in an operation domain of the eight-bit shaping to half-precision floating point instruction, further including:

A 25, the method according to clause a 20, applied to an eight-bit shaping to half-precision floating point instruction processing apparatus comprising a control module and an execution module, the execution module comprising a plurality of execution submodules, the method further comprising:

A 26, the method according to clause a 20, applied to an eight-bit shaping to half-precision floating point instruction processing apparatus comprising a control module and an execution module, the execution module comprising a master execution submodule and a plurality of slave execution submodules, the method further comprising:

A 27, the method of any one of clauses a 20 to a 26, further comprising:

The method of claim a 20, wherein the parsing the compiled eight-bit integer-to-half-precision floating point instruction to obtain a source address and a destination address in an operation domain of the eight-bit integer-to-half-precision floating point instruction further comprises:

A29. the method according to clause A22,

A 30, according to the method in clause a 24, the number of bits of the extraction step and the storage step is a multiple of the number of bits of any dimension in the tensor, and the number of bits of the storage step is greater than the number of bits of the half-precision floating-point type element to be stored.

A 31, the method of clause a 24, further comprising:

and generating an assembly file according to the eight-bit shaping to half-precision floating point instruction, and translating the assembly file into a binary file, wherein the binary file is the compiled eight-bit shaping to half-precision floating point instruction.

A 32, according to the method described in clause a 30, the compiling the obtained eight-bit shaping to half-precision floating point instruction to obtain a compiled eight-bit shaping to half-precision floating point instruction, and analyzing the compiled eight-bit shaping to half-precision floating point instruction to obtain a source address and a destination address in an operation domain of the eight-bit shaping to half-precision floating point instruction, including:

A 33, according to the method described in clause a 30, the compiling the obtained eight-bit shaping to half-precision floating point instruction to obtain a compiled eight-bit shaping to half-precision floating point instruction, and analyzing the compiled eight-bit shaping to half-precision floating point instruction to obtain a source address and a destination address in an operation domain of the eight-bit shaping to half-precision floating point instruction, further includes:

Having described embodiments of the present application, the foregoing description is intended to be exemplary, not exhaustive, and not limited to the disclosed embodiments. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terms used herein were chosen in order to best explain the principles of the embodiments, the practical application, or technical improvements to the techniques in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims

1. An apparatus for processing an eight-bit shaping to half precision floating point instruction, the apparatus comprising:

2. The apparatus of claim 1,

the control module is further configured to obtain the number of elements in an operation domain of the eight-bit shaping to half-precision floating point instruction;

3. The apparatus of claim 2,

4. A machine learning arithmetic device, the device comprising:

one or more eight-bit shaping to half-precision floating-point instruction processing devices according to any one of claims 1 to 3, configured to obtain data and control information to be operated on from other processing devices, perform a specified machine learning operation, and transmit the execution result to other processing devices through an I/O interface;

5. A combined processing apparatus, characterized in that the combined processing apparatus comprises:

the machine learning computing device, the universal interconnect interface, and the other processing device of claim 4;

6. A machine learning chip, the machine learning chip comprising:

the machine learning computation apparatus of claim 4 or the combined processing apparatus of claim 16.

7. An electronic device, characterized in that the electronic device comprises:

the machine learning chip of claim 6.

8. The utility model provides a board card, its characterized in that, the board card includes: a memory device, an interface device and a control device and a machine learning chip according to claim 6;

the storage device is used for storing data;

9. An eight-bit shaping to half-precision floating point instruction processing method is applied to an eight-bit shaping to half-precision floating point instruction processing device, and comprises the following steps:

10. The method of claim 9, wherein compiling the obtained eight-bit shaping-to-half-precision floating point instruction to obtain a compiled eight-bit shaping-to-half-precision floating point instruction, and parsing the compiled eight-bit shaping-to-half-precision floating point instruction to obtain a source address and a destination address in an operation domain of the eight-bit shaping-to-half-precision floating point instruction, further comprises: