CN116882475A

CN116882475A - Training method and device applied to neural network and related products

Info

Publication number: CN116882475A
Application number: CN202310947078.3A
Authority: CN
Inventors: 请求不公布姓名
Original assignee: Shanghai Cambricon Information Technology Co Ltd
Current assignee: Shanghai Cambricon Information Technology Co Ltd
Priority date: 2023-07-28
Filing date: 2023-07-28
Publication date: 2023-10-13

Abstract

Training method and device applied to neural network and related products. The invention relates to a board card, comprising: a memory device, an interface device, and a control device, and an artificial intelligence chip; wherein the artificial intelligent chip is respectively connected with the storage device, the control device and the interface device; the storage device is used for storing data; the interface device is used for realizing data transmission between the artificial intelligent chip and external equipment; the control device is used for monitoring the state of the artificial intelligent chip. The board card may be used to perform artificial intelligence operations.

Description

Training method and device applied to neural network and related products

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to a training method and device applied to a neural network and related products.

Background

Along with the continuous development of neural network technology, the application field of the neural network technology is wider and wider, and the neural network technology is well applied to the fields of image recognition, voice recognition, natural language processing and the like. The neural network is used in various fields in a way of reasoning operation after the structural parameters of the network are required to be obtained through training. The neural network training process refers to inputting enough samples into the network, and adjusting the structure of the neural network (mainly adjusting weights) through a certain algorithm, so that the output of the neural network is consistent with an expected value, and the neural network training process is applied to the reasoning operation of the neural network after the structural parameters of the neural network are obtained, namely the actual using process of the neural network.

The neural network includes different computation layers, such as a convolution layer and a full connection layer, each of which requires a large amount of data to be used to perform different complex algorithms. In the training process of the neural network, single-precision floating point number FP32 is often used for training, and due to complexity of the neural network algorithm and overlong bit length of the FP32 floating point number, the training process consumes too high storage space, the operation process is too delayed, and the operation efficiency of a hardware carrier is low. How to improve the training efficiency, reduce the storage consumption and improve the execution efficiency of a hardware carrier while guaranteeing the training precision of the neural network is a problem to be solved urgently in the technical field of the neural network.

Disclosure of Invention

In view of the above, the present invention provides a training method and device applied to a neural network and related products.

According to an aspect of the present invention, there is provided a processor for performing neural network training including three stages of forward operation, reverse operation and weight update, the processor comprising: the control circuit is used for receiving and analyzing the instruction, and indicating the first operation circuit to complete the neural network training operation by utilizing the FP12 format according to the analyzed instruction; the first operation circuit is used for completing forward operation, reverse operation and weight updating in the neural network training by utilizing the data in the FP12 format, and comprises a first exponent processing circuit and a first mantissa processing circuit, wherein the processing bit width of the first exponent processing circuit is at least 8 bits, and the processing bit width of the first mantissa processing circuit is at least 3 bits; and the storage circuit is used for storing the weight updating value obtained after the weight updating and using the weight updating value as the weight of the next forward operation.

According to another aspect of the invention, the FP12 format comprises any one of the following formats: the sign bit of the FP12 format is 1 bit, the exponent bit is 8 bits, and the tail bit is 3 bits; the sign bit of the FP12 format is 1 bit, the exponent bit is 5 bits, and the tail bit is 6 bits; the sign bit of the FP12 format is 1 bit, the exponent bit is 4 bits, and the tail bit is 7 bits.

According to another aspect of the invention, the processor further comprises: and the data format conversion circuit is used for converting the data to be operated in the neural network training into the data in the FP12 format.

According to another aspect of the present invention, there is provided the processor, wherein the data format conversion circuit is further configured to convert the result data of the inverse operation into high-precision data, and the precision of the high-precision data is higher than the precision of the FP12 format data; the processor further includes: and the second operation circuit is used for finishing weight updating in the neural network training by utilizing the high-precision data and comprises a second mantissa processing circuit, and the processing bit width of the second mantissa processing circuit is at least 8 bits.

According to another aspect of the invention, the first mantissa processing circuit and the second mantissa processing circuit in the processor are multiplexed with each other.

According to another aspect of the present invention, the data format conversion circuit is further configured to convert data to be operated on by a nonlinear layer in the neural network into high-precision data, where the precision of the high-precision data is higher than that of the FP12 format data; the first arithmetic circuit includes: a linear layer operation circuit for completing the forward operation and the backward operation of the linear layer in the neural network by using the data in the FP12 format; the second arithmetic circuit includes: and the nonlinear layer operation circuit is used for completing the forward operation and the backward operation of the nonlinear layer in the neural network by utilizing the high-precision data.

According to another aspect of the present invention, there is provided a processor, wherein the high-precision data includes data in BF16 or FP32 format, the sign bit of the BF16 format is 1 bit, the exponent bit is 8 bits, the mantissa bit is 7 bits, the sign bit of the FP32 format is 1 bit, the exponent bit is 8 bits, and the mantissa bit is 23 bits.

According to another aspect of the present invention, there is provided a processor, further comprising: the mixed precision selecting circuit is used for receiving and analyzing a first mixed precision selecting instruction, and the first mixed precision selecting instruction is used for indicating to execute FP16 mixed precision training or FP12 mixed precision training; and the scaling factor setting circuit is used for instructing the data format conversion circuit to convert the data to be operated into the first FP12 format when the first mixed precision selection instruction instructs to execute the FP12 mixed precision training, and instructing the first operation circuit to set the scaling factor of the loss function in the reverse operation to be 1 when the reverse operation is completed by using the first FP12 format, wherein the sign bit of the first FP12 format is 1 bit, the exponent bit is 8 bits, and the mantissa bit is 3 bits.

According to another aspect of the present invention, the processor is further configured to receive and parse a second blending precision selection instruction, where the second blending precision selection instruction is used to instruct to perform FP8 blending precision training or FP12 blending precision training; the processor further includes: and the scale factor setting circuit is used for setting the scale factor to be 1 when the data format conversion circuit is instructed to quantize the data to be operated into the first FP12 format when the second mixed precision selection instruction instructs to execute FP12 mixed precision training, wherein the sign bit of the first FP12 format is 1 bit, the exponent bit is 8 bits, and the mantissa bit is 3 bits.

According to another aspect of the invention, the processor is provided, wherein the linear layer operation circuit comprises a linear layer forward operation circuit, a first FP12 format data processing circuit and a second FP12 format data processing circuit, wherein the linear layer forward operation circuit is used for completing forward operation of a linear layer in the neural network training; and the linear layer reverse operation circuit is used for finishing the reverse operation of the linear layer in the neural network training by utilizing the data in the second FP12 format, wherein the sign bit in the second FP12 format is 1 bit, the exponent bit is 5 bits, and the mantissa bit is 6 bits.

The invention also provides a processor applied to multi-machine multi-card training, each card in the multi-machine multi-card training comprises at least one processor as described in any one of the above, and each processor uses the scaling factor of the processor to complete the multi-machine multi-card training.

According to another aspect of the present invention, there is provided a machine-readable medium, the API being executable by one or more processors, the API causing the one or more processors to perform neural network training comprising three phases of forward operations, reverse operations, and weight updating, the API causing the one or more processors to:

receiving and analyzing an instruction, and indicating a first operation circuit in the processor to complete neural network training operation by using an FP12 format according to the analyzed instruction;

completing forward operation, reverse operation and weight updating in the neural network training on the first operation circuit by utilizing the data in the FP12 format, wherein the first operation circuit comprises a first exponent processing circuit and a first mantissa processing circuit, the processing bit width of the first exponent processing circuit is at least 8 bits, and the processing bit width of the first mantissa processing circuit is at least 3 bits;

And storing the weight updating value obtained after the weight updating, and using the weight updating value as the weight of the next forward operation.

According to another aspect of the invention, the FP12 format comprises any one of the following formats:

the sign bit of the FP12 format is 1 bit, the exponent bit is 8 bits, and the tail bit is 3 bits;

the sign bit of the FP12 format is 1 bit, the exponent bit is 5 bits, and the tail bit is 6 bits;

the sign bit of the FP12 format is 1 bit, the exponent bit is 4 bits, and the tail bit is 7 bits.

According to another aspect of the invention, the API further causes the one or more processors to:

and converting the data to be operated in the neural network training into data in an FP12 format by using a data format conversion circuit in the processor.

converting the result data of the reverse operation into high-precision data by using the data format conversion circuit, wherein the precision of the high-precision data is higher than that of the FP12 format data;

and finishing weight updating in the neural network training on a second operation circuit in the processing by using the high-precision data, wherein the second operation circuit comprises a second mantissa processing circuit, and the processing bit width of the second mantissa processing circuit is at least 8 bits.

converting data to be operated of a nonlinear layer in the neural network into high-precision data by utilizing the data format conversion circuit, wherein the precision of the high-precision data is higher than that of the FP12 format data;

based on a linear layer operation circuit in the first operation circuit, completing the forward operation and the backward operation of a linear layer in the neural network by utilizing the data in the FP12 format;

and based on a nonlinear layer operation circuit in the second operation circuit, completing the forward operation and the backward operation of the nonlinear layer in the neural network by using the high-precision data.

According to another aspect of the present invention, the high-precision data includes data in BF16 or FP32 format, the sign bit of the BF16 format is 1 bit, the exponent bit is 8 bits, the mantissa bit is 7 bits, the sign bit of the FP32 format is 1 bit, the exponent bit is 8 bits, and the mantissa bit is 23 bits.

Receiving and analyzing a first mixed precision selection instruction by utilizing a mixed precision selection circuit in the processor, wherein the first mixed precision selection instruction is used for indicating to execute FP16 mixed precision training or FP12 mixed precision training;

when the first mixed precision selection instruction indicates to execute FP12 mixed precision training, the data format conversion circuit is instructed by a scaling sparse setting circuit in the processor to convert the data to be operated into the first FP12 format, and the first operation circuit is instructed to set a scaling coefficient of a loss function in the reverse operation to 1 when the reverse operation is completed by using the first FP12 format, a sign bit of the first FP12 format is 1 bit, a exponent bit is 8 bits, and a mantissa bit is 3 bits.

receiving and analyzing a second mixed precision selection instruction by utilizing the mixed precision selection circuit, wherein the second mixed precision selection instruction is used for indicating to execute FP8 mixed precision training or FP12 mixed precision training;

when the second mixed precision selection instruction indicates that the FP12 mixed precision training is executed, a scale factor setting circuit in the processor is utilized to set a scale factor for quantizing the data to be calculated into the first FP12 format to be 1, a sign bit of the first FP12 format is 1 bit, a exponent bit is 8 bits, and a mantissa bit is 3 bits.

based on a linear layer forward operation circuit in the processor, completing forward operation of a linear layer in the neural network training by utilizing the data in the first FP12 format;

based on a linear layer reverse operation circuit in the processor, the reverse operation of the linear layer in the neural network training is completed by utilizing data in a second FP12 format, wherein sign bits in the second FP12 format are 1 bit, exponent bits are 5 bits, and mantissa bits are 6 bits.

According to another aspect of the present invention, each of the multi-machine multi-card includes at least one machine-readable medium as in any one of the above, having stored thereon an application program interface API, the API being executable by one or more processors, the API causing the processors to perform the multi-machine multi-card training using the scaling factors of the processors.

According to another aspect of the present invention, there is also provided an artificial intelligence chip comprising a processor as claimed in any one of the above.

According to another aspect of the invention, there is also provided an electronic device comprising the artificial intelligence chip described above.

According to another aspect of the present invention, there is also provided a system comprising: a memory; one or more processors, wherein the memory stores an application program interface API comprising any one of the preceding claims.

According to another aspect of the invention there is also provided a method responsive to an application program interface API comprising any one of the preceding claims.

Compared with data formats such as FP32, BF16 and FP8, the FP12 training method applied to the neural network provided by the embodiment of the invention has the advantages of both data representation range and expression precision. The FP12 is used for training the neural network, on the premise of ensuring the training precision, the storage space of each level occupied in the training process of the neural network is reduced, the delay of the operation process is reduced, and the operation efficiency of the processor is improved. The processing of the computing circuit in the FP12 format can directly multiplex the existing computing circuits in various data formats such as FP32, BF16 and the like, reduce the design and manufacturing cost of the hardware carrier trained by the neural network and improve the computing efficiency of the hardware carrier.

Other features and aspects of the present invention will become apparent from the following detailed description of exemplary embodiments, which proceeds with reference to the accompanying drawings.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate exemplary embodiments, features and aspects of the invention and together with the description, serve to explain the principles of the invention.

FIG. 1 shows a block diagram of a board according to an embodiment of the invention;

FIG. 2 shows a block diagram of a combination processing device in the chip 101 of FIG. 1;

FIG. 3 shows a block diagram of a processor for performing neural network training, according to an embodiment of the invention;

fig. 4 shows a schematic diagram of FP12 data format provided by an embodiment of the present invention;

FIG. 5 shows a block diagram of a processor for performing neural network training, according to another embodiment of the invention;

FIG. 6 shows a block diagram of a processor for performing neural network training, according to another embodiment of the invention;

fig. 7 shows a schematic diagram of the location of an application program interface API according to another embodiment of the invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

It should be understood that the terms "first," "second," "third," and "fourth," etc. in the claims, specification and drawings of the present invention are used for distinguishing between different objects and not for describing a particular sequential order. The terms "comprises" and "comprising" when used in the specification and claims of the present invention are taken to specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It is also to be understood that the terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in the specification and claims, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should be further understood that the term "and/or" as used in the present specification and claims refers to any and all possible combinations of one or more of the associated listed items, and includes such combinations.

The word "exemplary" is used herein to mean "serving as an example, embodiment, or illustration. Any embodiment described herein as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments.

In addition, numerous specific details are set forth in the following description in order to provide a better illustration of the invention. It will be understood by those skilled in the art that the present invention may be practiced without some of these specific details. In some instances, well known methods, procedures, components, and circuits have not been described in detail so as not to obscure the present invention.

In an embodiment of the present invention, fig. 1 is a block diagram illustrating a board according to an embodiment of the present invention, and a processor or a machine-readable medium for performing neural network training in this embodiment may be located in the board, and in particular, may be located in a chip. As shown in fig. 1, the board 10 includes a Chip 101, which is a System on Chip (SoC), or a System on Chip, integrated with one or more combined processing devices, wherein the combined processing device is an artificial intelligent computing unit, and is used for supporting various deep learning and machine learning algorithms, so as to meet the intelligent processing requirements in complex fields such as computer vision, voice, natural language processing, data mining, and the like. Particularly, the deep learning technology is largely applied to the cloud intelligent field, and one remarkable characteristic of the cloud intelligent application is that the input data volume is large, and the high requirements on the storage capacity and the computing capacity of the platform are provided.

The chip 101 is connected to an external device 103 through an external interface device 102. The external device 103 is, for example, a server, a computer, a camera, a display, a mouse, a keyboard, a network card, a wifi interface, or the like. The data to be processed may be transferred by the external device 103 to the chip 101 through the external interface means 102. The calculation result of the chip 101 may be transmitted back to the external device 103 via the external interface means 102. The external interface device 102 may have different interface forms, such as PCIe interfaces, etc., according to different application scenarios.

The board 10 also includes a memory device 104 for storing data, which includes one or more memory cells 105. The memory device 104 is connected to the control device 106 and the chip 101 via a bus and transmits data. The control device 106 in the board 10 is configured to regulate the state of the chip 101. To this end, in one application scenario, the control device 106 may comprise a single chip microcomputer (Micro Controller Unit, MCU).

Fig. 2 shows a block diagram of the combination processing apparatus in the chip 101 of fig. 1. As shown in fig. 2, the combination processing device 20 includes a computing device 201, an interface device 202, a processing device 203, and a DRAM 204.

The computing device 201 is configured to perform user-specified operations, primarily implemented as a single-core smart processor or as a multi-core smart processor, to perform deep learning or machine learning computations, which may interact with the processing device 203 through the interface device 202 to collectively accomplish the user-specified operations.

The interface means 202 are used for transmitting data and control instructions between the computing means 201 and the processing means 203. For example, the computing device 201 may obtain input data from the processing device 203 via the interface device 202, writing to a storage device on the chip of the computing device 201. Further, the computing device 201 may obtain control instructions from the processing device 203 via the interface device 202, and write the control instructions into a control cache on the chip of the computing device 201. Alternatively or in addition, the interface device 202 may also read data in a memory device on-chip to the computing device 201 and transmit it to the processing device 203.

The processing device 203 is a general purpose processing device that performs basic control including, but not limited to, data handling, starting and/or stopping of the computing device 201, and the like. Depending on the implementation, the processing device 203 may be one or more types of processors, including but not limited to a digital signal processor (digital signal processor, DSP), an application specific integrated circuit (application specific integrated circuit, ASIC), a field-programmable gate array (field-programmable gate array, FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, etc., of a central processing unit (central processing unit, CPU), graphics processor (graphics processing unit, GPU) or other general purpose and/or special purpose processor, and the number thereof may be determined according to actual needs. As previously mentioned, the computing device 201 of the present disclosure may be considered to have a single core structure or a homogeneous multi-core structure only with respect to it. However, when computing device 201 and processing device 203 are considered together, they are considered to form a heterogeneous multi-core structure. The floating point number calculation device provided by the invention can be arranged in the processing device 203.

The DRAM 204 is used to store data to be processed, and is a DDR memory, typically 16G or more in size, for storing data for the computing device 201 and/or the processing device 203.

Fig. 3 shows a block diagram of a processor for performing neural network training according to an embodiment of the present invention, and as shown in fig. 3, the present invention provides a processor for performing neural network training including three stages of forward operation, reverse operation, and weight update, the processor including:

the control circuit 100 is configured to receive and parse the instruction, and instruct the first operation circuit 200 to complete the neural network training operation by using the FP12 format according to the parsed instruction;

the first operation circuit 200 is configured to complete forward operation, backward operation and weight update in the neural network training by using the FP12 format data, where the first operation circuit includes a first exponent processing circuit 210 and a first mantissa processing circuit 220, a processing bit width of the first exponent processing circuit is at least 8 bits, and a processing bit width of the first mantissa processing circuit is at least 3 bits;

the storage circuit 300 is used for storing the weight update value obtained after the weight update and using the weight update value as the weight of the next forward operation.

The neural network comprises at least one layer, each layer comprising a plurality of input nodes, a plurality of output nodes, and a plurality of connections for connecting the plurality of input nodes and the output nodes, the connections between the input nodes and the output nodes being assigned digital weights, each weight characterizing a manner in which an input of a given node is related to an output of that node. Each weight is multiplied by the input of a given node to generate an output. Various optimization methods may be used to adjust the weights, such as random gradient descent, to alter the response of the neural network to a particular input.

In the forward operation phase of the neural network, data to be operated on (also referred to as input data, training examples) representing a pre-classified data set is sequentially transmitted through layers of the neural network for operation. The data to be operated is propagated and operated through the input nodes and the weighted connection of each layer of the neural network, and the result data of the output node is obtained through a series of data operation calculation. In the reverse operation stage, the error between the result data generated in the forward operation stage and the expected data is calculated by using a loss function to obtain the result data of the reverse operation, wherein the common loss function comprises an absolute value loss function, a logarithmic loss function, a mean square error loss function, an exponential loss function and the like. In the weight updating stage, the weight data of the neural network is updated according to the result data of the reverse operation, namely the error obtained by calculation according to the loss function, the weight updated value is obtained and then stored, and the weight updated value is used as the weight of the next forward operation. Repeating the above process until the training result meets the expectation, and completing the training of the neural network. The neural network obtained by training can be used for carrying out the reasoning operation process (actual use process) of the neural network such as image classification, image recognition and the like.

In the training and reasoning process of the neural network, the input data of the data to be operated is large in data quantity, and the data types are also various. The hardware carrier processor for supporting neural network training and reasoning has the advantages that the longer the data bit width participating in various operations in the supported data types, the slower the operation speed, so that the supported data types directly influence the operation efficiency of the processor. The processing bit width of the processor is reasonably set, so that the idling and waste of hardware resources can be avoided, the processing efficiency is ensured, the area of the processor or the chip is reduced as much as possible, and the computing power which can be provided by the processor or the chip is furthest excavated.

In an embodiment of the present invention, a processor for performing neural network training includes a controller 100, a first arithmetic circuit 200, and a storage circuit 300.

The control circuit 100 is configured to coordinate and control the operations of the first operation circuit 200 and the storage circuit 300 to complete the training and reasoning process of the neural network, and the control circuit 100 may include an instruction fetch circuit 110 and an instruction decode circuit 120. Instruction fetch circuit 110 is configured to fetch an instruction of ISA (Instruction Set Architecture ), and instruction decode circuit 120 decodes the fetched instruction and sends the decoded result to first arithmetic circuit 200 and memory circuit 300 as control information.

The first arithmetic circuit 200 includes a first exponent processing circuit 210 and a first mantissa processing circuit 220. The first exponent processing circuit 210 processes operations of the exponent portion of the data to be operated on of the floating point type, and the first mantissa processing circuit 220 is used to process operations of the mantissa portion of the data to be operated on of the floating point type.

The storage circuit 300 is used for storing or carrying relevant data, including data to be operated, such as neurons and weight data, used in the neural network training and reasoning process, and also including intermediate results generated in the neural network training and reasoning process, including weight update values, etc. The storage circuit 300 in the embodiment of the present invention is configured to store the weight update value obtained after the weight update, and use the weight update value as the weight of the next forward operation. In each iterative operation of the neural network training, the obtained weight update value is stored in a storage circuit so as to finish the next iterative operation until the training of the neural network is finished.

In the training and reasoning of the neural network, the representation of the data to be operated has two forms, namely, a fixed point number and a floating point number. Wherein, the liquid crystal display device comprises a liquid crystal display device,

the data format of the fixed point number is divided into a sign bit (sign) of 1 bit and a mantissa bit (mantissa) of a plurality of bits. The sign bit is used to determine the positive and negative values of the fixed point number. The mantissa bit is used to determine the value of the fixed point number, which refers to a number in which the position of the decimal point is fixed. The fixed point number is divided into a fixed point integer and a fixed point decimal point, and the decimal point is fixed, so that the decimal point is not required to be represented and the numerical value is calculated according to the appointed position. Fixed point numbers are typically expressed as pure decimal or pure integers. If the value is a pure decimal, the decimal point is preset between the sign bit and the highest position of the mantissa bit, and if the value is a pure integer, the decimal point is preset to the right of the lowest position of the mantissa bit.

The format for expressing floating point numbers is specified in IEEE 754. Taking a 37-bit single-precision floating point number (FP 37) as an example, it is composed of 1-bit sign bit, 8-bit exponent bit (exp), and 28-bit mantissa bit, which are used to represent the following values:

numerical value=sign×mantissa×2exp-127

The 8 digits may represent a range of 0 to 255, resulting in an index that becomes very large, so the IEEE754 specifies an index offset of 127, which translates the index range between-127 and 128, which is reasonable. IEEE754 further agrees that the decimal point is implied to the left by a single bit, typically this number of bits is 1, so the single precision mantissa number described above is actually 29 bits.

Since the above representation limits the range and precision of floating point numbers, the floating point numbers can only be represented approximately to operate, and the rounding problem has to be considered. Under decimal scale, assuming that two decimal places are to be reserved, namely, a numerical value of ten digits and a numerical value of a percentile are reserved, the reserved digit is the lowest digit of the numerical value, namely, the percentile, the approximate digit is the first truncated digit, namely, the thousandth digit, and all digits after the thousandth digit are called as viscous digits (sticky digit), and the information of the viscous digit is completely lost. For binary, if two decimal places are desired to be preserved, then the second decimal place to the right is the preserved bit, the third decimal place to the right is the approximated bit, and all decimal places from the fourth decimal place to the right are sticky bits. For this purpose, IEEE754 defines four different rounding approaches: even rounding, zero rounding, down rounding, and up rounding, IEEE754 defaults to even rounding.

With the continuous development of neural network technology, the data format of floating point numbers is increasingly more and more, and at least the data format comprises: FP32, TF32, BF16, FP16, UHP, FP8. Wherein the sign bit of FP32 is 1 bit, the exponent bit is 8 bits, and the mantissa bit is 23 bits. The sign bit of TF32 is 1 bit, the exponent bit is 8 bits, and the mantissa bit is 10 bits. The sign bit of BF16 is 1 bit, the exponent bit is 8 bits, and the mantissa bit is 7 bits. The sign bit of FP16 is 1 bit, the exponent bit is 5 bits, and the mantissa bit is 10 bits. UHP unsigned bit, exponent bit 6 and mantissa bit 10. FP8 has two formats, one with 1 sign, 4 digits and 3 digits, the other with 1 sign, 5 digits and 2 digits.

The exponent bit width of the floating point number determines the data expression range, the bit width of the tail bit determines the data expression precision, and the data format of the floating point number can be roughly divided into two types, wherein the first type is that the bit width of the tail bit is longer, such as FP32, TF32, BF16, FP16, UHP, and the second type is that the bit width of the tail bit is shorter, such as FP8 in two formats. The bit width of the digits of the first class of floating point numbers is longer, and the bit width of the digits of the second class of floating point numbers is shorter. Different types of floating point numbers have different exponent bit widths and mantissa bit widths, different data expression ranges and different data expression precision, and can adapt to the complex requirements of various application scenes, various software or hardware carriers in the machine learning field or different use stages.

The data expression range and the data precision of the first type of floating point number are guaranteed, but when the floating point number is used in a software and hardware operation carrier, larger resource expenditure is needed, and under the scene of low precision requirement, the waste of operation resources, storage resources and transmission resources and the reduction of operation efficiency are caused, and larger chip or processor area is occupied. On the premise of meeting the precision requirement, the second type floating point number can improve the operation efficiency of the software and hardware operation carrier and reduce the chip area, but has larger loss in the data expression range and the data expression precision compared with the first type floating point number.

Fig. 4 shows a schematic diagram of an FP12 data format provided by an embodiment of the present invention, as shown in fig. 4, in an embodiment of the present invention, a new data format FP12 is proposed, and the data format of FP12 may be any one of the following three data formats:

1. sign bit 1, exponent bit 8, mantissa bit 3. (which may be abbreviated as first FP12 format, 183 format)

2. Sign bit 1, exponent bit 5, mantissa bit 6. (which may be referred to simply as the second FP12 format, 156 format)

3. Sign bit 1, exponent bit 4, mantissa bit 7. (which may be abbreviated as third FP12 format, 147 format)

It will be appreciated that FP12 floating point numbers in 183 format have a similar bit width to the second type floating point number mantissa bits described above, but have a greater range of data expression than the second type floating point numbers described above. FP12 floating point numbers in 156 format and 147 format have similar bit widths as the second type floating point digits described above, but have higher precision expressions than the second type floating point digits described above. Thus, beyond the two types of floating point numbers described above, the FP12 format provides more flexible data expression for upper layer applications.

In an embodiment of the present invention, the first operation circuit 200 may include a first mantissa processing circuit having a processing bit width of at least 3 bits and an exponent processing circuit having a processing bit width of 8 bits. The mantissa processing circuit with a processing bit width of at least 3 bits is used for processing the operation of mantissa bits of 3 bits in the first FP12 format, including various logical operations such as multiplication operation, addition operation, shift operation and comparison operation or mathematical operation. Similarly, the first arithmetic circuit 200 may further include a first mantissa processing circuit processing a mantissa bit having a bit width of 6 bits, 7 bits or 8 bits to process an operation of mantissa bits in the second FP12 format or the third FP12 format. Similarly, a mantissa processing circuit greater than or equal to 7 bits may be provided for processing data of FP12 in three different formats.

In the operation of neural network training and reasoning, vector multiplication and matrix multiply-accumulate occupy more than 90% of the operations. The circuit design overhead of multiplication operations may therefore be mainly considered in the operation circuit design of the processor. To support floating point operations, the arithmetic circuitry within the processor typically includes exponent processing circuitry and mantissa processing circuitry for handling various computations of the exponent and mantissa bits of the floating point number, including logical operations and mathematical operations. The exponent processing circuit and mantissa processing circuit may include at least one circuit structure that is more basic, such as an adder, multiplier, shifter, comparator, etc., and the present invention is not limited in this respect. To complete the multiplication, the exponent processing circuit is used to complete the addition of the exponents, and the mantissa processing circuit is used to complete the multiplication of the mantissas. Since the operation logic of the multiplication circuit is more complex than that of the addition circuit and the comparison circuit, both the operator and the data storage occupy a larger area and occupy a larger hardware cost. For multiplication circuits, the long-bit-width multiplication circuit occupies larger hardware cost than the short-bit-width multiplication circuit, for example, the hardware cost of the 10-bit-width multiplication circuit is far larger than that of the 3-bit-width multiplication circuit, and the larger hardware cost means that larger area is occupied or more energy consumption is used in operation, larger space is occupied in data storage, and larger bandwidth is occupied in data transmission.

In the embodiment of the present invention, the first operation circuit 200 uses FP12 format data to complete three stages of forward operation, backward operation and weight update in neural network training. For example, the overall process of neural network training may be accomplished using FP12 data in 183 format. Because of the above advantages of the FP12 format, the first arithmetic circuit may utilize a mantissa processing circuit having a processing bit width of 3 bits or more for completing the operation of the data to be operated in the FP12 format. Based on the advantages of the FP12 format in terms of the data expression range and the data accuracy, the processor including the first arithmetic circuit 200 can process data of a larger data expression range and has high arithmetic efficiency. The operation data in the FP12 format can be operated by using the mantissa processing circuit with at least 3 bit width provided by the embodiment of the invention, and mantissa bits with 3 bit width belong to shorter mantissa bit width, so that the data in the FP12 format occupies fewer storage resources, and the processing efficiency of the mantissa processing circuit and the processor is higher.

In an embodiment of the invention, a processor for performing neural network training includes a control circuit, a first operation circuit, and a storage circuit. The processor utilizes the data in the FP12 format provided by the invention to complete the training process of the neural network. Based on the characteristics of the data in the FP12 format, the memory space occupied by the data is smaller and the processing efficiency of the processor is higher in the neural network training process.

Fig. 5 is a block diagram of a processor for performing neural network training according to another embodiment of the present invention, and as shown in fig. 5, the processor in the embodiment of the present invention further includes: the data format conversion circuit 400 is configured to convert data to be operated in the neural network training into data in FP12 format.

In the embodiment of the present invention, the FP12 format data to be operated used by the processor when executing the neural network training may be directly received by the processor, for example, after the data to be operated in the neural network is format-converted at the software side to obtain FP12 format data, the FP12 format data may be sent to the processor, and the forward operation, the reverse operation and the operation of the weight updating stage of the neural network training are directly executed by using the FP12 format data in the processor.

In the embodiment of the invention, the data in the FP12 format utilized by the processor when executing the neural network training may be the forward operation, the backward operation and the operation of the weight updating stage of the neural network training after the data to be operated of the neural network model is sent to the processor by the software side and then converted into the data in the FP12 format by the processor.

In the embodiment of the present invention, the processor may include a data format conversion circuit 400, where the data format conversion circuit 400 may be used to pre-process input data, so as to improve the operation efficiency of the neural network processor. The preprocessing may include data format conversion, normalization processing, or scaling processing, etc. Where the data format conversion converts the data to floating point numbers or fixed point numbers of any format required, the normalization process maps the input data to a [0,1] or [ -1,1] interval or smaller, such as [0.1,0.9]. The equal-scale scaling process is to scale the input data according to the same proportion according to the operation requirement. The data format conversion circuit can improve the operation efficiency of the input data after converting the data by the data format, normalizing or scaling the data in equal proportion. In the embodiment of the present invention, the data format conversion circuit 400 may be used to convert the data to be operated in the neural network training, including the neuron data, the weight data, and the like, from other formats into FP12 format, for example, into FP12 data in 183 format, and transmit the converted FP12 format data to the first operation circuit 200 to perform operation, thereby completing the neural network training. The invention does not limit the data format before conversion.

In the embodiment of the invention, the processor comprises a data format conversion circuit which can convert received data to be operated in other formats into data in an FP12 format. The processor can complete the conversion of the data format, and the application range of the processor can be improved.

Fig. 6 is a block diagram of a processor for performing neural network training according to another embodiment of the present invention, and in the processor provided by the embodiment of the present invention, as shown in fig. 6, the data format conversion circuit 400 is further configured to convert result data of the inverse operation into high-precision data, where precision of the high-precision data is higher than that of the FP12 format data;

the processor further includes: and a second operation circuit 500 for updating the weight value in the neural network training by using the high-precision data, wherein the second operation circuit comprises a second mantissa processing circuit 520, and the processing bit width of the second mantissa processing circuit is at least 8 bits.

In neural network training, ensuring proper weight updating is an important basis. When the neural network training is performed with lower accuracy, there may occur a case where the weight update gradient obtained by the inverse operation is small, the weight is 0 due to the underflow, or the weight update value is 0 due to the underflow caused by the too small product of the learning rate of the weight update stage and the weight update gradient, or the weight update value is 0 due to the insufficient accuracy of the lower accuracy data compared with the update value.

The embodiment of the invention provides a neural network mixed precision training method, which can use higher precision data to execute weight updating operation in the weight updating stage of the neural network training. For example, the weights of the layers of the neural network can be stored according to a data format with higher precision in the training process, when each layer of the neural network performs iterative operation, forward operation and reverse operation are performed according to the FP12 format, and in the weight updating stage, the weight updating value is obtained by calculating by using the data format with higher precision and stored in the storage circuit.

In the embodiment of the invention, in the forward operation and the backward operation stages of the neural network, data in the format of FP12 is used for operation, and in the weight updating stage, data with higher precision than FP12 is used for operation. For example, the high-precision data may include data in BF16 or FP32 format, where the sign bit of the BF16 format is 1 bit, the exponent bit is 8 bits, the mantissa bit is 7 bits, the sign bit of the FP32 format is 1 bit, the exponent bit is 8 bits, and the mantissa bit is 23 bits. Because the higher-precision data is utilized in the weight updating stage, the accuracy of weight updating is improved, the iteration times required by training or the total duration of training is shortened, and the training efficiency is improved.

To process operations on higher precision data, the processor in the embodiment of the present invention further includes a second operation circuit 500, which includes a second exponent processing circuit 510 and a second mantissa processing circuit 520, where the second mantissa processing circuit has a processing bit width of at least 8 bits to support operations on BF16 or FP32 format data. For example, if mixed precision training is performed by using BF16 format data, a second mantissa processing circuit with a processing bit width of 8 bits may be provided in the processor, if mixed precision training is performed by using FP32 format data, a second mantissa processing circuit with a processing bit width of 24 bits may be provided, or after the second mantissa processing circuit with a processing bit width of 8 bits is provided, the second mantissa processing circuit is multiplexed three times to complete data operation in FP32 format. The invention is not limited in this regard.

The processor for executing the neural network hybrid precision training provided by the embodiment of the invention executes forward operation and backward operation by utilizing the data in the FP12 format, completes the operation of the weight updating stage by utilizing the data with higher precision, has high processing efficiency, saves storage resources and calculation resources, improves the time required by the neural network training and improves the accuracy of the training result.

In the processor in the embodiment of the invention, the first mantissa processing circuit and the second mantissa processing circuit in the processor are multiplexed with each other.

In the design of a processor, the setting of a multiplication circuit directly affects various performance indexes such as the processing efficiency, the power consumption and the like of the processor. In the embodiment of the invention, two operation circuits are arranged, wherein the first operation circuit is used for processing the low-precision floating point number, the processing bit width of the mantissa processing circuit is shorter, the second operation circuit is used for processing the high-precision floating point number, and the processing bit width of the mantissa processing circuit is longer. The first operation circuit is used for processing the floating point number of the FP12 type, and the second operation circuit is used for processing the floating point number with higher precision than the FP12 type, including the floating point number of the types of FP32, TF32, BF16, FP16 and the like. The first operation circuit and the second operation circuit jointly complete FP12 mixed precision training.

In order to further save the area of the processor and improve the utilization rate of hardware, the first operation circuit and the second operation circuit can be multiplexed with each other. The mantissa processing circuit in the operation circuit can be set as a processing circuit for processing long-bit wide mantissas, and can be used as a plurality of processing circuits for processing short-bit wide mantissas to work simultaneously when processing floating-point numbers with short-bit wide mantissas, and the multiplexing of the circuits is realized at the same time. The mantissa processing circuit in the operation circuit may be set as a processing circuit for processing short-bit-width mantissas, and when the floating-point number of long-bit-width mantissas is processed, one long-bit-width mantissa is divided into a plurality of short-bit-width mantissas, which are respectively processed and integrated to obtain a processing result of the long-bit-width mantissa floating-point number. For example, the mantissa processing circuit in the arithmetic circuit may be set to process 8-bit mantissa bit width, and when FP12 data in 183 format is processed, two 183 data may be processed simultaneously. The mantissa processing circuit in the arithmetic circuit may be set to process 3-bit mantissa bit width, and when BF16 data is processed, the mantissa of BF16 data may be obtained after completion of three computations. The multiplexing of the exponent processing circuit in the operation circuit is the same as the multiplexing of the mantissa processing circuit, and will not be described again. The exponent processing circuit and the mantissa processing circuit in the operation circuit can be set to process long bit width or short bit width respectively according to design requirements, and can also be set to process long bit width or short bit width simultaneously. The invention is not limited in this regard.

In the actual running of the chip or the processor, the bandwidth of data transmission or data reading and writing under the chip, and the bandwidth of data transmission or data reading and writing between each storage medium in the chip are easy to become the bottleneck of chip computing power. The embodiment of the invention realizes multiplexing of the operation circuit, and also configures corresponding data transmission bandwidth or data reading bandwidth according to the exponent bit width and the mantissa bit width of the floating point number to be supported by the operation circuit, and configures corresponding data storage space to cooperate with data operation of the operation circuit to realize maximization of chip computing power.

In the embodiment of the invention, the multiplexing of the mantissa processing circuit and the exponent processing circuit with different bit widths is processed, so that the design area of a processor or a chip can be reduced, and the consumption of hardware resources is saved.

In the embodiment of the invention, the data format conversion circuit is further used for converting the data to be operated of the nonlinear layer in the neural network into high-precision data, and the precision of the high-precision data is higher than that of the FP12 format data;

the first arithmetic circuit includes: a linear layer operation circuit for completing the forward operation and the backward operation of the linear layer in the neural network by using the data in the FP12 format;

The second arithmetic circuit includes: and the nonlinear layer operation circuit is used for completing the forward operation and the backward operation of the nonlinear layer in the neural network by utilizing the high-precision data.

The layers in the neural network can be divided into a linear layer and a nonlinear layer, the linear layer performs linear operation, the nonlinear layer performs nonlinear operation, and the linear layer and the nonlinear layer together complete various complex operations in deep learning and obtain expected results. For example, the linear layer comprises a fully connected layer and the nonlinear layer comprises an active layer. According to the algorithm characteristics of nonlinear operation and linear operation, in order to maintain the accuracy of operation results, the nonlinear layer needs to use higher operation precision than the linear layer.

In the embodiment of the invention, the second operation circuit is also provided for completing the operation of data with higher precision than the FP12 on the basis of the first operation circuit. The nonlinear layer in the neural network can perform forward operation and backward operation in training by using the second operation circuit, and the linear layer still performs forward operation and backward operation in training by using the first operation circuit. In addition, when the above-mentioned linear layer and nonlinear layer respectively use the first operation circuit and the second operation circuit to execute operations with different precision, the weight update of the neural network training may be performed with higher precision than FP12, that is, the weight update of the neural network training is completed by using the second operation circuit, or the weight update of the neural network training may be performed by FP12, that is, the weight update of the neural network training is completed by using the first operation circuit. In this embodiment, the data with higher precision than the FP12 includes floating point numbers of FP32, TF32, BF16, FP16, and the like, which is not limited in the present invention.

The data format conversion circuit in this embodiment is configured to perform, when performing operations on the linear layer and the nonlinear layer, conversion between the FP12 format and other precision formats according to the required precision of the data to be operated, so as to meet the calculation requirements of the linear layer and the nonlinear layer.

In this embodiment, according to the operation characteristics of the linear layer and the nonlinear layer in the neural network, the linear layer uses FP12 format data to perform operation, and the nonlinear layer uses higher precision data to perform operation, so as to complete the hybrid precision training of the neural network model. On the basis of improving the operation efficiency, the accuracy of the neural network training can be further improved.

In an embodiment of the present invention, the linear layer operation circuit includes:

the linear layer forward operation circuit is used for completing forward operation of the linear layer in the neural network training by utilizing the data in the first FP12 format;

and the linear layer reverse operation circuit is used for finishing the reverse operation of the linear layer in the neural network training by utilizing the data in the second FP12 format, wherein the sign bit in the second FP12 format is 1 bit, the exponent bit is 5 bits, and the mantissa bit is 6 bits.

As described above, in the embodiment of the present invention, the linear layer operation circuit includes the linear layer forward operation circuit and the linear layer backward operation circuit on the basis that the nonlinear layer of the neural network performs the forward operation and the backward operation using the data of higher accuracy than the FP 12. Because the data precision of the second FP12 format is larger than that of the first FP12 format, in order to further improve the training precision during the neural network hybrid precision training, the operation of the linear layer can be completed by using the linear layer forward operation circuit and the linear layer backward operation circuit in the forward operation stage and the backward operation stage respectively. The linear layer forward operation circuit utilizes the data in the first FP12 format to complete the forward operation of the linear layer in the neural network training, and the linear layer backward operation circuit utilizes the data in the second FP12 format to complete the backward operation of the linear layer in the neural network training. The conversion between the data of the first FP12 format and the second FP12 format and the other formats is done by a data format conversion circuit.

In the embodiment of the invention, after the forward operation and the backward operation of the linear layer are separated, the backward operation of the linear layer is finished by utilizing the data in the second FP12 format with higher precision, so that the accuracy of the neural network mixed precision training result can be further improved.

In an embodiment of the present invention, the processor further includes:

the mixed precision selecting circuit is used for receiving and analyzing a first mixed precision selecting instruction, and the first mixed precision selecting instruction is used for indicating to execute FP16 mixed precision training or FP12 mixed precision training;

and the scaling factor setting circuit is used for instructing the data format conversion circuit to convert the data to be operated into the first FP12 format when the first mixed precision selection instruction instructs to execute the FP12 mixed precision training, and instructing the first operation circuit to set the scaling factor of the loss function in the reverse operation to be 1 when the reverse operation is completed by using the first FP12 format, wherein the sign bit of the first FP12 format is 1 bit, the exponent bit is 8 bits, and the mantissa bit is 3 bits.

In the embodiment of the invention, it can be understood that based on the first operation circuit and the second operation circuit which are mutually multiplexed or respectively arranged, the processor provided by the invention can execute different modes of neural network hybrid precision training such as FP12 hybrid precision training, FP16 hybrid precision training and the like. The processor may determine what type of data mixing accuracy training needs to be performed by receiving and parsing the mixing accuracy selection instruction. The instructions in the embodiments of the present invention include ISA ((Instruction Set Architecture, instruction set architecture) instructions, where the ISA instructions define the basic functions of software thereon, and the ISA instructions define the functional targets of hardware implementation below, which are the interfaces between software and hardware, and are an important part of the design of a neural network system.

The bit width of the FP16 is half of that of the FP32, and compared with the training of the neural network by using the data of the FP32, the memory occupied by various parameters such as weights in the operation of the neural network is also half, so that a large amount of memory space is saved, and the operation efficiency of a processor is improved. FP16 hybrid precision training also presents problems such as precision overflow and rounding errors. Since the valid data representation range of FP16 is much smaller than FP32, overflow (overflow) and underflow (underslow) situations can occur when FP16 is used to replace high precision data. In the training of the neural network, the gradient of the weight needs to be calculated, and an underflow condition occurs, which is caused by the fact that the gradient is smaller than the weight value, and the neural network cannot be converged due to the underflow. To address the underflow condition, FP16 incorporates a scaling factor (loss scale). The scaling factor performs an amplifying operation on the loss (loss) value obtained by forward calculation in the reverse operation, that is, after multiplying the parameter by the scaling factor, the decimal data that may overflow is shifted forward and shifted into the data range that FP16 can represent, so as to avoid the occurrence of underflow.

Because the effective data representation range of the FP12 is large, especially when the data in the first FP12 format is adopted, the sign bit of the first FP12 format is 1 bit, the exponent bit is 8 bits, the mantissa bit is 3 bits, the underflow condition is rarely generated, and null values (nan) are not easy to generate in the training process, so that the FP12 mixed precision training is not required to be subjected to loss scaling. Therefore, when the first blend precision selection instruction received by the blend precision selection circuit instructs to perform FP16 blend precision training, the scaling factor setting circuit needs to set the corresponding loss scaling loss factor to match the blend precision training of FP 16. When the first blend precision selection instruction received by the blend precision selection circuit instructs execution of FP12 blend precision training, the scaling factor of the loss function in the inverse operation is set to 1.

In the embodiment of the invention, the processor can execute a plurality of different mixed precision exercises, and the mixed precision selection circuit receives and analyzes a first mixed precision selection instruction for indicating to execute the FP16 mixed precision exercises or the FP12 mixed precision exercises; when the first mixed precision selection instruction instructs execution of FP12 mixed precision training, a scaling factor setting circuit instructs the data format conversion circuit to convert the data to be operated into the first FP12 format, and sets a scaling factor of a loss function in the inverse operation to 1. The FP12 data format provided by the invention can reduce the calculated amount during the mixed precision training, improve the calculation efficiency of a processor and improve the precision of the mixed precision training of the neural network.

The processor in the embodiment of the invention is further configured to receive and parse a second hybrid precision selection instruction, where the second hybrid precision selection instruction is used to instruct to perform FP8 hybrid precision training or FP12 hybrid precision training;

the processor further includes:

and the scale factor setting circuit is used for setting the scale factor to be 1 when the data format conversion circuit is instructed to quantize the data to be operated into the first FP12 format when the second mixed precision selection instruction instructs to execute FP12 mixed precision training, wherein the sign bit of the first FP12 format is 1 bit, the exponent bit is 8 bits, and the mantissa bit is 3 bits.

In the embodiment of the invention, it can be understood that based on the first operation circuit and the second operation circuit which are mutually multiplexed or respectively arranged, the processor provided by the invention can also execute different modes of neural network hybrid precision training such as FP8 hybrid precision training and the like. The processor may determine the blend precision training to be performed by receiving and parsing the blend precision selection instruction. For example, in the second mixed precision selection instruction, an identification indicating that FP8 mixed precision training is performed or FP12 mixed precision training is performed may be carried. The processor performs hybrid precision training of the neural network by converting the data to be operated into data of FP12 or higher precision or performing hybrid precision training of the neural network by converting the data to be operated into data of FP8 or higher precision by parsing the identification bits in the instruction. In the hybrid precision training of different modes, the data format conversion circuit performs a quantization process of data when converting the data to be operated.

FP8 has two formats, one with 1 Sign bit, 4 Exponent bit, 3 Mantissa bit, E4M3, 1 Sign bit, 5 Exponent bit, 2 Mantissa bit, E5M2. For some upper layer applications, the data range represented by FP8 cannot be expressed, and in order to reduce the quantization error of FP8 data, a scaling factor needs to be set when the data to be calculated is quantized into FP8, and in the FP8 hybrid precision training of the neural network, all vectors related to forward and reverse operations of FP8 need to be quantized into FP8 data by using the scaling factor, which is not described in detail in the present invention. In calculating the scale factor, one method is to calculate with the maximum value in the past N iterations of data, or calculate with the maximum value in the last iteration of data. Therefore, when calculating the scale factor, a complex algorithm is involved, the calculated data amount is large, and additional storage space is required to store the historical data information.

Because the effective data representation range of the FP12 provided by the present invention is very large, especially when the data in the first FP12 format is adopted, the sign bit in the first FP12 format is 1 bit, the exponent bit is 8 bits, and the mantissa bit is 3 bits, the problem occurring when the data is quantized into the FP8 data format can be avoided, and therefore, the FP12 mixed precision training can set the scale factor to 1. When the first mixing precision selection instruction received by the mixing precision selection circuit indicates to perform FP8 mixing precision training, the scaling factor setting circuit needs to set a corresponding scaling factor to match the quantization of FP 8. When the first blend precision selection instruction received by the blend precision selection circuit indicates that FP12 blend precision training is performed, then the scale factor may be set to 1.

In the embodiment of the invention, the processor can execute a plurality of different mixed precision exercises, and the mixed precision selection circuit receives and analyzes a second mixed precision selection instruction for indicating to execute the FP8 mixed precision exercises or the FP12 mixed precision exercises; the first mixed precision selection instruction instructs the data format conversion circuit to quantize the data to be operated into the first FP12 format when the FP12 mixed precision training is performed, and sets a scale factor to 1. The FP12 data format provided by the invention can reduce the calculated amount during the mixed precision training, improve the calculation efficiency of a processor, save the data storage space and improve the precision of the mixed precision training of the neural network.

The embodiment of the invention also provides a processor applied to multi-machine multi-card training, wherein each card in the multi-machine multi-card training at least comprises one processor, and each processor uses the scale factors of the processor to complete the multi-machine multi-card training.

Along with the development of neural network technology, the neural network model is larger and larger in scale, the data volume of data to be calculated is extremely large, training tasks can be split and distributed to a plurality of computing nodes according to a certain method for computing, and information to be summarized is aggregated according to a certain method, so that the training speed of the model is increased. The common splitting method comprises data splitting and model splitting, and the split data or models are distributed on a plurality of processors for parallel processing, so that the training speed of the models can be greatly improved, for example, a single-machine multi-card training mode, a multi-machine multi-card training mode and the like which can be used in actual engineering can be greatly improved. The above-mentioned one computing node (node may be referred to as a machine) represents one physical node, which may be a computer, a server, or a computing device, etc., and each computing node includes a plurality of processors (referred to as cards). The training mode of multiple machines and multiple cards comprises a DDP mode, wherein each card in the DDP mode corresponds to a separate neural network model (namely a process). In the training of the multi-machine multi-card mode, the data splitting is included, namely, data of one batch (batch) are divided into different machines or cards, the neural network model (one process) is split or copied to be operated on a plurality of computing nodes in parallel, the distributed training can be performed after the processes are grouped (group), and the like can be further included, but the problem of data consistency is required to be solved by the multi-machine multi-card in any mode of parallelism, so that the efficiency of the neural network training is seriously affected by frequent data transmission among the multi-machine multi-card.

In the embodiment of the invention, based on the characteristics of the proposed FP12 format, the processor can set the scale factor in the process of quantizing data into FP12 to be 1 when executing FP12 mixed precision training, and can use the local scale factor in a group (group) or among different computing nodes by different processors (cards), thereby avoiding synchronous data transmission generated by related operation of the scale factor, saving a large amount of data storage space and improving the training efficiency of multiple computers and multiple cards.

FIG. 7 illustrates a schematic diagram of the location of an application program interface API in accordance with another embodiment of the invention, in which a machine-readable medium having stored thereon an application program interface API for execution by one or more processors, the API causing the one or more processors to perform neural network training comprising three phases of forward operations, reverse operations, and weight updates, the API causing the one or more processors to:

API (ApplicationProgrammingInterface) application program interface. The APIs may be distributed or otherwise provided as part of one or more libraries, runtimes, drivers, or any other software set, executable code set. The API may also be a set of software instructions provided by the upper layer software framework, or an interface of the neural network model supported by the upper layer software framework, that is, the API may be an interface of each layer of software running on the hardware carrier, which is not limited in this invention. The APIs, if executed, may cause one or more processors to perform various operations. Based on the user-implemented software program, one or more APIs may be utilized to perform various operations, such as device management, mathematical operation, and the like. The user can implement various operations on the processor by calling the API to complete the upper layer application.

The API interface may send instructions to the processor and instruct the processor and various circuits in the process to perform various operations set in the API. In one embodiment of the invention, a machine-readable medium may be provided having an API stored thereon, and the API is executed by one or more processors, the API causing the one or more processors to perform neural network training comprising three phases of forward operations, reverse operations, and weight updates.

In the embodiment of the present invention, the API indicates that each operation performed by the processor is the same as the above content in the processor provided by the present invention, and will not be described in detail.

In one embodiment of the present invention, the FP12 format comprises any one of the following formats:

In one embodiment of the invention, the API further causes the one or more processors to: and converting the data to be operated in the neural network training into data in an FP12 format by using a data format conversion circuit in the processor.

In one embodiment of the invention, the API further causes the one or more processors to:

In one embodiment of the invention, the first mantissa processing circuit and the second mantissa processing circuit in the processor are multiplexed with each other.

In one embodiment of the present invention, the high-precision data includes BF16 or FP32 format data, wherein the sign bit of the BF16 format is 1 bit, the exponent bit is 8 bits, the mantissa bit is 7 bits, the sign bit of the FP32 format is 1 bit, the exponent bit is 8 bits, and the mantissa bit is 23 bits.

In one embodiment of the present invention, each of the multi-machine multi-cards comprises at least one machine readable medium as claimed in any one of claims 12-21, having stored thereon an application program interface API, the API being executable by one or more processors, the API causing the processors to perform the multi-machine multi-card training using the scaling factors of the present processor.

In one embodiment of the invention, there is also provided an artificial intelligence chip including a processor as described in any one of the above.

In one embodiment of the present invention, there is also provided an electronic device including the artificial intelligence chip described above.

In one embodiment of the present invention, there is also provided a system including:

a memory;

one or more of the processors of the present invention,

wherein the memory stores an application program interface API comprising any one of the above.

In one embodiment of the invention, there is also provided a method responsive to the application program interface API comprising any of the above.

The electronic device in the embodiment of the invention comprises a data processing device, a robot, a computer, a printer, a scanner, a tablet computer, an intelligent terminal, a mobile phone, a vehicle recorder, a navigator, a sensor, a camera, a server, a cloud server, a camera, a video camera, a projector, a watch, an earphone, a mobile storage device, a wearable device, a vehicle, a household appliance and/or a medical device.

The vehicle comprises an aircraft, a ship and/or a vehicle; the household appliances comprise televisions, air conditioners, microwave ovens, refrigerators, electric cookers, humidifiers, washing machines, electric lamps, gas cookers and range hoods; the medical device includes a nuclear magnetic resonance apparatus, a B-mode ultrasonic apparatus, and/or an electrocardiograph apparatus.

The foregoing description of embodiments of the invention has been presented for purposes of illustration and description, and is not intended to be exhaustive or limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the various embodiments described. The terminology used herein was chosen in order to best explain the principles of the embodiments, the practical application, or the improvement of technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims

1. A processor for performing neural network training, the neural network training comprising three phases, forward operation, reverse operation, and weight update, the processor comprising:

the control circuit is used for receiving and analyzing the instruction, and indicating the first operation circuit to complete the neural network training operation by utilizing the FP12 format according to the analyzed instruction;

The first operation circuit is used for completing forward operation, reverse operation and weight updating in the neural network training by utilizing the data in the FP12 format, and comprises a first exponent processing circuit and a first mantissa processing circuit, wherein the processing bit width of the first exponent processing circuit is at least 8 bits, and the processing bit width of the first mantissa processing circuit is at least 3 bits;

and the storage circuit is used for storing the weight updating value obtained after the weight updating and using the weight updating value as the weight of the next forward operation.

2. The processor of claim 1, wherein the FP12 format comprises any one of the following formats:

3. The processor of claim 1, wherein the processor further comprises:

and the data format conversion circuit is used for converting the data to be operated in the neural network training into the data in the FP12 format.

4. The processor of claim 3, wherein the processor,

The data format conversion circuit is further used for converting the data of the result of the reverse operation into high-precision data, and the precision of the high-precision data is higher than that of the FP12 format data;

the processor further includes:

and the second operation circuit is used for finishing weight updating in the neural network training by utilizing the high-precision data and comprises a second mantissa processing circuit, and the processing bit width of the second mantissa processing circuit is at least 8 bits.

5. The processor of claim 4, wherein a first mantissa processing circuit and the second mantissa processing circuit in the processor are multiplexed with each other.

6. The processor according to claim 3 or 4, wherein,

the data format conversion circuit is further used for converting data to be operated of a nonlinear layer in the neural network into high-precision data, and the precision of the high-precision data is higher than that of the FP12 format data;

the first arithmetic circuit includes:

a linear layer operation circuit for completing the forward operation and the backward operation of the linear layer in the neural network by using the data in the FP12 format;

the second arithmetic circuit includes:

And the nonlinear layer operation circuit is used for completing the forward operation and the backward operation of the nonlinear layer in the neural network by utilizing the high-precision data.

7. The processor of any one of claims 4-6, wherein the high precision data comprises BF16 or FP32 format data, wherein the sign bit of the BF16 format is 1 bit, the exponent bit is 8 bits, the mantissa bit is 7 bits, the sign bit of the FP32 format is 1 bit, the exponent bit is 8 bits, and the mantissa bit is 23 bits.

8. The processor of any one of claims 1-7, wherein the processor further comprises:

9. The processor of claim 8, wherein the processor further comprises a processor controller,

the mixed precision selection circuit is further used for receiving and analyzing a second mixed precision selection instruction, and the second mixed precision selection instruction is used for indicating to execute FP8 mixed precision training or FP12 mixed precision training;

the processor further includes:

10. The processor of claim 8, wherein the linear layer arithmetic circuit comprises:

11. A processor for multi-machine multi-card training, wherein each card in the multi-machine multi-card comprises at least one processor according to any one of claims 1-10, and each processor uses the scaling factor of the processor to complete the multi-machine multi-card training.

12. A machine-readable medium having stored thereon an application program interface, API, the API executable by one or more processors, the API causing the one or more processors to perform a neural network training comprising three phases, a forward operation, a reverse operation, and a weight update, the API causing the one or more processors to:

13. The machine-readable medium of claim 11, wherein the FP12 format comprises any one of the following formats:

14. The machine-readable medium of claim 12, wherein the API further causes the one or more processors to:

15. The machine-readable medium of claim 14, wherein the API further causes the one or more processors to:

16. The machine-readable medium of claim 15, wherein the computer program product comprises,

a first mantissa processing circuit and the second mantissa processing circuit in the processor are multiplexed with each other.

17. The machine-readable medium of claim 14 or 15, wherein the API further causes the one or more processors to:

18. The machine-readable medium according to any one of claims 15-17, wherein,

the high-precision data comprises BF16 or FP32 format data, wherein the sign bit of the BF16 format is 1 bit, the exponent bit is 8 bits, the mantissa bit is 7 bits, the sign bit of the FP32 format is 1 bit, the exponent bit is 8 bits, and the mantissa bit is 23 bits.

19. The machine-readable medium of any of claims 12-18, wherein the API further causes the one or more processors to:

20. The machine-readable medium of claim 19, wherein the API further causes the one or more processors to:

21. The machine-readable medium of claim 19, wherein the API further causes the one or more processors to:

22. A machine-readable medium for use in multi-machine multi-card training, wherein each of the multi-machine multi-cards includes at least one machine-readable medium as claimed in any one of claims 12-21 having stored thereon an application program interface API, the API being executable by one or more processors, the API causing the processors to perform the multi-machine multi-card training using scaling factors of the processors.

23. An artificial intelligence chip, characterized in that the chip comprises a processor according to any one of claims 1-11.

24. An electronic device comprising the artificial intelligence chip of claim 23.

25. A system, comprising:

a memory;

one or more of the processors of the present invention,

wherein the memory store comprises an application program interface API as recited in any one of claims 12-22.

26. A method responsive to comprising an application program interface API as claimed in any one of claims 12-22.