CN110163353B

CN110163353B - Computing device and method

Info

Publication number: CN110163353B
Application number: CN201910195535.1A
Authority: CN
Inventors: 不公告发明人
Original assignee: Shanghai Cambricon Information Technology Co Ltd
Current assignee: Shanghai Cambricon Information Technology Co Ltd
Priority date: 2018-02-13
Filing date: 2018-09-03
Publication date: 2021-05-11
Anticipated expiration: 2038-03-14
Also published as: CN110163359B; CN110163360B; CN110163354B; CN110163357B; CN110163361A; CN110163353A; CN110163363B; CN110163363A; CN110163355A; CN110163358A; CN110163360A; CN110163361B; CN110163359A; CN110163356B; CN110163354A; CN110163362A; CN110163362B; CN110163355B; CN110163357A; CN110163358B

Abstract

A computing device, comprising: a storage unit (10) for acquiring input data and computing instructions; a controller unit (11) for fetching computation instructions from the memory unit (10), decoding the computation instructions to obtain one or more computation instructions and sending the one or more computation instructions and input data to the computation unit (12); and an arithmetic unit (12) for performing a calculation on the input data according to one or more arithmetic instructions to obtain a result of the arithmetic instruction. The computing device represents the data participating in machine learning calculation by adopting fixed point data, so that the processing speed and the processing efficiency of training calculation can be improved.

Description

Computing device and method

Technical Field

The present application relates to the field of information processing technologies, and in particular, to a computing device and method.

Background

With the continuous development of information technology and the increasing demand of people, the requirement of people on the timeliness of information is higher and higher. Currently, the terminal obtains and processes information based on a general-purpose processor.

In practice, it is found that such a manner of processing information based on a general-purpose processor running a software program is limited by the running speed of the general-purpose processor, and particularly under the condition that the load of the general-purpose processor is large, the information processing efficiency is low, the time delay is large, the computation amount of the training operation is large for a computation model of information processing, such as a training model, and the time for the general-purpose processor to complete the training operation is long, and the efficiency is low.

Disclosure of Invention

The embodiment of the application provides a computing device and method, which can improve the processing speed of operation and improve the efficiency.

In a first aspect, an embodiment of the present application provides a computing apparatus, including: a storage unit, a conversion unit, an arithmetic unit and a controller unit; the memory unit includes a cache and a register,

the controller unit is used for determining the decimal point position of the first input data and the bit width of the fixed point data; the bit width of the fixed point data is the bit width of the first input data converted into the fixed point data;

the arithmetic unit is used for initializing the position of a decimal point of the first input data and adjusting the position of the decimal point of the first input data; and storing the adjusted decimal point position of the first input data into a cache of the storage unit,

the controller unit is used for acquiring first input data and a plurality of operation instructions from the register and acquiring the decimal point position of the adjusted first input data from the cache; transmitting the decimal point position of the adjusted first input data and the first input data to the conversion unit;

the conversion unit is used for converting the first input data into second input data according to the decimal point position of the adjusted first input data;

wherein the arithmetic unit adjusts a decimal point position of the first input data, including:

adjusting the position of the decimal point of the first input data in a single step upwards according to the maximum value of the absolute value of the data in the first input data, or; gradually adjusting the position of a decimal point of the first input data upwards according to the maximum value of the absolute value of the data in the first input data, or; adjusting the decimal point position of the first input data in a single step upwards according to the first input data distribution, or; gradually adjusting the position of the decimal point of the first input data upwards according to the distribution of the first input data, or; and adjusting the position of the decimal point of the first input data downwards according to the maximum value of the absolute value of the first input data.

In one possible embodiment, the arithmetic unit initializes a decimal point position of the first input data, including:

initializing the position of a decimal point of the first input data according to the maximum value of the absolute value of the first input data, or; initializing the position of a decimal point of the first input data according to the minimum absolute value of the first input data, or; initializing the position of a decimal point of the first input data according to the relation between different data types in the first input data, or; initializing the decimal point position of the first input data according to an empirical value constant.

In a possible embodiment, the computing device is configured to perform machine learning calculations, and the controller unit is further configured to transmit the plurality of calculation instructions to the calculation unit;

the conversion unit is further used for transmitting the second input data to the operation unit;

the operation unit is further configured to perform an operation on the second input data according to the plurality of operation instructions to obtain an operation result.

In one possible embodiment, the machine learning calculation includes: an artificial neural network operation, the first input data comprising: inputting neuron data and weight data; the calculation result is output neuron data.

In a possible embodiment, the arithmetic unit comprises a master processing circuit and a plurality of slave processing circuits;

the main processing circuit is used for performing preamble processing on the second input data and transmitting data and the plurality of operation instructions with the plurality of slave processing circuits;

the plurality of slave processing circuits are used for transmitting second input data and the plurality of operation instructions according to the slave processing circuit, executing intermediate operation to obtain a plurality of intermediate results and transmitting the plurality of intermediate results to the master processing circuit;

and the main processing circuit is used for executing subsequent processing on the plurality of intermediate results to obtain the operation result.

In one possible embodiment, the computing device further comprises: a Direct Memory Access (DMA) unit;

the cache is further used for storing the first input data; wherein the cache comprises a scratch pad cache;

the register is further used for storing scalar data in the first input data;

the DMA unit is used for reading data from the storage unit or storing data into the storage unit.

In a possible embodiment, when the first input data is fixed-point data, the arithmetic unit further includes:

and the derivation unit is used for deriving the decimal point position of one or more intermediate results according to the decimal point position of the first input data, wherein the one or more intermediate results are obtained by operation according to the first input data.

In a possible embodiment, the arithmetic unit further includes:

a data caching unit for caching the one or more intermediate results.

In one possible embodiment, the arithmetic unit includes: a tree module, the tree module comprising: the root port of the tree module is connected with the main processing circuit, and the branch ports of the tree module are respectively connected with one of the plurality of slave processing circuits;

the tree module is used for forwarding data and operation instructions between the main processing circuit and the plurality of slave processing circuits; the tree model is an n-branch tree structure, and n is an integer greater than or equal to 2.

In a possible embodiment, the arithmetic unit further comprises a branch processing circuit,

the main processing circuit is specifically configured to determine that the input neuron is broadcast data, determine that a weight is distribution data, allocate the distribution data to a plurality of data blocks, and send at least one data block of the plurality of data blocks, the broadcast data, and at least one operation instruction of the plurality of operation instructions to the branch processing circuit;

the branch processing circuit is used for forwarding data blocks, broadcast data and operation instructions between the main processing circuit and the plurality of slave processing circuits;

the plurality of slave processing circuits are used for carrying out operation on the received data blocks and the broadcast data according to the operation instruction to obtain an intermediate result and transmitting the intermediate result to the branch processing circuit;

the main processing circuit is further configured to perform subsequent processing on the intermediate result sent by the branch processing circuit to obtain a result of the operation instruction, and send the result of the calculation instruction to the controller unit.

In one possible embodiment, the plurality of slave processing circuits are distributed in an array; each slave processing circuit is connected with other adjacent slave processing circuits, the master processing circuit is connected with K slave processing circuits in the plurality of slave processing circuits, and the K slave processing circuits are as follows: n slave processing circuits of row 1, n slave processing circuits of row m, and m slave processing circuits of column 1;

the K slave processing circuits are used for forwarding data and instructions between the main processing circuit and the plurality of slave processing circuits;

the main processing circuit is further configured to determine that the input neuron is broadcast data, determine that a weight is distribution data, distribute the distribution data into a plurality of data blocks, and send at least one data block of the plurality of data blocks and at least one operation instruction of the plurality of operation instructions to the K slave processing circuits;

the K slave processing circuits are used for converting data between the main processing circuit and the plurality of slave processing circuits;

the plurality of slave processing circuits are used for performing operation on the received data blocks according to the operation instruction to obtain an intermediate result and transmitting the operation result to the K slave processing circuits;

and the main processing circuit is used for processing the intermediate results sent by the K slave processing circuits to obtain the result of the calculation instruction, and sending the result of the calculation instruction to the controller unit.

In a possible embodiment, the main processing circuit is specifically configured to combine and sort the intermediate results sent by the multiple processing circuits to obtain the result of the computation instruction;

or the main processing circuit is specifically configured to perform combination sorting and activation processing on the intermediate results sent by the multiple processing circuits to obtain a result of the calculation instruction.

In one possible embodiment, the main processing circuit includes: one or any combination of an activation processing circuit and an addition processing circuit;

the activation processing circuit is used for executing activation operation of data in the main processing circuit;

the addition processing circuit is used for executing addition operation or accumulation operation;

the slave processing circuit includes:

the multiplication processing circuit is used for executing multiplication operation on the received data block to obtain a product result;

and the accumulation processing circuit is used for executing accumulation operation on the product result to obtain the intermediate result.

In a second aspect, an embodiment of the present application provides a computing method, including:

the method comprises the steps that a controller unit determines the decimal point position of first input data and the bit width of fixed point data, wherein the bit width of the fixed point data is the bit width of the first input data which is the fixed point data; the arithmetic unit initializes the position of a decimal point of the first input data and adjusts the position of the decimal point of the first input data; the conversion unit acquires the decimal point position of the adjusted first input data and converts the first input data into second input data according to the decimal point position; wherein the arithmetic unit adjusts a decimal point position of the first input data, including:

adjusting the position of the decimal point of the first input data in a single step upwards according to the maximum value of the absolute value of the data in the first input data, or;

gradually adjusting the position of a decimal point of the first input data upwards according to the maximum value of the absolute value of the data in the first input data, or; adjusting the decimal point position of the first input data in a single step upwards according to the first input data distribution, or; gradually adjusting the position of the decimal point of the first input data upwards according to the distribution of the first input data, or; and adjusting the position of the decimal point of the first input data downwards according to the maximum value of the absolute value of the first input data.

In one possible embodiment, the computing method is a method for performing machine learning computation, the method further comprising: the operation unit operates the second input data according to the operation instructions to obtain an operation result.

In one possible embodiment, the machine learning calculation includes: an artificial neural network operation, the first input data comprising: inputting neurons and weights; the calculation result is an output neuron.

In a possible embodiment, when the first input data is fixed-point data, the method further comprises: the arithmetic unit derives decimal point positions of one or more intermediate results according to the decimal point positions of the first input data, wherein the one or more intermediate results are obtained through calculation according to the first input data.

In a third aspect, an embodiment of the present invention provides a machine learning arithmetic device, which includes one or more computing devices according to the first aspect. The machine learning arithmetic device is used for acquiring data to be operated and control information from other processing devices, executing specified machine learning arithmetic and transmitting an execution result to other processing devices through an I/O interface;

when the machine learning arithmetic device comprises a plurality of computing devices, the plurality of computing devices can be linked through a specific structure and transmit data;

the plurality of computing devices are interconnected through a PCIE bus and transmit data so as to support operation of larger-scale machine learning; a plurality of the computing devices share the same control system or own respective control systems; the computing devices share the memory or own the memory; the plurality of computing devices are interconnected in any interconnection topology.

In a fourth aspect, an embodiment of the present invention provides a combined processing device, which includes the machine learning processing device according to the third aspect, a universal interconnection interface, and other processing devices. The machine learning arithmetic device interacts with the other processing devices to jointly complete the operation designated by the user. The combined processing device may further include a storage device, which is connected to the machine learning arithmetic device and the other processing device, respectively, and stores data of the machine learning arithmetic device and the other processing device.

In a fifth aspect, an embodiment of the present invention provides a neural network chip, where the neural network chip includes the computing device according to the first aspect, the machine learning arithmetic device according to the third aspect, or the combined processing device according to the fourth aspect.

In a sixth aspect, an embodiment of the present invention provides a neural network chip package structure, where the neural network chip package structure includes the neural network chip described in the fifth aspect;

in a seventh aspect, an embodiment of the present invention provides a board, where the board includes a storage device, an interface device, a control device, and the neural network chip in the fifth aspect;

wherein, the neural network chip is respectively connected with the storage device, the control device and the interface device;

the storage device is used for storing data;

the interface device is used for realizing data transmission between the chip and external equipment;

and the control device is used for monitoring the state of the chip.

Further, the memory device includes: a plurality of groups of memory cells, each group of memory cells is connected with the chip through a bus, and the memory cells are: DDR SDRAM;

the chip includes: the DDR controller is used for controlling data transmission and data storage of each memory unit;

the interface device is as follows: a standard PCIE interface.

In an eighth aspect, an embodiment of the present invention provides an electronic device, where the electronic device includes the neural network chip described in the fifth aspect, the neural network chip package structure described in the sixth aspect, or the board described in the seventh aspect.

In some embodiments, the electronic device comprises a data processing apparatus, a robot, a computer, a printer, a scanner, a tablet, a smart terminal, a cell phone, a tachograph, a navigator, a sensor, a camera, a server, a cloud server, a camera, a camcorder, a projector, a watch, a headset, a mobile storage, a wearable device, a vehicle, a household appliance, and/or a medical device.

In some embodiments, the vehicle comprises an aircraft, a ship, and/or a vehicle; the household appliances comprise a television, an air conditioner, a microwave oven, a refrigerator, an electric cooker, a humidifier, a washing machine, an electric lamp, a gas stove and a range hood; the medical equipment comprises a nuclear magnetic resonance apparatus, a B-ultrasonic apparatus and/or an electrocardiograph.

These and other aspects of the invention are apparent from and will be elucidated with reference to the embodiments described hereinafter.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a schematic diagram of a data structure of fixed-point data according to an embodiment of the present disclosure;

FIG. 2 is a schematic diagram of another data structure of fixed-point data according to an embodiment of the present disclosure;

FIG. 2A is a schematic diagram of another data structure of fixed-point data according to an embodiment of the present disclosure;

FIG. 2B is a schematic diagram of another data structure of fixed-point data according to an embodiment of the present disclosure;

FIG. 3 is a schematic diagram of a computing device according to an embodiment of the present application;

FIG. 3A is a schematic block diagram of a computing device according to an embodiment of the present application;

FIG. 3B is a schematic block diagram of a computing device according to another embodiment of the present application;

FIG. 3C is a schematic block diagram of a computing device according to another embodiment of the present application;

fig. 3D is a schematic structural diagram of a main processing circuit provided in an embodiment of the present application;

FIG. 3E is a schematic block diagram of a computing device according to another embodiment of the present application;

FIG. 3F is a schematic structural diagram of a tree module according to an embodiment of the present disclosure;

FIG. 3G is a schematic block diagram of a computing device according to another embodiment of the present application;

FIG. 3H is a schematic block diagram of a computing device according to another embodiment of the present application;

FIG. 4 is a flowchart illustrating a forward operation of a single-layer artificial neural network according to an embodiment of the present disclosure;

FIG. 5 is a flow chart of a forward operation and a reverse training of a neural network according to an embodiment of the present disclosure;

fig. 6 is a structural diagram of a combined processing device provided in an embodiment of the present application;

FIG. 6A is a schematic block diagram of a computing device according to another embodiment of the present application;

FIG. 7 is a block diagram of another combined processing device provided in an embodiment of the present application;

fig. 8 is a schematic structural diagram of a board card provided in the embodiment of the present application;

fig. 9 is a schematic flowchart of a calculation method according to an embodiment of the present application;

FIG. 10 is a schematic diagram illustrating a data decimal point position determining and adjusting process according to an embodiment of the present disclosure;

fig. 11 is a schematic structural diagram of a distributed system according to an embodiment of the present application;

fig. 12 is a schematic structural diagram of another distributed system according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some, but not all, embodiments of the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The terms "first," "second," "third," and "fourth," etc. in the description and claims of this application and in the accompanying drawings are used for distinguishing between different objects and not for describing a particular order. Furthermore, the terms "include" and "have," as well as any variations thereof, are intended to cover non-exclusive inclusions. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those steps or elements listed, but may alternatively include other steps or elements not listed, or inherent to such process, method, article, or apparatus.

Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the application. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein can be combined with other embodiments

The embodiment of the application provides a data type, wherein the data type comprises an adjustment factor, and the adjustment factor is used for indicating the value range and the precision of the data type.

Wherein the adjustment factor comprises a first scaling factor and a second scaling factor (optionally), the first scaling factor being indicative of the precision of the data type; the second scaling factor is used for adjusting the value range of the data type.

Optionally, the first scaling factor may be 2^-m、8^-m、10^-m、2、3、6、9、10、2^m、8^m、10^mOr other values.

Specifically, the first scaling factor may be a decimal point position. For example, the binary input data INA1 has decimal point shifted by m bits to the right, and the input data INB1 ═ INA1 × 2^mThat is, the input data INB1 is enlarged by 2 relative to the input data INA1^mDoubling; for another example, decimal input data INA2 has decimal point shifted by n bits to the left to obtain input data INB2 ═ INA2/10ⁿThat is, the input data INA2 is reduced by 10 relative to the input data INB2ⁿAnd m and n are integers.

Alternatively, the second scaling factor may be 2, 8, 10, 16, or other values.

For example, the value range of the data type corresponding to the input data is 8^-15-8¹⁶In the operation process, when the obtained operation result is greater than the maximum value corresponding to the value range of the data type corresponding to the input data, the value range of the data type is multiplied by a second scaling factor (namely 8) of the data type to obtain a new value range 8^-14-8¹⁷(ii) a When the operation result is smaller than the minimum value corresponding to the value range of the data type corresponding to the input data, dividing the value range of the data type by a second scaling factor (8) of the data type to obtain a new value range 8^-16-8¹⁵。

Scaling factors may be added to data in any format (e.g., floating point number, discrete data) to adjust the size and precision of the data.

It should be noted that the decimal point positions mentioned in the description of the present application may be the first scaling factor, and are not described herein.

The following describes a structure of fixed-point data, and with reference to fig. 1, fig. 1 is a schematic diagram of a data structure of fixed-point data according to an embodiment of the present application. The signed fixed-point data, which occupies X bits as shown in fig. 1, may also be referred to as X-bit fixed-point data. The X-bit fixed point data includes a sign bit occupying 1 bit, an integer bit occupying M bits, and a decimal bit occupying N bits, and X-1 is M + N. For unsigned fixed-point data, only M-bit integer bits and N-bit decimal bits, i.e., X ═ M + N, are included.

Compared with a 32-bit floating Point data representation form, the short-bit fixed Point data representation form adopted by the invention has the advantages that the occupied bit number is less, and for data of the same layer and the same type in a network model, such as all convolution kernels, input neurons or offset data of a first convolution layer, a flag bit is additionally arranged to record the position of a decimal Point of the fixed Point data, and the flag bit is Point Location. The size of the flag bit can be adjusted according to the distribution of the input data, so that the accuracy of the fixed point data and the expressible range of the fixed point data are adjusted.

For example, floating point number 68.6875 is converted to signed 16-bit fixed point data with a decimal point position of 5. In the signed 16-bit fixed point data with the decimal point position of 5, the integer part accounts for 10 bits, the decimal part accounts for 5 bits, and the sign bit accounts for 1 bit. The conversion unit converts the floating point number 68.6875 to signed 16-bit fixed point data 0000010010010110, as shown in FIG. 2.

In one possible embodiment, the fixed point data may also be represented in the manner shown in FIG. 2A. As shown in fig. 2A, the bit number of the fixed point data is bitnum, the decimal point is s, and the precision of the fixed point data is 2^s. The first bit is a sign bit to indicate whether the certain data is a positive or negative number. For example, when the sign bit is 0, it indicates that the fixed point data is a positive number; when the sign bit is 1, it indicates that the fixed point data is a negative number. The fixed point data indicates a range of [ neg, pos]Wherein pos is (2)^bitnum-1-1)*2^s，neg＝-(2^bitnum-1-1)*2^s。

Wherein, bitnum can remove any positive integer. S can be any integer not less than s _ min

Alternatively, bitnum may be 8,16,24,32, 64 or other values. Further, s _ min is-64.

Optionally, bitnum is 8,16,24,32 or other values. s can be any integer not less than s _ min, and s _ min is-64.

In one embodiment, a variety of fixed-point representation methods can be used for data with larger values, see fig. 2B in particular: as shown in fig. 2B, the data having a large value is represented by 3 kinds of fixed-point data combinations, that is, the data is composed of fixed-point data 1, fixed-point data 2, and fixed-point data 3. The bit width of the fixed point data 1 is bitnum1, the decimal point position is s1, the bit width of the fixed point data 2 is bitnum2, and the decimal point position is s 2; the bit width of the fixed point data 3 is bitnum3, the decimal point position is s3, bitnum2-2 is s1-1, and bitnum3-2 is s 2-1. The range represented by the 3 fixed-point data is [ neg, pos [ ]]Wherein pos is (2)^bitnum-1-1)*2^s，neg＝-(2^bitnum-1-1)*2^s。

First, a computing device as used herein is described. Referring to fig. 3, there is provided a computing device comprising: the device comprises a controller unit 11, an arithmetic unit 12 and a conversion unit 13, wherein the controller unit 11 is connected with the arithmetic unit 12, and the conversion unit 13 is connected with both the controller unit 11 and the arithmetic unit 12;

in a possible embodiment, the controller unit 11 is adapted to retrieve the first input data and to calculate the instructions.

In one embodiment, the first input data is machine learning data. Further, the machine learning data includes input neuron data, weight data. The output neuron data is the final output result or intermediate data.

In an alternative, the manner of obtaining the first input data and the calculation instruction may be obtained through a data input/output unit, and the data input/output unit may be one or more data I/O interfaces or I/O pins.

The above calculation instructions include, but are not limited to: the present invention is not limited to the specific expression of the above-mentioned computation instruction, such as a convolution operation instruction, or a forward training instruction, or other neural network operation instruction.

The controller unit 11 is further configured to parse the computation instruction to obtain a data conversion instruction and/or one or more operation instructions, where the data conversion instruction includes an operation field and an operation code, the operation code is used to indicate a function of the data type conversion instruction, and the operation field of the data type conversion instruction includes a decimal point position, a flag bit used to indicate a data type of the first input data, and a conversion mode identifier of the data type.

When the operation domain of the data conversion instruction is an address of a storage space, the controller unit 11 obtains the decimal point position, a flag bit indicating the data type of the first input data, and a conversion mode identifier of the data type from the storage space corresponding to the address.

The controller unit 11 transmits the operation code and operation field of the data conversion instruction and the first input data to the conversion unit 13; transmitting the plurality of operation instructions to the operation unit 12;

the converting unit 13 is configured to convert the first input data into second input data according to the operation code and the operation domain of the data conversion instruction, where the second input data is fixed-point data; and transmits the second input data to the arithmetic unit 12;

the arithmetic unit 12 is configured to perform an arithmetic operation on the second input data according to the plurality of arithmetic instructions to obtain a calculation result of the calculation instruction.

In a possible embodiment, the present application provides a technical solution that the operation unit 12 is set to a master-slave structure, and for the calculation instruction of the forward operation, the operation unit can split data according to the calculation instruction of the forward operation, so that the plurality of slave processing circuits 102 can perform parallel operation on the part with a large calculation amount, thereby increasing the operation speed, saving the operation time, and further reducing the power consumption. As shown in fig. 3A, the arithmetic unit 12 includes a master processing circuit 101 and a plurality of slave processing circuits 102;

the main processing circuit 101 is configured to perform a preamble process on the second input data and to transfer data and the plurality of operation instructions with the plurality of slave processing circuits 102;

the plurality of slave processing circuits 102, configured to perform an intermediate operation according to second input data and the plurality of operation instructions transmitted from the master processing circuit 101 to obtain a plurality of intermediate results, and transmit the plurality of intermediate results to the master processing circuit 101;

the main processing circuit 101 is configured to perform subsequent processing on the plurality of intermediate results to obtain a calculation result of the calculation instruction.

In one embodiment, the machine learning operation includes a deep learning operation (i.e., an artificial neural network operation), and the machine learning data (i.e., the first input data) includes input neurons and weights (i.e., neural network model data). The output neuron is a calculation result or an intermediate result of the calculation instruction. In the following, the deep learning operation is taken as an example, but it should be understood that the deep learning operation is not limited thereto.

Optionally, the computing device may further include: the storage unit 10 and the Direct Memory Access (DMA) unit 50, the storage unit 10 may include: one or any combination of a register and a cache, specifically, the cache is used for storing the calculation instruction; the register 201 is configured to store the first input data and a scalar. Wherein the first input data includes input neurons, weights, and output neurons.

The cache 202 is a scratch pad cache.

The DMA unit 50 is used to read or store data from the memory unit 10.

In a possible embodiment, the register 201 stores the operation instruction, the first input data, the decimal point position, a flag bit indicating a data type of the first input data, and a conversion mode identifier of the data type; the controller unit 11 directly obtains the operation instruction, the first input data, the decimal point position, a flag bit indicating the data type of the first input data, and a conversion mode identifier of the data type from the register 201; transmitting the first input data, the decimal point position, the flag bit indicating the data type of the first input data, and the conversion mode identification of the data type to the above conversion unit 13; transmitting the operation instruction to the operation unit 12;

the conversion unit 13 converts the first input data into the second input data according to the decimal point position, the flag bit indicating the data type of the first input data, and the conversion mode identifier of the data type; then transmitting the second input data to the arithmetic unit 12;

the arithmetic unit 12 performs an arithmetic operation on the second input data according to the arithmetic instruction to obtain an arithmetic result.

Optionally, the controller unit 11 includes: an instruction cache unit 110, an instruction processing unit 111, and a store queue unit 113;

the instruction cache unit 110 is configured to store the calculation instruction associated with the artificial neural network operation;

the instruction processing unit 111 is configured to analyze the computation instruction to obtain the data conversion instruction and the plurality of operation instructions, and analyze the data conversion instruction to obtain an operation code and an operation domain of the data conversion instruction;

the storage queue unit 113 is configured to store an instruction queue, where the instruction queue includes: and a plurality of operation instructions or calculation instructions to be executed according to the front and back sequence of the queue.

For example, in an alternative embodiment, the main processing circuit 101 may also include a control unit, and the control unit may include a main instruction processing unit, specifically configured to decode an instruction into a microinstruction. Of course, in another alternative, the slave processing circuit 102 may also include another control unit including a slave instruction processing unit, specifically for receiving and processing microinstructions. The micro instruction may be a next-stage instruction of the instruction, and the micro instruction may be obtained by splitting or decoding the instruction, and may be further decoded into control signals of each component, each unit, or each processing circuit.

In one alternative, the structure of the calculation instruction may be as shown in Table 1 below.

Operation code

Registers or immediate data

Register/immediate

……

TABLE 1

The ellipses in the above table indicate that multiple registers or immediate numbers may be included.

In another alternative, the computing instructions may include: one or more operation domains and an opcode. The computation instructions may include neural network operation instructions. Taking the neural network operation instruction as an example, as shown in table 1, register number 0, register number 1, register number 2, register number 3, and register number 4 may be operation domains. Each of register number 0, register number 1, register number 2, register number 3, and register number 4 may be a number of one or more registers.

TABLE 2

The register may be an off-chip memory, and in practical applications, may also be an on-chip memory for storing data, where the data may specifically be n-dimensional data, where n is an integer greater than or equal to 1, and for example, when n is equal to 1, the data is 1-dimensional data, that is, a vector, and when n is equal to 2, the data is 2-dimensional data, that is, a matrix, and when n is equal to 3 or more, the data is a multidimensional tensor.

Optionally, the controller unit 11 may further include:

a dependency processing unit 112, configured to determine whether a first operation instruction is associated with a zeroth operation instruction before the first operation instruction when there are multiple operation instructions, if the first operation instruction is associated with the zeroth operation instruction, cache the first operation instruction in the instruction cache unit 110, and after the zeroth operation instruction is completely executed, extract the first operation instruction from the instruction cache unit 110 and transmit the first operation instruction to the operation unit;

the determining whether the first operation instruction has an association relationship with a zeroth operation instruction before the first operation instruction comprises: extracting a first storage address interval of required data (such as a matrix) in the first operation instruction according to the first operation instruction, extracting a zeroth storage address interval of the required matrix in the zeroth operation instruction according to the zeroth operation instruction, if the first storage address interval and the zeroth storage address interval have an overlapped area, determining that the first operation instruction and the zeroth operation instruction have an association relation, and if the first storage address interval and the zeroth storage address interval do not have an overlapped area, determining that the first operation instruction and the zeroth operation instruction do not have an association relation.

In another alternative embodiment, as shown in fig. 3B, the arithmetic unit 12 includes a master processing circuit 101, a plurality of slave processing circuits 102, and a plurality of branch processing circuits 103.

The main processing circuit 101 is specifically configured to determine that the input neuron is broadcast data, determine that a weight is distribution data, allocate one distribution data to a plurality of data blocks, and send at least one data block of the plurality of data blocks, the broadcast data, and at least one operation instruction of the plurality of operation instructions to the branch processing circuit 103;

the branch processing circuit 103 is configured to forward a data block, broadcast data, and an operation instruction between the master processing circuit 101 and the plurality of slave processing circuits 102;

the slave processing circuits 102 are configured to perform an operation on the received data block and broadcast data according to the operation instruction to obtain an intermediate result, and transmit the intermediate result to the branch processing circuit 103;

the main processing circuit 101 is further configured to perform subsequent processing on the intermediate result sent from the branch processing circuit 103 to obtain a result of the arithmetic instruction, and send the result of the arithmetic instruction to the controller unit 11.

In another alternative embodiment, the arithmetic unit 12 may include a master processing circuit 101 and a plurality of slave processing circuits 102, as shown in fig. 3C. As shown in fig. 3C, a plurality of slave processing circuits 102 are distributed in an array; each slave processing circuit 102 is connected to other adjacent slave processing circuits 102, the master processing circuit 101 is connected to K slave processing circuits 102 in the plurality of slave processing circuits 102, and the K slave processing circuits 102 are: the n slave processing circuits 102 in the 1 st row, the n slave processing circuits 102 in the m th row, and the m slave processing circuits 102 in the 1 st column, it should be noted that, as shown in fig. 3C, the K slave processing circuits 102 include only the n slave processing circuits 102 in the 1 st row, the n slave processing circuits 102 in the m th row, and the m slave processing circuits 102 in the 1 st column, that is, the K slave processing circuits 102 are the slave processing circuits 102 directly connected to the master processing circuit 101 among the plurality of slave processing circuits 102.

K slave processing circuits 102 for forwarding data and instructions between the master processing circuit 101 and the plurality of slave processing circuits 102;

the master processing circuit 101 is further configured to determine that the input neuron is broadcast data, determine that a weight is distribution data, distribute the distribution data into a plurality of data blocks, and send at least one data block of the plurality of data blocks and at least one operation instruction of the plurality of operation instructions to the K slave processing circuits 102;

the K slave processing circuits 102 for converting data between the master processing circuit 101 and the plurality of slave processing circuits 102;

the slave processing circuits 102 are configured to perform an operation on the received data block according to the operation instruction to obtain an intermediate result, and transmit the operation result to the K slave processing circuits 102;

the main processing circuit 101 is configured to process the intermediate results sent by the K slave processing circuits 102 to obtain a result of the calculation instruction, and send the result of the calculation instruction to the controller unit 11.

Optionally, as shown in fig. 3D, the main processing circuit 101 in fig. 3A to 3C may further include: one or any combination of the activation processing circuit 1011 and the addition processing circuit 1012;

an activation processing circuit 1011 for performing an activation operation of data in the main processing circuit 101;

an addition processing circuit 1012 is used to perform addition or accumulation.

The slave processing circuit 102 includes: the multiplication processing circuit is used for executing multiplication operation on the received data block to obtain a product result; forwarding processing circuitry (optional) for forwarding the received data block or the product result. And the accumulation processing circuit is used for executing accumulation operation on the product result to obtain the intermediate result.

In a possible embodiment, the first input data is data whose data type is not consistent with the operation type indicated by the operation instruction participating in the operation, the second input data is data whose data type is consistent with the operation type indicated by the operation instruction participating in the operation, the conversion unit 13 obtains an operation code and an operation field of the data conversion instruction, the operation code is used for indicating the function of the data conversion instruction, and the operation field includes a decimal point position and a conversion mode identifier of the data type. The conversion unit 13 converts the first input data into the second input data according to the conversion mode identifier of the decimal point position and the data type.

Specifically, the conversion mode identifier of the data type corresponds to the conversion mode of the data type one to one. Referring to table 3 below, table 3 is a table of correspondence between the conversion type identifier of the data type and the conversion type of the data type.

Translation mode identification of data types	Data type conversion mode
		00	Conversion of fixed-point data into fixed-point data
01	Conversion of floating point data to floating point data
		10	Fixed point dataConversion to floating point data
11	Conversion of floating point data to fixed point data

TABLE 3

As shown in table 3, when the conversion mode of the data type is marked as 00, the conversion mode of the data type is that fixed point data is converted into fixed point data; when the conversion mode of the data type is marked as 01, converting the conversion mode of the data type into floating point data; when the conversion mode of the data type is marked as 10, converting the fixed point data into floating point data; when the conversion mode of the data type is marked as 11, the conversion mode of the data type is that floating point data is converted into fixed point data.

Optionally, the correspondence between the conversion mode identifier of the data type and the conversion mode of the data type may also be as shown in table 4 below.

Translation mode identification of data types	Data type conversion mode
		0000	Conversion of 64-bit fixed point data to 64-bit floating point data
0001	Conversion of 32-bit fixed point data to 64-bit floating point data
		0010	16-bit fixed point dataConversion to 64-bit floating point data
0011	Conversion of 32-bit fixed-point data to 32-bit floating-point data
		0100	Conversion of 16-bit fixed point data to 32-bit floating point data
0101	Conversion of 16-bit fixed point data to 16-bit floating point data
		0110	Conversion of 64-bit floating-point data to 64-bit fixed-point data
0111	Conversion of 32-bit floating-point data to 64-bit fixed-point data
		1000	Conversion of 16-bit floating point data to 64-bit fixed point data
1001	Conversion of 32-bit floating-point data to 32-bit fixed-point data
		1010	Conversion of 16-bit floating point data to 32-bit fixed point data
1011	Conversion of 16-bit floating-point data to 16-bit fixed-point data

TABLE 4

As shown in table 4, when the conversion mode of the data type is identified as 0000, the conversion mode of the data type is that 64-bit fixed point data is converted into 64-bit floating point data; when the conversion mode of the data type is marked as 0001, the conversion mode of the data type is that 32-bit fixed point data is converted into 64-bit floating point data; when the conversion mode of the data type is 0010, the conversion mode of the data type is that 16-bit fixed point data is converted into 64-bit floating point data; when the conversion mode of the data type is identified as 0011, the conversion mode of the data type is that 32-bit fixed point data is converted into 32-bit floating point data; when the conversion mode of the data type is identified as 0100, the conversion mode of the data type is that 16-bit fixed point data is converted into 32-bit floating point data; when the conversion mode of the data type is identified as 0101, the conversion mode of the data type is that 16-bit fixed point data is converted into 16-bit floating point data; when the conversion mode of the data type is 0110, the conversion mode of the data type is that 64-bit floating point data is converted into 64-bit fixed point data; when the conversion mode of the data type is 0111, the conversion mode of the data type is that 32-bit floating point data is converted into 64-bit fixed point data; when the conversion mode of the data type is marked as 1000, the conversion mode of the data type is that 16-bit floating point data is converted into 64-bit fixed point data; when the conversion mode of the data type is marked as 1001, the conversion mode of the data type is that 32-bit floating point data is converted into 32-bit fixed point data; when the conversion mode of the data type is marked as 1010, the conversion mode of the data type is that 16-bit floating point data is converted into 32-bit fixed point data; when the conversion mode of the data type is indicated as 1011, the conversion mode of the data type is that 16-bit floating point data is converted into 16-bit fixed point data.

In a possible embodiment, the controller unit 11 obtains a calculation instruction from the storage unit 10, and parses the calculation instruction to obtain one or more operation instructions, where the operation instruction may be a variable format operation instruction or a fixed format operation instruction.

The variable format operation instruction comprises an operation code and an operation field, wherein the operation code is used for indicating the function of the variable format operation instruction, and the operation field comprises a first address of first input data, the length (optional) of the first input data, a first address of output data, a decimal point position, a data type flag bit (optional) of the first input data and an operation type identifier.

When the operation instruction is a variable format operation instruction, the controller unit 11 parses the variable format operation instruction to obtain a first address of the first input data, a length of the first input data, a first address of the output data, a decimal point position, a data type flag bit of the first input data, and an operation type identifier, acquires the first input data from the storage unit 10 according to the first address of the first input data and the length of the first input data, transmits the first input data, the decimal point position, the data type flag bit of the first input data, and the operation type identifier to the conversion unit 13, and transmits the first address of the output data to the operation unit 12;

the conversion unit 13 converts the first input data into second input data according to the data type flag bit, the decimal point position, and the operation type indicated by the operation type identifier; the second input data is then transmitted to the arithmetic unit 12.

The master processing circuit 101 and the slave processing circuit 102 of the arithmetic unit 12 perform arithmetic operations on the second input data to obtain a result of the calculation instruction; the result of the calculation instruction is stored in the storage unit 10 at a position corresponding to the head address of the output data.

The operation type flag indicates the type of data that participates in the operation when the operation unit 12 performs the operation. The types include fixed point data, floating point data, integer data, discrete data and the like.

In a possible embodiment, the storage unit 10 stores a first address of the first input data, a length of the first input data, a first address of the output data, a decimal point position, a data type flag bit of the first input data, and an operation type identifier, and the controller unit 11 directly obtains the first address of the first input data, the length of the first input data, the first address of the output data, the decimal point position, the data type flag bit of the first input data, and the operation type identifier from the storage unit 10, and then performs subsequent operations according to the above process.

For example, the operation type is identified as 0 or 1. When the flag bit is 1, the master processing circuit 101 and the slave processing circuit 102 of the arithmetic unit 12 perform floating-point operation, that is, the type of data involved in the operation is floating-point data; when the operation type flag is 0, the master processing circuit 101 and the slave processing circuit 102 of the above-described arithmetic unit 12 perform fixed-point arithmetic, that is, the data type participating in the arithmetic is fixed-point data.

The arithmetic unit 12 can determine the type of the input data and the type of the operation according to the data flag bit and the operation type identifier.

Specifically, referring to table 5, table 5 is a mapping relationship table of the flag bit of the data type and the operation type identifier.

TABLE 5

As shown in table 5, when the operation type flag is 0 and the data type flag is 0, the first input data is fixed-point data, and the master processing circuit 101 and the slave processing circuit 102 of the arithmetic unit 12 perform fixed-point arithmetic without data conversion; when the operation type flag is 0 and the data type flag is 1, the first input data is floating point data, and the master processing circuit 101 and the slave processing circuit 102 of the arithmetic unit 12 perform floating point arithmetic without data conversion; when the operation type flag is 1 and the data type flag is 0, the first input data is fixed point data, the conversion unit 13 converts the first input data into second input data according to the decimal point position, the second input data is floating point data, and the master processing circuit 101 and the slave processing circuit 102 of the arithmetic unit 12 perform arithmetic operation on the second input data; when the operation type flag is 1 and the data type flag is 1, the first input data is floating point data, the conversion unit 13 converts the first input data into second input data according to the decimal point position, the second input data is fixed point data, and the master processing circuit 101 and the slave processing circuit 102 of the arithmetic unit 12 perform arithmetic operation on the second input data.

The fixed-point data includes 64-bit fixed-point data, 32-bit fixed-point data, and 16-bit fixed-point data. The floating point data includes 64-bit floating point data, 32-bit floating point data and 16-bit floating point data. The mapping relationship between the flag bits and the operation type identifiers can be specifically referred to in table 6 below.

TABLE 6

As shown in table 6, when the operation type flag is 0000 and the data type flag is 0, the first input data is 64 fixed point data, the master processing circuit 101 and the slave processing circuit 102 of the arithmetic unit 12 perform 64-bit fixed point arithmetic, and do not perform data type conversion; when the operation type flag is 0000 and the data type flag is 1, the first input data is 64 floating point data, and the master processing circuit 101 and the slave processing circuit 102 of the arithmetic unit 12 perform 64-bit floating point arithmetic without performing data type conversion; when the operation type flag is 0001 and the data type flag is 0, the first input data is 32 fixed point data, and the master processing circuit 101 and the slave processing circuit 102 of the arithmetic unit 12 perform 32-bit fixed point arithmetic without data type conversion; when the operation type flag is 0001 and the data type flag is 1, the first input data is 32-bit floating point data, and the master processing circuit 101 and the slave processing circuit 102 of the arithmetic unit 12 perform 32-bit floating point arithmetic without performing data type conversion; when the operation type flag is 0010 and the data type flag is 0, the first input data is 16-bit fixed point data, and the master processing circuit 101 and the slave processing circuit 102 of the arithmetic unit 12 perform 16-bit fixed point arithmetic without data type conversion; when the operation type flag is 0010 and the data type flag is 1, the first input data is 16 floating point data, and the master processing circuit 101 and the slave processing circuit 102 of the arithmetic unit 12 perform 16-bit floating point operation without performing data type conversion.

When the operation type flag is 0011 and the data type flag is 0, the first input data is 64 fixed-point data, the conversion unit 13 converts the first input data into second input data according to the decimal point position, the second input data is 64 floating-point data, and then the master processing circuit 101 and the slave processing circuit 102 of the arithmetic unit 12 perform 64-bit floating-point operation on the second input data; when the operation type flag is 0011 and the data type flag is 1, the first input data is floating-point data 64, the conversion unit 13 converts the first input data into second input data 64, which is fixed-point data, according to the decimal point position, and then the master processing circuit 101 and the slave processing circuit 102 of the arithmetic unit 12 perform a 64-bit fixed-point operation on the second input data.

When the operation type flag is 0100 and the data type flag is 0, the first input data is 32 fixed-point data, the conversion unit 13 converts the first input data into second input data according to the decimal point position, the second input data is 64 floating-point data, and then the master processing circuit 101 and the slave processing circuit 102 of the arithmetic unit 12 perform 64-bit floating-point operation on the second input data; when the operation type flag is 0100 and the data type flag is 1, the first input data is 32 floating point data, the conversion unit 13 converts the first input data into second input data according to the decimal point position, the second input data is 64 fixed point data, and the master processing circuit 101 and the slave processing circuit 102 of the arithmetic unit 12 perform 64-bit fixed point arithmetic on the second input data.

When the operation type flag is 0101 and the data type flag is 0, the first input data is 16 fixed point data, the conversion unit 13 converts the first input data into second input data according to the decimal point position, the second input data is 64 floating point data, and then the master processing circuit 101 and the slave processing circuit 102 of the arithmetic unit 12 perform 64-bit floating point arithmetic on the second input data; when the operation type flag is 0101 and the data type flag is 1, the first input data is 16 floating point data, the conversion unit 13 converts the first input data into second input data according to the decimal point position, the second input data is 64 fixed point data, and the master processing circuit 101 and the slave processing circuit 102 of the arithmetic unit 12 perform 64-bit fixed point arithmetic on the second input data.

When the operation type flag is 0110 and the data type flag is 0, the first input data is 32 fixed point data, the conversion unit 13 converts the first input data into second input data according to the decimal point position, the second input data is 32 floating point data, and then the master processing circuit 101 and the slave processing circuit 102 of the arithmetic unit 12 perform 32-bit floating point operation on the second input data; when the operation type flag is 0110 and the data type flag is 1, the first input data is 32 floating point data, the conversion unit 13 converts the first input data into second input data according to the decimal point position, the second input data is 32 fixed point data, and then the master processing circuit 101 and the slave processing circuit 102 of the arithmetic unit 12 perform 32-bit fixed point arithmetic on the second input data.

When the operation type flag is 0111 and the data type flag is 0, the first input data is 16 fixed point data, the conversion unit 13 converts the first input data into second input data according to the decimal point position, the second input data is 32 floating point data, and then the master processing circuit 101 and the slave processing circuit 102 of the arithmetic unit 12 perform 32-bit floating point operation on the second input data; when the operation type flag is 0111 and the data type flag is 1, the first input data is floating point data 16, the conversion unit 13 converts the first input data into second input data according to the decimal point position, the second input data is fixed point data 32, and then the master processing circuit 101 and the slave processing circuit 102 of the arithmetic unit 12 perform 32-bit fixed point arithmetic on the second input data.

When the operation type flag is 1000 and the data type flag is 0, the first input data is 16 fixed point data, the conversion unit 13 converts the first input data into second input data according to the decimal point position, the second input data is 16 floating point data, and then the master processing circuit 101 and the slave processing circuit 102 of the arithmetic unit 12 perform 16-bit floating point arithmetic on the second input data; when the operation type flag is 1000 and the data type flag is 1, the first input data is floating point data 16, the conversion unit 13 converts the first input data into second input data according to the decimal point position, the second input data is fixed point data 16, and then the master processing circuit 101 and the slave processing circuit 102 of the arithmetic unit 12 perform 16-bit fixed point arithmetic on the second input data.

When the operation type flag is 1001 and the data type flag is 0, the first input data is fixed-point data 64, the conversion unit 13 converts the first input data into second input data according to the decimal point position, the second input data is floating-point data 32, and then the master processing circuit 101 and the slave processing circuit 102 of the arithmetic unit 12 perform 32-bit floating-point arithmetic on the second input data; when the operation type flag is 1001 and the data type flag is 1, the first input data is 64 floating point data, the conversion unit 13 converts the first input data into second input data according to the decimal point position, the second input data is 32 fixed point data, and then the master processing circuit 101 and the slave processing circuit 102 of the arithmetic unit 12 perform 32-bit fixed point arithmetic on the second input data.

When the operation type flag is 1010 and the data type flag is 0, the first input data is fixed-point data 64, the conversion unit 13 converts the first input data into second input data according to the decimal point position, the second input data is floating-point data 16, and then the master processing circuit 101 and the slave processing circuit 102 of the arithmetic unit 12 perform 16-bit floating-point arithmetic on the second input data; when the operation type flag is 1010 and the data type flag is 1, the first input data is 64 floating point data, the conversion unit 13 converts the first input data into second input data according to the decimal point position, the second input data is 16 fixed point data, and then the master processing circuit 101 and the slave processing circuit 102 of the arithmetic unit 12 perform 16-bit fixed point arithmetic on the second input data.

When the operation type flag is 1011 and the data type flag is 0, the first input data is 32 fixed point data, the conversion unit 13 converts the first input data into second input data according to the decimal point position, the second input data is 16 floating point data, and then the master processing circuit 101 and the slave processing circuit 102 of the arithmetic unit 12 perform 16-bit floating point arithmetic on the second input data; when the operation type flag is 1011 and the data type flag is 1, the first input data is 32 floating point data, the conversion unit 13 converts the first input data into second input data according to the decimal point position, the second input data is 16 fixed point data, and then the master processing circuit 101 and the slave processing circuit 102 of the arithmetic unit 12 perform 16-bit fixed point arithmetic on the second input data.

In a possible embodiment, the operation instruction is a fixed-point format operation instruction, the fixed-point format operation instruction includes an operation field and an operation code, the operation code is used for indicating the function of the fixed-point format operation instruction, and the operation code of the fixed-point format operation instruction includes a first address of the first input data, a length (optional) of the first input data, a first address of the output data, and a decimal point position.

After the controller unit 11 obtains the fixed-point format operation instruction, it analyzes the fixed-point format operation instruction to obtain the first address of the first input data, the length of the first input data, the first address of the output data, and the decimal point position; then the controller unit 11 obtains the first input data from the storage unit 10 according to the first address of the first input data and the length of the first input data, and then transmits the first input data and the decimal point position to the conversion unit 13; the first address of the output data is transmitted to the arithmetic unit 12. The conversion unit converts the first input data into second input data according to the decimal point position, and then transmits the second input data to the operation unit 13, and the master processing circuit 101 and the slave processing circuit 102 of the operation unit 12 operate on the second input data to obtain a result of a calculation instruction, and store the result of the calculation instruction to a position corresponding to the first address of the output data in the storage unit 10.

In a possible embodiment, before the arithmetic unit 13 of the computing device performs the operation of the ith layer of the multilayer neural network model, the controller unit 11 of the computing device obtains a configuration instruction, which includes a decimal point position and a data type participating in the operation. The controller unit 11 analyzes the configuration command to obtain the decimal point position and the data type participating in the operation, or directly obtains the decimal point position and the data type participating in the operation from the storage unit 10, and then after the controller unit 11 obtains the input data, judges whether the data type of the input data is consistent with the data type participating in the operation; when it is determined that the data type of the input data is not identical to the data type involved in the operation, the controller unit 11 transmits the input data, the decimal point position, and the data type involved in the operation to the converting unit 13; the conversion unit carries out data type conversion on the input data according to the decimal point position and the data type participating in operation, so that the data type of the input data is consistent with the data type participating in operation; then, the converted data is transmitted to the above arithmetic unit 12, and the main processing circuit 101 and the sub processing circuit 102 of the arithmetic unit 12 perform arithmetic on the converted input data; when it is determined that the data type of the input data matches the data type participating in the operation, the controller unit 11 transmits the input data to the operation unit 12, and the master processing circuit 101 and the slave processing circuit 102 of the operation unit 12 directly operate on the input data without performing data type conversion.

Further, when the input data is fixed point data and the type of data involved in the operation is fixed point data, the controller unit 11 determines whether the position of the decimal point of the input data is consistent with the position of the decimal point involved in the operation, if not, the controller unit 11 transmits the input data, the position of the decimal point of the input data and the position of the decimal point involved in the operation to the conversion unit 13, the conversion unit 13 converts the input data into fixed point data consistent with the position of the decimal point and the position of the decimal point of the data involved in the operation, and then transmits the converted data to the operation unit, and the main processing circuit 101 and the slave processing circuit 102 of the operation unit 12 operate on the converted data.

In other words, the arithmetic instruction may be replaced with the configuration instruction.

In another embodiment, the operation instruction is a matrix by matrix instruction, an accumulation instruction, an activation instruction, or the like.

In an alternative embodiment, as shown in fig. 3E, the arithmetic unit comprises: a tree module 40, the tree module comprising: a root port 401 and a plurality of branch ports 404, wherein the root port of the tree module is connected with the main processing circuit 101, and the branch ports of the tree module are respectively connected with one slave processing circuit 102 in the plurality of slave processing circuits 102;

the tree module has a transceiving function, as shown in fig. 3E, the tree module is a transmitting function, as shown in fig. 6A, the tree module is a receiving function.

The tree module is configured to forward data blocks, weights, and operation instructions between the master processing circuit 101 and the plurality of slave processing circuits 102.

Optionally, the tree module is an optional result of the computing device, and may include at least 1 layer of nodes, where the nodes are line structures with forwarding function, and the nodes themselves may not have computing function. If the tree module has zero-level nodes, the tree module is not needed.

Optionally, the tree module may have an n-ary tree structure, for example, a binary tree structure as shown in fig. 3F, or may have a ternary tree structure, where n may be an integer greater than or equal to 2. The present embodiment is not limited to the specific value of n, the number of layers may be 2, and the slave processing circuit 102 may be connected to nodes of other layers than the node of the penultimate layer, for example, the node of the penultimate layer shown in fig. 3F.

Optionally, the operation unit may carry a separate cache, as shown in fig. 3G, and may include: a neuron buffer unit, the neuron buffer unit 63 buffers the input neuron vector data and the output neuron value data of the slave processing circuit 102.

As shown in fig. 3H, the arithmetic unit may further include: the weight buffer unit 64 is used for buffering the weight data required by the slave processing circuit 102 in the calculation process.

In an alternative embodiment, taking the fully-connected operation in the neural network operation as an example, the process may be: f (wx + b), where x is an input neuron matrix, w is a weight matrix, b is a bias scalar, and f is an activation function, and may specifically be: sigmoid function, tanh, relu, softmax function. Here, assuming a binary tree structure with 8 slave processing circuits 102, the method may be implemented as follows:

the controller unit 11 acquires an input neuron matrix x, a weight matrix w and a full-connection operation instruction from the storage unit 10, and transmits the input neuron matrix x, the weight matrix w and the full-connection operation instruction to the main processing circuit 101;

the master processing circuit 101 splits the input neuron matrix x into 8 sub-matrices, then distributes the 8 sub-matrices to 8 slave processing circuits 102 via a tree module, broadcasts the weight matrix w to the 8 slave processing circuits 102,

the slave processing circuit 102 executes multiplication and accumulation operations of the 8 sub-matrices and the weight matrix w in parallel to obtain 8 intermediate results, and sends the 8 intermediate results to the master processing circuit 101;

the main processing circuit 101 is configured to sequence the 8 intermediate results to obtain a wx operation result, perform offset b operation on the operation result, perform activation operation to obtain a final result y, send the final result y to the controller unit 11, and the controller unit 11 outputs or stores the final result y into the storage unit 10.

In one embodiment, the arithmetic unit 12 includes, but is not limited to: a first one or more multipliers of the first portion; one or more adders of the second part (more specifically, the adders of the second part may also constitute an addition tree); a third part of the activation function unit; and/or the vector processing unit of the fourth section. More specifically, the vector processing unit may process vector operations and/or pooling operations. The first part multiplies the input data 1(in1) and the input data 2(in2) to obtain the multiplied output (out), which is: out in1 in 2; the second part adds the input data in1 by an adder to obtain output data (out). More specifically, when the second part is an adder tree, the input data in1 is added step by step through the adder tree to obtain the output data (out), where in1 is a vector with length N, N is greater than 1, and the process is: out in1[1] + in1[2] +. + in1[ N ], and/or adding the input data (in1) and the input data (in2) after adding the addition number to obtain the output data (out), wherein the process is as follows: out-in 1[1] + in1[2] +. + in1[ N ] + in2, or adding the input data (in1) and the input data (in2) to obtain the output data (out), the process is: out in1+ in 2; the third part obtains activation output data (out) by operating the input data (in) through an activation function (active), and the process is as follows: the active function may be sigmoid, tanh, relu, softmax, and the like, and in addition to the activation operation, the third part may implement other non-linear functions, and may obtain the output data (out) by performing the operation (f) on the input data (in), where the process is as follows: out ═ f (in). The vector processing unit obtains output data (out) after the pooling operation by pooling the input data (in), wherein the process is out ═ pool (in), where the pool is the pooling operation, and the pooling operation includes but is not limited to: mean pooling, maximum pooling, median pooling, input data in being data in a pooling kernel associated with output out.

The operation unit executes operation including a first part of multiplying the input data 1 and the input data 2 to obtain multiplied data; and/or the second part performs an addition operation (more specifically, an addition tree operation for adding input data 1 step by step through an addition tree) or adds the input data 1 and input data 2 to obtain output data; and/or the third part executes activation function operation, and obtains output data through activation function (active) operation on input data; and/or a fourth part performing pooling operations, out ═ pool (in), where pool is a pooling operation including, but not limited to: mean pooling, maximum pooling, median pooling, input data in being data in a pooling kernel associated with output out. The operation of the above parts can freely select one or more parts to carry out combination in different orders, thereby realizing the operation of various functions. The computing units correspondingly form a two-level, three-level or four-level pipeline level architecture.

It should be noted that the first input data is long-bit non-fixed point data, such as 32-bit floating point data, or may be standard 64-bit or 16-bit floating point data, and the description is given here only with 32 bits as a specific example; the second input data is short-digit fixed-point data, which is also called less-digit fixed-point data and represents fixed-point data represented by a smaller number of digits relative to the first input data of long-digit non-fixed-point data.

In one possible embodiment, the first input data is non-fixed point data, the second input data is fixed point data, and the number of bits occupied by the first input data is greater than or equal to the number of bits occupied by the second input data. For example, the first input data is 32-bit floating point data, and the second input data is 32-bit fixed point data; for another example, the first input data is 32-bit floating point data, and the second input data is 16-bit fixed point data.

In particular, the first input data comprises different types of data for different layers of different network models. The decimal point positions of the different types of data are different, namely the accuracy of the corresponding fixed point data is different. For a fully connected layer, the first input data comprises data such as input neurons, weights, bias data and the like; in the case of convolutional layers, the first input data includes data such as convolutional kernels, input neurons, and offset data.

For example, for a fully connected layer, the decimal point locations include the decimal point locations of the input neurons, the decimal point locations of the weights, and the decimal point locations of the offset data. The positions of the decimal points of the input neurons, the positions of the decimal points of the weights and the positions of the decimal points of the offset data can be all the same or partially the same or different from each other.

In a possible embodiment, the controller unit 11 is further configured to: before acquiring first input data and a calculation instruction, determining the position of a decimal point of the first input data and the bit width of fixed point data; the bit width of the fixed point data is the bit width of the first input data converted into the fixed point data;

the operation unit 12 is further configured to initialize a decimal point position of the first input data and adjust the decimal point position of the first input data.

The bit width of the fixed point data of the first input data is the bit position occupied by the first input data expressed by the fixed point data, and the decimal position is the bit position occupied by the decimal part of the first data expressed by the fixed point data. The decimal point position is used for representing the precision of the fixed point data. See in particular the description relating to fig. 2A.

Specifically, the first input data may be any type of data, and the first input data a is converted into the second input data according to the position of the decimal point and the bit width of the fixed point data

The method comprises the following specific steps:

wherein, when the first input data a satisfies the condition that neg is less than or equal to a and less than or equal to pos, the second input data

Is | a/2^s|*2^s(ii) a When the first input data a is greater than pos, the second input data

Pos; when the first input data a is less than neg, the second input data

Is neg.

In one embodiment, the input neurons, weights, output neurons, input neuron derivatives, output neuron derivatives, and weight derivatives for convolutional layers and fully-connected layers are all represented using fixed-point data.

Alternatively, the bit width of the fixed-point data used by the input neurons may be 8,16, 32, 64, or other values. Further, the bit width of the fixed-point data used by the input neuron is 8.

Optionally, the bit width of the fixed-point data used by the above weight values may be 8,16, 32, 64, or other values. Further, the bit width of the fixed-point data used by the weight is 8.

Alternatively, the bit width of the fixed-point data used for the input neuron derivatives may be 8,16, 32, 64, or other values. Further, the bit width of the fixed-point data used for the input neuron derivative is 16.

Alternatively, the bit width of the fixed-point data used for the output neuron derivatives may be 8,16, 32, 64, or other values. Further, the bit width of the fixed-point data used for the output neuron derivative is 24.

Alternatively, the bit width of the fixed-point data used for the weight derivative may be 8,16, 32, 64, or other values. Further, the bit width of the fixed-point data used by the weight derivative is 24.

In an embodiment, a plurality of fixed-point representation methods may be adopted for the data a with a larger value among the data participating in the above-mentioned multi-layer network model operation, and refer to the related description of fig. 2B specifically.

The method comprises the following specific steps:

Is composed of

And is

When the first input data a is greater than pos, the second input data

Pos; when the first input data a is less than neg, the second input data

Is neg.

Further, the arithmetic unit 12 initializes the decimal point position of the first input data, including:

initializing the position of a decimal point of the first input data according to the maximum value of the absolute value of the first input data, or;

initializing the position of a decimal point of the first input data according to the minimum absolute value of the first input data, or;

initializing the position of a decimal point of the first input data according to the relation between different data types in the first input data, or;

initializing the decimal point position of the first input data according to an empirical value constant.

Specifically, the decimal point position s needs to be initialized and dynamically adjusted according to data of different types and data of different neural network layers and data in different iteration rounds.

The initialization process of the decimal point position s of the first input data is specifically described below, that is, the decimal point position s used for the timing point data when the first input data is converted for the first time is determined.

The initializing unit 1211 includes: initializing a decimal point position s of the first input data according to the maximum absolute value of the first input data; initializing a decimal point position s of the first input data according to the minimum value of the absolute value of the first input data; initializing a decimal point position s of the first input data according to the relation between different data types in the first input data; and initializing the decimal point position s of the first input data according to an empirical value constant.

Specifically, the above initialization processes are specifically described below, respectively.

a) The calculating unit 12 initializes a decimal point position s of the first input data according to a maximum value of an absolute value of the first input data:

specifically, the arithmetic unit 12 initializes the decimal point position s of the first input data by performing an operation shown by the following equation: .

Wherein, a above_maxThe maximum value of the absolute value of the first input data, bitnum is the bit width for converting the first input data into fixed point data, and s is_aIs the decimal point position of the first input data.

The data participating in the operation can be divided into the following data according to categories and network layers: input neuron X of layer I^(l)And output neuron Y^(l)Weight W^(l)Input neuron derivative

Output neuron derivative

Derivative of sum weight

When the maximum value of the absolute value is searched, searching according to the data category; the search can be carried out in a layered and classified manner; the search can be layered, classified and grouped. The method for determining the maximum value of the absolute value of the first input data comprises the following steps:

a.1), the above-mentioned calculating unit 12 finds the maximum value of the absolute value by data type

Specifically, the first input data comprises a vector/matrix with each element being a_i ^(l)Wherein, the a^(l)Can be input neuron X^(l)Or output neuron Y^(l)Or weight W^(l)Or input neuron derivatives

Or output neuron derivatives

Or weight derivative

In other words, the first input data comprises an inputThe decimal point position of the first input data comprises a decimal point position of the input neuron, a decimal point position of the weight, a decimal point position of the output neuron, a decimal point position of the input neuron derivative, a decimal point position of the weight derivative and a decimal point position of the output neuron derivative. The input neurons, the weights, the output neurons, the input neuron derivatives, the weight derivatives and the output neuron derivatives are all represented in matrix or vector form. The arithmetic unit 12 obtains the maximum absolute value of each type of data by traversing all elements in the vector/matrix of each layer of the multi-layer network model, that is, the maximum absolute value of each type of data

By the formula

Determining the decimal point position s of each type data a converted into fixed point data_a。

a.2), the above-mentioned calculating unit 12 finds the maximum value of the absolute value according to the hierarchical and fractional data categories

Specifically, each element in the first input data vector/matrix is a_i ^(l)Wherein, the a^(l)Can be input neuron X^(l)Or output neuron Y^(l)Or weight W^(l)Or input neuron derivatives

Or output neuron derivatives

Or weight derivative

In other words, each layer of the above-described multilayer network model includes an input neuron, a weight, an output neuron, an input neuron derivative, a weight derivative, and an output neuron derivative. Fraction of the first input dataThe point locations include a decimal point location of the input neuron, a decimal point location of the weight, a decimal point location of the output neuron, a decimal point location of the input neuron derivative, a decimal point location of the weight derivative, and a decimal point location of the output neuron derivative. The input neuron, the weight, the output neuron, the input neuron derivative, the weight derivative and the output neuron derivative are all expressed by matrix/vector. The above-mentioned arithmetic unit 12 obtains the maximum value of the absolute value of each kind of data, that is, the maximum value of the absolute value of each kind of data by traversing all elements in the vector/matrix of each kind of data of each layer of the multilayer network model, that is, by traversing all the elements in the vector/matrix of each kind of data of each layer of the multilayer network model

By the formula:

determining decimal point positions of each type data a on the l-th layer

a.3), the above-mentioned calculating unit 12 finds the maximum value of the absolute value according to the hierarchy, the classification of the data and the grouping

Specifically, each element in the first input data vector/matrix is a_i ^(l)Wherein a is^(l)Can be input neuron X^(l)Or output neuron Y^(l)Or weight W^(l)Or input neuron derivatives

Or output neuron derivatives

Or weight derivative

In other words, the data classes of each layer of the above-described multilayer network model include input neurons, weights, output neurons, input neuron derivatives, weight derivatives, and output neuron derivatives. The arithmetic unit 12 uses the multi-layer network modelEach type of data of each layer is divided into g groups or grouped by any other grouping rule. The arithmetic unit 12 then traverses each element of each group of data in g groups of data corresponding to each type of data in each layer in the multi-layer network model, and obtains the element with the largest absolute value in the group of data, that is, the element with the largest absolute value in the group of data

By the formula

Determining the position of decimal point of each group of g groups of data corresponding to each data type in each layer

The arbitrary grouping rules include, but are not limited to, rules such as grouping according to a data range, grouping according to a data training batch, and the like.

b) The arithmetic unit 12 initializes a decimal point position s of the first input data based on the minimum absolute value of the first input data:

specifically, the above-mentioned arithmetic unit 12 finds the minimum value a of the absolute value of the data to be quantized_minThe spotting accuracy s is determined by the following formula.

Wherein, a above_minIs the minimum absolute value of the first input data. Obtaining a_minSee in particular the above-mentioned steps a.1), a.2), a.3).

c) The arithmetic unit 12 initializes a decimal point position s of the first input data according to a relationship between different data types in the first input data:

in particular, data type a of any layer (such as layer I) in the multi-layer network model^(l)Position of decimal point

The arithmetic unit 12 is used to calculate the data type b according to the l-th layer^(l)Position of decimal point

And formula

And (4) determining.

Wherein, a^(l)And b^(l)Can be input neuron X^(l)Or output neuron Y^(l)Or weight W^(l)Or input neuron derivatives

Or output neuron derivatives

Or weight derivative

Wherein, a^(l)And b^(l)Is an integer constant.

d) The above-mentioned calculating unit 12 initializes the decimal point position s of the first input data according to the empirical value constant:

specifically, the data type a of any layer (such as the l-th layer) of the multi-layer network model^(l)Decimal point position s_a ^(l)Can be set artificially_a ^(l)C, wherein c is an integer constant, a above^(l)Can be input neuron X^(l)Or output neuron Y^(l)Or weight W^(l)Or input neuron derivatives

Or output neuron derivatives

Or weight derivative

Furthermore, the decimal point position initialization value of the input neuron and the decimal point position initialization value of the output neuron can be selected in the range of [ -8,8 ]; the decimal point position initialization value of the weight value can be selected within the range of [ -17,8], and the decimal point position initialization value of the input neuron derivative and the decimal point position initialization value of the output neuron derivative can be selected within the range of [ -40, -20 ]. The initial value of decimal point position of the weight derivative can be selected in the range of [ -48, -12 ].

The method for dynamically adjusting the decimal point position s by the arithmetic unit 12 will be described in detail below.

The above-mentioned method for dynamically adjusting the decimal point position s by the arithmetic unit 12 includes adjusting s upward (s becomes larger) and adjusting s downward (s becomes smaller). The method specifically comprises the steps of single-step upward adjustment according to the maximum value of the absolute value of first input data; gradually adjusting upwards according to the maximum value of the absolute value of the first input data; step up according to the first input data profile; gradually adjusting upwards according to the first input data distribution; and adjusting downwards according to the maximum value of the absolute value of the first input data.

a) The operation unit 12 performs single step up adjustment according to the maximum value of the absolute value of the data in the first input data:

assuming that the position of the decimal point is adjusted to s _ old, the fixed point data corresponding to the position of the decimal point s _ old may indicate that the data range is [ neg, pos [ ]]. Wherein pos is (2)^bitnum-1-1)*2^s_old，neg＝-(2^bitnum-1-1)*2^s_old. When the maximum value a of the absolute value of the data in the first input data_maxWhen the position is more than or equal to pos, the decimal point after adjustment is set as

Otherwise, the decimal point position is not adjusted, i.e., s _ new ═ s _ old.

b) The above-mentioned arithmetic unit 12 gradually adjusts upward according to the maximum value of the absolute value of the data in the first input data:

assuming that the position of the decimal point is adjusted to s _ old, the fixed point data corresponding to the position of the decimal point s _ old can represent a data rangeEnclose is [ neg, pos]Wherein pos is (2)^bitnum-1-1)*2^s_old，neg＝-(2^bitnum-1-1)*2^s_old. When the maximum value a of the absolute value of the data in the first input data_maxWhen the decimal point position is greater than or equal to pos, the decimal point position after adjustment is s _ new ═ s _ old + 1; otherwise, the decimal point position is not adjusted, i.e., s _ new ═ s _ old.

c) The operation unit 12 adjusts the first input data distribution in a single step upwards:

assuming that the position of the decimal point is adjusted to s _ old, the fixed point data corresponding to the position of the decimal point s _ old may indicate that the data range is [ neg, pos [ ]]Wherein pos is (2)^bitnum-1-1)*2^s_old，neg＝-(2^bitnum-1-1)*2^s_old. Calculating statistics of absolute values of the first input data, e.g. mean a of absolute values_meanAnd the standard deviation a of the absolute value_std. Maximum range a of setting data_max＝a_mean+na_std. When a is_maxWhen the pressure is more than or equal to pos,

Further, n may be 2 or 3

d) The above-mentioned arithmetic unit 12 gradually adjusts upwards according to the first input data distribution:

assuming that the position of the decimal point is adjusted to s _ old, the fixed point data corresponding to the position of the decimal point s _ old may indicate that the data range is [ neg, pos [ ]]Wherein pos is (2)^bitnum-1-1)*2^s_old，neg＝-(2^bitnum-1-1)*2^s_old. Calculating statistics of absolute values of the first input data, e.g. mean a of absolute values_meanAnd the standard deviation a of the absolute value_std. Maximum range a of setting data_max＝a_mean+na_stdAnd n may be 3. When a is_maxAnd when the position is more than or equal to pos, s _ new is s _ old +1, otherwise, the position of the decimal point is not adjusted, namely s _ new is s _ old.

e) The arithmetic unit 12 adjusts downward according to the maximum absolute value of the first input data:

assuming that the position of the decimal point is adjusted to s _ old, the fixed point data corresponding to the position of the decimal point s _ old may indicate that the data range is [ neg, pos [ ]]Wherein pos is (2)^bitnum-1-1)*2^s_old，neg＝-(2^bitnum-1-1)*2^s_old. When the maximum absolute value a of the first input data_max＜2^{s_old+(bitnum-n)}And s _ old is greater than or equal to s_minWhen s _ new ═ s _ old-1, where n is an integer constant, s_minEither an integer or minus infinity.

Further, n is 3, and s is_minIs-64.

Alternatively, for adjusting the frequency of the decimal point position, the decimal point position of the first input data may not be adjusted ever; or the training is adjusted once every n first training periods (i.e. iteration), wherein n is a constant; or once every n second training periods (i.e., epochs), wherein n is a constant; or the position of the decimal point of the first input data is adjusted once every n first training periods or n second training periods, and then n is adjusted to be alpha n, wherein alpha is larger than 1; or the position of the decimal point of the first input data is adjusted once every n first training periods or second training periods, and n is gradually reduced along with the increment of the number of training rounds.

Further, the positions of the decimal points of the input neurons, the positions of the decimal points of the weights and the positions of the decimal points of the output neurons are adjusted every 100 first training periods. The positions of the decimal points of the input neuron derivatives and the positions of the decimal points of the output neuron derivatives are adjusted every 20 first training periods.

It should be noted that the first training period is the time required for training a batch of samples, and the second training period is the time required for performing one training on all training samples.

In a possible embodiment, after the controller unit 11 or the arithmetic unit 12 obtains the decimal point position of the first input data according to the above process, the decimal point position of the first input data is stored in the buffer 202 of the storage unit 10.

When the calculation instruction is an immediate addressing instruction, the main processing unit 101 directly converts the first input data into the second input data according to the decimal point position indicated by the operation field of the calculation instruction; when the calculation instruction is a direct addressing instruction or an indirect addressing instruction, the main processing unit 101 obtains a decimal point position of the first input data according to a storage space indicated by an operation domain of the calculation instruction, and then converts the first input data into the second input data according to the decimal point position.

The calculation apparatus may further include a rounding unit that buffers the intermediate operation result because an operation result (the operation result including the intermediate operation result and the result of the calculation instruction) obtained by performing addition, multiplication, and/or other operations on the second input data may have a precision exceeding a precision range of the current fixed-point data during the operation. After the operation is finished, the rounding unit performs rounding operation on the operation result which exceeds the precision range of the fixed-point data to obtain a rounded operation result, and then the data conversion unit converts the rounded operation result into data of the current fixed-point data type.

Specifically, the rounding unit performs a rounding operation on the intermediate operation result, the rounding operation being any one of a random rounding operation, a rounding operation, an upward rounding operation, a downward rounding operation, and a truncation rounding operation.

When the rounding unit performs the random rounding operation, the rounding unit specifically performs the following operations:

wherein y represents data obtained by randomly rounding the operation result x before rounding, i.e. the operation result after rounding, and epsilon is the smallest positive number capable of being expressed by the current fixed-point data expression format, i.e. 2^{-Point Location}，

The formula represents the probability that the data obtained by randomly rounding the operation result x before rounding is the same as the data obtained by directly truncating the operation result x before rounding to fixed point data (similar to the operation of rounding down decimal), and the formula represents that the data obtained by randomly rounding the operation result x before rounding is the probability

Has a probability of

The intermediate operation result x is rounded randomly to obtain data of

Has a probability of

When the rounding unit performs the rounding operation, the rounding unit specifically performs the following operations:

wherein y represents data obtained by rounding the operation result x before rounding, i.e. the operation result after rounding, and epsilon is the smallest positive integer which can be expressed by the current fixed point data expression format, i.e. 2^{-Point Location}，

Is an integer multiple of epsilon and has a value less than or equal to the maximum number of x. The above formula indicates that the operation result x before the rounding satisfies the condition

The rounded operation result is

When the operation result before rounding satisfies the condition

The rounded operation result is

When the rounding-up operation is performed by the rounding unit, the rounding unit specifically performs the following operations:

wherein y represents data obtained by rounding up the pre-rounding operation result x, i.e. the rounded operation result,

is an integer multiple of epsilon with a value greater than or equal to the minimum number of x, and epsilon is the smallest positive integer which can be represented by the current fixed-point data representation format, namely 2^{-Point Location}。

When the rounding unit performs a downward rounding operation, the rounding unit specifically performs the following operations:

wherein y represents data obtained by rounding down the arithmetic result x before rounding, that is, the arithmetic result after rounding,

is an integer multiple of epsilon with a maximum number less than or equal to x, and epsilon is the smallest positive integer that can be represented by the current fixed-point data representation format, namely 2^{-Point Location}。

When the rounding unit performs truncation rounding operation, the rounding unit specifically performs the following operations:

y＝[x]

wherein y represents the data obtained by truncating the operation result x before rounding, i.e., the operation result after rounding, and [ x ] represents the data obtained by directly truncating the operation result x to fixed point data.

When the rounding unit obtains the rounded intermediate operation result, the operation unit 12 converts the rounded intermediate operation result into data of the current fixed point data type according to the position of the decimal point of the first input data.

In a possible embodiment, the arithmetic unit 12 does not perform truncation processing on the intermediate result of which the data type is floating point data in the one or more intermediate results.

The intermediate result obtained by the operation performed by the processing circuit 102 according to the above method in the operation unit 12 is generally truncated because the intermediate result obtained by the multiplication, the division, and the like in the operation process exceeds the memory storage range; however, because the intermediate result generated in the operation process of the method is not stored in the memory, the intermediate result beyond the storage range of the memory is not required to be cut off, the precision loss of the intermediate result is greatly reduced, and the precision of the calculation result is improved.

In a possible embodiment, the arithmetic unit 12 further includes a derivation unit, when the arithmetic unit 12 receives the decimal point position of the input data participating in the fixed-point operation, the derivation unit derives the decimal point position of the one or more intermediate results obtained in the process of performing the fixed-point operation according to the decimal point position of the input data participating in the fixed-point operation. When the intermediate result obtained by the operation of the operation subunit exceeds the range indicated by the decimal point position corresponding to the intermediate result, the derivation unit shifts the decimal point position of the intermediate result to the left by M bits, so that the precision of the intermediate result is within the precision range indicated by the decimal point position of the intermediate result, and M is an integer greater than 0.

For example, the first input data includes input data I1 and input data I2, the corresponding decimal point positions are P1 and P2, respectively, and P1> P2, when the operation type indicated by the operation instruction is addition operation or subtraction operation, that is, the operation subunit performs I1+ I2 or I1-I2 operation, the derivation unit derives the decimal point position at which the intermediate result of the operation process indicated by the operation instruction is performed as P1; when the operation type indicated by the operation instruction is multiplication operation, that is, the operation subunit performs I1 × I2 operation, the derivation unit derives the decimal point position P1 × P2 at which the intermediate result of the operation process indicated by the operation instruction is performed.

In a possible embodiment, the arithmetic unit 12 further includes:

and the data caching unit is used for caching the one or more intermediate results.

In an optional embodiment, the computing apparatus further includes a data statistics unit, configured to perform statistics on input data of the same type in each layer of the multi-layer network model to obtain a position of a decimal point of each type of input data in each layer.

The data statistics unit may be a part of an external device, and the calculation device may acquire the position of the decimal point participating in the calculation data from the external device before the data conversion is performed.

Specifically, the data statistic unit includes:

the acquisition subunit is used for extracting input data of the same type in each layer of the multilayer network model;

the statistical subunit is used for counting and acquiring the distribution proportion of the input data of the same type in each layer of the multilayer network model in a preset interval;

and the analysis subunit is used for acquiring the decimal point position of the input data of the same type in each layer of the multilayer network model according to the distribution proportion.

Wherein the predetermined interval can be

i is 0,1,2, …, n, n is a preset positive integer, X is the number of bits occupied by the fixed point data. The above-mentioned preset interval

Comprising n +1 subintervals. The statistical subunit counts distribution information of the input data of the same type in each layer of the multi-layer network model in the n +1 subintervals, and acquires the first distribution proportion according to the distribution information. The first distribution ratio is p₀,p₁,p₂,…,p_nAnd the n +1 numerical values are distribution ratios of the input data of the same type in each layer of the multilayer network model on the n +1 subintervals. The analysis subunit presets an overflow rate EPL, which takes the largest i from 0,1,2, …, n, so that p is_iAnd the maximum i is the decimal point position of the input data of the same type in each layer of the multilayer network model. In other words, the analysis subunit takes the decimal point position of the same type of input data in each layer of the multilayer network model as: max { i/p_i≧ 1-EPL, i ∈ {0,1,2, …, n } }, i.e., p satisfying greater than or equal to 1-EPL_iIn the method, the maximum subscript value i is selected as the decimal point position of the input data of the same type in each layer of the multilayer network model.

In addition, p is_iFor the value in the same type of input data in each layer of the multi-layer network model

The number of input data in (a) to the total number of input data of the same type in each layer of the above-described multi-layer network model. For example, m2 input data values in the same type of input data in each layer of m1 multi-layer network models are in intervals

In (1), the above

In a feasible embodiment, in order to improve the operation efficiency, the obtaining subunit extracts part of data in the same type of input data in each layer of the multilayer network model randomly or in a sampling manner, then obtains the decimal point position of the part of data according to the method, and then performs data conversion (including conversion from floating point data to fixed point data, conversion from fixed point data to fixed point data, and the like) on the type of input data according to the decimal point position of the part of data, so that the calculation speed and efficiency can be improved on the premise of keeping the precision.

Optionally, the data statistics unit may determine bit width and decimal point position of the same type of data or the same layer of data according to the median of the same type of data or the same layer of data, or determine bit width and decimal point position of the same type of data or the same layer of data according to the average of the same type of data or the same layer of data.

Optionally, when the intermediate result obtained by the arithmetic unit according to the arithmetic on the data of the same type or the data of the same layer exceeds the value range corresponding to the decimal point position and the bit width of the data of the same type or the data of the same layer, the arithmetic unit does not perform truncation processing on the intermediate result, and caches the intermediate result in the data caching unit of the arithmetic unit for use in subsequent arithmetic.

Specifically, the operation field includes a decimal point position of the input data and a conversion mode identifier of the data type. The instruction processing unit analyzes the data conversion instruction to obtain the decimal point position of the input data and the conversion mode identifier of the data type. The processing unit further comprises a data conversion unit which converts the first input data into second input data according to the decimal point position of the input data and the conversion mode identification of the data type.

It should be noted that the network model includes multiple layers, such as a full connection layer, a convolutional layer, a pooling layer, and an input layer. In the at least one input data, the input data belonging to the same layer have the same decimal point position, that is, the input data of the same layer share or share the same decimal point position.

The input data includes different types of data, including input neurons, weights, and bias data, for example. The input data belonging to the same type in the input data have the same decimal point position, that is, the input data of the same type share or share the same decimal point position.

For example, the operation type indicated by the operation instruction is fixed-point operation, and the input data participating in the operation indicated by the operation instruction is floating-point data, so that the data conversion unit converts the input data from the floating-point data to the fixed-point data before the fixed-point operation is performed; if the operation type indicated by the operation instruction is floating-point operation and the input data participating in the operation indicated by the operation instruction is fixed-point data, the data conversion unit converts the input data corresponding to the operation instruction from the fixed-point data to floating-point data before the floating-point operation is performed.

For macro instructions (such as a calculation instruction and a data conversion instruction) related to the present application, the controller unit 11 may parse the macro instruction to obtain an operation field and an operation code of the macro instruction; generating a micro instruction corresponding to the macro instruction according to the operation domain and the operation code; alternatively, the controller unit 11 decodes the macro instruction to obtain the micro instruction corresponding to the macro instruction.

In one possible embodiment, a System On Chip (SOC) includes a main processor including the computing device and a coprocessor. The coprocessor acquires the decimal point position of the input data of the same type in each layer of the multilayer network model according to the method, and transmits the decimal point position of the input data of the same type in each layer of the multilayer network model to the computing device, or the computing device acquires the decimal point position of the input data of the same type in each layer of the multilayer network model from the coprocessor when the decimal point position of the input data of the same type in each layer of the multilayer network model needs to be used.

In a possible embodiment, the first input data is non-fixed point data, and the non-fixed point data includes long-bit floating point data, short-bit floating point data, integer data, discrete data, and the like.

The data types of the first input data are different from each other. For example, the input neurons, the weights and the bias data are floating point data; part of data in the input neurons, the weight values and the bias data are floating point data, and part of data is integer data; the input neurons, weights and bias data are integer data. The computing device can realize the conversion from non-fixed point data to fixed point data, namely, the conversion from data of types such as long-bit floating point data, short-bit floating point data, integer data, discrete data and the like to the fixed point data. The setpoint data may be signed setpoint data or unsigned setpoint data.

In a possible embodiment, the first input data and the second input data are fixed-point data, and the first input data and the second input data may be both signed fixed-point data, or both unsigned fixed-point data, or one of them is unsigned fixed-point data and the other is signed fixed-point data. And the position of the decimal point of the first input data is different from the position of the decimal point of the second input data.

In one possible embodiment, the first input data is fixed-point data, and the second input data is non-fixed-point data. In other words, the above-described computing device can implement conversion of fixed-point data into non-fixed-point data.

Fig. 4 is a flowchart of a forward operation of a single-layer neural network according to an embodiment of the present invention. The flow chart describes a process for a single layer neural network forward operation implemented using a computing device and instruction set implemented by the present invention. For each layer, the input neuron vectors are weighted and summed to calculate an intermediate result vector of the layer. The intermediate result vector is biased and activated to obtain an output neuron vector. And taking the output neuron vector as an input neuron vector of the next layer.

In a specific application scenario, the computing device may be a training device. Before the neural network model training, the training device acquires training data participating in the neural network model training, wherein the training data is non-fixed point data, and the position of a decimal point of the training data is acquired according to the method. The training device converts the training data into training data expressed by fixed point data according to the decimal point position of the training data. The training device performs a forward neural network operation based on the training data expressed by the fixed-point data to obtain a neural network operation result. The training device performs random rounding operation on the neural network operation result which exceeds the data precision range represented by the decimal point position of the training data to obtain the rounded neural network operation result, and the neural network operation result is positioned in the data precision range represented by the decimal point position of the training data. According to the method, the training device obtains the neural network operation result of each layer of the multilayer neural network, namely the output neuron. The training device obtains the gradient of the output neuron according to each layer of output neuron, and carries out inverse operation according to the gradient of the output neuron to obtain the weight gradient, thereby updating the weight of the neural network model according to the weight gradient.

The training device repeatedly executes the process to achieve the purpose of training the neural network model.

It should be noted that, before performing the forward operation and the backward training, the computing device performs data conversion on the data participating in the forward operation; data conversion is not carried out on the data participating in the reverse training; or the computing device does not perform data conversion on the data participating in forward operation; carrying out data conversion on data participating in reverse training; the computing device carries out data conversion on the data participating in the reverse training of the data participating in the forward operation; the specific data conversion process can be referred to the description of the related embodiment above, and will not be described here.

The forward operation includes the multilayer neural network operation, the multilayer neural network operation includes operations such as convolution, and the convolution operation is implemented by a convolution operation instruction.

The convolution operation instruction is an instruction in a Cambricon instruction set, and the Cambricon instruction set is characterized in that the instruction is composed of an operation code and an operand, and the instruction set includes four types of instructions, namely a control instruction (control instructions), a data transmission instruction (data instructions), an operation instruction (computational instructions) and a logic instruction (local instructions).

Preferably, each instruction in the instruction set has a fixed length. For example, each instruction in the instruction set may be 64 bits long.

Further, the control instructions are used for controlling the execution process. The control instructions include jump (jump) instructions and conditional branch (conditional branch) instructions.

Further, the data transmission instruction is used for completing data transmission between different storage media. The data transmission instruction comprises a load (load) instruction, a store (store) instruction and a move (move) instruction. The load instruction is used for loading data from the main memory to the cache, the store instruction is used for storing the data from the cache to the main memory, and the move instruction is used for carrying the data between the cache and the cache or between the cache and the register or between the register and the register. The data transfer instructions support three different data organization modes including matrices, vectors and scalars.

Further, the arithmetic instruction is used for completing the neural network arithmetic operation. The operation instructions include a matrix operation instruction, a vector operation instruction, and a scalar operation instruction.

Further, the matrix operation instruction performs matrix operations in the neural network, including matrix multiplication vector (matrix multiplication vector), vector multiplication matrix (vector multiplication matrix), matrix multiplication scalar (matrix multiplication scale), outer product (outer product), matrix addition matrix (matrix added matrix), and matrix subtraction matrix (matrix subtraction matrix).

Further, the vector operation instruction performs vector operations in the neural network, including vector elementary operations (vector elementary operations), vector transcendental functions (vector transcendental functions), inner products (dot products), vector random generator (random vector generator), and maximum/minimum values in vectors (maximum/minimum of a vector). The vector basic operation includes vector addition, subtraction, multiplication, and division (add, subtrect, multiplex, divide), and the vector transcendental function refers to a function that does not satisfy any polynomial equation with coefficients of a polynomial, including but not limited to an exponential function, a logarithmic function, a trigonometric function, and an inverse trigonometric function.

Further, scalar operation instructions perform scalar operations in the neural network, including scalar elementary operations (scalar elementary operations) and scalar transcendental functions (scalar transcendental functions). The scalar basic operation includes scalar addition, subtraction, multiplication, and division (add, subtrect, multiplex, divide), and the scalar transcendental function refers to a function that does not satisfy any polynomial equation with coefficients of a polynomial, including but not limited to an exponential function, a logarithmic function, a trigonometric function, and an inverse trigonometric function.

Further, the logic instruction is used for logic operation of the neural network. The logical operations include vector logical operation instructions and scalar logical operation instructions.

Further, the vector logic operation instruction includes a vector compare (vector compare), a vector logic operation (vector local operations) and a vector greater than merge (vector larger than merge). Where vector comparisons include, but are not limited to, greater than, less than, equal to, greater than or equal to, less than or equal to, and not equal to. The vector logic operation includes and, or, not.

Further, scalar logic operations include scalar compare (scalar compare), scalar local operations (scalar logical operations). Where scalar comparisons include, but are not limited to, greater than, less than, equal to, greater than or equal to, less than or equal to, and not equal to. Scalar logic operations include and, or, not.

For the multilayer neural network, the implementation process is that in the forward operation, after the execution of the artificial neural network of the previous layer is completed, the operation instruction of the next layer takes the output neuron calculated in the operation unit as the input neuron of the next layer to perform operation (or performs some operation on the output neuron and then takes the output neuron as the input neuron of the next layer), and meanwhile, the weight is replaced by the weight of the next layer; in the reverse operation, after the reverse operation of the artificial neural network of the previous layer is completed, the operation instruction of the next layer takes the input neuron gradient calculated in the operation unit as the output neuron gradient of the next layer to perform operation (or performs some operation on the input neuron gradient and then takes the input neuron gradient as the output neuron gradient of the next layer), and at the same time, the weight value is replaced by the weight value of the next layer. As shown in fig. 5, the arrows of the broken lines in fig. 5 indicate the backward operation, and the realized arrows indicate the forward operation.

In another embodiment, the operation instruction is a matrix multiplied by matrix instruction, an accumulation instruction, an activation instruction, and other calculation instructions, including a forward operation instruction and a direction training instruction.

The following describes a specific calculation method of the calculation apparatus shown in fig. 3A by a neural network operation instruction. For a neural network operation instruction, the formula that actually needs to be executed may be s-s (Σ wx)_i+ b), wherein the weight w is multiplied by the input data x_iAnd summing, adding a bias b, and performing activation operation s (h) to obtain a final output result s.

The method for executing the neural network forward operation instruction by the computing device shown in fig. 3A may specifically be:

after the conversion unit 13 performs data type conversion on the first input data, the controller unit 11 extracts the neural network forward operation instruction, the operation domain corresponding to the neural network operation instruction, and at least one operation code from the instruction cache unit 110, and the controller unit 11 transmits the operation domain to the data access unit and sends the at least one operation code to the operation unit 12.

The controller unit 11 extracts the weight w and the offset b corresponding to the operation field from the storage unit 10 (when b is 0, the offset b does not need to be extracted), transmits the weight w and the offset b to the main processing circuit 101 of the arithmetic unit, and the controller unit 11 extracts the input data Xi from the storage unit 10 and transmits the input data Xi to the main processing circuit 101.

The main processing circuit 101 splits the input data Xi into n data blocks;

the instruction processing unit 111 of the controller unit 11 determines a multiplication instruction, an offset instruction and an accumulation instruction according to the at least one opcode, sends the multiplication instruction, the offset instruction and the accumulation instruction to the master processing circuit 101, the master processing circuit 101 sends the multiplication instruction and the weight w to the plurality of slave processing circuits 102 in a broadcast manner, and distributes the n data blocks to the plurality of slave processing circuits 102 (for example, there are n slave processing circuits 102, and then each slave processing circuit 102 sends one data block); the plurality of slave processing circuits 102 are configured to perform a multiplication operation on the weight w and the received data block according to the multiplication instruction to obtain an intermediate result, and send the intermediate result to the master processing circuit 101, the master processing circuit 101 performs an accumulation operation on the intermediate result sent by the plurality of slave processing circuits 102 according to the accumulation instruction to obtain an accumulation result, performs an offset operation b on the accumulation result according to the offset instruction to obtain a final result, and sends the final result to the controller unit 11.

In addition, the order of addition and multiplication may be reversed.

It should be noted that, the method for executing the neural network reverse training instruction by the computing apparatus is similar to the process for executing the neural network forward operation instruction by the computing apparatus, and specific reference may be made to the above description of the reverse training, and no description is given here.

According to the technical scheme, multiplication and offset operation of the neural network are achieved through one instruction, namely the neural network operation instruction, storage or extraction is not needed in the intermediate result of the neural network calculation, and storage and extraction operations of intermediate data are reduced, so that the method has the advantages of reducing corresponding operation steps and improving the calculation effect of the neural network.

The application also discloses a machine learning operation device, which comprises one or more computing devices mentioned in the application, and is used for acquiring data to be operated and control information from other processing devices, executing specified machine learning operation, and transmitting the execution result to peripheral equipment through an I/O interface. Peripheral devices such as cameras, displays, mice, keyboards, network cards, wifi interfaces, servers. When more than one computing device is included, the computing devices can be linked and transmit data through a specific structure, for example, through the PCIE bus to interconnect and transmit data, so as to support larger-scale machine learning operations. At this time, the same control system may be shared, or there may be separate control systems; the memory may be shared or there may be separate memories for each accelerator. In addition, the interconnection mode can be any interconnection topology.

The machine learning arithmetic device has high compatibility and can be connected with various types of servers through PCIE interfaces.

The application also discloses a combined processing device which comprises the machine learning arithmetic device, the universal interconnection interface and other processing devices. The machine learning arithmetic device interacts with other processing devices to jointly complete the operation designated by the user. Fig. 6 is a schematic view of a combined treatment apparatus.

Other processing devices include one or more of general purpose/special purpose processors such as Central Processing Units (CPUs), Graphics Processing Units (GPUs), machine learning processors, and the like. The number of processors included in the other processing devices is not limited. The other processing devices are used as interfaces of the machine learning arithmetic device and external data and control, and comprise data transportation to finish basic control of starting, stopping and the like of the machine learning arithmetic device; other processing devices may cooperate with the machine learning computing device to perform computing tasks.

And the universal interconnection interface is used for transmitting data and control instructions between the machine learning arithmetic device and other processing devices. The machine learning arithmetic device acquires required input data from other processing devices and writes the input data into a storage device on the machine learning arithmetic device; control instructions can be obtained from other processing devices and written into a control cache on a machine learning arithmetic device chip; the data in the storage module of the machine learning arithmetic device can also be read and transmitted to other processing devices.

Alternatively, as shown in fig. 7, the configuration may further include a storage device, and the storage device is connected to the machine learning arithmetic device and the other processing device, respectively. The storage device is used for storing data in the machine learning arithmetic device and the other processing device, and is particularly suitable for data which is required to be calculated and cannot be stored in the internal storage of the machine learning arithmetic device or the other processing device.

The combined processing device can be used as an SOC (system on chip) system of equipment such as a mobile phone, a robot, an unmanned aerial vehicle and video monitoring equipment, the core area of a control part is effectively reduced, the processing speed is increased, and the overall power consumption is reduced. In this case, the generic interconnect interface of the combined processing device is connected to some component of the apparatus. Some parts are such as camera, display, mouse, keyboard, network card, wifi interface.

In one possible embodiment, a distributed system is also claimed, the system comprising n1 host processors and n2 coprocessors, n1 being an integer greater than or equal to 0 and n2 being an integer greater than or equal to 1. The system may be of various types of topologies including, but not limited to, the topology shown in FIG. 3B, the topology shown in FIG. 3C, the topology shown in FIG. 11, and the topology shown in FIG. 12.

The main processor respectively sends input data, decimal point positions of the input data and calculation instructions to the plurality of coprocessors; or the main processor sends the input data, the decimal point position of the input data and the calculation instruction to some of the plurality of slave processors, and the partial slave processors send the input data, the decimal point position of the input data and the calculation instruction to other slave processors. The coprocessor comprises the computing device, and the computing device is used for computing the input data according to the method and the computing instruction to obtain a computing result;

the input data includes, but is not limited to, input neurons, weight values, bias data, and the like.

The coprocessor directly sends the operation result to the main processor, or the coprocessor which is not connected with the main processor firstly sends the operation result to the coprocessor which is connected with the main processor, and then the coprocessor sends the received operation result to the main processor.

In some embodiments, a chip is also claimed, which includes the above machine learning arithmetic device or the combined processing device.

In some embodiments, a chip package structure is provided, which includes the above chip.

In some embodiments, a board card is provided, which includes the above chip package structure.

In some embodiments, an electronic device is provided that includes the above board card. Referring to fig. 8, fig. 8 provides a card that may include other kits in addition to the chip 389, including but not limited to: memory device 390, receiving means 391 and control device 392;

the memory device 390 is connected to the chip in the chip package structure through a bus for storing data. The memory device may include a plurality of groups of memory cells 393. Each group of the storage units is connected with the chip through a bus. It is understood that each group of the memory cells may be a DDR SDRAM (Double Data Rate SDRAM).

DDR can double the speed of SDRAM without increasing the clock frequency. DDR allows data to be read out on the rising and falling edges of the clock pulse. DDR is twice as fast as standard SDRAM. In one embodiment, the storage device may include 4 sets of the storage unit. Each group of the memory cells may include a plurality of DDR4 particles (chips). In one embodiment, the chip may internally include 4 72-bit DDR4 controllers, and 64 bits of the 72-bit DDR4 controller are used for data transmission, and 8 bits are used for ECC check. It can be understood that when DDR4-3200 particles are adopted in each group of memory cells, the theoretical bandwidth of data transmission can reach 25600 MB/s.

In one embodiment, each group of the memory cells includes a plurality of double rate synchronous dynamic random access memories arranged in parallel. DDR can transfer data twice in one clock cycle. And a controller for controlling DDR is arranged in the chip and is used for controlling data transmission and data storage of each memory unit.

The interface device is electrically connected with a chip in the chip packaging structure. The interface device is used for realizing data transmission between the chip and an external device (such as a server or a computer). For example, in one embodiment, the interface device may be a standard PCIE interface. For example, the data to be processed is transmitted to the chip by the server through the standard PCIE interface, so as to implement data transfer. Preferably, when PCIE 3.0X 16 interface transmission is adopted, the theoretical bandwidth can reach 16000 MB/s. In another embodiment, the interface device may also be another interface, and the present application does not limit the concrete expression of the other interface, and the interface unit may implement the switching function. In addition, the calculation result of the chip is still transmitted back to an external device (e.g., a server) by the interface device.

The control device is electrically connected with the chip. The control device is used for monitoring the state of the chip. Specifically, the chip and the control device may be electrically connected through an SPI interface. The control device may include a single chip Microcomputer (MCU). The chip may include a plurality of processing chips, a plurality of processing cores, or a plurality of processing circuits, and may carry a plurality of loads. Therefore, the chip can be in different working states such as multi-load and light load. The control device can realize the regulation and control of the working states of a plurality of processing chips, a plurality of processing andor a plurality of processing circuits in the chip.

The electronic device comprises a data processing device, a robot, a computer, a printer, a scanner, a tablet computer, an intelligent terminal, a mobile phone, a vehicle data recorder, a navigator, a sensor, a camera, a server, a cloud server, a camera, a video camera, a projector, a watch, an earphone, a mobile storage, a wearable device, a vehicle, a household appliance, and/or a medical device.

The vehicle comprises an airplane, a ship and/or a vehicle; the household appliances comprise a television, an air conditioner, a microwave oven, a refrigerator, an electric cooker, a humidifier, a washing machine, an electric lamp, a gas stove and a range hood; the medical equipment comprises a nuclear magnetic resonance apparatus, a B-ultrasonic apparatus and/or an electrocardiograph.

Referring to fig. 9, fig. 9 is a method for performing machine learning calculation according to an embodiment of the present invention, where the method includes:

s901, the computing device acquires first input data and a computing instruction.

The first input data includes input neurons and weights.

S902, the computing device analyzes the computing instruction to obtain a data conversion instruction and a plurality of operation instructions.

The data conversion instruction comprises an operation field and an operation code, wherein the operation code is used for indicating the function of the data type conversion instruction, and the operation field of the data type conversion instruction comprises a decimal point position, a flag bit used for indicating the data type of the first input data and a conversion mode of the data type.

And S903, converting the first input data into second input data by the computing device according to the data conversion instruction, wherein the second input data is fixed-point data.

Wherein the converting the first input data into second input data according to the data conversion instruction comprises:

analyzing the data conversion instruction to obtain the decimal point position, the flag bit for indicating the data type of the first input data and the conversion mode of the data type;

determining the data type of the first input data according to the data type zone bit of the first input data;

and converting the first input data into second input data according to the decimal point position and the conversion mode of the data type, wherein the data type of the second input data is inconsistent with the data type of the first input data.

When the first input data and the second input data are fixed point data, the position of the decimal point of the first input data is inconsistent with the position of the decimal point of the second input data.

In a possible embodiment, when the first input data is fixed-point data, the method further comprises:

and deducing the decimal point position of one or more intermediate results according to the decimal point position of the first input data, wherein the one or more intermediate results are obtained by operation according to the first input data.

And S904, the computing device performs computation on the second input data according to the plurality of operation instructions to obtain a result of the computation instruction.

The operation instruction includes a forward operation instruction and a reverse training instruction, that is, during the process of executing the forward operation instruction and/or the reverse training instruction (that is, the computing device performs forward operation and/or reverse training), the computing device may convert data participating in the operation into fixed-point data according to the embodiment shown in fig. 9, and perform fixed-point operation.

It should be noted that, the above steps S901-S904 can be described in detail with reference to the related description of the embodiment shown in fig. 1-8, and will not be described here.

In a specific application scenario, the computing device converts the data participating in the operation into fixed-point data, and adjusts the position of a decimal point of the fixed-point data, with reference to fig. 10 as a specific process, as shown in fig. 10, the method includes:

s1001, the computing device acquires first input data.

The first input data is data participating in the mth layer operation of the multilayer network model, and the first input data is any type of data. For example, the first input data is fixed point data, floating point data, integer data or discrete data, and m is an integer greater than 0.

Wherein, the mth layer of the multilayer network model is a linear layer, and the linear layer includes but is not limited to a convolutional layer and a full link layer. The first input data includes input neurons, weights, output neurons, input neuron derivatives, weight derivatives, and output neuron derivatives.

S1002, the computing device determines the decimal point position of the first input data and the bit width of the fixed point data.

The method comprises the following specific steps:

Pos; when the first input data a is less than neg, the second input data

Is neg.

The method comprises the following specific steps:

Is composed of

And is

When the first input data a is greater than pos, the second input data

Pos; when the first input data a is less than neg, the second input data

Is neg.

S903, the computing device initializes the decimal point position of the first input data and adjusts the decimal point position of the first input data.

The decimal point position s needs to be initialized and dynamically adjusted according to data of different types and data of different neural network layers and data in different iteration rounds.

Wherein the initializing of the decimal point position s of the first input data of the computing device comprises: initializing a decimal point position s of the first input data according to the maximum absolute value of the first input data; initializing a decimal point position s of the first input data according to the minimum value of the absolute value of the first input data; initializing a decimal point position s of the first input data according to the relation between different data types in the first input data; and initializing the decimal point position s of the first input data according to an empirical value constant.

a) And the calculating device initializes the position s of the decimal point of the first input data according to the maximum value of the absolute value of the first input data:

the calculating means initializes a decimal point position s of the first input data by a formula: .

Output neuron derivative

Derivative of sum weight

a.1) the computing device searches the maximum value of the absolute value according to the data type

Or output neuron derivatives

Or weight derivative

In other words, the aboveThe first input data comprises an input neuron, a weight, an output neuron, an input neuron derivative, a weight derivative and an output neuron derivative, and the decimal point position of the first input data comprises a decimal point position of the input neuron, a decimal point position of the weight, a decimal point position of the output neuron, a decimal point position of the input neuron derivative, a decimal point position of the weight derivative and a decimal point position of the output neuron derivative. The input neurons, the weights, the output neurons, the input neuron derivatives, the weight derivatives and the output neuron derivatives are all represented in matrix or vector form. The computing device obtains the maximum absolute value of each kind of data by traversing all elements in the vector/matrix of each layer of the multi-layer network model, namely

By the formula

a.2) the computing device searches the maximum value of the absolute value according to the hierarchical classification

Or output neuron derivatives

Or weight derivative

In other words, each layer of the above-described multilayer network model includes an input neuron, a weight, an output neuron, an input neuron derivative, a weight derivative, and an output neuron derivative. The first input numberThe decimal point position of the data comprises a decimal point position of an input neuron, a decimal point position of a weight, a decimal point position of an output neuron, a decimal point position of a derivative of the input neuron, a decimal point position of a derivative of the weight, and a decimal point position of a derivative of the output neuron. The input neuron, the weight, the output neuron, the input neuron derivative, the weight derivative and the output neuron derivative are all expressed by matrix/vector. The above-described calculation means obtains the maximum value of the absolute value of each kind of data, that is, the maximum value of the absolute value of each kind of data, by traversing all elements in the vector/matrix of each kind of data of each layer of the multi-layer network model

By the formula:

determining decimal point positions of each type data a on the l-th layer

a.3) the computing device searches the maximum value of the absolute value according to the hierarchical classification grouping

Or output neuron derivatives

Or weight derivative

In other words, the data classes of each layer of the above-described multilayer network model include input neurons, weights, output neurons, input neuron derivatives, weight derivatives, and output neuron derivatives. The computing device uses the multi-layer network modelEach type of data of each layer is divided into g groups or grouped by any other grouping rule. Then, traversing each element of each group of data in g groups of data corresponding to each type of data in each layer in the multilayer network model, and acquiring an absolute value in the group of data

The largest element, i.e.

By the formula

b) The calculating device initializes the decimal point position s of the first input data according to the minimum absolute value of the first input data:

specifically, the above-mentioned calculation means finds the minimum value a of the absolute value of the data to be quantized_minThe spotting accuracy s is determined by the following formula.

c) The computing device initializes the decimal point position s of the first input data according to the relationship between different data types in the first input data:

Data type b according to the l layer can be calculated by the computing device^(l)Position of decimal point

And formula

And (4) determining.

Or output neuron derivatives

Or weight derivative

Wherein, a^(l)And b^(l)Is an integer constant.

d) The calculating means initializes a decimal point position s of the first input data according to an empirical value constant:

Or output neuron derivatives

Or weight derivative

The method for dynamically adjusting the decimal point position s by the computing device is described in detail below.

The method for dynamically adjusting the decimal point position s by the computing device comprises the steps of adjusting s upwards (s is larger), and adjusting s downwards (s is smaller). The method specifically comprises the steps of single-step upward adjustment according to the maximum value of the absolute value of first input data; gradually adjusting upwards according to the maximum value of the absolute value of the first input data; step up according to the first input data profile; gradually adjusting upwards according to the first input data distribution; and adjusting downwards according to the maximum value of the absolute value of the first input data.

a) And the computing device performs single-step upward adjustment according to the maximum value of the absolute value of the data in the first input data:

b) And the calculating device gradually adjusts upwards according to the maximum value of the absolute value of the data in the first input data:

assuming that the position of the decimal point is adjusted to s _ old, the fixed point data corresponding to the position of the decimal point s _ old can be tabulatedThe data range is [ neg, pos]Wherein pos is (2)^bitnum-1-1)*2^s_old，neg＝-(2^bitnum-1-1)*2^s_old. When the maximum value a of the absolute value of the data in the first input data_maxWhen the decimal point position is greater than or equal to pos, the decimal point position after adjustment is s _ new ═ s _ old + 1; otherwise, the decimal point position is not adjusted, i.e., s _ new ═ s _ old.

c) The computing device performs single-step upward adjustment according to the first input data distribution:

Further, n may be 2 or 3

d) And the computing device gradually adjusts upwards according to the first input data distribution:

e) And the calculating device is downwards adjusted according to the maximum value of the absolute value of the first input data:

Further, n is 3, and s is_minIs-64.

The initialization and adjustment of the position of the decimal point of the data according to the average value or the median of the absolute values of the data may be described in detail with reference to the initialization and adjustment of the position of the decimal point of the data according to the maximum value of the absolute values of the data, and will not be described herein.

It should be noted that, for simplicity of description, the above-mentioned method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present application is not limited by the order of acts described, as some steps may occur in other orders or concurrently depending on the application. Further, those skilled in the art should also appreciate that the embodiments described in the specification are exemplary embodiments and that the acts and modules referred to are not necessarily required in this application.

In the foregoing embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus may be implemented in other manners. For example, the above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one type of division of logical functions, and there may be other divisions when actually implementing, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not implemented. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of some interfaces, devices or units, and may be an electric or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit may be implemented in the form of hardware, or may be implemented in the form of a software program module.

The integrated units, if implemented in the form of software program modules and sold or used as stand-alone products, may be stored in a computer readable memory. Based on such understanding, the technical solution of the present application may be substantially implemented or a part of or all or part of the technical solution contributing to the prior art may be embodied in the form of a software product stored in a memory, and including several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method described in the embodiments of the present application. And the aforementioned memory comprises: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic or optical disk, and other various media capable of storing program codes.

Those skilled in the art will appreciate that all or part of the steps in the methods of the above embodiments may be implemented by associated hardware instructed by a program, which may be stored in a computer-readable memory, which may include: flash disk, ROM, RAM, magnetic or optical disk, and the like.

The foregoing detailed description of the embodiments of the present application has been presented to illustrate the principles and implementations of the present application, and the above description of the embodiments is only provided to help understand the method and the core concept of the present application; meanwhile, for a person skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.

Claims

1. A computing device, the computing device comprising: a storage unit, a conversion unit, an arithmetic unit and a controller unit; the memory unit includes a cache and a register,

the controller unit is used for acquiring first input data and a plurality of operation instructions from the register and acquiring the decimal point position of the adjusted first input data from the cache; transmitting the decimal point position of the adjusted first input data and the first input data to the conversion unit; wherein adjusting the decimal point position of the first input data is periodic;

gradually adjusting the position of a decimal point of the first input data upwards according to the maximum value of the absolute value of the data in the first input data, or;

adjusting the decimal point position of the first input data in a single step upwards according to the first input data distribution, or;

gradually adjusting the position of the decimal point of the first input data upwards according to the distribution of the first input data, or;

and adjusting the position of the decimal point of the first input data downwards according to the maximum value of the absolute value of the first input data.

2. The apparatus of claim 1, wherein the arithmetic unit initializes a decimal point position of the first input data, comprising:

3. The apparatus of claim 1 or 2, wherein the adjusting the position of the decimal point of the first input data is periodic, comprising:

adjusting the decimal point position of the first input data once every n first training periods iteration or n second training periods epoch, wherein the first training period is larger than the second training period, and n is a constant.

4. The apparatus of claim 1 or 2, wherein the adjusting the position of the decimal point of the first input data is periodic, comprising:

the decimal point position of the first input data is adjusted once every n first training periods iteration or every second training period epoch, and then n is adjusted to be alphan, where alpha is greater than 1, or,

adjusting the position of a decimal point of first input data once every n first training periods iteration or second training periods epoch, wherein n is gradually reduced along with the increment of the number of training rounds;

wherein the first training period is greater than the second training period.

5. The apparatus of claim 1, wherein the computing apparatus is configured to perform machine learning computations,

the controller unit is further configured to transmit the plurality of operation instructions to the operation unit;

6. The apparatus of claim 2, wherein the computing apparatus is configured to perform a machine learning computation,

7. The apparatus of claim 3, wherein the computing apparatus is configured to perform a machine learning computation,

8. The apparatus of claim 4, wherein the computing apparatus is configured to perform machine learning calculations,

9. The apparatus of any of claims 5-8, wherein the machine learning computation comprises: an artificial neural network operation, the first input data comprising: inputting neuron data and weight data; the operation result is output neuron data.

10. The apparatus according to any one of claims 5-8, wherein said arithmetic unit comprises a master processing circuit and a plurality of slave processing circuits;

11. The apparatus of claim 9, wherein said arithmetic unit comprises a master processing circuit and a plurality of slave processing circuits;

12. The apparatus of claim 10, wherein the computing apparatus further comprises: a Direct Memory Access (DMA) unit;

the register is further used for storing scalar data in the first input data;

13. The apparatus of claim 11, wherein the computing apparatus further comprises: a Direct Memory Access (DMA) unit;

the register is further used for storing scalar data in the first input data;

14. The apparatus according to any one of claims 5 to 8, wherein when the first input data is fixed-point data, the arithmetic unit further includes:

15. The apparatus according to claim 9, wherein when the first input data is fixed-point data, the arithmetic unit further comprises:

16. The apparatus according to claim 10, wherein when the first input data is fixed-point data, the arithmetic unit further comprises:

17. The apparatus according to claim 11, wherein when the first input data is fixed-point data, the arithmetic unit further comprises:

18. The apparatus according to claim 12, wherein when the first input data is fixed-point data, the arithmetic unit further comprises:

19. The apparatus according to claim 13, wherein when the first input data is fixed-point data, the arithmetic unit further comprises:

20. The apparatus of claim 14, wherein the arithmetic unit further comprises:

a data caching unit for caching the one or more intermediate results.

21. The apparatus according to any one of claims 15-19, wherein the arithmetic unit further comprises:

a data caching unit for caching the one or more intermediate results.

22. The apparatus according to claim 10, wherein the arithmetic unit comprises: a tree module, the tree module comprising: the root port of the tree module is connected with the main processing circuit, and the branch ports of the tree module are respectively connected with one of the plurality of slave processing circuits;

the tree module is used for forwarding data and operation instructions between the main processing circuit and the plurality of slave processing circuits;

the tree module is of an n-branch tree structure, and n is an integer greater than or equal to 2.

23. The apparatus according to any one of claims 11-13, wherein the arithmetic unit comprises: a tree module, the tree module comprising: the root port of the tree module is connected with the main processing circuit, and the branch ports of the tree module are respectively connected with one of the plurality of slave processing circuits;

24. The apparatus of claim 10, wherein the arithmetic unit further comprises branch processing circuitry,

the main processing circuit is specifically configured to determine that an input neuron is broadcast data, determine that a weight is distribution data, allocate the distribution data to a plurality of data blocks, and send at least one data block of the plurality of data blocks, the broadcast data, and at least one operation instruction of the plurality of operation instructions to the branch processing circuit;

the main processing circuit is further configured to perform subsequent processing on the intermediate result sent by the branch processing circuit to obtain a result of the operation instruction, and send the result of the operation instruction to the controller unit.

25. The apparatus according to any one of claims 11-13, wherein the arithmetic unit further comprises branch processing circuitry,

26. The apparatus of claim 10, wherein the plurality of slave processing circuits are distributed in an array; each slave processing circuit is connected with other adjacent slave processing circuits, the master processing circuit is connected with K slave processing circuits in the plurality of slave processing circuits, and the K slave processing circuits are as follows: n slave processing circuits of row 1, n slave processing circuits of row m, and m slave processing circuits of column 1;

and the main processing circuit is used for processing the intermediate results sent by the K slave processing circuits to obtain the result of the operation instruction, and sending the result of the operation instruction to the controller unit.

27. The apparatus according to any one of claims 11-13, wherein the arithmetic unit further comprises branch processing circuitry,

28. The apparatus of any one of claims 22, 24 and 26,

the main processing circuit is specifically configured to combine and sort the intermediate results sent by the multiple processing circuits to obtain the result of the operation instruction;

or the main processing circuit is specifically configured to perform combination sorting and activation processing on the intermediate results sent by the multiple processing circuits to obtain a result of the operation instruction.

29. The apparatus of claim 23,

30. The apparatus of claim 25,

31. The apparatus of claim 27,

32. The apparatus of any one of claims 22, 24 and 26, wherein the main processing circuit comprises: one or any combination of an activation processing circuit and an addition processing circuit;

the slave processing circuit includes:

and the accumulation processing circuit is used for executing accumulation operation on the product result to obtain an intermediate result.

33. The apparatus of claim 23, wherein the main processing circuit comprises: one or any combination of an activation processing circuit and an addition processing circuit;

the slave processing circuit includes:

34. The apparatus of claim 25, wherein the main processing circuit comprises: one or any combination of an activation processing circuit and an addition processing circuit;

the slave processing circuit includes:

35. The apparatus of claim 27, wherein the main processing circuit comprises: one or any combination of an activation processing circuit and an addition processing circuit;

the slave processing circuit includes:

36. A machine learning arithmetic device, characterized in that the machine learning arithmetic device comprises one or more computing devices according to any one of claims 5 to 35, and is used for acquiring data to be operated and control information from other processing devices, executing specified machine learning arithmetic, and transmitting the execution result to other processing devices through an I/O interface;

when the machine learning arithmetic device comprises a plurality of computing devices, the plurality of computing devices can be connected through a specific structure and transmit data;

the computing devices are interconnected through a PCIE bus of a fast peripheral equipment interconnection bus and transmit data so as to support operation of larger-scale machine learning; a plurality of the computing devices share the same control system or own respective control systems; the computing devices share the memory or own the memory; the plurality of computing devices are interconnected in any interconnection topology.

37. A combined processing apparatus, characterized in that the combined processing apparatus comprises the machine learning arithmetic apparatus according to claim 36, a universal interconnection interface, a storage apparatus and other processing apparatuses;

the machine learning arithmetic device interacts with the other processing devices to jointly complete the calculation operation designated by the user;

the storage device is respectively connected with the machine learning arithmetic device and the other processing devices and is used for storing the data of the machine learning arithmetic device and the other processing devices.

38. A neural network chip comprising the machine learning computation device of claim 36 or the combined processing device of claim 37.

39. An electronic device, characterized in that the electronic device comprises a chip according to claim 38.

40. The utility model provides a board card, its characterized in that, the board card includes: a memory device, an interface apparatus and a control device and a neural network chip as claimed in claim 38;

the storage device is used for storing data;

the control device is used for monitoring the state of the chip;

wherein the memory device comprises: a plurality of groups of memory cells, each group of memory cells is connected with the chip through a bus, and the memory cells are: DDR SDRAM;

the interface device is as follows: a standard PCIE interface.

41. A method of computing, comprising:

the method comprises the steps that a controller unit determines the decimal point position of first input data and the bit width of fixed point data, wherein the bit width of the fixed point data is the bit width of the first input data converted into the fixed point data;

the arithmetic unit initializes the position of a decimal point of the first input data and adjusts the position of the decimal point of the first input data;

the conversion unit acquires the decimal point position of the adjusted first input data and converts the first input data into second input data according to the decimal point position; wherein adjusting the decimal point position of the first input data is periodic;

42. The method of claim 41, wherein the arithmetic unit initializes a decimal point position of the first input data, comprising:

43. The method of claim 42, the computing method being a method for performing machine learning computations, the method further comprising:

the operation unit operates the second input data according to a plurality of operation instructions to obtain an operation result.

44. The method of claim 43, wherein the machine learning computation comprises: an artificial neural network operation, the first input data comprising: inputting neurons and weights; the operation result is an output neuron.

45. The method of claim 44, wherein when the first input data is fixed-point data, the method further comprises:

the arithmetic unit derives decimal point positions of one or more intermediate results according to the decimal point positions of the first input data, wherein the one or more intermediate results are obtained through calculation according to the first input data.

46. The method of any of claims 41 to 45, wherein the adjusting the position of the decimal point of the first input data is periodic and comprises:

47. The method of any of claims 41 to 45, wherein the adjusting the position of the decimal point of the first input data is periodic and comprises:

adjusting the position of a decimal point of the first input data once every n first training periods iteration or second training periods epoch, and then adjusting n to be alpha n, wherein alpha is greater than 1, or;

wherein the first training period is greater than the second training period.