CN111738429A

CN111738429A - Computing device and related product

Info

Publication number: CN111738429A
Application number: CN201910229823.4A
Authority: CN
Inventors: 不公告发明人
Original assignee: Cambricon Technologies Corp Ltd
Current assignee: Cambricon Technologies Corp Ltd
Priority date: 2019-03-25
Filing date: 2019-03-25
Publication date: 2020-10-02
Anticipated expiration: 2039-03-25
Also published as: CN111738429B

Abstract

The application provides a computing device and a related product, wherein the related product comprises a neural network chip and a board card, the board card comprises a storage device, an interface device, a control device and the neural network chip, and the neural network chip is respectively connected with the storage device, the control device and the interface device; the storage device is used for storing data; the interface device is used for realizing data transmission between the neural network chip and external equipment; the control device is used for monitoring the state of the neural network chip. By implementing the embodiment of the application, the problems of high data transmission delay and high energy consumption in the operation process of the neural network are solved, and the bottleneck of the operation of the neural network is broken, so that the actual requirements of users are met, and the user experience is improved.

Description

Computing device and related product

Technical Field

The present application relates to the field of information processing technologies, and in particular, to a computing device and a related product.

Background

With the continuous development of information technology and the increasing demand of people, the requirement of people on data storage is higher and higher, especially in the operation process of a neural network algorithm.

However, in the prior art, the storage device has the problems of low storage density, large area, high power consumption and high access delay, so that data transmission delay is high and energy consumption is high in the neural network operation process, the data transmission becomes a bottleneck of the neural network operation, the actual requirements of users cannot be met, and the user experience is not good.

Disclosure of Invention

The embodiment of the application provides a computing device and a related product, which are beneficial to solving the problems of high data transmission delay and high energy consumption in the operation process of a neural network and breaking the bottleneck of the operation of the neural network, thereby meeting the actual requirements of users and improving the user experience.

In a first aspect, an embodiment of the present application provides a computing apparatus, where the computing apparatus includes a storage unit and a controller unit, where the storage unit includes: a 3D decoder and a 3D memory;

the controller unit is used for sending an access instruction to the 3D decoder;

the 3D decoder is used for decoding the access instruction transmitted by the controller unit to obtain address information of data to be accessed carried by the access instruction;

the 3D decoder is further used for sending the address information to the 3D memory;

the 3D memory is used for accessing the data to be accessed in the 3D memory according to the address information transmitted by the 3D decoder.

In a second aspect, an embodiment of the present application provides a neural network chip, which includes the computing device according to the first aspect.

In a third aspect, an embodiment of the present application provides a board card, where the board card includes the neural network chip package structure described in the second aspect.

In a fourth aspect, an embodiment of the present application provides an electronic device, where the electronic device includes the neural network chip described in the second aspect or the board described in the third aspect.

In a fifth aspect, embodiments of the present application further provide a computing method for executing a machine learning model, where the computing method is applied to a computing device, and the computing device includes a storage unit and a controller unit, where the storage unit includes: a 3D decoder and a 3D memory; the method comprises the following steps:

In some embodiments, the electronic device comprises a data processing apparatus, a robot, a computer, a printer, a scanner, a tablet, a smart terminal, a cell phone, a tachograph, a navigator, a sensor, a camera, a server, a cloud server, a camera, a camcorder, a projector, a watch, a headset, a mobile storage, a wearable device, a vehicle, a household appliance, and/or a medical device.

In some embodiments, the vehicle comprises an aircraft, a ship, and/or a vehicle; the household appliances comprise a television, an air conditioner, a microwave oven, a refrigerator, an electric cooker, a humidifier, a washing machine, an electric lamp, a gas stove and a range hood; the medical equipment comprises a nuclear magnetic resonance apparatus, a B-ultrasonic apparatus and/or an electrocardiograph.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1A is a schematic structural diagram of a computing device according to an embodiment of the present application.

Fig. 1B is a structural diagram of a memory cell according to an embodiment of the present application.

Fig. 1C is a block diagram of a memory cell according to another embodiment of the present application.

Fig. 1D is a block diagram of another computing device according to an embodiment of the present disclosure.

Fig. 1E is a block diagram of another computing device provided in the embodiments of the present application.

Fig. 1F is a schematic structural diagram of another computing device provided in the embodiment of the present application.

Fig. 1G is a block diagram of another computing device provided in the embodiments of the present application.

Fig. 1H is a block diagram of another computing device according to an embodiment of the present disclosure.

Fig. 1I is a block diagram of another computing device provided in the embodiments of the present application.

Fig. 1J is a schematic structural diagram of a tree module according to an embodiment of the present application.

Fig. 1K is a block diagram of another computing device according to an embodiment of the present application.

Fig. 1L is a block diagram of another computing device according to an embodiment of the present disclosure.

Fig. 1M is a block diagram of another computing device according to an embodiment of the present disclosure.

Fig. 1N is a block diagram of another computing device according to an embodiment of the present application.

Fig. 2A is a schematic structural diagram of a computing device according to an embodiment of the present application.

Fig. 3A is a schematic structural diagram of a board card provided in the embodiment of the present application.

Fig. 3B is a schematic structural diagram of a board card provided in the embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some, but not all, embodiments of the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The terms "first," "second," "third," and "fourth," etc. in the description and claims of this application and in the accompanying drawings are used for distinguishing between different objects and not for describing a particular order. Furthermore, the terms "include" and "have," as well as any variations thereof, are intended to cover non-exclusive inclusions. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those steps or elements listed, but may alternatively include other steps or elements not listed, or inherent to such process, method, article, or apparatus.

Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the application. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein can be combined with other embodiments.

First, a computing device as used herein is described. Referring to fig. 1A, there is provided a computing device including a storage unit 10 and a controller unit 11, wherein the storage unit 10 includes: a 3D decoder 201 and a 3D memory 202;

the controller unit 11 is configured to send an access instruction to the 3D decoder.

In an alternative, specifically, when the access instruction is a read instruction, the read instruction carries address information and a data size of data to be accessed, where in an alternative, a structure of the read instruction may be as shown in the following table:

Layer＿id

Cell_id

Row_id

Col_id

Data_size

wherein, Layer _ id is the index address information of the active Layer, which represents the position of the active Layer in the 3D memory; cell _ id is memory index address information, which represents the position of the target memory in the target active layer; row _ id is Row index address information and represents the Row address of the target storage space of the data to be accessed in the target memory; col _ id is column index address information indicating the column address of the target storage space in the target memory of the data to be accessed; data _ size is a Data size indicating the Data size of the Data to be accessed.

When the access instruction is a write instruction, the write instruction carries address information and data to be accessed, where in an alternative, the structure of the write instruction may be as shown in the following table:

Layer＿id

Cell_id

Row_id

Col_id

Data

wherein, Layer _ id is the index address information of the active Layer, which represents the position of the active Layer in the 3D memory; cell _ id is memory index address information, which represents the position of the target memory in the target active layer; row _ id is Row index address information and represents the Row address of the target storage space of the data to be accessed in the target memory; col _ id is column index address information indicating the column address of the target storage space in the target memory of the data to be accessed; data is Data, representing Data to be accessed.

The 3D decoder 201 is configured to decode the access instruction transmitted by the controller unit, and obtain address information of data to be accessed carried by the access instruction.

The 3D decoder 201 is further configured to send the address information to the 3D memory.

The 3D memory 202 is configured to access the data to be accessed in the 3D memory according to the address information transmitted by the 3D decoder.

The technical scheme that this application provided sets the memory cell into 3D memory and 3D decoder, has satisfied the storage demand of neural network operation, simultaneously, has reduced data IO unit, has solved the problem that data transmission delays height, energy consumption are high in neural network operation process, breaks the bottleneck of neural network operation to satisfy user's actual demand, promote user experience.

Referring to fig. 1B, when the access instruction is a read instruction, the 3D decoder 201 is specifically configured to decode the read instruction transmitted by the controller unit to obtain address information and data size of data to be accessed, which are carried by the read instruction; transmitting the address information and the data size to the 3D memory.

The 3D memory 202 is specifically configured to read the data to be accessed, which is matched with the data size, from the 3D memory according to the address information transmitted by the 3D decoder.

Referring to fig. 1C, when the access instruction is a write instruction, the 3D decoder 201 is specifically configured to decode the write instruction transmitted by the controller unit to obtain address information of data to be accessed and data to be accessed, where the address information is carried by the write instruction; and sending the address information and the data to be accessed to the 3D memory.

The 3D memory 202 is specifically configured to write the data to be accessed in the 3D memory according to the address information transmitted by the 3D decoder.

Optionally, the 3D memory 202 includes: each of the N active layers comprises a 2D memory, the 2D memory in the ith active layer is connected with the 2D memory in the (i +1) th active layer, i is more than or equal to 1 and less than N, i is an integer, each 2D memory is obtained by arranging M memory arrays with the same type, wherein N is a positive integer more than 1, and M is a positive integer.

In one possible implementation manner of the present solution, the 2D memory in the ith active layer and the 2D memory in the (i +1) th active layer are connected through a silicon channel.

In the above, i may be, for example, 1,2, 3, 4, 5, 7, 8, 9, or the like.

In this case, M may be, for example, a number of 1,2, 3, 4, 5, 7, 8, 9, or the like.

N may be, for example, 2, 3, 4, 5, 7, 8, 9, or the like.

The types of the memory may include, for example: dynamic random access memory, static random access memory, registers, flash memory, etc.

For example, a 3D-register has 4 active layers, each active layer includes a 2D register, adjacent 2D registers are connected by a silicon channel, and each 2D register is obtained by arranging M register arrays, that is, there are M rows and n columns of registers to form a 2D register, where M is n is M, where M and n are positive integers greater than or equal to 1.

For another example, a 3D-sram has 4 active layers, each active layer includes a 2D sram, adjacent 2D srams are connected by a silicon channel, and each 2D sram is formed by arranging M sram arrays, that is, a 2D sram is formed by M rows and n columns of srams, where M × n is M, where M and n are positive integers greater than or equal to 1.

For another example, a 3D-dram has 4 active layers, each active layer includes a 2D dram, adjacent 2D drams are connected by a silicon channel, and each 2D dram is configured by M dram arrays, i.e., a 2D dram is configured by M rows and n columns of drams, where M × n is M, where M and n are positive integers greater than or equal to 1.

In a possible implementation manner of this solution, the address information includes: active layer index address information, memory index address information, row index address information, and column index address information; when the data to be accessed is accessed in the 3D memory according to the address information transmitted by the 3D decoder, the 3D memory 202 is specifically configured to:

and accessing the data to be accessed to a target storage space in the 3D memory, wherein the target storage space is a storage space corresponding to the row index address information and the column index address information in the target memory, the target memory is a memory corresponding to the memory index address information in a target active layer, and the target active layer is an active layer corresponding to the active layer index address information in the 3D memory.

Optionally, in an embodiment of the present disclosure, the external device 13 and the 3D memory may be as shown in fig. 1D and fig. 1E, and the computing apparatus may further include: an external device 13, an external storage unit of the external device 13 including: 3D-dynamic random access memory and 3D-static random access memory; the data to be accessed comprises: input data and scalar data in the input data, wherein the input data comprises: inputting neuron data and weight data;

when the external storage unit is the 3D-dynamic random access memory, the 3D memory 102 includes: 3D-SRAM and 3D-register;

the 3D-static random access memory is used for storing the input data;

the 3D-register is used for storing the scalar data;

or, when the external storage unit is the 3D-sram, the 3D memory 102 is a 3D-register;

the 3D-register is used for storing the input data;

the 3D-register is further used for storing the scalar data.

Optionally, in another embodiment of the present disclosure, as shown in fig. 1F, when the external storage unit 301 is the 3D-dram 3012, the 3D memory 202 enters a first operating mode, where the first operating mode includes:

the 3D-sram 2021 accesses the input data in the 3D-sram according to the address information transmitted by the 3D decoder; and the number of the first and second groups,

the 3D-register 2022 accesses the scalar data in the 3D-register according to the address information transmitted by the 3D decoder;

or, when the external storage unit is the 3D-sram 3011, the 3D memory 202 enters a second operating mode, where the second operating mode includes:

the 3D-register 2022 accesses the input data and the scalar data in the 3D-register according to the address information transmitted by the 3D decoder.

In one embodiment of the present solution, the computing device is configured to perform machine learning calculations, the computing device further comprising: an arithmetic unit 12, wherein the controller unit 11 is connected to the arithmetic unit 12, said arithmetic unit 12 comprising: a master processing circuit and a plurality of slave processing circuits;

a controller unit 11, further configured to obtain input data and a calculation instruction; in an alternative, the input data and the calculation instruction may be obtained through a data input/output unit, and the data input/output unit may be one or more data I/O interfaces or I/O pins.

The above calculation instructions include, but are not limited to: the present invention is not limited to the specific expression of the above-mentioned computation instruction, such as a convolution operation instruction, or a forward training instruction, or other neural network operation instruction.

The controller unit 11 is further configured to analyze the calculation instruction to obtain a plurality of operation instructions, and send the plurality of operation instructions and the input data to the main processing circuit;

a master processing circuit 101 configured to perform a preamble process on the input data and transmit data and an operation instruction with the plurality of slave processing circuits;

a plurality of slave processing circuits 102 configured to perform an intermediate operation in parallel according to the data and the operation instruction transmitted from the master processing circuit to obtain a plurality of intermediate results, and transmit the plurality of intermediate results to the master processing circuit;

and the main processing circuit 101 is configured to perform subsequent processing on the plurality of intermediate results to obtain a calculation result of the calculation instruction.

The technical scheme that this application provided sets the arithmetic element to a main many slave structures, to the computational instruction of forward operation, it can be with the computational instruction according to the forward operation with data split, can carry out parallel operation to the great part of calculated amount through a plurality of processing circuits from like this to improve the arithmetic speed, save the operating time, and then reduce the consumption.

Optionally, the machine learning calculation specifically includes: the artificial neural network operation, where the input data specifically includes: neuron data and weight data are input. The calculation result may specifically be: the result of the artificial neural network operation outputs neuron data.

In the forward operation, after the execution of the artificial neural network of the previous layer is completed, the operation instruction of the next layer takes the output neuron calculated in the operation unit as the input neuron of the next layer to perform operation (or performs some operation on the output neuron and then takes the output neuron as the input neuron of the next layer), and at the same time, the weight value is replaced by the weight value of the next layer; in the reverse operation, after the reverse operation of the artificial neural network of the previous layer is completed, the operation instruction of the next layer takes the input neuron gradient calculated in the operation unit as the output neuron gradient of the next layer to perform operation (or performs some operation on the input neuron gradient and then takes the input neuron gradient as the output neuron gradient of the next layer), and at the same time, the weight value is replaced by the weight value of the next layer.

The above-described machine learning calculations may also include support vector machine operations, k-nearest neighbor (k-nn) operations, k-means (k-means) operations, principal component analysis operations, and the like. For convenience of description, the following takes artificial neural network operation as an example to illustrate a specific scheme of machine learning calculation.

For the artificial neural network operation, if the artificial neural network operation has multilayer operation, the input neurons and the output neurons of the multilayer operation do not refer to the neurons in the input layer and the neurons in the output layer of the whole neural network, but for any two adjacent layers in the network, the neurons in the lower layer of the network forward operation are the input neurons, and the neurons in the upper layer of the network forward operation are the output neurons. Taking a convolutional neural network as an example, let a convolutional neural network have L layers, K1, 2.., L-1, for the K-th layer and K + 1-th layer, we will refer to the K-th layer as an input layer, in which the neurons are the input neurons, and the K + 1-th layer as an output layer, in which the neurons are the output neurons. That is, each layer except the topmost layer can be used as an input layer, and the next layer is a corresponding output layer.

Optionally, the controller unit includes: an instruction storage unit 110, an instruction processing unit 111, and a storage queue unit 113;

an instruction storage unit 110, configured to store a calculation instruction associated with the artificial neural network operation;

the instruction processing unit 111 is configured to analyze the calculation instruction to obtain a plurality of operation instructions;

a store queue unit 113 for storing an instruction queue, the instruction queue comprising: and a plurality of operation instructions or calculation instructions to be executed according to the front and back sequence of the queue.

For example, in an alternative embodiment, the main operation processing circuit may also include a controller unit, and the controller unit may include a main instruction processing unit, specifically configured to decode instructions into microinstructions. Of course, in another alternative, the slave arithmetic processing circuit may also include another controller unit that includes a slave instruction processing unit, specifically for receiving and processing microinstructions. The micro instruction may be a next-stage instruction of the instruction, and the micro instruction may be obtained by splitting or decoding the instruction, and may be further decoded into control signals of each component, each unit, or each processing circuit.

In one alternative, the structure of the calculation instruction may be as shown in the following table.

Operation code

Registers or immediate data

Register/immediate

...

The ellipses in the above table indicate that multiple registers or immediate numbers may be included.

In another alternative, the computing instructions may include: one or more operation domains and an opcode. The computation instructions may include neural network operation instructions. Taking the neural network operation instruction as an example, as shown in table 1, register number 0, register number 1, register number 2, register number 3, and register number 4 may be operation domains. Each of register number 0, register number 1, register number 2, register number 3, and register number 4 may be a number of one or more registers.

The register may be an off-chip memory, and in practical applications, may also be an on-chip memory for storing data, where the data may specifically be n-dimensional data, where n is an integer greater than or equal to 1, and for example, when n is equal to 1, the data is 1-dimensional data, that is, a vector, and when n is equal to 2, the data is 2-dimensional data, that is, a matrix, and when n is equal to 3 or more, the data is a multidimensional tensor.

Optionally, the controller unit may further include:

the dependency processing unit 108 is configured to determine whether a first operation instruction is associated with a zeroth operation instruction before the first operation instruction when there are multiple operation instructions, cache the first operation instruction in the instruction storage unit if the first operation instruction is associated with the zeroth operation instruction, and extract the first operation instruction from the instruction storage unit and transmit the first operation instruction to the operation unit after the zeroth operation instruction is executed;

the determining whether the first operation instruction has an association relationship with a zeroth operation instruction before the first operation instruction comprises:

extracting a first storage address interval of required data (such as a matrix) in the first operation instruction according to the first operation instruction, extracting a zeroth storage address interval of the required matrix in the zeroth operation instruction according to the zeroth operation instruction, if the first storage address interval and the zeroth storage address interval have an overlapped area, determining that the first operation instruction and the zeroth operation instruction have an association relation, and if the first storage address interval and the zeroth storage address interval do not have an overlapped area, determining that the first operation instruction and the zeroth operation instruction do not have an association relation.

In another alternative embodiment, the arithmetic unit 12 may include one master processing circuit 101 and a plurality of slave processing circuits 102, as shown in fig. 1G and 1H. In one embodiment, as shown in fig. 1G and 1H, a plurality of slave processing circuits are distributed in an array; each slave processing circuit is connected with other adjacent slave processing circuits, the master processing circuit is connected with k slave processing circuits in the plurality of slave processing circuits, and the k slave processing circuits are as follows: it should be noted that the K slave processing circuits shown in fig. 1G and 1H include only the n slave processing circuits in the 1 st row, the n slave processing circuits in the m th row, and the m slave processing circuits in the 1 st column, that is, the K slave processing circuits are slave processing circuits directly connected to the master processing circuit among the plurality of slave processing circuits.

And the K slave processing circuits are used for forwarding data and instructions between the main processing circuit and the plurality of slave processing circuits.

In another embodiment, the operation instruction is a matrix by matrix instruction, an accumulation instruction, an activation instruction, or the like.

The specific calculation method of the calculation apparatus shown in fig. 1A is described below by a neural network operation instruction, for which the formula actually required to be executed may be s (∑ wx)_i+ b), wherein the weight w is multiplied by the input data x_iAnd summing, adding a bias b, and performing activation operation s (h) to obtain a final output result s.

In an alternative embodiment, as shown in fig. 1I, the arithmetic unit comprises: a tree module 40, the tree module comprising: a root port 401 and a plurality of branch ports 404, wherein the root port of the tree module is connected with the main processing circuit, and the branch ports of the tree module are respectively connected with one of the plurality of slave processing circuits;

the tree module has a transceiving function, for example, as shown in fig. 1I, the tree module is a transmitting function, and as shown in fig. 2A, the tree module is a receiving function.

And the tree module is used for forwarding data blocks, weights and operation instructions between the main processing circuit and the plurality of slave processing circuits.

Optionally, the tree module is an optional result of the computing device, and may include at least 1 layer of nodes, where the nodes are line structures with forwarding function, and the nodes themselves may not have computing function. If the tree module has zero-level nodes, the tree module is not needed.

Optionally, the tree module may have an n-ary tree structure, for example, a binary tree structure as shown in fig. 1J, or may have a ternary tree structure, where n may be an integer greater than or equal to 2. The present embodiment is not limited to the specific value of n, the number of layers may be 2, and the slave processing circuit may connect nodes of other layers than the node of the penultimate layer, for example, the node of the penultimate layer shown in fig. 1J.

Optionally, the operation unit may carry a separate cache, as shown in fig. 1K, and may include: a neuron buffer unit, the neuron buffer unit 63 buffers the input neuron vector data and the output neuron value data of the slave processing circuit.

As shown in fig. 1L, the arithmetic unit may further include: and a weight buffer unit 64, configured to buffer weight data required by the slave processing circuit in the calculation process.

In an alternative embodiment, the arithmetic unit 12, as shown in fig. 1M and 1N, may include a branch processing circuit 103; the specific connection structure is shown in fig. 1M and 1N, wherein,

the main processing circuit 101 is connected to branch processing circuit(s) 103, the branch processing circuit 103 being connected to one or more slave processing circuits 102;

a branch processing circuit 103 for executing data or instructions between the forwarding main processing circuit 101 and the slave processing circuit 102.

In an alternative embodiment, taking the fully-connected operation in the neural network operation as an example, the process may be: f (wx + b), where x is an input neuron matrix, w is a weight matrix, b is a bias scalar, and f is an activation function, and may specifically be: sigmoid function, tanh, relu, softmax function. Here, a binary tree structure is assumed, and there are 8 slave processing circuits, and the implementation method may be:

the controller unit acquires an input neuron matrix x, a weight matrix w and a full-connection operation instruction from the storage unit, and transmits the input neuron matrix x, the weight matrix w and the full-connection operation instruction to the main processing circuit;

the main processing circuit determines the input neuron matrix x as broadcast data, determines the weight matrix w as distribution data, divides the weight matrix w into 8 sub-matrixes, then distributes the 8 sub-matrixes to 8 slave processing circuits through a tree module, broadcasts the input neuron matrix x to the 8 slave processing circuits,

the slave processing circuit executes multiplication and accumulation operation of the 8 sub-matrixes and the input neuron matrix x in parallel to obtain 8 intermediate results, and the 8 intermediate results are sent to the master processing circuit;

and the main processing circuit is used for sequencing the 8 intermediate results to obtain a wx operation result, executing the offset b operation on the operation result, executing the activation operation to obtain a final result y, sending the final result y to the controller unit, and outputting or storing the final result y into the storage unit by the controller unit.

The method for executing the neural network forward operation instruction by the computing device shown in fig. 1A may specifically be:

the controller unit extracts the neural network forward operation instruction, the operation domain corresponding to the neural network operation instruction and at least one operation code from the instruction storage unit, transmits the operation domain to the data access unit, and sends the at least one operation code to the operation unit.

The controller unit extracts the weight w and the offset b corresponding to the operation domain from the storage unit (when b is 0, the offset b does not need to be extracted), transmits the weight w and the offset b to the main processing circuit of the arithmetic unit, extracts the input data Xi from the storage unit, and transmits the input data Xi to the main processing circuit.

The main processing circuit determines multiplication operation according to the at least one operation code, determines input data Xi as broadcast data, determines weight data as distribution data, and splits the weight w into n data blocks;

the instruction processing unit of the controller unit determines a multiplication instruction, an offset instruction and an accumulation instruction according to the at least one operation code, and sends the multiplication instruction, the offset instruction and the accumulation instruction to the master processing circuit, the master processing circuit sends the multiplication instruction and the input data Xi to a plurality of slave processing circuits in a broadcasting mode, and distributes the n data blocks to the plurality of slave processing circuits (for example, if the plurality of slave processing circuits are n, each slave processing circuit sends one data block); the plurality of slave processing circuits are used for executing multiplication operation on the input data Xi and the received data block according to the multiplication instruction to obtain an intermediate result, sending the intermediate result to the master processing circuit, executing accumulation operation on the intermediate result sent by the plurality of slave processing circuits according to the accumulation instruction by the master processing circuit to obtain an accumulation result, executing offset b on the accumulation result according to the offset instruction to obtain a final result, and sending the final result to the controller unit.

In addition, the order of addition and multiplication may be reversed.

According to the technical scheme, multiplication and offset operation of the neural network are achieved through one instruction, namely the neural network operation instruction, storage or extraction is not needed in the intermediate result of the neural network calculation, and storage and extraction operations of intermediate data are reduced, so that the method has the advantages of reducing corresponding operation steps and improving the calculation effect of the neural network.

The application also discloses a machine learning operation device, which comprises one or more computing devices mentioned in the application, and is used for acquiring data to be operated and control information from other processing devices, executing specified machine learning operation, and transmitting the execution result to peripheral equipment through an I/O interface. Peripheral devices such as cameras, displays, mice, keyboards, network cards, wifi interfaces, servers. When more than one computing device is included, the computing devices can be linked and transmit data through a specific structure, for example, through the PCIE bus to interconnect and transmit data, so as to support larger-scale machine learning operations. At this time, the same control system may be shared, or there may be separate control systems; the memory may be shared or there may be separate memories for each accelerator. In addition, the interconnection mode can be any interconnection topology.

The machine learning arithmetic device has high compatibility and can be connected with various types of servers through PCIE interfaces.

In some embodiments, a chip including the above machine learning computing device is also claimed.

In some embodiments, a board card is provided, which includes the above chip package structure. Referring to fig. 3A and 3B, fig. 3A and 3B provide 2 boards that may include other mating components in addition to the chip 389, including but not limited to: memory device 390, interface means 391, control device 392 and external device 394;

the external storage unit of the external device 394 may be a 3D-dynamic random access memory, or a 3D-static random access memory;

the memory device 390 is connected to the chip in the chip package structure through a bus for storing data. The memory device may include the memory cell 393 of fig. 3A or 3B. The storage unit is connected with the chip through a bus. It is understood that, when the external storage unit of the external device 394 is a 3D-dynamic random access memory, the storage unit may be a 3D decoder, a 3D-static random access memory, and a 3D-register; when the external storage unit of the external device 394 is a 3D-sram, the storage unit is a 3D decoder and a 3D-register.

The interface device is electrically connected with a chip in the chip packaging structure. The interface device is used for realizing data transmission between the chip and an external device (such as a server or a computer). For example, in one embodiment, the interface device may be a standard PCIE interface. For example, the data to be processed is transmitted to the chip by the server through the standard PCIE interface, so as to implement data transfer. Preferably, when PCIE3.0X16 interface is adopted for transmission, the theoretical bandwidth can reach 16000 MB/s. In another embodiment, the interface device may also be another interface, and the present application does not limit the concrete expression of the other interface, and the interface unit may implement the switching function. In addition, the calculation result of the chip is still transmitted back to an external device (e.g., a server) by the interface device.

The control device is electrically connected with the chip. The control device is used for monitoring the state of the chip. Specifically, the chip and the control device may be electrically connected through an SPI interface. The control device may include a single chip Microcomputer (MCU). The chip may include a plurality of processing chips, a plurality of processing cores, or a plurality of processing circuits, and may carry a plurality of loads. Therefore, the chip can be in different working states such as multi-load and light load. The control device can realize the regulation and control of the working states of a plurality of processing chips, a plurality of processing andor a plurality of processing circuits in the chip.

In some embodiments, an electronic device is provided that includes the above board card.

The electronic device comprises a data processing device, a robot, a computer, a printer, a scanner, a tablet computer, an intelligent terminal, a mobile phone, a vehicle data recorder, a navigator, a sensor, a camera, a server, a cloud server, a camera, a video camera, a projector, a watch, an earphone, a mobile storage, a wearable device, a vehicle, a household appliance, and/or a medical device.

The vehicle comprises an airplane, a ship and/or a vehicle; the household appliances comprise a television, an air conditioner, a microwave oven, a refrigerator, an electric cooker, a humidifier, a washing machine, an electric lamp, a gas stove and a range hood; the medical equipment comprises a nuclear magnetic resonance apparatus, a B-ultrasonic apparatus and/or an electrocardiograph.

It should be noted that, for simplicity of description, the above-mentioned method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present application is not limited by the order of acts described, as some steps may occur in other orders or concurrently depending on the application. Further, those skilled in the art should also appreciate that the embodiments described in the specification are exemplary embodiments and that the acts and modules referred to are not necessarily required in this application.

In the foregoing embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus may be implemented in other manners. For example, the above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one type of division of logical functions, and there may be other divisions when actually implementing, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not implemented. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of some interfaces, devices or units, and may be an electric or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit may be implemented in the form of hardware, or may be implemented in the form of a software program module.

The integrated units, if implemented in the form of software program modules and sold or used as stand-alone products, may be stored in a computer readable memory. Based on such understanding, the technical solution of the present application may be substantially implemented or a part of or all or part of the technical solution contributing to the prior art may be embodied in the form of a software product stored in a memory, and including several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method described in the embodiments of the present application. And the aforementioned memory comprises: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic or optical disk, and other various media capable of storing program codes.

Those skilled in the art will appreciate that all or part of the steps in the methods of the above embodiments may be implemented by associated hardware instructed by a program, which may be stored in a computer-readable memory, which may include: flash Memory disks, Read-Only memories (ROMs), Random Access Memories (RAMs), magnetic or optical disks, and the like.

The foregoing detailed description of the embodiments of the present application has been presented to illustrate the principles and implementations of the present application, and the above description of the embodiments is only provided to help understand the method and the core concept of the present application; meanwhile, for a person skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.

Claims

1. A computing device, comprising a storage unit and a controller unit, wherein the storage unit comprises: a 3D decoder and a 3D memory;

2. The computing device of claim 1, wherein the 3D memory comprises: each of the N active layers comprises a 2D memory, the 2D memory in the ith active layer is connected with the 2D memory in the (i +1) th active layer, i is more than or equal to 1 and less than N, i is an integer, each 2D memory is obtained by arranging M memory arrays with the same type, wherein N is a positive integer more than 1, and M is a positive integer.

3. The computing device of claim 1 or 2, wherein the address information comprises: active layer index address information, memory index address information, row index address information, and column index address information; when the data to be accessed is accessed in the 3D memory according to the address information transmitted by the 3D decoder, the 3D memory is specifically configured to:

4. The computing apparatus according to claim 1, wherein the computing apparatus includes an external device, and the external storage unit of the external device includes: 3D-dynamic random access memory and 3D-static random access memory; the data to be accessed comprises: input data and scalar data in the input data, wherein the input data comprises: inputting neuron data and weight data;

when the external storage unit is the 3D-dynamic random access memory, the 3D memory includes: 3D-SRAM and 3D-register;

the 3D-static random access memory is used for storing the input data;

the 3D-register is used for storing the scalar data;

or, when the external storage unit is the 3D-sram, the 3D memory is a 3D-register;

the 3D-register is used for storing the input data;

the 3D-register is further used for storing the scalar data.

5. The computing device of claim 4, wherein when the external storage unit is the 3D-DRAM, the 3D memory enters a first operating mode, and the first operating mode comprises:

the 3D-SRAM accesses the input data in the 3D-SRAM according to the address information transmitted by the 3D decoder; and the number of the first and second groups,

the 3D-register accesses the scalar data in the 3D-register according to the address information transmitted by the 3D decoder;

or, when the external storage unit is the 3D-sram, the 3D memory enters a second operating mode, where the second operating mode includes:

the 3D-register accesses the input data and the scalar data in the 3D-register according to the address information transmitted by the 3D decoder.

6. A neural network chip, comprising a computing device as claimed in any one of claims 1 to 5.

7. An electronic device, characterized in that it comprises a chip according to claim 6.

8. The utility model provides a board card, its characterized in that, the board card includes: a memory device, an interface apparatus and a control device and the neural network chip of claim 7;

wherein, the neural network chip is respectively connected with the storage device, the control device and the interface device;

the storage device is used for storing data;

the interface device is used for realizing data transmission between the chip and external equipment;

and the control device is used for monitoring the state of the chip.

9. A computing method for executing a machine learning model, wherein the computing method is applied to a computing apparatus including a storage unit and a controller unit, wherein the storage unit includes: a 3D decoder and a 3D memory; the method comprises the following steps:

10. The method of claim 9, wherein the 3D memory comprises: each of the N active layers comprises a 2D memory, the 2D memory in the ith active layer is connected with the 2D memory in the (i +1) th active layer, i is more than or equal to 1 and less than N, i is an integer, each 2D memory is obtained by arranging M memory arrays with the same type, wherein N is a positive integer more than 1, and M is a positive integer.

11. The method according to claim 9 or 10, wherein the address information comprises: active layer index address information, memory index address information, row index address information, and column index address information; when the data to be accessed is accessed in the 3D memory according to the address information transmitted by the 3D decoder, the 3D memory accesses the data to be accessed to a target storage space in the 3D memory, wherein the target storage space is a storage space corresponding to the row index address information and the column index address information in the target memory, the target memory is a memory corresponding to the memory index address information in a target active layer, and the target active layer is an active layer corresponding to the active layer index address information in the 3D memory.

12. The method of claim 9, wherein the computing device comprises an external device, and wherein the external storage unit of the external device comprises: 3D-dynamic random access memory and 3D-static random access memory; the data to be accessed comprises: input data and scalar data in the input data, wherein the input data comprises: inputting neuron data and weight data;

the 3D-static random access memory is used for storing the input data;

the 3D-register is used for storing the scalar data;

the 3D-register is used for storing the input data;

the 3D-register is further used for storing the scalar data.

13. The method of claim 12, wherein when the external storage unit is the 3D-dram, the 3D memory enters a first operating mode, and the first operating mode comprises: