CN111353591A

CN111353591A - Computing device and related product

Info

Publication number: CN111353591A
Application number: CN201811566331.6A
Authority: CN
Inventors: 不公告发明人
Original assignee: Cambricon Technologies Corp Ltd
Current assignee: Cambricon Technologies Corp Ltd
Priority date: 2018-12-20
Filing date: 2018-12-20
Publication date: 2020-06-30
Also published as: CN111353598A

Abstract

The application provides a computing device and a related product, wherein the computing device comprises a compression unit, an arithmetic unit and a controller unit; the controller unit is used for acquiring a compression request aiming at the first input data and indicating the compression unit to compress the first input data according to the compression request; wherein the first input data comprises a first weight matrix; the compression unit is used for compressing the first weight matrix into a second weight matrix; and the controller unit is also used for executing the neural network calculation according to the second input data and the calculation instruction. According to the method and the device, the topological structure of the neural network model can be kept unchanged in the neural network compression process, so that the topological structure of the neural network model is prevented from being irregular, and the operation amount of the neural network is reduced.

Description

Computing device and related product

Technical Field

The present application relates to the field of information processing technologies, and in particular, to a computing device and a related product.

Background

The neural network is an arithmetic mathematical model for simulating animal neural network behavior characteristics and performing distributed parallel information processing, the network is formed by connecting a large number of nodes (or called neurons) in a star-lake manner, and input neuron data and weight are utilized to generate output data to simulate information processing process processing information of human brain and generate a result after pattern recognition by adjusting the mutual connection relationship among the large number of nodes inside.

At present, neural networks are widely applied in various fields of computer vision, such as image recognition, object detection, image segmentation, and the like. However, in practical applications, the neural network model often has a huge number of model parameters (for example, a super-large-scale weight), which means that the neural network requires a large amount of computing resources and storage resources, the overhead of the large amount of computing resources and storage resources can reduce the operation speed of the neural network, and the requirements on the transmission bandwidth of hardware and an operator are also greatly increased, so how to reduce the computation amount of the neural network while reducing the parameters of the neural network model becomes very important.

In the prior art, parameters of a neural network model are adjusted by a pruning method to reduce the parameters of the neural network model and reduce the calculation amount of the neural network. Taking pruning of the weights of the neural network as an example, as shown in fig. 1A, before pruning of the weights of the neural network, the topology of the neural network is regular, however, after pruning of the weights of the neural network, the original regular topology in the neural network model is likely to become irregular. How to avoid the topology in the neural network model from becoming irregular is a technical problem to be solved urgently.

Disclosure of Invention

The embodiment of the application provides a computing device and a related product, and in the process of compressing a neural network, the topological structure of a neural network model can be kept unchanged, so that the topological structure of the neural network model is prevented from being irregular, and the operation amount of the neural network is reduced.

In a first aspect, a computing device is provided for performing a machine learning model machine learning computation, the computing device comprising: a compression unit, an arithmetic unit, and a controller unit;

the controller unit is used for acquiring a compression request aiming at first input data and instructing the compression unit to compress the first input data according to the compression request; wherein the first input data comprises a first weight matrix;

the compression unit is used for compressing the first weight matrix into a second weight matrix;

the controller unit is also used for acquiring second input data and a calculation instruction; the second input data comprises the second weight matrix and input neuron data;

the controller unit is further configured to analyze the calculation instruction to obtain a plurality of operation instructions, and send the plurality of operation instructions and the second input data to the operation unit;

the arithmetic unit acquires the arithmetic instruction and executes neural network calculation according to the arithmetic instruction and the second input data.

According to the neural network computation method and device, the first weight matrix can be compressed through the compression unit to obtain the second weight matrix, then the neural network computation can be executed according to the second weight matrix and input neuron data, the problem that the topological structure of the neural network is irregular due to the fact that a neural network pruning algorithm is adopted in the prior art is solved, the neural network can be deeply compressed, the computation amount of the neural network can be reduced, and the computation speed is improved.

In a second aspect, the present application provides a machine learning computing device, which includes one or more computing devices according to the first aspect. The machine learning arithmetic device is used for acquiring data to be operated and control information from other processing devices, executing specified machine learning arithmetic and transmitting an execution result to other processing devices through an I/O interface;

when the machine learning arithmetic device comprises a plurality of computing devices, the plurality of computing devices can be linked through a specific structure and transmit data;

the plurality of computing devices are interconnected through a PCIE bus and transmit data so as to support operation of larger-scale machine learning; a plurality of the computing devices share the same control system or own respective control systems; the computing devices share the memory or own the memory; the plurality of computing devices are interconnected in any interconnection topology.

In a third aspect, an embodiment of the present application provides a combined processing device, which includes the machine learning processing device according to the third aspect, a universal interconnection interface, and other processing devices. The machine learning arithmetic device interacts with the other processing devices to jointly complete the operation designated by the user. The combined processing device may further include a storage device, which is connected to the machine learning arithmetic device and the other processing device, respectively, and stores data of the machine learning arithmetic device and the other processing device.

In a fourth aspect, an embodiment of the present application provides a neural network chip, where the neural network chip includes the computing device according to the first aspect, the machine learning arithmetic device according to the second aspect, or the combined processing device according to the third aspect.

In a fifth aspect, an embodiment of the present application provides a neural network chip package structure, which includes the neural network chip described in the fourth aspect.

In a sixth aspect, an embodiment of the present application provides a board card, where the board card includes the neural network chip package structure described in the fifth aspect.

In a seventh aspect, an embodiment of the present application provides an electronic device, where the electronic device includes the neural network chip described in the sixth aspect or the board described in the sixth aspect.

In an eighth aspect, embodiments of the present application further provide a computing method for executing a machine learning model, where the computing method is applied to a computing device, and the computing device is used for executing machine learning computation; the computing device includes: a compression unit, an arithmetic unit, and a controller unit; the method comprises the following steps:

the controller unit acquires a compression request aiming at first input data and instructs the compression unit to compress the first input data according to the compression request; wherein the first input data comprises a first weight matrix;

and the arithmetic unit is used for acquiring the arithmetic instruction and executing neural network calculation according to the arithmetic instruction and the second input data.

In some embodiments, the electronic device comprises a data processing apparatus, a robot, a computer, a printer, a scanner, a tablet, a smart terminal, a cell phone, a tachograph, a navigator, a sensor, a camera, a server, a cloud server, a camera, a camcorder, a projector, a watch, a headset, a mobile storage, a wearable device, a vehicle, a household appliance, and/or a medical device.

In some embodiments, the vehicle comprises an aircraft, a ship, and/or a vehicle; the household appliances comprise a television, an air conditioner, a microwave oven, a refrigerator, an electric cooker, a humidifier, a washing machine, an electric lamp, a gas stove and a range hood; the medical equipment comprises a nuclear magnetic resonance apparatus, a B-ultrasonic apparatus and/or an electrocardiograph.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1A is a schematic diagram illustrating an operation of pruning a neural network according to an embodiment of the present disclosure;

FIG. 1B is a schematic diagram of a computing device according to an embodiment of the present disclosure;

fig. 2 is a schematic structural diagram of a control unit provided in an embodiment of the present application;

fig. 3 is a schematic flowchart of a neural network operation method according to an embodiment of the present disclosure;

fig. 4 is a schematic flowchart of a neural network compression method according to an embodiment of the present application;

FIG. 5A is a schematic diagram of a neural network architecture provided by an embodiment of the present application;

fig. 5B is a schematic diagram of a fully-connected layer weight matrix according to an embodiment of the present disclosure;

fig. 5C is a schematic diagram illustrating an operation of compressing a full link layer weight matrix according to an embodiment of the present application;

FIG. 5D is a diagram illustrating a structure of convolution kernels in a convolutional layer according to an embodiment of the present disclosure;

FIG. 5E is a diagram of a fully-connected layer weight matrix according to another embodiment of the present application;

fig. 5F is a schematic diagram of an operation of compressing the LSTM layer according to an embodiment of the present application;

FIG. 6 is a schematic block diagram of another computing device provided in an embodiment of the present application;

FIG. 7 is a schematic structural diagram of a main processing circuit provided in an embodiment of the present application;

FIG. 8 is a schematic block diagram of another computing device provided in an embodiment of the present application;

FIG. 9 is a schematic structural diagram of a tree module provided in an embodiment of the present application;

FIG. 10 is a block diagram of yet another computing device provided in an embodiment of the present application;

FIG. 11 is a block diagram of yet another computing device provided in an embodiment of the present application;

FIG. 12 is a block diagram of another computing device provided in embodiments of the present application;

fig. 13 is a block diagram of a combined processing apparatus according to an embodiment of the present application;

fig. 14 is a block diagram of another combined processing device provided in an embodiment of the present application;

fig. 15 is a schematic structural diagram of a board card provided in an embodiment of the present application

Fig. 16 is a schematic flowchart of a neural network compression method according to an embodiment of the present application;

fig. 17A is a schematic structural diagram of a neural network compression device according to an embodiment of the present application;

fig. 17B is a schematic structural diagram of a compressing unit according to an embodiment of the present application;

fig. 18 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some, but not all, embodiments of the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The terms "first," "second," "third," and "fourth," etc. in the description and claims of this application and in the accompanying drawings are used for distinguishing between different objects and not for describing a particular order. Furthermore, the terms "include" and "have," as well as any variations thereof, are intended to cover non-exclusive inclusions. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those steps or elements listed, but may alternatively include other steps or elements not listed, or inherent to such process, method, article, or apparatus.

Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the application. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein can be combined with other embodiments.

The application provides a compression element for compressing a first weight matrix into a second weight matrix, and solves the problem that the topological structure of a neural network is irregular easily caused by the adoption of a neural network pruning algorithm in the prior art. In practical applications, the compression unit may be used in neural network calculations, and in particular, in a computing device for performing neural network calculations, and the invention will be described below with reference to the computing device shown in fig. 1B.

Referring to fig. 1B, fig. 1B is a schematic structural diagram of a computing device according to an embodiment of the present invention, the computing device is configured to perform machine learning calculation, and the computing device includes: the device comprises a controller unit 11, an arithmetic unit 12 and a compression unit 13, wherein the controller unit 11 is respectively connected with the arithmetic unit 12 and the compression unit 13;

the controller unit 11 is configured to obtain a compression request for first input data, and instruct the compression unit to compress the first input data according to the compression request; wherein the first input data comprises a first weight matrix; in an alternative, the compression request may be triggered by a data input/output unit, which may specifically be one or more data I/O interfaces or I/O pins;

the compressing unit 13 is configured to compress the first weight matrix into a second weight matrix; wherein, the second weight matrix comprises at least two sub-matrixes;

in a specific implementation, the compressing unit 13 includes a decomposing unit 131, a solving unit 132, and a training unit 133. The decomposition unit 131 is configured to decompose the first weight matrix into a third weight matrix; wherein the third weight matrix comprises at least two sub-matrices; a solving unit 132, configured to determine a size of each of the at least two sub-matrices according to a first formula, where Q ≈ Q₁*Q₂*......*Q_n(ii) a Wherein Q represents a first weight matrix; said Q₁Representing a first sub-matrix of the at least two sub-matrices; said Q₂Presentation instrumentA second sub-matrix of the at least two sub-matrices; said Q_nRepresenting an nth sub-matrix of the at least two sub-matrices; the training unit 133 is configured to adjust the size of each of the at least two sub-matrices, and train the compressed machine learning model to obtain a second weight matrix meeting the preset precision.

The controller unit 11 is further configured to obtain second input data and a calculation instruction; the second input data comprises a second weight matrix and input neuron data; in an alternative, specifically, the manner of acquiring the second input data and calculating the instruction may be obtained through a data input/output unit, and the data input/output unit may specifically be one or more data I/O interfaces or I/O pins.

The controller unit 11 is further configured to analyze the calculation instruction to obtain a plurality of operation instructions, and send the plurality of operation instructions and the second input data to the operation unit;

the arithmetic unit 12 is configured to obtain the arithmetic instruction, and execute a neural network calculation according to the arithmetic instruction and the second input data.

In one implementation, it is considered that the computing device is provided with a "compression instruction", in which case, the controller unit 11 is configured to obtain the first input data and the compression instruction; wherein the first input data comprises a first weight matrix; in an alternative, specifically, the manner of acquiring the first input data and the compression instruction may be obtained by a data input/output unit, and the data input/output unit may specifically be one or more data I/O interfaces or I/O pins.

The controller unit 11 is further configured to analyze the compression instruction to obtain a plurality of operation instructions, and send the plurality of operation instructions and the first weight matrix to the compression unit;

the compressing unit 13 is configured to compress the first weight matrix into a second weight matrix according to the plurality of operation instructions;

the controller unit 11 is further configured to obtain second input data and a calculation instruction; the second input data comprises the second weight matrix and input neuron data; in an alternative, specifically, the manner of acquiring the second input data and calculating the instruction may be obtained through a data input/output unit, and the data input/output unit may specifically be one or more data I/O interfaces or I/O pins.

In a specific implementation, the arithmetic unit 12 includes a main processing circuit 101 and a plurality of slave processing circuits 102, where the main processing circuit 101 is configured to perform preamble processing on the second input data and transmit data and arithmetic instructions with the plurality of slave processing circuits;

a plurality of slave processing circuits 102 configured to perform an intermediate operation in parallel according to the data and the operation instruction transmitted from the master processing circuit to obtain a plurality of intermediate results, and transmit the plurality of intermediate results to the master processing circuit;

and the main processing circuit 101 is configured to perform subsequent processing on the plurality of intermediate results to obtain a calculation result of the calculation instruction.

Optionally, the second input data may specifically include: a second weight matrix and input neuron data. The calculation result may specifically be: the result of the neural network operation is output neuron data.

In one embodiment, the computing device may further include: the storage unit 10 and the direct memory access unit 50, the storage unit 10 may include: one or any combination of a register and a cache, specifically, the cache is used for storing the calculation instruction; the register is used for storing the input data and a scalar; the cache is a scratch pad cache. The direct memory access unit 50 is used to read or store data from the storage unit 10.

In the embodiment of the present application, as shown in fig. 2, the controller unit 11 includes: an instruction cache unit 110, an instruction processing unit 111, a dependency processing unit 112, and a store queue unit 113;

the instruction cache unit 110 is configured to store computation instructions associated with the artificial neural network operation, while a zeroth computation instruction is executed, other instructions that are not submitted for execution are cached in the instruction cache unit 110, after the zeroth computation instruction is executed, if a first computation instruction is an earliest instruction in the uncommitted instructions in the instruction cache unit 110, the first computation instruction is submitted, and once the first computation instruction is submitted, a change of a device state by an operation performed by the instruction cannot be cancelled;

the instruction processing unit 111 is configured to obtain the computation instruction from the instruction cache unit, and analyze the computation instruction to obtain a plurality of operation instructions;

the dependency processing unit 112 is configured to determine whether a first operation instruction has an association relationship with a zeroth operation instruction before the first operation instruction when there are multiple operation instructions, store the first operation instruction into the instruction queue unit 113 if the first operation instruction has an association relationship with the zeroth operation instruction, and extract the first operation instruction from the instruction queue unit 113 and transmit the first operation instruction to the operation unit if the association relationship between the first operation instruction and the zeroth operation instruction is released after the zeroth operation instruction is executed;

the determining whether the first operation instruction has an association relationship with a zeroth operation instruction before the first operation instruction comprises:

extracting a first storage address interval of required data (such as a matrix) in the first operation instruction according to the first operation instruction, extracting a zeroth storage address interval of the required matrix in the zeroth operation instruction according to the zeroth operation instruction, if the first storage address interval and the zeroth storage address interval have an overlapped area, determining that the first operation instruction and the zeroth operation instruction have an association relationship, and if the first storage address interval and the zeroth storage address interval do not have an overlapped area, determining that the first operation instruction and the zeroth operation instruction do not have an association relationship.

A store queue unit 113 for storing an instruction queue, the instruction queue comprising: a plurality of operation instructions or calculation instructions to be executed in the front-to-back order of the queue.

In the embodiment of the present application, as shown in fig. 2, the instruction processing unit 111 includes an instruction fetching module, a decoding module, and an instruction queue, where the instruction fetching module is configured to obtain a computation instruction of a neural network from the instruction cache unit 110; the decoding module is used for decoding the calculation instruction acquired by the instruction fetching module to obtain an operation instruction of the neural network; and the instruction queue is used for sequentially storing the operation instructions obtained after decoding according to the sequence to be executed.

For example, in an alternative embodiment, the main operation processing circuit may also include a controller unit, and the controller unit may include a main instruction processing unit, specifically configured to decode instructions into microinstructions. Of course, in another alternative, the slave arithmetic processing circuit may also include another controller unit that includes a slave instruction processing unit, specifically for receiving and processing microinstructions. The micro instruction may be a next-stage instruction of the instruction, and the micro instruction may be obtained by splitting or decoding the instruction, and may be further decoded into control signals of each component, each unit, or each processing circuit.

In one alternative, the structure of the calculation instruction may be as shown in the following table.

TABLE 1

Operation code

Registers or immediate data

Register/immediate

...

The ellipses in the above table indicate that multiple registers or immediate numbers may be included.

In another alternative, the computing instructions may include: one or more operation domains and an opcode. The computation instructions may include neural network operation instructions, and may also include compression instructions as referred to above. Taking the neural network operation instruction as an example, as shown in table 1, register number 0, register number 1, register number 2, register number 3, and register number 4 may be operation domains. Each of register number 0, register number 1, register number 2, register number 3, and register number 4 may be a number of one or more registers.

TABLE 2

The register may be an off-chip memory, and in practical applications, may also be an on-chip memory for storing data, where the data may specifically be n-dimensional data, where n is an integer greater than or equal to 1, and for example, when n is equal to 1, the data is 1-dimensional data, that is, a vector, and when n is equal to 2, the data is 2-dimensional data, that is, a matrix, and when n is equal to 3 or more, the data is a multidimensional tensor.

In an embodiment of the present invention, a process of the computing device executing the neural network operation is shown in fig. 3, and includes:

step S1, the controller unit receives the compression instruction, decodes and analyzes the compression instruction into a plurality of operation instructions, and sends the plurality of operation instructions to the compression unit.

After the controller unit reads the compression instruction from the storage unit, the controller unit analyzes the compression instruction into an operation instruction and sends the operation instruction to the compression unit. Specifically, the instruction fetching module of the instruction processing unit 111 in the controller unit 11 obtains the compressed instruction from the instruction cache unit 110, and transmits the compressed instruction to the decoding module, and the decoding module decodes the compressed instruction to obtain the operation instruction, and splits the operation instruction into the operation code and each different operation domain according to the preset instruction rule, where the composition and the function of the operation code and the operation domain may refer to the foregoing, and are not described herein again. The decoding module transmits the operation instruction obtained after decoding to an instruction queue for sequential storage, in the instruction queue, the data address of the data to be processed corresponding to the operation instruction is acquired according to the operation code and operation of the operation instruction, and the data address is transmitted to a dependency relationship processing unit 112, the dependency relationship processing unit analyzes whether the instruction and the instruction which is being executed have an incidence relationship, if so, the operation instruction is stored in a storage queue unit 113 until the incidence relationship is removed, and if not, the operation instruction is transmitted to a compression unit to execute the corresponding operation.

S2, the compression unit receives the operation instruction sent by the control unit, and carries out compression processing according to the first weight matrix read from the storage unit, so as to obtain a second weight matrix meeting the preset precision.

In the following, referring to the schematic flow chart of the neural network compression method provided in the embodiment of the present invention shown in fig. 4, how to implement compression on the first weight matrix to obtain the second weight matrix in the embodiment of the present invention is specifically described, which may include, but is not limited to, the following steps:

step S21, decomposing the first weight matrix into a third weight matrix; wherein the third weight matrix comprises at least two sub-matrices.

In a specific implementation, the weight data in the first weight matrix may be any real number. Here, the weight data refers to a connection value between layers of the neural network, that is, information transfer strength between neurons.

In one embodiment, the third weight matrix includes two sub-matrices, and each of the two sub-matrices includes a compression parameter K. Here, the compression parameter K is an unknown number, that is, when the first weight matrix is decomposed, it may be determined that the first weight matrix may be decomposed into two sub-matrices, but the size of each of the two sub-matrices is not determined.

In another embodiment, the number of the sub-matrices in the third weight matrix is n, where n is a positive integer greater than 2. The number of compression parameters K included in the n sub-matrices is (n-1). Taking the first weight matrix divided into three sub-matrices as an example, the compression parameters K to be solved may include K1 and K2.

Step S22, determining the size of each sub-matrix in the at least two sub-matrices according to a first formula, wherein the first formula is Q ≈ Q₁*Q₂*......*Q_n(ii) a Wherein Q represents a first weight matrix; said Q₁Representing a first sub-matrix of the at least two sub-matrices; said Q₂Representing a second sub-matrix of the at least two sub-matrices; said Q_nRepresents an nth sub-matrix of the at least two sub-matrices.

In a specific implementation, the operation symbol "+" in the first formula represents the multiplication operation of the matrix.

In one embodiment, when the third weight matrix includes two sub-matrices, the first formula may be represented as:

Q≈Q₁*Q₂(1.1)

in another embodiment, when the third weight matrix includes at least two sub-matrices, the first formula may be represented as:

Q≈Q₁*Q₂*......*Q_n(1.2)

in the above formula (1.2), n is a positive integer greater than 2.

In a specific implementation, the size of each of the at least two sub-matrices is determined according to the first formula and a second formula, where the second formula is | | Q-Q₁*Q₂*......*Q_nAnd | ≦ T, wherein T represents a preset error threshold.

In a specific implementation, the predetermined error threshold referred to herein may be between 5% and 10%. It can be understood that the smaller the preset error threshold is set, the better the attribute characteristics of the first weight matrix can be represented by the at least two sub-matrices determined according to the first formula and the second formula.

And step S23, adjusting the size of each sub-matrix of the at least two sub-matrices, and training the compressed machine learning model to obtain a second weight matrix meeting the preset precision.

In a specific implementation, the process of adjusting the size of each of the at least two sub-matrices is substantially a dynamic variation process of the value of the compression parameter K, so as to find the optimal compression parameter K. As the compression parameter K changes, the compression ratio between the first weight matrix and the second weight matrix also changes.

Taking the application scenario of speech recognition as an example, in a certain word sequence, there may be a case where some words are inserted, deleted or replaced by mistake. For example, for an initial recognized Word containing N words, if there are I words back inserted, D words deleted, and E words replaced, then the Word Error Rate WER (WER) is:

WER＝(I+D+E)/N (1.3)

therein, the error rate WER is usually expressed in percentage.

When the neural network model is adopted to identify the word sequence, the detection precision of the word error rate of the word sequence can be obtained. In the embodiment of the present invention, the preset precision referred to herein is the detection precision of the neural network model before compression for the word error rate WER. For example, the preset accuracy is 70%. In general, the error rate WER of the compressed neural network becomes large, which means that the accuracy of the compressed neural network becomes poor.

In the embodiment of the invention, the detection precision of the word error rate of the neural network model corresponding to different compression ratios (different compression parameter K values) is measured to obtain the second weight matrix meeting the preset precision.

In a preferred embodiment, the training unit is configured to adjust a size of each of the at least two sub-matrices and train the compressed machine learning model to obtain a second weight matrix satisfying a preset precision, and includes:

the training unit is specifically configured to adjust the size of each of the at least two sub-matrices, and train the compressed machine learning model to obtain a second weight matrix that meets a preset precision and a compression ratio with the first weight matrix that meets a preset compression ratio.

It can be understood that, in this embodiment, the compression parameter K in the current state not only enables the neural network model to obtain the optimal compression effect, but also enables the compressed neural network model to meet the preset precision when detecting the word error rate WER. When the neural network model is in the optimal compression effect, the operation amount of the neural network can be further reduced.

Taking the fully-connected layer of the neural network as an example, the fully-connected layer means that for the n-1 layer and the n layer, any one node of the n-1 layer is connected with all nodes of the n layer. Specifically, referring to fig. 5A, the structural diagram of a one-dimensional fully-connected layer of a neural network provided in an embodiment of the present invention is shown in fig. 5A, where the neural network includes an input layer, a hidden layer, and an output layer, where a two-dimensional parameter matrix of the fully-connected layer between the input layer and the hidden layer is (3,4), and the two-dimensional parameter matrix (3,4) indicates that, in the fully-connected layer structure between the input layer and the hidden layer, the number of input neurons is 3, the number of output neurons is 4, and the number of weights is 12. In a specific implementation, the 12 weights may be represented as a weight matrix with 4 rows and 3 columns, and the representation form of the weight matrix may be as shown in fig. 5B.

In a fully-connected layer neural network, the first formula includes: m ≈ M₁*M₂(ii) a The two sub-matrix fingers comprise a first sub-matrix M1 and a second sub-matrix M2, wherein M1 is N_inK matrix, M2 is K N_outA matrix; where K is a compression parameter, N_inThe number of input neurons of the neural network,N_outthe number of output neurons of the neural network; the compression parameters are used for characterizing the number of output neurons of the M1 and the number of input neurons of the M2, and the K is greater than 0 and less than or equal to min (N)_in，N_out) Is a positive integer of (1).

As mentioned above, the process of adjusting the size of each of the two sub-matrices is essentially a dynamic variation process of the value of the compression parameter K to find the optimal compression parameter K. In practical application, a binary search mode can be adopted to determine the compression parameter K value in the full-connection layer neural network, so that a second weight matrix meeting the preset precision is obtained. In one embodiment, the compression parameter K determined by the binary search method may enable the second weight matrix to satisfy the predetermined accuracy. In another embodiment, the compression parameter K determined by using the binary search method can enable the second weight matrix to meet the preset precision, and meanwhile, the compression ratio of the first weight matrix and the second weight matrix meets the preset compression ratio, that is, a better compression effect is obtained for the compression of the neural network model.

In a specific implementation, the compression parameter K has different values, i.e. the first weight matrix is compressed based on a plurality of different compression ratios, where in the fully-connected layer neural network, the compression ratio is

The following describes how to determine the compression parameter K value by using binary search. First, two parameters KL and KR are set. In the initialization case, KL is 1, KR is min (N)_in，N_out). K ═ KL + KR)/2 during parameter adjustment. If M is₁*M₂When the second weight matrix shown causes the accuracy of the compressed neural network model to decrease by X% (here, X is 1 to 10, etc.), the parameter KL is adjusted so that KL is K. If M is₁*M₂And the expressed second weight matrix causes the compressed neural network model to meet the preset precision, and then, the KR is adjusted to enable the KR to be equal to K, and the steps are repeatedly executed until the ending condition K is equal to KL or K is equal to KKR。

Taking the fully-connected layer from the input layer to the hidden layer in fig. 5A as an example, the value of the compression parameter K is a positive integer greater than 0 and less than or equal to 3. The compression parameter K is determined to be 2 by the binary search, that is, the first sub-matrix M1 in the second weight matrix satisfying the preset precision is a (3,2) matrix, and the second sub-matrix M2 is a (2,4) matrix. In particular, the compression for this fully connected layer between the input layer to the hidden layer in FIG. 5A may be as shown in FIG. 5C.

In one embodiment, when the expression form of the first formula is as shown in formula (1.2), that is, the number of the sub-matrices in the third weight matrix is n, where n is a positive integer greater than 2, at this time, the number of the compression parameters K is (n-1), which can be expressed as K1, K2, … …, K_n-1. In practical applications, an adaptive algorithm (e.g., a genetic algorithm) may be used to determine the (n-1) compression parameters K in the fully-connected layer neural network, so as to obtain a second weight matrix satisfying a predetermined precision and/or satisfying a compression effect. The following details how the genetic algorithm is used to determine the (n-1) compression parameter K values in the fully-connected layer neural network:

step 1: randomly generating a population: setting the size of the population to be P, and setting the maximum iteration number T_maxE.g. T_max100. In an initial state, setting an iteration number counter t to be 0; a cross probability Pc ═ a (e.g., a ═ 0.4), a variation probability Pm ═ B (e.g., B ═ 0.6), each row of the matrix of the population represents one gene string individual, each column represents the number of individuals; here, each individual is a set of compression parameters K (e.g., K)_j) A solution to the value;

step 2: calculating the fitness of each individual in the population; here, the fitness refers to a compression ratio and/or precision of the first weight matrix and the second weight matrix corresponding to the individual, where the compression ratio is used to represent a compression effect for the neural network.

And step 3: acting the selection operator on the population, and directly transmitting the optimized individuals to the next generation;

and 4, step 4: acting on the population in the crossover operator, randomly generating position points of a plurality of gene strings for any two individuals, and exchanging values of the two individuals at the positions;

and 5: acting a mutation operator on the population, randomly generating positions of a plurality of gene strings for any individual, and then changing values at the positions; here, mutation means that K is changed randomly_jA value of (d);

step 6: reserving the individuals with the highest fitness in each generation, and entering the next generation;

and 7: judging whether the maximum iteration number T is reached_maxIf T is equal to T_maxIf so, outputting the individual with the maximum fitness and terminating the calculation; otherwise, jumping to step 2 to continue execution.

The (n-1) compression parameter K values in the fully-connected layer neural network can thus be determined according to the genetic algorithm described above.

Taking the convolutional layer of the neural network as an example, as shown in FIG. 5D, the convolutional layer can be considered as a four-dimensional matrix (N)_fin,N_fout,K_x,K_y) Wherein N is_finFor the number of input feature images, N_foutTo output the number of characteristic images, (K)_x,K_y) Is the size of the convolution kernel in the convolutional layer.

In a convolutional neural network, the convolutional neural network comprises N_fin*N_foutA convolution kernel; the first formula includes: f ≈ F₁*F₂(ii) a Wherein F represents the N_fin*N_foutAny one of a plurality of convolution kernels; the F1 is a first sub convolution kernel; the F2 is a second sub convolution kernel; the first sub-convolution kernel F1 is (K)_xR), said second sub-convolution kernel F2 being (R, K)_y)，(K_x，K_y) Representing the size of a convolution kernel, R is a compression parameter, and R is greater than 0 and less than or equal to min (K)_x，K_y) Is a positive integer of (1).

As mentioned above, the process of adjusting the size of each of the two sub-matrices is essentially a dynamic variation process of the compression parameter R value to find the optimal compression parameter R. In practical application, a binary search mode can be adopted to determine a compression parameter R value in the convolutional layer neural network, so that a second weight matrix meeting preset precision is obtained.

In a specific implementation, the compression parameters R have different values, i.e. the first weight matrix is compressed based on a plurality of different compression ratios, where in the convolutional layer neural network, the compression ratio is

In the embodiment of the present invention, the implementation process of determining the value of the compression parameter R by using a binary search mode refers to the foregoing text description, which is not repeated herein.

For example, in the convolutional layer neural network structure shown in fig. 5D, the convolutional layer includes 4 convolution kernels, and the size of the convolution kernels is 3 × 3, where, in the 1 st convolution kernel, N is_fin＝4，N_foutThe compression parameter R value is a positive integer greater than 0 and equal to or less than 4. The compression parameter R is determined to be 4 by the binary search, that is, the first sub-convolution kernel F1 in the 1 st convolution kernel satisfying the preset precision is a (3,4) matrix, and the second sub-convolution kernel F2 is a (4,3) matrix. In one embodiment, the other convolution kernels shown in fig. 5D may adopt the same compression method as the 1 st convolution kernel, or may adopt a compression method different from the 1 st convolution kernel, and the embodiment of the present invention is not particularly limited.

In one embodiment, when the expression form of the first formula is shown in formula (1.2), an adaptive algorithm (e.g., a genetic algorithm) may be used to determine the compression parameter R value in the convolutional neural network, and the specific implementation process refers to the foregoing description, which is not repeated herein.

Taking the Long and Short term Memory LSTM Layer (LSTM) of the neural network as an example, the weight of the LSTM layer is composed of a plurality of fully-connected layer weights. Let the weight of the LSTM layer consist of t fully-connected layer weights, t being a positive integer greater than 0. For example, the jth fully-connected layer weight is (N)_{in_j},N_{out_j}) Wherein N is_{in_j}Represents the number of j full-connection layer input neurons, N_{out_j}Denotes the jth integerThe number of output neurons of the connection layer is N, and the weight number of the jth full connection layer is N_{in_j}*N_{out_j}。

In an LSTM layer neural network, the LSTM layer comprises N fully connected layers, wherein N is a positive integer greater than 0; for the jth fully-connected layer, the first formula includes: m_j≈M_{j_1}*M_{j_2}(ii) a Two sub-matrixes in the jth fully-connected layer comprise a first sub-matrix M_{j_1}And a second sub-matrix M_{j_2}Said M is_{j_1}Is N_{in_j}S matrix, said M_{j_2}Is S x N_{out_j}A matrix; wherein S is a compression parameter, N_{in_j}The number of input neurons of the jth fully-connected layer of the neural network, N_{out_j}The number of output neurons of the jth fully-connected layer of the neural network is the number of output neurons of the jth fully-connected layer of the neural network; the compression parameter is used for characterizing the M_{j_1}The number of output neurons and the M_{j_2}S is greater than 0 and not more than min (N)_{in_j}，N_{out_j}) Is a positive integer of (1).

As mentioned above, the process of adjusting the size of each of the two sub-matrices is essentially a dynamic variation process of the compression parameter S value to find the optimal compression parameter S. In practical application, a binary search mode can be adopted to determine the compression parameter S value in the convolutional layer neural network, so that a second weight matrix meeting the preset precision is obtained.

In a specific implementation, for the jth fully-connected layer, the compression parameters S have different values, that is, the first weight matrix is compressed based on a plurality of different compression ratios, where, in the jth fully-connected layer, the compression ratio is different

In the embodiment of the present invention, the implementation process of determining the compression parameter S value in the jth fully-connected layer by using a binary search mode refers to the foregoing text description, which is not repeated herein.

Taking the neural network architecture shown in fig. 5A as an example, the neural network includes an input layer, a hidden layer and an output layer,wherein, be 1 st full tie layer between the hidden layer to the input layer, be 2 nd full tie layer between hidden layer and the output layer. For a detailed description of the fully-connected layer structure from the input layer to the hidden layer (i.e., the 1 st fully-connected layer), please refer to the foregoing description, which is not repeated herein. As can be seen from fig. 5A, the two-dimensional parameter matrix of this fully-connected layer between the hidden layer and the output layer is (4,2), and the two-dimensional parameter matrix (4,2) indicates that, in the fully-connected layer structure between the hidden layer and the output layer, the number of input neurons is 4, the number of output neurons is 2, and the number of weights is 8. In a specific implementation, the 8 weights may be represented as a 2-row and 4-column weight matrix, and the expression form of the weight matrix may be as shown in fig. 5E. Then, in this case, the compression parameter S has a positive integer value greater than 0 and equal to or less than 2. Determining the compression parameter S to be 2 by means of binary search, that is, in the 2 nd full-connection layer, the first sub-matrix M in the second weight matrix meeting the preset precision_{2_1}Is a (4,2) matrix, a second sub-matrix M_{2_2}Is a (2,2) matrix. In particular, the compression for this fully connected layer between the input layers but the hidden layers and this fully connected layer between the hidden layers to the output layers in fig. 5A may be as shown in fig. 5F.

In one embodiment, when the expression form of the first formula is shown in formula (1.2), an adaptive algorithm (e.g., a genetic algorithm) may be used to determine the compression parameter S value in the LSTM layer neural network, and the specific implementation process refers to the foregoing description, which is not repeated herein.

According to the embodiment of the invention, after the controller unit obtains the compression instruction, the controller unit analyzes the compression instruction to obtain a plurality of operation instructions, then sends the operation instructions and the first weight matrix to the compression unit, and then the compression unit decomposes the first weight matrix to obtain the second weight matrix. In the concrete implementation, the second weight matrix comprises at least two sub-matrices, and then the second weight matrix meeting the preset precision is obtained by adjusting the size of each sub-matrix in the at least two sub-matrices and combining with a machine learning model after training and compression, so that the problem that the topological structure of the neural network is irregular easily caused by the adoption of a neural network pruning algorithm in the prior art is solved. In addition, the neural network is compressed, so that the operation amount of the neural network can be reduced, and the operation speed is further improved.

S3, the controller unit obtains second input data and a calculation instruction, wherein the second input data includes a second weight matrix and input neuron data.

S4, the controller unit analyzes the calculation instruction into an operation instruction, and sends the operation instruction and the second input data to the operation unit.

In a specific implementation, for the implementation manner in which the controller unit obtains the computation instruction and analyzes the computation instruction to obtain a plurality of operation instructions, please refer to the text description of the compressed instruction obtained by the controller unit, which is not described herein.

And S5, the arithmetic unit receives the arithmetic instruction sent by the controller unit and executes neural network calculation according to the arithmetic instruction and the second input data.

In practical applications, the neural network computation referred to herein may include an artificial neural network operation, a convolutional neural network operation, and so on.

Taking the artificial neural network operation as an example, for the artificial neural network operation, if the artificial neural network operation has a multilayer operation, the input neurons and the output neurons of the multilayer operation do not refer to the neurons in the input layer and the neurons in the output layer of the whole neural network, but for any two adjacent layers in the network, the neurons in the lower layer of the network forward operation are the input neurons, and the neurons in the upper layer of the network forward operation are the output neurons. Taking a convolutional neural network as an example, let a convolutional neural network have L layers, K1, 2.., L-1, for the K-th layer and K + 1-th layer, we will refer to the K-th layer as an input layer, in which the neurons are the input neurons, and the K + 1-th layer as an output layer, in which the neurons are the output neurons. That is, each layer except the topmost layer can be used as an input layer, and the next layer is a corresponding output layer.

In the specific implementation, the operation in the neural network may be a layer of operation in the neural network, and for a multilayer neural network, the implementation process is that, in the forward operation, after the execution of the artificial neural network in the previous layer is completed, the operation instruction in the next layer takes the output neuron calculated in the operation unit as the input neuron in the next layer to perform operation (or performs some operation on the output neuron and then takes the output neuron as the input neuron in the next layer), and at the same time, the weight is also replaced by the weight in the next layer; in the reverse operation, after the reverse operation of the artificial neural network of the previous layer is completed, the operation instruction of the next layer takes the input neuron gradient calculated in the operation unit as the output neuron gradient of the next layer to perform operation (or performs some operation on the input neuron gradient and then takes the input neuron gradient as the output neuron gradient of the next layer), and at the same time, the weight value is replaced by the weight value of the next layer.

Taking the forward operation process of the neural network as an example, first, the operation unit reads the second input data from the storage unit, wherein the second input data includes the second weight matrix and the input neuron data.

Secondly, the main processing circuit reads corresponding neuron data and broadcasts the neuron data to each slave processing circuit in sequence according to the designated sequence. In practical applications, the neuron data may be broadcast only once, and the data is received from the processing circuit and then temporarily stored in a buffer or a register, so as to be conveniently multiplexed. Further, the neuron data may be broadcast a plurality of times, and may be used directly after receiving the data from the processing circuit without multiplexing. In one possible embodiment, the main processing circuit broadcasts the neuron data directly after reading the neuron data.

And then, each slave processing circuit carries out inner product operation on the read neuron data and the second weight matrix according to the operation instruction, and then transmits the inner product result back to the main processing circuit.

In one embodiment, the slave processing circuit may transmit the partial sum obtained by performing the inner product operation each time back to the master processing circuit for accumulation; in one embodiment, the partial sum obtained by the inner product operation executed by the slave processing circuit each time may be stored in a register and/or an on-chip cache of the slave processing circuit, and may be transmitted back to the master processing circuit after the accumulation is completed; in one embodiment, the partial sum obtained by the inner product operation performed by the slave processing circuit may be stored in a register and/or an on-chip buffer of the slave processing circuit in some cases, and may be transmitted to the master processing circuit in some cases to be accumulated, and may be transmitted back to the master processing circuit after the accumulation is completed.

And finally, after the main processing circuit carries out operations such as accumulation, activation and the like on the results of all the slave processing circuits until the forward operation process of the neural network is completed, an error value between a prediction result and an actual result, namely the neuron gradient data of the last layer is obtained and stored in a storage unit.

In the embodiment of the present invention, the arithmetic unit 12 may be configured as a master multi-slave structure. In an alternative embodiment, the arithmetic unit 12 may comprise a master processing circuit 101 and a plurality of slave processing circuits 102, as shown in fig. 6. In one embodiment, as shown in FIG. 6, a plurality of slave processing circuits are distributed in an array; each slave processing circuit is connected with other adjacent slave processing circuits, the master processing circuit is connected with k slave processing circuits in the plurality of slave processing circuits, and the k slave processing circuits are as follows: it should be noted that, as shown in fig. 6, the K slave processing circuits include only the n slave processing circuits in the 1 st row, the n slave processing circuits in the m th row, and the m slave processing circuits in the 1 st column, that is, the K slave processing circuits are slave processing circuits directly connected to the master processing circuit among the plurality of slave processing circuits.

And the K slave processing circuits are used for forwarding data and instructions between the main processing circuit and the plurality of slave processing circuits.

Optionally, as shown in fig. 7, the main processing circuit may further include: one or any combination of the conversion processing circuit 110, the activation processing circuit 111, and the addition processing circuit 112;

a conversion processing circuit 110 for performing an interchange between the first data structure and the second data structure (e.g., conversion of continuous data and discrete data) on the data block or intermediate result received by the main processing circuit; or performing an interchange between the first data type and the second data type (e.g., a fixed point type to floating point type conversion) on a data block or intermediate result received by the main processing circuitry;

an activation processing circuit 111 for performing an activation operation of data in the main processing circuit;

and an addition processing circuit 112 for performing addition operation or accumulation operation.

The master processing circuit is configured to determine that the input neuron is broadcast data, determine that a weight is distribution data, distribute the distribution data into a plurality of data blocks, and send at least one data block of the plurality of data blocks and at least one operation instruction of the plurality of operation instructions to the slave processing circuit;

the plurality of slave processing circuits are used for executing operation on the received data blocks according to the operation instruction to obtain an intermediate result and transmitting the operation result to the main processing circuit;

and the main processing circuit is used for processing the intermediate results sent by the plurality of slave processing circuits to obtain the result of the calculation instruction and sending the result of the calculation instruction to the controller unit.

The slave processing circuit includes: a multiplication processing circuit;

the multiplication processing circuit is used for executing multiplication operation on the received data block to obtain a product result;

forwarding processing circuitry (optional) for forwarding the received data block or the product result.

And the accumulation processing circuit is used for performing accumulation operation on the product result to obtain the intermediate result.

In another embodiment, the operation instruction is a matrix by matrix instruction, an accumulation instruction, an activation instruction, or the like.

The following describes a specific calculation method of the calculation apparatus shown in fig. 1B by a neural network operation instruction. For the operation instruction of neural network, its practical needThe formula implemented may be s-s (∑ wx)_i+ b), wherein the weight w is multiplied by the input data x_iAnd summing, adding a bias b, and performing activation operation s (h) to obtain a final output result s.

In an alternative embodiment, as shown in fig. 8, the arithmetic unit comprises: a tree module 40, the tree module comprising: a root port 401 and a plurality of branch ports 404, wherein the root port of the tree module is connected with the main processing circuit, and the branch ports of the tree module are respectively connected with one of the plurality of slave processing circuits; the tree module has a transceiving function, and the tree module has a transceiving function and is used for forwarding data blocks, weights and operation instructions between the main processing circuit and the plurality of slave processing circuits, so that data of the main processing circuit can be transmitted to each slave processing circuit, and data of each slave processing circuit can be transmitted to the main processing circuit.

Optionally, the tree module is an optional result of the computing device, and may include at least 1 layer of nodes, where the nodes are line structures with forwarding function, and the nodes themselves may not have computing function. If the tree module has zero-level nodes, the tree module is not needed.

Optionally, the tree module may have an n-ary tree structure, for example, a binary tree structure as shown in fig. 9, or may have a ternary tree structure, where n may be an integer greater than or equal to 2. The present embodiment is not limited to the specific value of n, the number of layers may be 2, and the slave processing circuit may be connected to nodes of other layers than the node of the penultimate layer, for example, the node of the penultimate layer shown in fig. 9.

Optionally, the arithmetic unit may carry a separate cache, as shown in fig. 10, and may include: a neuron buffer unit, the neuron buffer unit 63 buffers the input neuron vector data and the output neuron value data of the slave processing circuit.

As shown in fig. 11, the arithmetic unit may further include: and a weight buffer unit 64, configured to buffer weight data required by the slave processing circuit in the calculation process.

In an alternative embodiment, the arithmetic unit 12, as shown in fig. 12, may include a branch processing circuit 103; the specific connection structure is shown in fig. 12, wherein,

the main processing circuit 101 is connected to branch processing circuit(s) 103, the branch processing circuit 103 being connected to one or more slave processing circuits 102;

a branch processing circuit 103 for executing data or instructions between the forwarding main processing circuit 101 and the slave processing circuit 102.

In an alternative embodiment, taking the fully-connected operation in the neural network operation as an example, the process may be: f (wx + b), where x is an input neuron matrix, w is a weight matrix, b is a bias scalar, and f is an activation function, and may specifically be: sigmoid function, tanh, relu, softmax function. Here, a binary tree structure is assumed, and there are 8 slave processing circuits, and the implementation method may be:

the controller unit acquires an input neuron matrix x, a weight matrix w and a full-connection operation instruction from the storage unit, and transmits the input neuron matrix x, the weight matrix w and the full-connection operation instruction to the main processing circuit;

the main processing circuit determines the input neuron matrix x as broadcast data, determines the weight matrix w as distribution data, divides the weight matrix w into 8 sub-matrixes, then distributes the 8 sub-matrixes to 8 slave processing circuits through a tree module, broadcasts the input neuron matrix x to the 8 slave processing circuits,

the slave processing circuit executes multiplication and accumulation operation of the 8 sub-matrixes and the input neuron matrix x in parallel to obtain 8 intermediate results, and the 8 intermediate results are sent to the master processing circuit;

and the main processing circuit is used for sequencing the 8 intermediate results to obtain a wx operation result, executing the offset b operation on the operation result, executing the activation operation to obtain a final result y, sending the final result y to the controller unit, and outputting or storing the final result y into the storage unit by the controller unit.

The method for the computing device shown in fig. 1B to execute the neural network forward operation instruction may specifically be:

the controller unit extracts the neural network forward operation instruction, the operation domain corresponding to the neural network operation instruction and at least one operation code from the instruction storage unit, transmits the operation domain to the data access unit, and sends the at least one operation code to the operation unit.

The controller unit extracts the weight w and the offset b corresponding to the operation domain from the storage unit (when b is 0, the offset b does not need to be extracted), transmits the weight w and the offset b to the main processing circuit of the arithmetic unit, extracts the input data Xi from the storage unit, and transmits the input data Xi to the main processing circuit.

The main processing circuit determines multiplication operation according to the at least one operation code, determines input data Xi as broadcast data, determines weight data as distribution data, and splits the weight w into n data blocks;

the instruction processing unit of the controller unit determines a multiplication instruction, an offset instruction and an accumulation instruction according to the at least one operation code, and sends the multiplication instruction, the offset instruction and the accumulation instruction to the master processing circuit, the master processing circuit sends the multiplication instruction and the input data Xi to a plurality of slave processing circuits in a broadcasting mode, and distributes the n data blocks to the plurality of slave processing circuits (for example, if the plurality of slave processing circuits are n, each slave processing circuit sends one data block); the plurality of slave processing circuits are used for executing multiplication operation on the input data Xi and the received data block according to the multiplication instruction to obtain an intermediate result, sending the intermediate result to the master processing circuit, executing accumulation operation on the intermediate result sent by the plurality of slave processing circuits according to the accumulation instruction by the master processing circuit to obtain an accumulation result, executing offset b on the accumulation result according to the offset instruction to obtain a final result, and sending the final result to the controller unit.

In addition, the order of addition and multiplication may be reversed.

According to the technical scheme, multiplication and offset operation of the neural network are achieved through one instruction, namely the neural network operation instruction, storage or extraction is not needed in the intermediate result of the neural network calculation, and storage and extraction operations of intermediate data are reduced, so that the method has the advantages of reducing corresponding operation steps and improving the calculation effect of the neural network.

The application also discloses a machine learning operation device, which comprises one or more computing devices mentioned in the application, and is used for acquiring data to be operated and control information from other processing devices, executing specified machine learning operation, and transmitting the execution result to peripheral equipment through an I/O interface. Peripheral devices such as cameras, displays, mice, keyboards, network cards, wifi interfaces, servers. When more than one computing device is included, the computing devices can be linked and transmit data through a specific structure, for example, through the PCIE bus to interconnect and transmit data, so as to support larger-scale machine learning operations. At this time, the same control system may be shared, or there may be separate control systems; the memory may be shared or there may be separate memories for each accelerator. In addition, the interconnection mode can be any interconnection topology.

The machine learning arithmetic device has high compatibility and can be connected with various types of servers through PCIE interfaces.

The application also discloses a combined processing device which comprises the machine learning arithmetic device, the universal interconnection interface and other processing devices. The machine learning arithmetic device interacts with other processing devices to jointly complete the operation designated by the user. Fig. 13 is a schematic view of a combined treatment apparatus.

Other processing devices include one or more of general purpose/special purpose processors such as Central Processing Units (CPUs), Graphics Processing Units (GPUs), neural network processors, and the like. The number of processors included in the other processing devices is not limited. The other processing devices are used as interfaces of the machine learning arithmetic device and external data and control, and comprise data transportation to finish basic control of starting, stopping and the like of the machine learning arithmetic device; other processing devices may cooperate with the machine learning computing device to perform computing tasks.

And the universal interconnection interface is used for transmitting data and control instructions between the machine learning arithmetic device and other processing devices. The machine learning arithmetic device acquires required input data from other processing devices and writes the input data into a storage device on the machine learning arithmetic device; control instructions can be obtained from other processing devices and written into a control cache on a machine learning arithmetic device chip; the data in the storage module of the machine learning arithmetic device can also be read and transmitted to other processing devices.

Alternatively, as shown in fig. 14, the configuration may further include a storage device, and the storage device is connected to the machine learning arithmetic device and the other processing device, respectively. The storage device is used for storing data in the machine learning arithmetic device and the other processing device, and is particularly suitable for data which is required to be calculated and cannot be stored in the internal storage of the machine learning arithmetic device or the other processing device.

The combined processing device can be used as an SOC (system on chip) system of equipment such as a mobile phone, a robot, an unmanned aerial vehicle and video monitoring equipment, the core area of a control part is effectively reduced, the processing speed is increased, and the overall power consumption is reduced. In this case, the generic interconnect interface of the combined processing device is connected to some component of the apparatus. Some parts are such as camera, display, mouse, keyboard, network card, wifi interface.

In some embodiments, a chip is also claimed, which includes the above machine learning arithmetic device or the combined processing device.

In some embodiments, a chip package structure is provided, which includes the above chip.

In some embodiments, a board card is provided, which includes the above chip package structure. Referring to fig. 15, fig. 15 provides a card that may include other kits in addition to the chip 389, including but not limited to: memory device 390, interface device 391 and control device 392;

the memory device 390 is connected to the chip in the chip package structure through a bus for storing data. The memory device may include a plurality of groups of memory cells 393. Each group of the storage units is connected with the chip through a bus. It is understood that each group of the memory cells may be a DDR SDRAM (Double Data Rate SDRAM).

DDR can double the speed of SDRAM without increasing the clock frequency. DDR allows data to be read out on the rising and falling edges of the clock pulse. DDR is twice as fast as standard SDRAM. In one embodiment, the storage device may include 4 sets of the storage unit. Each group of the memory cells may include a plurality of DDR4 particles (chips). In one embodiment, the chip may internally include 4 72-bit DDR4 controllers, and 64 bits of the 72-bit DDR4 controller are used for data transmission, and 8 bits are used for ECC check. It can be understood that when DDR4-3200 particles are adopted in each group of memory cells, the theoretical bandwidth of data transmission can reach 25600 MB/s.

In one embodiment, each group of the memory cells includes a plurality of double rate synchronous dynamic random access memories arranged in parallel. DDR can transfer data twice in one clock cycle. And a controller for controlling DDR is arranged in the chip and is used for controlling data transmission and data storage of each memory unit.

The interface device is electrically connected with a chip in the chip packaging structure. The interface device is used for realizing data transmission between the chip and an external device (such as a server or a computer). For example, in one embodiment, the interface device may be a standard PCIE interface. For example, the data to be processed is transmitted to the chip by the server through the standard PCIE interface, so as to implement data transfer. Preferably, when PCIE 3.0X 16 interface transmission is adopted, the theoretical bandwidth can reach 16000 MB/s. In another embodiment, the interface device may also be another interface, and the present application does not limit the concrete expression of the other interface, and the interface unit may implement the switching function. In addition, the calculation result of the chip is still transmitted back to an external device (e.g., a server) by the interface device.

The control device is electrically connected with the chip. The control device is used for monitoring the state of the chip. Specifically, the chip and the control device may be electrically connected through an SPI interface. The control device may include a single chip Microcomputer (MCU). The chip may include a plurality of processing chips, a plurality of processing cores, or a plurality of processing circuits, and may carry a plurality of loads. Therefore, the chip can be in different working states such as multi-load and light load. The control device can realize the regulation and control of the working states of a plurality of processing chips, a plurality of processing andor a plurality of processing circuits in the chip.

In some embodiments, an electronic device is provided that includes the above board card.

The electronic device comprises a data processing device, a robot, a computer, a printer, a scanner, a tablet computer, an intelligent terminal, a mobile phone, a vehicle data recorder, a navigator, a sensor, a camera, a server, a cloud server, a camera, a video camera, a projector, a watch, an earphone, a mobile storage, a wearable device, a vehicle, a household appliance, and/or a medical device.

The vehicle comprises an airplane, a ship and/or a vehicle; the household appliances comprise a television, an air conditioner, a microwave oven, a refrigerator, an electric cooker, a humidifier, a washing machine, an electric lamp, a gas stove and a range hood; the medical equipment comprises a nuclear magnetic resonance apparatus, a B-ultrasonic apparatus and/or an electrocardiograph.

In the embodiment of the present invention, it is considered that the compression method for the neural network may include, but is not limited to, being applied in the above-mentioned computing device, and may also be applied in other scenarios, for example, reducing the precision loss of the neural network. Based on this, the following may specifically describe how to implement the compression on the first weight matrix to obtain the second weight matrix in combination with the schematic flow diagram of the neural network compression method provided in the embodiment of the present invention shown in fig. 16, and may include, but is not limited to, the following steps:

s100, acquiring first input data; wherein the first input data comprises a first weight matrix.

S102, compressing the first weight matrix into a second weight matrix; wherein, the second weight matrix comprises at least two sub-matrixes.

In one embodiment, the adjusting the first weight matrix to a second weight matrix includes:

decomposing the first weight matrix into a third weight matrix; wherein the third weight matrix comprises at least two sub-matrices;

determining the size of each of the at least two sub-matrices according to a first formula, wherein the first formula is that Q ≈ Q₁*Q₂*......*Q_n(ii) a Wherein Q represents a first weight matrix; said Q₁Representing a first sub-matrix of the at least two sub-matrices; said Q₂Representing a second sub-matrix of the at least two sub-matrices; said Q_nRepresenting an nth sub-matrix of the at least two sub-matrices;

and adjusting the size of each sub-matrix of the at least two sub-matrices, and training the compressed machine learning model to obtain a second weight matrix meeting the preset precision.

Q≈Q₁*Q₂(1.1)

wherein Q represents a first weight matrix, said Q₁Representing a first sub-matrix of the at least two sub-matrices; said Q₂Representing a second sub-matrix of the at least two sub-matrices.

Q≈Q₁*Q₂*......*Q_n(1.2)

in the above formula (1.2), n is a positive integer greater than 2.

In a specific implementation, in an implementation process of compressing a first weight matrix into a second weight matrix, when the first weight matrix is applied to different neural networks (e.g., a fully-connected layer neural network, a convolutional layer neural network, and an LSTM layer neural network), the above-mentioned related decomposition operations for the first weight matrix, solving each of at least two sub-matrices, and adjusting each of at least two sub-matrices to obtain the second weight matrix satisfying a preset precision will be different, and then, the following details are set forth:

(1) full connection layer neural network:

the fully-connected layer means that for the n-1 layer and the n layer, any node of the n-1 layer is connected with all nodes of the n layer. Specifically, referring to fig. 5A, the structural diagram of a one-dimensional fully-connected layer of a neural network provided in an embodiment of the present invention is shown in fig. 5A, where the neural network includes an input layer, a hidden layer, and an output layer, where a two-dimensional parameter matrix of the fully-connected layer between the input layer and the hidden layer is (3,4), and the two-dimensional parameter matrix (3,4) indicates that, in the fully-connected layer structure between the input layer and the hidden layer, the number of input neurons is 3, the number of output neurons is 4, and the number of weights is 12. In a specific implementation, the 12 weights may be represented as a weight matrix with 4 rows and 3 columns, and the representation form of the weight matrix may be as shown in fig. 5B.

In a fully-connected layer neural network, the first formula includes: m ≈ M₁*M₂(ii) a The two sub-matrix fingers comprise a first sub-matrix M1 and a second sub-matrix M2, wherein M1 is N_inK matrix, M2 is K N_outA matrix; where K is a compression parameter, N_inIs the number of input neurons of the neural network, N_outThe number of output neurons of the neural network; the compression parameters are used for characterizing the number of output neurons of the M1 and the number of input neurons of the M2, and the K is greater than 0 and less than or equal to min (N)_in，N_out) Is just neatAnd (4) counting.

The following describes how to determine the compression parameter K value by using binary search. First, two parameters KL and KR are set. In the initialization case, KL is 1, KR is min (N)_in，N_out). K ═ KL + KR)/2 during parameter adjustment. If M is₁*M₂When the second weight matrix shown causes the accuracy of the compressed neural network model to decrease by X% (here, X is 1 to 10, etc.), the parameter KL is adjusted so that KL is K. If M is₁*M₂And the expressed second weight matrix causes the compressed neural network model to meet the preset precision, and then, the KR is adjusted to enable the KR to be equal to K, and the steps are repeatedly executed until the ending condition K is equal to KL or K is equal to KR is met.

In one embodiment, when the expression form of the first formula is as shown in formula (1.2), that is, the number of the sub-matrices in the third weight matrix is n, where n is a positive integer greater than 2, the number of the compression parameters K is (n-1). In practical applications, an adaptive algorithm (e.g., a genetic algorithm) may be used to determine the (n-1) compression parameters K in the fully-connected layer neural network, so as to obtain a second weight matrix satisfying a predetermined precision and/or satisfying a compression effect. The following details how the genetic algorithm is used to determine the (n-1) compression parameter K values in the fully-connected layer neural network:

and 5: acting a mutation operator on the population, randomly generating positions of a plurality of gene strings for any individual, and then changing values at the positions; here, the mutation isMean to change K randomly_jA value of (d);

(2) Convolutional layer neural network:

In a specific implementation, the compression parameters R have different values, i.e. the first weight moment is based on a plurality of different compression ratiosThe array is compressed, where, in a convolutional neural network, the compression ratio

For example, in the convolutional layer neural network structure shown in fig. 5D, the convolutional layer includes 4 convolution kernels, and the size of the convolution kernel is 3 × 3, where, in the 1 st convolution kernel, N is_fin＝4，N_foutThe compression parameter R value is a positive integer greater than 0 and equal to or less than 4. The compression parameter R is determined to be 4 by the binary search, that is, the first sub-convolution kernel F1 in the 1 st convolution kernel satisfying the preset precision is a (3,4) matrix, and the second sub-convolution kernel F2 is a (4,3) matrix. In one embodiment, the other convolution kernels shown in fig. 5D may adopt the same compression method as the 1 st convolution kernel, or may adopt a compression method different from the 1 st convolution kernel, and the embodiment of the present invention is not particularly limited.

(3) LSTM layer neural network:

taking the Long and Short term Memory LSTM Layer (LSTM) of the neural network as an example, the weight of the LSTM layer is composed of a plurality of fully-connected layer weights. Let the weight of the LSTM layer consist of t fully-connected layer weights, t being a positive integer greater than 0. For example, the jth fully-connected layer weight is (N)_{in_j},N_{out_j}) Wherein N is_{in_j}Represents the number of j full-connection layer input neurons, N_{out_j}Representing the number of output neurons of the jth fully-connected layer, wherein the weight number of the jth fully-connected layer is N_{in_j}*N_{out_j}。

In the LSTM layer neural network, the LSTM layerThe optical fiber comprises N full-connection layers, wherein N is a positive integer larger than 0; for the jth fully-connected layer, the first formula includes: m_j≈M_{j_1}*M_{j_2}(ii) a Two sub-matrixes in the jth fully-connected layer comprise a first sub-matrix M_{j_1}And a second sub-matrix M_{j_2}Said M is_{j_1}Is N_{in_j}S matrix, said M_{j_2}Is S x N_{out_j}A matrix; wherein S is a compression parameter, N_{in_j}The number of input neurons of the jth fully-connected layer of the neural network, N_{out_j}The number of output neurons of the jth fully-connected layer of the neural network is the number of output neurons of the jth fully-connected layer of the neural network; the compression parameter is used for characterizing the M_{j_1}The number of output neurons and the M_{j_2}S is greater than 0 and not more than min (N)_{in_j}，N_{out_j}) Is a positive integer of (1).

Taking the neural network architecture shown in fig. 5A as an example, the neural network includes an input layer, a hidden layer, and an output layer, wherein a 1 st full connection layer is between the input layer and the hidden layer, and a 2 nd full connection layer is between the hidden layer and the output layer. For this fully connected layer structure between the input layer to the hidden layer (i.e., the first1 full connection layer) refer to the foregoing description, and no further description is given here. As can be seen from fig. 5A, the two-dimensional parameter matrix of this fully-connected layer between the hidden layer and the output layer is (4,2), and the two-dimensional parameter matrix (4,2) indicates that, in the fully-connected layer structure between the hidden layer and the output layer, the number of input neurons is 4, the number of output neurons is 2, and the number of weights is 8. In a specific implementation, the 8 weights may be represented as a 2-row and 4-column weight matrix, and the expression form of the weight matrix may be as shown in fig. 5E. Then, in this case, the compression parameter S has a positive integer value greater than 0 and equal to or less than 2. Determining the compression parameter S to be 2 by means of binary search, that is, in the 2 nd full-connection layer, the first sub-matrix M in the second weight matrix meeting the preset precision_{2_1}Is a (4,2) matrix, a second sub-matrix M_{2_2}Is a (2,2) matrix. In particular, the compression for this fully connected layer between input layer to hidden layer and hidden layer to output layer in fig. 5A may be as shown in fig. 5F.

Step S104, performing neural network calculation according to second input data, wherein the second input data comprises a second weight matrix and neuron data.

According to the embodiment of the invention, the at least two sub-matrixes containing the compression parameters can be obtained by decomposing the first weight matrix, then each sub-matrix in the at least two sub-matrixes is solved according to a formula, and the second weight matrix meeting the preset precision is obtained by training the compressed neural network, so that the problem that the topological structure of the neural network is irregular easily caused by the adoption of a neural network pruning algorithm in the prior art is solved, the neural network can be deeply compressed, the calculated amount of the neural network can be reduced, and the operation speed is improved.

In order to better implement the above scheme of the embodiment of the present invention, the present invention further provides a neural network compression apparatus, which is described in detail below with reference to the accompanying drawings:

fig. 17A is a schematic structural diagram of a neural network compression device according to an embodiment of the present invention, where the neural network compression device includes: an acquisition unit 300, a compression unit 13, and a calculation unit 304;

the acquiring unit 300 is configured to acquire first input data; wherein the first input data comprises a first weight matrix;

the calculating unit 304 is configured to perform a neural network calculation according to second input data, where the second input data includes the second weight matrix and input neuron data.

In one embodiment, as shown in fig. 17B, the compression unit 13 includes a decomposition unit 130, a solving unit 131, and a training unit 132;

the decomposition unit 130 is configured to decompose the first weight matrix into a third weight matrix; wherein the third weight matrix comprises at least two sub-matrices;

the solving unit 131 is configured to determine the size of each of the at least two sub-matrices according to a first formula, where the first formula is Q ≈ Q₁*Q₂*......*Q_n(ii) a Wherein Q represents a first weight matrix; said Q₁Representing a first sub-matrix of the at least two sub-matrices; said Q₂Representing a second sub-matrix of the at least two sub-matrices; said Q_nRepresenting an nth sub-matrix of the at least two sub-matrices;

the training unit 132 is configured to adjust the size of each of the at least two sub-matrices, and train the compressed machine learning model to obtain a second weight matrix meeting the preset precision.

Optionally, the solving unit 131 is specifically configured to determine the size of each sub-matrix of the at least two sub-matrices according to the first formula and a second formula, where the second formula is | | Q-Q₁*Q₂*......*Q_nAnd | ≦ T, wherein T represents a preset error threshold.

Optionally, the training unit 132 is specifically configured to adjust a size of each of the at least two sub-matrices, and train the compressed machine learning model to obtain a second weight matrix that meets a preset precision and a compression ratio with the first weight matrix that meets a preset compression ratio.

Optionally, the neural network comprises a fully-connected layer neural network; the first formula includes: m ≈ M₁*M₂(ii) a The two sub-matrix fingers comprise a first sub-matrix M1 and a second sub-matrix M2, wherein M1 is N_inK matrix, M2 is K N_outA matrix; where K is a compression parameter, N_inIs the number of input neurons of the neural network, N_outThe number of output neurons of the neural network; the compression parameters are used for characterizing the number of output neurons of the M1 and the number of input neurons of the M2, and the K is greater than 0 and less than or equal to min (N)_in，N_out) Is a positive integer of (1).

Optionally, the neural network comprises a convolutional layer neural network; the convolutional layer neural network comprises N_fin*N_foutA convolution kernel; the first formula includes: f ≈ F₁*F₂(ii) a Wherein F represents the N_fin*N_foutAny one of a plurality of convolution kernels; the F1 is a first sub convolution kernel; the F2 is a second sub convolution kernel; the first sub-convolution kernel F1 is (K)_xR), said second sub-convolution kernel F2 being (R, K)_y)，(K_x，K_y) Representing the size of a convolution kernel, R is a compression parameter, and R is greater than 0 and less than or equal to min (K)_x，K_y) Is a positive integer of (1).

Optionally, the neural network includes an LSTM layer neural network, where the LSTM layer includes N fully-connected layers, and N is a positive integer greater than 0; for the jth fully-connected layer, the first formula includes: m_j≈M_{j_1}*M_{j_2}(ii) a Two sub-matrixes in the jth fully-connected layer comprise a first sub-matrix M_{j_1}And a second sub-matrix M_{j_2}Said M is_{j_1}Is N_{in_j}S matrix, said M_{j_2}Is S x N_{out_j}A matrix; wherein S is a compression parameter, N_{in_j}The number of input neurons of the jth fully-connected layer of the neural network, N_{out_j}The number of output neurons of the jth fully-connected layer of the neural network is the number of output neurons of the jth fully-connected layer of the neural network; the compression parameter is used for characterizing the M_{j_1}The number of output neurons and the M_{j_2}S is greater than 0 and not more than min (N)_{in_j}，N_{out_j}) Is a positive integer of (1).

According to the embodiment of the invention, the at least two sub-matrixes containing the compression parameters can be obtained by decomposing the first weight matrix, then each sub-matrix in the at least two sub-matrixes is solved according to a formula, and the second weight matrix meeting the preset precision is obtained by training the compressed neural network, so that the problem that the topological structure of the neural network is irregular easily caused by the adoption of a neural network pruning algorithm in the prior art is solved, the neural network is deeply compressed, the calculated amount of the neural network can be reduced, and the operation speed is improved.

In order to better implement the above scheme of the embodiment of the present invention, the present invention further provides another electronic device, which is described in detail below with reference to the accompanying drawings:

as shown in fig. 18, which is a schematic structural diagram of an electronic device provided in the embodiment of the present invention, the electronic device 40 may include a processor 401, a memory 404, and a communication module 405, and the processor 401, the memory 404, and the communication module 405 may be connected to each other through a bus 406. The Memory 404 may be a Random Access Memory (RAM) Memory or a non-volatile Memory (e.g., at least one disk Memory). The memory 404 may optionally be at least one memory system located remotely from the aforementioned processor 401. The memory 404 is used for storing application program codes, and may include an operating system, a network communication module, a user interface module, and a data processing program, and the communication module 405 is used for information interaction with an external device; the processor 401 is configured to call the program code, and perform the following steps:

acquiring first input data; wherein the first input data comprises a first weight matrix;

compressing the first weight matrix into a second weight matrix; wherein, the second weight matrix comprises at least two sub-matrixes;

performing neural network computations based on second input data, wherein the second input data comprises the second weight matrix and input neuron data.

Wherein, the processor 401 compresses the first weight matrix into a second weight matrix; wherein, the second weight matrix includes at least two sub-matrices, which may include:

determining a size of each of the at least two sub-matrices, the first formula being Q ≈ Q₁*Q₂*......*Q_n(ii) a Wherein Q represents a first weight matrix; said Q₁Representing a first sub-matrix of the at least two sub-matrices; said Q₂Representing a second sub-matrix of the at least two sub-matrices; said Q_nRepresenting an nth sub-matrix of the at least two sub-matrices;

Wherein the processor 401 determines each of the at least two sub-matrices according to a first formula, wherein the first formula is Q ≈ Q₁*Q₂*......*Q_nThe method comprises the following steps:

determining the size of each of the two sub-matrices according to the first formula and a second formula, the second formula being | | Q-Q₁*Q₂*......*Q_nAnd | ≦ T, wherein T represents a preset error threshold.

The adjusting, by the processor 401, the size of each of the at least two sub-matrices, and training the compressed machine learning model to obtain a second weight matrix meeting the preset precision may include:

and adjusting the size of each sub-matrix of the at least two sub-matrices, and training the compressed machine learning model to obtain a second weight matrix which meets the preset precision and the compression ratio with the first weight matrix meets the preset compression ratio.

Wherein the neural network is a full connection layer neural network; the first formula includes: m ≈ M₁*M₂(ii) a The two sub-matrix fingers comprise a first sub-matrix M1 and a second sub-matrix M2, wherein M1 is N_inK matrix, M2 is K N_outA matrix; where K is a compression parameter, N_inIs the number of input neurons of the neural network, N_outThe number of output neurons of the neural network; the compression parameters are used for characterizing the number of output neurons of the M1 and the number of input neurons of the M2, and the K is greater than 0 and less than or equal to min (N)_in，N_out) Is a positive integer of (1).

Wherein the neural network is a convolutional layer neural network; the convolutional layer neural network comprises N_fin*N_foutA convolution kernel; the first formula includes: f ≈ F₁*F₂(ii) a Wherein F represents the N_fin*N_foutAny one of a plurality of convolution kernels; the F1 is a first sub convolution kernel; the F2 is a second sub convolution kernel; the first sub-convolution kernel F1 is (K)_xR), said second sub-convolution kernel F2 being (R, K)_y)，(K_x，K_y) Representing the size of a convolution kernel, R is a compression parameter, and R is greater than 0 and less than or equal to min (K)_x，K_y) Is a positive integer of (1).

Wherein the neural network is an LSTM layer neural network; the LSTM layer neural network comprises N full-connection layers, wherein N is a positive integer greater than 0; for the jth fully-connected layer, the first formula includes: m_j≈M_{j_1}*M_{j_2}(ii) a What is needed isTwo sub-matrixes in the jth fully-connected layer comprise a first sub-matrix M_{j_1}And a second sub-matrix M_{j_2}Said M is_{j_1}Is N_{in_j}S matrix, said M_{j_2}Is S x N_{out_j}A matrix; wherein S is a compression parameter, N_{in_j}The number of input neurons of the jth fully-connected layer of the neural network, N_{out_j}The number of output neurons of the jth fully-connected layer of the neural network is the number of output neurons of the jth fully-connected layer of the neural network; the compression parameter is used for characterizing the M_{j_1}The number of output neurons and the M_{j_2}S is greater than 0 and not more than min (N)_{in_j}，N_{out_j}) Is a positive integer of (1).

It should be noted that, for the step executed by the processor in the electronic device 40 in the embodiment of the present invention, reference may be made to the specific implementation manner of the operation of the electronic device in the embodiment of fig. 16 in the foregoing method embodiments, and details are not described here again.

In practical applications, the processor 401 in the electronic device 40 includes, but is not limited to, only one. In one embodiment, the electronic device 40 further includes a Graphics Processing Unit (GPU) for processing images, and may also include an embedded Neural Network Processor (NPU). At this time, a compression method for a neural network may be integrated in the NPU. In one embodiment, the processor 401 may control the NPU to perform a compression method for the first weight matrix.

In a specific implementation, as described above, the electronic device 40 may include a data processing apparatus, a robot, a computer, a printer, a scanner, a tablet computer, a smart terminal, a mobile phone, a vehicle recorder, a navigator, a sensor, a camera, a server, a cloud server, a camera, a video camera, a projector, a watch, an earphone, a mobile storage, a wearable device, a vehicle, a household appliance, and/or a medical device, and the embodiment of the present invention is not limited in particular.

An embodiment of the present invention further provides a computer storage medium for storing computer software instructions for the electronic device shown in fig. 16, which includes a program for executing the method embodiment described above. By executing the stored program, the compression aiming at the first weight matrix can be realized to obtain a second weight matrix meeting the preset precision, so that the irregular topological structure of the neural network model is avoided, and the operand of the neural network is reduced.

It should be noted that, for simplicity of description, the above-mentioned method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present application is not limited by the order of acts described, as some steps may occur in other orders or concurrently depending on the application. Further, those skilled in the art should also appreciate that the embodiments described in the specification are exemplary embodiments and that the acts and modules referred to are not necessarily required in this application.

In the foregoing embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus may be implemented in other manners. For example, the above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one type of division of logical functions, and there may be other divisions when actually implementing, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not implemented. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of some interfaces, devices or units, and may be an electric or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit may be implemented in the form of hardware, or may be implemented in the form of a software program module.

The integrated units, if implemented in the form of software program modules and sold or used as stand-alone products, may be stored in a computer readable memory. Based on such understanding, the technical solution of the present application may be substantially implemented or a part of or all or part of the technical solution contributing to the prior art may be embodied in the form of a software product stored in a memory, and including several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method described in the embodiments of the present application. And the aforementioned memory comprises: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic or optical disk, and other various media capable of storing program codes.

Those skilled in the art will appreciate that all or part of the steps in the methods of the above embodiments may be implemented by associated hardware instructed by a program, which may be stored in a computer-readable memory, which may include: flash Memory disks, Read-Only memories (ROMs), Random Access Memories (RAMs), magnetic or optical disks, and the like.

The foregoing detailed description of the embodiments of the present application has been presented to illustrate the principles and implementations of the present application, and the above description of the embodiments is only provided to help understand the method and the core concept of the present application; meanwhile, for a person skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.

Claims

1. A computing device configured to perform machine learning computations, the computing device comprising: a compression unit, an arithmetic unit, and a controller unit;

2. The computing device of claim 1, wherein the compression unit comprises: the system comprises a decomposition unit, a solving unit and a training unit;

the decomposition unit is configured to decompose the first weight matrix into a third weight matrix; wherein the third weight matrix comprises at least two sub-matrices;

the solving unit is used for determining the size of each sub-matrix of the at least two sub-matrices according to a first formula, wherein the first formula is that Q is approximately equal to Q₁*Q₂*......*Q_n(ii) a Wherein Q represents a first weight matrix; said Q₁Representing a first sub-matrix of the at least two sub-matrices; said Q₂Representing a second sub-matrix of the at least two sub-matrices; said Q_nPresentation instrumentAn nth sub-matrix of the at least two sub-matrices;

and the training unit is used for adjusting the size of each sub-matrix of the at least two sub-matrices and obtaining a second weight matrix meeting the preset precision by training the compressed machine learning model.

3. The computing device according to claim 2, wherein the solving unit is configured to determine each of the at least two sub-matrices according to a first formula, wherein the first formula is Q ≈ Q₁*Q₂*......*Q_nThe method comprises the following steps:

the solving unit is specifically configured to determine a size of each of the at least two sub-matrices according to the first formula and a second formula, where the second formula is | | Q-Q₁*Q₂*......*Q_nAnd | ≦ T, wherein T represents a preset error threshold.

4. The computing apparatus according to claim 2, wherein the training unit is configured to adjust a size of each of the at least two sub-matrices and train the compressed machine learning model to obtain a second weight matrix satisfying a predetermined precision, and the training unit is configured to:

5. The computing device of any of claims 2 to 4, wherein the computing device is configured to perform full connectivity layer neural network computations; the at least two sub-matrices comprise two sub-matrices; the first formula includes: m ≈ M₁*M₂(ii) a The two sub-matrices include a first sub-matrix M1 and a second sub-matrix M2, the M1 is N_inK matrix, M2 is K N_outA matrix; wherein the content of the first and second substances,k is a compression parameter, N_inIs the number of input neurons of the neural network, N_outThe number of output neurons of the neural network; the compression parameters are used for characterizing the number of output neurons of the M1 and the number of input neurons of the M2, and the K is greater than 0 and less than or equal to min (N)_in，N_out) Is a positive integer of (1).

6. The computing device of any of claims 2-4, wherein the computing device is configured to perform convolutional layer neural network computations; the convolutional layer neural network comprises N_fin*N_foutA convolution kernel; the first formula includes: f ≈ F₁*F₂(ii) a Wherein F represents the N_fin*N_foutAny one of a plurality of convolution kernels; the F1 is a first sub convolution kernel; the F2 is a second sub convolution kernel; the first sub-convolution kernel F1 is (K)_xR), said second sub-convolution kernel F2 being (R, K)_y)，(K_x，K_y) Representing the size of a convolution kernel, R is a compression parameter, and R is greater than 0 and less than or equal to min (K)_x，K_y) Is a positive integer of (1).

7. The computing device of any of claims 2-4, wherein the computing device is configured to perform LSTM layer neural network computations, wherein the LSTM layer comprises N fully-connected layers, and wherein N is a positive integer greater than 0; for the jth fully-connected layer, the first formula includes: m_j≈M_{j_1}*M_{j_2}(ii) a Two sub-matrixes in the jth fully-connected layer comprise a first sub-matrix M_{j_1}And a second sub-matrix M_{j_2}Said M is_{j_1}Is N_{in_j}S matrix, said M_{j_2}Is S x N_{out_j}A matrix; wherein S is a compression parameter, N_{in_j}The number of input neurons of the jth fully-connected layer of the neural network, N_{out_j}The number of output neurons of the jth fully-connected layer of the neural network is the number of output neurons of the jth fully-connected layer of the neural network; the compression parameter is used for characterizing the M_{j_1}The number of output neurons and the M_{j_2}Is transported byThe number of the neurons is greater than 0 and less than or equal to min (N)_{in_j}，N_{out_j}) Is a positive integer of (1).

8. The computing device according to claim 1, wherein the arithmetic unit includes: a master processing circuit and a plurality of slave processing circuits;

the main processing circuit performs preamble processing on the second input data and transmits data and operation instructions with the plurality of slave processing circuits;

the plurality of slave processing circuits execute intermediate operation in parallel according to the data and the operation instruction transmitted from the main processing circuit to obtain a plurality of intermediate results, and transmit the plurality of intermediate results to the main processing circuit;

and the main processing circuit executes subsequent processing on the plurality of intermediate results to obtain a calculation result of the calculation instruction.

9. The computing device of claim 1, further comprising: a storage unit and a direct memory access unit, the storage unit comprising: any combination of a register and a cache;

the cache is used for storing the first input data and the second input data;

the register is used for storing scalar data in the first input data and the second input data;

the cache comprises a scratch pad cache;

the controller unit includes: the device comprises an instruction storage unit, an instruction storage unit and a storage queue unit;

the instruction storage unit is used for storing a calculation instruction associated with the artificial neural network operation;

the instruction processing unit is used for analyzing the calculation instruction to obtain a plurality of operation instructions;

the storage queue unit is configured to store an instruction queue, where the instruction queue includes: a plurality of operation instructions or calculation instructions to be executed according to the front and back sequence of the queue;

the control unit includes a main processing circuit including: a dependency processing unit;

the dependency relationship processing unit is configured to determine whether an association relationship exists between a first operation instruction and a zeroth operation instruction before the first operation instruction, if the association relationship exists between the first operation instruction and the zeroth operation instruction, cache the first operation instruction in the instruction storage unit, and after the zeroth operation instruction is executed, extract the first operation instruction from the instruction storage unit and transmit the first operation instruction to the operation unit;

extracting a first storage address interval of required data in the first operation instruction according to the first operation instruction, extracting a zeroth storage address interval of the required data in the zeroth operation instruction according to the zeroth operation instruction, if the first storage address interval and the zeroth storage address interval have an overlapped area, determining that the first operation instruction and the zeroth operation instruction have an association relation, and if the first storage address interval and the zeroth storage address interval do not have an overlapped area, determining that the first operation instruction and the zeroth operation instruction do not have an association relation.

10. A machine learning arithmetic device, characterized in that the machine learning arithmetic device comprises one or more computing devices according to any one of claims 1 to 9, and is used for acquiring input data and control information to be operated from other processing devices, executing specified machine learning operation, and transmitting the execution result to other processing devices through an I/O interface;

when the machine learning arithmetic device comprises a plurality of computing devices, the plurality of computing devices can be connected through a specific structure and transmit data;

the computing devices are interconnected through a PCIE bus of a fast peripheral equipment interconnection bus and transmit data so as to support operation of larger-scale machine learning; a plurality of the computing devices share the same control system or own respective control systems; the computing devices share the memory or own the memory; the plurality of computing devices are interconnected in any interconnection topology.

11. A combined processing apparatus, characterized in that the combined processing apparatus comprises the machine learning arithmetic apparatus according to claim 10, a universal interconnection interface, a storage apparatus and other processing apparatuses;

the machine learning arithmetic device interacts with the other processing devices to jointly complete the calculation operation designated by the user; the storage device is connected with the machine learning arithmetic device and the other processing device respectively and is used for storing the data of the machine learning arithmetic device and the other processing device.

12. A neural network chip, wherein the machine learning chip comprises the machine learning arithmetic device of claim 10 or the combined processing device of claim 11.

13. An electronic device, characterized in that it comprises a chip according to claim 12.

14. The utility model provides a board card, its characterized in that, the board card includes: a memory device, an interface apparatus and a control device and the neural network chip of claim 12;

wherein, the neural network chip is respectively connected with the storage device, the control device and the interface device;

the storage device is used for storing data;

the interface device is used for realizing data transmission between the chip and external equipment;

and the control device is used for monitoring the state of the chip.

15. A computing method for executing a machine learning model, wherein the computing method is applied to a computing device for executing machine learning computation; the computing device includes: a compression unit, an arithmetic unit, and a controller unit; the method comprises the following steps:

the controller unit acquires second input data and a calculation instruction; the second input data comprises the second weight matrix and input neuron data;

the controller unit analyzes the calculation instruction to obtain a plurality of operation instructions and sends the operation instructions and the second input data to an operation unit;

16. The method of claim 15, wherein the compression unit comprises: the system comprises a decomposition unit, a solving unit and a training unit;

the solving unit is used for determining the size of each sub-matrix of the at least two sub-matrices according to a first formula, wherein the first formula is that Q is approximately equal to Q₁*Q₂*......*Q_n(ii) a Wherein Q represents a first weight matrix; said Q₁Representing a first sub-matrix of the at least two sub-matrices; said Q₂Representing a second sub-matrix of the at least two sub-matrices; said Q_nRepresenting an nth sub-matrix of the at least two sub-matrices;

17. The method according to claim 16, wherein the solution unit is configured to determine each of the at least two sub-matrices according to a first formula, wherein the first formula is Q ≈ Q₁*Q₂*......*Q_nThe method comprises the following steps:

18. The method according to claim 16, wherein the training unit is configured to adjust a size of each of the at least two sub-matrices and train the compressed machine learning model to obtain a second weight matrix satisfying a predetermined precision, and includes:

19. The method of any one of claims 16-18, wherein the computing device is configured to perform full connectivity layer neural network computations; the at least two sub-matrices comprise two sub-matrices; the first formula includes: m ≈ M₁*M₂(ii) a The two sub-matrices include a first sub-matrix M1 and a second sub-matrix M2, the M1 isN_inK matrix, M2 is K N_outA matrix; where K is a compression parameter, N_inIs the number of input neurons of the neural network, N_outThe number of output neurons of the neural network; the compression parameters are used for characterizing the number of output neurons of the M1 and the number of input neurons of the M2, and the K is greater than 0 and less than or equal to min (N)_in，N_out) Is a positive integer of (1).

20. The method of any one of claims 16-18, wherein the computing device is configured to perform convolutional layer neural network computations; the convolutional layer neural network comprises N_fin*N_foutA convolution kernel; the first formula includes: f ≈ F₁*F₂(ii) a Wherein F represents the N_fin*N_foutAny one of a plurality of convolution kernels; the F1 is a first sub convolution kernel; the F2 is a second sub convolution kernel; the first sub-convolution kernel F1 is (K)_xR), said second sub-convolution kernel F2 being (R, K)_y)，(K_x，K_y) Representing the size of a convolution kernel, R is a compression parameter, and R is greater than 0 and less than or equal to min (K)_x，K_y) Is a positive integer of (1).

21. The method of any of claims 16-18, wherein the computing device is configured to perform LSTM layer neural network computations, wherein the LSTM layer comprises N fully-connected layers, wherein N is a positive integer greater than 0; for the jth fully-connected layer, the first formula includes: m_j≈M_j_₁*M_j_₂(ii) a Two sub-matrixes in the jth fully-connected layer comprise a first sub-matrix M_j_₁And a second sub-matrix M_j_₂Said M is_j_₁Is N_{in_j}S matrix, said M_j_₂Is S x N_{out_j}A matrix; wherein S is a compression parameter, N_{in_j}The number of input neurons of the jth fully-connected layer of the neural network, N_{out_j}For the output of the jth fully-connected layer of the neural networkThe number of neurons; the compression parameter is used for characterizing the M_j_₁The number of output neurons and the M_j_₂S is greater than 0 and not more than min (N)_{in_j}，N_{out_j}) Is a positive integer of (1).

22. The method of claim 15, wherein the arithmetic unit comprises: a master processing circuit and a plurality of slave processing circuits;

23. The method of claim 15, wherein the computing device further comprises: a storage unit and a direct memory access unit, the storage unit comprising: any combination of a register and a cache;

the cache stores the first input data and the second input data;

the register stores scalars in the first input data and the second input data; the cache comprises a scratch pad cache;

the instruction storage unit stores a calculation instruction associated with the artificial neural network operation;

the instruction processing unit analyzes the calculation instruction to obtain a plurality of operation instructions;

the store queue unit stores an instruction queue comprising: a plurality of operation instructions or calculation instructions to be executed according to the front and back sequence of the queue;

the dependency relationship processing unit determines whether a first operation instruction and a zeroth operation instruction before the first operation instruction have an association relationship, if the first operation instruction and the zeroth operation instruction have the association relationship, the first operation instruction is cached in the instruction storage unit, and after the zeroth operation instruction is executed, the first operation instruction is extracted from the instruction storage unit and transmitted to the operation unit;