WO2021082722A1

WO2021082722A1 - Computing device and method, and related product

Info

Publication number: WO2021082722A1
Application number: PCT/CN2020/113160
Authority: WO
Inventors: 张英男; 曾洪博; 张尧; 刘少礼; 黄迪; 周诗怡; 张曦珊; 刘畅; 郭家明; 高钰峰
Original assignee: 中科寒武纪科技股份有限公司
Priority date: 2019-11-01
Filing date: 2020-09-03
Publication date: 2021-05-06
Also published as: CN112765539A; CN112765539B

Abstract

A computing device and method, and a related product. The device comprises a master control unit, a slave control unit, a storage unit, a master computing unit, and a slave computing unit. The invention effectively improves the energy efficiency ratio and computing speed of deep learning networks in terms of hardware architecture, thus improving the performance of the deep learning networks.

Description

Computing device, method and related products

This application claims the priority of a Chinese patent application filed with the Chinese Patent Office on November 1, 2019, the application number is 2019110610783, and the application name is "Computer Devices, Methods and Related Products", the entire content of which is incorporated into this application by reference .

Technical field

The present disclosure relates to the field of computer technology, and in particular to a computing device, method and related products.

Background technique

With the maturity of emerging technologies such as big data and machine learning, more and more tasks include a variety of matrix operations, and the speed of matrix operations determines the speed of computer algorithms.

At present, the popular deep learning network contains a large number of matrix multiplication operations. In the fully connected layer of the deep learning network, the operation expression of the output neuron is y=f(wx+b), where w is the weight matrix, x is the input vector, b is the bias vector, and the output matrix y is calculated The process is to multiply the matrix w and the vector x, add the vector b, and then perform the activation function operation on the obtained vector (that is, perform the activation function operation on each element in the matrix). In this process, the complexity of the matrix multiplication vector operation is much higher than the subsequent addition of offset and activation operations, and the efficient realization of the former has the most important impact on the entire operation process.

However, in the hardware implementation process, the multiplication operation cost of the original convolution is large, which makes the energy efficiency ratio of the deep learning network low and the operation time is long.

Summary of the invention

The present disclosure provides a computing device, a method, and related products to improve the energy efficiency ratio and computing speed of a deep learning network on a hardware architecture, and improve the performance of the deep learning network.

In a first aspect, the present disclosure provides an arithmetic device, including: a master control unit, a slave control unit, a storage unit, a master arithmetic unit, and a slave arithmetic unit;

The master control unit is configured to send a first control instruction, and the first control instruction is used to instruct the master arithmetic unit to perform a positive transformation operation in a winograd convolution operation, and instruct the slave control unit to send a second control Instruction, the second control instruction is used to instruct the slave arithmetic unit to perform multiplication and addition operations and inverse transformation operations in a winograd convolution operation;

The storage unit is used to store data used for winograd convolution operation;

The main arithmetic unit is configured to respond to the first control instruction and extract data from the storage unit to perform a positive transformation operation in a winograd convolution operation to obtain a positive transformation operation result; wherein the forward transformation operation is split The solution is a summation operation;

The slave operation unit is configured to respond to the second control instruction, obtain the result of the forward transformation operation from the main operation unit, extract data from the storage unit, and perform multiplication and addition operations in the winograd convolution operation And the inverse transformation operation to obtain the winograd convolution operation result; wherein, the inverse transformation operation is disassembled into a summation operation.

In a second aspect, the present disclosure provides a neural network computing device. The neural network computing device includes one or more computing devices as described in the first aspect for obtaining data to be computed and controlling from other processing devices. Information, execute the specified neural network operation, and transfer the execution result to other processing devices through the I/O interface;

When the neural network computing device includes multiple computing devices, the multiple computing devices are connected through a specific structure and transmit data;

Wherein, a plurality of said arithmetic devices are interconnected and transmit data through a fast external device interconnection bus to support larger-scale neural network operations; a plurality of said arithmetic devices share the same control system or have their own control systems; The two computing devices share memory or have their own memory; the interconnection mode of the multiple computing devices is an arbitrary interconnection topology.

In a third aspect, the present disclosure provides an artificial intelligence chip including the computing device according to any one of the first aspect.

In a fourth aspect, the present disclosure provides an electronic device including the artificial intelligence chip as described in the third aspect.

In a fifth aspect, the present disclosure provides a board card, the board card includes: a storage device, an interface device, a control device, and the artificial intelligence chip as described in the third aspect;

Wherein, the artificial intelligence chip is connected to the storage device, the control device, and the interface device respectively;

The storage device is used to store data;

The interface device is used to implement data transmission between the artificial intelligence chip and external equipment;

The control device is used to monitor the state of the artificial intelligence chip.

In a sixth aspect, the present disclosure provides an arithmetic method applied to a computing device, the arithmetic device comprising: a master control unit, a slave control unit, a storage unit, a master arithmetic unit, and a slave arithmetic unit; the method includes:

The master control unit sends a first control instruction, and the first control instruction is used to instruct the master arithmetic unit to perform the forward transformation operation in the winograd convolution operation, and instruct the slave control unit to send a second control instruction, so The second control instruction is used to instruct the slave operation unit to perform multiplication and addition operations and inverse transformation operations in a winograd convolution operation;

The storage unit stores data used for winograd convolution operation;

In response to the first control instruction, the main arithmetic unit extracts data from the storage unit to perform a positive transformation operation in a winograd convolution operation to obtain a positive transformation operation result; wherein the forward transformation operation is decomposed into Sum operation

The slave operation unit responds to the second control instruction, obtains the result of the forward transformation operation from the main operation unit, and extracts data from the storage unit, and performs the multiplication and addition operation and the inverse transformation in the winograd convolution operation Operation to obtain the winograd convolution operation result; wherein, the inverse transformation operation is disassembled into a summation operation.

The arithmetic device, method and related products provided by the present disclosure are provided with a master control unit, a slave control unit, a storage unit, a master arithmetic unit, and a slave arithmetic unit in the arithmetic device; the master control unit sends a first control instruction, a first control instruction Used to instruct the main arithmetic unit to perform the forward transformation operation in the winograd convolution operation, and instruct the slave control unit to send a second control instruction, the second control instruction is used to instruct the slave arithmetic unit to perform the multiply-add operation and inverse in the winograd convolution operation Transformation operation; the storage unit stores data used for winograd convolution operation; the main operation unit responds to the first control instruction, extracts data from the storage unit and performs the positive transformation operation in the winograd convolution operation to obtain the result of the positive transformation operation; The transformation operation is broken down into a summation operation; the slave operation unit responds to the second control instruction, obtains the result of the positive transformation operation from the main operation unit, and extracts data from the storage unit, and performs the multiplication and addition operation in the winograd convolution operation. The inverse transformation operation obtains the winograd convolution operation result; among them, the inverse transformation operation is disassembled into a summation operation. This can effectively improve the energy efficiency ratio and computing speed of the deep learning network on the hardware architecture, and improve the performance of the deep learning network.

The present disclosure can be applied to the following (including but not limited to) scenarios: data processing, robots, computers, printers, scanners, phones, tablets, smart terminals, mobile phones, driving recorders, navigators, sensors, cameras, cloud servers , Cameras, camcorders, projectors, watches, earphones, mobile storage, wearable devices and other electronic products; aircraft, ships, vehicles and other transportation tools; TVs, air conditioners, microwave ovens, refrigerators, rice cookers, humidifiers, washing machines, Electric lights, gas stoves, range hoods and other household appliances; and various medical equipment including nuclear magnetic resonance, B-ultrasound, electrocardiograph, etc.

Description of the drawings

The drawings included in the specification and constituting a part of the specification together with the specification illustrate exemplary embodiments, features, and aspects of the present disclosure, and are used to explain the principle of the present disclosure.

FIG. 1 is a schematic diagram of the structure of an arithmetic device provided by an embodiment of the application;

2 is a schematic structural diagram of a computing device provided by another embodiment of the application;

FIG. 3 is a flow sequence diagram of a computing device provided by another embodiment of this application;

FIG. 4 is a schematic flowchart of an operation method provided by another embodiment of this application;

FIG. 5 is a schematic diagram of a processing system of an arithmetic method according to an embodiment of the present disclosure;

Fig. 6 is a structural block diagram of a board according to an embodiment of the present disclosure.

Detailed ways

The technical solutions in the embodiments of the present disclosure will be clearly and completely described below in conjunction with the accompanying drawings in the embodiments of the present disclosure. Obviously, the described embodiments are part of the embodiments of the present disclosure, rather than all of the embodiments. Based on the embodiments in the present disclosure, all other embodiments obtained by those skilled in the art without creative work shall fall within the protection scope of the present disclosure.

It should be understood that the terms "first", "second", "third" and "fourth" in the claims, specification and drawings of the present disclosure are used to distinguish different objects, rather than to describe a specific order. . The terms "comprising" and "comprising" used in the specification and claims of the present disclosure indicate the existence of the described features, wholes, steps, operations, elements and/or components, but do not exclude one or more other features, wholes The existence or addition of, steps, operations, elements, components, and/or their collections.

It should also be understood that the terms used in this specification of the present disclosure are only for the purpose of describing specific embodiments, and are not intended to limit the present disclosure. As used in the specification and claims of the present disclosure, unless the context clearly indicates otherwise, the singular forms "a", "an" and "the" are intended to include plural forms. It should be further understood that the term "and/or" used in the specification and claims of the present disclosure refers to any combination of one or more of the items listed in association and all possible combinations, and includes these combinations.

As used in this specification and claims, the term "if" can be interpreted as "when" or "once" or "in response to determination" or "in response to detection" depending on the context. Similarly, the phrase "if determined" or "if detected [described condition or event]" can be interpreted as meaning "once determined" or "in response to determination" or "once detected [described condition or event]" depending on the context ]" or "in response to detection of [condition or event described]".

In order to clearly understand the technical solutions of the present application, the technical terms involved in the prior art and the embodiments of the present application are explained below:

Convolution operation: Convolution operation refers to opening an active window with the same size as the template from the upper left corner of the image. The active window corresponds to a window image, which is the convolution kernel, and the window image corresponds to the pixels in the image Multiply and add, and use the calculation result as the first pixel value of the new image after the convolution operation. Then, the active window moves one column to the right, the window image corresponding to the active window and the pixels in the image are multiplied and then added together, and the calculation result is used as the second pixel value of the new image after the convolution operation. By analogy, from left to right, from top to bottom, you can get a new image.

Winograd convolution: Winograd convolution is a convolution acceleration implementation method based on polynomial interpolation algorithm. It passes the two inputs of the convolution operation: the first target matrix and the second target matrix are respectively subjected to winograd convolution positive transformation, and then the first target matrix and the second target matrix after the positive transformation are subjected to bitwise multiplication, and finally The winograd convolution inverse transformation is performed again on the result of the bit multiplication, and the convolution result equivalent to the original convolution operation is obtained.

Convolutional neural network model: Convolutional neural network model is a type of feedforward neural network model that includes convolution calculation and has a deep structure, and is one of the representative models of deep learning. In the convolutional layer, fully connected layer and other network layers in the convolutional neural network model, it is necessary to perform convolution operations on neurons and convolution kernels to obtain feature data.

Figure 1 is a schematic structural diagram of a computing device provided by an embodiment of this application. As shown in Figure 1, the device in this embodiment may include: a master control unit, a slave control unit, a storage unit, a master arithmetic unit, and a slave arithmetic unit ; The main control unit is used to send a first control instruction, the first control instruction is used to instruct the main arithmetic unit to perform the forward transformation operation in the winograd convolution operation, and to instruct the slave control unit to send a second control instruction, the second control instruction is used To instruct the slave operation unit to perform multiplication and addition operations and inverse transformation operations in the winograd convolution operation; the storage unit is used to store data used for the winograd convolution operation; the main operation unit is used to respond to the first control instruction, and the slave storage unit The data is extracted from the winograd convolution operation and the positive transformation operation is performed to obtain the positive transformation operation result; the positive transformation operation is disassembled into a summation operation; the slave operation unit is used to respond to the second control instruction from the main operation unit Obtain the result of the forward transformation operation, and extract the data from the storage unit, and perform the multiplication and addition operation and the inverse transformation operation in the winograd convolution operation to obtain the winograd convolution operation result; wherein the inverse transformation operation is disassembled into a summation operation.

In this embodiment, the main control unit and the slave control unit receive the arithmetic control signal, and decode the arithmetic control signal to obtain the corresponding first control instruction and second control instruction, and then the master control unit transfers the first control instruction Send to the main arithmetic unit, send the second control instruction to the slave arithmetic unit; the master arithmetic unit responds to the first control instruction, extracts data from the storage unit and performs the positive transformation operation in the winograd convolution operation to obtain the positive transformation operation result; In response to the second control instruction, the arithmetic unit obtains the result of the forward transformation operation from the main arithmetic unit, and extracts data from the storage unit, and performs multiplication and addition operations and inverse transformation operations in the winograd convolution operation to obtain the winograd convolution operation result.

In this embodiment, the arithmetic device is provided with a master control unit, a slave control unit, a storage unit, a master arithmetic unit, and a slave arithmetic unit; the master control unit sends a first control instruction, and the first control instruction is used to instruct the master arithmetic The unit performs the forward transformation operation in the winograd convolution operation, and instructs the slave control unit to send a second control instruction, the second control instruction is used to instruct the slave operation unit to perform the multiplication and addition operation and the inverse transformation operation in the winograd convolution operation; storage unit Store data used for winograd convolution operation; the main arithmetic unit responds to the first control instruction, extracts data from the storage unit to perform the positive transformation operation in the winograd convolution operation, and obtains the result of the positive transformation operation; the slave arithmetic unit responds to the second control instruction , Obtain the result of the forward transformation operation from the main operation unit, and extract the data from the storage unit, and perform the multiplication and addition operation and the inverse transformation operation in the winograd convolution operation to obtain the winograd convolution operation result. Since the forward transformation operation of the main arithmetic unit is disassembled into a summation operation, the inverse transformation operation of the slave arithmetic unit is disassembled into a summation operation; this can effectively improve the energy efficiency ratio and calculation speed of the deep learning network on the hardware architecture, and improve The performance of the deep learning network.

FIG. 2 is a schematic structural diagram of an arithmetic device provided by another embodiment of the application. As shown in FIG. 2, the device in this embodiment may include: a master control unit, a slave control unit, a master storage unit, and a slave storage unit master arithmetic unit , And from the computing unit.

In the first optional implementation manner, the main storage unit is used to receive and store the characteristic data; the slave storage unit is used to receive and store the positive transformation data of the weight data.

In this embodiment, the main arithmetic unit obtains the feature data from the main storage unit, and disassembles the feature data into multiple sub-tensors; performs transformation operations on the multiple sub-tensors and sums them, and obtains the feature data according to the result of the summation operation. The data is being transformed.

Optionally, the main arithmetic unit parses the feature data to obtain multiple sub-tensors, where the feature data is the sum of multiple sub-tensors, the number of multiple sub-tensors is the same as the number of non-zero elements in the feature data, and each sub-tensor There is a single non-zero element in the tensor, and the non-zero element in the sub-tensor is the same as the non-zero element at the corresponding position in the feature data. The main arithmetic unit obtains the winograd transformation result of the meta-sub-tensor corresponding to each sub-tensor, where the meta-sub-tensor is a tensor with the non-zero elements of the sub-tensor set to 1, and the non-zero element values in the sub-tensor are taken as The coefficient is multiplied by the winograd transformation result of the corresponding element sub-tensor to obtain the winograd transformation result of the sub-tensor; the winograd transformation results of multiple sub-tensors are added to obtain the positive transformation data of the feature data. For each sub-tensor, the main arithmetic unit multiplies the left side of the sub-tensor corresponding to the sub-tensor by the left-multiplying matrix, and the right side by the right-multiplying matrix to obtain the winograd transformation result of the sub-tensor, where the left multiplying matrix and The right multiplication matrix is determined by the scale of the subtensor and the winograd transformation type, where the winograd transformation type includes the winograd transformation type of the forward transformation and the winograd transformation type of the inverse transformation.

In this embodiment, the slave operation unit may also store the winograd convolution operation result in the preset address space of the storage unit. In this way, the directional storage of the calculation results can be realized, and the storage space can be fully utilized.

In this embodiment, the main arithmetic unit is communicatively connected with a plurality of slave arithmetic units, and different slave arithmetic units are responsible for calculating the results of different forward transformation operations. This design method can improve computing power and computing efficiency, and achieve asynchronous computing.

In this embodiment, the slave arithmetic unit obtains the positive transformation data of the weight data from the storage unit, and then performs bitwise multiplication on the positive transformation data of the feature data and the positive transformation data of the weight data to obtain the multiplication result; The operation result is disassembled into multiple sub-tensors; the multiple sub-tensors are transformed and summed to obtain the winograd convolution operation result.

Optionally, the multiplication operation result is analyzed from the operation unit to obtain multiple sub-tensors, where the multiplication operation result is the sum of multiple sub-tensors, and the number of the multiple sub-tensors is the same as the number of non-zero elements in the multiplication operation result , There is a single non-zero element in each sub-tensor, and the non-zero element in the sub-tensor is the same as the non-zero element in the corresponding position in the result of the multiplication operation.

In this embodiment, the winograd transformation result of the meta-sub-tensor corresponding to each sub-tensor is obtained from the arithmetic unit, where the meta-sub-tensor is a tensor in which the non-zero elements of the sub-tensor are set to 1; The element value of 0 is used as the coefficient multiplied by the winograd transformation result of the corresponding element sub-tensor to obtain the winograd transformation result of the sub-tensor; the winograd transformation results of multiple sub-tensors are added to obtain the winograd convolution operation result. For each sub-tensor from the arithmetic unit, multiply the left side of the sub-tensor corresponding to the sub-tensor by the left multiplying matrix and the right side by the right multiplying matrix to obtain the winograd transformation result of the sub-tensor, where the left multiplying matrix and The right multiplication matrix is determined by the scale of the subtensor and the winograd transformation type, where the winograd transformation type includes the winograd transformation type of the forward transformation and the winograd transformation type of the inverse transformation.

In the second optional implementation manner, the main storage unit is used to receive and store the characteristic data; the slave storage unit is used to receive and store the weight data.

In this embodiment, the main arithmetic unit obtains the feature data from the main storage unit, and disassembles the feature data into multiple sub-tensors; performs transformation operations on the multiple sub-tensors and sums them, and obtains the feature data according to the result of the summation operation The positive transformation data.

Optionally, the main arithmetic unit parses the feature data to obtain multiple sub-tensors, where the feature data is the sum of multiple sub-tensors, the number of multiple sub-tensors is the same as the number of non-zero elements in the feature data, and each sub-tensor There is a single non-zero element in the tensor, and the non-zero element in the sub-tensor is the same as the non-zero element at the corresponding position in the feature data. The main arithmetic unit obtains the winograd transformation result of the meta-sub-tensor corresponding to each sub-tensor, where the meta-sub-tensor is a tensor with the non-zero elements of the sub-tensor set to 1, and the non-zero element values in the sub-tensor are taken as The coefficient is multiplied by the winograd transformation result of the corresponding element sub-tensor to obtain the winograd transformation result of the sub-tensor; the winograd transformation results of multiple sub-tensors are added to obtain the positive transformation data of the feature data.

In this embodiment, for each sub-tensor, the main arithmetic unit multiplies the left side of the sub-tensor corresponding to the sub-tensor by the left multiplying matrix and the right side by the right multiplying matrix to obtain the winograd transformation result of the sub-tensor, where , The left multiplication matrix and the right multiplication matrix are both determined by the scale of the sub-tensor and the winograd transformation type, where the winograd transformation type includes the winograd transformation type of the forward transformation and the winograd transformation type of the inverse transformation.

In this embodiment, the slave computing unit obtains the weight data from the storage unit, and disassembles the weight data into multiple sub-tensors; performs transformation operations on the multiple sub-tensors and sums them, and obtains according to the result of the summation operation The positive transformation data of the weight data. Perform bitwise multiplication on the positive transformation data of the feature data and the positive transformation data of the weight data from the arithmetic unit to obtain the result of the multiplication operation; disassemble the result of the multiplication operation into multiple sub-tensors; perform transformation operations on the multiple sub-tensors and combine Sum and get the result of winograd convolution operation.

Optionally, the multiplication operation result is analyzed from the operation unit to obtain multiple sub-tensors, where the multiplication operation result is the sum of multiple sub-tensors, and the number of the multiple sub-tensors is the same as the number of non-zero elements in the multiplication operation result , There is a single non-zero element in each sub-tensor, and the non-zero element in the sub-tensor is the same as the non-zero element in the corresponding position in the result of the multiplication operation. Obtain the winograd transformation result of the meta-sub-tensor corresponding to each sub-tensor from the arithmetic unit, where the meta-sub-tensor is a tensor with the non-zero elements of the sub-tensor set to 1, and the non-zero element values in the sub-tensor are taken as The coefficient is multiplied by the winograd transformation result of the corresponding element sub-tensor to obtain the winograd transformation result of the sub-tensor; the winograd transformation results of multiple sub-tensors are added to obtain the winograd convolution operation result.

In this embodiment, for each sub-tensor from the arithmetic unit, the left side of the sub-tensor corresponding to the sub-tensor is multiplied by the left multiplying matrix, and the right side is multiplied by the right multiplying matrix to obtain the winograd transformation result of the sub-tensor, where , The left multiplication matrix and the right multiplication matrix are both determined by the scale of the sub-tensor and the winograd transformation type, where the winograd transformation type includes the winograd transformation type of the forward transformation and the winograd transformation type of the inverse transformation.

In a third alternative embodiment, the main storage unit is used to receive and store the characteristic data and weight data; the slave storage unit is used to receive and store the positive transformation data of the weight data.

Optionally, the main arithmetic unit parses the feature data to obtain multiple sub-tensors, where the feature data is the sum of multiple sub-tensors, the number of multiple sub-tensors is the same as the number of non-zero elements in the feature data, and each sub-tensor There is a single non-zero element in the tensor, and the non-zero element in the sub-tensor is the same as the non-zero element at the corresponding position in the feature data.

In this embodiment, the main arithmetic unit obtains the winograd transformation result of the meta-sub-tensor corresponding to each sub-tensor, where the meta-sub-tensor is a tensor in which the non-zero elements of the sub-tensor are set to 1; The element value of 0 is used as a coefficient multiplied by the winograd transformation result of the corresponding element sub-tensor to obtain the winograd transformation result of the sub-tensor; the winograd transformation results of multiple sub-tensors are added to obtain the positive transformation data of the feature data. For each sub-tensor, the main arithmetic unit multiplies the left side of the sub-tensor corresponding to the sub-tensor by the left-multiplying matrix, and the right side by the right-multiplying matrix to obtain the winograd transformation result of the sub-tensor, where the left multiplying matrix and The right multiplication matrix is determined by the scale of the subtensor and the winograd transformation type, where the winograd transformation type includes the winograd transformation type of the forward transformation and the winograd transformation type of the inverse transformation.

In this embodiment, the main arithmetic unit disassembles the weight data into multiple sub-tensors; performs transformation operations on the multiple sub-tensors and sums them, and obtains the positive transformation data of the weight data according to the result of the summation operation; The positive conversion data of the data is sent to the slave storage unit. The slave operation unit obtains the positive transformation data of the weight data from the storage unit, and performs bitwise multiplication of the positive transformation data of the feature data and the positive transformation data of the weight data to obtain the multiplication result; disassemble the multiplication result For multiple sub-tensors; transform and sum multiple sub-tensors to obtain the winograd convolution operation result.

Optionally, the multiplication operation result is analyzed from the operation unit to obtain multiple sub-tensors, where the multiplication operation result is the sum of multiple sub-tensors, and the number of the multiple sub-tensors is the same as the number of non-zero elements in the multiplication operation result , There is a single non-zero element in each sub-tensor, and the non-zero element in the sub-tensor is the same as the non-zero element in the corresponding position in the result of the multiplication operation. Obtain the winograd transformation result of the meta-sub-tensor corresponding to each sub-tensor from the arithmetic unit, where the meta-sub-tensor is a tensor with the non-zero elements of the sub-tensor set to 1, and the non-zero element values in the sub-tensor are taken as The coefficient is multiplied by the winograd transformation result of the corresponding element sub-tensor to obtain the winograd transformation result of the sub-tensor; the winograd transformation results of multiple sub-tensors are added to obtain the winograd convolution operation result. For each sub-tensor from the arithmetic unit, multiply the left side of the sub-tensor corresponding to the sub-tensor by the left multiplying matrix and the right side by the right multiplying matrix to obtain the winograd transformation result of the sub-tensor, where the left multiplying matrix and The right multiplication matrix is determined by the scale of the subtensor and the winograd transformation type, where the winograd transformation type includes the winograd transformation type of the forward transformation and the winograd transformation type of the inverse transformation.

In a fourth optional implementation manner, the main operation unit includes: a main processing module and a cache; the main processing module is used to respond to the first control instruction and extract data from the storage unit to perform the positive transformation operation in the winograd convolution operation , Get the result of the positive transformation operation; cache, used to store the result of the positive transformation operation.

In this embodiment, the positive transformation operation result may be sent to the slave operation unit when the positive transformation operation result stored in the cache has accumulated to a preset number. This design method can integrate the processing data of the slave arithmetic unit and avoid the slave arithmetic unit from being in the arithmetic state all the time.

In the fifth optional implementation manner, the operation process of the main operation unit and the slave operation unit are parallel operations. Before the main operation unit completes the calculation of the positive transformation data of the feature data, the slave operation unit focuses on the calculated feature data. The bitwise multiplication is performed on the element position of the positive transformation data of the forward transformation data and the element position of the forward transformation data of the corresponding weight data, until the alignment multiplication value of each element position is calculated, and the multiplication result is obtained.

In this embodiment, the feature data stored in the main storage unit is divided into a plurality of first data used for winograd convolution operation, and the size of the first data is determined according to the size of the convolution kernel; the weight stored in the storage unit The forward transformation data of the data is divided into a plurality of second data used for winograd convolution operation, the size of the second data is determined according to the size of the first data; the main processing module of the main operation unit responds to the first data sent by the main control unit The control instruction is to sequentially obtain the first data from the main storage unit, perform a forward transformation operation on the first data to obtain the positive transformation result of the first data, and store the positive transformation result of the first data in the cache; when the first data in the cache is When the positive conversion result of a data reaches the preset number, the main processing module of the main arithmetic unit sends the positive conversion result of the first data in the buffer to the slave arithmetic unit in turn; the slave arithmetic unit responds to the second control instruction sent by the slave control unit , Obtain the second data from the storage unit, perform the bitwise multiplication operation on the forward transformation result of the first data and the second data to obtain the bitwise multiplication operation result, and perform the inverse transformation operation on the bitwise multiplication operation result to obtain the inverse Transformation result; the slave operation unit obtains the winograd convolution operation result according to the inverse transformation result, and sends the winograd convolution operation result to the main storage unit for storage.

Fig. 3 is a flow sequence diagram of an arithmetic device provided by another embodiment of the application; as shown in Fig. 3, the slave functional unit and the main processing module are the arithmetic modules; the cache, the main processing memory, and the slave processing memory are the storage modules. The input data pre-stored in the main processing memory and the slave processing memory are data after segmentation. The main function unit to main processing memory means that the main processing memory sends characteristic data to the main function unit. Specifically, a convolution with an input scale of 4×4, a convolution kernel scale of 3×3, stride=1, and an output scale of 2×2 is taken as an example for detailed description. Suppose that m represents the number of feature data blocks divided along the height and width directions, and n represents the number of weight data blocks divided along the Cout direction; k<=4, represents the number of divisions in the Cin direction, res(0,0)... (1,0) represents the output calculation result. bd(i,j) represents the i-th data block along the height and width directions after bottom_data segmentation, and the j-th data block along the Cin direction; the size of the data block is 16*16*512bit, that is, the size of syn_reuse_iter kernels Syn_reuse_iter represents the number of times of weight reuse, and kernel represents the most basic unit of winograd transformation. Wino_bd(i,j) represents the bottom_data block of the winograd domain after winograd transformation. Wino_w(i,j) represents the i-th data block in the Cout direction of the transformed winograd domain weight and the j-th data block in the Cin direction, and the data block size is 64*16*512bit, which corresponds to the characteristic data block, the total data The number of blocks is n*k, n=Cout/64, k<=4. bd(i,j) means that the bottom_data transformation result Wino_bd(i1,j1) is obtained after the calculation of the main processing module, and then the transformation result is sent to the slave functional unit, and the corresponding Wino_w(i2,j2) is used for bitwise multiplication. And the inverse transform operation to get the result of the operation.

It should be noted that in FIG. 3, buffering the To slave functional unit means that the buffer sends data blocks to the slave functional unit. From the processing memory to the slave functional unit, it means that the data block is sent from the processing memory to the slave functional unit. Therefore, when the To slave functional unit is cached and the To slave functional unit is processed from the memory, if the number of the sent data block changes, the relevant data multiplexing relationship will be reflected. In addition, when the slave functional unit sends data blocks to the main processing memory, it is sent at intervals, that is, the data that can be accumulated in the Cin direction will be temporarily stored in the slave functional unit. In the above process, it is necessary to circulate in the unit of block, and the lowest dimension of the cycle is Cin, that is, the transformed block is calculated along the direction of Cin. During the calculation of the Cin direction, the output is not performed immediately when the output result is buffered from the functional unit, but output is performed after the accumulation of the Cin direction is completed. Among them, the number of cycles in the Cin direction is <=4, otherwise repeated transformations are required. The result of the bottom_data transformation of the main functional unit does not need to be sent to the slave functional unit after the feature data block has been transformed. The result of the inverse transformation from the functional unit can also be output directly after part of the operation is completed, without waiting for the result of the entire kernel to be output after the transformation is completed. The above-mentioned design method can reduce the delay of data operation and increase the operation speed.

In this embodiment, the arithmetic device is provided with a master control unit, a slave control unit, a master storage unit, a slave storage unit, a master arithmetic unit, and a slave arithmetic unit; the master control unit sends the first control instruction, the first control instruction Used to instruct the main arithmetic unit to perform the forward transformation operation in the winograd convolution operation, and instruct the slave control unit to send a second control instruction, the second control instruction is used to instruct the slave arithmetic unit to perform the multiply-add operation and inverse in the winograd convolution operation Transformation operation; the storage unit stores data used for winograd convolution operation; the main operation unit responds to the first control instruction and extracts data from the main storage unit to perform the positive transformation operation in the winograd convolution operation to obtain the result of the positive transformation operation; The unit responds to the second control instruction, obtains the result of the forward transformation operation from the main operation unit, and extracts data from the storage unit, and performs the multiplication and addition operation and the inverse transformation operation in the winograd convolution operation to obtain the winograd convolution operation result. Since the forward transformation operation of the main arithmetic unit is disassembled into a summation operation, the inverse transformation operation of the slave arithmetic unit is disassembled into a summation operation; this can effectively improve the energy efficiency ratio and calculation speed of the deep learning network on the hardware architecture, and improve The performance of the deep learning network.

The present disclosure provides a neural network computing device. The neural network computing device includes one or more computing devices as shown in Figures 1 and 2, which are used to obtain data and control information to be computed from other processing devices, and execute the specified neural network. Network operations, the execution results are transmitted to other processing devices through the I/O interface; when the neural network computing device includes multiple computing devices, the multiple computing devices are connected through a specific structure and transmit data; among them, multiple computing devices pass through the fast The external device interconnection bus interconnects and transmits data to support larger-scale neural network operations; multiple computing devices share the same control system or have their own control systems; multiple computing devices share memory or have their own memory; more The interconnection mode of the arithmetic devices is arbitrary interconnection topology.

FIG. 4 is a schematic flowchart of an operation method provided by another embodiment of this application. As shown in FIG. 4, the method in this embodiment is applied to the operation device shown in FIG. 1 and FIG. 2, and the method may include:

Step S101, the main control unit sends a first control instruction, the first control instruction is used to instruct the main arithmetic unit to perform the forward transformation operation in the winograd convolution operation, and to instruct the slave control unit to send a second control instruction, the second control instruction is used for Instructs the multiplication and addition operation and the inverse transformation operation in the winograd convolution operation from the arithmetic unit.

Step S102: In response to the first control instruction, the main arithmetic unit extracts data from the storage unit to perform the positive transformation operation in the winograd convolution operation to obtain the positive transformation operation result; wherein the positive transformation operation is disassembled into a summation operation.

Step S103: The slave operation unit responds to the second control instruction, obtains the result of the forward transformation operation from the main operation unit, and extracts data from the storage unit, and performs the multiplication and addition operation and the inverse transformation operation in the winograd convolution operation to obtain the winograd volume The result of the product operation; among them, the inverse transformation operation is broken down into a summation operation.

In a possible design, the storage unit includes: a master storage unit and a slave storage unit;

The main storage unit is used to receive and store characteristic data;

The slave storage unit is used to receive and store the positive transformation data of the weight data.

In a possible design, the main arithmetic unit is specifically used to disassemble the feature data into multiple sub-tensors; perform transformation operations on multiple sub-tensors and sum them, and obtain the positive transformation of the feature data according to the result of the summation operation data.

In a possible design, the main arithmetic unit is specifically used to parse the feature data to obtain multiple sub-tensors, where the feature data is the sum of multiple sub-tensors, and the number of multiple sub-tensors is non-zero in the feature data. The number of elements is the same, each sub-tensor has a single non-zero element, and the non-zero element in the sub-tensor is the same as the non-zero element at the corresponding position in the feature data.

In a possible design, the main arithmetic unit is specifically used to obtain the winograd transformation result of the meta-sub-tensor corresponding to each sub-tensor, where the meta-sub-tensor sets the non-zero elements of the sub-tensor to 1 Tensor; multiply the non-zero element value of the sub-tensor as the coefficient by the winograd transformation result of the corresponding element sub-tensor to obtain the winograd transformation result of the sub-tensor; add the winograd transformation results of multiple sub-tensors to obtain the feature data The data is being transformed.

In a possible design, the main arithmetic unit is specifically used for each sub-tensor, multiplying the left side of the sub-tensor corresponding to the sub-tensor by the left multiplying matrix and the right side by the right multiplying matrix to obtain the sub-tensor The result of the winograd transformation of the quantity, where the left multiplication matrix and the right multiplication matrix are both determined by the scale of the subtensor and the winograd transformation type, where the winograd transformation type includes the winograd transformation type of the forward transformation and the winograd transformation type of the inverse transformation.

In a possible design, the slave arithmetic unit is specifically used to perform bitwise multiplication on the positive transformation data of the characteristic data and the positive transformation data of the weight data to obtain the result of the multiplication operation;

The multiplication operation result is disassembled into multiple sub-tensors; the multiple sub-tensors are transformed and summed to obtain the winograd convolution operation result.

In a possible design, the arithmetic unit is specifically used to obtain multiple sub-tensors from the multiplication operation result analysis, where the multiplication operation result is the sum of the multiple sub-tensors, and the number of the multiple sub-tensors and the multiplication operation result The number of non-zero elements is the same, each sub-tensor has a single non-zero element, and the non-zero element in the sub-tensor is the same as the non-zero element in the corresponding position in the result of the multiplication operation.

In a possible design, the slave arithmetic unit is specifically used to obtain the winograd transformation result of the meta-sub-tensor corresponding to each sub-tensor, where the meta-sub-tensor sets the non-zero elements of the sub-tensor to 1 Tensor; multiply the non-zero element value in the sub-tensor as the coefficient by the winograd transformation result of the corresponding meta-sub-tensor to obtain the winograd transformation result of the sub-tensor; add the winograd transformation results of multiple sub-tensors to obtain the winograd convolution The result of the calculation.

In a possible design, the slave operation unit is specifically used for each sub-tensor, multiplying the left side of the sub-tensor corresponding to the sub-tensor by the left multiplying matrix, and the right multiplying by the right multiplying matrix to obtain the sub-tensor The result of the winograd transformation of the quantity, where the left multiplication matrix and the right multiplication matrix are both determined by the scale of the subtensor and the winograd transformation type, where the winograd transformation type includes the winograd transformation type of the forward transformation and the winograd transformation type of the inverse transformation.

The main storage unit is used to receive and store characteristic data;

The slave storage unit is used to receive weight data and store it.

In a possible design, the slave arithmetic unit is also used to disassemble the weight data into multiple sub-tensors; transform and sum the multiple sub-tensors, and obtain the weight data according to the result of the summation operation. The data is being transformed.

The main storage unit is used to receive and store characteristic data and weight data;

In a possible design, the main arithmetic unit is also used to disassemble the weight data into multiple sub-tensors; transform and sum the multiple sub-tensors, and obtain the weight data according to the result of the summation operation. Positive transformation data;

The positive transformation data of the weight data is sent to the slave storage unit.

In a possible design, the main computing unit includes: a main processing module and a cache;

The main processing module is configured to respond to the first control instruction, extract data from the storage unit and perform the positive transformation operation in the winograd convolution operation to obtain the positive transformation operation result;

Cache, used to store the result of the positive transformation operation.

In a possible design, the buffer is also used to send the result of the positive conversion to the slave arithmetic unit when the stored result of the positive conversion accumulates to a preset number.

In a possible design, the slave operation unit is also used to store the winograd convolution operation result in the preset address space of the storage unit.

In a possible design, the main arithmetic unit is communicatively connected with multiple slave arithmetic units, and different slave arithmetic units are responsible for calculating the results of different forward transformation operations.

In a possible design, the operation process of the main arithmetic unit and the slave arithmetic unit are parallel operations. Before the main arithmetic unit completes the calculation of the positive transformation data of the feature data, the slave arithmetic unit performs the positive transformation of the calculated feature data. The element position of the data and the element position of the positive transformation data of the corresponding weight data are subjected to a bitwise multiplication operation until the bitwise multiplication operation value of each element position is calculated, and the multiplication result is obtained.

In a possible design, the characteristic data stored in the main storage unit is divided into a plurality of first data used for winograd convolution operation, and the size of the first data is determined according to the size of the convolution kernel;

The positive transformation data of the weight data stored in the storage unit is divided into a plurality of second data used for winograd convolution operation, and the size of the second data is determined according to the size of the first data;

The main processing module of the main arithmetic unit responds to the first control instruction sent by the main control unit, obtains the first data in sequence from the main storage unit, performs a forward transformation operation on the first data to obtain the positive transformation result of the first data, and combines The result of the positive transformation of one data is stored in the cache;

When the positive transformation result of the first data in the buffer reaches a preset number, the main processing module of the main arithmetic unit sends the positive transformation result of the first data in the buffer to the slave arithmetic unit in turn;

The slave arithmetic unit responds to the second control instruction sent by the control unit, obtains the second data from the storage unit, performs a bitwise multiplication operation on the positive transformation result of the first data and the second data to obtain the bitwise multiplication result, and Perform the inverse transformation operation on the result of the bit multiplication operation to obtain the inverse transformation result;

The slave operation unit obtains the winograd convolution operation result according to the inverse transformation result, and sends the winograd convolution operation result to the main storage unit for storage.

The operation method according to the embodiment of the present disclosure can be applied to any one processor of a processing system (for example, an artificial intelligence chip) including multiple processors (multi-core). The processor may be a general-purpose processor, such as a CPU (Central Processing Unit, central processing unit), or an artificial intelligence processor (IPU) for performing artificial intelligence operations. Artificial intelligence operations can include machine learning operations, brain-like operations, and so on. Among them, machine learning operations include neural network operations, k-means operations, support vector machine operations, and so on. The artificial intelligence processor may, for example, include GPU (Graphics Processing Unit), NPU (Neural-Network Processing Unit, neural network processing unit), DSP (Digital Signal Process, digital signal processing unit), field programmable gate array (Field-Programmable Gate Array, FPGA) One or a combination of chips. The present disclosure does not limit the specific types of processors. In addition, the types of multiple processors in the processing system may be the same or different, which is not limited in the present disclosure.

In a possible implementation manner, the processor mentioned in the present disclosure may include multiple processing units, and each processing unit can independently run various tasks assigned to it, such as convolution computing tasks and pooling tasks. Or fully connected tasks, etc. The present disclosure does not limit the processing unit and the tasks run by the processing unit.

Fig. 5 is a schematic diagram of a processing system of an arithmetic method according to an embodiment of the present disclosure. As shown in FIG. 5, the processing system 10 includes multiple processors 11 and a memory 12, the multiple processors 11 are used to execute instruction sequences, and the memory 12 is used to store data, and may include random access memory (RAM, Random Access Memory) and registers. heap. The multiple processors 11 in the processing system 10 can share part of the storage space, for example, share part of the RAM storage space and the register file, and can also have their own storage space at the same time.

It should be noted that for the foregoing method embodiments, for the sake of simple description, they are all expressed as a series of action combinations, but those skilled in the art should know that the present disclosure is not limited by the described sequence of actions. Because according to the present disclosure, certain steps can be performed in other order or at the same time. Secondly, those skilled in the art should also know that the embodiments described in the specification are all optional embodiments, and the involved actions and modules are not necessarily required by the present disclosure.

It should be further noted that although the various steps in the flowchart are displayed in sequence according to the directions of the arrows, these steps are not necessarily executed in the order indicated by the arrows. Unless there is a clear description in this article, there is no strict order for the execution of these steps, and these steps can be executed in other orders. Moreover, at least part of the steps in the flowchart may include multiple sub-steps or multiple stages. These sub-steps or stages are not necessarily executed at the same time, but can be executed at different times. The execution of these sub-steps or stages The sequence is not necessarily performed sequentially, but may be performed alternately or alternately with at least a part of other steps or sub-steps or stages of other steps.

It should be understood that the foregoing device embodiments are only illustrative, and the device of the present disclosure may also be implemented in other ways. For example, the division of units/modules in the above embodiments is only a logical function division, and there may be other division methods in actual implementation. For example, multiple units, modules or components may be combined or integrated into another system, or some features may be omitted or not implemented.

In addition, unless otherwise specified, the functional units/modules in the various embodiments of the present disclosure may be integrated into one unit/module, or each unit/module may exist alone physically, or two or more units/modules may exist. The modules are integrated together. The above-mentioned integrated unit/module can be implemented in the form of hardware or software program module.

If the integrated unit/module is implemented in the form of hardware, the hardware may be a digital circuit, an analog circuit, and so on. The physical realization of the hardware structure includes but is not limited to transistors, memristors and so on. Unless otherwise specified, the artificial intelligence processor may be any appropriate hardware processor, such as CPU, GPU, FPGA, DSP, ASIC, and so on. Unless otherwise specified, the storage unit can be any suitable magnetic storage medium or magneto-optical storage medium, such as RRAM (Resistive Random Access Memory), Dynamic Random Access Memory (DRAM), and static random access memory. Access memory SRAM (Static Random-Access Memory), enhanced dynamic random access memory EDRAM (Enhanced Dynamic Random Access Memory), high-bandwidth memory HBM (High-Bandwidth Memory), hybrid storage cube HMC (Hybrid Memory Cube), etc.

If the integrated unit/module is implemented in the form of a software program module and sold or used as an independent product, it can be stored in a computer readable memory. Based on this understanding, the technical solution of the present disclosure essentially or the part that contributes to the prior art or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a memory, A number of instructions are included to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the methods of the various embodiments of the present disclosure. The aforementioned memory includes: U disk, Read-Only Memory (ROM, Read-Only Memory), Random Access Memory (RAM, Random Access Memory), mobile hard disk, magnetic disk or optical disk and other media that can store program codes.

In a possible implementation manner, an artificial intelligence chip is also disclosed, which includes the aforementioned computing device.

In a possible implementation manner, a board card is also disclosed, which includes a storage device, an interface device, a control device, and the aforementioned artificial intelligence chip; wherein the artificial intelligence chip is connected to the storage device, the control device, and the interface device respectively; The storage device is used to store data; the interface device is used to realize data transmission between the artificial intelligence chip and external equipment; the control device is used to monitor the state of the artificial intelligence chip.

Fig. 6 is a structural block diagram of a board according to an embodiment of the present disclosure. Referring to Fig. 6, the board may include other supporting components in addition to the chip 389 described above. The supporting components include, but are not limited to: a storage device 390, an interface Device 391 and control device 392;

The storage device 390 is connected to the artificial intelligence chip through a bus for storing data. The storage device may include multiple sets of storage units 393. Each group of storage units is connected to the artificial intelligence chip through a bus. It can be understood that each group of storage units may be DDR SDRAM (English: Double Data Rate SDRAM, double-rate synchronous dynamic random access memory).

DDR does not need to increase the clock frequency to double the speed of SDRAM. DDR allows data to be read on the rising and falling edges of the clock pulse. The speed of DDR is twice that of standard SDRAM. In one embodiment, the storage device may include 4 sets of storage units. Each group of memory cells can include multiple DDR4 particles (chips). In one embodiment, the artificial intelligence chip may include four 72-bit DDR4 controllers. In the 72-bit DDR4 controller, 64 bits are used for data transmission and 8 bits are used for ECC verification. It can be understood that when DDR4-3200 particles are used in each group of memory cells, the theoretical bandwidth of data transmission can reach 25600MB/s.

In one embodiment, each group of storage units includes a plurality of double-rate synchronous dynamic random access memories arranged in parallel. DDR can transmit data twice in one clock cycle. A controller that controls the DDR is provided in the chip to control the data transmission and data storage of each storage unit.

The interface device is electrically connected with the artificial intelligence chip. The interface device is used to realize data transmission between the artificial intelligence chip and an external device (such as a server or a computer). For example, in one embodiment, the interface device may be a standard PCIE interface. For example, the data to be processed is transferred from the server to the chip through a standard PCIE interface to realize data transfer. Preferably, when the PCIE 3.0X 16 interface is used for transmission, the theoretical bandwidth can reach 16000MB/s. In another embodiment, the interface device may also be other interfaces. The present disclosure does not limit the specific manifestations of the other interfaces mentioned above, as long as the interface unit can realize the switching function. In addition, the calculation result of the artificial intelligence chip is still transmitted back to the external device (such as the server) by the interface device.

The control device is electrically connected with the artificial intelligence chip. The control device is used to monitor the state of the artificial intelligence chip. Specifically, the artificial intelligence chip and the control device may be electrically connected through an SPI interface. The control device may include a single-chip microcomputer (Micro Controller Unit, MCU). For example, an artificial intelligence chip may include multiple processing chips, multiple processing cores, or multiple processing circuits, which can drive multiple loads. Therefore, the artificial intelligence chip can be in different working states such as multi-load and light-load. The control device can realize the regulation and control of the working states of multiple processing chips, multiple processing and or multiple processing circuits in the artificial intelligence chip.

In a possible implementation manner, an electronic device is disclosed, which includes the aforementioned artificial intelligence chip. Electronic equipment includes data processing devices, robots, computers, printers, scanners, tablets, smart terminals, mobile phones, driving recorders, navigators, sensors, cameras, servers, cloud servers, cameras, cameras, projectors, watches, headsets , Mobile storage, wearable devices, vehicles, household appliances, and/or medical equipment.

Transportation includes airplanes, ships, and/or vehicles; household appliances include TVs, air conditioners, microwave ovens, refrigerators, rice cookers, humidifiers, washing machines, electric lights, gas stoves, range hoods; medical equipment includes nuclear magnetic resonance, ultrasound, and/or Electrocardiograph.

In the above-mentioned embodiments, the description of each embodiment has its own focus. For parts that are not described in detail in an embodiment, reference may be made to related descriptions of other embodiments. The technical features of the above-mentioned embodiments can be combined arbitrarily. In order to make the description concise, all possible combinations of the various technical features in the above-mentioned embodiments are not described. However, as long as there is no contradiction in the combination of these technical features, they should all be combined. It is considered as the range described in this specification.

The foregoing can be better understood according to the following clauses:

Clause 1. An arithmetic device for performing winograd convolution operations, including: a master control unit, a slave control unit, a storage unit, a master arithmetic unit, and a slave arithmetic unit;

The main control unit is used to send a first control instruction, the first control instruction is used to instruct the main arithmetic unit to perform the forward transformation operation in the winograd convolution operation, and to instruct the slave control unit to send a second control instruction, the second control instruction is used to Instruct the multiplication and addition operation and the inverse transformation operation in the winograd convolution operation from the arithmetic unit;

Storage unit, used to store data used for winograd convolution operation;

The main operation unit is used to respond to the first control instruction, extract data from the storage unit and perform the positive transformation operation in the winograd convolution operation to obtain the positive transformation operation result; wherein the positive transformation operation is disassembled into a summation operation;

The slave operation unit is used to respond to the second control instruction, obtain the result of the forward transformation operation from the main operation unit, and extract the data from the storage unit, and perform the multiplication and addition operation and the inverse transformation operation in the winograd convolution operation to obtain the winograd volume The result of the product operation; among them, the inverse transformation operation is broken down into a summation operation.

Clause 2. The computing device according to Clause 1, the storage unit includes: a master storage unit and a slave storage unit;

The main storage unit is used to receive and store characteristic data;

Clause 3. The computing device according to Clause 2,

The main arithmetic unit is specifically used to disassemble the feature data into multiple sub-tensors; perform transformation operations on the multiple sub-tensors and sum them, and obtain the positive transformation data of the feature data according to the result of the summation operation.

Clause 4. The computing device according to Clause 3,

The main arithmetic unit is specifically used to parse the feature data to obtain multiple sub-tensors, where the feature data is the sum of multiple sub-tensors, the number of multiple sub-tensors is the same as the number of non-zero elements in the feature data, and each sub-tensor is the same as the number of non-zero elements in the feature data. There is a single non-zero element in the tensor, and the non-zero element in the sub-tensor is the same as the non-zero element at the corresponding position in the feature data.

Clause 5. The computing device according to Clause 4,

The main arithmetic unit is specifically used to obtain the winograd transformation result of the meta-sub-tensor corresponding to each sub-tensor, where the meta-sub-tensor is a tensor in which the non-zero elements of the sub-tensor are set to 1, and the sub-tensor is non-zero The element value of is used as a coefficient multiplied by the winograd transformation result of the corresponding element sub-tensor to obtain the winograd transformation result of the sub-tensor; the winograd transformation results of multiple sub-tensors are added to obtain the positive transformation data of the feature data.

Clause 6. The arithmetic device according to Clause 5, the main arithmetic unit is specifically used for each sub-tensor, multiplying the left side of the sub-tensor corresponding to the sub-tensor by the left multiplying matrix, and the right multiplying by the right multiplying matrix to obtain the element The winograd transformation result of the sub-tensor, where the left-multiplication matrix and the right-multiplication matrix are both determined by the size of the sub-tensor and the winograd transformation type, where the winograd transformation type includes the winograd transformation type of the forward transformation and the winograd transformation type of the inverse transformation.

Clause 7. The computing device according to Clause 2,

The slave arithmetic unit is specifically used to perform a bitwise multiplication operation on the positive transformation data of the characteristic data and the positive transformation data of the weight data to obtain the result of the multiplication operation;

Clause 8. The computing device according to Clause 7,

From the arithmetic unit, it is specifically used to analyze the result of the multiplication operation to obtain multiple sub-tensors, where the result of the multiplication operation is the sum of multiple sub-tensors, and the number of the multiple sub-tensors is the same as the number of non-zero elements in the multiplication operation result , There is a single non-zero element in each sub-tensor, and the non-zero element in the sub-tensor is the same as the non-zero element in the corresponding position in the result of the multiplication operation.

Clause 9. The computing device according to Clause 8,

From the arithmetic unit, it is specifically used to obtain the winograd transformation result of the meta-sub-tensor corresponding to each sub-tensor, where the meta-sub-tensor is a tensor with the non-zero elements of the sub-tensor set to 1, and the sub-tensor is non-zero The element value of is used as a coefficient multiplied by the winograd transformation result of the corresponding element sub-tensor to obtain the winograd transformation result of the sub-tensor; the winograd transformation results of multiple sub-tensors are added to obtain the winograd convolution operation result.

Clause 10. The arithmetic device according to Clause 9, from the arithmetic unit, specifically used for each sub-tensor, multiplying the left side of the sub-tensor corresponding to the sub-tensor by the left multiplying matrix, and the right multiplying by the right multiplying matrix to obtain the element The winograd transformation result of the sub-tensor, where the left-multiplication matrix and the right-multiplication matrix are both determined by the size of the sub-tensor and the winograd transformation type, where the winograd transformation type includes the winograd transformation type of the forward transformation and the winograd transformation type of the inverse transformation.

Clause 11. The computing device according to Clause 1, the storage unit includes: a master storage unit and a slave storage unit;

The main storage unit is used to receive and store characteristic data;

The slave storage unit is used to receive weight data and store it.

Clause 12. The computing device according to Clause 11,

The slave arithmetic unit is also used to disassemble the weight data into multiple sub-tensors; perform transformation operations on the multiple sub-tensors and sum them, and obtain the positive transformation data of the weight data according to the result of the summation operation.

Clause 13. The computing device according to Clause 1, the storage unit includes: a master storage unit and a slave storage unit;

Clause 14. The computing device according to Clause 13,

The main operation unit is also used to disassemble the weight data into multiple sub-tensors; perform transformation operations on the multiple sub-tensors and sum them, and obtain the positive transformation data of the weight data according to the result of the summation operation;

Clause 15. The computing device according to any one of clauses 1-14, the main computing unit includes: a main processing module and a buffer;

The main processing module is used to respond to the first control instruction to extract data from the storage unit to perform the positive transformation operation in the winograd convolution operation to obtain the positive transformation operation result;

Cache, used to store the result of the positive transformation operation.

Clause 16. According to the arithmetic device of Clause 15, the buffer is also used to send the result of the forward conversion to the slave arithmetic unit when the stored result of the forward conversion has accumulated to a preset number.

Clause 17. According to the arithmetic device of any one of clauses 1-14, the slave arithmetic unit is also used to store the result of the winograd convolution operation in the preset address space of the storage unit.

Clause 18. According to the device of any one of clauses 1-14, the main arithmetic unit is communicatively connected with multiple slave arithmetic units, and different slave arithmetic units are responsible for calculating the results of different forward transformation operations.

Clause 19. According to the device of Clause 7, the operation process of the main arithmetic unit and the slave arithmetic unit are parallel operations. Before the main arithmetic unit completes the calculation of the positive transformation data of the feature data, the slave arithmetic unit performs the calculation of the positive data of the feature data. The element position of the transformed data and the element position of the positive transformation data of the corresponding weight data are subjected to bitwise multiplication until the bitwise multiplication value of each element position is calculated, and the multiplication result is obtained.

Clause 20, the device according to Clause 19,

The feature data stored in the main storage unit is divided into a plurality of first data used for winograd convolution operation, and the size of the first data is determined according to the size of the convolution kernel;

Clause 21. An artificial intelligence chip, the chip including a computing device as in any one of clauses 1-20.

Clause 22. An electronic device. The electronic device includes an artificial intelligence chip as in Clause 21.

Clause 23. A board card, which includes: a storage device, an interface device, a control device, and an artificial intelligence chip as in Clause 21;

Among them, the artificial intelligence chip is connected to the storage device, the control device and the interface device respectively;

Storage device, used to store data;

Interface device, used to realize data transmission between artificial intelligence chip and external equipment;

Clause 24. According to the board card of Clause 23, the storage device includes: multiple groups of storage units, each group of storage units is connected to the artificial intelligence chip through a bus, and the storage unit is: DDR SDRAM;

The chip includes: DDR controller, used to control the data transmission and data storage of each storage unit;

The interface device is: standard PCIE interface.

Clause 25. An arithmetic method applied to an arithmetic device. The arithmetic device includes: a master control unit, a slave control unit, a storage unit, a master arithmetic unit, and a slave arithmetic unit; the method includes:

The master control unit sends a first control instruction, the first control instruction is used to instruct the master arithmetic unit to perform the forward transformation operation in the winograd convolution operation, and the slave control unit to send a second control instruction, the second control instruction is used to instruct the slave operation The unit performs multiplication and addition operations and inverse transformation operations in winograd convolution operations;

The storage unit stores data used for winograd convolution operation;

The main arithmetic unit responds to the first control instruction and extracts data from the storage unit to perform the positive transformation operation in the winograd convolution operation to obtain the positive transformation operation result; wherein, the positive transformation operation is disassembled into a summation operation;

The slave operation unit responds to the second control instruction, obtains the result of the forward transformation operation from the main operation unit, and extracts data from the storage unit, and performs the multiplication and addition operation and the inverse transformation operation in the winograd convolution operation to obtain the winograd convolution operation result ; Among them, the inverse transform operation is broken down into a summation operation.

The above scheme can effectively improve the processing efficiency of the chip. However, since the matrix multiplication is still performed in the winograd convolution forward transformation and the winograd convolution inverse transformation, there is still a large overhead in the hardware implementation process. Therefore, in order to further improve the processing efficiency, this The application embodiment also proposes a winograd convolution operation method. The winograd convolution operation method can be applied in the hardware implementation process of the convolutional neural network.

The embodiments of the present disclosure are described in detail above, and specific examples are used in this article to illustrate the principles and implementations of the present disclosure. The descriptions of the above embodiments are only used to help understand the methods and core ideas of the present disclosure. At the same time, changes or modifications made by those skilled in the art based on the ideas of the present disclosure, the specific embodiments and the scope of application of the present disclosure, are all within the protection scope of the present disclosure. In summary, the content of this specification should not be construed as a limitation of this disclosure.

Claims

An arithmetic device, which is used to perform winograd convolution operations, and is characterized by comprising: a master control unit, a slave control unit, a storage unit, a master arithmetic unit, and a slave arithmetic unit;

The master control unit is configured to send a first control instruction, and the first control instruction is used to instruct the master arithmetic unit to perform a positive transformation operation in a winograd convolution operation, and instruct the slave control unit to send a second control Instruction, the second control instruction is used to instruct the slave arithmetic unit to perform multiplication and addition operations and inverse transformation operations in a winograd convolution operation;

The storage unit is used to store data used for winograd convolution operation;

The main arithmetic unit is configured to respond to the first control instruction and extract data from the storage unit to perform a positive transformation operation in a winograd convolution operation to obtain a positive transformation operation result; wherein the forward transformation operation is split The solution is a summation operation;

The slave operation unit is configured to respond to the second control instruction, obtain the result of the forward transformation operation from the main operation unit, extract data from the storage unit, and perform multiplication and addition operations in the winograd convolution operation And the inverse transformation operation to obtain the winograd convolution operation result; wherein, the inverse transformation operation is disassembled into a summation operation.
The computing device according to claim 1, wherein the storage unit comprises: a main storage unit and a slave storage unit;

The main storage unit is used to receive and store characteristic data;

The slave storage unit is used to receive and store the positive transformation data of the weight data.
The arithmetic device according to claim 2, wherein:

The main arithmetic unit is specifically configured to disassemble the feature data into multiple sub-tensors; perform transformation operations on the multiple sub-tensors and sum them, and obtain the positive transformation of the feature data according to the result of the summation operation data.
The arithmetic device according to claim 3, wherein:

The main arithmetic unit is specifically configured to parse the feature data to obtain multiple sub-tensors, where the feature data is the sum of the multiple sub-tensors, and the number of the multiple sub-tensors and the feature The number of non-zero elements in the data is the same, each sub-tensor has a single non-zero element, and the non-zero element in the sub-tensor is the same as the non-zero element in the corresponding position in the feature data.
The arithmetic device according to claim 4, wherein:

The main arithmetic unit is specifically configured to obtain the winograd transformation result of the meta-sub-tensor corresponding to each sub-tensor, wherein the meta-sub-tensor is a tensor with non-zero elements of the sub-tensor set to 1. ; Multiplying the non-zero element values in the sub-tensor by the coefficients of the winograd transformation result of the corresponding element sub-tensor to obtain the winograd transformation result of the sub-tensor; adding the winograd transformation results of multiple sub-tensors to obtain the result The positive transformation data of the feature data.
5. The arithmetic device according to claim 5, wherein the main arithmetic unit is specifically configured to, for each of the sub-tensors, multiply the left side of the element sub-tensor corresponding to the sub-tensor by the left multiplication matrix , The right side is multiplied by the right multiplication matrix to obtain the winograd transformation result of the element sub-tensor, wherein the left multiplication matrix and the right multiplication matrix are both determined by the scale of the sub-tensor and the winograd transformation type, Wherein, the winograd transformation type includes the winograd transformation type of the forward transformation and the winograd transformation type of the inverse transformation.
The arithmetic device according to claim 2, wherein:

The slave operation unit is specifically configured to perform a bitwise multiplication operation on the positive transformation data of the characteristic data and the positive transformation data of the weight data to obtain a multiplication operation result;

The multiplication operation result is disassembled into multiple sub-tensors; the multiple sub-tensors are transformed and summed to obtain the winograd convolution operation result.
The arithmetic device according to claim 7, wherein:

The slave operation unit is specifically configured to parse the result of the multiplication operation to obtain a plurality of sub-tensors, wherein the result of the multiplication operation is the sum of the plurality of sub-tensors, and the number of the plurality of sub-tensors is The number of non-zero elements in the multiplication operation result is the same, each of the sub-tensors has a single non-zero element, and the non-zero elements in the sub-tensor are the same as the non-zero elements in the corresponding position in the multiplication operation result .
The arithmetic device according to claim 8, wherein:

The slave operation unit is specifically configured to obtain the winograd transformation result of the meta-sub-tensor corresponding to each sub-tensor, wherein the meta-sub-tensor is a tensor with non-zero elements of the sub-tensor set to 1. ; Multiplying the non-zero element value of the sub-tensor by the winograd transformation result of the corresponding element sub-tensor as a coefficient to obtain the winograd transformation result of the sub-tensor; adding the winograd transformation results of multiple sub-tensors to obtain the result Describe the result of winograd convolution operation.
8. The arithmetic device according to claim 9, wherein the slave arithmetic unit is specifically configured to, for each of the sub-tensors, multiply the left side of the element sub-tensor corresponding to the sub-tensor by the left multiplication matrix , The right side is multiplied by the right multiplication matrix to obtain the winograd transformation result of the element sub-tensor, wherein the left multiplication matrix and the right multiplication matrix are both determined by the scale of the sub-tensor and the winograd transformation type, Wherein, the winograd transformation type includes the winograd transformation type of the forward transformation and the winograd transformation type of the inverse transformation.
The computing device according to claim 1, wherein the storage unit comprises: a main storage unit and a slave storage unit;

The main storage unit is used to receive and store characteristic data;

The slave storage unit is used to receive and store weight data.
The arithmetic device according to claim 11, wherein:

The slave operation unit is also used to disassemble the weight data into multiple sub-tensors; perform transformation operations on the multiple sub-tensors and sum them, and obtain the weight data of the weight data according to the result of the summation operation. The data is being transformed.
The computing device according to claim 1, wherein the storage unit comprises: a main storage unit and a slave storage unit;

The main storage unit is used to receive and store characteristic data and weight data;

The slave storage unit is used to receive and store the positive transformation data of the weight data.
The arithmetic device according to claim 13, wherein:

The main arithmetic unit is also used to disassemble the weight data into multiple sub-tensors; perform transformation operations on the multiple sub-tensors and sum them, and obtain the weight data of the weight data according to the result of the summation operation. Positive transformation data;

Sending the positive transformation data of the weight data to the slave storage unit.
The computing device according to any one of claims 1-14, wherein the main computing unit comprises: a main processing module and a buffer;

The main processing module is configured to respond to the first control instruction to extract data from the storage unit to perform a positive transformation operation in a winograd convolution operation to obtain a positive transformation operation result;

The buffer is used to store the result of the forward transformation operation.
15. The arithmetic device according to claim 15, wherein the buffer is further configured to send the result of the forward conversion to the slave arithmetic unit when the stored result of the forward conversion is accumulated to a preset number.
14. The arithmetic device according to any one of claims 1-14, wherein the slave arithmetic unit is further configured to store the winograd convolution operation result in a preset address space of the storage unit.
The device according to any one of claims 1-14, wherein the main arithmetic unit is communicatively connected with a plurality of slave arithmetic units, and different slave arithmetic units are responsible for calculating different forward transform operation results.
7. The device according to claim 7, wherein the operation processes of the main operation unit and the slave operation unit are parallel operations, and before the main operation unit completes the calculation of the forward transformation data of the characteristic data, The slave arithmetic unit performs a bitwise multiplication operation on the calculated element position of the forward transformation data of the characteristic data and the corresponding element position of the forward transformation data of the weight data, until the pair of element positions is calculated. Multiply the value of the bit to obtain the result of the multiplication.
The device of claim 19, wherein:

The feature data stored in the main storage unit is divided into a plurality of first data used for winograd convolution operations, and the size of the first data is determined according to the size of the convolution kernel;

The positive transformation data of the weight data stored in the secondary storage unit is divided into a plurality of second data used for winograd convolution operation, and the size of the second data is determined according to the size of the first data;

The main processing module of the main arithmetic unit responds to the first control instruction sent by the main control unit to sequentially obtain the first data from the main storage unit, and perform a forward transformation operation on the first data to obtain the The positive transformation result of the first data, and storing the positive transformation result of the first data in a cache;

When the positive transformation results of the first data in the buffer reach a preset number, the main processing module of the main arithmetic unit sends the positive transformation results of the first data in the buffer to the Slave unit

The slave arithmetic unit responds to the second control instruction sent by the slave control unit, acquires the second data from the slave storage unit, and compares the positive transformation result of the first data with the second data Bit multiplication operation to obtain a bit multiplication operation result, and perform an inverse transformation operation on the bit multiplication operation result to obtain an inverse transformation result;

The slave operation unit obtains the winograd convolution operation result according to the inverse transformation result, and sends the winograd convolution operation result to the main storage unit for storage.
An artificial intelligence chip, characterized in that the chip comprises the computing device according to any one of claims 1-20.
An electronic device, wherein the electronic device comprises the artificial intelligence chip according to claim 21.
A board card, characterized in that the board card comprises: a storage device, an interface device, a control device, and the artificial intelligence chip according to claim 21;

Wherein, the artificial intelligence chip is connected to the storage device, the control device, and the interface device respectively;

The storage device is used to store data;

The interface device is used to implement data transmission between the artificial intelligence chip and external equipment;

The control device is used to monitor the state of the artificial intelligence chip.
The board card according to claim 23, characterized in that,

The storage device includes: multiple groups of storage units, each group of the storage unit is connected to the artificial intelligence chip through a bus, and the storage unit is: DDR SDRAM;

The chip includes: a DDR controller, which is used to control the data transmission and data storage of each storage unit;

The interface device is: a standard PCIE interface.
An arithmetic method applied to an arithmetic device, characterized in that the arithmetic device includes: a master control unit, a slave control unit, a storage unit, a master arithmetic unit, and a slave arithmetic unit; the method includes:

The master control unit sends a first control instruction, and the first control instruction is used to instruct the master arithmetic unit to perform the forward transformation operation in the winograd convolution operation, and instruct the slave control unit to send a second control instruction, so The second control instruction is used to instruct the slave operation unit to perform multiplication and addition operations and inverse transformation operations in a winograd convolution operation;

The storage unit stores data used for winograd convolution operation;

In response to the first control instruction, the main arithmetic unit extracts data from the storage unit to perform a positive transformation operation in a winograd convolution operation to obtain a positive transformation operation result; wherein the forward transformation operation is decomposed into Sum operation

The slave operation unit responds to the second control instruction, obtains the result of the forward transformation operation from the main operation unit, and extracts data from the storage unit, and performs the multiplication and addition operation and the inverse transformation in the winograd convolution operation Operation to obtain the winograd convolution operation result; wherein, the inverse transformation operation is disassembled into a summation operation.
The operation method according to claim 25, wherein the storage unit comprises: a main storage unit and a slave storage unit;

The storage unit stores data used for winograd convolution operations, including:

The main storage unit receives and stores characteristic data;

The positive conversion data of the weight data is received from the storage unit and stored.
The operation method according to claim 26, wherein the main operation unit responds to the first control instruction to extract data from the storage unit and perform a positive transformation operation in a winograd convolution operation to obtain a positive transformation operation The results include:

The main arithmetic unit disassembles the feature data into a plurality of sub-tensors;

The main arithmetic unit performs a transformation operation on the multiple sub-tensors and sums them, and obtains the positive transformation data of the feature data according to the result of the summation operation.
The operation method according to claim 27, wherein the main operation unit decomposes the feature data into a plurality of sub-tensors, comprising:

The main computing unit analyzes and obtains a plurality of sub-tensors from the feature data;

Wherein, the feature data is the sum of the multiple sub-tensors, the number of the multiple sub-tensors is the same as the number of non-zero elements in the feature data, and each of the sub-tensors has a single non-zero element , And the non-zero element in the sub-tensor is the same as the non-zero element at the corresponding position in the feature data.
The operation method according to claim 28, wherein the main operation unit performs a transformation operation on the plurality of sub-tensors and sums them, and obtains the positive transformation data of the feature data according to the result of the summation operation, including :

The main arithmetic unit obtains the winograd transformation result of the meta-sub-tensor corresponding to each sub-tensor, where the meta-sub-tensor is a tensor in which non-zero elements of the sub-tensor are set to 1;

The main arithmetic unit multiplies the non-zero element value in the sub-tensor as a coefficient by the winograd transformation result of the corresponding meta-sub-tensor to obtain the winograd transformation result of the sub-tensor;

The main arithmetic unit adds the winograd transformation results of multiple sub-tensors to obtain the positive transformation data of the feature data.
The operation method according to claim 29, wherein the main operation unit obtains the winograd transformation result of the meta-sub-tensor corresponding to each sub-tensor, comprising:

For each of the sub-tensors, the main arithmetic unit multiplies the left side of the element sub-tensor corresponding to the sub-tensor by the left multiplying matrix, and the right side multiplies the right multiplying matrix to obtain the winograd transformation of the element sub-tensor result;

Wherein, the left multiplication matrix and the right multiplication matrix are both determined by the scale of the sub-tensor and the winograd transformation type, wherein the winograd transformation type includes the winograd transformation type of the forward transformation and the winograd transformation type of the inverse transformation .
The operation method according to claim 26, wherein the slave operation unit responds to the second control instruction to obtain the result of the forward transformation operation from the main operation unit, and extracts data from the storage unit, And perform the multiplication and addition operation and the inverse transformation operation in the winograd convolution operation to obtain the winograd convolution operation result, including:

The slave operation unit performs a bitwise multiplication operation on the positive transformation data of the characteristic data and the positive transformation data of the weight data to obtain a multiplication operation result;

The slave operation unit disassembles the multiplication operation result into a plurality of sub-tensors; performs transformation operations on the plurality of sub-tensors and sums them to obtain the winograd convolution operation result.
The operation method according to claim 31, wherein the slave operation unit decomposes the multiplication operation result into a plurality of sub-tensors, comprising:

The slave operation unit parses and obtains a plurality of sub-tensors from the multiplication operation result;

Wherein, the multiplication operation result is the sum of the multiple sub-tensors, the number of the multiple sub-tensors is the same as the number of non-zero elements in the multiplication operation result, and each of the sub-tensors has a single non-zero element. 0 elements, and the non-zero elements in the sub-tensor are the same as the non-zero elements in the corresponding position in the multiplication operation result.
The operation method according to claim 32, wherein the slave operation unit disassembles the multiplication operation result into a plurality of sub-tensors; performs transformation operations on the plurality of sub-tensors and sums them to obtain the Winograd convolution operation results, including:

Obtaining the winograd transformation result of the meta-sub-tensor corresponding to each sub-tensor from the arithmetic unit;

Wherein, the element sub-tensor is a tensor in which the non-zero elements of the sub-tensor are set to 1;

The slave operation unit multiplies the non-zero element value in the sub-tensor as a coefficient by the winograd transformation result of the corresponding element sub-tensor to obtain the winograd transformation result of the sub-tensor;

The slave operation unit adds the winograd transformation results of multiple sub-tensors to obtain the winograd convolution operation result.
The operation method according to claim 33, wherein the obtaining from the operation unit the winograd transformation result of the element sub-tensor corresponding to each sub-tensor comprises:

For each of the sub-tensors, the slave operation unit multiplies the left side of the element sub-tensor corresponding to the sub-tensor by the left multiplying matrix, and the right side multiplies the right multiplying matrix to obtain the winograd transformation of the element sub-tensor result;

Wherein, the left multiplication matrix and the right multiplication matrix are both determined by the scale of the sub-tensor and the winograd transformation type, wherein the winograd transformation type includes the winograd transformation type of the forward transformation and the winograd transformation type of the inverse transformation .
The operation method according to claim 25, wherein the storage unit comprises: a main storage unit and a slave storage unit;

The storage unit stores data used for winograd convolution operations, including:

The main storage unit receives and stores characteristic data;

The weight data is received from the storage unit and stored.
The operation method according to claim 35, further comprising:

The slave arithmetic unit disassembles the weight data into a plurality of sub-tensors;

The slave operation unit performs a transformation operation on the plurality of sub-tensors and sums them, and obtains the positive transformation data of the weight data according to the result of the summation operation.
The operation method according to claim 25, wherein the storage unit comprises: a main storage unit and a slave storage unit;

The storage unit stores data used for winograd convolution operations, including:

The main storage unit receives and stores characteristic data and weight data;

The positive conversion data of the weight data is received from the storage unit and stored.
The operation method according to claim 37, further comprising:

The main operation unit disassembles the weight data into multiple sub-tensors; performs transformation operations on the multiple sub-tensors and sums them, and obtains the positive transformation data of the weight data according to the result of the summation operation;

The main arithmetic unit sends the positive conversion data of the weight data to the slave storage unit.
The operation method according to any one of claims 25-38, wherein the main operation unit comprises: a main processing module and a cache;

In response to the first control instruction, the main arithmetic unit extracts data from the storage unit to perform a positive transformation operation in a winograd convolution operation to obtain a positive transformation operation result, including:

In response to the first control instruction, the main processing module extracts data from the storage unit to perform a positive transformation operation in a winograd convolution operation to obtain a positive transformation operation result;

The cache stores the result of the forward transformation operation.
The operation method according to claim 39, further comprising:

When the stored positive transformation operation result is accumulated to a preset number, the buffer sends the positive transformation operation result to the slave operation unit.
The operation method according to any one of claims 25-38, further comprising:

The slave operation unit stores the winograd convolution operation result in the preset address space of the storage unit.
The operation method according to any one of claims 25-38, wherein the main operation unit is communicatively connected with a plurality of slave operation units, and different slave operation units are responsible for calculating different forward transformation operation results.
The operation method according to claim 31, wherein the operation processes of the main operation unit and the slave operation unit are parallel operations, and before the main operation unit completes the calculation of the forward transformation data of the characteristic data , The slave arithmetic unit performs a bitwise multiplication operation on the calculated element position of the forward transformation data of the characteristic data and the corresponding element position of the forward transformation data of the weight data, until the calculation of the position of each element Multiply the value of the bitwise operation to obtain the result of the multiplication operation.
The operation method according to claim 43, wherein the characteristic data stored in the main storage unit is divided into a plurality of first data used for winograd convolution operation, and the size of the first data is according to The size of the convolution kernel is determined;

The positive transformation data of the weight data stored in the secondary storage unit is divided into a plurality of second data used for winograd convolution operation, and the size of the second data is determined according to the size of the first data;

The main processing module of the main arithmetic unit responds to the first control instruction sent by the main control unit to sequentially obtain the first data from the main storage unit, and perform a forward transformation operation on the first data to obtain the The positive transformation result of the first data, and storing the positive transformation result of the first data in a cache;

When the positive transformation results of the first data in the buffer reach a preset number, the main processing module of the main arithmetic unit sends the positive transformation results of the first data in the buffer to the Slave unit

The slave arithmetic unit responds to the second control instruction sent by the slave control unit, acquires the second data from the slave storage unit, and compares the positive transformation result of the first data with the second data Bit multiplication operation to obtain a bit multiplication operation result, and perform an inverse transformation operation on the bit multiplication operation result to obtain an inverse transformation result;

The slave operation unit obtains the winograd convolution operation result according to the inverse transformation result, and sends the winograd convolution operation result to the main storage unit for storage.