WO2021083097A1

WO2021083097A1 - Data processing method and apparatus, and computer device and storage medium

Info

Publication number: WO2021083097A1
Application number: PCT/CN2020/123837
Authority: WO
Inventors: 张英男; 曾洪博; 张尧; 刘少礼; 黄迪; 周诗怡; 张曦珊; 刘畅; 郭家明; 高钰峰
Original assignee: 中科寒武纪科技股份有限公司
Priority date: 2019-11-01
Filing date: 2020-10-27
Publication date: 2021-05-06
Also published as: CN112765538B; CN112765538A

Abstract

A data processing method and apparatus, and a computer device and a storage medium. The method comprises: splitting a first convolution kernel according to a step length N to obtain a plurality of second convolution kernels (S11); splitting first input data according to the step length N to obtain a plurality of second input data corresponding to the plurality of first convolution kernels (S12); for any second input data, executing a Winograd convolution operation on the second input data and the corresponding second convolution kernel to obtain a convolution result corresponding to the second input data (S13); and determining that a sum of the convolution results corresponding to the plurality of second input data is a convolution result of the first convolution kernel and the first input data (S14). By means of the above method, the reusability of data can be improved.

Description

Data processing method, device, computer equipment and storage medium

This application claims the priority of a Chinese patent application filed with the Chinese Patent Office, the application number is 201911061027.0, and the invention title is "data processing method, device, computer equipment and storage medium" on November 1, 2019, the entire content of which is incorporated by reference In this application.

Technical field

The present disclosure relates to the field of data processing technology, and in particular to a data processing method, device, computer equipment, and storage medium.

Background technique

In the field of artificial intelligence technology, neural network algorithm is a very popular machine learning algorithm recently, and has achieved very good results in various fields, such as image recognition, speech recognition, natural language processing, etc. With the development of neural network algorithms, the complexity of the algorithm is getting higher and higher. In order to improve the recognition, the scale of the model is gradually increasing. Using GPU and CPU to process these large-scale models requires a lot of computing time and consumes a lot of power.

Summary of the invention

Based on this, the embodiments of the present disclosure provide a data processing method, device, computer equipment, and storage medium that can improve the multiplexing rate of data.

According to an aspect of the present disclosure, there is provided a data processing method, including:

Split the first convolution kernel according to the step size N to obtain multiple second convolution kernels;

Splitting the first input data according to the step size N to obtain a plurality of second input data corresponding to the plurality of first convolution kernels;

For any of the second input data, perform a winograd convolution operation on the second input data and the corresponding second convolution kernel to obtain a convolution result corresponding to the second input data;

It is determined that the sum of the convolution results corresponding to the plurality of second input data is the convolution result of the first convolution kernel and the first input data.

According to another aspect of the present disclosure, there is provided a data processing device, including:

The first splitting module is used to split the first convolution kernel according to the step size N to obtain multiple second convolution kernels;

The second splitting module is configured to split the first input data according to the step size N to obtain multiple second input data corresponding to the multiple first convolution kernels;

The convolution module is configured to perform a winograd convolution operation on the second input data and the corresponding second convolution kernel for any of the second input data to obtain a convolution result corresponding to the second input data ；

The determining module is configured to determine that the sum of the convolution results corresponding to the plurality of second input data is the convolution result of the first convolution kernel and the first input data.

According to another aspect of the present disclosure, an artificial intelligence chip is provided, and the chip includes the data processing device according to any one of the foregoing.

According to another aspect of the present disclosure, there is provided an electronic device including the aforementioned artificial intelligence chip.

According to another aspect of the present disclosure, there is provided a board card, the board card comprising: a storage device, an interface device, a control device, and the aforementioned artificial intelligence chip;

Wherein, the artificial intelligence chip is connected to the storage device, the control device, and the interface device respectively;

The storage device is used to store data;

The interface device is used to implement data transmission between the artificial intelligence chip and external equipment;

The control device is used to monitor the state of the artificial intelligence chip.

According to another aspect of the present disclosure, there is provided an electronic device including:

processor;

A memory for storing processor executable instructions;

Wherein, the processor is configured to call instructions stored in the memory to execute the method described in any one of the foregoing.

According to another aspect of the present disclosure, there is provided a computer-readable storage medium having computer program instructions stored thereon, wherein the computer program instructions implement the method described in any one of the foregoing when the computer program instructions are executed by a processor .

In this way, the data processing method, device, computer equipment, and storage medium provided by the embodiments of the present disclosure can split the first convolution kernel with a step size greater than 1 and the first input data into multiple second volumes with a step size of 1. The product core and multiple second input data improve the data multiplexing rate.

According to the following detailed description of exemplary embodiments with reference to the accompanying drawings, other features and aspects of the present disclosure will become clear.

Description of the drawings

The drawings included in the specification and constituting a part of the specification together with the specification illustrate exemplary embodiments, features, and aspects of the present disclosure, and are used to explain the principle of the present disclosure.

Figure 1 shows a data processing method provided by an embodiment of the present disclosure;

Fig. 2 shows a schematic diagram of a data processing method of an example of the present disclosure;

FIG. 3 shows a structural block diagram of a data processing device provided by an embodiment of the present disclosure;

Figure 4 shows a block diagram of a board according to an embodiment of the present disclosure;

FIG. 5 shows a block diagram of an electronic device 800 according to an embodiment of the present disclosure;

FIG. 6 shows a block diagram of an electronic device 1900 according to an embodiment of the present disclosure.

Detailed ways

The technical solutions in the embodiments of the present disclosure will be clearly and completely described below in conjunction with the accompanying drawings in the embodiments of the present disclosure. Obviously, the described embodiments are part of the embodiments of the present disclosure, rather than all of the embodiments. Based on the embodiments in the present disclosure, all other embodiments obtained by those skilled in the art without creative work shall fall within the protection scope of the present disclosure.

It should be understood that the terms "first", "second", "third" and "fourth" in the claims, specification and drawings of the present disclosure are used to distinguish different objects, rather than to describe a specific order. . The terms "comprising" and "comprising" used in the specification and claims of the present disclosure indicate the existence of the described features, wholes, steps, operations, elements and/or components, but do not exclude one or more other features, wholes The existence or addition of, steps, operations, elements, components, and/or their collections.

It should also be understood that the terms used in this specification of the present disclosure are only for the purpose of describing specific embodiments, and are not intended to limit the present disclosure. As used in the specification and claims of this disclosure, unless the context clearly indicates otherwise, the singular forms of "a", "an" and "the" are intended to include plural forms. It should be further understood that the term "and/or" used in the specification and claims of the present disclosure refers to any combination of one or more of the items listed in association and all possible combinations, and includes these combinations.

As used in this specification and claims, the term "if" can be interpreted as "when" or "once" or "in response to determination" or "in response to detection" depending on the context. Similarly, the phrase "if determined" or "if detected [described condition or event]" can be interpreted as meaning "once determined" or "in response to determination" or "once detected [described condition or event]" depending on the context ]" or "in response to detection of [condition or event described]".

Winograd convolution is a convolution acceleration implementation method based on polynomial interpolation algorithm. It passes the two inputs of the convolution operation: the input data and the convolution kernel are divided into a certain size and then linear transformation (winograd positive transformation) is performed respectively, and then the transformed input data and the convolution kernel are subjected to bitwise multiplication, and finally The linear transformation (winograd inverse transformation) is performed on the result of the bit multiplication again to obtain the convolution result equivalent to the original convolution operation.

The expression of winograd transformation is as follows:

For one-dimensional input data and convolution kernel: S=A ^T ((Gg)⊙(B ^T d))

For two-dimensional input data and convolution kernel: S=A ^T ((GgG ^T )⊙(B ^T dB))A

Among them, g represents the convolution kernel, G represents the left multiplication positive transformation matrix ^{corresponding to the convolution kernel, G T} represents the right multiplication positive transformation matrix corresponding to the convolution kernel, d represents the input data, and B represents the right multiplication positive transformation corresponding to the input data Matrix, B ^T represents the left multiplication positive transformation matrix corresponding to the input data, ⊙ represents the bitwise multiplication operation, A represents the right multiplication inverse transformation matrix, and ^AT represents the left multiplication and inverse transformation matrix. For input data of different dimensions, there are B and B ^{T corresponding to them} ; similarly, for the convolution kernels of different dimensions, there are G and G ^{T corresponding to them} .

Replacing the original convolution operation by winograd convolution can bring greater benefits in hardware energy efficiency and computing time, and at the same time, higher neural network performance can be achieved without increasing or increasing less hardware overhead. However, when winograd convolution has a larger step size of the convolution kernel, the data reuse rate is lower.

The present disclosure provides a data processing method, which can split a convolution kernel with a step size greater than 1 in a winograd convolution process into a convolution kernel with a step size of 1, so as to improve the data multiplexing rate.

Fig. 1 shows a data processing method provided by an embodiment of the present disclosure. The method may be applied to a processor. As shown in Fig. 1, the method may include:

In step S11, the first convolution kernel is split according to the step size N to obtain multiple second convolution kernels;

In step S12, the first input data is split according to the step size N to obtain a plurality of second input data corresponding to the plurality of first convolution kernels.

For example, the first convolution kernel can be split into multiple second convolution kernels with a step size of 1 according to the step size N, and the first input data can be split into multiple steps with a step size of 1 according to the step size N The second input data. Among them, the original first input data may be image data, sound data, or video data. Taking the input data as image data as an example, the input data can be expressed in the form of NHWC (batch, height, width, channels), N represents the number of images, HW can respectively represent the number of pixels in the height and width directions, and C can represent The number of channels, for example: C can represent three channels of RGB (Red, Green, Blue). It should be noted that the above representation is only an example of the present disclosure, and the present disclosure is not limited to this.

In a possible implementation manner, the foregoing splitting of the first convolution kernel according to the step size N to obtain multiple second convolution kernels may include:

For the rows and columns of the first convolution kernel, the first convolution kernel is split with an interval of N-1 steps to obtain multiple second convolution kernels.

For example, when the step size is N, for each row and each column of the first convolution kernel, the first convolution kernel can be split with an interval of N-1 steps, that is, for each row and each column of the first convolution kernel, The current element of, obtain an element in the row and column at an interval of N-1, and this element and the current element belong to the same second convolution kernel, and execute the cycle with this element as the current element, and the interval N-1 to obtain the second The process of convolution kernel elements.

In a possible implementation manner, for the rows and columns of the first convolution kernel, the first convolution kernel is split with an interval of N-1 steps to obtain multiple second convolution kernels, including:

Traverse the elements in the first convolution kernel, repeat every interval N-1 rows to determine a row target row, for the target row, get one element every interval N-1 column, and the acquired multiple elements form a second convolution The process of the kernel until the elements in the first convolution kernel are traversed.

For example, for the first convolution kernel of m×n, you can start from the first row of the first convolution kernel, and determine a behavior target row every N-1 rows. For each target row, you can start from the first row of the first convolution kernel. Starting with one element, one element is obtained every N-1 column, and the elements in each target row obtained are determined to form a second convolution kernel. Continue to traverse the second element in each target row, and obtain an element every N-1 column interval, determine that the elements in each target row obtained form another second convolution kernel, until the first convolution kernel Any element of is traversed.

The splitting the first input data according to the step size N to obtain multiple second input data corresponding to the multiple first convolution kernels includes:

For the rows and columns of the first input data, the first input data is split with an interval of N-1 steps to obtain multiple second input data corresponding to the multiple first convolution kernels.

For example, when the step size is N, for each row and each column of the first input data, the first input data can be split with an interval of N-1 steps, that is, for the current element in the first input data , Obtain an element in the row and column at intervals of N-1, and the element and the current element belong to the same second input data, and execute cyclically with this element as the current element, and obtain the elements that make up the second input data at intervals of N-1 the process of.

In a possible implementation manner, for the rows and columns of the first input data, the first input data is split with an interval of N-1 steps to obtain multiple data corresponding to the multiple first convolution kernels. The second input data can include:

Traverse the elements in the first input data, repeat every interval N-1 rows to determine a row target row, for the target row, every interval N-1 column to obtain an element, the obtained multiple elements form a second input data The process until the elements in the first input data are traversed.

For example, for the first input data of m×n, you can start from the first row of the first input data, and determine a behavior target row every N-1 rows. For each target row, you can start from the first element At the beginning, an element is obtained in every interval N-1 column, and the elements in each target row obtained are determined to form a second input data. Continue to traverse the second element in each target row, and obtain an element every N-1 column interval, determine that the elements in each target row obtained constitute another second input data, until any of the first input data All elements are traversed.

Fig. 2 shows a schematic diagram of a data processing method of an example of the present disclosure.

As shown in Figure 2, for the winograd convolution with the first input data being 5×5, the first convolution kernel being 3×3, and the step size of 2, the first input data and the first convolution can be based on the step size of 2. The core is split, the first convolution kernel is split into 4 second convolution kernels, and the first input data is split into 4 second input data, specifically:

For the first convolution kernel, a target line is determined at an interval of 1 line, then the first line and the third line are determined as the target line. For the target line, start from the first element and take one element at an interval of 1 element (marked in the figure "1"), all the finally obtained elements form the second convolution kernel (1), where the elements taken from each target row form a row of the second convolution kernel in order.

For the second element of the target row, one element is taken at an interval of one element (marked as "2" in the figure), and all the elements finally obtained form the second convolution kernel (2).

At this point, the elements of the target row have been traversed, and the target row of the second line is re-determined. For the target row, start from the first element and take one element (marked as "3" in the figure), and finally obtain all the elements. The elements form the second convolution kernel (3).

Starting from the second element, one element is taken at an interval of 1 element (marked as "4" in the figure), and all the elements finally obtained form the second convolution kernel (4).

So far, after traversing the elements in the first convolution kernel, 4 second convolution kernels are obtained.

Similarly, for the first input data, a target row is determined at an interval of 1 line, and then the first row, the third row, and the fifth row are determined as the target row. For the target row, start from the first element and take the interval 1 element One element (marked as "1" in the figure), and all the elements finally obtained form the second input data (1).

For the second element of the target row, one element is taken at an interval of 1 element (marked as "2" in the figure), and all the elements finally obtained form the second input data (2).

At this point, the elements of the target row have completed the traversal, and the second and fourth rows of the target row have been re-determined. For the target row, start from the first element, and take one element (marked as "3" in the figure). All elements finally obtained constitute the second input data (3).

Starting from the second element, one element (marked as "4" in the figure) is taken one element apart, and all the elements finally obtained form the second input data (4).

So far, after traversing the elements in the first input data, 4 second input data are obtained.

In a possible implementation, the correspondence between the second input data and the second convolution kernel is specifically: the first element in the second input data is in the first input data The position of is the same as the position of the first element in the second convolution kernel in the first convolution kernel.

For example, assuming that the position of the first element of the second input data in the first input data is the x-th row and the y-th column, the position of the first element in the corresponding second convolution kernel in the first convolution It is also the xth row and the yth column.

Taking the above example as an example, the second convolution kernel (1) has a corresponding relationship with the second input data (1), and the position of the first element in the second convolution kernel (1) in the first convolution kernel is In the first row and the first column, the position of the first element in the second input data (1) in the first input data is the first row and the first column.

The second convolution kernel (2) has a corresponding relationship with the second input data (2), and the position of the first element in the second convolution kernel (2) in the first convolution kernel is the first row and the second column , The position of the first element in the second input data (2) in the first input data is the first row and the second column.

The second convolution kernel (3) has a corresponding relationship with the second input data (3), and the position of the first element in the second convolution kernel (3) in the first convolution kernel is the second row and the first column , The position of the first element in the second input data (3) in the first input data is the second row and the first column.

The second convolution kernel (4) has a corresponding relationship with the second input data (4), and the position of the first element in the second convolution kernel (4) in the first convolution kernel is the second row and the second column , The position of the first element in the second input data (4) in the first input data is the second row and the second column.

In step S13, for any of the second input data, perform a winograd convolution operation on the second input data and the corresponding second convolution kernel to obtain a convolution result corresponding to the second input data.

In step S14, it is determined that the sum of the convolution results corresponding to the plurality of second input data is the convolution result of the first convolution kernel and the first input data.

For example, the winograd convolution operation can be performed on any second input data and the corresponding second convolution kernel to obtain the convolution result corresponding to the second input data, and perform the convolution result of all the second input data The summation operation determines that the sum of the convolution results of all the second input data is the convolution result of the first convolution kernel and the first input data.

Still taking the above example as an example, perform the winograd convolution operation on the second input data (1) and the corresponding second convolution kernel (1) to obtain the convolution result corresponding to the second input data (1); The input data (2) and the corresponding second convolution kernel (2) perform the winograd convolution operation to obtain the convolution result corresponding to the second input data (2); the second input data (3) and the corresponding second The convolution kernel (3) performs the winograd convolution operation to obtain the convolution result corresponding to the second input data (3); performs winograd convolution on the second input data (4) and the corresponding second convolution kernel (4) Operate to obtain the convolution result corresponding to the second input data (4). Determine the convolution result corresponding to the second input data (1), the convolution result corresponding to the second input data (2), the convolution result corresponding to the second input data (3), and the volume corresponding to the second input data (4) The sum of the product results is the winograd convolution result of the first input data and the first convolution kernel.

In this way, the data processing method provided by the embodiments of the present disclosure can split the first convolution kernel with a step size greater than 1 and the first input data into multiple second convolution kernels with a step size of 1 and multiple second inputs. Data, improve the multiplexing rate of data.

However, the disadvantages of winograd convolution are still more obvious, and a large number of multiplication operations still consume a long time in the calculation process.

In order to solve the above technical problems, the present disclosure provides a data processing method that can disassemble the multiplication operation in the winograd convolution process into an addition operation, thereby saving calculation time, reducing energy consumption, and convolving winograd The data in the process is quantified to further improve the calculation performance.

In a possible implementation manner, for any of the second input data, the second input data and the corresponding second convolution kernel perform a winograd convolution operation to obtain the second input The convolution result corresponding to the data can include:

Disassemble the winograd forward transformation of the second input data into a summation operation, and perform calculation to obtain the winograd forward transformation result of the second input data;

Disassemble the winograd positive transformation of the second convolution kernel into a summation operation, and perform calculation to obtain the winograd positive transformation result of the second convolution kernel;

Performing an alignment multiplication operation of the winograd forward transformation result of the second input data and the winograd forward transformation result of the second convolution kernel to obtain the alignment multiplication result;

The winograd inverse transform of the alignment multiplication result is disassembled into a summation operation to obtain a convolution result of the second input data and the corresponding second convolution kernel.

In a possible implementation manner, the above-mentioned disassembling the winograd forward transformation of the second input data into a summation operation, and performing calculation to obtain the winograd forward transformation result of the second input data may include:

The second input data is disassembled into a plurality of first sub-tensors, and winograd positive transformation is performed on the plurality of first sub-tensors and summed to obtain a winograd positive transformation result of the second input data.

In a possible implementation manner, the number of the plurality of first sub-tensors is the same as the number of non-zero elements of the second input data, and each of the plurality of first sub-tensors One element in a sub-tensor is the same as the element at the corresponding position in the second input data, and the other elements are all zero.

For example, suppose that the second input data is expressed as:

The second input data is a 4×4 matrix including 16 elements. Therefore, the second input data can be decomposed into 16 first sub-tensors.

Then, according to the disassembly method of the present disclosure, the 16 first sub-tensors are:

One element in each first sub-tensor is the same as the element at the corresponding position in the second input data, and the other elements are all 0 means: taking the first sub-tensor d ₀₀ as an example, in the first row and first column The position element is the same as the position element of the second input data in the first row and first column. The other elements are all 0, and the other first subtensors also have the same attributes.

It should be noted that the above disassembly methods are only some examples of the present disclosure, and do not limit the present disclosure in any way. For example, if the second input data has an element with a value of 0, the first sub-tensor obtained by disassembly is The number may be less than the number of elements of the second input data. For example, the number of multiple first subtensors is the same as the number of non-zero elements of the second input data.

In a possible implementation manner, performing winograd forward transformation on the multiple first subtensors and summing them to obtain the winograd forward transformation result of the second input data may include the following process:

Obtain the winograd positive transformation result of the first sub-tensor corresponding to the first sub-tensor; where the first sub-tensor corresponding to the first sub-tensor is: the value of the element at the first position in the first sub-tensor Is 1, where the position of the first position in the first sub-tensor is the same as the position of the non-zero element in the first sub-tensor;

Multiplying the non-zero element value of the first sub-tensor by the coefficient of the winograd positive transformation result of the corresponding first-element sub-tensor to obtain the winograd positive transformation result of the first sub-tensor;

The winograd positive transformation results of the multiple first subtensors are added to obtain the winograd positive transformation result of the second input data.

Still taking the first sub-tensor d ₀₀ as an example, the first-element sub-tensor corresponding to _{d 00 can be}

In other words, the first sub-tensor is to extract the values of non-zero elements in the first sub-tensor, and the values of non-zero elements can be used as coefficients of the first sub-tensor.

Among them, the winograd positive transformation result of the first sub-tensor corresponding to the first sub-tensor can be obtained in advance through the following process: For each first sub-tensor, the first sub-tensor corresponding to the first sub-tensor The left side of the sub-tensor is multiplied by the positive transformation, the left-multiplied matrix, and the right is multiplied by the positive transformation, and the right-multiplied matrix is used to obtain the winograd positive transformation result of the first sub-tensor.

For matrices of different sizes, the form of the corresponding first element sub-tensor is determined, and the corresponding positive transformation left-multiplication matrix and forward transformation right-multiplication matrix are also determined.

Therefore, the winograd positive transformation result of the first sub-tensor can be calculated in advance, and the specific process is as described above. For example, still taking d ₀₀ as an example, the corresponding winograd positive transformation result of the first sub-tensor is:

For another example, taking d ₀₁ as an example, the winograd positive transformation result of the corresponding first-element sub-tensor is:

Since the element values of the positive transformation left multiplication matrix and the positive transformation right multiplication matrix are both 0 and ±1, the element value of the first sub-tensor is 0 or 1, and the element in the winograd positive transformation result of the first sub-tensor is also 0, ±1. Therefore, the matrix multiplication operation can be broken down into an addition operation.

The process of calculating the winograd forward transformation result of the first element sub-tensor involves more multiplication operations. Through the method of the present disclosure, the pre-calculated winograd forward transformation results of the first element subtensor of various scales can be saved. In this way, in the actual calculation process, it can be directly obtained without repeated calculations, thereby shortening calculation time and saving calculation resources.

After obtaining the winograd positive transformation result of the first sub-tensor corresponding to the first sub-tensor, the non-zero element value in the first sub-tensor can be multiplied by the winograd positive transformation result of the corresponding first sub-tensor, You can get the winograd positive transformation result of the first subtensor. For example, still taking d ₀₀ as an example, the corresponding winograd positive transformation result is:

For another example, taking d ₀₁ as an example, the winograd positive transformation result of _{d 01 is}

The winograd positive transformation results of all the first sub-tensors are calculated through the above process, and the winograd positive transformation results of multiple first sub-tensors are added to obtain the winograd positive transformation results of the second input data.

Similarly, the winograd positive transformation of the second convolution kernel can be disassembled into a summation operation, and calculations are performed to obtain the winograd positive transformation result of the second convolution kernel. In a possible implementation manner, the Disassembling the winograd positive transformation of the second convolution kernel into a summation operation, and performing calculations to obtain the winograd positive transformation result of the second convolution kernel may include:

The second convolution kernel is disassembled into a plurality of second sub-tensors, and winograd positive transformation is performed on the plurality of second sub-tensors and summed to obtain a winograd positive transformation result of the second convolution kernel.

In a possible implementation manner, the number of the plurality of second sub-tensors is the same as the number of elements of the second convolution kernel, and each second sub-tensor of the plurality of second sub-tensors One element in the sub-tensor is the same as the element at the corresponding position in the second convolution kernel, and the other elements are all zero.

Assuming that the second convolution kernel can be expressed as:

The second convolution kernel is a 3×3 matrix and includes 9 elements. Therefore, the second convolution kernel can be decomposed into 9 second sub-tensors.

Then, according to the disassembly method of the present disclosure, the 9 second sub-tensors are:

Similarly, one element in each second subtensor is the same as the element at the corresponding position in the second convolution kernel, and the other elements are all zero.

The process of performing winograd positive transformation on the multiple second subtensors and summing them to obtain the winograd positive transformation result of the second convolution kernel can be obtained by referring to the aforementioned winograd positive transformation on multiple first subtensors and summing them. The process of the winograd positive transformation result of the second input data is not repeated here in this disclosure.

After obtaining the winograd positive transformation result of the second input data and the winograd positive transformation result of the second convolution kernel, the winograd positive transformation result of the second input data and the winograd positive transformation result of the second convolution kernel can be executed The counter-multiply operation of, get the result of the counter-multiply.

Wherein, the bitwise multiplication may refer to the data obtained by multiplying the data at the corresponding positions of the two tensors as the value of the corresponding position in the bitwise multiplication result.

Assuming that the winograd positive transformation result B ^T d _4×4 B of the second input data can be expressed as:

Winograd positive transformation result of the second convolution kernel

It can be expressed as:

Then the result of counter multiplication can be:

The winograd convolution result of the second input data can be expressed as S _4×4 =A ^T (G _4×4 ⊙D _4×4 )A, the present disclosure can take A ^T (G _4×4 ⊙D _4×4 )A The disassembly is a summation operation, and calculation is performed to obtain the winograd convolution result of the second input data, thereby further saving calculation time and reducing energy consumption.

In a possible implementation manner, the above-mentioned disassembling the winograd inverse transform of the alignment multiplication result into a summation operation obtains the convolution result of the second input data and the corresponding second convolution kernel , Can include:

The result of the alignment multiplication is disassembled into a plurality of third sub-tensors, and winograd inverse transformation is performed on the plurality of third sub-tensors and summed to obtain the second input data and the corresponding second The convolution result of the convolution kernel.

In a possible implementation manner, the number of the plurality of third sub-tensors is the same as the number of the non-zero elements of the alignment multiplication result, and each of the plurality of third sub-tensors One element in the third sub-tensor is the same as the element at the corresponding position in the alignment multiplication result, and the other elements are all zero.

Assume that the result of counter multiplication is

The result of the alignment multiplication is disassembled into multiple third sub-tensors, for example, it can be disassembled into 16, and the 16 third sub-tensors are:

After the disassembly is completed, winograd inverse transformation may be performed on the multiple third sub-tensors and summed to obtain the winograd convolution result of the second input data.

In a possible implementation manner, performing winograd inverse transformation on the multiple third subtensors and summing them to obtain the winograd convolution result of the second input data may include the following process:

Obtain the winograd inverse transform result of the third sub-tensor corresponding to the third sub-tensor; where the third-element sub-tensor corresponding to the third sub-tensor is: the value of the element at the second position in the third-element sub-tensor Is 1, where the position of the second position in the third sub-tensor is the same as the position of the non-zero element in the third sub-tensor;

Multiplying the non-zero element value of the third subtensor with the winograd inverse transform result of the corresponding third subtensor as the coefficient to obtain the winograd inverse transform result of the third subtensor;

The winograd inverse transform results of the multiple third subtensors are added to obtain the winograd convolution result of the second input data.

The method for determining the third-element sub-tensor corresponding to the third sub-tensor is the same as the method for determining the first-element sub-tensor above, and will not be repeated here. Among them, the winograd inverse transform result of the third sub-tensor is obtained in advance through the following process: For each third sub-tensor, the left side of the third sub-tensor corresponding to the third sub-tensor is multiplied by the inverse transform Multiplying the matrix on the left, multiplying by the inverse transformation on the right, and multiplying the matrix on the right to obtain the winograd inverse transformation result of the third-element subtensor.

For matrices of different sizes, the form of the corresponding third-element sub-tensor is determined, and the corresponding inverse transform left multiplication matrix and inverse transform right multiplication matrix are also determined. Therefore, the winograd inverse transformation result of the third-element sub-tensor can be calculated in advance, and the specific process is as described above. For the examples listed in this article, the left multiplication matrix of the inverse transformation is a 2×4 matrix, for example:

The inverse transformation right multiplication matrix is a 4×2 matrix, for example:

The dimension of the inverse transformation matrix can be determined according to the dimension of the second input data and the dimension of the second convolution kernel and the convolution step length. The above is only an example and does not limit the present disclosure in any way.

The inverse transformation matrix consists of 0,

±1 constitutes, so the matrix multiplication operation of the inverse transformation can be realized by disassembling it into addition and shift operations. Multiply the inverse transformation matrix by the third-element sub-tensor to obtain the winograd inverse transformation result of the third-element sub-tensor. The element value in the winograd inverse transformation result of the third-element sub-tensor is 0,

With the composition of ±1, the fraction can be calculated by a simple shift operation, which can still save calculation time compared to the multiplication operation.

For the winograd inverse transform result of multiplying the element values of the third subtensor that are not 0 as coefficients by the corresponding third subtensor, the winograd inverse transform result of the third subtensor is obtained, and the multiple third subtensors For the specific process of adding the winograd inverse transform results of the second input data to obtain the winograd convolution result of the second input data, refer to the above-mentioned multiplying the element values of the first subtensor that are not 0 as coefficients by the corresponding first element subtensor The winograd positive transformation result of the first sub-tensor is obtained, and the winograd positive transformation results of multiple first sub-tensors are added to obtain the winograd positive transformation result of the input data, but the third element The result of the winograd inverse transformation of tensor is not completely composed of 0 and ±1, but the score can be calculated by a simple shift operation. Compared with the multiplication operation, the present disclosure can still save calculation time after disassembling the ordinary inverse transformation process. , The effect of reducing energy consumption.

According to the above-mentioned embodiments of the present disclosure, it can be known that multiple third sub-tensors are obtained by disassembling the bit multiplication results, and the winograd inverse transform results of the third-element sub-tensors corresponding to the third sub-tensors obtained in advance and The non-zero element value of the third subtensor can be summed to obtain the winograd convolution result of the input data. According to the arithmetic device of the present disclosure, disassembling the multiplication operation into a summation operation can save calculation time and reduce energy consumption.

It should be noted that for the foregoing method embodiments, for the sake of simple description, they are all expressed as a series of action combinations, but those skilled in the art should know that the present disclosure is not limited by the described sequence of actions. Because according to the present disclosure, certain steps can be performed in other order or at the same time. Secondly, those skilled in the art should also know that the embodiments described in the specification are all optional embodiments, and the involved actions and modules are not necessarily required by the present disclosure.

It should be further noted that although the various steps in the flowchart of FIGS. 1-2 are displayed in sequence as indicated by the arrows, these steps are not necessarily performed in sequence in the order indicated by the arrows. Unless there is a clear description in this article, there is no strict order for the execution of these steps, and these steps can be executed in other orders. Moreover, at least part of the steps in Figure 1-2 may include multiple sub-steps or multiple stages. These sub-steps or stages are not necessarily executed at the same time, but can be executed at different times. These sub-steps or stages The execution order of is not necessarily performed sequentially, but may be performed alternately or alternately with at least a part of other steps or sub-steps or stages of other steps.

Fig. 3 shows a structural block diagram of a data processing device provided by an embodiment of the present disclosure. As shown in Fig. 3, the device may include:

The first splitting module 301 may be used to split the first convolution kernel according to the step size N to obtain multiple second convolution kernels;

The second splitting module 302 may be used to split the first input data according to the step size N to obtain multiple second input data corresponding to the multiple first convolution kernels;

The convolution module 303 may be used to perform a winograd convolution operation on the second input data and the corresponding second convolution kernel for any of the second input data to obtain the volume corresponding to the second input data Product result

The determining module 304 may be used to determine that the sum of the convolution results corresponding to the plurality of second input data is the convolution result of the first convolution kernel and the first input data.

In this way, the data processing device provided by the embodiment of the present disclosure can split the first convolution kernel with a step size greater than 1 and the first input data into multiple second convolution kernels with a step size of 1 and multiple second inputs. Data, improve the multiplexing rate of data.

In a possible implementation manner, the first splitting module may also be used for:

For the rows and columns of the first convolution kernel, split the first convolution kernel with an interval of N-1 steps to obtain multiple second convolution kernels;

The second splitting module can also be used for:

Traverse the elements in the first convolution kernel, repeat every interval N-1 rows to determine a row target row, for the target row, get one element every interval N-1 column, and the acquired multiple elements form a second convolution The process of the kernel until the elements in the first convolution kernel are traversed;

The second splitting module can also be used for:

In a possible implementation manner, the above convolution module can also be used for:

In a possible implementation manner, the convolution module may also be used for:

In a possible implementation manner, the number of the plurality of second sub-tensors is the same as the number of elements of the second convolution kernel, and each second sub-tensor of the plurality of second sub-tensors One element in the tensor is the same as the element at the corresponding position in the second convolution kernel, and the other elements are all zero.

In some embodiments of the present disclosure, the functions or modules contained in the device provided in the embodiments of the present disclosure can be used to execute the methods described in the above method embodiments. For specific implementation and technical effects, please refer to the above method embodiments. Description, for the sake of brevity, I will not repeat it here.

It should be understood that the foregoing device embodiments are only illustrative, and the device of the present disclosure may also be implemented in other ways. For example, the division of units/modules in the above-mentioned embodiments is only a logical function division, and there may be other division methods in actual implementation. For example, multiple units, modules or components may be combined or integrated into another system, or some features may be omitted or not implemented.

In addition, unless otherwise specified, the functional units/modules in the various embodiments of the present disclosure may be integrated into one unit/module, or each unit/module may exist alone physically, or two or more units/modules may exist. The modules are integrated together. The above-mentioned integrated unit/module can be implemented in the form of hardware or software program module.

If the integrated unit/module is implemented in the form of hardware, the hardware may be a digital circuit, an analog circuit, and so on. The physical realization of the hardware structure includes but is not limited to transistors, memristors and so on. Unless otherwise specified, the artificial intelligence processor may be any appropriate hardware processor, such as CPU, GPU, FPGA, DSP, ASIC, and so on. Unless otherwise specified, the storage unit may be any suitable magnetic storage medium or magneto-optical storage medium, such as RRAM (Resistive Random Access Memory), DRAM (Dynamic Random Access Memory), Static random access memory SRAM (Static Random-Access Memory), enhanced dynamic random access memory EDRAM (Enhanced Dynamic Random Access Memory), high-bandwidth memory HBM (High-Bandwidth Memory), hybrid storage cube HMC (Hybrid Memory Cube), etc. Wait.

If the integrated unit/module is implemented in the form of a software program module and sold or used as an independent product, it can be stored in a computer readable memory. Based on this understanding, the technical solution of the present disclosure essentially or the part that contributes to the prior art or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a memory, It includes several instructions to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the methods described in the various embodiments of the present disclosure. The aforementioned memory includes: U disk, Read-Only Memory (ROM, Read-Only Memory), Random Access Memory (RAM, Random Access Memory), mobile hard disk, magnetic disk or optical disk and other media that can store program codes.

In a possible implementation manner, an artificial intelligence chip is also disclosed, which includes the above-mentioned data processing device.

In a possible implementation manner, a board card is also disclosed, which includes a storage device, an interface device, a control device, and the aforementioned artificial intelligence chip; wherein, the artificial intelligence chip is connected to the storage device and the control device And the interface devices are respectively connected; the storage device is used to store data; the interface device is used to realize data transmission between the artificial intelligence chip and an external device; the control device is used to The state of the artificial intelligence chip is monitored.

Fig. 4 shows a structural block diagram of a board card according to an embodiment of the present disclosure. Referring to Fig. 4, the board card may include other supporting components in addition to the chip 389 described above. The supporting components include, but are not limited to: a storage device 390, Interface device 391 and control device 392;

The storage device 390 is connected to the artificial intelligence chip through a bus for storing data. The storage device may include multiple groups of storage units 393. Each group of the storage unit and the artificial intelligence chip are connected through a bus. It can be understood that each group of the storage units may be DDR SDRAM (English: Double Data Rate SDRAM, double-rate synchronous dynamic random access memory).

DDR does not need to increase the clock frequency to double the speed of SDRAM. DDR allows data to be read on the rising and falling edges of the clock pulse. The speed of DDR is twice that of standard SDRAM. In an embodiment, the storage device may include 4 groups of the storage units. Each group of the storage unit may include a plurality of DDR4 particles (chips). In one embodiment, the artificial intelligence chip may include four 72-bit DDR4 controllers. In the 72-bit DDR4 controller, 64 bits are used for data transmission and 8 bits are used for ECC verification. It can be understood that when DDR4-3200 particles are used in each group of the storage units, the theoretical bandwidth of data transmission can reach 25600MB/s.

In one embodiment, each group of the storage unit includes a plurality of double-rate synchronous dynamic random access memories arranged in parallel. DDR can transmit data twice in one clock cycle. A controller for controlling the DDR is provided in the chip, which is used to control the data transmission and data storage of each storage unit.

The interface device is electrically connected with the artificial intelligence chip. The interface device is used to implement data transmission between the artificial intelligence chip and an external device (such as a server or a computer). For example, in one embodiment, the interface device may be a standard PCIE interface. For example, the data to be processed is transferred from the server to the chip through a standard PCIE interface to realize data transfer. Preferably, when the PCIE3.0X 16 interface is used for transmission, the theoretical bandwidth can reach 16000MB/s. In another embodiment, the interface device may also be other interfaces. The present disclosure does not limit the specific manifestations of the other interfaces mentioned above, as long as the interface unit can realize the switching function. In addition, the calculation result of the artificial intelligence chip is still transmitted by the interface device back to an external device (such as a server).

The control device is electrically connected with the artificial intelligence chip. The control device is used to monitor the state of the artificial intelligence chip. Specifically, the artificial intelligence chip and the control device may be electrically connected through an SPI interface. The control device may include a single-chip microcomputer (Micro Controller Unit, MCU). For example, the artificial intelligence chip may include multiple processing chips, multiple processing cores, or multiple processing circuits, and can drive multiple loads. Therefore, the artificial intelligence chip can be in different working states such as multi-load and light-load. The control device can realize the regulation and control of the working states of multiple processing chips, multiple processing and or multiple processing circuits in the artificial intelligence chip.

In a possible implementation manner, an electronic device is disclosed, which includes the aforementioned artificial intelligence chip. Electronic equipment includes data processing devices, robots, computers, printers, scanners, tablets, smart terminals, mobile phones, driving recorders, navigators, sensors, cameras, servers, cloud servers, cameras, cameras, projectors, watches, headsets , Mobile storage, wearable devices, vehicles, household appliances, and/or medical equipment. The transportation means include airplanes, ships, and/or vehicles; the household appliances include TVs, air conditioners, microwave ovens, refrigerators, rice cookers, humidifiers, washing machines, electric lights, gas stoves, and range hoods; the medical equipment includes nuclear magnetic resonance, B-ultrasound and/or electrocardiograph.

The embodiments of the present disclosure also provide a computer-readable storage medium on which computer program instructions are stored, and the computer program instructions implement the above-mentioned method when executed by a processor. The computer-readable storage medium may be a non-volatile computer-readable storage medium.

An embodiment of the present disclosure also provides an electronic device, including: a processor; a memory for storing executable instructions of the processor; wherein the processor is configured to call the instructions stored in the memory to execute the above method.

The electronic device can be provided as a terminal, server or other form of device.

FIG. 5 shows a block diagram of an electronic device 800 according to an embodiment of the present disclosure. For example, the electronic device 800 may be a mobile phone, a computer, a digital broadcasting terminal, a messaging device, a game console, a tablet device, a medical device, a fitness device, a personal digital assistant, and other terminals.

5, the electronic device 800 may include one or more of the following components: a processing component 802, a memory 804, a power supply component 806, a multimedia component 808, an audio component 810, an input/output (I/O) interface 812, and a sensor component 814 , And communication component 816.

The processing component 802 generally controls the overall operations of the electronic device 800, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing component 802 may include one or more processors 820 to execute instructions to complete all or part of the steps of the foregoing method. In addition, the processing component 802 may include one or more modules to facilitate the interaction between the processing component 802 and other components. For example, the processing component 802 may include a multimedia module to facilitate the interaction between the multimedia component 808 and the processing component 802.

The memory 804 is configured to store various types of data to support operations in the electronic device 800. Examples of these data include instructions for any application or method operating on the electronic device 800, contact data, phone book data, messages, pictures, videos, etc. The memory 804 can be implemented by any type of volatile or non-volatile storage device or a combination thereof, such as static random access memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable and Programmable read only memory (EPROM), programmable read only memory (PROM), read only memory (ROM), magnetic memory, flash memory, magnetic disk or optical disk.

The power supply component 806 provides power for various components of the electronic device 800. The power supply component 806 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for the electronic device 800.

The multimedia component 808 includes a screen that provides an output interface between the electronic device 800 and the user. In some embodiments, the screen may include a liquid crystal display (LCD) and a touch panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive input signals from the user. The touch panel includes one or more touch sensors to sense touch, sliding, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure related to the touch or slide operation. In some embodiments, the multimedia component 808 includes a front camera and/or a rear camera. When the electronic device 800 is in an operation mode, such as a shooting mode or a video mode, the front camera and/or the rear camera can receive external multimedia data. Each front camera and rear camera can be a fixed optical lens system or have focal length and optical zoom capabilities.

The audio component 810 is configured to output and/or input audio signals. For example, the audio component 810 includes a microphone (MIC), and when the electronic device 800 is in an operation mode, such as a call mode, a recording mode, and a voice recognition mode, the microphone is configured to receive an external audio signal. The received audio signal may be further stored in the memory 804 or transmitted via the communication component 816. In some embodiments, the audio component 810 further includes a speaker for outputting audio signals.

The I/O interface 812 provides an interface between the processing component 802 and a peripheral interface module. The above-mentioned peripheral interface module may be a keyboard, a click wheel, a button, and the like. These buttons may include, but are not limited to: home button, volume button, start button, and lock button.

The sensor component 814 includes one or more sensors for providing the electronic device 800 with various aspects of state evaluation. For example, the sensor component 814 can detect the on/off status of the electronic device 800 and the relative positioning of the components. For example, the component is the display and the keypad of the electronic device 800. The sensor component 814 can also detect the electronic device 800 or the electronic device 800. The position of the component changes, the presence or absence of contact between the user and the electronic device 800, the orientation or acceleration/deceleration of the electronic device 800, and the temperature change of the electronic device 800. The sensor component 814 may include a proximity sensor configured to detect the presence of nearby objects when there is no physical contact. The sensor component 814 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor component 814 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.

The communication component 816 is configured to facilitate wired or wireless communication between the electronic device 800 and other devices. The electronic device 800 can access a wireless network based on a communication standard, such as WiFi, 2G, or 3G, or a combination thereof. In an exemplary embodiment, the communication component 816 receives a broadcast signal or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 816 further includes a near field communication (NFC) module to facilitate short-range communication. For example, the NFC module can be implemented based on radio frequency identification (RFID) technology, infrared data association (IrDA) technology, ultra-wideband (UWB) technology, Bluetooth (BT) technology and other technologies.

In an exemplary embodiment, the electronic device 800 may be implemented by one or more application-specific integrated circuits (ASIC), digital signal processors (DSP), digital signal processing devices (DSPD), programmable logic devices (PLD), field-available A programmable gate array (FPGA), controller, microcontroller, microprocessor, or other electronic components are implemented to implement the above methods.

In an exemplary embodiment, there is also provided a non-volatile computer-readable storage medium, such as the memory 804 including computer program instructions, which can be executed by the processor 820 of the electronic device 800 to complete the foregoing method.

FIG. 6 shows a block diagram of an electronic device 1900 according to an embodiment of the present disclosure. For example, the electronic device 1900 may be provided as a server. 6, the electronic device 1900 includes a processing component 1922, which further includes one or more processors, and a memory resource represented by the memory 1932, for storing instructions executable by the processing component 1922, such as application programs. The application program stored in the memory 1932 may include one or more modules each corresponding to a set of instructions. In addition, the processing component 1922 is configured to execute instructions to perform the above-described methods.

The electronic device 1900 may also include a power supply component 1926 configured to perform power management of the electronic device 1900, a wired or wireless network interface 1950 configured to connect the electronic device 1900 to the network, and an input output (I/O) interface 1958 . The electronic device 1900 can operate an operating system stored in the memory 1932, such as Windows ServerTM, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM or the like.

In an exemplary embodiment, a non-volatile computer-readable storage medium is also provided, such as the memory 1932 including computer program instructions, which can be executed by the processing component 1922 of the electronic device 1900 to complete the foregoing method.

In the above-mentioned embodiments, the description of each embodiment has its own focus. For parts that are not described in detail in an embodiment, reference may be made to related descriptions of other embodiments. The technical features of the above-mentioned embodiments can be combined arbitrarily. In order to make the description concise, all possible combinations of the various technical features in the above-mentioned embodiments are not described. However, as long as there is no contradiction in the combination of these technical features, they should all be combined. It is considered as the range described in this specification.

The foregoing can be better understood according to the following clauses:

Clause A1, a data processing method, the method includes splitting the first convolution kernel according to the step size N to obtain multiple second convolution kernels; splitting the first input data according to the step size N to obtain A plurality of second input data corresponding to a plurality of the first convolution kernels; for any of the second input data, perform a winograd convolution operation on the second input data and the corresponding second convolution kernel , Obtain the convolution result corresponding to the second input data; determine that the sum of the convolution results corresponding to the plurality of second input data is the convolution result of the first convolution kernel and the first input data.

Clause A2, according to the method of clause A1, the corresponding relationship between the second input data and the second convolution kernel is specifically: the first element in the second input data is in the first input The position in the data is the same as the position of the first element in the second convolution kernel in the first convolution kernel.

Clause A3, according to the method of any one of clauses A1-A2, said splitting the first convolution kernel according to the step size N to obtain multiple second convolution kernels includes:

Clause A4, according to the method described in Clause A3, for the rows and columns of the first convolution kernel, the first convolution kernel is split with an interval of N-1 steps to obtain multiple second convolution kernels, include:

For the rows and columns of the first input data, the first input data is split with an interval of N-1 steps to obtain multiple second input data corresponding to the multiple first convolution kernels, including:

Clause A5, according to the method of any one of clauses A1 to A4, for any of the second input data, perform a winograd convolution operation on the second input data and the corresponding second convolution kernel, Obtaining the convolution result corresponding to the second input data includes:

Clause A6, according to the method described in clause A5, the disassembling the winograd positive transformation of the second input data into a summation operation, and performing calculation to obtain the winograd positive transformation result of the second input data includes:

Clause A7, according to the method described in clause A5, the disassembling the winograd positive transformation of the second convolution kernel into a summation operation, and performing calculation to obtain the winograd positive transformation result of the second convolution kernel includes:

Clause A8, according to the method of clause A6, the number of the plurality of first sub-tensors is the same as the number of non-zero elements of the second input data, and each of the plurality of first sub-tensors One element in the first sub-tensor is the same as the element at the corresponding position in the second input data, and the other elements are all 0.

Clause A9, according to the method of clause A7, the number of the plurality of second sub-tensors is the same as the number of elements of the second convolution kernel, and each of the plurality of second sub-tensors One element in the two sub-tensors is the same as the element at the corresponding position in the second convolution kernel, and the other elements are all zero.

Clause A10, according to the method described in clause A5, said disassembling the winograd inverse transform of the result of the alignment multiplication into a summation operation to obtain the difference between the second input data and the corresponding second convolution kernel Convolution results, including:

Clause A11, according to the method of clause A10, the number of the plurality of third sub-tensors is the same as the number of non-zero elements of the result of the alignment multiplication, and the number of the third sub-tensors in the plurality of One element in each third sub-tensor is the same as the element at the corresponding position in the alignment multiplication result, and the other elements are all zero.

Clause A12, a data processing device, including:

Clause A13, according to the device of clause A12, the correspondence between the second input data and the second convolution kernel is specifically: the first element in the second input data is in the first input The position in the data is the same as the position of the first element in the second convolution kernel in the first convolution kernel.

Clause A14, according to the device described in any one of clauses A12 to A13, the first splitting module is further used for:

The second splitting module is also used for:

Clause A15, according to the device of clause A14, the first splitting module is further used for:

The second splitting module is also used for:

Clause A16, the device according to any one of clauses A12 to A15, the convolution module is further used for:

Clause A17, the device according to clause A16, the convolution module is further used for:

Clause A18, the device according to clause A16, the convolution module is further used for:

Clause A19. The device according to clause A17, wherein the number of the plurality of first sub-tensors is the same as the number of non-zero elements of the second input data, and each of the plurality of first sub-tensors One element in the first sub-tensor is the same as the element at the corresponding position in the second input data, and the other elements are all 0.

Clause A20, the device according to clause A18, wherein the number of the plurality of second sub-tensors is the same as the number of elements of the second convolution kernel, and each of the plurality of second sub-tensors One element in the two sub-tensors is the same as the element at the corresponding position in the second convolution kernel, and the other elements are all zero.

Clause A21, the device according to clause A16, the convolution module is further used for:

Clause A22, according to the device of clause A21, the number of the plurality of third sub-tensors is the same as the number of non-zero elements of the alignment multiplication result, and of the plurality of third sub-tensors One element in each third sub-tensor is the same as the element at the corresponding position in the alignment multiplication result, and the other elements are all zero.

Clause A23, an artificial intelligence chip, the chip comprising the data processing device according to any one of clauses A12 to A22.

Clause A24, an electronic device including the artificial intelligence chip as described in Clause A23.

Clause A25, a board card comprising: a storage device, an interface device, a control device, and the artificial intelligence chip as described in clause A23;

The storage device is used to store data;

Clause A26, the board according to clause A25, the storage device includes: multiple groups of storage units, each group of the storage unit is connected to the artificial intelligence chip through a bus, and the storage unit is: DDR SDRAM;

The chip includes: a DDR controller, which is used to control the data transmission and data storage of each storage unit;

The interface device is: a standard PCIE interface.

Clause A27, an electronic device, characterized in that it includes:

processor;

A memory for storing processor executable instructions;

Wherein, the processor is configured to call instructions stored in the memory to execute the method described in any one of clauses A1 to A11.

Clause A28, a computer-readable storage medium with computer program instructions stored thereon, characterized in that, when the computer program instructions are executed by a processor, the method described in any one of clauses A1 to A11 is implemented.

The embodiments of the present disclosure are described in detail above, and specific examples are used in this article to illustrate the principles and implementations of the present disclosure. The descriptions of the above embodiments are only used to help understand the methods and core ideas of the present disclosure. At the same time, changes or modifications made by those skilled in the art based on the ideas of the present disclosure, the specific embodiments and the scope of application of the present disclosure, are all within the protection scope of the present disclosure. In summary, the content of this specification should not be construed as a limitation of this disclosure.

Claims

A data processing method, characterized in that it comprises:

Split the first convolution kernel according to the step size N to obtain multiple second convolution kernels;

Splitting the first input data according to the step size N to obtain a plurality of second input data corresponding to the plurality of first convolution kernels;

For any of the second input data, perform a winograd convolution operation on the second input data and the corresponding second convolution kernel to obtain a convolution result corresponding to the second input data;

It is determined that the sum of the convolution results corresponding to the plurality of second input data is the convolution result of the first convolution kernel and the first input data.
The method according to claim 1, wherein the corresponding relationship between the second input data and the second convolution kernel is specifically: the first element in the second input data is in the first The position of an input data is the same as the position of the first element in the second convolution kernel in the first convolution kernel.
The method according to claim 1 or 2, wherein the splitting the first convolution kernel according to the step size N to obtain a plurality of second convolution kernels comprises:

For the rows and columns of the first convolution kernel, split the first convolution kernel with an interval of N-1 steps to obtain multiple second convolution kernels;

The splitting the first input data according to the step size N to obtain multiple second input data corresponding to the multiple first convolution kernels includes:

For the rows and columns of the first input data, the first input data is split with an interval of N-1 steps to obtain multiple second input data corresponding to the multiple first convolution kernels.
The method according to claim 3, characterized in that, for the rows and columns of the first convolution kernel, the first convolution kernel is split with an interval of N-1 steps to obtain a plurality of second convolution kernels. Nuclear, including:

Traverse the elements in the first convolution kernel, repeat every interval N-1 rows to determine a row target row, for the target row, get one element every interval N-1 column, and the acquired multiple elements form a second convolution The process of the kernel until the elements in the first convolution kernel are traversed;

For the rows and columns of the first input data, the first input data is split with an interval of N-1 steps to obtain multiple second input data corresponding to the multiple first convolution kernels, including:

Traverse the elements in the first input data, repeat every interval N-1 rows to determine a row target row, for the target row, every interval N-1 column to obtain an element, the obtained multiple elements form a second input data The process until the elements in the first input data are traversed.
The method according to any one of claims 1 to 4, wherein for any of the second input data, perform winograd convolution on the second input data and the corresponding second convolution kernel The operation to obtain the convolution result corresponding to the second input data includes:

Disassemble the winograd forward transformation of the second input data into a summation operation, and perform calculation to obtain the winograd forward transformation result of the second input data;

Disassemble the winograd positive transformation of the second convolution kernel into a summation operation, and perform calculation to obtain the winograd positive transformation result of the second convolution kernel;

Performing an alignment multiplication operation of the winograd forward transformation result of the second input data and the winograd forward transformation result of the second convolution kernel to obtain the alignment multiplication result;

The winograd inverse transform of the alignment multiplication result is disassembled into a summation operation to obtain a convolution result of the second input data and the corresponding second convolution kernel.
The method according to claim 5, wherein the disassembling the winograd positive transformation of the second input data into a summation operation, and performing calculation to obtain the winograd positive transformation result of the second input data comprises:

The second input data is disassembled into a plurality of first sub-tensors, and winograd positive transformation is performed on the plurality of first sub-tensors and summed to obtain a winograd positive transformation result of the second input data.
The method according to claim 5, wherein the disassembling the winograd positive transformation of the second convolution kernel into a summation operation, and performing calculation to obtain the winograd positive transformation result of the second convolution kernel comprises:

The second convolution kernel is disassembled into a plurality of second sub-tensors, and winograd positive transformation is performed on the plurality of second sub-tensors and summed to obtain a winograd positive transformation result of the second convolution kernel.
The method according to claim 6, wherein the number of the plurality of first sub-tensors is the same as the number of non-zero elements of the second input data, and the number of the first sub-tensors One element in each first subtensor of is the same as the element at the corresponding position in the second input data, and the other elements are all 0.
The method according to claim 7, wherein the number of the plurality of second sub-tensors is the same as the number of elements of the second convolution kernel, and each of the plurality of second sub-tensors One element in the second sub-tensor is the same as the element at the corresponding position in the second convolution kernel, and the other elements are all zero.
The method according to claim 5, wherein the inverse winograd transform of the result of the alignment multiplication is disassembled into a summation operation to obtain the second input data and the corresponding second convolution The convolution results of the kernel, including:

The result of the alignment multiplication is disassembled into a plurality of third sub-tensors, and winograd inverse transformation is performed on the plurality of third sub-tensors and summed to obtain the second input data and the corresponding second The convolution result of the convolution kernel.
The method according to claim 10, wherein the number of the plurality of third sub-tensors is the same as the number of non-zero elements of the alignment multiplication result, and the plurality of third sub-tensors In each third sub-tensor in the tensor, one element is the same as the element at the corresponding position in the alignment multiplication result, and the other elements are all zero.
A data processing device, characterized in that it comprises:

The first splitting module is used to split the first convolution kernel according to the step size N to obtain multiple second convolution kernels;

The second splitting module is configured to split the first input data according to the step size N to obtain multiple second input data corresponding to the multiple first convolution kernels;

The convolution module is configured to perform a winograd convolution operation on the second input data and the corresponding second convolution kernel for any of the second input data to obtain a convolution result corresponding to the second input data ；

The determining module is configured to determine that the sum of the convolution results corresponding to the plurality of second input data is the convolution result of the first convolution kernel and the first input data.
An artificial intelligence chip, characterized in that the chip includes the data processing device according to claim 12.
An electronic device, wherein the electronic device comprises the artificial intelligence chip according to claim 13.
A board card, characterized in that the board card comprises: a storage device, an interface device, a control device, and the artificial intelligence chip according to claim 13;

Wherein, the artificial intelligence chip is connected to the storage device, the control device, and the interface device respectively;

The storage device is used to store data;

The interface device is used to implement data transmission between the artificial intelligence chip and external equipment;

The control device is used to monitor the state of the artificial intelligence chip.
The board card according to claim 15, characterized in that,

The storage device includes: multiple groups of storage units, each group of the storage unit is connected to the artificial intelligence chip through a bus, and the storage unit is: DDR SDRAM;

The chip includes: a DDR controller, which is used to control the data transmission and data storage of each storage unit;

The interface device is: a standard PCIE interface.
An electronic device, characterized in that it comprises:

processor;

A memory for storing processor executable instructions;

Wherein, the processor is configured to call instructions stored in the memory to execute the method according to any one of claims 1 to 11.
A computer-readable storage medium having computer program instructions stored thereon, wherein the computer program instructions implement the method according to any one of claims 1 to 11 when the computer program instructions are executed by a processor.