WO2022246617A1

WO2022246617A1 - Convolution operation method and apparatus, image processing method and apparatus, and storage medium

Info

Publication number: WO2022246617A1
Application number: PCT/CN2021/095619
Authority: WO
Inventors: 李鹏; 罗岚; 祝武勇; 高岳
Original assignee: 深圳市大疆创新科技有限公司
Priority date: 2021-05-24
Filing date: 2021-05-24
Publication date: 2022-12-01

Abstract

A convolution operation method and apparatus, an image processing method and apparatus, and a storage medium. The convolution operation apparatus comprises a data acquisition node and a MAC array; the data acquisition node is used for acquiring data to be processed, and determining, in a clock cycle on the basis of the data to be processed, node input data corresponding to each of a plurality of MAC nodes in the multiplier MAC array; the MAC array is connected to the data acquisition node and used for acquiring node input data by means of the data acquisition node, and performing convolution calculation on the corresponding node input data by means of the MAC nodes to obtain node calculation results, wherein the node calculation results are used for determining a convolution operation result corresponding to the data to be processed. According to the method, redundancy in adder resources can be eliminated, and it is beneficial to reducing the chip area, reducing the power consumption, and reducing the data processing cost; in addition, all nodes can be operated in a short time, and it is beneficial to improving the data processing performance.

Description

Convolution operation method, image processing method, device and storage medium

technical field

The embodiments of the present invention relate to the technical field of data processing, and in particular, to a convolution operation method, an image processing method, a device, and a storage medium.

Background technique

With the rapid development of science and technology, the application range of artificial intelligence technology is becoming more and more extensive. In the process of applying artificial intelligence technology, whether it is on the device side or the cloud side, there are strict requirements on power consumption and cost. For example, end-side consumers are more sensitive to price, and the lower the price, the higher the power consumption, and the higher the power consumption, the shorter the standby time. Users on the cloud side have large computing needs and a large amount of chip deployment. If the power consumption of a single chip is high, the cumulative power consumption will be large; if the cost of a single chip is high, the cumulative cost will be higher.

Contents of the invention

Embodiments of the present invention provide a convolution operation method, an image processing method, a device, and a storage medium, which can improve data processing efficiency and reduce data processing power consumption and cost.

The first aspect of the present invention is to provide a convolution computing device, comprising:

A data acquisition node, configured to acquire data to be processed, and determine within a clock cycle based on the data to be processed, node input data corresponding to each of the multiple MAC nodes in the multiplier MAC array;

The MAC array is connected to the data acquisition node, and is used to acquire the input data of the node through the data acquisition node, and perform convolution calculation on the input data of the node to obtain a calculation result of the node, and the calculation result of the node is used for determining the convolution operation result corresponding to the data to be processed.

The second aspect of the present invention is to provide a convolution operation method, including:

Obtaining data to be processed, and determining node input data corresponding to each MAC node in the plurality of MAC nodes in the multiplier MAC array within a clock cycle based on the data to be processed;

The MAC nodes in the MAC array are used to perform convolution calculation on the input data of the nodes to obtain node calculation results, and the node calculation results are used to determine the convolution operation results corresponding to the data to be processed.

The third aspect of the present invention is in order to provide a kind of image processing method based on multiplier MAC array, comprising:

Acquiring an image to be processed, and determining node input data corresponding to each MAC node among the plurality of MAC nodes in the MAC array within a clock cycle based on the image to be processed;

Using the MAC nodes in the MAC array to perform convolution calculations on the input data of the nodes to obtain node calculation results;

A convolution processing result corresponding to the image to be processed is determined based on the node calculation result.

A fourth aspect of the present invention is to provide a convolution computing device, comprising:

memory for storing computer programs;

a processor for running a computer program stored in said memory to:

A fifth aspect of the present invention is to provide an image processing device based on a multiplier MAC array, including:

memory for storing computer programs;

a processor for running a computer program stored in said memory to:

The sixth aspect of the present invention is to provide a computer-readable storage medium, the storage medium is a computer-readable storage medium, the computer-readable storage medium stores program instructions, and the program instructions are used for the second aspect. The convolution operation method described above.

A seventh aspect of the present invention is to provide a computer-readable storage medium, the storage medium is a computer-readable storage medium, and program instructions are stored in the computer-readable storage medium, and the program instructions are used in the third aspect. The image processing method based on the multiplier MAC array described above.

In the technical solution provided by the embodiment of the present invention, the node input data corresponding to each MAC node in the multiple MAC nodes in the multiplier MAC array is determined by the data acquisition node within one clock cycle, and after the node input data is acquired, it can be Use the MAC nodes included in the MAC array to perform convolution calculation processing on the corresponding node input data within one clock cycle at the same time, and obtain the node calculation results, which can not only improve the data processing efficiency, but also reduce the data processing power consumption and Therefore, it effectively solves the problems of time-consuming and low processing performance when inputting data in the form of triangle input, thereby improving the quality and efficiency of convolution operations.

Description of drawings

The drawings described here are used to provide a further understanding of the application and constitute a part of the application. The schematic embodiments and descriptions of the application are used to explain the application and do not constitute an improper limitation to the application. In the attached picture:

Fig. 1 is a schematic diagram of the principle of convolution calculation provided by an embodiment in the related art;

Fig. 2 is a schematic diagram of a systolic array for realizing convolution calculation provided by an embodiment in the related art;

FIG. 3 is a schematic diagram of sequentially inputting feature maps for different rows of a systolic array provided by an embodiment in the related art;

FIG. 4 is a schematic structural diagram of a convolution operation device provided by an embodiment of the present invention;

FIG. 5 is a schematic diagram of inputting node input data to a MAC array provided by an embodiment of the present invention;

Fig. 6 is a schematic diagram 1 of cascading a node unit group and an adder provided by an embodiment of the present invention;

FIG. 7 is a second schematic diagram of cascading node unit groups and adders provided by an embodiment of the present invention;

Fig. 8 is a schematic diagram 3 of the cascade connection of the node unit group and the adder provided by the embodiment of the present invention;

FIG. 9 is a schematic diagram of a clock cycle that enables a MAC node in a MAC array to operate provided in the related art;

FIG. 10 is a schematic diagram of a clock cycle for enabling a MAC node in a MAC array to operate provided in an embodiment of the present invention;

FIG. 11 is a schematic flowchart of a convolution operation method provided by an embodiment of the present invention;

FIG. 12 is a schematic flowchart of another convolution operation method provided by an embodiment of the present invention;

FIG. 13 is a schematic flowchart of using the MAC nodes in the MAC array to perform convolution calculations on the input data of the nodes to obtain the calculation results of the nodes provided by an embodiment of the present invention;

FIG. 14 is a schematic flowchart of an image processing method based on a multiplier MAC array provided by an embodiment of the present invention;

FIG. 15 is a schematic structural diagram of a convolution operation device provided by an embodiment of the present invention;

FIG. 16 is a schematic structural diagram of an image processing device based on a multiplier MAC array provided by an embodiment of the present invention.

Detailed ways

In order to make the purpose, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below in conjunction with the drawings in the embodiments of the present invention. Obviously, the described embodiments It is a part of embodiments of the present invention, but not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by persons of ordinary skill in the art without making creative efforts belong to the protection scope of the present invention.

Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the technical field of the invention. The terms used herein in the description of the present invention are for the purpose of describing specific embodiments only, and are not intended to limit the present invention.

In order to facilitate the understanding of the specific implementation process and technical effects of the technical solution in this embodiment, first, the principle of convolution calculation is explained in conjunction with Figure 1:

After the feature map is obtained, the convolution calculation can be performed on the feature map, so that the convolution calculation result can be obtained. When performing convolution on the feature map to calculate each point of the output feature map, it is necessary to use the convolution kernel and the corresponding point of the input feature map for calculation. If there are multiple input feature maps, then the points of the same position of each feature map The calculation results are accumulated to obtain the result of the point in the output feature map. As shown in Figure 1, when performing convolution calculation on an input feature map, two points of the output feature map can be calculated.

In the specific implementation, when performing convolution calculation on the input feature map, the convolutional neural network used to realize the convolution calculation can be used to analyze and process the input feature map. The following is a brief introduction to the convolutional neural network. The convolutional neural network Networks can include:

The fully connected layer (FC layer) is used to implement the multiply-accumulate operation, wherein, for each output point, each input point is used.

The first type of fully connected layer (1stFC layer) is used to realize the multiply-accumulate operation, in which every point of each output must use every point of each graph.

The other fully connected layer (elsFC layer) is used to implement the multiply-accumulate operation, where each output point uses each point in the input one-dimensional data.

For the fully connected layer, in order to improve computing efficiency, the fully connected layer can perform batch processing to reuse the weight values in the computing unit, and in this way, multiple data can be processed at the same time. Specifically, if an image is given, each recognition of an image is a batch process; it saves some access data, reduces power consumption, saves bandwidth, and can reuse some data.

In some scenarios, in order to simplify the design, the batch processing of the 1stFC layer will be converted to the elsFC layer for batch processing, so that the 1stFC layer can be input into the graph-like feature map in advance and expanded into a continuous point-like feature map to adapt to elsFC The layer's requirements for the input feature map.

In specific implementation, a systolic array can be used to describe the convolution calculation, as shown in Figure 2, the systolic array includes:

Input data loading node (IFM_LOAD), used to load input data to be processed, for example: feature map IFM;

Weight loading node (WEIGHT_LOAD), used to load weight;

A number of multiply-accumulate operation nodes (MAC nodes) are used to form a pulsating calculation array. For the above-mentioned pulsating array, the feature map IFM can perform horizontal flow pulsation transfer based on the pulsation array, where the pulsation transfer can refer to the flow input IFM , the MAC nodes in each row can multiplex the IFM one by one; the weight coefficients can be transmitted vertically through the pulsation array based on the pulsation array.

Among them, using the pulsation array to analyze and process a certain feature map, as shown in Figure 3, N-0 is the 0th ifm data of the Nth row of a certain picture, and N-1 is the 0th ifm data of the Nth row of a certain picture 1 ifm data, input the feature map IFM for different rows of the systolic array at intervals of 1 time period, then the MAC node in the previous row can pass the calculation result to the MAC node in the next row, and complete the accumulation in it, and then continue Passed to the MAC node of the next row.

For example, if the input feature map data ifm and weight information are both signed int8, then each MAC node can complete one multiplication (int8*int8) and one addition, considering the convolution calculation compatible with int16*int16, Split each int16 into (mas_8b, lsb_8b), the multiplication of int16*int16 becomes: (mas_8b, lsb_8b)*(mas_8b, lsb_8b), mas_8b is in the high bit of int16, with a sign bit, extend its sign When the bit reaches the 9th bit, it becomes int9; lsb_8b is in the low bit of int16 and has no sign bit, so it needs to add 1 bit of 0 in the highest bit to become int9; then, the multiplier needs to be changed from int8*int8 multiplier to int9* Int9 multiplier; in this way, when calculating lsb_int9*lsb_int9 for realizing int16*int16, the product needs 17 bits, so the specifications of the above multipliers are unified: int9*int9=int17.

In this way, the product int17 + the calculation result passed down by the MAC node in the previous line, also called partial sum psum_out, is an intermediate result, and all intermediate results are accumulated to obtain the sum of the final results. For the convenience of description, the analysis and processing of the feature map by the 64x64 MAC array is taken as an example to illustrate the data processing process. At this time, the MAC array includes MAC nodes with 64 rows and 64 columns. For each row of MAC nodes, the adder The resources are as follows:

Line 0, no input part and psum_in, no adder adder, part and result psum_out is 17b;

In line 1, the input part and psum_in are 17b, the resource corresponding to the adder adder is 17b+17b=18b, and the part and result psum_out is 18b;

In line 2, the input part and psum_in are 18b, the resource corresponding to the adder adder is 17b+18b=19b, and the part and result psum_out is 19b;

In line 3, the input part and psum_in are 19b, the resource corresponding to the adder adder is 17b+19b=19b, and the part and result psum_out is 19b;

In line 4, the input part and psum_in are 19b, the resource corresponding to the adder adder is 17b+19b=20b, and the part and result psum_out is 20b;

In line 5, the input part and psum_in are 20b, the resource corresponding to the adder adder is 17b+20b=20b, and the part and result psum_out is 20b;

In line 6, the input part and psum_in are 20b, the resource corresponding to the adder adder is 17b+20b=20b, and the part and result psum_out is 20b;

In line 7, the input part and psum_in are 20b, the resource corresponding to the adder adder is 17b+20b=20b, and the part and result psum_out is 20b;

In line 8, the input part and psum_in are 20b, the resource corresponding to the adder adder is 17b+20b=21b, and the part and result psum_out is 21b;

In line 9, the input part and psum_in are 21b, the resource corresponding to the adder adder is 17b+21b=21b, and the part and result psum_out is 21b;

…

In line 15, the input part and psum_in are 21b, the resource corresponding to the adder adder is 17b+21b=21b, and the part and result psum_out is 21b;

In line 16, the input part and psum_in are 21b, the resource corresponding to the adder adder is 17b+21b=22b, and the part and result psum_out is 22b;

In line 17, the input part and psum_in are 22b, the resource corresponding to the adder adder is 17b+22b=22b, and the part and result psum_out is 22b;

…

In line 31, the input part and psum_in are 22b, the resource corresponding to the adder adder is 17b+22b=22b, and the part and result psum_out is 22b;

In line 32, the input part and psum_in are 22b, the resource corresponding to the adder adder is 17b+22b=23b, and the part and result psum_out is 23b;

In line 33, the input part and psum_in are 23b, the resource corresponding to the adder adder is 17b+23b=23b, and the part and result psum_out is 23b;

…

In line 63, the input part and psum_in are 23b, the resource corresponding to the adder adder is 17b+23b=23b, and the part and result psum_out is 23b.

In the above implementation, the MAC node includes an adder, and in the process of performing data operations, there is a large amount of redundancy in the adder resources in the above MAC node, for example: in the second row, the adder adder corresponds to The resource is 17b+18b=19b. In fact, only 18b+18b can fully use up the adder resource of 19b. However, 19b must be used here because there is an addend of 18b, so the adder resource is redundant. In line 3, the resource corresponding to the adder adder is 17b+19b=19b, and the result does not need 20b, and the adder resource is not redundant here. Similarly, in line 4, the resource corresponding to the adder adder is 17b+19b=20b. In fact, only 19b+19b can fully use up the adder resource of 20b, but 20b must be used here because there is a The addend is 19b, so there is redundancy in adder resources; this redundancy also exists in the adders of the MAC nodes in rows 5 and 6; and in row 7, there is no redundancy in the resources of the adder. Similarly, the same is true for lines 8 to 15, lines 16 to 31, and lines 32 to 63.

To sum up, it can be seen that the implementation of convolution calculation in related technologies has the following defects:

(1) The MAC nodes in the MAC array include adders, and the resources of the adders are redundant, which increases the area of the chip, increases the power consumption, and increases the cost.

(2) When using the MAC array to analyze and process the input data, the input data (IFM) is a triangle input, so if all the MAC nodes need to be used, it will take a long time and reduce the data processing performance.

(3) The input data (IFM) is a triangle input, so for the input data loading node, the input cycle of the input data needs to be considered, which will make the design logic of the input data loading node more complicated, and will also increase the area and power consumption .

(4) Each column of MAC nodes is pulsating, and some working enable signals need to consume a register (a register of control signals) in all MACs in the column, and then pulsate to transmit the enabling signal, for example: weight loading signal , which consumes more registers and takes up more resources.

In order to solve the above technical problems, this embodiment provides a convolution operation method, an image processing method, a device, and a storage medium, wherein the convolution operation device includes: a data acquisition node and a MAC array connected to the data acquisition node, the above The data acquisition node is used to acquire the data to be processed, and determine the node input data corresponding to each MAC node in the multiple MAC nodes in the multiplier MAC array within one clock cycle based on the data to be processed, and is used to pass the data acquisition node The node input data is obtained, and the convolution calculation is performed on the node input data to obtain the node calculation result, which is used to determine the convolution operation result corresponding to the data to be processed.

In the technical solution provided by this embodiment, the node input data corresponding to each MAC node in the multiple MAC nodes in the multiplier MAC array is determined by the data acquisition node within one clock cycle. After the node input data is obtained, the node input data can be used The MAC nodes included in the MAC array simultaneously perform convolution calculation processing on the corresponding node input data within one clock cycle to obtain node calculation results, which can not only improve data processing efficiency, but also reduce data processing power consumption and cost. Therefore, the problems of time-consuming and low processing performance existing when inputting data in the form of triangle input are effectively solved, thereby improving the quality and efficiency of the convolution operation method.

Some implementations of a convolution operation method, image processing method, device, and storage medium in the present invention will be described in detail below with reference to the accompanying drawings. Under the condition that there is no conflict between the various embodiments, the following embodiments and the features in the embodiments can be combined with each other.

Fig. 4 is a schematic structural diagram of a convolution operation device provided by an embodiment of the present invention; Fig. 5 is a schematic diagram of inputting node input data to a MAC array provided by an embodiment of the present invention; referring to Fig. 4-Fig. 5, this The embodiment provides a convolution operation device, the convolution operation device can perform convolution operation processing on the data to be processed, and the convolution operation device can realize the following operations: (1) it can eliminate or reduce the resource redundancy existing in the adder In other cases, the chip area is reduced, the power consumption is reduced, and the data processing cost is reduced; (2) the input data (IFM) of the convolution operation device is input within a clock period, thereby realizing input in a rectangular manner, All MAC nodes can be used in a shorter time frame like this, the time of consumption is shortened, and processing performance is improved; (3) the input data (IFM) of this convolution computing device is input in a rectangular manner, like this When inputting data into the MAC array, the design logic of the data acquisition node is simple, and it will also bring area and power consumption benefits; (4) Each column of MAC nodes in the MAC array works synchronously, and some work enable signals, Only one register is needed to drive all the MAC nodes in the column, for example, the loading signal of the weight, thereby reducing the number of registers and reducing the occupied resources.

Specifically, the convolution operation device may include: a data acquisition node and a MAC array connected to the data acquisition node; the above-mentioned data acquisition node is used to acquire the data to be processed, and determine and multiply within one clock cycle based on the data to be processed The node input data corresponding to each MAC node in the multiple MAC nodes in the MAC array of the device, the MAC array is used to obtain the node input data through the data acquisition node, and perform convolution calculation on the node input data to obtain the node calculation result, and the node calculation The result is used to determine the result of the convolution operation corresponding to the data to be processed.

Among them, the data to be processed refers to the data that requires convolution calculation. In different application scenarios, the data to be processed can correspond to different types of data. For example, the data to be processed can be text data, image data, video data, etc. . When the user has convolution calculation requirements for the data to be processed, the data to be processed can be obtained through the data acquisition node. Specifically, the data to be processed can be stored in the preset area, and the data acquisition node can obtain the pending data by accessing the preset area. data; or, the data to be processed is stored in a third device, and the data acquisition node communicates with the third device, so that the data acquisition node can obtain the data to be processed through the third device. Of course, the data acquisition node can also acquire the data to be processed in other ways, as long as the accuracy and reliability of the acquisition of the data to be processed can be guaranteed, and details will not be repeated here.

After the data acquisition node acquires the data to be processed, the data to be processed can be sent to the MAC array, so that the MAC array can analyze and process the data to be processed, wherein the above-mentioned MAC array is not limited to a two-dimensional MAC array. It can be a three-dimensional MAC array, etc. For ease of understanding, a two-dimensional MAC array is used as an example for illustration. Since the MAC array includes multiple rows and multiple columns of MAC nodes, when the data acquisition node communicates with the MAC array, the data The acquisition node communicates with the MAC nodes in the same column. In order to enable the MAC nodes in the same column to analyze and process the data to be processed within one clock cycle, after the data acquisition node acquires the data to be processed, it can analyze and process the data to be processed , so as to determine the node input data corresponding to each of the MAC nodes in the multiplier MAC array within one clock cycle.

It can be understood that, for each row of nodes and each column of nodes in the MAC array, the node input data corresponding to the first MAC node can be part of the data to be processed; rather than the node input data corresponding to the first MAC node The data can be determined based on the output result of the corresponding previous input MAC node, and the output result of the MAC node is determined based on the input data of the node corresponding to the first MAC node. Therefore, after obtaining the data to be processed , node input data corresponding to each of the plurality of MAC nodes in the multiplier MAC array may be determined based on the data to be processed.

For example, as shown in FIG. 5, for an image data to be processed, node input data may be determined based on the image data, and the node input data may include data 0-1 corresponding to the first row of MAC nodes, and The data 1-1 corresponding to the MAC node in the second row...the data N-1 corresponding to the MAC node in the Nth row, the above data N-N is used to represent the Nth data of the Nth row in the image data , and then the node input data can be transmitted to the MAC array within one clock cycle for analysis and processing.

After multiple MAC nodes in the MAC array obtain the node input data, the MAC node can perform convolution calculation on the node input data, so that the node calculation result can be obtained. It can be understood that the node calculation result is the convolution of the data to be processed The intermediate result obtained by the product calculation, after obtaining the node calculation result of the MAC node in the MAC array to analyze and process the node input data, can determine the convolution operation result corresponding to the data to be processed based on the calculation results of all nodes, Therefore, the stable convolution calculation operation of the data to be processed is effectively realized.

In some examples, the convolution operation device in this embodiment may further include: a weight cache node, the weight cache node is connected to the MAC array, and the above weight cache node is used to determine the corresponding Then the weight coefficient can be transmitted to the MAC array, so that the MAC array can use the MAC node and the weight coefficient to perform convolution calculation on the node input data, and obtain the node calculation result.

In some other examples, after the node calculation result is obtained, the convolution operation result can be determined based on the node calculation result. Specifically, the convolution operation device in this embodiment can also include: a data processing device connected to the MAC array The node and the convolution operation device are used to determine the convolution operation result corresponding to the data to be processed based on the node calculation result, thereby effectively ensuring the accuracy and reliability of the analysis and processing of the convolution operation result.

In the convolution operation device provided in this embodiment, the node input data corresponding to each of the MAC nodes in the multiplier MAC array is determined by the data acquisition node within one clock cycle, and after the node input data is acquired, The MAC nodes included in the MAC array can be used to simultaneously perform convolution calculation processing on the corresponding node input data within one clock cycle to obtain node calculation results, and the obtained node calculation results can be used to determine the data corresponding to the data to be processed In this way, not only can the data to be processed be transmitted to the MAC array in the form of rectangular input within one clock cycle, so that the MAC array can obtain the node input data to be processed at the initial moment and within the same clock cycle , and also improve the data processing efficiency, reduce the power consumption and cost of data processing, thus effectively solving the problems of time consumption and low processing performance when inputting data in the form of triangle input, thereby improving the volume The practicality of product computing devices.

In some examples, as shown in FIG. 6, in order to further solve the redundant situation of the resources of the adder in the MAC node in the prior art, when using the MAC array to analyze and process the node input data, the Each column is divided into a plurality of node unit groups, and the node unit group includes a plurality of MAC nodes, wherein each node unit group corresponds to at least one adder, and the adder is used to accumulate the calculation results of the MAC nodes in the node unit group, Thus, node processing results corresponding to each column in the MAC array can be obtained.

Wherein, when each column of MAC nodes in the MAC array is divided into multiple node unit groups, the division parameters of the node unit groups can be obtained, and the division parameters can include at least one of the following: the number of node unit groups, the divided node unit The number of MAC nodes included in the group, etc.; after obtaining the division parameter, each column in the MAC array may be divided into multiple node unit groups based on the division parameter. In some examples, in order to ensure the quality and efficiency of convolution calculation on data, the number of MAC nodes included in the node unit group corresponding to the same column in the MAC array is the same.

For example, when the MAC array is a 64*64 array, each column includes 64 MAC nodes, and the 64 MAC nodes can be evenly divided into 8 node unit groups. At this time, each node unit group includes 8 MAC node. Alternatively, 64 MAC nodes may be equally divided into 4 node unit groups, and in this case, each node unit group includes 16 MAC nodes. Alternatively, 64 MAC nodes can be divided into 16 node unit groups on average, and at this time, each node unit group includes 4 MAC nodes; in this way, when using the node unit group to analyze and process the node input data, it can effectively Ensure the quality and efficiency of data analysis and processing.

In some other examples, in order to improve the flexibility and reliability of performing convolution calculation on data, the numbers of node unit groups corresponding to different columns in the MAC array are the same or different. For example, when the MAC array is a 64*64 array, each column includes 64 MAC nodes, so when the node unit group division operation is performed on each column in the MAC array, the 64 MAC nodes in the first column can be Evenly divided into 8 node unit groups, at this time, each node unit group includes 8 MAC nodes; the 64 MAC nodes in the second column are equally divided into 4 node unit groups, at this time, each node unit group includes 16 MAC nodes; the number of node unit groups corresponding to different columns in the above MAC array is different. Alternatively, the 64 MAC nodes in the second column can also be equally divided into 8 node unit groups. At this time, each node unit group includes 8 MAC nodes; the node unit groups corresponding to different columns in the above MAC array the same amount. In this way, when the node unit group is used to analyze and process the node input data, the flexibility of data analysis and processing can be effectively guaranteed.

It should be noted that the MAC nodes in the MAC array in this embodiment do not include an adder. At this time, in order to perform convolution operation processing on the data, the node unit groups divided in the MAC array correspond to at least one adder , the above-mentioned adder is used to accumulate the calculation results of the MAC nodes in the node unit group to obtain the node processing results corresponding to each column in the MAC array. In some examples, at least one adder may include: a first-stage adder connected to the node unit group and a second-stage adder connected to the first-stage adder, the data resources supported by the second-stage adder Data resources larger than can be supported by the first-stage adder. In the process of data processing, one control signal can be used to uniformly control the adders corresponding to each column of MAC nodes.

Wherein, the first-stage adder is directly connected to the node unit group, and the first-stage adder can accumulate the output results of the MAC nodes included in the node unit group, and the second-stage adder is connected to the first-stage adder connected, the second-stage adder is used to accumulate the output results of the first-stage adder, it can be understood that the number of the second-stage adder can be one or more, and the second-stage adder can support The data resource is larger than what the first stage adder can support.

In some examples, since the adder is used to accumulate the output results of the MAC nodes in the node unit group, the data resources supported by the adder are convolved with the number of MAC nodes included in the node unit group, and the MAC nodes The calculated data bit width is related. Specifically, when the number of MAC nodes included in the node unit group is 2 to the Nth power, the data resource supported by the adder is the sum of the data bit width and N for convolution calculation in the MAC nodes.

For example, when the MAC array is a 64*64 array, each column includes 64 MAC nodes, so when the node unit group division operation is performed on each column in the MAC array, the 64 MAC nodes in the first column can be Evenly divided into 8 node unit groups, each node unit group includes 8 (2 ³ ) MAC nodes, at this time, each node unit group can be connected with a first-stage adder, and the first-stage adder can Including adder 0 corresponding to the first node element group, adder 1 corresponding to the second node element group, adder 2 corresponding to the third node element group, and adder 2 corresponding to the fourth node element group The corresponding adder 3, the adder 4 corresponding to the fifth nodal unit group, and the adder 7 corresponding to the eighth nodal unit group.

For the above-mentioned first-stage adder, if the output result of the MAC node in the node unit group is 17b, and the first-stage adder is used to process the product processing results of 8 (2 ³ ) MAC nodes in the node unit group Accumulation operation, at this time, the supportable data resource of the first-stage adder is the sum of the data bit width and N for convolution calculation in the MAC node, that is, 17b+3b=20b. The second-stage adder is used to accumulate the output results of the first-stage adder. If the second-stage adder also uses an 8-input adder, then only one 8-input 20b adder is needed. At this time, the first-stage adder The output bit width of the secondary adder is 20b+3b=23b.

For the convolution operation device, if the structure of the convolution operation device is more complicated, the more complex the calculation logic is, the clock frequency of the convolution operation device will not be high, so the operating frequency of the convolution operation device cannot be too high, and the more Difficult to meet high-frequency hardware timing requirements. Therefore, when the MAC array in the convolution operation device is divided into node unit groups, the number of MAC nodes included in the node unit group and the number of adders corresponding to the node unit group can directly affect the data of the convolution operation device. Processing time, for example: the more the number of adders, the worse the timing of the convolution operation device.

Based on the above statement, it can be seen that when using the convolution operation device to analyze and process the data to be processed, if the timing corresponding to the 8-input 20b adder cannot meet the requirements, you can use other node unit groups and the corresponding adder way, for example, as shown in Figure 7, when the MAC array is a 64*64 array, each column includes 64 MAC nodes, so when performing node unit group division operations on each column in the MAC array, the The 64 MAC nodes in the first column are evenly divided into 8 node unit groups, and each node unit group includes 8 (2 ³ ) MAC nodes. At this time, each node unit group can be connected with a first-stage adder , the first-stage adder may include an adder 0 corresponding to the first nodal unit group, an adder 1 corresponding to the second nodal unit group, and an adder 2 corresponding to the third nodal unit group , an adder 3 corresponding to the fourth node unit group, an adder 4 corresponding to the fifth node unit group... and an adder 7 corresponding to the eighth node unit group.

At this moment, the first-stage adder is an 8-input adder, and the second-stage adder connected to the first-stage adder can be a 4(2 ² )-input adder, and the second-stage adder can include and Adder 0 , Adder 1 , Adder 2 and Adder 3 are connected to Adder 8 , and Adder 9 is connected to Adder 4 , Adder 5 , Adder 6 and Adder 7 . The number of the above-mentioned second stage adders is two, and the output bit width is the sum of the data bit width for convolution calculation in the MAC node and N, that is, 20b+2b=22b.

It should be noted that the second-stage adder also includes an adder 10 connected to the above adder 8 and adder 9, the adder 10 is a 2 (2 ¹ ) input adder, at this time, the adder The output bit width is the sum of the data bit width and N for convolution calculation in the MAC node, that is, 22b+1b=23b.

Based on the above statement, it can be seen that when using the convolution operation device to analyze and process the data to be processed, if the timing corresponding to the convolution operation device with the above structure cannot meet the requirements, other node unit groups and adders can be used. For the connection structure, for example, referring to the accompanying drawing 8, when the MAC array is a 64*64 array, each column includes 64 MAC nodes, so when performing node unit group division operations on each column in the MAC array, The 64 MAC nodes in the first column can be evenly divided into 16 node unit groups, and each node unit group includes 4 (2 ² ) MAC nodes. At this time, each node unit group can be connected with the first level Adders, the first stage of adders may include adder 0 corresponding to the first nodal element group, adder 1 corresponding to the second nodal element group, adder 1 corresponding to the third nodal element group Adder 2, adder 3 corresponding to the fourth nodal unit group, adder 4 corresponding to the fifth nodal unit group... and adder corresponding to the sixteenth nodal unit group Device 15.

For the convolution operation device with the above structure, the first-stage adder is a 4-input adder, that is, it can accumulate the products of 4 (2 ² ) MAC nodes. At this time, the first-stage adder’s The output bit width is 17b+2b=19b. And the second-stage adder that is connected with the first-stage adder can be the adder of 4 (2 ² ) input, and the second-stage adder can comprise adder 16, adder 17, adder 18 and adder 19, Wherein, adder 16 is connected with adder 0, adder 1, adder 2 and adder 3, adder 17 is connected with adder 4, adder 5, adder 6 and adder 7, adder 18 is connected with adder 7 The adder 8 , the adder 9 , the adder 10 and the adder 11 are connected, and the adder 19 is connected with the adder 12 , the adder 13 , the adder 14 and the adder 15 . The number of the above-mentioned second-stage adders is four, and the second-stage adder can accumulate the results of 4 (2 ² ) first-stage adders. At this time, the output bit width of the second-stage adder 19b+2b=21b.

It should be noted that the second-stage adder also includes an adder 20 connected to the above-mentioned adder 16, adder 17, adder 18 and adder 19, and the adder 20 is a 4 (2 ² ) input 21b An adder, at this time, the output bit width of the adder is the sum of the data bit width and N for convolution calculation in the MAC node, that is, 21b+2b=23b.

The convolution operation device provided in this embodiment divides the MAC array into several node unit groups on average, and then makes the adder connected to the node unit group meet the data bit width and N for convolution calculation in the MAC node as much as possible. The sum value, N is relevant to the quantity of the input data of adder, when the quantity of input data of adder is 4, N is 2, promptly N is log ₂ (quantity of the input data of adder); The non-redundant cascaded adder based on the systolic array, that is, the adder resource included in the convolution operation device has no redundancy at all, which can save a lot of area for the chip, reduce power consumption and cost, and further effectively improve The quality and efficiency of the convolution operation performed by the convolution operation device are improved, and the practicability of the convolution operation device is guaranteed.

In some other examples, when the number of MAC nodes included in the node unit group is not the Nth power of 2, the data resources supported by the adder are the data bit width of the convolution calculation in the MAC node and the ratio of N+1 and value.

For example, when the MAC array is a 48*48 array, each column includes 48 MAC nodes, so when performing node unit group division operation on each column in the MAC array, the 48 MAC nodes in the first column can be Evenly divided into 4 node unit groups, each node unit group includes 12 (2 ⁴ >12>2 ³ ) MAC nodes, at this time, each node unit group can be connected with a first-stage adder, the first The number of first-level adders is 4, and each first-level adder is used to accumulate the product processing results of 12 MAC nodes in the node unit group. At this time, the supported data resources of the first-level adder It is the sum of the data bit width and N+1 for convolution calculation in the MAC node, that is, 17b+3b+1=21b. The second-stage adder is used to accumulate the output results of the first-stage adder. If the second-stage adder uses an adder with 4 (2 ² ) inputs, at this time, the output bit width of the second-stage adder is 21b +2b=23b.

In the convolution operation device provided in this embodiment, by dividing the MAC array into several node unit groups, when the number of MAC nodes included in the node unit group is not the Nth power of 2, the data resources supported by the adder are: The sum of the data bit width and N+1 for convolution calculation in the MAC node. At this time, although the adder resources included in the convolution operation device are redundant, the degree of resource redundancy is effectively reduced. This can also save area for the chip, reduce power consumption and cost, further effectively improve the quality and efficiency of the convolution operation performed by the convolution operation device, and ensure the practicability of the convolution operation device.

During specific implementation, the convolution operation device provided by this embodiment effectively improves the utilization rate of MAC nodes and is more conducive to the improvement of data processing performance compared with the convolution operation device in the related art. Refer to FIG. 9 - As shown in Figure 10, taking the 64*64 MAC array as an example, when the data to be processed ifm is input to the MAC array in a triangle form, it takes 128 clock cycles before each MAC node in the MAC array runs and when the data to be processed ifm is input to the MAC array in a rectangular manner, it only takes 64 clock cycles to make each MAC node in the MAC array run. Moreover, since each column of MAC nodes in the convolution operation device works synchronously, a large number of pulsating control signals are reduced, and the quality and efficiency of the convolution operation performed by the convolution operation device are further improved.

It should be noted that the convolution operation device provided in this embodiment is not limited to the implementation described above, and those skilled in the art can divide the MAC array according to specific application scenarios or application requirements, and can also divide Afterwards, the resources of the adder corresponding to the obtained node unit group are configured, which further improves the flexibility of the convolution operation device. In addition, the convolution operation device can not only perform convolution calculations on int8 precision data, but also perform convolution calculations on other types of data, such as: int4 precision data, int16 precision data, etc., thereby expanding the convolution Applicable scope of computing devices.

FIG. 11 is a schematic flow chart of a convolution operation method provided by an embodiment of the present invention; referring to FIG. 11 , this embodiment provides a convolution operation method, and the execution subject of the convolution operation method may be convolution computing device, it can be understood that the convolution computing device can be implemented as software, or a combination of software and hardware, specifically, the convolution computing method can include:

Step S1101: Obtain the data to be processed, and determine the node input data corresponding to each of the MAC nodes in the multiplier MAC array within one clock cycle based on the data to be processed.

Among them, the data to be processed refers to the data that requires convolution calculation. In different application scenarios, the data to be processed can correspond to different data formats. For example, the data to be processed can be text data, image data, video data, etc. . When the user has convolution calculation requirements for the data to be processed, the data to be processed can be obtained through the data acquisition node. Specifically, the data to be processed can be stored in the preset area, and the data acquisition node can obtain the pending data by accessing the preset area. data; or, the data to be processed is stored in a third device, and the data acquisition node communicates with the third device, so that the data acquisition node can obtain the data to be processed through the third device. Of course, the data acquisition node can also acquire the data to be processed in other ways, as long as the accuracy and reliability of the acquisition of the data to be processed can be guaranteed, and details will not be repeated here.

After the data acquisition node acquires the data to be processed, the data to be processed can be sent to the MAC array, so that the MAC array can analyze and process the data to be processed, wherein the above-mentioned MAC array is not limited to a two-dimensional MAC array. It can be a three-dimensional MAC array, etc. For ease of understanding, a two-dimensional MAC array is used as an example for illustration. Since the MAC array includes multiple rows and multiple columns of MAC nodes, when the data acquisition node communicates with the MAC array, the data The acquisition node communicates with the MAC nodes in the same column. In order to enable the MAC nodes in the same column to analyze and process the data to be processed within one clock cycle, after the data acquisition node acquires the data to be processed, it can analyze and process the data to be processed , so as to determine the node input data corresponding to each of the MAC nodes in the multiplier MAC array within one clock cycle. It can be understood that the node input data may be a part of the data to be processed.

Step S1102: Use the MAC nodes in the MAC array to perform convolution calculations on the input data of the nodes to obtain node calculation results, and the node calculation results are used to determine the convolution operation results corresponding to the data to be processed.

In some examples, using the MAC nodes in the MAC array to perform convolution calculations on the node input data, and obtaining the node calculation results may include: determining weight coefficients corresponding to multiple MAC nodes in the MAC array; using the MAC nodes and the weight coefficients to The node inputs the data for convolution calculation and obtains the node calculation result.

Since the node calculation result is used to determine the convolution operation result corresponding to the data to be processed, after obtaining the node calculation result, the method in this embodiment may further include: determining the convolution operation result corresponding to the data to be processed based on the node processing result The result of the convolution operation.

The convolution operation method provided in this embodiment obtains the data to be processed, and determines the node input data corresponding to each of the MAC nodes in the multiplier MAC array within one clock cycle based on the data to be processed, and then The MAC nodes in the MAC array are used to perform convolution calculations on the input data of the nodes to obtain the calculation results of the nodes, which effectively realizes the convolution operation of the data to be processed within one clock cycle, and also improves the data processing efficiency and reduces The power consumption and cost of data processing are reduced, thereby effectively solving the problems of time consumption and low processing performance when inputting data in the form of triangle input, thereby improving the practicability of the convolution operation method.

Fig. 12 is a schematic flowchart of another convolution operation method provided by the embodiment of the present invention; on the basis of the above embodiment, referring to the accompanying drawing 12, the MAC nodes in the MAC array are used to convolve the node input data Before calculating, the method in this embodiment may also include:

Step S1201: Obtain division parameters for dividing each column of MAC nodes in the MAC array.

Step S1202: Divide each column in the MAC array into a plurality of node unit groups based on the division parameter, and the node unit group includes a plurality of MAC nodes.

Among them, when using the MAC nodes in the MAC array to perform convolution operations on the node input data, in order to improve the quality and efficiency of convolution operations, the MAC array can be divided into nodes. Specifically, the MAC Each column of MAC nodes in the array is divided by a division parameter, which may be a parameter stored in a preset area, or a parameter configured by a user based on an application scenario or an application requirement. After the division parameter is obtained, each column in the MAC array may be divided into multiple node unit groups based on the division parameter, and the obtained node unit group includes multiple MAC nodes. In some examples, the number of MAC nodes included in the node unit groups corresponding to the same column in the MAC array is the same. The numbers of node unit groups corresponding to different columns in the MAC array are the same or different.

After each column in the MAC array is divided into a plurality of node unit groups based on the division parameters, the node unit groups can be used to perform convolution operations on the node input data, so that the node operation results can be obtained.

In this embodiment, by obtaining the division parameters for dividing each column of MAC nodes in the MAC array, and then dividing each column in the MAC array into multiple node unit groups based on the division parameters, the node unit groups can be used to The node input data performs the convolution operation, and it is beneficial to improve the quality and efficiency of the convolution operation.

Fig. 13 is a schematic flow diagram of using the MAC nodes in the MAC array to perform convolution calculations on the input data of the nodes to obtain the calculation results of the nodes provided by the embodiment of the present invention; The example provides an implementation method of using the MAC nodes in the MAC array to perform convolution calculation on the node input data. Specifically, in this embodiment, the MAC nodes in the MAC array are used to perform convolution calculation on the node input data to obtain the node Calculation results can include:

Step S1301: Obtain at least one adder corresponding to the node unit group in the MAC array.

Step S1302: using at least one adder to accumulate the calculation results of the MAC nodes in the node unit group to obtain the node processing results corresponding to each column in the MAC array.

Among them, when using the MAC nodes in the MAC array to perform convolution calculations on the node input data, an accumulation operation is required. In order to avoid the redundancy of adder resources, when the MAC array is divided into multiple node unit groups, it can be Configure the adder resource corresponding to the node unit group. Therefore, in order to realize the convolution operation, at least one adder corresponding to the node unit group in the MAC array can be obtained first, and in some examples, obtaining at least one adder corresponding to the node unit group in the MAC array can include : Obtain the number of MAC nodes included in the node unit group; determine the data bit width for convolution calculation in the MAC nodes corresponding to at least one adder; determine the data that the adder can support based on the number of MAC nodes and the data bit width resource. Specifically, based on the number of MAC nodes and the data bit width, determining the data resources supported by the adder may include: when the number of MAC nodes included in the node unit group is the N power of 2, then based on the data bit width and N The sum value determines the data resources that the adder can support; when the number of MAC nodes included in the node unit group is not the Nth power of 2, then based on the sum value of the data bit width and N+1, it is determined that the adder can Supported data resources.

After obtaining at least one adder corresponding to the node unit group in the MAC array, at least one adder can be used to accumulate the calculation results of the MAC nodes in the node unit group, so as to be able to stably and effectively obtain the result corresponding to the MAC array in the MAC array. The node processing result corresponding to each column further ensures the accuracy of obtaining the node processing result.

The implementation, implementation principle, and implementation effect of the method in this embodiment are similar to the implementation, implementation principle, and implementation effect of the convolution operation device shown in FIGS. No longer.

14 is a schematic flow diagram of an image processing method based on a multiplier MAC array provided by an embodiment of the present invention; referring to the accompanying drawing 14, the present embodiment provides an image processing method based on a multiplier MAC array, which is based on The image processing method of the multiplier MAC array can be executed by an image processing device based on the multiplier MAC array. It can be understood that the image processing device can be implemented as software or a combination of software and hardware. Specifically, the multiplier-based The image processing method of the MAC array may include:

Step S1401: Acquire the image to be processed, and determine the node input data corresponding to each MAC node among the plurality of MAC nodes in the MAC array within one clock cycle based on the image to be processed.

Step S1402: use the MAC nodes in the MAC array to perform convolution calculation on the input data of the nodes, and obtain the calculation results of the nodes.

Step S1403: Determine the convolution processing result corresponding to the image to be processed based on the node calculation result.

The implementation, implementation principle, and implementation effect of the method in this embodiment are similar to the implementation, implementation principle, and implementation effect of the convolution operation device shown in FIGS. 4-10 above for the data to be processed. Reference is made to the content of the above statements, and no further details are given here.

FIG. 15 is a schematic structural diagram of a convolution operation device provided by an embodiment of the present invention; referring to FIG. 15 , this embodiment provides a convolution operation device, which is used to execute the above-mentioned convolution operation device shown in FIG. 11 The convolution operation method shown, specifically, the convolution operation device may include:

The first memory 12 is used to store computer programs;

The first processor 11 is configured to run the computer program stored in the first memory 12 to realize:

Obtaining the data to be processed, and determining the node input data corresponding to each MAC node in the plurality of MAC nodes in the multiplier MAC array within a clock cycle based on the data to be processed;

Wherein, the structure of the electronic device may further include a first communication interface 13, which is used for the electronic device to communicate with other devices or a communication network.

In some examples, before using the MAC nodes in the MAC array to perform convolution calculations on the node input data, the first processor 11 in this embodiment is also configured to perform: obtaining the Division parameters for division; based on the division parameters, each column in the MAC array is divided into multiple node unit groups, and the node unit groups include multiple MAC nodes.

In some examples, the number of MAC nodes included in the node unit groups corresponding to the same column in the MAC array is the same.

In some examples, the numbers of node unit groups corresponding to different columns in the MAC array are the same or different.

In some examples, when the first processor 11 uses the MAC nodes in the MAC array to perform convolution calculations on the node input data to obtain the node calculation results, the first processor 11 is used to perform: acquiring the node unit group in the MAC array Corresponding at least one adder; using at least one adder to accumulate the calculation results of the MAC nodes in the node unit group to obtain the node processing results corresponding to each column in the MAC array.

In some examples, when the first processor 11 acquires at least one adder corresponding to the node unit group in the MAC array, the first processor 11 is configured to: acquire the number of MAC nodes included in the node unit group; determine The data bit width for convolution calculation in the respective MAC nodes corresponding to at least one adder; based on the number of MAC nodes and the data bit width, determine the data resources that the adder can support.

In some examples, when the first processor 11 determines the data resources that the adder can support based on the number of MAC nodes and the data bit width, the first processor 11 is configured to perform: the MAC nodes included in the node unit group When the number is the Nth power of 2, the data resources supported by the adder are determined based on the sum of the data bit width and N; when the number of MAC nodes included in the node unit group is not the Nth power of 2, then Based on the sum of the data bit width and N+1, the data resources supported by the adder are determined.

In some examples, when the first processor 11 uses the MAC nodes in the MAC array to perform convolution calculations on the node input data to obtain the node calculation results, the first processor 11 is used to perform: determine the The weight coefficients corresponding to the nodes; use the MAC node and the weight coefficients to perform convolution calculations on the node input data to obtain the node calculation results.

In some examples, after the node calculation result is obtained, the first processor 11 is configured to: determine a convolution operation result corresponding to the data to be processed based on the node processing result.

The device shown in FIG. 15 can execute the method of the embodiment shown in FIG. 11-FIG. 13. For the parts not described in detail in this embodiment, refer to the relevant description of the embodiment shown in FIG. 11-FIG. 13. For the execution process and technical effect of this technical solution, refer to the description in the embodiment shown in FIG. 11-FIG. 13 , which will not be repeated here.

FIG. 16 is a schematic structural diagram of an image processing device based on a multiplier MAC array provided by an embodiment of the present invention. Referring to FIG. 16 , this embodiment provides an image processing device based on a multiplier MAC array. The image processing device of the multiplier MAC array is used to perform the image processing method based on the multiplier MAC array shown in FIG. 14. Specifically, the image processing device based on the multiplier MAC array may include:

The second memory 22 is used to store computer programs;

The second processor 21 is configured to run the computer program stored in the second memory 22 to realize:

Obtain the image to be processed, and determine the node input data corresponding to each MAC node in a plurality of MAC nodes in the MAC array within a clock cycle based on the image to be processed;

Use the MAC nodes in the MAC array to perform convolution calculations on the node input data to obtain the node calculation results;

Wherein, the structure of the electronic device may further include a second communication interface 23 for the electronic device to communicate with other devices or a communication network.

The device shown in FIG. 16 can execute the method of the embodiment shown in FIG. 14 . For parts not described in detail in this embodiment, refer to the relevant description of the embodiment shown in FIG. 14 . For the execution process and technical effect of this technical solution, refer to the description in the embodiment shown in FIG. 14 , and details are not repeated here.

In addition, an embodiment of the present invention provides a computer storage medium, which is used to store computer software instructions used by electronic devices, which includes the programs involved in executing the convolution operation method in the method embodiments shown in FIGS. 11-13 above. .

In addition, an embodiment of the present invention provides a computer storage medium, which is used to store computer software instructions used by electronic devices, which includes instructions for performing the image processing method based on the multiplier MAC array in the method embodiment shown in FIG. 14. program of.

The technical solutions and technical features in each of the above embodiments can be used alone or in combination if they conflict with the present invention, as long as they do not exceed the scope of cognition of those skilled in the art, they all belong to equivalent embodiments within the scope of protection of the present application .

In the several embodiments provided by the present invention, it should be understood that the disclosed related detection devices and methods can be implemented in other ways. For example, the above-described embodiment of the detection device is only illustrative. For example, the division of the modules or units is only a logical function division. In actual implementation, there may be other division methods, such as multiple units or components May be combined or may be integrated into another system, or some features may be omitted, or not implemented. In another point, the mutual coupling or direct coupling or communication connection shown or discussed may be through some interfaces, and the indirect coupling or communication connection of detection devices or units may be in electrical, mechanical or other forms.

The units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in one place, or may be distributed to multiple network units. Part or all of the units can be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in each embodiment of the present invention may be integrated into one processing unit, each unit may exist separately physically, or two or more units may be integrated into one unit. The above-mentioned integrated units can be implemented in the form of hardware or in the form of software functional units.

If the integrated unit is realized in the form of a software function unit and sold or used as an independent product, it can be stored in a computer-readable storage medium. Based on this understanding, the essence of the technical solution of the present invention or the part that contributes to the prior art or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium , including several instructions for causing a computer processor (processor) to execute all or part of the steps of the method described in each embodiment of the present invention. The aforementioned storage medium includes: U disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), magnetic disk or optical disk and other media that can store program codes.

The above is only an embodiment of the present invention, and does not limit the patent scope of the present invention. Any equivalent structure or equivalent process transformation made by using the description of the present invention and the contents of the accompanying drawings, or directly or indirectly used in other related technologies fields, all of which are equally included in the scope of patent protection of the present invention.

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solutions of the present invention, rather than limiting them; although the present invention has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand that: It is still possible to modify the technical solutions described in the foregoing embodiments, or perform equivalent replacements for some or all of the technical features; and these modifications or replacements do not make the essence of the corresponding technical solutions deviate from the technical solutions of the various embodiments of the present invention. scope.

Claims

A convolution computing device, characterized in that it comprises:

A data acquisition node, configured to acquire data to be processed, and determine within a clock cycle based on the data to be processed, node input data corresponding to each of the multiple MAC nodes in the multiplier MAC array;

The MAC array is connected to the data acquisition node, and is used to acquire the input data of the node through the data acquisition node, and perform convolution calculation on the input data of the node to obtain a calculation result of the node, and the calculation result of the node is used for determining the convolution operation result corresponding to the data to be processed.
The device according to claim 1, wherein each column in the MAC array includes a plurality of node unit groups, the node unit group includes a plurality of MAC nodes, and the node unit group corresponds to at least one adder , the adder is configured to accumulate the calculation results of the MAC nodes in the node unit group to obtain a node processing result corresponding to each column in the MAC array.
The device according to claim 2, wherein the number of MAC nodes included in the node unit groups corresponding to the same column in the MAC array is the same.
The device according to claim 2, wherein the number of node unit groups corresponding to different columns in the MAC array is the same or different.
The apparatus according to claim 2, wherein the at least one adder comprises: a first-stage adder connected to the node unit group and a second-stage adder connected to the first-stage adder In an adder, the data resources supported by the second-stage adder are greater than the data resources supported by the first-stage adder.
The device according to claim 5, wherein the data resources supported by the adder are related to the number of MAC nodes included in the node unit group and the data bit width for convolution calculation in the MAC nodes.
The device according to claim 6, wherein when the number of MAC nodes included in the node unit group is the Nth power of 2, the data resource supported by the adder is convolution in the MAC node The sum of the calculated data bit width and N.
The device according to claim 6, wherein, when the number of MAC nodes included in the node unit group is not the Nth power of 2, the data resources supported by the adder are volumes in the MAC nodes The sum of the data bit width and N+1 calculated by the product.
The device according to any one of claims 1-8, wherein the device further comprises:

a weight cache node, configured to determine weight coefficients corresponding to each of the multiple MAC nodes in the MAC array;

The MAC array is connected to the weight cache node, and is used to perform convolution calculation on the input data of the node by using the MAC node and the weight coefficient to obtain a node calculation result.
The device according to any one of claims 1-8, wherein the device further comprises:

A data processing node, connected to the MAC array, configured to determine a convolution operation result corresponding to the data to be processed based on the calculation result of the node.
A convolution operation method, characterized in that, comprising:

Obtaining data to be processed, and determining node input data corresponding to each MAC node in the plurality of MAC nodes in the multiplier MAC array within a clock cycle based on the data to be processed;

The MAC nodes in the MAC array are used to perform convolution calculation on the input data of the nodes to obtain node calculation results, and the node calculation results are used to determine the convolution operation results corresponding to the data to be processed.
The method according to claim 11, wherein before using the MAC nodes in the MAC array to perform convolution calculation on the node input data, the method further comprises:

Acquiring division parameters for dividing each column of MAC nodes in the MAC array;

Divide each column in the MAC array into a plurality of node unit groups based on the division parameter, and the node unit group includes a plurality of MAC nodes.
The method according to claim 12, wherein the number of MAC nodes included in the node unit groups corresponding to the same column in the MAC array is the same.
The method according to claim 12, wherein the number of node unit groups corresponding to different columns in the MAC array is the same or different.
The method according to claim 12, wherein the MAC nodes in the MAC array are used to perform convolution calculations on the input data of the nodes to obtain node calculation results, including:

Obtain at least one adder corresponding to the node unit group in the MAC array;

Using the at least one adder to accumulate the calculation results of the MAC nodes in the node unit group to obtain a node processing result corresponding to each column in the MAC array.
The method according to claim 15, wherein obtaining at least one adder corresponding to the node unit group in the MAC array comprises:

obtaining the number of MAC nodes included in the node unit group;

Determining the data bit width for performing convolution calculation in the respective MAC nodes corresponding to the at least one adder;

Based on the number of the MAC nodes and the data bit width, determine the data resources that the adder can support.
The method according to claim 16, wherein, based on the number of the MAC nodes and the data bit width, determining the data resources that the adder can support includes:

When the number of MAC nodes included in the node unit group is an N power of 2, then based on the sum of the data bit width and N, determine the data resources that the adder can support;

When the number of MAC nodes included in the node unit group is not the Nth power of 2, the data resources supported by the adder are determined based on the sum of the data bit width and N+1.
The method according to claim 11, characterized in that, using the MAC nodes in the MAC array to perform convolution calculations on the input data of the nodes to obtain node calculation results, including:

determining weight coefficients corresponding to each of the multiple MAC nodes in the MAC array;

The MAC node and the weight coefficient are used to perform convolution calculation on the input data of the node to obtain a node calculation result.
The method according to claim 18, characterized in that, after obtaining the node calculation results, the method further comprises:

A convolution operation result corresponding to the data to be processed is determined based on the node processing result.
A kind of image processing method based on multiplier MAC array, it is characterized in that, comprising:

Acquiring an image to be processed, and determining node input data corresponding to each MAC node among the plurality of MAC nodes in the MAC array within a clock cycle based on the image to be processed;

Using the MAC nodes in the MAC array to perform convolution calculations on the input data of the nodes to obtain node calculation results;

A convolution processing result corresponding to the image to be processed is determined based on the node calculation result.
A convolution computing device, characterized in that it comprises:

memory for storing computer programs;

a processor for running a computer program stored in said memory to:

Obtaining data to be processed, and determining node input data corresponding to each MAC node in the plurality of MAC nodes in the multiplier MAC array within a clock cycle based on the data to be processed;

The MAC nodes in the MAC array are used to perform convolution calculation on the input data of the nodes to obtain node calculation results, and the node calculation results are used to determine the convolution operation results corresponding to the data to be processed.
The device according to claim 21, wherein the processor is further configured to execute:

Acquiring division parameters for dividing each column of MAC nodes in the MAC array;

Divide each column in the MAC array into a plurality of node unit groups based on the division parameter, and the node unit group includes a plurality of MAC nodes.
The device according to claim 22, wherein the number of MAC nodes included in the node unit groups corresponding to the same column in the MAC array is the same.
The device according to claim 22, wherein the number of node unit groups corresponding to different columns in the MAC array is the same or different.
The device according to claim 22, wherein when the processor uses the MAC nodes in the MAC array to perform convolution calculations on the input data of the nodes and obtain the calculation results of the nodes, the processor is configured to execute :

Obtain at least one adder corresponding to the node unit group in the MAC array;

Using the at least one adder to accumulate the calculation results of the MAC nodes in the node unit group to obtain a node processing result corresponding to each column in the MAC array.
The device according to claim 25, wherein when the processor obtains at least one adder corresponding to the node unit group in the MAC array, the processor is configured to perform:

obtaining the number of MAC nodes included in the node unit group;

Determining the data bit width for performing convolution calculation in the respective MAC nodes corresponding to the at least one adder;

Based on the number of the MAC nodes and the data bit width, determine the data resources that the adder can support.
The device according to claim 26, wherein when the processor determines the data resources that the adder can support based on the number of the MAC nodes and the data bit width, the processor is configured to execute :

When the number of MAC nodes included in the node unit group is an N power of 2, then based on the sum of the data bit width and N, determine the data resources that the adder can support;

When the number of MAC nodes included in the node unit group is not the Nth power of 2, the data resources supported by the adder are determined based on the sum of the data bit width and N+1.
The device according to claim 21, wherein when the processor uses the MAC nodes in the MAC array to perform convolution calculations on the input data of the nodes and obtain the calculation results of the nodes, the processor is configured to execute :

determining weight coefficients corresponding to each of the multiple MAC nodes in the MAC array;

The MAC node and the weight coefficient are used to perform convolution calculation on the input data of the node to obtain a node calculation result.
The device according to claim 28, wherein after obtaining the node calculation result, the processor is configured to perform:

A convolution operation result corresponding to the data to be processed is determined based on the node processing result.
An image processing device based on a multiplier MAC array, characterized in that it comprises:

memory for storing computer programs;

a processor for running a computer program stored in said memory to:

Acquiring an image to be processed, and determining node input data corresponding to each MAC node among the plurality of MAC nodes in the MAC array within a clock cycle based on the image to be processed;

Using the MAC nodes in the MAC array to perform convolution calculations on the input data of the nodes to obtain node calculation results;

A convolution processing result corresponding to the image to be processed is determined based on the node calculation result.
A computer-readable storage medium, characterized in that the storage medium is a computer-readable storage medium, and program instructions are stored in the computer-readable storage medium, and the program instructions are used to implement any one of claims 11-19. The convolution operation method described in the item.
A computer-readable storage medium, characterized in that, the storage medium is a computer-readable storage medium, and program instructions are stored in the computer-readable storage medium, and the program instructions are used to implement the multiplication-based Image processing method for MAC array.