WO2022246617A1 - Convolution operation method and apparatus, image processing method and apparatus, and storage medium - Google Patents

Convolution operation method and apparatus, image processing method and apparatus, and storage medium Download PDF

Info

Publication number
WO2022246617A1
WO2022246617A1 PCT/CN2021/095619 CN2021095619W WO2022246617A1 WO 2022246617 A1 WO2022246617 A1 WO 2022246617A1 CN 2021095619 W CN2021095619 W CN 2021095619W WO 2022246617 A1 WO2022246617 A1 WO 2022246617A1
Authority
WO
WIPO (PCT)
Prior art keywords
mac
node
data
nodes
adder
Prior art date
Application number
PCT/CN2021/095619
Other languages
French (fr)
Chinese (zh)
Inventor
李鹏
罗岚
祝武勇
高岳
Original Assignee
深圳市大疆创新科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳市大疆创新科技有限公司 filed Critical 深圳市大疆创新科技有限公司
Priority to PCT/CN2021/095619 priority Critical patent/WO2022246617A1/en
Publication of WO2022246617A1 publication Critical patent/WO2022246617A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • the embodiments of the present invention relate to the technical field of data processing, and in particular, to a convolution operation method, an image processing method, a device, and a storage medium.
  • Embodiments of the present invention provide a convolution operation method, an image processing method, a device, and a storage medium, which can improve data processing efficiency and reduce data processing power consumption and cost.
  • the first aspect of the present invention is to provide a convolution computing device, comprising:
  • a data acquisition node configured to acquire data to be processed, and determine within a clock cycle based on the data to be processed, node input data corresponding to each of the multiple MAC nodes in the multiplier MAC array;
  • the MAC array is connected to the data acquisition node, and is used to acquire the input data of the node through the data acquisition node, and perform convolution calculation on the input data of the node to obtain a calculation result of the node, and the calculation result of the node is used for determining the convolution operation result corresponding to the data to be processed.
  • the second aspect of the present invention is to provide a convolution operation method, including:
  • the MAC nodes in the MAC array are used to perform convolution calculation on the input data of the nodes to obtain node calculation results, and the node calculation results are used to determine the convolution operation results corresponding to the data to be processed.
  • the third aspect of the present invention is in order to provide a kind of image processing method based on multiplier MAC array, comprising:
  • a convolution processing result corresponding to the image to be processed is determined based on the node calculation result.
  • a fourth aspect of the present invention is to provide a convolution computing device, comprising:
  • a processor for running a computer program stored in said memory to:
  • the MAC nodes in the MAC array are used to perform convolution calculation on the input data of the nodes to obtain node calculation results, and the node calculation results are used to determine the convolution operation results corresponding to the data to be processed.
  • a fifth aspect of the present invention is to provide an image processing device based on a multiplier MAC array, including:
  • a processor for running a computer program stored in said memory to:
  • a convolution processing result corresponding to the image to be processed is determined based on the node calculation result.
  • the sixth aspect of the present invention is to provide a computer-readable storage medium, the storage medium is a computer-readable storage medium, the computer-readable storage medium stores program instructions, and the program instructions are used for the second aspect.
  • a seventh aspect of the present invention is to provide a computer-readable storage medium, the storage medium is a computer-readable storage medium, and program instructions are stored in the computer-readable storage medium, and the program instructions are used in the third aspect.
  • the image processing method based on the multiplier MAC array described above.
  • the node input data corresponding to each MAC node in the multiple MAC nodes in the multiplier MAC array is determined by the data acquisition node within one clock cycle, and after the node input data is acquired, it can be Use the MAC nodes included in the MAC array to perform convolution calculation processing on the corresponding node input data within one clock cycle at the same time, and obtain the node calculation results, which can not only improve the data processing efficiency, but also reduce the data processing power consumption and Therefore, it effectively solves the problems of time-consuming and low processing performance when inputting data in the form of triangle input, thereby improving the quality and efficiency of convolution operations.
  • Fig. 1 is a schematic diagram of the principle of convolution calculation provided by an embodiment in the related art
  • Fig. 2 is a schematic diagram of a systolic array for realizing convolution calculation provided by an embodiment in the related art
  • FIG. 3 is a schematic diagram of sequentially inputting feature maps for different rows of a systolic array provided by an embodiment in the related art
  • FIG. 4 is a schematic structural diagram of a convolution operation device provided by an embodiment of the present invention.
  • FIG. 5 is a schematic diagram of inputting node input data to a MAC array provided by an embodiment of the present invention.
  • Fig. 6 is a schematic diagram 1 of cascading a node unit group and an adder provided by an embodiment of the present invention
  • FIG. 7 is a second schematic diagram of cascading node unit groups and adders provided by an embodiment of the present invention.
  • Fig. 8 is a schematic diagram 3 of the cascade connection of the node unit group and the adder provided by the embodiment of the present invention.
  • FIG. 9 is a schematic diagram of a clock cycle that enables a MAC node in a MAC array to operate provided in the related art
  • FIG. 10 is a schematic diagram of a clock cycle for enabling a MAC node in a MAC array to operate provided in an embodiment of the present invention
  • FIG. 11 is a schematic flowchart of a convolution operation method provided by an embodiment of the present invention.
  • FIG. 12 is a schematic flowchart of another convolution operation method provided by an embodiment of the present invention.
  • FIG. 13 is a schematic flowchart of using the MAC nodes in the MAC array to perform convolution calculations on the input data of the nodes to obtain the calculation results of the nodes provided by an embodiment of the present invention
  • FIG. 14 is a schematic flowchart of an image processing method based on a multiplier MAC array provided by an embodiment of the present invention.
  • FIG. 15 is a schematic structural diagram of a convolution operation device provided by an embodiment of the present invention.
  • FIG. 16 is a schematic structural diagram of an image processing device based on a multiplier MAC array provided by an embodiment of the present invention.
  • the convolution calculation can be performed on the feature map, so that the convolution calculation result can be obtained.
  • the convolution kernel When performing convolution on the feature map to calculate each point of the output feature map, it is necessary to use the convolution kernel and the corresponding point of the input feature map for calculation. If there are multiple input feature maps, then the points of the same position of each feature map The calculation results are accumulated to obtain the result of the point in the output feature map. As shown in Figure 1, when performing convolution calculation on an input feature map, two points of the output feature map can be calculated.
  • the convolutional neural network used to realize the convolution calculation can be used to analyze and process the input feature map.
  • the convolutional neural network Networks can include:
  • FC layer The fully connected layer (FC layer) is used to implement the multiply-accumulate operation, wherein, for each output point, each input point is used.
  • the first type of fully connected layer (1stFC layer) is used to realize the multiply-accumulate operation, in which every point of each output must use every point of each graph.
  • the other fully connected layer (elsFC layer) is used to implement the multiply-accumulate operation, where each output point uses each point in the input one-dimensional data.
  • the fully connected layer in order to improve computing efficiency, can perform batch processing to reuse the weight values in the computing unit, and in this way, multiple data can be processed at the same time. Specifically, if an image is given, each recognition of an image is a batch process; it saves some access data, reduces power consumption, saves bandwidth, and can reuse some data.
  • the batch processing of the 1stFC layer will be converted to the elsFC layer for batch processing, so that the 1stFC layer can be input into the graph-like feature map in advance and expanded into a continuous point-like feature map to adapt to elsFC The layer's requirements for the input feature map.
  • a systolic array can be used to describe the convolution calculation, as shown in Figure 2, the systolic array includes:
  • Input data loading node used to load input data to be processed, for example: feature map IFM;
  • Weight loading node (WEIGHT_LOAD), used to load weight
  • a number of multiply-accumulate operation nodes are used to form a pulsating calculation array.
  • the feature map IFM can perform horizontal flow pulsation transfer based on the pulsation array, where the pulsation transfer can refer to the flow input IFM , the MAC nodes in each row can multiplex the IFM one by one; the weight coefficients can be transmitted vertically through the pulsation array based on the pulsation array.
  • N-0 is the 0th ifm data of the Nth row of a certain picture
  • N-1 is the 0th ifm data of the Nth row of a certain picture 1 ifm data
  • each MAC node can complete one multiplication (int8*int8) and one addition, considering the convolution calculation compatible with int16*int16, Split each int16 into (mas_8b, lsb_8b), the multiplication of int16*int16 becomes: (mas_8b, lsb_8b)*(mas_8b, lsb_8b), mas_8b is in the high bit of int16, with a sign bit, extend its sign When the bit reaches the 9th bit, it becomes int9; lsb_8b is in the low bit of int16 and has no sign bit, so it needs to add 1 bit of 0 in the highest bit to become int9; then, the multiplier needs to be changed from int8*int8 multiplier to int9* Int9 multiplier; in this way, when calculating lsb_int9*lsb_in
  • the product int17 + the calculation result passed down by the MAC node in the previous line, also called partial sum psum_out, is an intermediate result, and all intermediate results are accumulated to obtain the sum of the final results.
  • the analysis and processing of the feature map by the 64x64 MAC array is taken as an example to illustrate the data processing process.
  • the MAC array includes MAC nodes with 64 rows and 64 columns. For each row of MAC nodes, the adder The resources are as follows:
  • the MAC nodes in the MAC array include adders, and the resources of the adders are redundant, which increases the area of the chip, increases the power consumption, and increases the cost.
  • the input data is a triangle input, so if all the MAC nodes need to be used, it will take a long time and reduce the data processing performance.
  • the input data (IFM) is a triangle input, so for the input data loading node, the input cycle of the input data needs to be considered, which will make the design logic of the input data loading node more complicated, and will also increase the area and power consumption .
  • Each column of MAC nodes is pulsating, and some working enable signals need to consume a register (a register of control signals) in all MACs in the column, and then pulsate to transmit the enabling signal, for example: weight loading signal , which consumes more registers and takes up more resources.
  • a register a register of control signals
  • this embodiment provides a convolution operation method, an image processing method, a device, and a storage medium, wherein the convolution operation device includes: a data acquisition node and a MAC array connected to the data acquisition node, the above
  • the data acquisition node is used to acquire the data to be processed, and determine the node input data corresponding to each MAC node in the multiple MAC nodes in the multiplier MAC array within one clock cycle based on the data to be processed, and is used to pass the data acquisition node
  • the node input data is obtained, and the convolution calculation is performed on the node input data to obtain the node calculation result, which is used to determine the convolution operation result corresponding to the data to be processed.
  • the node input data corresponding to each MAC node in the multiple MAC nodes in the multiplier MAC array is determined by the data acquisition node within one clock cycle. After the node input data is obtained, the node input data can be used
  • the MAC nodes included in the MAC array simultaneously perform convolution calculation processing on the corresponding node input data within one clock cycle to obtain node calculation results, which can not only improve data processing efficiency, but also reduce data processing power consumption and cost. Therefore, the problems of time-consuming and low processing performance existing when inputting data in the form of triangle input are effectively solved, thereby improving the quality and efficiency of the convolution operation method.
  • Fig. 4 is a schematic structural diagram of a convolution operation device provided by an embodiment of the present invention
  • Fig. 5 is a schematic diagram of inputting node input data to a MAC array provided by an embodiment of the present invention; referring to Fig. 4-Fig.
  • this The embodiment provides a convolution operation device, the convolution operation device can perform convolution operation processing on the data to be processed, and the convolution operation device can realize the following operations: (1) it can eliminate or reduce the resource redundancy existing in the adder In other cases, the chip area is reduced, the power consumption is reduced, and the data processing cost is reduced; (2) the input data (IFM) of the convolution operation device is input within a clock period, thereby realizing input in a rectangular manner, All MAC nodes can be used in a shorter time frame like this, the time of consumption is shortened, and processing performance is improved; (3) the input data (IFM) of this convolution computing device is input in a rectangular manner, like this When inputting data into the MAC array, the design logic of the data acquisition node is simple, and it will also bring area and power consumption benefits; (4) Each column of MAC nodes in the MAC array works synchronously, and some work enable signals, Only one register is needed to drive all the MAC nodes in the column, for example, the loading signal of the weight,
  • the convolution operation device may include: a data acquisition node and a MAC array connected to the data acquisition node; the above-mentioned data acquisition node is used to acquire the data to be processed, and determine and multiply within one clock cycle based on the data to be processed
  • the result is used to determine the result of the convolution operation corresponding to the data to be processed.
  • the data to be processed refers to the data that requires convolution calculation.
  • the data to be processed can correspond to different types of data.
  • the data to be processed can be text data, image data, video data, etc. .
  • the data to be processed can be obtained through the data acquisition node.
  • the data to be processed can be stored in the preset area, and the data acquisition node can obtain the pending data by accessing the preset area. data; or, the data to be processed is stored in a third device, and the data acquisition node communicates with the third device, so that the data acquisition node can obtain the data to be processed through the third device.
  • the data acquisition node can also acquire the data to be processed in other ways, as long as the accuracy and reliability of the acquisition of the data to be processed can be guaranteed, and details will not be repeated here.
  • the data to be processed can be sent to the MAC array, so that the MAC array can analyze and process the data to be processed, wherein the above-mentioned MAC array is not limited to a two-dimensional MAC array. It can be a three-dimensional MAC array, etc.
  • a two-dimensional MAC array is used as an example for illustration. Since the MAC array includes multiple rows and multiple columns of MAC nodes, when the data acquisition node communicates with the MAC array, the data The acquisition node communicates with the MAC nodes in the same column.
  • the data acquisition node After the data acquisition node acquires the data to be processed, it can analyze and process the data to be processed , so as to determine the node input data corresponding to each of the MAC nodes in the multiplier MAC array within one clock cycle.
  • the node input data corresponding to the first MAC node can be part of the data to be processed; rather than the node input data corresponding to the first MAC node
  • the data can be determined based on the output result of the corresponding previous input MAC node, and the output result of the MAC node is determined based on the input data of the node corresponding to the first MAC node. Therefore, after obtaining the data to be processed , node input data corresponding to each of the plurality of MAC nodes in the multiplier MAC array may be determined based on the data to be processed.
  • node input data may be determined based on the image data, and the node input data may include data 0-1 corresponding to the first row of MAC nodes, and The data 1-1 corresponding to the MAC node in the second row...the data N-1 corresponding to the MAC node in the Nth row, the above data N-N is used to represent the Nth data of the Nth row in the image data , and then the node input data can be transmitted to the MAC array within one clock cycle for analysis and processing.
  • the MAC node After multiple MAC nodes in the MAC array obtain the node input data, the MAC node can perform convolution calculation on the node input data, so that the node calculation result can be obtained. It can be understood that the node calculation result is the convolution of the data to be processed.
  • the intermediate result obtained by the product calculation after obtaining the node calculation result of the MAC node in the MAC array to analyze and process the node input data, can determine the convolution operation result corresponding to the data to be processed based on the calculation results of all nodes, Therefore, the stable convolution calculation operation of the data to be processed is effectively realized.
  • the convolution operation device in this embodiment may further include: a weight cache node, the weight cache node is connected to the MAC array, and the above weight cache node is used to determine the corresponding Then the weight coefficient can be transmitted to the MAC array, so that the MAC array can use the MAC node and the weight coefficient to perform convolution calculation on the node input data, and obtain the node calculation result.
  • the convolution operation result can be determined based on the node calculation result.
  • the convolution operation device in this embodiment can also include: a data processing device connected to the MAC array The node and the convolution operation device are used to determine the convolution operation result corresponding to the data to be processed based on the node calculation result, thereby effectively ensuring the accuracy and reliability of the analysis and processing of the convolution operation result.
  • the node input data corresponding to each of the MAC nodes in the multiplier MAC array is determined by the data acquisition node within one clock cycle, and after the node input data is acquired,
  • the MAC nodes included in the MAC array can be used to simultaneously perform convolution calculation processing on the corresponding node input data within one clock cycle to obtain node calculation results, and the obtained node calculation results can be used to determine the data corresponding to the data to be processed
  • the data to be processed be transmitted to the MAC array in the form of rectangular input within one clock cycle, so that the MAC array can obtain the node input data to be processed at the initial moment and within the same clock cycle , and also improve the data processing efficiency, reduce the power consumption and cost of data processing, thus effectively solving the problems of time consumption and low processing performance when inputting data in the form of triangle input, thereby improving the volume The practicality of product computing devices.
  • the Each column is divided into a plurality of node unit groups, and the node unit group includes a plurality of MAC nodes, wherein each node unit group corresponds to at least one adder, and the adder is used to accumulate the calculation results of the MAC nodes in the node unit group,
  • node processing results corresponding to each column in the MAC array can be obtained.
  • the division parameters of the node unit groups can be obtained, and the division parameters can include at least one of the following: the number of node unit groups, the divided node unit The number of MAC nodes included in the group, etc.; after obtaining the division parameter, each column in the MAC array may be divided into multiple node unit groups based on the division parameter.
  • the number of MAC nodes included in the node unit group corresponding to the same column in the MAC array is the same.
  • each column includes 64 MAC nodes, and the 64 MAC nodes can be evenly divided into 8 node unit groups. At this time, each node unit group includes 8 MAC node.
  • 64 MAC nodes may be equally divided into 4 node unit groups, and in this case, each node unit group includes 16 MAC nodes.
  • 64 MAC nodes can be divided into 16 node unit groups on average, and at this time, each node unit group includes 4 MAC nodes; in this way, when using the node unit group to analyze and process the node input data, it can effectively Ensure the quality and efficiency of data analysis and processing.
  • the numbers of node unit groups corresponding to different columns in the MAC array are the same or different.
  • each column includes 64 MAC nodes, so when the node unit group division operation is performed on each column in the MAC array, the 64 MAC nodes in the first column can be Evenly divided into 8 node unit groups, at this time, each node unit group includes 8 MAC nodes; the 64 MAC nodes in the second column are equally divided into 4 node unit groups, at this time, each node unit group includes 16 MAC nodes; the number of node unit groups corresponding to different columns in the above MAC array is different.
  • the 64 MAC nodes in the second column can also be equally divided into 8 node unit groups.
  • each node unit group includes 8 MAC nodes; the node unit groups corresponding to different columns in the above MAC array the same amount. In this way, when the node unit group is used to analyze and process the node input data, the flexibility of data analysis and processing can be effectively guaranteed.
  • the MAC nodes in the MAC array in this embodiment do not include an adder.
  • the node unit groups divided in the MAC array correspond to at least one adder , the above-mentioned adder is used to accumulate the calculation results of the MAC nodes in the node unit group to obtain the node processing results corresponding to each column in the MAC array.
  • at least one adder may include: a first-stage adder connected to the node unit group and a second-stage adder connected to the first-stage adder, the data resources supported by the second-stage adder Data resources larger than can be supported by the first-stage adder.
  • one control signal can be used to uniformly control the adders corresponding to each column of MAC nodes.
  • the first-stage adder is directly connected to the node unit group, and the first-stage adder can accumulate the output results of the MAC nodes included in the node unit group, and the second-stage adder is connected to the first-stage adder connected, the second-stage adder is used to accumulate the output results of the first-stage adder, it can be understood that the number of the second-stage adder can be one or more, and the second-stage adder can support The data resource is larger than what the first stage adder can support.
  • the adder since the adder is used to accumulate the output results of the MAC nodes in the node unit group, the data resources supported by the adder are convolved with the number of MAC nodes included in the node unit group, and the MAC nodes
  • the calculated data bit width is related. Specifically, when the number of MAC nodes included in the node unit group is 2 to the Nth power, the data resource supported by the adder is the sum of the data bit width and N for convolution calculation in the MAC nodes.
  • each column includes 64 MAC nodes, so when the node unit group division operation is performed on each column in the MAC array, the 64 MAC nodes in the first column can be Evenly divided into 8 node unit groups, each node unit group includes 8 (2 3 ) MAC nodes, at this time, each node unit group can be connected with a first-stage adder, and the first-stage adder can Including adder 0 corresponding to the first node element group, adder 1 corresponding to the second node element group, adder 2 corresponding to the third node element group, and adder 2 corresponding to the fourth node element group The corresponding adder 3, the adder 4 corresponding to the fifth nodal unit group, and the adder 7 corresponding to the eighth nodal unit group.
  • the convolution operation device For the convolution operation device, if the structure of the convolution operation device is more complicated, the more complex the calculation logic is, the clock frequency of the convolution operation device will not be high, so the operating frequency of the convolution operation device cannot be too high, and the more Difficult to meet high-frequency hardware timing requirements. Therefore, when the MAC array in the convolution operation device is divided into node unit groups, the number of MAC nodes included in the node unit group and the number of adders corresponding to the node unit group can directly affect the data of the convolution operation device. Processing time, for example: the more the number of adders, the worse the timing of the convolution operation device.
  • each column includes 64 MAC nodes, so when performing node unit group division operations on each column in the MAC array, the The 64 MAC nodes in the first column are evenly divided into 8 node unit groups, and each node unit group includes 8 (2 3 ) MAC nodes.
  • each node unit group can be connected with a first-stage adder
  • the first-stage adder may include an adder 0 corresponding to the first nodal unit group, an adder 1 corresponding to the second nodal unit group, and an adder 2 corresponding to the third nodal unit group , an adder 3 corresponding to the fourth node unit group, an adder 4 corresponding to the fifth node unit group... and an adder 7 corresponding to the eighth node unit group.
  • the first-stage adder is an 8-input adder
  • the second-stage adder connected to the first-stage adder can be a 4(2 2 )-input adder
  • the second-stage adder can include and Adder 0 , Adder 1 , Adder 2 and Adder 3 are connected to Adder 8
  • Adder 9 is connected to Adder 4 , Adder 5 , Adder 6 and Adder 7 .
  • the second-stage adder also includes an adder 10 connected to the above adder 8 and adder 9, the adder 10 is a 2 (2 1 ) input adder, at this time, the adder
  • each column includes 64 MAC nodes, so when performing node unit group division operations on each column in the MAC array, The 64 MAC nodes in the first column can be evenly divided into 16 node unit groups, and each node unit group includes 4 (2 2 ) MAC nodes.
  • each node unit group can be connected with the first level Adders
  • the first stage of adders may include adder 0 corresponding to the first nodal element group, adder 1 corresponding to the second nodal element group, adder 1 corresponding to the third nodal element group Adder 2, adder 3 corresponding to the fourth nodal unit group, adder 4 corresponding to the fifth nodal unit group... and adder corresponding to the sixteenth nodal unit group Device 15.
  • the second-stage adder that is connected with the first-stage adder can be the adder of 4 (2 2 ) input, and the second-stage adder can comprise adder 16, adder 17, adder 18 and adder 19, Wherein, adder 16 is connected with adder 0, adder 1, adder 2 and adder 3, adder 17 is connected with adder 4, adder 5, adder 6 and adder 7, adder 18 is connected with adder 7
  • the adder 8 , the adder 9 , the adder 10 and the adder 11 are connected, and the adder 19 is connected with the adder 12 , the adder 13 , the adder 14 and the adder 15 .
  • the second-stage adder also includes an adder 20 connected to the above-mentioned adder 16, adder 17, adder 18 and adder 19, and the adder 20 is a 4 (2 2 ) input 21b
  • the convolution operation device provided in this embodiment divides the MAC array into several node unit groups on average, and then makes the adder connected to the node unit group meet the data bit width and N for convolution calculation in the MAC node as much as possible.
  • the sum value, N is relevant to the quantity of the input data of adder, when the quantity of input data of adder is 4, N is 2, promptly N is log 2 (quantity of the input data of adder);
  • the non-redundant cascaded adder based on the systolic array that is, the adder resource included in the convolution operation device has no redundancy at all, which can save a lot of area for the chip, reduce power consumption and cost, and further effectively improve The quality and efficiency of the convolution operation performed by the convolution operation device are improved, and the practicability of the convolution operation device is guaranteed.
  • the data resources supported by the adder are the data bit width of the convolution calculation in the MAC node and the ratio of N+1 and value.
  • each column includes 48 MAC nodes, so when performing node unit group division operation on each column in the MAC array, the 48 MAC nodes in the first column can be Evenly divided into 4 node unit groups, each node unit group includes 12 (2 4 >12>2 3 ) MAC nodes, at this time, each node unit group can be connected with a first-stage adder, the first The number of first-level adders is 4, and each first-level adder is used to accumulate the product processing results of 12 MAC nodes in the node unit group.
  • the convolution operation device by dividing the MAC array into several node unit groups, when the number of MAC nodes included in the node unit group is not the Nth power of 2, the data resources supported by the adder are: The sum of the data bit width and N+1 for convolution calculation in the MAC node.
  • the adder resources included in the convolution operation device are redundant, the degree of resource redundancy is effectively reduced. This can also save area for the chip, reduce power consumption and cost, further effectively improve the quality and efficiency of the convolution operation performed by the convolution operation device, and ensure the practicability of the convolution operation device.
  • the convolution operation device effectively improves the utilization rate of MAC nodes and is more conducive to the improvement of data processing performance compared with the convolution operation device in the related art.
  • FIG. 9 - As shown in Figure 10, taking the 64*64 MAC array as an example, when the data to be processed ifm is input to the MAC array in a triangle form, it takes 128 clock cycles before each MAC node in the MAC array runs and when the data to be processed ifm is input to the MAC array in a rectangular manner, it only takes 64 clock cycles to make each MAC node in the MAC array run.
  • each column of MAC nodes in the convolution operation device works synchronously, a large number of pulsating control signals are reduced, and the quality and efficiency of the convolution operation performed by the convolution operation device are further improved.
  • the convolution operation device provided in this embodiment is not limited to the implementation described above, and those skilled in the art can divide the MAC array according to specific application scenarios or application requirements, and can also divide Afterwards, the resources of the adder corresponding to the obtained node unit group are configured, which further improves the flexibility of the convolution operation device.
  • the convolution operation device can not only perform convolution calculations on int8 precision data, but also perform convolution calculations on other types of data, such as: int4 precision data, int16 precision data, etc., thereby expanding the convolution Applicable scope of computing devices.
  • FIG. 11 is a schematic flow chart of a convolution operation method provided by an embodiment of the present invention. referring to FIG. 11 , this embodiment provides a convolution operation method, and the execution subject of the convolution operation method may be convolution computing device, it can be understood that the convolution computing device can be implemented as software, or a combination of software and hardware, specifically, the convolution computing method can include:
  • Step S1101 Obtain the data to be processed, and determine the node input data corresponding to each of the MAC nodes in the multiplier MAC array within one clock cycle based on the data to be processed.
  • the data to be processed refers to the data that requires convolution calculation.
  • the data to be processed can correspond to different data formats.
  • the data to be processed can be text data, image data, video data, etc. .
  • the data to be processed can be obtained through the data acquisition node.
  • the data to be processed can be stored in the preset area, and the data acquisition node can obtain the pending data by accessing the preset area. data; or, the data to be processed is stored in a third device, and the data acquisition node communicates with the third device, so that the data acquisition node can obtain the data to be processed through the third device.
  • the data acquisition node can also acquire the data to be processed in other ways, as long as the accuracy and reliability of the acquisition of the data to be processed can be guaranteed, and details will not be repeated here.
  • the data to be processed can be sent to the MAC array, so that the MAC array can analyze and process the data to be processed, wherein the above-mentioned MAC array is not limited to a two-dimensional MAC array. It can be a three-dimensional MAC array, etc.
  • a two-dimensional MAC array is used as an example for illustration. Since the MAC array includes multiple rows and multiple columns of MAC nodes, when the data acquisition node communicates with the MAC array, the data The acquisition node communicates with the MAC nodes in the same column.
  • the data acquisition node acquires the data to be processed, it can analyze and process the data to be processed , so as to determine the node input data corresponding to each of the MAC nodes in the multiplier MAC array within one clock cycle. It can be understood that the node input data may be a part of the data to be processed.
  • Step S1102 Use the MAC nodes in the MAC array to perform convolution calculations on the input data of the nodes to obtain node calculation results, and the node calculation results are used to determine the convolution operation results corresponding to the data to be processed.
  • the MAC node After multiple MAC nodes in the MAC array obtain the node input data, the MAC node can perform convolution calculation on the node input data, so that the node calculation result can be obtained. It can be understood that the node calculation result is the convolution of the data to be processed.
  • the intermediate result obtained by the product calculation after obtaining the node calculation result of the MAC node in the MAC array to analyze and process the node input data, can determine the convolution operation result corresponding to the data to be processed based on the calculation results of all nodes, Therefore, the stable convolution calculation operation of the data to be processed is effectively realized.
  • using the MAC nodes in the MAC array to perform convolution calculations on the node input data, and obtaining the node calculation results may include: determining weight coefficients corresponding to multiple MAC nodes in the MAC array; using the MAC nodes and the weight coefficients to The node inputs the data for convolution calculation and obtains the node calculation result.
  • the method in this embodiment may further include: determining the convolution operation result corresponding to the data to be processed based on the node processing result The result of the convolution operation.
  • the convolution operation method provided in this embodiment obtains the data to be processed, and determines the node input data corresponding to each of the MAC nodes in the multiplier MAC array within one clock cycle based on the data to be processed, and then The MAC nodes in the MAC array are used to perform convolution calculations on the input data of the nodes to obtain the calculation results of the nodes, which effectively realizes the convolution operation of the data to be processed within one clock cycle, and also improves the data processing efficiency and reduces The power consumption and cost of data processing are reduced, thereby effectively solving the problems of time consumption and low processing performance when inputting data in the form of triangle input, thereby improving the practicability of the convolution operation method.
  • Fig. 12 is a schematic flowchart of another convolution operation method provided by the embodiment of the present invention.
  • the MAC nodes in the MAC array are used to convolve the node input data Before calculating, the method in this embodiment may also include:
  • Step S1201 Obtain division parameters for dividing each column of MAC nodes in the MAC array.
  • Step S1202 Divide each column in the MAC array into a plurality of node unit groups based on the division parameter, and the node unit group includes a plurality of MAC nodes.
  • the MAC array can be divided into nodes.
  • the MAC Each column of MAC nodes in the array is divided by a division parameter, which may be a parameter stored in a preset area, or a parameter configured by a user based on an application scenario or an application requirement.
  • each column in the MAC array may be divided into multiple node unit groups based on the division parameter, and the obtained node unit group includes multiple MAC nodes.
  • the number of MAC nodes included in the node unit groups corresponding to the same column in the MAC array is the same.
  • the numbers of node unit groups corresponding to different columns in the MAC array are the same or different.
  • the node unit groups can be used to perform convolution operations on the node input data, so that the node operation results can be obtained.
  • the node unit groups can be used to The node input data performs the convolution operation, and it is beneficial to improve the quality and efficiency of the convolution operation.
  • Fig. 13 is a schematic flow diagram of using the MAC nodes in the MAC array to perform convolution calculations on the input data of the nodes to obtain the calculation results of the nodes provided by the embodiment of the present invention;
  • the example provides an implementation method of using the MAC nodes in the MAC array to perform convolution calculation on the node input data.
  • the MAC nodes in the MAC array are used to perform convolution calculation on the node input data to obtain the node Calculation results can include:
  • Step S1301 Obtain at least one adder corresponding to the node unit group in the MAC array.
  • Step S1302 using at least one adder to accumulate the calculation results of the MAC nodes in the node unit group to obtain the node processing results corresponding to each column in the MAC array.
  • At least one adder corresponding to the node unit group in the MAC array can be obtained first, and in some examples, obtaining at least one adder corresponding to the node unit group in the MAC array can include : Obtain the number of MAC nodes included in the node unit group; determine the data bit width for convolution calculation in the MAC nodes corresponding to at least one adder; determine the data that the adder can support based on the number of MAC nodes and the data bit width resource.
  • determining the data resources supported by the adder may include: when the number of MAC nodes included in the node unit group is the N power of 2, then based on the data bit width and N The sum value determines the data resources that the adder can support; when the number of MAC nodes included in the node unit group is not the Nth power of 2, then based on the sum value of the data bit width and N+1, it is determined that the adder can Supported data resources.
  • At least one adder can be used to accumulate the calculation results of the MAC nodes in the node unit group, so as to be able to stably and effectively obtain the result corresponding to the MAC array in the MAC array.
  • the node processing result corresponding to each column further ensures the accuracy of obtaining the node processing result.
  • FIG. 14 is a schematic flow diagram of an image processing method based on a multiplier MAC array provided by an embodiment of the present invention; referring to the accompanying drawing 14, the present embodiment provides an image processing method based on a multiplier MAC array, which is based on The image processing method of the multiplier MAC array can be executed by an image processing device based on the multiplier MAC array. It can be understood that the image processing device can be implemented as software or a combination of software and hardware. Specifically, the multiplier-based The image processing method of the MAC array may include:
  • Step S1401 Acquire the image to be processed, and determine the node input data corresponding to each MAC node among the plurality of MAC nodes in the MAC array within one clock cycle based on the image to be processed.
  • Step S1402 use the MAC nodes in the MAC array to perform convolution calculation on the input data of the nodes, and obtain the calculation results of the nodes.
  • Step S1403 Determine the convolution processing result corresponding to the image to be processed based on the node calculation result.
  • FIG. 15 is a schematic structural diagram of a convolution operation device provided by an embodiment of the present invention. referring to FIG. 15 , this embodiment provides a convolution operation device, which is used to execute the above-mentioned convolution operation device shown in FIG. 11
  • the convolution operation method shown, specifically, the convolution operation device may include:
  • the first memory 12 is used to store computer programs
  • the first processor 11 is configured to run the computer program stored in the first memory 12 to realize:
  • the MAC nodes in the MAC array are used to perform convolution calculation on the input data of the nodes to obtain node calculation results, and the node calculation results are used to determine the convolution operation results corresponding to the data to be processed.
  • the structure of the electronic device may further include a first communication interface 13, which is used for the electronic device to communicate with other devices or a communication network.
  • the first processor 11 in this embodiment is also configured to perform: obtaining the Division parameters for division; based on the division parameters, each column in the MAC array is divided into multiple node unit groups, and the node unit groups include multiple MAC nodes.
  • the number of MAC nodes included in the node unit groups corresponding to the same column in the MAC array is the same.
  • the numbers of node unit groups corresponding to different columns in the MAC array are the same or different.
  • the first processor 11 when the first processor 11 uses the MAC nodes in the MAC array to perform convolution calculations on the node input data to obtain the node calculation results, the first processor 11 is used to perform: acquiring the node unit group in the MAC array Corresponding at least one adder; using at least one adder to accumulate the calculation results of the MAC nodes in the node unit group to obtain the node processing results corresponding to each column in the MAC array.
  • the first processor 11 when the first processor 11 acquires at least one adder corresponding to the node unit group in the MAC array, the first processor 11 is configured to: acquire the number of MAC nodes included in the node unit group; determine The data bit width for convolution calculation in the respective MAC nodes corresponding to at least one adder; based on the number of MAC nodes and the data bit width, determine the data resources that the adder can support.
  • the first processor 11 determines the data resources that the adder can support based on the number of MAC nodes and the data bit width
  • the first processor 11 is configured to perform: the MAC nodes included in the node unit group When the number is the Nth power of 2, the data resources supported by the adder are determined based on the sum of the data bit width and N; when the number of MAC nodes included in the node unit group is not the Nth power of 2, then Based on the sum of the data bit width and N+1, the data resources supported by the adder are determined.
  • the first processor 11 when the first processor 11 uses the MAC nodes in the MAC array to perform convolution calculations on the node input data to obtain the node calculation results, the first processor 11 is used to perform: determine the The weight coefficients corresponding to the nodes; use the MAC node and the weight coefficients to perform convolution calculations on the node input data to obtain the node calculation results.
  • the first processor 11 is configured to: determine a convolution operation result corresponding to the data to be processed based on the node processing result.
  • the device shown in FIG. 15 can execute the method of the embodiment shown in FIG. 11-FIG. 13.
  • the parts not described in detail in this embodiment refer to the relevant description of the embodiment shown in FIG. 11-FIG. 13.
  • FIG. 16 is a schematic structural diagram of an image processing device based on a multiplier MAC array provided by an embodiment of the present invention. Referring to FIG. 16 , this embodiment provides an image processing device based on a multiplier MAC array. The image processing device of the multiplier MAC array is used to perform the image processing method based on the multiplier MAC array shown in FIG. 14. Specifically, the image processing device based on the multiplier MAC array may include:
  • the second memory 22 is used to store computer programs
  • the second processor 21 is configured to run the computer program stored in the second memory 22 to realize:
  • a convolution processing result corresponding to the image to be processed is determined based on the node calculation result.
  • the structure of the electronic device may further include a second communication interface 23 for the electronic device to communicate with other devices or a communication network.
  • the device shown in FIG. 16 can execute the method of the embodiment shown in FIG. 14 .
  • parts not described in detail in this embodiment refer to the relevant description of the embodiment shown in FIG. 14 .
  • an embodiment of the present invention provides a computer storage medium, which is used to store computer software instructions used by electronic devices, which includes the programs involved in executing the convolution operation method in the method embodiments shown in FIGS. 11-13 above. .
  • an embodiment of the present invention provides a computer storage medium, which is used to store computer software instructions used by electronic devices, which includes instructions for performing the image processing method based on the multiplier MAC array in the method embodiment shown in FIG. 14. program of.
  • the disclosed related detection devices and methods can be implemented in other ways.
  • the above-described embodiment of the detection device is only illustrative.
  • the division of the modules or units is only a logical function division.
  • the mutual coupling or direct coupling or communication connection shown or discussed may be through some interfaces, and the indirect coupling or communication connection of detection devices or units may be in electrical, mechanical or other forms.
  • the units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in one place, or may be distributed to multiple network units. Part or all of the units can be selected according to actual needs to achieve the purpose of the solution of this embodiment.
  • each functional unit in each embodiment of the present invention may be integrated into one processing unit, each unit may exist separately physically, or two or more units may be integrated into one unit.
  • the above-mentioned integrated units can be implemented in the form of hardware or in the form of software functional units.
  • the integrated unit is realized in the form of a software function unit and sold or used as an independent product, it can be stored in a computer-readable storage medium.
  • the computer software product is stored in a storage medium , including several instructions for causing a computer processor (processor) to execute all or part of the steps of the method described in each embodiment of the present invention.
  • the aforementioned storage medium includes: U disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), magnetic disk or optical disk and other media that can store program codes.

Abstract

A convolution operation method and apparatus, an image processing method and apparatus, and a storage medium. The convolution operation apparatus comprises a data acquisition node and a MAC array; the data acquisition node is used for acquiring data to be processed, and determining, in a clock cycle on the basis of the data to be processed, node input data corresponding to each of a plurality of MAC nodes in the multiplier MAC array; the MAC array is connected to the data acquisition node and used for acquiring node input data by means of the data acquisition node, and performing convolution calculation on the corresponding node input data by means of the MAC nodes to obtain node calculation results, wherein the node calculation results are used for determining a convolution operation result corresponding to the data to be processed. According to the method, redundancy in adder resources can be eliminated, and it is beneficial to reducing the chip area, reducing the power consumption, and reducing the data processing cost; in addition, all nodes can be operated in a short time, and it is beneficial to improving the data processing performance.

Description

卷积运算方法、图像处理方法、装置和存储介质Convolution operation method, image processing method, device and storage medium 技术领域technical field
本发明实施例涉及数据处理技术领域,尤其涉及一种卷积运算方法、图像处理方法、装置和存储介质。The embodiments of the present invention relate to the technical field of data processing, and in particular, to a convolution operation method, an image processing method, a device, and a storage medium.
背景技术Background technique
随着科学技术的飞速发展,人工智能技术的应用范围越来越广泛,而在对人工智能技术进行应用的过程中,不论是端侧还是云侧,都对功耗和成本有着苛刻的要求。例如,端侧的消费者对价格比较敏感,而价格越低,功耗越高,而功耗高了则待机时间变短。云侧的用户计算需求大,芯片部署量大,若单芯片功耗大,则累计功耗大;若单芯片成本高,则累计成本更高。With the rapid development of science and technology, the application range of artificial intelligence technology is becoming more and more extensive. In the process of applying artificial intelligence technology, whether it is on the device side or the cloud side, there are strict requirements on power consumption and cost. For example, end-side consumers are more sensitive to price, and the lower the price, the higher the power consumption, and the higher the power consumption, the shorter the standby time. Users on the cloud side have large computing needs and a large amount of chip deployment. If the power consumption of a single chip is high, the cumulative power consumption will be large; if the cost of a single chip is high, the cumulative cost will be higher.
发明内容Contents of the invention
本发明实施例提供了一种卷积运算方法、图像处理方法、装置和存储介质,能够提高数据的处理效率,并降低数据处理功耗和成本。Embodiments of the present invention provide a convolution operation method, an image processing method, a device, and a storage medium, which can improve data processing efficiency and reduce data processing power consumption and cost.
本发明的第一方面是为了提供一种卷积运算装置,包括:The first aspect of the present invention is to provide a convolution computing device, comprising:
数据获取节点,用于获取待处理数据,并基于所述待处理数据在一时钟周期内确定与乘法器MAC阵列中多个MAC节点中每一个MAC节点相对应的节点输入数据;A data acquisition node, configured to acquire data to be processed, and determine within a clock cycle based on the data to be processed, node input data corresponding to each of the multiple MAC nodes in the multiplier MAC array;
MAC阵列,与所述数据获取节点相连接,用于通过所述数据获取节点获取所述节点输入数据,并对所述节点输入数据进行卷积计算,获得节点计算结果,所述节点计算结果用于确定与所述待处理数据相对应的卷积运算结果。The MAC array is connected to the data acquisition node, and is used to acquire the input data of the node through the data acquisition node, and perform convolution calculation on the input data of the node to obtain a calculation result of the node, and the calculation result of the node is used for determining the convolution operation result corresponding to the data to be processed.
本发明的第二方面是为了提供一种卷积运算方法,包括:The second aspect of the present invention is to provide a convolution operation method, including:
获取待处理数据,并基于所述待处理数据在一时钟周期内确定与乘法器MAC阵列中多个MAC节点中每一个MAC节点相对应的节点输入数据;Obtaining data to be processed, and determining node input data corresponding to each MAC node in the plurality of MAC nodes in the multiplier MAC array within a clock cycle based on the data to be processed;
利用所述MAC阵列中的MAC节点对所述节点输入数据进行卷积计算,获得节点计算结果,所述节点计算结果用于确定与所述待处理数据相对应的卷积运算结果。The MAC nodes in the MAC array are used to perform convolution calculation on the input data of the nodes to obtain node calculation results, and the node calculation results are used to determine the convolution operation results corresponding to the data to be processed.
本发明的第三方面是为了提供一种基于乘法器MAC阵列的图像处理方法,包括:The third aspect of the present invention is in order to provide a kind of image processing method based on multiplier MAC array, comprising:
获取待处理图像,并基于所述待处理图像在一时钟周期内确定与所述MAC阵列中多个MAC节点中每一个MAC节点相对应的节点输入数据;Acquiring an image to be processed, and determining node input data corresponding to each MAC node among the plurality of MAC nodes in the MAC array within a clock cycle based on the image to be processed;
利用所述MAC阵列中的MAC节点对所述节点输入数据进行卷积计算,获得节点计算结果;Using the MAC nodes in the MAC array to perform convolution calculations on the input data of the nodes to obtain node calculation results;
基于所述节点计算结果确定与所述待处理图像相对应的卷积处理结果。A convolution processing result corresponding to the image to be processed is determined based on the node calculation result.
本发明的第四方面是为了提供一种卷积运算装置,包括:A fourth aspect of the present invention is to provide a convolution computing device, comprising:
存储器,用于存储计算机程序;memory for storing computer programs;
处理器,用于运行所述存储器中存储的计算机程序以实现:a processor for running a computer program stored in said memory to:
获取待处理数据,并基于所述待处理数据在一时钟周期内确定与乘法器MAC阵列中多个MAC节点中每一个MAC节点相对应的节点输入数据;Obtaining data to be processed, and determining node input data corresponding to each MAC node in the plurality of MAC nodes in the multiplier MAC array within a clock cycle based on the data to be processed;
利用所述MAC阵列中的MAC节点对所述节点输入数据进行卷积计算,获得节点计算结果,所述节点计算结果用于确定与所述待处理数据相对应的卷积运算结果。The MAC nodes in the MAC array are used to perform convolution calculation on the input data of the nodes to obtain node calculation results, and the node calculation results are used to determine the convolution operation results corresponding to the data to be processed.
本发明的第五方面是为了提供一种基于乘法器MAC阵列的图像处理装置,包括:A fifth aspect of the present invention is to provide an image processing device based on a multiplier MAC array, including:
存储器,用于存储计算机程序;memory for storing computer programs;
处理器,用于运行所述存储器中存储的计算机程序以实现:a processor for running a computer program stored in said memory to:
获取待处理图像,并基于所述待处理图像在一时钟周期内确定与所述MAC阵列中多个MAC节点中每一个MAC节点相对应的节点输入数据;Acquiring an image to be processed, and determining node input data corresponding to each MAC node among the plurality of MAC nodes in the MAC array within a clock cycle based on the image to be processed;
利用所述MAC阵列中的MAC节点对所述节点输入数据进行卷积计算,获得节点计算结果;Using the MAC nodes in the MAC array to perform convolution calculations on the input data of the nodes to obtain node calculation results;
基于所述节点计算结果确定与所述待处理图像相对应的卷积处理结果。A convolution processing result corresponding to the image to be processed is determined based on the node calculation result.
本发明的第六方面是为了提供一种计算机可读存储介质,所述存储介质为计算机可读存储介质,该计算机可读存储介质中存储有程序指令,所述程序指令用于第二方面所述的卷积运算方法。The sixth aspect of the present invention is to provide a computer-readable storage medium, the storage medium is a computer-readable storage medium, the computer-readable storage medium stores program instructions, and the program instructions are used for the second aspect. The convolution operation method described above.
本发明的第七方面是为了提供一种计算机可读存储介质,所述存储介质为计算机可读存储介质,该计算机可读存储介质中存储有程序指令,所述程序指令用于第三方面所述的基于乘法器MAC阵列的图像处理方法。A seventh aspect of the present invention is to provide a computer-readable storage medium, the storage medium is a computer-readable storage medium, and program instructions are stored in the computer-readable storage medium, and the program instructions are used in the third aspect. The image processing method based on the multiplier MAC array described above.
本发明实施例提供的技术方案,通过数据获取节点在一时钟周期内确定 与乘法器MAC阵列中多个MAC节点中每一个MAC节点相对应的节点输入数据,在获取到节点输入数据之后,可以利用MAC阵列中所包括的MAC节点在一时钟周期内同时对各自对应的节点输入数据进行卷积计算处理,获得节点计算结果,这样不仅能够提高数据的处理效率,并降低数据的处理功耗和成本,从而有效地解决了以三角形输入的方式来输入数据时所存在的消耗时间、处理性能低的问题,进而提高了卷积运算的质量和效率。In the technical solution provided by the embodiment of the present invention, the node input data corresponding to each MAC node in the multiple MAC nodes in the multiplier MAC array is determined by the data acquisition node within one clock cycle, and after the node input data is acquired, it can be Use the MAC nodes included in the MAC array to perform convolution calculation processing on the corresponding node input data within one clock cycle at the same time, and obtain the node calculation results, which can not only improve the data processing efficiency, but also reduce the data processing power consumption and Therefore, it effectively solves the problems of time-consuming and low processing performance when inputting data in the form of triangle input, thereby improving the quality and efficiency of convolution operations.
附图说明Description of drawings
此处所说明的附图用来提供对本申请的进一步理解,构成本申请的一部分,本申请的示意性实施例及其说明用于解释本申请,并不构成对本申请的不当限定。在附图中:The drawings described here are used to provide a further understanding of the application and constitute a part of the application. The schematic embodiments and descriptions of the application are used to explain the application and do not constitute an improper limitation to the application. In the attached picture:
图1为相关技术中实施例提供的卷积计算的原理示意图;Fig. 1 is a schematic diagram of the principle of convolution calculation provided by an embodiment in the related art;
图2为相关技术中实施例提供的用于实现卷积计算的脉动阵列的示意图;Fig. 2 is a schematic diagram of a systolic array for realizing convolution calculation provided by an embodiment in the related art;
图3为相关技术中实施例提供的对脉动阵列的不同行依次间隔输入特征图的示意图;FIG. 3 is a schematic diagram of sequentially inputting feature maps for different rows of a systolic array provided by an embodiment in the related art;
图4为本发明实施例提供的一种卷积运算装置的结构示意图;FIG. 4 is a schematic structural diagram of a convolution operation device provided by an embodiment of the present invention;
图5为本发明实施例提供的将节点输入数据输入至MAC阵列的示意图;FIG. 5 is a schematic diagram of inputting node input data to a MAC array provided by an embodiment of the present invention;
图6为本发明实施例提供的节点单元组与加法器级联的示意图一;Fig. 6 is a schematic diagram 1 of cascading a node unit group and an adder provided by an embodiment of the present invention;
图7为本发明实施例提供的节点单元组与加法器级联的示意图二;FIG. 7 is a second schematic diagram of cascading node unit groups and adders provided by an embodiment of the present invention;
图8为本发明实施例提供的节点单元组与加法器级联的示意图三;Fig. 8 is a schematic diagram 3 of the cascade connection of the node unit group and the adder provided by the embodiment of the present invention;
图9为相关技术中提供的使得MAC阵列中MAC节点进行运行的时钟周期示意图;FIG. 9 is a schematic diagram of a clock cycle that enables a MAC node in a MAC array to operate provided in the related art;
图10为本发明实施例中提供的使得MAC阵列中MAC节点进行运行的时钟周期示意图;FIG. 10 is a schematic diagram of a clock cycle for enabling a MAC node in a MAC array to operate provided in an embodiment of the present invention;
图11为本发明实施例提供的一种卷积运算方法的流程示意图;FIG. 11 is a schematic flowchart of a convolution operation method provided by an embodiment of the present invention;
图12为本发明实施例提供的另一种卷积运算方法的流程示意图;FIG. 12 is a schematic flowchart of another convolution operation method provided by an embodiment of the present invention;
图13为本发明实施例提供的利用所述MAC阵列中的MAC节点对所述节点输入数据进行卷积计算,获得节点计算结果的流程示意图;FIG. 13 is a schematic flowchart of using the MAC nodes in the MAC array to perform convolution calculations on the input data of the nodes to obtain the calculation results of the nodes provided by an embodiment of the present invention;
图14为本发明实施例提供的一种基于乘法器MAC阵列的图像处理方法的流程示意图;FIG. 14 is a schematic flowchart of an image processing method based on a multiplier MAC array provided by an embodiment of the present invention;
图15为本发明实施例提供的一种卷积运算装置的结构示意图;FIG. 15 is a schematic structural diagram of a convolution operation device provided by an embodiment of the present invention;
图16为本发明实施例提供的一种基于乘法器MAC阵列的图像处理装置的结构示意图。FIG. 16 is a schematic structural diagram of an image processing device based on a multiplier MAC array provided by an embodiment of the present invention.
具体实施方式Detailed ways
为使本发明实施例的目的、技术方案和优点更加清楚,下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。In order to make the purpose, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below in conjunction with the drawings in the embodiments of the present invention. Obviously, the described embodiments It is a part of embodiments of the present invention, but not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by persons of ordinary skill in the art without making creative efforts belong to the protection scope of the present invention.
除非另有定义,本文所使用的所有的技术和科学术语与属于本发明的技术领域的技术人员通常理解的含义相同。本文中在本发明的说明书中所使用的术语只是为了描述具体的实施例的目的,不是旨在于限制本发明。Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the technical field of the invention. The terms used herein in the description of the present invention are for the purpose of describing specific embodiments only, and are not intended to limit the present invention.
为了方便理解本实施例中技术方案的具体实现过程和技术效果,首先,结合附图1对卷积计算的原理进行说明:In order to facilitate the understanding of the specific implementation process and technical effects of the technical solution in this embodiment, first, the principle of convolution calculation is explained in conjunction with Figure 1:
在获取到特征图之后,可以对特征图进行卷积计算,从而可以获得卷积计算结果。在对特征图进行卷积计算输出特征图的每一个点时,都需要用卷积核与输入特征图相应位置的点进行计算,如果有多张输入特征图,那么每张特征图相同位置的计算结果累加才能得到输出特征图中该点的结果。如图1所示,在对一张输入特征图进行卷积计算时,可以计算输出特征图的两个点。After the feature map is obtained, the convolution calculation can be performed on the feature map, so that the convolution calculation result can be obtained. When performing convolution on the feature map to calculate each point of the output feature map, it is necessary to use the convolution kernel and the corresponding point of the input feature map for calculation. If there are multiple input feature maps, then the points of the same position of each feature map The calculation results are accumulated to obtain the result of the point in the output feature map. As shown in Figure 1, when performing convolution calculation on an input feature map, two points of the output feature map can be calculated.
具体实现时,在对输入特征图进行卷积计算时,可以利用用于实现卷积计算的卷积神经网络对输入特征图进行分析处理,下面对卷积神经网络进行简单介绍,卷积神经网络可以包括:In the specific implementation, when performing convolution calculation on the input feature map, the convolutional neural network used to realize the convolution calculation can be used to analyze and process the input feature map. The following is a brief introduction to the convolutional neural network. The convolutional neural network Networks can include:
全连接层(FC层),用于实现乘累加操作,其中,对于每一个输出的点,都要用到每一个输入的点。The fully connected layer (FC layer) is used to implement the multiply-accumulate operation, wherein, for each output point, each input point is used.
第一种全连接层(1stFC层),用于实现乘累加操作,其中,每一个输出的点,都要用到每张图的每一个点。The first type of fully connected layer (1stFC layer) is used to realize the multiply-accumulate operation, in which every point of each output must use every point of each graph.
其他全连接层(elsFC层),用于实现乘累加操作,其中,每一个输出的点,都要用到输入一维数据中的每一个点。The other fully connected layer (elsFC layer) is used to implement the multiply-accumulate operation, where each output point uses each point in the input one-dimensional data.
对于全连接层而言,为了提高计算效率,全连接层可以进行批处理操作,以复用计算单元中的权重值,并且这样可以同时对多个数据进行处理。具体的,如果给一张图像,每做一张图像的识别,就是一次批处理;节省一些访 问数据,减少功耗,节省带宽,可以重复利用一些数据。For the fully connected layer, in order to improve computing efficiency, the fully connected layer can perform batch processing to reuse the weight values in the computing unit, and in this way, multiple data can be processed at the same time. Specifically, if an image is given, each recognition of an image is a batch process; it saves some access data, reduces power consumption, saves bandwidth, and can reuse some data.
在一些场景中,为了简化设计,1stFC层的批处理会转化为elsFC层进行批处理,这样可以提前把1stFC层输入图状的特征图,展开成连续的点状的特征图,以适配elsFC层对输入特征图的要求。In some scenarios, in order to simplify the design, the batch processing of the 1stFC layer will be converted to the elsFC layer for batch processing, so that the 1stFC layer can be input into the graph-like feature map in advance and expanded into a continuous point-like feature map to adapt to elsFC The layer's requirements for the input feature map.
具体实现时,可以利用一个脉动阵列来描述卷积计算,如图2所示,脉动阵列中包括:In specific implementation, a systolic array can be used to describe the convolution calculation, as shown in Figure 2, the systolic array includes:
输入数据装载节点(IFM_LOAD),用于装载输入待处理数据,例如:特征图IFM;Input data loading node (IFM_LOAD), used to load input data to be processed, for example: feature map IFM;
权重装载节点(WEIGHT_LOAD),用于装载权重;Weight loading node (WEIGHT_LOAD), used to load weight;
若干个乘积累加运算节点(MAC节点),用于组成一个脉动的计算阵列,针对上述的脉动阵列,特征图IFM可以基于脉动阵列进行横向流水脉动传递,其中,脉动传递可以是指针对流水输入IFM,每行的MAC节点可以逐一复用IFM;权重系数可以基于脉动阵列进行纵向流水脉动传递。A number of multiply-accumulate operation nodes (MAC nodes) are used to form a pulsating calculation array. For the above-mentioned pulsating array, the feature map IFM can perform horizontal flow pulsation transfer based on the pulsation array, where the pulsation transfer can refer to the flow input IFM , the MAC nodes in each row can multiplex the IFM one by one; the weight coefficients can be transmitted vertically through the pulsation array based on the pulsation array.
其中,在利用脉动阵列对某张特征图进行分析处理,如图3所示,N-0为某张图第N行的第0个ifm数据,N-1为某张图第N行的第1个ifm数据,对脉动阵列的不同行依次间隔1个时间周期输入特征图IFM,那么,上一行的MAC节点就可以把计算结果传递到下一行的MAC节点,并在其中完成累加,然后继续传到下一行的MAC节点。Among them, using the pulsation array to analyze and process a certain feature map, as shown in Figure 3, N-0 is the 0th ifm data of the Nth row of a certain picture, and N-1 is the 0th ifm data of the Nth row of a certain picture 1 ifm data, input the feature map IFM for different rows of the systolic array at intervals of 1 time period, then the MAC node in the previous row can pass the calculation result to the MAC node in the next row, and complete the accumulation in it, and then continue Passed to the MAC node of the next row.
举例来说,如果输入的特征图数据ifm和权重信息都是有符号数int8,那么,每个MAC节点可以完成一次乘法(int8*int8)和一次加法,考虑兼容int16*int16的卷积计算,把每个int16拆分为(mas_8b,lsb_8b),int16*int16的乘法就变成了:(mas_8b,lsb_8b)*(mas_8b,lsb_8b),mas_8b在int16的高位,带有符号位,扩展它的符号位到第9位,变为int9;lsb_8b在int16的低位,没有符号位,需要在最高位补1bit的0,变为int9;那么,乘法器就需要由int8*int8的乘法器变为int9*int9的乘法器;这样在计算用于实现int16*int16的lsb_int9*lsb_int9时,乘积需要17bit,所以上述乘法器的规格统一都是:int9*int9=int17。For example, if the input feature map data ifm and weight information are both signed int8, then each MAC node can complete one multiplication (int8*int8) and one addition, considering the convolution calculation compatible with int16*int16, Split each int16 into (mas_8b, lsb_8b), the multiplication of int16*int16 becomes: (mas_8b, lsb_8b)*(mas_8b, lsb_8b), mas_8b is in the high bit of int16, with a sign bit, extend its sign When the bit reaches the 9th bit, it becomes int9; lsb_8b is in the low bit of int16 and has no sign bit, so it needs to add 1 bit of 0 in the highest bit to become int9; then, the multiplier needs to be changed from int8*int8 multiplier to int9* Int9 multiplier; in this way, when calculating lsb_int9*lsb_int9 for realizing int16*int16, the product needs 17 bits, so the specifications of the above multipliers are unified: int9*int9=int17.
这样,乘积int17+上一行MAC节点传递下来的计算结果,也叫部分和psum_out,是一种中间结果,所有中间结果累加完成,得到的就是最终结果总和。为了便于说明,以64x64的MAC阵列对特征图进行分析处理为例,对数据处理过程进行说明,此时,MAC阵列中包括有64行64列的MAC节点,对于每 一行的MAC节点,加法器资源如下:In this way, the product int17 + the calculation result passed down by the MAC node in the previous line, also called partial sum psum_out, is an intermediate result, and all intermediate results are accumulated to obtain the sum of the final results. For the convenience of description, the analysis and processing of the feature map by the 64x64 MAC array is taken as an example to illustrate the data processing process. At this time, the MAC array includes MAC nodes with 64 rows and 64 columns. For each row of MAC nodes, the adder The resources are as follows:
第0行,无需输入的部分和psum_in,无需加法器adder,部分和结果psum_out是17b; Line 0, no input part and psum_in, no adder adder, part and result psum_out is 17b;
第1行,输入的部分和psum_in是17b,加法器adder所对应的资源是17b+17b=18b,部分和结果psum_out是18b;In line 1, the input part and psum_in are 17b, the resource corresponding to the adder adder is 17b+17b=18b, and the part and result psum_out is 18b;
第2行,输入的部分和psum_in是18b,加法器adder所对应的资源是17b+18b=19b,部分和结果psum_out是19b;In line 2, the input part and psum_in are 18b, the resource corresponding to the adder adder is 17b+18b=19b, and the part and result psum_out is 19b;
第3行,输入的部分和psum_in是19b,加法器adder所对应的资源是17b+19b=19b,部分和结果psum_out是19b;In line 3, the input part and psum_in are 19b, the resource corresponding to the adder adder is 17b+19b=19b, and the part and result psum_out is 19b;
第4行,输入的部分和psum_in是19b,加法器adder所对应的资源是17b+19b=20b,部分和结果psum_out是20b;In line 4, the input part and psum_in are 19b, the resource corresponding to the adder adder is 17b+19b=20b, and the part and result psum_out is 20b;
第5行,输入的部分和psum_in是20b,加法器adder所对应的资源是17b+20b=20b,部分和结果psum_out是20b;In line 5, the input part and psum_in are 20b, the resource corresponding to the adder adder is 17b+20b=20b, and the part and result psum_out is 20b;
第6行,输入的部分和psum_in是20b,加法器adder所对应的资源是17b+20b=20b,部分和结果psum_out是20b;In line 6, the input part and psum_in are 20b, the resource corresponding to the adder adder is 17b+20b=20b, and the part and result psum_out is 20b;
第7行,输入的部分和psum_in是20b,加法器adder所对应的资源是17b+20b=20b,部分和结果psum_out是20b;In line 7, the input part and psum_in are 20b, the resource corresponding to the adder adder is 17b+20b=20b, and the part and result psum_out is 20b;
第8行,输入的部分和psum_in是20b,加法器adder所对应的资源是17b+20b=21b,部分和结果psum_out是21b;In line 8, the input part and psum_in are 20b, the resource corresponding to the adder adder is 17b+20b=21b, and the part and result psum_out is 21b;
第9行,输入的部分和psum_in是21b,加法器adder所对应的资源是17b+21b=21b,部分和结果psum_out是21b;In line 9, the input part and psum_in are 21b, the resource corresponding to the adder adder is 17b+21b=21b, and the part and result psum_out is 21b;
...... …
第15行,输入的部分和psum_in是21b,加法器adder所对应的资源是17b+21b=21b,部分和结果psum_out是21b;In line 15, the input part and psum_in are 21b, the resource corresponding to the adder adder is 17b+21b=21b, and the part and result psum_out is 21b;
第16行,输入的部分和psum_in是21b,加法器adder所对应的资源是17b+21b=22b,部分和结果psum_out是22b;In line 16, the input part and psum_in are 21b, the resource corresponding to the adder adder is 17b+21b=22b, and the part and result psum_out is 22b;
第17行,输入的部分和psum_in是22b,加法器adder所对应的资源是17b+22b=22b,部分和结果psum_out是22b;In line 17, the input part and psum_in are 22b, the resource corresponding to the adder adder is 17b+22b=22b, and the part and result psum_out is 22b;
...... …
第31行,输入的部分和psum_in是22b,加法器adder所对应的资源是17b+22b=22b,部分和结果psum_out是22b;In line 31, the input part and psum_in are 22b, the resource corresponding to the adder adder is 17b+22b=22b, and the part and result psum_out is 22b;
第32行,输入的部分和psum_in是22b,加法器adder所对应的资源是17b+22b=23b,部分和结果psum_out是23b;In line 32, the input part and psum_in are 22b, the resource corresponding to the adder adder is 17b+22b=23b, and the part and result psum_out is 23b;
第33行,输入的部分和psum_in是23b,加法器adder所对应的资源是17b+23b=23b,部分和结果psum_out是23b;In line 33, the input part and psum_in are 23b, the resource corresponding to the adder adder is 17b+23b=23b, and the part and result psum_out is 23b;
...... …
第63行,输入的部分和psum_in是23b,加法器adder所对应的资源是17b+23b=23b,部分和结果psum_out是23b。In line 63, the input part and psum_in are 23b, the resource corresponding to the adder adder is 17b+23b=23b, and the part and result psum_out is 23b.
在上述实现方式中,MAC节点中包括有加法器,并且,在进行数据运算的过程中,上述MAC节点中的加法器资源存在大量冗余,例如:在第2行中,加法器adder所对应的资源是17b+18b=19b,实际上,18b+18b才能把19b的加法器资源用满,但是,这里又必须用19b,因为有一个加数是18b,所以加法器资源存在冗余。而第3行中,加法器adder所对应的资源是17b+19b=19b,结果不需要20b,这里加法器资源无冗余。同理的,在第4行中,加法器adder所对应的资源是17b+19b=20b,实际上,19b+19b才能把20b的加法器资源用满,但是这里又必须用20b,因为有一个加数是19b,所以加法器资源存在冗余;这种冗余也存在于第5、6行MAC节点的加法器中;而在第7行中,加法器的资源无冗余。同理的,第8行至第15行、第16行至第31行、第32行至第63行也是如此。In the above implementation, the MAC node includes an adder, and in the process of performing data operations, there is a large amount of redundancy in the adder resources in the above MAC node, for example: in the second row, the adder adder corresponds to The resource is 17b+18b=19b. In fact, only 18b+18b can fully use up the adder resource of 19b. However, 19b must be used here because there is an addend of 18b, so the adder resource is redundant. In line 3, the resource corresponding to the adder adder is 17b+19b=19b, and the result does not need 20b, and the adder resource is not redundant here. Similarly, in line 4, the resource corresponding to the adder adder is 17b+19b=20b. In fact, only 19b+19b can fully use up the adder resource of 20b, but 20b must be used here because there is a The addend is 19b, so there is redundancy in adder resources; this redundancy also exists in the adders of the MAC nodes in rows 5 and 6; and in row 7, there is no redundancy in the resources of the adder. Similarly, the same is true for lines 8 to 15, lines 16 to 31, and lines 32 to 63.
综上可知,相关技术中的卷积计算实现方式,存在以下缺陷:To sum up, it can be seen that the implementation of convolution calculation in related technologies has the following defects:
(1)MAC阵列中的MAC节点中包括有加法器,而加法器的资源存在冗余情况,这样使得芯片的面积变大,功耗变高,成本增加。(1) The MAC nodes in the MAC array include adders, and the resources of the adders are redundant, which increases the area of the chip, increases the power consumption, and increases the cost.
(2)在利用MAC阵列对输入数据进行分析处理时,输入数据(IFM)是三角形输入,这样若需要把全部MAC节点都用起来,需要消耗的时间长,降低了数据处理性能。(2) When using the MAC array to analyze and process the input data, the input data (IFM) is a triangle input, so if all the MAC nodes need to be used, it will take a long time and reduce the data processing performance.
(3)输入数据(IFM)是三角形输入,这样对于输入数据装载节点而言,需要考虑输入数据的输入周期,这样会使得输入数据装载节点的设计逻辑比较复杂,同时也会增加面积和功耗。(3) The input data (IFM) is a triangle input, so for the input data loading node, the input cycle of the input data needs to be considered, which will make the design logic of the input data loading node more complicated, and will also increase the area and power consumption .
(4)每列MAC节点是脉动工作的,一些工作的使能信号,需要在该列所有MAC都消耗一个寄存器(控制信号的寄存器),而后脉动传递该使能信号,例如:权重的装载信号,这样使得消耗的寄存器个数较多,占用资源较多。(4) Each column of MAC nodes is pulsating, and some working enable signals need to consume a register (a register of control signals) in all MACs in the column, and then pulsate to transmit the enabling signal, for example: weight loading signal , which consumes more registers and takes up more resources.
为了解决上述技术问题,本实施例提供了一种卷积运算方法、图像处理 方法、装置和存储介质,其中,卷积运算装置包括:数据获取节点和与数据获取节点相连接的MAC阵列,上述的数据获取节点用于获取待处理数据,并基于待处理数据在一时钟周期内确定与乘法器MAC阵列中多个MAC节点中每一个MAC节点相对应的节点输入数据,用于通过数据获取节点获取节点输入数据,并对节点输入数据进行卷积计算,获得节点计算结果,节点计算结果用于确定与待处理数据相对应的卷积运算结果。In order to solve the above technical problems, this embodiment provides a convolution operation method, an image processing method, a device, and a storage medium, wherein the convolution operation device includes: a data acquisition node and a MAC array connected to the data acquisition node, the above The data acquisition node is used to acquire the data to be processed, and determine the node input data corresponding to each MAC node in the multiple MAC nodes in the multiplier MAC array within one clock cycle based on the data to be processed, and is used to pass the data acquisition node The node input data is obtained, and the convolution calculation is performed on the node input data to obtain the node calculation result, which is used to determine the convolution operation result corresponding to the data to be processed.
本实施例提供的技术方案,通过数据获取节点在一时钟周期内确定与乘法器MAC阵列中多个MAC节点中每一个MAC节点相对应的节点输入数据,在获取到节点输入数据之后,可以利用MAC阵列中所包括的MAC节点在一时钟周期内同时对各自对应的节点输入数据进行卷积计算处理,获得节点计算结果,这样不仅能够提高数据的处理效率,并降低数据处理功耗和成本,从而有效地解决了以三角形输入的方式来输入数据时所存在的消耗时间、处理性能低的问题,进而提高了卷积运算方法的质量和效率。In the technical solution provided by this embodiment, the node input data corresponding to each MAC node in the multiple MAC nodes in the multiplier MAC array is determined by the data acquisition node within one clock cycle. After the node input data is obtained, the node input data can be used The MAC nodes included in the MAC array simultaneously perform convolution calculation processing on the corresponding node input data within one clock cycle to obtain node calculation results, which can not only improve data processing efficiency, but also reduce data processing power consumption and cost. Therefore, the problems of time-consuming and low processing performance existing when inputting data in the form of triangle input are effectively solved, thereby improving the quality and efficiency of the convolution operation method.
下面结合附图,对本发明中一种卷积运算方法、图像处理方法、装置和存储介质的一些实施方式作详细说明。在各实施例之间不冲突的情况下,下述的实施例及实施例中的特征可以相互组合。Some implementations of a convolution operation method, image processing method, device, and storage medium in the present invention will be described in detail below with reference to the accompanying drawings. Under the condition that there is no conflict between the various embodiments, the following embodiments and the features in the embodiments can be combined with each other.
图4为本发明实施例提供的一种卷积运算装置的结构示意图;图5为本发明实施例提供的将节点输入数据输入至MAC阵列的示意图;参考附图4-图5所示,本实施例提供了一种卷积运算装置,该卷积运算装置能够对待处理数据进行卷积运算处理,该卷积运算装置能够实现如下操作:(1)能够消除或者降低加法器所存在的资源冗余情况,减小芯片面积,降低功耗,并使得数据处理成本下降;(2)该卷积运算装置的输入数据(IFM)是在一时钟时期内输入的,从而实现了以矩形方式输入,这样可以在一较短的时间范围内把全部的MAC节点都用起来,消耗的时间变短,提高处理性能;(3)该卷积运算装置的输入数据(IFM)是以矩形方式输入,这样在将数据输入至MAC阵列时,数据获取节点的设计逻辑简单,也会带来面积和功耗收益;(4)MAC阵列中的每列MAC节点是同步工作的,一些工作的使能信号,只需要一个寄存器驱动该列所有MAC节点即可,例如:权重的装载信号,进而减少了寄存器个数,降低了占用资源。Fig. 4 is a schematic structural diagram of a convolution operation device provided by an embodiment of the present invention; Fig. 5 is a schematic diagram of inputting node input data to a MAC array provided by an embodiment of the present invention; referring to Fig. 4-Fig. 5, this The embodiment provides a convolution operation device, the convolution operation device can perform convolution operation processing on the data to be processed, and the convolution operation device can realize the following operations: (1) it can eliminate or reduce the resource redundancy existing in the adder In other cases, the chip area is reduced, the power consumption is reduced, and the data processing cost is reduced; (2) the input data (IFM) of the convolution operation device is input within a clock period, thereby realizing input in a rectangular manner, All MAC nodes can be used in a shorter time frame like this, the time of consumption is shortened, and processing performance is improved; (3) the input data (IFM) of this convolution computing device is input in a rectangular manner, like this When inputting data into the MAC array, the design logic of the data acquisition node is simple, and it will also bring area and power consumption benefits; (4) Each column of MAC nodes in the MAC array works synchronously, and some work enable signals, Only one register is needed to drive all the MAC nodes in the column, for example, the loading signal of the weight, thereby reducing the number of registers and reducing the occupied resources.
具体的,该卷积运算装置可以包括:数据获取节点和与数据获取节点相连接的MAC阵列;上述的数据获取节点用于获取待处理数据,并基于待处理数 据在一时钟周期内确定与乘法器MAC阵列中多个MAC节点中每一个MAC节点相对应的节点输入数据,MAC阵列用于通过数据获取节点获取节点输入数据,并对节点输入数据进行卷积计算,获得节点计算结果,节点计算结果用于确定与待处理数据相对应的卷积运算结果。Specifically, the convolution operation device may include: a data acquisition node and a MAC array connected to the data acquisition node; the above-mentioned data acquisition node is used to acquire the data to be processed, and determine and multiply within one clock cycle based on the data to be processed The node input data corresponding to each MAC node in the multiple MAC nodes in the MAC array of the device, the MAC array is used to obtain the node input data through the data acquisition node, and perform convolution calculation on the node input data to obtain the node calculation result, and the node calculation The result is used to determine the result of the convolution operation corresponding to the data to be processed.
其中,待处理数据是指需要进行卷积计算的数据,在不同的应用场景中,待处理数据可以对应有不同类型的数据,例如:待处理数据可以为文本数据、图像数据、视频数据等等。当用户针对待处理数据存在卷积计算需求时,可以通过数据获取节点获取待处理数据,具体的,待处理数据可以存储在预设区域中,数据获取节点通过访问预设区域即可获取待处理数据;或者,待处理数据存储在第三设备中,数据获取节点与第三设备通信连接,这样使得数据获取节点可以通过第三设备获取到待处理数据。当然的,数据获取节点也可以采用其他的方式来获取待处理数据,只要能够保证对待处理数据进行获取的准确可靠性即可,在此不再赘述。Among them, the data to be processed refers to the data that requires convolution calculation. In different application scenarios, the data to be processed can correspond to different types of data. For example, the data to be processed can be text data, image data, video data, etc. . When the user has convolution calculation requirements for the data to be processed, the data to be processed can be obtained through the data acquisition node. Specifically, the data to be processed can be stored in the preset area, and the data acquisition node can obtain the pending data by accessing the preset area. data; or, the data to be processed is stored in a third device, and the data acquisition node communicates with the third device, so that the data acquisition node can obtain the data to be processed through the third device. Of course, the data acquisition node can also acquire the data to be processed in other ways, as long as the accuracy and reliability of the acquisition of the data to be processed can be guaranteed, and details will not be repeated here.
在数据获取节点获取到待处理数据之后,可以将待处理数据发送至MAC阵列,以使得MAC阵列可以对待处理数据进行分析处理,其中,上述的MAC阵列并不限于是二维的MAC阵列,也可以是三维的MAC阵列等等,为了便于理解,以二维的MAC阵列为例进行说明,由于MAC阵列包括多行、多列的MAC节点,这样在数据获取节点与MAC阵列通信连接时,数据获取节点与同一列的MAC节点通信连接,为了能够使得同一列的MAC节点可以在一时钟周期内对待处理数据进行分析处理,在数据获取节点获取到待处理数据之后,可以对待处理数据进行分析处理,以实现在一时钟周期内确定与乘法器MAC阵列中多个MAC节点中每一个MAC节点相对应的节点输入数据。After the data acquisition node acquires the data to be processed, the data to be processed can be sent to the MAC array, so that the MAC array can analyze and process the data to be processed, wherein the above-mentioned MAC array is not limited to a two-dimensional MAC array. It can be a three-dimensional MAC array, etc. For ease of understanding, a two-dimensional MAC array is used as an example for illustration. Since the MAC array includes multiple rows and multiple columns of MAC nodes, when the data acquisition node communicates with the MAC array, the data The acquisition node communicates with the MAC nodes in the same column. In order to enable the MAC nodes in the same column to analyze and process the data to be processed within one clock cycle, after the data acquisition node acquires the data to be processed, it can analyze and process the data to be processed , so as to determine the node input data corresponding to each of the MAC nodes in the multiplier MAC array within one clock cycle.
可以理解的是,对于MAC阵列中的每行节点和每列节点而言,首个MAC节点所对应的节点输入数据可以为待处理数据中的一部分;而非首个MAC节点所对应的节点输入数据可以是基于所对应的前一输入MAC节点的输出结果所确定的,而MAC节点的输出结果是基于首个MAC节点所对应的节点输入数据所确定的,因此,在获取到待处理数据之后,可以基于待处理数据确定与乘法器MAC阵列中多个MAC节点中每一个MAC节点相对应的节点输入数据。It can be understood that, for each row of nodes and each column of nodes in the MAC array, the node input data corresponding to the first MAC node can be part of the data to be processed; rather than the node input data corresponding to the first MAC node The data can be determined based on the output result of the corresponding previous input MAC node, and the output result of the MAC node is determined based on the input data of the node corresponding to the first MAC node. Therefore, after obtaining the data to be processed , node input data corresponding to each of the plurality of MAC nodes in the multiplier MAC array may be determined based on the data to be processed.
举例来说,如图5所示,针对一待处理的图像数据而言,可以基于图像数据确定节点输入数据,该节点输入数据可以包括与第一行MAC节点相对应的数据0-1、与第二行MAC节点相对应的数据1-1......与第N行MAC节点相对应的数 据N-1,上述的数据N-N用于表征图像数据中第N行的第N个数据,而后可以将节点输入数据在一时钟周期内传输至MAC阵列进行分析处理。For example, as shown in FIG. 5, for an image data to be processed, node input data may be determined based on the image data, and the node input data may include data 0-1 corresponding to the first row of MAC nodes, and The data 1-1 corresponding to the MAC node in the second row...the data N-1 corresponding to the MAC node in the Nth row, the above data N-N is used to represent the Nth data of the Nth row in the image data , and then the node input data can be transmitted to the MAC array within one clock cycle for analysis and processing.
在MAC阵列中的多个MAC节点获取到节点输入数据之后,MAC节点可以对节点输入数据进行卷积计算,从而可以获得节点计算结果,可以理解的是,该节点计算结果为对待处理数据进行卷积计算所获得的中间结果,在获取到MAC阵列中MAC节点对节点输入数据进行分析处理的节点计算结果之后,可以基于所有的节点计算结果来确定与待处理数据相对应的卷积运算结果,从而有效地实现了对待处理数据进行稳定地卷积计算操作。After multiple MAC nodes in the MAC array obtain the node input data, the MAC node can perform convolution calculation on the node input data, so that the node calculation result can be obtained. It can be understood that the node calculation result is the convolution of the data to be processed The intermediate result obtained by the product calculation, after obtaining the node calculation result of the MAC node in the MAC array to analyze and process the node input data, can determine the convolution operation result corresponding to the data to be processed based on the calculation results of all nodes, Therefore, the stable convolution calculation operation of the data to be processed is effectively realized.
在一些实例中,本实施例中的卷积运算装置还可以包括:权重缓存节点,该权重缓存节点与MAC阵列相连接,上述的权重缓存节点用于确定与MAC阵列中多个MAC节点各自对应的权重系数,而后可以将权重系数传输至MAC阵列中,以使得MAC阵列可以利用MAC节点和权重系数对节点输入数据进行卷积计算,获得节点计算结果。In some examples, the convolution operation device in this embodiment may further include: a weight cache node, the weight cache node is connected to the MAC array, and the above weight cache node is used to determine the corresponding Then the weight coefficient can be transmitted to the MAC array, so that the MAC array can use the MAC node and the weight coefficient to perform convolution calculation on the node input data, and obtain the node calculation result.
在又一些实例中,在获取到节点计算结果之后,可以基于节点计算结果来确定卷积运算结果,具体的,本实施例中的卷积运算装置还可以包括:与MAC阵列相连接的数据处理节点,卷积运算装置用于基于节点计算结果确定与待处理数据相对应的卷积运算结果,从而有效地保证了对卷积运算结果进行分析处理的准确可靠性。In some other examples, after the node calculation result is obtained, the convolution operation result can be determined based on the node calculation result. Specifically, the convolution operation device in this embodiment can also include: a data processing device connected to the MAC array The node and the convolution operation device are used to determine the convolution operation result corresponding to the data to be processed based on the node calculation result, thereby effectively ensuring the accuracy and reliability of the analysis and processing of the convolution operation result.
本实施例提供的卷积运算装置,通过数据获取节点在一时钟周期内确定与乘法器MAC阵列中多个MAC节点中每一个MAC节点相对应的节点输入数据,在获取到节点输入数据之后,可以利用MAC阵列中所包括的MAC节点在一时钟周期内同时对各自对应的节点输入数据进行卷积计算处理,获得节点计算结果,所获得的节点计算结果可以用于确定与待处理数据相对应的卷积运算结果,这样不仅能够实现在一时钟周期内可以将待处理数据以矩形输入的方式传输至MAC阵列,使得MAC阵列可以在初始时刻时、同一时钟周期内获得待处理的节点输入数据,并且还提高了数据的处理效率,降低了数据的处理功耗和成本,从而有效地解决了以三角形输入的方式来输入数据时所存在的消耗时间、处理性能低的问题,进而提高了卷积运算装置的实用性。In the convolution operation device provided in this embodiment, the node input data corresponding to each of the MAC nodes in the multiplier MAC array is determined by the data acquisition node within one clock cycle, and after the node input data is acquired, The MAC nodes included in the MAC array can be used to simultaneously perform convolution calculation processing on the corresponding node input data within one clock cycle to obtain node calculation results, and the obtained node calculation results can be used to determine the data corresponding to the data to be processed In this way, not only can the data to be processed be transmitted to the MAC array in the form of rectangular input within one clock cycle, so that the MAC array can obtain the node input data to be processed at the initial moment and within the same clock cycle , and also improve the data processing efficiency, reduce the power consumption and cost of data processing, thus effectively solving the problems of time consumption and low processing performance when inputting data in the form of triangle input, thereby improving the volume The practicality of product computing devices.
在一些实例中,参考附图6所示,为了进一步解决现有技术中MAC节点中加法器的资源存在冗余情况,在利用MAC阵列对节点输入数据进行分析处理时,可以将MAC阵列中的每列划分为包括多个节点单元组,节点单元组中包括多个 MAC节点,其中,每个节点单元组对应至少一个加法器,加法器用于对节点单元组中MAC节点的计算结果进行累加,从而可以获得与MAC阵列中的每列相对应的节点处理结果。In some examples, as shown in FIG. 6, in order to further solve the redundant situation of the resources of the adder in the MAC node in the prior art, when using the MAC array to analyze and process the node input data, the Each column is divided into a plurality of node unit groups, and the node unit group includes a plurality of MAC nodes, wherein each node unit group corresponds to at least one adder, and the adder is used to accumulate the calculation results of the MAC nodes in the node unit group, Thus, node processing results corresponding to each column in the MAC array can be obtained.
其中,在将MAC阵列中的每列MAC节点划分为多个节点单元组时,可以获取节点单元组的划分参数,该划分参数可以包括以下至少之一:节点单元组的数量、所划分节点单元组中所包括的MAC节点的数量等等;在获取到划分参数之后,可以基于划分参数将MAC阵列中的每列划分为多个节点单元组。在一些实例中,为了能够保证对数据进行卷积计算的质量和效率,MAC阵列中同一列所对应的节点单元组中包括的MAC节点的数量相同。Wherein, when each column of MAC nodes in the MAC array is divided into multiple node unit groups, the division parameters of the node unit groups can be obtained, and the division parameters can include at least one of the following: the number of node unit groups, the divided node unit The number of MAC nodes included in the group, etc.; after obtaining the division parameter, each column in the MAC array may be divided into multiple node unit groups based on the division parameter. In some examples, in order to ensure the quality and efficiency of convolution calculation on data, the number of MAC nodes included in the node unit group corresponding to the same column in the MAC array is the same.
举例来说,在MAC阵列为64*64的阵列时,每一列包括64个MAC节点,可以将64个MAC节点平均划分为8个节点单元组,此时,每个节点单元组中包括8个MAC节点。或者,可以将64个MAC节点平均划分为4个节点单元组,此时,每个节点单元组中包括16个MAC节点。或者,可以将64个MAC节点平均划分为16个节点单元组,此时,每个节点单元组中包括4个MAC节点;这样在利用节点单元组对节点输入数据进行分析处理时,可以有效地保证对数据进行分析处理的质量和效率。For example, when the MAC array is a 64*64 array, each column includes 64 MAC nodes, and the 64 MAC nodes can be evenly divided into 8 node unit groups. At this time, each node unit group includes 8 MAC node. Alternatively, 64 MAC nodes may be equally divided into 4 node unit groups, and in this case, each node unit group includes 16 MAC nodes. Alternatively, 64 MAC nodes can be divided into 16 node unit groups on average, and at this time, each node unit group includes 4 MAC nodes; in this way, when using the node unit group to analyze and process the node input data, it can effectively Ensure the quality and efficiency of data analysis and processing.
在另一些实例中,为了能够提高对数据进行卷积计算的灵活可靠性,MAC阵列中不同列所对应的节点单元组的数量相同或不同。举例来说,在MAC阵列为64*64的阵列时,每一列包括64个MAC节点,这样在对MAC阵列中的每一列进行节点单元组划分操作时,可以将第一列的64个MAC节点平均划分为8个节点单元组,此时,每个节点单元组中包括8个MAC节点;将第二列的64个MAC节点平均划分为4个节点单元组,此时,每个节点单元组中包括16个MAC节点;上述的MAC阵列中不同列所对应的节点单元组的数量不同。或者,也可以将第二列的64个MAC节点平均划分为8个节点单元组,此时,每个节点单元组中包括8个MAC节点;上述的MAC阵列中不同列所对应的节点单元组的数量相同。这样在利用节点单元组对节点输入数据进行分析处理时,可以有效地保证对数据进行分析处理的灵活程度。In some other examples, in order to improve the flexibility and reliability of performing convolution calculation on data, the numbers of node unit groups corresponding to different columns in the MAC array are the same or different. For example, when the MAC array is a 64*64 array, each column includes 64 MAC nodes, so when the node unit group division operation is performed on each column in the MAC array, the 64 MAC nodes in the first column can be Evenly divided into 8 node unit groups, at this time, each node unit group includes 8 MAC nodes; the 64 MAC nodes in the second column are equally divided into 4 node unit groups, at this time, each node unit group includes 16 MAC nodes; the number of node unit groups corresponding to different columns in the above MAC array is different. Alternatively, the 64 MAC nodes in the second column can also be equally divided into 8 node unit groups. At this time, each node unit group includes 8 MAC nodes; the node unit groups corresponding to different columns in the above MAC array the same amount. In this way, when the node unit group is used to analyze and process the node input data, the flexibility of data analysis and processing can be effectively guaranteed.
需要注意的是,本实施例中MAC阵列中的MAC节点中并不包括加法器,此时,为了能够实现对数据进行卷积运算处理,MAC阵列中所划分的节点单元组对应至少一个加法器,上述的加法器用于对节点单元组中MAC节点的计算结果进行累加,获得与MAC阵列中的每列相对应的节点处理结果。在一些实例中, 至少一个加法器可以包括:与节点单元组相连接的第一级加法器以及与第一级加法器相连接的第二级加法器,第二级加法器可支持的数据资源大于第一级加法器可支持的数据资源。在数据处理的过程中,可以利用一个控制信号对与每列MAC节点所对应的加法器可以统一控制。It should be noted that the MAC nodes in the MAC array in this embodiment do not include an adder. At this time, in order to perform convolution operation processing on the data, the node unit groups divided in the MAC array correspond to at least one adder , the above-mentioned adder is used to accumulate the calculation results of the MAC nodes in the node unit group to obtain the node processing results corresponding to each column in the MAC array. In some examples, at least one adder may include: a first-stage adder connected to the node unit group and a second-stage adder connected to the first-stage adder, the data resources supported by the second-stage adder Data resources larger than can be supported by the first-stage adder. In the process of data processing, one control signal can be used to uniformly control the adders corresponding to each column of MAC nodes.
其中,第一级加法器与节点单元组直接连接,该第一级加法器能够对节点单元组中所包括的MAC节点的输出结果进行累加操作,而第二级加法器与第一级加法器相连接,第二级加法器用于对第一级加法器的输出结果进行累加处理,可以理解的是,第二级加法器的数量可以为一个或多个,并且第二级加法器可支持的数据资源大于第一级加法器可支持的数据资源。Wherein, the first-stage adder is directly connected to the node unit group, and the first-stage adder can accumulate the output results of the MAC nodes included in the node unit group, and the second-stage adder is connected to the first-stage adder connected, the second-stage adder is used to accumulate the output results of the first-stage adder, it can be understood that the number of the second-stage adder can be one or more, and the second-stage adder can support The data resource is larger than what the first stage adder can support.
在一些实例中,由于加法器用于对节点单元组中MAC节点的输出结果进行累加操作,因此,加法器可支持的数据资源与节点单元组中包括的MAC节点的数量、MAC节点中进行卷积计算的数据位宽相关。具体的,在节点单元组中包括的MAC节点的数量为2的N次幂时,加法器可支持的数据资源为MAC节点中进行卷积计算的数据位宽与N的和值。In some examples, since the adder is used to accumulate the output results of the MAC nodes in the node unit group, the data resources supported by the adder are convolved with the number of MAC nodes included in the node unit group, and the MAC nodes The calculated data bit width is related. Specifically, when the number of MAC nodes included in the node unit group is 2 to the Nth power, the data resource supported by the adder is the sum of the data bit width and N for convolution calculation in the MAC nodes.
举例来说,在MAC阵列为64*64的阵列时,每一列包括64个MAC节点,这样在对MAC阵列中的每一列进行节点单元组划分操作时,可以将第一列的64个MAC节点平均划分为8个节点单元组,每个节点单元组中包括有8(2 3)个MAC节点,此时,每个节点单元组可以连接有第一级加法器,该第一级加法器可以包括与第一个节点单元组相对应的加法器0、与第二个节点单元组相对应的加法器1、与第三个节点单元组相对应的加法器2、与第四个节点单元组相对应的加法器3、与第五个节点单元组相对应的加法器4.......以及与第八个节点单元组相对应的加法器7。 For example, when the MAC array is a 64*64 array, each column includes 64 MAC nodes, so when the node unit group division operation is performed on each column in the MAC array, the 64 MAC nodes in the first column can be Evenly divided into 8 node unit groups, each node unit group includes 8 (2 3 ) MAC nodes, at this time, each node unit group can be connected with a first-stage adder, and the first-stage adder can Including adder 0 corresponding to the first node element group, adder 1 corresponding to the second node element group, adder 2 corresponding to the third node element group, and adder 2 corresponding to the fourth node element group The corresponding adder 3, the adder 4 corresponding to the fifth nodal unit group, and the adder 7 corresponding to the eighth nodal unit group.
对于上述的第一级加法器而言,若节点单元组中MAC节点的输出结果为17b,而第一级加法器用于对节点单元组中的8(2 3)个MAC节点的乘积处理结果进行累加操作,此时,第一级加法器的可支持的数据资源为MAC节点中进行卷积计算的数据位宽与N的和值,即为17b+3b=20b。而第二级加法器用于对第一级加法器的输出结果进行累加操作,若第二级加法器也用8输入加法器,那也只需要1个8输入的20b加法器,此时,第二级加法器的输出位宽为20b+3b=23b。 For the above-mentioned first-stage adder, if the output result of the MAC node in the node unit group is 17b, and the first-stage adder is used to process the product processing results of 8 (2 3 ) MAC nodes in the node unit group Accumulation operation, at this time, the supportable data resource of the first-stage adder is the sum of the data bit width and N for convolution calculation in the MAC node, that is, 17b+3b=20b. The second-stage adder is used to accumulate the output results of the first-stage adder. If the second-stage adder also uses an 8-input adder, then only one 8-input 20b adder is needed. At this time, the first-stage adder The output bit width of the secondary adder is 20b+3b=23b.
对于卷积运算装置而言,若卷积运算装置的结构比较复杂时,计算逻辑越复杂,卷积运算装置的时钟频率会不高,这样卷积运算装置的工作频率无 法太高,进而就越难满足高频率的硬件时序需求。因此,在对卷积运算装置中的MAC阵列进行节点单元组划分时,节点单元组中所包括的MAC节点数量和与节点单元组相对应的加法器的数量可以直接影响卷积运算装置的数据处理时效,例如:加法器的数量越多,卷积运算装置的时序越差。For the convolution operation device, if the structure of the convolution operation device is more complicated, the more complex the calculation logic is, the clock frequency of the convolution operation device will not be high, so the operating frequency of the convolution operation device cannot be too high, and the more Difficult to meet high-frequency hardware timing requirements. Therefore, when the MAC array in the convolution operation device is divided into node unit groups, the number of MAC nodes included in the node unit group and the number of adders corresponding to the node unit group can directly affect the data of the convolution operation device. Processing time, for example: the more the number of adders, the worse the timing of the convolution operation device.
基于上述陈述内容可知,在利用卷积运算装置对待处理数据进行分析处理时,如果利用8输入的20b加法器所对应的时序不能满足需求时,则可以采用其他的节点单元组与加法器的对应方式,举例来说,如图7所示,在MAC阵列为64*64的阵列时,每一列包括64个MAC节点,这样在对MAC阵列中的每一列进行节点单元组划分操作时,可以将第一列的64个MAC节点平均划分为8个节点单元组,每个节点单元组中包括有8(2 3)个MAC节点,此时,每个节点单元组可以连接有第一级加法器,该第一级加法器可以包括与第一个节点单元组相对应的加法器0、与第二个节点单元组相对应的加法器1、与第三个节点单元组相对应的加法器2、与第四个节点单元组相对应的加法器3、与第五个节点单元组相对应的加法器4.......以及与第八个节点单元组相对应的加法器7。 Based on the above statement, it can be seen that when using the convolution operation device to analyze and process the data to be processed, if the timing corresponding to the 8-input 20b adder cannot meet the requirements, you can use other node unit groups and the corresponding adder way, for example, as shown in Figure 7, when the MAC array is a 64*64 array, each column includes 64 MAC nodes, so when performing node unit group division operations on each column in the MAC array, the The 64 MAC nodes in the first column are evenly divided into 8 node unit groups, and each node unit group includes 8 (2 3 ) MAC nodes. At this time, each node unit group can be connected with a first-stage adder , the first-stage adder may include an adder 0 corresponding to the first nodal unit group, an adder 1 corresponding to the second nodal unit group, and an adder 2 corresponding to the third nodal unit group , an adder 3 corresponding to the fourth node unit group, an adder 4 corresponding to the fifth node unit group... and an adder 7 corresponding to the eighth node unit group.
此时,第一级加法器为8输入的加法器,而与第一级加法器相连接的第二级加法器可以为4(2 2)输入的加法器,第二级加法器可以包括与加法器0、加法器1、加法器2和加法器3相连接的加法器8、以及与加法器4、加法器5、加法器6和加法器7相连接的加法器9。上述第二级加法器的数量为两个,输出位宽为MAC节点中进行卷积计算的数据位宽与N的和值,即20b+2b=22b。 At this moment, the first-stage adder is an 8-input adder, and the second-stage adder connected to the first-stage adder can be a 4(2 2 )-input adder, and the second-stage adder can include and Adder 0 , Adder 1 , Adder 2 and Adder 3 are connected to Adder 8 , and Adder 9 is connected to Adder 4 , Adder 5 , Adder 6 and Adder 7 . The number of the above-mentioned second stage adders is two, and the output bit width is the sum of the data bit width for convolution calculation in the MAC node and N, that is, 20b+2b=22b.
需要注意的是,第二级加法器还包括与上述加法器8和加法器9相连接的加法器10,该加法器10为一个2(2 1)输入的加法器,此时,加法器的输出位宽即为MAC节点中进行卷积计算的数据位宽与N的和值,即22b+1b=23b。 It should be noted that the second-stage adder also includes an adder 10 connected to the above adder 8 and adder 9, the adder 10 is a 2 (2 1 ) input adder, at this time, the adder The output bit width is the sum of the data bit width and N for convolution calculation in the MAC node, that is, 22b+1b=23b.
基于上述陈述内容可知,在利用卷积运算装置对待处理数据进行分析处理时,如果利用上述结构的卷积运算装置所对应的时序不能满足需求时,则可以采用其他的节点单元组与加法器的连接结构,举例来说,参考附图8所示,在MAC阵列为64*64的阵列时,每一列包括64个MAC节点,这样在对MAC阵列中的每一列进行节点单元组划分操作时,可以将第一列的64个MAC节点平均划分为16个节点单元组,每个节点单元组中包括有4(2 2)个MAC节点,此时,每个节点单元组可以连接有第一级加法器,该第一级加法器可以包括与第一个节点单元组相对应的加法器0、与第二个节点单元组相对应的加法器1、与第三 个节点单元组相对应的加法器2、与第四个节点单元组相对应的加法器3、与第五个节点单元组相对应的加法器4.......以及与第十六个节点单元组相对应的加法器15。 Based on the above statement, it can be seen that when using the convolution operation device to analyze and process the data to be processed, if the timing corresponding to the convolution operation device with the above structure cannot meet the requirements, other node unit groups and adders can be used. For the connection structure, for example, referring to the accompanying drawing 8, when the MAC array is a 64*64 array, each column includes 64 MAC nodes, so when performing node unit group division operations on each column in the MAC array, The 64 MAC nodes in the first column can be evenly divided into 16 node unit groups, and each node unit group includes 4 (2 2 ) MAC nodes. At this time, each node unit group can be connected with the first level Adders, the first stage of adders may include adder 0 corresponding to the first nodal element group, adder 1 corresponding to the second nodal element group, adder 1 corresponding to the third nodal element group Adder 2, adder 3 corresponding to the fourth nodal unit group, adder 4 corresponding to the fifth nodal unit group... and adder corresponding to the sixteenth nodal unit group Device 15.
对于上述结构的卷积运算装置而言,第一级加法器为4输入的加法器,即能够对4(2 2)个MAC节点的乘积进行累加操作,此时,该第一级加法器的输出位宽为17b+2b=19b。而与第一级加法器相连接的第二级加法器可以为4(2 2)输入的加法器,第二级加法器可以包括加法器16、加法器17、加法器18和加法器19,其中,加法器16与加法器0、加法器1、加法器2和加法器3相连接,加法器17与加法器4、加法器5、加法器6和加法器7相连接,加法器18与加法器8、加法器9、加法器10和加法器11相连接,加法器19与加法器12、加法器13、加法器14和加法器15相连接。上述的第二级加法器的数量为四个,且第二级加法器能够对4(2 2)个第一级加法器的结果进行累加操作,此时,第二级加法器的输出位宽19b+2b=21b。 For the convolution operation device with the above structure, the first-stage adder is a 4-input adder, that is, it can accumulate the products of 4 (2 2 ) MAC nodes. At this time, the first-stage adder’s The output bit width is 17b+2b=19b. And the second-stage adder that is connected with the first-stage adder can be the adder of 4 (2 2 ) input, and the second-stage adder can comprise adder 16, adder 17, adder 18 and adder 19, Wherein, adder 16 is connected with adder 0, adder 1, adder 2 and adder 3, adder 17 is connected with adder 4, adder 5, adder 6 and adder 7, adder 18 is connected with adder 7 The adder 8 , the adder 9 , the adder 10 and the adder 11 are connected, and the adder 19 is connected with the adder 12 , the adder 13 , the adder 14 and the adder 15 . The number of the above-mentioned second-stage adders is four, and the second-stage adder can accumulate the results of 4 (2 2 ) first-stage adders. At this time, the output bit width of the second-stage adder 19b+2b=21b.
需要注意的是,第二级加法器还包括与上述加法器16、加法器17、加法器18和加法器19相连接的加法器20,该加法器20为一个4(2 2)输入的21b加法器,此时,加法器的输出位宽即为MAC节点中进行卷积计算的数据位宽与N的和值,即21b+2b=23b。 It should be noted that the second-stage adder also includes an adder 20 connected to the above-mentioned adder 16, adder 17, adder 18 and adder 19, and the adder 20 is a 4 (2 2 ) input 21b An adder, at this time, the output bit width of the adder is the sum of the data bit width and N for convolution calculation in the MAC node, that is, 21b+2b=23b.
本实施例提供的卷积运算装置,通过将MAC阵列平均分为若干个节点单元组,而后使得与节点单元组相连接的加法器可以尽量满足MAC节点中进行卷积计算的数据位宽与N的和值,N与加法器的输入数据的数量相关,在加法器的输入数据的数量为4时,N为2,即N为log 2(加法器的输入数据的数量);这样实现了适用于脉动阵列的无冗余的级联加法器,即卷积运算装置中所包括的加法器资源完全没有冗余,进而能够为芯片节省了大量面积,降低了功耗和成本,进一步有效地提高了卷积运算装置进行卷积运算的质量和效率,保证了该卷积运算装置的实用性。 The convolution operation device provided in this embodiment divides the MAC array into several node unit groups on average, and then makes the adder connected to the node unit group meet the data bit width and N for convolution calculation in the MAC node as much as possible. The sum value, N is relevant to the quantity of the input data of adder, when the quantity of input data of adder is 4, N is 2, promptly N is log 2 (quantity of the input data of adder); The non-redundant cascaded adder based on the systolic array, that is, the adder resource included in the convolution operation device has no redundancy at all, which can save a lot of area for the chip, reduce power consumption and cost, and further effectively improve The quality and efficiency of the convolution operation performed by the convolution operation device are improved, and the practicability of the convolution operation device is guaranteed.
在又一些实例中,在节点单元组中包括的MAC节点的数量不为2的N次幂时,加法器可支持的数据资源为MAC节点中进行卷积计算的数据位宽与N+1的和值。In some other examples, when the number of MAC nodes included in the node unit group is not the Nth power of 2, the data resources supported by the adder are the data bit width of the convolution calculation in the MAC node and the ratio of N+1 and value.
举例来说,在MAC阵列为48*48的阵列时,每一列包括48个MAC节点,这样在对MAC阵列中的每一列进行节点单元组划分操作时,可以将第一列的48个MAC节点平均划分为4个节点单元组,每个节点单元组中包括有12(2 4>12>2 3)个MAC节点,此时,每个节点单元组可以连接有第一级加法器,该第一级加法 器的数量为4个,每个第一级加法器用于对节点单元组中的12个MAC节点的乘积处理结果进行累加操作,此时,第一级加法器的可支持的数据资源为MAC节点中进行卷积计算的数据位宽与N+1的和值,即为17b+3b+1=21b。而第二级加法器用于对第一级加法器的输出结果进行累加操作,若第二级加法器用4(2 2)输入的加法器,此时,第二级加法器的输出位宽为21b+2b=23b。 For example, when the MAC array is a 48*48 array, each column includes 48 MAC nodes, so when performing node unit group division operation on each column in the MAC array, the 48 MAC nodes in the first column can be Evenly divided into 4 node unit groups, each node unit group includes 12 (2 4 >12>2 3 ) MAC nodes, at this time, each node unit group can be connected with a first-stage adder, the first The number of first-level adders is 4, and each first-level adder is used to accumulate the product processing results of 12 MAC nodes in the node unit group. At this time, the supported data resources of the first-level adder It is the sum of the data bit width and N+1 for convolution calculation in the MAC node, that is, 17b+3b+1=21b. The second-stage adder is used to accumulate the output results of the first-stage adder. If the second-stage adder uses an adder with 4 (2 2 ) inputs, at this time, the output bit width of the second-stage adder is 21b +2b=23b.
本实施例提供的卷积运算装置,通过将MAC阵列划分为若干个节点单元组,在节点单元组中包括的MAC节点的数量不为2的N次幂时,加法器可支持的数据资源为MAC节点中进行卷积计算的数据位宽与N+1的和值,此时,虽然卷积运算装置中所包括的加法器资源存在冗余情况,但是有效地减少了资源冗余的程度,这样同样可以为芯片节省面积,降低功耗和成本,进一步有效地提高了卷积运算装置进行卷积运算的质量和效率,保证了该卷积运算装置的实用性。In the convolution operation device provided in this embodiment, by dividing the MAC array into several node unit groups, when the number of MAC nodes included in the node unit group is not the Nth power of 2, the data resources supported by the adder are: The sum of the data bit width and N+1 for convolution calculation in the MAC node. At this time, although the adder resources included in the convolution operation device are redundant, the degree of resource redundancy is effectively reduced. This can also save area for the chip, reduce power consumption and cost, further effectively improve the quality and efficiency of the convolution operation performed by the convolution operation device, and ensure the practicability of the convolution operation device.
具体实现时,本实施例所提供的卷积运算装置相对于相关技术中的卷积运算装置而言,有效地提高了MAC节点的利用率,更有利于数据处理性能的提升,参考附图9-图10所示,以64*64的MAC阵列为例,当待处理数据ifm以三角形方式输入至MAC阵列时,要经过128个时钟周期之后,MAC阵列中的每个MAC节点才会都运行起来;而当待处理数据ifm以矩形方式输入至MAC阵列时,只需要经过64个时钟周期即可让MAC阵列中的每个MAC节点才会都运行起来。并且,由于该卷积运算装置中的每列MAC节点是同步工作的,减少了大量的脉动控制信号,进一步提高了卷积运算装置进行卷积运算的质量和效率。During specific implementation, the convolution operation device provided by this embodiment effectively improves the utilization rate of MAC nodes and is more conducive to the improvement of data processing performance compared with the convolution operation device in the related art. Refer to FIG. 9 - As shown in Figure 10, taking the 64*64 MAC array as an example, when the data to be processed ifm is input to the MAC array in a triangle form, it takes 128 clock cycles before each MAC node in the MAC array runs and when the data to be processed ifm is input to the MAC array in a rectangular manner, it only takes 64 clock cycles to make each MAC node in the MAC array run. Moreover, since each column of MAC nodes in the convolution operation device works synchronously, a large number of pulsating control signals are reduced, and the quality and efficiency of the convolution operation performed by the convolution operation device are further improved.
需要说明的是,本实施例所提供的卷积运算装置并不限于上述所描述的实现方式,本领域技术人员可以根据具体的应用场景或者应用需求对MAC阵列进行划分操作,并且也可以对于划分之后所获得的节点单元组所对应的加法器的资源进行配置,进一步提高了卷积运算装置使用的灵活程度。此外,卷积运算装置不仅能够对int8精度的数据进行卷积计算,还可以对其他类型的数据进行卷积计算,例如:int4精度的数据、int16精度的数据等等,进而拓展了该卷积运算装置的适用范围。It should be noted that the convolution operation device provided in this embodiment is not limited to the implementation described above, and those skilled in the art can divide the MAC array according to specific application scenarios or application requirements, and can also divide Afterwards, the resources of the adder corresponding to the obtained node unit group are configured, which further improves the flexibility of the convolution operation device. In addition, the convolution operation device can not only perform convolution calculations on int8 precision data, but also perform convolution calculations on other types of data, such as: int4 precision data, int16 precision data, etc., thereby expanding the convolution Applicable scope of computing devices.
图11为本发明实施例提供的一种卷积运算方法的流程示意图;参考附图11所示,本实施例提供了一种卷积运算方法,该卷积运算方法的执行主体可以为卷积运算装置,可以理解的是,该卷积运算装置可以实现为软件、或者软件和硬件的组合,具体的,该卷积运算方法可以包括:FIG. 11 is a schematic flow chart of a convolution operation method provided by an embodiment of the present invention; referring to FIG. 11 , this embodiment provides a convolution operation method, and the execution subject of the convolution operation method may be convolution computing device, it can be understood that the convolution computing device can be implemented as software, or a combination of software and hardware, specifically, the convolution computing method can include:
步骤S1101:获取待处理数据,并基于待处理数据在一时钟周期内确定与乘法器MAC阵列中多个MAC节点中每一个MAC节点相对应的节点输入数据。Step S1101: Obtain the data to be processed, and determine the node input data corresponding to each of the MAC nodes in the multiplier MAC array within one clock cycle based on the data to be processed.
其中,待处理数据是指需要进行卷积计算的数据,在不同的应用场景中,待处理数据可以对应有不同的数据格式,例如:待处理数据可以为文本数据、图像数据、视频数据等等。当用户针对待处理数据存在卷积计算需求时,可以通过数据获取节点获取待处理数据,具体的,待处理数据可以存储在预设区域中,数据获取节点通过访问预设区域即可获取待处理数据;或者,待处理数据存储在第三设备中,数据获取节点与第三设备通信连接,这样使得数据获取节点可以通过第三设备获取到待处理数据。当然的,数据获取节点也可以采用其他的方式来获取待处理数据,只要能够保证对待处理数据进行获取的准确可靠性即可,在此不再赘述。Among them, the data to be processed refers to the data that requires convolution calculation. In different application scenarios, the data to be processed can correspond to different data formats. For example, the data to be processed can be text data, image data, video data, etc. . When the user has convolution calculation requirements for the data to be processed, the data to be processed can be obtained through the data acquisition node. Specifically, the data to be processed can be stored in the preset area, and the data acquisition node can obtain the pending data by accessing the preset area. data; or, the data to be processed is stored in a third device, and the data acquisition node communicates with the third device, so that the data acquisition node can obtain the data to be processed through the third device. Of course, the data acquisition node can also acquire the data to be processed in other ways, as long as the accuracy and reliability of the acquisition of the data to be processed can be guaranteed, and details will not be repeated here.
在数据获取节点获取到待处理数据之后,可以将待处理数据发送至MAC阵列,以使得MAC阵列可以对待处理数据进行分析处理,其中,上述的MAC阵列并不限于是二维的MAC阵列,也可以是三维的MAC阵列等等,为了便于理解,以二维的MAC阵列为例进行说明,由于MAC阵列包括多行、多列的MAC节点,这样在数据获取节点与MAC阵列通信连接时,数据获取节点与同一列的MAC节点通信连接,为了能够使得同一列的MAC节点可以在一时钟周期内对待处理数据进行分析处理,在数据获取节点获取到待处理数据之后,可以对待处理数据进行分析处理,以实现在一时钟周期内确定与乘法器MAC阵列中多个MAC节点中每一个MAC节点相对应的节点输入数据,可以理解的是,节点输入数据可以为待处理数据中的一部分。After the data acquisition node acquires the data to be processed, the data to be processed can be sent to the MAC array, so that the MAC array can analyze and process the data to be processed, wherein the above-mentioned MAC array is not limited to a two-dimensional MAC array. It can be a three-dimensional MAC array, etc. For ease of understanding, a two-dimensional MAC array is used as an example for illustration. Since the MAC array includes multiple rows and multiple columns of MAC nodes, when the data acquisition node communicates with the MAC array, the data The acquisition node communicates with the MAC nodes in the same column. In order to enable the MAC nodes in the same column to analyze and process the data to be processed within one clock cycle, after the data acquisition node acquires the data to be processed, it can analyze and process the data to be processed , so as to determine the node input data corresponding to each of the MAC nodes in the multiplier MAC array within one clock cycle. It can be understood that the node input data may be a part of the data to be processed.
步骤S1102:利用MAC阵列中的MAC节点对节点输入数据进行卷积计算,获得节点计算结果,节点计算结果用于确定与待处理数据相对应的卷积运算结果。Step S1102: Use the MAC nodes in the MAC array to perform convolution calculations on the input data of the nodes to obtain node calculation results, and the node calculation results are used to determine the convolution operation results corresponding to the data to be processed.
在MAC阵列中的多个MAC节点获取到节点输入数据之后,MAC节点可以对节点输入数据进行卷积计算,从而可以获得节点计算结果,可以理解的是,该节点计算结果为对待处理数据进行卷积计算所获得的中间结果,在获取到MAC阵列中MAC节点对节点输入数据进行分析处理的节点计算结果之后,可以基于所有的节点计算结果来确定与待处理数据相对应的卷积运算结果,从而有效地实现了对待处理数据进行稳定地卷积计算操作。After multiple MAC nodes in the MAC array obtain the node input data, the MAC node can perform convolution calculation on the node input data, so that the node calculation result can be obtained. It can be understood that the node calculation result is the convolution of the data to be processed The intermediate result obtained by the product calculation, after obtaining the node calculation result of the MAC node in the MAC array to analyze and process the node input data, can determine the convolution operation result corresponding to the data to be processed based on the calculation results of all nodes, Therefore, the stable convolution calculation operation of the data to be processed is effectively realized.
在一些实例中,利用MAC阵列中的MAC节点对节点输入数据进行卷积计算, 获得节点计算结果可以包括:确定与MAC阵列中多个MAC节点各自对应的权重系数;利用MAC节点和权重系数对节点输入数据进行卷积计算,获得节点计算结果。In some examples, using the MAC nodes in the MAC array to perform convolution calculations on the node input data, and obtaining the node calculation results may include: determining weight coefficients corresponding to multiple MAC nodes in the MAC array; using the MAC nodes and the weight coefficients to The node inputs the data for convolution calculation and obtains the node calculation result.
由于节点计算结果用于确定与待处理数据相对应的卷积运算结果,因此,在获得节点计算结果之后,本实施例中的方法还可以包括:基于节点处理结果确定与待处理数据相对应的卷积运算结果。Since the node calculation result is used to determine the convolution operation result corresponding to the data to be processed, after obtaining the node calculation result, the method in this embodiment may further include: determining the convolution operation result corresponding to the data to be processed based on the node processing result The result of the convolution operation.
本实施例提供的卷积运算方法,通过获取待处理数据,并基于待处理数据在一时钟周期内确定与乘法器MAC阵列中多个MAC节点中每一个MAC节点相对应的节点输入数据,而后利用MAC阵列中的MAC节点对节点输入数据进行卷积计算,获得节点计算结果,有效地实现了能够在一时钟周期内可以对待处理数据进行卷积运算,并且还提高了数据的处理效率,降低了数据的处理功耗和成本,从而有效地解决了以三角形输入的方式来输入数据时所存在的消耗时间、处理性能低的问题,进而提高了卷积运算方法的实用性。The convolution operation method provided in this embodiment obtains the data to be processed, and determines the node input data corresponding to each of the MAC nodes in the multiplier MAC array within one clock cycle based on the data to be processed, and then The MAC nodes in the MAC array are used to perform convolution calculations on the input data of the nodes to obtain the calculation results of the nodes, which effectively realizes the convolution operation of the data to be processed within one clock cycle, and also improves the data processing efficiency and reduces The power consumption and cost of data processing are reduced, thereby effectively solving the problems of time consumption and low processing performance when inputting data in the form of triangle input, thereby improving the practicability of the convolution operation method.
图12为本发明实施例提供的另一种卷积运算方法的流程示意图;在上述实施例的基础上,参考附图12所示,在利用MAC阵列中的MAC节点对节点输入数据进行卷积计算之前,本实施例中的方法还可以包括:Fig. 12 is a schematic flowchart of another convolution operation method provided by the embodiment of the present invention; on the basis of the above embodiment, referring to the accompanying drawing 12, the MAC nodes in the MAC array are used to convolve the node input data Before calculating, the method in this embodiment may also include:
步骤S1201:获取用于对MAC阵列中的每列MAC节点进行划分的划分参数。Step S1201: Obtain division parameters for dividing each column of MAC nodes in the MAC array.
步骤S1202:基于划分参数将MAC阵列中的每列划分为多个节点单元组,节点单元组中包括多个MAC节点。Step S1202: Divide each column in the MAC array into a plurality of node unit groups based on the division parameter, and the node unit group includes a plurality of MAC nodes.
其中,在利用MAC阵列中的MAC节点对节点输入数据进行卷积运算操作时,为了能够提高卷积运算的质量和效率,可以对MAC阵列进行节点划分操作,具体的,可以获取用于对MAC阵列中的每列MAC节点进行划分的划分参数,该划分参数可以是存储在预设区域中的参数,或者是用户基于应用场景或者应用需求进行配置的参数。在获取到划分参数之后,可以基于划分参数将MAC阵列中的每列划分为多个节点单元组,所获得的节点单元组中包括有多个MAC节点。在一些实例中,MAC阵列中同一列所对应的节点单元组中包括的MAC节点的数量相同。MAC阵列中不同列所对应的节点单元组的数量相同或不同。Among them, when using the MAC nodes in the MAC array to perform convolution operations on the node input data, in order to improve the quality and efficiency of convolution operations, the MAC array can be divided into nodes. Specifically, the MAC Each column of MAC nodes in the array is divided by a division parameter, which may be a parameter stored in a preset area, or a parameter configured by a user based on an application scenario or an application requirement. After the division parameter is obtained, each column in the MAC array may be divided into multiple node unit groups based on the division parameter, and the obtained node unit group includes multiple MAC nodes. In some examples, the number of MAC nodes included in the node unit groups corresponding to the same column in the MAC array is the same. The numbers of node unit groups corresponding to different columns in the MAC array are the same or different.
在基于划分参数将MAC阵列中的每列划分为多个节点单元组之后,可以利用节点单元组对节点输入数据进行卷积运算操作,从而可以获得节点运算结果。After each column in the MAC array is divided into a plurality of node unit groups based on the division parameters, the node unit groups can be used to perform convolution operations on the node input data, so that the node operation results can be obtained.
本实施例中,通过获取用于对MAC阵列中的每列MAC节点进行划分的划分 参数,而后基于划分参数将MAC阵列中的每列划分为多个节点单元组,这样可以利用节点单元组对节点输入数据进行卷积运算操作,并且有利于提高卷积运算操作的质量和效率。In this embodiment, by obtaining the division parameters for dividing each column of MAC nodes in the MAC array, and then dividing each column in the MAC array into multiple node unit groups based on the division parameters, the node unit groups can be used to The node input data performs the convolution operation, and it is beneficial to improve the quality and efficiency of the convolution operation.
图13为本发明实施例提供的利用MAC阵列中的MAC节点对节点输入数据进行卷积计算,获得节点计算结果的流程示意图;在上述实施例的基础上,参考附图13所示,本实施例提供了一种利用MAC阵列中的MAC节点对节点输入数据进行卷积计算的实现方式,具体的,本实施例中的利用MAC阵列中的MAC节点对节点输入数据进行卷积计算,获得节点计算结果可以包括:Fig. 13 is a schematic flow diagram of using the MAC nodes in the MAC array to perform convolution calculations on the input data of the nodes to obtain the calculation results of the nodes provided by the embodiment of the present invention; The example provides an implementation method of using the MAC nodes in the MAC array to perform convolution calculation on the node input data. Specifically, in this embodiment, the MAC nodes in the MAC array are used to perform convolution calculation on the node input data to obtain the node Calculation results can include:
步骤S1301:获取MAC阵列中的节点单元组相对应的至少一个加法器。Step S1301: Obtain at least one adder corresponding to the node unit group in the MAC array.
步骤S1302:利用至少一个加法器对节点单元组中MAC节点的计算结果进行累加,获得与MAC阵列中的每列相对应的节点处理结果。Step S1302: using at least one adder to accumulate the calculation results of the MAC nodes in the node unit group to obtain the node processing results corresponding to each column in the MAC array.
其中,在利用MAC阵列中的MAC节点对节点输入数据进行卷积计算时,需要进行累加操作,为了避免加法器资源存在冗余的情况,在将MAC阵列划分为多个节点单元组时,可以为节点单元组所对应的加法器资源进行配置。因此,为了能够实现卷积运算操作,可以先获取MAC阵列中的节点单元组相对应的至少一个加法器,在一些实例中,获取MAC阵列中的节点单元组相对应的至少一个加法器可以包括:获取节点单元组中包括的MAC节点的数量;确定至少一个加法器各自对应的MAC节点中进行卷积计算的数据位宽;基于MAC节点的数量和数据位宽,确定加法器可支持的数据资源。具体的,基于MAC节点的数量和数据位宽,确定加法器可支持的数据资源可以包括:在节点单元组中包括的MAC节点的数量为2的N次幂时,则基于数据位宽与N的和值,确定加法器可支持的数据资源;在节点单元组中包括的MAC节点的数量不为2的N次幂时,则基于数据位宽与N+1的和值,确定加法器可支持的数据资源。Among them, when using the MAC nodes in the MAC array to perform convolution calculations on the node input data, an accumulation operation is required. In order to avoid the redundancy of adder resources, when the MAC array is divided into multiple node unit groups, it can be Configure the adder resource corresponding to the node unit group. Therefore, in order to realize the convolution operation, at least one adder corresponding to the node unit group in the MAC array can be obtained first, and in some examples, obtaining at least one adder corresponding to the node unit group in the MAC array can include : Obtain the number of MAC nodes included in the node unit group; determine the data bit width for convolution calculation in the MAC nodes corresponding to at least one adder; determine the data that the adder can support based on the number of MAC nodes and the data bit width resource. Specifically, based on the number of MAC nodes and the data bit width, determining the data resources supported by the adder may include: when the number of MAC nodes included in the node unit group is the N power of 2, then based on the data bit width and N The sum value determines the data resources that the adder can support; when the number of MAC nodes included in the node unit group is not the Nth power of 2, then based on the sum value of the data bit width and N+1, it is determined that the adder can Supported data resources.
在获取到MAC阵列中的节点单元组相对应的至少一个加法器之后,可以利用至少一个加法器对节点单元组中MAC节点的计算结果进行累加,从而能够稳定、有效地获得与MAC阵列中的每列相对应的节点处理结果,进一步保证了对节点处理结果进行获取的准确程度。After obtaining at least one adder corresponding to the node unit group in the MAC array, at least one adder can be used to accumulate the calculation results of the MAC nodes in the node unit group, so as to be able to stably and effectively obtain the result corresponding to the MAC array in the MAC array. The node processing result corresponding to each column further ensures the accuracy of obtaining the node processing result.
本实施例中的方法的实现方式、实现原理和实现效果与上述图4-图10所示的卷积运算装置的实现方式、实现原理和实现效果相类似,具体可参考上述陈述内容,在此不再赘述。The implementation, implementation principle, and implementation effect of the method in this embodiment are similar to the implementation, implementation principle, and implementation effect of the convolution operation device shown in FIGS. No longer.
图14为本发明实施例提供的一种基于乘法器MAC阵列的图像处理方法的 流程示意图;参考附图14所示,本实施例提供了一种基于乘法器MAC阵列的图像处理方法,该基于乘法器MAC阵列的图像处理方法的执行主体可以为基于乘法器MAC阵列的图像处理装置,可以理解的是,该图像处理装置可以实现为软件、或者软件和硬件的组合,具体的,该基于乘法器MAC阵列的图像处理方法可以包括:14 is a schematic flow diagram of an image processing method based on a multiplier MAC array provided by an embodiment of the present invention; referring to the accompanying drawing 14, the present embodiment provides an image processing method based on a multiplier MAC array, which is based on The image processing method of the multiplier MAC array can be executed by an image processing device based on the multiplier MAC array. It can be understood that the image processing device can be implemented as software or a combination of software and hardware. Specifically, the multiplier-based The image processing method of the MAC array may include:
步骤S1401:获取待处理图像,并基于待处理图像在一时钟周期内确定与MAC阵列中多个MAC节点中每一个MAC节点相对应的节点输入数据。Step S1401: Acquire the image to be processed, and determine the node input data corresponding to each MAC node among the plurality of MAC nodes in the MAC array within one clock cycle based on the image to be processed.
步骤S1402:利用MAC阵列中的MAC节点对节点输入数据进行卷积计算,获得节点计算结果。Step S1402: use the MAC nodes in the MAC array to perform convolution calculation on the input data of the nodes, and obtain the calculation results of the nodes.
步骤S1403:基于节点计算结果确定与待处理图像相对应的卷积处理结果。Step S1403: Determine the convolution processing result corresponding to the image to be processed based on the node calculation result.
本实施例中的方法的实现方式、实现原理和实现效果与上述图4-图10所示的卷积运算装置对待处理数据进行卷积运算的实现方式、实现原理和实现效果相类似,具体可参考上述陈述内容,在此不再赘述。The implementation, implementation principle, and implementation effect of the method in this embodiment are similar to the implementation, implementation principle, and implementation effect of the convolution operation device shown in FIGS. 4-10 above for the data to be processed. Reference is made to the content of the above statements, and no further details are given here.
图15为本发明实施例提供的一种卷积运算装置的结构示意图;参考附图15所示,本实施例提供了一种卷积运算装置,该卷积运算装置用于执行上述图11所示的卷积运算方法,具体的,该卷积运算装置可以包括:FIG. 15 is a schematic structural diagram of a convolution operation device provided by an embodiment of the present invention; referring to FIG. 15 , this embodiment provides a convolution operation device, which is used to execute the above-mentioned convolution operation device shown in FIG. 11 The convolution operation method shown, specifically, the convolution operation device may include:
第一存储器12,用于存储计算机程序;The first memory 12 is used to store computer programs;
第一处理器11,用于运行第一存储器12中存储的计算机程序以实现:The first processor 11 is configured to run the computer program stored in the first memory 12 to realize:
获取待处理数据,并基于待处理数据在一时钟周期内确定与乘法器MAC阵列中多个MAC节点中每一个MAC节点相对应的节点输入数据;Obtaining the data to be processed, and determining the node input data corresponding to each MAC node in the plurality of MAC nodes in the multiplier MAC array within a clock cycle based on the data to be processed;
利用MAC阵列中的MAC节点对节点输入数据进行卷积计算,获得节点计算结果,节点计算结果用于确定与待处理数据相对应的卷积运算结果。The MAC nodes in the MAC array are used to perform convolution calculation on the input data of the nodes to obtain node calculation results, and the node calculation results are used to determine the convolution operation results corresponding to the data to be processed.
其中,电子设备的结构中还可以包括第一通信接口13,用于电子设备与其他设备或通信网络通信。Wherein, the structure of the electronic device may further include a first communication interface 13, which is used for the electronic device to communicate with other devices or a communication network.
在一些实例中,在利用MAC阵列中的MAC节点对节点输入数据进行卷积计算之前,本实施例中的第一处理器11还用于执行:获取用于对MAC阵列中的每列MAC节点进行划分的划分参数;基于划分参数将MAC阵列中的每列划分为多个节点单元组,节点单元组中包括多个MAC节点。In some examples, before using the MAC nodes in the MAC array to perform convolution calculations on the node input data, the first processor 11 in this embodiment is also configured to perform: obtaining the Division parameters for division; based on the division parameters, each column in the MAC array is divided into multiple node unit groups, and the node unit groups include multiple MAC nodes.
在一些实例中,MAC阵列中同一列所对应的节点单元组中包括的MAC节点的数量相同。In some examples, the number of MAC nodes included in the node unit groups corresponding to the same column in the MAC array is the same.
在一些实例中,MAC阵列中不同列所对应的节点单元组的数量相同或不同。In some examples, the numbers of node unit groups corresponding to different columns in the MAC array are the same or different.
在一些实例中,在第一处理器11利用MAC阵列中的MAC节点对节点输入数据进行卷积计算,获得节点计算结果时,第一处理器11用于执行:获取MAC阵列中的节点单元组相对应的至少一个加法器;利用至少一个加法器对节点单元组中MAC节点的计算结果进行累加,获得与MAC阵列中的每列相对应的节点处理结果。In some examples, when the first processor 11 uses the MAC nodes in the MAC array to perform convolution calculations on the node input data to obtain the node calculation results, the first processor 11 is used to perform: acquiring the node unit group in the MAC array Corresponding at least one adder; using at least one adder to accumulate the calculation results of the MAC nodes in the node unit group to obtain the node processing results corresponding to each column in the MAC array.
在一些实例中,在第一处理器11获取MAC阵列中的节点单元组相对应的至少一个加法器时,第一处理器11用于执行:获取节点单元组中包括的MAC节点的数量;确定至少一个加法器各自对应的MAC节点中进行卷积计算的数据位宽;基于MAC节点的数量和数据位宽,确定加法器可支持的数据资源。In some examples, when the first processor 11 acquires at least one adder corresponding to the node unit group in the MAC array, the first processor 11 is configured to: acquire the number of MAC nodes included in the node unit group; determine The data bit width for convolution calculation in the respective MAC nodes corresponding to at least one adder; based on the number of MAC nodes and the data bit width, determine the data resources that the adder can support.
在一些实例中,在第一处理器11基于MAC节点的数量和数据位宽,确定加法器可支持的数据资源时,第一处理器11用于执行:在节点单元组中包括的MAC节点的数量为2的N次幂时,则基于数据位宽与N的和值,确定加法器可支持的数据资源;在节点单元组中包括的MAC节点的数量不为2的N次幂时,则基于数据位宽与N+1的和值,确定加法器可支持的数据资源。In some examples, when the first processor 11 determines the data resources that the adder can support based on the number of MAC nodes and the data bit width, the first processor 11 is configured to perform: the MAC nodes included in the node unit group When the number is the Nth power of 2, the data resources supported by the adder are determined based on the sum of the data bit width and N; when the number of MAC nodes included in the node unit group is not the Nth power of 2, then Based on the sum of the data bit width and N+1, the data resources supported by the adder are determined.
在一些实例中,在第一处理器11利用MAC阵列中的MAC节点对节点输入数据进行卷积计算,获得节点计算结果时,第一处理器11用于执行:确定与MAC阵列中多个MAC节点各自对应的权重系数;利用MAC节点和权重系数对节点输入数据进行卷积计算,获得节点计算结果。In some examples, when the first processor 11 uses the MAC nodes in the MAC array to perform convolution calculations on the node input data to obtain the node calculation results, the first processor 11 is used to perform: determine the The weight coefficients corresponding to the nodes; use the MAC node and the weight coefficients to perform convolution calculations on the node input data to obtain the node calculation results.
在一些实例中,在获得节点计算结果之后,第一处理器11用于执行:基于节点处理结果确定与待处理数据相对应的卷积运算结果。In some examples, after the node calculation result is obtained, the first processor 11 is configured to: determine a convolution operation result corresponding to the data to be processed based on the node processing result.
图15所示装置可以执行图11-图13所示实施例的方法,本实施例未详细描述的部分,可参考对图11-图13所示实施例的相关说明。该技术方案的执行过程和技术效果参见图11-图13所示实施例中的描述,在此不再赘述。The device shown in FIG. 15 can execute the method of the embodiment shown in FIG. 11-FIG. 13. For the parts not described in detail in this embodiment, refer to the relevant description of the embodiment shown in FIG. 11-FIG. 13. For the execution process and technical effect of this technical solution, refer to the description in the embodiment shown in FIG. 11-FIG. 13 , which will not be repeated here.
图16为本发明实施例提供的一种基于乘法器MAC阵列的图像处理装置的结构示意图,参考附图16所示,本实施例提供了一种基于乘法器MAC阵列的图像处理装置,该基于乘法器MAC阵列的图像处理装置用于执行上述图14所示的基于乘法器MAC阵列的图像处理方法,具体的,该基于乘法器MAC阵列的图像处理装置可以包括:FIG. 16 is a schematic structural diagram of an image processing device based on a multiplier MAC array provided by an embodiment of the present invention. Referring to FIG. 16 , this embodiment provides an image processing device based on a multiplier MAC array. The image processing device of the multiplier MAC array is used to perform the image processing method based on the multiplier MAC array shown in FIG. 14. Specifically, the image processing device based on the multiplier MAC array may include:
第二存储器22,用于存储计算机程序;The second memory 22 is used to store computer programs;
第二处理器21,用于运行第二存储器22中存储的计算机程序以实现:The second processor 21 is configured to run the computer program stored in the second memory 22 to realize:
获取待处理图像,并基于待处理图像在一时钟周期内确定与MAC阵列中多 个MAC节点中每一个MAC节点相对应的节点输入数据;Obtain the image to be processed, and determine the node input data corresponding to each MAC node in a plurality of MAC nodes in the MAC array within a clock cycle based on the image to be processed;
利用MAC阵列中的MAC节点对节点输入数据进行卷积计算,获得节点计算结果;Use the MAC nodes in the MAC array to perform convolution calculations on the node input data to obtain the node calculation results;
基于节点计算结果确定与待处理图像相对应的卷积处理结果。A convolution processing result corresponding to the image to be processed is determined based on the node calculation result.
其中,电子设备的结构中还可以包括第二通信接口23,用于电子设备与其他设备或通信网络通信。Wherein, the structure of the electronic device may further include a second communication interface 23 for the electronic device to communicate with other devices or a communication network.
图16所示装置可以执行图14所示实施例的方法,本实施例未详细描述的部分,可参考对图14所示实施例的相关说明。该技术方案的执行过程和技术效果参见图14所示实施例中的描述,在此不再赘述。The device shown in FIG. 16 can execute the method of the embodiment shown in FIG. 14 . For parts not described in detail in this embodiment, refer to the relevant description of the embodiment shown in FIG. 14 . For the execution process and technical effect of this technical solution, refer to the description in the embodiment shown in FIG. 14 , and details are not repeated here.
另外,本发明实施例提供了一种计算机存储介质,用于储存电子设备所用的计算机软件指令,其包含用于执行上述图11-图13所示方法实施例中卷积运算方法所涉及的程序。In addition, an embodiment of the present invention provides a computer storage medium, which is used to store computer software instructions used by electronic devices, which includes the programs involved in executing the convolution operation method in the method embodiments shown in FIGS. 11-13 above. .
另外,本发明实施例提供了一种计算机存储介质,用于储存电子设备所用的计算机软件指令,其包含用于执行上述图14所示方法实施例中基于乘法器MAC阵列的图像处理方法所涉及的程序。In addition, an embodiment of the present invention provides a computer storage medium, which is used to store computer software instructions used by electronic devices, which includes instructions for performing the image processing method based on the multiplier MAC array in the method embodiment shown in FIG. 14. program of.
以上各个实施例中的技术方案、技术特征在与本相冲突的情况下均可以单独,或者进行组合,只要未超出本领域技术人员的认知范围,均属于本申请保护范围内的等同实施例。The technical solutions and technical features in each of the above embodiments can be used alone or in combination if they conflict with the present invention, as long as they do not exceed the scope of cognition of those skilled in the art, they all belong to equivalent embodiments within the scope of protection of the present application .
在本发明所提供的几个实施例中,应该理解到,所揭露的相关检测装置和方法,可以通过其它的方式实现。例如,以上所描述的检测装置实施例仅仅是示意性的,例如,所述模块或单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,检测装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。In the several embodiments provided by the present invention, it should be understood that the disclosed related detection devices and methods can be implemented in other ways. For example, the above-described embodiment of the detection device is only illustrative. For example, the division of the modules or units is only a logical function division. In actual implementation, there may be other division methods, such as multiple units or components May be combined or may be integrated into another system, or some features may be omitted, or not implemented. In another point, the mutual coupling or direct coupling or communication connection shown or discussed may be through some interfaces, and the indirect coupling or communication connection of detection devices or units may be in electrical, mechanical or other forms.
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。The units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in one place, or may be distributed to multiple network units. Part or all of the units can be selected according to actual needs to achieve the purpose of the solution of this embodiment.
另外,在本发明各个实施例中的各功能单元可以集成在一个处理单元中, 也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。In addition, each functional unit in each embodiment of the present invention may be integrated into one processing unit, each unit may exist separately physically, or two or more units may be integrated into one unit. The above-mentioned integrated units can be implemented in the form of hardware or in the form of software functional units.
所述集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本发明的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得计算机处理器(processor)执行本发明各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、磁盘或者光盘等各种可以存储程序代码的介质。If the integrated unit is realized in the form of a software function unit and sold or used as an independent product, it can be stored in a computer-readable storage medium. Based on this understanding, the essence of the technical solution of the present invention or the part that contributes to the prior art or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium , including several instructions for causing a computer processor (processor) to execute all or part of the steps of the method described in each embodiment of the present invention. The aforementioned storage medium includes: U disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), magnetic disk or optical disk and other media that can store program codes.
以上所述仅为本发明的实施例,并非因此限制本发明的专利范围,凡是利用本发明说明书及附图内容所作的等效结构或等效流程变换,或直接或间接运用在其他相关的技术领域,均同理包括在本发明的专利保护范围内。The above is only an embodiment of the present invention, and does not limit the patent scope of the present invention. Any equivalent structure or equivalent process transformation made by using the description of the present invention and the contents of the accompanying drawings, or directly or indirectly used in other related technologies fields, all of which are equally included in the scope of patent protection of the present invention.
最后应说明的是:以上各实施例仅用以说明本发明的技术方案,而非对其限制;尽管参照前述各实施例对本发明进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分或者全部技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本发明各实施例技术方案的范围。Finally, it should be noted that: the above embodiments are only used to illustrate the technical solutions of the present invention, rather than limiting them; although the present invention has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand that: It is still possible to modify the technical solutions described in the foregoing embodiments, or perform equivalent replacements for some or all of the technical features; and these modifications or replacements do not make the essence of the corresponding technical solutions deviate from the technical solutions of the various embodiments of the present invention. scope.

Claims (32)

  1. 一种卷积运算装置,其特征在于,包括:A convolution computing device, characterized in that it comprises:
    数据获取节点,用于获取待处理数据,并基于所述待处理数据在一时钟周期内确定与乘法器MAC阵列中多个MAC节点中每一个MAC节点相对应的节点输入数据;A data acquisition node, configured to acquire data to be processed, and determine within a clock cycle based on the data to be processed, node input data corresponding to each of the multiple MAC nodes in the multiplier MAC array;
    MAC阵列,与所述数据获取节点相连接,用于通过所述数据获取节点获取所述节点输入数据,并对所述节点输入数据进行卷积计算,获得节点计算结果,所述节点计算结果用于确定与所述待处理数据相对应的卷积运算结果。The MAC array is connected to the data acquisition node, and is used to acquire the input data of the node through the data acquisition node, and perform convolution calculation on the input data of the node to obtain a calculation result of the node, and the calculation result of the node is used for determining the convolution operation result corresponding to the data to be processed.
  2. 根据权利要求1所述的装置,其特征在于,所述MAC阵列中的每列包括多个节点单元组,所述节点单元组中包括多个MAC节点,所述节点单元组对应至少一个加法器,所述加法器用于对所述节点单元组中MAC节点的计算结果进行累加,获得与所述MAC阵列中的每列相对应的节点处理结果。The device according to claim 1, wherein each column in the MAC array includes a plurality of node unit groups, the node unit group includes a plurality of MAC nodes, and the node unit group corresponds to at least one adder , the adder is configured to accumulate the calculation results of the MAC nodes in the node unit group to obtain a node processing result corresponding to each column in the MAC array.
  3. 根据权利要求2所述的装置,其特征在于,所述MAC阵列中同一列所对应的节点单元组中包括的MAC节点的数量相同。The device according to claim 2, wherein the number of MAC nodes included in the node unit groups corresponding to the same column in the MAC array is the same.
  4. 根据权利要求2所述的装置,其特征在于,所述MAC阵列中不同列所对应的节点单元组的数量相同或不同。The device according to claim 2, wherein the number of node unit groups corresponding to different columns in the MAC array is the same or different.
  5. 根据权利要求2所述的装置,其特征在于,所述至少一个加法器包括:与所述节点单元组相连接的第一级加法器以及与所述第一级加法器相连接的第二级加法器,所述第二级加法器可支持的数据资源大于所述第一级加法器可支持的数据资源。The apparatus according to claim 2, wherein the at least one adder comprises: a first-stage adder connected to the node unit group and a second-stage adder connected to the first-stage adder In an adder, the data resources supported by the second-stage adder are greater than the data resources supported by the first-stage adder.
  6. 根据权利要求5所述的装置,其特征在于,所述加法器可支持的数据资源与所述节点单元组中包括的MAC节点的数量、MAC节点中进行卷积计算的数据位宽相关。The device according to claim 5, wherein the data resources supported by the adder are related to the number of MAC nodes included in the node unit group and the data bit width for convolution calculation in the MAC nodes.
  7. 根据权利要求6所述的装置,其特征在于,在所述节点单元组中包括的MAC节点的数量为2的N次幂时,所述加法器可支持的数据资源为MAC节点中进行卷积计算的数据位宽与N的和值。The device according to claim 6, wherein when the number of MAC nodes included in the node unit group is the Nth power of 2, the data resource supported by the adder is convolution in the MAC node The sum of the calculated data bit width and N.
  8. 根据权利要求6所述的装置,其特征在于,在所述节点单元组中包括的MAC节点的数量不为2的N次幂时,所述加法器可支持的数据资源为MAC节点中进行卷积计算的数据位宽与N+1的和值。The device according to claim 6, wherein, when the number of MAC nodes included in the node unit group is not the Nth power of 2, the data resources supported by the adder are volumes in the MAC nodes The sum of the data bit width and N+1 calculated by the product.
  9. 根据权利要求1-8中任意一项所述的装置,其特征在于,所述装置还包括:The device according to any one of claims 1-8, wherein the device further comprises:
    权重缓存节点,用于确定与所述MAC阵列中多个MAC节点各自对应的权重系数;a weight cache node, configured to determine weight coefficients corresponding to each of the multiple MAC nodes in the MAC array;
    MAC阵列,与所述权重缓存节点相连接,用于利用所述MAC节点和权重系数对所述节点输入数据进行卷积计算,获得节点计算结果。The MAC array is connected to the weight cache node, and is used to perform convolution calculation on the input data of the node by using the MAC node and the weight coefficient to obtain a node calculation result.
  10. 根据权利要求1-8中任意一项所述的装置,其特征在于,所述装置还包括:The device according to any one of claims 1-8, wherein the device further comprises:
    数据处理节点,与所述MAC阵列相连接,用于基于所述节点计算结果确定与所述待处理数据相对应的卷积运算结果。A data processing node, connected to the MAC array, configured to determine a convolution operation result corresponding to the data to be processed based on the calculation result of the node.
  11. 一种卷积运算方法,其特征在于,包括:A convolution operation method, characterized in that, comprising:
    获取待处理数据,并基于所述待处理数据在一时钟周期内确定与乘法器MAC阵列中多个MAC节点中每一个MAC节点相对应的节点输入数据;Obtaining data to be processed, and determining node input data corresponding to each MAC node in the plurality of MAC nodes in the multiplier MAC array within a clock cycle based on the data to be processed;
    利用所述MAC阵列中的MAC节点对所述节点输入数据进行卷积计算,获得节点计算结果,所述节点计算结果用于确定与所述待处理数据相对应的卷积运算结果。The MAC nodes in the MAC array are used to perform convolution calculation on the input data of the nodes to obtain node calculation results, and the node calculation results are used to determine the convolution operation results corresponding to the data to be processed.
  12. 根据权利要求11所述的方法,其特征在于,在利用所述MAC阵列中的MAC节点对所述节点输入数据进行卷积计算之前,所述方法还包括:The method according to claim 11, wherein before using the MAC nodes in the MAC array to perform convolution calculation on the node input data, the method further comprises:
    获取用于对所述MAC阵列中的每列MAC节点进行划分的划分参数;Acquiring division parameters for dividing each column of MAC nodes in the MAC array;
    基于所述划分参数将所述MAC阵列中的每列划分为多个节点单元组,所述节点单元组中包括多个MAC节点。Divide each column in the MAC array into a plurality of node unit groups based on the division parameter, and the node unit group includes a plurality of MAC nodes.
  13. 根据权利要求12所述的方法,其特征在于,所述MAC阵列中同一列所对应的节点单元组中包括的MAC节点的数量相同。The method according to claim 12, wherein the number of MAC nodes included in the node unit groups corresponding to the same column in the MAC array is the same.
  14. 根据权利要求12所述的方法,其特征在于,所述MAC阵列中不同列所对应的节点单元组的数量相同或不同。The method according to claim 12, wherein the number of node unit groups corresponding to different columns in the MAC array is the same or different.
  15. 根据权利要求12所述的方法,其特征在于,利用所述MAC阵列中的MAC节点对所述节点输入数据进行卷积计算,获得节点计算结果,包括:The method according to claim 12, wherein the MAC nodes in the MAC array are used to perform convolution calculations on the input data of the nodes to obtain node calculation results, including:
    获取所述MAC阵列中的节点单元组相对应的至少一个加法器;Obtain at least one adder corresponding to the node unit group in the MAC array;
    利用所述至少一个加法器对所述节点单元组中MAC节点的计算结果进行累加,获得与所述MAC阵列中的每列相对应的节点处理结果。Using the at least one adder to accumulate the calculation results of the MAC nodes in the node unit group to obtain a node processing result corresponding to each column in the MAC array.
  16. 根据权利要求15所述的方法,其特征在于,获取所述MAC阵列中的节点单元组相对应的至少一个加法器,包括:The method according to claim 15, wherein obtaining at least one adder corresponding to the node unit group in the MAC array comprises:
    获取所述节点单元组中包括的MAC节点的数量;obtaining the number of MAC nodes included in the node unit group;
    确定所述至少一个加法器各自对应的MAC节点中进行卷积计算的数据位宽;Determining the data bit width for performing convolution calculation in the respective MAC nodes corresponding to the at least one adder;
    基于所述MAC节点的数量和所述数据位宽,确定所述加法器可支持的数据资源。Based on the number of the MAC nodes and the data bit width, determine the data resources that the adder can support.
  17. 根据权利要求16所述的方法,其特征在于,基于所述MAC节点的数量和所述数据位宽,确定所述加法器可支持的数据资源,包括:The method according to claim 16, wherein, based on the number of the MAC nodes and the data bit width, determining the data resources that the adder can support includes:
    在所述节点单元组中包括的MAC节点的数量为2的N次幂时,则基于所述数据位宽与N的和值,确定所述加法器可支持的数据资源;When the number of MAC nodes included in the node unit group is an N power of 2, then based on the sum of the data bit width and N, determine the data resources that the adder can support;
    在所述节点单元组中包括的MAC节点的数量不为2的N次幂时,则基于所述数据位宽与N+1的和值,确定所述加法器可支持的数据资源。When the number of MAC nodes included in the node unit group is not the Nth power of 2, the data resources supported by the adder are determined based on the sum of the data bit width and N+1.
  18. 根据权利要求11所述的方法,其特征在于,利用所述MAC阵列中的MAC节点对所述节点输入数据进行卷积计算,获得节点计算结果,包括:The method according to claim 11, characterized in that, using the MAC nodes in the MAC array to perform convolution calculations on the input data of the nodes to obtain node calculation results, including:
    确定与所述MAC阵列中多个MAC节点各自对应的权重系数;determining weight coefficients corresponding to each of the multiple MAC nodes in the MAC array;
    利用所述MAC节点和权重系数对所述节点输入数据进行卷积计算,获得节点计算结果。The MAC node and the weight coefficient are used to perform convolution calculation on the input data of the node to obtain a node calculation result.
  19. 根据权利要求18所述的方法,其特征在于,在获得节点计算结果之后,所述方法还包括:The method according to claim 18, characterized in that, after obtaining the node calculation results, the method further comprises:
    基于所述节点处理结果确定与所述待处理数据相对应的卷积运算结果。A convolution operation result corresponding to the data to be processed is determined based on the node processing result.
  20. 一种基于乘法器MAC阵列的图像处理方法,其特征在于,包括:A kind of image processing method based on multiplier MAC array, it is characterized in that, comprising:
    获取待处理图像,并基于所述待处理图像在一时钟周期内确定与所述MAC阵列中多个MAC节点中每一个MAC节点相对应的节点输入数据;Acquiring an image to be processed, and determining node input data corresponding to each MAC node among the plurality of MAC nodes in the MAC array within a clock cycle based on the image to be processed;
    利用所述MAC阵列中的MAC节点对所述节点输入数据进行卷积计算,获得节点计算结果;Using the MAC nodes in the MAC array to perform convolution calculations on the input data of the nodes to obtain node calculation results;
    基于所述节点计算结果确定与所述待处理图像相对应的卷积处理结果。A convolution processing result corresponding to the image to be processed is determined based on the node calculation result.
  21. 一种卷积运算装置,其特征在于,包括:A convolution computing device, characterized in that it comprises:
    存储器,用于存储计算机程序;memory for storing computer programs;
    处理器,用于运行所述存储器中存储的计算机程序以实现:a processor for running a computer program stored in said memory to:
    获取待处理数据,并基于所述待处理数据在一时钟周期内确定与乘法器MAC阵列中多个MAC节点中每一个MAC节点相对应的节点输入数据;Obtaining data to be processed, and determining node input data corresponding to each MAC node in the plurality of MAC nodes in the multiplier MAC array within a clock cycle based on the data to be processed;
    利用所述MAC阵列中的MAC节点对所述节点输入数据进行卷积计算,获得节点计算结果,所述节点计算结果用于确定与所述待处理数据相对应的卷积 运算结果。The MAC nodes in the MAC array are used to perform convolution calculation on the input data of the nodes to obtain node calculation results, and the node calculation results are used to determine the convolution operation results corresponding to the data to be processed.
  22. 根据权利要求21所述的装置,其特征在于,在利用所述MAC阵列中的MAC节点对所述节点输入数据进行卷积计算之前,所述处理器还用于执行:The device according to claim 21, wherein the processor is further configured to execute:
    获取用于对所述MAC阵列中的每列MAC节点进行划分的划分参数;Acquiring division parameters for dividing each column of MAC nodes in the MAC array;
    基于所述划分参数将所述MAC阵列中的每列划分为多个节点单元组,所述节点单元组中包括多个MAC节点。Divide each column in the MAC array into a plurality of node unit groups based on the division parameter, and the node unit group includes a plurality of MAC nodes.
  23. 根据权利要求22所述的装置,其特征在于,所述MAC阵列中同一列所对应的节点单元组中包括的MAC节点的数量相同。The device according to claim 22, wherein the number of MAC nodes included in the node unit groups corresponding to the same column in the MAC array is the same.
  24. 根据权利要求22所述的装置,其特征在于,所述MAC阵列中不同列所对应的节点单元组的数量相同或不同。The device according to claim 22, wherein the number of node unit groups corresponding to different columns in the MAC array is the same or different.
  25. 根据权利要求22所述的装置,其特征在于,在所述处理器利用所述MAC阵列中的MAC节点对所述节点输入数据进行卷积计算,获得节点计算结果时,所述处理器用于执行:The device according to claim 22, wherein when the processor uses the MAC nodes in the MAC array to perform convolution calculations on the input data of the nodes and obtain the calculation results of the nodes, the processor is configured to execute :
    获取所述MAC阵列中的节点单元组相对应的至少一个加法器;Obtain at least one adder corresponding to the node unit group in the MAC array;
    利用所述至少一个加法器对所述节点单元组中MAC节点的计算结果进行累加,获得与所述MAC阵列中的每列相对应的节点处理结果。Using the at least one adder to accumulate the calculation results of the MAC nodes in the node unit group to obtain a node processing result corresponding to each column in the MAC array.
  26. 根据权利要求25所述的装置,其特征在于,在所述处理器获取所述MAC阵列中的节点单元组相对应的至少一个加法器时,所述处理器用于执行:The device according to claim 25, wherein when the processor obtains at least one adder corresponding to the node unit group in the MAC array, the processor is configured to perform:
    获取所述节点单元组中包括的MAC节点的数量;obtaining the number of MAC nodes included in the node unit group;
    确定所述至少一个加法器各自对应的MAC节点中进行卷积计算的数据位宽;Determining the data bit width for performing convolution calculation in the respective MAC nodes corresponding to the at least one adder;
    基于所述MAC节点的数量和所述数据位宽,确定所述加法器可支持的数据资源。Based on the number of the MAC nodes and the data bit width, determine the data resources that the adder can support.
  27. 根据权利要求26所述的装置,其特征在于,在所述处理器基于所述MAC节点的数量和所述数据位宽,确定所述加法器可支持的数据资源时,所述处理器用于执行:The device according to claim 26, wherein when the processor determines the data resources that the adder can support based on the number of the MAC nodes and the data bit width, the processor is configured to execute :
    在所述节点单元组中包括的MAC节点的数量为2的N次幂时,则基于所述数据位宽与N的和值,确定所述加法器可支持的数据资源;When the number of MAC nodes included in the node unit group is an N power of 2, then based on the sum of the data bit width and N, determine the data resources that the adder can support;
    在所述节点单元组中包括的MAC节点的数量不为2的N次幂时,则基于所述数据位宽与N+1的和值,确定所述加法器可支持的数据资源。When the number of MAC nodes included in the node unit group is not the Nth power of 2, the data resources supported by the adder are determined based on the sum of the data bit width and N+1.
  28. 根据权利要求21所述的装置,其特征在于,在所述处理器利用所述 MAC阵列中的MAC节点对所述节点输入数据进行卷积计算,获得节点计算结果时,所述处理器用于执行:The device according to claim 21, wherein when the processor uses the MAC nodes in the MAC array to perform convolution calculations on the input data of the nodes and obtain the calculation results of the nodes, the processor is configured to execute :
    确定与所述MAC阵列中多个MAC节点各自对应的权重系数;determining weight coefficients corresponding to each of the multiple MAC nodes in the MAC array;
    利用所述MAC节点和权重系数对所述节点输入数据进行卷积计算,获得节点计算结果。The MAC node and the weight coefficient are used to perform convolution calculation on the input data of the node to obtain a node calculation result.
  29. 根据权利要求28所述的装置,其特征在于,在获得节点计算结果之后,所述处理器用于执行:The device according to claim 28, wherein after obtaining the node calculation result, the processor is configured to perform:
    基于所述节点处理结果确定与所述待处理数据相对应的卷积运算结果。A convolution operation result corresponding to the data to be processed is determined based on the node processing result.
  30. 一种基于乘法器MAC阵列的图像处理装置,其特征在于,包括:An image processing device based on a multiplier MAC array, characterized in that it comprises:
    存储器,用于存储计算机程序;memory for storing computer programs;
    处理器,用于运行所述存储器中存储的计算机程序以实现:a processor for running a computer program stored in said memory to:
    获取待处理图像,并基于所述待处理图像在一时钟周期内确定与所述MAC阵列中多个MAC节点中每一个MAC节点相对应的节点输入数据;Acquiring an image to be processed, and determining node input data corresponding to each MAC node among the plurality of MAC nodes in the MAC array within a clock cycle based on the image to be processed;
    利用所述MAC阵列中的MAC节点对所述节点输入数据进行卷积计算,获得节点计算结果;Using the MAC nodes in the MAC array to perform convolution calculations on the input data of the nodes to obtain node calculation results;
    基于所述节点计算结果确定与所述待处理图像相对应的卷积处理结果。A convolution processing result corresponding to the image to be processed is determined based on the node calculation result.
  31. 一种计算机可读存储介质,其特征在于,所述存储介质为计算机可读存储介质,该计算机可读存储介质中存储有程序指令,所述程序指令用于实现权利要求11-19中任意一项所述的卷积运算方法。A computer-readable storage medium, characterized in that the storage medium is a computer-readable storage medium, and program instructions are stored in the computer-readable storage medium, and the program instructions are used to implement any one of claims 11-19. The convolution operation method described in the item.
  32. 一种计算机可读存储介质,其特征在于,所述存储介质为计算机可读存储介质,该计算机可读存储介质中存储有程序指令,所述程序指令用于实现权利要求20所述的基于乘法器MAC阵列的图像处理方法。A computer-readable storage medium, characterized in that, the storage medium is a computer-readable storage medium, and program instructions are stored in the computer-readable storage medium, and the program instructions are used to implement the multiplication-based Image processing method for MAC array.
PCT/CN2021/095619 2021-05-24 2021-05-24 Convolution operation method and apparatus, image processing method and apparatus, and storage medium WO2022246617A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/CN2021/095619 WO2022246617A1 (en) 2021-05-24 2021-05-24 Convolution operation method and apparatus, image processing method and apparatus, and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2021/095619 WO2022246617A1 (en) 2021-05-24 2021-05-24 Convolution operation method and apparatus, image processing method and apparatus, and storage medium

Publications (1)

Publication Number Publication Date
WO2022246617A1 true WO2022246617A1 (en) 2022-12-01

Family

ID=84229131

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/095619 WO2022246617A1 (en) 2021-05-24 2021-05-24 Convolution operation method and apparatus, image processing method and apparatus, and storage medium

Country Status (1)

Country Link
WO (1) WO2022246617A1 (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108363559A (en) * 2018-02-13 2018-08-03 北京旷视科技有限公司 Multiplication processing method, equipment and the computer-readable medium of neural network
CN108629405A (en) * 2017-03-22 2018-10-09 杭州海康威视数字技术股份有限公司 The method and apparatus for improving convolutional neural networks computational efficiency
CN111694643A (en) * 2020-05-12 2020-09-22 中国科学院计算技术研究所 Task scheduling execution system and method for graph neural network application
CN112214222A (en) * 2020-10-27 2021-01-12 华中科技大学 Sequential structure for realizing feedforward neural network in COStream and compiling method thereof

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108629405A (en) * 2017-03-22 2018-10-09 杭州海康威视数字技术股份有限公司 The method and apparatus for improving convolutional neural networks computational efficiency
CN108363559A (en) * 2018-02-13 2018-08-03 北京旷视科技有限公司 Multiplication processing method, equipment and the computer-readable medium of neural network
CN111694643A (en) * 2020-05-12 2020-09-22 中国科学院计算技术研究所 Task scheduling execution system and method for graph neural network application
CN112214222A (en) * 2020-10-27 2021-01-12 华中科技大学 Sequential structure for realizing feedforward neural network in COStream and compiling method thereof

Similar Documents

Publication Publication Date Title
US11449576B2 (en) Convolution operation processing method and related product
CN111226230B (en) Neural network processing system with multiple processors and neural network accelerators
CN110520853B (en) Queue management for direct memory access
WO2021057713A1 (en) Method for splitting neural network model by using multi-core processor, and related product
CN109102065B (en) Convolutional neural network accelerator based on PSoC
WO2020073211A1 (en) Operation accelerator, processing method, and related device
GB2568086A (en) Hardware implementation of convolution layer of deep neutral network
CN109685201B (en) Operation method, device and related product
CN113330421A (en) Dot product calculator and operation method thereof
US20230026006A1 (en) Convolution computation engine, artificial intelligence chip, and data processing method
US8711160B1 (en) System and method for efficient resource management of a signal flow programmed digital signal processor code
CN109726822B (en) Operation method, device and related product
US11789733B2 (en) Instruction processing apparatus, acceleration unit, and server
WO2021143217A1 (en) Processing component, method for processing data, and related apparatus
Bai et al. A unified hardware architecture for convolutions and deconvolutions in CNN
EP4260174A1 (en) Data-type-aware clock-gating
WO2021083101A1 (en) Data processing method and apparatus, and related product
WO2021082725A1 (en) Winograd convolution operation method and related product
WO2022246617A1 (en) Convolution operation method and apparatus, image processing method and apparatus, and storage medium
CN109711538B (en) Operation method, device and related product
Wu et al. Skeletongcn: a simple yet effective accelerator for gcn training
CN116227599A (en) Inference model optimization method and device, electronic equipment and storage medium
CN111047005A (en) Operation method, operation device, computer equipment and storage medium
US11256940B1 (en) Method, apparatus and system for gradient updating of image processing model
Tavakoli et al. A high throughput hardware CNN accelerator using a novel multi-layer convolution processor

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21942206

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE