CN116306823B - Method, device and chip for providing data for MAC array - Google Patents

Method, device and chip for providing data for MAC array Download PDF

Info

Publication number
CN116306823B
CN116306823B CN202310466151.5A CN202310466151A CN116306823B CN 116306823 B CN116306823 B CN 116306823B CN 202310466151 A CN202310466151 A CN 202310466151A CN 116306823 B CN116306823 B CN 116306823B
Authority
CN
China
Prior art keywords
image characteristic
value
data
characteristic values
image
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202310466151.5A
Other languages
Chinese (zh)
Other versions
CN116306823A (en
Inventor
胡文静
梁喆
马振强
孙猛
靳馥华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Aixin Technology Co ltd
Aixin Yuanzhi Semiconductor Shanghai Co Ltd
Original Assignee
Beijing Aixin Technology Co ltd
Aixin Yuanzhi Semiconductor Shanghai Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Aixin Technology Co ltd, Aixin Yuanzhi Semiconductor Shanghai Co Ltd filed Critical Beijing Aixin Technology Co ltd
Priority to CN202310466151.5A priority Critical patent/CN116306823B/en
Publication of CN116306823A publication Critical patent/CN116306823A/en
Application granted granted Critical
Publication of CN116306823B publication Critical patent/CN116306823B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The application provides a method, a device and a chip for providing data for a MAC array, wherein the method comprises the following steps: acquiring the size of a convolution window; the method comprises the steps of obtaining image characteristic values to be processed of a convolution window from an on-chip memory to form an image characteristic matrix, processing an image characteristic value array to obtain a plurality of characteristic value splicing results, storing the characteristic value splicing results into a data buffer, and carrying out batch processing on the image characteristic values in all second characteristic value splicing results in the data buffer according to the data M under the condition that the data M of the image characteristic values which can be processed by the MAC array in one clock cycle is smaller than the total number of the image characteristic values in the image characteristic value array, so as to obtain a plurality of batches of image characteristic values, and sending the image characteristic values in the plurality of batches to the MAC array in a plurality of clock cycles. Therefore, the image characteristic data is stored by one data buffer, and the MAC array is supplied, so that the hardware cost for supplying the MAC array is reduced.

Description

Method, device and chip for providing data for MAC array
Technical Field
The present disclosure relates to the field of computer technologies, and in particular, to a method, an apparatus, and a chip for providing data for a MAC array.
Background
Currently, a MAC (Multiply Accumulate, multiply-accumulate) array is generally used to implement convolution operation, and convolution operation is one of important logic operations for implementing an artificial neural network by an artificial intelligent chip.
In the process of performing the device on the corresponding image by the MAC array, since the data size of the image feature value that can be processed by the MAC array in one clock cycle is limited, in the related art, a plurality of data buffers are generally used to buffer a plurality of line image feature values required by the device, and the number of the image feature values is supplied to the MAC array by the plurality of data buffers. However, this approach is costly to implement by the MAC array due to the relatively large number of data buffers used. Therefore, how to provide the image feature values participating in the convolution operation for the MAC array in a low-cost manner is a technical problem to be solved at present.
Disclosure of Invention
According to a first aspect of the present application, there is provided a method of providing data for a MAC array, comprising: obtaining the size of a convolution window, wherein the size of the convolution window is n multiplied by m multiplied by c, n represents the number of rows of the convolution window, m represents the number of columns of the convolution window, c represents the number of channels of the convolution window, wherein n and m are integers greater than 1, and c is an integer greater than or equal to 1; acquiring n multiplied by m multiplied by c image characteristic values to be processed at the time of a convolution window from an on-chip memory to form an image characteristic value array, wherein the number of rows of the image characteristic value array is n, the number of columns is m and the number of channels is c; aiming at each column in the image characteristic value array, aiming at each channel in c channels, splicing the image characteristic values on the channels in each row in the current column to obtain a first characteristic value splicing result of the current column on the channels, and splicing the first characteristic value splicing result of the current column on each channel to obtain a second characteristic value splicing result of the current column; sequentially storing second characteristic value splicing results corresponding to each column in the image characteristic value array into a data buffer; under the condition that the data quantity M of the image characteristic values which can be processed by the MAC array in one clock period is smaller than K, carrying out batch processing on the image characteristic values in all second characteristic value splicing results in the data buffer according to the data quantity M to obtain a plurality of batches of image characteristic values, wherein the number of the image characteristic values in each batch of image characteristic values is equal to M, and K represents the total number of the image characteristic values in the image characteristic value array; and sequentially transmitting the image characteristic values of the plurality of batches to the MAC array on a plurality of continuous clock cycles according to the storage sequence of the image characteristic values of the plurality of batches.
According to the method for providing data for the MAC array, when the MAC array is adopted for convolution operation, the size of a convolution window is obtained; acquiring image characteristic values to be processed of a convolution window from an on-chip memory to form an image characteristic value array, and for each channel in c channels in each column in the image characteristic value array, splicing the image characteristic values on the channels in each row in the current column to obtain a first characteristic value splicing result of the current column on the channel, and splicing the first characteristic value splicing results of the current column on each channel to obtain a second characteristic value splicing result of the current column; and sequentially storing second characteristic value splicing results of the image characteristic value arrays on each column into a data buffer, carrying out batch processing on the image characteristic values in all the second characteristic value splicing results in the data buffer according to the data quantity M under the condition that the data quantity M of the image characteristic values which can be processed by the MAC array in one clock period is smaller than the total number K of the image characteristic values in the image characteristic value arrays so as to obtain a plurality of batches of image characteristic values, and sequentially sending the plurality of batches of image characteristic values to the MAC array in a plurality of continuous clock periods according to the storage sequence of the plurality of batches of image characteristic values. Therefore, the method for storing the image characteristic data required by the device through one data buffer and supplying the number to the MAC array through the data buffer is provided, so that the hardware cost for supplying the number to the MAC array is reduced, and the number to the MAC array is realized in a convenient mode.
In one possible implementation of the present application, the method further includes:
and under the condition that the data quantity M of the image characteristic values which can be processed by the MAC array in one clock period is larger than or equal to K, sequentially reading the image characteristic values from the second characteristic value splicing results of the data buffer to the MAC array until the number of the read image characteristic values is K.
In one possible implementation manner of the present application, the obtaining, from an on-chip memory, n rows of image feature values to be processed by the convolution window to form an image feature value array includes: simultaneously reading n lines of image characteristic values to be processed of the convolution window from an on-chip memory through n paths of read data memory channels RDMA, wherein each path of RDMA reads one line of image characteristic values; and forming the image characteristic value array according to the n rows of image characteristic values.
In one possible implementation manner of the present application, the sequentially sending the multiple batches of image feature values to the MAC array on multiple consecutive clock cycles according to the storage order of the multiple batches of image feature values includes: sorting the image characteristic values of the plurality of batches according to the storage sequence of the image characteristic values of the plurality of batches to obtain a sorting result; determining a target start address of a target batch image feature value located on an ith bit in the sorting result in the data buffer for an ith clock cycle of the plurality of continuous clock cycles, wherein i is a positive integer and is less than L, wherein L represents the total number of the plurality of continuous clock cycles; reading out continuous M image characteristic values from the data buffer from the target starting address; taking the M image characteristic values read out as an ith batch of image characteristic values; and sending the ith batch of image characteristic values to the MAC array.
In one possible implementation manner of the present application, the determining the target start address of the target batch image feature value located on the ith bit in the sorting result in the data buffer includes: determining an initial starting address of the corresponding batch of image characteristic values positioned on the first bit in the sorting result in the data buffer under the condition that i is greater than 1; and determining a target starting address of a target batch image characteristic value positioned on the ith bit in the sorting result in the data buffer according to the initial starting address, the i and the data quantity M.
In one possible implementation manner of the present application, in a case of changing the depth of the data buffer from a first value to a second value, and changing the bit width of the data buffer from a third value to a fourth value, where the value obtained by multiplying the first value and the third value is the same as the value obtained by multiplying the second value and the fourth value, the fourth value is B times the third value, B is an integer greater than 1, and the reading from the data buffer from the target start address includes: c data are read from the data buffer memory from the target starting address, wherein the C is determined according to the data quantity M, the third numerical value and the fourth numerical value; m data are determined from the C data, and the bit width corresponding to each data is the third numerical value; b selecting operation is carried out on each data in the M data respectively so as to obtain the image characteristic value corresponding to each data in the M data.
According to a second aspect of embodiments of the present application, there is provided an apparatus for providing data for a MAC array, including: a first obtaining module, configured to obtain a size of a convolution window, where the size of the convolution window is nxm×c, n represents a number of rows of the convolution window, m represents a number of columns of the convolution window, c represents a number of channels of the convolution window, n and m are integers greater than 1, and c is an integer greater than or equal to 1; the second acquisition module is used for acquiring n multiplied by m multiplied by c image characteristic values to be processed at the time of a convolution window from an on-chip memory to form an image characteristic value array, wherein the number of rows of the image characteristic value array is n, the number of columns of the image characteristic value array is m, and the number of channels of the image characteristic value array is c; the splicing module is used for splicing the image characteristic values on the channels in each row of the current column aiming at each channel in the c channels to obtain a first characteristic value splicing result of the current column on the channels, and splicing the first characteristic value splicing result of the current column on each channel to obtain a second characteristic value splicing result of the current column; the storage module is used for sequentially storing second characteristic value splicing results corresponding to each column in the image characteristic value array into a data buffer; the batch module is used for carrying out batch processing on the image characteristic values in all the second characteristic value splicing results in the data buffer according to the data quantity M under the condition that the data quantity M of the image characteristic values which can be processed by the MAC array in one clock period is smaller than K so as to obtain a plurality of batches of image characteristic values, wherein the number of the image characteristic values in each batch of image characteristic values is equal to M, and K represents the total number of the image characteristic values in the image characteristic value array; and the first sending module is used for sequentially sending the image characteristic values of the plurality of batches to the MAC array on a plurality of continuous clock cycles according to the storage sequence of the image characteristic values of the plurality of batches.
The device for providing data for the MAC array, provided by the embodiment of the application, acquires the size of a convolution window when the convolution operation is performed by adopting the MAC array; acquiring image characteristic values to be processed of a convolution window from an on-chip memory to form an image characteristic value array, and for each channel in c channels in each column in the image characteristic value array, splicing the image characteristic values on the channels in each row in the current column to obtain a first characteristic value splicing result of the current column on the channel, and splicing the first characteristic value splicing results of the current column on each channel to obtain a second characteristic value splicing result of the current column; and sequentially storing second characteristic value splicing results of the image characteristic value arrays on each column into a data buffer, carrying out batch processing on the image characteristic values in all the second characteristic value splicing results in the data buffer according to the data quantity M under the condition that the data quantity M of the image characteristic values which can be processed by the MAC array in one clock period is smaller than the total number K of the image characteristic values in the image characteristic value arrays so as to obtain a plurality of batches of image characteristic values, and sequentially sending the plurality of batches of image characteristic values to the MAC array in a plurality of continuous clock periods according to the storage sequence of the plurality of batches of image characteristic values. Therefore, the method for storing the image characteristic data required by the device through one data buffer and supplying the number to the MAC array through the data buffer is provided, so that the hardware cost for supplying the number to the MAC array is reduced, and the number to the MAC array is realized in a convenient mode.
In one possible implementation manner of the present application, the second sending module is configured to sequentially read, when the data size M of the image feature values that can be processed by the MAC array in one clock cycle is greater than or equal to K, the image feature values from each second feature value splicing result of the data buffer to the MAC array until the number of the read image feature values is K.
In one possible implementation manner of the present application, the first obtaining module is specifically configured to: simultaneously reading n lines of image characteristic values to be processed of the convolution window from an on-chip memory through n paths of read data memory channels RDMA, wherein each path of RDMA reads one line of image characteristic values; and forming the image characteristic value array according to the n rows of image characteristic values.
In one possible implementation manner of the present application, the first sending module includes: the sorting unit is used for sorting the image characteristic values of the plurality of batches according to the storage sequence of the image characteristic values of the plurality of batches so as to obtain a sorting result; a first determining unit, configured to determine, for an i-th clock cycle of the plurality of consecutive clock cycles, a target start address in the data buffer of a target batch image feature value located on an i-th bit in the ordering result, where i is a positive integer and is smaller than L, where L represents a total number of the plurality of consecutive clock cycles; the reading unit is used for reading out continuous M image characteristic values from the data buffer from the target starting address; a second determining unit, configured to take the read M image feature values as an ith batch of image feature values; and the sending unit is used for sending the ith batch of image characteristic values to the MAC array.
In one possible implementation manner of the present application, the first determining unit is specifically configured to: determining an initial starting address of the corresponding batch of image characteristic values positioned on the first bit in the sorting result in the data buffer under the condition that i is greater than 1; and determining a target starting address of a target batch image characteristic value positioned on the ith bit in the sorting result in the data buffer according to the initial starting address, the i and the data quantity M.
In one possible implementation manner of the present application, in a case of changing the depth of the data buffer from a first value to a second value and changing the bit width of the data buffer from a third value to a fourth value, where a value obtained by multiplying the first value and the third value is the same as a value obtained by multiplying the second value and the fourth value, and B is a multiple of B of the third value, B is an integer greater than 1, the reading unit is specifically configured to: c data are read from the data buffer memory from the target starting address, wherein the C is determined according to the data quantity M, the third numerical value and the fourth numerical value; m data are determined from the C data, and the bit width corresponding to each data is the third numerical value; b selecting operation is carried out on each data in the M data respectively so as to obtain the image characteristic value corresponding to each data in the M data.
According to a third aspect of the present application, there is provided a chip which may include the apparatus for providing data to a multiplier accumulator MAC array as set forth in the embodiments of the second aspect of the present application.
According to a fourth aspect of the present application, an electronic device is provided, which may include a chip set forth in an embodiment of the third aspect of the present application.
Additional aspects and advantages of the application will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the application.
Drawings
The foregoing and/or additional aspects and advantages of the present application will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings, in which:
FIG. 1 is a flow diagram illustrating a method of providing data for a MAC array in accordance with an exemplary embodiment;
FIG. 2 is an exemplary diagram of a simple format for storing image feature values in an OCM and data buffer;
FIG. 3 is a flow chart illustrating a method of providing data for a MAC array according to another exemplary embodiment;
FIG. 4 is a flow chart illustrating a method of providing data for a MAC array according to another exemplary embodiment;
FIG. 5 is a flow chart illustrating a method of providing data for a MAC array according to another exemplary embodiment;
FIG. 6 is an exemplary diagram of a convolution window including 8 batches of image feature values;
FIG. 7 is an exemplary diagram of a convolution window including 11 batches of image feature values;
FIG. 8 is an exemplary diagram of image feature values on consecutive 1152 bits in a data buffer;
FIG. 9 is a schematic diagram illustrating an apparatus for providing data to a MAC array in accordance with an exemplary embodiment;
fig. 10 is a block diagram of an electronic device for implementing a method of providing data for a MAC array, according to an example embodiment.
Detailed Description
Embodiments of the present application are described in detail below, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to like or similar elements or elements having like or similar functions throughout. The embodiments described below by referring to the drawings are exemplary and intended for the purpose of explaining the present application and are not to be construed as limiting the present application.
The method, the device and the electronic equipment for providing data for the MAC array are provided in the application in detail with reference to the accompanying drawings.
Fig. 1 is a flow chart illustrating a method of providing data for a MAC array according to an exemplary embodiment. It should be noted that the method for providing data to the MAC array may be applied to a device for providing data to a multiplier accumulator (Multiply Accumulate, MAC) array. The device for providing data to the MAC array may be configured in a chip having the MAC array, for example, may be configured in an artificial intelligence chip having the MAC array. Wherein the chip may be configured in an electronic device.
The electronic device may include a terminal device, a server, and the like, and the embodiment is not particularly limited to the electronic device.
As shown in fig. 1, the method for providing data to a MAC array may include the steps of:
step 101, obtaining the size of a convolution window, wherein the size of the convolution window is n×m×c, n represents the number of rows of the convolution window, m represents the number of columns of the convolution window, and c represents the number of channels of the convolution window.
Wherein n and m are integers greater than 1.
Wherein c is an integer greater than or equal to 1.
Step 102, acquiring n rows of image characteristic values to be processed in this time of a convolution window from an on-chip memory to form an image characteristic value array.
Wherein the n lines of image features comprise m×c image feature values.
The number of rows of the image characteristic value array is n, the number of columns is m, and the number of channels is c.
In some exemplary embodiments, the n lines of image feature values to be processed at this time of the convolution window may be obtained from an On-Chip Memory (OCM) according to the size of the convolution window.
In some examples, to improve the efficiency of acquiring n rows of image feature values, n rows of image feature values to be processed at this time by a convolution window may be simultaneously read from an on-chip memory through n-way read data memory channels (Read Data Memory Access, RDMA), where each way RDMA reads a row of image feature values; and forming an image feature value array based on the read n lines of image feature values.
Step 103, for each column in the image eigenvalue array, for each channel in the c channels, the image eigenvalues on the channels in each row in the current column are spliced to obtain a first eigenvalue splicing result of the current column on the channel, and the first eigenvalue splicing results of the current column on each channel are spliced to obtain a second eigenvalue splicing result of the current column.
In some exemplary embodiments, under the condition that c is equal to 1, the image feature values in each column of the image feature value array may be spliced respectively to obtain feature value splicing results corresponding to each column of the image feature value array, and the feature value splicing results corresponding to each column of the image feature value array are sequentially stored in a data buffer, so that n rows of image feature values are buffered by the data buffer. That is, under the condition that c is equal to 1, the image feature values of each column of the image feature values in the image feature array can be spliced respectively to obtain the image feature value splicing result corresponding to each column of the image feature values, and the feature value splicing result corresponding to each column of the image feature values in the image feature value array is sequentially stored in a data buffer.
In some examples, the image feature value corresponding to a corresponding pixel point may be represented by the pixel value corresponding to each pixel point in the image to be processed.
In some exemplary embodiments, the image feature values are stored in the OCM in the order (h, w, ic, fb). That is, according to the accuracy of one point fb bits, the data of one point ic direction is stored first, then the data of w direction is stored, and finally the data of h direction is stored. One point is a point having a value of 1 in the w (width short) direction, the h (height short) direction, and the ic direction.
Wherein w represents the width direction of the image to which the image feature value belongs, h represents the height direction of the image to which the image feature value belongs, ic represents the channel direction corresponding to the channel of the image to which the image feature value belongs, and fb is the number of bits occupied by one point.
As an example, for each column in the image feature value array, in the process of stitching the image feature values in the current column, stitching processing may be performed on the image feature values in the current column based on a preset stitching granularity, so as to obtain a feature value stitching result of the current column.
Wherein the splice granularity in this example is determined based on fb and c. Where c in this example represents the total number of channels of the image to which the image feature value belongs.
For example, when the data is spliced with 128 bits (bits) as the splicing granularity in the ic direction, each line of image feature value data is loaded by n RDMA, then the data of n RDMA is spliced according to the splicing granularity and output to a data buffer, and the storage sequence of the data in the data buffer is (w, ic×fb/128, n, 128), wherein the storage format of the image feature value in the OCM and the data buffer is a simple example diagram, as shown in fig. 2. Based on the disclosure of fig. 2, it can be seen that in this example, n is 4 and m is 4 are taken as an example to describe an exemplary embodiment, where the feature value stitching result corresponding to the first position in the data buffer (the first position is denoted by a in the drawing) is to stitch the first column of image feature values corresponding to the first batch of channels in the image feature value array first to obtain a first feature value stitching result, and then stitch the first feature value stitching result followed by continuing to stitch the second feature value stitching result, where the second feature value stitching result is obtained by stitching the first column of image features corresponding to the second batch of channels in the image feature value array. It should be noted that, for the determining manner of the feature value splicing result corresponding to the other three positions (the other three positions are the second position, the third position and the fourth position respectively), B, C and D in fig. 2 respectively represent the second position, the third position and the fourth position) in the data buffer are similar to the determining manner of the feature value splicing result corresponding to the first position a, and are not repeated here.
And 104, sequentially storing second characteristic value splicing results corresponding to each column in the image characteristic value array into a data buffer.
In some examples, under the condition that the value of c is equal to 1, the image feature values of each column in the image feature value array may be spliced respectively to obtain a row vector, where the row vector includes feature value splicing results corresponding to each column in the image feature value array.
In some examples, the plurality of eigenvalue splice results in the row vector may be sequentially stored in the data buffer according to the sequential positions of the eigenvalue splice results in the row vector.
In other examples, when the value of c is greater than 1, after the second eigenvalue splicing results corresponding to each column in the image eigenvalue array are obtained, a row vector may be formed according to the sequence of each column in the image eigenvalue array, where the row vector includes sequentially ordered second eigenvalue splicing results, and then, according to the sequential positions of the second eigenvalue splicing results in the row vector, sequentially storing a plurality of second eigenvalue splicing results in the row vector into a data buffer.
Step 105, under the condition that the data quantity M of the image characteristic values which can be processed by the MAC array in one clock period is smaller than K, carrying out batch processing on the image characteristic values in all the second characteristic value splicing results in the data buffer according to the data quantity M so as to obtain a plurality of batches of image characteristic values.
Wherein the number of image feature values in each batch of image feature values is equal to M.
Where K represents the total number of image feature values in the image feature value array.
Where it is understood that the value of K is the same as the value of the size of the convolution window, i.e., k=n×m×c.
And step 106, sequentially transmitting the image characteristic values of the plurality of batches to the MAC array on a plurality of continuous clock cycles according to the storage sequence of the image characteristic values of the plurality of batches.
Where a clock cycle is the most basic, smallest unit of time in a computer.
It will be appreciated that in the related art, it is generally necessary to open up n-1 data buffers to buffer the n-1 line image feature values and supply the number of the MAC arrays through a plurality of data buffers, however, this approach tends to result in higher hardware cost required in implementing the convolution operation. Therefore, in this example, the image feature data required by the MAC array is stored by one data buffer, and the MAC array is supplied by the data buffer, so that the hardware cost required when the MAC array is supplied can be reduced while the supply of the MAC array is realized, the configuration of the convolution operation core can be more flexible, and the cost for realizing the convolution operation by the MAC array can be reduced.
It should be noted that, after the image feature value is acquired by the MAC array, in a process of performing data processing on the acquired image feature value based on the weight data corresponding to the MAC array, reference may be made to description in related technology, which is not repeated herein, and a core focus in this example is on how to supply data to the MAC array.
According to the method for providing data for the MAC array, when the MAC array is adopted for convolution operation, the size of a convolution window is obtained; acquiring image characteristic values to be processed of a convolution window from an on-chip memory to form an image characteristic value array, and for each channel in c channels in each column in the image characteristic value array, splicing the image characteristic values on the channels in each row in the current column to obtain a first characteristic value splicing result of the current column on the channel, and splicing the first characteristic value splicing results of the current column on each channel to obtain a second characteristic value splicing result of the current column; and sequentially storing second characteristic value splicing results of the image characteristic value arrays on each column into a data buffer, carrying out batch processing on the image characteristic values in all the second characteristic value splicing results in the data buffer according to the data quantity M under the condition that the data quantity M of the image characteristic values which can be processed by the MAC array in one clock period is smaller than the total number K of the image characteristic values in the image characteristic value arrays so as to obtain a plurality of batches of image characteristic values, and sequentially sending the plurality of batches of image characteristic values to the MAC array in a plurality of continuous clock periods according to the storage sequence of the plurality of batches of image characteristic values. Therefore, the method for storing the image characteristic data required by convolution processing through one data buffer and supplying the number to the MAC array through the data buffer is provided, so that the hardware cost for supplying the number to the MAC array is reduced, and the number to the MAC array is realized in a convenient mode.
The method for providing data to the MAC array according to the embodiment of the present application is further described below with reference to fig. 3. In this embodiment, the above embodiment is further refined or described.
Fig. 3 is a flow chart illustrating a method of providing data for a MAC array according to another exemplary embodiment.
As shown in fig. 3, the method for providing data to the MAC array includes the following steps:
step 301, obtaining the size of the convolution window, where the size of the convolution window is n×m, n represents the length of the convolution window, and m represents the width of the convolution window.
Wherein n and m are integers greater than 1.
Wherein c is an integer greater than or equal to 1.
Step 302, acquiring n rows of image feature values to be processed of the convolution window from an on-chip memory to form an image feature value array, wherein the n rows of image feature values comprise m×c image feature values, the number of rows of the image feature value array is n, the number of columns is m, and the number of channels is c.
Step 303, for each column in the image feature value array, for each channel in the c channels, stitching the image feature values on the channels in each row in the current column to obtain a first feature value stitching result of the current column on the channel, and stitching the first feature value stitching results of the current column on each channel to obtain a second feature value stitching result of the current column.
And step 304, sequentially storing the second characteristic value splicing result of the image characteristic value array on each column into a data buffer.
It should be noted that, regarding the specific implementation manner of the steps 301 to 304, reference may be made to the related description of the embodiments of the present disclosure, which is not repeated here.
Step 305, determining that the data size M of the image feature value that can be processed by the MAC array in one clock cycle is greater than or equal to K, if yes, executing step 306, otherwise executing steps 307 and 308.
Where K is used to represent the total number of image feature values in the array of image feature values.
In some exemplary embodiments, the data amount M of the image feature value that can be processed by the MAC array in one clock cycle may be acquired, then, it is determined that the data amount M is greater than or equal to K, if the data amount M is greater than or equal to K, step 306 is performed, and if the data amount M is less than K, step 307 and step 308 are performed.
And 306, sequentially reading the image characteristic values from the second characteristic value splicing results of the data buffer to the MAC array until the number of the read image characteristic values is K.
Step 307, performing batch processing on the image feature values in the splicing result of all the second feature values in the data buffer according to the data quantity M to obtain a plurality of batches of image feature values, wherein the number of the image feature values in each batch of image feature values is equal to M, and K represents the total number of the image feature values in the image feature value array.
Step 308, sequentially transmitting the plurality of image feature values to the MAC array over a plurality of consecutive clock cycles in accordance with the storage order of the plurality of image feature values.
In this example, the image feature values in the image feature value array are stored by one data buffer, and when the data amount M of the image feature values that can be processed by the MAC array in one clock cycle is smaller than the total number K of the image feature values in the image feature value array, the image feature values in the data buffer are supplied to the MAC array in a batch manner, so that the hardware cost required for supplying the number to the MAC array is reduced.
The method for providing data to the MAC array according to the embodiment of the present application is further described below with reference to fig. 4.
Fig. 4 is a flow chart illustrating a method of providing data for a MAC array according to another exemplary embodiment.
Here, the number of channels c of the convolution window is taken as one example in this example to describe an exemplary embodiment.
As shown in fig. 4, the method for providing data to the MAC array includes the steps of:
in step 401, the size of a convolution window is obtained, where the size of the convolution window is n×m.
Where n represents the number of rows of the convolution window and m represents the number of columns of the convolution window.
Step 402, acquiring n rows of image characteristic values to be processed in the convolution window from an on-chip memory.
Wherein each of the n rows of image feature values includes m image feature values.
In some exemplary embodiments, n lines of image feature values required for the present processing of the convolution window may be obtained from on-chip memory.
Step 403, forming an image feature value array according to the n rows of image feature values.
And step 404, respectively splicing each column of image characteristic values in the image characteristic value array to obtain characteristic value splicing results corresponding to each column of image characteristic values, and sequentially storing a plurality of characteristic value splicing results into a data buffer.
Step 405, determining that the data size M of the image feature value that can be processed by the MAC array in one clock cycle is greater than or equal to K, if yes, executing step 406, otherwise executing steps 407 and 408.
Where K is used to represent the total number of image feature values in the array of image feature values.
In some exemplary embodiments, the data amount M of the image feature value that can be processed by the MAC array in one clock cycle may be acquired, then, it is determined that the data amount M is greater than or equal to K, if the data amount M is greater than or equal to K, step 406 is performed, and if the data amount M is less than K, step 407 and step 408 are performed.
And step 406, sequentially reading the image characteristic values from the characteristic value splicing results of the data buffer to the MAC array until the number of the read image characteristic values is K.
In some exemplary embodiments, the image feature values may be sequentially read from the respective feature value concatenation results of the data buffer to the MAC array according to the storage order of the respective feature value concatenation results in the data buffer until the number of the read image feature values is K.
Step 407, performing batch processing on the image feature values in the splicing result of all the feature values in the data buffer according to the data quantity M to obtain a plurality of batches of image feature values, wherein the number of the image feature values in each batch of image feature values is smaller than or equal to M.
Step 408, sequentially transmitting the plurality of image feature values to the MAC array over a plurality of consecutive clock cycles in the order in which the plurality of image feature values are stored.
In this example, the image feature values in the image feature value array are stored by one data buffer, and when the data amount M of the image feature values that can be processed by the MAC array in one clock cycle is smaller than the total number K of the image feature values in the image feature value array, the image feature values in the data buffer are supplied to the MAC array in a batch manner, so that the hardware cost required for supplying the number to the MAC array is reduced.
In order to make it clear how the multiple batches of image feature values will be transmitted sequentially to the MAC array over multiple successive clock cycles in the order in which they are stored, a further exemplary description of the method of providing data to the MAC array of this embodiment is provided below in connection with fig. 5.
Fig. 5 is a flow chart illustrating a method of providing data for a MAC array according to another exemplary embodiment.
As shown in fig. 5, one possible implementation of sequentially sending the plurality of image feature values to the MAC array over a plurality of consecutive clock cycles in the order in which the plurality of image feature values are stored may include the steps of:
step 501, sorting the image feature values according to the storage sequence of the image feature values to obtain a sorting result.
In some exemplary embodiments, the plurality of image feature values may be ranked in order of their stored order from front to back to obtain a ranking result.
Step 502, determining, for an ith clock cycle of a plurality of consecutive clock cycles, a target start address of a target lot image feature value located on an ith bit in the ordering result in the data buffer, where i is a positive integer and is less than L, where L represents a total number of the plurality of consecutive clock cycles.
In some exemplary embodiments, to improve the accuracy of the determined target start address, one possible implementation manner of determining the target start address of the target batch image feature value located on the ith bit in the sorting result in the data buffer is: under the condition that i is greater than 1, determining an initial starting address of a corresponding batch of image characteristic values positioned on a first bit in the ordering result in a data buffer; and determining a target starting address of the target batch image characteristic value positioned on the ith bit in the sequencing result in the data buffer according to the initial starting address, the i and the data quantity M.
It can be understood that, in the case where i is equal to 1, the target start address is the initial start address of the corresponding batch of image feature values located on the first bit in the sorting result in the data buffer.
In order to clearly understand the process of determining the target start address, the process of reading the image feature values from the data buffer to the MAC array is described below with reference to fig. 6, where an exemplary diagram including 8 batches of image feature values in one convolution window when a non-hole convolution operation is performed by the MAC array is illustrated in fig. 6. Correspondingly, for 8 batches of image feature values in this convolution window, 8 batches of image feature values need to be sent to the MAC array. Based on the content shown in fig. 6, it can be seen that, after calculating an output feature value based on 8 batches of image feature values included in one convolution window, the convolution window can be controlled to slide by a preset sliding step, and correspondingly, the next output feature value can be calculated based on 8 batches of image feature values included in the convolution window after sliding.
Specifically, each image feature value in the first image feature value is read from an initial start address of the first image feature value to the MAC array on one clock cycle, then, based on the initial start address and the number of the first image feature values, a target start address of the second image feature value is determined, and each image feature value in the second image feature value is read from the target start address of the second image feature value to the MAC array on a next clock cycle after the clock cycle, correspondingly, the target start address of the next image feature value is continuously determined, and each image feature value in the next image feature value is read from the determined target start address to the MAC array on the next clock cycle until the 8 th image feature value in the data buffer is sent to the MAC array.
It should be noted that, the MAC array may calculate an output eigenvalue based on 8 batches of image eigenvalues. And 8 batches of image feature values are in a convolution window. In addition, it will be appreciated that calculating each output feature value is calculated based on 8 batches of image feature values that are in the convolution window, except that the 8 batches of image feature values that are in the convolution window are different. The convolution window in this example is slid on the image to be processed with a preset sliding step. Correspondingly, when calculating the next output characteristic value, the convolution window can be controlled to slide on the image to be processed once with a preset sliding step length, and convolution operation is carried out based on the image characteristic value on each pixel point currently in the convolution window.
It should be understood that, in the process that the convolution window slides on one line of image feature values of the image to be processed, after the convolution window contains the last image feature value in the line of image feature values, after calculating an output feature value based on each image feature value in the convolution window, the convolution window can be controlled to move to the beginning position of the next line of image feature values of the image to be processed.
For example, the sliding step length is set in a unit of one pixel point, and when the convolution window includes the image feature values corresponding to N pixel points in the line of image feature values, the number of total output feature values that can be output by the convolution window on the line of image feature values is equal to N.
In some examples, when performing non-hole convolution operations based on the MAC array, the control procedure for reading data from the data buffer is exemplified as follows: calculating an output characteristic value, and requiring the data quantity of a convolution window, wherein the data quantity is sent to the MAC array in (nxm_rd_batch_max+1) batches; when calculating the next output characteristic value, sliding the window by a fixed step length nxm_rd_stride_4B_1; when the last output characteristic value of one row is calculated, the window slides to the position of the data beginning of the next row, and the sliding step length is nxm_line_last_stride_stride_4B.
Wherein, the meaning expressed by the corresponding configuration parameters: (nxm_rd_batch_max+1) is the data amount of one convolution window, in batch. Dimension 0 is not used and the parameter is configured to be 0. The 1 st dimension corresponds to the calculation of a row, the nxm_rd_stride_4B_1 is the step length of sliding a window when calculating the next output characteristic value out pixel, and the step length is taken as a unit of 32-bits; nxm_rd_max_1+1 is the number of output eigenvalues pixel corresponding to one row convolution.
Wherein, the description of the two-dimensional configuration parameters is shown in the following table.
As can be seen from the above table, the bit widths of nxm_rd_batch_max, nxm_rd_max_0 and nxm_rd_max_1 are all 4, and correspondingly, the value ranges of nxm_rd_batch_max, nxm_rd_max_0 and nxm_rd_max_1 are all any one of values from 0 to 15. The bit widths of the nxm_rd_stride_4b_0 and the nxm_rd_stride_4b_1 are 10, and the value ranges of the nxm_rd_batch_max and the nxm_rd_stride_4b_1 are any one of 0 to 1023 correspondingly.
It is understood that, when performing the non-hole convolution operation, only three parameters of nxm_rd_batch_max, nxm_rd_stride_4b_1, and nxm_rd_max_1 may be configured in combination with the actual situation, and accordingly, both nxm_rd_stride_4b_0 and nxm_rd_max_0 may be configured as zero.
In some examples, a process of reading image feature values from the data buffer to the MAC array is described below with reference to fig. 7, where in this example, a hole convolution operation performed once by the MAC array is described as an example, in this example, 11 batches of image feature values are included in one convolution window are described as an example, and 8 batches of valid image feature values and 3 batches of invalid image feature values are included in 11 batches of image feature values are described as an example. An exemplary graph of 11 image feature values is included in a convolution window, as shown in fig. 7. Based on the content shown in fig. 7, after calculating an output eigenvalue from 11 batches of image eigenvalues contained in a convolution window, the convolution window can be controlled to slide with a preset sliding step length, and the next output eigenvalue is calculated based on 11 batches of image eigenvalues currently contained in the convolution window.
Note that, in fig. 7, the 3 invalid image feature values are the 3 rd image feature value, the 6 th image feature and the 9 th image feature, respectively.
Wherein each batch of valid image feature values is formed based on the image feature values currently at the corresponding non-hole points in the convolution window.
Wherein each set of invalid image feature values is formed based on image feature values currently at corresponding hole points in the convolution window.
Correspondingly, starting at one clock period, starting from an initial starting address of the first image characteristic value, reading the first image characteristic value from the data buffer to the MAC array, determining a target starting address of the second image characteristic based on the initial starting address, correspondingly, starting at the next clock period from the target starting address, reading the second image characteristic value from the data buffer to the MAC array, correspondingly, determining the target starting address of the third image characteristic value according to the total number of the target starting address and the second image characteristic value. Correspondingly, in the case that the third image feature value is determined to be an invalid image feature value, the third image feature value may not be skipped at this time, and the target start address of the fourth image feature value may be determined based on the initial start address, the number of the first image feature values, the number of the second image feature values, and the number of the third image features, and in the case that the fourth image feature value is determined to be an valid image feature value, the fourth image feature value may be sent to the MAC array on the corresponding next clock cycle, and correspondingly, the subsequent other image feature values may be processed based on the above processing procedure until the last image feature value is read out and sent to the MAC array.
In some examples, when performing a hole convolution operation based on the MAC array, the control procedure for reading data from the data buffer is exemplified as follows: calculating the data quantity of a non-hole point (actually, longitudinal n points after n rows of concat) which needs n×ic× (feature-bit), and sending (nxm_rd_batch_max+1) batches to a MAC calculation unit; the current non-hole point is calculated, the hole point is skipped to execute the supply number of the next non-hole point, the cycle (nxm_rd_max_0+1) is completed until the supply number of all hole points is completed, and the 0 th dimension configuration of 2 dimension reading is needed; and the rest two steps are simultaneously and non-cavity convolution dimated_conv read control.
Wherein, the corresponding configuration information is: (nxm_rd_batch_max+1) is the data amount of one non-hole point, and is in batch units.
The 0 th dimension corresponds to the calculation of an output pixel, and nxm_rd_stride_4B_0 is the data quantity of a non-hole point plus a hole point, and takes 32-bit as a unit; nxm_rd_max_0+1 is how many non-hole points there are for one convolution window. The 1 st dimension corresponds to the calculation of a row, the nxm_rd_stride_4B_1 is the step length of window sliding when calculating the next out pixel, and the unit is 32-bit; nxm_rd_max_1+1 is the number of output eigenvalues pixel corresponding to one row convolution.
The image feature values in this example are all image feature values corresponding to corresponding pixel points currently in the convolution window on the sliding of the convolution window on the image to be processed.
In step 503, consecutive M image feature values are read from the data buffer starting from the target start address.
In some exemplary embodiments, in a case where the depth of the data buffer is changed from the first value to the second value and the bit width of the data buffer is changed from the third value to the fourth value, wherein the value obtained by multiplying the first value and the third value is the same as the value obtained by multiplying the second value and the fourth value, the fourth value is B times the third value, B is an integer greater than 1, and reading consecutive M image feature values from the data buffer from the target start address includes: c data are read from the data buffer memory from the target initial address, wherein C is determined according to M, a third value and a fourth value; m data are determined from the C data, and the bit width corresponding to each data is a third numerical value; b selection operation is carried out on each data in the M data respectively so as to obtain image characteristic values corresponding to each data in the M data.
Wherein M in this example may be equal to three configurations 1152, 576, 288.
The first value, the second value, the third value and the fourth value are preset.
For example, the first value may be 296, the third value may be 32, the fourth value may be B times 32, and when B is equal to 8, the fourth value may be 256, and the second value may be 37. Assuming that M is 1152, correspondingly, in order to be able to quickly read out 1152 image feature values from the data buffer, the data buffer with depth 296 and bit width 32 can be correspondingly reshaped into a data buffer with depth 37 and bit width 256. Then, reading continuous 1152+256-32=1376 bit registers from the remolded data buffer, wherein the data selection logic of the first-stage data selector (mux) is 1376 pieces of 37-select 1; again, 1152-bit registers are selected from the 1376-bit registers, and the data select logic of the second stage mux is 1152 8-select 1 because the data is 32-bit aligned. In the example, the one-stage mux circuits of 1152 296 selection 1 can be converted into 1376 two-stage mux circuits of 37 selection 1 and 1152 8 selection 1 by the method, so that the circuit area is greatly reduced, and the time sequence is improved.
It should be noted that, in the case where the bit width of the data buffer is 256 bits, correspondingly, an exemplary graph of the image feature values on the 1152 bits consecutive in the data buffer is shown in fig. 8, where the process of acquiring 1152 bits from the start address of the 8 th subgroup in the first 256 bits in the example in fig. 8 is started, where each 256 bits is divided into 8 subgroups in units of 32 bits, that is, the small lattices labeled a in fig. 8 are one group, and the bit width corresponding to one group is 32 bits.
And 504, taking the M read image characteristic values as an ith batch of image characteristic values.
Step 505, the ith batch of image feature values are sent to the MAC array.
In this example, each batch of image characteristic values in the data buffer is sent to the MAC array in a batch manner, so that the number of the MAC array is supplied by one data buffer.
Fig. 9 is a schematic diagram illustrating a structure of an apparatus for providing data to a MAC array according to an exemplary embodiment.
As shown in fig. 9, the apparatus 900 for providing data to a MAC array includes: a first acquisition module 901, a second acquisition module 902, a stitching module 903, a saving module 904, a batch module 905, and a first sending module 906, wherein:
A first obtaining module 901, configured to obtain a size of a convolution window, where the size of the convolution window is n×m×c, n represents a number of rows of the convolution window, m represents a number of columns of the convolution window, c represents a number of channels of the convolution window, where n and m are integers greater than 1, and c is an integer greater than or equal to 1;
a second obtaining module 902, configured to obtain n rows of image feature values to be processed at this time by using a convolution window from an on-chip memory, so as to form an image feature value array, where n rows of image feature values include m×c image feature values, the number of rows of the image feature value array is n, the number of columns is m, and the number of channels is c;
the stitching module 903 is configured to stitch, for each column in the image feature value array, for each channel in the c channels, the image feature values on the channels in each row in the current column to obtain a first feature value stitching result of the current column on the channel, and stitch the first feature value stitching results of the current column on each channel to obtain a second feature value stitching result of the current column;
the storage module 904 is configured to sequentially store the second feature value splicing results corresponding to each column in the image feature value array into a data buffer;
The batching module 905 is configured to batch-process, according to the data amount M, the image feature values in all the second feature value splicing results in the data buffer to obtain a plurality of batches of image feature values when the data amount M of the image feature values that can be processed by the MAC array in one clock cycle is smaller than K, where K represents the total number of the image feature values in the image feature value array;
a first sending module 906, configured to send the plurality of image feature values to the MAC array sequentially on a plurality of consecutive clock cycles according to the storage order of the plurality of image feature values.
In one embodiment of the present application, the apparatus further comprises:
and the second sending module is used for reading the image characteristic values from the second characteristic value splicing results of the data buffer in sequence to the MAC array until the number of the read image characteristic values is K under the condition that the data quantity M of the image characteristic values which can be processed by the MAC array in one clock period is larger than or equal to K.
In one embodiment of the present application, the first obtaining module is specifically configured to:
and simultaneously reading n lines of image characteristic values to be processed of the convolution window from an on-chip memory through n paths of read data memory channels RDMA, wherein each path of RDMA reads one line of image characteristic values.
In one embodiment of the present application, a first transmitting module includes:
the sorting unit is used for sorting the image characteristic values of the plurality of batches according to the storage sequence of the image characteristic values of the plurality of batches so as to obtain a sorting result;
a first determining unit, configured to determine, for an i-th clock cycle of a plurality of consecutive clock cycles, a target start address of a target batch image feature value located on an i-th bit in the ordering result in the data buffer, where i is a positive integer and is smaller than L, where L represents a total number of the plurality of consecutive clock cycles;
the reading unit is used for reading out continuous M image characteristic values from the data buffer from the target initial address;
a second determining unit, configured to take the read M image feature values as an ith batch of image feature values;
and the transmitting unit is used for transmitting the ith batch of image characteristic values to the MAC array.
In one embodiment of the present application, the first determining unit is specifically configured to:
under the condition that i is greater than 1, determining an initial starting address of a corresponding batch of image characteristic values positioned on a first bit in the ordering result in a data buffer;
and determining a target starting address of the target batch image characteristic value positioned on the ith bit in the sequencing result in the data buffer according to the initial starting address, the i and the data quantity M.
In one embodiment of the present application, when the depth of the data buffer is changed from the first value to the second value and the bit width of the data buffer is changed from the third value to the fourth value, where the value obtained by multiplying the first value and the third value is the same as the value obtained by multiplying the second value and the fourth value, the fourth value is B times the third value, and B is an integer greater than 1, and the reading unit is specifically configured to: c data are read from the data buffer memory from the target initial address, wherein C is determined according to M, a third value and a fourth value; m data are determined from the C data, and the bit width corresponding to each data is a third numerical value; b selection operation is carried out on each data in the M data respectively so as to obtain image characteristic values corresponding to each data in the M data.
It should be noted that the foregoing explanation of the embodiment of the method for providing data to the MAC array is also applicable to the apparatus for providing data to the MAC array of this embodiment, and this embodiment is not limited in particular.
The device for providing data for the MAC array acquires the size of a convolution window when the MAC array is adopted for convolution operation; acquiring image characteristic values to be processed of a convolution window from an on-chip memory to form an image characteristic value array, and for each channel in c channels in each column in the image characteristic value array, splicing the image characteristic values on the channels in each row in the current column to obtain a first characteristic value splicing result of the current column on the channel, and splicing the first characteristic value splicing results of the current column on each channel to obtain a second characteristic value splicing result of the current column; and sequentially storing second characteristic value splicing results of the image characteristic value arrays on each column into a data buffer, carrying out batch processing on the image characteristic values in all the second characteristic value splicing results in the data buffer according to the data quantity M under the condition that the data quantity M of the image characteristic values which can be processed by the MAC array in one clock period is smaller than the total number K of the image characteristic values in the image characteristic value arrays so as to obtain a plurality of batches of image characteristic values, and sequentially sending the plurality of batches of image characteristic values to the MAC array in a plurality of continuous clock periods according to the storage sequence of the plurality of batches of image characteristic values. Therefore, the method for storing the image characteristic data required by the device through one data buffer and supplying the number to the MAC array through the data buffer is provided, so that the hardware cost for supplying the number to the MAC array is reduced, and the number to the MAC array is realized in a convenient mode.
In order to implement the above embodiment, the present application further proposes a chip, which includes a device for providing data by using a MAC array.
In order to achieve the above embodiments, the present application further provides an electronic device, where the electronic device includes the chip provided by the embodiments of the present disclosure.
In order to implement the above-mentioned embodiments, the present application further proposes an electronic device, as shown in fig. 10, and fig. 10 is a block diagram of the electronic device for implementing a method for providing data to a MAC array according to an exemplary embodiment. As shown in fig. 10, the electronic device 1000 may include:
memory 1010 and processor 1020, bus 1030 connecting the different components (including memory 1010 and processor 1020), memory 1010 storing a computer program that when executed by processor 1020 implements the method of providing data for a MAC array of embodiments of the present application.
Bus 1030 represents one or more of several types of bus structures, including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, a processor, or a local bus using any of a variety of bus architectures. By way of example, and not limitation, such architectures include Industry Standard Architecture (ISA) bus, micro channel architecture (MAC) bus, enhanced ISA bus, video Electronics Standards Association (VESA) local bus, and Peripheral Component Interconnect (PCI) bus.
Electronic device 1000 typically includes multiple computer-readable media. Such media can be any available media that is accessible by the electronic device 1000 and includes both volatile and nonvolatile media, removable and non-removable media.
Memory 1010 may also include computer system readable media in the form of volatile memory, such as Random Access Memory (RAM) 1040 and/or cache memory 1050. Electronic device 1000 may further include other removable/non-removable, volatile/nonvolatile computer system storage media. By way of example only, storage system 1060 may be used to read from or write to a non-removable, non-volatile magnetic media (not shown in FIG. 10, commonly referred to as a "hard disk drive"). Although not shown in fig. 10, a magnetic disk drive for reading from and writing to a removable non-volatile magnetic disk (e.g., a "floppy disk"), and an optical disk drive for reading from or writing to a removable non-volatile optical disk (e.g., a CD-ROM, DVD-ROM, or other optical media) may be provided. In such cases, each drive may be coupled to bus 1030 through one or more data medium interfaces. Memory 1010 may include at least one program product having a set (e.g., at least one) of program modules configured to carry out the functions of the embodiments of the present application.
A program/utility 1080 having a set (at least one) of program modules 1070 may be stored, for example, in memory 1010, such program modules 1070 including, but not limited to, an operating system, one or more application programs, other program modules, and program data, each or some combination of which may include an implementation of a network environment. Program modules 1070 generally perform the functions and/or methods in the embodiments described herein.
The electronic device 1000 can also communicate with one or more external devices 1090 (e.g., keyboard, pointing device, display 1091, etc.), with one or more devices that enable a user to interact with the electronic device 1000, and/or with any device (e.g., network card, modem, etc.) that enables the electronic device 1000 to communicate with one or more other computing devices. Such communication may occur through an input/output (I/O) interface 1092. Also, the electronic device 1000 may communicate with one or more networks such as a Local Area Network (LAN), a Wide Area Network (WAN) and/or a public network, such as the Internet, through a network adapter 1093. As shown in fig. 10, the network adapter 1093 communicates with other modules of the electronic device 1000 via the bus 1030. It should be appreciated that although not shown in fig. 10, other hardware and/or software modules may be used in connection with electronic device 1000, including, but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, data backup storage systems, and the like.
The processor 1020 executes various functional applications and data processing by running programs stored in the memory 1010.
It should be noted that, the implementation process and the technical principle of the electronic device in this embodiment refer to the foregoing explanation of the method for providing data for the MAC array in this embodiment, and are not repeated herein.
Other embodiments of the present application will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the application following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the application pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the application being indicated by the following claims.
It is to be understood that the present application is not limited to the precise arrangements and instrumentalities shown in the drawings, which have been described above, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the application is limited only by the appended claims.

Claims (14)

1. A method of providing data for a MAC array, the method comprising:
Obtaining the size of a convolution window, wherein the size of the convolution window is n multiplied by m multiplied by c, n represents the number of rows of the convolution window, m represents the number of columns of the convolution window, c represents the number of channels of the convolution window, wherein n and m are integers greater than 1, and c is an integer greater than or equal to 1;
acquiring n rows of image characteristic values to be processed of the convolution window from an on-chip memory to form an image characteristic value array, wherein the n rows of image characteristic values comprise m multiplied by c image characteristic values, the number of rows of the image characteristic value array is n, the number of columns is m and the number of channels is c;
aiming at each column in the image characteristic value array, aiming at each channel in c channels, splicing the image characteristic values on the channels in each row in the current column to obtain a first characteristic value splicing result of the current column on the channels, and splicing the first characteristic value splicing result of the current column on each channel to obtain a second characteristic value splicing result of the current column;
sequentially storing second characteristic value splicing results corresponding to each column in the image characteristic value array into a data buffer;
Under the condition that the data quantity M of the image characteristic values which can be processed by the MAC array in one clock period is smaller than K, carrying out batch processing on the image characteristic values in all second characteristic value splicing results in the data buffer according to the data quantity M to obtain a plurality of batches of image characteristic values, wherein the number of the image characteristic values in each batch of image characteristic values is equal to M, and K represents the total number of the image characteristic values in the image characteristic value array;
and sequentially transmitting the image characteristic values of the plurality of batches to the MAC array on a plurality of continuous clock cycles according to the storage sequence of the image characteristic values of the plurality of batches.
2. The method of claim 1, wherein the method further comprises:
and under the condition that the data quantity M of the image characteristic values which can be processed by the MAC array in one clock period is larger than or equal to K, sequentially reading the image characteristic values from the second characteristic value splicing results of the data buffer to the MAC array until the number of the read image characteristic values is K.
3. The method of claim 1, wherein the obtaining, from the on-chip memory, n rows of image feature values to be processed at this time by the convolution window to form the image feature value array includes:
Simultaneously reading n lines of image characteristic values to be processed of the convolution window from an on-chip memory through n paths of read data memory channels RDMA, wherein each path of RDMA reads one line of image characteristic values;
and forming the image characteristic value array according to the n rows of image characteristic values.
4. The method of claim 1, wherein said sequentially transmitting the plurality of image feature values to the MAC array over a plurality of consecutive clock cycles in the order in which the plurality of image feature values are stored comprises:
sorting the image characteristic values of the plurality of batches according to the storage sequence of the image characteristic values of the plurality of batches to obtain a sorting result;
determining a target start address of a target batch image feature value located on an ith bit in the sorting result in the data buffer for an ith clock cycle of the plurality of continuous clock cycles, wherein i is a positive integer and is less than L, wherein L represents the total number of the plurality of continuous clock cycles;
reading out continuous M image characteristic values from the data buffer from the target starting address;
taking the M image characteristic values read out as an ith batch of image characteristic values;
And sending the ith batch of image characteristic values to the MAC array.
5. The method of claim 4, wherein determining a target start address in the data buffer for a target lot image feature value located on an i-th bit in the ordering result comprises:
determining an initial starting address of the corresponding batch of image characteristic values positioned on the first bit in the sorting result in the data buffer under the condition that i is greater than 1;
and determining a target starting address of a target batch image characteristic value positioned on the ith bit in the sorting result in the data buffer according to the initial starting address, the i and the data quantity M.
6. The method of claim 4, wherein in a case where a depth of the data buffer is changed from a first value to a second value, and a bit width of the data buffer is changed from a third value to a fourth value, wherein a value obtained by multiplying the first value and the third value is identical to a value obtained by multiplying the second value and the fourth value, and wherein the fourth value is B times the third value, and wherein B is an integer greater than 1, and wherein the reading of consecutive M image feature values from the data buffer from the target start address includes:
C data are read from the data buffer memory from the target starting address, wherein the C is determined according to the data quantity M, the third numerical value and the fourth numerical value;
m data are determined from the C data, and the bit width corresponding to each data is the third numerical value;
b selecting operation is carried out on each data in the M data respectively so as to obtain the image characteristic value corresponding to each data in the M data.
7. An apparatus for providing data to a MAC array, the apparatus comprising:
a first obtaining module, configured to obtain a size of a convolution window, where the size of the convolution window is nxm×c, n represents a number of rows of the convolution window, m represents a number of columns of the convolution window, c represents a number of channels of the convolution window, n and m are integers greater than 1, and c is an integer greater than or equal to 1;
the second acquisition module is used for acquiring n rows of image characteristic values to be processed of the convolution window from an on-chip memory to form an image characteristic value array, wherein the n rows of image characteristic values comprise m multiplied by c image characteristic values, the number of rows of the image characteristic value array is n, the number of columns is m and the number of channels is c;
The splicing module is used for splicing the image characteristic values on the channels in each row of the current column aiming at each channel in the c channels to obtain a first characteristic value splicing result of the current column on the channels, and splicing the first characteristic value splicing result of the current column on each channel to obtain a second characteristic value splicing result of the current column;
the storage module is used for sequentially storing second characteristic value splicing results corresponding to each column in the image characteristic value array into a data buffer;
the batch module is used for carrying out batch processing on the image characteristic values in all the second characteristic value splicing results in the data buffer according to the data quantity M under the condition that the data quantity M of the image characteristic values which can be processed by the MAC array in one clock period is smaller than K so as to obtain a plurality of batches of image characteristic values, wherein the number of the image characteristic values in each batch of image characteristic values is equal to M, and K represents the total number of the image characteristic values in the image characteristic value array;
and the first sending module is used for sequentially sending the image characteristic values of the plurality of batches to the MAC array on a plurality of continuous clock cycles according to the storage sequence of the image characteristic values of the plurality of batches.
8. The apparatus of claim 7, wherein the apparatus further comprises:
and the second sending module is used for reading the image characteristic values from the second characteristic value splicing results of the data buffer in sequence to the MAC array until the number of the read image characteristic values is K under the condition that the data quantity M of the image characteristic values which can be processed by the MAC array in one clock period is larger than or equal to K.
9. The apparatus of claim 7, wherein the second acquisition module is specifically configured to:
simultaneously reading n lines of image characteristic values to be processed of the convolution window from an on-chip memory through n paths of read data memory channels RDMA, wherein each path of RDMA reads one line of image characteristic values;
and forming the image characteristic value array according to the n rows of image characteristic values.
10. The apparatus of claim 7, wherein the first transmission module comprises:
the sorting unit is used for sorting the image characteristic values of the plurality of batches according to the storage sequence of the image characteristic values of the plurality of batches so as to obtain a sorting result;
a first determining unit, configured to determine, for an i-th clock cycle of the plurality of consecutive clock cycles, a target start address in the data buffer of a target batch image feature value located on an i-th bit in the ordering result, where i is a positive integer and is smaller than L, where L represents a total number of the plurality of consecutive clock cycles;
The reading unit is used for reading out continuous M image characteristic values from the data buffer from the target starting address;
a second determining unit, configured to take the read M image feature values as an ith batch of image feature values;
and the sending unit is used for sending the ith batch of image characteristic values to the MAC array.
11. The apparatus according to claim 10, wherein the first determining unit is specifically configured to:
determining an initial starting address of the corresponding batch of image characteristic values positioned on the first bit in the sorting result in the data buffer under the condition that i is greater than 1;
and determining a target starting address of a target batch image characteristic value positioned on the ith bit in the sorting result in the data buffer according to the initial starting address, the i and the data quantity M.
12. The apparatus of claim 10, wherein in a case where a depth of the data buffer is changed from a first value to a second value and a bit width of the data buffer is changed from a third value to a fourth value, wherein a value obtained by multiplying the first value and the third value is the same as a value obtained by multiplying the second value and the fourth value, and wherein the fourth value is B times the third value, and wherein B is an integer greater than 1, the reading unit is specifically configured to:
C data are read from the data buffer memory from the target starting address, wherein the C is determined according to the data quantity M, the third numerical value and the fourth numerical value;
m data are determined from the C data, and the bit width corresponding to each data is the third numerical value;
b selecting operation is carried out on each data in the M data respectively so as to obtain the image characteristic value corresponding to each data in the M data.
13. A chip comprising a device for providing data to a MAC array as claimed in any one of claims 7 to 12.
14. An electronic device comprising the chip of claim 13.
CN202310466151.5A 2023-04-27 2023-04-27 Method, device and chip for providing data for MAC array Active CN116306823B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310466151.5A CN116306823B (en) 2023-04-27 2023-04-27 Method, device and chip for providing data for MAC array

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310466151.5A CN116306823B (en) 2023-04-27 2023-04-27 Method, device and chip for providing data for MAC array

Publications (2)

Publication Number Publication Date
CN116306823A CN116306823A (en) 2023-06-23
CN116306823B true CN116306823B (en) 2023-08-04

Family

ID=86790806

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310466151.5A Active CN116306823B (en) 2023-04-27 2023-04-27 Method, device and chip for providing data for MAC array

Country Status (1)

Country Link
CN (1) CN116306823B (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107862650A (en) * 2017-11-29 2018-03-30 中科亿海微电子科技(苏州)有限公司 The method of speed-up computation two dimensional image CNN convolution
CN109993283A (en) * 2019-04-12 2019-07-09 南京吉相传感成像技术研究院有限公司 The accelerated method of depth convolution production confrontation network based on photoelectricity computing array
CN110705687A (en) * 2019-09-05 2020-01-17 北京三快在线科技有限公司 Convolution neural network hardware computing device and method
CN111445420A (en) * 2020-04-09 2020-07-24 北京爱芯科技有限公司 Image operation method and device of convolutional neural network and electronic equipment
WO2022007266A1 (en) * 2020-07-08 2022-01-13 嘉楠明芯(北京)科技有限公司 Method and apparatus for accelerating convolutional neural network
CN115827552A (en) * 2022-10-26 2023-03-21 北京爱芯科技有限公司 Computing task processing method and device and storage medium

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107862650A (en) * 2017-11-29 2018-03-30 中科亿海微电子科技(苏州)有限公司 The method of speed-up computation two dimensional image CNN convolution
CN109993283A (en) * 2019-04-12 2019-07-09 南京吉相传感成像技术研究院有限公司 The accelerated method of depth convolution production confrontation network based on photoelectricity computing array
CN110705687A (en) * 2019-09-05 2020-01-17 北京三快在线科技有限公司 Convolution neural network hardware computing device and method
CN111445420A (en) * 2020-04-09 2020-07-24 北京爱芯科技有限公司 Image operation method and device of convolutional neural network and electronic equipment
WO2022007266A1 (en) * 2020-07-08 2022-01-13 嘉楠明芯(北京)科技有限公司 Method and apparatus for accelerating convolutional neural network
CN115827552A (en) * 2022-10-26 2023-03-21 北京爱芯科技有限公司 Computing task processing method and device and storage medium

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
TopoOpt: Co-optimizing Network Topology and Parallelization Strategy for Distributed Machine Learning Training Jobs;Wang. Weiyang;《DSpace@MIT》;全文 *
基于FPGA的U-Net网络硬件加速系统的实现;梅亚军;王唯佳;彭析竹;;电子与封装(第06期);全文 *

Also Published As

Publication number Publication date
CN116306823A (en) 2023-06-23

Similar Documents

Publication Publication Date Title
CN109522052B (en) Computing device and board card
US20190205739A1 (en) Operation apparatus and method
EP3855367A1 (en) Operation accelerator, processing method, and related device
CN107657581A (en) Convolutional neural network CNN hardware accelerator and acceleration method
US11829730B2 (en) Elements for in-memory compute
CN112559046A (en) Data processing device and artificial intelligence processor
US20080320273A1 (en) Interconnections in Simd Processor Architectures
US11822900B2 (en) Filter processing device and method of performing convolution operation at filter processing device
CN115470060A (en) Hardware board card, test equipment, test system and synchronous test method
CN116306823B (en) Method, device and chip for providing data for MAC array
US20230400997A1 (en) Memory apparatus embedded with computing function and operation method thereof
WO2021232422A1 (en) Neural network arithmetic device and control method thereof
CN111158757B (en) Parallel access device and method and chip
US20220328118A1 (en) Memory calibration device, system and method
US11409523B2 (en) Graphics processing unit
KR20230059536A (en) Method and apparatus for process scheduling
CN112241509B (en) Graphics processor and acceleration method thereof
KR102373802B1 (en) Neural network accelerator for neural network computing efficiency and operation method thereof
KR102372869B1 (en) Matrix operator and matrix operation method for artificial neural network
US10152766B2 (en) Image processor, method, and chipset for increasing intergration and performance of image processing
CN112257870B (en) Machine learning instruction conversion method and device, board card, main board and electronic equipment
CN111125627A (en) Method for pooling multi-dimensional matrices and related products
CN111260046B (en) Operation method, device and related product
US20230267173A1 (en) Chip, Method, Accelerator, and System for Pooling Operation
US20230099656A1 (en) Method for processing image, electronic device and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant