CN114648444A - Vector up-sampling calculation method and device applied to neural network data processing - Google Patents

Vector up-sampling calculation method and device applied to neural network data processing Download PDF

Info

Publication number
CN114648444A
CN114648444A CN202210180461.6A CN202210180461A CN114648444A CN 114648444 A CN114648444 A CN 114648444A CN 202210180461 A CN202210180461 A CN 202210180461A CN 114648444 A CN114648444 A CN 114648444A
Authority
CN
China
Prior art keywords
data
ddr
state
feature map
characteristic diagram
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210180461.6A
Other languages
Chinese (zh)
Inventor
王峥
肖玺
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Institute of Advanced Technology of CAS
Original Assignee
Shenzhen Institute of Advanced Technology of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Institute of Advanced Technology of CAS filed Critical Shenzhen Institute of Advanced Technology of CAS
Priority to CN202210180461.6A priority Critical patent/CN114648444A/en
Publication of CN114648444A publication Critical patent/CN114648444A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T3/00Geometric image transformations in the plane of the image
    • G06T3/40Scaling of whole images or parts thereof, e.g. expanding or contracting
    • G06T3/4046Scaling of whole images or parts thereof, e.g. expanding or contracting using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T1/00General purpose image data processing
    • G06T1/60Memory management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Image Analysis (AREA)
  • Image Processing (AREA)

Abstract

The invention relates to the field of neural networks, in particular to a vector up-sampling calculation method and a vector up-sampling calculation device applied to neural network data processing. The method and the device can efficiently process data of a plurality of pictures, can quickly address in a network middle layer, can continuously read and write DDR as much as possible, effectively reduce interaction times of an operator and the DDR, greatly improve the performance of the operator, and reduce system delay and power consumption.

Description

Vector up-sampling calculation method and device applied to neural network data processing
Technical Field
The invention relates to the field of neural networks, in particular to a vector up-sampling calculation method and device applied to neural network data processing.
Background
The sampling of the neural network picture can be divided into two types, namely downsampling and upsampling. The downsampling (subsample) is actually a pooling operation in the convolutional neural network, and the purpose of the operation is mainly two: 1. fitting the image to the size of the display area; 2. a thumbnail of the corresponding image is generated. While the main purpose of upsampling (upsampling) is to magnify the original image so that it can be displayed on a higher resolution display device. It is worth noting that the zooming operation on an image does not bring more information about the image, and therefore the quality of the image will inevitably be affected. However, there are scaling methods that can add information to the image so that the quality of the scaled image exceeds the quality of the original image.
Meanwhile, in the convolutional neural network, as the network layer becomes deeper and deeper, after the initial input feature map is processed by the network function layers such as convolution and pooling in the previous period, the size of the input feature map becomes smaller, so that the feature information may be lost later, and some intermediate layers need to be subjected to picture amplification to process data vectors.
The prior art also has some treatment methods for the defects, but the following defects still exist:
1. the processing of pictures on a custom chip is page-by-page, and processing speed and efficiency are low in design.
2. The storage process of the picture data from the under-chip DDR to the inside of the specific operator is slow, and the system delay is large due to the fact that the SRAM and the DDR interact for many times.
3. Some operators are used as independent design modules and are not well integrated with other modules of the system, so that the operation process of the system is complicated, the code design is messy and resources are wasted.
Disclosure of Invention
The embodiment of the invention provides a vector up-sampling calculation method and device applied to neural network data processing, and at least solves the technical problem of low efficiency of processing an input characteristic diagram in the prior art.
According to an embodiment of the present invention, there is provided a vector upsampling calculation method applied in data processing of a neural network, including the following steps:
directly calculating the corresponding pixel initial position of the output characteristic diagram after final amplification from the position of the original input characteristic diagram;
and the control module transmits the read-write address to the DMA controller to control the read-write address of the off-chip DDR.
Further, the estimation process of the method is as follows:
S·(i-wk)+SW·S·k
=S·i–S·W·k+S^2·W·k
=S·i+K·[SW·(S-1)]
the width of the input characteristic diagram is W, the magnification factor of the input characteristic diagram is represented by S, the position of each pixel point of the original input characteristic diagram is represented by i, the number of lines in the original input characteristic diagram is represented by a letter k, and k is an integer starting from 0;
in the formula (i-wk), the column address of a specific pixel position in the original input feature map is calculated, then multiplied by S to obtain the column position of the amplified output feature map, and then added with SW · S · k to obtain the initial position of the pixel point of the final output feature map corresponding to the input feature map.
Further, the calculation process of the method is realized by using an upsample algorithm; the upsample operator is realized based on state machine control, the state machine control storage and calculation module is in a mode based on an instruction set, and bits in the instruction set are connected to represent a specific feature vector.
Furthermore, in the method, firstly, the system RST and the IDLE, then, the system starts to work and writes an information instruction set containing a working network layer into the CFG-RAM, and all the instruction sets are decoded after receiving the information instruction set;
firstly, judging the type of the network layer of each network layer according to a specific binary code corresponding to the type of the network layer in an instruction set, and jumping to the working state of the network layer for subsequent operation when a control condition is met;
then the subsequent instruction set information also decodes the size, the channel and the initial position in the DDR of the input characteristic diagram of the UPSAMPLE, the layer network data in the DDR is written into a memory Tensor _ RAM, the STATE machine immediately jumps to the STATE of STATE _ UPSAMPLE _ LOAD _ DDR, only the data of all the channels corresponding to one pixel position of the layer network diagram is written once when the data is written into the Tensor _ RAM from the DDR every time, and finally the data is written back into the DDR according to the purpose requirement.
Further, when data is written back to the DDR from the RAM, the STATE machine jumps to a STATE _ UPSAMPLE _ COMMIT STATE, then fast reading and writing are carried out according to an UPSAMPLE algorithm, S times of writing are carried out on the DDR, and the data in the RAM is continuously written S times each time;
after all channel data of one pixel point in the layer network are processed, processing all data corresponding to the position of the next pixel point, skipping to a STATE _ UPDATE _ INDEX STATE after the STATE machine receives an indication signal, caching new pixel point data by the RAM, then repeating the above operations to scan all pixel points, and skipping to a STATE _ UPDATE _ END _ CONDITION STATE by the STATE machine;
if the pixel points in the network layer are not completely processed, the STATE machine jumps to the STATE _ update _ LOAD _ DDR _ TRIGGER STATE, and continues to process unprocessed image data; and if all the pixel points of the network layer are processed, after all the data corresponding to the position of the last pixel point are processed, the STATE machine jumps to the STATE _ UPDATE _ INST and the STATE _ LOAD _ INST to perform the operation of the next network layer.
Further, in the method, the input feature map is subjected to a preprocessing before the accelerator, so that the input image is square.
Furthermore, in the method, the pixels of the input characteristic diagram and the output characteristic diagram stored in the DDR are stored according to the depth, only one layer of channel is needed to obtain the jump initial position of each pixel according to the upsample algorithm, and the addresses of the following channels are continuously read and written.
According to another embodiment of the present invention, there is provided a vector upsampling calculation apparatus applied in data processing of a neural network, including:
the pixel initial position acquisition unit is used for directly solving the corresponding pixel initial position of the output characteristic diagram after final amplification from the position of the original input characteristic diagram;
and the read-write address control unit is used for controlling the read-write address of the under-chip DDR by transmitting the read-write address to the DMA controller through the control module.
A storage medium storing a program file capable of implementing any one of the above vector up-sampling computation methods applied in neural network data processing.
A processor, configured to run a program, wherein the program executes the vector up-sampling computation method applied in data processing of a neural network.
The vector up-sampling calculation method and device applied to the neural network data processing in the embodiment of the invention directly work out the corresponding pixel initial position of the output characteristic diagram after final amplification from the position of the original input characteristic diagram, and the control module transmits the pixel initial position to the DMA controller to control the read-write address of the under-chip DDR, so that a plurality of pieces of picture data can be efficiently processed, the addressing can be rapidly carried out in the network intermediate layer, the DDR can be continuously read and written as much as possible, the interaction times of an operator and the DDR are effectively reduced, the operator performance is greatly improved, and the system delay and the power consumption are reduced.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the invention without limiting the invention. In the drawings:
FIG. 1 is a diagram illustrating the operation of an upsample operator in a vector upsampling calculation method applied in data processing of a neural network according to the present invention;
FIG. 2 is a diagram illustrating a working process of an example of taking a single channel by an upsample operator in a vector upsampling calculation method applied in data processing of a neural network according to the present invention;
fig. 3 is a diagram of a working process of a state machine in a vector up-sampling calculation method applied in data processing of a neural network according to the present invention.
Detailed Description
In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
Example 1
According to an embodiment of the present invention, there is provided a vector upsampling calculation method applied in data processing of a neural network, including the following steps:
directly calculating the corresponding pixel initial position of the output characteristic diagram after final amplification from the position of the original input characteristic diagram;
and the control module transmits the read-write address to the DMA controller to control the read-write address of the off-chip DDR.
The vector up-sampling calculation method applied to the neural network data processing in the embodiment of the invention directly calculates the corresponding pixel initial position of the output characteristic diagram after final amplification from the position of the original input characteristic diagram, and the control module transmits the pixel initial position to the DMA controller to control the read-write address of the under-chip DDR, so that the data of a plurality of pictures can be efficiently processed, the addressing can be rapidly carried out in the network intermediate layer, the DDR can be continuously read and written as much as possible, the interaction times of an operator and the DDR are effectively reduced, the operator performance is greatly improved, and the system delay and the power consumption are reduced.
The method comprises the following calculation processes:
S·(i-wk)+SW·S·k
=S·i–S·W·k+S^2·W·k
=S·i+K·[SW·(S-1)]
the width of the input characteristic diagram is W, the magnification of the input characteristic diagram is represented by S, the position of each pixel point of the original input characteristic diagram is represented by i, the number of lines in the original input characteristic diagram is represented by a letter k, and k is an integer starting from 0;
in the formula (i-wk), the column address of a specific pixel position in the original input characteristic diagram is calculated, then multiplied by S to obtain the column position of the amplified output characteristic diagram, and then added with SW · S · k to obtain the initial position of the pixel point of the final output characteristic diagram corresponding to the input characteristic diagram.
Wherein, the calculation process of the method is realized by using an upsample algorithm; the upsample operator is realized based on state machine control, the state machine control storage and calculation module is in a mode based on an instruction set, and bits in the instruction set are connected to represent a specific feature vector.
Firstly, a system in the method firstly performs RST and IDLE, then starts to work and writes an information instruction set containing a working network layer into a CFG-RAM, and decodes the instruction set after all the instruction sets are received;
firstly, judging the type of the network layer of each network layer according to a specific binary code corresponding to the type of the network layer in an instruction set, and jumping to the working state of the network layer for subsequent operation when a control condition is met;
then the subsequent instruction set information also decodes the size, the channel and the initial position in the DDR of the input characteristic diagram of the UPSAMPLE, the layer network data in the DDR is written into a memory Tensor _ RAM, the STATE machine immediately jumps to the STATE of STATE _ UPSAMPLE _ LOAD _ DDR, only the data of all the channels corresponding to one pixel position of the layer network diagram is written once when the data is written into the Tensor _ RAM from the DDR every time, and finally the data is written back into the DDR according to the purpose requirement.
When data is written back to the DDR from the RAM, the STATE machine jumps to a STATE _ UPSAMPLE _ COMMIT STATE, then fast reads and writes according to an UPSAMPLE algorithm, the data is written into the DDR S times, and the data in the RAM is continuously written S times each time;
after all channel data of one pixel point in the layer network are processed, processing all data corresponding to the position of the next pixel point, skipping to a STATE _ UPDATE _ INDEX STATE after the STATE machine receives an indication signal, caching new pixel point data by the RAM, then repeating the above operations to scan all pixel points, and skipping to a STATE _ UPDATE _ END _ CONDITION STATE by the STATE machine;
if the pixel points in the network layer are not completely processed, the STATE machine jumps to the STATE _ update _ LOAD _ DDR _ TRIGGER STATE, and continues to process unprocessed image data; and if all the pixel points of the network layer are processed, after all the data corresponding to the position of the last pixel point are processed, the STATE machine jumps to the STATE _ UPDATE _ INST and the STATE _ LOAD _ INST to perform the operation of the next network layer.
In the method, the input characteristic graph is preprocessed before the accelerator, so that the input graph is square.
In the method, the pixels of the input characteristic diagram and the output characteristic diagram stored in the DDR are stored according to the depth, only one layer of channel is needed to obtain the jump initial position of each pixel according to the upsample algorithm, and the subsequent channel address is continuously read and written.
The following describes in detail a vector upsampling calculation method applied in neural network data processing according to a specific embodiment of the present invention:
in a convolutional neural network, there may be a large amount of data computation and storage, such as convolution (convolution), pooling (posing), full connection (fullconnect), upsampling (upsample), stitching (route), and residual (short) which are most commonly used. With the structure of the neural network becoming more complex and the network layer number becoming deeper and deeper, the network further needs to make some necessary processing on some intermediate quantity to ensure the accuracy and precision of the network data. For example, after the input features are subjected to convolution or pooling operations, the feature map size becomes smaller, including its length and width. At this time, the layer of pictures needs to be enlarged, so that the precision and the definition of the pictures can be improved, and the pictures can be displayed on a display with higher resolution. Therefore, the invention designs an Upesample operator which is used for operating the picture data vector of the required middle network layer in the convolutional neural network so as to achieve the purpose of amplifying the picture.
1. Principle of operator design
As shown in FIG. 1, the working process of the upsample operator is shown. The left side is the input characteristic diagram, and the right side is the output characteristic diagram obtained after the processing of the upscale operator. In short, a picture is enlarged, and it can be seen that the size of the input feature map on the left is 4x4x6, and the output feature map of 12x12x6 is obtained after enlargement by 3 times. Because the length and width of the picture are enlarged by three times by 3 times, the area of the enlarged picture is 9 times of the original area, and the depths are the same and unchanged, and the depth is 6 in the example of the figure.
Because the most used operation in the neural network is convolution operation, the multi-thread cooperative processing and weight sharing mode is used for increasing the operation rate, and the number of channels of the input feature map is large, which results in that the picture data in the off-chip DDR is stored according to the depth instead of one page by one page. For example, for the input feature map in fig. 1, 0 to 15 in the figure represent pixel positions of the input feature map, but not picture data, and the DDR stores data corresponding to each identical position of the picture, specifically, the memory stores all data of position 0 according to the depth, that is, data of all 0 regions in the figure, and then gives the control module a message after all data of 0 regions are stored, and then the memory stores data of the next position 1, that is, all picture data of 1 region in the figure, and then the above operation is circulated to continue data processing of the next pixel position point. If the pixel data of the row is processed, the next row is jumped to, and the data at the position 4 in the figure, namely all the data in the area of the next storage 4, is circulated until the storage of all the data at the position 15 of the last pixel point, namely the area 15 in the figure, is completed, so that the information of the input feature map is completely stored in the DDR. The storage principle of the right output feature map is the same, but the right output feature map is an enlarged version of the input feature map, addresses of data storage processed by operator operation are different, and the data storage addresses occupy larger memory, that is, for example, the data volume of the area corresponding to 0 in the output feature map is 9 times of that of the original input feature map, and similarly, the data volume of other pixel points is also enlarged.
2. Algorithm implementation
Based on the introduction of the principle of the upscale operator, it can be clearly understood that the operator is a process of reading data of an input feature map from the DDR, copying the data according to a specific target, and then writing back to the DDR as required. Therefore, the most important thing in this process is to ensure that the addresses for reading and writing data are all correct, and no error occurs, that is, to tell the DDR and SRAM to read and write data correctly at the correct time and at the correct address under the direction of the control module. Therefore, in order to ensure the correctness of the data storage address, the invention designs a specific algorithm aiming at the problem.
Since the sizes of the input feature map and the output feature map are not clear for the middle layer of the neural network, the parameter setting method is adopted. Because in the present invention, the pictures are pre-processed before being input into the accelerator of the project design, the input pictures are made to be square, that is, the size of the input pictures is the same as that of the input pictures. Because the pixel points stored in the DDR by the input characteristic diagram and the output characteristic diagram are stored according to the depth, the invention only needs one layer of channel to obtain the jump initial position of each pixel point according to a specific algorithm, and the addresses of the following channels can be continuously read and written. Assuming that an existing upscale layer network, the width (length and width) of an input feature map is w (width), the magnification of the input feature map is denoted by S (scale), for example, S is 3 in the schematic diagram 1, each pixel point position of an original input feature map is denoted by i (index), that is, 0 to 15 in the diagram, and the line number in the original input feature map is denoted by k, since k conforms to the storage rule of a memory DDR and an SRAM for the convenience of the implementation of subsequent hardware circuits, and k also starts from 0, for example, the input feature maps 0, 1, 2 and 3 in the diagram of fig. 1 belong to the 0 th line, that is, k is 0; 4. row 1 belongs to 5, 6, 7, k is 1; 8. 9, 10, 11 belong to row 2, k is 2; 12. the lines 13, 14 and 15 belong to the 3 rd row, where k is 3, and similarly, the k th row represents a total of k +1 rows of the input feature map. To illustrate the problem, the present invention takes a single-channel feature map as an example, and as shown in fig. 2, an input feature map with a size of 4 × 4 on the left side is shown, an output feature map with a triple magnification (S ═ 3) on the upper right side is shown, and an output feature map with a double magnification (S ═ 2) on the lower right side is shown. It is clear from the previous analysis that the purpose of this algorithm is to ensure the accuracy of the data transfer address.
Specifically, when S is equal to 3, since the length and width of a picture are all enlarged by 3 times, the area of a picture is enlarged by 9 times, that is, each pixel point of each channel of the input feature map is copied 9 times. As shown in fig. 2, the data at position 0(i ═ 0) in the original is copied into 9 copies and stored in 9 positions of 0, 1, 2, 12, 13, 14, 24, 25, and 16 in the output feature map. The image data information is stored according to the depth, so for the input feature map, all data of an address 0 are processed firstly, then 9 addresses of the output feature map need to be processed, all channel data of an address 1 of the input feature map are processed after the processing is finished, all data of all channels corresponding to the 9 addresses of the output feature map are processed, and then the up sample operation of the whole input feature map is finished after all data corresponding to a last address 15 are processed in a circulating and reciprocating mode in sequence. Similarly, if the data is enlarged twice (S is 2), the area of the output feature map is enlarged 4 times, the data corresponding to the input feature map address 0 is copied to the four areas 0, 1, 8 and 9 of the output feature map, and the other positions are the same, and the operation process is the same as that of S3, except that the copying is changed from 9 times to 4 times. Therefore, the invention can directly calculate the corresponding pixel initial position of the output characteristic diagram after final amplification from the position of the original input characteristic diagram, and the control module transmits the pixel initial position to the DMA controller to control the read-write address of the under-chip DDR. The estimation process is as follows:
S·(i-wk)+SW·S·k
=S·i–S·W·k+S^2·W·k
=S·i+K·[SW·(S-1)]
wherein, the column address of the specific pixel position in the original input characteristic diagram can be calculated by (i-wk), then multiplied by S to obtain the column position of the amplified output characteristic diagram, and then added with SW · S · k to obtain the initial position of the pixel point of the final output characteristic diagram corresponding to the input characteristic diagram.
3. State machine and code implementation
The designed upsample operator is realized based on state machine control, the state machine control storage and calculation module is based on an instruction set mode, in order to expand more operator functions of a chip later, certain bits in the instruction set can represent some specific meanings after being connected, the most common meanings comprise the layer number of a neural network, the operation type of an operator, the size and the channel of an input characteristic diagram, the size and the channel of an output characteristic diagram, the start and the end of work and the like, and a specific state machine is shown in figure 3.
The left half of fig. 3 shows mainly the previous system configuration and control part, and the right half is the operational state associated with this operator UPSAMPLE. Firstly, RST and IDLE of a system are set, then when a chip starts to work, an information instruction set containing a working network layer is written into a CFG-RAM, a corresponding STATE is STATE-CFG-RAM, decoding is carried out after all instruction sets are received, the corresponding STATE is STATE-DECODE-INST, at the moment, the type of the network layer of the layer is judged, a specific binary code is contained in the instruction set according to the type of each network layer, and then the STATE machine jumps to the working STATE of the network layer to carry out subsequent operation under the condition that a control condition is met. For example, for the UPSAMPLE layer, the [2:0] corresponding to the instruction set parameter var _ layer _ type is binary 101, then the STATE machine jumps to the STATE _ UPSAMPLE STATE, and subsequent instruction set information also decodes the size, channel, and starting location in the DDR of the input profile that needs to be UPSAMPLE. Then, the invention designs a memory Tensor _ RAM, the layer network data in the DDR needs to be written into the Tensor _ RAM, and the STATE machine jumps to the STATE _ UPSAMPLE _ LOAD _ DDR STATE immediately. Because the data of the network pictures in the DDR is stored according to the pixel depth, and the UPSAMPLE can amplify a plurality of pictures once, the UPSAMPLE operation is carried out by scanning and copying the corresponding pixel points one by one, so that the data of all channels corresponding to one pixel point position of the network pictures in the layer are written once when the data is written to the Tensor _ RAM from the DDR every time, and finally the data is written back to the DDR according to the purpose.
At this time, when data needs to be written back into the DDR from the RAM, the STATE machine jumps to the STATE _ update _ COMMIT STATE, but at this time, the invention performs fast reading and writing according to the previous algorithm, taking the example of amplifying by three times (S ═ 3), the area of the image amplified at this time is 9 times of the original area, that is, each pixel point of the original input characteristic diagram is copied 9 times, because the memory is stored according to the depth of the pixel point, the memory in the DDR is continuous for the first 3 times of amplifying by 9 times, and the invention can directly obtain the initial positions of the last 3 times according to the algorithm, so the invention writes 3 times into the DDR, and writes 3 times of data in the RAM continuously each time, thus writes 9 times of original data into the DDR, but the memory addressing of the DDR jumps 3 times, one pixel point reduces 6 times, and all pixel points of one layer of the network reduce the number of accesses to the DDR, this saves a lot of delay because, through various measurement evaluations, seventy percent of the system delay in the neural network chip is due to the access of a large amount of feature map data, and the real data processing stage delay is very small. After the process is finished, it means that all channel data of a pixel in the network is processed, then all data corresponding to the next pixel point position needs to be processed, after the STATE machine receives the indication signal, it jumps to the STATE _ UPDATE _ INDEX STATE, the RAM re-caches the data of a new pixel point, then repeats the above operations to scan all pixel points, and then the STATE machine jumps to the STATE _ UPDATE _ END _ CONDITION STATE, if the pixel points in the network are not completely processed, it means that the STATE machine jumps to the STATE _ UPDATE _ LOAD _ DDR _ TRIGGER STATE in the process of UPDATE, and continues to process the image data which is not processed. However, if all the pixels of the network layer are processed, after all the data corresponding to the last pixel point is processed, the STATE machine jumps to STATE _ UPDATE _ INST and STATE _ LOAD _ INST, which indicates that the network layer UPDATE of this layer has completed the operation, and the next network layer can be operated, and at this time, the instruction set of the next layer is updated, and then decoded, and the next layer is operated.
The upsample operator can process a plurality of pieces of picture data efficiently at the same time in a batch processing mode, an index algorithm of a corresponding target network layer storage address is designed, addressing can be performed rapidly in a network middle layer, SRAM is used for continuous reading and writing of DDR to the greatest extent possible, interaction times of the operator and the DDR are effectively reduced, operator performance is greatly improved, and system delay and power consumption are reduced. According to the method and the device, the front-end function simulation is passed, and the back-end FPGA verification result is correct.
Example 2
According to another embodiment of the present invention, there is provided a vector upsampling calculation apparatus applied in data processing of a neural network, including:
the pixel starting position acquisition unit is used for directly solving the corresponding pixel starting position of the output characteristic diagram after final amplification from the position of the original input characteristic diagram;
and the read-write address control unit is used for controlling the read-write address of the under-chip DDR by transmitting the read-write address to the DMA controller through the control module.
The vector up-sampling computing device applied to the neural network data processing in the embodiment of the invention directly calculates the corresponding pixel initial position of the output characteristic diagram after final amplification from the position of the original input characteristic diagram, and the pixel initial position is transmitted to the DMA controller by the control module to control the read-write address of the under-chip DDR, so that the data of a plurality of pictures can be efficiently processed, the addressing can be rapidly carried out in the network intermediate layer, the DDR can be continuously read and written as much as possible, the interaction times of an operator and the DDR are effectively reduced, the operator performance is greatly improved, and the system delay and the power consumption are reduced.
The following describes in detail a vector up-sampling computation apparatus applied in neural network data processing according to an embodiment of the present invention:
in a convolutional neural network, there may be a large amount of data computation and storage, such as convolution (convolution), pooling (posing), full connection (full connection), upsampling (upsample), stitching (route), and residual (short) which are most commonly used. With the structure of the neural network becoming more complex and the network layer number becoming deeper and deeper, the network further needs to make some necessary processing on some intermediate quantity to ensure the accuracy and precision of the network data. For example, after the input features are subjected to convolution or pooling operations, the feature map size becomes smaller, including its length and width. At this time, the layer of pictures needs to be enlarged, so that the precision and the definition of the pictures can be improved, and the pictures can be displayed on a display with higher resolution. Therefore, the invention designs an Upesample operator which is used for operating the picture data vector of the required middle network layer in the convolutional neural network so as to achieve the purpose of amplifying the picture.
1. Principle of operator design
As shown in FIG. 1, the working process of the upsample operator is shown. The left side is the input characteristic diagram, and the right side is the output characteristic diagram obtained after the processing of the upscale operator. In short, a picture is enlarged, and it can be seen that the size of the input feature map on the left is 4x4x6, and the output feature map of 12x12x6 is obtained after enlargement by 3 times. Since the length and width of the picture are enlarged by three times by 3 times, the area of the enlarged picture is 9 times of the original area, and the depths are the same and unchanged, and the depth is 6 in the example of the figure.
Because the most used operation in the neural network is convolution operation, the multi-thread cooperative processing and weight sharing mode is used for increasing the operation rate, and the number of channels of the input feature map is large, which results in that the picture data in the off-chip DDR is stored according to the depth instead of one page by one page. For example, for the input feature map in fig. 1, 0 to 15 in the figure represent pixel positions of the input feature map, but not picture data, and the DDR stores data corresponding to each identical position of the picture, specifically, the memory stores all data of position 0 according to the depth, that is, data of all 0 regions in the figure, and then gives the control module a message after all data of 0 regions are stored, and then the memory stores data of the next position 1, that is, all picture data of 1 region in the figure, and then the above operation is circulated to continue data processing of the next pixel position point. If the pixel data of the row is processed, the next row is jumped to, and the data at the position 4 in the figure, namely all the data in the area of the next storage 4, is circulated until the storage of all the data at the position 15 of the last pixel point, namely the area 15 in the figure, is completed, so that the information of the input feature map is completely stored in the DDR. The storage principle of the right output feature map is the same, but the right output feature map is an enlarged version of the input feature map, addresses of data storage processed by operator operation are different, and the data storage addresses occupy larger memory, that is, for example, the data volume of the area corresponding to 0 in the output feature map is 9 times of that of the original input feature map, and similarly, the data volume of other pixel points is also enlarged.
2. Algorithm implementation
Based on the introduction of the principle of the upscale operator, it can be clearly understood that the operator is a process of reading data of an input feature map from the DDR, copying the data according to a specific target, and then writing back to the DDR as required. Therefore, the most important thing in this process is to ensure that the addresses for reading and writing data are all correct, and no error occurs, that is, to tell the DDR and SRAM to read and write data correctly at the correct time and at the correct address under the direction of the control module. Therefore, in order to ensure the correctness of the data storage address, the invention designs a specific algorithm aiming at the problem.
Since the sizes of the input feature map and the output feature map are not clear for the middle layer of the neural network, the parameter setting method is adopted. Because in the present invention, the pictures are pre-processed before being input into the accelerator of the project design, the input pictures are made to be square, that is, the size of the input pictures is the same as that of the input pictures. Because the pixels stored in the DDR by the input characteristic diagram and the output characteristic diagram are stored according to the depth, the invention only needs one layer of channel to obtain the jump initial position of each pixel according to a specific algorithm, and the addresses of the following channels can be continuously read and written. Assuming that an existing upscale layer network, the width (length and width) of an input feature map is w (width), the magnification of the input feature map is denoted by S (scale), for example, S is 3 in the schematic diagram 1, each pixel point position of an original input feature map is denoted by i (index), that is, 0 to 15 in the diagram, and the line number in the original input feature map is denoted by k, since k conforms to the storage rule of a memory DDR and an SRAM for the convenience of the implementation of subsequent hardware circuits, and k also starts from 0, for example, the input feature maps 0, 1, 2 and 3 in the diagram of fig. 1 belong to the 0 th line, that is, k is 0; 4. 5, 6, 7 belong to row 1, k is 1; 8. 9, 10, 11 belong to row 2, k is 2; 12. the lines 13, 14 and 15 belong to the 3 rd row, where k is 3, and similarly, the k th row represents a total of k +1 rows of the input feature map. To illustrate the problem, the present invention takes a single-channel feature map as an example, and as shown in fig. 2 below, an input feature map of 4 × 4 on the left side is shown, an output feature map enlarged by three times (S ═ 3) on the upper right side is shown, and an output feature map enlarged by two times (S ═ 2) on the lower right side is shown. It is clear from the previous analysis that the purpose of this algorithm is to guarantee the accuracy of the data transfer address.
Specifically, when S is equal to 3, since the length and width of a picture are all enlarged by 3 times, the area of a picture is enlarged by 9 times, that is, each pixel point of each channel of the input feature map is copied 9 times. As shown in fig. 2, the data at position 0(i ═ 0) in the original is copied into 9 copies and stored in 9 positions of 0, 1, 2, 12, 13, 14, 24, 25, and 16 in the output feature map. The image data information is stored according to the depth, so for the input feature map, all data of an address 0 are processed firstly, then 9 addresses of the output feature map need to be processed, all channel data of an address 1 of the input feature map are processed after the processing is finished, all data of all channels corresponding to the 9 addresses of the output feature map are processed, and then the up sample operation of the whole input feature map is finished after all data corresponding to a last address 15 are processed in a circulating and reciprocating mode in sequence. Similarly, if the data is enlarged twice (S is 2), the area of the output feature map is enlarged 4 times, the data corresponding to the input feature map address 0 is copied to the four areas 0, 1, 8 and 9 of the output feature map, and the other positions are the same, and the operation process is the same as that of S3, except that the copying is changed from 9 times to 4 times. Therefore, the invention can directly calculate the corresponding pixel starting position of the output characteristic diagram after final amplification from the position of the original input characteristic diagram, and the control module transmits the pixel starting position to the DMA controller to control the read-write address of the under-chip DDR. The estimation process is as follows:
S·(i-wk)+SW·S·k
=S·i–S·W·k+S^2·W·k
=S·i+K·[SW·(S-1)]
wherein, the column address of the specific pixel position in the original input characteristic diagram can be calculated by (i-wk), then multiplied by S to obtain the column position of the amplified output characteristic diagram, and then added with SW · S · k to obtain the initial position of the pixel point of the final output characteristic diagram corresponding to the input characteristic diagram.
3. State machine and code implementation
The designed upsample operator is realized based on state machine control, the state machine control storage and calculation module is based on an instruction set mode, in order to expand more operator functions of a chip later, certain bits in the instruction set can represent some specific meanings after being connected, the most common meanings comprise the layer number of a neural network, the operation type of an operator, the size and the channel of an input characteristic diagram, the size and the channel of an output characteristic diagram, the start and the end of work and the like, and a specific state machine is shown in figure 3.
The left half of fig. 3 shows mainly the previous system configuration and control part, and the right half is the operational state associated with this operator UPSAMPLE. Firstly, RST and IDLE of a system are set, then when a chip starts to work, an information instruction set containing a working network layer is written into a CFG-RAM, a corresponding STATE is STATE-CFG-RAM, decoding is carried out after all instruction sets are received, the corresponding STATE is STATE-DECODE-INST, at the moment, the type of the network layer of the layer is judged, a specific binary code is contained in the instruction set according to the type of each network layer, and then the STATE machine jumps to the working STATE of the network layer to carry out subsequent operation under the condition that a control condition is met. For example, for the UPSAMPLE layer, the [2:0] corresponding to the instruction set parameter var _ layer _ type is binary 101, then the STATE machine jumps to the STATE _ UPSAMPLE STATE, and subsequent instruction set information also decodes the size, channel, and starting location in the DDR for the input profile that is needed for UPSAMPLE. Then, the invention designs a memory Tensor _ RAM, the layer network data in the DDR needs to be written into the Tensor _ RAM, and the STATE machine jumps to the STATE _ UPSAMPLE _ LOAD _ DDR STATE immediately. Because the data of the network pictures in the DDR is stored according to the pixel depth, and the UPSAMPLE can amplify a plurality of pictures once, the UPSAMPLE operation is carried out by scanning and copying the corresponding pixel points one by one, so that the data of all channels corresponding to one pixel point position of the network pictures in the layer are written once when the data is written to the Tensor _ RAM from the DDR every time, and finally the data is written back to the DDR according to the purpose.
At this time, when data needs to be written back into the DDR from the RAM, the STATE machine jumps to the STATE _ update _ COMMIT STATE, but at this time, the invention performs fast reading and writing according to the previous algorithm, taking the example of amplifying by three times (S ═ 3), the area of the image amplified at this time is 9 times of the original area, that is, each pixel point of the original input characteristic diagram is copied 9 times, because the memory is stored according to the depth of the pixel point, the memory in the DDR is continuous for the first 3 times of amplifying by 9 times, and the invention can directly obtain the initial positions of the last 3 times according to the algorithm, so the invention writes 3 times into the DDR, and writes 3 times of data in the RAM continuously each time, thus writes 9 times of original data into the DDR, but the memory addressing of the DDR jumps 3 times, one pixel point reduces 6 times, and all pixel points of one layer of the network reduce the number of accesses to the DDR, this saves a lot of delay because, through various measurement evaluations, seventy percent of the system delay in the neural network chip is due to the access of a large amount of feature map data, and the real data processing stage delay is very small. After the process is completed, it means that all channel data of a pixel in the network is processed, and then all data corresponding to the next pixel point position needs to be processed, after the STATE machine receives the indication signal, it jumps to the STATE _ UPDATE _ INDEX STATE, the RAM re-caches the data of the new pixel point, and then repeats the above operations to scan all pixel points, and the STATE machine jumps to the STATE _ UPDATE _ END _ CONDITION STATE, if the pixel points in the network are not completely processed, it means that the STATE machine jumps to the STATE _ UPDATE _ LOAD _ DDR _ TRIGGER STATE in the process of UPDATE, and continues to process the unprocessed image data. However, if all the pixels of the network layer are processed, after all the data corresponding to the last pixel point is processed, the STATE machine jumps to STATE _ UPDATE _ INST and STATE _ LOAD _ INST, which indicates that the network layer UPDATE of this layer has completed the operation, and the next network layer can be operated, and at this time, the instruction set of the next layer is updated, and then decoded, and the next layer is operated.
The upsample operator can process a plurality of pieces of picture data efficiently at the same time in a batch processing mode, an index algorithm of a corresponding target network layer storage address is designed, addressing can be performed rapidly in a network middle layer, SRAM is used for continuous reading and writing of DDR to the greatest extent possible, interaction times of the operator and the DDR are effectively reduced, operator performance is greatly improved, and system delay and power consumption are reduced. The front-end function simulation of the invention is passed, and the back-end FPGA verification result is also correct.
Example 3
A storage medium storing a program file capable of implementing any one of the above vector up-sampling computation methods applied in neural network data processing.
Example 4
A processor, configured to run a program, where the program executes the vector upsampling calculation method applied to the neural network data processing in the above-mentioned any one when running.
The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.
In the above embodiments of the present invention, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.
In the embodiments provided in the present application, it should be understood that the disclosed technology can be implemented in other ways. The above-described system embodiments are merely illustrative, and for example, a division of a unit may be a logical division, and an actual implementation may have another division, for example, multiple units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed coupling or direct coupling or communication connection between each other may be an indirect coupling or communication connection through some interfaces, units or modules, and may be electrical or in other forms.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic or optical disk, and other various media capable of storing program codes.
The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims (10)

1. A vector up-sampling calculation method applied to neural network data processing is characterized by comprising the following steps:
directly calculating the corresponding pixel initial position of the output characteristic diagram after final amplification from the position of the original input characteristic diagram;
and the control module transmits the read-write address to the DMA controller to control the read-write address of the off-chip DDR.
2. The method for computing vector up-sampling applied in neural network data processing according to claim 1, wherein the calculation process of the method is as follows:
S·(i-wk)+SW·S·k
=S·i–S·W·k+S^2·W·k
=S·i+K·[SW·(S-1)]
the width of the input characteristic diagram is W, the magnification factor of the input characteristic diagram is represented by S, the position of each pixel point of the original input characteristic diagram is represented by i, the number of lines in the original input characteristic diagram is represented by a letter k, and k is an integer starting from 0;
in the formula (i-wk), the column address of a specific pixel position in the original input feature map is calculated, then multiplied by S to obtain the column position of the amplified output feature map, and then added with SW · S · k to obtain the initial position of the pixel point of the final output feature map corresponding to the input feature map.
3. The vector up-sampling computation method applied to neural network data processing, according to claim 2, wherein the calculation process of the method is implemented using an upsample algorithm; the upsamplale operator is realized based on state machine control, the state machine control storage and calculation module is in a mode based on an instruction set, and bits in the instruction set are connected to represent a specific feature vector.
4. The vector up-sampling computation method applied to the data processing of the neural network as claimed in claim 3, wherein the system firstly RST and IDLE, then the system starts to work and writes the instruction set containing the information of the working network layer into CFG _ RAM, and all the instruction sets are decoded after being accepted;
firstly, judging the type of the network layer of each network layer according to a specific binary code corresponding to the type of the network layer in an instruction set, and jumping to the working state of the network layer for subsequent operation when a control condition is met;
then the subsequent instruction set information also decodes the size, the channel and the initial position in the DDR of the input characteristic diagram of the UPSAMPLE, the layer network data in the DDR is written into a memory Tensor _ RAM, the STATE machine immediately jumps to the STATE of STATE _ UPSAMPLE _ LOAD _ DDR, only the data of all the channels corresponding to one pixel position of the layer network diagram is written once when the data is written into the Tensor _ RAM from the DDR every time, and finally the data is written back into the DDR according to the purpose requirement.
5. The vector up-sampling computation method applied in data processing of neural networks as claimed in claim 4, wherein the STATE machine jumps to STATE _ update _ COMMIT STATE when writing data back into DDR from RAM, then performs fast read and write according to the update algorithm, writes S times into DDR, writing data in RAM S times consecutively;
after all channel data of one pixel point in the layer network are processed, processing all data corresponding to the position of the next pixel point, skipping to a STATE _ UPDATE _ INDEX STATE after the STATE machine receives an indication signal, caching new pixel point data by the RAM, then repeating the above operations to scan all pixel points, and skipping to a STATE _ UPDATE _ END _ CONDITION STATE by the STATE machine;
if the pixel points in the network layer are not completely processed, the STATE machine jumps to the STATE _ UPSAMPLE _ LOAD _ DDR _ TRIGGER STATE, and image data which are not processed are continuously processed; and if all the pixel points of the network layer are processed, after all the data corresponding to the position of the last pixel point are processed, the STATE machine jumps to the STATE _ UPDATE _ INST and the STATE _ LOAD _ INST to perform the operation of the next network layer.
6. The method of claim 5, wherein the input feature map is pre-processed before the accelerator, such that the input feature map is square.
7. The vector upsampling calculation method applied to the neural network data processing, according to claim 6, wherein in the method, pixels of the input feature map and the output feature map stored in the DDR are stored according to depth, only one layer of channel is needed to obtain a jump initial position of each pixel according to an upsample algorithm, and the addresses of the following channels are continuously read and written.
8. A vector upsampling computing device for use in neural network data processing, comprising:
the pixel initial position acquisition unit is used for directly solving the corresponding pixel initial position of the output characteristic diagram after final amplification from the position of the original input characteristic diagram;
and the read-write address control unit is used for controlling the read-write address of the under-chip DDR by transmitting the read-write address to the DMA controller through the control module.
9. A storage medium storing a program file for implementing the vector up-sampling computation method applied to the data processing of the neural network according to any one of claims 1 to 7.
10. A processor, characterized in that the processor is configured to run a program, wherein the program is configured to execute the vector upsampling calculation method applied in the neural network data processing in the method according to any one of claims 1 to 7 in the running process.
CN202210180461.6A 2022-02-25 2022-02-25 Vector up-sampling calculation method and device applied to neural network data processing Pending CN114648444A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210180461.6A CN114648444A (en) 2022-02-25 2022-02-25 Vector up-sampling calculation method and device applied to neural network data processing

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210180461.6A CN114648444A (en) 2022-02-25 2022-02-25 Vector up-sampling calculation method and device applied to neural network data processing

Publications (1)

Publication Number Publication Date
CN114648444A true CN114648444A (en) 2022-06-21

Family

ID=81994500

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210180461.6A Pending CN114648444A (en) 2022-02-25 2022-02-25 Vector up-sampling calculation method and device applied to neural network data processing

Country Status (1)

Country Link
CN (1) CN114648444A (en)

Similar Documents

Publication Publication Date Title
JP2021100247A (en) Distorted document image correction method and device
US8941669B1 (en) Split push buffer rendering for scalability
KR20190022237A (en) Method and apparatus for performing convolution operation in neural network
CN112991142B (en) Matrix operation method, device, equipment and storage medium for image data
EP3663938B1 (en) Signal processing method and apparatus
CN113724128A (en) Method for expanding training sample
CN114022748B (en) Target identification method, device, equipment and storage medium
US20220044104A1 (en) Method and apparatus for forward computation of neural network, and computer-readable storage medium
CN117217274A (en) Vector processor, neural network accelerator, chip and electronic equipment
CN115291813B (en) Data storage method and device, data reading method and device, and equipment
CN114648444A (en) Vector up-sampling calculation method and device applied to neural network data processing
CN113554164A (en) Neural network model optimization method, neural network model data processing method, neural network model optimization device, neural network model data processing device and storage medium
CN107230183A (en) Image rasterization processing method and processing device
CN116051345A (en) Image data processing method, device, computer equipment and readable storage medium
JP4636526B2 (en) Method for correcting non-functional pixels in digital X-ray imaging in real time
CN116958375A (en) Graphics processor, system, apparatus, device, and method
JP5045652B2 (en) Correlation processing device and medium readable by correlation processing device
US7737988B1 (en) Using font filtering engines for texture blitting
CN116306823B (en) Method, device and chip for providing data for MAC array
CN116527908B (en) Motion field estimation method, motion field estimation device, computer device and storage medium
CN111047037B (en) Data processing method, device, equipment and storage medium
CN113553009B (en) Data reading method, data writing method and data reading and writing method
CN117314730B (en) Median filtering computing device and method for accelerating digital image processing
US11842273B2 (en) Neural network processing
CN113554095B (en) Feature map processing method and device and computer equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination