WO2021056143A1 - Image processing method and apparatus, and mobile device - Google Patents

Image processing method and apparatus, and mobile device Download PDF

Info

Publication number
WO2021056143A1
WO2021056143A1 PCT/CN2019/107299 CN2019107299W WO2021056143A1 WO 2021056143 A1 WO2021056143 A1 WO 2021056143A1 CN 2019107299 W CN2019107299 W CN 2019107299W WO 2021056143 A1 WO2021056143 A1 WO 2021056143A1
Authority
WO
WIPO (PCT)
Prior art keywords
image
read
original pixels
pixels
filter
Prior art date
Application number
PCT/CN2019/107299
Other languages
French (fr)
Chinese (zh)
Inventor
仇晓颖
韩彬
吴迪
Original Assignee
深圳市大疆创新科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳市大疆创新科技有限公司 filed Critical 深圳市大疆创新科技有限公司
Priority to PCT/CN2019/107299 priority Critical patent/WO2021056143A1/en
Priority to CN201980033764.1A priority patent/CN112154475A/en
Publication of WO2021056143A1 publication Critical patent/WO2021056143A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T5/00Image enhancement or restoration
    • G06T5/20Image enhancement or restoration using local operators
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/38Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
    • G06F7/48Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
    • G06F7/52Multiplying; Dividing
    • G06F7/523Multiplying only
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20024Filtering details

Definitions

  • the present disclosure relates to the field of image processing, and in particular to an image processing method, device and mobile equipment.
  • Filtering is widely used in the field of image processing.
  • the image processing device executes the filtering algorithm, it first reads the original pixels of the image from the off-chip memory, and then uses the arithmetic unit to filter the original pixels.
  • an image processing device has eight arithmetic units. If only four original pixels can be read per operation cycle, only four arithmetic units are involved in the operation, and the remaining four arithmetic units cannot be used, resulting in limited overall image filtering performance , Affect the efficiency of image processing.
  • the present disclosure provides an image processing method, which is applied to a vector processing unit, the vector processing unit includes a multiplier, and the method includes: reading P read original pixels of an image, wherein the value of P read is based on Corresponding to the determination of the access bit width of the vector processing unit; reading the N coefficients of the filter, the value of N is determined according to the number of multipliers of the vector processing unit, and the filter is used to filter the image Processing; through the multiplier, each of the N coefficients and the P read original pixels are respectively multiplied to obtain multiple product results, and the product results are used to calculate the pixels in the filtered image The pixel value.
  • the present disclosure also provides an image processing device, which includes: an external storage unit that stores images and filters; a vector processing unit that includes: a multiplier; and the vector processing unit is used to read P read elements of the image.
  • Pixel wherein the value of P read is determined according to the memory access bit width corresponding to the vector processing unit, the N coefficients of the filter are read, and the value of N is determined according to the number of multipliers of the vector processing unit,
  • the filter is used to filter the image; the multiplier is used to multiply each of the N coefficients and the P read original pixels to obtain multiple product results.
  • the result is used to calculate the pixel value of the pixel in the filtered image.
  • the present disclosure also provides a mobile device, which includes: the above-mentioned image processing device.
  • the present disclosure reads the N coefficients of the filter in each operation cycle, and the number N of the read coefficients is determined according to the number of multipliers of the vector processing unit; the N coefficients are respectively compared with each of the originals. Pixels are multiplied to obtain a product result; now for the prior art, more multipliers are involved in filtering operations, and computing resources are fully utilized, which effectively improves the overall performance of image filtering and improves image processing efficiency.
  • FIG. 1 is a flowchart of an image processing method according to an embodiment of the disclosure.
  • FIG. 2 are schematic diagrams of operations in the 1-3 operation cycles, respectively.
  • FIG. 3 are the operation schematic diagrams of the 4th to 8th operation cycles, respectively.
  • FIG. 4 are the operation schematic diagrams of the 9th to 18th operation cycles, respectively.
  • FIG. 5 are operation schematic diagrams of the 19th and 23rd operation cycles, respectively.
  • FIG. 6 are schematic diagrams of operations in the 24th and 28th operation cycles, respectively.
  • FIG. 7 is a schematic diagram of a process of an image processing method according to an embodiment of the disclosure.
  • FIG. 8 are the operation schematic diagrams of the 54th to 56th operation cycles, respectively.
  • FIG. 9 is a schematic diagram of a filtered image of the image processing method according to an embodiment of the disclosure.
  • FIG. 10 is a schematic structural diagram of an image processing apparatus according to an embodiment of the disclosure.
  • FIG. 11 is a schematic diagram of the structure when the arithmetic unit of the image processing device of the embodiment of the disclosure performs preprocessing.
  • FIG. 12 is a schematic diagram of the structure when the arithmetic unit of the image processing device of the embodiment of the disclosure performs parallel processing.
  • An embodiment of the present disclosure provides an image processing method, which is applied to a vector processing unit, and the vector processing unit includes a multiplier. As shown in FIG. 1, the method includes:
  • Step S101 Read P read original pixels of the image, where the value of P read is determined according to the memory access bit width corresponding to the vector processing unit;
  • Step S102 Read N coefficients of the filter, the value of N is determined according to the number of multipliers of the vector processing unit, and the filter is used for filtering the image;
  • Step S103 Through the multiplier, each coefficient of the N coefficients and the P read original pixels are respectively multiplied to obtain multiple product results, and the product results are used to calculate the pixels in the filtered image The pixel value of the point.
  • the image processing method of this embodiment is run by an image processing device, and the image processing device includes: an external storage unit and a processor.
  • the processor can be any type of chip with vector processing capabilities such as CPU, DSP, FPGA, etc.
  • the vector processing unit can perform filtering and other processing on the image.
  • the external storage unit is its off-chip memory.
  • the external storage unit stores the image to be processed and the filter.
  • the vector processing unit uses the filter to filter the image, and the filtered image obtained can also be stored in an external storage unit.
  • the vector processing unit of the DSP includes multiple multiply and accumulators (MAC, Multiply and ACumulate), and each MAC includes one multiplier and one adder, which are used to perform multiplication and accumulation operations in filtering.
  • MAC multiply and accumulators
  • step S101 P read original pixels of the image are read, where the value of P read is determined according to the memory access bit width corresponding to the vector processing unit.
  • the entire filtering process needs to go through multiple calculation cycles to complete.
  • the vector processing unit needs to read P read original pixels of the image.
  • the number of original pixels P read read is equal to the maximum number of original pixels that can be read by the vector processing unit in each operation cycle.
  • the maximum number that can be read depends on the access bit width of the data bus and the bit width of the original pixels.
  • the bit width refers to how many bits are used to represent each original pixel.
  • the bit width of the original pixel can be 8bit, 16bit, 32bit, and so on.
  • the memory access bit width refers to how many bit lines the data bus has, that is, how many bits can be transmitted by the data bus at a time, and is generally a multiple of 8.
  • the memory access bit width can be 8bit, 16bit, 32bit, 64bit, etc.
  • step S102 reads the N coefficients of the filter, the value of N is determined according to the number of multipliers of the vector processing unit, and the filter is used for filtering the image.
  • the vector processing unit reads the N coefficients of the filter corresponding to the P read original pixels.
  • the number N of coefficients read each time is determined according to the number of multipliers. Specifically, if the vector processing unit includes N calc multipliers, the number of coefficients
  • N (N calc /P read ).
  • the number of multipliers is usually an integer multiple of the number of original pixels read in each operation cycle, and the number of multipliers is how many times the number of original pixels read in each operation cycle is read from the external storage unit How many filter coefficients.
  • the value of N can be the result of N calc /P read rounded up. For example, if the number of multipliers is 10, when the vector processing unit reads 4 original pixels per operation cycle, 10/4 is 2.5, and 3 coefficients are read at this time. Choose any 2 coefficients to multiply each of the 4 original pixels. This operation uses 8 multipliers; choose another coefficient in addition to the above 2 coefficients, which is separated from any two of the 4 original pixels. Multiply, this operation uses the remaining 2 multipliers. In this way, 10 multipliers are fully utilized, and more multiplication operations are performed in one operation cycle.
  • step S103 uses the multiplier to multiply each of the N coefficients by the P read original pixels to obtain multiple product results.
  • the product result is used to calculate the pixel value of the pixel in the filtered image.
  • the image processing method of this embodiment includes preprocessing and parallel processing.
  • the vector processing unit When the vector processing unit reads P read original pixels in the first row of the image, preprocessing is performed. In the preprocessing, the vector processing unit reads the P read original pixels in the first line of the image and one coefficient or N coefficients in the first line of the filter, and compares the one coefficient with the P read original pixels. Multiply or multiply each of the N coefficients and the P read original pixels to obtain a product result.
  • parallel processing is performed.
  • the vector processing unit reads the P read original pixels in a row of the image and the N coefficients of the filter, where the N coefficients are located in the same column of the adjacent N rows of the filter.
  • the N calc multipliers multiply each of the N coefficients and the P read original pixels to obtain the product result.
  • the original size of the image is 16 ⁇ 4.
  • this embodiment may pad the image.
  • the size of the filled image is 20 ⁇ 8, that is, the width Mw is 20 and the height Nh is 8. Compared with the original image, it expands multiple elements outside the original image boundary.
  • the specific filling method can be to simply copy the element adjacent to the boundary point and fill it to the area outside the boundary of the original image, or use preset filling elements. This is only an exemplary description and does not limit the present disclosure.
  • the size of the filter is 20 ⁇ 8, that is, both the width F w and the height F h are 5.
  • the filter performs filtering operations on the filled image. The following takes two-dimensional convolution as an example to describe the filtering operation in detail.
  • the original pixels of these lines are only used once.
  • the product operation in the filtering is performed on the original pixels of other lines, the The original pixels will not be reused. Therefore, the original pixels of these lines can be calculated in the way of preprocessing within the lines.
  • the first line of the image is in the above situation.
  • the original pixels in the first row are used only once, and the original pixels in the first row will not be multiplexed when performing product operations in the filtering on the original pixels in the other rows. Therefore, the original pixels of the first row are calculated by preprocessing. Please refer to Figure 2 and Figure 7 together to introduce the preprocessing process of the first line of the image.
  • the pretreatment process includes:
  • the four multipliers multiply the coefficients A 0 , 0 with the four original pixels B 0 , 0 , B 0 , 1, B 0, 2, and B 0, 3 of the image, respectively, to obtain the product result of the coefficients A 0, 0.
  • the four multipliers multiply the coefficients A 0,1 with the four original pixels B 0,1 , B 0,2 , B 0,3 , and B 0,4 of the image respectively to obtain the product result of the coefficients A 0,1,
  • the other four multipliers multiply the coefficients A 0 , 2 with the four original pixels B 0 , 2, B 0 , 3, B 0 , 4, and B 0, 5 of the image to obtain the product result of the coefficients A 0, 2.
  • the four multipliers multiply the coefficients A 0,3 with the four original pixels B 0,3 , B 0,4 , B 0,5 , and B 0,6 of the image respectively to obtain the product result of the coefficients A 0,3,
  • the other four multipliers multiply the coefficients A 0 , 4 with the four original pixels B 0 , 4, B 0, 5 , B 0, 6 , and B 0 , 7 of the image to obtain the product result of the coefficients A 0, 4.
  • the preprocessing is completed, and the product result obtained by the preprocessing is used to calculate the four pixels in the first line of the filtered image.
  • the preprocessing process of the first line is introduced above, and so on, you can repeat the above steps from the second line of the image, and perform intra-line preprocessing on other lines of the image to obtain the pixels used to calculate each line of the filtered image
  • the product result of is used to calculate the product result of the pixel value of the same pixel in the filtered image, and the pixel value of the pixel can be obtained.
  • the original pixels of some lines of the image are only used once, and the original pixels of these lines will not be multiplexed. But for other lines of the image, when calculating the filtering results of the original pixels, the original pixels of these lines will be used multiple times. When the product operation in filtering is performed on the original pixels of other lines, the original pixels of these lines can be used Reuse. Therefore, the original pixels of these rows can be operated in parallel processing within the rows.
  • parallel processing can be performed on other rows after the first row, and the parallel processing can further improve the operation efficiency of two-dimensional filtering and the overall performance of image filtering. Please refer to Figure 3 and Figure 7 together to introduce the parallel processing process.
  • the parallel processing process includes:
  • Eight multipliers multiply the coefficients A 0 , 0 and A 1 , 0 with the four original pixels B 1 , 0, B 1 , 1, B 1, 2 , and B 1, 3 of the image, respectively, to obtain the coefficient A 0,
  • the result of the product of 0 and A 1,0 are used to calculate the four pixels in the first line of the filtered image, and are added to the cumulative result of cycle 3; the 4 product results of the coefficient A 0, 0 are used to calculate the filter Four pixels in the second row of the back image.
  • the 8 multipliers multiply the coefficients A 0,1 , A 1,1 with the 4 original pixels B 1,1 , B 1,1 , B 1,3 , B 1,4 of the image, respectively, to obtain the coefficient A 0,
  • the 4 product results of the coefficient A 1,1 are used to calculate the four pixels in the first line of the filtered image, and are added to the cumulative result of cycle 4; the 4 product results of the coefficient A 0,1 are used to calculate the filter
  • the four pixels in the second row of the back image are added to the product result of cycle 4.
  • the eight multipliers multiply the coefficients A 0 , 2 , and A 1, 2 with the four original pixels B 1 , 2, B 1 , 3, B 1 , 4, and B 1, 5 of the image, respectively, to obtain the coefficient A 0,
  • the 4 product results of the coefficients A 1, 2 are used to calculate the four pixels in the first line of the filtered image, and they are added to the cumulative result of cycle 5; the 4 product results of the coefficients A 0 , 2 are used to calculate the filter
  • the four pixels in the second row of the back image are accumulated to the accumulation result of cycle 5.
  • the eight multipliers multiply the coefficients A 0 , 3 and A 1 , 3 by the four original pixels B 1 , 3, B 1 , 4, B 1 , 5 and B 1 , 6 of the image to obtain the coefficient A 0,
  • the 4 product results of the coefficients A 1 , 3 are used to calculate the four pixels in the first line of the filtered image, and they are added to the cumulative result of cycle 6; the 4 product results of the coefficients A 0 , 3 are used to calculate the filter
  • the four pixels in the second row of the back image are accumulated to the accumulation result of cycle 6.
  • the eight multipliers multiply the coefficients A 0 , 4, A 1 , 4 with the four original pixels B 1 , 4, B 1 , 5, B 1 , 6, and B 1, 7 of the image, respectively, to obtain the coefficients A 0,
  • the 4 product results of the coefficients A 1 , 4 are used to calculate the four pixels in the first line of the filtered image, and they are added to the cumulative result of cycle 7; the 4 product results of the coefficients A 0 , 4 are used to calculate the filter
  • the four pixels in the second row of the back image are accumulated to the accumulation result of cycle 7.
  • the parallel processing process of the third line of the image is similar to the parallel processing process of the second line above.
  • the difference is that the 4 original pixels of the third line of the image are read, and the second line and the second line of the filter are read.
  • the coefficients in the same column of the three rows and the two adjacent rows can be referred to as shown in Fig. 4(a) to Fig. 4(e) and Fig. 7, and the specific process will not be repeated.
  • the parallel processing process is as follows:
  • Eight multipliers multiply the coefficients A 3 , 0 and A 4 , 0 with the four original pixels B 4 , 0, B 4 , 1, B 4 , 2, and B 4, 3 of the image, respectively, to obtain the coefficient A 3,
  • the result of the product of 0 and A 4,0 are used to calculate the four pixels in the first line of the filtered image, and they are added to the cumulative result of cycle 18; the 4 product results of the coefficient A 3,0 are used to calculate the filter
  • the four pixels in the second row of the back image are added to the cumulative result of cycle 18.
  • the eight multipliers multiply the coefficients A 3 , 4 and A 4 , 4 with the four original pixels B 4 , 4 , B 4 , 5, B 4, 6, and B 4, 7 of the image, respectively, to obtain the coefficient A 3,
  • the four product results of coefficients A 4 and 4 are used to calculate the four pixels in the first line of the filtered image, and are added to the accumulation result of cycle 22.
  • the accumulation result of cycle 23 is the first line of the filtered image.
  • Four pixels C 0,0 -C 0,3 are used to calculate the four pixels in the second line of the filtered image, and are accumulated to the accumulation result of cycle 22.
  • Another P read original pixels are read from the internal storage unit of the vector processing unit, and the other P read original pixels are read from the image in the previous operation cycle and stored in the internal storage unit.
  • Each of the N coefficients located at the end of the filter height direction is respectively multiplied by the P read original pixels of a line read from the image to obtain the product result;
  • Each of the N coefficients located at the head of the filter height direction is respectively multiplied by another P read original pixels read from the internal storage unit to obtain the product result.
  • Read the 4 original pixels of the fifth row of the image from the external storage unit namely B 5 , 4, B 5 , 5 , B 5 , 6, B 5, 7, read the 4 of the third row of the image from the internal storage unit Original pixels, namely B 2,4 , B 2,5 , B 2,6 , B 2,7 .
  • the 4 original pixels B 2,4 , B 2,5 , B 2,6 , B 2,7 are read from the image in cycle 13 and stored in the internal storage unit.
  • the four multipliers divide the coefficients A 0 , 4 of the filter head with the coefficients A 0, 4 respectively 4 original image pixels B 2,4, B 2,5, B 2,6, B 2,7 multiplied by the multiplication results of the coefficients a 0,4, the coefficient a 4 0,4-pieces of multiplied results for Calculate the four pixels in the third line of the filtered image and add them to the accumulation result of cycle 27;
  • the other four multipliers combine the filter tail coefficients A 4 , 4 with the four original pixels B 5 , 4, B of the image, respectively 5,5, B 5,6, B 5,6 multiplied by the coefficient a 4 and 4 multiplication results, coefficients a 4 4,4 multiplication results for the second row of the four pixels is calculated after the filtered image , And accumulate to the accumulation result of cycle 27.
  • the accumulation result of cycle 28 is the four pixel points C 1,0 -C 1,3 in the second line of the second line of the second line of the second line of the second line of the second line of the fourth line of the four ;
  • the four pixels C 2,0 -C 2,3 of the third row of the filtered image can be obtained, and then one operation Cycle, that is, starting from cycle 28 and then passing through 25 operation cycles, after a total of 53 operation cycles, the four pixels C 3,0 -C 3,3 of the fourth row of the filtered image can be obtained, thereby obtaining the filtered image Pixels from column 1 to column 4 of all four rows.
  • the height N h of the image is 8 and the height F h of the filter is 5, so it can be determined that the height N 0 of the filtered image is 4.
  • the number N of filter coefficients read in each operation cycle is 2, and the height N 0 of the filtered image can be divisible by the number N of filter coefficients.
  • the filtering process for the pixels in the first column to the fourth column of the filtered image has been completed. If the height N 0 of the filtered image is not evenly divisible by the number N of filter coefficients, then the filtering process of the pixels in the first to fourth columns of the filtered image is still not completed. At this time, it is necessary to continue processing the image to obtain the pixels of the remaining lines of the filtered image. Because the remaining lines of the filtered image cannot be processed in parallel, the following in-line processing is performed:
  • the one coefficient is multiplied by the P read original pixels, or each coefficient of the N coefficients is multiplied by the P read original pixels to obtain a product result.
  • the filtered image is the first The filtering process of the pixels from column 1 to column 4 has not yet been completed. At this time, it is necessary to continue processing the image to obtain the pixels in the fifth row of the filtered image. Because there are only one row of pixels, that is, the pixels on the fifth row of the filtered image, the pixels on the fifth row of the filtered image cannot be processed in parallel, but are processed in-line.
  • the above-mentioned preprocessing process is performed once for each of the 5th to 9th rows of the image. That is, for each of the 5th to 9th rows of the image, read one or N coefficients in a row of the filter, and multiply the one or N coefficients with each original pixel of the row to obtain the product result.
  • the processing process specifically includes:
  • the four multipliers multiply the coefficient A 0 , 0 with the four original pixels B 4 , 0, B 4 , 1, B 4 , 2, and B 4, 3 of the image, respectively, to obtain the product result of the coefficient A 0, 0.
  • the four multipliers multiply the coefficient A 0,1 by the four original pixels B 4,1 , B 4,2 , B 4,3 , B 4,4 of the image, respectively, to obtain the product result of the coefficient A 0,1
  • the other four multipliers multiply the coefficient A 0 , 2 with the four original pixels B 4 , 2, B 4 , 3, B 4 , 4 and B 4, 5 of the image, respectively, to obtain the product result of the coefficient A 0, 2.
  • the four multipliers multiply the coefficients A 0 , 3 with the four original pixels B 4 , 3, B 4 , 4, B 4, 5 , and B 4, 6 of the image to obtain the product result of the coefficients A 0, 3.
  • the other four multipliers multiply the coefficients A 0 , 4 with the four original pixels B 4 , 4 , B 4 , 5, B 4, 6, and B 4, 7 of the image to obtain the product result of the coefficients A 0, 4. , And accumulate the two product results to the accumulation result of cycle 55.
  • the preprocessing of the fifth line of the image is completed, and the product results obtained in each of the foregoing calculation cycles are used to calculate the four pixels of the fifth line of the filtered image.
  • the above steps are repeated, and the above-mentioned preprocessing is performed on the 6th to 9th rows of the image and the second to 5th rows of the corresponding filter respectively, and the preprocessing of each row requires 3 operation cycles.
  • a total of 68 operation cycles the pixel of the fifth row of the filtered image is obtained, and the pixels of the first to fourth columns of all five rows of the filtered image are obtained, and the filtered image is the first The filtering process from column to column 4 is completed.
  • the number of vector processing units N calc is 8
  • the number of original pixels P read read in each operation cycle is 4
  • the number of filter coefficients N is 2
  • the image width M w is 20, and the filter
  • the width F w and the height F h of are both 5, and the image processing method of this embodiment is described.
  • the values of the above parameters are not limited to this.
  • the above parameters in this embodiment can also take other values.
  • the image processing method is similar to the above description, and those skilled in the art should be fully aware of the specific process of the image processing method.
  • the calculation time of P read pixels in the first row of the filtered image is:
  • T pre is the calculation time of the above preprocessing;
  • F w and F h are the width and height of the filter, respectively;
  • cycle is a calculation cycle.
  • the preprocessing operation time T pre is: (1+ceil((F w -1)/N)) ⁇ cycle.
  • the width F w and the height F h of the filter are both 5
  • N is 2
  • the preprocessing calculation time T pre is 3 calculation cycles, so the calculation time is 23 calculation cycles. In other words, after 25 operation cycles, 4 pixels in the first row of the filtered image are obtained.
  • the width F w and the height F h of the filter are both 5, and the calculation time is 25 calculation cycles. In other words, 4 pixels in each row from the second row to the fourth row of the filtered image require 25 operation cycles.
  • the operation time of P read pixels in each line of the last r lines of the filtered image is: T pre ⁇ F h .
  • the height N h of the image is 9
  • the last line of the filtered image that is, the pixels in the fifth line are obtained by intra-line calculation
  • the height F h of the filter is all 5
  • the calculation time is 15 calculation cycles .
  • the 4 pixels in the fifth row of the filtered image are obtained through 15 operation cycles.
  • N o is the height of the filtered image
  • N o N h -F h + 1
  • N h is the height of the image
  • r is the remainder of the division of N o and N, that is, the number of lines that need to be processed in-line in the filtered image
  • T r is the operation time of the in-line processing. Arithmetic processing within the line time T r is: r ⁇ T pre ⁇ F h .
  • the original pixels in this embodiment are read in order, that is, from top to bottom in the image height direction, from the first row to the last row, and read in the order from left to right in each row. .
  • the pixels of the filtered image are also output in order.
  • the P read original pixels of one line of the read image include:
  • n-th original pixel group comprises first to n-th column n + P read -1 P read original pixels of the column, wherein 1 ⁇ n ⁇ F w .
  • the original pixels of each row are read from top to bottom in the height direction of the image.
  • 5 groups of original pixels are sequentially read.
  • the first group of original pixels includes 4 columns from column 1 to column 4.
  • Original pixels such as B 1,0 , B 1,1 , B 1,2 , B 1,3 ;
  • the second group of original pixels includes 4 original pixels from column 2 to column 5, such as B 1,1 , B 1,2 , B 1,3 , B 1,4 ;
  • the third group of original pixels includes 4 original pixels from the third column to the sixth column, such as B 1,2 , B 1,3 , B 1,4 , B 1,5 ;
  • the 4th group of original pixels include 4 original pixels from the 4th column to the 7th column, such as B 1,3 , B 1,4 , B 1,5 , B 1,6 ;
  • the 5th group of original pixels Including the 4 original pixels in the fifth column to the eighth column, for example, B 1,4 , B 1,5 , B 1,6 , B 1,7 .
  • the order of reading the original pixels of the image is not limited, and reading can be sequential, reversed, or skipped. As long as it can traverse all the original pixels of the image, traverse all the filter coefficients, and complete the M*N*F w *F h multiplication and accumulation operations, the filtering can be completed. It's just that in the case of reverse or skip reading, the output order of the pixels of the filtered image is different.
  • P read original pixels of the image are read from the external storage unit each time.
  • P read original pixels stored after each calculation cycle reads an image from an external storage unit P read original pixels, it can be read in the internal storage unit vector processing unit. In this way, in the subsequent operation cycle, if some of the original pixels of the P read original pixels that need to be read have been stored in the internal storage unit, then this part of the original pixels can be directly read from the internal storage unit, and only It is sufficient to read another part of the original pixels not stored in the internal storage unit from the storage unit.
  • the P read original pixels of one line of the read image include:
  • P reads the remaining portion of the original pixel in the original pixels read from the external storage unit, to obtain P read original pixels
  • B 1, 0 , B 1 , 1, B 1, 2 , B 1,3 are stored in the internal storage unit of the vector processing unit.
  • B 1, 1, B 1 , 2, and B 1, 3 can be read from the internal storage unit. This reduces the amount of data read by the vector processing unit from the off-chip memory and saves bandwidth.
  • the multiplication and accumulation operations are performed in each operation cycle, that is, the filter coefficients are multiplied by the original pixels, and the result of the multiplication is accumulated to the accumulation result of the previous operation cycle.
  • the present disclosure is not limited to this. It is also possible to perform multiplication first to obtain all the product results used to calculate the pixel points of the filtered image and then perform the accumulation.
  • the present disclosure reads the N coefficients of the filter in each operation cycle, and the number N of the read coefficients is determined according to the number of multipliers that execute the image processing method; the N Each coefficient is multiplied by each of the original pixels to obtain the product result.
  • more vector processing units are involved in the filtering operation.
  • the calculation resources are fully utilized, and the overall image filtering is The performance is greatly improved, and the image processing efficiency is greatly improved.
  • FIG. 10 Another embodiment of the present disclosure provides an image processing device, as shown in FIG. 10, including:
  • the external storage unit stores images and filters.
  • the vector processing unit includes: a multiplier; the vector processing unit is used to read P read original pixels of the image, wherein the value of P read is determined according to the memory access bit width corresponding to the vector processing unit, and the read N coefficients of the filter, the value of N is determined according to the number of multipliers of the vector processing unit, and the filter is used for filtering the image;
  • the multiplier is used to multiply each of the N coefficients and the P read original pixels to obtain multiple product results, and the product results are used to calculate the pixel value of the pixel in the filtered image .
  • the vector processing unit can be an arithmetic unit in the processor, which can perform processing such as filtering on the image.
  • the processor can be any type of chip with vector processing capabilities such as CPU, DSP, FPGA, etc.
  • the external storage unit is its off-chip memory. The external storage unit stores the image to be processed and the filter. The vector processing unit uses the filter to filter the image, and the filtered image obtained can also be stored in an external storage unit.
  • the vector processing unit of the DSP includes a plurality of MACs, and each MAC includes a multiplier and an adder, which are used to perform multiplication and accumulation operations in filtering.
  • FIG. 10 only schematically shows the structure of the image processing apparatus.
  • there may be one or more external storage units and the images and filters may be stored in one or multiple external storage units.
  • the vector processing unit needs to read P read original pixels of a line of the image.
  • the number of original pixels P read to be read is equal to the maximum number of original pixels that can be read in each operation cycle.
  • the maximum number that can be read depends on the access bit width of the data bus and the bit width of the original pixels. That is, the maximum number in this embodiment is equal to the quotient of the memory access bit width and the original pixel bit width.
  • the vector processing unit reads the N coefficients of the filter corresponding to the P read original pixels from the external storage unit.
  • the number N of coefficients read each time is determined according to the number of multipliers. Specifically, if the image processing device includes N calc multipliers, the number of coefficients
  • N (N calc /P read ).
  • the number of multipliers is usually a multiple of the number of original pixels read in each operation cycle.
  • the number of multipliers is a multiple of the number of original pixels read in each operation cycle, just read from the storage unit
  • the coefficient of the filter can be 8, 16, 32, and so on.
  • each multiplier After reading the coefficients of the original pixel and the filter, each multiplier multiplies the N coefficients with each original pixel to obtain the product result.
  • the vector processing unit When the vector processing unit reads P read original pixels in the first row of the image, preprocessing is performed. In the preprocessing, the vector processing unit reads the P read original pixels in the first line of the image and one or N coefficients of the first line of the filter, and the one or N coefficients are respectively combined with the first line of the image Multiply each original pixel of to get the product result.
  • parallel processing is performed.
  • the vector processing unit reads the P read original pixels in the first row of the image and the N coefficients of the filter, where the N coefficients are located in the same column of the adjacent N rows of the filter.
  • the N calc vector processing units respectively multiply the N coefficients with each original pixel of the P read original pixels to obtain a product result.
  • the size of the filled image is 20 ⁇ 8, that is, the width Mw is 20 and the height Nh is 8.
  • the size of the filter is 20 ⁇ 8, that is, both the width F w and the height F h are 5.
  • the filter performs filtering operations on the filled image. The following takes two-dimensional convolution as an example to describe the filtering operation in detail.
  • the original pixels of these lines are only used once.
  • the product operation in the filtering is performed on the original pixels of other lines, the The original pixels will not be reused. Therefore, the original pixels of these lines can be pre-processed in the line.
  • the first line of the image is in the above situation.
  • the original pixels in the first row are used only once, and the original pixels in the first row will not be multiplexed when performing product operations in the filtering on the original pixels in the other rows. Therefore, the original pixels of the first row are calculated by preprocessing.
  • the vector processing unit When performing preprocessing, the image processing device is shown in Figure 11, showing 8 MAC (MAC0-MAC7) multipliers and adders.
  • the vector processing unit also includes:
  • Input buffers A group of buffers A1 and A2, B group of buffers B1, B2, B3, B4, B5 are used to buffer the read filter coefficients and the original pixels of the image.
  • Output buffer ACC0, ACC1, ACC2, ACC3.
  • the multiplier of each MAC is connected to the group B buffer, the multipliers of the first 4 MACs are connected to the buffer A1, and the multipliers of the last 4 MACs are connected to the buffer A2.
  • the MAC0-MAC3 multipliers respectively multiply the coefficients A 0, 0 and the original pixels B 0 , 0 , B 0 , 1, B 0, 2, B 0, 3 to obtain the product results of the coefficients A 0, 0, and respectively Cache to ACC0, ACC1, ACC2, ACC3.
  • B2 First buffers the buffered original pixel B 0,1 to B1 via MUX2 , B2 caches the read original pixel B 0,2 ; B3 caches the read original pixel B 0,3 , B4 caches the read original pixel B 0,4 , and B5 caches the read original pixel B 0,5 ;
  • the multiplier of MAC0-MAC3 multiplies the coefficient A 0,1 with the four original pixels B 0,1 , B 0,2 , B 0,3 , B 0,4 of the image respectively to obtain the product of the coefficient A 0,1
  • the MAC4-MAC7 multiplier multiplies the coefficients A 0 , 2 with the four original pixels B 0 , 2, B 0 , 3, B 0 , 4, and B 0, 5 of the image to obtain the coefficients A 0, 2.
  • the accumulator of MAC0-MAC7 accumulates the two product results to ACC0, ACC1, ACC2, ACC3.
  • B3 First buffers the buffered original pixels B 0,3 to B1 via MUX2 , B3 caches the read original pixel B 0,5 ; B2 caches the read original pixel B 0,4 , B4 caches the read original pixel B 0,6 , and B5 caches the read original pixel B 0,7 ;
  • the MAC0-MAC3 multiplier multiplies the coefficients A 0 , 3 with the four original pixels B 0 , 3, B 0 , 4, B 0, 5 , and B 0, 6 of the image to obtain the product of the coefficients A 0, 3.
  • the MAC4-MAC7 multiplier multiplies the coefficients A 0 , 4 with the four original pixels B 0 , 4, B 0, 5 , B 0, 6 , and B 0 , 7 of the image, respectively, to obtain the coefficients A 0, 4
  • the accumulator of MAC0-MAC7 accumulates the two product results to ACC0, ACC1, ACC2, ACC3.
  • the preprocessing is completed, and the product result obtained by the preprocessing is used to calculate the four pixels in the first line of the filtered image.
  • the preprocessing process of the first line is introduced above, and so on, you can repeat the above steps from the second line of the image, and perform intra-line preprocessing on other lines of the image to obtain the pixels used to calculate each line of the filtered image
  • the product result of is used to calculate the product result of the pixel value of the same pixel in the filtered image, and the pixel value of the pixel can be obtained.
  • the original pixels of some lines of the image are only used once, and the original pixels of these lines will not be multiplexed. But for other lines of the image, when calculating the filtering results of the original pixels, the original pixels of these lines will be used multiple times. When the product operation in filtering is performed on the original pixels of other lines, the original pixels of these lines can be used Reuse. Therefore, the original pixels of these rows can be operated in parallel processing within the rows.
  • parallel processing can be performed on other rows after the first row, and the parallel processing can further improve the operation efficiency of two-dimensional filtering and the overall performance of image filtering.
  • the vector processing unit also includes:
  • the internal storage units R0, R1, R2, R3, R4, R5, R6, R7; R0-R7 are the on-chip memories of the vector processing unit.
  • the MAC0-MAC3 multiplier multiplies the coefficients A 1 , 0 with the four original pixels B 1 , 0, B 1 , 1, B 1, 2 , and B 1, 3 of the image to obtain the product of the coefficients A 1, 0.
  • the accumulator of MAC0-MAC3 accumulates the product results to ACC0, ACC1, ACC2, and ACC3;
  • the multiplier of MAC4-MAC7 adds the coefficients A 0 , 4 to the four original pixels B 1 , 0 and B 1, 1 of the image.
  • B 1 , 2 and B 1, 3 are multiplied to obtain the product result of coefficient A 0 , 4.
  • the accumulator of MAC4-MAC7 accumulates the product result to ACC4, ACC5, ACC6, ACC7.
  • the multiplier of MAC0-MAC3 multiplies the coefficients A 1,1 with the four original pixels B 1,1 , B 1,2 , B 1,3 , B 1,4 of the image, respectively, to obtain the product of the coefficients A 1,1
  • the accumulator of MAC0-MAC3 accumulates the product results to ACC0, ACC1, ACC2, and ACC3
  • the multiplier of MAC4-MAC7 adds the coefficients A 0, 1 to the four original pixels B 1 , 1, B 1, 2 of the image, respectively.
  • B 1 , 3 and B 1, 4 are multiplied to obtain the product result of coefficient A 0 , 1.
  • the accumulator of MAC4-MAC7 accumulates the product result to ACC4, ACC5, ACC6, ACC7.
  • the multiplier of MAC0-MAC3 multiplies the coefficients A 1 , 4 with the four original pixels B 1 , 1 , B 1 , 2, B 1, 3, B 1 , 4 of the image to obtain the product of the coefficients A 1, 4
  • the accumulator of MAC0-MAC3 accumulates the product results to ACC0, ACC1, ACC2, and ACC3;
  • the multiplier of MAC4-MAC7 adds the coefficients A 0 , 4 to the four original pixels B 1 , 1, B 1, 2 of the image, respectively.
  • B 1 , 3 and B 1, 4 are multiplied to obtain the product result of coefficient A 0 , 4.
  • the accumulator of MAC4-MAC7 accumulates the product result to ACC4, ACC5, ACC6, ACC7.
  • the parallel processing of the other lines of the image is similar to the parallel processing of the second line described above. After 23 calculation cycles, four pixels in the first row of the filtered image are obtained, and ACC0, ACC1, ACC2, and ACC3 are cleared (clear0).
  • the vector processing unit reads another P read original pixels from the internal storage unit, and the other P read original pixels are read from the image in the previous operation cycle and stored in the internal storage unit.
  • Each of the N coefficients at the end of the filter height direction is multiplied by P read original pixels read from the image to obtain the product result;
  • Each of the N coefficients located at the head of the filter height direction is respectively multiplied by another P read original pixels read from the internal storage unit to obtain the product result.
  • Read from the external storage unit 4 original image pixels sixth row, i.e., B 5,0, B 5,1, B 5,2 , B 5,3, B 5,0, B 5,1, B 5 , 2 , B 5 , 3 are respectively cached to B1, B2, B3, B4; read the 4 original pixels of the third row of the image from the internal storage units R0, R1, R2, R3, namely B 2 , 0, B 2 ,1 ,B 2,2 ,B 2,3 .
  • the four original pixels B 2,0 , B 2,1 , B 2,2 , and B 2,3 are read from the image in cycle 9 and stored in the internal storage units R0, R1, R2, R3.
  • the MAC0-MAC3 multiplier multiplies the coefficients A 0 , 0 with the four original pixels B 2 , 0, B 2 , 1, B 2 , 2, and B 2, 3 of the image to obtain the product of the coefficients A 0, 0.
  • MAC0-MAC3 accumulators accumulating the multiplication results to ACC0, ACC1, ACC2, ACC3;
  • the accumulator of MAC4-MAC7 accumulates the product result to ACC4, ACC5, ACC6, ACC7.
  • the four pixels C 2,0 -C 2,3 of the third row of the filtered image can be obtained, and then one operation Cycle, that is, starting from cycle 28 and then passing through 25 operation cycles, after a total of 53 operation cycles, the four pixels C 3,0 -C 3,3 of the fourth row of the filtered image can be obtained, thereby obtaining the filtered image Pixels from column 1 to column 4 of all four rows.
  • the filtering process of the pixels in the first to fourth columns of the filtered image is still not completed. At this time, it is necessary to continue processing the image to obtain the pixels of the remaining lines of the filtered image. Because the remaining lines of the filtered image cannot be processed in parallel, the following in-line processing is performed:
  • the multiplier multiplies the one coefficient by the P read original pixels, or multiplies each coefficient of the N coefficients by the P read original pixels to obtain the product result.
  • the filtered image is the first The filtering process of the pixels from column 1 to column 4 has not yet been completed. At this time, it is necessary to continue processing the image to obtain the pixels in the fifth row of the filtered image. Because there are only one row of pixels, that is, the pixels on the fifth row of the filtered image, the pixels on the fifth row of the filtered image cannot be processed in parallel, but are processed in-line.
  • the above-mentioned preprocessing process is performed once for each of the 5th to 9th rows of the image. That is, for each of the 5th to 9th rows of the image, read one or N coefficients in a row of the filter, and multiply the one or N coefficients with each original pixel of the row to obtain the product result.
  • the processing process specifically includes:
  • B 4,0 is buffered by MUX2 to B1, B 4,1 , B 4 , 2 , B 4, 3 are cached to B2, B3, B4, and 0 is cached to B5 through MUX3;
  • the MAC0-MAC3 multipliers respectively multiply the coefficients A 0 , 0 with the four original pixels B 4 , 0, B 4 , 1, B 4 , 2, and B 4, 3 of the image to obtain the coefficients A 0, 0
  • the product results are cached to ACC0, ACC1, ACC2, and ACC3 respectively.
  • the MAC0-MAC3 multiplier multiplies the coefficients A 0,1 with the four original pixels B 4,1 , B 4,2 , B 4,3 , B 4,4 of the image to obtain the product of the coefficients A 0,1.
  • the MAC4-MAC7 multiplier multiplies the coefficients A 0 , 2 with the four original pixels B 4 , 2, B 4 , 3, B 4 , 4, and B 4, 5 of the image, respectively, to obtain the coefficients A 0, 2
  • the accumulator of MAC0-MAC7 accumulates the two product results to ACC0, ACC1, ACC2, ACC3.
  • B3 First buffers the buffered original pixels B 4,3 to B1 via MUX2, B3 then caches the read original pixel B 4,5 ; B2 caches the read original pixel B 4,4 , B4 caches the read original pixel B 4,6 , and B5 caches the read original pixel B 4,7 ;
  • the MAC0-MAC3 multiplier multiplies the coefficients A 0 , 3 with the four original pixels B 4 , 3, B 4 , 4, B 4, 5 , B 4 , 6 of the image to obtain the product of the coefficients A 0, 3.
  • the MAC4-MAC7 multiplier multiplies the coefficients A 0 , 4 with the four original pixels B 4 , 4 , B 4 , 5, B 4, 6, and B 4, 7 of the image to obtain the coefficients A 0, 4.
  • the accumulator of MAC0-MAC7 accumulates the two product results to ACC0, ACC1, ACC2, ACC3.
  • the preprocessing of the fifth line of the image is completed, and the product results obtained in each of the foregoing calculation cycles are used to calculate the four pixels of the fifth line of the filtered image.
  • the above steps are repeated, and the above-mentioned preprocessing is performed on the 6th to 9th rows of the image and the second to 5th rows of the corresponding filter respectively.
  • the preprocessing of each row requires 3 operation cycles. After 15 operation cycles of intra-row operation, a total of 68 operation cycles, the pixel points of the fifth row of the filtered image are obtained, and the pixels of the first to fourth columns of all five rows of the filtered image are obtained, and the filtered image is the first The filtering process from column to column 4 is completed.
  • the number of multipliers N calc is 8
  • the number of original pixels P read read in each calculation cycle is 4
  • the number of filter coefficients N is 2
  • the width of the image M w is 20
  • the width F w and the height F h are both 5, and the image processing apparatus of this embodiment has been described.
  • the values of the above parameters are not limited to this.
  • the above parameters in this embodiment can also take other values.
  • the image processing device is similar to the above description, and those skilled in the art should fully understand the specific structure of the image processing device.
  • the calculation time of P read pixels in the first row of the filtered image is:
  • T pre is the calculation time of the above preprocessing;
  • F w and F h are the width and height of the filter, respectively;
  • cycle is a calculation cycle.
  • the preprocessing operation time T pre is: (1+ceil((F w -1)/N)) ⁇ cycle.
  • the operation time of P read pixels in each line of the last r lines of the filtered image is: T pre ⁇ F h .
  • N o is the height of the filtered image
  • N o N h -F h + 1
  • N h is the height of the image
  • r is the remainder of the division of N o and N, that is, the number of lines that need to be processed in-line in the filtered image
  • T r is the operation time of the in-line processing. Arithmetic processing within the line time T r is: r ⁇ T pre ⁇ F h .
  • the original pixels in this embodiment are read in order, that is, from top to bottom in the image height direction, from the first row to the last row, and read in the order from left to right in each row. .
  • the pixels of the filtered image are also output in order.
  • the P read original pixels of one line of the read image include:
  • n-th original pixel group comprises first to n-th column n + P read -1 P read original pixels of the column, wherein 1 ⁇ n ⁇ F w .
  • the original pixels of each row are read from top to bottom in the height direction of the image, and in each row, 5 groups of original pixels are read sequentially.
  • the present disclosure is not limited to this.
  • the original pixel of the image is read The order of is not restricted, you can read in order, read in reverse order, and read in skip. As long as it can traverse all the original pixels of the image, traverse all the filter coefficients, and multiply and accumulate M w ⁇ N h ⁇ F w ⁇ F h , the filtering can be completed. It's just that in the case of reverse or skip reading, the output order of the pixels of the filtered image is different.
  • P read original pixels of one line of the image are read from the storage unit each time.
  • P read original pixels stored after each calculation cycle reads an image from the storage unit row P read original pixels, can also be read in the internal storage unit vector processing unit. In this way, in the subsequent operation cycle, if some of the original pixels of the P read original pixels that need to be read have been stored in the internal storage unit, then this part of the original pixels can be directly read from the internal storage unit, and only It is sufficient to read another part of the original pixels not stored in the internal storage unit from the external storage unit.
  • the P read original pixels of one line of the read image include:
  • P reads the remaining portion of the original pixel in the original pixels read from the external storage unit, to obtain P read original pixels
  • the multiplication and accumulation operations are performed every operation cycle, that is, the filter coefficients are multiplied by the original pixels, and the multiplication result is accumulated to the accumulation result of the previous operation cycle.
  • the present disclosure is not limited to this, and the multiplication operation may be performed first to obtain all the product results used to calculate the pixel points of the filtered image and then the accumulation is performed.
  • a mobile device including: the image processing apparatus described in the previous embodiment.
  • the mobile device is at least one of a portable mobile terminal, a drone, a handheld PTZ, and a remote controller.

Landscapes

  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Pure & Applied Mathematics (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Image Processing (AREA)

Abstract

An image processing method and apparatus, and a mobile device. The image processing method is applied to a vector processing unit, the vector processing unit comprising a multiplier, and the method comprising: reading Pread original pixels of an image, the value of Pread being determined according to an access bit width corresponding to the vector processing unit; reading N coefficients of a filter, the value of N being determined according to the number of multipliers of the vector processing unit, and the filter being used for filtering the image; and by means of a multiplier, multiplying each coefficient of the N coefficients with each Pread original pixel to obtain multiple product results, the product results being used for calculating pixel values of pixel points in the filtered image.

Description

图像处理方法、装置及移动设备Image processing method, device and mobile equipment 技术领域Technical field
本公开涉及图像处理领域,尤其涉及一种图像处理方法、装置及移动设备。The present disclosure relates to the field of image processing, and in particular to an image processing method, device and mobile equipment.
背景技术Background technique
滤波广泛应用于图像处理领域。图像处理装置在执行滤波算法时,先从片外存储器读取图像的原像素,再利用运算单元对原像素进行滤波处理。Filtering is widely used in the field of image processing. When the image processing device executes the filtering algorithm, it first reads the original pixels of the image from the off-chip memory, and then uses the arithmetic unit to filter the original pixels.
现有技术中,在每个运算周期,通常是读取几个原像素,就由几个运算单元对原像素进行处理。这样,当运算单元的数量数倍于每次读取的原像素个数时,就会有部分运算单元在滤波过程中闲置,得不到充分利用。In the prior art, in each operation cycle, several original pixels are usually read, and the original pixels are processed by several operation units. In this way, when the number of arithmetic units is several times the number of original pixels read each time, some arithmetic units will be idle during the filtering process and cannot be fully utilized.
例如,图像处理装置具有八个运算单元,如果每个运算周期只能读取四个原像素,只有四个运算单元参与运算,其余四个运算单元不能得到利用,导致图像滤波的整体性能受限,影响图像处理效率。For example, an image processing device has eight arithmetic units. If only four original pixels can be read per operation cycle, only four arithmetic units are involved in the operation, and the remaining four arithmetic units cannot be used, resulting in limited overall image filtering performance , Affect the efficiency of image processing.
发明内容Summary of the invention
本公开提供了一种图像处理方法,所述方法应用于向量处理单元,所述向量处理单元包括乘法器,所述方法包括:读取图像的P read个原像素,其中,P read的值根据对应所述向量处理单元的访存位宽确定;读取滤波器的N个系数,N的值根据所述向量处理单元的乘法器的个数确定,所述滤波器用于对所述图像进行滤波处理;通过所述乘法器,将所述N个系数中的每一系数和所述P read个原像素分别相乘,得到多个乘积结果,所述乘积结果用于计算滤波后图像中像素点的像素值。 The present disclosure provides an image processing method, which is applied to a vector processing unit, the vector processing unit includes a multiplier, and the method includes: reading P read original pixels of an image, wherein the value of P read is based on Corresponding to the determination of the access bit width of the vector processing unit; reading the N coefficients of the filter, the value of N is determined according to the number of multipliers of the vector processing unit, and the filter is used to filter the image Processing; through the multiplier, each of the N coefficients and the P read original pixels are respectively multiplied to obtain multiple product results, and the product results are used to calculate the pixels in the filtered image The pixel value.
本公开还提供了一种图像处理装置,包括:外部存储单元,存储有图像和滤波器;向量处理单元,包括:乘法器;所述向量处理单元用于读取所述图像的P read个原像素,其中,P read的值根据对应所述向量处理单元的访存位宽确定,读取所述滤波器的N个系数,N的值根据所述向量处理单元的乘法器的个数确定,所述滤波器用于对所述图像进行滤波处理;所述乘法器用于将所述N个系数中的每一系数和所述P read个原像素分别相乘,得到多个乘积结果,所述乘积结果用于计算滤波后图像中像素点的像素值。 The present disclosure also provides an image processing device, which includes: an external storage unit that stores images and filters; a vector processing unit that includes: a multiplier; and the vector processing unit is used to read P read elements of the image. Pixel, wherein the value of P read is determined according to the memory access bit width corresponding to the vector processing unit, the N coefficients of the filter are read, and the value of N is determined according to the number of multipliers of the vector processing unit, The filter is used to filter the image; the multiplier is used to multiply each of the N coefficients and the P read original pixels to obtain multiple product results. The result is used to calculate the pixel value of the pixel in the filtered image.
本公开还提供了一种移动设备,其中,包括:上述图像处理装置。The present disclosure also provides a mobile device, which includes: the above-mentioned image processing device.
本公开在每个运算周期读取滤波器的N个系数,读取的所述系数的个数N根据向量处理单元的乘法器的个数确定;所述N个系数分别和每一所述原像素相乘,得到乘积结果;现对于现有技术,使更多的乘法器参与滤波运算,运算资源得到充分利用,有效提高了图像滤波的整体性能,改善了图像处理效率。The present disclosure reads the N coefficients of the filter in each operation cycle, and the number N of the read coefficients is determined according to the number of multipliers of the vector processing unit; the N coefficients are respectively compared with each of the originals. Pixels are multiplied to obtain a product result; now for the prior art, more multipliers are involved in filtering operations, and computing resources are fully utilized, which effectively improves the overall performance of image filtering and improves image processing efficiency.
附图说明Description of the drawings
附图是用来提供对本公开的进一步理解,并且构成说明书的一部分,与下面的具体实施方式一起用于解释本公开,但并不构成对本公开的限制。在附图中:The accompanying drawings are used to provide a further understanding of the present disclosure and constitute a part of the specification. Together with the following specific embodiments, they are used to explain the present disclosure, but do not constitute a limitation to the present disclosure. In the attached picture:
图1为本公开实施例的图像处理方法的流程图。FIG. 1 is a flowchart of an image processing method according to an embodiment of the disclosure.
图2的(a)、(b)、(c)分别是第1-3个运算周期的运算示意图。(A), (b), and (c) of Fig. 2 are schematic diagrams of operations in the 1-3 operation cycles, respectively.
图3的(a)、(b)、(c)、(d)、(e)分别是第4-8个运算周期的运算示意图。(A), (b), (c), (d), (e) of FIG. 3 are the operation schematic diagrams of the 4th to 8th operation cycles, respectively.
图4的(a)、(b)、(c)、(d)、(e)分别是第9-18个运算周期的运算示意图。(A), (b), (c), (d), (e) of FIG. 4 are the operation schematic diagrams of the 9th to 18th operation cycles, respectively.
图5的(a)、(b)分别是第19、23个运算周期的运算示意图。(A) and (b) of FIG. 5 are operation schematic diagrams of the 19th and 23rd operation cycles, respectively.
图6的(a)、(b)分别是第24、28个运算周期的运算示意图。(A) and (b) of FIG. 6 are schematic diagrams of operations in the 24th and 28th operation cycles, respectively.
图7为本公开实施例的图像处理方法的过程示意图。FIG. 7 is a schematic diagram of a process of an image processing method according to an embodiment of the disclosure.
图8的(a)、(b)、(c)分别是第54-56个运算周期的运算示意图。(A), (b), and (c) of FIG. 8 are the operation schematic diagrams of the 54th to 56th operation cycles, respectively.
图9为本公开实施例的图像处理方法的滤波后图像示意图。FIG. 9 is a schematic diagram of a filtered image of the image processing method according to an embodiment of the disclosure.
图10为本公开实施例的图像处理装置的结构示意图。FIG. 10 is a schematic structural diagram of an image processing apparatus according to an embodiment of the disclosure.
图11为本公开实施例的图像处理装置的运算单元执行预处理时的结构示意图。FIG. 11 is a schematic diagram of the structure when the arithmetic unit of the image processing device of the embodiment of the disclosure performs preprocessing.
图12为本公开实施例的图像处理装置的运算单元执行并行处理时的结构示意图。FIG. 12 is a schematic diagram of the structure when the arithmetic unit of the image processing device of the embodiment of the disclosure performs parallel processing.
具体实施方式detailed description
下面将结合实施例和实施例中的附图,对本公开技术方案进行清楚、完整的描述。显然,所描述的实施例仅仅是本公开一部分实施例,而不是全部的实施例。基于本公开中的实施例,本领域普通技术人员在没有做出 创造性劳动前提下所获得的所有其他实施例,都属于本公开保护的范围。The technical solutions of the present disclosure will be clearly and completely described below in conjunction with the embodiments and the drawings in the embodiments. Obviously, the described embodiments are only a part of the embodiments of the present disclosure, rather than all of the embodiments. Based on the embodiments in the present disclosure, all other embodiments obtained by those of ordinary skill in the art without creative work shall fall within the protection scope of the present disclosure.
本公开一实施例提供了一种图像处理方法,所述图像方法应用于向量处理单元,所述向量处理单元包括乘法器,如图1所示,所述方法包括:An embodiment of the present disclosure provides an image processing method, which is applied to a vector processing unit, and the vector processing unit includes a multiplier. As shown in FIG. 1, the method includes:
步骤S101:读取图像的P read个原像素,其中,P read的值根据对应所述向量处理单元的访存位宽确定; Step S101: Read P read original pixels of the image, where the value of P read is determined according to the memory access bit width corresponding to the vector processing unit;
步骤S102:读取滤波器的N个系数,N的值根据所述向量处理单元的乘法器的个数确定,所述滤波器用于对所述图像进行滤波处理;Step S102: Read N coefficients of the filter, the value of N is determined according to the number of multipliers of the vector processing unit, and the filter is used for filtering the image;
步骤S103:通过所述乘法器,将所述N个系数中的每一系数和所述P read个原像素分别相乘,得到多个乘积结果,所述乘积结果用于计算滤波后图像中像素点的像素值。 Step S103: Through the multiplier, each coefficient of the N coefficients and the P read original pixels are respectively multiplied to obtain multiple product results, and the product results are used to calculate the pixels in the filtered image The pixel value of the point.
得到用于计算滤波后图像中像素点的像素值的乘积结果后,将用于计算滤波后图像中同一像素点的像素值的乘积结果累加在一起,即可得到该像素点的像素值,从而得到整个滤波后图像。After obtaining the product result used to calculate the pixel value of the pixel in the filtered image, add the product result used to calculate the pixel value of the same pixel in the filtered image to get the pixel value of the pixel. Get the entire filtered image.
本实施例的图像处理方法由一图像处理装置运行,图像处理装置包括:外部存储单元和处理器。该处理器可以是CPU、DSP、FPGA等任何类型的具有向量处理能力的芯片。The image processing method of this embodiment is run by an image processing device, and the image processing device includes: an external storage unit and a processor. The processor can be any type of chip with vector processing capabilities such as CPU, DSP, FPGA, etc.
以DSP为例,DSP内部有向量处理单元,向量处理单元可对图像进行滤波等处理。对于向量处理单元来说,外部存储单元是其片外存储器。外部存储单元存储有待处理的图像以及滤波器。向量处理单元利用滤波器对图像进行滤波处理,得到的滤波后图像也可以存储在外部存储单元中。Taking DSP as an example, there is a vector processing unit inside the DSP, and the vector processing unit can perform filtering and other processing on the image. For the vector processing unit, the external storage unit is its off-chip memory. The external storage unit stores the image to be processed and the filter. The vector processing unit uses the filter to filter the image, and the filtered image obtained can also be stored in an external storage unit.
DSP的向量处理单元包括多个乘累加器(MAC,Multiply and ACumulate),每个MAC包括一个所述乘法器和一个加法器,用于执行滤波中的乘法操作和累加操作。The vector processing unit of the DSP includes multiple multiply and accumulators (MAC, Multiply and ACumulate), and each MAC includes one multiplier and one adder, which are used to perform multiplication and accumulation operations in filtering.
在步骤S101中,读取图像的P read个原像素,其中,P read的值根据对应所述向量处理单元的访存位宽确定。 In step S101, P read original pixels of the image are read, where the value of P read is determined according to the memory access bit width corresponding to the vector processing unit.
整个滤波过程需要经历多个运算周期才可完成。在每个运算周期,向量处理单元均需读取图像的P read个原像素。为提高图像处理方法的效率,在每个运算周期应读取尽可能多的原像素。即读取的原像素的个数P read等于每个运算周期向量处理单元可读取的原像素的最大数目。 The entire filtering process needs to go through multiple calculation cycles to complete. In each operation cycle, the vector processing unit needs to read P read original pixels of the image. In order to improve the efficiency of the image processing method, as many original pixels as possible should be read in each operation cycle. That is, the number of original pixels P read read is equal to the maximum number of original pixels that can be read by the vector processing unit in each operation cycle.
当向量处理单元从存储单元读取这P read个原像素时,可读取的最大数 目取决于数据总线的访存位宽、以及原像素的位宽。所述位宽是指每个原像素用多少个位(bit)来表示。例如,原像素的位宽可以是8bit、16bit、32bit等。所述访存位宽是指所述数据总线具有多少条位线,即数据总线每次可以传输多少个bit,一般是8的倍数。例如,访存位宽可以是8bit、16bit、32bit、64bit等。当原像素位宽为16bit、访存位宽为64bit时,向量处理单元每个运算周期可从存储单元读取64/16=4个原像素。即本实施例中的所述最大数目等于访存位宽与所述原像素位宽的商。 When the vector processing unit reads the P read original pixels from the storage unit, the maximum number that can be read depends on the access bit width of the data bus and the bit width of the original pixels. The bit width refers to how many bits are used to represent each original pixel. For example, the bit width of the original pixel can be 8bit, 16bit, 32bit, and so on. The memory access bit width refers to how many bit lines the data bus has, that is, how many bits can be transmitted by the data bus at a time, and is generally a multiple of 8. For example, the memory access bit width can be 8bit, 16bit, 32bit, 64bit, etc. When the original pixel bit width is 16bit and the memory access bit width is 64bit, the vector processing unit can read 64/16=4 original pixels from the storage unit in each operation cycle. That is, the maximum number in this embodiment is equal to the quotient of the memory access bit width and the original pixel bit width.
当读取原像素后,步骤S102读取滤波器的N个系数,N的值根据所述向量处理单元的乘法器的个数确定,所述滤波器用于对所述图像进行滤波处理。After reading the original pixels, step S102 reads the N coefficients of the filter, the value of N is determined according to the number of multipliers of the vector processing unit, and the filter is used for filtering the image.
在每个运算周期,向量处理单元均读取与P read个原像素对应的滤波器的N个系数。为提升图像滤波的整体性能,提高图像处理效率,本实施例中,每次读取的系数的个数N根据乘法器的个数确定。具体来说,如果向量处理单元包括N calc个乘法器,则系数的个数 In each operation cycle, the vector processing unit reads the N coefficients of the filter corresponding to the P read original pixels. In order to improve the overall performance of image filtering and improve image processing efficiency, in this embodiment, the number N of coefficients read each time is determined according to the number of multipliers. Specifically, if the vector processing unit includes N calc multipliers, the number of coefficients
N=(N calc/P read)。 N=(N calc /P read ).
乘法器的个数通常是每个运算周期读取的原像素个数的整数倍数,乘法器的个数是每个运算周期读取的原像素个数的多少倍,就从外部存储单元读取多少个滤波器的系数。乘法器的个数可以是8、16、32等。当向量处理单元每个运算周期读取4个原像素时,如果乘法器的个数为8,则N=8/4=2,即每个运算周期读取滤波器的2个系数。The number of multipliers is usually an integer multiple of the number of original pixels read in each operation cycle, and the number of multipliers is how many times the number of original pixels read in each operation cycle is read from the external storage unit How many filter coefficients. The number of multipliers can be 8, 16, 32, and so on. When the vector processing unit reads 4 original pixels in each operation cycle, if the number of multipliers is 8, then N=8/4=2, that is, 2 coefficients of the filter are read in each operation cycle.
若乘法器的个数并非每个运算周期读取的原像素个数的整数倍数,那么N的值可以是N calc/P read的结果向上取整的值。比如,乘法器的个数为10,当向量处理单元每个运算周期读取4个原像素时,10/4为2.5,此时读取3个系数。选任意2个系数分别与4个原像素的每一个相乘,这个操作利用了8个乘法器;选除上述2个系数外的另1个系数,与4个原像素中的任意两个分别相乘,这个操作利用了剩余2个乘法器。这样,充分利用10个乘法器,在一次运算周期内执行了更多的乘法操作。 If the number of multipliers is not an integer multiple of the number of original pixels read in each operation cycle, then the value of N can be the result of N calc /P read rounded up. For example, if the number of multipliers is 10, when the vector processing unit reads 4 original pixels per operation cycle, 10/4 is 2.5, and 3 coefficients are read at this time. Choose any 2 coefficients to multiply each of the 4 original pixels. This operation uses 8 multipliers; choose another coefficient in addition to the above 2 coefficients, which is separated from any two of the 4 original pixels. Multiply, this operation uses the remaining 2 multipliers. In this way, 10 multipliers are fully utilized, and more multiplication operations are performed in one operation cycle.
当读取原像素和滤波器的系数后,步骤S103通过所述乘法器,将所述N个系数中的每一系数和所述P read个原像素分别相乘,得到多个乘积结果,所述乘积结果用于计算滤波后图像中像素点的像素值。本实施例的图 像处理方法,包括预处理和并行处理。 After reading the coefficients of the original pixels and the filter, step S103 uses the multiplier to multiply each of the N coefficients by the P read original pixels to obtain multiple product results. The product result is used to calculate the pixel value of the pixel in the filtered image. The image processing method of this embodiment includes preprocessing and parallel processing.
当向量处理单元读取的是图像的第一行中的P read个原像素时,执行预处理。在预处理中,向量处理单元读取图像的第一行中的P read个原像素以及滤波器第一行的一个系数或N个系数,将所述一个系数和所述P read个原像素相乘,或将所述N个系数中的每一个系数和所述P read个原像素分别相乘,得到乘积结果。 When the vector processing unit reads P read original pixels in the first row of the image, preprocessing is performed. In the preprocessing, the vector processing unit reads the P read original pixels in the first line of the image and one coefficient or N coefficients in the first line of the filter, and compares the one coefficient with the P read original pixels. Multiply or multiply each of the N coefficients and the P read original pixels to obtain a product result.
从图像的第二行开始,即当向量处理单元读取的是图像的第一行之后其他行的P read个原像素时,执行并行处理。在并行处理中,向量处理单元读取图像一行中的P read个原像素以及滤波器的N个系数,其中,这N个系数位于滤波器的相邻N行的同一列。N calc个乘法器将N个系数每一个系数和P read个原像素分别相乘,得到乘积结果。 Starting from the second line of the image, that is, when the vector processing unit reads P read original pixels in the other lines after the first line of the image, parallel processing is performed. In parallel processing, the vector processing unit reads the P read original pixels in a row of the image and the N coefficients of the filter, where the N coefficients are located in the same column of the adjacent N rows of the filter. The N calc multipliers multiply each of the N coefficients and the P read original pixels to obtain the product result.
以下结合附图,以乘法器的个数为8,每个运算周期读取4个原像素和滤波器的2个系数为一个示例,对上述处理过程进行说明。In the following, in conjunction with the accompanying drawings, taking the number of multipliers as 8, reading 4 original pixels and 2 coefficients of the filter in each operation cycle as an example, the above processing procedure will be described.
假设图像的原始尺寸为16×4。为了使滤波后图像的尺寸与图像的原始尺寸相同,本实施例可对图像进行填充(padding)。填充后的图像的尺寸为20×8,即宽度M w为20、高度N h为8,相较于原图像,其向原图像边界外扩充了多个元素。具体的填充方式可以是对临近边界点的元素简单复制,并填充至原图像边界以外的区域,也可是利用预先设定好的填充元素,此处仅为示例性说明,并未限制本公开的填充方式。滤波器的尺寸为20×8,即宽度F w和高度F h均为5。滤波器对填充后的图像进行滤波运算。以下以二维卷积为例对滤波运算进行详细说明。 Assume that the original size of the image is 16×4. In order to make the size of the filtered image the same as the original size of the image, this embodiment may pad the image. The size of the filled image is 20×8, that is, the width Mw is 20 and the height Nh is 8. Compared with the original image, it expands multiple elements outside the original image boundary. The specific filling method can be to simply copy the element adjacent to the boundary point and fill it to the area outside the boundary of the original image, or use preset filling elements. This is only an exemplary description and does not limit the present disclosure. Filling method. The size of the filter is 20×8, that is, both the width F w and the height F h are 5. The filter performs filtering operations on the filled image. The following takes two-dimensional convolution as an example to describe the filtering operation in detail.
在二维卷积中,对于图像的一些行,在计算原像素的滤波结果时,这些行的原像素只用到一次,在针对其他行的原像素进行滤波中的乘积运算时,这些行的原像素不会被复用。因此,这些行的原像素可以在行内采用预处理的方式进行运算。In two-dimensional convolution, for some lines of the image, when calculating the filtering results of the original pixels, the original pixels of these lines are only used once. When the product operation in the filtering is performed on the original pixels of other lines, the The original pixels will not be reused. Therefore, the original pixels of these lines can be calculated in the way of preprocessing within the lines.
在上述示例中,图像的第一行就属于上述情况。在计算原像素的滤波结果时,第一行的原像素只用到一次,在针对后面其他行的原像素进行滤波中的乘积运算时,第一行的原像素不会被复用。因此,第一行原像素采用预处理的方式进行运算。以下请一并参见图2和图7,对图像第一行的预处理过程进行介绍。所述预处理的过程包括:In the above example, the first line of the image is in the above situation. When calculating the filtering result of the original pixels, the original pixels in the first row are used only once, and the original pixels in the first row will not be multiplexed when performing product operations in the filtering on the original pixels in the other rows. Therefore, the original pixels of the first row are calculated by preprocessing. Please refer to Figure 2 and Figure 7 together to introduce the preprocessing process of the first line of the image. The pretreatment process includes:
如图2(a)所示,在第1个运算周期(cycle 1)中:As shown in Figure 2(a), in the first operation cycle (cycle 1):
读取图像的第一行的4个原像素,即B 0,0,B 0,1,B 0,2,B 0,3Read the 4 original pixels of the first line of the image, namely B 0,0 , B 0,1 , B 0,2 , B 0,3 ;
读取滤波器的第一行的一个系数A 0,0 Read a coefficient A 0,0 of the first line of the filter;
4个乘法器将系数A 0,0分别和图像的4个原像素B 0,0,B 0,1,B 0,2,B 0,3相乘,得到系数A 0,0的乘积结果。 The four multipliers multiply the coefficients A 0 , 0 with the four original pixels B 0 , 0 , B 0 , 1, B 0, 2, and B 0, 3 of the image, respectively, to obtain the product result of the coefficients A 0, 0.
如图2(b)所示,在第2个运算周期(cycle 2)中:As shown in Figure 2(b), in the second operation cycle (cycle 2):
读取图像的第一行的4个原像素,即B 0,2,B 0,3,B 0,4,B 0,5Read the 4 original pixels of the first line of the image, namely B 0,2 , B 0,3 , B 0,4 , B 0,5 ;
读取滤波器的第一行的2个系数A 0,1,A 0,2 Read the 2 coefficients A 0,1 and A 0,2 of the first line of the filter;
4个乘法器将系数A 0,1分别和图像的4个原像素B 0,1,B 0,2,B 0,3,B 0,4相乘,得到系数A 0,1的乘积结果,另4个乘法器将系数A 0,2分别和图像的4个原像素B 0,2,B 0,3,B 0,4,B 0,5相乘,得到系数A 0,2的乘积结果,并将两个乘积结果累加至cycle 1的乘积结果。 The four multipliers multiply the coefficients A 0,1 with the four original pixels B 0,1 , B 0,2 , B 0,3 , and B 0,4 of the image respectively to obtain the product result of the coefficients A 0,1, The other four multipliers multiply the coefficients A 0 , 2 with the four original pixels B 0 , 2, B 0 , 3, B 0 , 4, and B 0, 5 of the image to obtain the product result of the coefficients A 0, 2. , And accumulate the two product results to the product result of cycle 1.
如图2(c)所示,在第3个运算周期(cycle 3)中:As shown in Figure 2(c), in the third operation cycle (cycle 3):
读取图像的第一行的4个原像素,即B 0,4,B 0,5,B 0,6,B 0,7Read the 4 original pixels of the first line of the image, namely B 0,4 , B 0,5 , B 0,6 , B 0,7 ;
读取滤波器的第一行的2个系数A 0,3,A 0,4 Read the 2 coefficients A 0,3 and A 0,4 of the first line of the filter;
4个乘法器将系数A 0,3分别和图像的4个原像素B 0,3,B 0,4,B 0,5,B 0,6相乘,得到系数A 0,3的乘积结果,另4个乘法器将系数A 0,4分别和图像的4个原像素B 0,4,B 0,5,B 0,6,B 0,7相乘,得到系数A 0,4的乘积结果,并将两个乘积结果累加至cycle 2的累加结果。 The four multipliers multiply the coefficients A 0,3 with the four original pixels B 0,3 , B 0,4 , B 0,5 , and B 0,6 of the image respectively to obtain the product result of the coefficients A 0,3, The other four multipliers multiply the coefficients A 0 , 4 with the four original pixels B 0 , 4, B 0, 5 , B 0, 6 , and B 0 , 7 of the image to obtain the product result of the coefficients A 0, 4. , And accumulate the two product results to the accumulation result of cycle 2.
经过3个运算周期完成预处理,预处理得到的乘积结果用于计算滤波后图像第一行的四个像素点。After 3 calculation cycles, the preprocessing is completed, and the product result obtained by the preprocessing is used to calculate the four pixels in the first line of the filtered image.
以上介绍了第一行的预处理过程,以此类推,可从图像的第二行开始重复上述步骤,对图像的其他行也执行行内的预处理,得到用于计算滤波后图像各行的像素点的乘积结果,将用于计算滤波后图像中同一像素点的像素值的乘积结果累加在一起,即可得到该像素点的像素值。利用行内预处理的方式对图像进行二维滤波,提高了二维滤波的运算效率,提高了图像滤波的整体性能,改善了图像处理效率。The preprocessing process of the first line is introduced above, and so on, you can repeat the above steps from the second line of the image, and perform intra-line preprocessing on other lines of the image to obtain the pixels used to calculate each line of the filtered image The product result of is used to calculate the product result of the pixel value of the same pixel in the filtered image, and the pixel value of the pixel can be obtained. Using in-line preprocessing to perform two-dimensional filtering on the image, the computational efficiency of the two-dimensional filtering is improved, the overall performance of the image filtering is improved, and the image processing efficiency is improved.
前面已经提到,在二维卷积中,图像的一些行的原像素只用到一次,这些行的原像素不会被复用。但对于图像的另一些行,在计算原像素的滤波结果时,这些行的原像素会多次用到,在针对其他行的原像素进行滤波 中的乘积运算时,这些行的原像素可被复用。因此,这些行的原像素可以在行内采用并行处理的方式进行运算。As mentioned earlier, in two-dimensional convolution, the original pixels of some lines of the image are only used once, and the original pixels of these lines will not be multiplexed. But for other lines of the image, when calculating the filtering results of the original pixels, the original pixels of these lines will be used multiple times. When the product operation in filtering is performed on the original pixels of other lines, the original pixels of these lines can be used Reuse. Therefore, the original pixels of these rows can be operated in parallel processing within the rows.
在上述示例中,从图像的第二行开始就都属于上述情况。本实施例可对第一行之后的其他行执行并行处理,并行处理可进一步提高二维滤波的运算效率和图像滤波的整体性能。以下请一并参见图3和图7,对并行处理的过程进行介绍。所述并行处理的过程包括:In the above example, this is the case starting from the second line of the image. In this embodiment, parallel processing can be performed on other rows after the first row, and the parallel processing can further improve the operation efficiency of two-dimensional filtering and the overall performance of image filtering. Please refer to Figure 3 and Figure 7 together to introduce the parallel processing process. The parallel processing process includes:
如图3(a)所示,在第4个运算周期(cycle 4)中:As shown in Figure 3(a), in the fourth operation cycle (cycle 4):
读取图像的第二行的4个原像素,即B 1,0,B 1,1,B 1,2,B 1,3Read the 4 original pixels of the second line of the image, namely B 1,0 , B 1,1 , B 1,2 , B 1,3 ;
读取滤波器的第一行和第二行这相邻两行的第一列的2个系数A 0,0,A 1,0 Read the two coefficients A 0,0 and A 1,0 of the first column of two adjacent rows, the first row and the second row of the filter;
8个乘法器将系数A 0,0,A 1,0分别和图像的4个原像素B 1,0,B 1,1,B 1,2,B 1,3相乘,得到系数A 0,0和A 1,0的乘积结果。其中,系数A 1,0的4个乘积结果用于计算滤波后图像第一行的四个像素点,并累加至cycle 3的累加结果;系数A 0,0的4个乘积结果用于计算滤波后图像第二行的四个像素点。 Eight multipliers multiply the coefficients A 0 , 0 and A 1 , 0 with the four original pixels B 1 , 0, B 1 , 1, B 1, 2 , and B 1, 3 of the image, respectively, to obtain the coefficient A 0, The result of the product of 0 and A 1,0. Among them, the 4 product results of the coefficient A 1, 0 are used to calculate the four pixels in the first line of the filtered image, and are added to the cumulative result of cycle 3; the 4 product results of the coefficient A 0, 0 are used to calculate the filter Four pixels in the second row of the back image.
如图3(b)所示,在第5个运算周期(cycle 5)中:As shown in Figure 3(b), in the fifth operation cycle (cycle 5):
读取图像的第二行的4个原像素,即B 1,1,B 1,2,B 1,3,B 1,4Read the 4 original pixels of the second line of the image, namely B 1,1 , B 1,2 , B 1,3 , B 1,4 ;
读取滤波器的第一行和第二行这相邻两行的第2列的2个系数A 0,1,A 1,1 Read the two coefficients A 0,1 and A 1,1 in the second column of the first row and the second row of the filter, which are two adjacent rows;
8个乘法器将系数A 0,1,A 1,1分别和图像的4个原像素B 1,1,B 1,1,B 1,3,B 1,4相乘,得到系数A 0,1和A 1,1的乘积结果。其中,系数A 1,1的4个乘积结果用于计算滤波后图像第一行的四个像素点,并累加至cycle 4的累加结果;系数A 0,1的4个乘积结果用于计算滤波后图像第二行的四个像素点,并累加至cycle 4的乘积结果。 The 8 multipliers multiply the coefficients A 0,1 , A 1,1 with the 4 original pixels B 1,1 , B 1,1 , B 1,3 , B 1,4 of the image, respectively, to obtain the coefficient A 0, The result of the product of 1 and A 1,1. Among them, the 4 product results of the coefficient A 1,1 are used to calculate the four pixels in the first line of the filtered image, and are added to the cumulative result of cycle 4; the 4 product results of the coefficient A 0,1 are used to calculate the filter The four pixels in the second row of the back image are added to the product result of cycle 4.
如图3(c)所示,在第6个运算周期(cycle 6)中:As shown in Figure 3(c), in the sixth cycle (cycle 6):
读取图像的第二行的4个原像素,即B 1,2,B 1,3,B 1,4,B 1,5Read the 4 original pixels of the second line of the image, namely B 1 , 2, B 1 , 3, B 1 , 4, B 1, 5 ;
读取滤波器的第一行和第二行这相邻两行的第3列的2个系数A 0,2,A 1,2Reading the first filter and second rows adjacent to these two coefficients of the two rows of three A 0,2, A 1,2;
8个乘法器将系数A 0,2,A 1,2分别和图像的4个原像素B 1,2,B 1,3,B 1,4,B 1,5相乘,得到系数A 0,2和A 1,2的乘积结果。其中,系数A 1,2的4个乘积结果用于计算滤波后图像第一行的四个像素点,并累加至cycle 5的累加结果; 系数A 0,2的4个乘积结果用于计算滤波后图像第二行的四个像素点,并累加至cycle 5的累加结果。 The eight multipliers multiply the coefficients A 0 , 2 , and A 1, 2 with the four original pixels B 1 , 2, B 1 , 3, B 1 , 4, and B 1, 5 of the image, respectively, to obtain the coefficient A 0, The result of the product of 2 and A 1, 2. Among them, the 4 product results of the coefficients A 1, 2 are used to calculate the four pixels in the first line of the filtered image, and they are added to the cumulative result of cycle 5; the 4 product results of the coefficients A 0 , 2 are used to calculate the filter The four pixels in the second row of the back image are accumulated to the accumulation result of cycle 5.
如图3(d)所示,在第7个运算周期(cycle 7)中:As shown in Figure 3(d), in the seventh cycle (cycle 7):
读取图像的第二行的4个原像素,即B 1,3,B 1,4,B 1,5,B 1,6Read the 4 original pixels of the second line of the image, namely B 1,3 , B 1,4 , B 1,5 , B 1,6 ;
读取滤波器的第一行和第二行这相邻两行的第4列的2个系数A 0,3,A 1,3 Read the two coefficients A 0,3 and A 1,3 of the fourth column of the first row and the second row of the filter, which are two adjacent rows;
8个乘法器将系数A 0,3,A 1,3分别和图像的4个原像素B 1,3,B 1,4,B 1,5,B 1,6相乘,得到系数A 0,3和A 1,3的乘积结果。其中,系数A 1,3的4个乘积结果用于计算滤波后图像第一行的四个像素点,并累加至cycle 6的累加结果;系数A 0,3的4个乘积结果用于计算滤波后图像第二行的四个像素点,并累加至cycle 6的累加结果。 The eight multipliers multiply the coefficients A 0 , 3 and A 1 , 3 by the four original pixels B 1 , 3, B 1 , 4, B 1 , 5 and B 1 , 6 of the image to obtain the coefficient A 0, The result of the product of 3 and A 1,3. Among them, the 4 product results of the coefficients A 1 , 3 are used to calculate the four pixels in the first line of the filtered image, and they are added to the cumulative result of cycle 6; the 4 product results of the coefficients A 0 , 3 are used to calculate the filter The four pixels in the second row of the back image are accumulated to the accumulation result of cycle 6.
如图3(e)所示,在第8个运算周期(cycle 8)中:As shown in Figure 3(e), in the eighth cycle (cycle 8):
读取图像的第二行的4个原像素,即B 1,4,B 1,5,B 1,6,B 1,7Read the 4 original pixels of the second line of the image, namely B 1,4 , B 1,5 , B 1,6 , B 1,7 ;
读取滤波器的第一行和第二行这相邻两行的第5列的2个系数A 0,4,A 1,4 Read the two coefficients A 0,4 and A 1,4 of the fifth column of the first row and the second row of the filter, which are two adjacent rows;
8个乘法器将系数A 0,4,A 1,4分别和图像的4个原像素B 1,4,B 1,5,B 1,6,B 1,7相乘,得到系数A 0,4和A 1,4的乘积结果。其中,系数A 1,4的4个乘积结果用于计算滤波后图像第一行的四个像素点,并累加至cycle 7的累加结果;系数A 0,4的4个乘积结果用于计算滤波后图像第二行的四个像素点,并累加至cycle 7的累加结果。 The eight multipliers multiply the coefficients A 0 , 4, A 1 , 4 with the four original pixels B 1 , 4, B 1 , 5, B 1 , 6, and B 1, 7 of the image, respectively, to obtain the coefficients A 0, The result of the product of 4 and A 1,4. Among them, the 4 product results of the coefficients A 1 , 4 are used to calculate the four pixels in the first line of the filtered image, and they are added to the cumulative result of cycle 7; the 4 product results of the coefficients A 0 , 4 are used to calculate the filter The four pixels in the second row of the back image are accumulated to the accumulation result of cycle 7.
经过5个运算周期后,完成图像的第2行的8个原像素的并行处理。在上述5个运算周期中,8个乘法器并行将滤波器的2个系数与4个原像素相乘,从而并行得到用于计算滤波后图像第一行和第二行的四个像素点。After 5 operation cycles, the parallel processing of the 8 original pixels in the second row of the image is completed. In the above 5 operation cycles, 8 multipliers multiply the 2 coefficients of the filter by 4 original pixels in parallel, so as to obtain the 4 pixels used to calculate the first row and the second row of the filtered image in parallel.
图像的第三行的并行处理过程与上述第二行的并行处理过程是类似的,所不同的是读取的是图像的第三行的4个原像素,以及滤波器的第二行和第三行这相邻两行的同一列的系数,可以参见图4(a)至图4(e)以及图7所示,具体过程不再赘述。以此类推,当读取图像的第五行的原像素时,并行处理过程如下:The parallel processing process of the third line of the image is similar to the parallel processing process of the second line above. The difference is that the 4 original pixels of the third line of the image are read, and the second line and the second line of the filter are read. The coefficients in the same column of the three rows and the two adjacent rows can be referred to as shown in Fig. 4(a) to Fig. 4(e) and Fig. 7, and the specific process will not be repeated. By analogy, when reading the original pixels of the fifth row of the image, the parallel processing process is as follows:
如图5(a)和图7所示,在第19个运算周期(cycle 19)中:As shown in Figure 5(a) and Figure 7, in the 19th operation cycle (cycle 19):
读取图像的第五行的4个原像素,即B 4,0,B 4,1,B 4,2,B 4,3Read the 4 original pixels of the fifth line of the image, namely B 4,0 , B 4,1 , B 4,2 , B 4,3 ;
读取滤波器的第四行和第五行这相邻两行的第一列的2个系数A 3,0,A 4,0 Read the two coefficients A 3,0 and A 4,0 in the first column of the fourth row and the fifth row of the filter, which are two adjacent rows;
8个乘法器将系数A 3,0,A 4,0分别和图像的4个原像素B 4,0,B 4,1,B 4,2,B 4,3相乘,得到系数A 3,0和A 4,0的乘积结果。其中,系数A 4,0的4个乘积结果用于计算滤波后图像第一行的四个像素点,并累加至cycle 18的累加结果;系数A 3,0的4个乘积结果用于计算滤波后图像第二行的四个像素点,并累加至cycle 18的累加结果。 Eight multipliers multiply the coefficients A 3 , 0 and A 4 , 0 with the four original pixels B 4 , 0, B 4 , 1, B 4 , 2, and B 4, 3 of the image, respectively, to obtain the coefficient A 3, The result of the product of 0 and A 4,0. Among them, the 4 product results of the coefficient A 4,0 are used to calculate the four pixels in the first line of the filtered image, and they are added to the cumulative result of cycle 18; the 4 product results of the coefficient A 3,0 are used to calculate the filter The four pixels in the second row of the back image are added to the cumulative result of cycle 18.
以此类推,cycle 20、21、22运算过程与cycle 19类似。如图5(b)和图7所示,在第23个运算周期(cycle 23)中:By analogy, the calculation process of cycle 20, 21, and 22 is similar to cycle 19. As shown in Figure 5(b) and Figure 7, in the 23rd operation cycle (cycle 23):
读取图像的第五行的4个原像素,即B 4,4,B 4,5,B 4,6,B 4,7Read the 4 original pixels of the fifth line of the image, namely B 4,4 , B 4,5 , B 4,6 , B 4,7 ;
读取滤波器的第四行和第五行这相邻两行的第5列的2个系数A 3,4,A 4,4 Read the 2 coefficients A 3,4 and A 4,4 of the fifth row of the fourth row and the fifth row of the filter, which are two adjacent rows;
8个乘法器将系数A 3,4,A 4,4分别和图像的4个原像素B 4,4,B 4,5,B 4,6,B 4,7相乘,得到系数A 3,4和A 4,4的乘积结果。其中,系数A 4,4的4个乘积结果用于计算滤波后图像第一行的四个像素点,并累加至cycle 22的累加结果,cycle 23的累加结果即为滤波后图像第一行的四个像素点C 0,0-C 0,3。系数A 3,4的4个乘积结果用于计算滤波后图像第二行的四个像素点,并累加至cycle 22的累加结果。 The eight multipliers multiply the coefficients A 3 , 4 and A 4 , 4 with the four original pixels B 4 , 4 , B 4 , 5, B 4, 6, and B 4, 7 of the image, respectively, to obtain the coefficient A 3, The result of the product of 4 and A 4,4. Among them, the four product results of coefficients A 4 and 4 are used to calculate the four pixels in the first line of the filtered image, and are added to the accumulation result of cycle 22. The accumulation result of cycle 23 is the first line of the filtered image. Four pixels C 0,0 -C 0,3 . The four product results of the coefficients A 3 and 4 are used to calculate the four pixels in the second line of the filtered image, and are accumulated to the accumulation result of cycle 22.
由此可见,经过23个运算周期后,得到了滤波后图像第一行的四个像素点。It can be seen that after 23 calculation cycles, four pixels in the first row of the filtered image are obtained.
滤波过程继续进行,由于在cycle 23中,读取的是滤波器第四行和第五行的系数,所以在下一个运算周期中,会出现滤波器的相邻两行在高度方向上首尾跨越的情况。此时,读取的图像的下一行的原像素不能复用,滤波器的两个系数分别与不同行的原像素进行运算,并行处理按照以下方式进行:The filtering process continues, because in cycle 23, the coefficients of the fourth and fifth rows of the filter are read, so in the next calculation cycle, two adjacent rows of the filter will cross over in the height direction. . At this time, the original pixels of the next line of the read image cannot be multiplexed, and the two coefficients of the filter are calculated with the original pixels of different lines, and the parallel processing is performed in the following way:
从向量处理单元的内部存储单元读取另P read个原像素,所述另P read个原像素在之前的运算周期从所述图像读取,并被存储至所述内部存储单元。 Another P read original pixels are read from the internal storage unit of the vector processing unit, and the other P read original pixels are read from the image in the previous operation cycle and stored in the internal storage unit.
所述N个系数中位于所述滤波器高度方向尾部的每一个系数和从所述图像读取的一行的P read个原像素分别相乘,得到所述乘积结果; Each of the N coefficients located at the end of the filter height direction is respectively multiplied by the P read original pixels of a line read from the image to obtain the product result;
所述N个系数中位于所述滤波器高度方向首部的每一个系数和从所述内部存储单元读取的另P read个原像素分别相乘,得到所述乘积结果。 Each of the N coefficients located at the head of the filter height direction is respectively multiplied by another P read original pixels read from the internal storage unit to obtain the product result.
以下以第24至28运算周期为例,对上述过程进行说明。The following takes the 24th to 28th operation cycle as an example to describe the above process.
如图6(a)和图7所示,在第24个运算周期(cycle 24)中:As shown in Figure 6(a) and Figure 7, in the 24th operation cycle (cycle 24):
从外部存储单元读取图像的第六行的4个原像素,即B 5,0,B 5,1,B 5,2,B 5,3,从内部存储单元中读取图像第三行的4个原像素,即B 2,0,B 2,1,B 2,2,B 2,3。其中,B 2,0,B 2,1,B 2,2,B 2,3这4个原像素在cycle 9从图像读取并存储至内部存储单元。 Read the 4 original pixels of the sixth row of the image from the external storage unit, namely B 5,0 , B 5,1 , B 5,2 , B 5,3 , and read the third row of the image from the internal storage unit 4 original pixels, namely B 2,0 , B 2,1 , B 2,2 , B 2,3 . Among them, the four original pixels B 2,0 , B 2,1 , B 2,2 , B 2,3 are read from the image in cycle 9 and stored in the internal storage unit.
读取滤波器的第五行和第一行这相邻两行的第一列的2个系数A 4,0,A 0,0;4个乘法器将滤波器首部的系数A 0,0分别和图像的4个原像素B 2,0,B 2,1,B 2,2,B 2,3相乘,得到系数A 0,0的乘积结果,系数A 0,0的4个乘积结果用于计算滤波后图像第三行的四个像素点;另4个乘法器将滤波器尾部系数A 4,0分别和图像的4个原像素B 5,0,B 5,1,B 5,2,B 5,3相乘,得到系数A 4,0的乘积结果,系数A 4,0的4个乘积结果用于计算滤波后图像第二行的四个像素点。 Read the two coefficients A 4 , 0 , A 0, 0 in the first column of the fifth row and the first row of the filter; 4 multipliers divide the coefficients A 0, 0 of the filter head with 4 original image pixels B 2,0, B 2,1, B 2,2, B 2,3 multiplied by the multiplication results of the coefficients a 0, 0, the coefficient a 4 0,0-pieces of multiplied results for Calculate the four pixels in the third row of the filtered image; the other four multipliers combine the filter tail coefficient A 4,0 with the four original pixels B 5,0 ,B 5,1 ,B 5,2 , B 5,3 multiplied by the multiplication results of the coefficients a 4,0, 4 coefficients a 4,0 multiplication results for four pixels of the second row after the calculation of the filtered image.
以此类推,cycle 25、26、27的运算过程与cycle 24类似。如图6(b)和图7所示,在第28个运算周期(cycle 28)中:By analogy, the calculation process of cycle 25, 26, and 27 is similar to cycle 24. As shown in Figure 6(b) and Figure 7, in the 28th operation cycle (cycle 28):
从外部存储单元读取图像的第五行的4个原像素,即B 5,4,B 5,5,B 5,6,B 5,7,从内部存储单元中读取图像第三行的4个原像素,即B 2,4,B 2,5,B 2,6,B 2,7。其中,B 2,4,B 2,5,B 2,6,B 2,7这4个原像素在cycle 13从图像读取并存储至内部存储单元。 Read the 4 original pixels of the fifth row of the image from the external storage unit, namely B 5 , 4, B 5 , 5 , B 5 , 6, B 5, 7, read the 4 of the third row of the image from the internal storage unit Original pixels, namely B 2,4 , B 2,5 , B 2,6 , B 2,7 . Among them, the 4 original pixels B 2,4 , B 2,5 , B 2,6 , B 2,7 are read from the image in cycle 13 and stored in the internal storage unit.
读取滤波器的第五行和第一行这相邻两行的第五列的2个系数A 4,4,A 0,4;4个乘法器将滤波器首部的系数A 0,4分别和图像的4个原像素B 2,4,B 2,5,B 2,6,B 2,7相乘,得到系数A 0,4的乘积结果,系数A 0,4的4个乘积结果用于计算滤波后图像第三行的四个像素点,并累加至cycle 27的累加结果;另4个乘法器将滤波器尾部系数A 4,4分别和图像的4个原像素B 5,4,B 5,5,B 5,6,B 5,6相乘,得到系数A 4,4的乘积结果,系数A 4,4的4个乘积结果用于计算滤波后图像第二行的四个像素点,并累加至cycle 27的累加结果。cycle 28的累加结果即为滤波后图像第二行的四个像素点C 1,0-C 1,3 Read the two coefficients A 4 , 4, A 0 , 4 of the fifth row of the filter and the fifth column of two adjacent rows of the first row ; the four multipliers divide the coefficients A 0 , 4 of the filter head with the coefficients A 0, 4 respectively 4 original image pixels B 2,4, B 2,5, B 2,6, B 2,7 multiplied by the multiplication results of the coefficients a 0,4, the coefficient a 4 0,4-pieces of multiplied results for Calculate the four pixels in the third line of the filtered image and add them to the accumulation result of cycle 27; the other four multipliers combine the filter tail coefficients A 4 , 4 with the four original pixels B 5 , 4, B of the image, respectively 5,5, B 5,6, B 5,6 multiplied by the coefficient a 4 and 4 multiplication results, coefficients a 4 4,4 multiplication results for the second row of the four pixels is calculated after the filtered image , And accumulate to the accumulation result of cycle 27. The accumulation result of cycle 28 is the four pixel points C 1,0 -C 1,3 in the second line of the filtered image.
由此可见,经过28个运算周期后,得到了滤波后图像第一行的四个 像素点C 0,0-C 0,3和第二行的四个像素点C 1,0-C 1,3It can be seen that after 28 operation cycles, the four pixels C 0,0 -C 0,3 in the first row of the filtered image and the four pixels C 1,0 -C 1, in the second row are obtained. 3 .
不断重复执行上述并行处理过程,从cycle 28开始,再经过24个运算周期后,即可得到滤波后图像第三行的四个像素点C 2,0-C 2,3,再经过1个运算周期,即从cycle 28开始再经过25个运算周期,共53个运算周期后,即可得到滤波后图像第四行的四个像素点C 3,0-C 3,3,从而得到滤波后图像全部四行的第1列至第4列的像素点。 Repeatedly execute the above parallel processing process, starting from cycle 28, and after 24 operation cycles, the four pixels C 2,0 -C 2,3 of the third row of the filtered image can be obtained, and then one operation Cycle, that is, starting from cycle 28 and then passing through 25 operation cycles, after a total of 53 operation cycles, the four pixels C 3,0 -C 3,3 of the fourth row of the filtered image can be obtained, thereby obtaining the filtered image Pixels from column 1 to column 4 of all four rows.
在以上的描述中,图像的高度N h为8,滤波器的高度F h为5,由此可以确定,滤波后图像的高度N 0为4。每个运算周期读取的滤波器系数的个数N为2,滤波后图像的高度N 0能被滤波器系数的个数N整除。在这种情况下,经过上述53个运算周期,针对滤波后图像第1列至第4列的像素点的滤波过程已经完成。如果滤波后图像的高度N 0不能被滤波器系数的个数N整除,那么滤波后图像第1列至第4列的像素点的滤波过程仍未完成。此时还需要继续对图像进行处理,以得到滤波后图像剩余行的像素点。因为滤波后图像剩余行也不能进行并行处理,因此进行如下的行内处理: In the above description, the height N h of the image is 8 and the height F h of the filter is 5, so it can be determined that the height N 0 of the filtered image is 4. The number N of filter coefficients read in each operation cycle is 2, and the height N 0 of the filtered image can be divisible by the number N of filter coefficients. In this case, after the above 53 operation cycles, the filtering process for the pixels in the first column to the fourth column of the filtered image has been completed. If the height N 0 of the filtered image is not evenly divisible by the number N of filter coefficients, then the filtering process of the pixels in the first to fourth columns of the filtered image is still not completed. At this time, it is necessary to continue processing the image to obtain the pixels of the remaining lines of the filtered image. Because the remaining lines of the filtered image cannot be processed in parallel, the following in-line processing is performed:
读取图像中用于计算滤波后图像后r行像素点的一行Pread个原像素,r为滤波器的高度与N相除的余数;Read a line of Pread original pixels used to calculate the r line of pixels in the filtered image in the read image, where r is the remainder of the filter height divided by N;
读取所述滤波器一行中的一个系数或所述N个系数;Read one coefficient or the N coefficients in a row of the filter;
通过所述乘法器,将所述一个系数和所述P read个原像素相乘,或将所述N个系数中的每一个系数和所述P read个原像素分别相乘,得到乘积结果。 Through the multiplier, the one coefficient is multiplied by the P read original pixels, or each coefficient of the N coefficients is multiplied by the P read original pixels to obtain a product result.
例如,假设图像的高度N h为9,滤波器高度仍为5,则对应的滤波后图像的高度N 0为5,此时滤波后图像的高度N 0不能被N整除时,滤波后图像第1列至第4列的像素点的滤波过程仍未完成。此时还需要继续对图像进行处理,以得到滤波后图像第5行的像素点。因为只剩一行的像素点,即滤波后图像第5行的像素点,所以针对滤波后图像第5行的像素点也不能进行并行处理,而是进行行内处理。 For example, assuming that the height N h of the image is 9 and the filter height is still 5, the height N 0 of the corresponding filtered image is 5. At this time, when the height N 0 of the filtered image cannot be divisible by N, the filtered image is the first The filtering process of the pixels from column 1 to column 4 has not yet been completed. At this time, it is necessary to continue processing the image to obtain the pixels in the fifth row of the filtered image. Because there are only one row of pixels, that is, the pixels on the fifth row of the filtered image, the pixels on the fifth row of the filtered image cannot be processed in parallel, but are processed in-line.
在行内处理中,针对图像第5行至第9行中的每一行,均执行一次上述的预处理过程。即针对图像第5行至第9行中的每一行,读取滤波器一行中的一个或N个系数,经所述一个或N个系数分别和该行的每一个原像素相乘,得到乘积结果。在行内处理过程中,以图像第5行为例,其处 理过程具体包括:In the in-line processing, the above-mentioned preprocessing process is performed once for each of the 5th to 9th rows of the image. That is, for each of the 5th to 9th rows of the image, read one or N coefficients in a row of the filter, and multiply the one or N coefficients with each original pixel of the row to obtain the product result. In the in-line processing process, take the fifth line of the image as an example, the processing process specifically includes:
如图8(a)所示,在第54个运算周期(cycle 54)中:As shown in Figure 8(a), in the 54th operation cycle (cycle 54):
读取图像的第五行的4个原像素,即B 4,0,B 4,1,B 4,2,B 4,3Read the 4 original pixels of the fifth line of the image, namely B 4,0 , B 4,1 , B 4,2 , B 4,3 ;
读取滤波器的第一行的一个系数A 0,0 Read a coefficient A 0,0 of the first line of the filter;
4个乘法器将系数A 0,0分别和图像的4个原像素B 4,0,B 4,1,B 4,2,B 4,3相乘,得到系数A 0,0的乘积结果。 The four multipliers multiply the coefficient A 0 , 0 with the four original pixels B 4 , 0, B 4 , 1, B 4 , 2, and B 4, 3 of the image, respectively, to obtain the product result of the coefficient A 0, 0.
如图8(b)所示,在第55个运算周期(cycle 55)中:As shown in Figure 8(b), in the 55th operation cycle (cycle 55):
读取图像的第五行的4个原像素,即B 4,2,B 4,3,B 4,4,B 4,5Read the 4 original pixels of the fifth line of the image, namely B 4,2 , B 4,3 , B 4,4 , B 4,5 ;
读取滤波器的第一行的2个系数A 0,1,A 0,2 Read the 2 coefficients A 0,1 and A 0,2 of the first line of the filter;
4个乘法器将系数A 0,1分别和图像的4个原像素B 4,1,B 4,2,B 4,3,B 4,4相乘,得到系数A 0,1的乘积结果,另4个乘法器将系数A 0,2分别和图像的4个原像素B 4,2,B 4,3,B 4,4,B 4,5相乘,得到系数A 0,2的乘积结果,并将两个乘积结果累加至cycle 54的乘积结果。 The four multipliers multiply the coefficient A 0,1 by the four original pixels B 4,1 , B 4,2 , B 4,3 , B 4,4 of the image, respectively, to obtain the product result of the coefficient A 0,1, The other four multipliers multiply the coefficient A 0 , 2 with the four original pixels B 4 , 2, B 4 , 3, B 4 , 4 and B 4, 5 of the image, respectively, to obtain the product result of the coefficient A 0, 2. , And accumulate the two product results to the product result of cycle 54.
如图8(c)所示,在第56个运算周期(cycle 56)中:As shown in Figure 8(c), in the 56th operation cycle (cycle 56):
读取图像的第五行的4个原像素,即B 4,4,B 4,5,B 4,6,B 4,7Read the 4 original pixels of the fifth line of the image, namely B 4,4 , B 4,5 , B 4,6 , B 4,7 ;
读取滤波器的第一行的2个系数A 0,3,A 0,4 Read the 2 coefficients A 0,3 and A 0,4 of the first line of the filter;
4个乘法器将系数A 0,3分别和图像的4个原像素B 4,3,B 4,4,B 4,5,B 4,6相乘,得到系数A 0,3的乘积结果,另4个乘法器将系数A 0,4分别和图像的4个原像素B 4,4,B 4,5,B 4,6,B 4,7相乘,得到系数A 0,4的乘积结果,并将两个乘积结果累加至cycle 55的累加结果。 The four multipliers multiply the coefficients A 0 , 3 with the four original pixels B 4 , 3, B 4 , 4, B 4, 5 , and B 4, 6 of the image to obtain the product result of the coefficients A 0, 3. The other four multipliers multiply the coefficients A 0 , 4 with the four original pixels B 4 , 4 , B 4 , 5, B 4, 6, and B 4, 7 of the image to obtain the product result of the coefficients A 0, 4. , And accumulate the two product results to the accumulation result of cycle 55.
再经过3个运算周期完成图像的第5行的预处理,上述各个运算周期得到的乘积结果用于计算滤波后图像第五行的四个像素点。After another three calculation cycles, the preprocessing of the fifth line of the image is completed, and the product results obtained in each of the foregoing calculation cycles are used to calculate the four pixels of the fifth line of the filtered image.
以此类推,重复上述步骤,对图像的第6行至第9行、及其对应的滤波器的第2行至第5行分别进行上述预处理,每一行的预处理需要3个运算周期。经过15个运算周期的行内运算,一共68个运算周期,得到滤波后图像第5行的像素点,从而得到滤波后图像全部五行的第1列至第4列的像素点,滤波后图像第1列至第4列的滤波过程全部完成。By analogy, the above steps are repeated, and the above-mentioned preprocessing is performed on the 6th to 9th rows of the image and the second to 5th rows of the corresponding filter respectively, and the preprocessing of each row requires 3 operation cycles. After 15 operation cycles of intra-row operation, a total of 68 operation cycles, the pixel of the fifth row of the filtered image is obtained, and the pixels of the first to fourth columns of all five rows of the filtered image are obtained, and the filtered image is the first The filtering process from column to column 4 is completed.
当得到滤波后图像第1列至第4列的目标元素后,再不断重复上述预处理和并行处理、行内处理(如果有)的整个过程,如图9所示,以滤波后图像的高度为4为例,依次得到滤波后图像第5列至第8列、第9列至 第12列和第13列至第16列的像素点,从而完成整个滤波过程。滤波后图像的宽度M o=P read*ceil((M w-F w+1)/P read),高度N o=N h-F h+1;其中,M w为所述图像的宽度。当P read为4、M w为20、N h为8、F w和F h为5时,滤波后图像尺寸为16×4,即宽度M o为16、高度N o为4。 When the target elements in the first to fourth columns of the filtered image are obtained, the entire process of the above preprocessing, parallel processing, and inline processing (if any) is repeated continuously, as shown in Figure 9, taking the height of the filtered image as 4 as an example, the pixels in the fifth column to the eighth column, the ninth column to the 12th column, and the 13th column to the 16th column of the filtered image are sequentially obtained, thereby completing the entire filtering process. The width of the filtered image M o =P read *ceil((M w -F w +1)/P read ), the height N o =N h -F h +1; where M w is the width of the image. When P read is 4, M w is 20, N h is 8, F w and F h are 5, the image size after filtering is 16×4, that is, the width M o is 16 and the height N o is 4.
以上示例以向量处理单元的个数N calc为8,每个运算周期读取的原像素个数P read为4、滤波器系数的个数N为2、图像的宽度M w为20、滤波器的宽度F w和高度F h均为5,对本实施例的图像处理方法进行了说明。但显然,本领域技术人员应当清楚,以上参数的取值不限于此。本实施例的以上参数还可以取其他值。当以上参数取其他值时,图像处理方法与以上描述是类似的,本领域技术人员完全应该清楚图像处理方法的具体过程。 In the above example, the number of vector processing units N calc is 8, the number of original pixels P read read in each operation cycle is 4, the number of filter coefficients N is 2, the image width M w is 20, and the filter The width F w and the height F h of are both 5, and the image processing method of this embodiment is described. However, it should be clear to those skilled in the art that the values of the above parameters are not limited to this. The above parameters in this embodiment can also take other values. When the above parameters take other values, the image processing method is similar to the above description, and those skilled in the art should be fully aware of the specific process of the image processing method.
本实施例的图像处理方法,滤波后图像的第一行的P read个像素点的运算时间为: In the image processing method of this embodiment, the calculation time of P read pixels in the first row of the filtered image is:
T pre+F w×(F h-1)×cycle T pre +F w ×(F h -1)×cycle
其中,T pre为上述预处理的运算时间;F w、F h分别为滤波器的宽度和高度;cycle为一个运算周期。所述预处理的运算时间T pre为:(1+ceil((F w-1)/N))×cycle。 Among them, T pre is the calculation time of the above preprocessing; F w and F h are the width and height of the filter, respectively; cycle is a calculation cycle. The preprocessing operation time T pre is: (1+ceil((F w -1)/N))×cycle.
在上述示例中,滤波器的宽度F w和高度F h均为5,N为2,预处理的运算时间T pre为3个运算周期,所以该运算时间为23个运算周期。也就是说,经过25个运算周期得到滤波后图像的第一行4个像素点。 In the above example, the width F w and the height F h of the filter are both 5, N is 2, and the preprocessing calculation time T pre is 3 calculation cycles, so the calculation time is 23 calculation cycles. In other words, after 25 operation cycles, 4 pixels in the first row of the filtered image are obtained.
从滤波后图像的第二行开始,并行处理过程中每一行的P read个像素点的运算时间为: Starting from the second line of the filtered image, the calculation time of P read pixels in each line in the parallel processing process is:
F w×F h×cycle F w ×F h ×cycle
上述示例中,滤波器的宽度F w和高度F h均为5,该运算时间为25个运算周期。也就是说,滤波后图像的第二行至第四行的每一行的4个像素点都需要25个运算周期。 In the above example, the width F w and the height F h of the filter are both 5, and the calculation time is 25 calculation cycles. In other words, 4 pixels in each row from the second row to the fourth row of the filtered image require 25 operation cycles.
如果滤波过程存在行内运算,滤波后图像的后r行是通过行内运算得到的,那么滤波后图像的后r行中的每一行的P read个像素点的运算时间为:T pre×F hIf there are intra-line operations in the filtering process, and the last r lines of the filtered image are obtained through intra-line operations, then the operation time of P read pixels in each line of the last r lines of the filtered image is: T pre ×F h .
上述示例中,假设图像的高度N h为9,滤波后图像最后1行,即第5行的像素点通过行内运算得到,滤波器的高度F h均为5,该运算时间为15 个运算周期。也就是说,滤波后图像第五行的4个像素点经过15个运算周期得到。 In the above example, assuming that the height N h of the image is 9, the last line of the filtered image, that is, the pixels in the fifth line are obtained by intra-line calculation, the height F h of the filter is all 5, and the calculation time is 15 calculation cycles . In other words, the 4 pixels in the fifth row of the filtered image are obtained through 15 operation cycles.
本实施例的图像处理方法,当滤波后图像的高度N 0能被滤波器系数的个数N整除,即不存在行内处理时,滤波后图像的第一行至最后一行的N o×P read个像素点的运算时间为: In the image processing method of this embodiment, when the height N 0 of the filtered image can be divisible by the number N of filter coefficients, that is, there is no in-line processing, N o ×P read from the first line to the last line of the filtered image The calculation time of each pixel is:
T pre+F w×F h×(N o/N)×cycle T pre +F w ×F h ×(N o /N)×cycle
其中,N o为所述滤波后图像的高度,且N o=N h-F h+1,N h为图像的高度。 Wherein, N o is the height of the filtered image, and N o = N h -F h + 1, N h is the height of the image.
上述示例中,滤波器的宽度F w和高度F h均为5,预处理的运算时间T pre为3个运算周期,如果图像的高度N h为8,滤波后图像的高度N o为4,所以滤波后图像的第一行至第四行的4×4=16个像素点的运算时间为53个运算周期。 In the above example, the width F w and the height F h of the filter are both 5, and the preprocessing calculation time T pre is 3 calculation cycles. If the height N h of the image is 8, the height N o of the filtered image is 4. Therefore, the calculation time of 4×4=16 pixels from the first row to the fourth row of the filtered image is 53 calculation cycles.
当滤波后图像的高度N 0不能被滤波器系数的个数N整除,即存在行内处理时,滤波后图像的第一行至最后一行的N o×P read个像素点的运算时间为: When the height N 0 of the filtered image cannot be divisible by the number N of filter coefficients, that is, when there is in-line processing, the calculation time of N o ×P read pixels from the first line to the last line of the filtered image is:
T pre+F w×F h×((N o-r)/N)×cycle+T r T pre +F w ×F h ×((N o -r)/N)×cycle+T r
其中,r为N o与N相除的余数,即滤波后图像中需要由行内处理得到的行数,T r为行内处理的运算时间。行内处理的运算时间T r为:r×T pre×F hAmong them, r is the remainder of the division of N o and N, that is, the number of lines that need to be processed in-line in the filtered image, and T r is the operation time of the in-line processing. Arithmetic processing within the line time T r is: r × T pre × F h .
上述示例中,滤波器的宽度F w和高度F h均为5,预处理的运算时间T pre为3个运算周期,如果图像的高度N h为9,滤波后图像的高度N o为4,代入上述公式,滤波后图像的第一行至第五行的5×4=20个像素点的运算时间为68个运算周期。 In the above example, the width F w and the height F h of the filter are both 5, and the preprocessing calculation time T pre is 3 calculation cycles. If the height N h of the image is 9, the height N o of the filtered image is 4. Substituting the above formula, the calculation time of 5×4=20 pixels from the first row to the fifth row of the filtered image is 68 calculation cycles.
在以上的描述中,本实施例的原像素是按照顺序进行读取,即在图像高度方向上从上到下、从第一行至最后一行,每一行中按照从左到右的顺序读取。滤波后图像的像素点也是按顺序输出。按照顺序读取时,读取图像一行的P read个原像素包括: In the above description, the original pixels in this embodiment are read in order, that is, from top to bottom in the image height direction, from the first row to the last row, and read in the order from left to right in each row. . The pixels of the filtered image are also output in order. When reading sequentially, the P read original pixels of one line of the read image include:
从图像的第二行开始,在每一行中依次读取F w组原像素,其中第n组原像素包括第n列至第n+P read-1列的P read个原像素,其中1≤n≤F wStarting from the second row of the image, reading each row of pixels sequentially F w of the original group, wherein the n-th original pixel group comprises first to n-th column n + P read -1 P read original pixels of the column, wherein 1≤ n≤F w .
例如在上述示例中,在图像的高度方向上从上到下读取各行的原像素, 在每一行,依次读取5组原像素,第1组原像素包括第1列至第4列的4个原像素,例如B 1,0,B 1,1,B 1,2,B 1,3;第2组原像素包括第2列至第5列的4个原像素,例如B 1,1,B 1,2,B 1,3,B 1,4;第3组原像素包括第3列至第6列的4个原像素,例如B 1,2,B 1,3,B 1,4,B 1,5;第4组原像素包括第4列至第7列的4个原像素,例如B 1,3,B 1,4,B 1,5,B 1,6;第5组原像素包括第5列至第8列的4个原像素,例如B 1,4,B 1,5,B 1,6,B 1,7For example, in the above example, the original pixels of each row are read from top to bottom in the height direction of the image. In each row, 5 groups of original pixels are sequentially read. The first group of original pixels includes 4 columns from column 1 to column 4. Original pixels, such as B 1,0 , B 1,1 , B 1,2 , B 1,3 ; the second group of original pixels includes 4 original pixels from column 2 to column 5, such as B 1,1 , B 1,2 , B 1,3 , B 1,4 ; the third group of original pixels includes 4 original pixels from the third column to the sixth column, such as B 1,2 , B 1,3 , B 1,4 , B 1,5 ; The 4th group of original pixels include 4 original pixels from the 4th column to the 7th column, such as B 1,3 , B 1,4 , B 1,5 , B 1,6 ; the 5th group of original pixels Including the 4 original pixels in the fifth column to the eighth column, for example, B 1,4 , B 1,5 , B 1,6 , B 1,7 .
但本公开不限于此,实际上,图像原像素读取的顺序是不受限制的,可以顺序读取、倒序读取、跳跃读取。只要能遍历图像的所有原像素,遍历所有滤波器系数,完成M*N*F w*F h个乘累加运算,即可完成滤波。只不过在倒序或跳跃读取的情况下,滤波后图像的像素点的输出顺序不同而已。 However, the present disclosure is not limited to this. In fact, the order of reading the original pixels of the image is not limited, and reading can be sequential, reversed, or skipped. As long as it can traverse all the original pixels of the image, traverse all the filter coefficients, and complete the M*N*F w *F h multiplication and accumulation operations, the filtering can be completed. It's just that in the case of reverse or skip reading, the output order of the pixels of the filtered image is different.
在以上示例中,每次均是从外部存储单元中读取图像的P read个原像素。但本公开不限于此。在每个运算周期,从外部存储单元中读取图像的P read个原像素后,还可以将读取的P read个原像素存储在向量处理单元的内部存储单元中。这样在之后的运算周期中,如果其需要读取的P read个原像素中的部分原像素已经存储在内部存储单元中,那么这部分原像素直接从内部存储单元中读取即可,只需要从存储单元读取内部存储单元未存储的另一部分原像素即可。具体来说,读取图像一行的P read个原像素包括; In the above example, P read original pixels of the image are read from the external storage unit each time. But the present disclosure is not limited to this. P read original pixels stored after each calculation cycle, reads an image from an external storage unit P read original pixels, it can be read in the internal storage unit vector processing unit. In this way, in the subsequent operation cycle, if some of the original pixels of the P read original pixels that need to be read have been stored in the internal storage unit, then this part of the original pixels can be directly read from the internal storage unit, and only It is sufficient to read another part of the original pixels not stored in the internal storage unit from the storage unit. Specifically, the P read original pixels of one line of the read image include:
从内部存储单元读取P read个原像素中的部分原像素,所述部分原像素在之前的运算周期被存储; Read part of the original pixels in the P read original pixels from the internal storage unit, and the part of the original pixels was stored in the previous operation cycle;
从外部存储单元读取P read个原像素中的剩余部分原像素,得到P read个原像素; P reads the remaining portion of the original pixel in the original pixels read from the external storage unit, to obtain P read original pixels;
将剩余部分像素存储至内部存储单元。Store the remaining pixels in the internal storage unit.
例如,如图7所示,在cycle 4读取B 1,0,B 1,1,B 1,2,B 1,3后,可将B 1,0,B 1,1,B 1,2,B 1,3存储至向量处理单元的内部存储单元。这样在cycle 5中,只需从存储单元读取B 1,4即可,B 1,1,B 1,2,B 1,3可从内部存储单元中读取。这样减小了向量处理单元从片外存储器读取的数据量,节省了带宽。 For example, as shown in Figure 7, after reading B 1 , 0, B 1 , 1, B 1, 2 , B 1, 3 in cycle 4, B 1, 0 , B 1 , 1, B 1, 2 , B 1,3 are stored in the internal storage unit of the vector processing unit. In this way, in cycle 5, only B 1 , 4 needs to be read from the storage unit, and B 1, 1, B 1 , 2, and B 1, 3 can be read from the internal storage unit. This reduces the amount of data read by the vector processing unit from the off-chip memory and saves bandwidth.
在以上示例中,每个运算周期执行乘法和累加操作,即滤波器系数和原像素相乘,并将相乘结果累加至上一运算周期的累加结果。但是本公开并不限于此,也可以先进行乘法运算,得到用于计算滤波后图像的像素点 的所有乘积结果后再进行累加。In the above example, the multiplication and accumulation operations are performed in each operation cycle, that is, the filter coefficients are multiplied by the original pixels, and the result of the multiplication is accumulated to the accumulation result of the previous operation cycle. However, the present disclosure is not limited to this. It is also possible to perform multiplication first to obtain all the product results used to calculate the pixel points of the filtered image and then perform the accumulation.
由此可见,本公开在每个运算周期读取滤波器的N个系数,读取的所述系数的个数N根据所述执行所述图像处理方法的乘法器的个数确定;所述N个系数分别和每一所述原像素相乘,得到乘积结果;相对于现有技术,使更多的向量处理单元参与滤波运算,相对于预处理方式,运算资源得到充分利用,图像滤波的整体性能得到更大的提高,图像处理效率得到更大的改善。It can be seen that the present disclosure reads the N coefficients of the filter in each operation cycle, and the number N of the read coefficients is determined according to the number of multipliers that execute the image processing method; the N Each coefficient is multiplied by each of the original pixels to obtain the product result. Compared with the prior art, more vector processing units are involved in the filtering operation. Compared with the preprocessing method, the calculation resources are fully utilized, and the overall image filtering is The performance is greatly improved, and the image processing efficiency is greatly improved.
本公开另一实施例提供了一种图像处理装置,如图10所示,包括:Another embodiment of the present disclosure provides an image processing device, as shown in FIG. 10, including:
外部存储单元,存储有图像和滤波器。The external storage unit stores images and filters.
向量处理单元,包括:乘法器;所述向量处理单元用于读取所述图像的P read个原像素,其中,P read的值根据对应所述向量处理单元的访存位宽确定,读取所述滤波器的N个系数,N的值根据所述向量处理单元的乘法器的个数确定,所述滤波器用于对所述图像进行滤波处理; The vector processing unit includes: a multiplier; the vector processing unit is used to read P read original pixels of the image, wherein the value of P read is determined according to the memory access bit width corresponding to the vector processing unit, and the read N coefficients of the filter, the value of N is determined according to the number of multipliers of the vector processing unit, and the filter is used for filtering the image;
所述乘法器用于将所述N个系数中的每一系数和所述P read个原像素分别相乘,得到多个乘积结果,所述乘积结果用于计算滤波后图像中像素点的像素值。 The multiplier is used to multiply each of the N coefficients and the P read original pixels to obtain multiple product results, and the product results are used to calculate the pixel value of the pixel in the filtered image .
向量处理单元可以是处理器中的运算单元,可对图像进行滤波等处理。该处理器可以是CPU、DSP、FPGA等任何类型的具有向量处理能力的芯片。对于向量处理单元来说,外部存储单元是其片外存储器。外部存储单元存储有待处理的图像以及滤波器。向量处理单元利用滤波器对图像进行滤波处理,得到的滤波后图像也可以存储在外部存储单元中。The vector processing unit can be an arithmetic unit in the processor, which can perform processing such as filtering on the image. The processor can be any type of chip with vector processing capabilities such as CPU, DSP, FPGA, etc. For the vector processing unit, the external storage unit is its off-chip memory. The external storage unit stores the image to be processed and the filter. The vector processing unit uses the filter to filter the image, and the filtered image obtained can also be stored in an external storage unit.
DSP的向量处理单元包括多个MAC,每个MAC包括一个所述乘法器和一个加法器,用于执行滤波中的乘法操作和累加操作。The vector processing unit of the DSP includes a plurality of MACs, and each MAC includes a multiplier and an adder, which are used to perform multiplication and accumulation operations in filtering.
需要说明的是,图10只是示意性地显示了图像处理装置的结构。本实施例外部存储单元可以是一个或多个,图像和滤波器可存储在一个,或分别存储在多个外部存储单元中。It should be noted that FIG. 10 only schematically shows the structure of the image processing apparatus. In this embodiment, there may be one or more external storage units, and the images and filters may be stored in one or multiple external storage units.
在每个运算周期,向量处理单元均需读取图像一行的P read个原像素。为提高图像处理方法的效率,读取的原像素的个数P read等于每个运算周期可读取的原像素的最大数目。 In each operation cycle, the vector processing unit needs to read P read original pixels of a line of the image. In order to improve the efficiency of the image processing method, the number of original pixels P read to be read is equal to the maximum number of original pixels that can be read in each operation cycle.
当向量处理单元从存储单元读取这P read个原像素时,可读取的最大数 目取决于数据总线的访存位宽、以及原像素的位宽。即本实施例中的所述最大数目等于访存位宽与所述原像素位宽的商。 When the vector processing unit reads the P read original pixels from the storage unit, the maximum number that can be read depends on the access bit width of the data bus and the bit width of the original pixels. That is, the maximum number in this embodiment is equal to the quotient of the memory access bit width and the original pixel bit width.
在每个运算周期,向量处理单元从外部存储单元读取与P read个原像素对应的滤波器的N个系数。为提升图像滤波的整体性能,提高图像处理效率,本实施例中,每次读取的系数的个数N根据乘法器的个数确定。具体来说,如果图像处理装置包括N calc个乘法器,则系数的个数 In each operation cycle, the vector processing unit reads the N coefficients of the filter corresponding to the P read original pixels from the external storage unit. In order to improve the overall performance of image filtering and improve image processing efficiency, in this embodiment, the number N of coefficients read each time is determined according to the number of multipliers. Specifically, if the image processing device includes N calc multipliers, the number of coefficients
N=(N calc/P read)。 N=(N calc /P read ).
乘法器的个数通常是每个运算周期读取的原像素个数的倍数,乘法器的个数是每个运算周期读取的原像素个数的多少倍,就从存储单元读取多少个滤波器的系数。乘法器的个数可以是8、16、32等。The number of multipliers is usually a multiple of the number of original pixels read in each operation cycle. The number of multipliers is a multiple of the number of original pixels read in each operation cycle, just read from the storage unit The coefficient of the filter. The number of multipliers can be 8, 16, 32, and so on.
当读取原像素和滤波器的系数后,各个乘法器将N个系数分别和每一原像素相乘,得到乘积结果。After reading the coefficients of the original pixel and the filter, each multiplier multiplies the N coefficients with each original pixel to obtain the product result.
当向量处理单元读取的是图像的第一行中的P read个原像素时,执行预处理。在预处理中,向量处理单元读取图像的第一行中的P read个原像素以及滤波器第一行的一个或N个系数,将所述一个或N个系数分别和图像的第一行的每一个原像素相乘,得到乘积结果。 When the vector processing unit reads P read original pixels in the first row of the image, preprocessing is performed. In the preprocessing, the vector processing unit reads the P read original pixels in the first line of the image and one or N coefficients of the first line of the filter, and the one or N coefficients are respectively combined with the first line of the image Multiply each original pixel of to get the product result.
从图像的第二行开始,即当向量处理单元读取的是图像的第一行之后其他行的P read个原像素时,执行并行处理。在并行处理中,向量处理单元读取图像的第一行中的P read个原像素以及滤波器的N个系数,其中,这N个系数位于滤波器的相邻N行的同一列。N calc个向量处理单元将N个系数分别和P read个原像素中的每一原像素相乘,得到乘积结果。 Starting from the second line of the image, that is, when the vector processing unit reads P read original pixels in the other lines after the first line of the image, parallel processing is performed. In parallel processing, the vector processing unit reads the P read original pixels in the first row of the image and the N coefficients of the filter, where the N coefficients are located in the same column of the adjacent N rows of the filter. The N calc vector processing units respectively multiply the N coefficients with each original pixel of the P read original pixels to obtain a product result.
以下结合附图,以向量处理单元的个数为8,每个运算周期读取4个原像素和滤波器的2个系数为一个示例,对上述处理过程进行说明。In the following, with reference to the accompanying drawings, taking the number of vector processing units as 8, reading 4 original pixels and 2 coefficients of the filter in each operation cycle as an example, the above processing process will be described.
假设图像的原始尺寸为16×4。填充后的图像的尺寸为20×8,即宽度M w为20、高度N h为8。滤波器的尺寸为20×8,即宽度F w和高度F h均为5。滤波器对填充后的图像进行滤波运算。以下以二维卷积为例对滤波运算进行详细说明。 Assume that the original size of the image is 16×4. The size of the filled image is 20×8, that is, the width Mw is 20 and the height Nh is 8. The size of the filter is 20×8, that is, both the width F w and the height F h are 5. The filter performs filtering operations on the filled image. The following takes two-dimensional convolution as an example to describe the filtering operation in detail.
在二维卷积中,对于图像的一些行,在计算原像素的滤波结果时,这些行的原像素只用到一次,在针对其他行的原像素进行滤波中的乘积运算时,这些行的原像素不会被复用。因此,这些行的原像素可以在行内采用 预处理的方式进行运算。In two-dimensional convolution, for some lines of the image, when calculating the filtering results of the original pixels, the original pixels of these lines are only used once. When the product operation in the filtering is performed on the original pixels of other lines, the The original pixels will not be reused. Therefore, the original pixels of these lines can be pre-processed in the line.
在上述示例中,图像的第一行就属于上述情况。在计算原像素的滤波结果时,第一行的原像素只用到一次,在针对后面其他行的原像素进行滤波中的乘积运算时,第一行的原像素不会被复用。因此,第一行原像素采用预处理的方式进行运算。In the above example, the first line of the image is in the above situation. When calculating the filtering result of the original pixels, the original pixels in the first row are used only once, and the original pixels in the first row will not be multiplexed when performing product operations in the filtering on the original pixels in the other rows. Therefore, the original pixels of the first row are calculated by preprocessing.
当进行预处理时,图像处理装置如图11所示,显示了8个MAC(MAC0-MAC7)的乘法器和加法器。向量处理单元还包括:When performing preprocessing, the image processing device is shown in Figure 11, showing 8 MAC (MAC0-MAC7) multipliers and adders. The vector processing unit also includes:
输入缓存器:A组缓存器A1和A2,B组缓存器B1、B2、B3、B4、B5,分别用于缓存所读取的滤波器系数和图像的原像素。Input buffers: A group of buffers A1 and A2, B group of buffers B1, B2, B3, B4, B5 are used to buffer the read filter coefficients and the original pixels of the image.
输出缓存器:ACC0、ACC1、ACC2、ACC3。Output buffer: ACC0, ACC1, ACC2, ACC3.
以及3个选通器MUX1、MUX2、MUX3。And 3 strobes MUX1, MUX2, MUX3.
每个MAC的乘法器与B组缓存器连接,前4个MAC的乘法器连接缓存器A1,后4个MAC的乘法器连接缓存器A2。The multiplier of each MAC is connected to the group B buffer, the multipliers of the first 4 MACs are connected to the buffer A1, and the multipliers of the last 4 MACs are connected to the buffer A2.
在第1个运算周期(cycle 1)中:In the first operation cycle (cycle 1):
读取图像的第一行的4个原像素,即B 0,0,B 0,1,B 0,2,B 0,3,B 0,0经MUX2缓存至B1,B 0,1,B 0,2,B 0,3缓存至B2、B3、B4,0经过MUX3缓存至B5; Read the 4 original pixels of the first line of the image, namely B 0,0 ,B 0,1 ,B 0,2 ,B 0,3 ,B 0,0 buffered by MUX2 to B1,B 0,1 ,B 0 , 2, B 0,3 are cached to B2, B3, B4, and 0 is cached to B5 through MUX3;
读取滤波器的第一行的一个系数A 0,0,缓存至A1;0经过MUX1缓存至A2; Read a coefficient A 0,0 of the first line of the filter and buffer it to A1; 0 is buffered to A2 through MUX1;
MAC0-MAC3的乘法器分别将系数A 0,0和原像素B 0,0,B 0,1,B 0,2,B 0,3相乘,得到系数A 0,0的乘积结果,并分别缓存至ACC0、ACC1、ACC2、ACC3。 The MAC0-MAC3 multipliers respectively multiply the coefficients A 0, 0 and the original pixels B 0 , 0 , B 0 , 1, B 0, 2, B 0, 3 to obtain the product results of the coefficients A 0, 0, and respectively Cache to ACC0, ACC1, ACC2, ACC3.
在第2个运算周期(cycle 2)中:In the second operation cycle (cycle 2):
读取图像的第一行的4个原像素,即B 0,2,B 0,3,B 0,4,B 0,5;B2先将缓存的原像素B 0,1经MUX2缓存至B1,B2再缓存读取的原像素B 0,2;B3缓存读取的原像素B 0,3,B4缓存读取的原像素B 0,4,B5缓存读取的原像素B 0,5Read the 4 original pixels of the first line of the image, namely B 0,2 , B 0,3 , B 0,4 , B 0,5 ; B2 first buffers the buffered original pixel B 0,1 to B1 via MUX2 , B2 caches the read original pixel B 0,2 ; B3 caches the read original pixel B 0,3 , B4 caches the read original pixel B 0,4 , and B5 caches the read original pixel B 0,5 ;
读取滤波器的第一行的2个系数A 0,1,A 0,2;A 0,1缓存至A1,A 0,2缓存至A2; Read the two coefficients A 0,1 and A 0,2 of the first line of the filter; A 0,1 is cached to A1, and A 0,2 is cached to A2;
MAC0-MAC3的乘法器将系数A 0,1分别和图像的4个原像素B 0,1,B 0,2,B 0,3,B 0,4相乘,得到系数A 0,1的乘积结果,MAC4-MAC7的乘法器将系数A 0,2分别和图像的4个原像素B 0,2,B 0,3,B 0,4,B 0,5相乘,得到系数A 0,2的乘积结果,MAC0-MAC7的累加器将两个乘积结果累加至ACC0、ACC1、 ACC2、ACC3。 The multiplier of MAC0-MAC3 multiplies the coefficient A 0,1 with the four original pixels B 0,1 , B 0,2 , B 0,3 , B 0,4 of the image respectively to obtain the product of the coefficient A 0,1 As a result, the MAC4-MAC7 multiplier multiplies the coefficients A 0 , 2 with the four original pixels B 0 , 2, B 0 , 3, B 0 , 4, and B 0, 5 of the image to obtain the coefficients A 0, 2. The accumulator of MAC0-MAC7 accumulates the two product results to ACC0, ACC1, ACC2, ACC3.
在第3个运算周期(cycle 3)中:In the third operation cycle (cycle 3):
读取图像的第一行的4个原像素,即B 0,4,B 0,5,B 0,6,B 0,7;B3先将缓存的原像素B 0,3经MUX2缓存至B1,B3再缓存读取的原像素B 0,5;B2缓存读取的原像素B 0,4,B4缓存读取的原像素B 0,6,B5缓存读取的原像素B 0,7Read the 4 original pixels of the first line of the image, namely B 0,4 , B 0,5 , B 0,6 , B 0,7 ; B3 first buffers the buffered original pixels B 0,3 to B1 via MUX2 , B3 caches the read original pixel B 0,5 ; B2 caches the read original pixel B 0,4 , B4 caches the read original pixel B 0,6 , and B5 caches the read original pixel B 0,7 ;
读取滤波器的第一行的2个系数A 0,3,A 0,4;A 0,3缓存至A1,A 0,4缓存至A2; Read the two coefficients A 0,3 and A 0,4 of the first line of the filter; A 0,3 is cached to A1, and A 0,4 is cached to A2;
MAC0-MAC3的乘法器将系数A 0,3分别和图像的4个原像素B 0,3,B 0,4,B 0,5,B 0,6相乘,得到系数A 0,3的乘积结果,MAC4-MAC7的乘法器将系数A 0,4分别和图像的4个原像素B 0,4,B 0,5,B 0,6,B 0,7相乘,得到系数A 0,4的乘积结果,MAC0-MAC7的累加器将两个乘积结果累加至ACC0、ACC1、ACC2、ACC3。 The MAC0-MAC3 multiplier multiplies the coefficients A 0 , 3 with the four original pixels B 0 , 3, B 0 , 4, B 0, 5 , and B 0, 6 of the image to obtain the product of the coefficients A 0, 3. As a result, the MAC4-MAC7 multiplier multiplies the coefficients A 0 , 4 with the four original pixels B 0 , 4, B 0, 5 , B 0, 6 , and B 0 , 7 of the image, respectively, to obtain the coefficients A 0, 4 The accumulator of MAC0-MAC7 accumulates the two product results to ACC0, ACC1, ACC2, ACC3.
经过3个运算周期完成预处理,预处理得到的乘积结果用于计算滤波后图像第一行的四个像素点。After 3 calculation cycles, the preprocessing is completed, and the product result obtained by the preprocessing is used to calculate the four pixels in the first line of the filtered image.
以上介绍了第一行的预处理过程,以此类推,可从图像的第二行开始重复上述步骤,对图像的其他行也执行行内的预处理,得到用于计算滤波后图像各行的像素点的乘积结果,将用于计算滤波后图像中同一像素点的像素值的乘积结果累加在一起,即可得到该像素点的像素值。利用行内预处理的方式对图像进行二维滤波,提高了二维滤波的运算效率,提高了图像滤波的整体性能,改善了图像处理效率。The preprocessing process of the first line is introduced above, and so on, you can repeat the above steps from the second line of the image, and perform intra-line preprocessing on other lines of the image to obtain the pixels used to calculate each line of the filtered image The product result of is used to calculate the product result of the pixel value of the same pixel in the filtered image, and the pixel value of the pixel can be obtained. Using in-line preprocessing to perform two-dimensional filtering on the image, the computational efficiency of the two-dimensional filtering is improved, the overall performance of the image filtering is improved, and the image processing efficiency is improved.
前面已经提到,在二维卷积中,图像的一些行的原像素只用到一次,这些行的原像素不会被复用。但对于图像的另一些行,在计算原像素的滤波结果时,这些行的原像素会多次用到,在针对其他行的原像素进行滤波中的乘积运算时,这些行的原像素可被复用。因此,这些行的原像素可以在行内采用并行处理的方式进行运算。As mentioned earlier, in two-dimensional convolution, the original pixels of some lines of the image are only used once, and the original pixels of these lines will not be multiplexed. But for other lines of the image, when calculating the filtering results of the original pixels, the original pixels of these lines will be used multiple times. When the product operation in filtering is performed on the original pixels of other lines, the original pixels of these lines can be used Reuse. Therefore, the original pixels of these rows can be operated in parallel processing within the rows.
在上述示例中,从图像的第二行开始就都属于上述情况。本实施例可对第一行之后的其他行执行并行处理,并行处理可进一步提高二维滤波的运算效率和图像滤波的整体性能。In the above example, this is the case starting from the second line of the image. In this embodiment, parallel processing can be performed on other rows after the first row, and the parallel processing can further improve the operation efficiency of two-dimensional filtering and the overall performance of image filtering.
当进行并行处理时,图像处理装置如图12所示。向量处理单元还包括:When parallel processing is performed, the image processing device is as shown in FIG. 12. The vector processing unit also includes:
输出缓存器ACC4、ACC5、ACC6、ACC7。Output buffers ACC4, ACC5, ACC6, ACC7.
内部存储单元R0、R1、R2、R3、R4、R5、R6、R7;R0-R7为向量处理单元的片内存储器。The internal storage units R0, R1, R2, R3, R4, R5, R6, R7; R0-R7 are the on-chip memories of the vector processing unit.
在第4个运算周期(cycle 4)中:In the fourth operation cycle (cycle 4):
读取图像的第二行的4个原像素,即B 1,0,B 1,1,B 1,2,B 1,3;,B 1,0,B 0,1,B 0,2,B 0,3分别缓存至B1、B2、B3、B4; Read the 4 original pixels of the second line of the image, namely B 1 , 0, B 1 , 1, B 1, 2 , B 1, 3 ;, B 1 , 0 , B 0 , 1, B 0, 2, B 0 , 3 are cached to B1, B2, B3, and B4 respectively;
读取滤波器的第一行和第二行这相邻两行的第一列的2个系数A 0,0,A 1,0;A 1,0缓存至A1,A 0,0缓存至A2; Read the two coefficients A 0,0 ,A 1,0 in the first column of the first row and the second row of the filter, which are two adjacent rows; A 1,0 is cached to A1, and A 0,0 is cached to A2 ;
MAC0-MAC3的乘法器将系数A 1,0分别和图像的4个原像素B 1,0,B 1,1,B 1,2,B 1,3相乘,得到系数A 1,0的乘积结果,MAC0-MAC3的累加器将乘积结果累加至ACC0、ACC1、ACC2、ACC3;MAC4-MAC7的乘法器将系数A 0,4分别和图像的4个原像素B 1,0,B 1,1,B 1,2,B 1,3相乘,得到系数A 0,4的乘积结果,MAC4-MAC7的累加器将乘积结果累加至ACC4、ACC5、ACC6、ACC7。 The MAC0-MAC3 multiplier multiplies the coefficients A 1 , 0 with the four original pixels B 1 , 0, B 1 , 1, B 1, 2 , and B 1, 3 of the image to obtain the product of the coefficients A 1, 0. As a result, the accumulator of MAC0-MAC3 accumulates the product results to ACC0, ACC1, ACC2, and ACC3; the multiplier of MAC4-MAC7 adds the coefficients A 0 , 4 to the four original pixels B 1 , 0 and B 1, 1 of the image. , B 1 , 2 and B 1, 3 are multiplied to obtain the product result of coefficient A 0 , 4. The accumulator of MAC4-MAC7 accumulates the product result to ACC4, ACC5, ACC6, ACC7.
在第5个运算周期(cycle 5)中:In the fifth operation cycle (cycle 5):
读取图像的第二行的4个原像素,即B 1,1,B 1,2,B 1,3,B 1,4;B 1,1,B 1,2,B 1,3,B 1,4分别缓存至B1、B2、B3、B4; Read the 4 original pixels of the second line of the image, namely B 1,1 , B 1,2 , B 1,3 , B 1,4 ; B 1,1 , B 1,2 , B 1,3 , B 1 , 4 are respectively cached to B1, B2, B3, B4;
读取滤波器的第一行和第二行这相邻两行的第2列的2个系数A 0,1,A 1,1;A 1,1缓存至A1,A 0,1缓存至A2; Read the 2 coefficients A 0,1 , A 1,1 of the first row and the second row of the filter in the second column of the two adjacent rows; A 1,1 is cached to A1, and A 0,1 is cached to A2 ;
MAC0-MAC3的乘法器将系数A 1,1分别和图像的4个原像素B 1,1,B 1,2,B 1,3,B 1,4相乘,得到系数A 1,1的乘积结果,MAC0-MAC3的累加器将乘积结果累加至ACC0、ACC1、ACC2、ACC3;MAC4-MAC7的乘法器将系数A 0,1分别和图像的4个原像素B 1,1,B 1,2,B 1,3,B 1,4相乘,得到系数A 0,1的乘积结果,MAC4-MAC7的累加器将乘积结果累加至ACC4、ACC5、ACC6、ACC7。 The multiplier of MAC0-MAC3 multiplies the coefficients A 1,1 with the four original pixels B 1,1 , B 1,2 , B 1,3 , B 1,4 of the image, respectively, to obtain the product of the coefficients A 1,1 As a result, the accumulator of MAC0-MAC3 accumulates the product results to ACC0, ACC1, ACC2, and ACC3; the multiplier of MAC4-MAC7 adds the coefficients A 0, 1 to the four original pixels B 1 , 1, B 1, 2 of the image, respectively. , B 1 , 3 and B 1, 4 are multiplied to obtain the product result of coefficient A 0 , 1. The accumulator of MAC4-MAC7 accumulates the product result to ACC4, ACC5, ACC6, ACC7.
以此类推,在第8个运算周期(cycle 8)中:By analogy, in the eighth computing cycle (cycle 8):
读取图像的第二行的4个原像素,即B 1,4,B 1,5,B 1,6,B 1,7;B 1,4,B 1,5,B 1,6,B 1,7分别缓存至B1、B2、B3、B4; Read the 4 original pixels of the second line of the image, namely B 1,4 , B 1,5 , B 1,6 , B 1,7 ; B 1,4 , B 1,5 , B 1,6 , B 1 , 7 are cached to B1, B2, B3, and B4 respectively;
读取滤波器的第一行和第二行这相邻两行的第5列的2个系数A 0,4,A 1,4;A 1,4缓存至A1,A 0,4缓存至A2; Read the 2 coefficients A 0,4 ,A 1,4 of the 5th column of the first row and the second row of the filter, which are two adjacent rows; A 1,4 is cached to A1, and A 0,4 is cached to A2 ;
MAC0-MAC3的乘法器将系数A 1,4分别和图像的4个原像素B 1,1,B 1,2,B 1,3,B 1,4相乘,得到系数A 1,4的乘积结果,MAC0-MAC3的累加器将乘积结果累加至ACC0、ACC1、ACC2、ACC3;MAC4-MAC7的乘法器将系数A 0,4分别和图像的4个原像素B 1,1,B 1,2,B 1,3,B 1,4相乘,得到系数A 0,4的乘积结果,MAC4-MAC7的累加器将乘积结果累加至ACC4、ACC5、ACC6、ACC7。 The multiplier of MAC0-MAC3 multiplies the coefficients A 1 , 4 with the four original pixels B 1 , 1 , B 1 , 2, B 1, 3, B 1 , 4 of the image to obtain the product of the coefficients A 1, 4 As a result, the accumulator of MAC0-MAC3 accumulates the product results to ACC0, ACC1, ACC2, and ACC3; the multiplier of MAC4-MAC7 adds the coefficients A 0 , 4 to the four original pixels B 1 , 1, B 1, 2 of the image, respectively. , B 1 , 3 and B 1, 4 are multiplied to obtain the product result of coefficient A 0 , 4. The accumulator of MAC4-MAC7 accumulates the product result to ACC4, ACC5, ACC6, ACC7.
图像的其他行的并行处理过程与上述第二行的并行处理过程是类似的。经过23个运算周期后,得到了滤波后图像第一行的四个像素点,并将ACC0、ACC1、ACC2、ACC3清空(clear0)。The parallel processing of the other lines of the image is similar to the parallel processing of the second line described above. After 23 calculation cycles, four pixels in the first row of the filtered image are obtained, and ACC0, ACC1, ACC2, and ACC3 are cleared (clear0).
当出现滤波器的相邻两行在高度方向上首尾跨越的情况,读取的图像的下一行的原像素不能复用,滤波器的两个系数分别与不同行的原像素进行运算,并行处理按照以下方式进行:When the two adjacent lines of the filter cross over in the height direction, the original pixels of the next line of the read image cannot be multiplexed. The two coefficients of the filter are calculated with the original pixels of different lines and processed in parallel. Proceed as follows:
向量处理单元从内部存储单元读取另P read个原像素,所述另P read个原像素在之前的运算周期从所述图像读取,并被存储至所述内部存储单元。 The vector processing unit reads another P read original pixels from the internal storage unit, and the other P read original pixels are read from the image in the previous operation cycle and stored in the internal storage unit.
所述N个系数中位于所述滤波器高度方向尾部的每一个系数和从所述图像读取的P read个原像素分别相乘,得到所述乘积结果; Each of the N coefficients at the end of the filter height direction is multiplied by P read original pixels read from the image to obtain the product result;
所述N个系数中位于所述滤波器高度方向首部的每一个系数和从所述内部存储单元读取的另P read个原像素分别相乘,得到所述乘积结果。 Each of the N coefficients located at the head of the filter height direction is respectively multiplied by another P read original pixels read from the internal storage unit to obtain the product result.
例如,在第24个运算周期(cycle 24)中:For example, in the 24th operation cycle (cycle 24):
从外部存储单元读取图像的第六行的4个原像素,即B 5,0,B 5,1,B 5,2,B 5,3,B 5,0,B 5,1,B 5,2,B 5,3分别缓存至B1、B2、B3、B4;从内部存储单元R0、R1、R2、R3中读取图像第三行的4个原像素,即B 2,0,B 2,1,B 2,2,B 2,3。其中,B 2,0,B 2,1,B 2,2,B 2,3这4个原像素在cycle 9从图像读取并存储至内部存储单元R0、R1、R2、R3。 Read from the external storage unit 4 original image pixels sixth row, i.e., B 5,0, B 5,1, B 5,2 , B 5,3, B 5,0, B 5,1, B 5 , 2 , B 5 , 3 are respectively cached to B1, B2, B3, B4; read the 4 original pixels of the third row of the image from the internal storage units R0, R1, R2, R3, namely B 2 , 0, B 2 ,1 ,B 2,2 ,B 2,3 . Among them, the four original pixels B 2,0 , B 2,1 , B 2,2 , and B 2,3 are read from the image in cycle 9 and stored in the internal storage units R0, R1, R2, R3.
读取滤波器的第五行和第一行这相邻两行的第一列的2个系数A 4,0,A 0,0;A 0,0缓存至A1,A 4,0缓存至A2; Read the two coefficients A 4,0 ,A 0,0 in the first column of the fifth row and the first row of the filter; A 0,0 is buffered to A1, and A 4,0 is buffered to A2;
MAC0-MAC3的乘法器将系数A 0,0分别和图像的4个原像素B 2,0,B 2,1,B 2,2,B 2,3相乘,得到系数A 0,0的乘积结果,然后MAC0-MAC3的累加器将乘积结果累加至ACC0、ACC1、ACC2、ACC3;MAC4-MAC7的乘法器将系数A 4,0分别和图像的4个原像素B 5,0,B 5,1,B 5,2,B 5,3相乘,得到系数A 4,0 的乘积结果,MAC4-MAC7的累加器将乘积结果累加至ACC4、ACC5、ACC6、ACC7。 The MAC0-MAC3 multiplier multiplies the coefficients A 0 , 0 with the four original pixels B 2 , 0, B 2 , 1, B 2 , 2, and B 2, 3 of the image to obtain the product of the coefficients A 0, 0. As a result, then MAC0-MAC3 accumulators accumulating the multiplication results to ACC0, ACC1, ACC2, ACC3; MAC4-MAC7 multiplier coefficient 4 original pixels B a 4,0 images respectively 5,0, B 5, 1 , B 5 , 2 and B 5, 3 are multiplied to obtain the product result of coefficient A 4, 0 , and the accumulator of MAC4-MAC7 accumulates the product result to ACC4, ACC5, ACC6, ACC7.
以此类推,经过28个运算周期后,得到了滤波后图像第一行的四个像素点C 0,0-C 0,3和第二行的四个像素点C 1,0-C 1,3,并将ACC4、ACC5、ACC6、ACC7清空(clear1)。 By analogy, after 28 operation cycles, the four pixels C 0,0 -C 0,3 in the first row of the filtered image and the four pixels C 1,0 -C 1, in the second row are obtained. 3 , and clear ACC4, ACC5, ACC6, and ACC7 (clear1).
不断重复执行上述并行处理过程,从cycle 28开始,再经过24个运算周期后,即可得到滤波后图像第三行的四个像素点C 2,0-C 2,3,再经过1个运算周期,即从cycle 28开始再经过25个运算周期,共53个运算周期后,即可得到滤波后图像第四行的四个像素点C 3,0-C 3,3,从而得到滤波后图像全部四行的第1列至第4列的像素点。 Repeatedly execute the above parallel processing process, starting from cycle 28, and after 24 operation cycles, the four pixels C 2,0 -C 2,3 of the third row of the filtered image can be obtained, and then one operation Cycle, that is, starting from cycle 28 and then passing through 25 operation cycles, after a total of 53 operation cycles, the four pixels C 3,0 -C 3,3 of the fourth row of the filtered image can be obtained, thereby obtaining the filtered image Pixels from column 1 to column 4 of all four rows.
如果滤波后图像的高度N 0不能被滤波器系数的个数N整除,那么滤波后图像第1列至第4列的像素点的滤波过程仍未完成。此时还需要继续对图像进行处理,以得到滤波后图像剩余行的像素点。因为滤波后图像剩余行也不能进行并行处理,因此进行如下的行内处理: If the height N 0 of the filtered image is not evenly divisible by the number N of filter coefficients, then the filtering process of the pixels in the first to fourth columns of the filtered image is still not completed. At this time, it is necessary to continue processing the image to obtain the pixels of the remaining lines of the filtered image. Because the remaining lines of the filtered image cannot be processed in parallel, the following in-line processing is performed:
读取图像中用于计算滤波后图像后r行像素点的一行Pread个原像素,r为滤波器的高度与N相除的余数;Read a line of Pread original pixels used to calculate the r line of pixels in the filtered image in the read image, where r is the remainder of the filter height divided by N;
读取滤波器一行中的一个系数或所述N个系数;Read one coefficient or the N coefficients in a row of the filter;
乘法器将所述一个系数和所述P read个原像素相乘,或将所述N个系数中的每一个系数和所述P read个原像素分别相乘,得到所述乘积结果。 The multiplier multiplies the one coefficient by the P read original pixels, or multiplies each coefficient of the N coefficients by the P read original pixels to obtain the product result.
例如,假设图像的高度N h为9,滤波器高度仍为5,则对应的滤波后图像的高度N 0为5,此时滤波后图像的高度N 0不能被N整除时,滤波后图像第1列至第4列的像素点的滤波过程仍未完成。此时还需要继续对图像进行处理,以得到滤波后图像第5行的像素点。因为只剩一行的像素点,即滤波后图像第5行的像素点,所以针对滤波后图像第5行的像素点也不能进行并行处理,而是进行行内处理。 For example, assuming that the height N h of the image is 9 and the filter height is still 5, the height N 0 of the corresponding filtered image is 5. At this time, when the height N 0 of the filtered image cannot be divisible by N, the filtered image is the first The filtering process of the pixels from column 1 to column 4 has not yet been completed. At this time, it is necessary to continue processing the image to obtain the pixels in the fifth row of the filtered image. Because there are only one row of pixels, that is, the pixels on the fifth row of the filtered image, the pixels on the fifth row of the filtered image cannot be processed in parallel, but are processed in-line.
在行内处理中,针对图像第5行至第9行中的每一行,均执行一次上述的预处理过程。即针对图像第5行至第9行中的每一行,读取滤波器一行中的一个或N个系数,经所述一个或N个系数分别和该行的每一个原像素相乘,得到乘积结果。在行内处理过程中,以图像第5行为例,其处理过程具体包括:In the in-line processing, the above-mentioned preprocessing process is performed once for each of the 5th to 9th rows of the image. That is, for each of the 5th to 9th rows of the image, read one or N coefficients in a row of the filter, and multiply the one or N coefficients with each original pixel of the row to obtain the product result. In the in-line processing process, take the fifth line of the image as an example, the processing process specifically includes:
在第54个运算周期(cycle 54)中:In the 54th operation cycle (cycle 54):
读取图像的第五行的4个原像素,即B 4,0,B 4,1,B 4,2,B 4,3;B 4,0经MUX2缓存至B1,B 4,1,B 4,2,B 4,3缓存至B2、B3、B4,0经过MUX3缓存至B5; Read the 4 original pixels of the fifth line of the image, namely B 4,0 , B 4,1 , B 4,2 , B 4,3 ; B 4,0 is buffered by MUX2 to B1, B 4,1 , B 4 , 2 , B 4, 3 are cached to B2, B3, B4, and 0 is cached to B5 through MUX3;
读取滤波器的第一行的一个系数A 0,0,缓存至A1;0经过MUX1缓存至A2; Read a coefficient A 0,0 of the first line of the filter and buffer it to A1; 0 is buffered to A2 through MUX1;
MAC0-MAC3的乘法器分别将系数A 0,0分别和图像的4个原像素B 4,0,B 4,1,B 4,2,B 4,3相乘,得到系数A 0,0的乘积结果,并分别缓存至ACC0、ACC1、ACC2、ACC3。 The MAC0-MAC3 multipliers respectively multiply the coefficients A 0 , 0 with the four original pixels B 4 , 0, B 4 , 1, B 4 , 2, and B 4, 3 of the image to obtain the coefficients A 0, 0 The product results are cached to ACC0, ACC1, ACC2, and ACC3 respectively.
在第55个运算周期(cycle 55)中:In the 55th operation cycle (cycle 55):
读取图像的第五行的4个原像素,即B 4,2,B 4,3,B 4,4,B 4,5;B2先将缓存的原像素B 4,1经MUX2缓存至B1,B2再缓存读取的原像素B 4,2;B3缓存读取的原像素B 4,3,B4缓存读取的原像素B 4,4,B5缓存读取的原像素B 4,5Read the 4 original pixels of the fifth line of the image, namely B 4,2 , B 4,3 , B 4,4 , B 4,5 ; B2 first buffers the buffered original pixel B 4,1 to B1 via MUX2, B2 then buffers the read original pixel B 4,2 ; B3 buffers the read original pixel B 4,3 , B4 buffers the read original pixel B 4,4 , and B5 buffers the read original pixel B 4,5 ;
读取滤波器的第一行的2个系数A 0,1,A 0,2;A 0,1缓存至A1,A 0,2缓存至A2; Read the two coefficients A 0,1 and A 0,2 of the first line of the filter; A 0,1 is cached to A1, and A 0,2 is cached to A2;
MAC0-MAC3的乘法器将系数A 0,1分别和图像的4个原像素B 4,1,B 4,2,B 4,3,B 4,4相乘,得到系数A 0,1的乘积结果,MAC4-MAC7的乘法器将系数A 0,2分别和图像的4个原像素B 4,2,B 4,3,B 4,4,B 4,5相乘,得到系数A 0,2的乘积结果,MAC0-MAC7的累加器将两个乘积结果累加至ACC0、ACC1、ACC2、ACC3。 The MAC0-MAC3 multiplier multiplies the coefficients A 0,1 with the four original pixels B 4,1 , B 4,2 , B 4,3 , B 4,4 of the image to obtain the product of the coefficients A 0,1. As a result, the MAC4-MAC7 multiplier multiplies the coefficients A 0 , 2 with the four original pixels B 4 , 2, B 4 , 3, B 4 , 4, and B 4, 5 of the image, respectively, to obtain the coefficients A 0, 2 The accumulator of MAC0-MAC7 accumulates the two product results to ACC0, ACC1, ACC2, ACC3.
在第56个运算周期(cycle 56)中:In the 56th operation cycle (cycle 56):
读取图像的第五行的4个原像素,即B 4,4,B 4,5,B 4,6,B 4,7;B3先将缓存的原像素B 4,3经MUX2缓存至B1,B3再缓存读取的原像素B 4,5;B2缓存读取的原像素B 4,4,B4缓存读取的原像素B 4,6,B5缓存读取的原像素B 4,7Read the 4 original pixels of the fifth line of the image, namely B 4,4 , B 4,5 , B 4,6 , B 4,7 ; B3 first buffers the buffered original pixels B 4,3 to B1 via MUX2, B3 then caches the read original pixel B 4,5 ; B2 caches the read original pixel B 4,4 , B4 caches the read original pixel B 4,6 , and B5 caches the read original pixel B 4,7 ;
读取滤波器的第一行的2个系数A 0,3,A 0,4;A 0,3缓存至A1,A 0,4缓存至A2; Read the two coefficients A 0,3 and A 0,4 of the first line of the filter; A 0,3 is cached to A1, and A 0,4 is cached to A2;
MAC0-MAC3的乘法器将系数A 0,3分别和图像的4个原像素B 4,3,B 4,4,B 4,5,B 4,6相乘,得到系数A 0,3的乘积结果,MAC4-MAC7的乘法器将系数A 0,4分别和图像的4个原像素B 4,4,B 4,5,B 4,6,B 4,7相乘,得到系数A 0,4的乘积结果,MAC0-MAC7的累加器将两个乘积结果累加至ACC0、ACC1、ACC2、ACC3。 The MAC0-MAC3 multiplier multiplies the coefficients A 0 , 3 with the four original pixels B 4 , 3, B 4 , 4, B 4, 5 , B 4 , 6 of the image to obtain the product of the coefficients A 0, 3. As a result, the MAC4-MAC7 multiplier multiplies the coefficients A 0 , 4 with the four original pixels B 4 , 4 , B 4 , 5, B 4, 6, and B 4, 7 of the image to obtain the coefficients A 0, 4. The accumulator of MAC0-MAC7 accumulates the two product results to ACC0, ACC1, ACC2, ACC3.
再经过3个运算周期完成图像的第5行的预处理,上述各个运算周期得到的乘积结果用于计算滤波后图像第五行的四个像素点。After another three calculation cycles, the preprocessing of the fifth line of the image is completed, and the product results obtained in each of the foregoing calculation cycles are used to calculate the four pixels of the fifth line of the filtered image.
以此类推,重复上述步骤,对图像的第6行至第9行、及其对应的滤波器的第2行至第5行分别进行上述预处理,每一行的预处理需要3个运算周期。经过15个运算周期的行内运算,一共68个运算周期,得到滤波后图像第5行的像素点,从而得到滤波后图像全部五行的第1列至第4列的像素点,滤波后图像第1列至第4列的滤波过程全部完成。By analogy, the above steps are repeated, and the above-mentioned preprocessing is performed on the 6th to 9th rows of the image and the second to 5th rows of the corresponding filter respectively. The preprocessing of each row requires 3 operation cycles. After 15 operation cycles of intra-row operation, a total of 68 operation cycles, the pixel points of the fifth row of the filtered image are obtained, and the pixels of the first to fourth columns of all five rows of the filtered image are obtained, and the filtered image is the first The filtering process from column to column 4 is completed.
当得到滤波后图像第1列至第4列的目标元素后,再不断重复上述预处理和并行处理、行内处理(如果有)的整个过程,如图9所示,以滤波后图像的高度为4为例,依次得到滤波后图像第5列至第8列、第9列至第12列和第13列至第16列的像素点,从而完成整个滤波过程。滤波后图像的宽度M o=P read*ceil((M w-F w+1)/P read),高度N o=N h-F h+1;其中,M w为所述图像的宽度。当P read为4、M w为20、N h为8、F w和F h为5时,滤波后图像尺寸为16×4,即宽度M o为16、高度N o为4。 When the target elements in the first to fourth columns of the filtered image are obtained, the entire process of the above preprocessing, parallel processing, and inline processing (if any) is repeated continuously, as shown in Figure 9, taking the height of the filtered image as 4 as an example, the pixels in the fifth column to the eighth column, the ninth column to the 12th column, and the 13th column to the 16th column of the filtered image are sequentially obtained, thereby completing the entire filtering process. The width of the filtered image M o =P read *ceil((M w -F w +1)/P read ), the height N o =N h -F h +1; where M w is the width of the image. When P read is 4, M w is 20, N h is 8, F w and F h are 5, the image size after filtering is 16×4, that is, the width M o is 16 and the height N o is 4.
以上示例以乘法器的个数N calc为8,每个运算周期读取的原像素个数P read为4、滤波器系数的个数N为2、图像的宽度M w为20、滤波器的宽度F w和高度F h均为5,对本实施例的图像处理装置进行了说明。但显然,本领域技术人员应当清楚,以上参数的取值不限于此。本实施例的以上参数还可以取其他值。当以上参数取其他值时,图像处理装置与以上描述是类似的,本领域技术人员完全应该清楚图像处理装置的具体结构。 In the above example, the number of multipliers N calc is 8, the number of original pixels P read read in each calculation cycle is 4, the number of filter coefficients N is 2, the width of the image M w is 20, The width F w and the height F h are both 5, and the image processing apparatus of this embodiment has been described. However, it should be clear to those skilled in the art that the values of the above parameters are not limited to this. The above parameters in this embodiment can also take other values. When the above parameters take other values, the image processing device is similar to the above description, and those skilled in the art should fully understand the specific structure of the image processing device.
本实施例的图像处理装置,滤波后图像的第一行的P read个像素点的运算时间为: In the image processing device of this embodiment, the calculation time of P read pixels in the first row of the filtered image is:
T pre+F w×(F h-1)×cycle T pre +F w ×(F h -1)×cycle
其中,T pre为上述预处理的运算时间;F w、F h分别为滤波器的宽度和高度;cycle为一个运算周期。所述预处理的运算时间T pre为:(1+ceil((F w-1)/N))×cycle。 Among them, T pre is the calculation time of the above preprocessing; F w and F h are the width and height of the filter, respectively; cycle is a calculation cycle. The preprocessing operation time T pre is: (1+ceil((F w -1)/N))×cycle.
从滤波后图像的第二行开始,并行处理过程中每一行的P read个像素点的运算时间为: Starting from the second line of the filtered image, the calculation time of P read pixels in each line in the parallel processing process is:
F w×F h×cycle F w ×F h ×cycle
如果滤波过程存在行内运算,滤波后图像的后r行是通过行内运算得 到的,那么滤波后图像的后r行中的每一行的P read个像素点的运算时间为:T pre×F hIf there are intra-line operations in the filtering process, and the last r lines of the filtered image are obtained through intra-line operations, then the operation time of P read pixels in each line of the last r lines of the filtered image is: T pre ×F h .
本实施例的图像处理装置,当滤波后图像的高度N 0能被滤波器系数的个数N整除,即不存在行内处理时,滤波后图像的第一行至最后一行的N o×P read个像素点的运算时间为: In the image processing device of this embodiment, when the height N 0 of the filtered image can be divisible by the number N of filter coefficients, that is, when there is no in-line processing, N o ×P read from the first line to the last line of the filtered image The calculation time of each pixel is:
T pre+F w×F h×(N o/N)×cycle T pre +F w ×F h ×(N o /N)×cycle
其中,N o为所述滤波后图像的高度,且N o=N h-F h+1,N h为图像的高度。 Wherein, N o is the height of the filtered image, and N o = N h -F h + 1, N h is the height of the image.
当滤波后图像的高度N 0不能被滤波器系数的个数N整除,即存在行内处理时,滤波后图像的第一行至最后一行的N o×P read个像素点的运算时间为: When the height N 0 of the filtered image cannot be divisible by the number N of filter coefficients, that is, when there is in-line processing, the calculation time of N o ×P read pixels from the first line to the last line of the filtered image is:
T pre+F w×F h×((N o-r)/N)×cycle+T r T pre +F w ×F h ×((N o -r)/N)×cycle+T r
其中,r为N o与N相除的余数,即滤波后图像中需要由行内处理得到的行数,T r为行内处理的运算时间。行内处理的运算时间T r为:r×T pre×F hAmong them, r is the remainder of the division of N o and N, that is, the number of lines that need to be processed in-line in the filtered image, and T r is the operation time of the in-line processing. Arithmetic processing within the line time T r is: r × T pre × F h .
在以上的描述中,本实施例的原像素是按照顺序进行读取,即在图像高度方向上从上到下、从第一行至最后一行,每一行中按照从左到右的顺序读取。滤波后图像的像素点也是按顺序输出。按照顺序读取时,读取图像一行的P read个原像素包括: In the above description, the original pixels in this embodiment are read in order, that is, from top to bottom in the image height direction, from the first row to the last row, and read in the order from left to right in each row. . The pixels of the filtered image are also output in order. When reading sequentially, the P read original pixels of one line of the read image include:
从图像的第二行开始,在每一行中依次读取F w组原像素,其中第n组原像素包括第n列至第n+P read-1列的P read个原像素,其中1≤n≤F wStarting from the second row of the image, reading each row of pixels sequentially F w of the original group, wherein the n-th original pixel group comprises first to n-th column n + P read -1 P read original pixels of the column, wherein 1≤ n≤F w .
例如在上述示例中,在图像的高度方向上从上到下读取各行的原像素,在每一行,依次读取5组原像素,但本公开不限于此,实际上,图像原像素读取的顺序是不受限制的,可以顺序读取、倒序读取、跳跃读取。只要能遍历图像的所有原像素,遍历所有滤波器系数,M w×N h×F w×F h个乘累加运算,即可完成滤波。只不过在倒序或跳跃读取的情况下,滤波后图像的像素点的输出顺序不同而已。 For example, in the above example, the original pixels of each row are read from top to bottom in the height direction of the image, and in each row, 5 groups of original pixels are read sequentially. However, the present disclosure is not limited to this. In fact, the original pixel of the image is read The order of is not restricted, you can read in order, read in reverse order, and read in skip. As long as it can traverse all the original pixels of the image, traverse all the filter coefficients, and multiply and accumulate M w × N h × F w × F h , the filtering can be completed. It's just that in the case of reverse or skip reading, the output order of the pixels of the filtered image is different.
在以上示例中,每次均是从存储单元中读取图像一行的P read个原像素。但本公开不限于此。在每个运算周期,从存储单元中读取图像一行的P read个原像素后,还可以将读取的P read个原像素存储在向量处理单元的内部存 储单元中。这样在之后的运算周期中,如果其需要读取的P read个原像素中的部分原像素已经存储在内部存储单元中,那么这部分原像素直接从内部存储单元中读取即可,只需要从外部存储单元读取内部存储单元未存储的另一部分原像素即可。具体来说,读取图像一行的P read个原像素包括; In the above example, P read original pixels of one line of the image are read from the storage unit each time. But the present disclosure is not limited to this. P read original pixels stored after each calculation cycle, reads an image from the storage unit row P read original pixels, can also be read in the internal storage unit vector processing unit. In this way, in the subsequent operation cycle, if some of the original pixels of the P read original pixels that need to be read have been stored in the internal storage unit, then this part of the original pixels can be directly read from the internal storage unit, and only It is sufficient to read another part of the original pixels not stored in the internal storage unit from the external storage unit. Specifically, the P read original pixels of one line of the read image include:
从内部存储单元读取P read个原像素中的部分原像素,所述部分原像素在之前的运算周期被存储; Read part of the original pixels in the P read original pixels from the internal storage unit, and the part of the original pixels was stored in the previous operation cycle;
从外部存储单元读取P read个原像素中的剩余部分原像素,得到P read个原像素; P reads the remaining portion of the original pixel in the original pixels read from the external storage unit, to obtain P read original pixels;
将剩余部分像素存储至内部存储单元。Store the remaining pixels in the internal storage unit.
这样减小了向量处理单元从片外存储器读取的数据量,节省了带宽。This reduces the amount of data read by the vector processing unit from the off-chip memory and saves bandwidth.
在以上示例中,每个运算周期执行乘法和累加操作,即滤波器系数和原像素相乘,并将相乘结果累加至上一运算周期的累加结果。但是本公开并不限于此,也可以先进行乘法运算,得到用于计算滤波后图像的像素点的所有乘积结果后再进行累加。In the above example, the multiplication and accumulation operations are performed every operation cycle, that is, the filter coefficients are multiplied by the original pixels, and the multiplication result is accumulated to the accumulation result of the previous operation cycle. However, the present disclosure is not limited to this, and the multiplication operation may be performed first to obtain all the product results used to calculate the pixel points of the filtered image and then the accumulation is performed.
本公开再一实施例提供了一种移动设备,包括:上一实施例所述的图像处理装置。所述移动设备为便携式移动终端、无人机、手持云台、遥控器的至少一种。Yet another embodiment of the present disclosure provides a mobile device, including: the image processing apparatus described in the previous embodiment. The mobile device is at least one of a portable mobile terminal, a drone, a handheld PTZ, and a remote controller.
本领域技术人员可以清楚地了解到,为描述的方便和简洁,仅以上述各功能模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能模块完成,即将装置的内部结构划分成不同的功能模块,以完成以上描述的全部或者部分功能。上述描述的装置的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。Those skilled in the art can clearly understand that for the convenience and conciseness of the description, only the division of the above-mentioned functional modules is used as an example for illustration. In practical applications, the above-mentioned functions can be allocated by different functional modules as required, namely, the device The internal structure is divided into different functional modules to complete all or part of the functions described above. For the specific working process of the device described above, reference may be made to the corresponding process in the foregoing method embodiment, which will not be repeated here.
最后应说明的是:以上各实施例仅用以说明本公开的技术方案,而非对其限制;尽管参照前述各实施例对本公开进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分或者全部技术特征进行等同替换;在不冲突的情况下,本公开实施例中的特征可以任意组合;而这些修改或者替换,并不使相应技术方案的本质脱离本公开各实施例技术方案的范围。Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present disclosure, not to limit it; although the present disclosure has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand that: The technical solutions recorded in the foregoing embodiments can still be modified, or some or all of the technical features can be equivalently replaced; in the case of no conflict, the features in the embodiments of the present disclosure can be combined arbitrarily; and these modifications or replacements It does not cause the essence of the corresponding technical solutions to deviate from the scope of the technical solutions of the embodiments of the present disclosure.

Claims (38)

  1. 一种图像处理方法,其特征在于,所述方法应用于向量处理单元,所述向量处理单元包括乘法器,所述方法包括:An image processing method, characterized in that the method is applied to a vector processing unit, the vector processing unit includes a multiplier, and the method includes:
    读取图像的P read个原像素,其中,P read的值根据对应所述向量处理单元的访存位宽确定; Read P read original pixels of the image, where the value of P read is determined according to the memory access bit width corresponding to the vector processing unit;
    读取滤波器的N个系数,N的值根据所述向量处理单元的乘法器的个数确定,所述滤波器用于对所述图像进行滤波处理;Reading N coefficients of the filter, the value of N is determined according to the number of multipliers of the vector processing unit, and the filter is used for filtering the image;
    通过所述乘法器,将所述N个系数中的每一系数和所述P read个原像素分别相乘,得到多个乘积结果,所述乘积结果用于计算滤波后图像中像素点的像素值。 Through the multiplier, each of the N coefficients and the P read original pixels are respectively multiplied to obtain multiple product results, and the product results are used to calculate the pixels of the pixels in the filtered image value.
  2. 如权利要求1所述的图像处理方法,其特征在于,The image processing method according to claim 1, wherein:
    N=(N calc/P read) N=(N calc /P read )
    其中,N calc为所述向量处理单元的乘法器的个数。 Wherein, N calc is the number of multipliers of the vector processing unit.
  3. 如权利要求1所述的图像处理方法,其特征在于,所述P read的值等于每个运算周期所述向量处理单元可读取的原像素的最大数目。 8. The image processing method of claim 1, wherein the value of P read is equal to the maximum number of original pixels that can be read by the vector processing unit in each operation cycle.
  4. 如权利要求3所述的图像处理方法,其特征在于,所述最大数目等于所述访存位宽与所述原像素位宽的商。8. The image processing method of claim 3, wherein the maximum number is equal to the quotient of the memory access bit width and the original pixel bit width.
  5. 如权利要求1所述的图像处理方法,其特征在于,所述N个系数位于所述滤波器的相邻N行的同一列。8. The image processing method according to claim 1, wherein the N coefficients are located in the same column of N adjacent rows of the filter.
  6. 如权利要求5所述的图像处理方法,其特征在于,所述读取图像的P read个原像素包括: 8. The image processing method of claim 5, wherein the P read original pixels of the read image comprise:
    从所述图像的第二行开始,在每一行中依次读取F w组原像素,其中第n组原像素包括第n列至第n+P read-1列的P read个原像素,其中1≤n≤F w,F w为所述滤波器的宽度。 Starting from the second row of the image, reading each row of pixels sequentially F w of the original group, wherein the n-th original pixel group comprises first to n-th column n + P read -1 P read original pixels of the column, wherein 1≤n≤F w , F w is the width of the filter.
  7. 如权利要求6所述的图像处理方法,其特征在于,The image processing method according to claim 6, wherein:
    从所述滤波后图像的第二行开始,每一行的P read个像素点的运算时间为: Starting from the second line of the filtered image, the calculation time of P read pixels in each line is:
    F w×F h×cycle F w ×F h ×cycle
    其中,F w、F h分别为所述滤波器的宽度和高度;cycle为一个运算周期。 Wherein, F w and F h are the width and height of the filter, respectively; cycle is an operation cycle.
  8. 如权利要求1所述的图像处理方法,其特征在于,当读取的所述P read个原像素位于所述图像的第一行时,执行如下预处理: 5. The image processing method of claim 1, wherein when the P read original pixels read are located in the first row of the image, the following preprocessing is performed:
    读取所述滤波器第一行的一个系数或所述N个系数;Reading one coefficient or the N coefficients of the first row of the filter;
    通过所述乘法器,将所述一个系数和所述P read个原像素相乘,或将所述N个系数中的每一个系数和所述P read个原像素分别相乘,得到所述乘积结果。 Through the multiplier, the one coefficient is multiplied by the P read original pixels, or each coefficient of the N coefficients is multiplied by the P read original pixels to obtain the product result.
  9. 如权利要求8所述的图像处理方法,其特征在于,The image processing method according to claim 8, wherein:
    所述滤波后图像的第一行的P read个像素点的运算时间为: The calculation time of P read pixels in the first row of the filtered image is:
    T pre+F w×(F h-1)×cycle T pre +F w ×(F h -1)×cycle
    其中,T pre为所述预处理的运算时间;F w、F h分别为所述滤波器的宽度和高度;cycle为一个运算周期。 Wherein, T pre is the calculation time of the preprocessing; F w and F h are the width and height of the filter, respectively; and cycle is one calculation cycle.
  10. 如权利要求1所述的图像处理方法,其特征在于,当所述滤波后图像的高度不能被所述N整除,且读取的是所述图像的用于计算所述滤波后图像后r行像素点的一行P read个原像素时,r为所述滤波后图像的高度与所述N相除的余数,执行如下行内处理: The image processing method according to claim 1, wherein when the height of the filtered image is not divisible by the N, and the image read is used to calculate the rear r rows of the filtered image When P read original pixels in a row of pixels, r is the remainder of dividing the height of the filtered image by the N, and the following in-line processing is performed:
    读取所述滤波器一行中的一个系数或所述N个系数;Read one coefficient or the N coefficients in a row of the filter;
    通过所述乘法器,将所述一个系数和所述P read个原像素相乘,或将所述N个系数中的每一个系数和所述P read个原像素分别相乘,得到所述乘积结果。 Through the multiplier, the one coefficient is multiplied by the P read original pixels, or each coefficient of the N coefficients is multiplied by the P read original pixels to obtain the product result.
  11. 如权利要求10所述的图像处理方法,其特征在于,The image processing method according to claim 10, wherein:
    所述滤波后图像的后r行中的每一行的P read个像素点的运算时间为: The calculation time of P read pixels in each of the last r rows of the filtered image is:
    T pre×F h T pre ×F h
    其中,T pre为预处理的运算时间;F h为所述滤波器的高度。 Among them, T pre is the preprocessing operation time; F h is the height of the filter.
  12. 如权利要求1所述的图像处理方法,其特征在于,当所述滤波后图像的高度能被所述N整除时,所述滤波后图像的第一行至最后一行的N o×P read个像素点的运算时间为: The image processing method of claim 1, wherein when the height of the filtered image is divisible by the N, the number of N o ×P reads from the first line to the last line of the filtered image The calculation time of the pixel is:
    T pre+F w×F h×(N o/N)×cycle T pre +F w ×F h ×(N o /N)×cycle
    其中,N o为所述滤波后图像的行数,且N o=N h-F h+1,N h为所述图像的高度,F w、F h分别为所述滤波器的宽度和高度;cycle为一个运算周期,T pre为预处理的运算时间。 Wherein, N o is the number of rows of the filtered image, and N o =N h -F h +1, N h is the height of the image, and F w and F h are the width and height of the filter, respectively ; Cycle is an operation cycle, T pre is the operation time of preprocessing.
  13. 如权利要求1所述的图像处理方法,其特征在于,当所述滤波后图像的高度不能被所述N整除时,所述滤波后图像的第一行至最后一行的N o×P read个像素点的运算时间为: The image processing method of claim 1, wherein when the height of the filtered image is not divisible by the N, the number of N o ×P reads from the first line to the last line of the filtered image The calculation time of the pixel is:
    T pre+F w×F h×((N o-r)/N)×cycle+T r T pre +F w ×F h ×((N o -r)/N)×cycle+T r
    其中,N o为所述滤波后图像的行数,且N o=N h-F h+1,N h为所述图像的高度;F w、F h分别为所述滤波器的宽度和高度,r为N h与N相除的余数;T pre为预处理的运算时间;T r为行内处理的运算时间。 Wherein, N o is the number of rows of the filtered image, and N o =N h -F h +1, N h is the height of the image; F w and F h are the width and height of the filter, respectively , R is the remainder of the division of N h and N; T pre is the operation time of preprocessing; T r is the operation time of in-line processing.
  14. 如权利要求13所述的图像处理方法,其特征在于,所述行内处理的运算时间T r为: The image processing method according to claim 13, wherein, in said row calculating processing time T r is:
    r×T pre×F hr×T pre ×F h .
  15. 如权利要求9、11-14任一项所述的图像处理方法,其特征在于,所述预处理的运算时间T pre为: 15. The image processing method according to any one of claims 9, 11-14, wherein the preprocessing operation time T pre is:
    (1+ceil((F w-1)/N))×cycle。 (1+ceil((F w -1)/N))×cycle.
  16. 如权利要求1所述的图像处理方法,其特征在于,所述滤波后图 像的宽度为P read*ceil((M w-F w+1)/P read),高度为N oThe image processing method according to claim 1, wherein a width of said filtered image P read * ceil ((M w -F w +1) / P read), a height N o;
    其中,M w为所述图像的宽度,F w为所述滤波器的宽度。 Wherein, M w is the width of the image, and F w is the width of the filter.
  17. 如权利要求1所述的图像处理方法,其特征在于,所述向量处理单元还包括内部存储单元;所述图像存储于外部存储单元;5. The image processing method according to claim 1, wherein the vector processing unit further comprises an internal storage unit; the image is stored in an external storage unit;
    所述读取图像的P read个原像素包括; The P read original pixels of the read image include;
    从所述内部存储单元读取P read个原像素中的部分原像素,所述部分原像素在之前的运算周期被存储; Read part of the original pixels in the P read original pixels from the internal storage unit, and the part of the original pixels was stored in the previous operation cycle;
    从所述外部存储单元读取P read个原像素中的剩余部分原像素,得到所述P read个原像素; Reading the remaining portion of the original P read original pixels in the pixel from the external storage unit, to obtain the read original pixels P;
    将所述剩余部分像素存储至所述内部存储单元。Storing the remaining part of pixels in the internal storage unit.
  18. 如权利要求5所述的图像处理方法,其特征在于,所述向量处理单元还包括内部存储单元;8. The image processing method of claim 5, wherein the vector processing unit further comprises an internal storage unit;
    当所述相邻N行在所述滤波器的高度方向上首尾跨越时,还包括:从所述内部存储单元读取另P read个原像素,所述另P read个原像素在之前的运算周期从所述图像读取,并被存储至所述内部存储单元; When the adjacent N rows span end to end in the height direction of the filter, the method further includes: reading another P read original pixels from the internal storage unit, and the previous operation of the P read original pixels Periodically read from the image and stored in the internal storage unit;
    所述N个系数中位于所述滤波器高度方向尾部的每一个系数和从所述图像读取的P read个原像素分别相乘,得到所述乘积结果; Each of the N coefficients at the end of the filter height direction is multiplied by P read original pixels read from the image to obtain the product result;
    所述N个系数中位于所述滤波器高度方向首部的每一个系数和从所述内部存储单元读取的另P read个原像素分别相乘,得到所述乘积结果。 Each of the N coefficients located at the head of the filter height direction is respectively multiplied by another P read original pixels read from the internal storage unit to obtain the product result.
  19. 一种图像处理装置,其特征在于,包括:An image processing device, characterized in that it comprises:
    外部存储单元,存储有图像和滤波器;External storage unit, which stores images and filters;
    向量处理单元,包括:乘法器;Vector processing unit, including: multiplier;
    所述向量处理单元用于读取所述图像的P read个原像素,其中,P read的值根据对应所述向量处理单元的访存位宽确定,读取所述滤波器的N个系数,N的值根据所述向量处理单元的乘法器的个数确定,所述滤波器用于对所述图像进行滤波处理; The vector processing unit is used to read P read original pixels of the image, wherein the value of P read is determined according to the memory access bit width corresponding to the vector processing unit, reads the N coefficients of the filter, The value of N is determined according to the number of multipliers of the vector processing unit, and the filter is used for filtering the image;
    所述乘法器用于将所述N个系数中的每一系数和所述P read个原像素分 别相乘,得到多个乘积结果,所述乘积结果用于计算滤波后图像中像素点的像素值。 The multiplier is used to multiply each of the N coefficients and the P read original pixels to obtain multiple product results, and the product results are used to calculate the pixel value of the pixel in the filtered image .
  20. 如权利要求18所述的图像处理装置,其特征在于,The image processing device according to claim 18, wherein:
    N=(N calc/P read) N=(N calc /P read )
    其中,N calc为所述向量处理单元的个数。 Wherein, N calc is the number of the vector processing units.
  21. 如权利要求19所述的图像处理装置,其特征在于,所述P read的值等于每个运算周期所述向量处理单元可读取的原像素的最大数目。 19. The image processing device of claim 19, wherein the value of P read is equal to the maximum number of original pixels that can be read by the vector processing unit in each operation cycle.
  22. 如权利要求20所述的图像处理装置,其特征在于,所述最大数目等于访存位宽与所述原像素位宽的商。22. The image processing device of claim 20, wherein the maximum number is equal to the quotient of the memory access bit width and the original pixel bit width.
  23. 如权利要求19所述的图像处理装置,其特征在于,所述N个系数位于所述滤波器的相邻N行的同一列。19. The image processing device according to claim 19, wherein the N coefficients are located in the same column of N adjacent rows of the filter.
  24. 如权利要求23所述的图像处理装置,其特征在于,所述所述向量处理单元从所述图像的第二行开始,在每一行中依次读取F w组原像素,其中第n组原像素包括第n列至第n+P read-1列的P read个原像素,其中1≤n≤F w,F w为所述滤波器的宽度。 The image processing device according to claim 23, wherein the vector processing unit starts from the second row of the image and sequentially reads Fw groups of original pixels in each row, wherein the nth group of original pixels The pixels include P read original pixels from the nth column to the n+P read -1 column, where 1≤n≤F w , and F w is the width of the filter.
  25. 如权利要求24所述的图像处理装置,其特征在于,The image processing device according to claim 24, wherein:
    从所述滤波后图像的第二行开始,每一行的P read个像素的运算时间为: Starting from the second line of the filtered image, the calculation time of P read pixels in each line is:
    F w×F h×cycle F w ×F h ×cycle
    其中,F w、F h分别为所述滤波器的宽度和高度;cycle为一个运算周期。 Wherein, F w and F h are the width and height of the filter, respectively; cycle is an operation cycle.
  26. 如权利要求19所述的图像处理装置,其特征在于,当所述向量处理单元读取的所述P read个原像素位于所述图像的第一行时,所述向量处理单元执行如下预处理: The image processing device according to claim 19, wherein when the P read original pixels read by the vector processing unit are located in the first row of the image, the vector processing unit performs the following preprocessing :
    所述向量处理单元读取所述滤波器第一行的一个系数或所述N个系数;The vector processing unit reads one coefficient or the N coefficients of the first row of the filter;
    所述乘法器将所述一个系数和所述P read个原像素相乘,或将所述N个系数中的每一个系数和所述P read个原像素分别相乘,得到所述乘积结果。 The multiplier multiplies the one coefficient and the P read original pixels, or multiplies each of the N coefficients and the P read original pixels to obtain the product result.
  27. 如权利要求26所述的图像处理装置,其特征在于,The image processing device according to claim 26, wherein:
    所述滤波后图像的第一行的P read个像素的运算时间为: The calculation time of P read pixels in the first row of the filtered image is:
    T pre+F w×(F h-1)×cycle T pre +F w ×(F h -1)×cycle
    其中,T pre为预处理的运算时间;F w、F h分别为所述滤波器的宽度和高度;cycle为一个运算周期。 Among them, T pre is the calculation time of preprocessing; F w and F h are the width and height of the filter, respectively; and cycle is one calculation cycle.
  28. 如权利要求19所述的图像处理装置,其特征在于,当所述滤波后图像的高度不能被所述N整除,且所述向量处理单元读取的是所述图像的用于计算所述滤波后图像后r行像素点的一行P read个原像素,r为所述滤波后图像的高度与所述N相除的余数,所述向量处理单元执行如下行内处理: The image processing device according to claim 19, wherein when the height of the filtered image is not divisible by the N, and the vector processing unit reads the value of the image used to calculate the filtering A row of P read original pixels of r rows of pixels after the rear image, where r is the remainder of dividing the height of the filtered image by the N, and the vector processing unit performs the following in-line processing:
    所述向量处理单元读取所述滤波器一行中的一个系数或所述N个系数;The vector processing unit reads one coefficient or the N coefficients in a row of the filter;
    所述乘法器将所述一个系数和所述P read个原像素相乘,或将所述N个系数中的每一个系数和所述P read个原像素分别相乘,得到所述乘积结果。 The multiplier multiplies the one coefficient and the P read original pixels, or multiplies each of the N coefficients and the P read original pixels to obtain the product result.
  29. 如权利要求28所述的图像处理装置,其特征在于,The image processing device according to claim 28, wherein:
    所述滤波后图像的后r行中的每一行的P read个像素的运算时间为: The calculation time of P read pixels in each of the last r rows of the filtered image is:
    T pre×F h T pre ×F h
    其中,T pre为预处理的运算时间;F h为所述滤波器的高度。 Among them, T pre is the preprocessing operation time; F h is the height of the filter.
  30. 如权利要求19所述的图像处理装置,其特征在于,当所述滤波后图像的高度能被所述N整除时,所述滤波后图像的第一行至最后一行的N o×P read个像素的运算时间为: The image processing device according to claim 19, wherein when the height of the filtered image is divisible by the N, the number of N o ×P reads from the first line to the last line of the filtered image The calculation time of the pixel is:
    T pre+F w×F h×(N o/N)×cycle T pre +F w ×F h ×(N o /N)×cycle
    其中,N o为所述滤波后图像的行数,且N o=N h-F h+1,N h为所述图像的高度,F w、F h分别为所述滤波器的宽度和高度;cycle为一个运算周期,T pre为预处理的运算时间。 Wherein, N o is the number of rows of the filtered image, and N o =N h -F h +1, N h is the height of the image, and F w and F h are the width and height of the filter, respectively ; Cycle is an operation cycle, T pre is the operation time of preprocessing.
  31. 如权利要求19所述的图像处理装置,其特征在于,当所述滤波后图像的高度不能被所述N整除时,所述滤波后图像的第一行至最后一行的N o×P read个像素的运算时间为: The image processing device of claim 19, wherein when the height of the filtered image is not divisible by the N, the number of N o ×P reads from the first line to the last line of the filtered image The calculation time of the pixel is:
    T pre+F w×F h×((N o-r)/N)×cycle+T r T pre +F w ×F h ×((N o -r)/N)×cycle+T r
    其中,N o为所述滤波后图像的行数,且N o=N h-F h+1,N h为所述图像的高度;F w、F h分别为所述滤波器的宽度和高度,r为N h与N相除的余数;T pre为预处理的运算时间;T r为行内处理的运算时间。 Wherein, N o is the number of rows of the filtered image, and N o =N h -F h +1, N h is the height of the image; F w and F h are the width and height of the filter, respectively , R is the remainder of the division of N h and N; T pre is the operation time of preprocessing; T r is the operation time of in-line processing.
  32. 如权利要求31所述的图像处理装置,其特征在于,所述行内处理的运算时间T r为: The image processing apparatus according to claim 31, wherein, in said row calculating processing time T r is:
    r×T pre×F hr×T pre ×F h .
  33. 如权利要求27、29-32任一项所述的图像处理装置,其特征在于,所述预处理的运算时间T pre为: The image processing device according to any one of claims 27 and 29-32, wherein the preprocessing operation time T pre is:
    (1+ceil((F w-1)/N))×cycle。 (1+ceil((F w -1)/N))×cycle.
  34. 如权利要求19所述的图像处理装置,其特征在于,所述滤波后图像的宽度为P read*ceil((M w-F w+1)/P read),高度为N oThe image processing apparatus according to claim 19, wherein a width of said filtered image P read * ceil ((M w -F w +1) / P read), a height N o;
    其中,M w为所述图像的宽度,F w为所述滤波器的宽度。 Wherein, M w is the width of the image, and F w is the width of the filter.
  35. 如权利要求19所述的图像处理装置,其特征在于,所述向量处理单元还包括内部存储单元;19. The image processing device of claim 19, wherein the vector processing unit further comprises an internal storage unit;
    所述向量处理单元从所述内部存储单元读取P read个原像素中的部分原像素,所述部分原像素在之前的运算周期被存储; The vector processing unit reads part of the original pixels in the P read original pixels from the internal storage unit, and the part of the original pixels is stored in the previous operation cycle;
    所述向量处理单元从所述外部存储单元读取P read个原像素中的剩余部分原像素,得到所述P read个原像素,将所述剩余部分像素存储至所述内 部存储单元。 The vector processing unit reads from the external storage unit the remaining part of the original P read pixels in the original pixels, P read to obtain the original pixels, the remaining portion of the pixel stored in the internal storage unit.
  36. 如权利要求23所述的图像处理装置,其特征在于,所述向量处理单元还包括:内部存储单元;当所述相邻N行在所述滤波器的高度方向上首尾跨越时,22. The image processing device according to claim 23, wherein the vector processing unit further comprises: an internal storage unit; when the adjacent N rows span end to end in the height direction of the filter,
    所述向量处理单元从所述内部存储单元读取另P read个原像素,所述另P read个原像素在之前的运算周期从所述图像读取,并被存储至所述内部存储单元; The vector processing unit reads another P read original pixels from the internal storage unit, and the other P read original pixels are read from the image in a previous operation cycle and stored in the internal storage unit;
    所述向量处理单元将所述N个系数中位于所述滤波器高度方向尾部的每一个系数和从所述图像读取的P read个原像素分别相乘,得到所述乘积结果; The vector processing unit multiplies each coefficient located at the tail of the filter height direction among the N coefficients by P read original pixels read from the image to obtain the product result;
    所述向量处理单元将所述N个系数中位于所述滤波器高度方向首部的每一个系数和从所述内部存储单元读取的另P read个原像素分别相乘,得到所述乘积结果。 The vector processing unit multiplies each coefficient located at the head of the filter height direction among the N coefficients by another P read original pixels read from the internal storage unit to obtain the product result.
  37. 一种移动设备,其中,包括:如权利要求19至36任一项所述的图像处理装置。A mobile device, comprising: the image processing device according to any one of claims 19 to 36.
  38. 如权利要求37所述的移动设备,其中,所述移动设备为便携式移动终端、无人机、手持云台、遥控器的至少一种。The mobile device according to claim 37, wherein the mobile device is at least one of a portable mobile terminal, a drone, a handheld pan/tilt, and a remote controller.
PCT/CN2019/107299 2019-09-23 2019-09-23 Image processing method and apparatus, and mobile device WO2021056143A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
PCT/CN2019/107299 WO2021056143A1 (en) 2019-09-23 2019-09-23 Image processing method and apparatus, and mobile device
CN201980033764.1A CN112154475A (en) 2019-09-23 2019-09-23 Image processing method and device and mobile equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2019/107299 WO2021056143A1 (en) 2019-09-23 2019-09-23 Image processing method and apparatus, and mobile device

Publications (1)

Publication Number Publication Date
WO2021056143A1 true WO2021056143A1 (en) 2021-04-01

Family

ID=73891508

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/107299 WO2021056143A1 (en) 2019-09-23 2019-09-23 Image processing method and apparatus, and mobile device

Country Status (2)

Country Link
CN (1) CN112154475A (en)
WO (1) WO2021056143A1 (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1964490A (en) * 2005-11-09 2007-05-16 松下电器产业株式会社 A filter and filtering method
CN101072019A (en) * 2007-04-19 2007-11-14 华为技术有限公司 Wave filter and its filtering method
CN103218201A (en) * 2012-01-19 2013-07-24 联发科技(新加坡)私人有限公司 Digital signal processor and processing method
US20160205342A1 (en) * 2013-08-20 2016-07-14 Keisoku Giken Co., Ltd. Image processing apparatus and image processing method
US20180144452A1 (en) * 2016-11-18 2018-05-24 Canon Kabushiki Kaisha Image processing circuit with multipliers allocated based on filter coefficients
CN108416730A (en) * 2017-02-09 2018-08-17 深圳市中兴微电子技术有限公司 A kind of image processing method and device

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1964490A (en) * 2005-11-09 2007-05-16 松下电器产业株式会社 A filter and filtering method
CN101072019A (en) * 2007-04-19 2007-11-14 华为技术有限公司 Wave filter and its filtering method
CN103218201A (en) * 2012-01-19 2013-07-24 联发科技(新加坡)私人有限公司 Digital signal processor and processing method
US20160205342A1 (en) * 2013-08-20 2016-07-14 Keisoku Giken Co., Ltd. Image processing apparatus and image processing method
US20180144452A1 (en) * 2016-11-18 2018-05-24 Canon Kabushiki Kaisha Image processing circuit with multipliers allocated based on filter coefficients
CN108416730A (en) * 2017-02-09 2018-08-17 深圳市中兴微电子技术有限公司 A kind of image processing method and device

Also Published As

Publication number Publication date
CN112154475A (en) 2020-12-29

Similar Documents

Publication Publication Date Title
CN108182471B (en) Convolutional neural network reasoning accelerator and method
CN108280514B (en) FPGA-based sparse neural network acceleration system and design method
CN108805266B (en) Reconfigurable CNN high-concurrency convolution accelerator
CN109948774B (en) Neural network accelerator based on network layer binding operation and implementation method thereof
CN106445471A (en) Processor and method for executing matrix multiplication on processor
CN111079917B (en) Tensor data block access method and device
WO2019205617A1 (en) Calculation method and apparatus for matrix multiplication
JPH03180965A (en) Integrated circuit apparatus adapted to repeat dct/idct computation using single multiplier/accumulator and single random access memory
CN112668708B (en) Convolution operation device for improving data utilization rate
CN109284475A (en) A kind of matrix convolution computing module and matrix convolution calculation method
CN112836813A (en) Reconfigurable pulsation array system for mixed precision neural network calculation
CN110598844A (en) Parallel convolution neural network accelerator based on FPGA and acceleration method
CN116257209A (en) Compressed storage of sparse matrix and parallel processing method of vector multiplication thereof
CN116720549A (en) FPGA multi-core two-dimensional convolution acceleration optimization method based on CNN input full cache
CN109324984B (en) Method and apparatus for using circular addressing in convolution operations
WO2021056143A1 (en) Image processing method and apparatus, and mobile device
CN113918120A (en) Computing device, neural network processing apparatus, chip, and method of processing data
CN110796229A (en) Device and method for realizing convolution operation
CN101534439A (en) Low power consumption parallel wavelet transforming VLSI structure
CN116888591A (en) Matrix multiplier, matrix calculation method and related equipment
CN108415881A (en) The arithmetic unit and method of convolutional neural networks
CN111275180A (en) Convolution operation structure for reducing data migration and power consumption of deep neural network
CN115860080A (en) Computing core, accelerator, computing method, device, equipment, medium and system
JP4156538B2 (en) Matrix operation unit
Shahbahrami et al. Performance comparison of SIMD implementations of the discrete wavelet transform

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19946313

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19946313

Country of ref document: EP

Kind code of ref document: A1