WO2021056143A1

WO2021056143A1 - Image processing method and apparatus, and mobile device

Info

Publication number: WO2021056143A1
Application number: PCT/CN2019/107299
Authority: WO
Inventors: 仇晓颖; 韩彬; 吴迪
Original assignee: 深圳市大疆创新科技有限公司
Priority date: 2019-09-23
Filing date: 2019-09-23
Publication date: 2021-04-01
Also published as: CN112154475A

Abstract

An image processing method and apparatus, and a mobile device. The image processing method is applied to a vector processing unit, the vector processing unit comprising a multiplier, and the method comprising: reading Pread original pixels of an image, the value of Pread being determined according to an access bit width corresponding to the vector processing unit; reading N coefficients of a filter, the value of N being determined according to the number of multipliers of the vector processing unit, and the filter being used for filtering the image; and by means of a multiplier, multiplying each coefficient of the N coefficients with each Pread original pixel to obtain multiple product results, the product results being used for calculating pixel values of pixel points in the filtered image.

Description

Image processing method, device and mobile equipment

Technical field

The present disclosure relates to the field of image processing, and in particular to an image processing method, device and mobile equipment.

Background technique

Filtering is widely used in the field of image processing. When the image processing device executes the filtering algorithm, it first reads the original pixels of the image from the off-chip memory, and then uses the arithmetic unit to filter the original pixels.

In the prior art, in each operation cycle, several original pixels are usually read, and the original pixels are processed by several operation units. In this way, when the number of arithmetic units is several times the number of original pixels read each time, some arithmetic units will be idle during the filtering process and cannot be fully utilized.

For example, an image processing device has eight arithmetic units. If only four original pixels can be read per operation cycle, only four arithmetic units are involved in the operation, and the remaining four arithmetic units cannot be used, resulting in limited overall image filtering performance , Affect the efficiency of image processing.

Summary of the invention

The present disclosure provides an image processing method, which is applied to a vector processing unit, the vector processing unit includes a multiplier, and the method includes: reading P _read original pixels of an image, wherein the value of _{P read is based on} Corresponding to the determination of the access bit width of the vector processing unit; reading the N coefficients of the filter, the value of N is determined according to the number of multipliers of the vector processing unit, and the filter is used to filter the image Processing; through the multiplier, each of the N coefficients and the P _read original pixels are respectively multiplied to obtain multiple product results, and the product results are used to calculate the pixels in the filtered image The pixel value.

The present disclosure also provides an image processing device, which includes: an external storage unit that stores images and filters; a vector processing unit that includes: a multiplier; and the vector processing unit is used to read P _{read elements of the image.} Pixel, wherein _{the value of P read} is determined according to the memory access bit width corresponding to the vector processing unit, the N coefficients of the filter are read, and the value of N is determined according to the number of multipliers of the vector processing unit, The filter is used to filter the image; the multiplier is used to multiply each of the N coefficients and the P _read original pixels to obtain multiple product results. The result is used to calculate the pixel value of the pixel in the filtered image.

The present disclosure also provides a mobile device, which includes: the above-mentioned image processing device.

The present disclosure reads the N coefficients of the filter in each operation cycle, and the number N of the read coefficients is determined according to the number of multipliers of the vector processing unit; the N coefficients are respectively compared with each of the originals. Pixels are multiplied to obtain a product result; now for the prior art, more multipliers are involved in filtering operations, and computing resources are fully utilized, which effectively improves the overall performance of image filtering and improves image processing efficiency.

Description of the drawings

The accompanying drawings are used to provide a further understanding of the present disclosure and constitute a part of the specification. Together with the following specific embodiments, they are used to explain the present disclosure, but do not constitute a limitation to the present disclosure. In the attached picture:

FIG. 1 is a flowchart of an image processing method according to an embodiment of the disclosure.

(A), (b), and (c) of Fig. 2 are schematic diagrams of operations in the 1-3 operation cycles, respectively.

(A), (b), (c), (d), (e) of FIG. 3 are the operation schematic diagrams of the 4th to 8th operation cycles, respectively.

(A), (b), (c), (d), (e) of FIG. 4 are the operation schematic diagrams of the 9th to 18th operation cycles, respectively.

(A) and (b) of FIG. 5 are operation schematic diagrams of the 19th and 23rd operation cycles, respectively.

(A) and (b) of FIG. 6 are schematic diagrams of operations in the 24th and 28th operation cycles, respectively.

FIG. 7 is a schematic diagram of a process of an image processing method according to an embodiment of the disclosure.

(A), (b), and (c) of FIG. 8 are the operation schematic diagrams of the 54th to 56th operation cycles, respectively.

FIG. 9 is a schematic diagram of a filtered image of the image processing method according to an embodiment of the disclosure.

FIG. 10 is a schematic structural diagram of an image processing apparatus according to an embodiment of the disclosure.

FIG. 11 is a schematic diagram of the structure when the arithmetic unit of the image processing device of the embodiment of the disclosure performs preprocessing.

FIG. 12 is a schematic diagram of the structure when the arithmetic unit of the image processing device of the embodiment of the disclosure performs parallel processing.

detailed description

The technical solutions of the present disclosure will be clearly and completely described below in conjunction with the embodiments and the drawings in the embodiments. Obviously, the described embodiments are only a part of the embodiments of the present disclosure, rather than all of the embodiments. Based on the embodiments in the present disclosure, all other embodiments obtained by those of ordinary skill in the art without creative work shall fall within the protection scope of the present disclosure.

An embodiment of the present disclosure provides an image processing method, which is applied to a vector processing unit, and the vector processing unit includes a multiplier. As shown in FIG. 1, the method includes:

Step S101: Read P _read original pixels of the image, where _{the value of P read} is determined according to the memory access bit width corresponding to the vector processing unit;

Step S102: Read N coefficients of the filter, the value of N is determined according to the number of multipliers of the vector processing unit, and the filter is used for filtering the image;

Step S103: Through the multiplier, each coefficient of the N coefficients and the P _read original pixels are respectively multiplied to obtain multiple product results, and the product results are used to calculate the pixels in the filtered image The pixel value of the point.

After obtaining the product result used to calculate the pixel value of the pixel in the filtered image, add the product result used to calculate the pixel value of the same pixel in the filtered image to get the pixel value of the pixel. Get the entire filtered image.

The image processing method of this embodiment is run by an image processing device, and the image processing device includes: an external storage unit and a processor. The processor can be any type of chip with vector processing capabilities such as CPU, DSP, FPGA, etc.

Taking DSP as an example, there is a vector processing unit inside the DSP, and the vector processing unit can perform filtering and other processing on the image. For the vector processing unit, the external storage unit is its off-chip memory. The external storage unit stores the image to be processed and the filter. The vector processing unit uses the filter to filter the image, and the filtered image obtained can also be stored in an external storage unit.

The vector processing unit of the DSP includes multiple multiply and accumulators (MAC, Multiply and ACumulate), and each MAC includes one multiplier and one adder, which are used to perform multiplication and accumulation operations in filtering.

In step S101, P _read original pixels of the _{image are read, where the value of P read} is determined according to the memory access bit width corresponding to the vector processing unit.

The entire filtering process needs to go through multiple calculation cycles to complete. In each operation cycle, the vector processing unit needs to read P _read original pixels of the image. In order to improve the efficiency of the image processing method, as many original pixels as possible should be read in each operation cycle. That is, the number of original pixels P _{read read} is equal to the maximum number of original pixels that can be read by the vector processing unit in each operation cycle.

When the vector processing unit reads the P _read original pixels from the storage unit, the maximum number that can be read depends on the access bit width of the data bus and the bit width of the original pixels. The bit width refers to how many bits are used to represent each original pixel. For example, the bit width of the original pixel can be 8bit, 16bit, 32bit, and so on. The memory access bit width refers to how many bit lines the data bus has, that is, how many bits can be transmitted by the data bus at a time, and is generally a multiple of 8. For example, the memory access bit width can be 8bit, 16bit, 32bit, 64bit, etc. When the original pixel bit width is 16bit and the memory access bit width is 64bit, the vector processing unit can read 64/16=4 original pixels from the storage unit in each operation cycle. That is, the maximum number in this embodiment is equal to the quotient of the memory access bit width and the original pixel bit width.

After reading the original pixels, step S102 reads the N coefficients of the filter, the value of N is determined according to the number of multipliers of the vector processing unit, and the filter is used for filtering the image.

In each operation cycle, the vector processing unit reads the N coefficients of the filter corresponding to the _{P read original pixels.} In order to improve the overall performance of image filtering and improve image processing efficiency, in this embodiment, the number N of coefficients read each time is determined according to the number of multipliers. Specifically, if the vector processing unit includes N _calc multipliers, the number of coefficients

N=(N _calc /P _read ).

The number of multipliers is usually an integer multiple of the number of original pixels read in each operation cycle, and the number of multipliers is how many times the number of original pixels read in each operation cycle is read from the external storage unit How many filter coefficients. The number of multipliers can be 8, 16, 32, and so on. When the vector processing unit reads 4 original pixels in each operation cycle, if the number of multipliers is 8, then N=8/4=2, that is, 2 coefficients of the filter are read in each operation cycle.

If the number of multipliers is not an integer multiple of the number of original pixels read in each operation cycle, then the value of N can be _{the result of N calc} /P _read rounded up. For example, if the number of multipliers is 10, when the vector processing unit reads 4 original pixels per operation cycle, 10/4 is 2.5, and 3 coefficients are read at this time. Choose any 2 coefficients to multiply each of the 4 original pixels. This operation uses 8 multipliers; choose another coefficient in addition to the above 2 coefficients, which is separated from any two of the 4 original pixels. Multiply, this operation uses the remaining 2 multipliers. In this way, 10 multipliers are fully utilized, and more multiplication operations are performed in one operation cycle.

After reading the coefficients of the original pixels and the filter, step S103 uses the multiplier to multiply each of the N coefficients by the P _read original pixels to obtain multiple product results. The product result is used to calculate the pixel value of the pixel in the filtered image. The image processing method of this embodiment includes preprocessing and parallel processing.

When the vector processing unit reads P _read original pixels in the first row of the image, preprocessing is performed. In the preprocessing, the vector processing unit reads the P _read original pixels in the first line of the image and one coefficient or N coefficients in the first line of the filter, and compares the one coefficient with the P _read original pixels. Multiply or multiply each of the N coefficients and the P _read original pixels to obtain a product result.

Starting from the second line of the image, that is, when the vector processing unit reads P _read original pixels in the other lines after the first line of the image, parallel processing is performed. In parallel processing, the vector processing unit reads the P _read original pixels in a row of the image and the N coefficients of the filter, where the N coefficients are located in the same column of the adjacent N rows of the filter. The N _calc multipliers multiply each of the N coefficients and the P _read original pixels to obtain the product result.

In the following, in conjunction with the accompanying drawings, taking the number of multipliers as 8, reading 4 original pixels and 2 coefficients of the filter in each operation cycle as an example, the above processing procedure will be described.

Assume that the original size of the image is 16×4. In order to make the size of the filtered image the same as the original size of the image, this embodiment may pad the image. The size of the filled image is 20×8, that is, the width _Mw is 20 and the height _Nh is 8. Compared with the original image, it expands multiple elements outside the original image boundary. The specific filling method can be to simply copy the element adjacent to the boundary point and fill it to the area outside the boundary of the original image, or use preset filling elements. This is only an exemplary description and does not limit the present disclosure. Filling method. The size of the filter is 20×8, that is, both the width F _w and the height F _h are 5. The filter performs filtering operations on the filled image. The following takes two-dimensional convolution as an example to describe the filtering operation in detail.

In two-dimensional convolution, for some lines of the image, when calculating the filtering results of the original pixels, the original pixels of these lines are only used once. When the product operation in the filtering is performed on the original pixels of other lines, the The original pixels will not be reused. Therefore, the original pixels of these lines can be calculated in the way of preprocessing within the lines.

In the above example, the first line of the image is in the above situation. When calculating the filtering result of the original pixels, the original pixels in the first row are used only once, and the original pixels in the first row will not be multiplexed when performing product operations in the filtering on the original pixels in the other rows. Therefore, the original pixels of the first row are calculated by preprocessing. Please refer to Figure 2 and Figure 7 together to introduce the preprocessing process of the first line of the image. The pretreatment process includes:

As shown in Figure 2(a), in the first operation cycle (cycle 1):

Read the 4 original pixels of the first line of the image, namely B _0,0 , B _0,1 , B _0,2 , B _0,3 ;

_{Read a coefficient A 0,0} of the first line of the filter;

The four multipliers multiply the coefficients A ₀ , 0 with the four original pixels B ₀ , ₀ , B ₀ , _{1, B 0, 2, and B 0, 3} of the image, respectively, to obtain the product result of the _{coefficients A 0, 0.}

As shown in Figure 2(b), in the second operation cycle (cycle 2):

Read the 4 original pixels of the first line of the image, namely B _0,2 , B _0,3 , B _0,4 , B _0,5 ;

_{Read the 2 coefficients A 0,1} and A _0,2 of the first line of the filter;

The four multipliers multiply the coefficients A _0,1 with the four original pixels B _0,1 , B _0,2 , B _0,3 , and B _{0,4 of the} image respectively to obtain the product result of the coefficients A _0,1, The other four multipliers multiply the coefficients A ₀ , 2 with the four original pixels B ₀ , 2, B ₀ , 3, B ₀ , _{4, and B 0, 5} of the image to obtain the product result of the _{coefficients A 0, 2.} , And accumulate the two product results to the product result of cycle 1.

As shown in Figure 2(c), in the third operation cycle (cycle 3):

Read the 4 original pixels of the first line of the image, namely B _0,4 , B _0,5 , B _0,6 , B _0,7 ;

_{Read the 2 coefficients A 0,3} and A _0,4 of the first line of the filter;

The four multipliers multiply the coefficients A _0,3 with the four original pixels B _0,3 , B _0,4 , B _0,5 , and B _{0,6 of the} image respectively to obtain the product result of the coefficients A _0,3, The other four multipliers multiply the coefficients A ₀ , 4 with the four original pixels B ₀ , 4, B _{0, 5} , B _{0, 6} , and B ₀ , 7 of the image to obtain the product result of the _{coefficients A 0, 4.} , And accumulate the two product results to the accumulation result of cycle 2.

After 3 calculation cycles, the preprocessing is completed, and the product result obtained by the preprocessing is used to calculate the four pixels in the first line of the filtered image.

The preprocessing process of the first line is introduced above, and so on, you can repeat the above steps from the second line of the image, and perform intra-line preprocessing on other lines of the image to obtain the pixels used to calculate each line of the filtered image The product result of is used to calculate the product result of the pixel value of the same pixel in the filtered image, and the pixel value of the pixel can be obtained. Using in-line preprocessing to perform two-dimensional filtering on the image, the computational efficiency of the two-dimensional filtering is improved, the overall performance of the image filtering is improved, and the image processing efficiency is improved.

As mentioned earlier, in two-dimensional convolution, the original pixels of some lines of the image are only used once, and the original pixels of these lines will not be multiplexed. But for other lines of the image, when calculating the filtering results of the original pixels, the original pixels of these lines will be used multiple times. When the product operation in filtering is performed on the original pixels of other lines, the original pixels of these lines can be used Reuse. Therefore, the original pixels of these rows can be operated in parallel processing within the rows.

In the above example, this is the case starting from the second line of the image. In this embodiment, parallel processing can be performed on other rows after the first row, and the parallel processing can further improve the operation efficiency of two-dimensional filtering and the overall performance of image filtering. Please refer to Figure 3 and Figure 7 together to introduce the parallel processing process. The parallel processing process includes:

As shown in Figure 3(a), in the fourth operation cycle (cycle 4):

Read the 4 original pixels of the second line of the image, namely B _1,0 , B _1,1 , B _1,2 , B _1,3 ;

_{Read the two coefficients A 0,0} and A _1,0 of the first column of two adjacent rows, the first row and the second row of the filter;

Eight multipliers multiply the coefficients A ₀ , 0 and A ₁ , 0 with the four original pixels B ₁ , 0, B ₁ _{, 1, B 1, 2} , and B _{1, 3} of the image, respectively, to obtain the coefficient A _{0, The} result of the product of 0 and A _1,0. Among them, _{the 4 product results of the coefficient A 1, 0} are used to calculate the four pixels in the first line of the filtered image, and are added to the cumulative result of cycle 3; the 4 product results of the _{coefficient A 0, 0 are used to calculate the filter} Four pixels in the second row of the back image.

As shown in Figure 3(b), in the fifth operation cycle (cycle 5):

Read the 4 original pixels of the second line of the image, namely B _1,1 , B _1,2 , B _1,3 , B _1,4 ;

_{Read the two coefficients A 0,1} and A _1,1 in the second column of the first row and the second row of the filter, which are two adjacent rows;

The 8 multipliers multiply the coefficients A _0,1 , A _1,1 with the 4 original pixels B _1,1 , B _1,1 , B _1,3 , B _1,4 of the image, respectively, to obtain the coefficient A _{0, The} result of the product of 1 and A _1,1. Among them, _{the 4 product results of the coefficient A 1,1} are used to calculate the four pixels in the first line of the filtered image, and are added to the cumulative result of cycle 4; the 4 product results of the _{coefficient A 0,1 are used to calculate the filter} The four pixels in the second row of the back image are added to the product result of cycle 4.

As shown in Figure 3(c), in the sixth cycle (cycle 6):

Read the 4 original pixels of the second line of the image, namely B ₁ , 2, B ₁ , 3, B ₁ , 4, B _{1, 5} ;

Reading the first filter and second rows adjacent to these two coefficients of the two rows of three A _{_0,2,} A _1,2;

The eight multipliers multiply the coefficients A ₀ , ₂ , and A 1, 2 with the four original pixels B ₁ , 2, B ₁ , 3, B ₁ , _{4, and B 1, 5 of the} image, respectively, to obtain the coefficient A _{0, The} result of the product of 2 and A _{1, 2.} Among them, _{the 4 product results of the coefficients A 1, 2} are used to calculate the four pixels in the first line of the filtered image, and they are added to the cumulative result of cycle 5; _{the 4 product results of the coefficients A 0} , 2 are used to calculate the filter The four pixels in the second row of the back image are accumulated to the accumulation result of cycle 5.

As shown in Figure 3(d), in the seventh cycle (cycle 7):

Read the 4 original pixels of the second line of the image, namely B _1,3 , B _1,4 , B _1,5 , B _1,6 ;

_{Read the two coefficients A 0,3} and A _1,3 of the fourth column of the first row and the second row of the filter, which are two adjacent rows;

The eight multipliers multiply the coefficients A ₀ , 3 and A ₁ , 3 by the four original pixels B ₁ , 3, B ₁ , 4, B ₁ , 5 and B ₁ , 6 of the image to obtain the coefficient A _{0, The} result of the product of 3 and A _1,3. Among them, _{the 4 product results of the coefficients A 1} , 3 are used to calculate the four pixels in the first line of the filtered image, and they are added to the cumulative result of cycle 6; _{the 4 product results of the coefficients A 0} , 3 are used to calculate the filter The four pixels in the second row of the back image are accumulated to the accumulation result of cycle 6.

As shown in Figure 3(e), in the eighth cycle (cycle 8):

Read the 4 original pixels of the second line of the image, namely B _1,4 , B _1,5 , B _1,6 , B _1,7 ;

_{Read the two coefficients A 0,4} and A _1,4 of the fifth column of the first row and the second row of the filter, which are two adjacent rows;

The eight multipliers multiply the coefficients A ₀ , 4, A ₁ , 4 with the four original pixels B ₁ , 4, B ₁ , 5, B ₁ , _{6, and B 1, 7} of the image, respectively, to obtain the coefficients A _{0, The} result of the product of 4 and A _1,4. Among them, _{the 4 product results of the coefficients A 1} , 4 are used to calculate the four pixels in the first line of the filtered image, and they are added to the cumulative result of cycle 7; _{the 4 product results of the coefficients A 0} , 4 are used to calculate the filter The four pixels in the second row of the back image are accumulated to the accumulation result of cycle 7.

After 5 operation cycles, the parallel processing of the 8 original pixels in the second row of the image is completed. In the above 5 operation cycles, 8 multipliers multiply the 2 coefficients of the filter by 4 original pixels in parallel, so as to obtain the 4 pixels used to calculate the first row and the second row of the filtered image in parallel.

The parallel processing process of the third line of the image is similar to the parallel processing process of the second line above. The difference is that the 4 original pixels of the third line of the image are read, and the second line and the second line of the filter are read. The coefficients in the same column of the three rows and the two adjacent rows can be referred to as shown in Fig. 4(a) to Fig. 4(e) and Fig. 7, and the specific process will not be repeated. By analogy, when reading the original pixels of the fifth row of the image, the parallel processing process is as follows:

As shown in Figure 5(a) and Figure 7, in the 19th operation cycle (cycle 19):

Read the 4 original pixels of the fifth line of the image, namely B _4,0 , B _4,1 , B _4,2 , B _4,3 ;

_{Read the two coefficients A 3,0} and A _4,0 in the first column of the fourth row and the fifth row of the filter, which are two adjacent rows;

Eight multipliers multiply the coefficients A ₃ , 0 and A ₄ , 0 with the four original pixels B ₄ , 0, B ₄ , 1, B ₄ , _{2, and B 4, 3} of the image, respectively, to obtain the coefficient A _{3, The} result of the product of 0 and A _4,0. Among them, _{the 4 product results of the coefficient A 4,0} are used to calculate the four pixels in the first line of the filtered image, and they are added to the cumulative result of cycle 18; the 4 product results of the _{coefficient A 3,0 are used to calculate the filter} The four pixels in the second row of the back image are added to the cumulative result of cycle 18.

By analogy, the calculation process of cycle 20, 21, and 22 is similar to cycle 19. As shown in Figure 5(b) and Figure 7, in the 23rd operation cycle (cycle 23):

Read the 4 original pixels of the fifth line of the image, namely B _4,4 , B _4,5 , B _4,6 , B _4,7 ;

_{Read the 2 coefficients A 3,4} and A _{4,4 of the} fifth row of the fourth row and the fifth row of the filter, which are two adjacent rows;

The eight multipliers multiply the coefficients A ₃ , 4 and A ₄ , 4 with the four original pixels B ₄ , ₄ , B ₄ , _{5, B 4, 6, and B 4, 7} of the image, respectively, to obtain the coefficient A _{3, The} result of the product of 4 and A _4,4. Among them, _{the four product results of coefficients A 4} and 4 are used to calculate the four pixels in the first line of the filtered image, and are added to the accumulation result of cycle 22. The accumulation result of cycle 23 is the first line of the filtered image. Four pixels C _0,0 -C _0,3 . The four product results of the coefficients A ₃ and 4 are used to calculate the four pixels in the second line of the filtered image, and are accumulated to the accumulation result of cycle 22.

It can be seen that after 23 calculation cycles, four pixels in the first row of the filtered image are obtained.

The filtering process continues, because in cycle 23, the coefficients of the fourth and fifth rows of the filter are read, so in the next calculation cycle, two adjacent rows of the filter will cross over in the height direction. . At this time, the original pixels of the next line of the read image cannot be multiplexed, and the two coefficients of the filter are calculated with the original pixels of different lines, and the parallel processing is performed in the following way:

_{Another P read} original pixels are read from the internal storage unit of the vector processing unit, and the other P _read original pixels are read from the image in the previous operation cycle and stored in the internal storage unit.

Each of the N coefficients located at the end of the filter height direction is _{respectively multiplied by the P read} original pixels of a line read from the image to obtain the product result;

Each of the N coefficients located at the head of the filter height direction is _{respectively multiplied by another P read} original pixels read from the internal storage unit to obtain the product result.

The following takes the 24th to 28th operation cycle as an example to describe the above process.

As shown in Figure 6(a) and Figure 7, in the 24th operation cycle (cycle 24):

Read the 4 original pixels of the sixth row of the image from the external storage unit, namely B _5,0 , B _5,1 , B _5,2 , B _5,3 , and read the third row of the image from the internal storage unit 4 original pixels, namely B _2,0 , B _2,1 , B _2,2 , B _2,3 . Among them, the _{four original pixels B 2,0} , B _2,1 , B _2,2 , B _{2,3 are} read from the image in cycle 9 and stored in the internal storage unit.

_{Read the two coefficients A 4} _{, 0} , A 0, 0 in the first column of the fifth row and the first row of the filter; 4 multipliers divide the coefficients A 0, _{0 of the filter head with} 4 original image pixels B _{_{_{2,0, B 2,1, B 2,2,}}} B 2,3 multiplied by the multiplication results of the coefficients a _{0, 0,} the coefficient a 4 _0,0-pieces of multiplied results for Calculate the four pixels in the third row of the filtered image; the other four multipliers combine the filter tail coefficient A _4,0 with the four original pixels B _5,0 ,B _5,1 ,B _5,2 , B _5,3 multiplied by the multiplication results of the coefficients a _4,0, 4 coefficients a _4,0 multiplication results for four pixels of the second row after the calculation of the filtered image.

By analogy, the calculation process of cycle 25, 26, and 27 is similar to cycle 24. As shown in Figure 6(b) and Figure 7, in the 28th operation cycle (cycle 28):

Read the 4 original pixels of the fifth row of the image from the external storage unit, namely B ₅ , 4, B ₅ , ₅ , B ₅ , 6, B 5, 7, read the 4 of the third row of the image from the internal storage unit Original pixels, namely B _2,4 , B _2,5 , B _2,6 , B _2,7 . Among them, the _{4 original pixels B 2,4} , B _2,5 , B _2,6 , B _{2,7 are} read from the image in cycle 13 and stored in the internal storage unit.

_{Read the two coefficients A 4} , 4, A ₀ , 4 of the fifth row of the filter and the fifth column of two adjacent rows of the first row _{; the four multipliers divide the coefficients A 0} , 4 of the filter head with the coefficients A 0, 4 respectively 4 original image pixels B _{_{_{2,4, B 2,5, B 2,6,}}} B 2,7 multiplied by the multiplication results of the coefficients a _0,4, the coefficient a 4 _0,4-pieces of multiplied results for Calculate the four pixels in the third line of the filtered image and add them to the accumulation result of cycle 27; the other four multipliers combine the filter tail coefficients A ₄ , 4 with the four original pixels B ₅ , 4, B of the image, respectively _{_{_5,5,}} B _5,6, B _5,6 multiplied by the coefficient a _{4 and 4} multiplication results, coefficients a 4 _4,4 multiplication results for the second row of the four pixels is calculated after the filtered image , And accumulate to the accumulation result of cycle 27. The accumulation result of cycle 28 is the four pixel points C _1,0 -C _{1,3 in the} second line of the filtered image.

It can be seen that after 28 operation cycles, the four pixels C _0,0 -C _{0,3 in the} first row of the filtered image and the four pixels C _1,0 -C _{1, in the second row are obtained. 3} .

Repeatedly execute the above parallel processing process, starting from cycle 28, and after 24 operation cycles, the four pixels C _2,0 -C _{2,3 of the} third row of the filtered image can be obtained, and then one operation Cycle, that is, starting from cycle 28 and then passing through 25 operation cycles, after a total of 53 operation cycles, the four pixels C _3,0 -C _{3,3 of the} fourth row of the filtered image can be obtained, thereby obtaining the filtered image Pixels from column 1 to column 4 of all four rows.

In the above description, the height N _h of the image is 8 and the height F _h of the filter is 5, so it can be determined that the height N ₀ of the filtered image is 4. The number N of filter coefficients read in each operation cycle is 2, and the height N _{0 of the} filtered image can be divisible by the number N of filter coefficients. In this case, after the above 53 operation cycles, the filtering process for the pixels in the first column to the fourth column of the filtered image has been completed. If the height N _{0 of the} filtered image is not evenly divisible by the number N of filter coefficients, then the filtering process of the pixels in the first to fourth columns of the filtered image is still not completed. At this time, it is necessary to continue processing the image to obtain the pixels of the remaining lines of the filtered image. Because the remaining lines of the filtered image cannot be processed in parallel, the following in-line processing is performed:

Read a line of Pread original pixels used to calculate the r line of pixels in the filtered image in the read image, where r is the remainder of the filter height divided by N;

Read one coefficient or the N coefficients in a row of the filter;

Through the multiplier, the one coefficient is multiplied by the P _read original pixels, or each coefficient of the N coefficients is _{multiplied by the P read} original pixels to obtain a product result.

For example, assuming that the height N _h of the image is 9 and the filter height is still 5, the height N ₀ of the corresponding filtered image is 5. At this time, when the height N _{0 of the} filtered image cannot be divisible by N, the filtered image is the first The filtering process of the pixels from column 1 to column 4 has not yet been completed. At this time, it is necessary to continue processing the image to obtain the pixels in the fifth row of the filtered image. Because there are only one row of pixels, that is, the pixels on the fifth row of the filtered image, the pixels on the fifth row of the filtered image cannot be processed in parallel, but are processed in-line.

In the in-line processing, the above-mentioned preprocessing process is performed once for each of the 5th to 9th rows of the image. That is, for each of the 5th to 9th rows of the image, read one or N coefficients in a row of the filter, and multiply the one or N coefficients with each original pixel of the row to obtain the product result. In the in-line processing process, take the fifth line of the image as an example, the processing process specifically includes:

As shown in Figure 8(a), in the 54th operation cycle (cycle 54):

_{Read a coefficient A 0,0} of the first line of the filter;

The four multipliers multiply the coefficient A ₀ , 0 with the four original pixels B ₄ , 0, B ₄ , 1, B ₄ , _{2, and B 4, 3} of the image, respectively, to obtain the product result of the _{coefficient A 0, 0.}

As shown in Figure 8(b), in the 55th operation cycle (cycle 55):

Read the 4 original pixels of the fifth line of the image, namely B _4,2 , B _4,3 , B _4,4 , B _4,5 ;

_{Read the 2 coefficients A 0,1} and A _0,2 of the first line of the filter;

The four multipliers multiply the coefficient A _0,1 by the four original pixels B _4,1 , B _4,2 , B _4,3 , B _4,4 of the image, respectively, to obtain the product result of the _{coefficient A 0,1,} The other four multipliers multiply the coefficient A ₀ , 2 with the four original pixels B ₄ , 2, B ₄ , 3, B ₄ , _{4 and B 4, 5} of the image, respectively, to obtain the product result of the _{coefficient A 0, 2.} , And accumulate the two product results to the product result of cycle 54.

As shown in Figure 8(c), in the 56th operation cycle (cycle 56):

_{Read the 2 coefficients A 0,3} and A _0,4 of the first line of the filter;

The four multipliers multiply the coefficients A ₀ , 3 with the four original pixels B ₄ , 3, B ₄ _{, 4, B 4, 5} , and B _{4, 6} of the image to obtain the product result of the _{coefficients A 0, 3.} The other four multipliers multiply the coefficients A ₀ , 4 with the four original pixels B ₄ , ₄ , B ₄ , _{5, B 4, 6, and B 4, 7} of the image to obtain the product result of the _{coefficients A 0, 4.} , And accumulate the two product results to the accumulation result of cycle 55.

After another three calculation cycles, the preprocessing of the fifth line of the image is completed, and the product results obtained in each of the foregoing calculation cycles are used to calculate the four pixels of the fifth line of the filtered image.

By analogy, the above steps are repeated, and the above-mentioned preprocessing is performed on the 6th to 9th rows of the image and the second to 5th rows of the corresponding filter respectively, and the preprocessing of each row requires 3 operation cycles. After 15 operation cycles of intra-row operation, a total of 68 operation cycles, the pixel of the fifth row of the filtered image is obtained, and the pixels of the first to fourth columns of all five rows of the filtered image are obtained, and the filtered image is the first The filtering process from column to column 4 is completed.

When the target elements in the first to fourth columns of the filtered image are obtained, the entire process of the above preprocessing, parallel processing, and inline processing (if any) is repeated continuously, as shown in Figure 9, taking the height of the filtered image as 4 as an example, the pixels in the fifth column to the eighth column, the ninth column to the 12th column, and the 13th column to the 16th column of the filtered image are sequentially obtained, thereby completing the entire filtering process. The width of the filtered image M _o =P _read *ceil((M _w -F _w +1)/P _read ), the height N _o =N _h -F _h +1; where M _w is the width of the image. When P _read is 4, M _w is 20, N _h is 8, F _w and F _h are 5, the image size after filtering is 16×4, that is, the width M _o is 16 and the height N _o is 4.

In the above example, the number of vector processing units N _calc is 8, the number of original pixels P _{read read in} each operation cycle is 4, the number of filter coefficients N is 2, the image width M _w is 20, and the filter The width F _w and the height F _{h of} are both 5, and the image processing method of this embodiment is described. However, it should be clear to those skilled in the art that the values of the above parameters are not limited to this. The above parameters in this embodiment can also take other values. When the above parameters take other values, the image processing method is similar to the above description, and those skilled in the art should be fully aware of the specific process of the image processing method.

In the image processing method of this embodiment, the calculation time _{of P read pixels in the first row of the filtered image is:}

T _pre +F _w ×(F _h -1)×cycle

Among them, T _pre is the calculation time of the above preprocessing; F _w and F _h are the width and height of the filter, respectively; cycle is a calculation cycle. The preprocessing operation time T _pre is: (1+ceil((F _w -1)/N))×cycle.

In the above example, the width F _w and the height F _h of the filter are both 5, N is 2, and the preprocessing calculation time T _pre is 3 calculation cycles, so the calculation time is 23 calculation cycles. In other words, after 25 operation cycles, 4 pixels in the first row of the filtered image are obtained.

Starting from the second line of the filtered image, the calculation time _{of P read pixels in each line in the parallel processing process is:}

F _w ×F _h ×cycle

In the above example, the width F _w and the height F _h of the filter are both 5, and the calculation time is 25 calculation cycles. In other words, 4 pixels in each row from the second row to the fourth row of the filtered image require 25 operation cycles.

If there are intra-line operations in the filtering process, and the last r lines of the filtered image are obtained through intra-line operations, then the operation time _{of P read} _{pixels in each line of the last r lines of the filtered image is: T pre} ×F _h .

In the above example, assuming that the height N _h of the image is 9, the last line of the filtered image, that is, the pixels in the fifth line are obtained by intra-line calculation, the height F _h of the filter is all 5, and the calculation time is 15 calculation cycles . In other words, the 4 pixels in the fifth row of the filtered image are obtained through 15 operation cycles.

In the image processing method of this embodiment, when the height N _{0 of the} filtered image can be divisible by the number N of filter coefficients, that is, there is no in-line processing, N _o ×P _{read from the first line to the last line of the filtered image} The calculation time of each pixel is:

T _pre +F _w ×F _h ×(N _o /N)×cycle

Wherein, N _o is the height of the filtered image, and _{_{_{N o = N h -F h +}}} 1, N h is the height of the image.

In the above example, the width F _w and the height F _h of the filter are both 5, and the preprocessing calculation time T _pre is 3 calculation cycles. If the height N _h of the image is 8, the height N _o of the filtered image is 4. Therefore, the calculation time of 4×4=16 pixels from the first row to the fourth row of the filtered image is 53 calculation cycles.

When the height N _{0 of the} filtered image cannot be divisible by the number N of filter coefficients, that is, when there is in-line processing, the calculation time _{of N o} ×P _{read pixels from the first line to the last line of the filtered image is:}

T _pre +F _w ×F _h ×((N _o -r)/N)×cycle+T _r

Among them, r is _{the remainder of the division of N o} and N, that is, the number of lines that need to be processed in-line in the filtered image, and T _r is the operation time of the in-line processing. Arithmetic processing within the line time T _r _{_{is: r × T pre × F h}} .

In the above example, the width F _w and the height F _h of the filter are both 5, and the preprocessing calculation time T _pre is 3 calculation cycles. If the height N _h of the image is 9, the height N _o of the filtered image is 4. Substituting the above formula, the calculation time of 5×4=20 pixels from the first row to the fifth row of the filtered image is 68 calculation cycles.

In the above description, the original pixels in this embodiment are read in order, that is, from top to bottom in the image height direction, from the first row to the last row, and read in the order from left to right in each row. . The pixels of the filtered image are also output in order. When reading sequentially, the P _read original pixels of one line of the read image include:

Starting from the second row of the image, reading each row of pixels sequentially F _w of the original group, wherein the n-th original pixel group comprises first to n-th column n + P _{_read} -1 P _read original pixels of the column, wherein 1≤ n≤F _w .

For example, in the above example, the original pixels of each row are read from top to bottom in the height direction of the image. In each row, 5 groups of original pixels are sequentially read. The first group of original pixels includes 4 columns from column 1 to column 4. Original pixels, such as B _1,0 , B _1,1 , B _1,2 , B _1,3 ; the second group of original pixels includes 4 original pixels from column 2 to column 5, such as B _1,1 , B _1,2 , B _1,3 , B _1,4 ; the third group of original pixels includes 4 original pixels from the third column to the sixth column, such as B _1,2 , B _1,3 , B _1,4 , B _1,5 ; The 4th group of original pixels include 4 original pixels from the 4th column to the 7th column, such as B _1,3 , B _1,4 , B _1,5 , B _1,6 ; the 5th group of original pixels Including the 4 original pixels in the fifth column to the eighth column, for example, B _1,4 , B _1,5 , B _1,6 , B _1,7 .

However, the present disclosure is not limited to this. In fact, the order of reading the original pixels of the image is not limited, and reading can be sequential, reversed, or skipped. As long as it can traverse all the original pixels of the image, traverse all the filter coefficients, and complete the M*N*F _w *F _h multiplication and accumulation operations, the filtering can be completed. It's just that in the case of reverse or skip reading, the output order of the pixels of the filtered image is different.

_{In the above example, P read} original pixels of the image are read from the external storage unit each time. But the present disclosure is not limited to this. P _read original pixels stored after each calculation cycle, reads an image from an external storage unit P _read original pixels, it can be read in the internal storage unit vector processing unit. In this way, in the subsequent operation cycle, if _{some of the original pixels of the P read} original pixels that need to be read have been stored in the internal storage unit, then this part of the original pixels can be directly read from the internal storage unit, and only It is sufficient to read another part of the original pixels not stored in the internal storage unit from the storage unit. Specifically, the P _read original pixels of one line of the read image include:

_{Read part of the original pixels in the P read} original pixels from the internal storage unit, and the part of the original pixels was stored in the previous operation cycle;

P reads the remaining portion of the original pixel in the original pixels _read from the external storage unit, to obtain P _read original pixels;

Store the remaining pixels in the internal storage unit.

For example, as shown in Figure 7, after reading B ₁ , 0, B ₁ , _{1, B 1, 2} , B _{1, 3} _{in cycle 4, B 1, 0} , B ₁ , _{1, B 1, 2} , B _{1,3 are} stored in the internal storage unit of the vector processing unit. In this way, in cycle 5, only _{B 1} _{, 4 needs} to be read from the storage unit, and

B

1, 1, B ₁ , _{2, and B 1, 3} can be read from the internal storage unit. This reduces the amount of data read by the vector processing unit from the off-chip memory and saves bandwidth.

In the above example, the multiplication and accumulation operations are performed in each operation cycle, that is, the filter coefficients are multiplied by the original pixels, and the result of the multiplication is accumulated to the accumulation result of the previous operation cycle. However, the present disclosure is not limited to this. It is also possible to perform multiplication first to obtain all the product results used to calculate the pixel points of the filtered image and then perform the accumulation.

It can be seen that the present disclosure reads the N coefficients of the filter in each operation cycle, and the number N of the read coefficients is determined according to the number of multipliers that execute the image processing method; the N Each coefficient is multiplied by each of the original pixels to obtain the product result. Compared with the prior art, more vector processing units are involved in the filtering operation. Compared with the preprocessing method, the calculation resources are fully utilized, and the overall image filtering is The performance is greatly improved, and the image processing efficiency is greatly improved.

Another embodiment of the present disclosure provides an image processing device, as shown in FIG. 10, including:

The external storage unit stores images and filters.

The vector processing unit includes: a multiplier; the vector processing unit is used to read P _read original pixels of the image, wherein _{the value of P read} is determined according to the memory access bit width corresponding to the vector processing unit, and the read N coefficients of the filter, the value of N is determined according to the number of multipliers of the vector processing unit, and the filter is used for filtering the image;

The multiplier is used to multiply each of the N coefficients and the P _read original pixels to obtain multiple product results, and the product results are used to calculate the pixel value of the pixel in the filtered image .

The vector processing unit can be an arithmetic unit in the processor, which can perform processing such as filtering on the image. The processor can be any type of chip with vector processing capabilities such as CPU, DSP, FPGA, etc. For the vector processing unit, the external storage unit is its off-chip memory. The external storage unit stores the image to be processed and the filter. The vector processing unit uses the filter to filter the image, and the filtered image obtained can also be stored in an external storage unit.

The vector processing unit of the DSP includes a plurality of MACs, and each MAC includes a multiplier and an adder, which are used to perform multiplication and accumulation operations in filtering.

It should be noted that FIG. 10 only schematically shows the structure of the image processing apparatus. In this embodiment, there may be one or more external storage units, and the images and filters may be stored in one or multiple external storage units.

In each operation cycle, the vector processing unit needs to read P _read original pixels of a line of the image. In order to improve the efficiency of the image processing method, the number of original pixels P _read to be read is equal to the maximum number of original pixels that can be read in each operation cycle.

When the vector processing unit reads the P _read original pixels from the storage unit, the maximum number that can be read depends on the access bit width of the data bus and the bit width of the original pixels. That is, the maximum number in this embodiment is equal to the quotient of the memory access bit width and the original pixel bit width.

In each operation cycle, the vector processing unit reads the N coefficients of the filter corresponding to the _{P read original pixels from the external storage unit.} In order to improve the overall performance of image filtering and improve image processing efficiency, in this embodiment, the number N of coefficients read each time is determined according to the number of multipliers. Specifically, if the image processing device includes N _calc multipliers, the number of coefficients

N=(N _calc /P _read ).

The number of multipliers is usually a multiple of the number of original pixels read in each operation cycle. The number of multipliers is a multiple of the number of original pixels read in each operation cycle, just read from the storage unit The coefficient of the filter. The number of multipliers can be 8, 16, 32, and so on.

After reading the coefficients of the original pixel and the filter, each multiplier multiplies the N coefficients with each original pixel to obtain the product result.

When the vector processing unit reads P _read original pixels in the first row of the image, preprocessing is performed. In the preprocessing, the vector processing unit reads the P _read original pixels in the first line of the image and one or N coefficients of the first line of the filter, and the one or N coefficients are respectively combined with the first line of the image Multiply each original pixel of to get the product result.

Starting from the second line of the image, that is, when the vector processing unit reads P _read original pixels in the other lines after the first line of the image, parallel processing is performed. In parallel processing, the vector processing unit reads the P _read original pixels in the first row of the image and the N coefficients of the filter, where the N coefficients are located in the same column of the adjacent N rows of the filter. The N _calc vector processing units respectively _{multiply the N coefficients with each original pixel of the P read} original pixels to obtain a product result.

In the following, with reference to the accompanying drawings, taking the number of vector processing units as 8, reading 4 original pixels and 2 coefficients of the filter in each operation cycle as an example, the above processing process will be described.

Assume that the original size of the image is 16×4. The size of the filled image is 20×8, that is, the width _Mw is 20 and the height _Nh is 8. The size of the filter is 20×8, that is, both the width F _w and the height F _h are 5. The filter performs filtering operations on the filled image. The following takes two-dimensional convolution as an example to describe the filtering operation in detail.

In two-dimensional convolution, for some lines of the image, when calculating the filtering results of the original pixels, the original pixels of these lines are only used once. When the product operation in the filtering is performed on the original pixels of other lines, the The original pixels will not be reused. Therefore, the original pixels of these lines can be pre-processed in the line.

In the above example, the first line of the image is in the above situation. When calculating the filtering result of the original pixels, the original pixels in the first row are used only once, and the original pixels in the first row will not be multiplexed when performing product operations in the filtering on the original pixels in the other rows. Therefore, the original pixels of the first row are calculated by preprocessing.

When performing preprocessing, the image processing device is shown in Figure 11, showing 8 MAC (MAC0-MAC7) multipliers and adders. The vector processing unit also includes:

Input buffers: A group of buffers A1 and A2, B group of buffers B1, B2, B3, B4, B5 are used to buffer the read filter coefficients and the original pixels of the image.

Output buffer: ACC0, ACC1, ACC2, ACC3.

And 3 strobes MUX1, MUX2, MUX3.

The multiplier of each MAC is connected to the group B buffer, the multipliers of the first 4 MACs are connected to the buffer A1, and the multipliers of the last 4 MACs are connected to the buffer A2.

In the first operation cycle (cycle 1):

Read the 4 original pixels of the first line of the image, namely B _0,0 ,B _0,1 ,B _0,2 ,B _0,3 ,B _0,0 buffered by MUX2 to B1,B _0,1 ,B ₀ , 2, B _{0,3 are} cached to B2, B3, B4, and 0 is cached to B5 through MUX3;

_{Read a coefficient A 0,0} of the first line of the filter and buffer it to A1; 0 is buffered to A2 through MUX1;

The MAC0-MAC3 multipliers respectively multiply the coefficients A _{0, 0} and the original pixels B ₀ , ₀ , B ₀ , 1, B 0, 2, B _{0, 3 to} obtain the product results of the coefficients A _{0, 0, and respectively} Cache to ACC0, ACC1, ACC2, ACC3.

In the second operation cycle (cycle 2):

Read the 4 original pixels of the first line of the image, namely B _0,2 , B _0,3 , B _0,4 , B _0,5 ; B2 first buffers the buffered original pixel B _0,1 to B1 via MUX2 , B2 caches the read original pixel B _0,2 ; B3 caches the read original pixel B _0,3 , B4 caches the read original pixel B _0,4 , and B5 caches the read original pixel B _0,5 ;

_{Read the two coefficients A 0,1} and A _0,2 of the first line of the filter; A _{0,1 is} cached to A1, and A _{0,2 is} cached to A2;

The multiplier of MAC0-MAC3 multiplies the coefficient A _0,1 with the four original pixels B _0,1 , B _0,2 , B _0,3 , B _{0,4 of the} image respectively to obtain the product of the coefficient A _0,1 As a result, the MAC4-MAC7 multiplier multiplies the coefficients A ₀ , 2 with the four original pixels B ₀ , 2, B ₀ , 3, B ₀ , _{4, and B 0, 5} of the image to obtain the coefficients A _{0, 2.} The accumulator of MAC0-MAC7 accumulates the two product results to ACC0, ACC1, ACC2, ACC3.

In the third operation cycle (cycle 3):

Read the 4 original pixels of the first line of the image, namely B _0,4 , B _0,5 , B _0,6 , B _0,7 ; B3 first buffers the buffered original pixels B _0,3 to B1 via MUX2 , B3 caches the read original pixel B _0,5 ; B2 caches the read original pixel B _0,4 , B4 caches the read original pixel B _0,6 , and B5 caches the read original pixel B _0,7 ;

_{Read the two coefficients A 0,3} and A _0,4 of the first line of the filter; A _{0,3 is} cached to A1, and A _{0,4 is} cached to A2;

The MAC0-MAC3 multiplier multiplies the coefficients A ₀ , 3 with the four original pixels B ₀ , 3, B ₀ , 4, B _{0, 5} , _{and B 0, 6} of the image to obtain the product of the _{coefficients A 0, 3.} As a result, the MAC4-MAC7 multiplier multiplies the coefficients A ₀ , 4 with the four original pixels B ₀ , 4, B _{0, 5} , B _{0, 6} , and B ₀ , 7 of the image, respectively, to obtain the coefficients A _{0, 4} The accumulator of MAC0-MAC7 accumulates the two product results to ACC0, ACC1, ACC2, ACC3.

In the above example, this is the case starting from the second line of the image. In this embodiment, parallel processing can be performed on other rows after the first row, and the parallel processing can further improve the operation efficiency of two-dimensional filtering and the overall performance of image filtering.

When parallel processing is performed, the image processing device is as shown in FIG. 12. The vector processing unit also includes:

Output buffers ACC4, ACC5, ACC6, ACC7.

The internal storage units R0, R1, R2, R3, R4, R5, R6, R7; R0-R7 are the on-chip memories of the vector processing unit.

In the fourth operation cycle (cycle 4):

Read the 4 original pixels of the second line of the image, namely B ₁ , 0, B ₁ , _{1, B 1, 2} , B _{1, 3} ;, B ₁ _{, 0} , B ₀ , 1, B 0, 2, B ₀ , 3 are cached to B1, B2, B3, and B4 respectively;

_{Read the two coefficients A 0,0} ,A _1,0 in the first column of the first row and the second row of the filter, which are two adjacent rows; A _{1,0 is} cached to A1, and A _{0,0 is} cached to A2 ；

The MAC0-MAC3 multiplier multiplies the coefficients A ₁ , 0 with the four original pixels B ₁ , 0, B ₁ _{, 1, B 1, 2} , and B _{1, 3} of the image to obtain the product of the _{coefficients A 1, 0.} As a result, the accumulator of MAC0-MAC3 accumulates the product results to ACC0, ACC1, ACC2, and ACC3; the multiplier of MAC4-MAC7 adds the coefficients A ₀ , 4 to the four original pixels B ₁ , 0 and B _{1, 1 of the image.} , B ₁ , _{2 and B 1, 3 are} multiplied to obtain _{the product result of coefficient A 0} , 4. The accumulator of MAC4-MAC7 accumulates the product result to ACC4, ACC5, ACC6, ACC7.

In the fifth operation cycle (cycle 5):

Read the 4 original pixels of the second line of the image, namely B _1,1 , B _1,2 , B _1,3 , B _1,4 ; B _1,1 , B _1,2 , B _1,3 , B ₁ , 4 are respectively cached to B1, B2, B3, B4;

_{Read the 2 coefficients A 0,1} , A _1,1 of the first row and the second row of the filter in the second column of the two adjacent rows; A _{1,1 is} cached to A1, and A _{0,1 is} cached to A2 ；

The multiplier of MAC0-MAC3 multiplies the coefficients A _1,1 with the four original pixels B _1,1 , B _1,2 , B _1,3 , B _1,4 of the image, respectively, to obtain the product of the _{coefficients A 1,1} As a result, the accumulator of MAC0-MAC3 accumulates the product results to ACC0, ACC1, ACC2, and ACC3; the multiplier of MAC4-MAC7 adds the coefficients A _{0, 1 to} the four original pixels B ₁ , 1, B _{1, 2 of the image, respectively.} , B ₁ , _{3 and B 1, 4 are} multiplied to obtain _{the product result of coefficient A 0} , 1. The accumulator of MAC4-MAC7 accumulates the product result to ACC4, ACC5, ACC6, ACC7.

By analogy, in the eighth computing cycle (cycle 8):

Read the 4 original pixels of the second line of the image, namely B _1,4 , B _1,5 , B _1,6 , B _1,7 ; B _1,4 , B _1,5 , B _1,6 , B ₁ , 7 are cached to B1, B2, B3, and B4 respectively;

_{Read the 2 coefficients A 0,4} ,A _1,4 of the 5th column of the first row and the second row of the filter, which are two adjacent rows; A _{1,4 is} cached to A1, and A _{0,4 is} cached to A2 ；

The multiplier of MAC0-MAC3 multiplies the coefficients A ₁ , 4 with the four original pixels B ₁ , ₁ , B ₁ , 2, B 1, 3, B ₁ , 4 of the image to obtain the product of the coefficients A _{1, 4} As a result, the accumulator of MAC0-MAC3 accumulates the product results to ACC0, ACC1, ACC2, and ACC3; the multiplier of MAC4-MAC7 adds the coefficients A ₀ , 4 to the four original pixels B ₁ , 1, B _{1, 2 of the image, respectively.} , B ₁ , _{3 and B 1, 4 are} multiplied to obtain _{the product result of coefficient A 0} , 4. The accumulator of MAC4-MAC7 accumulates the product result to ACC4, ACC5, ACC6, ACC7.

The parallel processing of the other lines of the image is similar to the parallel processing of the second line described above. After 23 calculation cycles, four pixels in the first row of the filtered image are obtained, and ACC0, ACC1, ACC2, and ACC3 are cleared (clear0).

When the two adjacent lines of the filter cross over in the height direction, the original pixels of the next line of the read image cannot be multiplexed. The two coefficients of the filter are calculated with the original pixels of different lines and processed in parallel. Proceed as follows:

The vector processing unit reads another P _read original pixels from the internal storage unit, and the other P _read original pixels are read from the image in the previous operation cycle and stored in the internal storage unit.

Each of the N coefficients at the end of the filter height direction is _{multiplied by P read} original pixels read from the image to obtain the product result;

For example, in the 24th operation cycle (cycle 24):

Read from the external storage unit 4 original image pixels sixth row, i.e., _{_{_{B 5,0, B 5,1, B 5,2}}} , B 5,3, B 5,0, B 5,1, B 5 _{, 2} , B ₅ , 3 are respectively cached to B1, B2, B3, B4; read the 4 original pixels of the third row of the image from the internal storage units R0, R1, R2, R3, namely B ₂ , 0, B _{2 ,1} ,B _2,2 ,B _2,3 . Among them, the _{four original pixels B 2,0} , B _2,1 , B _2,2 , and B _{2,3 are} read from the image in cycle 9 and stored in the internal storage units R0, R1, R2, R3.

_{Read the two coefficients A 4,0} ,A _0,0 in the first column of the fifth row and the first row of the filter; A _{0,0 is} buffered to A1, and A _{4,0 is} buffered to A2;

The MAC0-MAC3 multiplier multiplies the coefficients A ₀ , 0 with the four original pixels B ₂ , 0, B ₂ _{, 1, B 2} , 2, and B _{2, 3} of the image to obtain the product of the _{coefficients A 0, 0.} As a result, then MAC0-MAC3 accumulators accumulating the multiplication results to ACC0, ACC1, ACC2, ACC3; MAC4-MAC7 multiplier coefficient 4 original pixels B a _4,0 images respectively _{_5,0,} B _{5, 1} , B ₅ , _{2 and B 5, 3 are} multiplied to obtain _{the product result of coefficient A 4, 0} , and the accumulator of MAC4-MAC7 accumulates the product result to ACC4, ACC5, ACC6, ACC7.

By analogy, after 28 operation cycles, the four pixels C _0,0 -C _{0,3 in the} first row of the filtered image and the four pixels C _1,0 -C _{1, in the second row are obtained. 3} , and clear ACC4, ACC5, ACC6, and ACC7 (clear1).

If the height N _{0 of the} filtered image is not evenly divisible by the number N of filter coefficients, then the filtering process of the pixels in the first to fourth columns of the filtered image is still not completed. At this time, it is necessary to continue processing the image to obtain the pixels of the remaining lines of the filtered image. Because the remaining lines of the filtered image cannot be processed in parallel, the following in-line processing is performed:

Read one coefficient or the N coefficients in a row of the filter;

The multiplier multiplies the one coefficient by the P _read original pixels, or multiplies each coefficient of the N coefficients by the P _read original pixels to obtain the product result.

In the 54th operation cycle (cycle 54):

Read the 4 original pixels of the fifth line of the image, namely B _4,0 , B _4,1 , B _4,2 , B _4,3 ; B _4,0 is buffered by MUX2 to B1, B _4,1 , B _{4 , 2} , B _{4, 3 are} cached to B2, B3, B4, and 0 is cached to B5 through MUX3;

The MAC0-MAC3 multipliers respectively multiply the coefficients A ₀ , 0 with the four original pixels B ₄ , 0, B ₄ , 1, B ₄ , _{2, and B 4, 3} of the image to obtain the coefficients A _{0, 0} The product results are cached to ACC0, ACC1, ACC2, and ACC3 respectively.

In the 55th operation cycle (cycle 55):

Read the 4 original pixels of the fifth line of the image, namely B _4,2 , B _4,3 , B _4,4 , B _4,5 ; B2 first buffers the buffered original pixel B _4,1 to B1 via MUX2, B2 then buffers the read original pixel B _4,2 ; B3 buffers the read original pixel B _4,3 , B4 buffers the read original pixel B _4,4 , and B5 buffers the read original pixel B _4,5 ;

The MAC0-MAC3 multiplier multiplies the coefficients A _0,1 with the four original pixels B _4,1 , B _4,2 , B _4,3 , B _4,4 of the image to obtain the product of the _{coefficients A 0,1.} As a result, the MAC4-MAC7 multiplier multiplies the coefficients A ₀ , 2 with the four original pixels B ₄ , 2, B ₄ , 3, B ₄ , _{4, and B 4, 5} of the image, respectively, to obtain the coefficients A _{0, 2} The accumulator of MAC0-MAC7 accumulates the two product results to ACC0, ACC1, ACC2, ACC3.

In the 56th operation cycle (cycle 56):

Read the 4 original pixels of the fifth line of the image, namely B _4,4 , B _4,5 , B _4,6 , B _4,7 ; B3 first buffers the buffered original pixels B _4,3 to B1 via MUX2, B3 then caches the read original pixel B _4,5 ; B2 caches the read original pixel B _4,4 , B4 caches the read original pixel B _4,6 , and B5 caches the read original pixel B _4,7 ;

The MAC0-MAC3 multiplier multiplies the coefficients A ₀ , 3 with the four original pixels B ₄ , 3, B ₄ , _{4, B 4, 5} , B ₄ , 6 of the image to obtain the product of the _{coefficients A 0, 3.} As a result, the MAC4-MAC7 multiplier multiplies the coefficients A ₀ , 4 with the four original pixels B ₄ , ₄ , B ₄ , _{5, B 4, 6, and B 4, 7} of the image to obtain the coefficients A _{0, 4.} The accumulator of MAC0-MAC7 accumulates the two product results to ACC0, ACC1, ACC2, ACC3.

By analogy, the above steps are repeated, and the above-mentioned preprocessing is performed on the 6th to 9th rows of the image and the second to 5th rows of the corresponding filter respectively. The preprocessing of each row requires 3 operation cycles. After 15 operation cycles of intra-row operation, a total of 68 operation cycles, the pixel points of the fifth row of the filtered image are obtained, and the pixels of the first to fourth columns of all five rows of the filtered image are obtained, and the filtered image is the first The filtering process from column to column 4 is completed.

In the above example, the number of multipliers N _calc is 8, the number of original pixels P _{read read in} each calculation cycle is 4, the number of filter coefficients N is 2, the width of the image M _w is 20, The width F _w and the height F _h are both 5, and the image processing apparatus of this embodiment has been described. However, it should be clear to those skilled in the art that the values of the above parameters are not limited to this. The above parameters in this embodiment can also take other values. When the above parameters take other values, the image processing device is similar to the above description, and those skilled in the art should fully understand the specific structure of the image processing device.

In the image processing device of this embodiment, the calculation time _{of P read pixels in the first row of the filtered image is:}

T _pre +F _w ×(F _h -1)×cycle

F _w ×F _h ×cycle

In the image processing device of this embodiment, when the height N _{0 of the} filtered image can be divisible by the number N of filter coefficients, that is, when there is no in-line processing, N _o ×P _{read from the first line to the last line of the filtered image} The calculation time of each pixel is:

T _pre +F _w ×F _h ×(N _o /N)×cycle

T _pre +F _w ×F _h ×((N _o -r)/N)×cycle+T _r

For example, in the above example, the original pixels of each row are read from top to bottom in the height direction of the image, and in each row, 5 groups of original pixels are read sequentially. However, the present disclosure is not limited to this. In fact, the original pixel of the image is read The order of is not restricted, you can read in order, read in reverse order, and read in skip. As long as it can traverse all the original pixels of the image, traverse all the filter coefficients, and _{multiply and accumulate M w} × N _h × F _w × F _h , the filtering can be completed. It's just that in the case of reverse or skip reading, the output order of the pixels of the filtered image is different.

_{In the above example, P read} original pixels of one line of the image are read from the storage unit each time. But the present disclosure is not limited to this. P _read original pixels stored after each calculation cycle, reads an image from the storage unit row P _read original pixels, can also be read in the internal storage unit vector processing unit. In this way, in the subsequent operation cycle, if _{some of the original pixels of the P read} original pixels that need to be read have been stored in the internal storage unit, then this part of the original pixels can be directly read from the internal storage unit, and only It is sufficient to read another part of the original pixels not stored in the internal storage unit from the external storage unit. Specifically, the P _read original pixels of one line of the read image include:

Store the remaining pixels in the internal storage unit.

This reduces the amount of data read by the vector processing unit from the off-chip memory and saves bandwidth.

In the above example, the multiplication and accumulation operations are performed every operation cycle, that is, the filter coefficients are multiplied by the original pixels, and the multiplication result is accumulated to the accumulation result of the previous operation cycle. However, the present disclosure is not limited to this, and the multiplication operation may be performed first to obtain all the product results used to calculate the pixel points of the filtered image and then the accumulation is performed.

Yet another embodiment of the present disclosure provides a mobile device, including: the image processing apparatus described in the previous embodiment. The mobile device is at least one of a portable mobile terminal, a drone, a handheld PTZ, and a remote controller.

Those skilled in the art can clearly understand that for the convenience and conciseness of the description, only the division of the above-mentioned functional modules is used as an example for illustration. In practical applications, the above-mentioned functions can be allocated by different functional modules as required, namely, the device The internal structure is divided into different functional modules to complete all or part of the functions described above. For the specific working process of the device described above, reference may be made to the corresponding process in the foregoing method embodiment, which will not be repeated here.

Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present disclosure, not to limit it; although the present disclosure has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand that: The technical solutions recorded in the foregoing embodiments can still be modified, or some or all of the technical features can be equivalently replaced; in the case of no conflict, the features in the embodiments of the present disclosure can be combined arbitrarily; and these modifications or replacements It does not cause the essence of the corresponding technical solutions to deviate from the scope of the technical solutions of the embodiments of the present disclosure.

Claims

An image processing method, characterized in that the method is applied to a vector processing unit, the vector processing unit includes a multiplier, and the method includes:

Read P read original pixels of the image, where the value of P read is determined according to the memory access bit width corresponding to the vector processing unit;

Reading N coefficients of the filter, the value of N is determined according to the number of multipliers of the vector processing unit, and the filter is used for filtering the image;

Through the multiplier, each of the N coefficients and the P read original pixels are respectively multiplied to obtain multiple product results, and the product results are used to calculate the pixels of the pixels in the filtered image value.
The image processing method according to claim 1, wherein:

N=(N calc /P read )

Wherein, N calc is the number of multipliers of the vector processing unit.
8. The image processing method of claim 1, wherein the value of P read is equal to the maximum number of original pixels that can be read by the vector processing unit in each operation cycle.
8. The image processing method of claim 3, wherein the maximum number is equal to the quotient of the memory access bit width and the original pixel bit width.
8. The image processing method according to claim 1, wherein the N coefficients are located in the same column of N adjacent rows of the filter.
8. The image processing method of claim 5, wherein the P read original pixels of the read image comprise:

Starting from the second row of the image, reading each row of pixels sequentially F w of the original group, wherein the n-th original pixel group comprises first to n-th column n + P read -1 P read original pixels of the column, wherein 1≤n≤F w , F w is the width of the filter.
The image processing method according to claim 6, wherein:

Starting from the second line of the filtered image, the calculation time of P read pixels in each line is:

F w ×F h ×cycle

Wherein, F w and F h are the width and height of the filter, respectively; cycle is an operation cycle.
5. The image processing method of claim 1, wherein when the P read original pixels read are located in the first row of the image, the following preprocessing is performed:

Reading one coefficient or the N coefficients of the first row of the filter;

Through the multiplier, the one coefficient is multiplied by the P read original pixels, or each coefficient of the N coefficients is multiplied by the P read original pixels to obtain the product result.
The image processing method according to claim 8, wherein:

The calculation time of P read pixels in the first row of the filtered image is:

T pre +F w ×(F h -1)×cycle

Wherein, T pre is the calculation time of the preprocessing; F w and F h are the width and height of the filter, respectively; and cycle is one calculation cycle.
The image processing method according to claim 1, wherein when the height of the filtered image is not divisible by the N, and the image read is used to calculate the rear r rows of the filtered image When P read original pixels in a row of pixels, r is the remainder of dividing the height of the filtered image by the N, and the following in-line processing is performed:

Read one coefficient or the N coefficients in a row of the filter;

Through the multiplier, the one coefficient is multiplied by the P read original pixels, or each coefficient of the N coefficients is multiplied by the P read original pixels to obtain the product result.
The image processing method according to claim 10, wherein:

The calculation time of P read pixels in each of the last r rows of the filtered image is:

T pre ×F h

Among them, T pre is the preprocessing operation time; F h is the height of the filter.
The image processing method of claim 1, wherein when the height of the filtered image is divisible by the N, the number of N o ×P reads from the first line to the last line of the filtered image The calculation time of the pixel is:

T pre +F w ×F h ×(N o /N)×cycle

Wherein, N o is the number of rows of the filtered image, and N o =N h -F h +1, N h is the height of the image, and F w and F h are the width and height of the filter, respectively ; Cycle is an operation cycle, T pre is the operation time of preprocessing.
The image processing method of claim 1, wherein when the height of the filtered image is not divisible by the N, the number of N o ×P reads from the first line to the last line of the filtered image The calculation time of the pixel is:

T pre +F w ×F h ×((N o -r)/N)×cycle+T r

Wherein, N o is the number of rows of the filtered image, and N o =N h -F h +1, N h is the height of the image; F w and F h are the width and height of the filter, respectively , R is the remainder of the division of N h and N; T pre is the operation time of preprocessing; T r is the operation time of in-line processing.
The image processing method according to claim 13, wherein, in said row calculating processing time T r is:

r×T pre ×F h .
15. The image processing method according to any one of claims 9, 11-14, wherein the preprocessing operation time T pre is:

(1+ceil((F w -1)/N))×cycle.
The image processing method according to claim 1, wherein a width of said filtered image P read * ceil ((M w -F w +1) / P read), a height N o;

Wherein, M w is the width of the image, and F w is the width of the filter.
5. The image processing method according to claim 1, wherein the vector processing unit further comprises an internal storage unit; the image is stored in an external storage unit;

The P read original pixels of the read image include;

Read part of the original pixels in the P read original pixels from the internal storage unit, and the part of the original pixels was stored in the previous operation cycle;

Reading the remaining portion of the original P read original pixels in the pixel from the external storage unit, to obtain the read original pixels P;

Storing the remaining part of pixels in the internal storage unit.
8. The image processing method of claim 5, wherein the vector processing unit further comprises an internal storage unit;

When the adjacent N rows span end to end in the height direction of the filter, the method further includes: reading another P read original pixels from the internal storage unit, and the previous operation of the P read original pixels Periodically read from the image and stored in the internal storage unit;

Each of the N coefficients at the end of the filter height direction is multiplied by P read original pixels read from the image to obtain the product result;

Each of the N coefficients located at the head of the filter height direction is respectively multiplied by another P read original pixels read from the internal storage unit to obtain the product result.
An image processing device, characterized in that it comprises:

External storage unit, which stores images and filters;

Vector processing unit, including: multiplier;

The vector processing unit is used to read P read original pixels of the image, wherein the value of P read is determined according to the memory access bit width corresponding to the vector processing unit, reads the N coefficients of the filter, The value of N is determined according to the number of multipliers of the vector processing unit, and the filter is used for filtering the image;

The multiplier is used to multiply each of the N coefficients and the P read original pixels to obtain multiple product results, and the product results are used to calculate the pixel value of the pixel in the filtered image .
The image processing device according to claim 18, wherein:

N=(N calc /P read )

Wherein, N calc is the number of the vector processing units.
19. The image processing device of claim 19, wherein the value of P read is equal to the maximum number of original pixels that can be read by the vector processing unit in each operation cycle.
22. The image processing device of claim 20, wherein the maximum number is equal to the quotient of the memory access bit width and the original pixel bit width.
19. The image processing device according to claim 19, wherein the N coefficients are located in the same column of N adjacent rows of the filter.
The image processing device according to claim 23, wherein the vector processing unit starts from the second row of the image and sequentially reads Fw groups of original pixels in each row, wherein the nth group of original pixels The pixels include P read original pixels from the nth column to the n+P read -1 column, where 1≤n≤F w , and F w is the width of the filter.
The image processing device according to claim 24, wherein:

Starting from the second line of the filtered image, the calculation time of P read pixels in each line is:

F w ×F h ×cycle

Wherein, F w and F h are the width and height of the filter, respectively; cycle is an operation cycle.
The image processing device according to claim 19, wherein when the P read original pixels read by the vector processing unit are located in the first row of the image, the vector processing unit performs the following preprocessing :

The vector processing unit reads one coefficient or the N coefficients of the first row of the filter;

The multiplier multiplies the one coefficient and the P read original pixels, or multiplies each of the N coefficients and the P read original pixels to obtain the product result.
The image processing device according to claim 26, wherein:

The calculation time of P read pixels in the first row of the filtered image is:

T pre +F w ×(F h -1)×cycle

Among them, T pre is the calculation time of preprocessing; F w and F h are the width and height of the filter, respectively; and cycle is one calculation cycle.
The image processing device according to claim 19, wherein when the height of the filtered image is not divisible by the N, and the vector processing unit reads the value of the image used to calculate the filtering A row of P read original pixels of r rows of pixels after the rear image, where r is the remainder of dividing the height of the filtered image by the N, and the vector processing unit performs the following in-line processing:

The vector processing unit reads one coefficient or the N coefficients in a row of the filter;

The multiplier multiplies the one coefficient and the P read original pixels, or multiplies each of the N coefficients and the P read original pixels to obtain the product result.
The image processing device according to claim 28, wherein:

The calculation time of P read pixels in each of the last r rows of the filtered image is:

T pre ×F h

Among them, T pre is the preprocessing operation time; F h is the height of the filter.
The image processing device according to claim 19, wherein when the height of the filtered image is divisible by the N, the number of N o ×P reads from the first line to the last line of the filtered image The calculation time of the pixel is:

T pre +F w ×F h ×(N o /N)×cycle

Wherein, N o is the number of rows of the filtered image, and N o =N h -F h +1, N h is the height of the image, and F w and F h are the width and height of the filter, respectively ; Cycle is an operation cycle, T pre is the operation time of preprocessing.
The image processing device of claim 19, wherein when the height of the filtered image is not divisible by the N, the number of N o ×P reads from the first line to the last line of the filtered image The calculation time of the pixel is:

T pre +F w ×F h ×((N o -r)/N)×cycle+T r

Wherein, N o is the number of rows of the filtered image, and N o =N h -F h +1, N h is the height of the image; F w and F h are the width and height of the filter, respectively , R is the remainder of the division of N h and N; T pre is the operation time of preprocessing; T r is the operation time of in-line processing.
The image processing apparatus according to claim 31, wherein, in said row calculating processing time T r is:

r×T pre ×F h .
The image processing device according to any one of claims 27 and 29-32, wherein the preprocessing operation time T pre is:

(1+ceil((F w -1)/N))×cycle.
The image processing apparatus according to claim 19, wherein a width of said filtered image P read * ceil ((M w -F w +1) / P read), a height N o;

Wherein, M w is the width of the image, and F w is the width of the filter.
19. The image processing device of claim 19, wherein the vector processing unit further comprises an internal storage unit;

The vector processing unit reads part of the original pixels in the P read original pixels from the internal storage unit, and the part of the original pixels is stored in the previous operation cycle;

The vector processing unit reads from the external storage unit the remaining part of the original P read pixels in the original pixels, P read to obtain the original pixels, the remaining portion of the pixel stored in the internal storage unit.
22. The image processing device according to claim 23, wherein the vector processing unit further comprises: an internal storage unit; when the adjacent N rows span end to end in the height direction of the filter,

The vector processing unit reads another P read original pixels from the internal storage unit, and the other P read original pixels are read from the image in a previous operation cycle and stored in the internal storage unit;

The vector processing unit multiplies each coefficient located at the tail of the filter height direction among the N coefficients by P read original pixels read from the image to obtain the product result;

The vector processing unit multiplies each coefficient located at the head of the filter height direction among the N coefficients by another P read original pixels read from the internal storage unit to obtain the product result.
A mobile device, comprising: the image processing device according to any one of claims 19 to 36.
The mobile device according to claim 37, wherein the mobile device is at least one of a portable mobile terminal, a drone, a handheld pan/tilt, and a remote controller.