WO2022206556A1 - 图像数据的矩阵运算方法、装置、设备及存储介质 - Google Patents

图像数据的矩阵运算方法、装置、设备及存储介质 Download PDF

Info

Publication number
WO2022206556A1
WO2022206556A1 PCT/CN2022/082811 CN2022082811W WO2022206556A1 WO 2022206556 A1 WO2022206556 A1 WO 2022206556A1 CN 2022082811 W CN2022082811 W CN 2022082811W WO 2022206556 A1 WO2022206556 A1 WO 2022206556A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
matrix
instruction
sliding window
calculation result
Prior art date
Application number
PCT/CN2022/082811
Other languages
English (en)
French (fr)
Inventor
陈仲华
李峰
刘程浩
刘毅
艾通
李昊沅
陈其锋
Original Assignee
腾讯科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 腾讯科技(深圳)有限公司 filed Critical 腾讯科技(深圳)有限公司
Priority to EP22778739.7A priority Critical patent/EP4227886A4/en
Publication of WO2022206556A1 publication Critical patent/WO2022206556A1/zh
Priority to US17/976,185 priority patent/US20230049471A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T1/00General purpose image data processing
    • G06T1/20Processor architectures; Processor configuration, e.g. pipelining
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/16Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T1/00General purpose image data processing
    • G06T1/60Memory management
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • the present application relates to the field of computer technology, and in particular, to a matrix operation method, apparatus, device and storage medium for image data.
  • the computer needs to store and calculate the image data in the form of a matrix. Since the operation process of the matrix is often very computationally intensive and has a serious delay, it is necessary to optimize the matrix operation to improve the operation efficiency.
  • the related art provides an optimization method for the rearrangement of matrix data, which rearranges the source matrix data, for example, arranges the data in the NC4HW4 format.
  • the rearrangement algorithm not only takes extra time due to the large-scale adjustment of the data, but also adds an extra channel filling step to the matrix whose number of channels cannot be divisible by 4.
  • the improvement of the overall efficiency of matrix operations by the rearrangement algorithm can offset the additional cost of channel filling; for small matrices, the additional cost of channel filling has a great impact on the efficiency of matrix operations. .
  • Embodiments of the present application provide a matrix operation method, apparatus, device, and storage medium for image data.
  • the technical solution is as follows:
  • a matrix operation method for image data the method is executed by a computer device, and the method includes:
  • the column data in the matrix data is calculated by using a single calculation instruction corresponding to the image operator, and an intermediate calculation result is obtained, and the intermediate calculation result is in the form of a row;
  • the single calculation instruction is used to calculate the matrix elements of the target column in the N rows of cache data, and the calculation result of the matrix data under the single calculation instruction is obtained, and the target column includes the intermediate calculation result.
  • the calculation result is output as an image processing result of the matrix data by the image operator.
  • a matrix operation device for image data comprising:
  • a reading module for reading matrix data in the image data based on the matrix size of the image operator, M rows and N columns, where M and N are positive integers;
  • a calculation module configured to use a single calculation instruction corresponding to the image operator to calculate the column data in the matrix data to obtain an intermediate calculation result, and the intermediate calculation result is in the form of a row;
  • a multiplexing module for multiplexing and rearranging the intermediate calculation results into N lines of cached data
  • the calculation module is further configured to use the single calculation instruction to calculate the matrix elements of the target column in the N rows of cache data to obtain the calculation result of the matrix data under the single calculation instruction, the target column Contains N matrix elements in the intermediate calculation result;
  • An output module configured to output the calculation result as an image processing result of the matrix data by the image operator.
  • a computer device comprising a processor and a memory, the memory having at least one instruction stored therein, the instruction being loaded and executed by the processor to implement the present invention
  • An operation method of matrix data provided by various aspects of the application.
  • a computer-readable storage medium having stored therein at least one instruction, the instruction being loaded and executed by a processor to implement image data processing as provided in various aspects of the present application Matrix operation method.
  • a computer program product or computer program comprising computer instructions stored in a computer readable storage medium.
  • the processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device executes the above-mentioned matrix operation method for image data.
  • the operation result of the matrix data is obtained by multiplexing and rearranging the intermediate calculation result of the matrix data, and operating the intermediate calculation result after the multiplexing and rearrangement.
  • the method does not need channel filling during data rearrangement, thus avoiding the resource consumption of channel filling caused by the fact that the number of matrix channels cannot be divisible by 4 in the related art rearrangement algorithm.
  • the non-target columns in the cached data of this solution will consume some storage resources, for a matrix with a smaller number of channels, the non-target columns consume less, so the operation efficiency of small matrices can be significantly improved.
  • FIG. 1 is a schematic diagram of a matrix with 3 rows and 3 columns provided by an exemplary embodiment of the present application
  • FIG. 2 is a schematic diagram of a coefficient matrix with 3 rows and 3 columns provided by an exemplary embodiment of the present application
  • FIG. 3 is a structural block diagram of a computer device provided by an embodiment of the present application.
  • FIG. 4 is a schematic diagram of a matrix operation architecture of image data provided by an embodiment of the present application.
  • FIG. 5 is a flowchart of a matrix operation method for image data provided by an exemplary embodiment of the present application.
  • FIG. 6 is a schematic diagram of a matrix operation method for image data provided by an exemplary embodiment of the present application.
  • FIG. 7 is a schematic diagram of multiplexing and rearranging an intermediate calculation result provided by an exemplary embodiment of the present application.
  • FIG. 8 is a schematic diagram of multiplexing and rearranging an intermediate calculation result provided by an exemplary embodiment of the present application.
  • FIG. 9 is a schematic diagram of multiplexing and rearranging an intermediate calculation result provided by an exemplary embodiment of the present application.
  • FIG. 10 is a flowchart of a matrix operation method for image data provided by an exemplary embodiment of the present application.
  • FIG. 11 is a schematic diagram of an example of a matrix operation method for image data provided by an exemplary embodiment of the present application.
  • FIG. 12 is a schematic diagram of a matrix operation method for image data provided by an exemplary embodiment of the present application.
  • FIG. 13 is a schematic diagram of a special case of a matrix operation method for image data provided by an exemplary embodiment of the present application.
  • FIG. 14 is a schematic diagram of a special case of a matrix operation method for image data provided by an exemplary embodiment of the present application.
  • 15 is a schematic diagram of a matrix operation method for image data provided by an exemplary embodiment of the present application.
  • 16 is a comparison diagram of the optimization effect of a matrix operation method for image data provided by an exemplary embodiment of the present application.
  • FIG. 17 is a structural block diagram of an image data matrix operation apparatus provided by an exemplary embodiment of the present application.
  • Image Operators Matrix operations for image processing.
  • Matrix data Computer data stored in matrix form.
  • Matrix element refers to each data that makes up the matrix.
  • the matrix elements are stored in the memory of the computer device, and each matrix element has a corresponding storage address.
  • the computer device can obtain the matrix element by accessing the storage address of the matrix element.
  • a single calculation instruction including at least one of the following: a summation instruction, a maximum value instruction, a minimum value instruction, and a product instruction.
  • the convolution instruction is used to instruct the multiplication result obtained by multiplying each element in the matrix by the coefficient at the corresponding position in the coefficient matrix, adding the results, and dividing by the number of matrix elements.
  • the coefficient matrix is the matrix of 3 rows and 3 columns shown in Figure 2
  • the matrix rearrangement algorithm of the related art performs large-scale rearrangement of the source data, which takes a lot of extra time; and for a matrix whose number of channels cannot be divisible by 4, the channels need to be filled in each operation, which is very expensive.
  • the optimization effect is not good.
  • the method proposed in this paper only needs to fine-tune and rearrange the intermediate calculation results of the matrix data read in the image data. For a matrix with a smaller convolution kernel, the additional cost is smaller, and it has a better optimization effect for small matrices.
  • FIG. 3 shows a schematic structural diagram of a computer device provided by an exemplary embodiment of the present application.
  • the device includes: a bus 101 , a processor 102 , and a memory 103 .
  • the processor 102 includes one or more processing cores, and the processor 102 executes various functional applications and information processing by running software programs and modules.
  • the memory 103 is connected to the processor 102 through the bus 101 .
  • the memory 103 may be configured to store at least one instruction, and the processor 102 may be configured to execute the at least one instruction to implement various steps in the following method embodiments.
  • the memory 103 also includes one or more registers 104 .
  • the register 104 can be used to store the data read through the single instruction multiple data stream (Single Instruction Multiple Data, SIMD) instruction, the intermediate calculation result of the matrix data operation, and the data in the sliding window in the sliding window processing.
  • SIMD Single Instruction Multiple Data
  • the memory 103 may be implemented by any type of volatile or non-volatile storage device, or a combination thereof, including but not limited to: magnetic or optical disks, electrically erasable programmable Read Only Memory (Electrically-Erasable Programmable Read Only Memory, EEPROM), Erasable Programmable Read Only Memory (EPROM), Static Random Access Memory (SRAM), Read Only Memory (Read-Only Memory, ROM), magnetic memory, flash memory, programmable read-only memory (Programmable Read-Only Memory, PROM).
  • magnetic or optical disks electrically erasable programmable Read Only Memory (Electrically-Erasable Programmable Read Only Memory, EEPROM), Erasable Programmable Read Only Memory (EPROM), Static Random Access Memory (SRAM), Read Only Memory (Read-Only Memory, ROM), magnetic memory, flash memory, programmable read-only memory (Programmable Read-Only Memory, PROM).
  • PROM Programmable Read-Only Memory
  • the computer device in the embodiment of the present application may be a smart phone, a tablet computer, a personal computer, a wearable device, a vehicle-mounted terminal, a server, or the like, which is not limited in the embodiment of the present application.
  • an application with requirements for image processing, image recognition, image segmentation, etc. is installed in the computer device, and the application needs to perform matrix operations on image data during the running process.
  • FIG. 4 shows a schematic diagram of a matrix operation architecture of image data provided by an exemplary embodiment of the present application.
  • the operation architecture includes: input data 30 , an algorithm processing module 40 , and output data 50 .
  • the input data 30 is at least one input matrix with M rows and N columns read by the computer equipment from the image data, and the output data 50 is the image processing result obtained by the computer equipment through the operation of the algorithm processing module 40 .
  • the algorithm processing module 40 may represent an algorithmic whole comprising multiple functions or multiple modules; alternatively, the algorithm processing module 40 is only a single function or a single module.
  • FIG. 5 shows a flow chart of a matrix operation method for image data provided by an exemplary embodiment of the present application.
  • the operation method of the matrix data is performed by the computer device shown in FIG. 3 .
  • the method includes:
  • Step 220 Based on the matrix size of the image operator, M rows and N columns, read the matrix data in the image data, where M and N are positive integers;
  • AI Artificial Intelligence
  • the image to be processed is stored in a computer device in the form of a matrix according to the size of a pixel.
  • an image operator such as a convolution kernel
  • an image operator with a matrix size of 3 rows and 3 columns is used to analyze and process the image to be processed, and the matrix data of 3 rows and 3 columns is read in the image data, which are a0, a1, a2, b0, b1, b2 respectively. , c0, c1, c2, and the obtained image operator is shown in matrix 10 in Figure 6.
  • Step 240 use a single calculation instruction corresponding to the image operator to calculate the column data in the matrix data to obtain an intermediate calculation result, and the intermediate calculation result is in the form of a row;
  • the single calculation instruction corresponding to the image operator is determined to be one of five single calculation instructions. According to the single calculation instruction, the column data in the matrix data of M rows and N columns are calculated respectively, and the intermediate calculation results in the form of rows are obtained.
  • the method is as follows :
  • mean filtering is performed on each column of matrix elements in the matrix data
  • the maximum value is obtained for each column of matrix elements in the matrix data
  • the minimum value is obtained for each column of matrix elements in the matrix data
  • each column of matrix elements in the matrix data is multiplied by the respective corresponding coefficient values and then added.
  • the computer device calculates each column and column data in the matrix data according to a single calculation instruction, and obtains one row of intermediate calculation results.
  • Step 260 Multiplexing and rearranging the intermediate calculation results into N lines of cached data
  • the way to reuse and rearrange the intermediate calculation results into N lines of cached data can be to move the intermediate calculation results to the left i times and to the right N-1-i times, respectively, to obtain N lines of cached results, where i is less than or equal to N and an integer greater than or equal to 0. That is, when the value of i is 0, the computer device continuously moves the intermediate calculation result to the right for N-1 times to obtain N lines of cached data; when the value of i is N-1, the computer device continuously moves the intermediate calculation result to the left. Move N-1 times to get N lines of cache data; when i is an integer less than N-1 and greater than 0, the computer equipment needs to move the intermediate calculation result to the left and right.
  • the present application does not limit the multiplexing and rearranging manner of the intermediate calculation results.
  • s1, s2, and s3 whose intermediate calculation results are 1 row and 3 columns are used as examples, and the intermediate calculation results are reused and rearranged.
  • the intermediate calculation result 12 is moved to the right twice to obtain the intermediate calculation result 13 after moving once and the intermediate calculation result 14 after moving two times, respectively.
  • the intermediate calculation result 12 is moved to the left twice to obtain the intermediate calculation result 16 after moving once and the intermediate calculation result 17 after moving twice.
  • the intermediate calculation result is shifted to the left and the right once, respectively, to obtain the intermediate calculation result 19 after moving to the left once and the intermediate calculation result 20 after moving to the right once.
  • Step 280 Use a single calculation instruction to calculate the matrix elements of the target column in the N rows of cache data to obtain the calculation result of the matrix data under the single calculation instruction;
  • the target column contains N matrix elements in the intermediate calculation result, that is, the column of the N rows of cache data that also contains N matrix elements in the intermediate calculation result is the target column.
  • the target column refers to a column in which s0 , s1 , and s2 coexist.
  • the computer device uses a single calculation instruction to calculate the matrix elements in the target column, and obtains the calculation result of the matrix data under the single calculation instruction. For example, calculation result 15 is obtained by calculating s2, s1, and s0 in FIG. 7; calculation result 18 is obtained by calculating s0, s1, and s2 in FIG. 8; calculation result 18 is obtained by calculating s2, s1, and s0 in FIG. 9. Result 21. Calculate the matrix elements in the target column to obtain the calculation result of the matrix data under a single calculation instruction. The method is as follows:
  • the matrix elements in the target column are added, and then divided by the number of matrix elements to obtain the calculation result;
  • the maximum value is obtained for the matrix elements in the target column, and the calculation result is obtained;
  • the minimum value is obtained for the matrix elements in the target column, and the calculation result is obtained;
  • the matrix elements in the target column are added, and then divided by the number of matrix elements to obtain the calculation result.
  • Step 300 Output the calculation result as the image processing result of the matrix data by the image operator.
  • the matrix data is read in the image to be processed, as shown in matrix 10 in FIG. 6 .
  • the matrix operation instruction is a summation instruction
  • the summation operation is performed on each column of matrix data to obtain the intermediate calculation result:
  • s0 a0+b0+c0
  • s1 a1+b1+c1
  • s2 a2+b2+c2.
  • the matrix operation instruction is an average filter instruction
  • the matrix operation instruction is the maximum value instruction
  • the matrix operation instruction is a convolution instruction
  • the matrix data is read from the image data, the matrix elements of each column in the matrix data are calculated respectively, the intermediate calculation results obtained are then multiplexed and rearranged, and the multiplexed and rearranged results are calculated. data to obtain the calculation result of matrix data under a single calculation instruction.
  • This embodiment reduces the repeated calculation of single element data in the matrix, improves the concurrency of the matrix operation, and further improves the matrix operation efficiency of the image data.
  • FIG. 10 shows a flowchart of a matrix operation method for image data provided by an exemplary embodiment of the present application.
  • This embodiment takes the neon instruction applicable to the arm processor as an example.
  • the single instruction multiple data stream (Single Instruction Multiple Data, SIMD) instruction of load/store reads/writes 16 uint8_t data at one time.
  • the matrix operation method of the image data is performed by the computer device shown in FIG. 3 .
  • the method includes:
  • Step 320 Based on the matrix size of the image operator, M rows and N columns, read the matrix data in the image data, where M and N are positive integers;
  • the image operator for operation is a matrix with 3 rows and 3 columns as an example.
  • the black solid-line box in FIG. 11 represents the original image 1101 , and the area outside the solid-line box is the edge-enlarged area 1102 extended to accommodate the matrix operation overflowing the effective area of the image.
  • the width of the original image 1101 is n ⁇ simd_width+tail
  • simd_width is the amount of data read and written by the SIMD instruction at one time
  • tail is less than simd_width
  • n is a positive integer.
  • the neon command is used to read 16 pieces of data in the original image at one time, and a matrix of 3 rows and 3 columns is obtained from the read data.
  • the dotted box in FIG. 12 represents a matrix with 3 rows and 3 columns, and the matrix elements in the matrix are a0, a1, a2, b0, b1, b2, c0, c1, and c2, respectively.
  • Step 340 Use a single calculation instruction corresponding to the image operator to calculate the column data in the matrix data to obtain an intermediate calculation result
  • Step 362 adopt the processing instruction based on the sliding window, store the data in the sliding window in the j-th register, and the initial value of j is 0;
  • the SIMD instruction reads/writes 16 pieces of data at one time as an example, so the window size is 16.
  • the intermediate calculation result needs to be rearranged and reused.
  • the data of each column in the intermediate calculation result obtained by the matrix with M rows and N columns are arranged in the same column, and then the final matrix calculation result is obtained through a single calculation instruction.
  • the initial step of sliding window processing first determine the initial position of a window, and put the window data into the register t0.
  • a sliding window with a window size of 16 is selected, and the starting position of the sliding window is located at the starting point of the middle calculation result in 1 row and N columns.
  • Step 364 After sliding the sliding window, store the data in the sliding window into the j+1th register;
  • the sliding window processing method is adopted, in order to arrange the data of each column in the intermediate calculation result obtained by the matrix of M rows and N columns into the same column, so as to realize the purpose of the matrix operation instruction.
  • the processing method of the sliding window needs to move the sliding window at least N-1 times.
  • the sliding method may be to start the sliding window from the starting point of the middle calculation result in 1 row and N columns, and continuously slide it to the left for N-1 times; or, start the sliding window from the middle calculation result in 1 row and N columns.
  • Starting from the starting point slide to the right N-1 times continuously; alternatively, start the sliding window from the starting point of the middle calculation result of 1 row and N columns, slide it to the left continuously i times, and start from the middle calculation result of 1 row and N columns.
  • Starting from the starting point slide to the right for N-1-i times continuously, where i is an integer greater than or equal to 0 and less than or equal to N-1.
  • the black thick solid line frame in Fig. 11 represents the original image 1101, and the area outside the black thick solid line frame is the expanded border area 1102 expanded to accommodate the matrix operation overflowing the effective area of the image.
  • the starting position of the operation is It is located in the upper left corner of the black thick solid line box.
  • the size of the image is much larger than 3*3.
  • the register tcurr is used to store the value of the area that currently needs to be operated
  • the register tprev is used to store the value of the area adjacent to the left of the area that currently needs to be operated
  • the register tnext is used to store the area adjacent to the right of the area that currently needs to be operated. area value.
  • the data is read through the SIMD instruction, put into the register tcurr, and successive data are fetched forward and backward respectively, and put into the register tprev and the register tnext respectively.
  • the calculation involving the column where the data in the register tprev is located is skipped. Since there is no content on the left side of the original image 1101, that is, the register tprev involves the No. 1 area indicated by the diagonally downward stripes in Figure 11, so the register tprev is empty, the result cannot be calculated, and the operation in this column is skipped. That is, when the column position x of the center point of the calculated matrix is less than or equal to half of the number of columns of the matrix, since the register tprev is empty, the result cannot be obtained, and the operation of this column is skipped.
  • the black solid line box in the row of register t1 in Figure 13 represents the last bit of data sf in the register tprev.
  • the register tprev is empty, and the calculation result cannot be obtained, that is, skip the The operation of the first column.
  • the boundary of the matrix is expanded. That is, when the calculated row position y of the center point of the matrix is less than or equal to half the row size of the matrix, or, when y is greater than or equal to the image height minus half the row size of the matrix, the operation of the matrix involves the image in Figure 11 Area No. 2 is indicated by the diagonal square stripes above and No. 3 area is indicated by the horizontal stripes at the bottom of the image. At this time, the boundaries of the matrix are expanded.
  • the way to expand the boundary can be predefined, for example, the expanded area within the boundary range is 0; or, the expanded area within the boundary area is all 1; or, the value of the expanded area and the size of the pixel in the adjacent matrix same.
  • the present application does not limit the manner of edge expansion and filling.
  • the operation of the matrix is not affected. Since the width of the No. 4 area is the same as the amount of data in one read/write operation performed by the SIMD instruction, the register tnext will involve the value of the No. 5 area indicated by the gray filling on the right side of Figure 11 and/or the No. 6 area indicated by the square filling. . Since the picture itself has a widening area, the operation can be performed.
  • the black solid line box in the t2 register row in Figure 13 represents the first bit data s0 in the register tnext
  • the intermediate calculation result s0 is calculated from the data a0, b0, c0 in the register tnext, and a0, b0, c0 It is within the range of the edge expansion area, so it does not affect the operation of the matrix.
  • the matrix is obtained in a scalar manner.
  • the result of the operation that is, the operation is performed directly on the source data.
  • the start of the register tnext is at an unknown address outside the image edge extension.
  • the first 4 columns in the register tcurr are the data in the image
  • the 5th column is the filling data of the edge expansion area
  • the intermediate calculation results s1 to s5 are obtained, and the sliding window processing method is used for the intermediate results.
  • R1 to R4 can be obtained, and the result of R5 cannot be obtained.
  • the scalar method is required, that is, the size of the source data in the matrix is directly compared.
  • the width of the image is an integer multiple of the number of data for one read/write operation performed by the SIMD instruction, that is, there is no tail region No. 5 in FIG. 11, then for the last number in tcurr, a scalar
  • the calculation method is to directly compare the size of the source data in the matrix.
  • Step 366a determine whether the current register is the N-1th register
  • the computer device determines whether the current memory is the N-1th register. If the current register is the N-1 th register, the computer device completes the sliding window processing process and executes step 380; if the current memory is not the N-1 th memory, the computer device needs to repeat step 364 to continue the sliding window for value selection cache.
  • Step 366b Repeat step 364 until N lines of cache data stored in N registers are obtained;
  • a column in which N matrix elements of the intermediate calculation result coexist is a target column for performing subsequent operations.
  • Step 380 Use a single calculation instruction to calculate the matrix elements of the target column in the N rows of cache data to obtain the calculation result of the matrix data under the single calculation instruction;
  • the target column is the column in which the N matrix elements in the intermediate calculation result are simultaneously located in the N rows of cached data.
  • the matrix elements in the target column are added, and then divided by the number of matrix elements to obtain the calculation result;
  • the maximum value is obtained for the matrix elements in the target column, and the calculation result is obtained;
  • the minimum value is obtained for the matrix elements in the target column, and the calculation result is obtained;
  • the matrix elements in the target column are added, and then divided by the number of matrix elements to obtain the calculation result.
  • Step 400 Output the calculation result as the image processing result of the matrix data by the image operator.
  • FIG. 15 taking a matrix of 3 rows and 3 columns to obtain the maximum value through sliding window processing as an example.
  • a sliding window of size 16 is selected, and the starting point of the initial position is at the starting point of the register tcurr in row s.
  • the data in the initial sliding window is stored in the register t0, that is, in FIG. 15, the 16 data in the register tcurr are stored in the register t0.
  • Slide the sliding window to the right by one position from the initial position, and the sliding window position is shown in the black solid line box in Figure 15.
  • the data in the sliding window is stored in the register t1, that is, the last 15-bit data in the register tcurr and the first bit data in the register tnext are stored in the register t1. It is judged that the current register t1 is not the N-1 th register (ie register t2), and the sliding window operation is continued.
  • the sliding window is slid one bit to the left from the initial position, and the data in the sliding window is stored in the register t2, that is, the last bit of data in the register tprev and the first 15 bits of data in the register tcurr are stored in the register t2.
  • the current register t2 is the N-1th register
  • the target column is determined, and a single calculation instruction is used to calculate the matrix elements of the target column in the cached data.
  • the instruction of finding the maximum value of the matrix with 3 rows and 3 columns can be realized.
  • the result obtained by calculating the maximum value of the columns s0, s1, and s2 is the maximum value of the matrix represented by the dotted box in Fig. 12 .
  • the process of performing matrix operations on image operators with 3 rows and 3 columns shown in this embodiment can also be analogized to small image operators such as 5 rows and 5 columns, and 7 rows and 7 columns.
  • the convolution kernel matrix is a matrix with 3 rows and 3 columns
  • the intermediate calculation results are multiplexed and rearranged into 3 rows of cache data through a sliding window, then there are at most 2 non-target columns for storage. Resources will be consumed; and when the convolution kernel matrix is a matrix with 5 rows and 5 columns, the intermediate calculation results are reused and rearranged into 5 rows of cache data through a sliding window, and there are at most 4 non-target columns of storage resources.
  • the calculation result of the matrix data is obtained.
  • the obtained intermediate calculation results are multiplexed by means of sliding window processing, and the data calculation results of multiple matrices can be obtained at one time in combination with the SIMD instruction, which reduces the repeated calculation of the data of a single element in the matrix and improves the matrix operation.
  • the degree of concurrency is improved, and the matrix operation efficiency of image data is improved.
  • FIG. 16 shows the time-consuming comparison of using the method to perform matrix operation on image data and using the conventional method to perform matrix operation on image data.
  • data of type Uint8_t select a 3 ⁇ 3 image operator to perform matrix operations on image data.
  • the speedup ratio obtained by using this method compared with the OpenCV method is about 1.8 to 2.1. time-consuming ratio. It can be seen that this method greatly improves the matrix operation efficiency of image data.
  • FIG. 17 is a structural block diagram of an image data matrix operation apparatus provided by an exemplary embodiment of the present application.
  • the device includes:
  • a reading module 500 configured to read matrix data in the image data based on the matrix size of the image operator, M rows and N columns, where M and N are positive integers;
  • the calculation module 520 is configured to use a single calculation instruction corresponding to the image operator to calculate the column data in the matrix data to obtain an intermediate calculation result, and the intermediate calculation result is in the form of a row;
  • a multiplexing module 540 configured to multiplex and rearrange the intermediate calculation results into N lines of cached data
  • the calculation module 520 is further configured to use the single calculation instruction to calculate the matrix elements of the target column in the N rows of cache data to obtain the calculation result of the matrix data under the single calculation instruction, the target the column contains the N matrix elements in the intermediate calculation result;
  • the output module 560 is configured to output the calculation result as an image processing result of the matrix data by the image operator.
  • the multiplexing module 540 In a possible design, the multiplexing module 540,
  • the data in the sliding window includes part or all of the intermediate calculation result of the 1 row.
  • the initial value is 0; when j does not reach N, after sliding the sliding window, the data in the sliding window is stored in the j+1th register, until the data stored in the N registers is obtained. N lines of buffered data, the data in the sliding window includes part or all of the intermediate calculation result of the 1 line.
  • the processing instruction is a single instruction multiple data stream instruction, and the processing instruction supports simultaneous processing of K pieces of data;
  • the multiplexing module 540 is configured to use a processing instruction based on the sliding window to store K data in the sliding window into the jth register, where the data in the sliding window includes the intermediate calculation result Part or all of the sliding window; after sliding the sliding window, store the K data in the sliding window into the j+1th register.
  • the multiplexing module 540 is configured to store the K data in the sliding window into the j+1 th register after sliding the sliding window to the left by one bit; or, The multiplexing module 540 is configured to store the K pieces of data in the sliding window into the j+1 th register after sliding the sliding window to the right by one bit.
  • the matrix element at row i and column is the same as the matrix element at row t-1 and column i+1.
  • the single operation instruction is a mean filter instruction or a convolution instruction
  • the calculation module 520 is configured to add N matrix elements of the target column in the cached data to obtain the sum of the matrix elements and ; Divide the sum of the matrix elements by the number of matrix elements, and output the calculation result of the matrix data under the single calculation instruction; wherein, the number of matrix elements is equal to M times N.
  • a computer-readable storage medium stores at least one instruction, at least one piece of program, code set or instruction set, the at least one instruction, the At least one section of program, the code set or the instruction set is loaded and executed by the processor to implement the method for matrix operation of image data provided by the above method embodiments.
  • a computer program product or computer program comprising computer instructions stored in a computer readable storage medium from which a processor of a computer device can retrieve
  • the computer instruction is read by reading the storage medium, and the processor executes the computer instruction, so that the computer device performs the method for matrix operation of image data described in the above aspects.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Pure & Applied Mathematics (AREA)
  • Mathematical Optimization (AREA)
  • Mathematical Analysis (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Mathematics (AREA)
  • Computing Systems (AREA)
  • Algebra (AREA)
  • Databases & Information Systems (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Image Processing (AREA)

Abstract

一种图像数据的矩阵运算方法、装置、设备及存储介质,涉及计算机技术领域。所述方法包括:基于图像算子的矩阵尺寸M行N列,在所述图像数据中读取矩阵数据(220);采用所述图像算子对应的单一计算指令对所述矩阵数据中的列数据进行计算,得到中间计算结果(240);将所述中间计算结果复用重排为N行缓存数据(260);采用所述单一计算指令对所述N行缓存数据中目标列的矩阵元素进行计算,得到所述矩阵数据在所述单一计算指令下的计算结果(280);将所述计算结果输出为所述图像算子对所述矩阵数据的图像处理结果(300)。本申请提高了对于图像数据进行矩阵运算的效率。

Description

图像数据的矩阵运算方法、装置、设备及存储介质
本申请要求于2021年03月31日提交,申请号为202110349762.2、发明名称为“图像数据的矩阵运算方法、装置、设备及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请实施例中。
技术领域
本申请涉及计算机技术领域,特别涉及一种图像数据的矩阵运算方法、装置、设备及存储介质。
背景技术
在使用神经网络中的神经元对图像数据进行计算的场景中,由于神经元均为矩阵形式的算子,因此计算机需要将图像数据以矩阵的形式进行存储和计算。由于矩阵的运算过程往往计算量很大,时延严重,因此需要对矩阵运算进行优化来提升运算效率。
相关技术中提供有针对矩阵数据重排的优化方法,对源矩阵数据进行重排操作,例如将数据排为NC4HW4格式。该重排算法不仅由于数据的大规模调整会带来额外的耗时,而且对于通道数不能被4整除的矩阵,在运算时会加入额外的通道填充步骤。对于大型矩阵而言,重排算法对矩阵运算整体效率带来的提升可以抵消通道填充带来的额外耗费;而对于小型矩阵而言,通道填充带来的额外耗费对矩阵运算效率的影响非常大。
在对图像数据进行计算的场景中,如何提高小型矩阵的运算效率,是亟待解决的技术问题。
发明内容
本申请实施例提供了一种图像数据的矩阵运算方法、装置、设备及存储介质。所述技术方案如下:
根据本申请的一个方面,提供了一种图像数据的矩阵运算方法,所述方法由计算机设备执行,所述方法包括:
基于图像算子的矩阵尺寸M行N列,在所述图像数据中读取矩阵数据,M和N为正整数;
采用所述图像算子对应的单一计算指令对所述矩阵数据中的列数据进行计算,得到中间计算结果,所述中间计算结果采用行形式;
将所述中间计算结果复用重排为N行缓存数据;
采用所述单一计算指令对所述N行缓存数据中目标列的矩阵元素进行计算,得到所述矩阵数据在所述单一计算指令下的计算结果,所述目标列包含所述中间计算结果中的N个矩阵元素;
将所述计算结果输出为所述图像算子对所述矩阵数据的图像处理结果。
根据本申请的另一方面,提供了一种图像数据的矩阵运算装置,所述装置包括:
读取模块,用于基于图像算子的矩阵尺寸M行N列,在所述图像数据中读取矩阵数据,M和N为正整数;
计算模块,用于采用所述图像算子对应的单一计算指令对所述矩阵数据中的列数据进行计算,得到中间计算结果,所述中间计算结果采用行形式;
复用模块,用于将所述中间计算结果复用重排为N行缓存数据;
所述计算模块,还用于采用所述单一计算指令对所述N行缓存数据中目标列的矩阵元素进行计算,得到所述矩阵数据在所述单一计算指令下的计算结果,所述目标列包含所述中间计算结果中的N个矩阵元素;
输出模块,用于将所述计算结果输出为所述图像算子对所述矩阵数据的图像处理结果。
根据本申请的另一方面,提供了一种计算机设备,所述计算机设备包括处理器和存储器,所述存储器中存储有至少一条指令,所述指令由所述处理器加载并执行以实现如本申请各个方面提供的矩阵数据的运算方法。
根据本申请的另一方面,提供了一种计算机可读存储介质,所述存储介质中存储有至少一条指令,所述指令由处理器加载并执行以实现如本申请各个方面提供的图像数据的矩阵运算方法。
根据本申请的一个方面,提供了一种计算机程序产品或计算机程序,该计算机程序产品或计算机程序包括计算机指令,该计算机指令存储在计算机可读存储介质中。计算机设备的处理器从计算机可读存储介质读取该计算机指令,处理器执行该计算机指令,使得该计算机设备执行上述图像数据的矩阵运算方法。
在对图像数据进行计算的场景中,通过对矩阵数据的中间计算结果进行复用重排,并对复用重排后的中间计算结果进行运算得到矩阵数据的运算结果。本方法在数据重排时不需要进行通道填充,因此避免了相关技术重排算法中因为矩阵通道数不能整除4所带来的通道填充的资源耗费。虽然本方案的缓存数据中的非目标列会存在一些存储资源的耗费,但是对于通道数越小的矩阵,该非目标列的耗费越少,因此能够显著提升小型矩阵的运算效率。
附图说明
图1是本申请一个示例性实施例提供的一种3行3列的矩阵的示意图;
图2是本申请一个示例性实施例提供的一种3行3列的系数矩阵的示意图;
图3是本申请实施例提供的一种计算机设备的结构框图;
图4是本申请实施例提供的图像数据的矩阵运算架构的示意图;
图5是本申请一个示例性实施例提供的一种图像数据的矩阵运算方法的流程图;
图6是本申请一个示例性实施例提供的一种图像数据的矩阵运算方法的示意图;
图7是本申请一个示例性实施例提供的一种对中间计算结果进行复用重排的示意图;
图8是本申请一个示例性实施例提供的一种对中间计算结果进行复用重排的示意图;
图9是本申请一个示例性实施例提供的一种对中间计算结果进行复用重排的示意图;
图10是本申请一个示例性实施例提供的一种图像数据的矩阵运算方法的流程图;
图11是本申请一个示例性实施例提供的一种图像数据的矩阵运算方法的实例示意图;
图12是本申请一个示例性实施例提供的一种图像数据的矩阵运算方法的示意图;
图13是本申请一个示例性实施例提供的一种图像数据的矩阵运算方法的特殊情况示意图;
图14是本申请一个示例性实施例提供的一种图像数据的矩阵运算方法的特殊情况示意图;
图15是本申请一个示例性实施例提供的一种图像数据的矩阵运算方法的示意图;
图16是本申请一个示例性实施例提供的一种图像数据的矩阵运算方法的优化效果对比图;
图17是本申请一个示例性实施例提供的一种图像数据的矩阵运算装置的结构框图。
具体实施方式
图像算子:用于图像处理的矩阵运算操作。
矩阵数据:按照矩阵形式存储的计算机数据。
矩阵元素:是指组成矩阵的每一个数据。矩阵元素存储在计算机设备的存储器中,每个矩阵元素都有对应的存储地址。计算机设备可以通过访问矩阵元素的存储地址获取该矩阵元素。
单一计算指令:包括如下至少之一:求和指令、求最大值指令、求最小值指令以及求积指令。
求和指令用于指示将矩阵中各元素相加得到求和结果。以图1中的3行3列矩阵为例,求和指令输出的结果R0=a0+a1+a2+b0+b1+b2+c0+c1+c2;
均值滤波指令用于指示对矩阵中的元素进行均值滤波操作得到结果,即,将矩阵中各元素相加后除以矩阵元素数量。以图1中的3行3列矩阵为例,均值滤波指令输出的结果R0=(a0+a1+a2+b0+b1+b2+c0+c1+c2)/9;
求最大值指令用于指示比较矩阵中各元素的大小,得到最大值。以图1的3行3列矩阵为例,求最大值指令输出的结果R0=Max(a0,a1,a2,b0,b1,b2,c0,c1,c2);
求最小值指令用于指示比较矩阵中各元素的大小,得到最小值。以图1的3行3列矩阵为例,求最小值指令输出的结果R0=Min(a0,a1,a2,b0,b1,b2,c0,c1,c2);
卷积指令用于指示将矩阵中各元素分别与系数矩阵中对应位置的系数相乘后,将结果相加,再除以矩阵元素数量得到的求积结果。以图1的3行3列矩阵为例,系数矩阵为图2所示的3行3列的矩阵,则卷积指令输出的结果为R0=(a0*k00+a1*k01+a2*k02+b0*k10+b1*k11+b2*k12+c0*k20+c1*k21+c2*k22)/9。
相关技术的矩阵重排算法对源数据进行大规模重排,额外的耗时较大;并且对于通道数不能被4整除的矩阵,在每次运算时都需要填充通道,耗费很大,对于小型矩阵的运算来说,优化提升效果不佳。本文提出的方法只需对图像数据中读取的矩阵数据的中间计算结果进行微调重排,对于卷积核越小的矩阵,额外耗费越小,针对小型矩阵有更良好的优化效果。
图3示出了本申请一个示例性实施例提供的一种计算机设备的结构示意图。该设备包括:总线101、处理器102、存储器103。
处理器102包括一个或者一个以上处理核心,处理器102通过运行软件程序以及模块,从而执行各种功能应用以及信息处理。
存储器103通过总线101与处理器102相连。
存储器103可用于存储至少一个指令,处理器102用于执行该至少一个指令,以实现下述方法实施例中的各个步骤。
可选地,存储器103还包括一个或多个寄存器104。寄存器104可用于存储通过单指令多数据流(Single Instruction Multiple Data,SIMD)指令读取到的数据、矩阵数据运算的中间计算结果,以及在滑窗处理中存储滑动窗口中的数据。
此外,存储器103可以由任何类型的易失性或非易失性存储设备或者它们的组合实现,易失性或非易失性存储设备包括但不限于:磁盘或光盘,电可擦除可编程只读存储器(Electrically-Erasable Programmable Read Only Memory,EEPROM),可擦除可编程只读存储器(Erasable Programmable Read Only Memory,EPROM),静态随时存取存储器(Static Random Access Memory,SRAM),只读存储器(Read-Only Memory,ROM),磁存储器,快闪存储器,可编程只读存储器(Programmable Read-Only Memory,PROM)。
本申请实施例中的计算机设备可以为智能手机、平板电脑、个人计算机、可穿戴式设备、车载终端或服务器等等,本申请实施例并不对此进行限定。在一些实施例中,计算机设备中安装有具有图像处理、图像识别、图像分割等需求的应用,该应用在运行过程中即需要对图像数据进行矩阵运算。
图4示出了本申请一个示例性实施例提供的图像数据的矩阵运算架构的示意图,该运算架构包括:输入数据30、算法处理模块40、输出数据50。
输入数据30为计算机设备从图像数据中读取的至少一个M行N列的输入矩阵,输出数据50为计算机设备通过算法处理模块40运算得到的图像处理结果。算法处理模块40可能代表包含多个函数或者多个模块的算法整体;或者,算法处理模块40仅为单个函数或单个模块。
图5示出了本申请一个示例性实施例提供的图像数据的矩阵运算方法的流程图。示例性的,该矩阵数据的运算方法由图3示出的计算机设备执行。该方法包括:
步骤220:基于图像算子的矩阵尺寸M行N列,在图像数据中读取矩阵数据,M和N为正整数;
在对于图像进行处理的人工智能(Artificial Intelligence,AI)应用中,待处理图像按照像素点的大小以矩阵的形式存储于计算机设备中。在通过神经网络中的图像算子(比如卷积核)对图像进行处理时,需要读取图像算子所在区域的各个像素点的值。
示例性的,采用矩阵尺寸为3行3列的图像算子对待处理图像进行分析处理,在图像数据中读取3行3列的矩阵数据,分别为a0、a1、a2、b0、b1、b2、c0、c1、c2,得到图像算子如图6中矩阵10所示。
步骤240:采用图像算子对应的单一计算指令对矩阵数据中的列数据进行计算,得到中间计算结果,中间计算结果采用行形式;
图像算子对应的单一计算指令确定为五种单一计算指令之一,根据单一计算指令对M行N列的矩阵数据中的列数据分别进行计算,得到采用行形式的中间计算结果,该方法如下:
响应于图像算子对应的单一计算指令是求和指令,将矩阵数据中的每列矩阵元素相加;
响应于图像算子对应的单一计算指令是均值滤波指令,将矩阵数据中的每列矩阵元素进行均值滤波;
响应于图像算子对应的单一计算指令是求最大值指令,对矩阵数据中的每列矩阵元素求最大值;
响应于图像算子对应的单一计算指令是求最小值指令,对矩阵数据中的每列矩阵元素求最小值;
响应于图像算子对应的单一计算指令是卷积指令,将矩阵数据中的每列矩阵元素乘上各自对应的系数值后相加。
在一些实施例中,计算机设备根据单一计算指令对矩阵数据中的每列列数据进行计算,得到1行中间计算结果。
示例性的,图6中所示的3行3列的图像算子10,矩阵数据中的每列矩阵元素经过单一计算指令指示的计算后,得到1行中间计算结果12。
步骤260:将中间计算结果复用重排为N行缓存数据;
将中间计算结果复用重排为N行缓存数据的方式可以是将中间计算结果分别向左移动i次以及向右移动N-1-i次,共得到N行缓存结果,i为小于等于N且大于等于0的整数。即,当i取值为0时,计算机设备将中间计算结果连续向右移动N-1次,得到N行缓存数据;当i取值为N-1时,计算机设备将中间计算结果连续向左移动N-1次,得到N行缓存数据;当i取值为小于N-1且大于0的整数时,计算机设备既需要将中间计算结果向左移动,也需要向右移动。本申请对中间计算结果的复用重排方式不加以限定。
本实施例以中间计算结果为1行3列的s1、s2、s3为例,对该中间计算结果进行复用重排。
示例性的如图7所示,将中间计算结果12向右移动2次,分别得到移动1次后的中间计算结果13和移动2次后的中间计算结果14。
示例性的如图8所示,将中间计算结果12向左移动2次,分别得到移动1次后的中间计算结果16和移动2次后的中间计算结果17。
示例性的如图9所示,将中间计算结果分别向左和向右各移动1次,得到向左移动1次后的中间计算结果19和向右移动1次后的中间计算结果20。
步骤280:采用单一计算指令对N行缓存数据中目标列的矩阵元素进行计算,得到矩阵数据在单一计算指令下的计算结果;
其中,目标列包含中间计算结果中的N个矩阵元素,即N行缓存数据中同时包含中间 计算结果中N个矩阵元素的列为目标列。例如,在图7、图8、图9中,目标列指的是s0、s1、s2同时存在的列。
计算机设备采用单一计算指令计算目标列中的矩阵元素,得到矩阵数据在单一计算指令下的计算结果。例如,图7中通过对s2、s1、s0进行计算得到计算结果15;图8中通过对s0、s1、s2进行计算得到计算结果18;图9中通过对s2、s1、s0进行计算得到计算结果21。计算目标列中的矩阵元素,得到矩阵数据在单一计算指令下的计算结果,该方法如下:
响应于图像算子对应的单一计算指令是求和指令,将目标列中的矩阵元素相加,得到计算结果;
响应于图像算子对应的单一计算指令是均值滤波指令,将目标列中的矩阵元素相加,再除以矩阵元素数量,得到计算结果;
响应于图像算子对应的单一计算指令是求最大值指令,对目标列中的矩阵元素求最大值,得到计算结果;
响应于图像算子对应的单一计算指令是求最小值指令,对目标列中的矩阵元素求最小值,得到计算结果;
响应于图像算子对应的单一计算指令是卷积指令,将目标列中的矩阵元素相加,再除以矩阵元素数量,得到计算结果。
步骤300:将计算结果输出为图像算子对矩阵数据的图像处理结果。
示例性的,以采用3行3列的图像算子对图像数据进行处理进行举例:
基于图像算子的矩阵尺寸3行3列,在待处理图像中读取矩阵数据,如图6中矩阵10所示。
当矩阵操作指令为求和指令时,对每列矩阵数据进行求和运算得到中间计算结果:
s0=a0+b0+c0,s1=a1+b1+c1,s2=a2+b2+c2。将中间计算结果通过步骤220所述的任一复用重排方式得到N行缓存数据,确定s0、s1、s2同时所在的目标列,对目标列数据相加,得到矩阵数据的运算结果,即R0=s0+s1+s2,输出为图像算子对矩阵数据的图像处理结果。
当矩阵操作指令为均值滤波指令时,对每列矩阵数据进行求和运算得到中间计算结果:s0=a0+b0+c0,s1=a1+b1+c1,s2=a2+b2+c2。将中间计算结果通过步骤220所述的任一复用重排方式得到N行缓存数据,确定s0、s1、s2同时所在的目标列,对目标列数据相加后除以矩阵中元素的个数,得到矩阵数据的运算结果,即R0=(s0+s1+s2)/9,输出为图像算子对矩阵数据的图像处理结果。
当矩阵操作指令为求最大值指令时,对每列矩阵数据进行求最大值运算得到中间计算结果:s0=max(a0,b0,c0),s1=max(a1,b1,c1),s2=max(a2,b2,c2)。将中间计算结果通过步骤220所述的任一复用重排方式得到N行缓存数据,确定s0、s1、s2同时所在的目标列,对目标列求最大值运算得到矩阵数据的运算结果,即R0=max(s0,s1,s2),输出为图像算子对矩阵数据的图像处理结果。
当矩阵操作指令为求最小值指令时,对每列矩阵数据进行求最小值运算得到中间计算结果:s0=min(a0,b0,c0),s1=min(a1,b1,c1),s2=min(a2,b2,c2)。将中间计算结果通过步骤220所述的任一复用重排方式得到N行缓存数据,确定s0、s1、s2同时所在的目标列,对目标列求最小值运算得到矩阵数据的运算结果,即R0=min(s0,s1,s2),输出为图像算子对矩阵数据的图像处理结果。
当矩阵操作指令为卷积指令时,对每列矩阵数据进行卷积运算得到中间计算结果:s0=a0*k00+b0*k10+c0*k20,s1=a1*k01+b1*k11+c1*k21,s2=a2*k02+b2*k12+c2*k22。将中间计算结果通过步骤220所述的任一复用重排方式得到N行缓存数据,确定s0、s1、s2同时所在的目标列,对目标列数据相加后除以矩阵中元素的个数,得到矩阵数据的运算结果,即R0=(s0+s1+s2)/9,输出为图像算子对矩阵数据的图像处理结果。
综上所述,本实施例通过从图像数据中读取矩阵数据,分别计算矩阵数据中的每列矩阵元素,再对得到的中间计算结果进行复用重排,并计算复用重排后的数据,得到矩阵数据在 单一计算指令下的计算结果。本实施例减少了对矩阵中单个元素数据的重复计算,提高了矩阵运算的并发度,进而提高了图像数据的矩阵运算效率。
图10示出了本申请一个示例性实施例提供的图像数据的矩阵运算方法的流程图。本实施例以适用于arm处理器neon指令为例,对于uint8_t类型的数据,load/store的单指令多数据流(Single Instruction Multiple Data,SIMD)指令一次性读/写16个uint8_t数据。示例性的,该图像数据的矩阵运算方法由图3所示的计算机设备执行。该方法包括:
步骤320:基于图像算子的矩阵尺寸M行N列,在图像数据中读取矩阵数据,M和N为正整数;
本步骤的实施方式可以参考上述步骤220,本实施例在此不作赘述。
本实施例中以进行运算的图像算子为3行3列的矩阵为例。
示例性的,图11中的黑色实线框表示图像原图1101,实线框外区域是为了适应矩阵运算溢出图像有效区域而扩展的扩边区域1102。其中,图像原图1101的宽度为n×simd_width+tail,simd_width为SIMD指令一次读写数据的数据量,tail小于simd_width,n为正整数。
从图11中黑色实线框代表的图像原图1101的左上角开始运算,选取图像原图1101中一个像素点,将其作为3行3列的矩阵的中心点,读取该3行3列的矩阵数据。
示例性的,使用neon指令一次性读取图像原图中的16个数据,在读取的数据中获取3行3列的矩阵。例如,图12中的虚线框表示的即为一个3行3列的矩阵,矩阵中的矩阵元素分别为a0、a1、a2、b0、b1、b2、c0、c1、c2。
步骤340:采用图像算子对应的单一计算指令对矩阵数据中的列数据进行计算,得到中间计算结果;
本步骤的实施方式可以参考上述步骤240,本实施例在此不作赘述。
对于寄存器中的其余各列以此类推,得到图12中表示中间计算结果的s行。
步骤362:采用基于滑动窗口的处理指令,将滑动窗口中的数据存储至第j个寄存器中,j的起始值为0;
本实施例以SIMD指令一次性读/写16个数据为例,因此窗口大小取16。
为了通过中间计算结果实现矩阵操作指令要求,需要对中间计算结果进行重排复用。采用滑窗的处理方式,将通过M行N列的矩阵获得的中间计算结果中的各列数据排列至同一列,再通过单一计算指令得到最终的矩阵计算结果。滑窗处理的起始步骤,先确定一个窗口的起始位置,并将窗口数据放入寄存器t0中。
示例性的,选取窗口大小为16的滑动窗口,滑动窗口的起始位置位于1行N列的中间计算结果的起始点。将滑动窗口中的数据存储至初始寄存器t0。
步骤364:将滑动窗口进行滑动后,将滑动窗口中的数据存储至第j+1个寄存器中;
采用滑窗的处理方式,为了将通过M行N列的矩阵获得的中间计算结果中的各列数据排列至同一列,从而实现矩阵操作指令的目的。该滑窗的处理方式,需要将滑动窗口移动至少N-1次。
示例性的,滑动方式可以是将滑动窗口从1行N列的中间计算结果的起始点开始,连续向左滑动N-1次;或者,将滑动窗口从1行N列的中间计算结果的起始点开始,连续向右滑动N-1次;再或者,将滑动窗口从1行N列的中间计算结果的起始点开始,连续向左滑动i次,以及从1行N列的中间计算结果的起始点开始,连续向右滑动N-1-i次,i是大于等于0且小于等于N-1的整数。
将滑动窗口进行滑动后,将滑动窗口中的数据存储至第j+1个寄存器中,并且更新j的值j=j+1,j的起始值为0。
将滑动窗口进行滑动的过程中,可能会涉及到超出矩阵运算范围的问题,计算机设备可以将矩阵的边界进行扩充;或者,采用标量方法对矩阵进行计算;或者,跳过运算。
以图11为例,图11中的黑色粗实线框表示图像原图1101,黑色粗实线框外区域是为了适应矩阵运算溢出图像有效区域而扩展的扩边区域1102,运算的起始位置位于黑色粗实线框的左上角,以对3行3列的矩阵通过滑窗处理方式进行运算为例,图像的尺寸远大于3*3。寄存器tcurr用于存放当前需要进行运算的区域的数值,寄存器tprev用于存放当前需要进行运算的区域的左侧相邻区域的数值,寄存器tnext用于存放当前需要进行运算的区域的右侧相邻区域的数值。在图11中的黑色粗实线框代表的图像原图1101内通过SIMD指令读取数据,放入寄存器tcurr,并分别向前、向后取连续的数据,分别放入寄存器tprev、寄存器tnext。
示例性的,在寄存器tcurr的起点位于图像原图1101的首列的情况下,跳过涉及寄存器tprev中数据所在列的计算。由于图像原图1101左侧不存在内容,即寄存器tprev涉及图11中斜向下条纹表示的1号区域,故寄存器tprev为空,无法计算得出结果,跳过该列运算。也即,当所计算的矩阵的中心点的列位置x小于等于该矩阵列数的一半时,由于寄存器tprev为空,所以无法得出结果,跳过该列运算。例如,图13中的寄存器t1行中的黑色实线框表示寄存器tprev中的最后一位数据sf,当寄存器tcurr的起点位于首列时,寄存器tprev为空,无法得出计算结果,即跳过第一列的运算。
示例性的,在寄存器tcurr的起点位于图像原图1101的首行或是末行的情况下,扩充矩阵的边界。也即,所计算的矩阵中心点的行位置y小于等于该矩阵行数大小的一半,或者,y大于等于图像高度减去该矩阵行数大小的一半时,矩阵的运算涉及图11中的图像上方斜方格条纹表示的2号区域和图像下方横线条纹表示的3号区域。此时,扩充矩阵的边界。扩充边界的方式可以预先定义,例如,边界范围内扩边区域都为0;或者,边界区域内扩边区域都为1;或者,扩边区域的值与其相邻的矩阵内的像素点的大小相同。本申请对扩边填充的方式不加以限定。
示例性的,在寄存器tcurr的起点位于4号区域且没有涉及到6号扩边区域的情况下,不影响矩阵的运算。由于4号区域的宽度与SIMD指令进行一次读/写操作的数据数量相同,则寄存器tnext会涉及图11右侧灰色填充表示的5号区域和/或方格填充表示的6号区域的取值。由于图片本身存在扩边区域,所以可以进行运算。例如,图13中t2寄存器行中的黑色实线框表示寄存器tnext中的第一位数据s0,该中间计算结果s0由寄存器tnext中的数据a0、b0、c0计算产生,而a0、b0、c0在扩边区域的范围内的,因此不影响矩阵的运算。
示例性的,在寄存器tcurr的起点位于4号区域且涉及到6号扩边区域的情况下,或者,在寄存器tcurr的起点位于灰色填充表示的5号区域的情况下,采用标量的方式获得矩阵运算的结果,即直接对源数据进行运算。该情况中寄存器tnext起点位于图像扩边之外的未知地址。如图14所示,寄存器tcurr中前4列为图像中的数据,第5列为扩边区域的填充数据,得到中间计算结果s1至s5,对中间结果采用滑窗处理方式,在这种情况下只能得到第R1至R4,无法得到R5的结果。对于R5需要采用标量方式,即直接比较矩阵中源数据的大小。
示例性的,在图像的宽度为SIMD指令进行一次读/写操作的数据数量的整数倍的情况下,即图11中不存在5号tail区域,则对于tcurr中的最后一个数,采用标量的方式进行计算,即直接比较矩阵中源数据的大小。
步骤366a:判断当前寄存器是否为第N-1个寄存器;
在一种可能的实施方式中,在j未达到N的情况下,计算机设备将滑动窗口进行滑动后,将滑动窗口中的数据存储至第j+1个寄存器中,直至得到存储在N个寄存器中的N行缓存数据。
计算机设备判断当前存储器是否为第N-1个寄存器。若当前寄存器为第N-1个寄存器,计算机设备则完成滑窗处理过程,执行步骤380;若当前存储器不是第N-1个存储器,计算机设备则还需要重复步骤364,继续滑动窗口进行取值缓存。
步骤366b:重复步骤364,直至得到存储在N个寄存器中的N行缓存数据;
得到的N行缓存数据中,中间计算结果的N个矩阵元素同时存在的一列即为用于进行后续运算的目标列。
步骤380:采用单一计算指令对N行缓存数据中目标列的矩阵元素进行计算,得到矩阵数据在单一计算指令下的计算结果;
目标列是中间计算结果中的N个矩阵元素在N行缓存数据中同时所在的列。
计算目标列中的矩阵元素,得到矩阵数据在单一计算指令下的计算结果,该方法如下:
响应于图像算子操作指令是求和指令,将目标列中的矩阵元素相加,得到计算结果;
响应于图像算子对应的单一计算指令是均值滤波指令,将目标列中的矩阵元素相加,再除以矩阵元素数量,得到计算结果;
响应于图像算子对应的单一计算指令是求最大值指令,对目标列中的矩阵元素求最大值,得到计算结果;
响应于图像算子对应的单一计算指令是求最小值指令,对目标列中的矩阵元素求最小值,得到计算结果;
响应于图像算子对应的单一计算指令是卷积指令,将目标列中的矩阵元素相加,再除以矩阵元素数量,得到计算结果。
步骤400:将计算结果输出为图像算子对矩阵数据的图像处理结果。
示例性的如图15所示,以对3行3列的矩阵通过滑窗处理方式求最大值为例。选取大小为16的滑动窗口,初始位置的起点位于s行寄存器tcurr的起点处。将初始滑动窗口中的数据存储至寄存器t0中,即图15中,将寄存器tcurr中的16个数据存储至寄存器t0中。将滑动窗口从初始位置向右滑动一位,滑动后的窗口位置如图15中的黑色实线框所示。将滑动窗口中的数据存储至寄存器t1中,即将寄存器tcurr中的后15位数据和寄存器tnext中的第1位数据存储至寄存器t1。判断当前寄存器t1不是第N-1个寄存器(即寄存器t2),继续执行滑动窗口操作。将滑动窗口从初始位置向左滑动一位,将滑动窗口中的数据存储至寄存器t2中,即将寄存器tprev中的最后一位数据和寄存器tcurr中的前15位数据存储至寄存器t2。判断当前寄存器t2是第N-1个寄存器,确定目标列,采用单一计算指令对缓存数据中目标列的矩阵元素进行计算。对寄存器t0、t1、t2中的目标列求最大值,即可实现对3行3列的矩阵求最大值的指令。例如,通过对s0、s1、s2列求最大值得到的结果即为图12中虚线框代表矩阵的最大值。
示例性的,本实施例中展示的3行3列的图像算子进行矩阵运算的过程,也可以类推到5行5列、7行7列等小型图像算子。如上述实施例所示,在卷积核矩阵为3行3列的矩阵的情况下,通过滑动窗口将中间计算结果复用重排为3行缓存数据,则至多有2列非目标列的存储资源会被耗费;而在卷积核矩阵为5行5列的矩阵的情况下,通过滑动窗口将中间计算结果复用重排为5行缓存数据,则至多有4列非目标列的存储资源会被耗费;类推到卷积核矩阵为7行7列的矩阵的情况下,通过滑动窗口将中间计算结果复用重排为7行缓存数据,则至多有6列非目标列的存储资源会被耗费。由此可以看出,对于通道数越小的矩阵,耗费于非目标列的存储资源越少,因此本方法对小型图像算子的运算效率有显著的提升效果。
综上所述,本实施例通过分别计算矩阵数据中的每列数据,再对得到的中间计算结果进行滑窗处理,计算通过滑窗处理复用重排后的数据,得到矩阵数据的计算结果。本实施例通过滑窗处理的方式对得到的中间计算结果进行复用,结合SIMD指令一次性可得到多个矩阵的数据计算结果,减少了对矩阵中单个元素数据的重复计算,提高了矩阵运算的并发度,进而提高了图像数据的矩阵运算效率。
图16示出了采用本方法进行图像数据的矩阵运算与采用常规方法进行图像数据的矩阵运算的耗时对比。对于Uint8_t类型的数据,选择3×3大小的图像算子对图像数据进行矩阵运算。对于不同大小的分辨率,采用本方法与采用OpenCV方法对比,得到的加速比大约在1.8到2.1之间,加速比是指采用OpenCV常规方法进行矩阵运算的耗时与采用本方法进行矩阵运算的耗时之比。可以看出本方法对图像数据的矩阵运算效率作出了大幅提升。
下述为本申请装置实施例,可以用于执行本申请方法实施例。对于本申请装置实施例中 未披露的细节,请参照本申请方法实施例。
图17是本申请一个示例性实施例提供的一种图像数据的矩阵运算装置的结构框图。所述装置包括:
读取模块500,用于基于图像算子的矩阵尺寸M行N列,在所述图像数据中读取矩阵数据,M和N为正整数;
计算模块520,用于采用所述图像算子对应的单一计算指令对所述矩阵数据中的列数据进行计算,得到中间计算结果,所述中间计算结果采用行形式;
复用模块540,用于将所述中间计算结果复用重排为N行缓存数据;
所述计算模块520,还用于采用所述单一计算指令对所述N行缓存数据中目标列的矩阵元素进行计算,得到所述矩阵数据在所述单一计算指令下的计算结果,所述目标列包含所述中间计算结果中的N个矩阵元素;
输出模块560,用于将所述计算结果输出为所述图像算子对所述矩阵数据的图像处理结果。
在一个可能的设计中,所述复用模块540,
用于采用基于滑动窗口的处理指令,将所述滑动窗口中的数据存储至第j个寄存器中,所述滑动窗口中的数据包含所述1行的中间计算结果的部分或全部,j的起始值为0;在j未达到N的情况下,将所述滑动窗口进行滑动后,将所述滑动窗口中的数据存储至第j+1个寄存器中,直至得到存储在N个寄存器中的N行缓存数据,所述滑动窗口中的数据包含所述1行的中间计算结果的部分或全部。
在一个可能的设计中,所述处理指令是单指令多数据流指令,所述处理指令支持同时处理K个数据;
所述复用模块540,用于采用基于所述滑动窗口的处理指令,将所述滑动窗口中的K个数据存储至第j个寄存器中,所述滑动窗口中的数据包含所述中间计算结果的部分或全部;将所述滑动窗口进行滑动后,将所述滑动窗口中的K个数据存储至第j+1个寄存器中。
在一个可能的设计中,所述复用模块540,用于将所述滑动窗口左滑一位后,将所述滑动窗口中的K个数据存储至第j+1个寄存器中;或,所述复用模块540,用于将所述滑动窗口右滑一位后,将所述滑动窗口中的K个数据存储至第j+1个寄存器中。
在一个可能的设计中,所述N行缓存数据中存在如下至少一种情况:存在第j行第i列的矩阵元素与第j+1行第i-1列的矩阵元素相同;存在第t行第i列的矩阵元素与第t-1行第i+1列的矩阵元素相同。
在一个可能的设计中,所述单一运算指令是均值滤波指令或卷积指令,所述计算模块520,用于将所述缓存数据中目标列的N个矩阵元素进行相加,得到矩阵元素和;将所述矩阵元素和除以矩阵元素数量,输出所述矩阵数据在所述单一计算指令下的计算结果;其中,所述矩阵元素数量等于M乘N。
在示例性实施例中,还提供了一种计算机可读存储介质,所述计算机可读存储介质中存储有至少一条指令、至少一段程序、代码集或指令集,所述至少一条指令、所述至少一段程序、所述代码集或指令集由处理器加载并执行以实现上述各个方法实施例提供的图像数据的矩阵运算的方法。
在示例性实施例中,还提供了一种计算机程序产品或计算机程序,该计算机程序产品或计算机程序包括计算机指令,该计算机指令存储在计算机可读存储介质中,计算机设备的处理器从计算机可读存储介质读取该计算机指令,处理器执行该计算机指令,使得该计算机设备执行上述方面所述的图像数据的矩阵运算的方法。
本领域普通技术人员可以理解实现上述实施例的全部或部分步骤可以通过硬件来完成,也可以通过程序来指令相关的硬件完成,所述的程序可以存储于一种计算机可读存储介质中,上述提到的存储介质可以是只读存储器,磁盘或光盘等。
以上所述仅为本申请的可选实施例,并不用以限制本申请,凡在本申请的精神和原则之 内,所作的任何修改、等同替换、改进等,均应包含在本申请的保护范围之内。

Claims (16)

  1. 一种图像数据的矩阵运算方法,所述方法由计算机设备执行,所述方法包括:
    基于图像算子的矩阵尺寸M行N列,在所述图像数据中读取矩阵数据,M和N为正整数;
    采用所述图像算子对应的单一计算指令对所述矩阵数据中的列数据进行计算,得到中间计算结果,所述中间计算结果采用行形式;
    将所述中间计算结果复用重排为N行缓存数据;
    采用所述单一计算指令对所述N行缓存数据中目标列的矩阵元素进行计算,得到所述矩阵数据在所述单一计算指令下的计算结果,所述目标列包含所述中间计算结果中的N个矩阵元素;
    将所述计算结果输出为所述图像算子对所述矩阵数据的图像处理结果。
  2. 根据权利要求1所述的方法,其中,所述将所述一行中间计算结果复用重排为N行缓存数据,包括:
    采用基于滑动窗口的处理指令,将所述滑动窗口中的数据存储至第j个寄存器中,所述滑动窗口中的数据包含所述中间计算结果的部分或全部,j的起始值为0;
    在j未达到N的情况下,将所述滑动窗口进行滑动后,将所述滑动窗口中的数据存储至第j+1个寄存器中,直至得到存储在N个寄存器中的N行缓存数据,所述滑动窗口中的数据包含所述中间计算结果的部分或全部。
  3. 根据权利要求2所述的方法,其中,所述处理指令是单指令多数据流指令,所述处理指令支持同时处理K个数据;
    所述采用基于滑动窗口的处理指令,将所述滑动窗口中的数据存储至第j个寄存器中,包括:
    采用基于所述滑动窗口的处理指令,将所述滑动窗口中的K个数据存储至第j个寄存器中,所述滑动窗口中的数据包含所述中间计算结果的部分或全部;
    所述将所述滑动窗口进行滑动后,将所述滑动窗口中的数据存储至第j+1个寄存器中,包括:
    将所述滑动窗口进行滑动后,将所述滑动窗口中的K个数据存储至第j+1个寄存器中。
  4. 根据权利要求3所述的方法,其中,所述将所述滑动窗口进行滑动后,将所述滑动窗口中的K个数据存储至第j+1个寄存器中,包括:
    将所述滑动窗口左滑一位后,将所述滑动窗口中的K个数据存储至第j+1个寄存器中;
    或,
    将所述滑动窗口右滑一位后,将所述滑动窗口中的K个数据存储至第j+1个寄存器中。
  5. 根据权利要求2至4任一所述的方法,其中,所述N行缓存数据中存在如下至少一种情况:
    存在第j行第i列的矩阵元素与第j+1行第i-1列的矩阵元素相同;
    存在第t行第i列的矩阵元素与第t-1行第i+1列的矩阵元素相同。
  6. 根据权利要求1至4任一所述的方法,其中,所述单一运算指令包括如下至少之一:
    求和指令;
    均值滤波指令;
    求最大值指令;
    求最小值指令;
    卷积指令。
  7. 根据权利要求1至4任一所述的方法,其中,所述单一运算指令是均值滤波指令或卷积指令,所述采用所述单一运算指令对所述N行缓存数据中目标列的矩阵元素进行计算,输出所述矩阵数据在所述单一计算指令下的计算结果,包括:
    将所述缓存数据中目标列的N个矩阵元素进行相加,得到矩阵元素和;
    将所述矩阵元素和除以矩阵元素数量,输出所述矩阵数据在所述单一计算指令下的计算结果;
    其中,所述矩阵元素数量等于M乘N。
  8. 一种图像数据的矩阵运算装置,所述装置包括:
    读取模块,用于基于图像算子的矩阵尺寸M行N列,在所述图像数据中读取矩阵数据,M和N为正整数;
    计算模块,用于采用所述图像算子对应的单一计算指令对所述矩阵数据中的列数据进行计算,得到中间计算结果,所述中间计算结果采用行形式;
    复用模块,用于将所述中间计算结果复用重排为N行缓存数据;
    所述计算模块,还用于采用所述单一计算指令对所述N行缓存数据中目标列的矩阵元素进行计算,得到所述矩阵数据在所述单一计算指令下的计算结果,所述目标列包含所述中间计算结果中的N个矩阵元素;
    输出模块,用于将所述计算结果输出为所述图像算子对所述矩阵数据的图像处理结果。
  9. 根据权利要求8所述的装置,其中,
    所述复用模块,用于采用基于滑动窗口的处理指令,将所述滑动窗口中的数据存储至第j个寄存器中,所述滑动窗口中的数据包含所述1行的中间计算结果的部分或全部,j的起始值为0;在j未达到N的情况下,将所述滑动窗口进行滑动后,将所述滑动窗口中的数据存储至第j+1个寄存器中,直至得到存储在N个寄存器中的N行缓存数据,所述滑动窗口中的数据包含所述中间计算结果的部分或全部。
  10. 根据权利要求9所述的装置,其中,所述处理指令是单指令多数据流指令,所述处理指令支持同时处理K个数据;
    所述复用模块,用于采用基于所述滑动窗口的处理指令,将所述滑动窗口中的K个数据存储至第j个寄存器中,所述滑动窗口中的数据包含所述中间计算结果的部分或全部;将所述滑动窗口进行滑动后,将所述滑动窗口中的K个数据存储至第j+1个寄存器中。
  11. 根据权利要求10所述的装置,其中,
    所述复用模块,用于将所述滑动窗口左滑一位后,将所述滑动窗口中的K个数据存储至第j+1个寄存器中;
    或,
    所述复用模块,用于将所述滑动窗口右滑一位后,将所述滑动窗口中的K个数据存储至第j+1个寄存器中。
  12. 根据权利要求9至11任一所述的装置,其中,所述N行缓存数据中存在如下至少一种情况:
    存在第j行第i列的矩阵元素与第j+1行第i-1列的矩阵元素相同;
    存在第t行第i列的矩阵元素与第t-1行第i+1列的矩阵元素相同。
  13. 根据权利要求8至11任一所述的装置,其中,所述单一运算指令是均值滤波指令或卷积指令,所述计算模块,用于将所述缓存数据中目标列的N个矩阵元素进行相加,得到矩阵元素和;将所述矩阵元素和除以矩阵元素数量,输出所述矩阵数据在所述单一计算指令下的计算结果;其中,所述矩阵元素数量等于M乘N。
  14. 一种计算机设备,所述计算机设备包括处理器,与所述处理器相连的存储器,以及存储在所述存储器上的程序指令,所述处理器执行的所述程序指令时实现如权利要求1至7任一所述的图像数据的矩阵运算方法。
  15. 一种计算机可读存储介质,所述存储介质中存储有程序指令,其特征在于,所述程序指令被处理器执行时实现如权利要求1至7任一所述的图像数据的矩阵运算方法。
  16. 一种计算机程序产品或计算机程序,所述计算机程序产品或所述计算机程序包括计算机指令,所述计算机指令存储在计算机可读存储介质中;计算机设备的处理器从所述计算机 可读存储介质读取所述计算机指令,所述处理器执行所述计算机指令,使得所述计算机设备实现如权利要求1至7任一所述的图像数据的矩阵运算方法。
PCT/CN2022/082811 2021-03-31 2022-03-24 图像数据的矩阵运算方法、装置、设备及存储介质 WO2022206556A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP22778739.7A EP4227886A4 (en) 2021-03-31 2022-03-24 MATRIX OPERATING METHOD AND APPARATUS FOR IMAGE DATA, APPARATUS AND STORAGE MEDIUM
US17/976,185 US20230049471A1 (en) 2021-03-31 2022-10-28 Method and apparatus for operating image data

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110349762.2 2021-03-31
CN202110349762.2A CN112991142B (zh) 2021-03-31 2021-03-31 图像数据的矩阵运算方法、装置、设备及存储介质

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US17/976,185 Continuation US20230049471A1 (en) 2021-03-31 2022-10-28 Method and apparatus for operating image data

Publications (1)

Publication Number Publication Date
WO2022206556A1 true WO2022206556A1 (zh) 2022-10-06

Family

ID=76338653

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/082811 WO2022206556A1 (zh) 2021-03-31 2022-03-24 图像数据的矩阵运算方法、装置、设备及存储介质

Country Status (4)

Country Link
US (1) US20230049471A1 (zh)
EP (1) EP4227886A4 (zh)
CN (1) CN112991142B (zh)
WO (1) WO2022206556A1 (zh)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112862725B (zh) * 2021-03-12 2023-10-27 上海壁仞智能科技有限公司 用于计算的方法、计算设备和计算机可读存储介质
CN112991142B (zh) * 2021-03-31 2023-06-16 腾讯科技(深圳)有限公司 图像数据的矩阵运算方法、装置、设备及存储介质
CN114061591B (zh) * 2021-11-18 2022-07-12 东南大学 一种基于滑动窗数据回溯的等值线匹配方法
CN116740262A (zh) * 2022-03-04 2023-09-12 华为技术有限公司 数据处理方法及装置、电子设备和存储介质
CN115859011B (zh) * 2022-11-18 2024-03-15 上海天数智芯半导体有限公司 矩阵运算方法、装置及单元、电子设备

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2000175043A (ja) * 1998-12-08 2000-06-23 Canon Inc 画像処理装置及び画像処理方法
CN103368890A (zh) * 2012-04-01 2013-10-23 京信通信系统(中国)有限公司 一种信号处理方法及装置
CN110246078A (zh) * 2019-05-31 2019-09-17 北京航空航天大学 一种基于嵌入式gpu和卷积计算的图像处理方法和装置
CN110263909A (zh) * 2018-03-30 2019-09-20 腾讯科技(深圳)有限公司 图像识别方法及装置
CN111582467A (zh) * 2020-05-14 2020-08-25 上海商汤智能科技有限公司 人工智能加速器和电子设备
CN112991142A (zh) * 2021-03-31 2021-06-18 腾讯科技(深圳)有限公司 图像数据的矩阵运算方法、装置、设备及存储介质

Family Cites Families (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH06342450A (ja) * 1993-06-01 1994-12-13 Fujitsu Ltd 行列乗算装置
US7873812B1 (en) * 2004-04-05 2011-01-18 Tibet MIMAR Method and system for efficient matrix multiplication in a SIMD processor architecture
WO2009037684A2 (en) * 2007-09-19 2009-03-26 Provost Fellows And Scholars Of The College Of The Holy And Undivided Trinity Of Queen Elizabeth Near Dublin Sparse matrix by vector multiplication
GB2470780B (en) * 2009-06-05 2014-03-26 Advanced Risc Mach Ltd A data processing apparatus and method for performing a predetermined rearrangement operation
JP5951570B2 (ja) * 2013-09-13 2016-07-13 株式会社東芝 行列演算装置
CN111859273A (zh) * 2017-12-29 2020-10-30 华为技术有限公司 矩阵乘法器
CN110415157B (zh) * 2018-04-26 2024-01-30 华为技术有限公司 一种矩阵乘法的计算方法及装置
KR102555057B1 (ko) * 2018-05-09 2023-07-12 에스케이하이닉스 주식회사 웨이트 매트릭스를 포맷하는 방법, 포맷된 데이터를 사용하는 가속기 및 이를 포함하는 시스템
CN110147347B (zh) * 2019-03-18 2023-01-06 腾讯科技(深圳)有限公司 用于矩阵处理的芯片、矩阵处理方法、装置及存储介质
CN109948790A (zh) * 2019-03-27 2019-06-28 苏州浪潮智能科技有限公司 一种神经网络处理方法、装置、设备及存储介质
CN110580324B (zh) * 2019-07-23 2020-11-17 珠海格力电器股份有限公司 图像矩阵运算方法、装置、计算机设备和存储介质
CN112446007A (zh) * 2019-08-29 2021-03-05 上海华为技术有限公司 一种矩阵运算方法、运算装置以及处理器
CN112149694B (zh) * 2020-08-28 2024-04-05 特斯联科技集团有限公司 一种基于卷积神经网络池化模块的图像处理方法、系统、存储介质及终端
CN112069460A (zh) * 2020-09-18 2020-12-11 Oppo广东移动通信有限公司 数据处理方法、装置以及电子设备

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2000175043A (ja) * 1998-12-08 2000-06-23 Canon Inc 画像処理装置及び画像処理方法
CN103368890A (zh) * 2012-04-01 2013-10-23 京信通信系统(中国)有限公司 一种信号处理方法及装置
CN110263909A (zh) * 2018-03-30 2019-09-20 腾讯科技(深圳)有限公司 图像识别方法及装置
CN110246078A (zh) * 2019-05-31 2019-09-17 北京航空航天大学 一种基于嵌入式gpu和卷积计算的图像处理方法和装置
CN111582467A (zh) * 2020-05-14 2020-08-25 上海商汤智能科技有限公司 人工智能加速器和电子设备
CN112991142A (zh) * 2021-03-31 2021-06-18 腾讯科技(深圳)有限公司 图像数据的矩阵运算方法、装置、设备及存储介质

Also Published As

Publication number Publication date
CN112991142A (zh) 2021-06-18
EP4227886A1 (en) 2023-08-16
EP4227886A4 (en) 2024-07-03
US20230049471A1 (en) 2023-02-16
CN112991142B (zh) 2023-06-16

Similar Documents

Publication Publication Date Title
WO2022206556A1 (zh) 图像数据的矩阵运算方法、装置、设备及存储介质
US20180137414A1 (en) Convolution operation device and convolution operation method
US10642622B2 (en) Arithmetic processing device and control method of the arithmetic processing device
US11436017B2 (en) Data temporary storage apparatus, data temporary storage method and operation method
CA2929403C (en) Multi-dimensional sliding window operation for a vector processor
US10929965B2 (en) Histogram statistics circuit and multimedia processing system
JP7201802B2 (ja) 3次元画像処理におけるデータの読み書き方法とシステム、記憶媒体及び端末
CN108346131A (zh) 一种数字图像缩放方法、装置及显示设备
JP7419574B2 (ja) 膨張畳み込み加速演算方法及び装置
JPWO2019082859A1 (ja) 推論装置、畳み込み演算実行方法及びプログラム
US20040093470A1 (en) Parallel processing method for inverse matrix for shared memory type scalar parallel computer
CN109800867B (zh) 一种基于fpga片外存储器的数据调用方法
CN110087088B (zh) 一种基于运动估计的数据存储方法、终端设备及存储介质
CN112712457B (zh) 数据处理方法以及人工智能处理器
KR102372869B1 (ko) 인공 신경망을 위한 행렬 연산기 및 행렬 연산 방법
US20210342973A1 (en) Image upscaling apparatus using artificial neural network having multiple deconvolution layers and deconvolution layer pluralization method thereof
CN109816093B (zh) 一种单路式卷积实现方法
JP7293157B2 (ja) 画像処理装置
JP3860545B2 (ja) 画像処理装置及び画像処理方法
US11842273B2 (en) Neural network processing
CN112184565B (zh) 一种多窗口串行的图像锐化方法
US20230168809A1 (en) Intelligence processor device and method for reducing memory bandwidth
US20220269752A1 (en) Execution method for convolution computation
CN117422608A (zh) 图像引导滤波方法及系统
JP4821427B2 (ja) データ処理装置及びそのプログラム

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22778739

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2022778739

Country of ref document: EP

Effective date: 20230511

NENP Non-entry into the national phase

Ref country code: DE