CN113344768A

CN113344768A - Method for realizing image matrix convolution, computing equipment and storage medium

Info

Publication number: CN113344768A
Application number: CN202110878366.9A
Authority: CN
Inventors: 王正阳; 张勇; 刘明航
Original assignee: Chengdu Tongxin Software Technology Co ltd
Current assignee: Chengdu Tongxin Software Technology Co ltd
Priority date: 2021-08-02
Filing date: 2021-08-02
Publication date: 2021-09-03
Anticipated expiration: 2041-08-02
Also published as: CN113724127B; CN113724127A; CN113344768B

Abstract

The invention discloses a method for realizing image matrix convolution, a computing device and a storage medium, wherein the method comprises the following steps: the method comprises the steps of obtaining an image matrix to be convolved and a convolution kernel, expanding the image matrix to be convolved into a row matrix according to the size of the convolution kernel to obtain a first expanded image matrix, converting the first expanded image matrix into a column matrix to obtain a second expanded image matrix, expanding the convolution kernel into a matrix with the number of columns of the second expanded image matrix as the number of columns and the size of the convolution kernel as the number of rows to obtain a third expanded matrix, wherein the size of each row of data in the second expanded image matrix and the size of each row of data in the third expanded matrix are the size of a vector register, and performing convolution operation on the second expanded image matrix and the third expanded matrix to obtain a feature matrix of the image matrix. The invention uses the vector register of CPU in the computing device to realize the execution of a plurality of times of floating point data operation in the same time, and obviously improves the convolution computing speed of the image matrix.

Description

Method for realizing image matrix convolution, computing equipment and storage medium

Technical Field

The present invention relates to the field of data processing technologies, and in particular, to a method for implementing image matrix convolution, a computing device, and a storage medium.

Background

In the information age, the requirements for real-time performance and high efficiency of data processing (for example, image data) are increasing, and image matrix convolution is a very important data processing means, so that improving the computational efficiency of image matrix convolution is a major concern at present.

In the prior art, the computational efficiency of the image matrix convolution is improved by multiple algorithms, such as im2col algorithm. The im2col algorithm is understood to be that a two-dimensional image matrix is expanded into rows from left to right and from top to bottom in advance according to the size of a convolution kernel, and new rows are continuously distributed in a memory, which is equivalent to expanding an original matrix into a row matrix. When convolution operation is carried out, the probability of cache miss is greatly reduced and the speed of accessing the memory is obviously improved due to continuous memory distribution. However, in the image matrix convolution process, only the function of software is used, but only a compiler is used to realize the image matrix convolution, so that the calculation speed is limited, and if the calculation efficiency is to be improved, additional equipment needs to be accessed, so that the conventional image matrix convolution is complex to realize and low in efficiency.

For this reason, it is desirable to provide an efficient implementation of the convolution of the image matrix to solve the above-mentioned problems.

Disclosure of Invention

To this end, the present invention provides a method of implementing a convolution of an image matrix in an attempt to solve, or at least alleviate, the problems identified above.

According to one aspect of the invention, there is provided a method for implementing convolution of an image matrix, the method being executed in a computing device and comprising:

acquiring an image matrix to be convolved and a convolution kernel, wherein each element of the image matrix to be convolved is a pixel value of an image;

expanding the image matrix to be convolved into a row matrix to obtain a first expanded image matrix;

converting the first expanded image matrix into a column matrix to obtain a second expanded image matrix, wherein the size of each row of data in the second expanded image matrix is the size of a vector register of a CPU (Central processing Unit) in the computing equipment;

expanding the convolution kernel into a matrix with the column number of the second expanded image matrix as the column number and the size of the convolution kernel as the row number to obtain a third expanded matrix, wherein the size of each row of data in the third expanded matrix is the size of the vector register;

and performing convolution operation on the second expanded image matrix and the third expanded matrix to obtain a convolution result matrix of the image matrix, wherein the convolution result matrix is a characteristic matrix of the image matrix.

Optionally, the step of expanding the image matrix to be convolved into a row matrix to obtain a first expanded image matrix includes:

and according to the size of the convolution kernel, expanding the image matrix to be convolved into a row matrix through an im2col algorithm to be used as a first expanded image matrix.

Optionally, the method further comprises the steps of:

and normalizing the image matrix to be convolved to ensure that the size of each element in the obtained image matrix to be convolved is the size of floating point data of which the vector register supports floating point data operation.

Optionally, the method further comprises the steps of:

allocating continuous memory addresses to the first expanded image matrix as a first memory space;

and storing the first expanded image matrix into a first memory space.

Optionally, the step of converting the first expanded image matrix into a column matrix to obtain a second expanded image matrix includes:

determining the times that the vector register can simultaneously execute floating-point data operation, and taking the times of executing the floating-point data operation as a first numerical value;

taking elements of each column in adjacent rows of the first expanded image matrix as a group of data, and taking the group of data as a row of elements of the matrix, wherein the row number of the adjacent rows of the first expanded image matrix is equal to the first numerical value;

and when the remaining number of rows of the first expanded image matrix is less than the first numerical value, filling missing elements in the first expanded image matrix by zero elements until the remaining number of rows is equal to the first numerical value, and obtaining a second expanded image matrix.

Optionally, the method further comprises the steps of:

distributing continuous memory addresses for the second expanded image matrix to serve as a second memory space;

and storing the second expanded image matrix into a second memory space.

Optionally, the method further comprises the steps of:

forcibly converting the size of each element in the convolution kernel into the size of floating point data of which the vector register supports floating point data operation;

wherein, the step of unfolding the convolution kernel into a matrix with the column number of the second unfolded image matrix as the column number and the convolution kernel size as the row number to obtain a third unfolded matrix comprises:

copying each element in the converted convolution kernel into a plurality of columns of a second expanded image matrix according to a first preset rule to obtain a group of same data, wherein the first preset rule is a sequence rule from left to right and from top to bottom;

and taking the obtained group of same data as a row of elements of the matrix in sequence according to the copying sequence to obtain a third expansion matrix.

Optionally, the method further comprises the steps of:

allocating continuous memory addresses to the third expansion matrix as a third memory space;

and saving the third expansion matrix to a third memory space.

Optionally, the step of performing convolution operation on the second expanded image matrix and the third expanded matrix to obtain a convolution result matrix of the image matrix includes:

determining the size of a convolution result matrix according to the image matrix to be convolved and the convolution kernel;

dividing the second expanded image matrix into a plurality of groups of matrixes according to the size of the third expanded matrix, wherein the number of rows and columns of each group of matrixes is the same as that of the third expanded matrix, and elements of each row in each group of matrixes are not overlapped;

performing convolution operation on each group of matrixes and the third expansion matrix to obtain a convolution result of each element in each group of matrixes;

and writing the convolution result of the last row of elements of each group of matrixes into the convolution result matrix of the image matrix according to a first preset rule.

Optionally, the step of writing the convolution result of the last row of elements of each group of matrices into the convolution result matrix of the image matrix according to a first preset rule includes:

and if the total number of the convolution results of the last row of elements of each group of matrixes is larger than the size of the convolution result matrix, selecting the convolution results with the size of the former convolution kernel from the convolution results of the last row of elements of each group of matrixes, and writing the convolution results into the convolution result matrix of the image matrix according to a first preset rule.

Optionally, the step of performing convolution operation on each group of matrices and the third expansion matrix to obtain a convolution result of each element in each group of matrices includes:

reading a row of elements from any one of the plurality of sets of matrices as a first set of elements;

reading a row of elements equal to the number of rows of the first set of elements from the third expanded matrix as a second set of elements;

and multiplying each element in the first group of elements by an element at a corresponding position in the second group of elements, and summing the product result and the convolution results of the elements in the same column number and the previous row of the elements to obtain the convolution result of each element.

Optionally, the floating point data operation supported by the vector register has a floating point data size of any one of 16 bits, 32 bits, and 64 bits.

Optionally, the CPU of the computing device supports a SIMD instruction set, the CPU of the computing device comprising vector registers.

According to an aspect of the present invention, there is provided a computing device comprising: at least one processor; and a memory storing program instructions, wherein the program instructions are configured to be executed by the at least one processor, the program instructions comprising instructions for performing the method as described above.

According to an aspect of the present invention, there is provided a readable storage medium storing program instructions which, when read and executed by a computing device, cause the computing device to perform the method as described above.

According to the technical scheme of the invention, an implementation method of image matrix convolution is provided, the method comprises the steps of firstly obtaining an image matrix to be convolved and a convolution kernel, expanding the image matrix to be convolved into a row matrix according to the size of the convolution kernel to obtain a first expanded image matrix, converting the first expanded image matrix into a column matrix to obtain a second expanded image matrix, expanding each row of data in the second expanded image matrix into a vector register of a CPU (central processing unit) in a computing device, expanding the convolution kernel into a matrix with the column number of the second expanded image matrix as the column number and the convolution kernel as the row number to obtain a third expanded matrix, performing convolution operation on each row of data in the second expanded image matrix and the corresponding row of data in the third expanded image matrix to obtain a convolved image matrix, and the data storage addresses of the second expanded image matrix and the third expanded matrix are consecutive.

According to the method for realizing the matrix convolution, data with the size equal to that of the vector register are respectively read from the second expanded image matrix and the third expanded matrix once, so that the vector register of the CPU in the computing equipment is fully utilized, multiple floating point data operations can be executed at the same time, and the SIMD instruction can be realized.

Secondly, in the convolution operation process, because the data storage addresses of the second expanded image matrix and the third expanded matrix are continuous, the data reading and writing speed from the second expanded image matrix and the third expanded matrix is high, and the convolution operation speed of the image matrix is further improved. In addition, in the process of inverse normalization, because the data storage addresses of the convolution result matrix of the image matrix are continuous, the speed of reading the convolution result matrix of the image matrix is high, and the speed of convolution calculation of the image matrix is further improved.

Drawings

To the accomplishment of the foregoing and related ends, certain illustrative aspects are described herein in connection with the following description and the annexed drawings, which are indicative of various ways in which the principles disclosed herein may be practiced, and all aspects and equivalents thereof are intended to be within the scope of the claimed subject matter. The above and other objects, features and advantages of the present disclosure will become more apparent from the following detailed description read in conjunction with the accompanying drawings. Throughout this disclosure, like reference numerals generally refer to like parts or elements.

FIG. 1 shows a schematic diagram of a computing device 100, according to one embodiment of the invention;

FIG. 2 shows a flow diagram of a method 200 for implementing image matrix convolution according to one embodiment of the present invention;

FIG. 3 shows a schematic diagram of a normalized image matrix to be convolved according to an embodiment of the invention;

FIG. 4 shows a schematic diagram of a preprocessed convolution kernel according to one embodiment of the present invention;

FIG. 5 shows a schematic diagram of a first unfolded image matrix according to one embodiment of the invention;

FIG. 6 illustrates a first unfolded image matrix formed after filling in 3 rows of elements of size 0, according to one embodiment of the present invention;

FIG. 7 shows a schematic diagram of a second unfolded image matrix according to one embodiment of the invention;

FIG. 8 shows a schematic diagram of a third expansion matrix according to one embodiment of the invention; and

FIG. 9 shows a schematic diagram of a convolution result matrix of an image matrix according to one embodiment of the present invention.

Detailed Description

Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

The image matrix convolution is a very effective and fast tool for obtaining the primary effect of image processing. In image processing, an image is convolved in a matrix form, that is, an image matrix is convolved, the process of image matrix convolution can be regarded as a "sliding window" process, specifically, a smaller matrix is used as a convolution kernel to slide through the whole convolved matrix, each sliding place needs to perform a matrix inner product operation with the convolution kernel (that is, a small matrix of the convolved matrix corresponding to the sliding window is multiplied and accumulated with the convolution kernel), and the operation result forms a brand new matrix.

In the prior art, the calculation efficiency of the image matrix convolution is improved through algorithms, such as im2col algorithm, FFT algorithm, Winograd algorithm, and the like. The above algorithms are all for accelerating the convolution calculation process, and the calculation process of the matrix convolution is described by taking im2col algorithm as an example. The im2col algorithm can be understood as that a two-dimensional image matrix is expanded into rows from left to right and from top to bottom in advance according to the size of a convolution kernel, new rows are continuously distributed in a memory, which is equivalent to that the original matrix is expanded into a row matrix, and the memory distribution is continuous when convolution operation is performed, so that the cache miss probability is greatly reduced, and the memory access speed is obviously improved. However, the image matrix convolution process only depends on the function of software, but only depends on a compiler to realize the image matrix convolution, so that the calculation speed is limited, and if the calculation efficiency is to be improved, additional equipment needs to be accessed, so that the conventional image matrix convolution is complicated to realize and has low calculation efficiency.

To this end, the invention proposes a method for implementing a convolution of an image matrix, the method being implemented in a computing device. FIG. 1 shows a block diagram of a computing device 100, according to one embodiment of the invention. A block diagram of a computing device 100 As shown in FIG. 1, in a basic configuration 102, the computing device 100 typically includes a system memory 106 and one or more processors 104. A memory bus 108 may be used for communication between the processor 104 and the system memory 106.

Depending on the desired configuration, the processor 104 may be any type of processing, including but not limited to: a microprocessor (μ P), a microcontroller (μ C), a digital information processor (DSP), or any combination thereof. The processor 104 may include one or more levels of cache, such as a level one cache 110 and a level two cache 112, a processor core 114, and registers 116. The example processor core 114 may include an Arithmetic Logic Unit (ALU), a Floating Point Unit (FPU), a digital signal processing core (DSP core), or any combination thereof. The example memory controller 118 may be used with the processor 104, or in some implementations the memory controller 118 may be an internal part of the processor 104.

Depending on the desired configuration, system memory 106 may be any type of memory, including but not limited to: volatile memory (such as RAM), non-volatile memory (such as ROM, flash memory, etc.), or any combination thereof. System memory 106 may include an operating system 120, one or more applications 122, and program data 124. In some embodiments, application 122 may be arranged to operate with program data 124 on an operating system. The program data 124 comprises instructions, and in the computing device 100 according to the invention the program data 124 comprises instructions for performing the method 200 for implementing a convolution of an image matrix.

The computing device 100 also includes a storage device 132, the storage device 132 including removable storage 136 and non-removable storage 138, the removable storage 136 and the non-removable storage 138 each connected to the storage interface bus 134. In the present invention, the data related to each event occurring during the program execution process and the time information indicating the occurrence of each event may be stored in the storage device 132, and the operating system 120 is adapted to manage the storage device 132. The storage device 132 may be a magnetic disk.

Computing device 100 may also include an interface bus 140 that facilitates communication from various interface devices (e.g., output devices 142, peripheral interfaces 144, and communication devices 146) to the basic configuration 102 via the bus/interface controller 130. The example output device 142 includes an image processing unit 148 and an audio processing unit 150. They may be configured to facilitate communication with various external devices, such as a display or speakers, via one or more a/V ports 152. Example peripheral interfaces 144 may include a serial interface controller 154 and a parallel interface controller 156, which may be configured to facilitate communication with external devices such as input devices (e.g., keyboard, mouse, pen, voice input device, touch input device) or other peripherals (e.g., printer, scanner, etc.) via one or more I/O ports 158. An example communication device 146 may include a network controller 160, which may be arranged to facilitate communications with one or more other computing devices 162 over a network communication link via one or more communication ports 164.

A network communication link may be one example of a communication medium. Communication media may typically be embodied by computer readable instructions, data structures, program modules, and may include any information delivery media, such as carrier waves or other transport mechanisms, in a modulated data signal.“Modulating data signals”A signal may be represented as a set of one or more of its data or its changes in a manner that encodes information in the signal. By way of non-limiting example, communication media may include wired media such as a wired network or private-wired network, and various wireless media such as acoustic, Radio Frequency (RF), microwave, Infrared (IR), or other wireless media. The term computer readable media as used herein may include both storage media and communication media.

Computing device 100 may be implemented as a server, such as a file server, a database server, an application server, a WEB server, etc., or as part of a small-form factor portable (or mobile) electronic device, such as a cellular telephone, a Personal Digital Assistant (PDA), a personal media player device, a wireless WEB-watch device, a personal headset device, an application specific device, or a hybrid device that include any of the above functions. Computing device 100 may also be implemented as a personal computer including both desktop and notebook computer configurations. In some embodiments, the computing device 100 is configured to perform an implementation 200 of image matrix convolution according to the present invention.

FIG. 2 shows a flow diagram of a method 200 for implementing image matrix convolution according to an embodiment of the present invention. The method 200 is suitable for execution in a computing device 100, such as the computing device 100 described above. The method of the present invention can be applied to the convolution calculation process of the convolutional neural network in the field of image processing (including image recognition (e.g., AI image recognition), image enhancement, image segmentation, etc.), and can also be integrated into the convolution algorithm of image matrix convolution to increase the speed of convolution calculation.

It should be noted that, the CPU of the computing device in the present invention supports SIMD instruction sets, and the CPU of the computing device includes vector registers, all CPUs satisfying the above conditions are within the scope of the present invention, and the present invention is not limited thereto. For example, the CPU of the computing device may be a loongson 3a4000CPU, the loongson 3a4000CPU supports an MIPS MSA instruction set, and the MIPS MSA instruction set belongs to an SIMD instruction set, that is, the loongson 3a4000CPU supports the SIMD instruction set, the SIMD instruction set is single instruction multiple data, that is, a single CPU instruction operates multiple data at the same time, and the loongson 3a4000CPU includes a vector register, where the size of the vector register may be 128 bits, 256 bits, 512 bits, and the like, which is not limited in this invention.

As shown in fig. 2, the method 200 includes steps S201 to S210. The method 200 starts in step S201, and in step S201, an image matrix to be convolved and a preset convolution kernel are obtained. In the present invention, mainly for the convolution operation of the image matrix, the image matrix may be an image matrix in image processing (image processing such as image recognition (e.g. AI image recognition), image enhancement, image segmentation, etc.), and each element of the image matrix to be convolved is a pixel value of an image. The invention does not limit the size of the image matrix to be convolved and the size of the convolution kernel, for example, the size of the image matrix to be convolved is 5X5, and the size of the convolution kernel is 3X 3.

After the image matrix to be convolved is obtained, step S202 is executed to perform normalization processing on the image matrix to be convolved to obtain the normalized image matrix to be convolved, in one embodiment, each element in the image matrix to be convolved is uniformly processed into a uniform size in the normalization process, and the uniform size is the size of floating point data of which the vector register supports floating point data operation. The floating point data size of the floating point data operation supported by the vector register is one of 16 bits, 32 bits and 64 bits, that is, each element in the image matrix to be convolved can be uniformly processed into 16 bits, 32 bits or 64 bits in the normalization process. For example, fig. 3 shows a schematic diagram of a normalized image matrix to be convolved according to an embodiment of the present invention, and the size of the normalized image matrix to be convolved in fig. 3 is 5 × 5.

Then, step S203 is executed to perform preprocessing on the convolution kernel, specifically, to forcibly convert each element size in the convolution kernel into a floating point data size that supports floating point data operation with the vector register, for example, by using a compiler to forcibly convert each element size in the convolution kernel into a floating point data size that supports floating point data operation with the vector register, where the floating point data size is the same as the unified size determined in step S202, if the unified size is 32 bits, each element size in the convolution kernel is forcibly converted into 32 bits in step S203, and if the unified size is 64 bits, each element size in the convolution kernel is forcibly converted into 64 bits in step S203. For example, FIG. 4 shows a schematic diagram of a preprocessed convolution kernel, the preprocessed convolution kernel size of FIG. 4 being 3X3, according to one embodiment of the invention.

After the image matrix to be convolved is normalized and the convolution kernel is preprocessed, step S204 is executed to expand the normalized image matrix to be convolved into a row matrix to obtain a first expanded image matrix. In one embodiment, the normalized image matrix to be convolved is expanded into a row matrix by an im2col algorithm, depending on the size of the convolution kernel. Specifically, the normalized image matrix to be convolved is expanded into rows of a first expanded image matrix by sliding one element to the right once according to the size of a convolution kernel and the sequence from left to right and from top to bottom.

Taking the normalized size of the image matrix to be convolved as 5X5 (as shown in fig. 3), the size of the convolution kernel as 3X3 (as shown in fig. 4), the size of each element in the normalized image matrix to be convolved as 32 bits, and the size of each element in the preprocessed convolution kernel as 32 bits as an example, a process of obtaining the first expanded image matrix will be described. According to the size of the convolution kernel, taking the first row and first column of elements of the normalized image matrix to be convolved as the starting position, selecting the elements of the size of the convolution kernel, such as the elements included in the square frame 3-1 in fig. 3, and expanding the elements in the square frame 3-1 into a row of elements (i.e., 1, 1, 1, 0, 1, 0, 1, 0, 0) from left to right and from top to bottom, so as to obtain the first row of elements of the matrix shown in fig. 5, and fig. 5 shows a schematic diagram of the first expanded image matrix according to an embodiment of the present invention.

Sliding an element to the right, taking the first row and the second column of the normalized image matrix to be convolved as the starting positions, selecting the elements with the size of the convolution kernel, such as the elements included in the square frame 3-2 in fig. 3, and expanding the elements in the square frame 3-2 into a row of elements (i.e. 1, 1, 1, 0, 1, 0, 0, 0, 1) from left to right and from top to bottom, so as to obtain the second row of elements of the matrix shown in fig. 5. Then, sliding an element to the right, taking the third column element in the first row of the normalized image matrix to be convolved as the starting position, selecting the elements with the convolution kernel size, such as the elements included in the square box 3-3 in fig. 3, and expanding the elements in the square box 3-3 into a row of elements (i.e. 1, 1, 1, 0, 1, 1, 0) from left to right in the order from top to bottom, so as to obtain the third row of elements of the matrix shown in fig. 5.

After the first row of elements of the normalized image matrix to be convolved is expanded, one element is slid down from the first row and the first column of elements of the normalized image matrix to be convolved, that is, the second row and the first column of elements of the normalized image matrix to be convolved are expanded into the fourth row of elements (0, 1, 0, 1, 0, 0, 1, 0, 1) of the matrix shown in fig. 5 according to the above process with the second row and the first column of elements of the normalized image matrix to be convolved as the starting positions, such as the elements included in the square boxes 3-4 in fig. 3. And so on, a first expanded image matrix corresponding to the normalized image matrix to be convolved is obtained (as shown in fig. 5).

After the first expanded image matrix is obtained, step S205 is executed to allocate consecutive memory addresses to the first expanded image matrix as a first memory space, and store the first expanded image matrix in the first memory space. The starting address of the consecutive memory addresses is an integer multiple of the byte size (i.e. the ratio of the size of the vector register to 8) corresponding to the vector register. When the elements of the first expansion image matrix are read and written later, the reading and writing speed is high because the memory addresses of the first expansion space are continuous, thereby improving the speed of convolution calculation of the image matrix.

Then, step S206 is executed to convert the first expanded image matrix into a column matrix, and obtain a second expanded image matrix. Specifically, the number of times that the vector register can simultaneously perform floating point data operations is determined, the number of times that the floating point data operations are performed is used as a first numerical value, elements of each column in adjacent rows of the first expanded image matrix are used as a group of data, the group of data is used as a row of elements of the matrix, the number of rows of the adjacent rows of the first expanded image matrix is equal to the first numerical value, if the first numerical value is 4, the elements of each column in the adjacent 4 rows of the first expanded image matrix are used as a group of data, and the group of data is used as a row of elements of the matrix. And when the remaining number of lines of the first expanded image matrix is less than a first numerical value, filling missing elements in the first expanded image matrix to the extent that the remaining number of lines is equal to the first numerical value through zero elements, and if the first numerical value is 4, filling missing elements in the first expanded image matrix to 4 lines through zero elements to obtain a second expanded image matrix.

Since the element in each row of the first numerical value is taken as one row of the second expanded image matrix at a time, it is known that the first numerical value is the number of times that the vector register can simultaneously perform floating-point data operations, and the size of each element of the first expanded image matrix is the same as the size of the floating-point data that the vector register supports floating-point data operations, that is, the size of each row of the element in the second expanded image matrix is the size of the vector register, and the number of columns of the second expanded image matrix is the number of times that the vector register can simultaneously perform floating-point data operations.

A process of obtaining the second expanded image matrix will be described with an example of the first expanded image matrix shown in fig. 5, in which the loongson 3a4000CPU and the unified UOS V201030 are used as operating systems. As is known, the loongson 3a4000CPU includes a 128-bit vector register that supports 32-bit floating point data operations (in this case, each element of the first unfolded image matrix, the preprocessed convolution kernel is 32 bits), because the CPU includes a 128-bit vector register, the vector register supports 32-bit floating point data operations, then the first value is 128/32=4, that is, the elements in the first column in the first group of 4 rows adjacent to the first expanded image matrix shown in fig. 5 are taken as the first row elements of the second expanded image matrix, that is, the elements in the rectangular frame 5-1 in fig. 5 are taken as the first row elements of the second expanded image matrix, so as to obtain the first row elements (1, 1, 1, 0) of the second expanded image matrix shown in fig. 7, and fig. 7 shows a schematic diagram of the second expanded image matrix according to an embodiment of the present invention. That is, data of 128 bits at a time is read as one row element of the second expanded image matrix.

Then sliding an element to the right, taking the element in the second column in the first group of 4 rows adjacent to the first expanded image matrix as the second row element of the second expanded image matrix, that is, taking the element in the rectangular box 5-2 in fig. 5 as the second row element of the second expanded image matrix, to obtain the second row element (1, 1, 1, 1) of the second expanded image matrix as shown in fig. 7, and so on, until all the columns in the first group of 4 rows are taken as one row element of the second expanded image matrix, then taking the 5 th row to the 8 th row in fig. 5 as the 4 th row element of the second group, taking the element in the first column in the second group of 4 rows adjacent to the first expanded image matrix as the 10 th row element of the second expanded image matrix as shown in fig. 5, that is, the element in the rectangular box 5-3 in fig. 5 as the 10 th row element of the second expanded image matrix, the 10 th row element (1, 0, 1, 0) of the second unfolded image matrix shown in fig. 7 is obtained until all columns in the 4 rows of the second group are taken as one row element of the second unfolded image matrix.

Since the matrix in fig. 5 has only one row of elements left unread, not meeting the requirement for each set of 4 rows, in one embodiment, 3 rows of 0 size elements are filled to obtain the matrix shown in fig. 6, and fig. 6 shows the matrix formed after the first expanded image matrix according to one embodiment of the present invention is filled with 3 rows of 0 size elements, wherein the third set of 4 rows corresponding to 3 rows of 0 size elements are filled as the elements in the square box 6-1 in fig. 6. And similarly, the elements in the first column and in the third group of 4 rows adjacent to the first expanded image matrix are used as the 19 th row elements of the second expanded image matrix, so as to obtain the 19 th row elements (0, 0, 0, 0) of the second expanded image matrix shown in fig. 7, until all columns in the third group of 4 rows are used as one row elements of the second expanded image matrix, so as to obtain the complete second expanded image matrix shown in fig. 7.

Among the following are examples of key code for converting a first unfolded image matrix into a column matrix to obtain a second unfolded image matrix:

float *src = matWithPad.data();

uint32_t padHeight = matWithPad.height();

uint32_t padWidth = matWithPad.width();

uint32_t counter = 0;

size_t vecPos = 0;

for(uint32_t y = padSize;y != padHeight - padSize;++y) {

for(uint32_t x = padSize;x != padWidth - padSize;++x) {

float *k = &(copyBuffer[counter++][0]);

for(int32_t j = -padSize;j != padSize + 1;++j) {

for(int32_t i = -padSize;i != padSize + 1;++i) {

*k = src[(y + j) * padWidth + x + i];

++k;

}

if(counter == 4) {

counter = 0;

for(uint32_t p = 0;p != kernelSize;++p) {

for(uint32_t q = 0;q != 4;++q) {

colMat[vecPos + p][q] = copyBuffer[q][p];

}

vecPos += kernelSize;

}

then, step S207 is executed to allocate consecutive memory addresses to the second expanded image matrix as a second memory space, and store the second expanded image matrix into the second memory space, where a start address of the consecutive memory addresses is an integer multiple of a byte size (i.e., a ratio of the size of the vector register to 8) corresponding to the vector register. And executing step S208 to expand the convolution kernel preprocessed in step S203 into a matrix with the number of columns of the second expanded image matrix as the number of columns and the size of the convolution kernel as the number of rows, so as to obtain a third expanded matrix. Specifically, each element in the converted convolution kernel is copied into a plurality of columns of the second expanded image matrix according to a first preset rule to obtain a group of same data, wherein the first preset rule is a sequence rule from left to right and from top to bottom, and the obtained group of same data are sequentially used as a row of elements of the matrix according to the copying sequence to obtain a third expanded matrix. Taking the convolution kernel shown in fig. 4 as an example, the obtained third expansion matrix is shown in fig. 8, and fig. 8 is a schematic diagram of the third expansion matrix according to an embodiment of the present invention. Because each element in the convolution kernel is the floating point data size of the floating point data operation supported by the vector register, the number of the expanded columns is the same as that of the second expanded image matrix, and the number of the columns of the second expanded image matrix is the number of times that the vector register can simultaneously execute the floating point data operation, the size of each row element in the third expanded matrix is also the size of the vector register.

After the third expanded matrix corresponding to the expanded convolution kernel is obtained, step S209 is executed to allocate continuous memory addresses to the third expanded matrix as a third memory space, and store the third expanded matrix in the third memory space, where similarly, the start address of the continuous memory addresses is an integer multiple of the byte size corresponding to the vector register. An example of critical code that expands the convolution kernel into a third expansion matrix and assigns consecutive memory addresses to it is as follows:

void *truePtr = malloc(sizeof(v4f32) * kernelSize + 16);

v4f32 *kernelVec = (v4f32*)(((size_t)(result->truePtr) + 16) & (~(size_t)(16)));

for(size_t i = 0;i != kernelSize;++i) {

kernelVec[i][0] = kernelVec[i][1] = kernelVec[i][2] = kernelVec[i][3] = kernel[i];

}

and then, executing step S210, performing convolution operation on the second expanded image matrix and the third expanded matrix to obtain a convolution result matrix of the image matrix, where the convolution result matrix is a feature matrix of the image matrix. In one embodiment, the size of the convolution result matrix is determined according to the image matrix to be convolved and the convolution kernel, the second unwrapped image matrix is divided into a plurality of groups of matrices according to the size of the third unwrapped matrix, wherein the number of rows and columns of each group of matrices is the same as that of the third unwrapped matrix, each row of elements in each group of matrices is not overlapped, each group of matrices and the third unwrapped matrix are subjected to convolution operation to obtain the convolution result of each element in each group of matrices, if the total number of the convolution results of the last row of elements in each group of matrices is greater than the size of the convolution result matrix, a convolution result with the size of the previous convolution kernel is selected from the convolution results of the last row of elements in each group of matrices, and the first preset rule is written into the convolution result matrix according to the first preset rule, and is as described above, and is not repeated here. And if the total number of convolution results of the last row of elements of each group of matrixes is equal to the size of the convolution result matrix, writing the convolution results of the last row of elements of each group of matrixes into the convolution result matrix of the image matrix according to a first preset rule.

Taking the normalized image matrix to be convolved (5X 5) shown in fig. 3 and the preprocessed convolution kernel (3X 3) shown in fig. 4 as examples, the size of the convolution result matrix is calculated in the following manner: 5-round (3/2) = 3, thus determining the size of the final convolution result matrix to be 3X3, where round means rounded.

In one embodiment, the step of performing convolution operation on each set of matrices and the third expansion matrix to obtain a convolution result of each element in each set of matrices includes: reading a row of elements from any one group of matrixes of the multiple groups of matrixes to serve as a first group of elements, reading a row of elements which are equal to the row number of the first group of elements from the third expansion matrix to serve as a second group of elements, multiplying each element in the first group of elements by an element at a corresponding position in the second group of elements, and summing the product result and the convolution result of the elements which have the same column number and are arranged in the previous row of elements to obtain the convolution result of the image matrix of each element. And repeating the process for each row element of each group of matrixes to finally obtain the convolution result of the image matrixes of all the elements in the second expanded image matrix.

In the above process, since the data storage addresses of the second expanded image matrix and the third expanded matrix are continuous, the speed of reading and writing data from and into the second expanded image matrix and the third expanded matrix is fast, thereby further improving the speed of convolution calculation of the image matrix.

The procedure of the convolution calculation of the image matrix will be described in detail by taking the second expansion image matrix shown in fig. 7 and the third expansion matrix shown in fig. 8 as examples.

Reading the first 9 rows of elements from the second expanded image matrix of fig. 7 as a first set of matrices, such as the elements shown by the square box 7-1 in fig. 7, reading the first row of elements (1, 1, 1, 0) from the first set of matrices as a first set of elements, reading the first row of elements (0, 0, 0, 0) from the third expanded matrix as a first set of elements, multiplying each element in the first set of elements by the corresponding position element in the second set of elements (i.e., 1 × 0, 0 × 0) to obtain the result of 0, 0, respectively, summing the result of multiplication with the convolution result of the same column number and row element as each element in the first row, and the result of convolution of the same column number and row element as each element in the first row, thereby summing the result of convolution with the same column number and the same row number as each element in the second group of elements, The summation of the convolution results of the elements of the previous row results in 0+0=0, resulting in the convolution results 0, 0 of the elements of the first row of the first set of matrices.

Then, reading a second row element (1, 1, 1, 1) from the first set of matrix, reading a second row element (1, 1, 1, 1) from the third expanded matrix, multiplying each element of the second row element in the first set of elements by an element at a corresponding position of the second row element in the third expanded matrix (i.e. 1, 1, 1), respectively, to obtain 1, 1, summing the product result with the convolution result of the previous row element with the same column number as each element, the convolution result of each element of the first row element is 0, respectively, summing the product result with each element at a corresponding position in the first row element of the convolution result matrix, to obtain 1+0=1, thereby obtaining a convolution result 1 of the second row element of the first set of matrix, 1, 1, 1, and 1, or more, 1. 1 and 1. By analogy, the convolution results of the first set of matrices of the second expanded image matrix are (0, 0, 0, 0), (1, 1, 1, 1), (1, 2, 1, 2), (2, 2, 2, 2), (2, 3, 3, 2), (2, 3, 4, 2), respectively.

Then, the elements of the 10 th row to 18 th row are read from the second expanded image matrix of fig. 7 as the second group matrix, such as the elements shown by the square frame 7-2 in fig. 7, the first row element (1, 0, 1, 0) is read from the second group matrix, the first row element (0, 0, 0, 0) is read from the third expanded matrix, each element in the first row element read from the second group matrix is multiplied by the element at the corresponding position in the first row element read from the third expanded matrix (i.e., 1 × 0, 0 × 0), and the obtained results are 0, 0, 0, respectively. At this time, since the convolution calculation is performed on the first row element, the convolution result of the previous row element is 0 with the same number of columns as each element of the first row element, and the multiplication result is summed with each element at the corresponding position in the previous row element to obtain 0+0=0, and 0+0=0, thereby obtaining the

convolution results

0, 0 of the first row element of the second group matrix. By analogy, the convolution results of the second group of matrices of the second expanded image matrix are (0, 0, 0, 0), (0, 1, 1, 0), (0, 2, 1, 1), (1, 2, 2, 1), (2, 2, 2, 2).

Then, the elements of the 19 th row to the 27 th row are read from the second expanded image matrix of fig. 7 as a third group matrix, as shown by the square frame 7-3 in fig. 7, the first row element (0, 0, 0, 0) is read from the third group matrix, the first row element (0, 0, 0, 0) is read from the third expanded matrix, each element in the first row read from the third group matrix is multiplied by the element at the corresponding position in the first row element read from the third expanded matrix (i.e., 0 × 0), and the obtained results are 0, 0, 0, respectively. At this time, the convolution calculation is performed on the first row element, so that the convolution result of the previous row element is 0 with the same column number as each element of the first row element, and the multiplication result is summed with each element at the corresponding position in the first row element to obtain 0+0=0, and 0+0=0, thereby obtaining the

convolution results

0, 0 of the first row element in the third group matrix. By analogy, the convolution results of the third set of matrices of the second expanded image matrix are (0, 0, 0, 0), (1, 0, 0, 0), (2, 0, 0, 0), (3, 0, 0, 0), (4, 0, 0) respectively.

Then, the convolution result (2, 3, 4, 2) of the last row of the first set of matrix, the convolution result (2, 2, 2) of the last row of the second set of matrix, and the convolution result (4, 0, 0, 0) of the last row of the third set of matrix are extracted, it can be seen that the number of the extracted convolution results is 12, since the size of the convolution result matrix of the image matrix to be convolved (5X 5) shown in fig. 3 and the convolution kernel (3X 3) shown in fig. 4 has been calculated to be 3X3, the size of the convolution result matrix of the second expanded image matrix and the third expanded matrix is 3X3, which is equivalent to that the convolution result matrix includes 9 elements, and the number of the extracted convolution results 12 is greater than the size 9 of the convolution result matrix, therefore, the convolution result of the last row of the first set of matrix is written into the convolution result matrix from left to right and from top to bottom, specifically, the method comprises the following steps: taking the first convolution result (2) in the last row of convolution results of the first set of matrices as the first row and first column elements of the convolution result matrix, the second convolution result (3) as the first row and second column elements of the convolution result matrix, and so on, and taking the fourth convolution result (2) in the last row of convolution results of the first set of matrices as the second row and first column elements of the convolution result matrix.

Then, starting from the second row and the second column of the convolution result matrix, similarly according to the sequence from left to right and from top to bottom, the first convolution result (2) in the last row of the convolution results of the second group of matrices is used as the second row and the second column elements of the convolution result matrix, the second convolution result (2) in the last row of the convolution results of the second group of matrices is used as the second row and the third column elements of the convolution result matrix, and so on, and the fourth convolution result (2) in the last row of the convolution results of the second group of matrices is used as the third row and the second column elements of the convolution result matrix. Finally, starting from the third row and the third column of the convolution result matrix, the first convolution result (4) in the convolution result in the last row of the third group of matrices is used as the third row and the third column of the convolution result matrix, and therefore, the convolution result matrix of the image matrix shown in fig. 9 is obtained, and fig. 9 is a schematic diagram of the convolution result matrix of the image matrix according to an embodiment of the present invention.

The following is an example of a critical code for performing a convolution operation on the second expanded image matrix and the third expanded matrix to obtain a convolution result:

v4f32 *resultTemp = (v4f32*)malloc(colMatSize);

v4f32 *pResultTemp = resultTemp;

for(size_t i = 0;i != colMatSize / sizeof(v4f32);i += kernelSize) {

v4f32 colMulAddTemp = {0, 0, 0, 0};

for(size_t j = 0;j != kernelSize;++j) {

colMulAddTemp = __msa_fmadd_w(kernelVec[j], colMat[i + j], colMulAddTemp);

}

*pResult = colMulAddTemp;

pResult++;

}

float *pos = &(resultTemp[0][0]);

float *newPos = convResult;

for(size_t j = 0;j != mat.size();++j) {

*newPos = pos[j];

newPos++;

}

because the size of each row element in the second expanded image matrix is the size of the vector register, that is, data with the size of the vector register is respectively read from the second expanded image matrix and the third expanded matrix once, which is equivalent to the data size during each convolution operation being the size of the vector register, that is, the vector register of the CPU in the computing device is fully utilized, and multiple floating point data operations can be executed at the same time, therefore, the characteristics of the CPU can be fully utilized, the computing efficiency is maximized, and no additional equipment is needed to be added in the above process, so that the implementation is simple. In addition, in the convolution operation process, because the data storage addresses of the second expanded image matrix and the third expanded matrix are continuous, the data reading and writing speed from the second expanded image matrix and the third expanded matrix is high, and the convolution operation speed of the image matrix is further improved.

And finally, allocating continuous memory space for the convolution result matrix as a fourth memory space. And reading the convolution result matrix from the fourth memory space, performing inverse normalization processing on the convolution result matrix to obtain a convolution result, and writing the convolution result into the fourth memory space. In the above process, since the data storage addresses of the convolution result matrix are continuous, the speed of reading the convolution result matrix is fast in the process of inverse normalization, thereby further improving the speed of convolution calculation of the image matrix.

As can be seen from the above, in the method for implementing convolution of an image matrix according to the present invention, data having a size equal to that of the vector register is respectively read from the second expanded image matrix and the third expanded matrix at a single time, which is equivalent to fully utilizing the vector register of the CPU in the computing device, and a plurality of floating point data operations can be executed at the same time, i.e., the SIDM command can be implemented.

The various techniques described herein may be implemented in connection with hardware or software or, alternatively, with a combination of both. Thus, the methods and apparatus of the present invention, or certain aspects or portions thereof, may take the form of program code (i.e., instructions) embodied in tangible media, such as removable hard drives, U.S. disks, floppy disks, CD-ROMs, or any other machine-readable storage medium, wherein, when the program is loaded into and executed by a machine, such as a computer, the machine becomes an apparatus for practicing the invention.

In the case of program code execution on programmable computers, the computing device will generally include a processor, a storage medium readable by the processor (including volatile and non-volatile memory and/or storage elements), at least one input device, and at least one output device. Wherein the memory is configured to store program code; the processor is configured to perform the method of the invention for implementing image matrix convolution according to instructions in said program code stored in the memory.

By way of example, and not limitation, readable media may comprise readable storage media and communication media. Readable storage media store information such as computer readable instructions, data structures, program modules or other data. Communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media. Combinations of any of the above are also included within the scope of readable media.

In the description provided herein, algorithms and displays are not inherently related to any particular computer, virtual system, or other apparatus. Various general purpose systems may also be used with examples of this invention. The required structure for constructing such a system will be apparent from the description above. Moreover, the present invention is not directed to any particular programming language. It is appreciated that a variety of programming languages may be used to implement the teachings of the present invention as described herein, and any descriptions of specific languages are provided above to disclose the best mode of the invention.

In the description provided herein, numerous specific details are set forth. It is understood, however, that embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.

Similarly, it should be appreciated that in the foregoing description of exemplary embodiments of the invention, various features of the invention are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure and aiding in the understanding of one or more of the various inventive aspects. However, the disclosed method should not be interpreted as reflecting an intention that: that the invention as claimed requires more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this invention.

Those skilled in the art will appreciate that the modules or units or components of the devices in the examples disclosed herein may be arranged in a device as described in this embodiment or alternatively may be located in one or more devices different from the devices in this example. The modules in the foregoing examples may be combined into one module or may be further divided into multiple sub-modules.

Those skilled in the art will appreciate that the modules in the device in an embodiment may be adaptively changed and disposed in one or more devices different from the embodiment. The modules or units or components of the embodiments may be combined into one module or unit or component, and furthermore they may be divided into a plurality of sub-modules or sub-units or sub-components. All of the features disclosed in this specification (including any accompanying claims, abstract and drawings), and all of the processes or elements of any method or apparatus so disclosed, may be combined in any combination, except combinations where at least some of such features and/or processes or elements are mutually exclusive. Each feature disclosed in this specification (including any accompanying claims, abstract and drawings) may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.

Furthermore, those skilled in the art will appreciate that while some embodiments described herein include some features included in other embodiments, rather than other features, combinations of features of different embodiments are meant to be within the scope of the invention and form different embodiments. For example, in the following claims, any of the claimed embodiments may be used in any combination.

Furthermore, some of the described embodiments are described herein as a method or combination of method elements that can be performed by a processor of a computer system or by other means of performing the described functions. A processor having the necessary instructions for carrying out the method or method elements thus forms a means for carrying out the method or method elements. Further, the elements of the apparatus embodiments described herein are examples of the following apparatus: the apparatus is used to implement the functions performed by the elements for the purpose of carrying out the invention.

As used herein, unless otherwise specified the use of the ordinal adjectives "first", "second", "third", etc., to describe a common object, merely indicate that different instances of like objects are being referred to, and are not intended to imply that the objects so described must be in a given sequence, either temporally, spatially, in ranking, or in any other manner.

While the invention has been described with respect to a limited number of embodiments, those skilled in the art, having benefit of this description, will appreciate that other embodiments can be devised which do not depart from the scope of the invention as described herein. Furthermore, it should be noted that the language used in the specification has been principally selected for readability and instructional purposes, and may not have been selected to delineate or circumscribe the inventive subject matter. Accordingly, many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the appended claims. The present invention has been disclosed in an illustrative rather than a restrictive sense, and the scope of the present invention is defined by the appended claims.

Claims

1. A method for implementing image matrix convolution, executed in a computing device, the method comprising:

expanding the convolution kernel into a matrix with the number of columns of the second expanded image matrix as the number of columns and the size of the convolution kernel as the number of rows to obtain a third expanded matrix, wherein the size of each row of data in the third expanded matrix is the size of the vector register;

2. The method of claim 1, wherein the step of expanding the image matrix to be convolved into a row matrix to obtain a first expanded image matrix comprises:

and according to the size of the convolution kernel, expanding the image matrix to be convolved into a row matrix through an im2col algorithm to serve as the first expanded image matrix.

3. The method of claim 1, further comprising the step of:

and normalizing the image matrix to be convolved to enable the size of each element in the obtained image matrix to be convolved to be the size of floating point data of the vector register supporting floating point data operation.

4. The method of claim 1, wherein converting the first unfolded image matrix into a column matrix to obtain a second unfolded image matrix comprises:

determining the times of floating point data operation which can be simultaneously executed by the vector register, and taking the times of executing the floating point data operation as a first numerical value;

taking the elements of each column in the adjacent rows of the first expanded image matrix as a group of data, and taking the group of data as a row of elements of the matrix, wherein the number of rows in the adjacent rows of the first expanded image matrix is equal to the first numerical value;

when the remaining number of rows of the first expanded image matrix is less than the first numerical value, the missing elements in the first expanded image matrix are filled up through zero elements until the remaining number of rows is equal to the first numerical value, and the second expanded image matrix is obtained.

5. The method of claim 1, further comprising the step of:

wherein the step of unfolding the convolution kernel into a matrix with the number of columns of the second unfolded image matrix as the number of columns and the size of the convolution kernel as the number of rows to obtain a third unfolded matrix comprises:

copying each element in the converted convolution kernel into a plurality of columns of the second unfolded image matrix according to a first preset rule to obtain a group of same data, wherein the first preset rule is a sequential rule from left to right and from top to bottom;

and sequentially using the obtained group of same data as a row of elements of the matrix according to the copying sequence to obtain the third expansion matrix.

6. The method of claim 5, wherein the step of convolving the second unwrapped image matrix with the third unwrapped matrix to obtain a convolution result matrix for the image matrix comprises:

determining the size of the convolution result matrix according to the image matrix to be convolved and the convolution kernel;

dividing the second expanded image matrix into a plurality of groups of matrixes according to the size of the third expanded matrix, wherein the number of rows and the number of columns of each group of matrixes are the same as those of the third expanded matrix, and elements of each row in each group of matrixes are not overlapped;

and writing the convolution result of the last row of elements of each group of matrixes into the convolution result matrix of the image matrix according to the first preset rule.

7. The method of claim 6, wherein writing the convolution result of the last row of elements of each set of matrices into the convolution result matrix of the image matrix according to the first predetermined rule comprises:

and if the total number of the convolution results of the last row of elements of each group of matrixes is larger than the size of the convolution result matrix, selecting convolution results with the size of a former convolution kernel from the convolution results of the last row of elements of each group of matrixes, and writing the convolution results into the convolution result matrix of the image matrix according to the first preset rule.

8. The method of claim 6, wherein convolving each set of matrices with the third unfolded matrix to obtain the convolution result for each element in each set of matrices comprises:

reading a row of elements from the third expanded matrix equal to the number of rows of the first set of elements as a second set of elements;

9. A computing device, comprising:

at least one processor; and

a memory storing program instructions, wherein the program instructions are configured to be adapted to be executed by the at least one processor, the program instructions comprising instructions for performing the method of any of claims 1-8.

10. A readable storage medium storing program instructions which, when read and executed by a computing device, cause the computing device to perform the method of any of claims 1-8.