WO2023103551A1

WO2023103551A1 - Image data processing method and apparatus, device, and storage medium

Info

Publication number: WO2023103551A1
Application number: PCT/CN2022/122544
Authority: WO
Inventors: 胡宇; 姬彬斐; 刘嘉超; 刘兰个川
Original assignee: 广州小鹏自动驾驶科技有限公司
Priority date: 2021-12-06
Filing date: 2022-09-29
Publication date: 2023-06-15
Also published as: CN114283314A

Abstract

The present application relates to an image data processing method and apparatus, a device, and a storage medium. The method comprises: acquiring a first image matrix Am×p and a second image matrix Bp×n; determining a first feature map corresponding to the first image matrix Am×p; determining X convolution kernels corresponding to the second image matrix Bp×n, wherein the size of each convolution kernel is f × f × p, f is an integer greater than 1, X is an integer greater than or equal to 1, and each convolution kernel comprises one or more columns of elements in the second image matrix Bp×n; and performing convolution on the first feature map and the X convolution kernels to obtain a second feature map, wherein the second feature map corresponds to a third image matrix, and the third image matrix is a result obtained by performing matrix multiplication of the first image matrix Am×p and the second image matrix Bp×n. The solution provided by the present application can improve the flexibility of a transformer module.

Description

Image data processing method, device, equipment and storage medium

This application claims the priority of a Chinese patent application with application number 2021114776272 and application title "Image Data Processing Method and Apparatus" filed with the State Intellectual Property Office on December 6, 2021, the entire contents of which are incorporated herein by reference .

technical field

The present application relates to the technical field of image processing, and in particular to an image data processing method, device, equipment and storage medium.

Background technique

The current mainstream machine translation is mainly based on neural network machine translation. This type of method is an "encoder-decoder" (encoder-decoder) architecture system. The encoder encodes the source language sequence, extracts information, and then The translator converts the information into the target language and completes the language translation process. The deep self-attention transform (Transformer) model based on the "encoder-decoder" architecture design has become the mainstream model in the field of machine translation due to its superior performance, and has had a huge impact in the field of deep learning.

In the neural network with Transformer as the main module, there are two two-dimensional data tensors for matrix multiplication operations. In some solutions, the Transformer model is generally deployed in a chip that supports matrix multiplication.

However, some chips on the market only support convolution calculations, so Transformer modules that require matrix multiplication operations cannot be deployed on these chips, which limits the use of Transformer modules and is less flexible.

Contents of the invention

In order to solve or partially solve the problems existing in the related technologies, the present application provides an image data processing method, device, equipment and storage medium, which can improve the flexibility of the Transformer module.

The first aspect of the present application provides an image data processing method, including: acquiring the first image matrix A _m×p and the second image matrix B _p×n ; determining the first feature corresponding to the first image matrix A _m×p Figure; determine the X convolution kernels corresponding to the second image matrix B _p×n ; convolve the first feature map with the X convolution kernels to obtain a second feature map, the second feature The figure corresponds to the third image matrix, which is the result of matrix multiplication of the first image matrix A _m×p and the second image matrix B _p×n .

In one embodiment, the size of each convolution kernel is f×f×p, wherein the f is an integer greater than 1, and the X is an integer greater than or equal to 1.

In one embodiment, each convolution kernel includes one or more columns of elements in the second image matrix B _p×n .

In an embodiment, the determining the X convolution kernels corresponding to the second image matrix B _p×n includes: determining the values of n groups of target elements in the X convolution kernels, and the n groups of target elements The value of the i-th group of target elements in the element corresponds to the value of the i-th column element in the second image matrix, the i is an integer from 1 to n, and each convolution kernel contains one or more of the n groups of target elements A plurality of groups of target elements; determine the values of other elements in the X convolution kernels except for the n groups of target elements to be 0.

In one embodiment, each convolution kernel contains a group of target elements Z ₁ , Z ₂ ,..., Z _p in the n groups of target elements, and for each convolution kernel, the target elements contained in the convolution kernel Z ₁ , Z ₂ ,..., Z _p are respectively located in row j, column k of the 1st, 2nd,..., p channels of the convolution kernel;

The determining the first feature map corresponding to the first image matrix A _m×p includes: determining m data modules corresponding to the m row vectors of the first image matrix A _m×p , the data modules The length is 1, the width is 1, and the depth is p; the m data modules are spliced according to a preset splicing rule to obtain a third feature map, and the length of the third feature map is a, the width is b, and the depth is p, said a×b=m; insert zero data modules into the upper edge, lower edge, left edge and right edge of the third feature map respectively to obtain the first feature map, the length of the zero data module is 1, the width is 1, the depth is p, and the values of all elements in the zero data module are zero.

In one embodiment, each convolution kernel includes u groups of target elements in the n groups of target elements, the u is an integer greater than 1 and less than or equal to f, for each convolution kernel, the convolution kernel The p target elements contained in the vth group of target elements are respectively located in the first, second..., p channels of the convolution kernel, the j _vth row and the kvth _{column kv} , where v is an integer from 1 to u;

The determining the first feature map corresponding to the first image matrix A _m×p includes: determining m data modules corresponding to the m row vectors of the first image matrix A _m×p , the data modules The length is 1, the width is 1, and the depth is p; the m data modules are spliced according to a preset splicing rule to obtain a third feature map, and the length of the third feature map is a, the width is b, and the depth is p, said a×b=m; a zero data module is inserted into the target position of the third feature map to obtain the first feature map, the target position includes an internal position and an edge position, and the target position and the multiple groups The corresponding positions of the target elements in the convolution kernel are related, the length of the zero data module is 1, the width is 1, and the depth is n, and the values of all elements in the zero data module are zero.

The second aspect of the present application provides an image data processing device, including: an acquisition module, used to acquire a first image matrix A _m×p and a second image matrix B _p×n ; a first determination module, used to determine the first image matrix A first feature map corresponding to an image matrix A _m×p ; a second determination module, used to determine X convolution kernels corresponding to the second image matrix B _p×n ; a convolution module, used to convert the first A feature map is convolved with the X convolution kernels to obtain a second feature map, the second feature map corresponds to a third image matrix, and the third image matrix is the first image matrix A _m×p and the second image matrix The result of matrix multiplication of the two image matrix B _p×n .

In one embodiment, among the X convolution kernels determined by the second determination module, the size of each convolution kernel is f×f×p, wherein, the f is an integer greater than 1, and the X is An integer greater than or equal to 1, each convolution kernel contains one or more columns of elements in the second image matrix B _p×n .

In an embodiment, the second determination module includes: a first determination unit, configured to determine the values of n groups of target elements in the X convolution kernels, the i-th group of target elements in the n groups of target elements The value corresponds to the value of the i-th column element in the second image matrix, the i is an integer from 1 to n, and each convolution kernel contains one or more groups of target elements in the n groups of target elements; the second A determining unit, configured to determine that values of other elements in the X convolution kernels except the n groups of target elements are 0.

The first determination module includes: a third determination unit, configured to determine m data modules corresponding to the m row vectors of the first image matrix A _m×p , the length of the data modules is 1, and the width is 1. The depth is p; the first splicing unit is configured to splice the m data modules according to a preset splicing rule to obtain a third feature map, the length of the third feature map is a, the width is b, and the depth is is p, the a×b=m; the first zero padding unit is used to insert zero data modules into the upper edge, lower edge, left edge and right edge of the third feature map respectively to obtain the first feature map , the length of the zero data module is 1, the width is 1, and the depth is p, and the values of all elements in the zero data module are zero.

The first determination module includes: a fourth determination unit, configured to determine m data modules corresponding to the m row vectors of the first image matrix A _m×p , the length of the data modules is 1, and the width is 1. The depth is p; the second splicing unit is configured to splice the m data modules according to a preset splicing rule to obtain a third feature map, the length of the third feature map is a, the width is b, and the depth is is p, the a×b=m; the second zero padding unit is used to insert a zero data module at the target position of the third feature map to obtain the first feature map, and the target position includes an internal position and an edge position, The target position is related to the corresponding positions of the plurality of groups of target elements in the convolution kernel, the length of the zero data module is 1, the width is 1, and the depth is n, and all elements in the zero data module The value is zero.

The third aspect of the present application provides an electronic device, including: a processor; and a memory, on which executable code is stored, and when the executable code is executed by the processor, the processor is made to execute the above-mentioned Methods.

A fourth aspect of the present application provides a computer-readable storage medium, on which executable code is stored, and when the executable code is executed by a processor of an electronic device, the processor is caused to execute the above-mentioned method.

The technical solution of this application can acquire the first image matrix A _m×p and the second image matrix B _p×n , and determine the first feature map corresponding to the first image matrix A _m×p and the second image matrix B _{p ×n} corresponds to X convolution kernels of size f×f×p, and then the first feature map is convolved with these X convolution kernels and multiplied with the feature map corresponding to the third image matrix, where f is An integer greater than 1, the third image matrix is obtained by multiplying the first image matrix and the second image matrix. That is to say, in the embodiment of the present application, the convolution operation can be used to replace the matrix multiplication operation to obtain the corresponding feature map, which solves the problem that the Transformer module cannot be deployed on a chip that only supports convolution calculation, and improves the flexibility of the Transformer module .

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.

Description of drawings

The above and other objects, features and advantages of the present application will become more apparent by describing the exemplary embodiments of the present application in more detail with reference to the accompanying drawings, wherein, in the exemplary embodiments of the present application, the same reference numerals generally represent same parts.

FIG. 1 is a schematic flow diagram of an image data processing method shown in an embodiment of the present application;

2 is a schematic diagram of a third feature map in the image data processing method shown in the embodiment of the present application;

Fig. 3 is a schematic diagram of the first feature map in the image data processing method shown in the embodiment of the present application;

Fig. 4 is another schematic diagram of the first feature map in the image data processing method shown in the embodiment of the present application;

Fig. 5 is another schematic diagram of the first feature map in the image data processing method shown in the embodiment of the present application;

6 is a schematic diagram of X convolution kernels in the image data processing method shown in the embodiment of the present application;

7 is another schematic diagram of X convolution kernels in the image data processing method shown in the embodiment of the present application;

FIG. 8 is a schematic structural diagram of an image data processing device shown in an embodiment of the present application;

FIG. 9 is another schematic structural diagram of an image data processing device shown in an embodiment of the present application;

FIG. 10 is a schematic structural diagram of an electronic device shown in an embodiment of the present application.

Detailed ways

Embodiments of the present application will be described in more detail below with reference to the accompanying drawings. Although embodiments of the present application are shown in the drawings, it should be understood that the present application may be embodied in various forms and should not be limited by the embodiments set forth herein. Rather, these embodiments are provided so that this application will be thorough and complete, and will fully convey the scope of this application to those skilled in the art.

The terminology used in this application is for the purpose of describing particular embodiments only, and is not intended to limit the application. As used in this application and the appended claims, the singular forms "a", "the", and "the" are intended to include the plural forms as well, unless the context clearly dictates otherwise. It should also be understood that the term "and/or" as used herein refers to and includes any and all possible combinations of one or more of the associated listed items. It should be understood that although the terms "first", "second", "third" and so on may be used in this application to describe various information, such information should not be limited to these terms. These terms are only used to distinguish information of the same type from one another. For example, without departing from the scope of the present application, first information may also be called second information, and similarly, second information may also be called first information. Thus, a feature defined as "first" and "second" may explicitly or implicitly include one or more of these features. In the description of the present application, "plurality" means two or more, unless otherwise specifically defined.

In the related art, the use of the Transformer module is limited and the flexibility is poor. In view of the above problems, the embodiment of the present application provides an image data processing method, which can improve the flexibility of the Transformer module.

For ease of understanding, some terms involved in the embodiments of the present application are introduced below.

Image matrix: Digital image data can be represented by a matrix, so matrix theory and matrix algorithms can be used to analyze and process digital images. Since a digital image can be expressed in the form of a matrix, in a computer digital image processing program, a two-dimensional array is usually used to store image data.

Convolution kernel: The convolution kernel is when the image is processed, given the input image, the weighted average of the pixels in a small area of the input image becomes each corresponding pixel in the output image, where the weight is defined by a function, this function is called convolution kernel.

Feature map: In the convolutional layer of the neural network, the data exists in three-dimensional form, which can be regarded as many two-dimensional pictures stacked together, each of which is called a feature map. In the input layer, if it is a grayscale image, there is only one feature map; if it is a color image, there are generally 3 feature maps (red, green, and blue). There will be several convolution kernels between layers, and the convolution of the previous layer and each feature map with each convolution kernel will generate a feature map of the next layer.

Data module: In the image field, it is usually a three-dimensional array, which is used to represent the pixel value of the image, the length represents the height of the image, the width represents the width of the image, and the depth represents the number of color channels of the image.

It should be noted that the image data processing method in this embodiment can be used for image processing of a neural network including a Transformer module, and can also be used for other neural networks that require matrix multiplication, which is not specifically limited in this embodiment.

It should be noted that the image data processing apparatus in this embodiment may include a Transformer module, or other modules that need to perform matrix multiplication, which is not specifically limited in this embodiment.

The method of the present application can be applied to a processor, and the processor can be a general-purpose processor, such as a CPU (Central Processing Unit, central processing unit), or an artificial intelligence processor for performing artificial intelligence calculations. Artificial intelligence computing may include machine learning computing, brain-like computing, and the like. Among them, machine learning operations include neural network operations, k-means operations, support vector machine operations, and the like. The artificial intelligence processor includes, for example, an artificial intelligence chip processor, GPU (Graphics Processing Unit, graphics processing unit), NPU (Neural-Network Processing Unit, neural network processing unit), DSP (Digital Signal Process, digital signal processing unit), One or a combination of Field-Programmable Gate Array (FPGA) chips.

The artificial intelligence processor may be a processor applied in an artificial intelligence chip. The artificial intelligence chip may be, for example, a neural network chip or other chips, and the neural network chip may be, for example, a convolutional neural network reasoning chip, an ASIC chip, or the like. The present application does not limit the specific type of the processor.

In a possible implementation, the processor mentioned in this application may include a plurality of processing units, and each processing unit may independently run various assigned tasks, such as: convolution operation tasks, pooling tasks Or fully connected tasks, etc. The application does not limit the processing unit and the tasks run by the processing unit. Multiple processing units in the processor can not only share part of the storage space, for example share part of the RAM storage space and the register file, but also have their own storage space at the same time.

The technical solutions of the embodiments of the present application are described in detail below with reference to the accompanying drawings.

FIG. 1 is a schematic flowchart of an image data processing method shown in an embodiment of the present application. The method can be applied to a processor, and the processor can include a general-purpose processor, an artificial intelligence processor, etc., wherein the artificial intelligence processor can be an artificial intelligence chip processor or a GPU, etc.

Referring to Fig. 1, image data processing method comprises in the present embodiment:

101. An image data processing apparatus acquires a first image matrix and a second image matrix.

The image data processing device acquires the first image matrix A _m×p and the second image matrix B _p×n that need to be multiplied. Wherein, the first image matrix A _m×p or the second image matrix B _p×n may be an original image matrix, or an image matrix obtained after matrix operation, which is not limited in this embodiment.

102. The image data processing apparatus determines a first feature map corresponding to the first image matrix.

After the image data processing device acquires the first image matrix A _m×p , it generates a first feature map corresponding to the first image matrix A _m×p according to a first preset rule.

For example, the image data processing device may generate the first feature map corresponding to the first image matrix A _m×p in the following manner:

S1. Determine m data modules corresponding to the m row vectors of the first image matrix A _m×p .

For each row vector of the first image matrix A _m×p , the image data processing device generates a data module corresponding to the row vector, the length of the data module is 1, the width is 1, and the depth is p, and the data module contains p elements correspond one-to-one to the p elements of the row vector. For example, the image data processing device may generate the data module corresponding to the row vector through a matrix transformation function, or may generate the data module corresponding to the row vector through other methods, which are not specifically limited in this embodiment.

S2. Splicing the m data modules according to a preset splicing rule to obtain a third feature map.

After the image data processing device generates m data modules, the m data modules are spliced according to a preset splicing rule to obtain a third feature map, the length of the third feature map is a, the width is b, and the depth is p, a×b =m.

For example, the splicing rules include the splicing order, which can be from top to bottom, from left to right, that is, a data module is spliced in the order from top to bottom, the first column of the third feature map is arranged, and then according to The order from top to bottom is the 2nd column, the 3rd column, ..., and the b column gets the third feature map. The splicing sequence can also be from top to bottom, from right to left; or from bottom to top, from left to right; or from bottom to top, from right to left; or other splicing sequences, which are not limited in this embodiment.

Exemplary, the first image matrix

a=2, b=2, the splicing sequence is from top to bottom, from left to right, and the image data processing device generates 4 row vectors in A _4×3 [a _1,1 a _1,2 a _1,3 ], [a _2,1 a _2,2 a _2,3 ], [a _3,1 a _3,2 a _3,3 ], [a _4,1 a _4,2 a _4,3 ] correspond to 4 data modules M ₁ , M ₂ , M ₃ and M ₄ , then the image data processing device splices M ₂ under M ₁ in order from top to bottom to obtain the first column, then splices M ₃ to the left of M ₁ , and then Then splice _M4 to the bottom of _M3 to get the second column, as shown in Figure 2.

S3. Insert a zero data module into the third feature map according to a preset zero padding rule to obtain the first feature map.

It should be understood that, in this embodiment, the length of the zero data module is 1, the width is 1, and the depth is n, and the values of all elements in the zero data module are zero. The zero padding rule is related to the number of column vectors of the second image matrix B _p×n contained in the convolution kernel and the position of the column vectors in the convolution kernel.

As an optional way, each convolution kernel corresponding to the second image matrix B _p×n contains a column vector in the second image matrix B _p×n , then the corresponding image data processing device can be inserted in the following way Zero data module: the image data processing device respectively inserts zero data modules into the upper edge, lower edge, left edge and right edge of the third feature map to obtain the first feature map.

For example, if each convolution kernel contains a column vector, f is an odd number, and the column vector is located at the center of the convolution kernel, then the image data processing device can insert

Row zero data module, inserted on the left edge and right edge of the third feature map

The column zero data module obtains the first feature map, that is, the front of the first to n channels of the first feature map

column, last

column, front

row and last

The row values are all zero. Exemplary, f=3, the first image matrix

The third feature map obtained by the image data processing device is shown in Figure 2, and then insert a row of zero data modules on the upper edge and lower edge of the third feature map, and insert zero data modules on the left edge and right edge of the third feature map respectively. The first feature map is obtained from a zero-data module in one column, as shown in FIG. 3 .

If each convolution kernel contains a column vector, f is an odd number, and the column vector is not located in the center of the convolution kernel, the image data processing device can insert

Row zero data module, lower edge inserted

Row zero data modules, each inserted at the left edge

Column zero data module, right edge insert

The column zero data module obtains the first feature map, and the values of L ₁ , L ₂ , L ₃ and L ₄ are related to the position of the column vector in the convolution kernel. Exemplarily, f=3, the column vector is located in the first row and the first column of the convolution kernel, then the image data processing device inserts 1 row of zero data modules at the upper edge, inserts 2 rows of zero data modules at the lower edge, and inserts at the left edge 1 row of zero data modules, insert 2 rows of zero data modules at the right edge, as shown in Figure 4.

As an optional way, the system sets that each convolution kernel corresponding to the second image matrix B _p×n contains multiple column vectors in the second image matrix B _p×n , then the corresponding image data processing device can The zero data module is inserted in the following manner: the image data processing device inserts the zero data module at the target position of the third feature map to obtain the first feature map, the target position includes an internal position and an edge position, and the target position is in the volume with the plurality of column vectors The corresponding positions in the product kernel are related.

Exemplarily, f=3, each convolution kernel contains two column vectors W ₁ and W ₂ , W ₁ is located in row 2, column 2 of the convolution kernel, W ₂ is located in row 2, column 3 of the convolution kernel List. The image data processing device inserts a column of zero data modules between each column of the third feature map, inserts a row of zero data modules at the upper edge and lower edge of the third feature map, and inserts a row of zero data modules at the left edge and right edge respectively The first feature map is obtained by a zero-data module in one column, as shown in FIG. 5 .

It should be understood that the zero padding rule is related to the number of column vectors contained in the convolution kernel and the corresponding position of each column vector in the convolution kernel. Specifically, the user can set the number of column vectors and the corresponding position of the column vector in the convolution kernel. It can also be set in other ways, which is not limited in this embodiment.

103. The image data processing apparatus determines X convolution kernels corresponding to the second image matrix.

After the image data processing device acquires the second image matrix B _p×n , it generates X convolution kernels with a size of f×f _×p corresponding to the second image matrix B p×n according to the second preset rule, and each convolution The kernel contains one or more column vectors in the second image matrix B _p×n , that is, each convolution kernel contains one or more column elements in the second image matrix B _p×n , and f is an integer greater than 1 , X is an integer greater than or equal to 1.

For example, the image data processing device can determine the X convolution kernels corresponding to the second image matrix B _p×n in the following manner: the image data processing device determines the values of n groups of target elements in the X convolution kernels, and determines the X convolution kernels The values of other elements in the product kernel sum except the n groups of target elements are 0, wherein each group of target elements corresponds to a column vector of the second image matrix B _p×n , and the value of the i-th group of target elements is the same as that of the second image The values of the i-th column elements of the matrix B _p×n correspond one-to-one, i is an integer from 1 to n, and each convolution kernel contains one or more groups of target elements in the n groups of target elements.

It should be noted that the size of the convolution kernel is preset, that is, the value of f is a preset value. The system can also set the number of column vectors contained in each convolution kernel, that is, set how many column elements in the second image matrix B _p×n each convolution kernel contains. Further, the system can also set the corresponding position of the column vector contained in each convolution kernel in the convolution kernel.

In some embodiments, the system assumes that each convolution kernel contains a column vector, then X=n, and each convolution kernel contains a set of target elements. For example, for each convolution kernel, the target elements Z ₁ , Z ₂ _, . For example, if the system sets the column vector contained in the convolution kernel to be located at the center of the convolution kernel, and f is an odd number, then the target elements Z ₁ , Z ₂ ,...,Z _p contained in the convolution kernel contained in the convolution kernel are respectively located at The first... of the convolution kernel, the first p channel

row number

List. Exemplary, f=3, the second image matrix

The image data processing device determines that the second image matrix B _3×3 corresponds to three groups of target elements b _1,1 , b _2,1 , b _3,1 ; b _1,2 , b _2,2 , b _3,2 ; b _1,3 , b _2,3 , b _3,3 ; and determine the values of elements other than the target element in each convolution kernel to be 0, where each convolution kernel contains a set of targets Elements, the target elements contained in each convolution kernel are located in the second row and second column of the first, second and third channels of the convolution kernel, as shown in Figure 6.

In some embodiments, the system presets that each convolution kernel includes u column vectors, and u is an integer greater than 1 and less than or equal to f.

If u can be divisible by n, then X=n/u, and each convolution kernel contains u groups of target elements. For example, for each of the X convolution kernels, the convolution kernel contains the vth group of target elements respectively located in the jth row vth row k of the convolution kernel's 1st, 2nd, ..., p _channels _v column, wherein, v is an integer from 1 to u;

If u is not divisible by n, then X is the smallest integer greater than n/u, and the 1st to X-1th convolution kernels among the X convolution kernels each contain u group of target elements, and the Xth convolution kernel contains nu(X-1) groups of target elements. The 1st to X-1th convolution kernels contain the vth group of target elements in the u group of target elements, which are respectively located in the 1st, 2nd, ..., jth row, kvth _column _{, kvth} column of the p channel of the convolution kernel The r-th group of target elements in the nu(X-1) group of target elements contained in X convolution kernels is located in the j r row and _{k r} _column of the 1st, 2nd, ..., p channels of the convolution kernel, where, r is an integer from 1 to nu(X-1).

Exemplarily, f=3, each convolution kernel contains 2 column vectors W ₁ and W ₂ , W ₁ is located in row 2, column 2 of the convolution kernel, W ₂ is located in row 2, column 3 of the convolution kernel List.

second image matrix

The image data processing device determines that the second image matrix B _3×3 corresponds to three groups of target elements b _1,1 , b _2,1 , b _3,1 ; b _1,2 , b _2,2 , b _3,2 ; b _1,3 , b _2,3 , b _3,3 ; and determine the values of other elements in each convolution kernel except the target element to be 0, wherein the first convolution kernel contains the second First column elements b _1,1 , b _2,1 , b _3,1 and second column elements b 1,2 , b _2,2 , b _3,2 _in image matrix B 3× ₃ , volume 2 The product kernel includes elements b _1,3 , b _2,3 , and b _3,3 in the third column of the second image matrix B _3×3 , as shown in FIG. 7 .

104. The image data processing device convolves the first feature map with X convolution kernels to obtain a second feature map.

After the image data processing device generates the first feature map and the X convolution kernels, the first feature map is convolved with the X convolution kernels to obtain the second feature map, and the second feature map and the third image matrix C _m× Corresponding to _n , the third image matrix C _m×n is the result of matrix multiplication of the first image matrix A _m×p and the second image matrix B _p×n .

In some embodiments, f is an odd number, each convolution kernel contains a column vector of a second image matrix, and the column vector is located at the center of the convolution kernel, then the image data processing device combines the first feature map with n convolution kernels The length of the second feature map obtained by performing convolution is a, the width is b, and the depth is n. Each data module of the second feature map in the depth direction 1×1×n corresponds to the column vector of the third image matrix C _m×n , and the order of the column vectors is concatenated in the distribution order of the second feature map and the third feature map Sequential correspondence, that is, the mapping method of the second feature map to C _m×n is the same as the mapping method of the third feature map to A _m×p .

In some embodiments, each convolution kernel contains a plurality of column vectors of the second image matrix, and after the image data processing device convolutes the first feature map with n convolution kernels to obtain the second feature map, according to The preset rearrangement rule rearranges the data modules in the second feature map to obtain a fourth feature map. The length of the fourth feature map is a, the width is b, and the depth is n. Each of the fourth feature maps corresponds to the column vector of the third image matrix C _m×n in the depth direction of 1×1×n data, and the order of the column vectors is in the distribution order of the fourth feature map and the concatenation order of the third feature map correspond.

In some embodiments, each convolution kernel contains a column vector of the second image matrix, and the column vector is not located at the center of the convolution kernel, then the image data processing device convolves the first feature map with n convolution kernels After the second feature map is obtained, the zero data module at the preset position is deleted to obtain the fifth feature map. The length of the fifth feature map is a, the width is b, and the depth is n. Each fifth feature map with 1×1×n data in the depth direction corresponds to the column vector of the third image matrix C _m×n , and the order of the column vectors is in the distribution order of the fifth feature map and the splicing order of the third feature map correspond.

The technical solution of this application can obtain the first image matrix A _m×p and the second image matrix B _p×n , and determine the first feature map corresponding to the first image matrix A _m×p , and the second image matrix B _{p×n n} corresponds to X convolution kernels of size f×f×p, and then the first feature map is convolved with these X convolution kernels and multiplied with the feature map corresponding to the third image matrix, where f is greater than An integer of 1, the third image matrix is obtained by multiplying the first image matrix and the second image matrix. That is to say, in this embodiment, the convolution operation can be used to replace the matrix multiplication operation to obtain the corresponding feature map, which solves the problem that the Transformer module cannot be deployed on a chip that only supports convolution calculation, and improves the flexibility of the Transformer module.

Secondly, the embodiment of the present application provides multiple ways of determining the first feature map and multiple ways of determining the convolution kernel, which improves the flexibility of the solution.

Corresponding to the aforementioned embodiments of the method for implementing application functions, the present application also provides an image data processing device, electronic equipment, and corresponding embodiments.

FIG. 8 is a schematic structural diagram of an image data processing device shown in an embodiment of the present application.

Referring to FIG. 8 , the image data processing apparatus 800 in this embodiment includes: an acquisition module 801 , a first determination module 802 , a second determination module 803 , and a convolution module 804 .

An acquisition module 801, configured to acquire a first image matrix A _m×p and a second image matrix B _p×n ;

The first determining module 802 is configured to determine the first feature map corresponding to the first image matrix A _m×p ;

The second determination module 803 is configured to determine X convolution kernels corresponding to the second image matrix B _p×n , the size of each convolution kernel is f×f×p, where f is an integer greater than 1, and X is An integer greater than or equal to 1, each convolution kernel contains one or more columns of elements in the second image matrix B _p×n ;

The convolution module 804 is used to convolve the first feature map with X convolution kernels to obtain a second feature map, the second feature map corresponds to the third image matrix, and the third image matrix is the first image matrix A _{m ×} The result of matrix multiplication of _p and the second image matrix B _p×n .

In the technical solution of this application, the acquisition module 801 can acquire the first image matrix A _p×p and the second image matrix B _p×n , and the first determination module 802 can determine the first feature map corresponding to the first image matrix A _m×p , the second determination module 803 can determine the X convolution kernels whose size is f×f×p corresponding to the second image matrix B _p×n , and then the convolution module 804 can combine the first feature map with the X convolution kernels Carry out convolution multiplication with the feature map corresponding to the third image matrix, where f is an integer greater than 1, and the third image matrix is obtained by matrix multiplication of the first image matrix and the second image matrix. That is to say, in this embodiment, the convolution operation can be used to replace the matrix multiplication operation to obtain the corresponding feature map, which solves the problem that the Transformer module cannot be deployed on a chip that only supports convolution calculation, and improves the flexibility of the Transformer module.

For ease of understanding, the image data processing device in this application will be described in detail below. Please refer to FIG. Convolution module 904 .

An acquisition module 901, configured to acquire a first image matrix A _m×p and a second image matrix B _p×n ;

The first determining module 902 is configured to determine the first feature map corresponding to the first image matrix A _m×p ;

The second determining module 903 is configured to determine X convolution kernels corresponding to the second image matrix B _p×n , the size of each convolution kernel is f×f×p, where f is an integer greater than 1, and X is An integer greater than or equal to 1, each convolution kernel contains one or more columns of elements in the second image matrix B _p×n ;

Convolution module 904, configured to convolve the first feature map with X convolution kernels to obtain a second feature map, the second feature map corresponds to the third image matrix, and the third image matrix is the first image matrix A _{m ×} The result of matrix multiplication of _p and the second image matrix B _p×n .

The second determination module 903 includes: a first determination unit 9031 and a second determination unit 9032 .

The first determination unit 9031 is configured to determine the values of n groups of target elements in the X convolution kernels, the value of the i-th group of target elements in the n groups of target elements corresponds to the value of the i-th column element in the second image matrix, i is an integer from 1 to n, and each convolution kernel contains one or more sets of target elements in the n sets of target elements;

The second determination unit 9032 is configured to determine that values of other elements in the X convolution kernels are 0 except for the n groups of target elements.

Optionally, each convolution kernel contains a group of target elements Z ₁ , Z ₂ ,...,Z _p in n groups of target elements, and for each convolution kernel, the convolution kernel contains target elements Z ₁ , Z ₂ ,..., Z _p are respectively located in row j and column k of the 1st, 2nd,...,p channels of the convolution kernel.

The first determining module 902 includes: a third determining unit 9021 , a first splicing unit 9022 , and a first zero padding unit 9023 .

The third determination unit 9021 is configured to determine m data modules corresponding to the m row vectors of the first image matrix A _m×p , the length of these m data modules is 1, the width is 1, and the depth is p;

The first splicing unit 9022 is configured to splice m data modules according to preset splicing rules to obtain a third feature map, the length of the third feature map is a, the width is b, and the depth is p, and a×b=m ;

The first zero padding unit 9023 is used to insert zero data modules into the upper edge, lower edge, left edge and right edge of the third feature map respectively to obtain the first feature map, the length of the zero data module is 1, and the width is 1 , at depth p, all elements in the zero data module have the value zero.

Optionally, each convolution kernel contains u group of target elements in n groups of target elements, u is an integer greater than 1 and less than or equal to f, for each convolution kernel, the vth group of targets contained in the convolution kernel The p target elements in the element are respectively located in the 1st, 2nd..., p channels of the convolution kernel at the j _vth row and the k _vth column, where v is an integer from 1 to u.

The first determination module 902 includes: a fourth determination unit 9024 , a second splicing unit 9025 , and a second zero padding unit 9026 .

The fourth determination unit 9024 is configured to determine m data modules corresponding to the m row vectors of the first image matrix A _m×p , the length of these m data modules is 1, the width is 1, and the depth is p;

The second splicing unit 9025 is configured to splice m data modules according to a preset splicing rule to obtain a third feature map, where the length of the third feature map is a, the width is b, and the depth is p, where a×b=m;

The second zero padding unit 9026 is used to insert a zero data module at the target position of the third feature map to obtain the first feature map. The target position includes internal positions and edge positions, and the target position corresponds to multiple groups of target elements in the convolution kernel. Position-dependent, the length of the zero data block is 1, the width is 1, and the depth is n, and the value of all elements in the zero data block is zero.

In the technical solution of this application, the acquisition module 901 can acquire the first image matrix A _m×p and the second image matrix B _p×n , and the first determination module 902 can determine the first feature map corresponding to the first image matrix A _m×p , the second determination module 903 can determine the X convolution kernels whose size is f×f×p corresponding to the second image matrix B _p×n , and then the convolution module 904 can combine the first feature map with the X convolution kernels Carry out convolution multiplication with the feature map corresponding to the third image matrix, where f is an integer greater than 1, and the third image matrix is obtained by matrix multiplication of the first image matrix and the second image matrix. That is to say, in this embodiment, the convolution operation can be used to replace the matrix multiplication operation to obtain the corresponding feature map, which solves the problem that the Transformer module cannot be deployed on a chip that only supports convolution calculation, and improves the flexibility of the Transformer module.

Secondly, the embodiment of the present application provides ways to determine the first feature map and the convolution kernel in multiple ways, which improves the flexibility of the solution.

Regarding the apparatus in the above embodiments, the specific manner in which each module executes operations has been described in detail in the embodiments related to the method, and will not be described in detail here.

Referring to FIG. 10 , an electronic device 1000 includes a memory 1010 and a processor 1020 .

The processor 1020 can be a central processing unit (Central Processing Unit, CPU), and can also be other general-purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), on-site Field-Programmable Gate Array (FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. A general-purpose processor may be a microprocessor, or the processor may be any conventional processor, or the like.

The memory 1010 may include various types of storage units such as system memory, read only memory (ROM), and persistent storage. Wherein, the ROM may store static data or instructions required by the processor 1020 or other modules of the computer. The persistent storage device may be a readable and writable storage device. Persistent storage may be a non-volatile storage device that does not lose stored instructions and data even if the computer is powered off. In some embodiments, the permanent storage device adopts a mass storage device (such as a magnetic or optical disk, flash memory) as the permanent storage device. In some other implementations, the permanent storage device may be a removable storage device (such as a floppy disk, an optical drive). The system memory can be a readable and writable storage device or a volatile readable and writable storage device, such as dynamic random access memory. System memory can store some or all of the instructions and data that the processor needs at runtime. In addition, the memory 1010 may include any combination of computer-readable storage media, including various types of semiconductor memory chips (such as DRAM, SRAM, SDRAM, flash memory, programmable read-only memory), and magnetic disks and/or optical disks may also be used. In some embodiments, memory 1010 may include a readable and/or writable removable storage device, such as a compact disc (CD), a read-only digital versatile disc (e.g., DVD-ROM, dual-layer DVD-ROM), Read-only Blu-ray Disc, Super Density Disc, Flash memory card (such as SD card, min SD card, Micro-SD card, etc.), magnetic floppy disk, etc. Computer-readable storage media do not contain carrier waves and transient electronic signals transmitted by wireless or wire.

Executable codes are stored in the memory 1010 , and when the executable codes are processed by the processor 1020 , the processor 1020 may execute part or all of the methods mentioned above.

In addition, the method according to the present application can also be implemented as a computer program or computer program product, the computer program or computer program product including computer program code instructions for executing some or all of the steps in the above method of the present application.

Alternatively, the present application may also be implemented as a computer-readable storage medium (or a non-transitory machine-readable storage medium or a machine-readable storage medium), on which executable code (or computer program or computer instruction code) is stored, When the executable code (or computer program or computer instruction code) is executed by the processor of the electronic device (or server, etc.), the processor is made to perform part or all of the steps of the above-mentioned method according to the present application.

Having described various embodiments of the present application above, the foregoing description is exemplary, not exhaustive, and is not limited to the disclosed embodiments. Many modifications and alterations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen to best explain the principle of each embodiment, practical application or improvement of technology in the market, or to enable other ordinary skilled in the art to understand each embodiment disclosed herein.

Claims

An image data processing method, characterized in that, comprising:

Obtain the first image matrix A m×p and the second image matrix B p×n ;

determining a first feature map corresponding to the first image matrix A m×p ;

Determining X convolution kernels corresponding to the second image matrix B p×n ;

Convolving the first feature map with the X convolution kernels to obtain a second feature map, the second feature map corresponds to a third image matrix, and the third image matrix is the first image matrix A m ×p and the second image matrix B p×n are obtained by matrix multiplication.
The image data processing method according to claim 1, characterized in that:

The size of each convolution kernel is f×f×p, wherein, the f is an integer greater than 1, and the X is an integer greater than or equal to 1.
The image data processing method according to claim 2, characterized in that:

Each convolution kernel contains one or more columns of elements in the second image matrix B p×n .
The image data processing method according to claim 1, wherein the determining the X convolution kernels corresponding to the second image matrix Bp ×n comprises:

Determine the value of n groups of target elements in the X convolution kernels, the value of the i-th group of target elements in the n groups of target elements corresponds to the value of the i-th column element in the second image matrix, and the i is 1 An integer to n, each convolution kernel contains one or more groups of target elements in the n groups of target elements;

Determine the values of other elements in the X convolution kernels except for the n groups of target elements to be 0.
The image data processing method according to claim 4, characterized in that:

Each convolution kernel contains a group of target elements Z 1 , Z 2 ,..., Z p in the n groups of target elements, and for each convolution kernel, the convolution kernel contains target elements Z 1 , Z 2 , ..., Z p are respectively located in row j, column k of the 1st, 2nd, ..., p channels of the convolution kernel;

The determining the first feature map corresponding to the first image matrix Am ×p includes:

Determine m data modules corresponding to the m row vectors of the first image matrix A m×p , the length of the data modules is 1, the width is 1, and the depth is p;

splicing the m data modules according to a preset splicing rule to obtain a third feature map, the length of the third feature map is a, the width is b, and the depth is p, and the a×b=m;

Inserting zero data modules into the upper edge, lower edge, left edge and right edge of the third feature map respectively to obtain the first feature map, the length of the zero data module is 1, the width is 1, and the depth is p, All elements in the zero data module have a value of zero.
The image data processing method according to claim 4, characterized in that:

Each convolution kernel contains u group of target elements in the n groups of target elements, the u is an integer greater than 1 and less than or equal to f, for each convolution kernel, the vth group of targets contained in the convolution kernel The p target elements in the element are respectively located in the 1st, 2nd..., the jvth row and the kvth column of the p channel of the convolution kernel, where v is an integer from 1 to u;

The determining the first feature map corresponding to the first image matrix Am ×p includes:

Determine m data modules corresponding to the m row vectors of the first image matrix A m×p , the length of the data modules is 1, the width is 1, and the depth is p;

splicing the m data modules according to a preset splicing rule to obtain a third feature map, the length of the third feature map is a, the width is b, and the depth is p, and the a×b=m;

Insert a zero data module into the target position of the third feature map to obtain the first feature map, the target position includes an internal position and an edge position, and the target position corresponds to the plurality of groups of target elements in the convolution kernel , the length of the zero data module is 1, the width is 1, and the depth is n, and the values of all elements in the zero data module are zero.
An image data device, characterized in that it comprises:

An acquisition module, configured to acquire the first image matrix A m×p and the second image matrix B p×n ;

A first determining module, configured to determine a first feature map corresponding to the first image matrix A m×p ;

A second determining module, configured to determine X convolution kernels corresponding to the second image matrix B p×n ;

A convolution module, configured to convolve the first feature map with the X convolution kernels to obtain a second feature map, the second feature map corresponds to a third image matrix, and the third image matrix is The result obtained by matrix multiplication of the first image matrix A m×p and the second image matrix B p×n .
The image data processing device according to claim 7, characterized in that:

Among the X convolution kernels determined by the second determination module, the size of each convolution kernel is f×f×p, wherein the f is an integer greater than 1, and the X is an integer greater than or equal to 1 , each convolution kernel contains one or more columns of elements in the second image matrix B p×n .
The image data processing device according to claim 8, wherein the second determining module comprises:

The first determination unit is configured to determine the values of n groups of target elements in the X convolution kernels, the value of the i-th group of target elements in the n groups of target elements and the value of the i-th column element in the second image matrix Correspondingly, the i is an integer from 1 to n, and each convolution kernel contains one or more groups of target elements in the n groups of target elements;

The second determination unit is configured to determine that values of other elements in the X convolution kernels except the n groups of target elements are 0.
The image data processing device according to claim 9, characterized in that:

Each convolution kernel contains a group of target elements Z 1 , Z 2 ,..., Z p in the n groups of target elements, and for each convolution kernel, the convolution kernel contains target elements Z 1 , Z 2 , ..., Z p are respectively located in row j, column k of the 1st, 2nd, ..., p channels of the convolution kernel;

The first determination module includes:

The third determination unit is configured to determine m data modules corresponding to the m row vectors of the first image matrix A m×p , the length of the data modules is 1, the width is 1, and the depth is p;

The first splicing unit is configured to splice the m data modules according to a preset splicing rule to obtain a third feature map, the length of the third feature map is a, the width is b, and the depth is p, and the a ×b=m;

The first zero padding unit is used to respectively insert zero data modules into the upper edge, lower edge, left edge and right edge of the third feature map to obtain the first feature map, the length of the zero data module is 1, The width is 1, the depth is p, and the values of all elements in the zero data module are zero.
The image data processing device according to claim 9, characterized in that:

Each convolution kernel contains u group of target elements in the n groups of target elements, the u is an integer greater than 1 and less than or equal to f, for each convolution kernel, the vth group of targets contained in the convolution kernel The p target elements in the element are respectively located in the 1st, 2nd..., the jvth row and the kvth column of the p channel of the convolution kernel, where v is an integer from 1 to u;

The first determination module includes:

The fourth determination unit is configured to determine m data modules corresponding to the m row vectors of the first image matrix A m×p , the length of the data modules is 1, the width is 1, and the depth is p;

The second splicing unit is configured to splice the m data modules according to a preset splicing rule to obtain a third feature map, the length of the third feature map is a, the width is b, and the depth is p, and the a ×b=m;

The second zero padding unit is used to insert a zero data module at the target position of the third feature map to obtain the first feature map, the target position includes an internal position and an edge position, and the target position is consistent with the multiple groups of target elements The corresponding positions in the convolution kernel are related, the length of the zero data module is 1, the width is 1, and the depth is n, and the values of all elements in the zero data module are zero.
An electronic device, characterized in that it comprises:

processor; and

A memory on which executable code is stored, and when the executable code is executed by the processor, causes the processor to execute the method according to any one of claims 1-6.
A computer-readable storage medium, on which executable codes are stored, and when the executable codes are executed by a processor of an electronic device, the processor is executed as described in any one of claims 1-6. method.