CN111639652A

CN111639652A - Image processing method and device and computer storage medium

Info

Publication number: CN111639652A
Application number: CN202010354847.5A
Authority: CN
Inventors: 程帅; 贾书军; 杨春阳
Original assignee: Pateo Connect Nanjing Co Ltd
Current assignee: Pateo Connect Nanjing Co Ltd
Priority date: 2020-04-28
Filing date: 2020-04-28
Publication date: 2020-09-08

Abstract

The invention discloses an image processing method, an image processing device and a computer storage medium, wherein the image processing method comprises the following steps: acquiring an initial characteristic map of an image; performing convolution processing on the initial characteristic diagram to obtain characteristic vectors of all channels of the initial characteristic diagram; wherein the convolution parameters used by the convolution process are determined by the height and width of the initial feature map; performing Softmax operation on the characteristic vectors of the channels to obtain the attention values of the channels; and acquiring a new characteristic diagram according to the initial characteristic diagram and the attention values of all channels. The image processing method, the image processing device and the computer storage medium have the advantages that the effective expression of the input feature graph is adaptively learned in an end-to-end mode, so that the learned features have discriminability and representativeness based on the task where the learned features are located, and the processing efficiency and the accuracy can be improved.

Description

Image processing method and device and computer storage medium

Technical Field

The present invention relates to the field of image processing, and in particular, to an image processing method, an image processing apparatus, and a computer storage medium.

Background

The human visual attention mechanism is a means for rapidly screening high-value information from a large amount of information by using limited attention resources, and the mechanism improves the efficiency and accuracy of visual information processing. The attention mechanism in the deep learning refers to the attention thinking mode of human beings, the core aim is to select more key information for the current task object from a plurality of information, the method is widely applied to various different types of tasks such as image classification and detection, and remarkable results are obtained. The method comprises the steps of firstly performing global average pooling on each channel of a feature map, then obtaining an attention value (namely attention degree) of each channel through Softmax, wherein the attention value represents the importance and attention degree of the channel, and finally multiplying the attention value by the original feature map. The principle of the structure is that important features are enhanced and unimportant features are weakened by controlling the magnitude of an attention value, so that the extracted features have stronger directivity. Although the channel attention mechanism achieves good effects in different tasks, in the method, the global average pooling only utilizes the prior assumption to obtain the representation of each channel, the action and the contribution of each pixel to the whole feature map are considered to be the same, and the expression features obtained in the method have no discriminant or representativeness, namely non-optimal expression, so that the processing accuracy is to be improved.

Disclosure of Invention

The invention aims to provide an image processing method, an image processing device and a computer storage medium, which can improve the processing efficiency and accuracy.

In order to achieve the purpose, the technical scheme of the invention is realized as follows:

in a first aspect, an embodiment of the present invention provides an image processing method, where the image processing method includes:

acquiring an initial characteristic map of an image;

performing convolution processing on the initial characteristic diagram to obtain characteristic vectors of all channels of the initial characteristic diagram; wherein the convolution parameters used by the convolution process are determined by the height and width of the initial feature map;

performing Softmax operation on the characteristic vectors of the channels to obtain the attention values of the channels;

and acquiring a new characteristic diagram according to the initial characteristic diagram and the attention values of all channels.

As an embodiment, the performing convolution processing on the initial feature map to obtain a feature vector of each channel of the initial feature map includes:

performing feature enhancement processing on the initial feature map;

and performing convolution processing on the initial characteristic diagram after the characteristic enhancement processing to obtain the characteristic vector of each channel of the initial characteristic diagram.

As an embodiment, the performing the feature enhancement processing on the initial feature map includes:

and performing convolution processing on the initial feature map by using a convolution kernel of 1 x 1 so as to enhance the features of the initial feature map.

As an embodiment, the performing convolution processing on the initial feature map after feature enhancement processing to obtain a feature vector of each channel of the initial feature map includes:

performing convolution processing on each channel of the initial characteristic diagram after characteristic enhancement processing by adopting a convolution kernel of H x W to obtain a characteristic vector of each channel; wherein H represents the height of the initial feature map, and W represents the width of the initial feature map.

As an embodiment, before the performing convolution processing on each channel of the initial feature map after the feature enhancement processing by using the convolution kernel of H × W, and acquiring the feature vector of each channel, the method further includes:

and normalizing the initial feature map after the feature enhancement processing.

As an embodiment, before the performing Softmax operation on the feature vector of each channel and acquiring the attention value of each channel, the method further includes:

and carrying out normalization processing on the feature vectors of the channels.

As an embodiment, the obtaining a new feature map according to the initial feature map and the attention values of the channels includes:

and multiplying the initial characteristic diagram with the attention value of each channel to obtain a new characteristic diagram.

In a second aspect, an embodiment of the present invention provides an image processing apparatus, including a processor and a memory for storing a program; when the program is executed by the processor, the program causes the processor to implement the image processing method according to the first aspect.

In a third aspect, an embodiment of the present invention provides a computer storage medium, which stores a computer program, and when the computer program is executed by a processor, the computer program implements the image processing method according to the first aspect.

The embodiment of the invention provides an image processing method, an image processing device and a computer storage medium, wherein the image processing method comprises the following steps: acquiring an initial characteristic map of an image; performing convolution processing on the initial characteristic diagram to obtain characteristic vectors of all channels of the initial characteristic diagram; wherein the convolution parameters used by the convolution process are determined by the height and width of the initial feature map; performing Softmax operation on the characteristic vectors of the channels to obtain the attention values of the channels; and acquiring a new characteristic diagram according to the initial characteristic diagram and the attention values of all channels. Therefore, the initial feature map is convoluted based on the size parameter of the initial feature map of the image to obtain the feature vector of each channel of the initial feature map, and then a new feature map is obtained according to the feature vector of each channel, that is, the learned features have discriminability and representativeness based on the task in which the learned features are located by adopting the effective expression of the feature map input in an end-to-end mode in a self-adaptive learning mode, and the processing efficiency and the accuracy can be effectively improved.

Drawings

FIG. 1 is a Block diagram of SEnet;

FIG. 2 is a schematic diagram of a network structure of SEnet;

fig. 3 is a schematic flowchart of an image processing method according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of a GAPNet network according to an embodiment of the present invention;

fig. 5 is a schematic structural diagram of an image processing apparatus according to an embodiment of the present invention.

Detailed Description

The technical scheme of the invention is further elaborated by combining the drawings and the specific embodiments in the specification. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used herein, the term "and/or" includes any and all combinations of one or more of the associated listed items.

Conventionally, in the image fields such as image classification, object detection, semantic segmentation, and the like, a channel attention mechanism based on mean pooling represented by SENET, which is a Block unit diagram of SENET as shown in FIG. 1, is widely used. Ftr in fig. 1 is a conventional convolution structure, and X and U are the input (C '. H'. W ') and output (C H. W') of Ftr, which are already present in the conventional structure. The SEnet increase part is the structure after U: firstly, a global average pooling is carried out on U (Fsq () in figure 1 is the process of Squeeze), the output 1 x C data is subjected to two-stage full connection (Fex () in figure 1 is the process of initiation), and finally, Softmax (self-gating mechanism) is used for limiting to 0,1]Taking this value as scale (the attention value mentioned in the equivalent background) and multiplying it to C channels of U as the input data of the next stage, the network structure is shown in fig. 2, wherein, the input characteristic diagram x is assumed to be C × h × w, where C is the number of channels and h and w are respectively high and wide, and in each characteristic diagram x_k(k ∈ 1.. said., c), the contribution and effect of each pixel are considered to be equal by the conventional global average pooling, and finally the characteristic diagram expression of 1 × 1 is obtained:

however, in practice, the contribution and effect of each pixel is different in a specific task, so that the expression features obtained in this way have no discriminant or representativeness, i.e. non-optimal expression, and further influence the processing efficiency and accuracy.

To solve the above problem, referring to fig. 3, an image processing method provided in an embodiment of the present invention may be executed by an image processing apparatus provided in an embodiment of the present invention, where the image processing apparatus may be implemented in a software and/or hardware manner, and in a specific application, the image processing apparatus may be specifically a vehicle-mounted terminal such as a car machine, or a mobile terminal such as a smart phone, and the image processing method is applied to the vehicle-mounted terminal in this embodiment as an example, and the image processing method includes the following steps:

step S101: acquiring an initial characteristic map of an image;

step S102: performing convolution processing on the initial characteristic diagram to obtain characteristic vectors of all channels of the initial characteristic diagram; wherein the convolution parameters used by the convolution process are determined by the height and width of the initial feature map;

step S103: performing Softmax operation on the characteristic vectors of the channels to obtain the attention values of the channels;

step S104: and acquiring a new characteristic diagram according to the initial characteristic diagram and the attention values of all channels.

Here, the performing convolution processing on the initial feature map to obtain the feature vector of each channel of the initial feature map may include: performing feature enhancement processing on the initial feature map; and performing convolution processing on the initial characteristic diagram after the characteristic enhancement processing to obtain the characteristic vector of each channel of the initial characteristic diagram. That is, the feature enhancement processing is performed on the initial feature map, and then the contribution value of each pixel is learned. Preferably, the performing the feature enhancement processing on the initial feature map includes: and performing convolution processing on the initial feature map by using a convolution kernel of 1 x 1 so as to enhance the features of the initial feature map. It should be noted that, in addition to performing convolution processing on the initial feature map by using a convolution kernel of 1 × 1 to enhance the features of the initial feature map, different convolution kernels may be used to perform convolution processing on the initial feature map according to actual needs, such as 2 × 2, 1 × 2, and the like. Preferably, the performing convolution processing on the initial feature map after the feature enhancement processing to obtain the feature vector of each channel of the initial feature map includes: performing convolution processing on each channel of the initial characteristic diagram after characteristic enhancement processing by adopting a convolution kernel of H x W to obtain a characteristic vector of each channel; wherein H represents the height of the initial feature map, and W represents the width of the initial feature map. Here, by performing convolution processing on each channel of the initial feature map after the feature enhancement processing by using a convolution kernel of H × W, the contribution and action of each pixel can be obtained from the feature vector of each channel.

Understandably, the attention value of each channel with the numerical range of (0,1) is obtained by performing Softmax operation on the feature vector of each channel. In one embodiment, the obtaining a new feature map according to the initial feature map and the attention values of the channels includes: and multiplying the initial characteristic diagram with the attention value of each channel to obtain a new characteristic diagram. In addition, a certain weight coefficient can be set for each channel by combining the characteristics of each channel, and a new feature map is obtained according to the initial feature map, the attention value of each channel and the weight coefficient of each channel.

In summary, in the image processing method provided in the above embodiment, the initial feature map is convolved based on the size parameter of the initial feature map of the image to obtain the feature vector of each channel of the initial feature map, and then a new feature map is obtained according to the feature vector of each channel, that is, the effective expression of the input feature map is adaptively learned in an end-to-end manner, so that the learned features have discriminability and representativeness based on the task in which the learned features are located, and the processing efficiency and accuracy can be effectively improved.

In one embodiment, in order to converge the speed as soon as possible and further improve the processing efficiency, before the convolving the initial feature map with the H × W convolution kernel and obtaining the feature vector of each channel, the method further includes: and normalizing the initial feature map after the feature enhancement processing. In addition, in order to converge the speed as soon as possible and further improve the processing efficiency, before the performing Softmax operation on the feature vector of each channel and acquiring the attention value of each channel, the method may further include: and carrying out normalization processing on the feature vectors of the channels.

Based on the same inventive concept of the foregoing embodiments, the present embodiment describes technical solutions of the foregoing embodiments in detail through specific examples. In order to solve the problem that the feature vector obtained by global average pooling in the SENet Block is not optimal in expression, the image processing method provided by the embodiment of the invention is called as a GAPNet network structure, and referring to FIG. 4, the effective expression of the input feature map can be learned end-to-end in a self-adaptive manner. Assuming that an input feature map x is c multiplied by h multiplied by w, wherein c is the number of channels, h and w are respectively the height and the width, and the two-layer convolution network is adopted for self-adaptive learning of the contribution and the action of each feature map. The first layer uses 1 × 1 convolutional layers to achieve feature enhancement. Then, a BN layer is connected to realize normalization, then, H × W convolution layers are connected to learn the contribution value of each pixel, and the expression of a 1 × 1 characteristic diagram is obtained as follows:

then, the BN layer is connected to realize normalization, and finally, the attention value of each channel is obtained through the Softmax layer.

In addition, the GAPNet processing flow in the embodiment of the invention can be summarized as comprising the following steps: inputting a feature map, processing a Global Attention Pooling model, calculating Softmax Attention, distributing original feature Attention, and generating a new feature map.

Overall, the advantages of GAPNet are as follows:

1) the effective expression of the end-to-end self-adaptive learning characteristic diagram can be realized, and the accuracy of the current task is improved;

2) compared with a full connection layer, the full convolution network can adapt to input a feature map with any scale;

3) similar to the SENET network structure, the input dimensionality is not changed, the method can be plugged and used in any network structure, and is convenient to use in classification, detection and semantic segmentation tasks;

4) the parameters of the network structure are c 1 x 1+ c h w, the global pooling layer parameter of the original structure is 0, and the full-connection parameter is

Difference of network parameters is

The number of the network parameters can be calculated conveniently, and the structure can be used according to the size of an input image.

In this way, the effective expression of the input feature map is adaptively learned in an end-to-end mode, so that the learned features have discriminability and representativeness based on the task rather than the assumed a priori. Meanwhile, the method can be plugged into a deep learning network structure, and the processing efficiency and accuracy are improved in tasks such as image classification, target detection and semantic segmentation.

Based on the same inventive concept of the foregoing embodiments, an embodiment of the present invention provides an image processing apparatus, which may be a vehicle-mounted terminal, a mobile terminal, or a cloud server, as shown in fig. 5, and includes: a processor 110 and a memory 111 for storing computer programs capable of running on the processor 110; the processor 110 illustrated in fig. 5 is not used to refer to the number of the processors 110 as one, but is only used to refer to the position relationship of the processor 110 relative to other devices, and in practical applications, the number of the processors 110 may be one or more; similarly, the memory 111 illustrated in fig. 5 is also used in the same sense, that is, it is only used to refer to the position relationship of the memory 111 relative to other devices, and in practical applications, the number of the memory 111 may be one or more. The processor 110 is configured to implement the image processing method applied to the above-mentioned apparatus when the computer program is executed.

The apparatus may further comprise: at least one network interface 112. The various components of the device are coupled together by a bus system 113. It will be appreciated that the bus system 113 is used to enable communications among the components. The bus system 113 includes a power bus, a control bus, and a status signal bus in addition to the data bus. For clarity of illustration, however, the various buses are labeled as bus system 113 in FIG. 5.

The memory 111 may be a volatile memory or a nonvolatile memory, or may include both volatile and nonvolatile memories. Among them, the nonvolatile Memory may be a Read Only Memory (ROM), a Programmable Read Only Memory (PROM), an Erasable Programmable Read-Only Memory (EPROM), an Electrically Erasable Programmable Read-Only Memory (EEPROM), a magnetic random access Memory (FRAM), a Flash Memory (Flash Memory), a magnetic surface Memory, an optical disk, or a Compact Disc Read-Only Memory (CD-ROM); the magnetic surface storage may be disk storage or tape storage. Volatile memory can be Random Access Memory (RAM), which acts as external cache memory. By way of illustration and not limitation, many forms of RAM are available, such as Static Random Access Memory (SRAM), Synchronous Static Random Access Memory (SSRAM), Dynamic Random Access Memory (DRAM), Synchronous Dynamic Random Access Memory (SDRAM), Double Data Rate Synchronous Dynamic Random Access Memory (DDRSDRAM), Enhanced Synchronous Dynamic Random Access Memory (ESDRAM), Enhanced Synchronous Dynamic Random Access Memory (Enhanced DRAM), Synchronous Dynamic Random Access Memory (SLDRAM), Direct Memory (DRmb Access), and Random Access Memory (DRAM). The memory 111 described in connection with the embodiments of the invention is intended to comprise, without being limited to, these and any other suitable types of memory.

The memory 111 in embodiments of the present invention is used to store various types of data to support the operation of the device. Examples of such data include: any computer program for operating on the device, such as operating systems and application programs; contact data; telephone book data; a message; a picture; video, etc. The operating system includes various system programs, such as a framework layer, a core library layer, a driver layer, and the like, and is used for implementing various basic services and processing hardware-based tasks. The application programs may include various application programs such as a Media Player (Media Player), a Browser (Browser), etc. for implementing various application services. Here, the program that implements the method of the embodiment of the present invention may be included in an application program.

Based on the same inventive concept of the foregoing embodiments, this embodiment further provides a computer storage medium, where a computer program is stored in the computer storage medium, where the computer storage medium may be a Memory such as a magnetic random access Memory (FRAM), a Read Only Memory (ROM), a Programmable Read Only Memory (PROM), an Erasable Programmable Read Only Memory (EPROM), an Electrically Erasable Programmable Read Only Memory (EEPROM), a flash Memory (flash Memory), a magnetic surface Memory, an optical Disc, or a Compact Disc Read Only Memory (CD-ROM), and the like; or may be a variety of devices including one or any combination of the above memories, such as a mobile phone, computer, tablet device, personal digital assistant, etc. The computer program stored in the computer storage medium implements the image processing method applied to the above-described apparatus when being executed by a processor. Please refer to the description of the embodiment shown in fig. 3 for a specific step flow realized when the computer program is executed by the processor, which is not described herein again.

The technical features of the embodiments described above may be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the embodiments described above are not described, but should be considered as being within the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

As used herein, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, including not only those elements listed, but also other elements not expressly listed.

The above description is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention, and all the changes or substitutions should be covered within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the appended claims.

Claims

1. An image processing method, characterized in that the method comprises:

acquiring an initial characteristic map of an image;

2. The method according to claim 1, wherein the convolving the initial feature map to obtain the feature vector of each channel of the initial feature map comprises:

performing feature enhancement processing on the initial feature map;

3. The method according to claim 2, wherein the performing feature enhancement processing on the initial feature map comprises:

4. The method according to claim 2 or 3, wherein the convolving the initial feature map after the feature enhancement processing to obtain the feature vector of each channel of the initial feature map comprises:

5. The method according to claim 4, wherein before the convolving the initial feature map with the H × W convolution kernel and obtaining the feature vector of each channel, the method further comprises:

6. The method according to claim 1, wherein before performing the Softmax operation on the feature vector of each channel and obtaining the attention value of each channel, the method further comprises:

7. The method of claim 1, wherein said obtaining a new feature map from the initial feature map and the attention values of the channels comprises:

8. An image processing apparatus, characterized in that the apparatus comprises a processor and a memory for storing a program; when executed by the processor, cause the processor to implement the method of image processing as claimed in any one of claims 1 to 7.

9. A computer storage medium, characterized in that a computer program is stored which, when executed by a processor, implements the image processing method of any one of claims 1 to 7.