CN118397298B

CN118397298B - Self-attention space pyramid pooling method based on mixed pooling and related components

Info

Publication number: CN118397298B
Application number: CN202410854329.8A
Authority: CN
Inventors: 张红亮; 陈梅; 杨小娜; 马娜; 魏祥; 肖凤超; 魏巍
Original assignee: Hangzhou AIMS Intelligent Technology Co Ltd
Current assignee: Hangzhou AIMS Intelligent Technology Co Ltd
Priority date: 2024-06-28
Filing date: 2024-06-28
Publication date: 2024-09-06
Anticipated expiration: 2044-06-28
Also published as: CN118397298A

Abstract

The invention discloses a self-attention space pyramid pooling method based on mixed pooling and a related component, which are applied to the technical field of visual analysis, and in order to solve the problem of insufficient extraction of the existing space pyramid pooling characteristics, the invention provides a method for dividing a characteristic image group of a graph to be analyzed into a first characteristic image group and a second characteristic image group; dividing the first characteristic image group into a first sub-characteristic image group and a second sub-characteristic image group through a channel distribution layer; respectively carrying out pooling treatment on the first sub-feature image group and the second sub-feature image group by adopting a maximum pooling method and an average pooling method, then splicing the obtained maximum pooling feature image group and the average pooling feature image group, and carrying out expansion receptive field treatment on the spliced pooling feature image group by adopting a self-attention model to obtain a treated expansion feature image group; splicing the extended feature map group with the second feature map group to obtain a spatial pyramid pooling result; the global receptive field can be improved, and the visual analysis performance is improved.

Description

Self-attention space pyramid pooling method based on mixed pooling and related components

Technical Field

The present invention relates to the field of visual analysis technologies, and in particular, to a method and apparatus for pooling a self-attention space pyramid based on hybrid pooling, an electronic device, and a computer readable storage medium.

Background

Image semantic segmentation essentially classifies each pixel in an image. The existing semantic segmentation algorithm is mainly designed based on a coding and decoding framework, and an image is changed into a feature map rich in high-level semantic information through a series of rolling and pooling operations in a coder stage; in the decoder stage, the feature map is gradually upsampled to produce a prediction result of equal size as the input image. Since this way of coding requires frequent downsampling and upsampling, a large amount of critical information is lost. In view of this problem, there are two currently mainstream solutions:

One is to reduce the number of downsampling of feature extraction models, such as ene (a speech segmentation algorithm), which reduces the occurrence of key pixel loss problems due to downsampling by discarding the downsampling of the last stage of the model in pursuit of a compact framework thereof.

The other is a multi-scale fusion method, since the bottom layer features enable the model to see texture details of many images, the texture details of a small block of the images can be quite accurate, the model can not see the whole object, and high-level information enables the model to see the whole object, but the downsampling times are too many, and information such as edge details become fuzzy. Therefore, by combining the feature information of each scale is important to solve the above problem, such as Feature Pyramid Network (FPN) algorithm, the FPN constructs a top-down hierarchical structure with lateral connection to construct a semantic feature fusion method of each scale. Spatial pyramid pooling (SPATIAL PYRAMID Pooling, SPP for short) proposes a very efficient multi-resolution strategy, adding one SPP layer after the output of the last convolutional layer. The layer divides the feature map into different scales and performs pooling operation on each block. Thus, an output with a fixed size can be obtained, and then the output is input into a full-connection layer for classification, regression and other tasks.

In deep networks, the size of the receptive field substantially embodies the contextual information that can be obtained by the deep learning model. Aiming at the problem that the feature extraction is insufficient in the spatial pyramid pooling SPP by adopting the multi-scale fusion method, the SPP only uses a single maximum pooling method to extract the features, so that the feature extraction is insufficient. In addition, the SPP method performs further local feature extraction operations only by using convolution computation, and does not perform receptive field expansion to the maximum extent, so that the analysis accuracy of visual analysis based on the existing spatial pyramid pooling method is limited.

In view of this, how to provide a spatial pyramid pooling method, apparatus, electronic device and computer readable storage medium capable of expanding receptive fields is a problem that needs to be solved by those skilled in the art.

Disclosure of Invention

The embodiment of the invention aims to provide a self-attention space pyramid pooling method, a device, electronic equipment and a computer-readable storage medium based on mixed pooling, which can improve the global receptive field, reduce the calculated amount and facilitate improving the visual analysis performance in the use process.

In order to solve the technical problems, the embodiment of the invention provides the following technical scheme:

in one aspect, the invention provides a self-attention space pyramid pooling method based on mixed pooling, which comprises the following steps:

Acquiring a feature image group aiming at a graph to be analyzed, and dividing the feature image group into a first feature image group and a second feature image group;

dividing the first characteristic image group into a first sub-characteristic image group and a second sub-characteristic image group through a channel distribution layer;

carrying out maximum pooling treatment on the first sub-feature image group by adopting a maximum pooling method to obtain a maximum pooling feature image group carrying the maximum feature;

carrying out average pooling treatment on the second sub-feature image group by adopting an average pooling method to obtain an average pooled feature image group carrying average features;

Splicing the maximum pooling feature image group and the average pooling feature image group, and performing expanding receptive field processing on the spliced pooling feature image group by adopting a self-attention model to obtain a processed expanding feature image group;

And splicing the extended feature map set with the second feature map set to obtain a spatial pyramid pooling result so as to perform visual analysis based on the spatial pyramid pooling result.

In an exemplary embodiment, the expanding receptive field processing is performed on the spliced pooled feature map set by using a self-attention model to obtain a processed expanded feature map set, which includes:

dividing each pooling feature image in the pooled feature image groups into a plurality of sub-pooling feature image groups according to the size of a preset pooling window;

And aiming at each sub-pooling feature image group, adopting a corresponding self-attention unit to perform expansion receptive field processing on the sub-pooling feature image groups to obtain expansion feature images corresponding to each sub-pooling feature image in the sub-pooling feature image groups.

In an exemplary embodiment, the expanding receptive field processing is performed on the sub-pooled feature map set by using a corresponding self-attention unit to obtain an expanded feature map corresponding to each sub-pooled feature map in the sub-pooled feature map set, including:

processing the sub-pooling feature image group by adopting a first full-connection layer in the self-attention unit to obtain a query feature matrix;

processing the sub-pooling feature image group by adopting a second full-connection layer in the self-attention unit to obtain a key feature matrix;

Processing the sub-pooling feature image group by adopting a third full-connection layer in the self-attention unit to obtain a value feature matrix;

calculating based on the query feature matrix, the key feature matrix and the value feature matrix by adopting a self-attention calculation relation to obtain a self-attention calculation result;

And obtaining an expansion feature map corresponding to each sub-pooling feature map in the sub-pooling feature map group based on the self-attention calculation result.

In an exemplary embodiment, the self-attention calculation relationship is:

wherein, the method comprises the steps of, wherein, The result of the self-attention calculation is indicated,Representing the matrix of the query feature,The key feature matrix is represented by a matrix of key features,The characteristic matrix of the values is represented,Is a function of the weight calculation,Representing the full connection layer output dimension, T represents the matrix transpose operation.

In an exemplary embodiment, the dividing each pooled feature map in the pooled feature map set into multiple sub-pooled feature map sets according to a preset pooled window size includes:

dividing each pooling feature image in the pooled feature image group into 1*1 feature image groups, (2 x n+1) x (2 x n+1) feature image groups and (2 x n+2) x (2 x n+2) feature image groups according to a preset pooling window size, wherein N is an integer not less than 1.

In an exemplary embodiment, the dividing, by the channel allocation layer, the first feature map group into a first sub-feature map group and a second sub-feature map group includes:

dividing the first characteristic image group into a first sub-characteristic image group and a second sub-characteristic image group according to the current distribution proportion through a channel distribution layer; the current distribution proportion is obtained after updating based on current loss in the visual analysis process.

In an exemplary embodiment, the stitching the extended feature map set with the second feature map set to obtain a spatial pyramid pooling result includes:

randomly scrambling the sequence of each extended feature map of the extended feature map group;

splicing the scrambled expansion feature images with the second feature image group after channel splicing to obtain a spliced integral feature image group;

and carrying out information interaction processing on each feature map in the spliced integral feature map group, and taking each feature map after the information interaction processing as a spatial pyramid pooling result.

In another aspect, the present invention provides a self-attention space pyramid pooling device based on hybrid pooling, including:

The acquisition module is used for acquiring a characteristic image group aiming at the graph to be analyzed and dividing the characteristic image group into a first characteristic image group and a second characteristic image group;

The distribution module is used for dividing the first characteristic image group into a first sub-characteristic image group and a second sub-characteristic image group through a channel distribution layer;

The maximum pooling module is used for carrying out maximum pooling treatment on the first sub-feature image group by adopting a maximum pooling method to obtain a maximum pooling feature image group carrying the maximum feature;

the average pooling module is used for carrying out average pooling treatment on the second sub-feature image group by adopting an average pooling method to obtain an average pooled feature image group carrying average features;

The self-attention module is used for splicing the largest pooling feature image group and the average pooling feature image group, and adopting a self-attention model to perform expanding receptive field processing on the spliced pooling feature image group to obtain a processed expanding feature image group;

and the splicing module is used for splicing the extended feature image group and the second feature image group to obtain a spatial pyramid pooling result so as to perform visual analysis based on the spatial pyramid pooling result.

Another aspect of the present invention provides an electronic device, including:

a memory for storing a computer program;

A processor for implementing the steps of the hybrid pooling-based self-attention space pyramid pooling method as described above when executing the computer program.

Another aspect of the invention provides a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of a hybrid pooling-based self-attention space pyramid pooling method as described above.

From the above technical solutions, the embodiment of the present invention has the following advantages:

the embodiment of the invention provides a self-attention space pyramid pooling method based on mixed pooling, which comprises the following steps: acquiring a feature image group aiming at a graph to be analyzed, and dividing the feature image group into a first feature image group and a second feature image group; dividing the first characteristic image group into a first sub-characteristic image group and a second sub-characteristic image group through a channel distribution layer; carrying out maximum pooling treatment on the first sub-feature image group by adopting a maximum pooling method to obtain a maximum pooling feature image group carrying the maximum feature; carrying out average pooling treatment on the second sub-feature image group by adopting an average pooling method to obtain an average pooled feature image group carrying average features; splicing the largest pooled feature image group and the average pooled feature image group, and adopting a self-attention model to perform expanded receptive field treatment on the spliced pooled feature image group to obtain a treated expanded feature image group; and splicing the extended feature map set with the second feature map set to obtain a spatial pyramid pooling result so as to perform visual analysis based on the spatial pyramid pooling result.

Therefore, in the embodiment of the application, the feature image group of the image to be analyzed is divided into two groups, the first feature image group is divided into the first sub-feature image group and the second sub-feature image group through the channel distribution layer, the first sub-feature image group and the second sub-feature image group are respectively subjected to pooling treatment by adopting a maximum pooling method and an average pooling method, the obtained maximum pooled feature image group and the average pooled feature image group are spliced, the spliced pooled feature image group is further subjected to expansion receptive field treatment by adopting a self-attention model, the processed expanded feature image group is obtained, and then the expanded feature image group and the second feature image group are spliced to obtain a spatial pyramid pooling result.

In addition, the invention provides a corresponding implementation device, electronic equipment and a computer readable storage medium for the self-attention space pyramid pooling method based on the mixed pooling, so that the method has more practicability, and the device, the electronic equipment and the computer readable storage medium have corresponding advantages.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention as claimed.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required in the prior art and the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic flow chart of a method for pooling a self-attention space pyramid based on hybrid pooling according to an embodiment of the present invention;

FIG. 2 is a diagram of an overall structure of a visual task according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a self-attention space pyramid pooling architecture based on hybrid pooling according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of a self-focusing layer according to an embodiment of the present invention;

fig. 5 is a schematic structural diagram of a self-attention space pyramid pooling device based on hybrid pooling according to an embodiment of the present invention;

fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the present invention;

fig. 7 is a schematic structural diagram of a computer readable storage medium according to an embodiment of the present invention.

Detailed Description

The embodiment of the invention provides a self-attention space pyramid pooling method, a self-attention space pyramid pooling device, electronic equipment and a computer readable storage medium based on mixed pooling, which can improve the global receptive field, reduce the calculated amount and facilitate the improvement of visual analysis performance in the use process.

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Referring to fig. 1, fig. 1 is a flow chart of a self-attention space pyramid pooling method based on hybrid pooling according to an embodiment of the present invention. The spatial pyramid pooling method comprises the following steps:

S110: acquiring a feature image group aiming at a graph to be analyzed, and dividing the feature image group into a first feature image group and a second feature image group;

It should be noted that, as shown in fig. 2, the spatial pyramid pooling method is only one component Part in the whole visual task (such as object detection and semantic segmentation), and is mainly connected behind the feature extraction network (such as ResNet, denseNet and VGGNet, etc.), the method provided in the embodiment of the present invention is applied to the spatial pyramid pooling module, the spatial pyramid pooling method in the spatial pyramid pooling module is optimized, after the graphic input to be analyzed in the actual application is extracted through the feature extraction network, a feature map group is obtained, and the feature map group includes a plurality of feature maps. It can be understood that in the embodiment of the invention, the feature image group is divided into the first feature image group and the second feature image group, so that the calculated amount can be reduced, the pyramid pooling effect can be ensured, and the balance between the feature extraction effect and the calculation efficiency is achieved.

S120: dividing the first characteristic image group into a first sub-characteristic image group and a second sub-characteristic image group through a channel distribution layer;

In order to use average pooling and maximum pooling with maximum efficiency, and to reduce the calculation amount greatly, the embodiment of the invention can prevent the model from being excessively biased to one of the pooling methods, thereby wasting resources, and a channel allocation layer is added before the maximum pooling layer and the average pooling layer, wherein the channel allocation layer has a learnable parameter for allocating the input channels of the average pooling and the maximum pooling.

Specifically, after the first feature map set in the embodiment of the present invention enters the channel allocation layer, the channel allocation layer may divide the first feature map set into a first sub-feature map set and a second sub-feature map set based on a current parameter (for example, a current allocation proportion).

The current parameter (such as the current allocation proportion) is obtained after updating the last allocation proportion based on the current loss in the whole visual analysis process.

S130: carrying out maximum pooling treatment on the first sub-feature image group by adopting a maximum pooling method to obtain a maximum pooling feature image group carrying the maximum feature;

It may be understood that in the embodiment of the present invention, a maximum pooling method is used to perform maximum pooling processing on each first sub-feature map in the first sub-feature map group, so as to obtain a maximum pooled first sub-feature map carrying the maximum feature, thereby obtaining a maximum pooled feature map group, that is, the maximum pooled feature map group includes the first sub-feature maps corresponding to each first sub-feature map respectively. The maximum pooling method can keep the most remarkable characteristic information by selecting the maximum value in the local area as output, so that the maximum pooling method is adopted to carry out the maximum pooling processing on the first sub-characteristic image group, and the key characteristics can be better captured.

S140: carrying out average pooling treatment on the second sub-feature image group by adopting an average pooling method to obtain an average pooled feature image group carrying average features;

It may be understood that in the embodiment of the present invention, an average pooling method is used to perform an average pooling process on each second sub-feature map in the second sub-feature map set, so as to obtain an average pooled second sub-feature map carrying average features, so as to obtain an average pooled feature map set, where the average pooled feature map set includes second sub-feature maps corresponding to each second sub-feature map respectively. Since the averaging pooling averages all feature values within the local area, i.e. the averaging pooling considers all features of the local area, the averaging pooling is robust to small variations or noise in the input data, helping the model to cope better with various scenarios.

S150: splicing the largest pooled feature image group and the average pooled feature image group, and adopting a self-attention model to perform expanded receptive field treatment on the spliced pooled feature image group to obtain a treated expanded feature image group;

In the embodiment of the invention, in order to better retain the detail features, the maximum pooling feature image group and the average pooling feature image group can be subjected to feature fusion in a channel splicing mode, and compared with a conventional point-by-point addition fusion mode, the channel splicing mode can better retain the detail features.

Specifically, after the channel is spliced, the self-attention model can be further adopted to perform expanded receptive field processing on the spliced pooled feature map set to obtain the processed expanded feature map set, wherein the self-attention mechanism can promote the global receptive field, and the capacity of the model for acquiring global information is effectively increased.

For example, in S150, the process of performing extended receptive field processing on the spliced pooled feature map set by using the self-attention model to obtain the processed extended feature map set may include:

And aiming at each sub-pooled feature map group, adopting a corresponding self-attention unit to perform extended receptive field processing on the sub-pooled feature map groups to obtain extended feature maps corresponding to each sub-pooled feature map in the sub-pooled feature map groups.

It should be noted that, as shown in fig. 3, 3 different pooling windows may be adopted in the embodiment of the present invention, so that the sizes of the pooled feature map sets are 1*1, (2xn+1) × (2xn+1) and (2xn+2) × (2xn+2), so that each pooled feature map in the pooled feature map set may be divided into multiple sub-pooled feature map sets according to the preset pooling window size. That is, the feature map size 1*1 is divided into one group, so as to obtain 1*1 feature map group; dividing the feature map size (2×n+1) ×2×n+1 into a group to obtain a feature map group (2×n+1) ×2×n+1; dividing the feature map size (2×n+2) ×2×n+2 into a group to obtain a feature map group (2×n+2) ×2×n+2). Wherein N is an integer of 1, 2, etc., wherein the specific value of N can be determined according to the actual situation, however, the value of N should not be too large, which is easy to increase the calculation amount.

Specifically, the self-attention model comprises self-attention units corresponding to the size of each preset pooling window, after each sub-pooling feature image group is determined, the corresponding self-attention units can be adopted to perform expanding receptive field processing on each sub-pooling feature image in the sub-pooling feature image group to obtain a corresponding expanding feature image, and the expanding feature image contains full-image feature information, so that the global receptive field can be improved, and the capability of the model for acquiring global information can be improved.

Further, the process of performing the extended receptive field processing on the sub-pooled feature map sets by using the corresponding self-attention units to obtain an extended feature map corresponding to each sub-pooled feature map in the sub-pooled feature map sets may include:

processing the sub-pooling feature map group by adopting a first full-connection layer in the self-attention unit to obtain a query feature matrix;

Processing the sub-pooling feature map set by adopting a second full-connection layer in the self-attention unit to obtain a key feature matrix;

processing the sub-pooling feature map group by adopting a third full-connection layer in the self-attention unit to obtain a value feature matrix;

based on the self-attention computation result, an extended feature map corresponding to each sub-pooling feature map in the sub-pooling feature map group is obtained.

It should be noted that, in the embodiment of the present invention, each self-attention unit includes three full-connection layers, as shown in fig. 4, in the embodiment of the present invention, one self-attention unit is taken as an example to describe in detail, for an input sub-pooling feature map group, a query feature matrix Q is obtained by processing the sub-pooling feature map group through a first full-connection layer 1, a key feature matrix K is obtained by processing the sub-pooling feature map group through a second full-connection layer 2, a value feature matrix V is obtained by processing the sub-pooling feature map group through a third full-connection layer 3, and then a self-attention calculation relational expression is further adopted to calculate based on the query feature matrix, the key feature matrix and the value feature matrix, so as to obtain a self-attention calculation result. Wherein, the self-attention calculating relation is:

In the embodiment of the invention, after the self-attention calculation result corresponding to the sub-pooling feature map group is obtained, the expansion feature map corresponding to each sub-pooling feature map in the sub-pooling feature map group can be further obtained.

It can be understood that the self-attention mechanism adopted in the embodiment of the invention can improve the effect of global feature extraction and further enlarge the receptive field range of the model.

S160: and splicing the extended feature map set with the second feature map set to obtain a spatial pyramid pooling result so as to perform visual analysis based on the spatial pyramid pooling result.

It should be noted that, in the embodiment of the present invention, after each extended feature map is obtained, an extended feature map set may be formed, in order to prevent the model from deviating from a certain feature map set and reduce the resource utilization rate in the training process, the extended feature map set may be spliced with the second feature map set obtained in S110 in a channel manner, so as to perform fusion of feature information, and the feature map set after information fusion is used as a spatial pyramid pooling result, so as to perform subsequent visual analysis according to the spatial pyramid pooling result, obtain a visual analysis result, and further obtain a current loss according to the visual analysis result, and in case that the current loss does not reach a preset requirement, update parameters in the whole model (such as a channel allocation proportion of a channel allocation layer, parameters of each full connection layer in a self-care model, etc.) according to the current loss, so as to enter the next training round and so on.

Further, in the step S160, the process of splicing the extended feature map set and the second feature map set to obtain the spatial pyramid pooling result may include:

Randomly disturbing the sequence of each expansion feature map in the expansion feature map group;

It should be noted that, as shown in fig. 3, in the embodiment of the present invention, after the extended feature map set is obtained, the sequence of each extended feature map of the extended feature map set may be randomly disturbed, then, after the channel splicing is performed on each randomly disturbed extended feature map, the channel splicing is performed on each randomly disturbed extended feature map and the second feature map set, so as to obtain a spliced overall feature map set, in order to further enhance the relevance of feature information, information interaction processing may be performed on each feature map in the spliced overall feature map set, specifically, 1*1 convolution may be used to perform information interaction between each feature map, and each feature map after information interaction processing is used as a spatial pyramid pooling result, which is favorable for improving the performance of visual tasks such as semantic segmentation.

The invention also provides a corresponding device for the spatial pyramid pooling method, so that the method has higher practicability. Wherein the device may be described separately from the functional module and the hardware. In the following description, the spatial pyramid pooling device provided by the present invention is used to implement the spatial pyramid pooling method provided by the present invention, and in this embodiment, the spatial pyramid pooling device may include or be divided into one or more program modules, where the one or more program modules are stored in a storage medium and executed by one or more processors, to implement the spatial pyramid pooling method disclosed in the foregoing embodiment. Program modules in the present invention are defined as a series of computer program instruction segments capable of performing a particular function, more preferably describing the execution of a spatial pyramid pooling device in a storage medium than the program itself. The following description will specifically describe the functions of each program module in this embodiment, and the spatial pyramid pooling device described below and the spatial pyramid-based pooling method described above may be referred to correspondingly.

Based on the angle of the functional modules, referring to fig. 5, fig. 5 is a schematic diagram of a spatial pyramid pooling device provided by the present invention in an embodiment, where the device may include:

An obtaining module 11, configured to obtain a feature map set for a graph to be analyzed, and divide the feature map set into a first feature map set and a second feature map set;

An allocation module 12, configured to divide the first feature map group into a first sub-feature map group and a second sub-feature map group through a channel allocation layer;

The maximum pooling module 13 is configured to perform maximum pooling processing on the first sub-feature image set by using a maximum pooling method, so as to obtain a maximum pooled feature image set carrying a maximum feature;

an average pooling module 14, configured to perform an average pooling process on the second sub-feature map set by using an average pooling method, so as to obtain an average pooled feature map set carrying average features;

The self-attention module 15 is configured to splice the largest pooled feature map set and the average pooled feature map set, and perform expanded receptive field processing on the spliced pooled feature map set by adopting a self-attention model to obtain a processed expanded feature map set;

And the splicing module 16 is used for splicing the extended feature map set and the second feature map set to obtain a spatial pyramid pooling result so as to perform visual analysis based on the spatial pyramid pooling result.

In one exemplary embodiment, the self-attention module 15 includes:

The first grouping unit is used for dividing each pooling feature image in the spliced pooling feature image groups into a plurality of sub-pooling feature image groups according to the size of a preset pooling window;

and the expansion unit is used for expanding the sub-pooling feature image groups by adopting the corresponding self-attention unit aiming at each sub-pooling feature image group to obtain an expansion feature image corresponding to each sub-pooling feature image in the sub-pooling feature image groups.

In an exemplary embodiment, an expansion unit includes:

the first processing subunit is used for processing the sub-pooling feature map set by adopting a first full-connection layer in the self-attention unit to obtain a query feature matrix;

The second processing subunit is used for processing the sub-pooling feature map set by adopting a second full-connection layer in the self-attention unit to obtain a key feature matrix;

the third processing subunit is used for processing the sub-pooling feature map set by adopting a third full-connection layer in the self-attention unit to obtain a value feature matrix;

The computing subunit is used for computing based on the query feature matrix, the key feature matrix and the value feature matrix by adopting the self-attention computing relation to obtain a self-attention computing result;

and the determining subunit is used for obtaining an expansion characteristic diagram corresponding to each sub-pooling characteristic diagram in the sub-pooling characteristic diagram group based on the self-attention calculation result.

In one exemplary embodiment, the self-attention computation relationship is:

In an exemplary embodiment, the first grouping unit is configured to:

In an exemplary embodiment, the allocation module 12 is configured to:

Dividing the first characteristic image group into a first sub-characteristic image group and a second sub-characteristic image group according to the current distribution proportion by a channel distribution layer; the current distribution proportion is obtained after updating based on the current loss in the visual analysis process.

In one exemplary embodiment, the splice module 16 includes:

the disturbing unit is used for randomly disturbing the sequence of each extended feature graph of the extended feature graph group;

The splicing unit is used for splicing the scrambled expansion feature images with the second feature image group after channel splicing to obtain a spliced integral feature image group;

And the processing unit is used for carrying out information interaction processing on each feature map in the spliced integral feature map group, and taking each feature map after the information interaction processing as a spatial pyramid pooling result.

It should be noted that, the spatial pyramid pooling device in the embodiment of the present application has the same beneficial effects as the spatial pyramid pooling method provided in the above embodiment, and for the specific description of the spatial pyramid pooling method in the embodiment of the present application, reference is made to the above embodiment, and the disclosure is not repeated here.

The spatial pyramid pooling device is described from the perspective of a functional module, and further, the application also provides an electronic device, which is described from the perspective of hardware. Fig. 6 is a block diagram of an electronic device according to an embodiment of the present application, as shown in fig. 6, where the electronic device includes: a memory 20 for storing a computer program;

A processor 21 for implementing the steps of the spatial pyramid pooling method according to the embodiments described above when executing a computer program.

The electronic device provided in this embodiment may include, but is not limited to, a smart phone, a tablet computer, a notebook computer, a desktop computer, or the like.

Processor 21 may include one or more processing cores, such as a 4-core processor, an 8-core processor, etc. The processor 21 may be implemented in at least one hardware form of DSP (DIGITAL SIGNAL Processing), FPGA (Field-Programmable gate array), PLA (Programmable Logic Array ). The processor 21 may also include a main processor, which is a processor for processing data in an awake state, also called a CPU (Central Processing Unit ), and a coprocessor; a coprocessor is a low-power processor for processing data in a standby state. In some embodiments, the processor 21 may integrate a GPU (Graphics Processing Unit, image processor) for rendering and drawing of content required to be displayed by the display screen. In some embodiments, the processor 21 may also include an AI (ARTIFICIAL INTELLIGENCE ) processor for processing computing operations related to machine learning.

Memory 20 may include one or more computer-readable storage media, which may be non-transitory. Memory 20 may also include high-speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. The memory 20 may in some embodiments be an internal storage unit of the electronic device, such as a hard disk of a server. The memory 20 may also be an external storage device of the electronic device, such as a plug-in hard disk provided on a server, a smart memory card (SMART MEDIA CARD, SMC), a Secure Digital (SD) card, a flash memory card (FLASH CARD), etc. in other embodiments. Further, the memory 20 may also include both internal storage units and external storage devices of the electronic device. The memory 20 may be used to store not only application software installed in an electronic device, but also various types of data, such as: code or the like that performs the procedure during the spatial pyramid pooling method may also be used to temporarily store data that has been output or is to be output. In this embodiment, the memory 20 is at least used for storing a computer program 201, where the computer program, when loaded and executed by the processor 21, is capable of implementing the relevant steps of the spatial pyramid pooling method disclosed in any of the foregoing embodiments. In addition, the resources stored in the memory 20 may further include an operating system 202, data 203, and the like, where the storage manner may be transient storage or permanent storage. Operating system 202 may include Windows, unix, linux, among other things. The data 203 may include, but is not limited to (data corresponding to the spatial pyramid pooling results) and the like.

In some embodiments, the electronic device may further include a display 22, an input-output interface 23, a communication interface 24, a power supply 25, and a communication bus 26. Among other things, the display 22, the input output interface 23 such as a Keyboard (Keyboard) belong to a user interface, which may alternatively comprise a standard wired interface, a wireless interface, etc. Alternatively, in some embodiments, the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode) touch, or the like. The display may also be referred to as a display screen or display unit, as appropriate, for displaying information processed in the electronic device and for displaying a visual user interface. The communication interface 24 may optionally include a wired interface and/or a wireless interface, such as a WI-FI interface, a bluetooth interface, etc., typically used to establish a communication connection between the electronic device and other electronic devices. The communication bus 26 may be a peripheral component interconnect standard (PERIPHERAL COMPONENT INTERCONNECT, PCI) bus, or an extended industry standard architecture (extended industry standard architecture, EISA) bus, among others. The bus may be classified as an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown in fig. 6, but not only one bus or one type of bus.

Those skilled in the art will appreciate that the structure shown in fig. 6 is not limiting of the electronic device and may include more or fewer components than shown.

It will be appreciated that the spatial pyramid pooling method of the above embodiments, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored on a computer readable storage medium. Based on this understanding, the technical solution of the present application may be embodied essentially or in part or in whole or in part in the form of a software product stored in a storage medium for performing all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), an electrically erasable programmable ROM, registers, a hard disk, a removable disk, a CD-ROM, a magnetic disk, or an optical disk, etc., which can store program codes.

Based on this, as shown in fig. 7, the embodiment of the present invention further provides a computer readable storage medium, on which a computer program 31 is stored in the computer readable storage medium 30, and the computer program 31 implements the steps of the spatial pyramid pooling method as described above when being executed by a processor.

In the present specification, each embodiment is described in a progressive manner, and each embodiment is mainly described in a different point from other embodiments, and identical and similar parts between the embodiments are all enough to refer to each other. For the device disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple, and the relevant points refer to the description of the method section.

It should also be noted that in this specification, relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A method for pooling a self-attention space pyramid based on hybrid pooling, comprising:

Splicing the extended feature map set and the second feature map set to obtain a spatial pyramid pooling result so as to perform visual analysis based on the spatial pyramid pooling result; wherein:

The adoption of the self-attention model to carry out expanded receptive field processing on the spliced pooled feature map set to obtain a processed expanded feature map set comprises the following steps:

Aiming at each sub-pooling feature image group, adopting a corresponding self-attention unit to perform expansion receptive field processing on the sub-pooling feature image groups to obtain expansion feature images corresponding to each sub-pooling feature image in the sub-pooling feature image groups;

The expanding receptive field processing is carried out on the sub-pooling feature map group by adopting a corresponding self-attention unit to obtain an expanding feature map corresponding to each sub-pooling feature map in the sub-pooling feature map group, and the expanding feature map comprises:

2. The method of spatial pyramid pooling according to claim 1, wherein the self-attention calculation relation is:

wherein, the method comprises the steps of, wherein, The self-attention calculation result is represented by Q, the query feature matrix is represented by K, the key feature matrix is represented by V, the value feature matrix is represented by V, softmax is a weight calculation function, d represents the output dimension of the full-connection layer, and T represents the matrix transposition operation.

3. The method for pooling the spatial pyramid as set forth in claim 1, wherein the dividing each pooled feature map in the pooled feature map set into a plurality of sub-pooled feature map sets according to a preset pooled window size includes:

4. The method of spatial pyramid pooling according to claim 1, wherein the dividing the first feature map group into a first sub-feature map group and a second sub-feature map group by a channel allocation layer includes:

5. The method for pooling spatial pyramids according to any one of claims 1 to 4, wherein the stitching the extended feature map set with the second feature map set to obtain a spatial pyramid pooling result comprises:

6. A self-attention space pyramid pooling device based on hybrid pooling, comprising:

the splicing module is used for splicing the extended feature image group and the second feature image group to obtain a spatial pyramid pooling result so as to perform visual analysis based on the spatial pyramid pooling result; wherein:

The self-attention module includes:

The expansion unit is used for expanding the sub-pooling feature image groups by adopting the corresponding self-attention units aiming at each sub-pooling feature image group to obtain expansion feature images corresponding to each sub-pooling feature image in the sub-pooling feature image groups;

The expansion unit comprises:

And the determining subunit is used for obtaining an expansion feature map corresponding to each sub-pooling feature map in the sub-pooling feature map group based on the self-attention calculation result.

7. An electronic device, comprising:

a memory for storing a computer program;

A processor for implementing the steps of the spatial pyramid pooling method according to any of claims 1 to 5 when executing said computer program.

8. A computer readable storage medium, characterized in that it has stored thereon a computer program which, when executed by a processor, implements the steps of the spatial pyramid pooling method according to any of the claims 1 to 5.