CN112488126A

CN112488126A - Feature map processing method, device, equipment and storage medium

Info

Publication number: CN112488126A
Application number: CN202011371411.3A
Authority: CN
Inventors: 宫延河
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2020-11-30
Filing date: 2020-11-30
Publication date: 2021-03-12

Abstract

The application discloses a feature map processing method, a feature map processing device and a storage medium, and relates to the technical field of artificial intelligence such as computer vision, augmented reality and deep learning. One embodiment of the method comprises: determining a processing window, wherein the size of the processing window is larger than the size of the convolution kernel; sliding the processing window on the input feature map according to the convolution step length to determine the current area in the processing window; utilizing a convolution core to convolute at least part of the current area to obtain convolution elements, and pooling at least part of the current area to obtain pooling elements; calculating the sum of the convolution element and the pooling element as an output element; when the sliding of the processing window is finished, the output feature map is generated based on all the output elements. The embodiment simultaneously performs convolution and pooling on the feature map, can extract high-level semantic features and extract detailed features, and therefore the extracted features are richer.

Description

Feature map processing method, device, equipment and storage medium

Technical Field

The embodiment of the application relates to the technical field of computers, in particular to the technical field of artificial intelligence such as computer vision, augmented reality and deep learning, and particularly relates to a feature map processing method, device, equipment and storage medium.

Background

The application of convolution and pooling in deep learning has achieved great success, and the records of the classified data sets are continuously refreshed. However, detection and classification are different, classification only needs to extract high-level semantics to judge the class to which the target belongs, and detection needs to classify the target and position the target in a picture, so that the extracted features are not only invariant and cannot change along with the class caused by target rotation, scaling and the like, but also are required to have constant variability and move along with the movement of the target position.

At present, the convolution includes ordinary convolution, deformable convolution, hole convolution, grouping convolution, depth separable convolution and the like, and the pooling includes average pooling, maximum pooling, random pooling, spatial pyramid pooling and the like. The normal convolution uses a convolution kernel of size K × K (K is a positive integer) to slide on the feature map by step size S (S is a positive integer), with no spatial and scale invariance. The deformable convolution learns an offset at the same time so that the convolution kernel is offset at the sampling points of the input feature map, focusing on the region of interest and the object. Both packet convolution and depth separable convolution are involved in reducing the amount of convolution operations and do not substantially improve the feature extraction capability. Average pooling takes as output the average of the S grid size inputs. The maximum pooling takes the maximum value in the S grid as output. Random pooling is the random selection of a number in the grid as output. In short, pooling has some spatial invariance but at the cost of sacrificing positioning accuracy, resulting in the current object detection model being rarely used. The spatial pyramid pooling adopts step lengths of different sizes for inputs of different sizes to ensure that outputs of the same size can be obtained, wherein twice quantization restricts further improvement of detection accuracy. In summary, neither volume nor pooling at present can satisfy the requirements of both preserving spatial features and having spatial invariance required for detection.

Disclosure of Invention

The embodiment of the application provides a feature map processing method, a feature map processing device, feature map processing equipment and a storage medium.

In a first aspect, an embodiment of the present application provides a feature map processing method, including: determining a processing window, wherein the size of the processing window is larger than the size of the convolution kernel; sliding the processing window on the input feature map according to the convolution step length to determine the current area in the processing window; utilizing a convolution core to convolute at least part of the current area to obtain convolution elements, and pooling at least part of the current area to obtain pooling elements; calculating the sum of the convolution element and the pooling element as an output element; when the sliding of the processing window is finished, the output feature map is generated based on all the output elements.

In a second aspect, an embodiment of the present application provides a feature map processing apparatus, including: a determination module configured to determine a processing window, wherein a size of the processing window is greater than a size of the convolution kernel; a sliding module configured to slide the processing window on the input feature map according to the convolution step size, and determine a current region in the processing window; a processing module configured to convolve at least part of the current region with a convolution kernel to obtain convolution elements, and pooling at least part of the current region to obtain pooled elements; a calculation module configured to calculate a sum of the convolution element and the pooling element as an output element; and the generation module is configured to generate an output feature map based on all output elements when the sliding of the processing window is finished.

In a third aspect, an embodiment of the present application provides a target detection method, including: acquiring an image to be detected; inputting an image to be detected into a target detection model to obtain the category and the position of a target in the image to be detected, wherein the target detection model comprises a convolution pooling layer, and the convolution pooling layer executes the method described in any one of the implementation manners of the first aspect.

In a fourth aspect, an embodiment of the present application provides an object detection apparatus, including: an acquisition module configured to acquire an image to be detected; a detection module configured to input the image to be detected into a target detection model, to obtain the category and the position of the target in the image to be detected, wherein the target detection model comprises a convolution pooling layer, and the convolution pooling layer performs the method as described in any implementation manner of the first aspect.

In a fifth aspect, an embodiment of the present application provides an electronic device, including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method as described in any one of the implementations of the first aspect or the method as described in any one of the implementations of the second aspect.

In a sixth aspect, embodiments of the present application propose a non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform the method as described in any of the implementations of the first aspect or the method as described in any of the implementations of the second aspect.

According to the feature map processing method, the feature map processing device, the feature map processing equipment and the storage medium, firstly, the determined processing window slides on the input feature map according to the convolution step length, and the current area in the processing window is determined; then, utilizing a convolution core to convolute at least part of the current area to obtain convolution elements, and pooling at least part of the current area to obtain pooled elements; then calculating the sum of the convolution element and the pooling element as an output element; and finally, when the sliding of the processing window is finished, generating an output characteristic graph based on all the output elements. The feature map is convolved and pooled at the same time, so that not only high-level semantic features but also detailed features can be extracted, and the extracted features are richer.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

Other features, objects and advantages of the present application will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings. The drawings are included to provide a better understanding of the present solution and are not intended to limit the present application. Wherein:

FIG. 1 is a flow diagram for one embodiment of a feature map processing method according to the present application;

FIG. 2 is a flow diagram of yet another embodiment of a feature map processing method according to the present application;

FIG. 3 is a schematic diagram of convolution for non-center pooling;

FIG. 4 is yet another schematic of convolution for non-center pooling;

FIG. 5 is a flow diagram of one embodiment of a target detection method according to the present application;

FIG. 6 is a schematic block diagram of one embodiment of a feature map processing apparatus according to the present application;

FIG. 7 is a schematic block diagram of one embodiment of an object detection device according to the present application;

fig. 8 is a block diagram of an electronic device for implementing the feature map processing method according to the embodiment of the present application.

Detailed Description

The following description of the exemplary embodiments of the present application, taken in conjunction with the accompanying drawings, includes various details of the embodiments of the application for the understanding of the same, which are to be considered exemplary only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present application. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present application will be described in detail below with reference to the embodiments with reference to the attached drawings.

FIG. 1 illustrates a flow 100 of one embodiment of a feature map processing method according to the present application. The feature map processing method comprises the following steps:

step 101, determining a processing window.

In this embodiment, the execution subject of the feature map processing method may determine the processing window. The processing window is a matrix which simultaneously performs convolution and pooling on the data, is used for extracting characteristic information in the input data, and has a size larger than that of a convolution kernel.

In some optional implementations of this embodiment, the execution subject may determine the initial processing window based on a preset size; if the preset size is larger than the size of the convolution kernel, taking the initial processing window as a processing window; and if the preset size is smaller than the size of the convolution kernel, expanding a black edge around the initial processing window to generate the processing window. Wherein, the size of the initial processing window is equal to the preset size. The size of the processing window is larger than the size of the convolution kernel. The pixel value of the extended black edge is equal to 0.

And 102, sliding the processing window on the input feature map according to the convolution step size, and determining the current area in the processing window.

In this embodiment, the execution body may slide the processing window on the input feature map according to the convolution step size to determine the current region in the processing window. Wherein the sliding direction may be from left to right, from top to bottom. The current region in the processing window may be the region of the processing window currently encompassed on the input feature map.

And 103, performing convolution on at least part of the current area by utilizing a convolution core to obtain convolution elements, and performing pooling on at least part of the current area to obtain pooling elements.

In this embodiment, the executing body may perform convolution on at least a part of the current region by using a convolution kernel to obtain convolution elements, and perform pooling on at least a part of the current region to obtain pooling elements. The convolution element is a detail feature of the feature map, and the pooling element is a high-level semantic feature.

Wherein pooling may include, but is not limited to, at least one of: average pooling, maximum pooling, random pooling, and pyramid pooling, among others. Average pooling, also known as mean-pooling (mean-pooling), is the averaging of all values in the local acceptance domain, with the average of the S grid size inputs as the output. Maximum pooling (max-pooling), i.e., taking the point with the largest value in the local acceptance domain, takes the maximum value in the S grid as output. The Stochastic Pooling (Stochastic Pooling) may be regarded as normalizing the feature map values in a Pooling window, and randomly sampling and selecting according to the normalized probability value of the feature map, that is, the probability of selecting a large element value is also large, and a number in a grid is randomly selected as an output. The significance of spatial pyramid pooling is that feature maps of any size can be converted into feature vectors of fixed size, and step sizes of different sizes are adopted for inputs of different sizes to ensure that outputs of the same size can be obtained.

Step 104, calculating the sum of the convolution element and the pooling element as an output element.

In this embodiment, the execution body may calculate the sum of the convolution element and the pooling element as the output element. In this way, the output element carries both detail and high-level semantic features.

And 105, when the sliding of the processing window is finished, generating an output feature map based on all output elements.

In this embodiment, when the processing window sliding ends, the above-described execution may mainly generate the output feature map based on all the output elements. And outputting the pixel points of the characteristic graph as output elements. Since the output feature map is obtained by simultaneously performing convolution and pooling on the feature map, the size of the output feature map is smaller than that of the input feature map.

The feature map processing method provided by the embodiment of the application comprises the steps of firstly sliding a determined processing window on an input feature map according to a convolution step length to determine a current area in the processing window; then, utilizing a convolution core to convolute at least part of the current area to obtain convolution elements, and pooling at least part of the current area to obtain pooled elements; then calculating the sum of the convolution element and the pooling element as an output element; and finally, when the sliding of the processing window is finished, generating an output characteristic graph based on all the output elements. The feature map is convolved and pooled at the same time, so that not only high-level semantic features but also detailed features can be extracted, and the extracted features are richer.

With further reference to FIG. 2, a flow 200 of yet another embodiment of a feature map processing method according to the present application is illustrated. The feature map processing method comprises the following steps:

step 201, determining a processing window based on the size of the convolution kernel and the convolution step size.

In this embodiment, the execution body may determine the processing window based on the size of the convolution kernel and the convolution step size such that the size of the processing window is larger than the size of the convolution kernel.

Assuming that the size of the convolution kernel is K (K ≧ 3), the convolution step size is S (S is a positive integer), then the size of the processing window may be: (2 XS + K-2) × (2 XS + K-2).

And step 202, sliding the processing window on the input feature map according to the convolution step size, and determining the current area in the processing window.

In this embodiment, the specific operation of step 202 has been described in detail in step 102 in the embodiment shown in fig. 1, and is not described herein again.

And step 203, performing convolution on the central area of the current area by using a convolution core to obtain a convolution element.

In this embodiment, since the details of the target are usually in the central region, the executing body may perform convolution on the central region of the current region by using the convolution kernel to obtain a convolution element. Wherein the central region is centered at the center of the current region and has a size equal to the size of the convolution kernel. The convolution elements are detail features that are used to locate the target. The convolution operand is: (K-2) × (K-2), K being the size of the convolution kernel.

And 204, pooling the peripheral area of the central area to obtain pooled elements.

In this embodiment, the class of the object can be accurately identified even if a part of the object is hidden by the center, so that the object pools the surrounding area of the center area to obtain pooled elements. Wherein the pooling element is a high-level semantic feature used to classify the target. The pooling operand is: s4-4, S is the convolution step.

Step 205, calculate the sum of the convolution element and the pooling element as the output element.

And step 206, when the sliding of the processing window is finished, generating an output feature map based on all output elements.

In the present embodiment, the specific operations of step 205-.

As can be seen from fig. 2, the flow 200 of the feature map processing method in this embodiment highlights the step of convolution pooling compared to the embodiment corresponding to fig. 1. Therefore, the scheme described in the embodiment is inspired by biological vision, and can accurately identify the category of the target when a part of the target is covered in the center, convolve the central area, and pool the surrounding area, so that the accuracy of the extracted high-level semantic features can be improved, the accuracy of the extracted detail features can be improved, and the accuracy of target classification and positioning can be improved.

The convolution is also called non-center pooling convolution, in which the center area is convolved and the peripheral area is pooled. The convolution of non-central pooling can extract high-level semantic features in the peripheral region and simultaneously reserve the detail features of the central region, can meet the requirements of detecting corresponding invariance and equal degeneration, and can improve the precision of the target detection model when being applied to the target detection model.

For ease of understanding, FIG. 3 shows one schematic of a convolution for non-center pooling, and FIG. 4 shows yet another schematic of a convolution for non-center pooling. As shown in fig. 3, the size of the processing window is 3 × 3, and the size of the convolution kernel is 1 × 1. Wherein the central gray area is convolved and the surrounding white areas are pooled. As shown in fig. 4, the size of the processing window is 5 × 5, and the size of the convolution kernel is 3 × 3. Wherein the central gray area is convolved and the surrounding white areas are pooled.

With further reference to FIG. 5, a flow 500 of one embodiment of a target detection method according to the present application is shown. The target detection method comprises the following steps:

and step 501, acquiring an image to be detected.

In the present embodiment, the execution subject of the object detection method can acquire an image to be detected. Wherein the target is present in the image to be detected. The target may be any tangible object that objectively exists in nature, including but not limited to, a person, an animal, a plant, an item, a building, and so forth.

Step 502, inputting the image to be detected into a target detection model to obtain the category and position of the target in the image to be detected.

In this embodiment, the execution subject may input the image to be detected to the target detection model, so as to obtain the category and the position of the target in the image to be detected.

The object detection model here includes a convolution pooling layer. The convolution pooling layer may have both functions of a convolution layer and a pooling layer for convolving at least part of the current region with a convolution kernel and pooling at least part of the current region. The convolution pooling layer can perform a feature map processing method, the feature map being an intermediate product of the processing of the image by the target detection model. The target detection model may include a plurality of convolution pooling layers. Taking an example that the target detection model comprises two convolution pooling layers, inputting an image into a first convolution pooling layer, and performing convolution pooling operation on the image by the first convolution pooling layer to obtain a first feature map; inputting the first feature map into a second convolution pooling layer, and performing convolution pooling operation on the first feature map by the second convolution pooling layer to obtain a second feature map; and inputting the second feature map into a full connection layer, and integrating the features of the second feature map by the full connection layer to obtain the category and the position of the target in the image.

According to the target detection method provided by the embodiment of the application, the image detection is carried out by using the target detection model comprising the convolution pooling layer. The convolution pooling layer can extract high-level semantic features and detail features, so that the extracted features are richer. High-level semantic features can be used to determine the class of objects in an image, thereby enabling classification of objects. The high-level semantic features have invariance and cannot change along with target rotation, scaling and the like to cause category change. The detail features can be used to determine the position of the target in the image, thereby enabling localization of the target. The detail features have equal variability and move with the movement of the target location.

With further reference to fig. 6, as an implementation of the method shown in the above figures, the present application provides an embodiment of a feature map processing apparatus, which corresponds to the embodiment of the method shown in fig. 2, and which can be applied to various electronic devices.

As shown in fig. 6, the feature map processing apparatus 600 of the present embodiment may include: a determination module 601, a sliding module 602, a processing module 603, a calculation module 604, and a generation module 605. Wherein the determining module 601 is configured to determine a processing window, wherein the size of the processing window is larger than the size of the convolution kernel; a sliding module 602 configured to slide the processing window on the input feature map according to the convolution step size, and determine a current region in the processing window; a processing module 603 configured to perform convolution on at least a part of the current region using a convolution kernel to obtain convolution elements, and perform pooling on at least a part of the current region to obtain pooled elements; a calculation module 604 configured to calculate a sum of the convolution element and the pooling element as an output element; a generating module 605 configured to generate an output feature map based on all output elements when the processing window sliding ends.

In the present embodiment, in the feature map processing apparatus 600: the specific processing and the technical effects thereof of the determining module 601, the sliding module 602, the processing module 603, the calculating module 604 and the generating module 605 can refer to the related descriptions of step 201 and step 205 in the corresponding embodiment of fig. 2, which are not described herein again.

In some optional implementations of this embodiment, the determining module 601 is further configured to: based on the size of the convolution kernel and the convolution step size, a processing window is determined.

In some optional implementations of this embodiment, the determining module 601 is further configured to: determining an initial processing window based on a preset size; and if the preset size is smaller than the size of the convolution kernel, expanding a black edge around the initial processing window to generate the processing window.

In some optional implementations of this embodiment, the processing module 603 is further configured to: performing convolution on a central area of the current area by using a convolution kernel to obtain convolution elements, wherein the central area takes the center of the current area as the center and has the size equal to the size of the convolution kernel; pooling the peripheral region of the central region to obtain pooled elements.

In some optional implementations of this embodiment, pooling includes at least one of: average pooling, maximum pooling, random pooling, and pyramid pooling.

With further reference to fig. 7, as an implementation of the methods shown in the above-mentioned figures, the present application provides an embodiment of an object detection apparatus, which corresponds to the embodiment of the method shown in fig. 5, and which is particularly applicable to various electronic devices.

As shown in fig. 7, the object detection apparatus 700 of the present embodiment may include: an acquisition module 701 and a detection module 702. The acquiring module 701 is configured to acquire an image to be detected; a detection module 702 configured to input the image to be detected into a target detection model, resulting in the category and the position of the target in the image to be detected, wherein the target detection model comprises a convolution pooling layer, and the convolution pooling layer performs the method as described in any implementation manner of the first aspect.

In the present embodiment, in the object detection apparatus 700: the specific processing of the obtaining module 701 and the detecting module 702 and the technical effects thereof can refer to the related descriptions of step 501 and step 502 in the corresponding embodiment of fig. 2, which are not repeated herein.

According to an embodiment of the present application, an electronic device and a readable storage medium are also provided.

Fig. 8 is a block diagram of an electronic device according to an embodiment of the present application. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the present application that are described and/or claimed herein.

As shown in fig. 8, the electronic apparatus includes: one or more processors 801, memory 802, and interfaces for connecting the various components, including a high speed interface and a low speed interface. The various components are interconnected using different buses and may be mounted on a common motherboard or in other manners as desired. The processor may process instructions for execution within the electronic device, including instructions stored in or on the memory to display graphical information of a GUI on an external input/output apparatus (such as a display device coupled to the interface). In other embodiments, multiple processors and/or multiple buses may be used, along with multiple memories and multiple memories, as desired. Also, multiple electronic devices may be connected, with each device providing portions of the necessary operations (e.g., as a server array, a group of blade servers, or a multi-processor system). Fig. 8 illustrates an example of a processor 801.

The memory 802 is a non-transitory computer readable storage medium as provided herein. Wherein the memory stores instructions executable by at least one processor to cause the at least one processor to perform the feature map processing method provided herein. The non-transitory computer-readable storage medium of the present application stores computer instructions for causing a computer to execute the feature map processing method provided by the present application.

The memory 802, as a non-transitory computer readable storage medium, may be used to store non-transitory software programs, non-transitory computer executable programs, and modules, such as program instructions/modules corresponding to the feature map processing methods in the embodiments of the present application (e.g., the determining module 601, the sliding module 602, the processing module 603, the calculating module 604, and the generating module 605 shown in fig. 6, as well as the obtaining module 701 and the detecting module 702 shown in fig. 7, for example). The processor 801 executes various functional applications of the server and data processing by running non-transitory software programs, instructions, and modules stored in the memory 802, that is, implements the feature map processing method in the above-described method embodiment.

The memory 802 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to the use of the electronic device of the feature map processing method, and the like. Further, the memory 802 may include high speed random access memory and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory 802 optionally includes memory located remotely from the processor 801, which may be connected to the electronics of the signature graph processing method via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The electronic device of the feature map processing method may further include: an input device 803 and an output device 804. The processor 801, the memory 802, the input device 803, and the output device 804 may be connected by a bus or other means, and are exemplified by a bus in fig. 8.

The input device 803 may receive input numeric or character information and generate key signal inputs related to user settings and function control of the electronic apparatus of the feature map processing method, such as a touch screen, a keypad, a mouse, a track pad, a touch pad, a pointing stick, one or more mouse buttons, a track ball, a joystick, or other input devices. The output devices 804 may include a display device, auxiliary lighting devices (e.g., LEDs), and haptic feedback devices (e.g., vibrating motors), among others. The display device may include, but is not limited to, a Liquid Crystal Display (LCD), a Light Emitting Diode (LED) display, and a plasma display. In some implementations, the display device can be a touch screen.

Various implementations of the systems and techniques described here can be realized in digital electronic circuitry, integrated circuitry, application specific ASICs (application specific integrated circuits), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

These computer programs (also known as programs, software applications, or code) include machine instructions for a programmable processor, and may be implemented using high-level procedural and/or object-oriented programming languages, and/or assembly/machine languages. As used herein, the terms "machine-readable medium" and "computer-readable medium" refer to any computer program product, apparatus, and/or device (e.g., magnetic discs, optical disks, memory, Programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term "machine-readable signal" refers to any signal used to provide machine instructions and/or data to a programmable processor.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The Server can be a cloud Server, also called a cloud computing Server or a cloud host, and is a host product in a cloud computing service system, so as to solve the defects of high management difficulty and weak service expansibility in the traditional physical host and VPS service ("Virtual Private Server", or simply "VPS"). The server may also be a server of a distributed system, or a server incorporating a blockchain.

According to the technical scheme of the application, firstly, the determined processing window slides on the input feature map according to the convolution step length, and the current area in the processing window is determined; then, utilizing a convolution core to convolute at least part of the current area to obtain convolution elements, and pooling at least part of the current area to obtain pooled elements; then calculating the sum of the convolution element and the pooling element as an output element; and finally, when the sliding of the processing window is finished, generating an output characteristic graph based on all the output elements. The feature map is convolved and pooled at the same time, so that not only high-level semantic features but also detailed features can be extracted, and the extracted features are richer.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present application may be executed in parallel, sequentially, or in different orders, and the present invention is not limited thereto as long as the desired results of the technical solutions disclosed in the present application can be achieved.

The above-described embodiments should not be construed as limiting the scope of the present application. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims

1. A method of feature map processing, comprising:

determining a processing window, wherein the size of the processing window is larger than the size of the convolution kernel;

sliding the processing window on the input feature map according to the convolution step length to determine the current region in the processing window;

utilizing a convolution core to convolute at least part of the current area to obtain convolution elements, and pooling at least part of the current area to obtain pooling elements;

calculating a sum of the convolution element and the pooling element as an output element;

and when the sliding of the processing window is finished, generating an output feature graph based on all output elements.

2. The method of claim 1, wherein the determining a processing window comprises:

determining the processing window based on the size of the convolution kernel and the convolution step.

3. The method of claim 1, wherein the determining a processing window comprises:

determining an initial processing window based on a preset size;

and if the preset size is smaller than the size of the convolution kernel, expanding a black edge around the initial processing window to generate the processing window.

4. The method of claim 1, wherein the convolving at least part of the current region with a convolution kernel resulting in convolution elements and pooling at least part of the current region resulting in pooled elements comprises:

convolving the central area of the current area by using a convolution kernel to obtain the convolution element, wherein the central area takes the center of the current area as the center, and the size of the central area is equal to the size of the convolution kernel;

pooling the surrounding area of the central area to obtain the pooled elements.

5. The method of claim 1, wherein pooling comprises at least one of: average pooling, maximum pooling, random pooling, and pyramid pooling.

6. A method of target detection, comprising:

acquiring an image to be detected;

inputting the image to be detected into a target detection model to obtain the category and the position of a target in the image to be detected, wherein the target detection model comprises a convolution pooling layer, and the convolution pooling layer executes the method of one of claims 1 to 5.

7. A feature map processing apparatus comprising:

a determination module configured to determine a processing window, wherein a size of the processing window is greater than a size of the convolution kernel;

a sliding module configured to slide the processing window on the input feature map according to the convolution step size, and determine a current region in the processing window;

a processing module configured to convolve at least part of the current region with a convolution kernel to obtain convolution elements, and pooling at least part of the current region to obtain pooled elements;

a calculation module configured to calculate a sum of the convolution element and the pooling element as an output element;

a generation module configured to generate an output feature map based on all output elements when the processing window sliding ends.

8. The apparatus of claim 7, wherein the determination module is further configured to:

9. The apparatus of claim 7, wherein the determination module is further configured to:

determining an initial processing window based on a preset size;

10. The apparatus of claim 7, wherein the processing module is further configured to:

pooling the surrounding area of the central area to obtain the pooled elements.

11. The apparatus of claim 7, wherein pooling comprises at least one of: average pooling, maximum pooling, random pooling, and pyramid pooling.

12. An object detection device comprising:

an acquisition module configured to acquire an image to be detected;

a detection module configured to input the image to be detected into a target detection model, resulting in a category and a position of a target in the image to be detected, wherein the target detection model comprises a convolution pooling layer, and the convolution pooling layer performs the method of one of claims 1 to 5.

13. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-5 or to perform the method of claim 6.

14. A non-transitory computer readable storage medium having stored thereon computer instructions for causing a computer to perform the method of any one of claims 1-5 or to perform the method of claim 6.