CN116129101A

CN116129101A - Target detection method, target detection device, electronic equipment and storage medium

Info

Publication number: CN116129101A
Application number: CN202310275105.7A
Authority: CN
Inventors: 陈子亮
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2023-03-20
Filing date: 2023-03-20
Publication date: 2023-05-16

Abstract

The disclosure relates to the technical field of artificial intelligence, in particular to the computing fields of computer vision, image processing, deep learning and the like, and particularly relates to a target detection method, a target detection device, electronic equipment and a storage medium. The specific implementation scheme is as follows: inputting the first feature map into an encoder of the target detection model, and carrying out position coding according to coordinate information of the first feature map by the encoder to obtain a corresponding first position vector; inputting the first feature map and the corresponding first position vector into each stage of coding module of the coder in sequence to carry out coding processing to obtain a second feature map and a second position vector; and decoding according to the second feature map and the second position vector to obtain a detection result of the image to be detected. Therefore, alignment of the characteristic information and the position information is achieved, and accuracy of target detection is improved.

Description

Target detection method, target detection device, electronic equipment and storage medium

Technical Field

The disclosure relates to the technical field of artificial intelligence, in particular to the computing fields of computer vision, image processing, deep learning and the like, and particularly relates to a target detection method, a target detection device, electronic equipment and a storage medium.

Background

Object detection is a fundamental task of computer vision and is widely used. Most typical object detectors are based on CNN (Convolutional Neural Networks, convolutional neural network). CNN-based object detectors have also made significant progress in recent years. In the last two years, researchers have proposed a transducer-based end-to-end target detector (DETR, DEtection probes) that eliminates the manually designed anchor portion and exhibits comparable effects compared to an anchor-based detector (such as the fast RCNN). However, the existing DETR still has the disadvantage of inaccurate target detection.

Disclosure of Invention

Aiming at the technical problem that the characteristics obtained by learning a transducer structure in the prior art are not aligned with the position information, the disclosure provides a target detection method, a target detection device, electronic equipment and a storage medium.

According to a first aspect of the present disclosure, there is provided a target detection method including:

acquiring a first feature map corresponding to an image to be detected;

inputting the first feature map into an encoder of a target detection model, and performing position coding according to coordinate information of the first feature map through the encoder to obtain a corresponding first position vector;

inputting the first feature map and the corresponding first position vector into each stage of coding module of the coder in turn to carry out coding processing to obtain a second feature map and a second position vector;

and decoding according to the second feature map and the second position vector to obtain a detection result of the image to be detected.

According to a second aspect of the present disclosure, there is provided an object detection apparatus including:

the acquisition module is configured to acquire a first feature map corresponding to the image to be detected;

the encoding module is configured to input the first feature map into an encoder of a target detection model, and perform position encoding according to coordinate information of the first feature map through the encoder to obtain a corresponding first position vector;

the coding module sequentially inputs the first feature map and the corresponding first position vector into each stage of coding module of the coder to carry out coding processing to obtain a second feature map and a second position vector;

and the decoding module is configured to perform decoding processing according to the second feature map and the second position vector to obtain a detection result of the image to be detected.

According to a third aspect of the present disclosure, there is provided an electronic device comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein, the liquid crystal display device comprises a liquid crystal display device,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of the above claims.

According to a fourth aspect of the present disclosure, there is provided a non-transitory computer-readable storage medium storing computer instructions for causing the computer to perform the method of any one of the above-mentioned technical solutions.

According to a fifth aspect of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, implements the method of any of the above technical solutions.

The disclosure provides a target detection method, a target detection device, electronic equipment and a storage medium, which realize characteristic and position information alignment and improve the accuracy of target detection.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.

Drawings

The drawings are for a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

FIG. 1 is a schematic diagram of steps of a target detection method in an embodiment of the present disclosure;

FIG. 2 is a schematic diagram of a prior art transducer encoder;

FIG. 3 is a schematic diagram of a transducer structure encoder in an embodiment of the present disclosure;

fig. 4 is a flow chart of target detection of DETR architecture in the prior art;

FIG. 5 is a convolution schematic diagram of a standard convolution kernel of the prior art;

FIG. 6 is a convolution schematic diagram of a hole convolution kernel used in embodiments of the present disclosure;

FIG. 7 is a functional block diagram of an object detection device in an embodiment of the present disclosure;

fig. 8 is a schematic block diagram of an example electronic device in an embodiment of the disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

Compared with an anchor-based detector, DETR uses object detection as a problem of aggregate prediction, uses only 100 queries to predict class information and position coordinates of an object, and does not require complicated post-processing such as Non-maximum suppression (Non-maximum suppression, NMS), so DETR is a more efficient object detection method. However, DETR also has problems such as misalignment of the position and features, resulting in inaccurate target detection results.

The current DETR detector has the problem of feature misalignment: because the object detector needs to detect the position information of the object, and the transducer structure loses the space information of the features, the position information is also subjected to trigonometric function coding on the design of the DETR, so that the learned features are related to the positions. However, the encoded position information is fixed, and the learned feature is constantly changing, which results in inconsistent learned feature and encoded position information, and thus affects the effect of target detection.

Aiming at the technical problem that the characteristics obtained by learning a transducer structure in the prior art are not aligned with the position information, the disclosure provides a target detection method, as shown in fig. 1, which comprises the following steps:

step S101, a first feature map (feature map) corresponding to the image to be detected is acquired. The first feature map may be obtained by feature extraction by CNN (convolutional neural network).

Step S102, inputting the first feature map into an encoder of the target detection model, and performing position encoding according to coordinate information of the first feature map by the encoder to obtain a corresponding first position vector. the transducer structure mainly comprises a CNN, an encoder (encoder) and a decoder (decoder), and the conventional transformer encoder structure is shown in fig. 2, wherein the encoder consists of a plurality of serial encoder modules. Because feature is stretched from a two-dimensional wide-by-high feature map to a one-dimensional feature vector before being sent to the first-stage encoder module, the wide-high spatial position information of the feature is lost. The coordinate information corresponding to features can be encoded into a position vector, and then each transformer encoder structure can send the position information and the feature information to perform global attention calculation at the same time, so that the position information corresponding to the features is reserved. However, feature i+1 output by each layer of the encodings is the result of global attention of the global other feature i, the position vector of the encodings through the multi-layer cascade is fixed, and the learning of feature by the encodings module is global and not local any more, so that the position of the concerned feature point can be changed in global attention calculation, and the fixed position vector can not represent the spatial position of feature i+1 any more, in short, if the position is unchanged, the feature i+1 (second feature map) processed by the encodings and the initial position (first position vector) can not be aligned, which can cause the problem of misalignment of the learned feature and the spatial position, and thus the accuracy of target detection can be affected.

Step S103, the first feature map and the corresponding first position vector are sequentially input into each stage of coding module of the coder to be coded to obtain a second feature map and a second position vector. In this embodiment, as shown in fig. 3, the transformer encoder structure of the encoder is shown in fig. 3, the first-stage encoding module encoder1 encodes the feature to obtain a position vector position, and meanwhile, the focus offset before and after feature extraction is calculated, for example, the coordinate information before feature is input into encoder1 is (x 0, y 0), and the coordinate information actually corresponding to feature after feature is input into encoder1 is changed into (x 1, y 1), so that the feature needs to be learned, and the position change needs to be learned at the same time, so that feature and position alignment is realized, and the accuracy of target detection is improved.

And step S104, decoding according to the second feature map and the second position vector to obtain a detection result of the image to be detected. And predicting features and positions output by the encoder through the encoder module to obtain a detection result.

The target detection method in this embodiment may be applied to the DETR structure, which is a visual version of the transducer, and may be used for target detection, or may be used for panorama segmentation. The DETR is an end-to-end framework, the network structure of the DETR is very simple and can be divided into three parts, and the first part is a traditional CNN used for extracting the characteristics of the picture Gao Wei; the second part is a transducer structure, and boundary boxes (Bounding boxes) are extracted through an encoder and a decoder; finally, the network is trained using a bipartite graph matching loss function (Bipartite matching loss). The target detection flow of DETR is shown in fig. 4, where the picture to be detected is input into a network with a backbone network as CNN, the picture features are extracted, and then the picture features are input into an encoder and a decoder of a transducer model in combination with position information, so as to obtain a detection result of the transducer, and each result is a box, where each box represents a tuple and includes a class of objects and a detection frame position.

In contrast to conventional target detection methods, DETR effectively eliminates the need for many manually designed components, such as Non-maximum suppression (NMS) procedures, anchor point (anchor) generation, and the like. However, since the DETR has the problem of misalignment of the above-mentioned position features, the target detection method in this embodiment is obtained by improving the DETR target detection, which not only retains the advantages of simple overall DETR flow and no need of complex post-processing, but also can realize alignment of the position and features in DETR target detection.

As an optional implementation manner, step S101 includes the steps of: acquiring an image to be detected; inputting the image to be detected into a convolutional neural network of a target detection model, and extracting features through the convolutional neural network to obtain a first feature map. The convolutional neural network performs feature extraction through a cavity convolutional kernel or a deformable convolutional kernel to obtain a first feature map.

In conventional CNN networks, a convolution operator is typically used to extract the features, and the size of the convolution operator determines the size of the receptive field. The larger the convolution kernel size, the larger the receptive field. As in fig. 5, the blank lattice portion is a feature map, the hatched portion is a convolution kernel, and the conventional convolution kernel is shown in fig. 5; the hole convolution increases the effectiveness of the feature by expanding the range of the convolution to increase the range of the corresponding spatial location, as shown in fig. 6. The space position is essentially unchanged, and the feature learning effect in the target detection process is improved by adding the receptive field, so that the accuracy of target detection is improved, and the feasible convolution and the cavity convolution are the same. In this embodiment, the target detection effect can be further improved by improving the convolution kernel of the CNN portion and combining with the learning of the position vector.

As an optional implementation manner, step S103 of inputting the first feature map and the corresponding first position vector into each stage of encoding module of the encoder in turn to perform encoding processing to obtain a second feature map and a second position vector includes:

encoding the input first feature map to obtain a second feature map, and calculating the position offset of the first feature map generated after the encoding process of the current encoding module;

and adjusting the first position vector according to the position offset to obtain a second position vector.

As shown in fig. 3, in the transform encoder, we no longer keep the predefined position vector fixed, but each level of encoder module feeds both position information position and feature information feature for global attention calculation. And each stage of the encoder module outputs features and correspondingly outputs the position offset before and after the features are input into the encoder module to learn, and the original position plus the learned position offset is used as the position information of the next cascade encoder. Therefore, the characteristics and the positions of the output of each layer of encoder are aligned, and the network can obtain a better target detection effect. For example, the coordinate information of the feature before being extracted by the encoder1 is (x 0, y 0), the coordinate information of the feature after being extracted by the encoder1 is changed, the encoder module can calculate the position offset through global attention, and the original coordinate information (x 0, y 0) is added with the offset to obtain a new position vector (x 1, y 1), so that the position vector changes along with the change of the feature.

As an alternative implementation manner, the position offset corresponding to the first feature map is calculated through the full connection layer of each stage of the coding module. In this embodiment, a full connection layer is added to each stage of the encoder module to implement the calculation of the position offset to a decimal place, for example, the conventional calculation offset can only obtain an integer of 1, and in this embodiment, the calculation offset can be accurate to 1.11 through the full connection layer. In this embodiment, the offset calculation is more accurate than the conventional offset calculation, and the adjustment accuracy of the position vector is higher.

The present disclosure also provides an object detection apparatus 700, as shown in fig. 7, comprising:

the acquiring module 701 is configured to acquire a first feature map corresponding to an image to be detected. The target detection device can be applied to a transducer structure and mainly comprises CNN, encoder, decoder parts, and the first characteristic diagram can be obtained by characteristic extraction through a convolutional neural network.

The encoding module 702 is configured to input the first feature map into an encoder of the target detection model, and perform position encoding according to coordinate information of the first feature map by the encoder to obtain a corresponding first position vector. The conventional transformer encoder structure is shown in fig. 2, and the encoder is composed of a plurality of serial encoder modules. Because feature is stretched from a two-dimensional wide-by-high feature map to a one-dimensional feature vector before being sent to the first-stage encoder module, the wide-high spatial position information of the feature is lost. The coordinate information corresponding to features can be encoded into a position vector, and then each transformer encoder structure can send the position information and the feature information to perform global attention calculation at the same time, so that the position information corresponding to the features is reserved. However, feature i+1 output by each layer of encodings is the result of global attention of other feature i, the position vector of the encodings through the multi-layer cascade is fixed, and as learning of feature by the encodings module is global and is not local any more, in global attention calculation, the position of the concerned feature point also changes, and the fixed position vector cannot represent the spatial position of feature i+1 any more, which causes the problem that the learned feature and the spatial position are not aligned, so that the accuracy of target detection is also affected.

The encoding module 702 sequentially inputs the first feature map and the corresponding first position vector into each stage of encoding module of the encoder to perform encoding processing to obtain a second feature map and a second position vector. In this embodiment, as shown in fig. 3, the transformer encoder structure of the encoder is shown in fig. 3, the first-stage encoding module encoder1 encodes the feature to obtain a position vector position, and meanwhile, the focus offset before and after feature extraction is calculated, for example, the coordinate information before feature is input into encoder1 is (x 0, y 0), and the coordinate information actually corresponding to feature after feature is input into encoder1 is changed into (x 1, y 1), so that the feature needs to be learned, and the position change needs to be learned at the same time, so that feature and position alignment is realized, and the accuracy of target detection is improved.

The decoding module 703 is configured to perform decoding processing according to the second feature map and the second position vector to obtain a detection result of the image to be detected. And predicting features and positions output by the encoder through the encoder module to obtain a detection result.

The object detection device in this embodiment may be applied to a DETR structure, where DETR is a visual version of a transducer, and may be used for object detection or panoramic segmentation. The DETR is an end-to-end framework, the network structure of the DETR is very simple and can be divided into three parts, and the first part is a traditional CNN used for extracting the characteristics of the picture Gao Wei; the second part is a transducer structure, and boundary boxes (Bounding boxes) are extracted through an encoder and a decoder; finally, the network is trained using a bipartite graph matching loss function (Bipartite matching loss). The target detection flow of DETR is shown in fig. 4, firstly, the picture to be detected is input into a network with a backbox as CNN, the picture features are extracted, then, the picture features are combined with the position information and are input into an encoder and a decoder of a transducer model, so as to obtain the detection result of the transducer, and each result is a box, wherein each box represents a tuple and comprises the type of the object and the position of the detection frame.

In contrast to conventional target detection methods, DETR effectively eliminates the need for many manually designed components, such as non-maximum suppression (NMS) procedures, anchor generation, and the like. However, since the DETR has the problem of misalignment of the above-mentioned position features, the target detection method in this embodiment is obtained by improving the DETR target detection, which not only retains the advantages of simple overall DETR flow and no need of complex post-processing, but also can realize alignment of the position and features in DETR target detection.

As an alternative embodiment, the obtaining module 701 includes:

and an image acquisition unit configured to acquire an image to be detected.

The feature extraction unit is configured to input the image to be detected into a convolutional neural network of the target detection model, and perform feature extraction through the convolutional neural network to obtain a first feature map. The convolutional neural network performs feature extraction through a cavity convolutional kernel or a deformable convolutional kernel to obtain a first feature map.

In conventional CNN networks, a convolution operator is typically used to extract the features, and the size of the convolution operator determines the size of the receptive field. The larger the convolution kernel size, the larger the receptive field. As in fig. 5, the blank lattice portion is a feature map, the hatched portion is a convolution kernel, and the conventional convolution kernel is shown in fig. 5; the hole convolution increases the effectiveness of the feature by expanding the range of the convolution and increasing the range of the corresponding spatial location, as shown in fig. 6, simply by adding spaces (zeros) between the convolution kernel elements to expand the convolution kernel. The space position is essentially unchanged, and the feature learning effect in the target detection process is improved by adding the receptive field, so that the accuracy of target detection is improved, and the feasible convolution and the cavity convolution are the same. In this embodiment, the target detection effect can be further improved by improving the convolution kernel of the CNN portion and combining with the learning of the position vector.

As an alternative embodiment, the encoding module 702 includes:

and the feature coding unit is configured to code the input first feature map to obtain a second feature map.

And the calculating unit is configured to calculate the position offset generated after the first characteristic diagram is encoded and processed by the current encoding module.

And the adjusting unit is configured to adjust the first position vector according to the position offset to obtain a second position vector.

As an alternative embodiment, each calculation unit comprises a fully connected layer for calculating the positional offset. And calculating the position offset corresponding to the first feature map by setting a full-connection layer in each stage of coding module. In this embodiment, a full connection layer is added to each stage of the encoder module to implement the calculation of the position offset to a decimal place, for example, the conventional calculation offset can only obtain an integer of 1, and in this embodiment, the calculation offset can be accurate to 1.11 through the full connection layer. In this embodiment, the offset calculation is more accurate than the conventional offset calculation, and the adjustment accuracy of the position vector is higher.

In the technical scheme of the disclosure, the acquisition, storage, application and the like of the related user personal information all conform to the regulations of related laws and regulations, and the public sequence is not violated.

According to embodiments of the present disclosure, the present disclosure also provides an electronic device, a readable storage medium and a computer program product.

Fig. 8 illustrates a schematic block diagram of an example electronic device 800 that may be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 8, the apparatus 800 includes a computing unit 801 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 802 or a computer program loaded from a storage unit 808 into a Random Access Memory (RAM) 803. In the RAM803, various programs and data required for the operation of the device 800 can also be stored. The computing unit 801, the ROM 802, and the RAM803 are connected to each other by a bus 804. An input/output (I/O) interface 805 is also connected to the bus 804.

Various components in device 800 are connected to I/O interface 805, including: an input unit 806 such as a keyboard, mouse, etc.; an output unit 807 such as various types of displays, speakers, and the like; a storage unit 808, such as a magnetic disk, optical disk, etc.; and a communication unit 809, such as a network card, modem, wireless communication transceiver, or the like. The communication unit 809 allows the device 800 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.

The computing unit 801 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 801 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning objective function algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 801 performs the respective methods and processes described above, such as the target detection method. For example, in some embodiments, the object detection method may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as the storage unit 808. In some embodiments, part or all of the computer program may be loaded and/or installed onto device 800 via ROM 802 and/or communication unit 809. When a computer program is loaded into the RAM803 and executed by the computing unit 801, one or more steps of the object detection method described above may be performed. Alternatively, in other embodiments, the computing unit 801 may be configured to perform the target detection method by any other suitable means (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the internet.

The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server incorporating a blockchain.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present disclosure may be performed in parallel, sequentially, or in a different order, provided that the desired results of the disclosed aspects are achieved, and are not limited herein.

The above detailed description should not be taken as limiting the scope of the present disclosure. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present disclosure are intended to be included within the scope of the present disclosure.

Claims

1. A target detection method comprising:

acquiring a first feature map corresponding to an image to be detected;

2. The method of claim 1, wherein the acquiring the first feature map corresponding to the image to be detected includes:

acquiring the image to be detected;

inputting the image to be detected into a convolutional neural network of the target detection model, and extracting features through the convolutional neural network to obtain the first feature map.

3. The method of claim 2, wherein the feature extraction by the convolutional neural network to obtain the first feature map comprises:

and the convolutional neural network performs feature extraction through a cavity convolutional kernel or a deformable convolutional kernel to obtain the first feature map.

4. A method according to any one of claims 1 to 3, wherein inputting the first feature map and the corresponding first position vector into each stage of coding module of the encoder in turn to perform coding processing to obtain a second feature map and a second position vector includes:

the input first feature map is encoded to obtain the second feature map, and the position offset of the first feature map generated after the encoding process of the current encoding module is calculated;

and adjusting the first position vector according to the position offset to obtain the second position vector.

5. The method of claim 4, wherein the encoding the first feature map of the input and calculating a position offset of the first feature map after the processing by the encoding module comprises:

and encoding the input first feature map through a global attention mechanism, and calculating the position offset generated after the first feature map is processed by the current encoding module.

6. The method of claim 4 or 5, wherein the calculating the position offset of the first feature map after processing by the current encoding module comprises:

and calculating the position offset corresponding to the first feature map through the full connection layer of each stage of the coding module.

7. An object detection apparatus comprising:

8. The apparatus of claim 7, wherein the acquisition module comprises:

an image acquisition unit configured to acquire the image to be detected;

the feature extraction unit is configured to input the image to be detected into a convolutional neural network of the target detection model, and the first feature map is obtained by feature extraction through the convolutional neural network.

9. The apparatus of claim 8, wherein the convolutional neural network performs feature extraction by a hole convolutional kernel or a deformable convolutional kernel to obtain the first feature map.

10. The apparatus of any of claims 7-9, wherein the encoding module comprises:

the feature coding unit is configured to code the input first feature map to obtain the second feature map;

a calculating unit configured to calculate a positional shift of the feature map generated after the encoding process of the encoding module at present;

and the adjusting unit is configured to adjust the first position vector according to the position offset to obtain the second position vector.

11. The apparatus according to claim 9, wherein the calculation unit performs an encoding process on the first feature map inputted through a global attention mechanism, and calculates the positional shift of the first feature map generated after the processing by the current encoding module.

12. The apparatus according to claim 10 or 11, wherein each of the calculation units includes a fully connected layer for calculating the positional offset.

13. An electronic device, comprising:

at least one processor; and

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-6.

14. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of any one of claims 1-6.

15. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any of claims 1-6.