CN113283475A

CN113283475A - Target detection method, device, equipment and storage medium

Info

Publication number: CN113283475A
Application number: CN202110459855.0A
Authority: CN
Inventors: 李彬; 吴新桥; 王昊; 刘岚; 蔡思航; 郭晓斌; 何超林
Original assignee: Southern Power Grid Digital Grid Research Institute Co Ltd
Current assignee: China Southern Power Grid Digital Grid Technology Guangdong Co ltd
Priority date: 2021-04-27
Filing date: 2021-04-27
Publication date: 2021-08-20

Abstract

The application provides a target detection method, a target detection device, computer equipment and a storage medium, wherein the target detection method comprises the following steps: acquiring a first initial feature map, a second initial feature map and a third initial feature map with sequentially increasing feature extraction depths; adding the second initial characteristic diagram and the third initial characteristic diagram, and splicing the second initial characteristic diagram and the third initial characteristic diagram with the first initial characteristic diagram to obtain characteristic diagrams, and respectively adding the characteristic diagrams to the first initial characteristic diagram and splicing the characteristic diagrams with the third initial characteristic diagram; and then the feature maps at the three depths are fused with the initial feature map, and are fused with each other through paths from bottom to top, so that the feature flow directions are various, a first target feature map, a second target feature map and a third target feature map which carry abundant fusion features are provided for a target detection module, and the target detection precision is improved.

Description

Target detection method, device, equipment and storage medium

Technical Field

The present application relates to the field of computer vision technologies, and in particular, to a target detection method, an apparatus, a computer device, and a storage medium.

Background

Whether an external damage dangerous object exists in a power transmission line channel scene can be monitored by means of a target detection technology in the field of computer vision, and the method is an effective means for preventing external damage accidents. With the rapid development of deep learning technology in recent years, compared with the traditional method of customizing features, the method of detecting the target by using the neural network has the advantages that the detection precision is greatly improved, and the cost of time and energy is lower.

The current target detection neural network comprises a feature extraction module and a target detection module; and the target detection module is used for carrying out target detection based on the feature map output by the feature extraction module. In order to improve the target detection accuracy, multi-scale fusion processing (i.e., feature fusion processing of different feature extraction depths) is usually performed on the feature map output by the feature extraction module, and then the feature map subjected to multi-scale fusion processing is input to the target detection module, where an FPN (feature pyramid) is a typical network module for the multi-scale fusion processing. However, when the multi-scale fusion processing is performed on the conventional FPN, the feature flow direction is single, and it is difficult to provide richer fusion features, and the multi-scale fusion effect cannot be further improved.

Disclosure of Invention

In view of the above, it is necessary to provide an object detection method, an apparatus, a computer device and a storage medium for solving the above technical problems.

A method of target detection, the method comprising:

acquiring a first initial feature map, a second initial feature map and a third initial feature map which are output after a feature extraction module extracts features of an image to be detected; the feature extraction depths corresponding to the first initial feature map, the second initial feature map and the third initial feature map are sequentially increased in an increasing manner;

adding the second initial characteristic diagram and the third initial characteristic diagram to obtain a first intermediate characteristic diagram, and splicing the first intermediate characteristic diagram and the first initial characteristic diagram to obtain a second intermediate characteristic diagram;

splicing a third intermediate feature map obtained by adding the second intermediate feature map and the first initial feature map with the first initial feature map to obtain a first target feature map;

adding the second intermediate characteristic diagram, the second initial characteristic diagram and the first target characteristic diagram to obtain a second target characteristic diagram;

fusing a fourth intermediate feature map obtained by splicing the second intermediate feature map and the third initial feature map with the third initial feature map, the second target feature map and the first target feature map to obtain a third target feature map;

and inputting the first target feature map, the second target feature map and the third target feature map into a target detection module so as to detect the target in the image to be detected.

In one embodiment, the adding the second initial feature map and the third initial feature map to obtain a first intermediate feature map includes:

assigning a first weight to the second initial feature map and a second weight to the third initial feature map;

adding the weighted second initial characteristic diagram and the weighted third initial characteristic diagram to obtain a fifth intermediate characteristic diagram, and giving a third weight to the fifth intermediate characteristic diagram; the third weight is an inverse of a sum of the first weight and the second weight;

and performing feature extraction on the fifth intermediate feature map with the weights given by the convolution kernel to obtain the first intermediate feature map.

In one embodiment, before assigning the first weight to the second initial feature map and assigning the second weight to the third initial feature map, the method further includes:

and normalizing the first weight and the second weight.

In one embodiment, before normalizing the first weight and the second weight, the method further includes:

the first weight and/or the second weight are/is processed by a linear rectification function so that the first weight and/or the second weight is/are greater than or equal to 0.

In one embodiment, the first initial feature map, the second initial feature map and the third initial feature map are respectively output by the last three residual sub-modules of the feature extraction module; any residual sub-module of the last three residual sub-modules comprises a first convolution layer and an MHSA layer for receiving a feature map output by the first convolution layer.

In one embodiment, any one of the last three residual sub-modules further comprises a second convolution layer receiving a feature map output by the MHSA layer; and the second convolution layer is used for adjusting the resolution of the feature map output by the MHSA layer to the resolution of the feature map input into the first convolution layer so as to perform residual summation on the feature map after resolution adjustment and the feature map input into the first convolution layer.

An object detection apparatus, the apparatus comprising:

the characteristic diagram acquisition unit is used for acquiring a first initial characteristic diagram, a second initial characteristic diagram and a third initial characteristic diagram which are output after the characteristic extraction module performs characteristic extraction on the image to be detected; the feature extraction depths corresponding to the first initial feature map, the second initial feature map and the third initial feature map are sequentially increased in an increasing manner;

a feature map processing unit, configured to add the second initial feature map and the third initial feature map to obtain a first intermediate feature map, and splice the first intermediate feature map and the first initial feature map to obtain a second intermediate feature map;

a first target feature map obtaining unit, configured to splice a third intermediate feature map obtained by adding the second intermediate feature map and the first initial feature map with the first initial feature map to obtain a first target feature map;

the second target characteristic diagram acquisition unit is used for adding the second intermediate characteristic diagram, the second initial characteristic diagram and the first target characteristic diagram to obtain a second target characteristic diagram;

a third target feature map obtaining unit, configured to fuse a fourth intermediate feature map obtained by splicing the second intermediate feature map and the third initial feature map with the third initial feature map, the second target feature map, and the first target feature map to obtain a third target feature map;

and the target detection unit is used for inputting the first target feature map, the second target feature map and the third target feature map into a target detection module so as to detect the target in the image to be detected.

In one embodiment, the feature map processing unit is further configured to assign a first weight to the second initial feature map and assign a second weight to the third initial feature map; adding the weighted second initial characteristic diagram and the weighted third initial characteristic diagram to obtain a fifth intermediate characteristic diagram, and giving a third weight to the fifth intermediate characteristic diagram; the third weight is an inverse of a sum of the first weight and the second weight; and performing feature extraction on the fifth intermediate feature map with the weights given by the convolution kernel to obtain the first intermediate feature map.

A computer device comprising a memory storing a computer program and a processor implementing the method when executing the computer program.

A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the above-mentioned method.

According to the target detection method, the target detection device, the computer equipment and the storage medium, a first initial feature map, a second initial feature map and a third initial feature map which are output after a feature extraction module extracts features of an image to be detected are obtained; the feature extraction depths corresponding to the first initial feature map, the second initial feature map and the third initial feature map are sequentially increased in an increasing manner; adding the second initial characteristic diagram and the third initial characteristic diagram to obtain a first intermediate characteristic diagram, and splicing the first intermediate characteristic diagram and the first initial characteristic diagram to obtain a second intermediate characteristic diagram; splicing a third intermediate feature map obtained by adding the second intermediate feature map and the first initial feature map with the first initial feature map to obtain a first target feature map; adding the second intermediate characteristic diagram, the second initial characteristic diagram and the first target characteristic diagram to obtain a second target characteristic diagram; fusing a fourth intermediate feature map obtained by splicing the second intermediate feature map and the third initial feature map with the third initial feature map, the second target feature map and the first target feature map to obtain a third target feature map; and inputting the first target feature map, the second target feature map and the third target feature map into a target detection module so as to detect the target in the image to be detected. Therefore, in the method, the feature maps of the feature extraction depths are fused with one another, the feature flow directions are various, richer fusion features are provided for the target detection module, the multi-scale fusion effect is improved, and the target detection precision is improved.

Drawings

FIG. 1 is a schematic flow chart of a method for object detection in one embodiment;

FIG. 2 is a schematic flow chart diagram of a method for object detection in one embodiment;

FIG. 3 is a diagram illustrating an embodiment of a conventional residual unit transformed into a residual unit of the present application;

FIG. 4 is a diagram of a transformer module structure in one embodiment;

FIG. 5 is a schematic illustration of embedding for two-dimensional position information in one embodiment;

FIG. 6 is a block diagram of an embodiment of an object detection device;

FIG. 7 is a diagram illustrating an internal structure of a computer device according to an embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

Reference in the specification to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the specification. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein can be combined with other embodiments.

The target detection method can be applied to computer equipment. When the computer device implements the method, the steps executed when the computer device implements the method are described in conjunction with fig. 1 and fig. 2:

step S201, acquiring a first initial feature map, a second initial feature map and a third initial feature map which are output after a feature extraction module performs feature extraction on an image to be detected; the feature extraction depths corresponding to the first initial feature map, the second initial feature map and the third initial feature map are sequentially increased in an increasing manner;

step S202, adding the second initial characteristic diagram and the third initial characteristic diagram to obtain a first intermediate characteristic diagram, and splicing the first intermediate characteristic diagram and the first initial characteristic diagram to obtain a second intermediate characteristic diagram;

step S203, splicing a third intermediate characteristic diagram obtained by adding the second intermediate characteristic diagram and the first initial characteristic diagram with the first initial characteristic diagram to obtain a first target characteristic diagram;

step S204, adding the second intermediate characteristic diagram, the second initial characteristic diagram and the first target characteristic diagram to obtain a second target characteristic diagram;

step S205, a fourth intermediate feature map obtained by splicing the second intermediate feature map and the third initial feature map is fused with the third initial feature map, the second target feature map and the first target feature map to obtain a third target feature map;

step S206, inputting the first target feature map, the second target feature map and the third target feature map into a target detection module to detect a target in the image to be detected.

In the context of figure 1 of the drawings,

a feature map (feature map) is shown,

indicates the addition processing layer (P1),

showing the stitching process layer (P2),

the method is composed of conv + bn + leak relu activation functions, and is described by taking c1, c2 and c3 as examples to represent a first initial characteristic diagram, a second initial characteristic diagram and a third initial characteristic diagram respectively: in the target detection neural network, the feature extraction module can extract feature maps with different feature extraction depths for the target detection module to perform target detection, such as a YOLO-v3 network; in a neural network such as YOLO-v3, the resolution of 3 feature maps input to the target detection module is different, and the resolution of each feature map decreases in order with the feature extraction depth. After C1, C2 and C3 with sequentially increasing feature extraction depths are obtained by the computer equipment, C3 and C2 are added by P1 to obtain a first intermediate feature map, and the first intermediate feature map and C1 are spliced by P2 to obtain a second intermediate feature map (C123).

Next, for the feature map fusion process on the c1 path: the computer equipment adds the C123 and the C1 by utilizing P1 to obtain a third intermediate feature map, and splices the third intermediate feature map and the C1 by utilizing P2 to obtain a first target feature map (C11);

for the feature map fusion process on the c2 path: c123, c2 and c11 are added by P1 to obtain a second target characteristic diagram (c 21);

for the feature map fusion process on the c3 path: and splicing the c123 and the c3 by using P2 to obtain a fourth intermediate feature map, and adding the fourth intermediate feature map, the c3, the c21 and the c11 by using P1 to obtain a third target feature map (c 31).

Therefore, in the above fusion mode, the paths from the lower depth to the higher depth are fused, and the processing mode can be repeated for multiple times, so as to further integrate information of different depths (scales) in the feature map. Then, c11, c21 and c31 are input into the ASFF structure (adaptive spatial feature fusion) to complete the last step of the multi-scale fusion, processed by the CBL module, and then accessed into the target detection module for target detection.

In the above path processing procedures, the computer device may further perform conv processing on the corresponding feature map, where the conv processing is a combined processing of convolution, batch normalization, and leave relu excitation.

In the target detection method, the feature graphs of the feature extraction depths are fused with one another, the feature flow directions are various, richer fusion features are provided for a target detection module, the multi-scale fusion effect is improved, and the target detection precision is improved.

In an embodiment, when the computer device performs the step S202, the adding the second initial feature map and the third initial feature map to obtain a first intermediate feature map includes: assigning a first weight to the second initial feature map and a second weight to the third initial feature map; adding the weighted second initial characteristic diagram and the weighted third initial characteristic diagram to obtain a fifth intermediate characteristic diagram, and giving a third weight to the fifth intermediate characteristic diagram; the third weight is an inverse of a sum of the first weight and the second weight; and performing feature extraction on the fifth intermediate feature map with the weights given by the convolution kernel to obtain the first intermediate feature map.

In order to express the importance degree of different characteristic graphs during the addition of the characteristic graphs, a weight omega is introduced into each characteristic graph during the addition so as to calculate the information proportion of the characteristic graphs; in the above embodiment, taking the addition of the second initial feature map and the third initial feature map as an example, it is understood that weights, such as c123 and c1, may also be introduced to other added feature maps. Taking P1 as an example, the above processing method can be characterized as follows:

wherein, ω is₁Is a first weight, ω₂Is a second weight; since the feature extraction depths of c3 and c2 are different, the resolutions of the two may not be the same, so to ensure that the two are normally added, the resolution of c3 may be first extended to the resolution of c2, i.e. Resize operation is performed on c 3. Further, the third weight 1/(ω)₁+ω₂) There may be a case of numerical shocks, so e can be introduced to avoid data shocks, specifically e 0.0001. After the feature extraction is performed on the fifth intermediate feature map to which the weights are given by the convolution kernel, the P1 may perform processing such as batch normalization and leave relu excitation on the fifth feature map after the feature extraction, to obtain the first intermediate feature map.

Further, before assigning a first weight to the second initial feature map and assigning a second weight to the third initial feature map, the computer device may further perform normalization processing on the first weight and the second weight to avoid that the training is unstable due to multiplication directly by the weights and the feature maps. The normalization process may utilize the following equation:

still further, before normalizing the first weight and the second weight, the computer device may further perform the following steps: the first weight and/or the second weight are/is processed with a linear rectification function (relu function) such that the first weight and/or the second weight is/are greater than or equal to 0.

The target detection neural network YOLO-v3 is taken as an example for introduction: for the residual structure in the backbone network of YOLO-v3, as shown in fig. 3, in the backbone network (feature extraction module) of YOLO-v3, the last 3 residual sub-modules respectively output feature maps (i.e., the first initial feature map, the second initial feature map, and the third initial feature map) with 3 resolutions. For the last residual submodule of each cycle, the original convolution layer of 3x3 is converted into an MHSA layer, and the residual submodule (convolution MHSA residual structure) proposed by the present application is formed to provide global information for the output feature map. The first convolutional layer of the above embodiment is a 1 × 1 convolutional layer of the residual sub-module proposed in the present application, which inputs the feature map to the MHSA layer.

Further, any residual sub-module of the last three residual sub-modules further comprises a second convolution layer for receiving the feature map output by the MHSA layer; and the second convolution layer is used for adjusting the resolution of the feature map output by the MHSA layer to the resolution of the feature map input into the first convolution layer so as to perform residual summation on the feature map after resolution adjustment and the feature map input into the first convolution layer.

Also exemplified above is YOLO-v 3: the second convolutional layer is the 1x1 convolutional layer of the residual sub-module proposed in this application, which receives the output characteristic map of the MHSA layer. In the above embodiment, a second convolutional layer is added after the MHSA layer to adjust the resolution of the feature map after the MHSA, so as to facilitate residual summation.

The processing manner of the MHSA layer is shown in fig. 4, wherein fig. 4 simply demonstrates the two-input two-end case: converting an input X _1 into corresponding q, K and V vectors through W _ Q, W _ K and W _ V matrixes, multiplying q _11 by all K _ i1 to obtain corresponding a _ i1, and performing softmax operation on all a _ i1 to obtain

Then use

And multiplied and summed with the respective corresponding v _ i1 to give b _ 11. B _12 is obtained by the same steps as above using q _12 and all k _ i 2. B _11 and b _12 are spliced together to formOne-by-one conversion matrix W₀And adjusting the dimension to obtain an output b _1 corresponding to the X _ 1. Similarly, the output b _2 corresponding to the input X _2 can be obtained.

The subsequent corresponding vectors of the three matrices W _ Q, W _ K and W _ V are different in operation, W _ Q and W _ K are used for performing similarity matching, and W _ V is used for outputting the input characteristic information.

After the processing of each matrix, two vectors are obtained, for example, X _1 is converted by a W _ Q matrix to obtain a Q _1 vector, and Q _1 is multiplied by two different conversion matrices Q _1 and Q _2 to obtain Q _11 and Q _12, which are used for increasing information flow channels and improving expression capacity.

For the above dimension adjustment, after the two vectors are spliced, the dimension is doubled, so that the dimension needs to be adjusted again and reduced to the dimension before splicing.

In order to make the corresponding input position information conform to the characteristics of the two-dimensional image, so that the transformer module can capture the relative position relationship between the inputs, therefore, the input position information needs to be added into the calculation of the transformer module; the calculation of q and k is further described below, as shown in fig. 5 (here in a simple, single-headed form): when q and k are multiplied to calculate a, q and position information r need to be multiplied to obtain qr, and then the qr and the result qk of q and k are added to obtain a. And R is obtained by utilizing R _ H and R _ W vectors, and the R _ H and R _ W vectors respectively correspond to the coordinate width and height information of the input X position.

In order to better understand the above method, an application example of the object detection method of the present application is described in detail below.

The application example aims at the last residual submodule before the output of three scales of a main network in a YOLO-v3 neural network, converts the convolution layer of 3x3 into a transformer module, and embeds position information related to the width (width) and height (height) of an image into a query in the transformer, so that the position information is more consistent with the characteristics of a two-dimensional image. An original FPN module (a characteristic pyramid module) of YOLO-v3 is modified to provide more various fusion modes for each scale characteristic, and the multi-scale processing effect is improved.

(1) The residual structure in the backbone network of YOLO-v3 is modified, as shown in fig. 3, in the backbone network of YOLO-v3, the last 3 residual sub-modules will output feature maps with 3 resolutions respectively. For the last residual sub-module of each cycle, the present application example transforms the convolution layer of 3x3 therein into the MHSA layer to provide global information for the output feature map.

The MHSA is processed as shown in fig. 4 (here, the two-input two-head case is simply demonstrated): converting an input X _1 into corresponding q, K and V vectors through W _ Q, W _ K and W _ V matrixes, multiplying q _11 by all K _ i1 to obtain corresponding a _ i1, and performing softmax operation on all a _ i1 to obtain

Then use

And multiplied and summed with the respective corresponding v _ i1 to give b _ 11. B _12 is obtained by the same steps as above using q _1 and all k _ i 2. B _11 and b _12 are spliced together through a transformation matrix W₀And adjusting the dimension to obtain an output b _1 corresponding to the X _ 1. Similarly, the output b _2 corresponding to the input X _2 can be obtained.

In order to add the input position information to the calculation of the transform module, the calculation method of q and k is further refined here, as shown in fig. 5 (here, it is described in a simple single-headed form): when q and k are multiplied to calculate a, q and position information r need to be multiplied to obtain qr, and then the qr and the result qk of q and k are added to obtain a. And R is obtained by utilizing R _ H and R _ W vectors, and the R _ H and R _ W vectors respectively correspond to the coordinate width and height information of the input X position.

(2) The FPN module in YOLO-v3 was optimized as shown in FIG. 1: wherein c1, c2 and c3 are feature maps of three different resolutions output in the backbone network, the resolutions of the feature maps are sequentially reduced by 1 time, and conv on a line represents a combination of convolution, batch normalization and LEAKyrelu excitation. In the application example, the flow direction of the feature information is not only from top to bottom, but the transfer fusion is carried out more complexly and diversely. c3 is added with c2 characteristics and then spliced with c1, and the obtained characteristic diagram is added with c1 and spliced with c3 respectively. Then the feature maps on the three resolution paths are fused with the initial input and then fused with each other through the bottom-up path. The multi-scale fusion module can be repeated for multiple times, information with different sizes in the feature diagram is further integrated, and finally the ASFF structure is accessed to complete the last step of multi-scale fusion and then the target detection module of the YOLO-v3 is accessed.

In the application example, in order to represent the importance degree between different inputs during feature map fusion, a weight value omega is introduced into each input during fusion, and is used for calculating the information proportion of the input. Since the instability of training is easily caused by directly multiplying ω by the input, the range of ω can be limited by normalization, and the processing method is as follows:

before calculation, we use relu to process ω to ensure ω ≧ 0, and ∈ 0.0001 to avoid numerical chatter. Taking the P1 node as an example in the figure, the calculation formula is:

in this embodiment, by means of the transform module located in the backbone network, the grasping and utilization of global information by the network can be effectively improved, and the target detection accuracy can be improved by simply replacing three residual sub-modules of YOLO-v3 with the transform module. In the multi-scale fusion step, the FPN module is optimized, so that the flowing and fusion modes of multi-scale information are more diverse and richer, and the target detection precision is further improved.

It should be understood that, although the steps in the flowcharts of fig. 1 to 5 are shown in sequence as indicated by the arrows, the steps are not necessarily performed in sequence as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least some of the steps in fig. 1 to 5 may include multiple steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of performing the steps or stages is not necessarily sequential, but may be performed alternately or alternately with other steps or at least some of the other steps or stages.

In one embodiment, as shown in fig. 6, there is provided an object detection apparatus including:

a feature map obtaining unit 601, configured to obtain a first initial feature map, a second initial feature map, and a third initial feature map, which are output after a feature extraction module performs feature extraction on an image to be detected; the feature extraction depths corresponding to the first initial feature map, the second initial feature map and the third initial feature map are sequentially increased in an increasing manner;

a feature map processing unit 602, configured to add the second initial feature map and the third initial feature map to obtain a first intermediate feature map, and splice the first intermediate feature map and the first initial feature map to obtain a second intermediate feature map;

a first target feature map obtaining unit 603, configured to splice a third intermediate feature map obtained by adding the second intermediate feature map and the first initial feature map with the first initial feature map to obtain a first target feature map;

a second target feature map obtaining unit 604, configured to add the second intermediate feature map, the second initial feature map, and the first target feature map to obtain a second target feature map;

a third target feature map obtaining unit 605, configured to fuse a fourth intermediate feature map obtained by splicing the second intermediate feature map and the third initial feature map with the third initial feature map, the second target feature map, and the first target feature map to obtain a third target feature map;

and an object detection unit 606, configured to input the first object feature map, the second object feature map, and the third object feature map into an object detection module, so as to detect an object in the image to be detected.

In an embodiment, the feature map processing unit 602 is further configured to assign a first weight to the second initial feature map and assign a second weight to the third initial feature map; adding the weighted second initial characteristic diagram and the weighted third initial characteristic diagram to obtain a fifth intermediate characteristic diagram, and giving a third weight to the fifth intermediate characteristic diagram; the third weight is an inverse of a sum of the first weight and the second weight; and performing feature extraction on the fifth intermediate feature map with the weights given by the convolution kernel to obtain the first intermediate feature map.

In one embodiment, the apparatus further includes a normalization processing unit configured to normalize the first weight and the second weight.

In one embodiment, the apparatus further comprises a linear rectification processing unit for processing the first weight and/or the second weight by using a linear rectification function so that the first weight and/or the second weight is greater than or equal to 0.

In one embodiment, any one of the last three residual sub-modules further comprises a second convolution layer that receives a feature map output by the MHSA layer; and the second convolution layer is used for adjusting the resolution of the feature map output by the MHSA layer to the resolution of the feature map input into the first convolution layer so as to perform residual summation on the feature map after resolution adjustment and the feature map input into the first convolution layer.

For specific limitations of the target detection device, reference may be made to the above limitations of the target detection method, which are not described herein again. The modules in the target detection device can be wholly or partially realized by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

In one embodiment, a computer device is provided, the internal structure of which may be as shown in FIG. 7. The computer device includes a processor, a memory, and a network interface connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The database of the computer device is used for storing object detection data. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a method of object detection.

Those skilled in the art will appreciate that the architecture shown in fig. 7 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

In one embodiment, a computer device is provided, comprising a memory and a processor, the memory storing a computer program, the processor implementing the steps of the above-described method embodiments when executing the computer program.

In an embodiment, a computer-readable storage medium is provided, on which a computer program is stored, which computer program, when being executed by a processor, carries out the steps of the respective method embodiment as described above.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database or other medium used in the embodiments provided herein can include at least one of non-volatile and volatile memory. Non-volatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical storage, or the like. Volatile Memory can include Random Access Memory (RAM) or external cache Memory. By way of illustration and not limitation, RAM can take many forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM), among others.

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above examples only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A method of object detection, the method comprising:

2. The method of claim 1, wherein said adding the second initial feature map and the third initial feature map to obtain a first intermediate feature map comprises:

3. The method of claim 2, wherein prior to assigning a first weight to the second initial feature map and a second weight to the third initial feature map, the method further comprises:

and normalizing the first weight and the second weight.

4. The method of claim 3, wherein prior to normalizing the first weight and the second weight, the method further comprises:

5. The method according to any one of claims 1 to 4, wherein the first initial feature map, the second initial feature map and the third initial feature map are output by the last three residual sub-modules of the feature extraction module, respectively; any residual sub-module of the last three residual sub-modules comprises a first convolution layer and an MHSA layer for receiving a feature map output by the first convolution layer.

6. The method of claim 5, wherein any of the last three residual sub-modules further comprises receiving a second convolution layer of the feature map output by the MHSA layer; and the second convolution layer is used for adjusting the resolution of the feature map output by the MHSA layer to the resolution of the feature map input into the first convolution layer so as to perform residual summation on the feature map after resolution adjustment and the feature map input into the first convolution layer.

7. An object detection apparatus, characterized in that the apparatus comprises:

8. The apparatus according to claim 7, wherein the feature map processing unit is further configured to assign a first weight to the second initial feature map and assign a second weight to the third initial feature map; adding the weighted second initial characteristic diagram and the weighted third initial characteristic diagram to obtain a fifth intermediate characteristic diagram, and giving a third weight to the fifth intermediate characteristic diagram; the third weight is an inverse of a sum of the first weight and the second weight; and performing feature extraction on the fifth intermediate feature map with the weights given by the convolution kernel to obtain the first intermediate feature map.

9. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the method of any one of claims 1 to 6 when executing the computer program.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the method of any one of claims 1 to 6.