CN113628168A

CN113628168A - Target detection method and device

Info

Publication number: CN113628168A
Application number: CN202110795519.3A
Authority: CN
Inventors: 黄诗盛
Original assignee: Shenzhen Haiyi Zhixin Technology Co Ltd
Current assignee: Shenzhen Haiyi Zhixin Technology Co Ltd
Priority date: 2021-07-14
Filing date: 2021-07-14
Publication date: 2021-11-09

Abstract

A target detection method and device, the method comprising: acquiring an image to be detected; carrying out target detection on the image by using the trained target detection model, and outputting a target detection result; wherein the object detection model comprises an encoding portion and a decoding portion, an output of the encoding portion being an input of the decoding portion, wherein: the encoding part comprises a plurality of layers of feature layers, and the size of a feature map output by the plurality of layers of feature layers of the encoding part is reduced layer by layer; the decoding part comprises a plurality of layers of feature layers, the size of a feature map output by the plurality of layers of feature layers of the decoding part is increased layer by layer, and the feature map output by the last layer of feature layer of the decoding part is used for target detection. According to the target detection method and device, the problem that time is consumed for target detection on multiple feature layers can be effectively solved, and meanwhile the problem that small target detection effect is poor can be solved.

Description

Target detection method and device

Technical Field

The present application relates to the field of target detection technologies, and in particular, to a target detection method and apparatus.

Background

Object detection is one of the important applications in the field of computer vision. The target detection algorithm may be divided into a target detection algorithm with anchor-based (anchor-based) and a target detection algorithm without anchor-free (anchor-free) according to whether anchor (anchor) needs to be set in advance for detection. Among them, typical representatives of anchor-based are fast-RCNN, Single-stage multi-box Detector (SSD), and YOLO; typical representatives of anchor-free are CornerNet, CenterNet and Fcos. According to the stage, the target detection algorithm can be divided into a one-stage target detection algorithm and a two-stage target detection algorithm. Typical representatives of one-stage are SSD, YOLO and Fcos; a typical representation of two-stage is fast-RCNN. The one-stage algorithm has the advantages of high speed but insufficient precision; the two-stage algorithm has the advantages of low speed and high precision; the anchor-based algorithm has the advantages of easy training, but more parameters needing to be adjusted; the anchor-free algorithm has the advantages of few parameters needing to be adjusted, high model precision, no need of post-processing such as Non-Maximum Suppression (NMS) and the like, but has the defects of high training difficulty and difficulty in convergence. Taking the SSD as an example, the detection speed is fast and the training is easy, but since it needs a plurality of detection heads to detect the sizes of different targets, the setup of the anchor needs a rich experience, and the detection of small targets in the shallow feature layer is not effective enough.

Disclosure of Invention

According to an aspect of the present application, there is provided a target detection method, the method including: acquiring an image to be detected; carrying out target detection on the image by using the trained target detection model, and outputting a target detection result; wherein the object detection model comprises an encoding portion and a decoding portion, an output of the encoding portion being an input of the decoding portion, wherein: the encoding part comprises a plurality of layers of feature layers, and the size of a feature map output by the plurality of layers of feature layers of the encoding part is reduced layer by layer; the decoding part comprises a plurality of layers of feature layers, the size of a feature map output by the plurality of layers of feature layers of the decoding part is increased layer by layer, and the feature map output by the last layer of feature layer of the decoding part is used for target detection.

In one embodiment of the present application, the number of layers of the feature layers included in the decoding portion is equal to the number of layers of the feature layers included in the encoding portion.

In one embodiment of the present application, the size of the feature map output by the last feature layer of the decoding portion is equal to the size of the image input to the object detection model.

In one embodiment of the present application, the step sizes of the convolution filters of the respective multi-layer feature layers of the coding portion are equal, and the convolution kernel sizes are equal.

In one embodiment of the present application, the size of the feature map output by each layer of feature layer of the decoding portion is increased in proportion, and the size of the feature map output by the first layer of feature layer of the decoding portion is equal to the size of the feature map output by the second last layer of feature layer of the encoding portion.

In an embodiment of the present application, the encoding portion includes five feature layers, the decoding portion includes five feature layers, each feature layer of the encoding portion outputs a first feature map to a fifth feature map, and each feature layer of the decoding portion outputs a sixth feature map to a tenth feature map, where: the sixth characteristic diagram is a result of point multiplication of the fifth characteristic diagram after deconvolution and the fourth characteristic diagram; the seventh characteristic diagram is a result of point multiplication of the sixth characteristic diagram after deconvolution and the third characteristic diagram; the eighth characteristic diagram is a result of point multiplication of the seventh characteristic diagram after deconvolution and the second characteristic diagram; the ninth characteristic diagram is the result of point multiplication of the eighth characteristic diagram after deconvolution and the first characteristic diagram; the tenth characteristic diagram is the result of performing 1 × 1 convolution on the ninth characteristic diagram after deconvolution; the tenth feature map is used for classification prediction and regression prediction.

In one embodiment of the present application, the method further comprises: after an image to be detected is obtained, preprocessing the image to be detected, and inputting the preprocessed image into the target detection model to obtain a target detection result.

In an embodiment of the present application, the preprocessing the image to be detected includes: and normalizing and/or enhancing the image to be detected to obtain an image meeting the format requirement of the target detection model.

In an embodiment of the present application, image enhancement is performed on the image to be detected, and includes: and at least one of turning over, rotating, dithering colors and randomly scaling the image to be detected.

According to the target detection method and device, the target detection model which comprises the coding part and the decoding part is adopted, the target detection is carried out on the feature graph output by the last feature layer of the decoding part, the problem that time is consumed for target detection in multiple feature layers in an original SSD can be effectively solved, and meanwhile, due to the fact that detection is carried out on the last feature layer, semantic information is rich, and the size of the feature layer is large, the problem that small target detection effect is poor can be solved, and the reliability of small target detection is greatly improved.

Drawings

The above and other objects, features and advantages of the present application will become more apparent by describing in more detail embodiments of the present application with reference to the attached drawings. The accompanying drawings are included to provide a further understanding of the embodiments of the application and are incorporated in and constitute a part of this specification, illustrate embodiments of the application and together with the description serve to explain the principles of the application. In the drawings, like reference numbers generally represent like parts or steps.

FIG. 1 shows a schematic block diagram of an example electronic device for implementing an object detection method and apparatus in accordance with embodiments of the present invention.

Fig. 2 shows a schematic flow diagram of a target detection method according to an embodiment of the present application.

Fig. 3 is a schematic diagram illustrating output feature maps of feature layers of each layer of a target detection model used in a target detection method according to an embodiment of the present application.

Fig. 4 shows a schematic block diagram of an object detection apparatus according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, exemplary embodiments according to the present application will be described in detail below with reference to the accompanying drawings. It should be understood that the described embodiments are only some embodiments of the present application and not all embodiments of the present application, and that the present application is not limited by the example embodiments described herein. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the application described in the application without inventive step, shall fall within the scope of protection of the application.

First, an exemplary electronic device 100 for implementing the object detection method and apparatus of the embodiment of the present invention is described with reference to fig. 1.

As shown in FIG. 1, electronic device 100 includes one or more processors 102, one or more memory devices 104, an input device 106, and an output device 108, which are interconnected via a bus system 110 and/or other form of connection mechanism (not shown). It should be noted that the components and structure of the electronic device 100 shown in fig. 1 are exemplary only, and not limiting, and the electronic device may have other components and structures as desired.

The processor 102 may be a Central Processing Unit (CPU) or other form of processing unit having data processing capabilities and/or instruction execution capabilities, and may control other components in the electronic device 100 to perform desired functions.

The storage 104 may include one or more computer program products that may include various forms of computer-readable storage media, such as volatile memory and/or non-volatile memory. The volatile memory may include, for example, Random Access Memory (RAM), cache memory (cache), and/or the like. The non-volatile memory may include, for example, Read Only Memory (ROM), hard disk, flash memory, etc. On which one or more computer program instructions may be stored that may be executed by processor 102 to implement client-side functionality (implemented by the processor) and/or other desired functionality in embodiments of the invention described below. Various applications and various data, such as various data used and/or generated by the applications, may also be stored in the computer-readable storage medium.

The input device 106 may be a device used by a user to input instructions and may include one or more of a keyboard, a mouse, a microphone, a touch screen, and the like. The input device 106 may be any interface for receiving information.

The output device 108 may output various information (e.g., images or sounds) to an external (e.g., user), and may include one or more of a display, a speaker, and the like. The output device 108 may be any other device having an output function.

Exemplarily, an exemplary electronic device for implementing the object detection method and apparatus according to the embodiment of the present invention may be implemented as a terminal such as a smartphone, a tablet computer, a camera, and the like.

Next, an object detection method 200 according to an embodiment of the present application will be described with reference to fig. 2. As shown in fig. 2, the target detection method 200 may include the steps of:

in step S210, an image to be detected is acquired.

In step S220, performing target detection on the image by using a trained target detection model, and outputting a target detection result, wherein the target detection model includes an encoding portion and a decoding portion, an output of the encoding portion is used as an input of the decoding portion, and wherein: the encoding part comprises a plurality of layers of feature layers, and the size of a feature map output by the plurality of layers of feature layers of the encoding part is reduced layer by layer; the decoding part comprises a plurality of layers of feature layers, the size of a feature map output by the plurality of layers of feature layers of the decoding part is increased layer by layer, and the feature map output by the last layer of feature layer of the decoding part is used for target detection.

In the embodiment of the application, the target detection model which comprises an encoding part and a decoding part is adopted to execute target detection, because the feature maps output by each layer of feature layer of the encoding part are reduced layer by layer, the decoding part takes the output of the encoding part as input, the output feature maps are increased layer by layer, and the feature map output by the last layer of feature layer of the decoding part is used for target detection, the target detection model only has one detection head, the problem that the target detection in the original SSD consumes time when the target detection is carried out on multiple feature layers can be effectively solved; meanwhile, as the detection is carried out on the last layer of feature layer, the semantic information is rich, and the feature graph output by the feature layer of the decoding part is increased layer by layer, the size of the feature graph output by the last layer of feature layer of the decoding part is inevitably larger than that of the output feature graph of the encoding part, so that the detection is carried out on the last layer of feature graph of the decoding part, the situation that a small target object is filtered due to too small feature graph can be effectively avoided, the problem of poor small target detection effect can be solved, and the reliability of small target detection is greatly improved.

In an embodiment of the present application, the number of layers of the feature layers included in the decoding portion of the object detection model employed in step S220 may be equal to the number of layers of the feature layers included in the encoding portion. In this embodiment, the depth of the decoding part is the same as that of the encoding part, and the target detection is performed on the last layer of feature map of the decoding part, so that the target detection can be performed on a deep layer of feature layer, and the deeper the feature layer is, the richer the semantic information is, and thus the reliability of the small target detection can be further improved. In addition, because the target detection is performed on the last characteristic layer of the decoding part, and the size of the characteristic graph output by each characteristic layer of the decoding part increases layer by layer, the size of the characteristic graph finally used for target detection is increased along with the increase of the number of the characteristic layers of the decoding part, and the detection of a small target object is facilitated. In one embodiment of the present application, the size of the feature map output by the last feature layer of the decoding portion may be equal to the size of the image input to the target detection model. In this embodiment, the feature map size at which the object detection is finally performed coincides with the input image size, and the small object (with respect to each feature map output by the encoding section) size in the input image is the largest, so that the effect of the small object detection can be excellently ensured.

In other embodiments of the present application, the number of feature layers included in the decoding portion of the target detection model adopted in step S220 may also be smaller than the number of feature layers included in the encoding portion. For example, the encoding portion includes five feature layers and the decoding portion includes four feature layers. In this embodiment, although the number of feature layers of the decoding portion is smaller than that of the encoding portion, since the decoding portion takes the output of the encoding portion as input, the feature maps output by the feature layers of each layer increase layer by layer, and thus the feature map output by the last feature layer finally used for object detection also has a larger size and rich semantic information, and the effect of small object detection can be improved compared to the case where object detection is performed on a feature map that decreases layer by layer in SSD that does not include the decoding portion. In other embodiments, the number of feature layers included in the decoding portion of the target detection model adopted in step S220 may be greater than the number of feature layers included in the encoding portion. In this embodiment, the feature map output by the last feature layer finally used for target detection has a larger size and abundant semantic information, which is more favorable for improving the effect of small target detection.

In the embodiment of the present application, the step lengths of the convolution filters included in each layer of the feature layer of the encoding portion of the target detection model adopted in step S220 may be equal or different; the sizes of convolution kernels of convolution filters included in each characteristic layer may be equal or different. In general, the convolution filters included in the feature layers of the encoding portion of the target detection model used in step S220 may be set to have the same step size and the same convolution kernel size, which may reduce the size of the feature map of each layer to enable more uniform extraction of feature information.

In the embodiment of the present application, the step sizes of the convolution filters included in each layer of the feature layer of the decoding portion of the target detection model adopted in step S220 may be equal or different; the sizes of convolution kernels of convolution filters included in each characteristic layer may be equal or different. In general, the convolution filters included in the feature layers of the coding part of the target detection model adopted in step S220 may be set to have the same step size and the same convolution kernel size, which may increase the size of the feature map of each layer in proportion, so that semantic information can be acquired more uniformly while increasing the size of the feature map.

In a specific embodiment of the present application, the encoding portion of the target detection model used in step S220 may include five feature layers, and the decoding portion may include five feature layers, where each feature layer of the encoding portion outputs a first feature map, a second feature map, a third feature map, a fourth feature map, and a fifth feature map, and each feature layer of the decoding portion outputs a sixth feature map, a seventh feature map, an eighth feature map, a ninth feature map, and a tenth feature map. Wherein: the sixth feature map is a result of point multiplication (Eltw Product, dot Product of feature layer matrix) of the fifth feature map and the fourth feature map after deconvolution; the seventh characteristic diagram is a result of point multiplication of the sixth characteristic diagram after deconvolution and the third characteristic diagram; the eighth characteristic diagram is a result of point multiplication of the seventh characteristic diagram after deconvolution and the second characteristic diagram; the ninth characteristic diagram is the result of point multiplication of the eighth characteristic diagram after deconvolution and the first characteristic diagram; the tenth characteristic diagram is the result of performing 1 × 1 convolution on the ninth characteristic diagram after deconvolution; the tenth feature map is used for classification prediction and regression prediction. Described below in conjunction with fig. 3.

Fig. 3 is a schematic diagram illustrating output feature maps of feature layers of each layer of a target detection model used in a target detection method according to an embodiment of the present application. As shown in fig. 3, the input of the model is 640 × 3 (where 640 × 640 is the size of the input image and 3 is the number of channels of the input image), the encoding portion is composed of 5-layer convolutional neural networks, and the decoding portion is also composed of 5-layer convolutional neural networks. In the encoding stage, the original image passes through a convolution filter with the step size of 2, the convolution kernel of 3 × 3 and the channel number of 32 to obtain a feature map C1 with the size of 320 × 32; c1 is passed through a convolution filter with step size of 2, convolution kernel of 3 × 3 and channel number of 64 to obtain a characteristic diagram C2 with size of 160 × 64; c2 is passed through a convolution filter with step size of 2, convolution kernel of 3 × 3 and channel number of 128 to obtain a characteristic diagram C3 with size of 80 × 128; c3 is passed through a convolution filter with step size of 2, convolution kernel of 3 × 3 and channel number of 256 to obtain a characteristic diagram C4 with size of 40 × 256; c4 was passed through a convolution filter with step size 2, convolution kernel 3 x 3, and channel number 256 to obtain a signature C5 with size 20 x 256. In the decoding stage, C5 is subjected to deconvolution to obtain a feature map with the size consistent with that of C4, and the feature map and C4 are used for Eltw Product to obtain C6; deconvoluting the C6 to obtain a characteristic diagram with the size consistent with that of the C3, and using the characteristic diagram and the C3 to make an Eltw Product to obtain C7; deconvoluting the C7 to obtain a characteristic diagram with the size consistent with that of the C2, and using the characteristic diagram and the C2 to make an Eltw Product to obtain C8; deconvoluting the C8 to obtain a characteristic diagram with the size consistent with that of the C1, and using the characteristic diagram and the C1 to make an Eltw Product to obtain C9; deconvolution of C9 with 1 × 1 gave C10. Only predictions for classification (Cls) and regression (Reg) were made at the C10 feature level.

In the example shown in fig. 3, if the SSD is an existing SSD, it only includes a coding portion, it performs target detection on feature maps C3, C4 and C5, so that there are multiple detection heads, consuming a lot of detection time, and feature maps C3, C4 and C5 are all small in size, especially C5, where small target objects may have been filtered out and cannot be detected, and target detection semantic information in C3 is not rich enough, so that small target detection is not good. In the embodiment of the application, the target detection model adopts a convolutional neural network with an hourglass structure as a main network for extracting features, the convolutional neural network firstly performs down sampling and then performs up sampling, and jump connection features are performed, the feature graph C10 output by the last layer of feature layer of the convolutional neural network is consistent with the size of an input image, and target detection is performed on the feature layer only, namely only one detection head is provided, so that the problem of time consumption for detection on multiple feature layers in an original SSD can be effectively solved, and different detection anchor points are required to be set by experience when detection is performed on the multiple feature layers, so that the method of the application does not need additional anchor point setting; meanwhile, as the detection is carried out on the last characteristic layer, the semantic information is rich, and the size of the characteristic layer is large (consistent with the size of an input picture), the problem of poor detection effect of small targets can be solved.

Referring back now to fig. 2, in an embodiment of the present application, the object detection method 200 may further include (not shown): after an image to be detected is obtained, preprocessing the image to be detected, and inputting the preprocessed image into the target detection model to obtain a target detection result. Illustratively, the preprocessing the image to be detected may include: and normalizing and/or enhancing the image to be detected to obtain an image meeting the format requirement of the target detection model. In this embodiment, the target detection is performed after the image to be detected is preprocessed, so that the efficiency and reliability of the target detection can be further improved. For example, the image enhancement of the image to be detected may include: and at least one of turning over, rotating, dithering colors and randomly scaling the image to be detected.

Similarly, when the target detection model adopted in step S220 is trained, the collected images may also be labeled and preprocessed to make the images meet the requirement of the neural network format, so as to construct a training set and a test set. In training the target detection model, the classified loss function is, for example, a focal loss (focal loss) function, the regressive loss function is, for example, a smooth L1 loss (smooth _ L1 loss) function, the activation function is, for example, Relu6, and the size of the anchor point may be obtained by, for example, a clustering algorithm such as Kmeans clustering. And (4) training the model until the loss is not reduced on the training set, and the precision on the test set is not increased, stopping training to obtain the optimal model which can be used for deployment.

The above exemplarily illustrates an object detection algorithm according to an embodiment of the present application. Based on the above description, the target detection method according to the embodiment of the present application adopts the target detection model including both the encoding portion and the decoding portion, and performs target detection on the feature map output at the last feature layer of the decoding portion, so that the problem of time consumption for performing target detection on multiple feature layers in the original SSD can be effectively solved, and meanwhile, since detection is performed at the last feature layer, semantic information is rich, and the size of the feature layer is large, the problem of poor small target detection effect can be solved, and the reliability of small target detection is greatly improved.

An object detection apparatus provided in another aspect of the present application is described below with reference to fig. 4. Fig. 4 shows a schematic block diagram of an object detection apparatus 400 according to an embodiment of the present application. As shown in fig. 4, the object detection apparatus 400 according to the embodiment of the present application may include a memory 410 and a processor 420, where the memory 410 stores a computer program executed by the processor 420, and the computer program, when executed by the processor 420, causes the processor 420 to execute the object detection method according to the embodiment of the present application described above. Those skilled in the art can understand the specific operations of the object detection apparatus according to the embodiments of the present application in combination with the foregoing descriptions, and for the sake of brevity, specific details are not repeated here, and only some main operations of the processor 420 are described.

In one embodiment of the application, the computer program, when executed by the processor 420, causes the processor 420 to perform the steps of: acquiring an image to be detected; carrying out target detection on the image by using the trained target detection model, and outputting a target detection result; wherein the object detection model comprises an encoding portion and a decoding portion, an output of the encoding portion being an input of the decoding portion, wherein: the encoding part comprises a plurality of layers of feature layers, and the size of a feature map output by the plurality of layers of feature layers of the encoding part is reduced layer by layer; the decoding part comprises a plurality of layers of feature layers, the size of a feature map output by the plurality of layers of feature layers of the decoding part is increased layer by layer, and the feature map output by the last layer of feature layer of the decoding part is used for target detection.

In one embodiment of the application, the computer program, when executed by the processor 420, further causes the processor 420 to perform the steps of: after an image to be detected is obtained, preprocessing the image to be detected, and inputting the preprocessed image into the target detection model to obtain a target detection result.

In an embodiment of the application, the computer program, when executed by the processor 420, causes the processor 420 to perform the pre-processing of the image to be detected, including: and normalizing and/or enhancing the image to be detected to obtain an image meeting the format requirement of the target detection model.

In an embodiment of the application, the computer program, when executed by the processor 420, causes the processor 420 to perform image enhancement on the image to be detected, including: and at least one of turning over, rotating, dithering colors and randomly scaling the image to be detected.

Furthermore, according to an embodiment of the present application, there is also provided a storage medium on which program instructions are stored, which when executed by a computer or a processor are used for executing the corresponding steps of the object detection method of the embodiment of the present application. The storage medium may include, for example, a memory card of a smart phone, a storage component of a tablet computer, a hard disk of a personal computer, a Read Only Memory (ROM), an Erasable Programmable Read Only Memory (EPROM), a portable compact disc read only memory (CD-ROM), a USB memory, or any combination of the above storage media. The computer-readable storage medium may be any combination of one or more computer-readable storage media.

Based on the above description, the target detection method and apparatus according to the embodiment of the present application use the target detection model including both the encoding portion and the decoding portion, and perform target detection on the feature map output at the last feature layer of the decoding portion, which can effectively solve the problem of time consumption for performing target detection on multiple feature layers in the original SSD, and at the same time, because detection is performed at the last feature layer, semantic information is rich, and the size of the feature layer is large, the problem of poor small target detection effect can be solved, and the reliability of small target detection is greatly improved.

Although the example embodiments have been described herein with reference to the accompanying drawings, it is to be understood that the above-described example embodiments are merely illustrative and are not intended to limit the scope of the present application thereto. Various changes and modifications may be effected therein by one of ordinary skill in the pertinent art without departing from the scope or spirit of the present application. All such changes and modifications are intended to be included within the scope of the present application as claimed in the appended claims.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described device embodiments are merely illustrative, and for example, the division of the units is only one logical functional division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another device, or some features may be omitted, or not executed.

In the description provided herein, numerous specific details are set forth. However, it is understood that embodiments of the application may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.

Similarly, it should be appreciated that in the description of exemplary embodiments of the present application, various features of the present application are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the application and aiding in the understanding of one or more of the various inventive aspects. However, the method of the present application should not be construed to reflect the intent: this application is intended to cover such departures from the present disclosure as come within known or customary practice in the art to which this invention pertains. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this application.

It will be understood by those skilled in the art that all of the features disclosed in this specification (including any accompanying claims, abstract and drawings), and all of the processes or elements of any method or apparatus so disclosed, may be combined in any combination, except combinations where such features are mutually exclusive. Each feature disclosed in this specification (including any accompanying claims, abstract and drawings) may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.

Furthermore, those skilled in the art will appreciate that while some embodiments described herein include some features included in other embodiments, rather than other features, combinations of features of different embodiments are meant to be within the scope of the application and form different embodiments. For example, in the claims, any of the claimed embodiments may be used in any combination.

The various component embodiments of the present application may be implemented in hardware, or in software modules running on one or more processors, or in a combination thereof. Those skilled in the art will appreciate that a microprocessor or Digital Signal Processor (DSP) may be used in practice to implement some or all of the functionality of some of the modules according to embodiments of the present application. The present application may also be embodied as apparatus programs (e.g., computer programs and computer program products) for performing a portion or all of the methods described herein. Such programs implementing the present application may be stored on a computer readable medium or may be in the form of one or more signals. Such a signal may be downloaded from an internet website or provided on a carrier signal or in any other form.

It should be noted that the above-mentioned embodiments illustrate rather than limit the application, and that those skilled in the art will be able to design alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The application may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the unit claims enumerating several means, several of these means may be embodied by one and the same item of hardware. The usage of the words first, second and third, etcetera do not indicate any ordering. These words may be interpreted as names.

The above description is only for the specific embodiments of the present application or the description thereof, and the protection scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope disclosed in the present application, and shall be covered by the protection scope of the present application. The protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A method of object detection, the method comprising:

acquiring an image to be detected;

carrying out target detection on the image by using the trained target detection model, and outputting a target detection result;

wherein the object detection model comprises an encoding portion and a decoding portion, an output of the encoding portion being an input of the decoding portion, wherein:

the encoding part comprises a plurality of layers of feature layers, and the size of a feature map output by the plurality of layers of feature layers of the encoding part is reduced layer by layer;

the decoding part comprises a plurality of layers of feature layers, the size of a feature map output by the plurality of layers of feature layers of the decoding part is increased layer by layer, and the feature map output by the last layer of feature layer of the decoding part is used for target detection.

2. The method of claim 1, wherein the number of layers of the feature layers included in the decoding portion is equal to the number of layers of the feature layers included in the encoding portion.

3. The method according to claim 1, wherein the size of the feature map output by the last feature layer of the decoding portion is equal to the size of the image input to the object detection model.

4. The method of claim 1, wherein the step sizes of the convolution filters and the convolution kernel sizes of the respective layers of the plurality of layers of the encoded portion are equal.

5. The method of claim 1, wherein the size of the feature map output by each layer of feature layers of the decoding portion is scaled up, and wherein the size of the feature map output by a first layer of feature layers of the decoding portion is equal to the size of the feature map output by a penultimate layer of feature layers of the encoding portion.

6. The method according to claim 1, wherein the encoding portion comprises five feature layers, the decoding portion comprises five feature layers, each feature layer of the encoding portion outputs a first feature map to a fifth feature map, and each feature layer of the decoding portion outputs a sixth feature map to a tenth feature map, respectively, wherein:

the sixth characteristic diagram is a result of point multiplication of the fifth characteristic diagram after deconvolution and the fourth characteristic diagram;

the seventh characteristic diagram is a result of point multiplication of the sixth characteristic diagram after deconvolution and the third characteristic diagram;

the eighth characteristic diagram is a result of point multiplication of the seventh characteristic diagram after deconvolution and the second characteristic diagram;

the ninth characteristic diagram is the result of point multiplication of the eighth characteristic diagram after deconvolution and the first characteristic diagram;

the tenth characteristic diagram is the result of performing 1 × 1 convolution on the ninth characteristic diagram after deconvolution;

the tenth feature map is used for classification prediction and regression prediction.

7. The method of claim 1, further comprising:

after an image to be detected is obtained, preprocessing the image to be detected, and inputting the preprocessed image into the target detection model to obtain a target detection result.

8. The method according to claim 7, wherein the preprocessing the image to be detected comprises:

and normalizing and/or enhancing the image to be detected to obtain an image meeting the format requirement of the target detection model.

9. The method according to claim 8, wherein image enhancing the image to be detected comprises:

and at least one of turning over, rotating, dithering colors and randomly scaling the image to be detected.

10. An object detection apparatus, characterized in that the apparatus comprises a memory and a processor, the memory having stored thereon a computer program for execution by the processor, the computer program, when executed by the processor, causing the processor to perform the object detection method according to any one of claims 1-9.