CN113255699B

CN113255699B - Small target object image detection method and device, electronic equipment and storage medium

Info

Publication number: CN113255699B
Application number: CN202110645084.4A
Authority: CN
Inventors: 陶家威
Original assignee: Zhejiang Huaray Technology Co Ltd
Current assignee: Zhejiang Huaray Technology Co Ltd
Priority date: 2021-06-10
Filing date: 2021-06-10
Publication date: 2022-01-18
Anticipated expiration: 2041-06-10
Also published as: CN113255699A

Abstract

The application discloses a small target object image detection method and device, electronic equipment and a computer readable storage medium. The method detects small target objects (such as dust) in image data by using an image detection model based on a convolutional neural network in a software mode, wherein when the image characteristics are extracted by using the convolutional neural network of the image detection model, an attention mechanism is introduced to fuse shallow features (local fine-grained features) of a trunk network in the convolutional neural network and deep features (overall coarse-grained semantic features) of a feature pyramid network, so that not only can detail information be retained, but also semantic information can be effectively extracted, and whether the small target objects exist in input image data or not and the specific positions of the small target objects can be more accurately detected. Therefore, no additional hardware device is needed, the method is simple and easy to operate, the small target object can be detected under different backgrounds, and the method is high in real-time performance and wide in applicable range.

Description

Small target object image detection method and device, electronic equipment and storage medium

Technical Field

The present application relates to the field of image processing, and in particular, to a method and an apparatus for detecting an image of a small target object.

Background

If dust adheres to the camera sensor and the lens, white spots are generated in the captured image, which greatly affects the image quality. Therefore, it is necessary to detect dust on the camera sensor and the lens and to issue an alarm in time.

The existing dust detection methods are mainly classified into hardware-based detection methods and software-based detection methods.

The hardware-based detection method is mainly used for judging whether the camera lens covers dust or not by detecting the voltage change of the output end of the infrared receiving tube. However, this detection method is costly and not easy to implement.

The software-based detection method mainly utilizes an image processing technology to detect white spots generated by dust in an image, and mainly comprises the following steps: using a camera to be tested to photograph the white plane to obtain a test image; filtering the test image to obtain a filtered image, and obtaining a difference image according to the test image and the filtered image; and calculating a threshold value according to the difference image, and carrying out threshold value processing on the difference image to realize dust detection. However, this method requires the camera to take pictures in a specific environment (white plane) for detection, and dust cannot be detected in time when the camera is working, so that the applicable range is small.

Therefore, how to improve the detection method based on software to enable the camera to detect dust in time during working is still a technical problem which needs to be solved urgently.

Disclosure of Invention

The applicant inventively provides a small target object image detection method, a small target object image detection device, an electronic device and a computer storage medium.

According to a first aspect of embodiments of the present application, there is provided a small target object image detection method, including: acquiring image data to be detected; detecting image data to be detected by using a first image detection model to judge whether a small target object exists in the image data to be detected to obtain a first detection result, wherein the first image detection model is based on a convolutional neural network, the convolutional neural network comprises a trunk network and a characteristic pyramid network, and the convolutional neural network fuses shallow features of the trunk network and deep features of the characteristic pyramid network by using an attention mechanism; and returning the first detection result.

According to an embodiment of the present application, the shallow feature of the backbone network includes: and performing dimensionality pooling on the shallow features obtained by down-sampling extraction of the backbone network to obtain one-dimensional shallow features.

According to an embodiment of the present application, the attention mechanism includes a spatial attention mechanism.

According to an embodiment of the present application, fusing shallow features of a backbone network and deep features of a feature pyramid network using an attention mechanism includes: and feeding back the shallow features obtained by downsampling and extracting the backbone network to the process of downsampling and extracting the feature pyramid network to obtain the deep features through jumping connection by using an attention mechanism, and fusing the shallow features of the backbone network and the deep features of the feature pyramid network.

According to an embodiment of the present application, a process of obtaining deep features by downsampling and extracting a feature pyramid network includes N rounds of downsampling, where N is a natural number, and accordingly, a shallow feature of a backbone network and a deep feature of the feature pyramid network are fused, including: and fusing the shallow features of the backbone network and the deep features obtained by the N-th round of downsampling of the feature pyramid network.

According to an embodiment of the present application, N is 1.

According to an embodiment of the present application, before detecting image data to be detected by using a first image detection model to determine whether a small target object exists in the image data to be detected to obtain a first detection result, the method further includes: constructing a machine learning model based on a convolutional neural network; acquiring an image data set with a label as training image data; and training the machine learning model by using the training image data to obtain a first image detection model.

According to a second aspect of the embodiments of the present application, there is provided a small target object image detection apparatus, the apparatus including: the image data acquisition module is used for acquiring image data to be detected; the small target object detection module is used for detecting image data to be detected by using a first image detection model so as to judge whether a small target object exists in the image data to be detected to obtain a first detection result, wherein the first image detection model is based on a convolutional neural network, the convolutional neural network comprises a trunk network and a characteristic pyramid network, and the convolutional neural network fuses shallow features of the trunk network and deep features of the characteristic pyramid network by using an attention mechanism; and the detection result returning module is used for returning the first detection result.

According to a third aspect of the embodiments of the present application, there is provided an electronic device, including a processor, a communication interface, a memory, and a communication bus, where the processor, the communication interface, and the memory complete communication with each other through the communication bus; a memory for storing a computer program; and a processor for implementing the method steps of any one of the above-described small target object image detection methods when executing the program stored in the memory.

According to a fourth aspect of embodiments of the present application, there is provided a computer-readable storage medium having stored therein a computer program, which when executed by a processor, implements the method steps of any one of the above-mentioned small target object image detection.

The embodiment of the application provides a small target object image detection method and device. The method detects small target objects (such as dust) in image data by using an image detection model based on a convolutional neural network in a software mode, wherein when the image characteristics are extracted by using the convolutional neural network of the image detection model, an attention mechanism is introduced to fuse shallow features (local fine-grained features) of a trunk network in the convolutional neural network and deep features (overall coarse-grained semantic features) of a feature pyramid network, so that not only can detail information be retained, but also semantic information can be effectively extracted, and whether the small target objects exist in input image data or not and the specific positions of the small target objects can be more accurately detected. Therefore, no additional hardware device is needed, the method is simple and easy to operate, the small target object can be detected under different backgrounds, and the method is high in real-time performance and wide in applicable range.

It is to be understood that the implementation of the present application does not require all of the above-described advantages to be achieved, but rather that certain technical solutions may achieve certain technical effects, and that other embodiments of the present application may also achieve other advantages not mentioned above.

Drawings

The above and other objects, features and advantages of exemplary embodiments of the present application will become readily apparent from the following detailed description read in conjunction with the accompanying drawings. Several embodiments of the present application are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings and in which:

in the drawings, the same or corresponding reference numerals indicate the same or corresponding parts.

FIG. 1 is a schematic view of an implementation flow of an embodiment of a small target object image detection method according to the present application;

FIG. 2 is a schematic diagram of a neural network structure and a feature extraction flow according to another embodiment of the small target object image detection method of the present application;

fig. 3 is a schematic structural diagram of a small target object image detection apparatus according to an embodiment of the present application.

Detailed Description

In order to make the objects, features and advantages of the present application more obvious and understandable, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is apparent that the described embodiments are only a part of the embodiments of the present application, and not all the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

In the description herein, reference to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the application. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.

Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the present application, "a plurality" means two or more unless specifically limited otherwise.

The existing image detection model often extracts the features of image data by performing multiple convolution calculations based on a convolution neural network, but the pixel information contained in a small target object (such as dust, scratch, dirt, or fine defect) is very small in nature, and the pixel ratio is further reduced on a picture with a large receptive field. Therefore, it is highly likely that the feature information of the small target object is lost after the plurality of convolutions, and thus the detection of the small target object is difficult.

But the image detection model based on the convolutional neural network can realize end-to-end image detection, is simple and feasible, can detect small target objects under different backgrounds, and has high real-time performance and wider applicable range. And the technology of using the image detection model based on the convolutional neural network to carry out image detection is more mature, and the image detection accuracy is higher.

Therefore, the inventor of the present application does not give up the scheme, but introduces an attention mechanism into the convolutional neural network and realizes fusion of shallow layer features (local fine-grained features) and deep layer features (overall coarse-grained semantic features) through the attention mechanism, so that not only can detail information be retained, but also semantic information can be effectively extracted, and whether a small target object exists in input image data or not and the specific position of the small target object can be more accurately detected.

Fig. 1 shows an implementation flow of an embodiment of the small target object image detection method according to the present application. Referring to fig. 1, the present embodiment provides a small target object image detection method, including: operation 110, acquiring image data to be detected; operation 120, detecting image data to be detected by using a first image detection model to determine whether a small target object exists in the image data to be detected to obtain a first detection result, wherein the first image detection model is based on a convolutional neural network, the convolutional neural network comprises a trunk network and a feature pyramid network, and the convolutional neural network fuses shallow features of the trunk network and deep features of the feature pyramid network by using an attention mechanism; at operation 130, a first detection result is returned.

In operation 110, the acquired image data to be detected may or may not include a small target object. Here, the small target object is generally a predetermined kind of small target object, for example, dust that may be present on a camera pickup device or a lens, scratches that may be present on glass, stains that may be present on cloth, or fine defects that may be present in a product waiting.

In operation 120, the Backbone network (Backbone) is a shallow convolutional network for extracting features from the input image data by a convolutional neural network.

The shallow feature is a feature that focuses more on local fine-grained features, such as detail texture of an image, and is acquired through a smaller receptive field.

Since the shallow features of the backbone network are extracted directly from the image data, fewer conversion operations are performed, closer to the original image data of the small target object. Therefore, small target objects can be more easily identified using the shallow features of the backbone network.

The deep features refer to features extracted through multiple times of conversion and convolution calculation in a convolution neural network, are overall coarse-grained semantic features obtained after neglecting some unimportant features, are features acquired through a large receptive field, and are easy to identify large target objects. In the identification process of the small target object, the small target object is facilitated to be positioned, namely, the specific position of the small target object is determined as a reference object.

The Feature Pyramid Network (FPN) is a deeper convolutional layer network based on the backbone network and constructed on the backbone network, and is used for performing multi-scale Feature enhancement on features extracted from the backbone network. In other words, feature pyramids with different scales can be constructed on the basis of features extracted by the backbone network, feature extraction is carried out on images with each scale, multi-scale feature representation can be generated, and feature maps with all levels have strong semantic information, wherein the feature maps include high-resolution shallow features.

Therefore, the deep features in the feature pyramid network often already fuse a part of the shallow features, but the shallow features have relatively little pixel information due to multiple convolution and downsampling, and even can be discarded.

The attention mechanism is a network feedback mechanism which has been widely focused in recent years, and the attention mechanism is introduced into a neural network, so that a current decoder can access outputs generated by all encoders, all the outputs of the encoders are weighted and then input into the current decoder to influence the output of the decoder, and the context information which is closer to original data before can be fully utilized while the input and output alignment is realized.

The inventor of the application creatively thinks that if the shallow features of the backbone network are fused with the deep features of the feature pyramid network through an attention mechanism, the detail information can be fully reserved, and the semantic information can be effectively extracted, so that whether a small target object exists in the input image data or not and the specific position of the small target object can be more accurately detected.

Based on the above invention thought, the inventor of the application introduces an attention mechanism in the convolutional neural network of the image detection model to fuse the shallow features of the backbone network with the deep features of the feature pyramid network.

However, unlike the conventional attention mechanism, the attention weight used in the present application is not obtained by learning the relationship between the features through a feedforward neural network, but is obtained by directly feeding the shallow features of the backbone network back to the deep features of the feature pyramid and fusing the shallow features as the attention weight of the deep features. Therefore, the guiding effect of the shallow features of the backbone network on the deep features of the feature pyramid network is achieved, and the relative area of the small target in the deep feature map extracted through continuous convolution and downsampling is strengthened. The method can not only keep the detail information, but also effectively extract the semantic information.

When feeding back the shallow layer features extracted from the shallow layer convolutional layer to the deep layer convolutional layer, any feature feedback mechanism that is currently available, such as skip connection, may be used.

In fusing attention weights as deep features, any common fusing method may be used, such as weighting, multiplication, or any fusing function.

After the shallow feature and the deep feature are fused to obtain the feature including both the detail information and the semantic information through operation 120, it can be accurately detected whether there is a small target object and information such as a specific position of the small target object in the input image data through a subsequent detection step, and the information is combined in a specific manner to obtain a first detection result.

Then, the first detection result is returned through operation 130, and the detection of the small target object can be completed.

It should be noted that the embodiment shown in fig. 1 is only one basic embodiment of the small target object image detection method of the present application, and further refinement and expansion can be performed by an implementer on the basis of the embodiment.

The one-dimensional shallow feature obtained by dimensionality pooling can enhance and highlight the shallow feature, the fusion effect is better, and the subsequent feature fusion calculation complexity can be greatly simplified.

In addition, the dimension of the deep feature is not changed by fusing (for example, dot multiplication or weighting) the one-dimensional shallow feature and the deep feature of the feature pyramid network, so that the calculation process can be greatly simplified and the effect is better.

Image data typically contains spatial domain features and channel domain features. The spatial domain feature is a feature obtained by transforming spatial information in original image data into another space, and retains key information.

The spatial attention mechanism is introduced in the spatial transformation process, so that a trained spatial domain feature transformer (spatial transformer) can find out a region needing attention in picture information, and the spatial domain feature transformer can have the functions of rotation and scaling transformation, so that local important information of the image data can be extracted through transformation.

In this way, it can be seen that locally significant information of the image data can be better acquired using the spatial attention mechanism, thereby preserving the features of small target objects as much as possible.

Downsampling corresponds to a process of image reduction, mainly to reduce the amount of computation and increase the field of view. And the small target object has the characteristic that the small target object is extremely easy to discard in the process of down-sampling, particularly after a plurality of times of down-sampling because the pixel information is small.

The jump connection means that a certain layer of the neural network jumps over the connection of the next layer or a plurality of layers of neurons, and interlayer connection among the neurons is realized.

When the main network carries out down-sampling, more pixel information of the small target object can be obtained compared with the down-sampling carried out by the characteristic pyramid network. Therefore, the attention mechanism is used for feeding back the shallow layer characteristics obtained by downsampling and extracting the main network to the process of downsampling and extracting the pyramid network to obtain the deep layer characteristics through jumping connection, the deep layer characteristics of the characteristic pyramid network can be fused with the shallow layer characteristics extracted by the downsampling of the main network, so that more characteristics of small target objects are obtained, and the small target objects are easier to identify.

In addition, the jump connection not only helps the counter-propagation of the gradient, but also can accelerate the training process, so that the function of the attention mechanism is fully exerted.

Generally, the down-sampling extraction in the feature pyramid network to obtain deep features may be performed in multiple rounds, that is, the processes of up-sampling and down-sampling in multiple rounds are repeated to continuously screen and obtain features with less number but higher importance, so that more definite and more target detection results can be provided more easily.

In this case, an attention mechanism may be introduced in any round of downsampling performed in the pyramid network, and the shallow features of the backbone network and the deep features obtained by the round of downsampling in the feature pyramid network are fused.

According to an embodiment of the present application, N is 1.

If the down-sampling times between the main network down-sampling and the N-th down-sampling performed by the feature pyramid network are too many and the scale difference is too large, the fusion effect of the shallow feature and the deep feature is affected. Therefore, when N is 1, that is, the shallow feature of the backbone network and the deep feature extracted by the first round of downsampling performed in the feature pyramid network are fused, so that the shallow feature is 2 times the deep feature size, the fusion effect is better.

The convolutional neural network based on the machine learning model is a convolutional neural network which introduces an attention mechanism and fuses shallow features and deep features through the attention mechanism. For the description of the convolutional neural network, reference may be made to the description of the convolutional neural network in the image detection model, and details are not repeated here.

In this embodiment, the annotated image dataset may be obtained by annotating the original image dataset and dividing and preprocessing the original image dataset, or the annotated image dataset may be obtained by purchasing or negotiating from a third-party data provider.

When the machine learning model is trained by using the training image data, any suitable training method and tuning method can be used, and the method for detecting the small target object image is not limited in the application.

The above embodiments are exemplary illustrations of how to further refine and expand on the basis of the basic embodiment shown in fig. 1, and an implementer may combine various implementations in the above embodiments to form a new embodiment according to specific implementation conditions and needs, so as to achieve a more ideal implementation effect.

Next, a specific implementation of the small target object image detection method according to another embodiment of the present application will be described with reference to fig. 2.

Fig. 2 shows a structure diagram of a multilayer convolutional neural network according to another embodiment of the small target image detection method of the present application. The embodiment is applied to dust detection of a camera sensor or a lens, as shown in fig. 2, the multilayer convolutional neural network of the embodiment includes a backbone network (a convolutional layer network above a dashed box) and a feature pyramid network (a convolutional layer network within a dashed box), and after receiving image data 200, the following steps are mainly performed to detect whether a dust image exists in the image data 201:

Step S2010, performing convolution calculation on the image data 200 to obtain an x-dimensional feature set 201 (shallow feature set), where x is a feature set dimension, generally an m-th power of 2, and m is a natural number;

step S2020, down-sampling the feature set 201 to obtain a 2 x-dimensional feature set 202 (shallow feature set);

step S2030, down-sampling the feature set 202 to obtain a 4 x-dimensional feature set 203 (shallow feature set);

step S2040, performing convolution calculation on the feature set 203 to obtain a 4 x-dimensional feature set 204 (shallow feature set);

step S2050, performing upsampling on the feature set 204, and combining the convolution result of the feature set 202 to obtain a 2 x-dimensional feature set 205;

step S2060, performing up-sampling on the feature set 205, and combining the convolution result of the feature set 201 to obtain an x-dimensional feature set 206;

step S2070, performing convolution calculation on the x-dimensional feature set 206 to obtain an x-dimensional feature set 207;

step S2080, down-sampling the feature set 207, weighting the down-sampled feature set (deep feature set) and feature set 201 (shallow feature set) by a spatial attention mechanism after down-sampling the one-dimensional shallow feature set subjected to dimensionality pooling, and fusing the obtained feature set with the feature set subjected to convolution of the 2 x-dimensional feature set 205;

When the feature set (deep feature set) and the feature set 201 (shallow feature set) obtained by down-sampling are subjected to channel pooling by an attention mechanism, the one-dimensional shallow feature set and the feature set after convolution of the 2 x-dimensional feature set 205 are fused, the following method is adopted.

Assuming that the guided layer is Pi in the bottom-up pyramid of features, the guided layer is Pi-1 in the backbone network, which is shown as the following equation:

･

wherein, the superscript i represents a Pi network, and the numeric area of the Pi network is (3, 6); subscript Fu represents a top-down one of the feature pyramids; subscript Fb denotes a bottom-up one of the feature pyramids; lower partSymbol B denotes a backbone network;

and

maximum pooling channel compression and average pooling channel compression, respectively;

representing a downsampling.

Since the feature set 207 has been sampled in step S2080 and has the same size as the feature set 201, it is necessary to perform downsampling to obtain the same size when fusing with the feature set obtained by downsampling the feature set 207.

Therefore, after the dimension of the feature set 201 is pooled into a one-dimensional feature, the weight of the feature set after down-sampling 207 and the down-sampled feature set are weighted as well as the size of the feature set 207.

Step S2090, performing convolution calculation on the 2 x-dimensional feature set 208 and the feature set obtained after down-sampling the x-dimensional feature set 201;

step S2100, downsampling the feature set 208, weighting the downsampled feature set (deep feature set) and the feature set 202 (shallow feature set) after downsampling the dimension pooled one-dimensional shallow feature set through a space attention mechanism, and fusing the obtained feature set and the feature set after convolution of the 2x dimension feature set 204;

when the feature set (deep feature set) obtained by down-sampling and the feature set 202 (shallow feature set) obtained by channel pooling are fused with the feature set convolved with the 2 x-dimensional feature set 204 by an attention mechanism, the method adopted is similar to the method adopted in step S2080, and details are not repeated here.

In step S2110, a convolution calculation is performed on the x-dimensional feature set 202 and the obtained down-sampling feature set to obtain a 2 x-dimensional feature set 209.

Through the operation, when the dust of the camera acquisition device and the lens is detected, a hardware device is not required to be added, so that the hardware cost is greatly saved; the background is not required to be limited to a white board, and dust can be detected under different backgrounds, so that real-time dust detection can be realized when the camera works, an alarm is sent out in time so as to eliminate the dust in time, and the image quality of the dust is prevented from generating images.

Furthermore, the inventors of the present application compared the effect of small target object detection using the existing scheme (i.e., the same convolutional neural network without introducing the attention mechanism) with the effect of small target object detection using the above steps using private data, where the average accuracy of the former is 0.887 and the average accuracy of the latter is 0.903. Therefore, the small target object detection method can greatly improve the accuracy of small target object detection after the attention mechanism is introduced.

It should be noted that the embodiment shown in fig. 2 is only an exemplary illustration of the small target object image detection method of the present application, and is not a limitation to the application scenario or implementation of the embodiment of the present application.

Further, the embodiment of the present application also provides an apparatus for detecting an image of a small target object, as shown in fig. 3, where the apparatus 30 includes: an image data obtaining module 301, configured to obtain image data to be detected; the small target object detection module 302 is configured to detect image data to be detected by using a first image detection model to determine whether a small target object exists in the image data to be detected to obtain a first detection result, where the first image detection model is based on a convolutional neural network, the convolutional neural network includes a trunk network and a feature pyramid network, and the convolutional neural network fuses a shallow feature of the trunk network and a deep feature of the feature pyramid network by using an attention mechanism; and a detection result returning module 303, configured to return the first detection result.

According to an embodiment of the present application, the small target object detection module 302 is specifically configured to use an attention mechanism to feed back the shallow features obtained by downsampling and extracting the backbone network to the process of downsampling and extracting the feature pyramid network to obtain the deep features through a jump connection, and to fuse the shallow features of the backbone network and the deep features of the feature pyramid network.

According to an embodiment of the present application, the process of obtaining the deep features by downsampling and extracting the feature pyramid network includes N rounds of downsampling, where N is a natural number, and accordingly, the small target object detection module 302 is specifically configured to fuse the shallow features of the backbone network and the deep features obtained by the N-th round of downsampling of the feature pyramid network.

According to an embodiment of the present application, the apparatus 30 further includes: the machine learning model building module is used for building a machine learning model based on a convolutional neural network; the training image data acquisition module is used for acquiring an image data set with labels as training image data; and the machine learning model training module is used for training the machine learning model by using the training image data to obtain a first image detection model.

Here, it should be noted that: the above description on the embodiment of the small target object image detection apparatus, the above description on the embodiment of the electronic device, and the above description on the embodiment of the computer storage medium are similar to the description on the foregoing method embodiments, and have similar beneficial effects to the foregoing method embodiments, and therefore, no further description is given. For the technical details that have not been disclosed in the description of the embodiment of the small target object image detection apparatus, the description of the embodiment of the electronic device, and the description of the embodiment of the computer storage medium, please refer to the description of the foregoing method embodiments of the present application for understanding, and therefore, for brevity, will not be repeated.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. The above-described device embodiments are merely illustrative, for example, the division of a unit is only one logical function division, and there may be other division ways in actual implementation, such as: multiple units or components may be combined, or may be integrated into another device, or some features may be omitted, or not implemented. In addition, the coupling, direct coupling or communication connection between the components shown or discussed may be through some interfaces, and the indirect coupling or communication connection between the devices or units may be electrical, mechanical or other forms.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units; can be located in one place or distributed on a plurality of network units; some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, all functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may be separately regarded as one unit, or two or more units may be integrated into one unit; the integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional unit.

Those of ordinary skill in the art will understand that: all or part of the steps for realizing the method embodiments can be completed by hardware related to program instructions, the program can be stored in a computer readable storage medium, and the program executes the steps comprising the method embodiments when executed; and the aforementioned storage medium includes: various media capable of storing program codes, such as a removable storage medium, a Read Only Memory (ROM), a magnetic disk, and an optical disk.

Alternatively, the integrated units described above in the present application may be stored in a computer-readable storage medium if they are implemented in the form of software functional modules and sold or used as independent products. Based on such understanding, the technical solutions of the embodiments of the present application may be essentially implemented or portions thereof that contribute to the prior art may be embodied in the form of a software product stored in a storage medium, and including several instructions for enabling a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the methods of the embodiments of the present application. And the aforementioned storage medium includes: a removable storage medium, a ROM, a magnetic disk, an optical disk, or the like, which can store the program code.

The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A small target object image detection method, characterized in that the method comprises:

acquiring image data to be detected;

detecting the image data to be detected by using a first image detection model to judge whether a small target object exists in the image data to be detected to obtain a first detection result,

wherein the first image detection model is based on a convolutional neural network, the convolutional neural network comprises a main network and a characteristic pyramid network, the convolutional neural network fuses shallow features obtained by downsampling and extracting the main network and deep features obtained by downsampling the N-th round of the characteristic pyramid network by using an attention mechanism,

when merging, the guided layer is Pi in the bottom-up feature pyramid, and the guided layer is Pi-1 in the backbone network, which is represented by the following formula:

･

Wherein, the superscript i represents a Pi layer network, the value range is (3, 6), the subscript Fu represents a characteristic pyramid from top to bottom in the characteristic pyramid, the subscript Fb represents a characteristic pyramid from bottom to top in the characteristic pyramid, the subscript B represents a backbone network,

and

maximum pooled channel compression operation and average pooled channel compression operation,

representing down-sampling, N and i are both natural numbers;

and returning the first detection result.

2. The method of claim 1, wherein the shallow features of the backbone network comprise: and performing dimensionality pooling on the shallow features obtained by the downsampling extraction of the backbone network to obtain one-dimensional shallow features.

3. The method of claim 1, wherein the attention mechanism comprises a spatial attention mechanism.

4. The method of claim 1, wherein fusing the shallow features extracted from the backbone network downsampling and the deep features extracted from the feature pyramid network N-th downsampling using an attention mechanism comprises:

and feeding back the shallow layer features obtained by the downsampling extraction of the main network to the process of obtaining the deep layer features by the N-th downsampling extraction of the feature pyramid network through jumping connection by using an attention mechanism, and fusing the shallow layer features obtained by the downsampling extraction of the main network and the deep layer features obtained by the N-th downsampling of the feature pyramid network.

5. The method of claim 4, wherein N is 1.

6. The method according to claim 1, wherein before the detecting the image data to be detected by using the first image detection model to determine whether the small target object exists in the image data to be detected to obtain the first detection result, the method further comprises:

constructing a machine learning model based on the convolutional neural network;

acquiring an image data set with a label as training image data;

and training the machine learning model by using the training image data to obtain the first image detection model.

7. An apparatus for detecting an image of a small target object, the apparatus comprising:

the image data acquisition module is used for acquiring image data to be detected;

the small target object detection module is configured to detect the image data to be detected by using a first image detection model to determine whether a small target object exists in the image data to be detected to obtain a first detection result, where the first image detection model is based on a convolutional neural network, the convolutional neural network includes a trunk network and a feature pyramid network, the convolutional neural network fuses a shallow feature extracted by downsampling the trunk network and a deep feature obtained by downsampling an nth round of the feature pyramid network by using an attention mechanism, when fusing is performed, a guided layer is Pi in a feature pyramid from bottom to top, and a guided layer is Pi-1 in the trunk network, and a process of the guided layer is as follows:

･

and

representing down-sampling, N and i are both natural numbers;

and the detection result returning module is used for returning the first detection result.

8. An electronic device is characterized by comprising a processor, a communication interface, a memory and a communication bus, wherein the processor and the communication interface are used for realizing mutual communication by the memory through the communication bus; a memory for storing a computer program; a processor for implementing the method steps of any one of claims 1 to 6 when executing a program stored in the memory.

9. A computer-readable storage medium, characterized in that a computer program is stored in the computer-readable storage medium, which computer program, when being executed by a processor, carries out the method steps of any one of claims 1-6.