CN112446239A

CN112446239A - Neural network training and target detection method, device and storage medium

Info

Publication number: CN112446239A
Application number: CN201910806390.4A
Authority: CN
Inventors: 赵颖; 刘殿超; 张观良; 付万豪
Original assignee: Ricoh Co Ltd
Current assignee: Ricoh Co Ltd
Priority date: 2019-08-29
Filing date: 2019-08-29
Publication date: 2021-03-05

Abstract

The present disclosure provides a neural network training and target detection method, apparatus, and storage medium. The training method comprises the following steps: receiving a training image including a detection target, the training image belonging to one of an annotated source domain image and an unlabeled target domain image; generating a feature map of a training image based on appearance characteristics and environment attribute information of the training image; generating an attention feature map of the training image based on the feature map; predicting whether the training image is a source domain image or a target domain image based on the attention feature map, and predicting the location of the detection target; and determining the overall loss of the neural network based on the loss of prediction of whether the training image is a source domain image or a target domain image and the loss of prediction of the positioning of the detection target, and updating the parameters of the neural network according to the overall loss.

Description

Neural network training and target detection method, device and storage medium

Technical Field

The present disclosure relates to image processing, and more particularly, to a neural network training method and apparatus for object detection, a neural network-based object detection method and apparatus, and a storage medium.

Background

In recent years, neural networks have been widely used in the field of image processing, and have become an effective means for detecting an object in a captured image. Before applying a neural network to a target detection scenario, it is usually required to train it with a large amount of training data in order to obtain a neural network model with sufficient detection accuracy.

The existing method for training the neural network for target detection usually needs a large amount of training data sets with manual labels, which are usually difficult to obtain and need a large amount of manual labeling work, so that the training process is time-consuming and labor-consuming. On the other hand, the detection target may be located in different environments, and there may be a significant deviation in images taken of the detection target in different environments, so that a neural network trained based on a taken image in a certain environment generally cannot be well generalized to a taken image in a new environment, and robustness to the new environment is poor.

Therefore, there is a need for a target detection technique based on a neural network, which can reduce the manual labeling workload required when training the neural network and improve the generalization ability to new environments.

Disclosure of Invention

According to an aspect of the present disclosure, there is provided a training method of a neural network for target detection, including: receiving a training image including a detection target, the training image belonging to one of an annotated source domain image and an unlabeled target domain image; generating a feature map of a training image based on appearance characteristics and environment attribute information of the training image; generating an attention feature map of the training image based on the feature map; predicting whether the training image is a source domain image or a target domain image based on the attention feature map, and predicting the location of the detection target; and determining the overall loss of the neural network based on the loss of prediction of whether the training image is a source domain image or a target domain image and the loss of prediction of the positioning of the detection target, and updating the parameters of the neural network according to the overall loss.

According to another aspect of the present disclosure, there is provided a training apparatus of a neural network for target detection, including: a processor; and a memory having computer program instructions stored therein, wherein the computer program instructions, when executed by the processor, cause the processor to perform the steps of: receiving a training image including a detection target, the training image belonging to one of an annotated source domain image and an unlabeled target domain image; generating a feature map of a training image based on appearance characteristics and environment attribute information of the training image; generating an attention feature map of the training image based on the feature map; predicting whether the training image is a source domain image or a target domain image based on the attention feature map, and predicting the location of the detection target; and determining the overall loss of the neural network based on the loss of prediction of whether the training image is a source domain image or a target domain image and the loss of prediction of the positioning of the detection target, and updating the parameters of the neural network according to the overall loss.

According to another aspect of the present disclosure, there is provided a training apparatus of a neural network for target detection, including: an image receiving unit configured to receive a training image including a detection target, the training image belonging to one of an annotated source domain image and an unlabeled target domain image; the characteristic diagram generating unit is configured to generate a characteristic diagram of a training image based on appearance characteristics and environment attribute information of the training image; an attention feature map generation unit configured to generate an attention feature map of the training image based on the feature map; a prediction unit configured to predict whether the training image is a source domain image or a target domain image based on the attention feature map, and predict a location of the detection target; and a parameter updating unit configured to determine an overall loss of the neural network based on a loss predicted on whether the training image is a source domain image or a target domain image and a loss predicted on a location of the detection target, and update a parameter of the neural network according to the overall loss.

According to another aspect of the present disclosure, there is provided a target detection method based on a neural network, including: receiving an input image; generating a feature map of an input image by using the neural network based on appearance characteristics and environmental attribute information of the input image; generating an attention feature map of the input image by using the neural network based on the feature map; and positioning the detection target in the input image by utilizing the neural network based on the attention feature map.

According to another aspect of the present disclosure, there is provided a target detection apparatus based on a neural network, including: a processor; and a memory having computer program instructions stored therein, wherein the computer program instructions, when executed by the processor, cause the processor to perform the steps of: receiving an input image; generating a feature map of an input image by using the neural network based on appearance characteristics and environmental attribute information of the input image; generating an attention feature map of the input image by using the neural network based on the feature map; and positioning the detection target in the input image by utilizing the neural network based on the attention feature map.

According to another aspect of the present disclosure, there is provided a target detection apparatus based on a neural network, including: an image receiving unit configured to receive an input image; a feature map generation unit configured to generate a feature map of an input image using the neural network based on appearance characteristics and environmental attribute information of the input image; an attention feature map generation unit configured to generate an attention feature map of the input image using the neural network based on the feature map; and a detection target positioning unit configured to position a detection target in the input image by using the neural network based on the attention feature map.

According to another aspect of the present disclosure, there is provided a computer-readable storage medium having stored thereon computer program instructions, wherein the computer program instructions, when executed by a processor, implement the training method for a neural network for object detection described in the embodiments of the present disclosure.

According to another aspect of the present disclosure, there is provided a computer-readable storage medium having stored thereon computer program instructions, wherein the computer program instructions, when executed by a processor, implement the neural network-based object detection method described in the embodiments of the present disclosure.

Drawings

These and/or other aspects and advantages of the present disclosure will become more apparent and more readily appreciated from the following detailed description of the embodiments of the present disclosure, taken in conjunction with the accompanying drawings of which:

fig. 1 illustrates a schematic scenario in which an object detection technique of an embodiment of the present disclosure may be applied.

Fig. 2 illustrates the problem of the existing target detection technology in the specific detection scene of the solar panel.

Fig. 3 shows a flow diagram of a method of training a neural network for target detection according to an embodiment of the present disclosure.

Fig. 4 illustrates a flow diagram of an exemplary method of generating a feature map based on appearance characteristics of a training image and environmental attribute information according to an embodiment of the disclosure.

FIG. 5 illustrates a flow diagram of an exemplary method of generating an attention feature map of a training image based on a feature map thereof, in accordance with an embodiment of the present disclosure.

Fig. 6 illustrates an exemplary block diagram for generating an attention feature map in accordance with an embodiment of the present disclosure.

FIG. 7 sets forth a flow chart illustrating an exemplary method for updating neural network parameters based on losses according to embodiments of the present disclosure.

FIG. 8 sets forth a flow chart illustrating a further exemplary method for generating an attention feature map thereof based on a feature map of a training image according to embodiments of the present disclosure.

FIG. 9 sets forth a flow chart illustrating a further exemplary method for updating neural network parameters based on losses according to embodiments of the present disclosure.

FIG. 10 sets forth a flow chart illustrating an exemplary method for prediction based on global and local attention feature maps according to embodiments of the present disclosure.

FIG. 11 illustrates a flow diagram of an exemplary method of predicting the location of a detection target within a training image in accordance with an embodiment of the disclosure.

FIG. 12 sets forth a flow chart illustrating a further exemplary method for updating neural network parameters based on losses according to embodiments of the present disclosure.

Fig. 13 shows a flow diagram of a neural network-based target detection method according to an embodiment of the present disclosure.

Fig. 14 shows a schematic hardware block diagram of a training device of a neural network for target detection according to an embodiment of the present disclosure.

Fig. 15 shows a schematic structural block diagram of a training apparatus of a neural network for target detection according to an embodiment of the present disclosure.

Fig. 16 shows a schematic hardware block diagram of a neural network-based object detection device according to an embodiment of the present disclosure.

Fig. 17 shows a schematic structural block diagram of a target detection apparatus based on a neural network according to an embodiment of the present disclosure.

Detailed Description

For a better understanding of the present disclosure, reference is made to the following detailed description taken in conjunction with the accompanying drawings.

An exemplary scenario in which the object detection technique of the embodiments of the present disclosure may be applied is first described with reference to fig. 1. As shown in fig. 1, solar panels are one of the most important devices of a photovoltaic power plant, and fault detection and maintenance thereof are very important for reliable operation of the photovoltaic power plant. At present, the common practice is to perform fault detection and maintenance by performing image analysis on a photovoltaic power station image shot by an automatic device such as an unmanned aerial vehicle. Specifically, embedding camera in order to shoot the photovoltaic power plant image in sight range in automation equipment such as unmanned aerial vehicle the place ahead. Then, the positioning of the solar panel can be determined by analyzing the shot photovoltaic power station image through various image processing technologies, particularly a neural network technology which is developed rapidly in recent years, and whether the solar panel has a fault and a fault area thereof are further detected according to the situation, and the detection result is fed back to a fault maintenance system, so that the maintenance of the solar panel is further carried out. However, as shown in fig. 2, on the one hand, when the image analysis processing is performed by using the neural network for target detection, the neural network needs to be trained in advance, and during the training process, a large number of training images with artificial labels are needed, and the training images need to be provided with artificially labeled foreground/background segmentation results or other types of labeling results, which significantly increases the time and labor cost for training the neural network. On the other hand, the photovoltaic power station may be built in different environments and may be in various complex terrains, such as water surface or mountain land, etc., however, training of the neural network for target detection may be performed only based on training images in one or more environments, so that in the process of target detection using the trained neural network, the neural network may not provide a detection result with expected accuracy for an image to be detected with a larger difference from the training image environment, and the generalization capability is poor.

In view of this, in order to reduce the workload of manual labeling required by the neural network during training and enable the neural network to have strong generalization capability to different environments, the present disclosure provides a target detection technique based on the deep learning idea of the attention mechanism: on the one hand, when the process of obtaining the neural network for target detection is performed, not only the labeled image but also the unlabeled image can be used for training, thereby reducing the need for manually labeling the training image. On the other hand, when the feature extraction is carried out on the training image containing the detection target, the appearance characteristic of the training image and the environment attribute information when the training image is shot are considered at the same time, the network learns the feature which is invariable across domains through an attention mechanism, and the adaptability of the neural network to different detection scenes is enhanced, so that the higher generalization capability is provided. The object detection technique of the present disclosure is mainly introduced in terms of both training of a neural network for object detection and object detection based on the neural network.

It should be understood that, for convenience of explanation, the target detection technology of the present disclosure is described in detail here and below by taking the example that the target is a solar panel, however, the target detection technology of the present disclosure may also be applied to other detection scenarios such as railways, roofs, etc. to detect a rail or a roof solar panel, etc., and the present disclosure is not limited thereto.

Neural network training method

Fig. 3 shows a flow diagram of a method of training a neural network for target detection according to an embodiment of the present disclosure. In the embodiment of the present disclosure, the parameters of the trained neural network may be initially set or obtained after a certain degree of learning. In order to make the neural network have stronger performance in the aspect of target detection, the neural network can be continuously trained and learned so as to enhance the performance of the neural network. The training method of the neural network is described in detail below with reference to fig. 3.

As shown in fig. 3, in step S101, a training image containing a detection target is received, wherein the training image belongs to one of an annotated source domain image and an annotated target domain image. In the embodiment of the present disclosure, the training image may be an image containing a detection target obtained in advance through various forms. For example, the acquired training image may be a still image acquired by a camera equipped in the drone, or may be a frame of video frame in a video image. In addition, the acquired training image may be a gray scale image or a color image, which is not limited herein.

It should be noted that, in the embodiment of the present disclosure, the training image is divided into an annotated source domain image or an annotated target domain image, where the annotated source domain image is an image that is annotated by human, for example, an image with a foreground/background segmentation result, a target coordinate positioning result, or other similar annotation result, and the annotated target domain image is an image that is not annotated by human. It is understood that the source domain image and the target domain image, which are both images captured under different detection scenarios and/or different capturing conditions, are different from each other mainly in that the source domain image is fully labeled, and the target domain image is an unlabeled image, and thus cannot be used in the existing neural network training process.

In step S102, a feature map of a training image is generated based on appearance characteristics and environmental attribute information of the training image. As described above, there may be significant differences in images captured of a detection target under different environments, and these differences are mainly caused by the images captured under different environmental conditions such as time, temperature, and terrain, and/or different capturing conditions such as tilt angle and exposure. The inventor considers that detection targets in a batch of images with similar environmental conditions and/or shooting conditions are more similar to each other when measuring the contribution of the training images in the training process, so that the batch of images can be regarded as training images with stronger mobility between different domains, and thus the batch of images can be regarded as training images more suitable for network learning. In the embodiment of the present disclosure, in addition to the appearance characteristics of the training image itself, the environment attribute information of the shooting environment is also taken into consideration as one factor of the network learning.

In the present disclosed embodiment, the appearance characteristic may be a characteristic of a screen itself of the training image (e.g., a color, a texture characteristic, etc.), and the environment attribute information may be photographing condition information of the training image at the time of photographing (e.g., a photographing angle, an exposure condition, etc.) and/or environment condition information in which the detection target is located (e.g., a terrain condition, a temperature condition, a time/season condition, etc.). The embodiment of the disclosure fully considers the two aspects of information, namely the appearance characteristic and the environmental attribute information, when generating the feature map of the feature map from the training image, so that the training image which is more similar to the picture and the environmental condition can make more contribution in the training process to obtain a better training result, as described in more detail below. In the embodiment of the present disclosure, various methods may be adopted to generate the feature map based on the appearance characteristics and the environment attribute information of the training image, and for completeness of explanation only, an exemplary method of generating the feature map of the training image will be described below with reference to fig. 4.

As shown in fig. 4, in step S1021, image feature extraction is performed on the training image, and an appearance feature map of the training image is generated. For example, the training image may be convolved, pooled, normalized, etc. to extract the appearance feature map of the input image. Specifically, the training image may be convolved with a convolution kernel to obtain a convolution map. Then, the convolution graph is normalized by a linear correction unit and a batch normalization method to obtain a normalized convolution graph. A maximum or average pooling process is then applied to the normalized convolution map to obtain an appearance feature map of the training image. In addition, in order to obtain rich and multi-scale features from the training image, the convolution kernel with different parameters may be used to perform the convolution, pooling, normalization, etc. on the training image multiple times, so as to obtain multiple appearance feature maps of the training image respectively.

In step S1022, environment attribute information of the shooting environment of the training image is acquired, and an environment attribute feature map of the training image is generated according to the environment attribute information. In the embodiments of the present disclosure, various methods may be employed to generate the environment attribute feature map of the training image according to the environment attribute information. As an illustrative example, the environment attribute information of the shooting environment of the training image may be converted into an environment attribute matrix, and an environment attribute feature map may be extracted from the environment attribute matrix. The environment attribute information may be information that is stored when the training image is captured, and may include a specific semantic meaning, for example, the environment condition and/or the imaging condition when the detection target is captured as described above. The stored semantic information may be converted into a multidimensional matrix by methods such as 2D embedding, wordtover, Natural Language Processing (NLP) + Computer Vision (CV), and then the environment attribute feature map may be extracted from the multidimensional matrix by an extraction method similar to the appearance feature map, such as performing convolution, pooling, normalization, and the like on the multidimensional matrix.

In step S1023, the appearance feature map and the environment attribute feature map are combined to generate the feature map. In the embodiment of the present disclosure, the appearance feature map and the environment attribute feature map may be combined by means of splicing, summing, and the like. For example, each appearance feature map of the training image may be dimensionally stitched with the environment attribute map to obtain a corresponding feature map of the training image. Optionally, the features may be further extracted from the features after the operations such as stitching and summing, for example, the operations such as convolution, pooling and normalization are performed again, and the further extracted feature map is used as the feature map extracted for the training image.

Returning to fig. 3, in step S103, an attention feature map of the training image may be generated based on the feature map. In the embodiment of the present disclosure, an attention mechanism in deep learning may be adopted to extract an attention feature map from a feature map of a training image. The attention mechanism in the deep learning is similar to the selective visual attention mechanism of human beings in nature, the core of the attention mechanism is that information which is more critical to a current task target is selected from a plurality of pieces of information, and by extracting an attention feature map, more detailed information which needs to be focused on a detection target can be extracted, and the learning of other useless information is weakened. One exemplary method of generating an attention feature map of a training image based on a feature map of the training image is described below in conjunction with fig. 5 and 6.

As shown in fig. 5, in step S1031, a global attention map of the training image is generated based on the feature map. Various methods may be employed to generate a global attention map of a training image from its feature map, and for completeness of explanation only, an exemplary method will be described below in connection with fig. 6. As shown in fig. 6, for the training image, one or more feature maps of the training image may be taken as input, and then the input feature maps may be subjected to operations such as convolution, normalization, and pooling to extract features therefrom, and features of different levels may also be extracted using the above-described operations such as convolution, normalization, and pooling with different parameters. The extracted features can then be activated using a non-linear activation function to accumulate information for all channels of the input feature map or maps. Finally, a SoftMax operation is spatially applied to the accumulated feature maps, thereby generating an attention map corresponding to the entire image hierarchy of the training image. It will be appreciated that the generated attention map may be considered as a weight map, the weight values in the attention map may sufficiently reflect whether the corresponding portions in the feature map of the training image may be the focus of the need for significant attention, such that the corresponding portions containing migratable features may be sufficiently noticed in the learning process of the neural network, while the portions containing unwanted information may be suitably attenuated.

In step S1032, a global attention feature map of the training image is generated based on the feature map and the generated global attention map. Referring again to fig. 6, after generating an attention map corresponding to the training image based on the input feature map, the generated attention map may be applied to the input one or more feature maps by matrix multiplication, for example, by Hadamard product, and accordingly one or more global attention feature maps of the training image are generated. Thus, through the attention feature extraction mechanism, the corresponding features possibly containing more attention details are more greatly contributed to the whole learning.

Returning to fig. 3, in step S104, based on the attention feature map, it is predicted whether the training image is a source domain image or a target domain image, and the location of the detection target is predicted. In the embodiment of the present disclosure, after the global attention feature map reflecting the hierarchical characteristics of the whole image is generated for the training image as described above, on one hand, whether the training image is the source domain image or the target domain image can be predicted based on the generated global attention feature map. On the other hand, the localization of the detection target within the training image may be predicted based on the global attention feature map.

For example, the neural network may classify based on the global attention feature map of the training image to predict whether the training image is a source domain image or a target domain image. It is understood that the specific implementation of prediction in neural networks is well known in the art and will not be described herein. In addition, the positioning of the detection target in the training image can be predicted based on the global attention feature map of the training image. Considering different requirements of the neural network on the positioning of the detection target, for example, coordinate positioning of the detection target or foreground/background segmentation of the image containing the detection target, a specific coordinate position of the detection target in the training image may be predicted in this step, or a foreground segmentation result of the detection target may be predicted.

Optionally, it is known in advance whether the training image belongs to an annotated source domain image or an annotated target domain image, wherein: for the source domain image, because the source domain image contains the label related to the positioning of the detection target, supervised learning can be performed subsequently according to the loss value between the positioning prediction result and the positioning true value in the label, so that in this step S104, it can be predicted whether the training image is the source domain image or the target domain image based on the generated attention feature map, and at the same time, the positioning of the detection target is predicted for subsequent comparison with the label true value for supervised learning; for the target domain image, which may lack a label related to the location of the detection target, and an exact loss value related to the prediction result of the location may not be obtained subsequently for learning under supervision, so that in this step S104, only whether the training image is the source domain image or the target domain image may be predicted based on the generated attention feature map, and prediction of the location of the detection target may be omitted, and in subsequent operations, for example, learning may be performed according to a preset location prediction loss, as described below.

In step S105, an overall loss of the neural network is determined based on a loss predicted for whether the training image is a source domain image or a target domain image and a loss predicted for the positioning of the detection target, and a parameter of the neural network is updated according to the overall loss. One exemplary method of updating neural network parameters based on losses is described below in conjunction with FIG. 7.

As shown in fig. 7, in step S1051, a global prediction loss predicting whether the training image is a source domain image or a target domain image is calculated. For example, for the training image, it is known whether it belongs to the labeled source domain image or the unlabeled target domain image, so the difference between the two can be calculated as the global prediction loss according to the known information and the result of predicting whether the training image is the source domain image or the target domain image in step S104.

In step S1052, a location prediction loss that predicts the location of the detection target within the training image is calculated. Various methods may be employed in this step to calculate the prediction loss of the positioning based on the prediction result of the positioning of the detection target in step S104. If the training image belongs to the labeled source domain image, the difference between the label in the artificial label in the training image and the result of the positioning prediction of the detection target in the training image in step S104 can be calculated as the positioning prediction loss of the detection target positioning. If the training image belongs to an unlabeled target domain image, considering that the label for supervised learning may be lacked therein, the accuracy of the prediction result may be roughly estimated based on the comparison of the context similarity of the prediction result and the training image, thereby calculating the positioning prediction loss.

Alternatively, as described in the above step S104, for the target domain image, the prediction of the location of the detection target in the unlabeled target domain image in step S104 may be omitted, and accordingly, in step S1052, the location prediction loss in the training image belonging to the unlabeled target domain image may be directly set to a preset value.

In step S1053, the overall loss of the neural network is determined based on the global prediction loss and the localization prediction loss, and the parameters of the neural network are updated according to the overall loss. In order to make the neural network have good performance in the aspects of target detection positioning, domain invariant feature extraction and identification and the like, the overall loss of the neural network can be determined based on the global prediction loss and the positioning prediction loss, so that the performance of the currently trained neural network is comprehensively reflected, and the parameters of the neural network are updated accordingly, so that the performance of the network is continuously improved.

In the disclosed embodiment, the overall loss of the neural network is positively correlated with the loss of the prediction of the positioning of the detection target, and negatively correlated with the loss of the prediction of whether the training image is the source domain image or the target domain image. In particular, the obtained loss may be transmitted back to each layer of the neural network, and accordingly, parameters of the relevant layers, such as the parameters of the convolutional layers involved in the above-mentioned stages of generating the feature map of the training image, generating the attention feature map, and the like, are updated, so as to minimize the overall loss of the neural network. It can be understood that when the overall loss of the neural network is minimized, for example, the overall loss is reduced to a predetermined threshold, on one hand, the network has an acceptable performance on the aspect of target location, and on the other hand, the network has a poor accuracy on the domain prediction, which just indicates that the network has learned the domain-invariant feature and thus cannot distinguish the source domain from the target domain, thereby being capable of adapting to the needs of various detection scenarios.

According to the training method of the neural network, the marked source domain image and the unmarked target domain image can be adopted for training the neural network, so that the strict requirement that the training images in the existing training method need to be marked manually is reduced. In addition, in the learning of the neural network, not only the appearance characteristics of the training images are considered, but also the environment attribute information during the shooting of the training images is considered, the attention mechanism is utilized to enhance the attention of the network to the training images which are similar in the two aspects, and the updating of the neural network is carried out based on the two aspects of the loss of the domain prediction and the loss of the positioning prediction, so that the network can learn the migratability characteristic which is not changed across domains, and the adaptability of the network to different detection scenes is enhanced. Through the attention mechanism-based deep learning method, the neural network which reduces the workload of manual labeling required in training the neural network and improves the generalization capability to a new environment can be provided.

In the above-described embodiment, after the feature map of the training image is obtained in step S102, the global attention feature map of the training image is generated from the feature map of the training image at the level of the entire image and the parameters of the neural network are updated accordingly in subsequent steps S103 to S105. Without being limited thereto, in the training process of the neural network according to another embodiment of the present disclosure, a local characteristic of a local position of the training image may also be considered instead of a global characteristic thereof.

As mentioned above, a photovoltaic power plant may be built in different environments, such as water or mountainous areas, etc., and images taken of it include a solar panel as an object to be detected and a background area, such as water or mountainous areas. The similarity degree of the target to be detected in the shot image is high under different environments, so that the target to be detected has high mobility in the adaptive learning of the neural network, however, the difference of the background area in the shot image is high under different environments, and the background area does not have common characteristics, so that the target to be detected does not have mobility in the learning of the neural network. Based on these considerations, the embodiments of the present disclosure provide a selection strategy for a local region including migratability features, so that a neural network focuses more on a more important region in an image in a learning process, so as to learn features that are invariant across domains, thereby improving adaptability of a model of the neural network to different scenes. In this embodiment, the feature map of the training image may be generated in the same manner as in steps S101 and S102 of fig. 3, and the neural network is updated in steps S103 to S105 according to the local characteristics of the training image. This embodiment is described in detail below in conjunction with fig. 8 and 9.

Specifically, for step S103 shown in fig. 3, the corresponding operation of generating the global attention feature map based on the feature map of the training image described in the previous embodiment in conjunction with fig. 5 may be replaced by the respective steps described below in conjunction with fig. 8.

As shown in fig. 8, in step S1031', after the feature map of the training image is obtained in step S102, the feature map of the training image is divided into local regions. The feature map of the training image may be partitioned in a variety of ways, such as based on thresholds, regions, edges, and so on. As an example implementation, relatively important parts in the training images may be determined based on the context similarity of the training images using a superpixel extraction technique, and corresponding local regions may be correspondingly determined from the feature maps based on the correspondence of the training images to the feature maps generated therefrom, as local regions that are more likely to contain domain-invariant features and therefore are important to learn.

In step S1032', respective local attention maps are generated for portions of the feature map within the respective local regions. As described above, one or more feature maps thereof may be generated from the training image, wherein each feature map corresponds to the entire training image range, and accordingly, after the partial region division is performed in step S1031', each feature map has a portion falling within the partial region for each partial region. Similarly to the method described above with reference to fig. 5 and 6, for each local region, a portion of one or more feature maps of the training image within the local region may be used as an input, and then the input portion of one or more feature maps may be subjected to convolution or the like to generate a local attention map corresponding to the local region. Similarly, the generated local attention map may be viewed as a weight map reflecting the focus in the local region that is more likely to require significant attention.

In step S1033', a local attention feature map for each local region of the training image is generated based on the portion of the feature map within each local region and the generated local attention map. Similarly to the method described above with reference to fig. 5 and 6, for each local region, the generated local attention map may be applied to a portion of the one or more feature maps of the training image within the local region, thereby generating one or more corresponding local attention feature maps of the local region, so that features that may contain more details of interest within the local region contribute more to the overall learning.

Subsequently, in contrast to the prediction based on the global attention feature map in step S104 in the previous embodiment, in the present embodiment, in the case of generating the local attention feature maps of the respective local regions, on the one hand, it may be predicted whether the training image to which the respective local regions belong is the source domain image or the target domain image based on the respective local attention feature maps. On the other hand, based on the respective local attention feature maps, the localization of the detection target within the training image is predicted. The prediction can be performed based on the local attention feature map similarly to the method of performing prediction based on the global attention feature map as described above, and it should be noted that, in the case of performing prediction based on the local attention feature map in this embodiment, the prediction of the domain of the belonging training image is performed in units of each local region, and each local prediction result corresponding to each local region can be obtained; the prediction of the location of the detection target in the training image is performed in combination with the local attention feature maps of the respective local regions, and a location prediction result of the detection target, for example, a prediction result of a specific coordinate position or foreground segmentation of the detection target in the training image, is obtained by integration. As an exemplary implementation of predicting the location of the detection target based on the local attention feature maps, the local attention feature maps may be integrated into a global attention feature map according to the corresponding positions of the local regions, and the location of the detection target may be predicted according to the global attention feature map.

Next, in contrast to the update of the neural network based on the global prediction result and the localization prediction result in step S105 in the previous embodiment, in the present embodiment, the parameters of the neural network may be updated based on the obtained respective local prediction results and localization prediction results. An exemplary method for updating neural network parameters based on local prediction results and positioning prediction results in the present embodiment will be described below with reference to fig. 9.

As shown in fig. 9, in step S1051', a local prediction loss for predicting whether the training image to which each local region belongs is a source domain image or a target domain image is calculated. Similar to the loss calculation method described in connection with step S1051 of fig. 7, the local prediction loss of each local region may be calculated according to the known information of whether the belonging training image belongs to the labeled source domain image or the unlabeled target domain image and the result of the domain prediction on the training image in units of local regions.

In step S1052', a location prediction loss that predicts a location of a detection target within the training image is calculated. In this embodiment, a method similar to the positioning loss calculation method described in conjunction with step S1052 of fig. 7 may be adopted, and details are not repeated herein.

In step S1053', an overall loss of the neural network is determined based on the local predicted loss and the location predicted loss, and a parameter of the neural network is updated according to the overall loss. In an embodiment, the overall loss of the neural network is positively correlated with the predicted loss of the positioning of the detection target, and is negatively correlated with the sum of the predicted losses of the training image to which the local regions belong, the source domain image or the target domain image, and the parameters of the neural network may be updated so as to minimize the overall loss.

According to the training method of the neural network in the embodiment, by selecting a more important local area in the training image for key learning, parts containing more migratable features can make a larger contribution in the training process, and areas containing no migratable features, such as background areas, make a smaller contribution in the training process, so that the network learns the features which are unchanged across domains, and the adaptability of the network to different detection scenes is enhanced.

In the two embodiments described above, after the feature map of the training image is obtained in step S102, in subsequent steps S103 to S105, either a global attention feature map thereof is generated from the feature map of the training image or a local attention feature map thereof is generated, and the parameters of the neural network are updated according to the global characteristics or the local characteristics of the training image. Without being limited thereto, in the training process of the neural network according to another embodiment of the present disclosure, the neural network learning may be performed based on a multi-context attention mechanism by considering two factors, i.e., the global characteristic of the training image and the local characteristic of the local position. In this embodiment, the feature map of the training image may be generated by the same method as steps S101 and S102 of fig. 3, and the neural network is updated in steps S103 to S105 according to both the global characteristic and the local characteristic of the training image. This embodiment is described in detail below in conjunction with fig. 10-12.

Specifically, in contrast to the two embodiments described above in which only one of the global attention feature map and the local attention feature map is generated in step S103, in this embodiment, the global attention feature map of the training image and the local attention feature maps of the local regions may be generated, and the specific method may adopt the method described above with reference to fig. 5 and 8, which is not described herein again.

Next, in contrast to the above two embodiments in which prediction is performed based on the global attention feature map or the local attention feature map in step S104, in the present embodiment, the domains of the training images and the locations of the detection targets therein are predicted based on the global attention feature map and the respective local attention feature maps, which is described below with reference to fig. 10.

As shown in fig. 10, in step S1041 ", it is predicted whether the training image is a source domain image or a target domain image based on the global attention feature map. In step S1042 ", it is predicted whether the training image to which each local region belongs is a source domain image or a target domain image based on each local attention feature map. The method for predicting whether the training image is the source domain image or the target domain image in step S1041 "and step S1042" is the same as the method described in the above two embodiments with respect to step S104, and is not repeated here.

In step S1043 ", the location of the detected target within the training image may be predicted based on the global attention feature map and the local attention feature map. Various methods may be employed to predict the location of the detected object based on the global attention feature map and the local attention feature map, and for completeness of explanation only, an exemplary method of predicting the location of the detected object will be described below in conjunction with fig. 11.

As shown in fig. 11, in step S1043 "-a, an overall attention feature map of the training image is generated based on the respective local attention feature maps. For example, the local attention feature map may be integrated into the overall attention feature map according to the corresponding position of each local region.

In step S1043 "-b, an optimized attention feature map of the training image is generated based on the global attention feature map and the generated overall attention feature map. As a possible implementation manner, the global attention feature map and the generated overall attention feature map may be fused in a point-by-point summation or splicing manner, so as to generate an optimized attention feature map. Alternatively, several convolutional layers may be applied to the global and global attention feature maps, respectively, before fusion, and then fused to generate the optimized attention feature map. Optionally, the attention feature extraction may be performed on the optimized attention feature map again, so that a part of the fused optimized attention feature map that may contain important information makes a greater contribution in the network learning.

In step S1043 "-c, based on the optimized attention feature map, predicting a location of a detection target within the training image. Similar to the above, the specific coordinate position of the detection target in the training image may be predicted in this step, or the foreground segmentation result of the detection target may be predicted.

Next, an example method of updating neural network parameters based on loss considering global characteristics and local characteristics of the training image is described with reference to fig. 12. As shown in fig. 12, in step S1051 ", a global prediction loss predicting whether the training image is a source domain image or a target domain image is calculated. In step S1052 ", a local prediction loss is calculated which predicts whether the training image to which each local region belongs is a source domain image or a target domain image. In step S1053 ″, a localization prediction loss that predicts the localization of the detection target within the training image is calculated. In this embodiment, the above prediction losses can be calculated by a method similar to that described above with reference to steps S1051-S1052 of fig. 7 and steps S1051 '-S1052' of fig. 9, which is not described herein again.

In step S1054 ″, the global loss of the neural network is determined based on the global predicted loss, the local predicted loss, and the localization predicted loss, and the parameters of the neural network are updated according to the global loss. In the embodiment of the disclosure, the overall loss of the neural network is positively correlated with the predicted loss of the positioning of the detection target, and is negatively correlated with the sum of the global predicted loss and each local predicted loss, and the parameters of the neural network can be updated according to the loss, so that the overall loss of the neural network is minimized.

According to the neural network training method disclosed by the invention, a more important local area is selected in the training image for key learning, so that parts containing more migratable features can make a greater contribution in the training process, the extracted features can fully reflect the correlation between the environmental attribute information and the image appearance characteristics by combining the global characteristics and the local characteristics of the training image, and the extracted features can be further ensured to reflect the features with unchanged domains under the global and local multi-context learning mechanism, so that the adaptability of the network to different detection scenes is enhanced.

Target detection method

Fig. 13 shows a flow diagram of a neural network-based target detection method according to an embodiment of the present disclosure. In embodiments of the present disclosure, the neural network may be pre-learned, e.g., obtained through a training method as described above, and have a certain target detection capability. The neural network can be used in various detection scenes, such as photovoltaic power stations, railways, roofs and the like, so as to locate the target.

As shown in fig. 13, in step S201, an input image is received. The received input image may be a still image captured by a camera or may be a video frame of a video image. In addition, the received input image may be a gray scale image or a color image, which is not limited herein. In addition, the input image can be an image shot under various environmental conditions and/or shooting conditions, or an image different from a training image scene used in a training process, and the image detection method and the image detection device can have certain adaptability to the image under various detection scenes.

In step S202, a feature map of the input image is generated using a neural network based on appearance characteristics and environmental attribute information of the input image. In the embodiment of the present disclosure, when generating a feature map of an input image for detection, not only the appearance characteristics of the input image itself but also the environmental attribute information at the time of input image capturing is considered in order to obtain a better detection result. The method of generating the feature map based on the appearance characteristics of the input image and the environment attribute information in this step is similar to the method of generating the feature map from the training image in step S102 in fig. 3, and only this step will be briefly described below. Firstly, image feature extraction is performed on the input image, and an appearance feature map of the input image is generated. Then, the environment attribute information of the shooting environment of the input image is obtained, and an environment attribute feature map of the input image is generated according to the environment attribute information. And finally, combining the appearance characteristic diagram and the environment attribute characteristic diagram to generate the characteristic diagram.

In step S203, an attention feature map of the input image is generated using the neural network based on the feature map. In the embodiment of the disclosure, an attention mechanism in deep learning may be adopted to extract an attention feature map from a feature map of an input image, so that more detailed information of a detection target that needs attention may be extracted, and extraction of other useless information may be weakened. Corresponding to the various embodiments described in the training process above, in this step, a global and/or local attention feature map of the input image may be generated based on the feature map of the input image. Specifically, from the feature map of the input image, it is possible to generate: (1) inputting a global attention feature map of an image; or (2) inputting local attention feature maps of local regions of the image; or (3) inputting the global attention feature map of the image and the local attention feature maps of the local regions, which are not described herein in detail.

In step S204, based on the attention feature map, a detection target in the input image is located by using the neural network. The following describes the method for detecting the target location separately for different cases of generating different attention feature maps in step S203.

Specifically, in the above case (1), the detection target within the input image may be located based on the global attention feature map of the input image. In this case, the detection target may be located according to a global attention feature map that reflects global characteristics at the entire image level of the input image. Considering different requirements for positioning of the detection target, the specific coordinate position of the detection target in the input image may be determined based on the global attention feature map, or a foreground segmentation result of the detection target may be obtained.

In the above case (2), the detection target within the input image may be positioned based on the local attention feature map of the input image. In this case, the local characteristics of the input image may be considered, and the input image may be analyzed as a whole in combination with the local attention feature maps of the respective local regions to locate the detection target, for example, to obtain a specific coordinate position of the detection target in the input image or a foreground segmentation result. As an exemplary implementation, the local attention feature maps may be integrated into a global attention feature map according to corresponding positions of the local regions, and the detection target in the input image may be located according to the global attention feature map.

In the above case (3), the detection target within the input image may be localized based on the global attention feature map and the local attention feature map of the input image. In this case, the detection target may be located in a multi-context attention mechanism in consideration of the global characteristics and the local characteristics of the input image. An exemplary implementation of locating detection targets based on global and local attention profiles is briefly described below. First, a global attention feature map of the input image may be generated based on the respective local attention feature maps, for example, the local attention feature maps may be integrated into the global attention feature map according to the corresponding positions of the respective local regions. Then, an optimized attention feature map of the input image may be generated based on the global attention feature map and the generated overall attention feature map, for example, the global attention feature map and the generated overall attention feature map may be fused in a point-by-point summation or stitching manner to generate the optimized attention feature map. Finally, based on the optimized attention feature map, the detection target may be located, for example, a specific coordinate position of the detection target in the input image or a foreground segmentation result is obtained.

According to the detection method based on the neural network, not only the appearance characteristics of the input image to be detected are considered, but also the environment attribute information during the shooting of the input image is considered, and the information possibly contained in the input image is fully extracted. In addition, the attention mechanism enables the part containing more important information in the image to be fully noticed during detection, two factors of the global characteristic and the local characteristic of the input image can be comprehensively considered, and the adaptability to different detection scenes is enhanced based on the attention mechanism of multiple contexts.

Neural network training device

According to another aspect of the present disclosure, a training apparatus for a neural network for target detection is provided, and the training apparatus 1400 is described in detail below in conjunction with fig. 14.

FIG. 14 shows a hardware block diagram of a training apparatus according to an embodiment of the present disclosure. As shown in FIG. 14, training device 1400 includes a processor U1401 and a memory U1402.

The processor U1401 may be any processing capable device capable of implementing the functionality of the embodiments of the present disclosure, e.g., it may be a general purpose processor, a Digital Signal Processor (DSP), an ASIC, a Field Programmable Gate Array (FPGA) or other Programmable Logic Device (PLD), discrete gate or transistor logic, discrete hardware components, or any combination thereof designed to perform the functionality described herein.

Memory U1402 may include computer system readable media in the form of volatile memory, such as Random Access Memory (RAM) and/or cache memory, as well as other removable/non-removable, volatile/nonvolatile computer system memory, such as hard drives, floppy disks, CD-ROMs, DVD-ROMs, or other optical storage media.

In this embodiment, computer program instructions are stored in the memory U1402, and the processor U1401 may execute the instructions stored in the memory U1402. When the computer program instructions are executed by the processor, the processor is caused to perform the training method of the neural network of the embodiments of the present disclosure. The training method for the neural network is substantially the same as that described above with respect to fig. 3-10, and thus, in order to avoid repetition, will not be described again. As examples of training devices, computers, servers, workstations, etc. may be included. After training, the training apparatus may provide the trained neural network to other devices for the other devices to use the network for target detection; or the training device itself may utilize the network for object detection.

According to another aspect of the present disclosure, a training apparatus for a neural network for target detection is provided, and the training apparatus 1500 is described in detail below in conjunction with fig. 15.

Fig. 15 shows a block diagram of a training apparatus of a neural network for target detection according to an embodiment of the present disclosure. As shown in fig. 15, the training apparatus 1500 includes an image receiving unit U1501, a feature map generating unit U1502, an attention feature map generating unit U1503, a prediction unit U1504, and a neural network updating unit U1505. The various components may perform the various steps/functions of the training method of a neural network for object detection described above in connection with fig. 3-10, respectively, and thus, in order to avoid repetition, only a brief description of the apparatus is given below, while a detailed description of the same details is omitted.

The image receiving unit U1501 receives a training image containing a detection target, which belongs to one of an annotated source domain image and an unlabeled target domain image. The training image may be an image containing a detection target obtained in advance by various forms. For example, the acquired training image may be a still image acquired by a camera equipped in the drone, or may be a frame of video frame in a video image. In addition, the acquired training image may be a gray scale image or a color image, which is not limited herein.

The feature map generation unit U1502 generates a feature map of a training image based on the appearance characteristics and environment attribute information of the training image. In this embodiment, the feature map generation unit U1502 may perform image feature extraction on the training image to generate an appearance feature map of the training image. Then, the feature map generation unit U1502 may acquire environment attribute information of the shooting environment of the training image and generate an environment attribute feature map of the training image according to the environment attribute information, for example, may convert the environment attribute information of the shooting environment of the training image into an environment attribute matrix and extract the environment attribute feature map from the environment attribute matrix. Finally, the feature map generation unit U1502 may combine the appearance feature map and the environment attribute feature map to generate the feature map, for example, the appearance feature map and the environment attribute feature map may be combined by means of stitching, summing, and the like.

The attention feature map generation unit U1503 generates an attention feature map of the training image based on the feature map. In one example, the attention feature map generation unit U1503 may generate a global attention feature map of the training image based on a feature map of the training image in consideration of global characteristics of the entire image level of the training image, and generate the global attention feature map of the training image based on the feature map and the generated global attention map. In another example, the attention feature map generation unit U1503 may generate the local attention feature maps of the training images by dividing the feature maps of the training images into local regions in consideration of local characteristics of the training images, and generate respective local attention maps for portions of the feature maps within the respective local regions, and then generate the local attention feature maps for the respective local regions of the training images based on the portions of the feature maps within the respective local regions and the generated local attention maps. In yet another example, the attention feature map generation unit U1503 may generate a global attention feature map of the training image and local attention feature maps of the respective local regions, taking into account both global characteristics and local characteristics of the training image.

The prediction unit U1504 predicts whether the training image is a source domain image or a target domain image based on the attention feature map, and predicts the location of the detection target. In the case where the attention feature map generation unit U1503 generates the global attention feature map, the local attention feature maps of the respective local regions, or both of the above, respectively, the prediction unit U1504 may then make a corresponding prediction based on the global and/or local attention feature maps generated by the attention feature map generation unit U1503. Examples of prediction by the prediction unit U1504 based on the generated global and/or local attention feature maps are described below, respectively.

In one example, the prediction unit U1504 may predict whether the training image is a source domain image or a target domain image based on the global attention feature map generated by the attention feature map generation unit U1503, and predict the location of a detection target within the training image, for example, a specific coordinate position of the detection target in the training image or a foreground segmentation result. In another example, the prediction unit U1504 may predict, based on each local attention feature map, whether a training image to which each local area belongs is a source domain image or a target domain image in units of each local area, and predict the location of a detection target within the training image in combination with the local attention feature maps of each local area, for example, integrate each local attention feature map into an overall attention feature map according to the corresponding position of each local area, and predict the location of the detection target according to the overall attention feature map. In yet another example, the prediction unit U1504 may predict whether the training image is a source domain image or a target domain image based on the global attention feature map, and the prediction unit U1504 predicts whether the training image to which each local region belongs is a source domain image or a target domain image based on each local attention feature map, and further, the prediction unit U1504 predicts the localization of the detection target within the training image based on the global attention feature map and the local attention feature maps. In this example, the prediction unit U1504 may generate an optimized attention feature map for the training image based on the global attention feature map and the global attention feature maps generated from the respective local attention feature maps, and then predict the location of the detection target within the training image based on the optimized attention feature map.

The neural network updating unit U1505 determines the overall loss of the neural network based on the loss of prediction of whether the training image is a source domain image or a target domain image and the loss of prediction of the localization of the detection target, and updates the parameters of the neural network according to the overall loss. After the prediction unit U1504 predicts based on the global characteristics and/or the local characteristics of the training images, respectively, the neural network update unit U1505 may update the parameters of the neural network based on the prediction results, accordingly. Examples of updating the parameters of the neural network by the neural network updating unit U1505 based on the corresponding prediction results are described below, respectively.

In one example, in the case where the prediction unit U1504 performs prediction based on the global attention feature map, accordingly, the neural network update unit U1505 may calculate a global prediction loss predicting whether the training image is a source domain image or a target domain image, and calculate a location prediction loss predicting the location of a detection target within the training image, and then determine the overall loss of the neural network according to the two losses, and update the parameters of the neural network accordingly. In this example, the overall loss of the neural network is positively correlated with the loss of the prediction of the localization of the detection target and negatively correlated with the loss of the prediction of whether the training image is the source domain image or the target domain image, the overall loss of the neural network being minimized by the neural network parameter update.

In another example, in the case where the prediction unit U1504 performs prediction based on the local attention feature map, accordingly, the neural network updating unit U1505 may calculate a local prediction loss predicting whether the training image to which each local region belongs is a source domain image or a target domain image, and calculate a location prediction loss predicting the location of a detection target within the training image, and then determine the overall loss of the neural network according to the above two losses, and update the parameters of the neural network accordingly. In this example, the overall loss of the neural network is positively correlated with the predicted loss of the localization of the detection target, and is negatively correlated with the sum of the predicted losses of whether the belonging training image is the source domain image or the target domain image in units of the respective local regions, and the overall loss of the neural network is minimized by the neural network parameter update.

In yet another example, in the case where the prediction unit U1504 performs prediction based on the global and local attention feature maps, accordingly, the neural network updating unit U1505 may calculate a global prediction loss predicting whether the training image is a source domain image or a target domain image, calculate a local prediction loss predicting whether the training image to which each local region belongs is a source domain image or a target domain image, and calculate a location prediction loss predicting the location of a detection target within the training image, then determine the overall loss of the neural network according to the above three losses, and update the parameters of the neural network accordingly. In this example, the overall loss of the neural network is positively correlated with the predicted loss of the localization of the detection target and negatively correlated with the sum of the global predicted loss and the respective local predicted losses, the overall loss of the neural network being minimized by the neural network parameter update.

According to the training equipment for the neural network for target detection disclosed by the invention, the neural network can be trained by adopting two images, namely the labeled source domain image and the unlabeled target domain image, so that the manual labeling workload required in the training of the neural network is reduced. In addition, in the learning of the neural network, not only the appearance characteristics of the training images are considered, but also the environment attribute information when the training images are shot is considered, the attention of the network to the training images which are similar in the two aspects is enhanced by using an attention mechanism, the more important local areas are further selected in the training images for important learning, the global characteristics and the local characteristics of the training images are fully considered, the neural network can learn the migratability characteristics which are not changed across the domains, and the adaptability of the network to different detection scenes is enhanced.

Object detection device

According to another aspect of the present disclosure, a neural network-based object detection apparatus is provided, and the object detection apparatus 1600 is described in detail below in conjunction with fig. 16.

Fig. 16 shows a hardware block diagram of an object detection device according to an embodiment of the present disclosure. As shown in fig. 16, the object detection apparatus 1600 includes a processor U1601 and a memory U1602.

The processor U1601 may be any processing capable device capable of implementing the functionality of the embodiments of the present disclosure, for example, it may be a general purpose processor, a Digital Signal Processor (DSP), an ASIC, a field programmable gate array signal (FPGA) or other Programmable Logic Device (PLD), discrete gate or transistor logic, discrete hardware components, or any combination thereof designed to perform the functionality described herein.

Memory U1602 may include computer system readable media in the form of volatile memory, such as Random Access Memory (RAM) and/or cache memory, and may also include other removable/non-removable, volatile/nonvolatile computer system memory, such as a hard drive, floppy disk, CD-ROM, DVD-ROM, or other optical storage media.

In this embodiment, the memory U1602 has computer program instructions stored therein, and the processor U1601 may execute the instructions stored in the memory U1602. When the computer program instructions are executed by the processor, cause the processor to perform the object detection method using a neural network of the embodiments of the present disclosure. The target detection method is substantially the same as that described above with respect to fig. 13, and thus, is not described again to avoid redundancy. As examples of the object detection device, a drone embedded computer, an on-board computer, a background server, and the like may be included. In the target detection process, images obtained by the unmanned aerial vehicle can be analyzed in real time, and a detection target is positioned; or, the drone may send the pictures it obtained to the backend server, and after the backend server analyzes with the trained neural network, return the results to the equipment for maintenance.

According to yet another aspect of the present disclosure, a neural network-based object detection apparatus is provided, and the object detection apparatus 1700 is described in detail below with reference to fig. 17.

Fig. 17 shows a block diagram of the structure of an object detection apparatus according to an embodiment of the present disclosure. As shown in fig. 17, the object detection apparatus 1700 includes an image receiving unit U1701, a feature map generation unit U1702, an attention feature map generation unit U1703, and a detected object localization unit U1704. The respective components may respectively perform the respective steps/functions of the neural network-based object detection method described above in conjunction with fig. 13, and thus, in order to avoid repetition, only a brief description of the apparatus will be given below, and a detailed description of the same details will be omitted.

The image receiving unit U1701 receives an input image. The input image received by the image receiving unit U1701 may be a still image captured by a camera or a video frame of a video image. In addition, the received input image may be a gray scale image or a color image, which is not limited herein. In addition, the input image can be an image shot under various environmental conditions and/or shooting conditions, or an image different from a training image scene used in a training process, and the image detection method and the image detection device can have certain adaptability to the image under various detection scenes.

The feature map generation unit U1702 generates a feature map of the input image using the neural network based on the appearance characteristics and the environment attribute information of the input image. The feature map generation unit U1702 may generate a feature map of the input image in consideration of both the appearance characteristics of the input image itself and the environment attribute information at the time of capturing the input image, and for example, may perform image feature extraction on the input image to generate an appearance feature map of the input image; then, acquiring environment attribute information of the shooting environment of the input image, and generating an environment attribute feature map of the input image according to the environment attribute information; finally, combining the appearance characteristic diagram and the environment attribute characteristic diagram to generate the characteristic diagram.

The attention feature map generation unit U1703 generates an attention feature map of the input image using the neural network based on the feature map. Similar to that described above in connection with step S203 of fig. 13, the attention feature map generating unit U1703 may generate, from the feature map of the input image: (1) inputting a global attention feature map of an image; or (2) inputting local attention feature maps of local regions of the image; or (3) inputting the global attention feature map of the image and the local attention feature maps of the local regions, which are not described herein in detail.

The detection target positioning unit U1704 positions the detection target in the input image using the neural network based on the attention feature map. In the case where the attention feature map generation unit U1703 generates different kinds of attention feature maps, the detection target localization unit U1704 may accordingly localize the detection target based on the generated attention map accordingly.

For example, in the case (1) described above, the detection target localization unit U1704 may localize the detection target within the input image based on the global attention feature map of the input image. In this case, the detection target may be located according to a global attention feature map reflecting global characteristics of the entire image level of the input image, for example, a specific coordinate position of the detection target in the input image or a foreground segmentation result is obtained.

In the above case (2), the detection target localization unit U1704 may localize the detection target within the input image based on the local attention feature map of the input image. In this case, the input image may be entirely analyzed in conjunction with the local attention feature maps of the respective local regions to locate the detection target according to the local characteristics of the input image. For example, the detection target positioning unit U1704 may integrate the local attention feature maps into an overall attention feature map according to the corresponding positions of the local regions, and position the detection target according to the overall attention feature map.

In the case (3) described above, the detection target localization unit U1704 may localize the detection target within the input image based on the global attention feature map and the local attention feature map of the input image. In this case, the detection target may be located in a multi-context attention mechanism in consideration of the global characteristics and the local characteristics of the input image. As an exemplary implementation, the detection target positioning unit U1704 may generate a global attention feature map of the input image based on the respective local attention feature maps, for example, the local attention feature maps may be integrated into the global attention feature map according to the corresponding positions of the respective local regions. Then, an optimized attention feature map of the input image may be generated based on the global attention feature map and the generated overall attention feature map, for example, the global attention feature map and the generated overall attention feature map may be fused in a point-by-point summation or stitching manner to generate the optimized attention feature map. Finally, based on the optimized attention feature map, the detection target may be located, for example, a specific coordinate position of the detection target in the input image or a foreground segmentation result is obtained.

According to the target detection device based on the neural network, not only the appearance characteristics of the input image to be detected but also the environment attribute information during the shooting of the input image are considered, and the information possibly contained in the input image is fully extracted. In addition, the attention mechanism enables the part containing more important information in the image to be fully noticed during detection, two factors of the global characteristic and the local characteristic of the input image can be comprehensively considered, and the adaptability to different detection scenes is enhanced based on the attention mechanism of multiple contexts.

Computer readable storage medium

The neural network training method/apparatus and the neural network-based object detection method/apparatus according to the present disclosure may also be implemented by providing a computer program product containing program code implementing the method or apparatus, or by any storage medium storing such a computer program product.

The foregoing describes the general principles of the present disclosure in conjunction with specific embodiments, however, it is noted that the advantages, effects, etc. mentioned in the present disclosure are merely examples and are not limiting, and they should not be considered essential to the various embodiments of the present disclosure. Furthermore, the foregoing disclosure of specific details is for the purpose of illustration and description and is not intended to be limiting, since the disclosure is not intended to be limited to the specific details so described.

The block diagrams of devices, apparatuses, systems referred to in this disclosure are only given as illustrative examples and are not intended to require or imply that the connections, arrangements, configurations, etc. must be made in the manner shown in the block diagrams. These devices, apparatuses, devices, systems may be connected, arranged, configured in any manner, as will be appreciated by those skilled in the art. Words such as "including," "comprising," "having," and the like are open-ended words that mean "including, but not limited to," and are used interchangeably therewith. The words "or" and "as used herein mean, and are used interchangeably with, the word" and/or, "unless the context clearly dictates otherwise. The word "such as" is used herein to mean, and is used interchangeably with, the phrase "such as but not limited to".

Also, as used herein, "or" as used in a list of items beginning with "at least one" indicates a separate list, such that, for example, a list of "A, B or at least one of C" means A or B or C, or AB or AC or BC, or ABC (i.e., A and B and C). Furthermore, the word "exemplary" does not mean that the described example is preferred or better than other examples.

It is also noted that in the apparatus and methods of the present disclosure, the components or steps may be broken down and/or recombined. These decompositions and/or recombinations are to be considered equivalents of the present disclosure.

It will be understood by those of ordinary skill in the art that all or any portion of the methods and apparatus of the present disclosure may be implemented in any computing device (including processors, storage media, etc.) or network of computing devices, in hardware, firmware, software, or any combination thereof. The hardware may be implemented with a general purpose processor, a Digital Signal Processor (DSP), an ASIC, a field programmable gate array signal (FPGA) or other Programmable Logic Device (PLD), discrete gate or transistor logic, discrete hardware components, or any combination thereof designed to perform the functions described herein. A general purpose processor may be a microprocessor, but in the alternative, the processor may be any commercially available processor, controller, microcontroller or state machine. A processor may also be implemented as a combination of computing devices, e.g., a combination of a DSP and a microprocessor, a plurality of microprocessors, one or more microprocessors in conjunction with a DSP core, or any other such configuration. The software may reside in any form of computer readable tangible storage medium. By way of example, and not limitation, such computer-readable tangible storage media can comprise RAM, ROM, EEPROM, CD-ROM or other optical disk storage, magnetic disk storage or other magnetic storage devices, or any other tangible medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer. Disk, as used herein, includes Compact Disk (CD), laser disk, optical disk, Digital Versatile Disk (DVD), floppy disk, and Blu-ray disk.

Various changes, substitutions and alterations to the techniques described herein may be made without departing from the techniques of the teachings as defined by the appended claims. Moreover, the scope of the claims of the present disclosure is not limited to the particular aspects of the process, machine, manufacture, composition of matter, means, methods and acts described above. Processes, machines, manufacture, compositions of matter, means, methods, or acts, presently existing or later to be developed that perform substantially the same function or achieve substantially the same result as the corresponding aspects described herein may be utilized. Accordingly, the appended claims are intended to include within their scope such processes, machines, manufacture, compositions of matter, means, methods, or acts.

The previous description of the disclosed aspects is provided to enable any person skilled in the art to make or use the present disclosure. Various modifications to these aspects will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other aspects without departing from the scope of the disclosure. Thus, the present disclosure is not intended to be limited to the aspects shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

The foregoing description has been presented for purposes of illustration and description. Furthermore, this description is not intended to limit embodiments of the disclosure to the form disclosed herein. While a number of example aspects and embodiments have been discussed above, those of skill in the art will recognize certain variations, modifications, alterations, additions and sub-combinations thereof.

Claims

1. A method of training a neural network for target detection, comprising:

receiving a training image including a detection target, the training image belonging to one of an annotated source domain image and an unlabeled target domain image;

generating a feature map of a training image based on appearance characteristics and environment attribute information of the training image;

generating an attention feature map of the training image based on the feature map;

predicting whether the training image is a source domain image or a target domain image based on the attention feature map, and predicting the location of the detection target; and

determining an overall loss of the neural network based on a loss of prediction of whether the training image is a source domain image or a target domain image and a loss of prediction of localization of the detection target, and updating parameters of the neural network according to the overall loss.

2. The training method of claim 1, wherein generating a feature map of a training image based on appearance characteristics and environmental attribute information of the training image comprises:

carrying out image feature extraction on the training image to generate an appearance feature map of the training image;

acquiring environment attribute information of a shooting environment of the training image, and generating an environment attribute feature map of the training image according to the environment attribute information; and

and combining the appearance characteristic diagram and the environment attribute characteristic diagram to generate the characteristic diagram.

3. The training method according to claim 2, wherein acquiring environment attribute information of a shooting environment of the training image, and generating an environment attribute feature map of the training image according to the environment attribute information includes:

and converting the environmental attribute information of the shooting environment of the training image into an environmental attribute matrix, and extracting the environmental attribute feature map from the environmental attribute matrix.

4. The training method of claim 1, wherein generating an attention feature map of the training image based on the feature map comprises:

dividing the feature map into local regions;

generating respective local attention diagrams for parts of the feature diagram in the local areas; and

and generating a local attention feature map of each local area of the training image based on the part of the feature map in each local area and the generated local attention map.

5. The training method of claim 4, wherein predicting whether the training image is a source domain image or a target domain image based on the attention feature map and predicting the location of the detection target comprises:

and predicting whether the training image to which each local region belongs is a source domain image or a target domain image based on each local attention feature map, and predicting the positioning of a detection target in the training image.

6. The training method of claim 5, wherein determining an overall loss of the neural network based on a loss of prediction of whether the training image is a source domain image or a target domain image and a loss of prediction of a location of the detected target, and updating parameters of the neural network according to the overall loss comprises:

calculating and predicting the local prediction loss of the training image to which each local region belongs to be the source domain image or the target domain image;

calculating a location prediction loss that predicts a location of a detection target within the training image; and

determining an overall loss of the neural network based on the local prediction loss and the location prediction loss, and updating parameters of the neural network according to the overall loss.

7. The training method of claim 4, wherein generating an attention feature map for the training image based on the feature map further comprises:

generating a global attention map of the training image based on the feature map; and

generating a global attention feature map for the training image based on the feature map and the generated global attention map.

8. The training method of claim 7, wherein predicting whether the training image is a source domain image or a target domain image based on the attention feature map, and predicting the location of the detection target comprises:

predicting whether the training image is a source domain image or a target domain image based on the global attention feature map;

predicting whether a training image to which each local region belongs is a source domain image or a target domain image based on each local attention feature map; and

predicting a location of a detection target within the training image based on the global attention feature map and the local attention feature map.

9. The training method of claim 8, wherein predicting a location of a detection target within the training image based on the global attention feature map and local attention feature map comprises:

generating an overall attention feature map of the training image based on each local attention feature map;

generating an optimized attention feature map of the training image based on the global attention feature map and the generated overall attention feature map; and

based on the optimized attention feature map, predicting a location of a detection target within the training image.

10. The training method of claim 9, wherein determining an overall loss of the neural network based on a loss of prediction of whether the training image is a source domain image or a target domain image and a loss of prediction of a location of the detected target, and updating parameters of the neural network according to the overall loss comprises:

calculating a global prediction loss predicting whether the training image is a source domain image or a target domain image;

and determining the overall loss of the neural network based on the global prediction loss, the local prediction loss and the positioning prediction loss, and updating the parameters of the neural network according to the overall loss.

11. Training method according to any of the claims 1-10, wherein the loss of the prediction of the localization of the training image is set to a preset value when it belongs to an unlabeled target domain image.

12. The training method of claim 1, wherein the overall loss of the neural network is positively correlated with the loss of the prediction of the localization of the detection target and negatively correlated with the loss of the prediction of whether the training image is a source domain image or a target domain image.

13. A training device for a neural network for target detection, comprising:

a processor; and

a memory having computer program instructions stored therein,

wherein the computer program instructions, when executed by the processor, cause the processor to perform the steps of:

14. A training device for a neural network for target detection, comprising:

an image receiving unit configured to receive a training image including a detection target, the training image belonging to one of an annotated source domain image and an unlabeled target domain image;

the characteristic diagram generating unit is configured to generate a characteristic diagram of a training image based on appearance characteristics and environment attribute information of the training image;

an attention feature map generation unit configured to generate an attention feature map of the training image based on the feature map;

a prediction unit configured to predict whether the training image is a source domain image or a target domain image based on the attention feature map, and predict a location of the detection target; and

a parameter updating unit configured to determine an overall loss of the neural network based on a loss predicted on whether the training image is a source domain image or a target domain image and a loss predicted on a location of the detection target, and update a parameter of the neural network according to the overall loss.

15. A computer readable storage medium having computer program instructions stored thereon, wherein the computer program instructions, when executed by a processor, implement the steps of:

determining an overall loss of a neural network for target detection based on a loss of prediction of whether the training image is a source domain image or a target domain image and a loss of prediction of localization of the detection target, and updating parameters of the neural network according to the overall loss.

16. A target detection method based on a neural network comprises the following steps:

receiving an input image;

generating a feature map of an input image by using the neural network based on appearance characteristics and environmental attribute information of the input image;

generating an attention feature map of the input image by using the neural network based on the feature map; and

and positioning the detection target in the input image by utilizing the neural network based on the attention feature map.

17. An object detection device based on a neural network, comprising:

a processor; and

a memory having computer program instructions stored therein,

receiving an input image;

18. An object detection device based on a neural network, comprising:

an image receiving unit configured to receive an input image;

a feature map generation unit configured to generate a feature map of an input image using the neural network based on appearance characteristics and environmental attribute information of the input image;

an attention feature map generation unit configured to generate an attention feature map of the input image using the neural network based on the feature map; and

a detection target positioning unit configured to position a detection target in the input image by using the neural network based on the attention feature map.

19. A computer readable storage medium having computer program instructions stored thereon, wherein the computer program instructions, when executed by a processor, implement the steps of:

receiving an input image;

generating a feature map of an input image by using a neural network based on appearance characteristics and environmental attribute information of the input image;