CN113887423A

CN113887423A - Target detection method, target detection device, electronic equipment and storage medium

Info

Publication number: CN113887423A
Application number: CN202111163177.XA
Authority: CN
Inventors: 杨喜鹏; 谭啸; 孙昊; 丁二锐
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2021-09-30
Filing date: 2021-09-30
Publication date: 2022-01-04

Abstract

The disclosure provides a target detection method, a target detection device, electronic equipment and a storage medium, relates to the field of artificial intelligence, in particular to computer vision and deep learning technology, and can be used in target detection and video analysis scenes. The scheme is as follows: carrying out feature extraction on the target image to obtain a target feature map; performing feature mapping on the target feature map by adopting a mapping network of the target identification model to obtain mapping features of multiple dimensions; respectively determining similarity between the mapping characteristics of each dimension and the characteristic mean values of a plurality of categories; according to the similarity, fusing the mapping feature of each dimension with the feature mean values of a plurality of categories to obtain a fusion feature of each dimension; and carrying out target detection according to the fusion characteristics of all dimensions. Therefore, the feature mean values of a plurality of categories are fused with the mapping features of each dimension, so that the difference between the categories can be enhanced, the accuracy of the target detection result is improved, and the error classification of the target detection result can be avoided.

Description

Target detection method, target detection device, electronic equipment and storage medium

Technical Field

The present disclosure relates to the field of artificial intelligence, and in particular, to computer vision and deep learning techniques, which may be used in target detection and video analysis scenarios, and in particular, to a target detection method, apparatus, electronic device, and storage medium.

Background

Under the scenes of smart cities, intelligent transportation and video analysis, objects or targets such as vehicles, pedestrians and objects in videos are accurately detected, and help can be provided for tasks such as abnormal event detection, prisoner tracking and vehicle statistics. Therefore, how to detect the target in the video is very important.

Disclosure of Invention

The disclosure provides a method, an apparatus, an electronic device and a storage medium for object detection.

According to an aspect of the present disclosure, there is provided an object detection method including: acquiring a target image and acquiring feature mean values of a plurality of categories; performing feature extraction on the target image to obtain a target feature map; performing feature mapping on the target feature map by adopting a mapping network of a target identification model to obtain mapping features of multiple dimensions; respectively determining similarity between the mapping characteristics of each dimension and the characteristic mean values of a plurality of categories; according to the similarity, fusing the mapping feature of each dimension with the feature mean values of the multiple categories to obtain a fusion feature of each dimension; and carrying out target detection according to the fusion characteristics of all dimensions.

According to another aspect of the present disclosure, there is provided an object detecting apparatus including: the acquisition module is used for acquiring a target image and acquiring the characteristic mean values of a plurality of categories; the extraction module is used for extracting the features of the target image to obtain a target feature map; the mapping module is used for performing feature mapping on the target feature map by adopting a mapping network of a target recognition model to obtain mapping features of multiple dimensions; the first determining module is used for respectively determining the similarity between the mapping characteristics of each dimension and the characteristic mean value of a plurality of categories; the fusion module is used for fusing the mapping feature of each dimension with the feature mean values of the multiple categories according to the similarity so as to obtain the fusion feature of each dimension; and the detection module is used for carrying out target detection according to the fusion characteristics of all dimensions.

According to still another aspect of the present disclosure, there is provided an electronic device including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform a method of object detection as set forth in the above aspect of the disclosure.

According to yet another aspect of the present disclosure, there is provided a non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform the object detection method set forth in the above-described aspect of the present disclosure.

According to yet another aspect of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, implements the object detection method set forth in the above-mentioned aspect of the present disclosure.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

fig. 1 is a schematic flowchart of a target detection method according to a first embodiment of the disclosure;

fig. 2 is a schematic flowchart of a target detection method according to a second embodiment of the disclosure;

fig. 3 is a schematic flowchart of a target detection method according to a third embodiment of the disclosure;

fig. 4 is a schematic flowchart of a target detection method according to a fourth embodiment of the disclosure;

FIG. 5 is a schematic structural diagram of a target recognition model provided by an embodiment of the present disclosure;

fig. 6 is a schematic flowchart of a target detection method according to a fifth embodiment of the disclosure;

fig. 7 is a schematic flowchart of a target detection method according to a sixth embodiment of the disclosure;

FIG. 8 is a schematic diagram illustrating a fusion process of a target feature map and a location map provided by an embodiment of the present disclosure;

fig. 9 is a schematic flowchart of a target detection method according to a seventh embodiment of the disclosure;

FIG. 10 is a schematic diagram illustrating a process of mapping a mapping network of a target recognition model to a target feature map provided in an embodiment of the present disclosure;

fig. 11 is a schematic structural diagram of an object detection apparatus according to an eighth embodiment of the present disclosure;

FIG. 12 shows a schematic block diagram of an example electronic device 1200, which can be used to implement embodiments of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

Currently, detection of objects in video frames is mainly performed by Fast candidate region (Fast RCNN) and transform-based processes. Although the two modes combine the characteristics of the front and the back of the current frame, the frame number of each time is limited, the characteristic capability of interframe enhancement is limited, the limitation of the effect optimization upper limit is obvious, the difference between classes is obviously insufficient, and the characteristics cannot be enhanced.

In view of the above problems, the present disclosure provides a target detection method, an apparatus, an electronic device, and a storage medium.

An object detection method, an apparatus, an electronic device, and a storage medium of the embodiments of the present disclosure are described below with reference to the drawings.

Fig. 1 is a schematic flow chart of a target detection method according to a first embodiment of the present disclosure.

The embodiments of the present disclosure are exemplified by the target detection method being configured in a target detection apparatus, which can be applied to any electronic device, so that the electronic device can perform a target detection function.

The electronic device may be any device with computing capability, for example, a personal computer, a mobile terminal, a server, and the like, and the mobile terminal may be a hardware device with various operating systems, touch screens, and/or display screens, such as an in-vehicle device, a mobile phone, a tablet computer, a personal digital assistant, a wearable device, and the like.

As shown in fig. 1, the target detection method may include the steps of:

step 101, obtaining a target image and obtaining feature mean values of a plurality of categories.

In the embodiment of the present disclosure, the target image is an image that needs to be subjected to target detection, and the target image may be an image acquired on line, for example, the target image to be detected may be acquired on line through a web crawler technology, or the target image may also be an image acquired off line, or the target image may also be an image acquired in real time, or the target image may also be an image synthesized by a human, and the like.

In addition, the target image may be a video frame of a certain frame in the video, and the target image may be extracted from the video.

In an embodiment of the present disclosure, the feature mean of the plurality of classes may be a feature mean of a plurality of classes in the target image.

As an example, the feature mean of the multiple categories may be a vector of M × C, M may represent a target data set in a target image, the target data set includes categories of targets, C may be a feature dimension of a prediction frame or a target, during the training of the target recognition model, features generated by the target matching with real objects labeled by sample images are determined through a matching strategy, and the feature mean of the multiple categories predicted during a previous training round is updated through an exponential moving average manner, so that features of all target categories are learned through the entire target data set and recorded through an average manner. The target can include any target object such as a vehicle, a person, an object, an animal and the like, and the category can include the category of the vehicle, the person and the like.

As another example, the feature mean of the plurality of categories may be a set feature mean.

And 102, performing feature extraction on the target image to obtain a target feature map.

In the embodiment of the present disclosure, feature extraction may be performed on the target image to obtain a target feature map corresponding to the target image.

In a possible implementation manner of the embodiment of the present disclosure, in order to improve accuracy and reliability of a feature extraction result, feature extraction may be performed on a target image based on a deep learning technique to obtain a target feature map corresponding to the target image.

As an example, feature extraction may be performed on a target image using a backbone network (backbone) of a main stream, so as to obtain a target feature map. For example, the backbone network can include a series of residual networks (ResNet) (such as ResNet34, ResNet50, ResNet101, etc. networks), a series of darknets (open source neural network frameworks written using C and CUDA) (such as DarkNet19, DarkNet53), and so forth.

In a possible implementation manner of the embodiment of the present disclosure, in order to achieve both accuracy of the feature extraction result and resource saving, a suitable backbone network may be selected to perform feature extraction on the target image according to an application scenario of the video service. For example, the backbone network may be divided into a lightweight structure (e.g., ResNet18, ResNet34, DarkNet19, etc.), a medium-sized structure (e.g., ResNet50, ResNeXt (which is a combination of ResNet and inclusion) 50, DarkNet53, etc.), and a heavy-sized structure (e.g., ResNet101, ResNeXt152), and the specific network structure may be selected according to the application scenario.

And 103, performing feature mapping on the target feature map by adopting a mapping network of the target identification model to obtain mapping features of multiple dimensions.

As an example, the target feature map and the corresponding position map may be fused to obtain an input feature map, the input feature map is input into the target recognition model, and the decoded features of the multiple targets to be predicted output by the target recognition model may be used as the mapping features of multiple dimensions.

As another example, the target feature map may be input to an RPN network of the target recognition model to perform region-of-interest prediction, so as to obtain feature maps of multiple regions of interest, and the feature maps of the multiple regions of interest may be input to a pooling layer of the target recognition model to perform size adjustment, thereby determining mapping features of corresponding dimensions.

And 104, respectively determining the similarity between the mapping feature of each dimension and the feature mean value of a plurality of categories.

In the embodiment of the present disclosure, for a mapping feature of each dimension, a similarity algorithm may be used to determine the similarity between the mapping feature and the feature mean of multiple classes.

And 105, fusing the mapping characteristics of each dimension with the characteristic mean values of a plurality of categories according to the similarity to obtain the fusion characteristics of each dimension.

As an example, the feature mean of the target category may be selected from feature means of multiple categories, and the feature mean of the target category and the one-dimensional mapping feature are fused, so as to obtain a fusion feature of each dimension.

As another example, the weight corresponding to the feature mean of each category may be determined, and then, according to the weight corresponding to the feature mean of each category, the feature mean of each category and the mapping feature of the dimension are fused to obtain the fusion feature of the dimension.

And 106, carrying out target detection according to the fusion characteristics of the dimensions.

In the embodiment of the present disclosure, target detection may be performed on the fusion features of each dimension to obtain a corresponding detection result, for example, target detection may be performed on the fusion features of each dimension based on a target detection algorithm to obtain a corresponding detection result. The detection result may include the predicted position of the prediction box and the prediction category to which the target in the prediction box belongs.

In a possible implementation manner of the embodiment of the present disclosure, in order to improve the accuracy and reliability of the target detection result, the target detection may be performed on the fusion features of each dimension based on a deep learning technique to obtain a corresponding detection result

In conclusion, by acquiring a target image and acquiring feature mean values of a plurality of categories; carrying out feature extraction on the target image to obtain a target feature map; performing feature mapping on the target feature map by adopting a mapping network of the target identification model to obtain mapping features of multiple dimensions; respectively determining similarity between the mapping characteristics of each dimension and the characteristic mean values of a plurality of categories; according to the similarity, fusing the mapping feature of each dimension with the feature mean values of a plurality of categories to obtain a fusion feature of each dimension; and carrying out target detection according to the fusion characteristics of all dimensions. Therefore, the feature mean values of a plurality of categories are fused with the mapping features of each dimension, so that the difference between the categories can be enhanced, the accuracy of the target detection result is improved, and the error classification of the target detection result can be avoided.

In order to clearly illustrate how the mapping features of each dimension are fused with the feature mean values of a plurality of categories according to the similarity in the above embodiments, the present disclosure also provides an object detection method.

Fig. 2 is a schematic flow chart of a target detection method provided in the second embodiment of the present disclosure.

As shown in fig. 2, the target detection method may include the steps of:

step 201, obtaining a target image, and obtaining feature mean values of a plurality of categories.

Step 202, performing feature extraction on the target image to obtain a target feature map.

And step 203, performing feature mapping on the target feature map by using a mapping network of the target identification model to obtain mapping features of multiple dimensions.

And 204, respectively determining the similarity between the mapping feature of each dimension and the feature mean of a plurality of categories.

Step 205, for any one-dimensional mapping feature, selecting a feature mean value of the target category from feature mean values of multiple categories according to similarity between the feature mean value and the feature mean values of multiple categories.

In the embodiment of the present disclosure, for any one-dimensional mapping feature, similarity between the one-dimensional mapping feature and feature mean values of multiple categories may be calculated by using a similarity algorithm, the feature mean value of the category with the highest similarity to the one-dimensional mapping feature is selected from the feature mean values of the multiple categories, and the feature mean value of the category is used as the feature mean value of the target category.

And step 206, fusing the feature mean value of the target category with the mapping feature of one dimension to obtain a fusion feature of one dimension.

In the embodiment of the disclosure, the feature mean of the target category may be added to the mapping feature of one dimension, and the addition result is used as the fusion feature of one dimension.

And step 207, performing target detection according to the fusion characteristics of the dimensions.

It should be noted that the execution processes of steps 201 to 204 and step 207 may refer to the execution process of the foregoing embodiment, which is not described herein again.

In summary, for any one dimension of mapping features, according to the similarity between the mapping features and the feature mean values of multiple categories, the feature mean value of the target category is selected from the feature mean values of multiple categories, and the feature mean value of the target category and the mapping feature of one dimension are fused to obtain a fusion feature of one dimension.

Fig. 3 is a schematic flow chart of a target detection method provided in the third embodiment of the present disclosure.

As shown in fig. 3, the target detection method may include the steps of:

step 301, obtaining a target image, and obtaining feature mean values of a plurality of categories.

Step 302, performing feature extraction on the target image to obtain a target feature map.

And 303, performing feature mapping on the target feature map by using a mapping network of the target identification model to obtain mapping features of multiple dimensions.

And 304, respectively determining the similarity between the mapping feature of each dimension and the feature mean of a plurality of categories.

Step 305, determining the weight corresponding to the feature mean value of each category according to the similarity between the mapping feature of any dimension and the feature mean values of a plurality of categories.

In this disclosure, for any one-dimensional mapping feature, a similarity between the any one-dimensional mapping feature and a feature mean value of each of a plurality of categories may be calculated according to a similarity algorithm, and a weight corresponding to the feature mean value of each category may be determined according to the similarity between the any one-dimensional mapping feature and the feature mean value of each of the plurality of categories, for example, the similarity between the any one-dimensional mapping feature and the feature mean value of one category of the plurality of categories is higher, and the weight corresponding to the feature mean value of the category may be set to be larger. For another example, the similarity between the mapping feature of the arbitrary one dimension and the feature mean of each of the plurality of categories may be used as the weight corresponding to the feature mean of each category.

And step 306, fusing the feature mean value of each category with the mapping feature of one dimension according to the weight corresponding to the feature mean value of each category to obtain a fusion feature of one dimension.

Further, the weights corresponding to the features of the respective categories and the feature mean of the respective categories may be subjected to weighted summation, the feature mean of the respective categories after weighted summation and the mapping feature of one dimension are added, and the addition result is used as the fusion feature of the one dimension.

And 307, detecting the target according to the fusion characteristics of the dimensions.

It should be noted that the execution processes of steps 301 to 304 and step 307 may refer to the execution process of the foregoing embodiment, which is not described herein again.

In summary, for any one dimension of the mapping features, the weight corresponding to the feature mean of each category is determined according to the similarity between the mapping features and the feature mean of a plurality of categories; and fusing the feature mean value of each category with the mapping feature of one dimension according to the weight corresponding to the feature mean value of each category to obtain the fusion feature of one dimension. Therefore, according to the weight corresponding to the feature mean value of each category, the feature mean value of each category is fused with the mapping feature of one dimension, the fusion feature of each dimension can be enhanced, and the accuracy and the reliability of the target detection result are further improved.

In order to clearly illustrate how the target detection is performed according to the fusion features of the dimensions in the above embodiments, the present disclosure also provides a target detection method.

Fig. 4 is a schematic flow chart of a target detection method according to a fourth embodiment of the present disclosure.

As shown in fig. 4, the target detection method may include the steps of:

step 401, obtaining a target image, and obtaining feature mean values of a plurality of categories.

And step 402, performing feature extraction on the target image to obtain a target feature map.

And 403, performing feature mapping on the target feature map by using a mapping network of the target identification model to obtain mapping features of multiple dimensions.

And step 404, respectively determining similarity between the mapping feature of each dimension and the feature mean of a plurality of categories.

And 405, fusing the mapping feature of each dimension with the feature mean values of a plurality of categories according to the similarity to obtain the fusion feature of each dimension.

Step 406, inputting the fusion features of the dimensions into corresponding prediction layers in the target recognition model respectively for target detection, so as to determine the prediction position of the prediction frame and the prediction category to which the target in the prediction frame belongs.

It should be understood that the target recognition model can recognize a large number of targets, but is limited to the framing picture of the image or the video frame, the number of targets contained in the image is limited, and the number of prediction layers can be determined according to the number of prediction dimensions in order to take account of the accuracy of the target detection result and avoid resource waste. The number of prediction layers is the same as the number of prediction dimensions.

In the embodiment of the present disclosure, the fusion features of each dimension may be respectively input to the corresponding prediction layers to obtain the prediction positions of the prediction frames output by each prediction layer.

In the embodiment of the present application, the prediction category to which the target in the prediction frame output by the corresponding prediction layer belongs may be determined according to the prediction category predicted by each prediction layer.

As an example, the target recognition model is exemplified as a model with a transform as a basic structure, and the structure of the target recognition model may be as shown in fig. 5, and the prediction layer is FFN (Feed-Forward Network).

The target feature map is a stereo feature of H × W × C, the stereo target feature map may be subjected to a blocking process to obtain a serialized feature vector sequence (i.e., a fused target feature map is converted into tokens (elements in the feature map)), that is, converted into H × W C-dimensional feature vectors, the serialized feature vectors are input to an encoder for attention learning (the attention mechanism may achieve an inter-frame enhancement effect), the obtained feature vector sequence is input to a decoder, the decoder performs attention learning according to the input feature vector sequence, the obtained decoded features are subjected to final target detection by using FFN, that is, prediction of classification and regression by FFN may be performed to obtain a detection result. The box output by the FFN is the predicted position of the prediction frame, and the prediction frame can be determined according to the predicted position of the prediction frame; class output by the FFN is a prediction class to which the target in the prediction frame belongs; a no object means no object. That is, the decoding feature may be input to the FFN, regression prediction of the target may be performed by the FFN to obtain a predicted position of the prediction frame, and category prediction of the target may be performed by the FFN to obtain a predicted category to which the target in the prediction frame belongs.

As an example, assuming that the number of fusion features is 4, as shown in fig. 5, the class prediction of the target can be performed by 4 FFNs, and 4 classes (classes) are obtained.

It should be noted that the execution process of steps 401 to 405 may refer to the execution process of the foregoing embodiment, and is not described herein again.

In summary, the fused features of the dimensions are respectively input into the corresponding prediction layers in the target recognition model to perform target detection, so as to determine the prediction position of the prediction frame and the prediction category to which the target in the prediction frame belongs, thereby accurately determining the prediction position of the prediction frame and the prediction category to which the target in the prediction frame belongs, and improving the accuracy and reliability of the target detection result.

In order to clearly illustrate how the feature mean corresponding to the actual category learns the feature mean of all categories, the present disclosure also proposes a target detection method.

Fig. 6 is a schematic flowchart of a target detection method provided in the fifth embodiment of the present disclosure.

As shown in fig. 6, the target detection method may include the steps of:

step 601, obtaining a target image and obtaining feature mean values of a plurality of categories.

Step 602, performing feature extraction on the target image to obtain a target feature map.

Step 603, performing feature mapping on the target feature map by using a mapping network of the target identification model to obtain mapping features of multiple dimensions.

And step 604, respectively determining similarity between the mapping feature of each dimension and the feature mean of a plurality of categories.

And 605, fusing the mapping feature of each dimension with the feature mean values of a plurality of categories according to the similarity to obtain a fusion feature of each dimension.

And 606, inputting the fusion characteristics of all dimensions into corresponding prediction layers in the target recognition model respectively to perform target detection so as to determine the prediction position of the prediction frame and the prediction category to which the target in the prediction frame belongs.

In step 607, a target prediction layer whose prediction type matches the actual type is determined from the prediction layers.

In the embodiment of the present disclosure, the prediction category output by each prediction layer may be matched with the actual category, and the prediction layer corresponding to the prediction category matched with the actual category may be used as the target prediction layer. For example, there are 4 features of different colors at the output of the decoder, and if the prediction category to which the target in the prediction frame predicted after the feature of a certain color passes through the FFN matches the actual category, the prediction layer corresponding to the prediction category matching the actual category is used as the target prediction layer.

Step 608, updating the feature mean corresponding to the actual category according to the fusion feature input by the target prediction layer.

Furthermore, the feature mean value corresponding to the actual category can be updated according to the fusion features input by the target prediction layer, and the actual category can be dynamically updated and maintained.

It should be noted that the execution process of steps 601 to 606 may refer to the execution process of the foregoing embodiment, which is not described herein again.

In summary, a target prediction layer with a prediction category matching the actual category is determined from the prediction layers, and the feature mean corresponding to the actual category is updated according to the fusion features input by the target prediction layer. Therefore, the actual categories can be dynamically updated and maintained, the target recognition model can learn the feature mean values of all the categories, the feature expression capability of the target recognition model is improved, and the accuracy and the reliability of the target detection result are improved.

In order to clearly illustrate how to perform feature mapping on the target feature map by using the mapping network of the target recognition model to obtain mapping features of multiple dimensions, the present disclosure also provides a target detection method.

Fig. 7 is a schematic flowchart of a target detection method according to a sixth embodiment of the disclosure.

As shown in fig. 7, the target detection method may include the steps of:

step 701, acquiring a target image and acquiring feature mean values of a plurality of categories.

Step 702, performing feature extraction on the target image to obtain a target feature map.

And 703, fusing the target feature map and the corresponding position map to obtain an input feature map.

In the embodiment of the present disclosure, each element in the position map corresponds to each element in the target feature map in a one-to-one manner, where each element in the position map is used to indicate the coordinate of the corresponding element in the target feature map in the target image.

In a possible implementation manner of the embodiment of the present disclosure, the target feature map and the corresponding position map may be spliced to obtain the input feature map.

As an example, taking a target recognition model as a model with a transform as a basic structure, the target detection principle of the present disclosure may be as shown in fig. 5, and a target feature map output by CNN may be added or spliced with a position map to obtain an input feature map.

In a possible implementation manner of the embodiment of the present disclosure, the target feature map and the corresponding position map may be spliced to obtain a spliced feature map, and the spliced feature map is input into the convolution layer to be fused to obtain an input feature map.

As an example, the input feature map may be obtained by fusing the target feature map with the corresponding position map through a convolutional layer as shown in fig. 8. In fig. 8, the i component (i coordinate) in the position map refers to the X-axis component in the coordinates of each element in the target image, and the j component (j coordinate) refers to the Y-axis component in the coordinates of each element in the target image.

That is, the target feature map w × h × c may be merged with the i component and the j component in the corresponding position map to obtain a merged feature map w × h × (c +2), and the merged feature map may be input to the convolutional layer to be fused to obtain an input feature map w ' × h ' × c '. w is a plurality of width components in the target feature map, h is a plurality of height components in the target feature map, c is a plurality of dimension components in the target feature map, w ' is a plurality of width components in the input feature map, h ' is a plurality of height components in the input feature map, and c ' is a plurality of dimension components in the input feature map.

Step 704, inputting the input feature map into an encoder of the target recognition model for encoding to obtain encoding features.

Step 705, inputting the coding features into a decoder of the target identification model for decoding to obtain the decoding features of a plurality of targets to be predicted in the target image.

In a possible implementation manner of the embodiment of the present disclosure, an encoder in the target recognition model may be used to encode the input feature map to obtain an encoded feature, and a decoder in the target recognition model may be used to decode the encoded feature to obtain decoded features of a plurality of targets to be predicted in the target image. For example, matrix multiplication operation may be performed on the encoding features according to the model parameters in the decoder, so as to obtain Q, K, V components in the attention mechanism, and according to Q, K, V components, the decoding features of a plurality of targets to be predicted in the target image are determined.

Therefore, by adopting the structure of the encoder-decoder, the input feature map is processed, that is, feature interaction can be performed on the input feature map based on an attention mechanism, such as a self-attention mechanism (self-attention) and a multi-head attention mechanism (multi-head attention), and the enhanced features, that is, the decoding features, are output, so that the prediction effect of the model can be improved.

Step 706, using the decoding features of the multiple targets to be predicted as the mapping features of multiple dimensions.

And then, the decoding characteristics of a plurality of targets to be predicted are used as the mapping characteristics of a plurality of dimensions.

And step 707, respectively determining similarity between the feature mean of the plurality of categories and the mapping feature of each dimension.

And 708, fusing the mapping feature of each dimension with the feature mean values of the multiple categories according to the similarity to obtain a fusion feature of each dimension.

And step 709, performing target detection according to the fusion characteristics of the dimensions.

It should be noted that the execution processes of steps 701 to 702 and steps 707 to 709 may refer to the execution process of the foregoing embodiment, which is not described herein again.

In summary, the target feature map and the corresponding position map are fused to obtain an input feature map; inputting the input characteristic diagram into an encoder of the target recognition model for encoding to obtain encoding characteristics; inputting the coding characteristics into a decoder of the target recognition model for decoding to obtain the decoding characteristics of a plurality of targets to be predicted in the target image; and taking the decoding characteristics of a plurality of targets to be predicted as mapping characteristics of a plurality of dimensions. Therefore, the mapping characteristics of multiple dimensions can be accurately determined by combining the position graph and the characteristic graph, and the accuracy of the target detection result can be improved.

In order to clearly illustrate how to perform feature mapping on the target feature map by using a mapping network of a target recognition model to obtain mapping features of multiple dimensions, the present disclosure also provides a target detection method.

Fig. 9 is a schematic flowchart of a target detection method according to a seventh embodiment of the disclosure.

As shown in fig. 9, the target detection method may include the steps of:

step 901, acquiring a target image, and acquiring feature mean values of a plurality of categories.

And step 902, performing feature extraction on the target image to obtain a target feature map.

And 903, inputting the target characteristic diagram into an RPN network of the target recognition model to predict the interested regions so as to obtain characteristic diagrams of a plurality of interested regions.

In the embodiment of the present disclosure, as shown in fig. 10, the target feature map may be input into an RPN (region generation Network) of the target recognition model to perform region-of-interest prediction, so as to obtain feature maps of a plurality of regions of interest.

Step 904, inputting the feature maps of the multiple regions of interest into the pooling layer of the target identification model for size adjustment, so as to obtain a target feature map with each region of interest conforming to a fixed size.

And then, inputting the feature maps of the multiple interested areas into a pooling layer of the target identification model, wherein the pooling layer can perform maximum pooling on the feature maps of the interested areas with non-uniform sizes and perform size adjustment to obtain a target feature map of which each interested area accords with a fixed size. ROI Pooling in fig. 10 represents the Pooling layer of the target recognition model.

Step 905, determining mapping characteristics of corresponding dimensions according to the target characteristic diagram of each region of interest.

Further, the target feature map of each region of interest may be input to a fully-connected layer of the target recognition model, and the fully-connected layer may output the mapping features of the corresponding dimensions. FC in fig. 10 may represent a full connection layer.

And step 906, respectively determining the similarity between the mapping feature of each dimension and the feature mean of a plurality of categories.

And 907, fusing the mapping feature of each dimension with the feature mean values of a plurality of categories according to the similarity to obtain a fusion feature of each dimension.

And 908, detecting the target according to the fusion characteristics of the dimensions.

It should be noted that the execution processes of steps 901 to 902 and steps 906 to 908 may refer to the execution processes of the above embodiments, which are not described herein again.

In conclusion, the target characteristic diagram is input into the RPN network of the target recognition model to predict the interested regions so as to obtain a plurality of characteristic diagrams of the interested regions; inputting the feature maps of the multiple interested areas into a pooling layer of the target recognition model for size adjustment to obtain a target feature map of which each interested area conforms to a fixed size; and determining the mapping characteristics of corresponding dimensions according to the target characteristic graph of each interested area. Therefore, the mapping characteristics of the corresponding dimensionality can be accurately determined, and the accuracy of the target detection result can be improved.

According to the target detection method, a target image is obtained, and the characteristic mean values of a plurality of categories are obtained; carrying out feature extraction on the target image to obtain a target feature map; performing feature mapping on the target feature map by adopting a mapping network of the target identification model to obtain mapping features of multiple dimensions; respectively determining similarity between the mapping characteristics of each dimension and the characteristic mean values of a plurality of categories; according to the similarity, fusing the mapping feature of each dimension with the feature mean values of a plurality of categories to obtain a fusion feature of each dimension; and carrying out target detection according to the fusion characteristics of all dimensions. Therefore, the feature mean values of a plurality of categories are fused with the mapping features of each dimension, so that the difference between the categories can be enhanced, the target detection accuracy is further improved, and the error classification of target detection results can be avoided.

Corresponding to the target detection method provided in the embodiments of fig. 1 to 10, the present disclosure also provides a target detection apparatus, and since the target detection apparatus provided in the embodiments of the present disclosure corresponds to the target detection method provided in the embodiments of fig. 1 to 10, the implementation of the target detection method is also applicable to the target detection apparatus provided in the embodiments of the present disclosure, and will not be described in detail in the embodiments of the present disclosure.

Fig. 11 is a schematic structural diagram of an object detection apparatus according to an eighth embodiment of the present disclosure.

As shown in fig. 11, the object detection apparatus 1100 may include: an acquisition module 1110, an extraction module 1120, a mapping module 1130, a first determination module 1140, a fusion module 1150, and a detection module 1160.

The obtaining module 1110 is configured to obtain a target image and obtain feature mean values of multiple categories; an extracting module 1120, configured to perform feature extraction on the target image to obtain a target feature map; the mapping module is used for performing feature mapping on the target feature map by adopting a mapping network of the target identification model to obtain mapping features of multiple dimensions; a first determining module 1140, configured to determine, for each dimension of the mapping feature, similarity between the mapping feature and a feature mean of a plurality of categories respectively; a fusion module 1150, configured to fuse the mapping feature of each dimension with the feature mean of multiple categories according to the similarity, so as to obtain a fusion feature of each dimension; and the detection module 1160 is used for detecting the target according to the fusion characteristics of the dimensions.

In a possible implementation manner of the embodiment of the present disclosure, the fusion module 1150 is configured to: aiming at any one-dimensional mapping feature, selecting a feature mean value of a target category from feature mean values of a plurality of categories according to the similarity between the feature mean value of the plurality of categories and the feature mean value of the plurality of categories; and fusing the feature mean value of the target category with the mapping feature of one dimension to obtain the fusion feature of one dimension.

In a possible implementation manner of the embodiment of the present disclosure, the fusion module 1150 is further configured to: aiming at any one-dimensional mapping feature, determining the weight corresponding to the feature mean value of each category according to the similarity between the mapping feature and the feature mean values of a plurality of categories; and fusing the feature mean value of each category with the mapping feature of one dimension according to the weight corresponding to the feature mean value of each category to obtain the fusion feature of one dimension.

In one possible implementation manner of the embodiment of the present disclosure, the detecting module 1160 is configured to: and respectively inputting the fusion characteristics of all dimensions into corresponding prediction layers in the target recognition model to perform target detection so as to determine the prediction position of the prediction frame and the prediction category to which the target in the prediction frame belongs.

In a possible implementation manner of the embodiment of the present disclosure, the target image is labeled with an actual category, and the target detection apparatus further includes: a second determination module and an update module.

The second determining module is used for determining a target prediction layer with a prediction category matched with the actual category from all prediction layers; and the updating module is used for updating the feature mean value corresponding to the actual category according to the fusion feature input by the target prediction layer.

In one possible implementation manner of the embodiment of the present disclosure, the mapping module 1130 is configured to: fusing the target feature map and the corresponding position map to obtain an input feature map, wherein each element in the position map corresponds to each element in the target feature map one to one, and the elements in the position map are used for indicating the coordinates of the corresponding elements in the target feature map in the target image; inputting the input characteristic diagram into an encoder of the target recognition model for encoding to obtain encoding characteristics; inputting the coding characteristics into a decoder of the target recognition model for decoding to obtain the decoding characteristics of a plurality of targets to be predicted in the target image; and taking the decoding characteristics of a plurality of targets to be predicted as mapping characteristics of a plurality of dimensions.

In a possible implementation manner of the embodiment of the present disclosure, the mapping module 1130 is further configured to: inputting the target characteristic diagram into an RPN network of a target recognition model to predict the interested regions so as to obtain a plurality of characteristic diagrams of the interested regions; inputting the feature maps of the multiple interested areas into a pooling layer of the target recognition model for size adjustment to obtain a target feature map of which each interested area conforms to a fixed size; and determining the mapping characteristics of corresponding dimensions according to the target characteristic graph of each interested area.

The target detection device of the embodiment of the disclosure acquires a target image and characteristic mean values of a plurality of categories; carrying out feature extraction on the target image to obtain a target feature map; performing feature mapping on the target feature map by adopting a mapping network of the target identification model to obtain mapping features of multiple dimensions; respectively determining similarity between the mapping characteristics of each dimension and the characteristic mean values of a plurality of categories; according to the similarity, fusing the mapping feature of each dimension with the feature mean values of a plurality of categories to obtain a fusion feature of each dimension; and carrying out target detection according to the fusion characteristics of all dimensions. Therefore, the feature mean values of a plurality of categories are fused with the mapping features of each dimension, so that the difference between the categories can be enhanced, the target detection accuracy is further improved, and the error classification of target detection results can be avoided.

To implement the above embodiments, the present disclosure also provides an electronic device, which may include at least one processor; and a memory communicatively coupled to the at least one processor; the memory stores instructions executable by the at least one processor, and the instructions are executed by the at least one processor to enable the at least one processor to execute the object detection method according to any one of the above embodiments of the disclosure.

In order to achieve the above embodiments, the present disclosure also provides a non-transitory computer readable storage medium storing computer instructions for causing a computer to execute the target detection method proposed by any one of the above embodiments of the present disclosure.

To achieve the above embodiments, the present disclosure also provides a computer program product comprising a computer program which, when executed by a processor, implements the steps of the object detection method set forth in any of the above embodiments of the present disclosure.

The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.

FIG. 12 shows a schematic block diagram of an example electronic device 1200, which can be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 12, the apparatus 1200 includes a computing unit 1201 which can perform various appropriate actions and processes in accordance with a computer program stored in a ROM (Read-Only Memory) 1202 or a computer program loaded from a storage unit 1208 into a RAM (Random Access Memory) 1203. In the RAM1203, various programs and data required for the operation of the device 1200 may also be stored. The computing unit 1201, the ROM 1202, and the RAM1203 are connected to each other by a bus 1204. An I/O (Input/Output) interface 1205 is also connected to the bus 1204.

Various components in the device 1200 are connected to the I/O interface 1205 including: an input unit 1206 such as a keyboard, a mouse, or the like; an output unit 1207 such as various types of displays, speakers, and the like; a storage unit 1208, such as a magnetic disk, optical disk, or the like; and a communication unit 1209 such as a network card, modem, wireless communication transceiver, etc. The communication unit 1209 allows the device 1200 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.

The computing unit 1201 may be a variety of general purpose and/or special purpose processing components having processing and computing capabilities. Some examples of the computing Unit 1201 include, but are not limited to, a CPU (Central Processing Unit), a GPU (graphics Processing Unit), various dedicated AI (Artificial Intelligence) computing chips, various computing Units running machine learning model algorithms, a DSP (Digital Signal Processor), and any suitable Processor, controller, microcontroller, and the like. The calculation unit 1201 performs the respective methods and processes described above, such as the object detection method. For example, in some embodiments, the object detection method may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 1208. In some embodiments, part or all of the computer program may be loaded and/or installed onto the device 1200 via the ROM 1202 and/or the communication unit 1209. When the computer program is loaded into the RAM1203 and executed by the computing unit 1201, one or more steps of the object detection method described above may be performed. Alternatively, in other embodiments, the computing unit 1201 may be configured to perform the object detection method by any other suitable means (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be realized in digital electronic circuitry, Integrated circuitry, FPGAs (Field Programmable Gate arrays), ASICs (Application-Specific Integrated circuits), ASSPs (Application Specific Standard products), SOCs (System On Chip, System On a Chip), CPLDs (Complex Programmable Logic devices), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a RAM, a ROM, an EPROM (Electrically Programmable Read-Only-Memory) or flash Memory, an optical fiber, a CD-ROM (Compact Disc Read-Only-Memory), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a Display device (e.g., a CRT (Cathode Ray Tube) or LCD (Liquid Crystal Display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: LAN (Local Area Network), WAN (Wide Area Network), internet, and blockchain Network.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The Server can be a cloud Server, also called a cloud computing Server or a cloud host, and is a host product in a cloud computing service system, so as to solve the defects of high management difficulty and weak service expansibility in the traditional physical host and VPS service ("Virtual Private Server", or simply "VPS"). The server may also be a server of a distributed system, or a server incorporating a blockchain.

It should be noted that artificial intelligence is a subject for studying a computer to simulate some human thinking processes and intelligent behaviors (such as learning, reasoning, thinking, planning, etc.), and includes both hardware and software technologies. Artificial intelligence hardware technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing, and the like; the artificial intelligence software technology mainly comprises a computer vision technology, a voice recognition technology, a natural language processing technology, machine learning/deep learning, a big data processing technology, a knowledge map technology and the like.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel, sequentially, or in different orders, as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved, and the present disclosure is not limited herein.

The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims

1. A method of target detection, comprising:

acquiring a target image and acquiring feature mean values of a plurality of categories;

performing feature extraction on the target image to obtain a target feature map;

performing feature mapping on the target feature map by adopting a mapping network of a target identification model to obtain mapping features of multiple dimensions;

respectively determining similarity between the mapping characteristics of each dimension and the characteristic mean values of a plurality of categories;

according to the similarity, fusing the mapping feature of each dimension with the feature mean values of the multiple categories to obtain a fusion feature of each dimension;

and carrying out target detection according to the fusion characteristics of all dimensions.

2. The method according to claim 1, wherein the fusing the mapping feature of each dimension with the feature mean values of the plurality of categories according to the similarity to obtain a fused feature of each dimension comprises:

aiming at any one-dimensional mapping feature, selecting a feature mean value of a target category from the feature mean values of the multiple categories according to the similarity between the mapping feature and the feature mean values of the multiple categories;

and fusing the feature mean value of the target category with the mapping feature of the dimension to obtain the fusion feature of the dimension.

3. The method according to claim 1, wherein the fusing the mapping feature of each dimension with the feature mean values of the plurality of categories according to the similarity to obtain a fused feature of each dimension comprises:

aiming at any one-dimensional mapping feature, determining the weight corresponding to the feature mean value of each category according to the similarity between the mapping feature and the feature mean values of the categories;

and fusing the feature mean values of the various categories and the mapping feature of the dimension according to the weight corresponding to the feature mean value of the various categories to obtain the fusion feature of the dimension.

4. The method according to any one of claims 1-3, wherein the performing target detection according to the fused features of each dimension comprises:

and respectively inputting the fusion characteristics of all dimensions into corresponding prediction layers in the target recognition model to carry out target detection so as to determine the prediction position of a prediction frame and the prediction category to which the target in the prediction frame belongs.

5. The method of claim 4, wherein the target image is labeled with an actual category; after the fusion features of each dimension are respectively input into the corresponding prediction layer in the target recognition model for target detection to determine the prediction position of the prediction frame and the prediction category to which the target in the prediction frame belongs, the method further includes:

determining a target prediction layer of which the prediction class is matched with the actual class from the prediction layers;

and updating the feature mean value corresponding to the actual category according to the fusion feature input by the target prediction layer.

6. The method according to any one of claims 1-3, wherein the feature mapping the target feature map using the mapping network of the target recognition model to obtain mapping features of a plurality of dimensions comprises:

fusing the target feature map and the corresponding position map to obtain an input feature map, wherein each element in the position map corresponds to each element in the target feature map one to one, and the elements in the position map are used for indicating the coordinates of the corresponding elements in the target feature map in the target image;

inputting the input feature map into an encoder of the target recognition model for encoding to obtain encoding features;

inputting the coding features into a decoder of the target recognition model for decoding to obtain decoding features of a plurality of targets to be predicted in the target image;

and taking the decoding characteristics of the plurality of targets to be predicted as the mapping characteristics of the plurality of dimensions.

7. The method according to any one of claims 1-3, wherein the feature mapping the target feature map using the mapping network of the target recognition model to obtain mapping features of a plurality of dimensions comprises:

inputting the target characteristic diagram into an RPN network of the target recognition model to predict interested areas so as to obtain characteristic diagrams of a plurality of interested areas;

inputting the feature maps of the multiple interested areas into a pooling layer of the target identification model for size adjustment to obtain a target feature map of which each interested area conforms to a fixed size;

and determining the mapping characteristics of corresponding dimensions according to the target characteristic map of each region of interest.

8. An object detection device comprising:

the acquisition module is used for acquiring a target image and acquiring the characteristic mean values of a plurality of categories;

the extraction module is used for extracting the features of the target image to obtain a target feature map;

the mapping module is used for performing feature mapping on the target feature map by adopting a mapping network of a target recognition model to obtain mapping features of multiple dimensions;

the first determining module is used for respectively determining the similarity between the mapping characteristics of each dimension and the characteristic mean value of a plurality of categories;

the fusion module is used for fusing the mapping feature of each dimension with the feature mean values of the multiple categories according to the similarity so as to obtain the fusion feature of each dimension;

and the detection module is used for carrying out target detection according to the fusion characteristics of all dimensions.

9. The apparatus of claim 8, wherein the fusion module is to:

10. The apparatus of claim 8, wherein the fusion module is further configured to:

11. The apparatus of any one of claims 8-10, wherein the detection module is to:

12. The apparatus of claim 11, wherein the target image is labeled with an actual category; the device further comprises:

a second determining module, configured to determine, from among the prediction layers, a target prediction layer whose prediction category matches the actual category;

and the updating module is used for updating the feature mean value corresponding to the actual category according to the fusion feature input by the target prediction layer.

13. The apparatus of any of claims 8-10, wherein the mapping module is to:

14. The apparatus of any of claims 8-10, wherein the mapping module is further configured to:

15. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-7.

16. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-7.

17. A computer program product comprising a computer program which, when being executed by a processor, carries out the steps of the method according to any one of claims 1-7.