CN111488777A

CN111488777A - Object identification method, object identification device and electronic equipment

Info

Publication number: CN111488777A
Application number: CN201910447858.5A
Authority: CN
Inventors: 汪成; 宋俍辰; 张骞; 王国利; 黄畅
Original assignee: Beijing Horizon Robotics Technology Research and Development Co Ltd
Current assignee: Beijing Horizon Robotics Technology Research and Development Co Ltd
Priority date: 2019-01-28
Filing date: 2019-05-27
Publication date: 2020-08-04

Abstract

An object recognition method, an object recognition apparatus and an electronic device are disclosed. The object identification method comprises the following steps: obtaining a first feature map from the input image through a first neural network, and obtaining a second feature map through a second neural network; the second feature map is downsampled through a third neural network to obtain a first downsampling feature map; performing scale normalization on the second feature map and the first downsampling feature map to obtain a first normalized feature map and a second normalized feature map; determining a final feature map of the input image based on the first normalized feature map and the second normalized feature map; and identifying the target object in the input image based on the final feature map. In this way, high-level semantic features of the image may be better captured, thereby improving the effectiveness of object recognition.

Description

Object identification method, object identification device and electronic equipment

Technical Field

The present application relates to the field of deep learning technologies, and more particularly, to an object recognition method, an object recognition apparatus, and an electronic device.

Background

Currently, in the fields of computer vision, automatic driving, video target tracking, etc., identification of a predetermined object in an image is involved. For example, the ReID (Re-identification) system is used to identify predetermined objects from different images.

Wherein pedestrian re-identification refers to identifying a target pedestrian from a pool of pedestrian images or video streams originating from non-overlapping multiple camera fields of view. Different from common pedestrian tracking under a single camera, the pedestrian re-identification can realize long-term tracking and monitoring of specific pedestrians under different background environments and multi-camera settings, so that the method has a very large application prospect in the monitoring field. In addition, in the new retail field, pedestrian re-identification technology can be used for pedestrian trajectory analysis, and has a very important role in analyzing data of retail stores. Further, there is a need to identify other objects from the image, such as vehicles, buildings, road signs, etc.

Therefore, an object recognition scheme that efficiently recognizes a predetermined target object from an image is desired.

Disclosure of Invention

The present application is proposed to solve the above-mentioned technical problems. Embodiments of the present application provide an object recognition method, an object recognition apparatus, and an electronic device, which can better capture high-level semantic features of an image by down-sampling a feature map of an input image to obtain a final feature map for recognizing a target object, thereby improving the effectiveness of object recognition.

According to an aspect of the present application, there is provided an object recognition method including: obtaining a first feature map from an input image through a first neural network, and obtaining a second feature map from the first feature map through a second neural network; the second feature map is downsampled through a third neural network to obtain a first downsampling feature map; performing scale normalization on the second feature map and the first downsampling feature map to obtain a first normalized feature map and a second normalized feature map; determining a final feature map of the input image based on the first normalized feature map and the second normalized feature map; and identifying the target object in the input image based on the final feature map.

According to another aspect of the present application, there is provided an object recognition apparatus including: the characteristic map obtaining unit is used for obtaining a first characteristic map from the input image through a first neural network and obtaining a second characteristic map from the first characteristic map through a second neural network; the first downsampling unit is used for downsampling the second feature map through a third neural network to obtain a first downsampled feature map; the first normalization unit is used for carrying out scale normalization on the second feature map and the first downsampling feature map so as to obtain a first normalized feature map and a second normalized feature map; a feature map determination unit configured to determine a final feature map of the input image based on the first normalized feature map and the second normalized feature map; and a target recognition unit for recognizing a target object in the input image based on the final feature map

According to still another aspect of the present application, there is provided an electronic apparatus including: a processor; and a memory in which are stored computer program instructions which, when executed by the processor, cause the processor to perform the object recognition method as described above.

According to yet another aspect of the present application, there is provided a computer readable medium having stored thereon computer program instructions which, when executed by a processor, cause the processor to perform the object recognition method as described above.

The object recognition method, the object recognition device and the electronic equipment provided by the application can better capture the high-level semantic features of the image and improve the effectiveness of object recognition by down-sampling the feature map of the input image and obtaining the final feature map for recognizing the target object based on the down-sampled feature map, so that the final feature map contains the features obtained by the down-sampling.

Drawings

The above and other objects, features and advantages of the present application will become more apparent by describing in more detail embodiments of the present application with reference to the attached drawings. The accompanying drawings are included to provide a further understanding of the embodiments of the application and are incorporated in and constitute a part of this specification, illustrate embodiments of the application and together with the description serve to explain the principles of the application. In the drawings, like reference numbers generally represent like parts or steps.

Fig. 1 illustrates a flow chart of an object recognition method according to an embodiment of the present application.

Fig. 2 illustrates a schematic diagram of a first example of a network structure according to an embodiment of the application.

Fig. 3 illustrates a flow chart of an example of a process of determining a final feature map according to an embodiment of the present application.

Fig. 4 illustrates a schematic diagram of a second example of a network structure according to an embodiment of the application.

Fig. 5 illustrates a schematic diagram of a third example of a network structure according to an embodiment of the application.

Fig. 6 illustrates a flow chart of an example of a process of slicing a feature map according to an embodiment of the present application.

FIG. 7 illustrates a schematic diagram of a feature map processing procedure according to an embodiment of the application.

Fig. 8 illustrates a flow chart of an example of a process of identifying a target object according to an embodiment of the present application.

Fig. 9 illustrates a flow chart of a training process of a neural network according to an embodiment of the present application.

Fig. 10 illustrates a flow chart of an example of a process of calculating a loss function according to an embodiment of the present application.

Fig. 11 illustrates a block diagram of a first example of an object recognition apparatus according to an embodiment of the present application.

Fig. 12 illustrates a block diagram of a first example of a first feature map determination unit according to an embodiment of the present application.

Fig. 13 illustrates a block diagram of a second example of an object recognition apparatus according to an embodiment of the present application.

Fig. 14 illustrates a block diagram of a third example of an object recognition apparatus according to an embodiment of the present application.

Fig. 15 illustrates a block diagram of a second example of the first feature map determination unit according to an embodiment of the present application.

Fig. 16 illustrates a block diagram of an object recognition unit according to an embodiment of the present application.

FIG. 17 illustrates a block diagram of an electronic device in accordance with an embodiment of the present application.

Detailed Description

Hereinafter, example embodiments according to the present application will be described in detail with reference to the accompanying drawings. It should be understood that the described embodiments are only some embodiments of the present application and not all embodiments of the present application, and that the present application is not limited by the example embodiments described herein.

Summary of the application

As described above, in order to recognize an object in a task such as pedestrian re-recognition, it is necessary to train a deep neural network using pedestrian data under a plurality of cameras as a training set, thereby obtaining a deep feature extractor.

Due to the camera angle and the detector, the position of the object in different pictures is very different, for example, the object is in the lower half part in some pictures and the object is in the upper half part in some pictures. Therefore, the objects in the picture are not aligned, and the neural network cannot effectively extract the object features for object recognition.

In view of the above technical problems, the basic concept of the present application is to design a multi-scale multi-slice neural network, to better capture the high-level semantic features of an image by down-sampling the feature map of an input image, and to combine the down-sampled feature map and the original feature map after normalization to obtain a final feature map for target recognition.

Specifically, according to the object identification method, the object identification device and the electronic device provided by the application, firstly, an input image is subjected to a first neural network to obtain a first feature map, then, the first feature map is subjected to a second neural network to obtain a second feature map, then, the second feature map is subjected to down-sampling through a third neural network to obtain a first down-sampling feature map, then, the second feature map and the first down-sampling feature map are subjected to scale normalization to obtain a first normalized feature map and a second normalized feature map, then, based on the first normalized feature map and the second normalized feature map, a final feature map of the input image is determined, and finally, based on the final feature map, a target object in the input image is identified.

Therefore, the object recognition method, the object recognition device and the electronic equipment provided by the application can learn partial level features of the input image from different scales by down-sampling the feature map of the input image, so as to obtain the details which are contained in the feature map and are not outstanding but have the discriminative ability.

In addition, the object recognition method, the object recognition device and the electronic equipment provided by the application obtain the final feature map for recognizing the target object based on the downsampled feature map, so that the final feature map comprises the detail features obtained through downsampling, the high-level semantic features of the image can be better captured, and the effectiveness of object recognition is improved.

Having described the general principles of the present application, various non-limiting embodiments of the present application will now be described with reference to the accompanying drawings.

Exemplary method

As shown in fig. 1, an object recognition method according to an embodiment of the present application includes the following steps.

Step S110, obtaining a first feature map from the input image through a first neural network, and obtaining a second feature map from the first feature map through a second neural network. Here, objects to be recognized, such as pedestrians, vehicles, and other markers, are included in the input image.

Referring to fig. 2, fig. 2 illustrates a schematic diagram of a first example of a network structure according to an embodiment of the present application. As shown IN fig. 2, the input image IN obtains a first feature map F1 through a first neural network N1, and the first feature map F1 further obtains a second feature map F2 through a second neural network N2. In the embodiment of the present application, the first neural network N1 and the second neural network N2 may be separate neural networks or may be part of a larger neural network as a whole, for example, the first neural network N1 may be the layer 1 of the ResNet 50 neural network, and the second neural network N2 is the layers 2-4 of the ResNet 50 neural network.

And step S120, performing downsampling on the second feature map through a third neural network to obtain a first downsampling feature map. With continued reference to fig. 2, the second signature F2 is downsampled by the third neural network N3 to obtain a first downsampled signature FD 1. Similarly, in the embodiment of the present application, the third neural network N3 may be a neural network independent from the first neural network N1 and the second neural network N2, or may be a part of a large neural network as a whole. For example, the third neural network N3 may be layers 5-9 of the ResNet 50 neural network described above.

For example, as shown in FIG. 2, the second feature map F2 has a scale of N × 256 × 96 × 32, and the first downsampled feature map FD1 has a scale of N × 512 85512 512 × 48 × 16. here, N represents the number of input images, which may also be referred to as a batch size, and 256 × 96 × 32 represents the number of channels, the height and the width of the images, respectively.

In the embodiment of the present application, the scale normalization mainly normalizes the number of channels, and thus, the scale normalization module SN performs scale normalization on the second feature map F2 and the first downsampling feature map FD1 to obtain a first normalized feature map FS1 and a second normalized feature map fs2, where the scale of the first normalized feature map FS1 is N × 512 × 96 × 32, and the scale of the second normalized feature map FS2 is N × 512 × 48 × 16.

Step S140, determining a final feature map of the input image based on the first normalized feature map and the second normalized feature map. That is, since the second normalized feature map includes the non-salient but distinctive detail features of the input image by down-sampling the feature map of the input image, the final feature map determined based on the first and second normalized feature maps will also include these detail features that are advantageous for image recognition.

And S150, identifying the target object in the input image based on the final feature map.

As described above, by using the detail feature included in the final feature map, which is not prominent but is distinctive, the recognition accuracy of the target object in the input image can be improved.

As shown in fig. 3, the step S140 includes the following steps based on the embodiment shown in fig. 1.

Referring to fig. 2, the first normalized feature map FS1 has a scale of N × 512 × 96 × 32, and the second normalized feature map FS2 has a scale of N × 512 × 48 × 16, and a first upsampled feature map fu1 may be obtained by upsampling, e.g., bilinear upsampling, the second normalized feature map FS2 by an upsampling module US, e.g., a first upsampled feature map FU1 has a scale of N × 512 × 96 × 32.

Step S1402, combining the first normalized feature map and the first up-sampling feature map, to determine the final feature map. By upsampling, the first normalized feature map FS1 is of the same scale as the first upsampled feature map FU1, and the first normalized feature map FS1 may be combined with the first upsampled feature map FU1 by, for example, an adder to determine the final feature map F.

In this way, the accuracy of image recognition can be further improved by combining detail information of the image through upsampling.

Fig. 4 illustrates a schematic diagram of a second example of a network structure according to an embodiment of the application. As shown in fig. 4, on the basis of the first example as shown in fig. 2, the first downsampled feature map FD1 is further downsampled by a fourth neural network N4 to obtain a second downsampled feature map FD 2.

Similar to the first, second and third neural networks N1, N2, N3 described above, the fourth neural network N4 may be a completely independent neural network or may be part of another overall larger neural network. For example, the fourth neural network N4 may be layers 10-19 of the ResNet 50 neural network described above.

Next, the second downsampled feature map FD2 is subjected to scale normalization by a scale normalization module SN to obtain a third normalized feature map fs3 having the same scale as the first normalized feature map FS1 and the second normalized feature map FS 2. specifically, as shown in fig. 4, the scale of the first downsampled feature map FD1 is N × 512 × 48 8548 48 × 16, the scale of the second downsampled feature map FD2 obtained after downsampling is N × 1024 × 24 × 8, after the scale normalization, the number of channels is mainly normalized, and the third normalized feature map FS3 having the scale of N × 512 × 24 × 8 is obtained.

Then, a first up-sampled feature map FU1 is obtained based on the second normalized feature map FS2 and the third normalized feature map FS 3. In this way, the final feature map of the input image further includes the detail features of the second down-sampled feature map FD2 obtained by performing the second down-sampling, so that the detail features included in the final feature map of the input image are richer, and the accuracy of identifying the target object is increased.

Continuing with fig. 4, similar to the process of obtaining the final feature map described earlier with reference to fig. 2, the first upsampled feature map FU1 is also obtained by means of a combined feature map, in particular, first upsampling, such as bilinear upsampling, by an upsampling module US based on the third normalized feature map FS3 to obtain a second upsampled feature map FU2, for example, the normalized feature map FS3 having a scale of N × 512 × ×, the second upsampled feature map FU2 obtained after upsampling having a scale of N × 1512 × ×, then combining the second normalized feature map FS2 having a scale of N × × × with the second upsampled feature map FU2 having a scale of N × × 48 ×, and then upsampling by the upsampling module US to obtain the first upsampled feature map FU1, as described earlier.

Similarly, the accuracy of image recognition may be improved because further detail information of the image is incorporated by upsampling.

Fig. 5 illustrates a schematic diagram of a third example of a network structure according to an embodiment of the application. As shown in fig. 5, on the basis of the second example as shown in fig. 4, the second downsampled feature map FD2 is further downsampled by a fifth neural network N5 to obtain a third downsampled feature map FD 3.

Similar to the first, second, third and fourth neural networks N1, N2, N3, N4, described above, the fifth neural network N5 may be a completely independent neural network or may be part of another overall larger neural network. For example, the fourth neural network N5 may be the 20 th-50 th layers of the ResNet 50 neural network described above.

Next, the third downsampled feature map FD3 is subjected to scale normalization by a scale normalization module SN to obtain a fourth normalized feature map fs4 having the same scale as the first normalized feature map FS1, the second normalized feature map FS2 and the third normalized feature map FS3, specifically, as shown in fig. 5, the second downsampled feature map FD2 has a scale of N × 512 × 24 × 8, the third downsampled feature map FD3 obtained after downsampling has a scale of N × 2048 × 12 82912 12 × 4, and after the scale normalization, which is mainly a channel number, the fourth normalized feature map FS4 having a scale of N × 512 × 12 × 4 is obtained.

Then, a second upsampled feature map FU2 is obtained based on the third normalized feature map FS3 and the fourth normalized feature map FS 4. In this way, the final feature map of the input image further includes the detail features of the third down-sampled feature map FD3 obtained by performing the third down-sampling, so that the detail features included in the final feature map of the input image are richer, and the accuracy of identifying the target object is increased.

Continuing with reference to fig. 5, said second up-sampled feature map FU2 is also obtained by means of a combined feature map, in a similar way to the process of obtaining the first up-sampled feature map FU1 described earlier with reference to fig. 4, in particular, first up-sampling, such as bilinear up-sampling, by an up-sampling module US on the basis of said fourth normalized feature map FS4 to obtain a third up-sampled feature map FU3, for example, said normalized feature map FS4 having a scale of N × 512 × 12 × 04, said third up-sampled feature map FU3 obtained after up-sampling having a scale of N × 1512 × 24 × 8, then said third normalized feature map FS3 having a scale of N × 512 × 24 × 8 is combined with said third up-sampled feature map FU3 having a scale of N × 512 × 24 × 8, and then up-sampling by an up-sampling module US to obtain said second up-sampled feature map FU2, as described earlier.

As shown in fig. 6, step S1402 further includes the following steps based on the embodiment shown in fig. 3.

Step S14021, combining the first normalized feature map and the first up-sampled feature map to obtain a combined feature map. Referring to fig. 7, fig. 7 illustrates a schematic diagram of a feature map processing procedure according to an embodiment of the present application. As shown in fig. 7, the first normalized feature map FS1 and the first upsampled feature map FU1 obtained as described above are combined, for example by means of an adder, to obtain a combined feature map FC.

Step S14022, performing average segmentation on the combined feature map according to a preset direction to obtain a preset number of partial feature maps. For example, as shown in fig. 7, the combination feature map FC is trisected in the horizontal direction to obtain three partial feature maps FC1, FC2, and FC 3. Here, in the embodiment of the present application, the preset direction may also be other directions than the horizontal direction, for example, a vertical direction, and the number of average divisions may also be other numbers, for example, six equal divisions are performed, and the like.

Step S14023, determining a final feature map based on the preset number of partial feature maps. In this way, by performing average segmentation on the feature map of the input image, the obtained final feature map can contain local details of the input image, so that the accuracy of image identification is increased.

In one example, the global average pooling may be performed on the preset number of partial feature maps first, and then the pooled partial feature maps of the preset number may be concatenated to obtain the final feature map. Therefore, in the embodiment of the present application, by obtaining the final feature map in a multi-slice manner, local features in the input image can be better utilized.

Further, when the pooled partial feature maps of the preset number are concatenated to obtain the final feature map, the pooled partial feature maps of the preset number may be concatenated to obtain a concatenated feature map, and then the concatenated feature map and the combined feature map are combined to obtain the final feature map. In this way, the final feature map includes not only local features obtained by average segmentation of the feature map of the input image, but also global features in the combined feature map that is not segmented, so that the accuracy of recognition can be improved by combining the local features with the global features.

As shown in fig. 8, on the basis of the embodiment shown in fig. 1, the step S150 further includes the following steps.

Step S1501, determining a target feature map corresponding to the target object and a reference feature map corresponding to a reference object in the final feature map. That is, in the actual recognition process, the reference image and all other images to be recognized, for example, the reference pedestrian image and all images in the pedestrian library, can be used as input, and the feature map of the image is extracted after the neural network processing as described above.

Step S1502, calculating a distance between the target feature map and the reference feature map. Here, the distance between the target feature map and the reference feature map may be a euclidean distance, a cosine distance, or the like.

And step S1503, performing similarity measurement of the target feature map to the reference feature map based on the distance. That is, if the distance between the target feature map and the reference feature map is short, the similarity of the target object and the reference object is considered to be high, and if the distance between the target feature map and the reference feature map is long, the similarity of the target object and the reference object is considered to be low.

Step S1504, identifying a target object in the input image based on the similarity metric. That is, if the similarity of the object and the reference object is high, the object may be identified as being the same as the reference object, for example, the object is identified as a pedestrian that needs to be retrieved from a pedestrian photo library.

Here, the training process of the neural network according to the embodiment of the present application is used to train the neural network as described above. For the first example shown in fig. 2, the first neural network N1, the second neural network N2, and the third neural network N3 are trained. For the second example as shown in fig. 4, the first neural network N1, the second neural network N2, the third neural network N3, and the fourth neural network N4 are trained. And for the third example shown in fig. 5, the first, second, third, fourth, and fifth neural networks N1, N2, N3, N4, and N5 are trained.

A training process of the neural network will be described below with a third example as shown in fig. 5, and as shown in fig. 9, the training process of the neural network according to the embodiment of the present application includes the following steps.

Step S210, inputting training images into the first neural network, the second neural network, the third neural network, the fourth neural network and the fifth neural network to obtain a training feature map. Referring to fig. 5, a training image is input into the first neural network, a feature map of the training image is obtained through the second neural network, downsampling is performed through the third neural network, the fourth neural network and the fifth neural network, and a final training feature map is determined through normalization. Of course, those skilled in the art will understand that the final training feature map may also be the feature map after being subjected to the average segmentation or further combined with the global features as described above.

Step S220, the training characteristic graph is normalized and input into a classification layer for normalizing the weight of the training characteristic graph. That is, the step includes three parts, the first part is to normalize the training feature map obtained in step S210, the second part is to normalize the classification layer used for image recognition, and the third part is to input the normalized feature map into the classification layer with weight normalization. Here, the classification layer may be, for example, a full connection layer.

Step S230, calculating a first loss function value by a softmax function based on the output of the classification layer. That is, a first loss function value corresponding to an output of the classification layer is calculated by a softmax function.

Step S240, training the first, second, third, fourth, and fifth neural networks based on the first loss function value. Based on the first loss function values, the first to fifth neural networks N1 to N5 described above may be updated by, for example, back propagation.

In this way, by training the first to fifth neural networks in the manner as described above, it is possible to enable the neural networks to learn patterns in an image using features in an input image, thereby efficiently recognizing a target object in the input image.

As shown in fig. 10, step S230 includes the following steps on the basis of the embodiment shown in fig. 9.

Step S2301, calculating a second loss function value by a softmax function for the output of the classification layer. That is, conventionally, the second loss function value is calculated using the softmax function directly on the output of the classification layer.

Here, the calculation method of the softmax classification loss function used in the conventional deep neural network training is as follows:

wherein

Is the feature of the ith image, d is the feature dimension;

W_jis a classification layer parameter matrix

J (th) column of (b)_jIs the offset of class j of the classification layer. In the examples of this application, for exampleAs described above, for the finally determined feature map x_iAnd a parameter matrix W of the classification layer_jAre normalized so that | x_i‖₂1 and W_j||₂1, thereby causing

Thus, the resulting second loss function value is:

in the embodiment of the present application, s is set to 15, for example.

Step S2302, calculating a mutual exclusion regular loss function value for the parameter matrix of the classification layer. That is, for the above-described classification layer parameter matrix W, W is calculated^T‖₁₂As a regular loss.

In step S2303, the sum of the second loss function value and the product of the mutually exclusive regular loss function value and the weighting coefficient is calculated as the first loss function value, that is, the first loss function value is calculated as L-L_norm+λ‖W^T‖₁₂Here, λ is a parameter that can be adjusted, for example, set to 1 × 10 in the present embodiment^-6。

Here, mutual exclusion regularization is used to strengthen the independence, i.e., sparsity, between column vectors of a matrix so that each element of a feature vector is irrelevant and represents a specific attribute of a target object, which can improve the generalization ability of a classification layer,

therefore, through the calculation process of the loss function and the training process of the corresponding neural network, the generalization capability of the neural network for identifying the target object can be improved, and the application range is expanded.

Exemplary devices

As shown in fig. 11, a first example of an object recognition apparatus 300 according to an embodiment of the present application includes: a feature map obtaining unit 301, configured to obtain a first feature map from an input image through a first neural network, and obtain a second feature map from the first feature map through a second neural network; a first downsampling unit 302, configured to downsample the second feature map obtained by the feature map obtaining unit 301 through a third neural network to obtain a first downsampled feature map; a first normalization unit 303, configured to perform scale normalization on the second feature map obtained by the feature map obtaining unit 301 and the first downsampled feature map obtained by the first downsampling unit 302 to obtain a first normalized feature map and a second normalized feature map; a first feature map determining unit 304, configured to determine a final feature map of the input image based on the first normalized feature map and the second normalized feature map obtained by the first normalizing unit 303; and a target recognition unit 305 configured to recognize a target object in the input image based on the final feature map determined by the first feature map determination unit 304.

As shown in fig. 12, based on the embodiment shown in fig. 11, the first feature map determining unit 304 includes:

a first upsampling subunit 3041, configured to perform upsampling based on the second normalized feature map obtained by the first normalizing unit 304 to obtain a first upsampled feature map; and a first combining subunit 3042, configured to combine the first normalized feature map obtained by the first normalizing unit 303 and the first upsampled feature map obtained by the first upsampling subunit 3041 to determine the final feature map.

As shown in fig. 13, on the basis of the embodiment shown in fig. 11, the second example of the object recognition apparatus 300 further includes, in addition to all the units shown in fig. 10: a second downsampling unit 306, configured to downsample the first downsampled feature map obtained by the first downsampling unit 302 through a fourth neural network to obtain a second downsampled feature map; a second normalization unit 307, configured to perform the scale normalization on the second downsampled feature map obtained by the second downsampling unit 306 to obtain a third normalized feature map; and a second feature map determining unit 308, configured to obtain the first up-sampling feature map based on the second normalized feature map obtained by the first normalizing unit 303 and the third normalized feature map obtained by the second normalizing unit 307.

In one example, the second feature map determining unit 308 includes: a second upsampling subunit, configured to perform upsampling on the basis of the third normalized feature map obtained by the second normalizing unit 307 to obtain a second upsampled feature map; and a second combining subunit, configured to combine the second normalized feature map obtained by the first normalizing unit 303 with the second upsampled feature map obtained by the second upsampling subunit and perform upsampling to obtain the first upsampled feature map.

As shown in fig. 14, on the basis of the embodiment shown in fig. 13, the third example of the object recognition apparatus 300 further includes, in addition to all the units shown in fig. 10 and 12: a third downsampling unit 309, configured to downsample the second downsampled feature map obtained by the second downsampling unit 306 through a fifth neural network to obtain a third downsampled feature map; a third normalizing unit 310, configured to perform the scale normalization on the third downsampled feature map obtained by the third downsampling unit 309 to obtain a fourth third normalized feature map; and a third feature map determining unit 311 configured to obtain the second upsampled feature map based on the third normalized feature map obtained by the second normalizing unit 307 and the fourth normalized feature map obtained by the third normalizing unit 310.

In one example, the third feature map determining unit 311 includes: a third upsampling subunit, configured to perform upsampling based on the fourth normalized feature map obtained by the third normalizing unit 310 to obtain a third upsampled feature map; and a third combining subunit, configured to combine the third normalized feature map obtained by the second normalizing unit 307 with the third upsampled feature map obtained by the third upsampling subunit and perform upsampling to obtain the second upsampled feature map.

As shown in fig. 15, on the basis of the embodiment shown in fig. 12, the first binding subunit 3042 includes: a feature map combining module 30421, configured to combine the first normalized feature map obtained by the first normalizing unit 303 and the first upsampled feature map obtained by the first upsampling sub-unit 3041 to obtain a combined feature map; a feature map segmentation module 30422, configured to perform average segmentation on the combined feature map obtained by the feature map combining subunit 30421 according to a preset direction, to obtain a preset number of partial feature maps; and a feature map determining module 30423, configured to determine a final feature map based on the preset number of partial feature maps obtained by the feature map cutting module 30422.

In one example, the feature map determination module 30423 is to: performing global average pooling on the preset number of partial feature maps; and connecting the pooled partial feature maps in series to obtain the final feature map.

In one example, the feature map determination module 30423 is further to: concatenating the pooled partial feature maps of the preset number to obtain a concatenated feature map, and combining the concatenated feature map with the combined feature map to obtain the final feature map.

As shown in fig. 16, on the basis of the embodiment shown in fig. 11, the object recognition unit 305 includes: a reference determination subunit 3051, configured to determine a target feature map corresponding to the target object and a reference feature map corresponding to a reference object in the final feature map determined by the first feature map determining unit 304; a distance calculation subunit 3052, configured to calculate a distance between the target feature map determined by the reference determination subunit 3051 and the reference feature map; a similarity measure subunit 3053, configured to perform a similarity measure of the target feature map with respect to the reference feature map based on the distance calculated by the distance calculation subunit 3052; and a target identifying subunit 3054, configured to identify a target object in the input image based on the similarity metric performed by the similarity metric subunit 3053.

Here, it will be understood by those skilled in the art that the specific functions and operations of the respective units and modules in the above-described object recognition apparatus 300 have been described in detail in the above description of the object recognition method with reference to fig. 1 to 10, and thus, a repetitive description thereof will be omitted.

As described above, the object recognition apparatus 300 according to the embodiment of the present application may be implemented in various terminal devices, such as a camera for security, or an in-vehicle automatic driving system, etc. In one example, the object recognition apparatus 300 according to the embodiment of the present application may be integrated into a terminal device as one software module and/or hardware module. For example, the object recognition apparatus 300 may be a software module in an operating system of the terminal device, or may be an application developed for the terminal device; of course, the object recognition apparatus 300 may also be one of many hardware modules of the terminal device.

Alternatively, in another example, the object recognition apparatus 300 and the terminal device may be separate devices, and the object recognition apparatus 300 may be connected to the terminal device through a wired and/or wireless network and transmit the interactive information according to an agreed data format.

Exemplary electronic device

Next, an electronic apparatus according to an embodiment of the present application is described with reference to fig. 17.

As shown in fig. 17, the electronic device 10 includes one or more processors 11 and a memory 12.

The processor 13 may be a Central Processing Unit (CPU) or other form of processing unit having data processing capabilities and/or instruction execution capabilities, and may control other components in the electronic device 10 to perform desired functions.

Memory 12 may include one or more computer program products that may include various forms of computer-readable storage media, such as volatile memory and/or non-volatile memory. The volatile memory may include, for example, Random Access Memory (RAM), cache memory (cache), and/or the like. The non-volatile memory may include, for example, Read Only Memory (ROM), hard disk, flash memory, etc. One or more computer program instructions may be stored on the computer-readable storage medium and executed by the processor 11 to implement the object identification methods of the various embodiments of the present application described above and/or other desired functions. Various contents such as a feature map of an input image, a feature map after down-sampling or up-sampling, and the like may also be stored in the computer-readable storage medium.

In one example, the electronic device 10 may further include: an input device 13 and an output device 14, which are interconnected by a bus system and/or other form of connection mechanism (not shown).

The input device 13 may include, for example, a keyboard, a mouse, and the like.

The output device 14 can output various information including the recognition result of the target object in the input image to the outside. The output devices 14 may include, for example, a display, speakers, a printer, and a communication network and its connected remote output devices, among others.

Of course, for the sake of simplicity, only some of the components of the electronic device 10 relevant to the present application are shown in fig. 17, and components such as buses, input/output interfaces, and the like are omitted. In addition, the electronic device 10 may include any other suitable components depending on the particular application.

Exemplary computer program product and computer-readable storage Medium

In addition to the above-described methods and apparatus, embodiments of the present application may also be a computer program product comprising computer program instructions that, when executed by a processor, cause the processor to perform the steps in the object recognition methods according to the various embodiments of the present application described in the "exemplary methods" section of this specification, above.

The computer program product may be written with program code for performing the operations of embodiments of the present application in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server.

Furthermore, embodiments of the present application may also be a computer-readable storage medium having stored thereon computer program instructions that, when executed by a processor, cause the processor to perform the steps in the object recognition methods according to various embodiments of the present application described in the "exemplary methods" section above of this specification.

The computer-readable storage medium may take any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may include, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

The foregoing describes the general principles of the present application in conjunction with specific embodiments, however, it is noted that the advantages, effects, etc. mentioned in the present application are merely examples and are not limiting, and they should not be considered essential to the various embodiments of the present application. Furthermore, the foregoing disclosure of specific details is for the purpose of illustration and description and is not intended to be limiting, since the foregoing disclosure is not intended to be exhaustive or to limit the disclosure to the precise details disclosed.

The block diagrams of devices, apparatuses, systems referred to in this application are only given as illustrative examples and are not intended to require or imply that the connections, arrangements, configurations, etc. must be made in the manner shown in the block diagrams. These devices, apparatuses, devices, systems may be connected, arranged, configured in any manner, as will be appreciated by those skilled in the art. Words such as "including," "comprising," "having," and the like are open-ended words that mean "including, but not limited to," and are used interchangeably therewith. The words "or" and "as used herein mean, and are used interchangeably with, the word" and/or, "unless the context clearly dictates otherwise. The word "such as" is used herein to mean, and is used interchangeably with, the phrase "such as but not limited to".

It should also be noted that in the devices, apparatuses, and methods of the present application, the components or steps may be decomposed and/or recombined. These decompositions and/or recombinations are to be considered as equivalents of the present application.

The previous description of the disclosed aspects is provided to enable any person skilled in the art to make or use the present application. Various modifications to these aspects will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other aspects without departing from the scope of the application. Thus, the present application is not intended to be limited to the aspects shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

The foregoing description has been presented for purposes of illustration and description. Furthermore, the description is not intended to limit embodiments of the application to the form disclosed herein. While a number of example aspects and embodiments have been discussed above, those of skill in the art will recognize certain variations, modifications, alterations, additions and sub-combinations thereof.

Claims

1. An object recognition method, comprising:

obtaining a first feature map from an input image through a first neural network, and obtaining a second feature map from the first feature map through a second neural network;

the second feature map is downsampled through a third neural network to obtain a first downsampling feature map;

performing scale normalization on the second feature map and the first downsampling feature map to obtain a first normalized feature map and a second normalized feature map;

determining a final feature map of the input image based on the first normalized feature map and the second normalized feature map; and

and identifying the target object in the input image based on the final feature map.

2. The object recognition method of claim 1, wherein determining a final feature map of the input image based on the first and second normalized feature maps comprises:

upsampling based on the second normalized feature map to obtain a first upsampled feature map; and

and combining the first normalized feature map and the first up-sampling feature map to determine the final feature map.

3. The object recognition method of claim 2, further comprising:

the first downsampling feature map is downsampled through a fourth neural network to obtain a second downsampling feature map;

performing the scale normalization on the second downsampled feature map to obtain a third normalized feature map; and

obtaining the first up-sampled feature map based on the second normalized feature map and the third normalized feature map.

4. The object recognition method of claim 3, wherein obtaining the first up-sampled feature map based on the second normalized feature map and the third normalized feature map comprises:

upsampling based on the third normalized feature map to obtain a second upsampled feature map; and

and combining the second normalized feature map and the second up-sampling feature map and performing up-sampling to obtain the first up-sampling feature map.

5. The object recognition method of claim 4, further comprising:

the second downsampling feature map is downsampled through a fifth neural network to obtain a third downsampling feature map;

performing the scale normalization on the third down-sampled feature map to obtain a fourth normalized feature map; and

obtaining the second up-sampled feature map based on the third normalized feature map and the fourth normalized feature map.

6. The object recognition method of claim 5, wherein obtaining the second upsampled feature map based on the third normalized feature map and the fourth normalized feature map comprises:

upsampling based on the fourth normalized feature map to obtain a third upsampled feature map; and

combining the third normalized feature map and the third upsampled feature map and upsampling to obtain the second upsampled feature map.

7. The object recognition method of claim 2, wherein determining the final feature map in combination with the first normalized feature map and the first up-sampled feature map comprises:

combining the first normalized feature map and the first up-sampled feature map to obtain a combined feature map;

carrying out average segmentation on the combined feature graph according to a preset direction to obtain a preset number of partial feature graphs;

and determining a final feature map based on the preset number of partial feature maps.

8. The object recognition method of claim 7, wherein determining a final feature map based on the preset number of partial feature maps comprises:

performing global average pooling on the preset number of partial feature maps; and

and connecting the pooled partial feature maps in series to obtain the final feature map.

9. The object recognition method of claim 8, wherein concatenating the pooled partial feature maps of the preset number to obtain the final feature map comprises:

connecting the pooled partial feature maps in series to obtain a series feature map;

combining the series signature with the combined signature to obtain the final signature.

10. The object recognition method of claim 1, wherein recognizing the target object in the input image based on the final feature map comprises:

determining a target feature map corresponding to the target object and a reference feature map corresponding to a reference object in the final feature map;

calculating the distance between the target feature map and the reference feature map;

performing a similarity measure of the target feature map to the reference feature map based on the distance; and

identifying a target object in the input image based on the similarity metric.

11. An object recognition apparatus comprising:

the characteristic map obtaining unit is used for obtaining a first characteristic map from the input image through a first neural network and obtaining a second characteristic map from the first characteristic map through a second neural network;

a first downsampling unit, configured to downsample the second feature map obtained by the feature map obtaining unit through a third neural network to obtain a first downsampled feature map;

a first normalization unit, configured to perform scale normalization on the second feature map obtained by the feature map obtaining unit and the first downsampled feature map obtained by the first downsampling unit to obtain a first normalized feature map and a second normalized feature map;

a feature map determination unit configured to determine a final feature map of the input image based on the first normalized feature map and the second normalized feature map obtained by the first normalization unit; and

and the target identification unit is used for identifying a target object in the input image based on the final characteristic diagram determined by the characteristic diagram determination unit.

12. An electronic device, comprising:

a processor; and

a memory having stored therein computer program instructions which, when executed by the processor, cause the processor to perform the object recognition method of any one of claims 1-10.