CN108229492B

CN108229492B - Method, device and system for extracting features

Info

Publication number: CN108229492B
Application number: CN201710195256.6A
Authority: CN
Inventors: 伊帅; 赵海宇; 田茂清; 闫俊杰; 王晓刚
Original assignee: Beijing Sensetime Technology Development Co Ltd
Current assignee: Beijing Sensetime Technology Development Co Ltd
Priority date: 2017-03-29
Filing date: 2017-03-29
Publication date: 2020-07-28
Anticipated expiration: 2037-03-29
Also published as: CN108229492A

Abstract

The application discloses a method, a device and a system for extracting features, wherein the method for extracting the features comprises the following steps: generating initial image features corresponding to objects in an image; generating a plurality of region features corresponding to a plurality of regions of the image, respectively; and fusing the initial image features and the plurality of region features to obtain target image features of the object. According to the technical scheme for extracting the features, in the process of obtaining the features, not only is the feature extraction performed on the whole image containing the object, but also the feature extraction is performed on the multiple regions in the image, so that the process of extracting the features for the multiple regions enables the detail features in the multiple regions to be at least partially reserved, and finally obtained object features have better identification.

Description

Method, device and system for extracting features

Technical Field

The application relates to the field of computer vision and image processing, in particular to a method, a device and a system for extracting features.

Background

With the development of computer vision technology and the increase of the amount of image information, image recognition technology is applied in more and more fields, such as pedestrian retrieval, video monitoring, video classification, and the like, and in the image recognition technology, feature extraction is of great importance.

In the conventional image recognition technology, a method for extracting features of an image generally extracts overall features of a picture to describe the picture, wherein the overall features are the same in the fineness of various objects or backgrounds in the picture. The overall feature can be used for picture recognition, for example, and the overall feature is compared with the feature of the target classification to determine whether the object in the picture belongs to the target classification.

Disclosure of Invention

The embodiment of the application provides a technical scheme for extracting features.

One aspect of the embodiments of the present application discloses a method for extracting features, the method comprising: generating initial image features corresponding to objects in an image; generating a plurality of region features corresponding to a plurality of regions of the image, respectively; fusing the plurality of region features into a fused feature, wherein in a first overlapping region of the plurality of region features, a feature with the highest identifiability is selected as a feature corresponding to the first overlapping region, and the fused feature is generated based on the feature corresponding to the first overlapping region; and fusing the fused feature with the initial image feature as a target image feature of the object.

According to the technical scheme for extracting the features, in the process of obtaining the object feature map, not only is the feature extraction performed on the whole image containing the object, but also the feature extraction is performed on the multiple regions in the image, so that the detail features in the multiple regions are at least partially reserved in the process of extracting the features from the multiple regions, and the finally obtained object features have better identification.

Another aspect of the embodiments of the present application discloses an apparatus for extracting object features, the apparatus comprising: an image feature generation module that generates an initial image feature corresponding to an object in an image; a region feature generation module that generates a plurality of region features corresponding to a plurality of regions of the image, respectively; and a fusion module for fusing the initial image feature and the plurality of region features to obtain a target image feature of the object, the fusion module comprising: the first fusion submodule is used for fusing the plurality of region features into fusion features, wherein in a first overlapping region in the plurality of region features, the feature with the highest identifiability is selected as the feature corresponding to the first overlapping region, and the fusion features are generated based on the feature corresponding to the first overlapping region; and a second fusion submodule for fusing the fusion feature and the initial image feature into a target image feature of the object.

Another aspect of the embodiments of the present application further discloses a system for extracting object features, where the system includes: a memory storing executable instructions; one or more processors in communication with the memory to execute the executable instructions to: generating initial image features corresponding to objects in an image; generating a plurality of region features corresponding to a plurality of regions of the image, respectively; fusing the plurality of region features into a fused feature, wherein in a first overlapping region of the plurality of region features, a feature with the highest identifiability is selected as a feature corresponding to the first overlapping region, and the fused feature is generated based on the feature corresponding to the first overlapping region; and fusing the fused feature with the initial image feature as a target image feature of the object.

Yet another aspect of an embodiment of the present application discloses a non-transitory computer storage medium storing computer-readable instructions that, when executed, cause a processor to: generating initial image features corresponding to objects in an image; generating a plurality of region features corresponding to a plurality of regions of the image, respectively; and fusing the initial image features and the plurality of region features to obtain target image features of the object.

Drawings

In the following, exemplary and non-limiting embodiments of the present application are described with reference to the accompanying drawings. The figures are merely illustrative and generally do not represent exact proportions. The same or similar elements in different drawings are denoted by the same reference numerals.

FIG. 1 is a flow diagram illustrating a method 1000 of extracting features according to an embodiment of the present application;

FIG. 2 is a schematic diagram showing a pedestrian's joint used to illustrate a method according to an embodiment of the present application;

FIG. 3 is a schematic diagram showing a feature map extraction process for explaining a method according to an embodiment of the present application;

FIG. 4 is a schematic diagram showing a feature map extraction process for explaining a method according to an embodiment of the present application;

FIG. 5 is a schematic diagram illustrating a fusion process according to an embodiment of the present application;

FIG. 6 is a schematic diagram illustrating a fusion process according to an embodiment of the present application;

FIG. 7 is a schematic diagram illustrating an apparatus 700 for extracting features according to an embodiment of the present application; and

FIG. 8 is a schematic diagram of a computer system 800 suitable for implementing embodiments of the present application.

Detailed Description

Hereinafter, embodiments of the present application will be described in detail with reference to the detailed description and the accompanying drawings.

Fig. 1 is a flow chart illustrating a method 1000 of extracting features according to an embodiment of the present application. As shown in fig. 1, the method 1000 includes: step S1100, generating initial image features corresponding to objects in the image; step S1200 of generating a plurality of region features corresponding to a plurality of regions of the image, respectively; and step S1300, fusing the initial image feature and the plurality of region features to obtain the target image feature of the object.

In step S1100, the operation of generating initial image features corresponding to objects in the image may be implemented by a Convolutional Neural Network (CNN), which may include a plurality of perceptual modules. When an image including an object is input to the CNN, the plurality of perception modules are configured to perform convolution and pooling on the input image to obtain an overall feature of the image, where the overall feature is the initial image feature. Taking the case where the object is a pedestrian as an example, the initial image feature may be a feature that reflects the feature of the entire pedestrian. It should be noted that, in the embodiment of the present invention, the object in the image is not limited, and any other object may be used as the object, such as a vehicle, a human face, and the like.

In step S1200, a plurality of region features corresponding to a plurality of regions of the image, respectively, may be generated. In this step S1200, the region features for different regions in the image may be obtained, for example, the region features corresponding to different regions may be obtained by performing operations such as convolution and pooling on different regions in the image through CNN, or the feature extraction may be performed on the entire image first, and then the region features may be obtained from different regions in the overall image feature (e.g., the initial image feature), but the present application is not limited thereto. The plurality of regions may be extracted based on a structure of the object. As shown in fig. 2, in the case where the object is a pedestrian, a plurality of regions may be extracted in accordance with the human body structure, for example, three regions may be extracted, which are a region 201 including the head and shoulders (hereinafter, also referred to as the head and shoulders) of the pedestrian, a region 202 including the upper body of the pedestrian, and a region 203 including the lower body of the pedestrian, respectively, wherein the three

regions

201 and 203 may partially overlap with each other.

In some embodiments, the plurality of regions of the image may be obtained by dividing the image into the plurality of regions according to the extraction reference points. The extracted reference point may be a connection point of a main part in the object structure. When the object is a pedestrian, the extraction reference point may be a human body localization point of the pedestrian, for example, a joint of the human body. As shown in fig. 2, fourteen joint points 1 to 14 of the pedestrian may be selected as extraction reference points, and then a plurality of regions may be extracted based on coordinates of the fourteen joint points 1 to 14, for example, joint points included in the head and shoulder are joint points 1 to 4, coordinates of the joint points 1 to 4 are, for example, (5,5), (5,2), (9,1), and (1,1), respectively, a minimum one (i.e., 1 of the coordinates (9,1) and (1, 1)) and a maximum one (i.e., 5 of the coordinates (5, 5)) of the vertical coordinates of the joint points 1 to 4 may be selected as a maximum value and a minimum value of the vertical coordinates of the region 201 including the head and shoulder, a minimum one (i.e., 1 of the coordinates (1, 1)) and a maximum one (i.e., the coordinates (9,1) 9) as the maximum value and the minimum value of the abscissa of the region 201 including the head-shoulder, the finally obtained region 201 including the head-shoulder may be a rectangle having coordinates (1,1), (9,1) (9,5), and (1,5) as the four corners, and the

regions

202 and 203 may be obtained in a similar manner. In the embodiment of the present invention, as shown in fig. 2, joint point 1 may be a vertex joint point, joint point 2 may be a neck joint point, joint point 3 may be a left shoulder joint point, joint point 4 may be a right shoulder joint point, joint point 5 may be a left elbow joint point, joint point 6 may be a left wrist joint point, joint point 7 may be a right elbow joint point, joint point 8 may be a right wrist joint point, joint point 9 may be a left hip joint point, joint point 10 may be a right hip joint point, joint point 11 may be a left knee joint point, joint point 12 may be a left foot joint point, joint point 13 may be a right knee joint point, and joint point 14 may be a right foot joint point. The plurality of joint points can be selected manually or extracted by using a convolutional neural network pre-trained by a reverse phase propagation method, for example, when an image is input into the convolutional neural network, a convolutional layer in the convolutional neural network convolves the image, so that the features of joint areas of pedestrians in the image are extracted and included in a feature map, the features of joint areas of non-pedestrians are removed or zeroed, and then the feature map is output as a feature map representing the positions of the joint points of the pedestrians.

For example, as shown in FIG. 3, the initial image feature 310 may be generated from the image 300 using a convolutional neural network, and then the Region features respectively corresponding to the regions, i.e., the Region features 321 and 323, wherein the Region features 321 and 323 may be feature maps or feature vectors, may be generated from the initial image feature 310, this step may be performed by Region of Interest Pooling (Pooling) using a convolutional neural network, e.g., the initial image feature 310 is a feature map of 96 8596, which may be input into the convolutional neural network, the Region of Interest Pooling layer in the convolutional neural network separately pools the regions corresponding to the regions in the initial image feature to obtain the features of 3524 and 3524 of the Region, which are the Region features 321 and 321 corresponding to the regions, respectively, it should be noted that the above mentioned feature size 96 is contemplated to be used for "96 size 96", and so as to generate the regions corresponding to the regions, i.e., the regions of

Interest

321 and 323, which may be generated from the image features using a convolutional neural network, which may be generated by using a more detailed neural network, or by using the same method as the method.

In another embodiment, step S1100 may include: performing convolution and pooling on the image to generate intermediate image characteristics; and generating initial image features corresponding to objects in the image based on the intermediate image features. The operations of convolving and pooling the images to generate intermediate image features in this embodiment may be implemented using a CNN that includes multiple perceptual modules. Specifically, an image of the object may be input into the CNN, and then the multiple sensing modules in the CNN perform convolution and pooling on the input image to obtain an overall feature of the image, which is an intermediate image feature. The intermediate image features may be global features comprising the image, which may be feature maps or feature vectors. After obtaining the intermediate image feature, the CNN may be used to perform convolution and pooling on the intermediate image feature to generate an initial image feature corresponding to an object in the image, specifically, the intermediate image feature may be input into the CNN, and then the multiple sensing modules in the CNN perform convolution and pooling on the input intermediate image feature to obtain an overall feature of the intermediate image feature, where the overall feature is the initial image feature. Since the initial image features are obtained by convolving and pooling the intermediate image features through the convolutional neural network, the initial image features may contain more fine features than the intermediate image features.

In this embodiment, step S1200 may include: extracting a plurality of regions from an image; pooling the intermediate image features based on the plurality of regions to generate a plurality of intermediate region features corresponding to the plurality of regions, respectively; and generating a plurality of region features respectively corresponding to the plurality of regions based on the plurality of intermediate region features.

For the step of extracting a plurality of regions from the image, it may be obtained using the method described above with reference to fig. 2. Then, the intermediate image features may be pooled based on the plurality of regions to generate a plurality of intermediate region features corresponding to the plurality of regions, respectively, and a plurality of region features corresponding to the plurality of regions, respectively, may be generated based on the plurality of intermediate region features. As shown in fig. 4, a pooling method described with reference to fig. 3 may be utilized to obtain a plurality of intermediate region features 331-, the region features 321-323 are described above. Since the region features 321-323 are generated by convolution and pooling on the basis of the intermediate region features 331-333, the region features 321-323 have finer or more discriminative features than the intermediate region features 331-333.

After obtaining the initial image feature and the plurality of region features respectively corresponding to the plurality of regions, in S1300, the initial image feature and the plurality of region features may be fused to obtain a target image feature of the object. In the case where the target image feature obtained by the fusion is a pedestrian, for example, the region feature corresponding to the region including the head and shoulder portion and the feature of the head and shoulder portion of the initial image feature may be fused to the feature of the head and shoulder portion of the target image feature, the region feature corresponding to the region including the upper body and the feature of the upper body of the initial image feature may be fused to the feature of the upper body of the target image feature, and the region feature corresponding to the region including the lower body and the feature of the lower body of the initial image feature may be fused to the feature of the lower body of the target image feature. Since the region features at least partially include finer or more distinguishable features, the method for obtaining the target image features by fusing the region features and the initial image features can effectively improve the accuracy in applications such as image recognition. In the fusion process, the feature in the region feature may be directly used as the feature of the corresponding region in the target image feature, or the feature in the region feature may be partially used as the feature of the corresponding region in the target image feature, and the feature of the corresponding region in the target image feature may be used as another feature of the corresponding region in the initial image feature, for example, the head feature in the region feature of the head and shoulder may be used as the head feature in the target image feature, and the shoulder feature in the target image feature may be used as the shoulder feature in the initial image feature.

In one embodiment, fusing the initial image feature and the plurality of region features to obtain a target image feature of the object includes: fusing the plurality of region features into a fused feature, and fusing the fused feature with the initial image feature into a target image feature of the object. For example, as shown in fig. 5, in the case where the object is a pedestrian, a region feature 321 corresponding to a region including the head and shoulders, a region feature 322 corresponding to a region including the upper body, and a region feature 323 corresponding to a region including the lower body may be fused first into a fused feature 400, and then the fused feature 400 and the initial image feature 310 may be fused into a target image feature 500 of the object.

In one embodiment, fusing the plurality of region features into a fused feature comprises: selecting the feature with the highest identification in a first overlapping area in the plurality of area features as the feature corresponding to the first overlapping area; and generating a fusion feature based on the feature corresponding to the first overlapping area. A plurality of regional features may partially overlap each other, and the overlapped region may be referred to as a first overlapped region, for example, as shown in fig. 6, a region 201 including the head and shoulder of a pedestrian partially overlaps a region 202 including the upper body of the pedestrian, and the region 202 including the upper body of the pedestrian partially overlaps a region 203 including the lower body of the pedestrian, wherein the regional features respectively corresponding to the

regions

201 and 203 are schematically represented as 321 and 323 in fig. 6, and in the

regional features

321 and 323, the features are represented in numerical form, and the features with high numerical values have high identifiability, that is, represent features with high distinctiveness from other features. In the process of fusing the region features 321-323, the sizes of the features of the two partially overlapped region features in the overlapped region (i.e., the first overlapped region) may be compared, a feature with a larger value is used as the feature of the overlapped region, and the features in the dashed frame of the region feature 321 are obviously larger than the features in the dashed frame of the region feature 322 by taking the region features 321 and 322 as an example, so that the features in the dashed frame of the region feature 321 are used as the features of the region in the fused feature after fusion, as shown in the dashed frame of the fused feature 400 in fig. 6. Using the above method, a fusion feature 400 as shown in fig. 6 can be obtained. Because the features with higher identifiability are reserved and the features with lower identifiability are eliminated in the fusion process, the competition strategy enables the target image features of the finally obtained object to contain the features with higher identifiability. The fusion method can be performed by a fusion unit in the convolutional neural network, and specifically, a plurality of region features or fusion features to be fused and initial image features are input into a fusion unit of the convolutional neural network, the fusion unit outputs a feature larger in an overlapping region of the plurality of input features as a feature of the overlapping region, and an inner lamination layer can be further included in the convolutional neural network and used for converting an output fusion feature layer into a feature map which can be used for subsequent fusion.

In one embodiment, fusing the fused feature with the initial image feature to a target image feature of the object comprises: selecting the feature with the highest identification as the feature corresponding to the second overlapping area in which the fused feature and the initial image feature are overlapped with each other; and generating a target image feature of the object based on the feature corresponding to the second overlapping area. In fusing the fused feature and the initial image feature, since the fused feature is obtained based on the region feature, the fused feature overlaps with the initial image feature, and the overlapped region may be referred to as a second overlapped region, so that in the fusing process, the fused feature and the initial image feature may be fused in the same way as the fusing method described with reference to fig. 6, that is, in the second overlapped region, the feature having the highest value (that is, having the highest recognizability) of the fused feature and the initial image feature is used as the feature at the region in the fused target image feature, so that the target image feature obtained by the fusion may be used as the feature of the object.

Compared with the traditional method, the method can obtain the target image features which are finer and have higher identifiability, because in the process of obtaining the target image features, the method not only extracts the features of the whole image containing the object, but also extracts the features of a plurality of areas in the image, so that the detailed features in the plurality of areas are at least partially reserved according to the process of extracting the features of the plurality of areas, and in addition, a competitive strategy is adopted in the fusion process according to the method of the embodiment of the application, namely, the features with high identifiability are reserved, the features with low identifiability are eliminated, and the finally obtained target image features have better identifiability. This is advantageous for applications such as object recognition, picture retrieval, etc., e.g., in applications for pedestrian retrieval, if the pedestrians in both the two pictures wear the white shirt and the black pants, at this time in the conventional object feature extraction method, features are extracted from the whole image, so that the slight difference between two pedestrians is probably ignored in the feature extraction process, in the method according to the embodiment of the present application, since there is a process of extracting features from a plurality of regions of a picture, so that the detail features in the plurality of region features can be extracted, for example, in the region including the face of a pedestrian, the detail features of the face of the pedestrian can be extracted, and the feature that the face of the pedestrian has high identification is reserved in the fusion process, so that the two pedestrians can be distinguished according to the difference of the details of the faces of the two pedestrians, and a correct result is obtained.

It should be noted that, in the present application, features, such as intermediate image features, initial image features, intermediate region features, fusion features, and the like, may be expressed in the form of feature maps and feature vectors. In addition, although the description is made using an example in which the object is a pedestrian in the present application, the present application is not limited thereto, and for example, the object may also be a human face, a vehicle, or the like.

A specific application of the method of extracting features according to the embodiment of the present application to a person will be described below.

When the object is a pedestrian, an intermediate image feature (e.g., 24 × 24 in size) may be generated by performing a convolution and pooling operation on an image of the pedestrian through the CNN in step S1100, and then an initial image feature (e.g., 12 × 12 in size) may be generated based on the intermediate image feature, for example, by performing a convolution and pooling operation on the intermediate image feature through the CNN. Next, in step S1200, a plurality of regions are extracted from the image of the pedestrian, for example, regions corresponding to the head-shoulder, upper-body, and lower-body of the pedestrian, and then, based on the plurality of regions corresponding to the head-shoulder, upper-body, and lower-body, the intermediate image features are pooled by using the CNN, and a plurality of intermediate region features (for example, 12 × 12 in size) corresponding to the head-shoulder, upper-body, and lower-body are generated. Then, region features corresponding to the head-shoulder, the upper-body, and the lower-body are generated based on the plurality of intermediate region features corresponding to the head-shoulder, the upper-body, and the lower-body, respectively, and specifically, the plurality of intermediate region features corresponding to the head-shoulder, the upper-body, and the lower-body may be convolved and pooled by the CNN to generate region features corresponding to the head-shoulder, the upper-body, and the lower-body. Then, in step S1300, the region feature corresponding to the upper body, the region feature corresponding to the lower body, and the region feature corresponding to the head and shoulder are fused into a fusion feature, and the fusion feature and the initial image feature are fused into a target image feature of the pedestrian.

In another embodiment, the method for extracting features of a pedestrian may further include the step of extracting and fusing features of limbs of the pedestrian. Specifically, in step S1100, the image of the pedestrian may be convolved and pooled by the CNN to generate an intermediate image feature (e.g., 24 × 24 in size), and then convolved and pooled to generate a convolved intermediate image feature (e.g., 12 × 12 in size), which may be, for example, convolved and pooled by the CNN to obtain a convolved intermediate image feature, after which the convolved intermediate image feature may be convolved and pooled to generate an initial image feature (e.g., 6 × 6 in size). Then, in step S1200, regions corresponding to the head and shoulder, the upper body, and the lower body of the pedestrian are extracted from the image of the pedestrian, and then, by using the CNN, the intermediate image features are pooled based on the plurality of regions corresponding to the head and shoulder, the upper body, and the lower body, and a plurality of intermediate region features (for example, 12 × 12 in size) corresponding to the head and shoulder, the upper body, and the lower body, respectively, are generated. In this step, it may further include: a plurality of regions, e.g.,

regions

204 and 207 in fig. 2, corresponding to the left arm, right arm, left leg, and right leg, respectively, of the pedestrian are extracted from the image. The region extraction process may use the same method as described with reference to fig. 2. The convolved intermediate image features are then pooled based on a plurality of regions corresponding to the left arm, the right arm, the left leg, and the right leg, respectively, using, for example, CNN, to generate a plurality of intermediate region features (e.g., 12 × 12 in size) corresponding to the left arm, the right arm, the left leg, and the right leg, respectively. Then, for example, a plurality of middle region features corresponding to the head-shoulder portion, the upper body, the lower body, the left arm, the right arm, the left leg, and the right leg are convolved and pooled by the CNN, respectively, and a convolved middle region feature (for example, 6 × 6 in size) corresponding to the head-shoulder portion, the upper body, the lower body, the left arm, the right arm, the left leg, and the right leg is generated. After obtaining the convolved mid-region features corresponding to the head-shoulder, upper-body, lower-body, left-arm, right-arm, left-leg, and right-leg, the convolved mid-region features corresponding to the left-leg and right-leg, respectively, may be fused into a leg-fusion feature using the fusion method described in connection with fig. 6; fusing the convolved middle region features corresponding to the left arm and the right arm respectively into arm fusion features; fusing the arm fusion feature and the convolved middle region feature corresponding to the upper half body into a region feature corresponding to the upper half body; fusing the leg fusion feature and the convolved middle region feature corresponding to the lower body into a region feature corresponding to the lower body; and using the convolved intermediate region features corresponding to the head-shoulder as the region features corresponding to the head-shoulder. Then, in step S1300, the region feature corresponding to the upper body, the region feature corresponding to the lower body, and the region feature corresponding to the head and shoulder may be fused into a fusion feature by the fusion method described with reference to fig. 6, and the fusion feature and the initial image feature may be fused into the target image feature of the pedestrian.

Fig. 7 exemplarily shows an apparatus 700 for extracting object features according to an embodiment of the present application. The device includes: an image feature generation module 710 that generates initial image features corresponding to objects in an image; a region feature generation module 720 that generates a plurality of region features corresponding to a plurality of regions of the image, respectively; and a fusion module 730 for fusing the initial image feature and the plurality of region features to obtain a target image feature of the object.

In one embodiment, the image feature generation module 710 includes: an intermediate image feature generation submodule 711 for performing convolution and pooling on the image to generate an intermediate image feature; and an initial image feature generation sub-module 712 that generates initial image features corresponding to objects in the image based on the intermediate image features.

In one embodiment, the region feature generation module 720 includes: a region extraction sub-module 721 that extracts a plurality of regions from the image; an intermediate region feature generation sub-module 722 pools the intermediate image features based on the plurality of regions to generate a plurality of intermediate region features corresponding to the plurality of regions, respectively, and a region feature generation sub-module 723 generates a plurality of region features corresponding to the plurality of regions based on the plurality of intermediate region features, respectively.

In one embodiment, the fusion module 730 includes: a first fusion submodule 731 fusing the plurality of region features into a fusion feature; and a second fusion sub-module 732 that fuses the fused features and the initial image features into target image features of the object.

In one embodiment, the first fusion submodule 731 is further configured to: selecting the feature with the highest identification in a first overlapping area in the plurality of area features as the feature corresponding to the first overlapping area; and generating a fusion feature based on the feature corresponding to the first overlapping area.

In one embodiment, the second fusion submodule 732 is further configured to: selecting the feature with the highest identification as the feature corresponding to the second overlapping area in which the fused feature and the initial image feature are overlapped with each other; and generating a target image feature of the object based on the feature corresponding to the second overlapping area.

In one embodiment, the region extraction sub-module 721 is configured to: the image is divided into a plurality of regions according to the extraction reference points.

In one embodiment, the object comprises a pedestrian, a human face, or a vehicle.

In one embodiment, when the object is a pedestrian, the extracted reference points include human body localization points of the pedestrian.

In one embodiment, when the object is a pedestrian, the region extraction sub-module 721 is configured to: extracting regions corresponding to the head and shoulder, the upper body, and the lower body of the pedestrian, respectively, from the image; and a middle region feature generation submodule 722 for: the intermediate image features are pooled based on a plurality of regions corresponding to the head-shoulder, upper-body, and lower-body, respectively, to generate a plurality of intermediate region features corresponding to the head-shoulder, upper-body, and lower-body, respectively.

In one embodiment, when the object is a pedestrian, the region feature generation submodule 723 is configured to: regional features corresponding to the head-shoulder, upper-body, and lower-body are generated based on a plurality of intermediate regional features corresponding to the head-shoulder, upper-body, and lower-body, respectively.

In one embodiment, when the subject is a pedestrian, the first fusion sub-module 731 is configured to: a region feature corresponding to the upper half, a region feature corresponding to the lower half, and a region feature corresponding to the head-shoulder are fused into a fused feature.

In one embodiment, when the object is a pedestrian, the initial image feature generation sub-module 712 is configured to: convolving and pooling the intermediate image features to generate convolved intermediate image features; and convolving and pooling the convolved intermediate image features to generate initial image features.

In one embodiment, when the object is a pedestrian, the region extraction sub-module 721 is further configured to: a plurality of regions respectively corresponding to a left arm, a right arm, a left leg, and a right leg of a pedestrian are extracted from an image.

In one embodiment, when the object is a pedestrian, the region feature generation sub-module 723 is configured to pool the convolved intermediate image features based on a plurality of regions corresponding to the left arm, the right arm, the left leg, and the right leg, respectively, to generate a plurality of intermediate region features corresponding to the left arm, the right arm, the left leg, and the right leg, respectively; convolving and pooling a plurality of intermediate region features corresponding to the head-shoulder, the upper-body, the lower-body, the left-arm, the right-arm, the left-leg, and the right-leg, respectively, to generate convolved intermediate region features corresponding to the head-shoulder, the upper-body, the lower-body, the left-arm, the right-arm, the left-leg, and the right-leg; fusing the convolved mid-region features corresponding to the left leg and the right leg, respectively, into leg-fused features; fusing the convolved mid-region features corresponding to the left arm and the right arm, respectively, into arm fusion features; fusing the arm fusion feature and the convolved middle region feature corresponding to the upper torso into a region feature corresponding to the upper torso; fusing the leg fusion feature and the convolved middle region feature corresponding to the lower body into a region feature corresponding to the lower body; and using the convolved intermediate region features corresponding to the head-shoulder as the region features corresponding to the head-shoulder.

In one embodiment, the features include a feature map or a feature vector.

As will be appreciated by those skilled in the art, the apparatus 700 for extracting object features may be implemented in the form of an Integrated Circuit (IC), which includes but is not limited to a digital signal processor, a graphics processing IC, an image processing IC, an audio processing IC, and the like. A person skilled in the art can know in the teaching provided in this application which form of hardware or software is used to implement the apparatus 700 for extracting object features. For example, the present application may be implemented in the form of a storage medium storing computer-executable instructions that respectively implement the above-described apparatus 700 for extracting object features, thereby implementing their respective above-described functions by being executed by a computer. The apparatus 700 for extracting object features of the present application can also be implemented by a computer system, wherein the computer system comprises a memory storing computer executable instructions and a processor in communication with the memory, and the processor executes the executable instructions to realize the functions of the apparatus 700 for extracting object features described above with reference to fig. 7.

Referring now to FIG. 8, shown is a block diagram of a computer system 800 suitable for implementing embodiments of the present application. The computer system 800 may include a processing unit (e.g., Central Processing Unit (CPU)801, image processing unit (GPU), etc.) that may perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM)802 or a program loaded from a storage section 808 into a Random Access Memory (RAM) 803. In the RAM 803, various programs and data required for the operation of the system 800 can also be stored. The CPU 801, ROM 802, and RAM 803 are connected to each other by a bus 804. An input/output I/O interface 805 is also connected to bus 804.

The following are components connectable to the I/O interface 805, AN input section 806 including a keyboard, a mouse, and the like, AN output section 807 including a cathode ray tube CRT, a liquid crystal display device L CD, a speaker, and the like, a storage section 808 including a hard disk and the like, and a communication section 809 including a network interface card (e.g., L AN card, modem, and the like), the communication section 809 may perform communication processing via a network such as the internet, the drive 810 may also be connected to the I/O interface 805 as necessary, a removable medium 811 such as a magnetic disk, AN optical disk, a magneto-optical disk, a semiconductor memory, and the like may be mounted on the drive 810 so that a computer program read therefrom is mounted in the storage section 808 as necessary.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units or modules referred to in the embodiments of the present application may be implemented by software or hardware. The described units or modules may also be provided in a processor. The names of these units or modules should not be construed as limiting these units or modules.

The above description is only an exemplary embodiment of the present application and is illustrative of the principles of the technology employed. It will be appreciated by a person skilled in the art that the scope of the present application is not limited to the embodiments with a specific combination of the above-mentioned features, but also covers other embodiments with any combination of the above-mentioned features or their equivalents without departing from the inventive concept. For example, the above features and the technical features having similar functions disclosed in the present application are mutually replaced to form the technical solution.

Claims

1. A method of extracting features, comprising:

generating initial image features corresponding to objects in an image;

generating a plurality of region features corresponding to a plurality of regions of the image, respectively;

fusing the plurality of region features into a fused feature, wherein in a first overlapping region of the plurality of region features, a feature with the highest identifiability is selected as a feature corresponding to the first overlapping region, and the fused feature is generated based on the feature corresponding to the first overlapping region; and

fusing the fused feature and the initial image feature as a target image feature of the object.

2. The method of claim 1, generating initial image features corresponding to objects in the image comprising:

performing convolution and pooling on the image to generate intermediate image features; and

generating initial image features corresponding to objects in the image based on the intermediate image features.

3. The method of claim 2, generating a plurality of region features corresponding to a plurality of regions of the image, respectively, comprising:

extracting a plurality of regions from the image;

pooling the intermediate image features based on the plurality of regions to generate a plurality of intermediate region features corresponding to the plurality of regions, respectively, an

Generating a plurality of region features respectively corresponding to the plurality of regions based on the plurality of intermediate region features.

4. The method of claim 1, wherein fusing the fused feature with the initial image feature to a target image feature of the object comprises:

selecting a feature with the highest identification as a feature corresponding to a second overlapping area in the second overlapping area in which the fused feature and the initial image feature overlap with each other; and

and generating a target image characteristic of the object based on the characteristic corresponding to the second overlapping area.

5. The method of claim 3, wherein extracting a plurality of regions from the image comprises:

dividing the image into a plurality of the regions according to the extraction reference points.

6. The method of claim 5, wherein the object comprises a pedestrian, a human face, or a vehicle.

7. The method of claim 6, wherein, when the object is a pedestrian, the extracted reference points comprise human body localization points of the pedestrian.

8. The method of claim 7, wherein, when the object is a pedestrian,

extracting a plurality of regions from the image comprises:

extracting regions corresponding to the head and shoulder, the upper body, and the lower body of the pedestrian, respectively, from the image; and

pooling the intermediate image features based on the plurality of regions, and generating a plurality of intermediate region features corresponding to the plurality of regions, respectively, comprises:

the intermediate image features are pooled based on a plurality of regions corresponding to the head-shoulder, the upper-body, and the lower-body, respectively, to generate a plurality of intermediate region features corresponding to the head-shoulder, the upper-body, and the lower-body, respectively.

9. The method of claim 8, wherein generating a plurality of region features respectively corresponding to the plurality of regions based on the plurality of intermediate region features when the object is a pedestrian comprises:

generating region features corresponding to the head-shoulder, the upper-body, and the lower-body based on a plurality of middle region features corresponding to the head-shoulder, the upper-body, and the lower-body, respectively.

10. The method of claim 9, wherein fusing the plurality of region features into a fused feature comprises, when the object is a pedestrian:

and fusing the region feature corresponding to the upper body, the region feature corresponding to the lower body, and the region feature corresponding to the head-shoulder portion into a fused feature.

11. The method of claim 10, wherein generating initial image features corresponding to objects in the image based on the intermediate image features comprises, when the object is a pedestrian:

convolving and pooling the intermediate image features to generate convolved intermediate image features; and

convolving and pooling the convolved intermediate image features to generate initial image features.

12. The method of claim 11, wherein, when the object is a pedestrian, extracting a plurality of regions from the image further comprises:

a plurality of regions respectively corresponding to a left arm, a right arm, a left leg, and a right leg of the pedestrian are extracted from the image.

13. The method of claim 12, wherein, when the object comprises a pedestrian, generating region features corresponding to the head-shoulder, the upper-body, and the lower-body based on a plurality of intermediate region features corresponding to the head-shoulder, the upper-body, and the lower-body, respectively, comprises:

pooling the convolved intermediate image features based on a plurality of regions corresponding to the left arm, the right arm, the left leg, and the right leg, respectively, to generate a plurality of intermediate region features corresponding to the left arm, the right arm, the left leg, and the right leg, respectively;

convolving and pooling a plurality of intermediate region features corresponding to the head-shoulder, the upper-body, the lower-body, the left-arm, the right-arm, the left-leg, and the right-leg, respectively, to generate convolved intermediate region features corresponding to the head-shoulder, the upper-body, the lower-body, the left-arm, the right-arm, the left-leg, and the right-leg;

fusing the convolved mid-region features corresponding to the left leg and the right leg, respectively, into leg-fused features;

fusing the convolved mid-region features corresponding to the left arm and the right arm, respectively, into arm fusion features;

fusing the arm fusion feature and the convolved middle region feature corresponding to the upper torso into a region feature corresponding to the upper torso;

fusing the leg fusion feature and the convolved middle region feature corresponding to the lower body into a region feature corresponding to the lower body; and

using the convolved intermediate region features corresponding to the head-shoulder as the region features corresponding to the head-shoulder.

14. The method of claim 1, wherein the feature comprises a feature map or a feature vector.

15. An apparatus for extracting features, comprising:

an image feature generation module that generates an initial image feature corresponding to an object in an image;

a region feature generation module that generates a plurality of region features corresponding to a plurality of regions of the image, respectively; and

a fusion module for fusing the initial image feature and the plurality of region features to obtain a target image feature of the object,

the fusion module includes:

a first fusion submodule, configured to fuse the plurality of region features into a fusion feature, where in a first overlapping region of the plurality of region features, a feature with a highest identifiability is selected as a feature corresponding to the first overlapping region, and the fusion feature is generated based on the feature corresponding to the first overlapping region; and

and the second fusion submodule fuses the fusion feature and the initial image feature into a target image feature of the object.

16. The apparatus of claim 15, wherein the image feature generation module comprises:

the intermediate image feature generation submodule is used for performing convolution and pooling on the image to generate intermediate image features; and

an initial image feature generation sub-module that generates initial image features corresponding to objects in the image based on the intermediate image features.

17. The apparatus of claim 16, wherein the regional feature generation module comprises:

a region extraction sub-module that extracts a plurality of regions from the image;

an intermediate region feature generation sub-module that pools the intermediate image features based on the plurality of regions, generates a plurality of intermediate region features respectively corresponding to the plurality of regions, and

and a region feature generation sub-module that generates a plurality of region features respectively corresponding to the plurality of regions based on the plurality of intermediate region features.

18. The apparatus of claim 15, wherein the second fusion submodule is further to:

19. The apparatus of claim 17, wherein the region extraction submodule is to:

20. The apparatus of claim 19, wherein the object comprises a pedestrian, a human face, or a vehicle.

21. The apparatus of claim 20, wherein, when the object is a pedestrian, the extracted reference point comprises a human body localization point of the pedestrian.

22. The apparatus of claim 21, wherein, when the object is a pedestrian,

the region extraction submodule is configured to:

the middle region feature generation submodule is configured to:

23. The apparatus of claim 22, wherein when the object is a pedestrian, the region feature generation submodule is to:

24. The apparatus of claim 23, wherein, when the object is a pedestrian, the first fusion submodule is to:

25. The apparatus of claim 24, wherein when the object is a pedestrian, the initial image feature generation submodule is to:

26. The apparatus of claim 25, wherein when the object is a pedestrian, the region extraction sub-module is further to:

27. The apparatus of claim 26, wherein when the object is a pedestrian, the region feature generation submodule is to:

28. The apparatus of claim 15, wherein the feature comprises a feature map or a feature vector.

29. A system for extracting features, comprising:

a memory storing executable instructions;

one or more processors in communication with the memory to execute the executable instructions to:

generating initial image features corresponding to objects in an image;