CN116188349A

CN116188349A - Image processing method, device, electronic equipment and storage medium

Info

Publication number: CN116188349A
Application number: CN202111421144.0A
Authority: CN
Inventors: 李炜明; 汪昊; 何宝; 金知姸; 张现盛; 洪性勋; 王强
Original assignee: Samsung Electronics Co Ltd
Current assignee: Samsung Electronics Co Ltd
Priority date: 2021-11-26
Filing date: 2021-11-26
Publication date: 2023-05-30
Also published as: KR20230078502A

Abstract

The present disclosure relates to an image processing method, an apparatus, an electronic device, and a storage medium, the image processing method including: acquiring a feature map of a first image, and detecting a target area in the first image based on the feature map; correcting the detected target area; and processing the object corresponding to the target area based on the corrected target area. Meanwhile, the above-described image processing method may be performed using an artificial intelligence model.

Description

Image processing method, device, electronic equipment and storage medium

Technical Field

The present disclosure relates to the field of artificial intelligence, and in particular, to an image processing method, an image processing device, an electronic device, and a storage medium.

Background

Currently, before processing an image, in order to secure an object processing effect, the entire image is corrected, and then the object is processed based on the corrected entire image. For example, a fisheye image captured by a fisheye camera contains significant distortion, and the related art requires correction of the entire fisheye image before processing an object in the fisheye image, and performs object processing (e.g., object recognition, segmentation, pose estimation, etc.) based on the corrected entire image. However, such an image processing method may cause not only a significant stretching of the object in the whole image correction, and thus a poor processing effect of the subsequent object, but also a slow image processing speed. In view of this, a better image processing technique is required to improve the object processing effect or to improve the image processing speed.

Disclosure of Invention

The present disclosure provides an image processing method, apparatus, electronic device, and storage medium to solve at least the above-mentioned problems in the related art.

According to a first aspect of embodiments of the present disclosure, there is provided an image processing method including: acquiring a feature map of a first image, and detecting a target area in the first image based on the feature map; correcting the detected target area; and processing the object corresponding to the target area based on the corrected target area.

Optionally, the acquiring a feature map of the first image, and detecting the target area in the first image based on the feature map includes: extracting features of the first image on at least one scale to obtain at least one feature map of the first image; a target region in the first image is detected based on the at least one feature map.

Optionally, the extracting the features of the first image on at least one scale to obtain at least one feature map of the first image includes: and performing convolution operation on the first image on each scale of the at least one scale by using a convolution neural network to obtain a feature map of each scale, wherein the convolution neural network performs convolution operation on each position of at least one position on the first image by adopting a convolution kernel corresponding to each position.

Optionally, the performing a convolution operation on the first image by using a convolutional neural network to obtain a feature map of each scale includes: obtaining a sampling position of a convolution kernel corresponding to each of the at least one position on the first image, wherein the sampling position of the convolution kernel is determined from an imaging model of the first image; and performing convolution operation according to the sampling position of the convolution kernel corresponding to each position to obtain a feature map of each scale.

Optionally, the sampling position of the convolution kernel is determined by: determining the sampling position of a convolution kernel function of each position in a three-dimensional space according to the imaging model; a sampling position of the convolution kernel corresponding to each position on the first image is determined from the sampling positions of the convolution kernel in three-dimensional space and the imaging model.

Optionally, the at least one feature map is a plurality of feature maps, and the detecting the target area in the first image based on the at least one feature map includes: and fusing the feature images of adjacent scales in the feature images, and detecting a target area in the first image based on at least one fused feature image.

Optionally, the correcting the detected target area includes: determining a first characteristic region corresponding to the detected target region in the characteristic map of the first image as a first target region characteristic map; spatially transforming the first target region feature map to generate a transformed first target region feature map, wherein the processing the object corresponding to the target region based on the corrected target region includes: and processing the object corresponding to the target area based on the transformed first target area characteristic diagram.

Optionally, the spatially transforming the first target region feature map to generate a transformed first target region feature map includes: creating a virtual camera corresponding to a target area according to an imaging model of a first image and the detected target area; and performing spatial transformation on the first target area feature map by using the virtual camera to generate a transformed first target area feature map.

Optionally, the light rays corresponding to the optical axis of the virtual camera pass through the center of the detected target area after being refracted by the imaging model.

Optionally, the processing the object corresponding to the target area based on the transformed first target area feature map includes: obtaining first attribute information of an object corresponding to the target area based on the transformed first target area feature map; and processing the object corresponding to the target area according to the first attribute information.

Optionally, the image processing method further includes: acquiring a second image associated with the first image; obtaining second attribute information of the object based on the second image, wherein the processing the object corresponding to the target area according to the first attribute information comprises the following steps: and processing the object corresponding to the target area according to the first attribute information and the second attribute information.

Optionally, the processing the object corresponding to the target area includes: at least one of object recognition, object segmentation, and object pose estimation is performed on the object.

Optionally, the first attribute information includes at least one of category information, mask information, key point information, and pose information of the object.

Optionally, the first attribute information includes first key point information and initial pose information of the object, the second attribute information includes second key point information of the object, and the processing the object corresponding to the target area according to the first attribute information and the second attribute information includes: final pose information of the object is estimated based on the initial pose information, the first keypoint information, and the second keypoint information.

Optionally, the first image is one of a left-eye image and a right-eye image, and the second image is the other of the left-eye image and the right-eye image.

Optionally, the obtaining the second attribute information of the object based on the second image includes: determining a target area corresponding to the object on a second image based on the initial pose information and parameters of a first camera generating the first image and a second camera generating the second image; and obtaining second key point information of the object based on a target area corresponding to the object on the second image.

Optionally, the determining, based on the initial pose information and parameters of the first camera that generates the first image and the second camera that generates the second image, a target area on the second image corresponding to the object includes: determining initial pose information of the object under a coordinate system of the first camera based on the initial pose information and parameters of the first camera; determining initial pose information of the object in a coordinate system of a second camera based on the initial pose information of the object in the coordinate system of the first camera and parameters of the second camera; and determining a target area corresponding to the object on the second image according to the initial posture information of the object under the coordinate system of the second camera.

Optionally, the obtaining the second keypoint information of the object based on the target area corresponding to the object on the second image includes: and correcting a target area corresponding to the object on the second image, and obtaining second key point information of the object based on the corrected target area.

Optionally, the correcting the target area corresponding to the object on the second image includes: acquiring a feature map of a second image; determining a second characteristic region corresponding to the target region on the second image on the characteristic map of the second image as a second target region characteristic map; performing spatial transformation on the second target region feature map to generate a transformed second target region feature map; and obtaining second key point information of the object based on the transformed second target area feature map.

According to a second aspect of the embodiments of the present disclosure, there is provided an image processing method, including: performing a convolution operation on the first image by using a convolution neural network to obtain a feature map of the first image, wherein the convolution neural network performs the convolution operation on each of at least one position on the first image by adopting a convolution kernel corresponding to the each position; and processing the object in the first image based on the characteristic map.

Optionally, the performing a convolution operation on the first image by using the convolutional neural network to obtain a feature map of the first image includes: obtaining a sampling position of a convolution kernel corresponding to each of the at least one position on the first image, wherein the sampling position of the convolution kernel is determined from an imaging model of the first image; and performing convolution operation according to the sampling position of the convolution kernel corresponding to each position to obtain the characteristic diagram.

According to a third aspect of the embodiments of the present disclosure, there is provided an image processing apparatus including: a detection unit configured to: acquiring a feature map of a first image, and detecting a target area in the first image based on the feature map; a correction unit configured to correct the detected target area; and an image processing unit configured to process an object corresponding to the target area based on the corrected target area.

Optionally, the detecting unit acquires a feature map of the first image, and detects the target area in the first image based on the feature map, including: extracting features of the first image on at least one scale to obtain at least one feature map of the first image; a target region in the first image is detected based on the at least one feature map.

Optionally, the detecting unit extracts features of the first image on at least one scale to obtain at least one feature map of the first image, including: and performing convolution operation on the first image on each scale of the at least one scale by using a convolution neural network to obtain a feature map of each scale, wherein the convolution neural network performs convolution operation on each position of at least one position on the first image by adopting a convolution kernel corresponding to each position.

Optionally, the detecting unit performs a convolution operation on the first image by using a convolutional neural network to obtain a feature map of each scale, including: obtaining a sampling position of a convolution kernel corresponding to each of the at least one position on the first image, wherein the sampling position of the convolution kernel is determined from an imaging model of the first image; and performing convolution operation according to the sampling position of the convolution kernel corresponding to each position to obtain a feature map of each scale.

Optionally, the at least one feature map is a plurality of feature maps, and the detecting unit detects the target area in the first image based on the at least one feature map, including: and fusing the feature images of adjacent scales in the feature images, and detecting a target area in the first image based on at least one fused feature image.

Optionally, the correcting unit corrects the detected target area, including: determining a first characteristic region corresponding to the detected target region in the characteristic map of the first image as a first target region characteristic map; performing spatial transformation on the first target region feature map to generate a transformed first target region feature map, wherein the image processing unit processes an object corresponding to a target region based on the corrected target region, and includes: and processing the object corresponding to the target area based on the transformed first target area characteristic diagram.

Optionally, the correcting unit performs spatial transformation on the first target region feature map to generate a transformed first target region feature map, including: creating a virtual camera corresponding to a target area according to an imaging model of a first image and the detected target area; and performing spatial transformation on the first target area feature map by using the virtual camera to generate a transformed first target area feature map.

Optionally, the image processing unit processes the object corresponding to the target area based on the transformed first target area feature map, including: obtaining first attribute information of an object corresponding to the target area based on the transformed first target area feature map; and processing the object corresponding to the target area according to the first attribute information.

Optionally, the image processing unit is further configured to: acquiring a second image associated with the first image; obtaining second attribute information of the object based on the second image, wherein the processing the object corresponding to the target area according to the first attribute information comprises the following steps: and processing the object corresponding to the target area according to the first attribute information and the second attribute information.

Optionally, the image processing unit processes an object corresponding to a target area, including: at least one of object recognition, object segmentation, and object pose estimation is performed on the object.

Optionally, the first attribute information includes first key point information and initial pose information of the object, the second attribute information includes second key point information of the object, and the image processing unit processes the object corresponding to the target area according to the first attribute information and the second attribute information, including: final pose information of the object is estimated based on the initial pose information, the first keypoint information, and the second keypoint information.

Optionally, the image processing unit obtains second attribute information of the object based on a second image, including: determining a target area corresponding to the object on a second image based on the initial pose information and parameters of a first camera generating the first image and a second camera generating the second image; and obtaining second key point information of the object based on a target area corresponding to the object on the second image.

Optionally, the image processing unit determines a target area corresponding to the object on the second image based on the initial pose information and parameters of a first camera generating the first image and a second camera generating the second image, including: determining initial pose information of the object under a coordinate system of the first camera based on the initial pose information and parameters of the first camera; determining initial pose information of the object in a coordinate system of a second camera based on the initial pose information of the object in the coordinate system of the first camera and parameters of the second camera; and determining a target area corresponding to the object on the second image according to the initial posture information of the object under the coordinate system of the second camera.

Optionally, the image processing unit obtains second key point information of the object based on a target area corresponding to the object on a second image, including: and correcting a target area corresponding to the object on the second image, and obtaining second key point information of the object based on the corrected target area.

Optionally, the correction unit corrects a target area corresponding to the object on the second image, including: acquiring a feature map of a second image; determining a second characteristic region corresponding to the target region on the second image on the characteristic map of the second image as a second target region characteristic map; performing spatial transformation on the second target region feature map to generate a transformed second target region feature map; and obtaining second key point information of the object based on the transformed second target area feature map.

According to a fourth aspect of the embodiments of the present disclosure, there is provided an image processing apparatus including: an acquisition unit configured to: performing a convolution operation on the first image by using a convolution neural network to obtain a feature map of the first image, wherein the convolution neural network performs the convolution operation on each of at least one position on the first image by adopting a convolution kernel corresponding to the each position; and an image processing unit configured to process the object in the first image based on the feature map.

Optionally, the obtaining unit performs a convolution operation on the first image using a convolutional neural network to obtain a feature map of the first image, including: obtaining a sampling position of a convolution kernel corresponding to each of the at least one position on the first image, wherein the sampling position of the convolution kernel is determined from an imaging model of the first image; and performing convolution operation according to the sampling position of the convolution kernel corresponding to each position to obtain the characteristic diagram.

According to a fifth aspect of the embodiments of the present disclosure, there is provided an electronic device, including: at least one processor; at least one memory storing computer-executable instructions, wherein the computer-executable instructions, when executed by the at least one processor, cause the at least one processor to perform the image processing method as described above.

According to a sixth aspect of embodiments of the present disclosure, there is provided a computer readable storage medium storing instructions, which when executed by at least one processor, cause the at least one processor to perform an image processing method as described above.

The technical scheme provided by the embodiment of the disclosure at least brings the following beneficial effects: according to the image processing method and the image processing device, when the object in the image is processed, the object processing effect and/or the image processing speed can be improved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate exemplary embodiments consistent with the disclosure and, together with the description, serve to explain the principles of the disclosure and do not constitute an undue limitation on the disclosure.

FIG. 1 is a flowchart of an image processing method according to an exemplary embodiment of the present disclosure;

fig. 2 is a schematic diagram showing a process of extracting features of a first image on a plurality of scales using a deformable CNN;

FIG. 3 is a schematic diagram illustrating determining sampling locations of a deformable CNN convolution kernel;

FIG. 4 is a schematic diagram illustrating multi-scale feature fusion.

Fig. 5 shows a schematic diagram of reverse-deformation ROI pooling.

Fig. 6 shows a schematic diagram of estimating an object pose according to an exemplary embodiment of the present disclosure.

Fig. 7 is a schematic diagram illustrating an example of an image processing method according to an exemplary embodiment of the present disclosure.

Fig. 8 is a detailed view illustrating an example shown in fig. 7.

Fig. 9 illustrates a scene diagram to which an image processing method of an exemplary embodiment of the present disclosure is applied.

Fig. 10 is a flowchart illustrating an image processing method according to another exemplary embodiment of the present disclosure.

Fig. 11 is a block diagram illustrating an image processing apparatus according to an exemplary embodiment of the present disclosure.

Fig. 12 is a block diagram of an image processing apparatus according to another exemplary embodiment of the present disclosure.

Fig. 13 is a block diagram of an electronic device according to an exemplary embodiment of the present disclosure.

Detailed Description

In order to enable those skilled in the art to better understand the technical solutions of the present disclosure, the technical solutions of the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings.

It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the foregoing figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the disclosure described herein may be capable of operation in sequences other than those illustrated or described herein. The embodiments described in the examples below are not representative of all embodiments consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with some aspects of the present disclosure as detailed in the accompanying claims.

It should be noted that, in this disclosure, "at least one of the items" refers to a case where three types of juxtaposition including "any one of the items", "a combination of any of the items", "an entirety of the items" are included. For example, "including at least one of a and B" includes three cases side by side as follows: (1) comprises A; (2) comprising B; (3) includes A and B. For example, "at least one of the first and second steps is executed", that is, three cases are juxtaposed as follows: (1) performing step one; (2) executing the second step; (3) executing the first step and the second step.

As described in the background art, the existing image processing method may not only cause a large stretching of the object during the whole image correction, and further cause a poor processing effect of the subsequent object, but also cause a slow image processing speed. For this reason, the present disclosure proposes to detect a target area first and then correct only the detected target area, whereby not only can a large extent of stretching caused to an object be reduced, thereby contributing to an improvement in the subsequent object processing effect, but also since no correction is made to an irrelevant area, the image processing speed can be improved.

Hereinafter, an image processing method according to an exemplary embodiment of the present disclosure will be described with reference to the accompanying drawings.

Fig. 1 is a diagram illustrating an image processing method according to an exemplary embodiment of the present disclosure. Referring to fig. 1, in step S110, a feature map of a first image is acquired, and a target region in the first image is detected based on the feature map. Here, the target region may be a region of interest (ROI, region of Interest), for example, a candidate region. Here, the first image may be a fisheye image, for example, one of a left fisheye image and a right fisheye image photographed with a fisheye stereo camera, but is not limited thereto. In fact, the first image may be any image in which there is a deformation of the object.

As an example, features of the first image may be extracted on at least one scale to obtain at least one feature map of the first image, and a target region in the first image may be detected based on the at least one feature map. According to the exemplary embodiments of the present disclosure, features of a first image are directly extracted at various scales without correcting the entire first image and then extracting features on the corrected image, and thus, not only is time-consuming correction of the entire first image avoided, but also the accuracy of the extracted features can be improved.

Preferably, a convolution operation is performed on the first image by using a convolution neural network to obtain a feature map of each scale, wherein the convolution neural network performs the convolution operation on each position in at least one position on the first image by adopting a convolution kernel corresponding to the each position. Hereinafter, such a convolutional neural network proposed according to an exemplary embodiment of the present disclosure may be simply referred to as a "deformable convolutional neural network (CNN, convolutional Neural Network)". As described above, since the object distortion or deformation is included in the first image such as the fisheye image, if sampling distortion occurs with the conventional convolutional neural network (the convolution kernel is always fixed) to cause difficulty in extracting features or inaccuracy of extracted features, and the deformable CNN is used according to the exemplary embodiment of the present disclosure, since such CNN performs the convolution operation with the convolution kernel corresponding to each of at least one position on the first image, sampling distortion when the conventional CNN is used in the image in which the object deformation exists is avoided, the extracted features can be made more accurate, thereby facilitating improvement of the effect of performing the object processing with these features later. For example, by employing the deformable CNN, it is possible to accommodate the image resolution variation of the center and peripheral portions of the fisheye image, so that the features can be extracted more accurately, thereby improving the processing effect on the object in the fisheye image.

Fig. 2 is a schematic diagram showing a process of extracting features of a first image on a plurality of scales using a deformable CNN.

In the example of fig. 2, it is assumed that the features of the first image are extracted on three scales and that the first image is a fisheye image, however, the number of scales is not limited to three, but may be any positive integer of 1 or more, and the first image is not limited to a fisheye image, but may be any image in which there is deformation of an object. As shown in fig. 2, features of the fisheye image may be extracted at each of three scales using deformable CNNs, respectively. Accordingly, the fisheye image is scaled down, e.g., to 1/2, 1/4, and 1/8 of the first image size, depending on the scale at which it is located.

As described above, the deformable CNN performs a convolution operation with a convolution kernel corresponding to each of at least one position on the first image. That is, each of the at least one location on the first image corresponds to an own convolution kernel. That is, the convolution kernel for each location is variable rather than being fixed at all times. The location here may be a pixel point. Thus, according to an exemplary embodiment, when performing the convolution operation, first a sampling position of the convolution kernel corresponding to each of the at least one position on the first image is obtained, and then the convolution operation is performed according to the sampling position of the convolution kernel corresponding to each position, resulting in a feature map of each scale. Here, the sampling position of the convolution kernel is determined from an imaging model of the first image. For example, at each scale, a sampling position of a convolution kernel corresponding to each of at least one position on the first image may be pre-calculated according to an imaging model of the first image, and the calculated sampling position may be stored, for example, in a Look-Up Table (LUT). The LUT may be pre-stored and used to perform convolution operations on each scale to obtain a feature map for each scale. Since, for each scale in the pyramid constituted by the plurality of deformable CNNs shown in fig. 2, the sampling position of the convolution kernel corresponding to each of the at least one position on the first image at that scale is calculated and stored in advance, in this case, the above-described sampling position of the convolution kernel corresponding to each of the at least one position on the first image may be, for example, a sampling position of the convolution kernel corresponding to each position obtained from a prestored LCU.

Fig. 3 is a schematic diagram showing the determination of the sampling position of the convolution kernel of the deformable CNN. The manner in which the sampling locations of the convolution kernel of the deformable CNN are determined is briefly described below with reference to fig. 3.

According to an exemplary embodiment, the sampling position of the convolution kernel may be determined by: determining the sampling position of a convolution kernel function of each position in a three-dimensional space according to the imaging model; a sampling position of the convolution kernel corresponding to each position on the first image is determined from the sampling positions of the convolution kernel in three-dimensional space and the imaging model.

In the example of fig. 3, it is assumed that the first image is a fisheye image, and accordingly, the above-described imaging model is a fisheye image imaging model, which may be hereinafter also referred to as a "fisheye camera model". The fisheye camera model may be, for example, a Kannala-Brandt model, as an example.

As shown in fig. 3, specifically, for example, first, each position on the fisheye image may be connected to the optical center (point Oc in fig. 3) of the fisheye camera model to determine a ray (such as a straight line connecting Oc and pixel point a in fig. 3), and second, it is determined from which incident ray passing through the optical center is deflected by the parameters of the fisheye camera model (also referred to as "inner parameters of the fisheye camera"). For example, in the case where the fisheye camera model is a Kannala-Brandt model, the incident ray may be determined according to the following equation:

θ _d ＝θ+k ₁ θ ³ +k ₂ θ ⁵ +k ₃ θ ⁷ +k ₄ θ ⁹ Wherein θ _d Is the angle between the line between the pixel position and the optical center and the optical axis of the fish-eye camera model (the straight line where OcZc in FIG. 3 is located), θ is the angle between the incident light and the optical axis of the fish-eye camera, k ₁ To k ₄ Is a polynomial coefficient.

After determining the incident ray, the intersection point (point B in fig. 3) of the incident ray (ray OcP in fig. 3) with the fish-eye camera model may be determined, and finally, the sampling position of the convolution kernel in three-dimensional space may be selected on the 3D local plane grid tangent to the spherical surface of the fish-eye camera model through the intersection point, for example, a set of sampling points may be selected on the 3D local plane grid by equidistant uniform sampling, for example, as shown in fig. 3, 9 sampling points are equidistantly selected around the intersection point, corresponding to the "convolution kernel sampling points in three-dimensional space" in fig. 3.

After determining the sampling position of the convolution kernel for each position in three-dimensional space, the sampling position of the convolution kernel in three-dimensional space may be mapped onto the fisheye image by ray projection according to the fisheye camera model to determine the sampling position of the convolution kernel corresponding to each position on the fisheye image. For example, as shown in fig. 3, 9 sampling points in the three-dimensional space are mapped onto the fisheye image plane by ray projection according to the Kannala-Brandt model, respectively, so that 9 sampling points are also correspondingly obtained on the fisheye image, corresponding to "convolution kernel sampling points on the fisheye image plane" in fig. 3, and the 9 sampling points are sampling positions of a convolution kernel corresponding to the pixel point a when the convolution operation is performed in the future.

The manner in which the sampling locations of the convolution kernel for each of the at least one location on the first image are determined has been described above in connection with fig. 3. According to an exemplary embodiment, at each scale, a convolution operation may be performed using the deformable CNN according to the sampling position of the convolution kernel corresponding to each position, resulting in a feature map for each scale.

As described above, the above-mentioned at least one feature map may be a plurality of feature maps, in which case, optionally, according to an exemplary embodiment, the above-mentioned detecting the target region in the first image based on the at least one feature map may include: and fusing the feature images of adjacent scales in the feature images, and detecting a target area in the first image based on at least one fused feature image. For example, the feature map may be input into a target region suggestion network to detect a target region in the first image. Here, the target area suggestion network may be a convolutional neural network learned in advance, but is not limited thereto. The target region suggestion network may be pre-learned to be able to detect a target region in the first image for the input feature map.

FIG. 4 is a schematic diagram illustrating multi-scale feature fusion.

And further carrying out feature fusion among different scales on the multi-scale features extracted by the pyramid deformable CNN. Specifically, as shown in fig. 4, the low-resolution feature map is up-sampled and then fused (e.g., added pixel by pixel) with the feature map of the previous layer of the adjacent scale, so that the fused feature map contains both semantic information in the low-resolution feature map and image detail information in the high-resolution feature map. In this case, detecting the target region in the first image based on the at least one fused feature map can more accurately detect the target region.

After feature fusion, low resolution features may be used, for example, for object region proposal, localization, and classification to save computational cost. High resolution features may be used, for example, to ensure accuracy in estimating object fine properties (e.g., keypoints, object mask map, 6DoF (Degree Of Freedom) pose).

As an example, only the relatively low resolution of the fused feature maps (e.g., feature maps 1 and 2 in fig. 4) may be used to detect the target region in the first image, which may further save computational costs. Whereas a relatively high resolution feature map of the fused feature map (e.g., feature map 3 of fig. 4) may be used to subsequently correct the detected target region (hereinafter also referred to as "reverse deformation target region pooling", which may also be referred to as "reverse deformation ROI pooling" if the target region is an ROI), for object pose estimation or the like (e.g., feature extraction and object keypoint prediction, object mask map or object pose estimation, etc.), as will be described below.

After the target region in the first image is detected through step S110, the detected target region may be corrected in step S120. Specifically, first, a first feature region corresponding to the detected target region may be determined in the feature map of the first image as the first target region feature map. The first target region feature map may then be spatially transformed to generate a transformed first target region feature map. The correction of the target region is achieved by a spatial transformation of the first target region feature map. As an example, as described above, if a plurality of feature maps are obtained on a plurality of scales and the feature maps are fused, a first feature region corresponding to a detected target region may be determined in a relatively high-resolution feature map (e.g., a highest-resolution feature map of the fused feature maps, i.e., a feature map of a fused maximum scale) among the fused feature maps. As in fig. 4, a first feature region corresponding to the detected target region may be determined in the feature map 3 as a first target region feature map. If a plurality of target areas are detected, a first target area feature map corresponding to each target area is determined for each target area. Then, each first target region feature map may be spatially transformed to achieve correction for each target region.

According to an exemplary embodiment, a virtual camera corresponding to a target region may be created according to an imaging model of a first image and the detected target region, and the first target region feature map may be spatially transformed by using the virtual camera to generate a transformed first target region feature map. In the present disclosure, a virtual camera corresponding to each detected target area is created for each detected target area, instead of employing the same virtual camera for the entire image or all target areas, whereby stretching of the shape of the object at the time of correction can be avoided. For example, this design avoids shape stretching that typically occurs at the edges of the field of view of a fisheye lens. Furthermore, after pooling of the inverse transformed target region, the first target region feature map is transformed into the same geometry as a conventional camera, which is also more advantageous for subsequent training or prediction of subsequent object processing models using such feature maps.

According to an exemplary embodiment, the light rays corresponding to the optical axis of the virtual camera, which are refracted by the imaging model, pass through the center of the detected target area. Furthermore, the optical axis of the virtual camera may be directed towards the optical center of the imaging model.

Fig. 5 shows a schematic diagram of reverse deformation target region pooling. Next, with reference to fig. 5, the reverse deformation target region pooling will be described.

In the example of fig. 5, it is still assumed that the first image is a fisheye image, and accordingly, the imaging model of the first image is a fisheye camera model. As shown in fig. 5, the optical axis of the virtual camera created for the target area may be a straight line determined by connecting the point Oc with the point F, and the light ray corresponding to the straight line passes through the pixel center (point E in fig. 5) of the target area after the light ray refracted by the fish-eye camera model (sphere in fig. 5). Furthermore, the image plane of the target area virtual camera is tangential to the sphere of the fish-eye camera model, with its image y-axis in the plane defined by Zc-Oc-P.

Specifically, when creating the virtual camera, the pixel center point E of the target area may be first connected to the optical center Oc of the fisheye camera model to determine a straight line, and then, according to the parameters of the fisheye camera model, it is determined which incident ray passing through the optical center deflects to generate the ray corresponding to the straight line. For example, in the case where the fisheye camera model is a Kannala-Brandt model, the incident light may be determined using the equations mentioned above in describing fig. 3, and will not be described again here. The straight line corresponding to the incident light is the virtual camera optical axis. After the optical axis is determined, a plane perpendicular to the optical axis may be determined as a plane of the virtual camera, for example, the plane of the virtual camera may be a plane tangential to a spherical surface of the fish-eye camera model, but is not limited thereto. As shown in fig. 5, the focal length F of the virtual camera (i.e., the distance between the optical center Oc and the center F of the virtual camera plane in fig. 5). It should be noted that, the focal length f of the virtual camera in the planar target area of the virtual camera may be dynamically calculated according to the size of the target area to ensure that the target area feature image of the inverse distortion has a fixed size in the image height H and the image width W.

After creating its corresponding virtual camera for each target region, the created virtual camera may be utilized to spatially transform the first target region feature map for each target region to generate a respective transformed first target region feature map. Specifically, each feature point in the first target area feature map may be mapped onto a corresponding virtual camera plane according to the imaging model to obtain a transformed first target area feature map. For example, each feature point is connected with the optical center of the imaging model to determine a light ray, and the intersection point of the incident light ray corresponding to the light ray and the virtual camera plane is determined according to the parameters of the imaging model, so that a transformed first target area feature map is obtained according to the intersection points.

As described above, the correction of the target region is achieved by the transformation of the first target region feature map.

Finally, in step S130, the object corresponding to the target area is processed based on the corrected target area. Specifically, the object corresponding to the target region may be processed based on the transformed first target region feature map. For example, first attribute information of an object corresponding to a target area may be obtained based on the transformed first target area feature map, and the object corresponding to the target area may be processed according to the first attribute information. As an example, the first attribute information of the object corresponding to the target region may be obtained using at least one convolutional neural network based on the transformed first target region feature map. For example, the first attribute information may include at least one of category information, mask information, key point information, and pose information of the object, but is not limited thereto. Accordingly, different processes can be performed on the object corresponding to the target area according to the difference of the attribute information. For example, at least one of object recognition, object segmentation, and object pose estimation may be performed on the object. It should be noted that, although the pose information may be obtained using at least one convolutional neural network based on the transformed first target region feature map, the pose information of the object may be determined after the key Point information of the object is obtained, for example, using an n-Point Perspective (PnP) algorithm.

As described above, there is an object deformation in the first image, and the effect of performing the above object processing using only the first image may still be not accurate enough, and therefore, according to an exemplary embodiment of the present disclosure, the above image processing method may further include: acquiring a second image associated with the first image; second attribute information of the object is obtained based on the second image. In this case, the processing of the object corresponding to the target area according to the first attribute information may include: and processing the object corresponding to the target area according to the first attribute information and the second attribute information. In this way, the processing effect on the object can be further improved. The second image may also be an image in which there is a deformation. As described above, the first image may be one of the left-eye image and the right-eye image, and at this time, the second image may be the other of the left-eye image and the right-eye image. In this way, the object can be processed more accurately based on both the left-eye image and the right-eye image, for example, the posture of the object can be estimated more accurately.

Specifically, according to an exemplary embodiment of the present disclosure, for example, in order to more accurately perform pose estimation of an object, the first attribute information may include first keypoint information (object two-dimensional keypoint in a corresponding graph (left view)) and initial pose information (initial object pose in a corresponding graph) of the object, the second attribute information may include second keypoint information (object two-dimensional keypoint in a corresponding graph (right view)) of the object, and accordingly, processing the object corresponding to the target region according to the first attribute information and the second attribute information may include: final pose information of the object is estimated based on the first keypoint information and the initial pose information and the second keypoint information (stereoscopic 6Dof pose optimization in the corresponding map). Specifically, for example, the second key point information of the object may be obtained by performing the following operations: determining a target area corresponding to the object on a second image based on the initial pose information and parameters of a first camera generating the first image and a second camera generating the second image; and obtaining second key point information of the object based on a target area corresponding to the object on the second image. For example, as shown in fig. 6, in the case where the first image and the second image are the left-eye image and the right-eye image, respectively, after the two-dimensional key point and the initial object pose of the object are obtained by performing image feature extraction and object attribute information prediction on the left-eye image, a target region (referred to as "object region projection" in fig. 6) corresponding to the object on the right-eye image may be determined from the initial object pose and the stereoscopic camera parameters. Then, two-dimensional key point information of the object is obtained based on the corresponding target area on the right fisheye image. Finally, final pose information of the object is estimated (i.e., the initial pose information is optimized) based on the initial object pose, the two-dimensional keypoints obtained from the left-fisheye image, and the two-dimensional keypoints obtained from the right-fisheye image, for example, the 6Dof pose of the object is estimated.

It should be noted that, for the object pose estimation task, the object is to estimate rotation and translation from the object coordinate system to the camera coordinate system, and thus pose information is closely related to selection of the camera coordinate system. When the object pose is estimated using the transformed first target region feature map generated using the virtual camera created for the target region described above, since the obtained initial pose information is still in the virtual camera coordinate system, it is necessary to convert it back to the real camera coordinate system (e.g., fisheye camera coordinate system) before outputting the estimated pose information. Alternatively, if it is desired to use the initial pose information in combination with the second key point information to make a more accurate pose estimation later, it is also necessary to convert the initial pose information into a real camera coordinate system.

Accordingly, the above-described determination of the target area on the second image corresponding to the object based on the initial pose information and parameters of the first camera that generates the first image and the second camera that generates the second image may include: determining initial pose information of the object under a coordinate system of the first camera based on the initial pose information and parameters of the first camera; determining initial pose information of the object in a coordinate system of a second camera based on the initial pose information of the object in the coordinate system of the first camera and parameters of the second camera; and determining a target area corresponding to the object on the second image according to the initial posture information of the object under the coordinate system of the second camera. That is, the initial pose information in the virtual camera coordinate system is converted into the initial pose information in the real camera coordinate system, and the target area corresponding to the object is determined on the second image using the initial pose information in the real camera coordinate system.

For example, in the case where the first image and the second image are the left-eye image and the right-eye image, respectively, the initial pose of the object estimated based on the left-eye image is projected to the right-eye image to determine the corresponding candidate object region in the right-eye image. Specifically, for example, assume that the parameter matrix of the target area virtual camera is Kv, and the camera coordinate system of the virtual camera is Ov-XvYvZv. And enabling the internal reference of the camera for correcting the image by fish-eye image perspective to be Kc, and enabling the camera coordinate system of the left-eye camera to be Oc-XcYcZc. The pose information estimated in the coordinate system of the target area virtual camera may be expressed as a rotation matrix Rv and a translation vector Tv, which may be converted into a rotation matrix Rc and a translation vector Tc in the Oc-XcYcZc coordinate system (wherein, in the following equation, inv () is a function of inverting the matrix):

Rc＝inv(Kc)*Kv*Rv

Tc＝inv(Kc)*Kv*Tv

then, through the calibrated external parameters of the left and right fisheye cameras, rotation and translation transformation between the left fisheye camera coordinate system and the right fisheye camera coordinate system can be known, so that the object expressed under the left fisheye camera coordinate system can be rotated and translated under the right fisheye camera coordinate system. Then, using the internal parameters of the right-eye camera (fisheye image imaging model parameters), the object can be projected onto the image plane of the right-eye image, thereby determining the target region corresponding to the object on the right-eye image.

As described above, after the target area corresponding to the object on the second image is determined, the second keypoint information of the object may be obtained based on the target area corresponding to the object on the second image. For example, the target region corresponding to the object on the second image may be corrected first, and then the second keypoint information of the object may be obtained based on the corrected target region. Alternatively, the corresponding target region on the second image may be corrected in the same way as the target region in the first image is corrected (i.e., the above-described inverse deformation target region pooling). Specifically, for example, a feature map of the second image may be acquired first, and then, a second feature region corresponding to a target region on the second image may be determined on the feature map of the second image as a second target region feature map. And then, performing spatial transformation on the second target region feature map to generate a transformed second target region feature map, and finally, obtaining second key point information of the object based on the transformed second target region feature map.

For example, the feature map of the second image may be acquired in the same manner as the feature map of the first image, i.e., the feature map of the second image may be acquired using the deformable CNN as well, which is not described here. Further, alternatively, in order to save the calculation amount and ensure accurate extraction of the two-dimensional key point information, here, the feature map of the second image may be only a higher resolution feature map of the second image. For example, as shown in fig. 6, only high-resolution features may be extracted from the right-eye image to obtain a high-resolution feature map, so as to reduce the calculation cost while ensuring accurate extraction of two-dimensional key point features.

The method for generating the transformed second target region feature map by spatially transforming the second target region feature map may be the same as the method for generating the transformed first target region feature map by spatially transforming the first target region feature map, and will not be described again.

Also, after the transformed second target region feature map is obtained, as shown in fig. 6, second keypoint information of the object may be obtained based on the transformed second target region feature map using at least one convolutional neural network.

After the second keypoint information is obtained, final pose information of the object may be estimated based on the initial pose information, the first keypoint information, and the second keypoint information. Specifically, for example, final pose information of an object may be determined by minimizing a sum of two-dimensional keypoint re-projection errors of the object in two images (first image and second image), i.e., optimizing initial pose information. For example, the minimization optimization may use a nonlinear optimization algorithm, such as the Lev-Mar algorithm. Specifically, in the case where the first image and the second image are the left-eye image and the right-eye image, estimating the final pose may be expressed as the following formula:

Wherein, assuming that a rotation matrix R and a translation vector T representing initial posture information are defined under a left-fisheye coordinate system, a transformation from a target area virtual camera coordinate system of the left-fisheye camera to the left-fisheye camera coordinate system is [ R ] _vl-fl ,T _vl-fl ]Then a rotation matrix representing the initial pose information in the left-eye camera coordinate system

And translation vector->

The method comprises the following steps:

similarly, the transformation from the target area virtual camera coordinate system of the right fisheye camera to the right fisheye camera coordinate system is [ R _vr-fr ,T _vr-fr ]Rotation matrix representing initial pose information in right-fisheye camera coordinate system

Translation vector

The method comprises the following steps:

here, [ R ] ^(r) ，T ^(r) ]The method is based on the expression of the initial posture information obtained based on the rotation and translation parameters between the left and right fisheye cameras under the right fisheye coordinate system. Pi is a key point defined on the three-dimensional model of the object, i=1, …, N, where N is the number of key points on the object.

Is the position of the ith key point extracted on the target area virtual camera of the left fisheye image,/>

Is the position of the i-th key point extracted on the right fisheye image. In addition, in the case of the optical fiber,

and->

The parameter matrices of the left fish-eye camera and the right fish-eye camera are respectively.

A specific image processing method according to an exemplary embodiment of the present disclosure, that is, estimating the pose of an object, has been described above in connection with fig. 6, and the pose of an object may be estimated more accurately in the object pose estimation manner shown in fig. 6.

In the above, the image processing method according to the exemplary embodiment of the present disclosure and the examples thereof have been described with reference to fig. 1 to 6, according to which not only the image processing speed but also the processing effect of the object can be improved.

In order to more clearly understand the above-described image processing method, the above-described image processing method is briefly described below with reference to the example of fig. 7.

Fig. 8 is a detailed view illustrating an example shown in fig. 7. In the example of fig. 7, the first image is a left fisheye image and the second image is a right fisheye image. A plurality of feature maps can be obtained based on the left-eye image using the pyramid constituted by the deformable CNN described above (the deformable pyramid network in the correspondence map). For example, as shown in fig. 8, for a left fish-eye image, three feature maps thereof may be extracted and feature map fusion may be performed based on the above-described deformable CNN. The target region in the left fisheye image may then be detected using a target region suggestion network (e.g., convolutional neural network, etc.) based on two of the low resolution feature maps. Subsequently, the detected target region may be corrected (referred to as "reverse deformation target region pooling" in fig. 7), for example, a target region feature map corresponding to the detected target region is first determined from the fused high-resolution feature map, and then the determined target region feature map is spatially transformed, for example, according to the intra-fisheye parameters (i.e., fisheye image imaging model parameters), to generate a transformed target region feature map. Next, attribute information of the object, e.g., object category, object mask map, two-dimensional key point left view, initialization pose, may be obtained using at least one CNN.

In order to make the estimated object pose more accurate, in the example of fig. 7, the keypoint information (the "two-dimensional keypoint (right view)" in fig. 7) may be obtained based on the right fisheye image, and then the 6Dof pose estimation is performed in combination with the initialized pose obtained from the left fisheye image and the two-dimensional keypoint (left view). Specifically, as shown in fig. 7, a feature map of the right fisheye image may be obtained using the deformable CNN as well, for example, as shown in fig. 8, a high resolution feature map of the right fisheye image may be obtained using the deformable CNN first, and then, subject region projection, that is, determination of a target region corresponding to the subject on the right fisheye image based on the initial pose information and the stereoscopic fisheye lens inside-outside parameters, may be performed. Then, the target region may be corrected by using inverse deformation target region pooling, specifically, a target region feature map corresponding to the target region is determined in a high resolution feature map of the right fisheye image, and spatially transformed to generate a transformed target region feature map. The transformed target region feature map may be input into at least one CNN, ultimately resulting in a two-dimensional right-hand keypoint view.

Finally, the object pose is optimized based on the initialized pose, the two-dimensional left view of the keypoint obtained from the left fisheye image, and the two-dimensional right view of the keypoint obtained from the right fisheye image.

In the examples of fig. 7 and 8, a pyramid is constructed for the left-eye image, wherein a low-resolution feature map is used for region-of-interest suggestion (i.e., target region prediction) and a high-resolution feature map is used to accurately extract two-dimensional keypoints. Whereas for the right fisheye image only the high resolution features are computed, since the object region has been predicted from the left fisheye image. In this way, the calculated amount is effectively reduced, and meanwhile, the gesture estimation is more accurate due to further combining the two-dimensional key point information obtained according to the left fish eye image. The keypoint features are sparse and efficient in determining the pose of the object 6DoF, and therefore, in the example of fig. 7, sparse keypoints are extracted from the left-and right-fisheye images, and pose estimation is optimized by minimizing the keypoint re-projection error.

According to the above example, a three-dimensional object segmentation and pose estimation can be performed quickly using stereoscopic fisheye images. The technique can be used for augmented reality and other task scenes in which the gesture of a three-dimensional object in the environment needs to be known and interacted with. For example, augmented reality technology provides a real information experience for a user by adding virtual content in a real scene in front of the user. In three-dimensional space, the augmented reality system needs to have high-precision real-time processing and understanding on the three-dimensional state of surrounding objects so as to complete the high-quality virtual-real fusion effect presented to the user. On the other hand, in a scene such as automatic driving, it is also necessary to segment and estimate the posture of an object such as a vehicle in the environment.

Fig. 9 illustrates a scene diagram to which an image processing method of an exemplary embodiment of the present disclosure is applied. As shown in fig. 9, in a case where a user wears a stereoscopic fisheye camera, the three-dimensional pose of a real object (e.g., a table) in the stereoscopic fisheye image may be estimated based on the stereoscopic fisheye image (left and right fisheye images) according to the image processing method of the exemplary embodiment of the present disclosure. After the three-dimensional gesture of the object is estimated, virtual three-dimensional graphic content (for example, an engine) can be displayed on the surface of the real object in a superimposed manner according to the three-dimensional gesture of the real object, so that the augmented reality experience of the user can be improved.

It should be noted that, although in the above description we describe in describing the image processing method shown in fig. 1 that the feature map of the first image is obtained by extracting the features of the first image using the deformable CNN, and the target region in the first image is detected based on the obtained feature map, and the object processing is performed based on the corrected target region after the target region is corrected, in fact, the object processing may be directly performed based on the obtained feature map after the feature map of the object in which the object deformation exists is extracted using the deformable CNN. That is, the above-mentioned deformable CNN can be used alone to perform object processing, which can also avoid the need of time-consuming correction of the entire image in which the object deformation exists in the prior art, and can avoid sampling distortion when the conventional CNN extracts the image in which the object deformation exists, so that the characteristics of the image can be extracted more accurately, and further the subsequent object processing effect can be improved.

Thus, according to another exemplary embodiment of the present disclosure, an image processing method shown in fig. 10 may also be provided. Fig. 10 is a flowchart illustrating an image processing method according to another exemplary embodiment of the present disclosure.

Referring to fig. 10, in step S1010, a convolution operation is performed on a first image using a convolution neural network to acquire a feature map of the first image, wherein the convolution neural network performs a convolution operation with a convolution kernel corresponding to each of at least one location on the first image. Here, for example, the first image may be an image in which there is deformation of the object. Specifically, at step S1010, a sampling position of a convolution kernel corresponding to each of at least one position on the first image may be first obtained, where the sampling position of the convolution kernel is determined according to an imaging model of the first image. Then, a convolution operation may be performed according to the sampling position of the convolution kernel corresponding to each position, resulting in the feature map. For example, the sampling position of the convolution kernel may be determined by: determining the sampling position of a convolution kernel function of each position in a three-dimensional space according to the imaging model; a sampling position of the convolution kernel corresponding to each position on the first image is determined from the sampling positions of the convolution kernel in three-dimensional space and the imaging model. Since the above operations performed by the deformable CNN have been described in detail in the above description, they are not described in detail herein.

Finally, in step S1020, the object in the first image is processed based on the feature map. As described hereinabove, after the feature map is obtained, it is also possible to detect a target region in the first image based on the feature map, correct the detected target region, and process an object corresponding to the target region based on the corrected target region. Details concerning these operations are already mentioned above in the description of the image processing method shown in fig. 1, and are not repeated here.

According to the image processing method shown in fig. 10, not only can the time-consuming correction of the whole image be avoided, but also the sampling distortion in the process of extracting the image by the conventional CNN can be avoided, so that the characteristics of the image can be extracted more accurately, and further the subsequent object processing effect can be improved.

Referring to fig. 11, the image processing apparatus 1100 may include a detection unit 1101, a correction unit 1102, and an image processing unit 1103. Specifically, the detection unit 1101 may be configured to acquire a feature map of the first image, and detect a target area in the first image based on the feature map. The correction unit 1102 may correct the detected target region. The image processing unit 1103 may process the object corresponding to the target area based on the corrected target area.

Since the image processing method shown in fig. 1 can be performed by the image processing apparatus 1100 shown in fig. 11, and the detection unit 1101, the correction unit 1102, and the image processing unit 1103 perform steps S110, S120, and S130, respectively, any relevant details concerning the operations performed by the units in fig. 11 can be referred to in the description related to fig. 1, and will not be repeated here.

Referring to fig. 12, an image processing apparatus 1200 may include an acquisition unit 1201 and an image processing unit 1202. Specifically, the acquiring unit 1201 may acquire the feature map of the first image by performing a convolution operation on the first image using a convolution neural network that performs a convolution operation with a convolution kernel corresponding to each of at least one position on the first image. The image processing unit 1202 processes an object in the first image based on the feature map.

Since the image processing method shown in fig. 10 can be performed by the image processing apparatus 1200 shown in fig. 12, and the acquisition unit 1201 and the image processing unit 1202 perform steps S1010 and S1020, respectively, any relevant details concerning the operations performed by the units in fig. 11 can be referred to the description concerning fig. 10, and will not be repeated here.

Further, it should be noted that, although the image processing apparatus 1100 and the image processing apparatus 1200 are described above as being divided into units for performing the respective processes, it is apparent to those skilled in the art that the processes performed by the respective units described above may be performed without any specific division of units or without explicit demarcation between the units by the

image processing apparatuses

1100 and 1200. In addition, the image processing apparatus 1100 and the image processing apparatus 1200 may further include other units, for example, a storage unit and the like.

Referring to fig. 13, an electronic device 1300 may include at least one memory 1301 and at least one processor 1302 storing computer-executable instructions that, when executed by the at least one processor, cause the at least one processor 1302 to perform an image processing method according to an embodiment of the present disclosure. The image processing method described above may be performed using an artificial intelligence model.

At least one of the above modules may be implemented by an AI model. The functions associated with the AI may be performed by a non-volatile memory, a volatile memory, and a processor.

The processor may include one or more processors. At this time, the one or more processors may be general-purpose processors such as a Central Processing Unit (CPU), an Application Processor (AP), etc., processors for graphics only (e.g., graphics Processor (GPU), visual Processor (VPU), and/or AI-specific processors (e.g., neural Processing Unit (NPU)).

The one or more processors control the processing of the input data according to predefined operating rules or Artificial Intelligence (AI) models stored in the non-volatile memory and the volatile memory. Predefined operational rules or artificial intelligence models may be provided through training or learning. Here, providing by learning means that a predefined operation rule or AI model having a desired characteristic is formed by applying a learning algorithm to a plurality of learning data. Learning may be performed in the device itself performing AI according to an embodiment and/or may be implemented by a separate server/device/system.

A learning algorithm is a method of training a predetermined target device (e.g., a robot) using a plurality of learning data so that, allowing, or controlling the target device makes a determination or prediction. Examples of learning algorithms include, but are not limited to, supervised learning, unsupervised learning, semi-supervised learning, or reinforcement learning.

According to the present invention, in the image processing method performed by the electronic device, the output image after processing the target area can be obtained by taking the input image as the input data of the artificial intelligence model.

The artificial intelligence model may be obtained through training. Herein, "obtaining by training" refers to training a basic artificial intelligence model having a plurality of training data by a training algorithm to obtain predefined operational rules or artificial intelligence models configured to perform a desired feature (or purpose).

As an example, the artificial intelligence model may include a plurality of neural network layers. Each of the plurality of neural network layers includes a plurality of weight values, and the neural network calculation is performed by calculation between the calculation result of the previous layer and the plurality of weight values. Examples of neural networks include, but are not limited to, convolutional Neural Networks (CNNs), deep Neural Networks (DNNs), recurrent Neural Networks (RNNs), boltzmann machines limited (RBMs), deep Belief Networks (DBNs), bi-directional recurrent deep neural networks (BRDNNs), generative Antagonism Networks (GANs), and deep Q networks.

By way of example, the electronic device may be a PC computer, tablet device, personal digital assistant, smart phone, or other device capable of executing the above-described set of instructions. Here, the electronic device is not necessarily a single electronic device, but may be any device or an aggregate of circuits capable of executing the above-described instructions (or instruction set) singly or in combination. The electronic device may also be part of an integrated control system or system manager, or may be configured as a portable electronic device that interfaces with either locally or remotely (e.g., via wireless transmission).

In an electronic device, a processor may include a Central Processing Unit (CPU), a Graphics Processor (GPU), a programmable logic device, a special purpose processor system, a microcontroller, or a microprocessor. By way of example, and not limitation, processors may also include analog processors, digital processors, microprocessors, multi-core processors, processor arrays, network processors, and the like.

The processor may execute instructions or code stored in the memory, wherein the memory may also store data. The instructions and data may also be transmitted and received over a network via a network interface device, which may employ any known transmission protocol.

The memory may be integrated with the processor, for example, RAM or flash memory disposed within an integrated circuit microprocessor or the like. In addition, the memory may include a stand-alone device, such as an external disk drive, a storage array, or any other storage device usable by a database system. The memory and the processor may be operatively coupled or may communicate with each other, for example, through an I/O port, a network connection, etc., such that the processor is able to read files stored in the memory.

In addition, the electronic device may also include a video display (such as a liquid crystal display) and a user interaction interface (such as a keyboard, mouse, touch input device, etc.). All components of the electronic device may be connected to each other via a bus and/or a network.

According to an embodiment of the present disclosure, there may also be provided a computer-readable storage medium storing instructions, wherein the instructions, when executed by at least one processor, cause the at least one processor to perform an image processing method according to an exemplary embodiment of the present disclosure. Examples of the computer readable storage medium herein include: read-only memory (ROM), random-access programmable read-only memory (PROM), electrically erasable programmable read-only memory (EEPROM), random-access memory (RAM), dynamic random-access memory (DRAM), static random-access memory (SRAM), flash memory, nonvolatile memory, CD-ROM, CD-R, CD + R, CD-RW, CD+RW, DVD-ROM, DVD-R, DVD + R, DVD-RW, DVD+RW, DVD-RAM, BD-ROM, BD-R, BD-R LTH, BD-RE, blu-ray or optical disk storage, hard Disk Drives (HDD), solid State Disks (SSD), card memory (such as multimedia cards, secure Digital (SD) cards or ultra-fast digital (XD) cards), magnetic tape, floppy disks, magneto-optical data storage, hard disks, solid state disks, and any other means configured to store computer programs and any associated data, data files and data structures in a non-transitory manner and to provide the computer programs and any associated data, data files and data structures to a processor or computer to enable the processor or computer to execute the programs. The instructions or computer programs in the computer-readable storage media described above can be run in an environment deployed in a computer device, such as a client, host, proxy device, server, etc., and further, in one example, the computer programs and any associated data, data files, and data structures are distributed across networked computer systems such that the computer programs and any associated data, data files, and data structures are stored, accessed, and executed in a distributed fashion by one or more processors or computers.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any adaptations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

Claims

1. An image processing method, comprising:

acquiring a feature map of a first image, and detecting a target area in the first image based on the feature map;

correcting the detected target area;

and processing the object corresponding to the target area based on the corrected target area.

2. The image processing method according to claim 1, wherein the acquiring a feature map of the first image and detecting the target area in the first image based on the feature map includes:

extracting features of the first image on at least one scale to obtain at least one feature map of the first image;

a target region in the first image is detected based on the at least one feature map.

3. The image processing method of claim 2, wherein the extracting features of the first image in at least one dimension to obtain at least one feature map of the first image comprises:

and performing convolution operation on the first image on each scale of the at least one scale by using a convolution neural network to obtain a feature map of each scale, wherein the convolution neural network performs convolution operation on each position of at least one position on the first image by adopting a convolution kernel corresponding to each position.

4. The image processing method as claimed in claim 3, wherein the performing a convolution operation on the first image using a convolutional neural network to obtain a feature map for each scale comprises:

obtaining a sampling position of a convolution kernel corresponding to each of the at least one position on the first image, wherein the sampling position of the convolution kernel is determined from an imaging model of the first image;

and performing convolution operation according to the sampling position of the convolution kernel corresponding to each position to obtain a feature map of each scale.

5. The image processing method of claim 4, wherein the sampling position of the convolution kernel is determined by:

Determining the sampling position of a convolution kernel function of each position in a three-dimensional space according to the imaging model;

a sampling position of the convolution kernel corresponding to each position on the first image is determined from the sampling positions of the convolution kernel in three-dimensional space and the imaging model.

6. The image processing method according to claim 1, wherein the correcting the detected target area includes: determining a first characteristic region corresponding to the detected target region in the characteristic map of the first image as a first target region characteristic map; spatially transforming the first target region feature map to generate a transformed first target region feature map,

the processing the object corresponding to the target area based on the corrected target area comprises the following steps: and processing the object corresponding to the target area based on the transformed first target area characteristic diagram.

7. The image processing method of claim 6, wherein spatially transforming the first target region feature map to generate a transformed first target region feature map comprises:

creating a virtual camera corresponding to a target area according to an imaging model of a first image and the detected target area;

And performing spatial transformation on the first target area feature map by using the virtual camera to generate a transformed first target area feature map.

8. The image processing method as claimed in claim 6, wherein the processing the object corresponding to the target region based on the transformed first target region feature map comprises:

obtaining first attribute information of an object corresponding to the target area based on the transformed first target area feature map;

and processing the object corresponding to the target area according to the first attribute information.

9. The image processing method of claim 8, further comprising:

acquiring a second image associated with the first image;

second attribute information of the object is obtained based on the second image,

the processing the object corresponding to the target area according to the first attribute information includes:

and processing the object corresponding to the target area according to the first attribute information and the second attribute information.

10. The image processing method of claim 8, wherein the first attribute information includes at least one of category information, mask information, keypoint information, and pose information of the object.

11. The image processing method of claim 9, wherein the first attribute information includes first keypoint information and initial pose information of the object, the second attribute information includes second keypoint information of the object,

The processing the object corresponding to the target area according to the first attribute information and the second attribute information includes: final pose information of the object is estimated based on the initial pose information, the first keypoint information, and the second keypoint information.

12. The image processing method according to claim 9, wherein the obtaining second attribute information of the object based on the second image includes:

determining a target area corresponding to the object on a second image based on the initial pose information and parameters of a first camera generating the first image and a second camera generating the second image;

and obtaining second key point information of the object based on a target area corresponding to the object on the second image.

13. The image processing method of claim 12, wherein the determining a target area on the second image corresponding to the object based on the initial pose information and parameters of the first camera that generated the first image and the second camera that generated the second image comprises:

determining initial pose information of the object under a coordinate system of the first camera based on the initial pose information and parameters of the first camera;

Determining initial pose information of the object in a coordinate system of a second camera based on the initial pose information of the object in the coordinate system of the first camera and parameters of the second camera;

and determining a target area corresponding to the object on the second image according to the initial posture information of the object under the coordinate system of the second camera.

14. The image processing method according to claim 12, wherein obtaining second keypoint information of the object based on a target region corresponding to the object on the second image, comprises:

and correcting a target area corresponding to the object on the second image, and obtaining second key point information of the object based on the corrected target area.

15. The image processing method according to claim 14, wherein the correcting the target area corresponding to the object on the second image includes:

acquiring a feature map of a second image;

determining a second characteristic region corresponding to the target region on the second image on the characteristic map of the second image as a second target region characteristic map;

performing spatial transformation on the second target region feature map to generate a transformed second target region feature map;

And obtaining second key point information of the object based on the transformed second target area feature map.

16. An image processing method, comprising:

performing a convolution operation on the first image by using a convolution neural network to obtain a feature map of the first image, wherein the convolution neural network performs the convolution operation on each of at least one position on the first image by adopting a convolution kernel corresponding to the each position;

and processing the object in the first image based on the characteristic map.

17. An image processing apparatus comprising:

a detection unit configured to acquire a feature map of a first image, and detect a target region in the first image based on the feature map;

a correction unit configured to correct the detected target area;

and an image processing unit configured to process an object corresponding to the target area based on the corrected target area.

18. An image processing apparatus comprising:

an acquisition unit configured to perform a convolution operation on a first image using a convolutional neural network to acquire a feature map of the first image, wherein the convolutional neural network performs the convolution operation with a convolution kernel corresponding to each of at least one position on the first image;

And an image processing unit configured to process the object in the first image based on the feature map.

19. An electronic device, comprising:

at least one processor;

at least one memory storing computer-executable instructions,

wherein the computer executable instructions, when executed by the at least one processor, cause the at least one processor to perform the image processing method of any of claims 1 to 16.

20. A computer readable storage medium storing instructions which, when executed by at least one processor, cause the at least one processor to perform the image processing method of any of claims 1 to 16.