CN111582207A

CN111582207A - Image processing method, image processing device, electronic equipment and storage medium

Info

Publication number: CN111582207A
Application number: CN202010403620.5A
Authority: CN
Inventors: 王灿; 李杰锋; 刘文韬; 钱晨
Original assignee: Beijing Sensetime Technology Development Co Ltd
Current assignee: Beijing Sensetime Technology Development Co Ltd
Priority date: 2020-05-13
Filing date: 2020-05-13
Publication date: 2020-08-25
Anticipated expiration: 2040-05-13
Also published as: WO2021227694A1; TWI777538B; CN111582207B; TW202143100A

Abstract

The present disclosure provides an image processing method, an apparatus, an electronic device, and a storage medium, wherein the method includes: identifying a target region of a target object in a first image; determining first two-dimensional position information of a plurality of key points representing the posture of the target object in a first image respectively, the relative depth of each key point relative to a reference node of the target object and the absolute depth of the reference node of the target object in a camera coordinate system based on a target area corresponding to the target object; and determining three-dimensional position information of a plurality of key points of the target object in a camera coordinate system respectively based on the first two-dimensional position information, the relative depth and the absolute depth of the target object. The three-dimensional position information of the plurality of key points of the target object in the camera coordinate system can be obtained more accurately based on the first two-dimensional position information of the target object, the relative depth relative to the reference node and the absolute depth of the reference node.

Description

Image processing method, image processing device, electronic equipment and storage medium

Technical Field

The present disclosure relates to the field of image processing technologies, and in particular, to an image processing method and apparatus, an electronic device, and a storage medium.

Background

The three-dimensional human body posture detector is widely applied to the fields of security, games, entertainment and the like. The current three-dimensional human body posture detection method generally identifies first two-dimensional position information of human body key points in an image, and then converts the first two-dimensional position information into three-dimensional position information according to a predetermined position relationship between the human body key points.

The human body posture obtained by the current three-dimensional human body posture detection method has larger error.

Disclosure of Invention

The embodiment of the disclosure at least provides an image processing method and device, electronic equipment and a storage medium.

In a first aspect, an embodiment of the present disclosure provides an image processing method, including: identifying a target region of a target object in the first image; determining first two-dimensional position information of a plurality of key points representing the target object posture in the first image respectively, the relative depth of each key point relative to a reference node of the target object and the absolute depth of the reference node of the target object in a camera coordinate system based on a target area corresponding to the target object; determining three-dimensional position information of a plurality of key points of the target object in the camera coordinate system respectively based on the first two-dimensional position information, the relative depth, and the absolute depth of the target object.

In this way, the three-dimensional position information of the plurality of key points of the target object in the camera coordinate system can be obtained more accurately, the three-dimensional position information of the plurality of key points of the target object in the camera coordinate system can represent the three-dimensional posture of the target object, and the higher the precision of the three-dimensional position information is, the higher the precision of the obtained three-dimensional posture of the target object is.

In a possible embodiment, the identifying a target region of a target object in the first image includes: performing feature extraction on the first image to obtain a feature map of the first image; and determining a plurality of target bounding boxes from a plurality of candidate bounding boxes generated in advance based on the feature map, and determining a target area corresponding to the target object based on the target bounding boxes.

Therefore, the target area of the target object is determined by two steps, the position of each target object in the first image can be accurately detected from the first image, and the human body information integrity and the detection accuracy in the subsequent key point detection process are improved.

In a possible implementation, the determining, based on the target bounding box, a target area corresponding to the target object includes: determining a feature subgraph corresponding to each target bounding box based on a plurality of target bounding boxes and the feature graph; and performing border frame regression processing on the basis of the feature subgraphs respectively corresponding to the target border frames to obtain target areas corresponding to the target objects.

In this way, the feature subgraphs corresponding to the target bounding boxes are subjected to bounding box regression processing, and the position of each target object in the first image can be accurately detected from the first image.

In a possible embodiment, determining an absolute depth of a reference node of the target object in a camera coordinate system based on a target area corresponding to the target object includes: determining a target feature map corresponding to the target image based on a target area corresponding to the target object and the first image; performing depth recognition processing based on a target feature map corresponding to the target object to obtain normalized absolute depth of a reference node of the target object; and obtaining the absolute depth of the reference node of the target object in the camera coordinate system based on the normalized absolute depth and the parameter matrix of the camera.

Therefore, the situation that the absolute depths acquired by different first images acquired by different cameras at the same view angle and the same position are different due to the fact that the absolute depths of the reference nodes are directly predicted based on the target feature map caused by different internal references of the cameras can be avoided as much as possible.

In a possible implementation manner, performing depth recognition processing based on a target feature map corresponding to the target object to obtain a normalized absolute depth of a reference node of the target object includes: acquiring an initial depth image based on the first image; the pixel value of any first pixel point in the initial depth image represents an initial depth value of a second pixel point corresponding to the first pixel point in the first image in the camera coordinate system; determining second two-dimensional position information of a reference node corresponding to the target object in the first image based on a target feature map corresponding to the target object, and determining an initial depth value of the reference node corresponding to the target object based on the second two-dimensional position information and the initial depth image; determining a normalized absolute depth of the reference node of the target object based on the initial depth value of the reference node corresponding to the target object and the target feature map corresponding to the target object.

In this way, the normalized absolute depth of the reference node obtained by the process can be made more accurate.

In one possible embodiment, the determining a normalized absolute depth of the reference node of the target object based on the initial depth value of the reference node corresponding to the target object and the target feature map corresponding to the target object includes: performing at least one stage of first convolution processing on a target characteristic diagram corresponding to the target object to obtain a characteristic vector of the target object; splicing the characteristic vector and the initial depth value to form a spliced vector, and performing at least one stage of second convolution processing on the spliced vector to obtain a corrected value of the initial depth value; and obtaining the normalized absolute depth based on the corrected value of the initial depth value and the initial depth value.

In one possible implementation, the image processing method is applied to a pre-trained neural network, and the neural network comprises three branch networks, namely a target detection network, a key point detection network and a depth prediction network; the neural network comprises three branch networks respectively used for obtaining a target area of the target object, first two-dimensional position information and the relative depth of the target object, and the absolute depth.

Therefore, an end-to-end target object posture detection framework is formed through three branch networks of the target detection network, the key point detection network and the depth prediction network, the first image is processed based on the framework, three-dimensional position information of a plurality of key points of each target object in the first image in a camera coordinate system is obtained, the processing speed is higher, and the identification precision is higher.

In a second aspect, an embodiment of the present disclosure further provides an image processing apparatus, including: an identification module for identifying a target region of a target object in the first image; a first detection module, configured to determine, based on a target area corresponding to the target object, first two-dimensional position information of a plurality of key points respectively representing a posture of the target object in the first image, a relative depth of each key point with respect to a reference node of the target object, and an absolute depth of the reference node of the target object in a camera coordinate system; a second detection module, configured to determine three-dimensional position information of a plurality of key points of the target object in the camera coordinate system respectively based on the first two-dimensional position information, the relative depth, and the absolute depth of the target object.

In a possible embodiment, the identification module, when identifying a target region of a target object in the first image, is configured to: performing feature extraction on the first image to obtain a feature map of the first image; and determining a plurality of target bounding boxes from a plurality of candidate bounding boxes generated in advance based on the feature map, and determining a target area corresponding to the target object based on the target bounding boxes.

In a possible implementation, the identification module, when determining the target area corresponding to the target object based on the target bounding box, is configured to: determining a feature subgraph corresponding to each target bounding box based on a plurality of target bounding boxes and the feature graph; and performing border frame regression processing on the basis of the feature subgraphs respectively corresponding to the target border frames to obtain target areas corresponding to the target objects.

In a possible implementation, the first detection module, when determining an absolute depth of a reference node of the target object in a camera coordinate system based on a target area corresponding to the target object, is configured to: determining a target feature map corresponding to the target image based on a target area corresponding to the target object and the first image; performing depth recognition processing based on a target feature map corresponding to the target object to obtain normalized absolute depth of a reference node of the target object; and obtaining the absolute depth of the reference node of the target object in the camera coordinate system based on the normalized absolute depth and the parameter matrix of the camera.

In a possible implementation manner, when performing depth recognition processing based on a target feature map corresponding to the target object to obtain a normalized absolute depth of a reference node of the target object, the first detection module is configured to: acquiring an initial depth image based on the first image; the pixel value of any first pixel point in the initial depth image represents an initial depth value of a second pixel point corresponding to the first pixel point in the first image in the camera coordinate system; determining second two-dimensional position information of a reference node corresponding to the target object in the first image based on a target feature map corresponding to the target object, and determining an initial depth value of the reference node corresponding to the target object based on the second two-dimensional position information and the initial depth image; determining a normalized absolute depth of the reference node of the target object based on the initial depth value of the reference node corresponding to the target object and the target feature map corresponding to the target object.

In one possible embodiment, the first detection module, when determining the normalized absolute depth of the reference node of the target object based on the initial depth value of the reference node corresponding to the target object and the target feature map corresponding to the target object, is configured to: performing at least one stage of first convolution processing on a target characteristic diagram corresponding to the target object to obtain a characteristic vector of the target object; splicing the characteristic vector and the initial depth value to form a spliced vector, and performing at least one stage of second convolution processing on the spliced vector to obtain a corrected value of the initial depth value; and obtaining the normalized absolute depth based on the corrected value of the initial depth value and the initial depth value.

In one possible implementation, a pre-trained neural network is deployed in the image processing apparatus, and the neural network includes three branch networks, namely a target detection network, a key point detection network and a depth prediction network; the neural network comprises three branch networks respectively used for obtaining a target area of the target object, first two-dimensional position information and the relative depth of the target object, and the absolute depth.

In a third aspect, an embodiment of the present disclosure further provides a computer device, including: a processor and a memory coupled to each other, the memory storing machine-readable instructions executable by the processor, the machine-readable instructions being executable by the processor when a computer device is run to implement the steps of the image processing method of the first aspect described above, or any one of the possible implementations of the first aspect.

In a fourth aspect, the disclosed embodiments also provide a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and the computer program is executed by a processor to perform the steps of the foregoing first aspect, or the image processing method in any one of the possible implementation manners of the first aspect.

In order to make the aforementioned objects, features and advantages of the present disclosure more comprehensible, preferred embodiments accompanied with figures are described in detail below.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present disclosure, the drawings required for use in the embodiments will be briefly described below, and the drawings herein incorporated in and forming a part of the specification illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the technical solutions of the present disclosure. It is appreciated that the following drawings depict only certain embodiments of the disclosure and are therefore not to be considered limiting of its scope, for those skilled in the art will be able to derive additional related drawings therefrom without the benefit of the inventive faculty.

Fig. 1 shows a flowchart of an image processing method provided by an embodiment of the present disclosure;

FIG. 2 illustrates a flowchart of a particular method of identifying a target region of a target object in a first image provided by an embodiment of the present disclosure;

fig. 3 illustrates a specific example of determining a target area corresponding to a target object based on a target bounding box provided by the embodiment of the present disclosure;

FIG. 4 illustrates a flow chart of a particular method of determining an absolute depth of a reference node of a target object in a camera coordinate system provided by an embodiment of the present disclosure;

FIG. 5 is a flow chart illustrating another specific method for obtaining a normalized absolute depth of a reference node according to an embodiment of the present disclosure;

FIG. 6 illustrates a specific example of a target object pose detection framework provided by embodiments of the present disclosure;

FIG. 7 illustrates another specific example of a target object pose detection framework provided by embodiments of the present disclosure;

fig. 8 shows a schematic diagram of an image processing apparatus provided by an embodiment of the present disclosure;

fig. 9 shows a schematic diagram of a computer device provided by an embodiment of the present disclosure.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present disclosure more clear, the technical solutions of the embodiments of the present disclosure will be described clearly and completely with reference to the drawings in the embodiments of the present disclosure, and it is obvious that the described embodiments are only a part of the embodiments of the present disclosure, not all of the embodiments. The components of the embodiments of the present disclosure, generally described and illustrated in the figures herein, can be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present disclosure, presented in the figures, is not intended to limit the scope of the claimed disclosure, but is merely representative of selected embodiments of the disclosure. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the disclosure without making creative efforts, shall fall within the protection scope of the disclosure.

The three-dimensional human body posture detection method generally comprises the steps of identifying first two-dimensional position information of human body key points in an image to be identified through a neural network, and converting the first two-dimensional position information of each human body key point into three-dimensional position information according to mutual position relations (such as connection relations among different key points, distance ranges among adjacent key points and the like) among the human body key points; however, human body shapes are complex and changeable, and the position relations of key points of human bodies corresponding to different human bodies are different, so that the three-dimensional human body posture obtained by the method has larger errors.

In addition, the accuracy of the current three-dimensional human body posture detection method is based on the accurate estimation of the human body key points, but due to the shielding of clothes, limbs and the like, the human body key points cannot be accurately identified from the image in many cases, and further the three-dimensional human body posture error obtained by the method is further enlarged.

The above-mentioned drawbacks are the results of the inventor after practical and careful study, and therefore, the discovery process of the above-mentioned problems and the solutions proposed by the present disclosure to the above-mentioned problems should be the contribution of the inventor in the process of the present disclosure.

It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures.

Based on the above research, the present disclosure provides an image processing method and apparatus, by identifying a target area of a target object in a first image, and determining first two-dimensional position information of a plurality of key points representing a posture of the target object in the first image, a relative depth of each key point with respect to a reference node of the target object, and an absolute depth of the reference node of the target object in a camera coordinate system based on the target area, thereby more accurately obtaining three-dimensional position information of the plurality of key points of the target object in the camera coordinate system based on the first two-dimensional position information, the relative depth, and the absolute depth of the target object.

To facilitate understanding of the present embodiment, first, an image processing method disclosed in the embodiments of the present disclosure is described in detail, where an execution subject of the image processing method provided in the embodiments of the present disclosure is generally a computer device with certain computing capability, and the computer device includes, for example: a terminal device, which may be a User Equipment (UE), a mobile device, a User terminal, a cellular phone, a cordless phone, a Personal Digital Assistant (PDA), a handheld device, a computing device, a vehicle mounted device, a wearable device, or a server or other processing device. In some possible implementations, the image processing method may be implemented by a processor calling computer readable instructions stored in a memory.

The following describes an image processing method provided by an embodiment of the present disclosure, taking an execution subject as a terminal device as an example.

Referring to fig. 1, a flowchart of an image processing method provided by the embodiment of the present disclosure is shown, where the method includes steps S101 to S103, where:

s101: identifying a target region of a target object in the first image;

s102: determining first two-dimensional position information of a plurality of key points representing the target object posture in the first image respectively, the relative depth of each key point relative to a reference node of the target object and the absolute depth of the reference node of the target object in a camera coordinate system based on a target area corresponding to the target object;

s103: determining three-dimensional position information of a plurality of key points of the target object in the camera coordinate system respectively based on the first two-dimensional position information, the relative depth, and the absolute depth of the target object.

The following describes each of the above-mentioned steps S101 to S103 in detail.

I: in the above S101, at least one target object is included in the first image. The target object includes, for example, a person, an animal, a robot, a vehicle, or the like, which needs to be posed.

In a possible embodiment, when there is more than one target object included in the first image, the categories of the different target objects may be the same or different; for example, the plurality of target objects are all humans; or the plurality of target objects are all vehicles. As another example, the target object in the first image includes: humans and animals; or the target objects in the first image comprise people and vehicles, and the target object category is determined according to the actual application scene requirements.

The target region of the target object is a region in the first image including the target object.

Illustratively, referring to fig. 2, an embodiment of the present disclosure provides a specific method for identifying a target area of a target object in a first image, including:

s201: and performing feature extraction on the first image to obtain a feature map of the first image.

Here, feature extraction may be performed on the first image using, for example, a neural network to obtain a feature map of the first image.

S202: and determining a plurality of target bounding boxes from a plurality of candidate bounding boxes generated in advance based on the feature map, and determining a target area corresponding to the target object based on the target bounding boxes.

In a specific implementation, for example, a bounding box prediction algorithm may be used to obtain a plurality of target bounding boxes. The bounding box prediction algorithm includes, for example, roilign, ROI-posing, etc., where, for example, roilign may traverse a plurality of candidate bounding boxes generated in advance, and determine that a sub-image corresponding to each candidate bounding box belongs to a region of interest (ROI) value of any target object in the first image, where the higher the ROI value is, the higher the probability that the sub-image corresponding to the candidate bounding box belongs to a certain target object is; after the ROI value corresponding to each candidate bounding box is determined, a plurality of target bounding boxes are determined from the candidate bounding boxes according to the sequence of the ROI values corresponding to the candidate bounding boxes from large to small.

The target bounding box is, for example, rectangular; the information of the target bounding box includes, for example: coordinates of any vertex in the target bounding box in the first image, and a height value and a width value of the target bounding box. Alternatively, the information of the target bounding box includes, for example: coordinates of any vertex in the target bounding box in the feature map of the first image, and a height value and a width value of the target bounding box.

After the plurality of target bounding boxes are obtained, target areas corresponding to all target objects in the first image are determined based on the plurality of target bounding boxes.

Referring to fig. 3, an embodiment of the present disclosure provides a specific example of determining a target area corresponding to a target object based on a target bounding box, where the specific example includes:

s301: and determining a characteristic subgraph corresponding to each target bounding box based on the plurality of target bounding boxes and the characteristic graph.

In specific implementation, under the condition that the information of the target boundary box includes the coordinates of any vertex on the target boundary box in the first image, and the height value and the width value of the target boundary box, the feature points in the feature map and the pixel points in the first image have a certain position mapping relationship; and determining the characteristic subgraph corresponding to each target boundary frame from the characteristic graph of the first image according to the relevant information of the target boundary frame and the mapping relation between the characteristic graph and the first image.

In the case that the information of the target bounding box includes the coordinates of any vertex in the target bounding box in the feature map of the first image, and the height value and the width value of the target bounding box, the feature subgraphs respectively corresponding to the target bounding boxes can be determined from the feature map of the first image directly based on the target bounding box.

S302: and performing border frame regression processing on the basis of the feature subgraphs respectively corresponding to the target border frames to obtain target areas corresponding to the target objects.

Here, for example, a bounding box regression algorithm may be used to perform bounding box regression processing on the target bounding box based on the feature subgraph corresponding to each target bounding box, so as to obtain multiple bounding boxes including the complete target object. Each of the plurality of bounding boxes corresponds to a target object, and the region determined based on the bounding box corresponding to the target object is the target region of the corresponding target object.

At this time, the number of the obtained target areas is consistent with the number of the target objects in the first image, and each target object corresponds to one target area; if the different target objects have a mutual occlusion positional relationship, the target areas corresponding to the target objects having the mutual occlusion relationship have a certain overlapping degree.

In another embodiment of the present disclosure, other target detection algorithms may also be employed to detect a target region of a target object in the first image. For example, a semantic segmentation algorithm is adopted to determine a semantic segmentation result of each pixel point in the first image, and then the positions of the pixel points belonging to different target objects in the first image are determined according to the semantic segmentation result; and then, solving a minimum bounding box according to pixel points belonging to the same target object, and determining an area corresponding to the minimum bounding box as a target area of the target object.

II: in the above S102, the image coordinate system refers to a two-dimensional coordinate system established in both the length and width directions of the first image; the camera coordinate system is a three-dimensional coordinate system established in the direction of the optical axis of the camera and in two directions parallel to the optical axis and in the plane of the optical center of the camera.

The key points of the target object are position points which are positioned on the target object and can represent the posture of the target object after being connected according to a certain sequence; for example, when the target object is a human body, the key points include, for example, position points where respective joints of the human body are located. The position point is represented as a two-dimensional coordinate value in an image coordinate system; in the camera coordinate system, three-dimensional coordinate values are expressed.

In a specific implementation, for example, a keypoint detection network may be used to perform keypoint detection processing based on a target feature map of a target object, so as to obtain two-dimensional position information of a plurality of keypoints of the target object in a first image respectively, and a relative depth of each keypoint with respect to a reference node of the target object. Here, the manner of obtaining the target feature map may refer to the following description of S401, which is not described herein again.

The reference node is, for example, a position point where a certain portion is predetermined on the target object. For example, the reference node may be predetermined according to actual needs; for example, when the target object is a human body, the position point where the pelvis of the human body is located may be determined as a reference node, or any key point on the human body may be determined as a reference node, or the position point where the center of the chest and abdomen of the human body is located may be determined as a reference node; the specific conditions can be set as required.

Referring to fig. 4, an embodiment of the present disclosure provides a specific method for determining an absolute depth of a reference node of a target object in a camera coordinate system based on a target area corresponding to the target object, including:

s401: and determining a target feature map corresponding to the target image based on the target area corresponding to the target object and the first image.

Here, the target feature map may be determined from the feature map based on the feature map of the first image obtained by feature extraction of the first image and the target region, for example.

Here, the feature points in the feature map extracted for the first image and the pixel points in the first image have a certain position mapping relationship; after the target area of each target object is obtained, the position of each target object in the feature map of the first image can be determined according to the position mapping relation, and then the target feature map of each target object is intercepted from the feature map of the first image.

S402: and executing depth recognition processing based on the target feature map corresponding to the target object to obtain the normalized absolute depth of the reference node of the target object.

Here, in a possible implementation, for example, a depth prediction network trained in advance may be used to perform a depth detection process on the target feature map, so as to obtain a normalized absolute depth of the reference node of the target object.

In another embodiment of the present disclosure, referring to fig. 5, another specific method for obtaining a normalized absolute depth of a reference node is further provided, including:

s501: acquiring an initial depth image based on the first image; the pixel value of any first pixel point in the initial depth image represents an initial depth value of a second pixel point corresponding to the first pixel point in the first image in the camera coordinate system.

Here, the method may employ a depth prediction network to determine an initial depth value of each pixel point (second pixel point) in the first image; the initial depth value of each first pixel point forms an initial depth image of the first image; the pixel value of any pixel point (first pixel point) in the initial depth image is the initial depth value of the pixel point (second pixel point) at the corresponding position in the first image.

S502: and determining second two-dimensional position information of the reference node corresponding to the target object in the first image based on the target feature map corresponding to the target object, and determining an initial depth value of the reference node corresponding to the target object based on the second two-dimensional position information and the initial depth image.

Here, the target feature map corresponding to the target object may be a target feature map determined for each target object from the feature map of the first image, for example, based on a target region corresponding to each target object.

After the target feature maps corresponding to the target objects are obtained, for example, a pre-trained reference node detection network may be used to determine second two-dimensional position information of the reference nodes of the target objects in the first image based on the target feature maps. Then, a pixel point corresponding to the reference node is determined from the initial depth image by using the second two-dimensional position information, and a pixel value of the pixel point determined from the initial depth image is determined as an initial depth value of the reference node.

S503: determining a normalized absolute depth of the reference node of the target object based on the initial depth value of the reference node corresponding to the target object and the target feature map corresponding to the target object.

For example, at least one stage of first convolution processing may be performed on a target feature map corresponding to the target object to obtain a feature vector of the target object; splicing the characteristic vector and the initial depth value to form a spliced vector, and performing at least one stage of second convolution processing on the spliced vector to obtain a corrected value of the initial depth value; and obtaining the normalized absolute depth based on the corrected value of the initial depth value and the initial depth value.

Here, for example, a neural network for adjusting the initial depth value may be adopted, the neural network including a plurality of convolutional layers; wherein, part of the plurality of convolution layers are used for carrying out at least one stage of first convolution processing on the target characteristic diagram; the other convolution layers are used for performing at least one stage of second convolution processing on the splicing vector so as to obtain the correction value; and then, adjusting the initial depth value according to the correction value to obtain the normalized depth of the reference node of the target object.

With reference to the foregoing S402, the specific method for determining the absolute depth of the reference node of the target object in the camera coordinate system provided by the embodiment of the present disclosure further includes:

s403: and obtaining the absolute depth of the reference node of the target object in the camera coordinate system based on the normalized absolute depth and the parameter matrix of the camera.

In a specific implementation, different first images may be captured by different cameras during image processing of the different first images; for different cameras, the corresponding camera parameters may be different; here, the camera internal reference includes, for example: the focal length of the camera on the x-axis, the focal length of the camera on the y-axis, and the coordinates of the optical center of the camera on the x-axis and the y-axis in the camera coordinate system.

The camera internal parameters are different, and even first images acquired at the same view angle and the same position can be distinguished; if the absolute depth of the reference node is predicted directly based on the target feature map, the absolute depth obtained for different first images obtained by different cameras at the same view angle and the same position is different.

To avoid the above situation, embodiments of the present disclosure directly predict the normalized depth of the reference node, which is obtained without considering the camera internal parameters; and then recovering the absolute depth of the reference node according to the camera internal reference and the normalized absolute depth.

Illustratively, the normalized absolute depth, and the absolute depth of the reference node of any target object satisfy the following formula (1):

wherein,

representing a normalized absolute depth of a reference node;

representing the reference node absolute depth; a. the_BoxRepresenting the area of the target region; a. the_RoIRepresenting objectsArea of the bounding box.

(f_x,f_y) Representing the camera focal length. Illustratively, the camera coordinate system is a three-dimensional coordinate system; the device comprises three coordinate axes of x, y and z; the origin of the camera coordinate system is the optical center of the camera; the optical axis of the camera is the z-axis of the camera coordinate system; the plane where the optical center is located and perpendicular to the z axis is the plane where the x axis and the y axis are located; f. of_xIs the focal length of the camera in the x-axis; f. of_yIs the focal length of the camera in the y-axis.

It should be noted here that, in the above S202, there are a plurality of target bounding boxes determined by roiallign; and the areas of the target bounding boxes are all equal.

Since the focal length of the camera is already determined when the camera acquires the first image, and the target area and the target bounding box are also already determined when the target area is determined, after the normalized absolute depth of the reference node is obtained, the absolute depth of the reference node of the target object is obtained according to the above formula (1).

III: in the above S103, it is assumed that each target object includes J key points, and there are N target objects in the first image; wherein the three-dimensional poses of the N target objects are represented as:

wherein the three-dimensional posture of the mth target object

Can be expressed as:

wherein,

a coordinate value of a jth key point representing the mth target object in the x-axis direction in the camera coordinate system;

a coordinate value of a jth key point representing the mth target object in the y-axis direction in the camera coordinate system;

and a coordinate value of the jth key point representing the mth target object in the z-axis direction in the camera coordinate system.

The target areas of the N target objects are represented as:

wherein the target area of the mth target object

Expressed as:

here, the number of the first and second electrodes,

and

a coordinate value indicating a vertex at which an upper left corner of the target region is located;

and

respectively representing the width and height values of the target area.

The three-dimensional poses of the N target objects relative to the reference node are represented as:

wherein the three-dimensional posture of the mth target object relative to the reference node

Expressed as:

wherein,

a coordinate value of the j-th key point representing the m-th target object on the x-axis in the image coordinate system;

a coordinate value of a y axis in an image coordinate system of a jth key point representing the mth target object; that is to say that the first and second electrodes,

and the two-dimensional coordinate value of the jth key point of the mth target object in the image coordinate system is represented.

Representing the relative depth of the jth node of the mth target object with respect to the reference node of the mth target object.

Obtaining the three-dimensional posture of the mth target object by back projection by using the internal reference matrix K of the camera, wherein the three-dimensional coordinate information of the jth node of the mth target object satisfies the following formula (2)

Wherein,

an absolute depth value of a reference node representing the mth target object in the camera coordinate system. Here, it is to be noted that

Obtained on the basis of the corresponding example of the above formula (1).

The internal reference matrix K is, for example: (f)_x,f_y,c_x,c_y)；

Wherein:f_xis the focal length of the camera on the x-axis in the camera coordinate system; f. of_yIs the focal length of the camera on the y-axis in the camera coordinate system; c. C_xIs a coordinate value of the optical center of the camera on the x-axis in a camera coordinate system; c. C_yAnd a coordinate value representing the optical center of the camera on the y-axis in the camera coordinate system.

Through the process, the three-dimensional position information of the plurality of key points of the target object in the camera coordinate system can be obtained; and for the mth target object, representing the three-dimensional posture of the mth target object by the three-dimensional position information corresponding to the J key points of the target object respectively.

The embodiment of the disclosure identifies the target area of the target object in the first image, and determines first two-dimensional position information of a plurality of key points representing the posture of the target object in the first image, the relative depth of each key point relative to the reference node of the target object, and the absolute depth of the reference node of the target object in the camera coordinate system based on the target area, so as to more accurately obtain the three-dimensional position information of the plurality of key points of the target object in the camera coordinate system based on the first two-dimensional position information, the relative depth, and the absolute depth of the target object.

In another embodiment of the present disclosure, another image processing method is further provided, where the image processing method is applied to a pre-trained neural network.

The neural network comprises a target detection network, a key point detection network and a depth prediction network; the neural network comprises three branch networks respectively used for obtaining a target area of the target object, first two-dimensional position information and the relative depth of the target object, and the absolute depth.

The specific working processes of the three branch networks can be shown in the above embodiments, and are not described herein again.

According to the method and the device, an end-to-end target object posture detection framework is formed through three branch networks, namely the target detection network, the key point detection network and the depth prediction network, the first image is processed based on the framework, three-dimensional position information of a plurality of key points of each target object in the first image in a camera coordinate system is obtained, the processing speed is higher, and the recognition precision is higher.

Referring to fig. 6, an embodiment of the present disclosure further provides a specific example of a target object posture detection framework, including:

the method comprises three network branches of a target detection network, a key point detection network and a depth prediction network;

the target detection network extracts the features of the first image to obtain a feature map of the first image; then, according to the first feature map, determining a plurality of target bounding boxes from a plurality of candidate bounding boxes generated in advance by adopting RoIAlign; and executing the bounding box regression processing on the plurality of target bounding boxes to obtain a target area corresponding to each target object. And transmitting the target characteristic graph corresponding to the target area to a key point detection network and a depth prediction network.

And the key point detection network determines first two-dimensional position information of a plurality of key points representing the target object posture in the first image and the relative depth of each key point relative to a reference node of the target object based on the target feature map. And aiming at the first two-dimensional position information and the relative depth of each key point in each target feature map, forming the three-dimensional posture of the target object in the target feature map. The three-dimensional posture at this time is a three-dimensional posture with reference to itself.

And the depth prediction network determines the absolute depth of the reference node of the target object in the camera coordinate system based on the target feature map.

And finally, determining three-dimensional position information of a plurality of key points of the target object in the camera coordinate system respectively according to the first two-dimensional position information, the relative depth and the absolute depth of the reference node of the target object. For each target object, the three-dimensional position information of the plurality of key points on the target object in the camera coordinate system respectively constitutes the three-dimensional posture of the target object in the camera coordinate system. The three-dimensional posture at this time is a three-dimensional posture with reference to the camera.

Referring to fig. 7, a specific example of another target object posture detection framework is further provided in the embodiments of the present disclosure, including:

a target detection network, a key point detection network and a depth prediction network;

The depth prediction network acquires an initial depth image based on the first image; determining second two-dimensional position information of a reference node corresponding to the target object in the first image based on a target feature map corresponding to the target object, and determining an initial depth value of the reference node corresponding to the target object based on the second two-dimensional position information and the initial depth image; performing at least one-stage first convolution processing on a target characteristic graph corresponding to a target object to obtain a characteristic vector of the target object; splicing the feature vector and the initial depth value of the reference node to form a spliced vector, and performing at least one stage of second convolution processing on the spliced vector to obtain a corrected value of the initial depth value; and adding the corrected value and the initial depth value of the reference node to obtain the normalized absolute depth value of the reference node.

Then, the absolute depth value of the reference node is recovered through the formula (1), and finally, the three-dimensional position information of the plurality of key points of the target object in the camera coordinate system is determined according to the first two-dimensional position information, the relative depth of the target object and the absolute depth of the reference node. For each target object, the three-dimensional position information of the plurality of key points on the target object in the camera coordinate system respectively constitutes the three-dimensional posture of the target object in the camera coordinate system. The three-dimensional posture at this time is a three-dimensional posture with reference to the camera.

Through any one of the two target object posture detection frameworks, the three-dimensional position information of the plurality of key points of each target object in the first image in the camera coordinate system can be obtained, the processing speed is higher, and the identification precision is higher.

It will be understood by those skilled in the art that in the method of the present invention, the order of writing the steps does not imply a strict order of execution and any limitations on the implementation, and the specific order of execution of the steps should be determined by their function and possible inherent logic.

Based on the same inventive concept, an image processing apparatus corresponding to the image processing method is also provided in the embodiments of the present disclosure, and since the principle of the apparatus in the embodiments of the present disclosure for solving the problem is similar to the image processing method described above in the embodiments of the present disclosure, the implementation of the apparatus may refer to the implementation of the method, and repeated details are not described again.

Referring to fig. 8, a schematic diagram of an image processing apparatus provided in an embodiment of the present disclosure is shown, where the apparatus includes: an identification module 81, a first detection module 82, a second detection module 83; wherein,

an identification module 81 for identifying a target area of a target object in the first image;

a first detection module 82, configured to determine, based on a target area corresponding to the target object, first two-dimensional position information of a plurality of key points respectively representing a posture of the target object in the first image, a relative depth of each key point with respect to a reference node of the target object, and an absolute depth of the reference node of the target object in a camera coordinate system;

a second detecting module 83, configured to determine three-dimensional position information of a plurality of key points of the target object in the camera coordinate system respectively based on the first two-dimensional position information, the relative depth, and the absolute depth of the target object.

In one possible embodiment, the identification module 81, when identifying a target region of a target object in the first image, is configured to:

performing feature extraction on the first image to obtain a feature map of the first image;

and determining a plurality of target bounding boxes from a plurality of candidate bounding boxes generated in advance based on the feature map, and determining a target area corresponding to the target object based on the target bounding boxes.

In a possible implementation manner, the identifying module 81, when determining the target area corresponding to the target object based on the target bounding box, is configured to:

determining a feature subgraph corresponding to each target bounding box based on a plurality of target bounding boxes and the feature graph;

and performing border frame regression processing on the basis of the feature subgraphs respectively corresponding to the target border frames to obtain target areas corresponding to the target objects.

In a possible implementation, the first detection module 82, when determining the absolute depth of the reference node of the target object in the camera coordinate system based on the target area corresponding to the target object, is configured to:

determining a target feature map corresponding to the target image based on a target area corresponding to the target object and the first image;

performing depth recognition processing based on a target feature map corresponding to the target object to obtain normalized absolute depth of a reference node of the target object;

and obtaining the absolute depth of the reference node of the target object in the camera coordinate system based on the normalized absolute depth and the parameter matrix of the camera.

In a possible implementation manner, when performing depth recognition processing based on a target feature map corresponding to the target object to obtain a normalized absolute depth of a reference node of the target object, the first detection module 82 is configured to:

acquiring an initial depth image based on the first image; the pixel value of any first pixel point in the initial depth image represents an initial depth value of a second pixel point corresponding to the first pixel point in the first image in the camera coordinate system;

determining second two-dimensional position information of a reference node corresponding to the target object in the first image based on a target feature map corresponding to the target object, and determining an initial depth value of the reference node corresponding to the target object based on the second two-dimensional position information and the initial depth image;

determining a normalized absolute depth of the reference node of the target object based on the initial depth value of the reference node corresponding to the target object and the target feature map corresponding to the target object.

In one possible implementation, the first detection module 82, when determining the normalized absolute depth of the reference node of the target object based on the initial depth value of the reference node corresponding to the target object and the target feature map corresponding to the target object, is configured to:

performing at least one stage of first convolution processing on a target characteristic diagram corresponding to the target object to obtain a characteristic vector of the target object;

splicing the characteristic vector and the initial depth value to form a spliced vector, and performing at least one stage of second convolution processing on the spliced vector to obtain a corrected value of the initial depth value;

and obtaining the normalized absolute depth based on the corrected value of the initial depth value and the initial depth value.

In addition, according to the embodiment of the disclosure, an end-to-end target object posture detection frame is formed through three branch networks, namely, a target detection network, a key point detection network and a depth prediction network, the first image is processed based on the frame, three-dimensional position information of a plurality of key points of each target object in the first image in a camera coordinate system is obtained, the processing speed is higher, and the recognition accuracy is higher.

The description of the processing flow of each module in the device and the interaction flow between the modules may refer to the related description in the above method embodiments, and will not be described in detail here.

The embodiment of the present disclosure further provides a computer device 10, as shown in fig. 9, which is a schematic structural diagram of the computer device 10 provided in the embodiment of the present disclosure, and includes:

a processor 11 and a memory 12; the memory 12 stores machine-readable instructions executable by the processor 11, which when executed by a computer device are executed by the processor to perform the steps of:

identifying a target region of a target object in the first image;

determining first two-dimensional position information of a plurality of key points representing the target object posture in the first image respectively, the relative depth of each key point relative to a reference node of the target object and the absolute depth of the reference node of the target object in a camera coordinate system based on a target area corresponding to the target object;

determining three-dimensional position information of a plurality of key points of the target object in the camera coordinate system respectively based on the first two-dimensional position information, the relative depth, and the absolute depth of the target object.

For the specific execution process of the instruction, reference may be made to the steps of the image processing method described in the embodiments of the present disclosure, and details are not described here.

The embodiments of the present disclosure also provide a computer-readable storage medium, on which a computer program is stored, where the computer program is executed by a processor to perform the steps of the image processing method described in the above method embodiments. The storage medium may be a volatile or non-volatile computer-readable storage medium.

The computer program product of the image processing method provided in the embodiments of the present disclosure includes a computer-readable storage medium storing a program code, where instructions included in the program code may be used to execute steps of the image processing method described in the above method embodiments, which may be referred to specifically for the above method embodiments, and are not described herein again.

The embodiments of the present disclosure also provide a computer program, which when executed by a processor implements any one of the methods of the foregoing embodiments. The computer program product may be embodied in hardware, software or a combination thereof. In an alternative embodiment, the computer program product is embodied in a computer storage medium, and in another alternative embodiment, the computer program product is embodied in a Software product, such as a Software Development Kit (SDK), or the like.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the system and the apparatus described above may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again. In the several embodiments provided in the present disclosure, it should be understood that the disclosed system, apparatus, and method may be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one logical division, and there may be other divisions when actually implemented, and for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of devices or units through some communication interfaces, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present disclosure may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a non-volatile computer-readable storage medium executable by a processor. Based on such understanding, the technical solution of the present disclosure may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present disclosure. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

Finally, it should be noted that: the above-mentioned embodiments are merely specific embodiments of the present disclosure, which are used for illustrating the technical solutions of the present disclosure and not for limiting the same, and the scope of the present disclosure is not limited thereto, and although the present disclosure is described in detail with reference to the foregoing embodiments, those skilled in the art should understand that: any person skilled in the art can modify or easily conceive of the technical solutions described in the foregoing embodiments or equivalent technical features thereof within the technical scope of the present disclosure; such modifications, changes or substitutions do not depart from the spirit and scope of the embodiments of the present disclosure, and should be construed as being included therein. Therefore, the protection scope of the present disclosure shall be subject to the protection scope of the claims.

Claims

1. An image processing method, comprising:

identifying a target region of a target object in the first image;

2. The image processing method of claim 1, wherein the identifying a target region of a target object in the first image comprises:

3. The image processing method according to claim 2, wherein the determining a target region corresponding to the target object based on the target bounding box comprises:

4. The image processing method according to any one of claims 1 to 3, wherein determining the absolute depth of the reference node of the target object in the camera coordinate system based on the target region corresponding to the target object comprises:

5. The image processing method according to claim 4, wherein the performing depth recognition processing based on the target feature map corresponding to the target object to obtain a normalized absolute depth of the reference node of the target object includes:

6. The method according to claim 5, wherein determining a normalized absolute depth of the reference node of the target object based on the initial depth value of the reference node corresponding to the target object and the target feature map corresponding to the target object comprises:

7. The image processing method according to any one of claims 1 to 6, wherein the image processing method is applied to a pre-trained neural network, and the neural network comprises three branch networks, namely a target detection network, a key point detection network and a depth prediction network; the neural network comprises three branch networks respectively used for obtaining a target area of the target object, first two-dimensional position information and the relative depth of the target object, and the absolute depth.

8. An image processing apparatus characterized by comprising:

an identification module for identifying a target region of a target object in the first image;

a first detection module, configured to determine, based on a target area corresponding to the target object, first two-dimensional position information of a plurality of key points respectively representing a posture of the target object in the first image, a relative depth of each key point with respect to a reference node of the target object, and an absolute depth of the reference node of the target object in a camera coordinate system;

a second detection module, configured to determine three-dimensional position information of a plurality of key points of the target object in the camera coordinate system respectively based on the first two-dimensional position information, the relative depth, and the absolute depth of the target object.

9. A computer device, comprising: an interconnected processor and memory, the memory storing machine-readable instructions executable by the processor, the machine-readable instructions being executable by the processor to implement the steps of the image processing method as claimed in any one of claims 1 to 7 when executed by a computer device.

10. A computer-readable storage medium, characterized in that a computer program is stored on the computer-readable storage medium, which computer program, when being executed by a processor, performs the steps of the image processing method according to any one of claims 1 to 9.