CN111582204A

CN111582204A - Attitude detection method and apparatus, computer device and storage medium

Info

Publication number: CN111582204A
Application number: CN202010402654.2A
Authority: CN
Inventors: 王灿; 李杰锋; 刘文韬; 钱晨
Original assignee: Beijing Sensetime Technology Development Co Ltd
Current assignee: Beijing Sensetime Technology Development Co Ltd
Priority date: 2020-05-13
Filing date: 2020-05-13
Publication date: 2020-08-25

Abstract

The present disclosure provides a method, an apparatus, a computer device and a storage medium for gesture detection; the method comprises the following steps: acquiring a first image; carrying out posture detection processing on the first image by using a pre-trained posture detection model to obtain a posture detection result of at least one first target in the first image; the posture detection model is trained by utilizing the relative position relation information of the key point of at least one second target in the second image under different visual angles. The gesture detection method and the gesture detection device can enable the gesture detection model to learn the mutual position relation between different second targets, and when the gesture detection model is used for detecting the gesture of the first target in the first image, the gesture detection result with higher detection precision can be obtained for the shielded first target.

Description

Attitude detection method and apparatus, computer device and storage medium

Technical Field

The present disclosure relates to the field of image processing technologies, and in particular, to a method and an apparatus for detecting an attitude, a computer device, and a storage medium.

Background

The three-dimensional human body posture detector is widely applied to the fields of security, games, entertainment and the like. The current three-dimensional human body posture detection method generally identifies first two-dimensional position information of human body key points in an image, and then converts the first two-dimensional position information into three-dimensional position information according to a predetermined position relationship between the human body key points.

The human body posture obtained by the current three-dimensional human body posture detection method has larger error.

Disclosure of Invention

The embodiment of the disclosure at least provides a posture detection method, a posture detection device, computer equipment and a storage medium.

In a first aspect, an embodiment of the present disclosure provides a method for detecting a gesture, including: acquiring a first image; carrying out posture detection processing on the first image by using a pre-trained posture detection model to obtain a posture detection result of at least one first target in the first image; the posture detection model is trained by utilizing the relative position relation information of the key point of at least one second target in the second image under different visual angles.

Therefore, the posture detection model can learn the mutual position relation between different second targets, and when the posture detection model is used for detecting the posture of the first target in the first image, the posture detection result with higher detection precision can be obtained for the shielded first target.

In a possible embodiment, the relative position relationship information at different viewing angles includes a depth difference of each of at least one key point pair at different camera coordinate systems; the keypoint pairs include any two keypoints of at least one second target in the second image.

Therefore, the depth difference of each key point pair in at least one key point pair under different camera coordinate systems represents the relative position relation information under different visual angles, and the shielding relation between different second targets can be better represented.

In a possible implementation, the different camera coordinate systems include: a real camera coordinate system, and at least one virtual camera coordinate system, or the different camera coordinate systems include: different virtual camera coordinate systems.

In this way, different viewing angles are determined by constructing at least one virtual camera coordinate system.

In a possible implementation, the posture detection model is obtained by training according to the following method: determining first position information of a plurality of key points of each second target in the second image in a real camera coordinate system by using a neural network to be trained; determining the relative positional relationship information based on the first positional information; determining a neural network loss based on the relative position relationship information; and training the neural network to be trained based on the neural network loss to obtain the posture detection model.

Therefore, the posture detection model is obtained by training the relative position relation information of the key point of at least one second target in the second image under different visual angles.

In one possible embodiment, determining a depth difference for each of at least one key point pair in a virtual camera coordinate system based on the first location information comprises: for each key point pair, determining second position information of each key point in the key point pair in a first virtual camera coordinate system respectively based on first position information of two key points in the key point pair in a real camera coordinate system respectively and conversion relation information between the real camera coordinate system and the first virtual camera coordinate system; determining a depth difference of each key point pair in the first virtual camera coordinate system based on second position information of each key point in the key point pair in the first virtual camera coordinate system, wherein the first virtual camera coordinate system is any one of the virtual camera coordinate systems.

Therefore, the virtual camera coordinate system is constructed through the real camera coordinate system, and the training process of the supervised posture detection model based on the multi-view mutual position relation information of the key point pairs is realized.

In a possible implementation, the gesture detection method further includes: determining position information of a center position point of the at least one second target in a real camera coordinate system and determining position information of a first virtual camera position point in the real camera coordinate system based on first position information of a keypoint of the at least one second target in the real camera coordinate system; and generating conversion relation information between the first virtual camera coordinate system corresponding to the first virtual camera position point and the real camera coordinate system based on the position information of the central position point in the real camera coordinate system and the position information of the first virtual camera position point in the real camera coordinate system.

Therefore, the virtual camera coordinate system can be randomly determined, and the conversion relation information between the virtual camera coordinate system and the real camera coordinate system can be obtained.

In one possible embodiment, the determining, in the real camera coordinate system, position information of a first virtual camera position point in the real camera coordinate system includes: determining a bounding sphere radius based on the position information of the center position point in the real camera coordinate system and the first position information of the key point of the at least one second target in the real camera coordinate system respectively; the enclosing ball encloses the key points of the at least one second target; determining position information of the first virtual camera position point in the real camera coordinate system on or outside a spherical surface of the bounding sphere based on the position information of the center position point in the real camera coordinate system and the bounding sphere radius.

In this way, it can be ensured that all key points of the second object in the second image are located under the shooting angle of view of the virtual camera.

In one possible embodiment, the determining the neural network loss based on the relative position relationship information includes: determining the neural network loss based on the depth difference of the at least one keypoint pair in the different camera coordinate systems and actual depth relation information of two keypoints in each keypoint pair in the different camera coordinate systems.

In a possible embodiment, the determining the neural network loss based on the depth difference of the at least one keypoint pair in the different camera coordinate system and the actual depth relation information of two keypoints in each keypoint pair in the different camera coordinate system includes: for each key point pair in the at least one key point pair, determining a loss penalty value of each key point pair according to the depth difference of each key point pair in the different virtual camera coordinate systems and the actual depth relation information; and determining the neural network loss based on the loss penalty values respectively corresponding to the at least one key point pair.

Therefore, the loss of the neural network is determined by utilizing whether the key point represented by the relative position relationship is correct in the front and back positions under different visual angles; if the accuracy of the front and back positions of the key point pairs under different viewing angles is higher, the corresponding loss of the neural network is smaller.

In a second aspect, an embodiment of the present disclosure further provides an attitude detection apparatus, including: the acquisition module is used for acquiring a first image; the processing module is used for carrying out posture detection processing on the first image by utilizing a pre-trained posture detection model to obtain a posture detection result of at least one first target in the first image; the posture detection model is trained by utilizing the relative position relation information of the key point of at least one second target in the second image under different visual angles.

In a possible embodiment, the method further comprises: the training module is used for training to obtain the posture detection model by adopting the following method: determining first position information of a plurality of key points of each second target in the second image in a real camera coordinate system by using a neural network to be trained; determining the relative positional relationship information based on the first positional information; determining a neural network loss based on the relative position relationship information; and training the neural network to be trained based on the neural network loss to obtain the posture detection model.

In one possible embodiment, the training module, when determining the depth difference for each of the at least one key point pair in the virtual camera coordinate system based on the first location information, is configured to: for each key point pair, determining second position information of each key point in the key point pair in a first virtual camera coordinate system respectively based on first position information of two key points in the key point pair in a real camera coordinate system respectively and conversion relation information between the real camera coordinate system and the first virtual camera coordinate system; determining a depth difference of each key point pair in the first virtual camera coordinate system based on second position information of each key point in the key point pair in the first virtual camera coordinate system, wherein the first virtual camera coordinate system is any one of the virtual camera coordinate systems.

In a possible implementation, the training module is further configured to: determining position information of a center position point of the at least one second target in a real camera coordinate system and determining position information of a first virtual camera position point in the real camera coordinate system based on first position information of a keypoint of the at least one second target in the real camera coordinate system; and generating conversion relation information between the first virtual camera coordinate system corresponding to the first virtual camera position point and the real camera coordinate system based on the position information of the central position point in the real camera coordinate system and the position information of the first virtual camera position point in the real camera coordinate system.

In one possible embodiment, the training module, when determining the position information of the first virtual camera position point in the real camera coordinate system, is configured to: determining a bounding sphere radius based on the position information of the center position point in the real camera coordinate system and the first position information of the key point of the at least one second target in the real camera coordinate system respectively; the enclosing ball encloses the key points of the at least one second target; determining position information of the first virtual camera position point in the real camera coordinate system on or outside a spherical surface of the bounding sphere based on the position information of the center position point in the real camera coordinate system and the bounding sphere radius.

In one possible embodiment, the training module, when determining the neural network loss based on the relative position relationship information, is configured to: determining the neural network loss based on the depth difference of the at least one keypoint pair in the different camera coordinate systems and actual depth relation information of two keypoints in each keypoint pair in the different camera coordinate systems.

In a possible embodiment, the training module, when determining that the neural network is lost based on the depth difference of the at least one keypoint pair in the different camera coordinate system and the actual depth relationship information of the two keypoints in each keypoint pair in the different camera coordinate system, is configured to: for each key point pair in the at least one key point pair, determining a loss penalty value of each key point pair according to the depth difference of each key point pair in the different virtual camera coordinate systems and the actual depth relation information; and determining the neural network loss based on the loss penalty values respectively corresponding to the at least one key point pair.

In a third aspect, an embodiment of the present disclosure further provides a computer device, including: a processor and a memory coupled to each other, the memory storing machine-readable instructions executable by the processor, the machine-readable instructions being executable by the processor to implement steps as in the first aspect, or any one of the possible implementations of the first aspect, when the computer device is run.

In a fourth aspect, this disclosed embodiment also provides a computer-readable storage medium, on which a computer program is stored, where the computer program is executed by a processor to perform the steps in the first aspect or any one of the possible implementation manners of the first aspect.

In order to make the aforementioned objects, features and advantages of the present disclosure more comprehensible, preferred embodiments accompanied with figures are described in detail below.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present disclosure, the drawings required for use in the embodiments will be briefly described below, and the drawings herein incorporated in and forming a part of the specification illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the technical solutions of the present disclosure. It is appreciated that the following drawings depict only certain embodiments of the disclosure and are therefore not to be considered limiting of its scope, for those skilled in the art will be able to derive additional related drawings therefrom without the benefit of the inventive faculty.

FIG. 1 illustrates a flow chart of a gesture detection method provided by an embodiment of the present disclosure;

FIG. 2 illustrates a top-down example provided by an embodiment of the present disclosure showing that a key point pair has different mutual positional relationships at different viewing angles;

FIG. 3 is a flow chart illustrating a particular method of training a gesture detection model provided by an embodiment of the present disclosure;

fig. 4 is a flowchart illustrating a specific method for determining a depth difference of each of at least one key point pair in a virtual camera coordinate system based on the first position information according to an embodiment of the present disclosure;

FIG. 5 shows a schematic diagram of a gesture detection apparatus provided by an embodiment of the present disclosure;

fig. 6 shows a schematic diagram of a computer device provided by an embodiment of the present disclosure.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present disclosure more clear, the technical solutions of the embodiments of the present disclosure will be described clearly and completely with reference to the drawings in the embodiments of the present disclosure, and it is obvious that the described embodiments are only a part of the embodiments of the present disclosure, not all of the embodiments. The components of the embodiments of the present disclosure, generally described and illustrated in the figures herein, can be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present disclosure, presented in the figures, is not intended to limit the scope of the claimed disclosure, but is merely representative of selected embodiments of the disclosure. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the disclosure without making creative efforts, shall fall within the protection scope of the disclosure.

The three-dimensional human body posture detection method generally comprises the steps of identifying first two-dimensional position information of human body key points in an image to be identified through a neural network, and converting the first two-dimensional position information of each human body key point into three-dimensional position information according to mutual position relations (such as connection relations among different key points, distance ranges among adjacent key points and the like) among the human body key points; but this method relies on the accurate detection of key points in the human body; however, when the current human body three-dimensional posture estimation method is used for detecting the human body three-dimensional posture of an image comprising a plurality of people, due to mutual shielding of the human bodies of the plurality of people, key points of the human body cannot be accurately identified from the image in many cases, and the problem that the accuracy of the human body three-dimensional posture detection based on the current method is poor is caused.

Based on the above research, the present disclosure provides a posture detection method and apparatus, in which a posture detection model for performing three-dimensional posture detection is trained by using relative position relationships of key points of at least one second target in a second image at different viewing angles, so that the posture detection model can learn the mutual position relationships between different second targets, and when the posture detection model is used to perform posture detection on a first target in a first image, a posture detection result with higher detection accuracy can be obtained for a blocked first target.

The above-mentioned drawbacks are the results of the inventor after practical and careful study, and therefore, the discovery process of the above-mentioned problems and the solutions proposed by the present disclosure to the above-mentioned problems should be the contribution of the inventor in the process of the present disclosure.

It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures.

To facilitate understanding of the present embodiment, first, a detailed description is given of a gesture detection method disclosed in the embodiments of the present disclosure, and an execution subject of the gesture detection method provided in the embodiments of the present disclosure is generally a computer device with certain computing capability, where the computer device includes, for example: a terminal device, which may be a User Equipment (UE), a mobile device, a User terminal, a cellular phone, a cordless phone, a Personal Digital Assistant (PDA), a handheld device, a computing device, a vehicle mounted device, a wearable device, or a server or other processing device. In some possible implementations, the gesture detection method may be implemented by a processor invoking computer readable instructions stored in a memory.

The image processing method provided by the embodiment of the disclosure can be used for detecting the three-dimensional posture of a human body and can also be used for detecting the three-dimensional posture of other targets.

The following describes a method for detecting an attitude according to an embodiment of the present disclosure.

Referring to fig. 1, a flowchart of a method for detecting an attitude according to an embodiment of the present disclosure is shown, where the method includes steps S101 to S102, where:

s101: acquiring a first image;

s102: carrying out posture detection processing on the first image by using a pre-trained posture detection model to obtain a posture detection result of at least one first target in the first image;

the posture detection model is trained by utilizing the relative position relation information of the key point of at least one second target in the second image under different visual angles.

The following describes each of the above-mentioned steps S101 to S102 in detail.

I: in the above S101, the first image includes at least one first object therein. The first object includes, for example, a person, an animal, a robot, a vehicle, or the like, for which a pose needs to be determined.

In a possible embodiment, in the case that more than one first object is included in the first image, the categories of the different first objects may be the same or different; for example, the plurality of first targets are all humans; alternatively, the plurality of first targets are all vehicles. For another example, the first target in the first image includes: humans and animals; or the first target in the first image comprises a person and a vehicle, and the category of the first target is determined according to the actual application scene needs.

II: in S102, the second image used for training the posture detection model is also different according to the difference of the first target in the first image.

For example, in the case where the first target to be recognized includes only a person, then the second target in the second image employed in training the pose detection model also includes a person; in the case where the first target to be recognized includes a person, a vehicle, and an animal, then the second target in the second image employed in training the pose detection model also includes a person, a vehicle, and an animal.

The key points of the second target are, for example, position points which are located on the second target and can represent the three-dimensional posture of the second target after being connected according to a certain sequence; for example, when the second object is a human body, the key points include, for example, position points where respective joints of the human body are located. The coordinate value of the position point in the image coordinate system is a two-dimensional coordinate value; in the camera coordinate system, the coordinate values are three-dimensional coordinate values.

In the embodiment of the present disclosure, the relative position relationship information of the keypoints of the at least one second target under different viewing angles includes, for example: depth difference of each key point pair in at least one key point pair under different camera coordinate systems; the keypoint pairs include any two keypoints of at least one second target in the second image.

The camera coordinate system is a three-dimensional coordinate system established by taking the optical axis of the camera as a z-axis and taking the plane where the optical center of the camera is located and which is perpendicular to the optical axis of the camera as a plane where an x-axis and a y-axis are located. Wherein the z-axis direction is called the depth direction.

Different camera coordinate systems may be, for example, any of the following: (1) a real camera coordinate system, and at least one virtual camera coordinate system; (2): at least two virtual camera coordinate systems.

Wherein the real camera coordinate system is a coordinate system established based on a real camera taking the second image.

The virtual camera coordinate system is, for example, a coordinate system established based on a virtual camera whose shooting position and shooting angle are different from those of the real camera.

Illustratively, for example, the second image includes 2 second objects, a and B, respectively, wherein the key points of the second object a include: a is₁To a_nN key points in total; the key points of the second object B include B₁To b_nAnd n key points.

At least one key point pair is represented as (p, q), where p ∈ { a }₁,a₂,…,a_n,b₁,b₂,…,b_n}，q∈{a₁,a₂,…,a_n,b₁,b₂,…b_nAnd p ≠ q.

Here, for two different keypoints that make up a keypoint pair, when the positions of the two keypoints vary, they characterize the different keypoint pair. For example (a)₃,b₄) And (b)₄,a₃) Are two different key point pairs.

In different camera coordinate systems, the same key point pair has different depth differences in different camera coordinate systems due to different viewing angles "looking" at the second object.

As shown in fig. 2, a top view example is shown where a key point pair has different mutual positional relationships at different viewing angles; in this example, the different camera coordinate systems include a real camera coordinate system and two virtual camera coordinate systems, for example, wherein the real camera coordinate system is a coordinate system established based on a real camera; the two virtual camera coordinate systems are coordinate systems established based on the virtual camera 1 and the virtual camera 2, respectively.

A key point pair (p, q), wherein the depth of p is equal to the depth of q under the real camera coordinate system (the equal means that the depth difference between p and q is less than a preset depth difference threshold value); in a virtual camera coordinate system established based on the virtual camera 1, the depth of p is less than the depth of q; in the virtual camera coordinate system established based on the virtual camera 2, the depth of p is greater than the depth of q. It can be seen that even for the same key point pair, when the viewing angle changes, the mutual positional relationship between two key points in the key point pair may change accordingly.

Based on the above, the embodiment of the disclosure trains the posture detection model by using the mutual position relationship of the key points of at least one second target in the second image under different viewing angles; under the condition that the gesture detection model cannot accurately identify key points on the shielded second target, the training process of the gesture detection model is supervised through the mutual position relation, so that the gesture detection model can learn the mutual position relation among different second targets, and when the gesture detection model is used for gesture detection of the first target in the first image, a gesture detection result with higher detection precision can be obtained for the shielded first target.

Referring to fig. 3, an embodiment of the present disclosure provides a specific method for training a posture detection model, including:

s301: and determining first position information of a plurality of key points of each second target in the second image in a real camera coordinate system by using a neural network to be trained.

In a specific implementation, the neural network to be trained includes, for example: the system comprises a target detection network, a key point detection network and a depth prediction network; the target detection network is used for obtaining a target area of the second target; the key point detection network is used for determining each key point in the second target, first two-dimensional position information in an image coordinate system corresponding to the second image and relative depth between each key point and a reference node of the second target based on the target area; the depth prediction network is for predicting an absolute depth of a reference node of the second target in the real camera coordinate system based on the target area. And then determining first position information of each key point of each second target in the second image in the real camera coordinate system according to the first two-dimensional position information, the relative depth of each key point and the absolute depth of the reference node. Here, the first position information is three-dimensional position information.

Here, the reference node is, for example, a position point where a certain portion is predetermined on the second target. For example, the reference node may be predetermined according to actual needs; for example, when the second target is a human body, the position point where the pelvis of the human body is located may be determined as a reference node, or any key point on the human body may be determined as a reference node, or the position point where the center of the chest and abdomen of the human body is located may be determined as a reference node; the specific conditions can be set as required.

The method comprises the following steps: target detection network: the target region of the second object in the second image may be identified, for example, in the following manner: performing feature extraction on the second image to obtain a feature map of the second image; and determining a plurality of target bounding boxes from a plurality of candidate bounding boxes generated in advance based on the feature map, and determining a target area corresponding to the second target based on the target bounding boxes.

Specifically, the method comprises the following steps: feature extraction may be performed on the second image, for example, using a neural network, to obtain a feature map of the second image. After obtaining the feature map of the second image, a plurality of target bounding boxes may be obtained, for example, using a bounding box prediction algorithm. The bounding box prediction algorithm includes, for example, roilign, ROI-posing, etc., where, for example, roilign may traverse a plurality of candidate bounding boxes generated in advance, and determine that a sub-image corresponding to each candidate bounding box belongs to a region of interest (ROI) value of any second target in the second image, where the higher the ROI value is, the higher the probability that the sub-image corresponding to the candidate bounding box belongs to a certain second target is; after the ROI value corresponding to each candidate bounding box is determined, a plurality of target bounding boxes are determined from the candidate bounding boxes according to the sequence of the ROI values corresponding to the candidate bounding boxes from large to small. The target bounding box is, for example, rectangular; the information of the target bounding box includes, for example: coordinates of any vertex in the target bounding box in the second image, and a height value and a width value of the target bounding box. Alternatively, the information of the target bounding box includes, for example: coordinates of any vertex in the target bounding box in the feature map of the second image, and a height value and a width value of the target bounding box. And after the plurality of target boundary frames are obtained, determining target areas corresponding to all second targets in the second image based on the plurality of target boundary frames.

When determining the target area corresponding to the second target based on the target bounding box, for example, the following method may be adopted: determining a feature subgraph corresponding to each target bounding box based on a plurality of target bounding boxes and the feature graph; and performing border frame regression processing based on the feature subgraphs respectively corresponding to the target border frames to obtain a target area corresponding to the second target.

Specifically, under the condition that the information of the target bounding box includes the coordinates of any vertex on the target bounding box in the second image, and the height value and the width value of the target bounding box, the feature points in the feature map and the pixel points in the second image have a certain position mapping relationship; and determining the characteristic subgraph corresponding to each target boundary frame from the characteristic graph of the second image according to the relevant information of the target boundary frame and the mapping relation between the characteristic graph and the second image.

In the case that the information of the target bounding box includes the coordinates of any vertex in the feature map of the second image, and the height value and the width value of the target bounding box, the feature subgraphs respectively corresponding to the target bounding boxes can be determined from the feature map of the second image directly based on the target bounding box.

After the feature sub-images corresponding to each target boundary box are obtained, for example, a boundary box regression algorithm may be used to perform boundary box regression processing on the target boundary box based on the feature sub-images corresponding to each target boundary box, so as to obtain a plurality of boundary boxes including the complete second target. Each of the plurality of bounding boxes corresponds to a second target, and the region determined based on the bounding box corresponding to the second target is the target region corresponding to the second target. At this time, the number of the obtained target areas is consistent with the number of the second targets in the second image, and each second target corresponds to one target area; if the different second targets have a mutual shielding position relationship, the target areas corresponding to the second targets having the mutual shielding relationship have a certain overlapping degree.

Secondly, the step of: the key point detection network: for example, the neural network may be directly used to perform the keypoint detection processing based on the target feature map of the second target, so as to obtain two-dimensional position information of a plurality of keypoints of the second target in the second image, and a relative depth of each keypoint with respect to a reference node of the second target.

Here, the target feature map may be determined from the feature map based on, for example, a feature map obtained by extracting features of the second image and the target region.

③: depth prediction network: the absolute depth of the reference node of the second object in the camera coordinate system may be determined, for example, based on the target area of the second object in the following manner: determining a target feature map corresponding to the target image based on the target area corresponding to the second target and the second image; performing depth recognition processing based on the target feature map corresponding to the second target to obtain normalized absolute depth of the reference node of the second target; and obtaining the absolute depth of the reference node of the second target in the camera coordinate system based on the normalized absolute depth and the parameter matrix of the camera.

Here, the normalized absolute depth of the reference node may be obtained, for example, in the following manner: acquiring an initial depth image based on the second image; the pixel value of any first pixel point in the initial depth image represents an initial depth value of a second pixel point corresponding to the first pixel point in the second image in the camera coordinate system; determining second two-dimensional position information of a reference node corresponding to the second target in the second image based on the target feature map corresponding to the second target, and determining an initial depth value of the reference node corresponding to the second target based on the second two-dimensional position information and the initial depth image; determining a normalized absolute depth of the reference node of the second target based on the initial depth value of the reference node corresponding to the second target and the target feature map corresponding to the second target.

Specifically, for example, a depth prediction network may be adopted to determine an initial depth value of each pixel point (second pixel point) in the second image; the initial depth value of each first pixel point forms an initial depth image of the first image; the pixel value of any pixel point (first pixel point) in the initial depth image is the initial depth value of the pixel point (second pixel point) at the corresponding position in the second image.

Second two-dimensional position information of the reference node of the second target in the second image may be determined based on the target feature map, for example, using a pre-trained reference node detection network. Then, a pixel point corresponding to the reference node is determined from the initial depth image by using the second two-dimensional position information, and a pixel value of the pixel point determined from the initial depth image is determined as an initial depth value of the reference node.

After obtaining the initial depth value of the reference node, for example, at least one stage of first convolution processing may be performed on a target feature map corresponding to a second target to obtain a feature vector of the second target; splicing the characteristic vector and the initial depth value to form a spliced vector, and performing at least one stage of second convolution processing on the spliced vector to obtain a corrected value of the initial depth value; and obtaining the normalized absolute depth based on the corrected value of the initial depth value and the initial depth value.

Here, for example, a neural network for adjusting the initial depth value may be adopted, the neural network including a plurality of convolutional layers; wherein, part of the plurality of convolution layers are used for carrying out at least one stage of first convolution processing on the target characteristic diagram; the other convolution layers are used for performing at least one stage of second convolution processing on the splicing vector so as to obtain a corrected value; and then, adjusting the initial depth value according to the correction value to obtain the normalized absolute depth of the reference node of the second target.

After the normalized absolute depth of the reference node is obtained, the absolute depth of the reference node can be recovered according to the normalized absolute depth and the camera internal parameter.

Exemplarily, different second images may be captured by different cameras during image processing of the different second images; for different cameras, the corresponding camera parameters may be different; here, the camera internal reference includes, for example: focal length of the camera on the x-axis, focal length of the camera on the y-axis, coordinates of the optical center of the camera in the camera coordinate system.

The camera internal parameters are different, and even second images acquired at the same view angle and the same position can be distinguished; if the absolute depth of the reference node is predicted directly based on the target feature map, the absolute depths obtained for different second images obtained by different cameras at the same view angle and the same position are different.

To avoid the above situation, embodiments of the present disclosure directly predict the normalized depth of the reference node, which is obtained without considering the camera internal parameters; and then recovering the absolute depth of the reference node according to the camera internal reference and the normalized absolute depth.

Illustratively, the normalized absolute depth, and the absolute depth of the reference node of any second target satisfy the following formula (1):

wherein the content of the first and second substances,

representing a normalized absolute depth of a reference node;

representing the reference node absolute depth; a. the_BoxRepresenting the area of the target region; a. the_RoIRepresenting the area of the target bounding box.

(f_x,f_y) Representing the camera focal length. Illustratively, the camera coordinate system is a three-dimensional coordinate system; the device comprises three coordinate axes of x, y and z; the origin of the camera coordinate system is the optical center of the camera; the optical axis of the camera is the z-axis of the camera coordinate system; the plane where the optical center is located and parallel to the z axis is the plane where the x axis and the y axis are located; f. of_xIs the focal length of the camera in the x-axis; f. of_yIs the focal length of the camera in the y-axis.

It should be noted here that there are a plurality of target bounding boxes determined by RoIAlign; and the areas of the target bounding boxes are all equal.

Since the focal length of the camera is already determined when the camera acquires the second image, and the second target and the target bounding box are also already determined when the target area is determined, after the normalized absolute depth of the reference node is obtained, the absolute depth of the reference node of the second target is obtained according to the above formula (1).

Fourthly, the method comprises the following steps: for example, the following manner may be adopted to determine the first position information of each key point of the second target in the real camera coordinate system in the second image according to the first two-dimensional position information, the relative depths of the key points, and the absolute depths of the reference nodes:

assuming that each second object comprises J keypoints and N second objects in the second image; wherein the three-dimensional poses of the N second targets are represented as:

wherein the three-dimensional posture of the mth second target

Can be expressed as:

wherein the content of the first and second substances,

a coordinate value of the jth key point representing the mth second target in the x-axis direction in the camera coordinate system;

a coordinate value of a j-th key point representing the mth second target in the y-axis direction in the camera coordinate system;

and a coordinate value of the jth key point representing the mth second target in the z-axis direction in the camera coordinate system.

The target areas of the N second targets are represented as:

wherein the target region of the mth second target

Expressed as:

here, the number of the first and second electrodes,

and

a coordinate value indicating a vertex at which an upper left corner of the target region is located;

and

respectively representing the width and height values of the target area.

The three-dimensional poses of the N second targets relative to the reference node are represented as:

wherein the m-th second target is in three-dimensional posture relative to the reference node

Expressed as:

wherein the content of the first and second substances,

a coordinate value of the j-th key point representing the mth second target on the x-axis in the image coordinate system;

a coordinate value of a y axis in an image coordinate system of a jth key point representing the mth second target; that is to say that the first and second electrodes,

and the two-dimensional coordinate value of the jth key point of the mth second target in the image coordinate system is represented.

Representing the relative depth of the jth node of the mth second target with respect to the reference node of the mth second target.

Obtaining the three-dimensional posture of the mth second target by back projection by using the internal reference matrix K of the camera, wherein the three-dimensional coordinate information of the jth node of the mth second target satisfies the following formula (2)

Wherein the content of the first and second substances,

an absolute depth value of the reference node representing the mth second target in the camera coordinate system. Here, it is to be noted that

Obtained on the basis of the corresponding example of the above formula (1).

The internal reference matrix K is, for example: (f)_x,f_y,c_x,c_y)；

Wherein: f. of_xIs the focal length of the camera on the x-axis in the camera coordinate system; f. of_yIs the focal length of the camera on the y-axis in the camera coordinate system; c. C_xIs a coordinate value of the optical center of the camera on the x-axis in a camera coordinate system; c. C_yAnd a coordinate value representing the optical center of the camera on the y-axis in the camera coordinate system.

Through the process, the three-dimensional position information of the J key points of the second target in the camera coordinate system can be obtained

Namely, the first position information of a plurality of key points of the second target in the real camera coordinate system respectively.

In connection with the above S301, the method for training a posture detection model according to the embodiment of the present disclosure further includes:

s302: determining the relative positional relationship information based on the first positional information.

In a specific implementation, after first position information of a plurality of key points of the second object in the real camera coordinate system is obtained, the relative position relationship information of each key point in at least one key point pair in the real camera coordinate system can be determined based on the first position information of each key point.

Here, regarding one key point with a first position in each key point pair as a first key point, regarding a key point with a later position as a second key point, the relative position relationship information of each key point pair in the real camera coordinate system includes: a depth difference of a depth value of a first keypoint of a keypoint pair in the real camera coordinate system and a depth value of a second keypoint of the keypoint pair in the real camera coordinate system.

For example, if the first position information of the first keypoint p in the real camera coordinate system in the keypoint pair (p, q) is represented as: p (x)₁,y₁,z₁) The first position information of the second keypoint q in the real camera coordinate system is expressed as: q (x)₂,y₂,z₂) Then, the relative position relationship information of the key point pair (p, q) in the real camera coordinate system is: z is a radical of₁-z₂。

In addition, a virtual camera coordinate system corresponding to each virtual camera may be constructed based on at least one virtual camera, and a depth difference of each key point pair in the at least one key point pair in the virtual camera coordinate system may be determined according to first position information of a plurality of key points of each second target in the real camera coordinate system, respectively.

Illustratively, referring to fig. 4, an embodiment of the present disclosure provides a specific method for determining a depth difference of each key point pair in at least one key point pair in a virtual camera coordinate system based on the first location information, including:

s401: and for each key point pair, determining second position information of each key point in the key point pair in the first virtual camera coordinate system respectively based on first position information of two key points in the key point pair in a real camera coordinate system respectively and conversion relation information between the real camera coordinate system and the first virtual camera coordinate system.

In a specific implementation, the information of the conversion relationship between the real camera coordinate system and the second virtual camera coordinate system can be obtained by the following method:

determining position information of a center position point of the at least one second target in a real camera coordinate system and determining position information of a first virtual camera position point in the real camera coordinate system based on first position information of a keypoint of the at least one second target in the real camera coordinate system; and generating conversion relation information between the first virtual camera coordinate system corresponding to the first virtual camera position point and the real camera coordinate system based on the position information of the central position point in the real camera coordinate system and the position information of the first virtual camera position point in the real camera coordinate system.

Here, the central location point of the at least one second target in the second image may be determined, for example, in the following manner: after first position information of key points of a second target in a second image in a real camera coordinate system is determined, calculating the mean value of coordinate values of all the key points on the x axis in the real camera coordinate system as a first coordinate value of a central position point on the x axis in the real camera coordinate system; calculating the mean value of coordinate values of all key points on the y axis in the real camera coordinate system as a second coordinate value of the center position point on the y axis in the real camera coordinate system; and calculating the mean value of the coordinate values of the z axis of all the key points in the real camera coordinate system as the third coordinate value of the z axis of the central position point in the real camera coordinate system. The first coordinate value, the second coordinate value and the third coordinate value are position information of the central position point in the real camera coordinate system.

After determining the position information of the center position point of the at least one second target in the second image in the real camera coordinate system, the position information of the first virtual camera position point in the real camera coordinate system may be determined in the following manner:

determining a bounding sphere radius based on the position information of the center position point in the real camera coordinate system and the first position information of the key point of the at least one second target in the real camera coordinate system respectively; the enclosing ball encloses the key points of the at least one second target; determining position information of the first virtual camera position point in the real camera coordinate system on or outside a spherical surface of the bounding sphere based on the position information of the center position point in the real camera coordinate system and the bounding sphere radius.

Illustratively, taking the determination of the first virtual camera position point on a spherical surface surrounding the ball as an example, the center position point of all the second targets in the second image is represented as O_p(ii) a And in O_pNearby set-up of a series of virtual cameras

Wherein N is_cRepresenting the number of virtual cameras; t is^cRepresents the position of the c-th virtual camera in the real camera coordinate system, and T^cSatisfies the following formula (3):

wherein r represents a radius surrounding the sphere; theta to U [0,2 pi ], U to U [0,1 ].

R^cA rotational quaternion representing a transformation from the real camera coordinate system to a virtual camera coordinate system constructed based on the first virtual camera location point, and R^cExpressed by the following formula (4) and formula (5):

wherein the content of the first and second substances,

a rotation axis representing a transformation of the real coordinate system to a virtual camera coordinate system constructed based on the first virtual camera location points; phi is a^cRepresenting the rotation angle of the real coordinate system transformed to a virtual camera coordinate system constructed based on the first virtual camera location points.

O denotes the origin of the real camera coordinate system, i.e. the position of the real camera.

The above formula (4) and formula (5), that is, the conversion relationship information between the real camera coordinate system and the first virtual camera coordinate system.

After the conversion relation information is obtained, first position information of each key point in the key point pair in the real camera coordinate system can be converted into second position information in the first virtual camera coordinate system based on the conversion relation information.

S402: determining a depth difference of each key point pair in the first virtual camera coordinate system based on second position information of each key point in the key point pair in the first virtual camera coordinate system, wherein the first virtual camera coordinate system is any one of the different virtual camera coordinate systems.

In connection with the above S302, the method for training a posture detection model provided in the embodiment of the present disclosure further includes:

s303: and determining the loss of the neural network based on the relative position relation information.

In specific implementation, when the neural network loss is determined based on the relative position relationship information, the neural network loss is determined based on whether the front and back positions of the key point pair represented by the relative position relationship under different visual angles are correct or not; if the accuracy of the front and back positions of the key point pairs under different viewing angles is higher, the corresponding loss of the neural network is smaller.

In determining the neural network loss, the neural network loss is determined, for example, based on the depth difference of the at least one keypoint pair in the different camera coordinate system and the actual depth relationship information of the two keypoints in each keypoint pair in the different camera coordinate system.

Here, the actual depth relation information of the two keypoints in each keypoint pair in the different virtual camera coordinate systems is determined in advance for the second image annotation.

When determining the neural network loss, for example, for each key point pair in the at least one key point pair, determining a loss penalty value of each key point pair according to the depth difference of each key point pair in the different virtual camera coordinate systems and the actual depth relation information;

determining the neural network loss based on the loss penalty values respectively corresponding to the at least one key point pair;

for example, if the predicted depth relationship of the key point pair represented by the depth difference of the key point pair is consistent with the actual depth relationship information of the key point pair, the penalty value of the key point pair is 0;

and if the predicted depth relation of the key point pair represented by the depth difference of the key point pair is inconsistent with the actual depth relation information of the key point pair, the punishment value of the key point pair is greater than 0.

Thus, for a key point pair with the consistent information of the predicted depth relationship and the actual depth relationship, the corresponding penalty value is 0, that is, in the loss of the neural network, the value of the loss item corresponding to the key point pair is 0; for the key point pairs with inconsistent information of the predicted depth relationship and the actual depth relationship, the corresponding penalty value is greater than 0, namely, in the loss of the neural network, the value of the loss item corresponding to the key point is greater than 0, so that the value of the loss of the neural network is increased only if the wrong key point pair appears in the predicted first position information.

Illustratively, for any key point pair (p, q), the actual depth relationship information of the key point pair is defined as: s (p, q; c), wherein if the depth of p is less than the depth of q and the distance between p and q is greater than a preset distance threshold (i.e., p is closer to the camera than q), s (p, q; c) is 1; if the depth of p is greater than the depth of q and the distance between p and q is greater than a preset distance threshold, s (pqc;) 1-; if the distance between p and q is smaller than or equal to a preset distance threshold value, the depth of p is considered to be equal to the depth of q, and s (p, q; c) is 0.

Assuming that the set of key point pairs is denoted as I, the camera c loss satisfies the following equation (6):

wherein D (·) satisfies the following formula (7):

after p and q are converted from first position information in a real camera coordinate system to second position information in a first virtual camera coordinate system, the depth difference of p and q in the first virtual camera coordinate system is calculated based on the second position information of p and q in the first virtual camera coordinate system. Wherein n is^⊥A normal vector representing the real camera; n is^⊥R^cRepresenting a rotation of a normal vector of a real camera to a first virtual camera direction R corresponding to a first virtual camera coordinate system^cI.e. the normal vector of the first virtual camera;

representing a mapping point of p in a first virtual camera coordinate system;

representing the mapping point of q in the first virtual camera coordinate system.

Wherein if the first position information of p in the real camera coordinate system satisfies: p (x)₁,y₁,z₁)，

The second position information in the first virtual camera coordinate system satisfies:

if the first position information of q in the real camera coordinate system satisfies: q (x)₂,y₂,z₂)，

then

In the above equations (6) and (7), in the first virtual camera coordinate system: if the actual depth relation information s (p, q; c) between p and q is 1, i.e. the depth of p is smaller than the depth of q, and the distance between p and q is larger than the preset distance threshold, in this case:

if it is

Less than 0 indicates that the prediction for the key point pair is correct, and this time

Is negative and s (p, q; c) is 1, and further

The value of the negative value is the negative value,

is 0, i.e. the loss penalty for this key point pair is 0. Wherein n is a preset distance threshold.

If it is

If the value is greater than 0, the prediction error of the key point pair is represented, and the prediction error is represented at the moment

Is positive and since s (p, q; c) is 1, and further

In the case of a positive value, the value of,

has a value of

I.e. the loss penalty for this key point pair is greater than 0.

Similarly, if the actual depth relationship information s (p, q; c) between p and q is-1, i.e. the depth of p is greater than the depth of q, and the distance between p and q is greater than the preset distance threshold, in this case:

if it is

If the value is greater than 0, the prediction of the key point pair is correct, and the time is

Is positive and since s (p, q; c) is-1, and further

The value of the negative value is the negative value,

is 0, i.e. the loss penalty for this key point pair is equal to 0.

If it is

Less than 0 indicates a prediction error for the key point pair, at which point

Is negative and s (p, q; c) — 1, and further

Is positiveThe value of the one or more of,

has a value of

I.e. the loss penalty for this key point pair is greater than 0.

Further, the loss value for any virtual camera coordinate system is obtained by the above formula.

Similarly, the loss value for the real camera coordinate system can also be obtained by the above formula.

And then obtaining the loss of the neural network based on the loss values of the key point pairs under the camera coordinate systems respectively corresponding to different visual angles.

For example, the average value of the coordinate values of the key point pairs in the camera coordinate systems respectively corresponding to different viewing angles may be used as the neural network loss.

In connection with the above S303, the method for training a posture detection model according to the embodiment of the present disclosure further includes:

s304: and training the neural network to be trained based on the neural network loss to obtain the posture detection model.

In the embodiment of the disclosure, the posture detection model for three-dimensional posture detection is trained by using the relative position relationship of the key point of at least one second target in the second image under different viewing angles, so that the posture detection model can learn the mutual position relationship between different second targets, and when the posture detection model is used for posture detection of the first target in the first image, a posture detection result with higher detection precision can be obtained for the shielded first target.

It will be understood by those skilled in the art that in the method of the present invention, the order of writing the steps does not imply a strict order of execution and any limitations on the implementation, and the specific order of execution of the steps should be determined by their function and possible inherent logic.

Based on the same inventive concept, the embodiment of the present disclosure further provides a gesture detection device corresponding to the gesture detection method, and as the principle of solving the problem of the device in the embodiment of the present disclosure is similar to the gesture detection method in the embodiment of the present disclosure, the implementation of the device may refer to the implementation of the method, and repeated details are not repeated.

Referring to fig. 5, a schematic diagram of an attitude detection apparatus provided in an embodiment of the present disclosure is shown, where the apparatus includes: an acquisition module 51 and a processing module 52; wherein the content of the first and second substances,

an obtaining module 51, configured to obtain a first image;

a processing module 52, configured to perform posture detection processing on the first image by using a pre-trained posture detection model to obtain a posture detection result of at least one first target in the first image;

In a possible implementation, the different camera coordinate systems include: a real camera coordinate system, and at least one virtual camera coordinate system; or

The different camera coordinate systems include: different virtual camera coordinate systems.

In a possible embodiment, the method further comprises: a training module 53, configured to train to obtain the posture detection model by using the following method:

determining first position information of a plurality of key points of each second target in the second image in a real camera coordinate system by using a neural network to be trained;

determining the relative positional relationship information based on the first positional information;

determining a neural network loss based on the relative position relationship information;

and training the neural network to be trained based on the neural network loss to obtain the posture detection model.

In a possible embodiment, the training module 53, when determining the depth difference of each key point pair of the at least one key point pair in the virtual camera coordinate system based on the first location information, is configured to:

for each key point pair, determining second position information of each key point in the key point pair in a first virtual camera coordinate system respectively based on first position information of two key points in the key point pair in a real camera coordinate system respectively and conversion relation information between the real camera coordinate system and the first virtual camera coordinate system;

determining a depth difference of each key point pair in the first virtual camera coordinate system based on second position information of each key point in the key point pair in the first virtual camera coordinate system, wherein the first virtual camera coordinate system is any one of the virtual camera coordinate systems.

In a possible implementation, the training module 53 is further configured to:

determining position information of a center position point of the at least one second target in a real camera coordinate system and determining position information of a first virtual camera position point in the real camera coordinate system based on first position information of a keypoint of the at least one second target in the real camera coordinate system;

and generating conversion relation information between the first virtual camera coordinate system corresponding to the first virtual camera position point and the real camera coordinate system based on the position information of the central position point in the real camera coordinate system and the position information of the first virtual camera position point in the real camera coordinate system.

In one possible embodiment, the training module 53, when determining the position information of the first virtual camera position point in the real camera coordinate system, is configured to:

determining a bounding sphere radius based on the position information of the center position point in the real camera coordinate system and the first position information of the key point of the at least one second target in the real camera coordinate system respectively; the enclosing ball encloses the key points of the at least one second target;

determining position information of the first virtual camera position point in the real camera coordinate system on or outside a spherical surface of the bounding sphere based on the position information of the center position point in the real camera coordinate system and the bounding sphere radius.

In one possible embodiment, the training module 53, when determining the neural network loss based on the relative position relationship information, is configured to:

determining the neural network loss based on the depth difference of the at least one keypoint pair in the different camera coordinate systems and actual depth relation information of two keypoints in each keypoint pair in the different camera coordinate systems.

In a possible implementation, the training module 53, when determining that the neural network is lost based on the depth difference of the at least one keypoint pair in the different camera coordinate system and the actual depth relation information of the two keypoints in each keypoint pair in the different camera coordinate system, is configured to:

for each key point pair in the at least one key point pair, determining a loss penalty value of each key point pair according to the depth difference of each key point pair in the different virtual camera coordinate systems and the actual depth relation information;

and determining the neural network loss based on the loss penalty values respectively corresponding to the at least one key point pair.

The description of the processing flow of each module in the device and the interaction flow between the modules may refer to the related description in the above method embodiments, and will not be described in detail here.

The embodiment of the present disclosure further provides a computer device 10, as shown in fig. 6, which is a schematic structural diagram of the computer device 10 provided in the embodiment of the present disclosure, and includes:

a processor 11 and a memory 12; the memory 12 stores machine-readable instructions executable by the processor 11, which when executed by a computer device are executed by the processor to perform the steps of: acquiring a first image; carrying out posture detection processing on the first image by using a pre-trained posture detection model to obtain a posture detection result of at least one first target in the first image;

For the specific execution process of the instruction, reference may be made to the steps of the defect detection method described in the embodiments of the present disclosure, and details are not described here.

The embodiments of the present disclosure also provide a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the steps of the gesture detection method described in the above method embodiments are performed. The storage medium may be a volatile or non-volatile computer-readable storage medium.

The computer program product of the gesture detection method provided in the embodiments of the present disclosure includes a computer-readable storage medium storing a program code, where instructions included in the program code may be used to execute the steps of the gesture detection method described in the above method embodiments, which may be specifically referred to in the above method embodiments, and are not described herein again.

The embodiments of the present disclosure also provide a computer program, which when executed by a processor implements any one of the methods of the foregoing embodiments. The computer program product may be embodied in hardware, software or a combination thereof. In an alternative embodiment, the computer program product is embodied in a computer storage medium, and in another alternative embodiment, the computer program product is embodied in a Software product, such as a Software Development Kit (SDK), or the like.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the system and the apparatus described above may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again. In the several embodiments provided in the present disclosure, it should be understood that the disclosed system, apparatus, and method may be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one logical division, and there may be other divisions when actually implemented, and for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of devices or units through some communication interfaces, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present disclosure may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a non-volatile computer-readable storage medium executable by a processor. Based on such understanding, the technical solution of the present disclosure may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present disclosure. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

Finally, it should be noted that: the above-mentioned embodiments are merely specific embodiments of the present disclosure, which are used for illustrating the technical solutions of the present disclosure and not for limiting the same, and the scope of the present disclosure is not limited thereto, and although the present disclosure is described in detail with reference to the foregoing embodiments, those skilled in the art should understand that: any person skilled in the art can modify or easily conceive of the technical solutions described in the foregoing embodiments or equivalent technical features thereof within the technical scope of the present disclosure; such modifications, changes or substitutions do not depart from the spirit and scope of the embodiments of the present disclosure, and should be construed as being included therein. Therefore, the protection scope of the present disclosure shall be subject to the protection scope of the claims.

Claims

1. An attitude detection method, characterized by comprising:

acquiring a first image;

carrying out posture detection processing on the first image by using a pre-trained posture detection model to obtain a posture detection result of at least one first target in the first image;

2. The pose detection method according to claim 1, wherein the relative positional relationship information at the different view angles includes a depth difference of each of at least one key point pair at different camera coordinate systems; the keypoint pairs include any two keypoints of at least one second target in the second image.

3. The pose detection method of claim 2, wherein the different camera coordinate systems comprise: a real camera coordinate system, and at least one virtual camera coordinate system, or the different camera coordinate systems include: different virtual camera coordinate systems.

4. The pose detection method according to claim 2 or 3, wherein the pose detection model is trained using the following method:

5. The pose detection method of claim 4, wherein determining the depth difference for each of at least one key point pair in a virtual camera coordinate system based on the first position information comprises:

6. The gesture detection method according to claim 4, further comprising:

7. The pose detection method of claim 6, wherein the determining the position information of the first virtual camera position point in the real camera coordinate system comprises:

8. The pose detection method according to any one of claims 4 to 7, wherein the determining a neural network loss based on the relative positional relationship information comprises:

9. The pose detection method of claim 8, wherein the determining the neural network loss based on the depth difference of the at least one keypoint pair in the different camera coordinate system and actual depth relationship information of two keypoints in each keypoint pair in the different camera coordinate system comprises:

10. An attitude detection device characterized by comprising:

the acquisition module is used for acquiring a first image;

the processing module is used for carrying out posture detection processing on the first image by utilizing a pre-trained posture detection model to obtain a posture detection result of at least one first target in the first image;

11. A computer device, comprising: an interconnected processor and memory, the memory storing machine-readable instructions executable by the processor which, when executed by a computer device, are executed by the processor to implement a gesture detection method as claimed in any one of claims 1 to 9.

12. A computer-readable storage medium, characterized in that the computer-readable storage medium has stored thereon a computer program which, when being executed by a processor, performs the gesture detection method according to any one of claims 1 to 9.