CN116386087B

CN116386087B - Target object processing method and device

Info

Publication number: CN116386087B
Application number: CN202310353616.6A
Authority: CN
Inventors: 陈汉苑; 罗斌; 何俊彦; 项王盟
Original assignee: Alibaba China Co Ltd
Current assignee: Alibaba China Co Ltd
Priority date: 2023-03-31
Filing date: 2023-03-31
Publication date: 2024-01-09
Anticipated expiration: 2043-03-31
Also published as: CN116386087A

Abstract

The embodiment of the specification provides a target object processing method and a target object processing device, wherein the target object processing method comprises the following steps: determining a two-dimensional coordinate sequence of an articulation point of a target object, wherein the two-dimensional coordinate sequence comprises two-dimensional coordinate information corresponding to at least two articulation points; determining the characteristics of the target joint points corresponding to the target joint points according to the two-dimensional coordinate information corresponding to the target joint points, wherein the target joint points are any joint point in the at least two joint points; determining initial bone edge characteristics associated with at least two target articulation points according to the target articulation point characteristics corresponding to the at least two target articulation points; and processing the target joint point characteristics corresponding to the target joint points and the initial bone edge characteristics associated with the target joint points, and determining a three-dimensional coordinate sequence of the joint points of the target object according to the processing result.

Description

Target object processing method and device

Technical Field

The embodiment of the specification relates to the technical field of computers, in particular to a target object processing method.

Background

Analysis of the pose of the human body by image or video is an important issue in computer vision research. The human body posture estimation is widely applied to various fields such as human-computer interaction, film special effects and the like. Human body posture estimation refers to a process of estimating three-dimensional coordinates of each main joint point of a human body in an image from the image and representing the human body posture in the image.

However, currently, in the process of human body posture estimation, three-dimensional joint point coordinates may be generally determined using two-dimensional joint point coordinates of a human body, and human body posture estimation is implemented according to the three-dimensional joint point coordinates. When the human body motion only has slight change, the accuracy of the human body gesture determined by the joint point coordinates is poor and the accuracy is low. Therefore, an effective solution is needed to solve the above problems.

Disclosure of Invention

In view of this, the present embodiment provides a target object processing method. One or more embodiments of the present specification also relate to a target object processing apparatus, a computing device, an AR/VR device, a computer-readable storage medium, and a computer program, which solve the technical drawbacks of the related art.

According to a first aspect of embodiments of the present specification, there is provided a target object processing method, including:

determining a two-dimensional coordinate sequence of an articulation point of a target object, wherein the two-dimensional coordinate sequence comprises two-dimensional coordinate information corresponding to at least two articulation points;

determining the characteristics of the target joint points corresponding to the target joint points according to the two-dimensional coordinate information corresponding to the target joint points, wherein the target joint points are any joint point in the at least two joint points;

Determining initial bone edge characteristics associated with at least two target articulation points according to the target articulation point characteristics corresponding to the at least two target articulation points;

and processing the target joint point characteristics corresponding to the target joint points and the initial bone edge characteristics associated with the target joint points, and determining a three-dimensional coordinate sequence of the joint points of the target object according to the processing result.

According to a second aspect of embodiments of the present specification, there is provided a target object processing apparatus including:

the first determining module is configured to determine a two-dimensional coordinate sequence of the joint point of the target object, wherein the two-dimensional coordinate sequence comprises two-dimensional coordinate information corresponding to at least two joint points;

the second determining module is configured to determine a target joint point characteristic corresponding to a target joint point according to two-dimensional coordinate information corresponding to the target joint point, wherein the target joint point is any joint point in the at least two joint points;

the third determining module is configured to determine initial bone edge characteristics associated with at least two target articulation points according to the target articulation point characteristics corresponding to the at least two target articulation points;

The processing module is configured to process the target joint point characteristics corresponding to the target joint point and the initial bone edge characteristics associated with the target joint point, and determine a three-dimensional coordinate sequence of the joint point of the target object according to a processing result.

According to a third aspect of embodiments of the present specification, there is provided a computing device comprising:

a memory and a processor;

the memory is configured to store computer-executable instructions that, when executed by the processor, perform the steps of the method described above.

According to a fourth aspect of embodiments of the present description, there is provided a computer-readable storage medium storing computer-executable instructions which, when executed by a processor, implement the steps of the above-described method.

According to a fifth aspect of embodiments of the present specification, there is provided a computer program, wherein the computer program, when executed in a computer, causes the computer to perform the steps of the above method.

According to a sixth aspect of embodiments of the present specification, there is provided an AR/VR device comprising:

a memory and a processor;

One embodiment of the present disclosure provides a target object processing method, determining a two-dimensional coordinate sequence of an articulation point of a target object, where the two-dimensional coordinate sequence includes two-dimensional coordinate information corresponding to at least two articulation points; determining the characteristics of the target joint points corresponding to the target joint points according to the two-dimensional coordinate information corresponding to the target joint points, wherein the target joint points are any joint point in the at least two joint points; determining initial bone edge characteristics associated with at least two target articulation points according to the target articulation point characteristics corresponding to the at least two target articulation points; and processing the target joint point characteristics corresponding to the target joint points and the initial bone edge characteristics associated with the target joint points, and determining a three-dimensional coordinate sequence of the joint points of the target object according to the processing result. According to the method, the initial skeleton edge characteristics associated with the joint points of the target object are determined, and the three-dimensional coordinates of the joint points are determined by utilizing the combination of the initial skeleton edge characteristics and the target joint point characteristics, so that the accuracy of the determined three-dimensional coordinates can be improved, and the accuracy and precision of the subsequent human body posture estimation are further ensured.

Drawings

Fig. 1 is a schematic application scenario of a target object processing method according to an embodiment of the present disclosure;

FIG. 2 is a flow chart of a method for processing a target object according to an embodiment of the present disclosure;

FIG. 3 is a schematic flow chart of a coordinate prediction model in a target object processing method according to an embodiment of the present disclosure;

FIG. 4 is a process flow diagram of a target object processing method according to one embodiment of the present disclosure;

FIG. 5 is a flowchart of a processing procedure of an end-side device in a target object processing method according to an embodiment of the present disclosure;

FIG. 6 is a schematic diagram of a target object processing apparatus according to an embodiment of the present disclosure;

FIG. 7 is a block diagram of a computing device provided in one embodiment of the present description.

Detailed Description

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present description. This description may be embodied in many other forms than described herein and similarly generalized by those skilled in the art to whom this disclosure pertains without departing from the spirit of the disclosure and, therefore, this disclosure is not limited by the specific implementations disclosed below.

The terminology used in the one or more embodiments of the specification is for the purpose of describing particular embodiments only and is not intended to be limiting of the one or more embodiments of the specification. As used in this specification, one or more embodiments and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used in one or more embodiments of the present specification refers to and encompasses any or all possible combinations of one or more of the associated listed items.

It should be understood that, although the terms first, second, etc. may be used in one or more embodiments of this specification to describe various information, these information should not be limited by these terms. These terms are only used to distinguish one type of information from another. For example, a first may also be referred to as a second, and similarly, a second may also be referred to as a first, without departing from the scope of one or more embodiments of the present description. The word "if" as used herein may be interpreted as "at … …" or "at … …" or "responsive to a determination", depending on the context.

It should be noted that, the user information (including, but not limited to, user equipment information, user personal information, etc.) and the data (including, but not limited to, data for analysis, stored data, presented data, etc.) according to the embodiments of the present disclosure are information and data authorized by the user or sufficiently authorized by each party, and the collection, use and processing of the related data need to comply with the related laws and regulations and standards of the related country and region, and provide corresponding operation entries for the user to select authorization or rejection.

In the present specification, a target object processing method is provided, and the present specification relates to a target object processing apparatus, a computing device, an AR/VR device, and a computer-readable storage medium, one by one, in the following embodiments.

Referring to fig. 1, fig. 1 shows an application scenario of a target object processing method according to an embodiment of the present disclosure.

Fig. 1 includes a cloud-side device 102 and an end-side device 104, where the cloud-side device 102 may be understood as a cloud server, and of course, in another implementation, the cloud-side device 102 may be replaced by a physical server; the end-side devices 104 include, but are not limited to, desktop computers, notebook computers, VR (Virtual Reality) devices, AR (Augmented Reality) devices, and the like; for ease of understanding, in the embodiments of the present disclosure, the cloud-side device 102 is taken as a cloud server, and the end-side device 104 is taken as a notebook computer as an example.

In implementation, the end-side device 104 may acquire video data of the user and send the video data of the user to the cloud-side device 102. The cloud-side device 102 may extract a two-dimensional coordinate sequence of each node of the user from the video data of the user, where the two-dimensional coordinate sequence includes two-dimensional coordinate information of each node of the user, perform feature extraction on the two-dimensional coordinate information of each node, and perform downsampling and upsampling, so as to determine a target node feature of each node. According to the target joint point characteristics of at least two joint points, determining initial bone edge characteristics associated with the at least two joint points, processing the target joint point characteristics and the initial bone edge characteristics of the joint points, and determining three-dimensional coordinates of the joint points according to processing results, so as to construct a three-dimensional coordinate sequence of each joint point of a user. The three-dimensional coordinate sequence may be used for subsequent determination of the user's body posture. The cloud-side device 102 may transmit the three-dimensional coordinate sequence to the end-side device 104, from which the end-side device 104 determines the user's human body pose. In addition, the cloud-side device may also determine the human body posture of the user according to the three-dimensional coordinate sequence, and then send the human body posture of the user to the end-side device 104.

Referring to fig. 2, fig. 2 shows a flowchart of a target object processing method according to an embodiment of the present disclosure, which specifically includes the following steps.

Step 202: and determining a two-dimensional coordinate sequence of the joint point of the target object, wherein the two-dimensional coordinate sequence comprises two-dimensional coordinate information corresponding to at least two joint points.

The target object may be a user who needs to perform three-dimensional posture estimation, or may be a virtual person, an animal, or the like. The joint point is understood to be a joint point on a bone, such as a wrist, elbow, ankle, etc. of a human body. A two-dimensional coordinate sequence of an articulation point can be understood as a set of coordinates consisting of the two-dimensional coordinates of the articulation point. For example, the two-dimensional coordinate sequence of the joint point may include two-dimensional coordinates of a wrist joint, two-dimensional coordinates of an ankle joint, two-dimensional coordinates of an elbow joint, and the like of the human body. Two-dimensional coordinate information may be understood as a representation of two-dimensional coordinates, such as in the form of images, but also in the form of video frames. Then, the two-dimensional coordinate sequence may include a plurality of video frames, and each video frame may include two-dimensional coordinates of the node in the current video frame.

Specifically, the target object processing method can be applied to a motion gesture evaluation scene of a user. The video of the user in motion can be obtained, the video is split into a plurality of video frames, and the two-dimensional coordinates of the joint points of the user are extracted from the video frames by utilizing a two-dimensional coordinate extraction model to form a two-dimensional coordinate sequence. Therefore, the subsequent generation of the three-dimensional coordinate sequence is realized, the motion gesture of the user is determined, and the comparison can be carried out between the determined motion gesture of the user and the reference motion gesture, so that whether the motion gesture of the user is standard or not is judged. The coordinates of each joint point of the human body in the three-dimensional physical space can be predicted from the video, and whether the limb actions meet the requirements or not can be judged through the relationships of the distances among the coordinates of each joint point. Such as whether the motion is in place during exercise. In addition, the target object processing method can also be used for human body posture evaluation scenes of virtual people and the like, and the embodiment of the specification is not limited to the method.

For example, the two-dimensional coordinate sequence of the node of the target object may include: two-dimensional coordinates A1 of the user's node a in video frame 1, two-dimensional coordinates A2 in video frame 2, two-dimensional coordinates A3 in video frame 3, and two-dimensional coordinates B1 of the user's node B in video frame 1, two-dimensional coordinates B2 in video frame 2, and two-dimensional coordinates A3 in video frame 3.

In specific implementation, the two-dimensional coordinate sequence of the node of the target object can be determined according to the video data of the target object. The specific implementation mode is as follows:

the determining the two-dimensional coordinate sequence of the joint point of the target object comprises the following steps:

and determining video data of a target object, and extracting a two-dimensional coordinate sequence of a joint point of the target object according to the video data.

The video data of the target object may be understood to include video data of the target object, for example, the video data may be video data of a user performing a hand-lifting motion, specifically, in the 1 st video frame of the video data, a user's hand may naturally droop, in the 2 nd video frame, the user's hand may be located at a shoulder height of the user, in the 3 rd video frame, the user's hand may have been raised above the top of the user's head, the 3 video frames form video data of the user performing the hand-lifting motion, and then the wrist joint and the finger joint included in the user's hand are different in two-dimensional coordinates in the 3 video frames. Alternatively, the video data may also be video data of a user performing an exercise action, and in particular, the two-dimensional coordinates of the user's joint point in each video frame in the video data may only be slightly different while the user is maintaining a certain exercise action.

Specifically, a two-dimensional coordinate extraction model may be used to extract a two-dimensional coordinate sequence of a joint point of the target object, where the two-dimensional coordinate extraction model may be a coordinate extraction model based on deep learning, for example, may be a gaussian model, a neural network model, or the like, video data of the target object is input into the two-dimensional coordinate extraction model, the video data of the target object is divided into a plurality of video frames by using the two-dimensional coordinate extraction model, and two-dimensional coordinates of the joint point of the target object in each video frame are extracted to obtain a two-dimensional coordinate sequence of the joint point of the target object.

Or, the video data of the target object may be split into a plurality of video frames, the plurality of video frames are input into a two-dimensional coordinate extraction model, and the two-dimensional coordinates of the joint points of the target object in the video frames are extracted by using the two-dimensional coordinate extraction model, so as to obtain a two-dimensional coordinate sequence of the joint points of the target object. The extraction of the two-dimensional coordinate sequence can also be realized based on a simulation algorithm.

In an alternative embodiment of the present disclosure, since the video frame is represented in the form of an image, the two-dimensional coordinates of the node point may also be extracted using the pixels of the image. In specific implementation, the positions of the nodes to be extracted in the image can be determined by using marks such as colors, and the two-dimensional coordinates of the nodes can be determined by traversing the pixel values of the image. The embodiment of the present disclosure does not limit a specific method for extracting the two-dimensional coordinate sequence of the joint point of the target object, and a person skilled in the art may extract the two-dimensional coordinate sequence of the joint point of the target object using any method or model capable of extracting the two-dimensional coordinates.

In summary, a data basis is provided for the subsequent determination of the three-dimensional coordinate sequence of the joint point of the target object by extracting the two-dimensional coordinate sequence of the joint point of the target object, so that the determination of the three-dimensional gesture of the target object is further realized.

Step 204: and determining the characteristics of the target joint point corresponding to the target joint point according to the two-dimensional coordinate information corresponding to the target joint point, wherein the target joint point is any joint point in the at least two joint points.

Specifically, after determining the two-dimensional coordinate sequence of the joint point of the target object, feature extraction can be performed on the two-dimensional coordinate information corresponding to any joint point included in the two-dimensional coordinate sequence, and the feature of the target joint point corresponding to the joint point can be determined.

In a specific implementation, the determining, according to the two-dimensional coordinate information corresponding to the target joint point, the target joint point feature corresponding to the target joint point includes:

extracting features of two-dimensional coordinate information corresponding to the target node to obtain initial node features corresponding to the target node;

and performing downsampling processing and upsampling processing on the initial joint point characteristics to obtain target joint point characteristics corresponding to the target joint point.

The initial joint point characteristics corresponding to the target joint point can include global characteristics and local characteristics of two-dimensional coordinate information of the target joint point.

Specifically, the feature extractor may be used to perform feature extraction on an image or video frame containing two-dimensional coordinates corresponding to the target node, so as to obtain global features and local features corresponding to the target node. And carrying out downsampling processing and upsampling processing on the global features and the local features to obtain target joint point features corresponding to the target joint points. Since the downsampling and upsampling processes can change the dimensions of the initial joint point feature, the dimensions of the target joint point feature may be greater than or equal to the dimensions of the initial joint point feature. For example, for an initial joint feature with dimensions of 64×64, the initial joint feature is subjected to downsampling to obtain an initial joint feature with dimensions of 32×32, and then the initial joint feature is subjected to upsampling to obtain a target joint feature with dimensions of 128×128. Or, for the initial joint point feature with the dimension of 64×64, performing downsampling processing on the initial joint point feature to obtain an initial joint point feature with the dimension of 32×32, and performing upsampling processing on the initial joint point feature to obtain a target joint point feature with the dimension of 64×64. When the downsampling process and the upsampling process are performed on the initial joint point feature, the number of downsampling processes and the number of upsampling processes may be preset, for example, the downsampling process may be performed on the initial joint point feature twice, the upsampling process may be performed on the initial joint point feature once, or the downsampling process may be performed on the initial joint point feature twice, and the upsampling process may be performed twice.

In practical application, the feature extractor may be a multi-head self-attention mechanism and an adjacency matrix, wherein the adjacency matrix may be used for describing a connection relationship between human skeleton structures, and initial joint point features extracted by the adjacency matrix can increase human skeleton semantics as prior information. The convolution layer can be utilized to carry out downsampling operation so as to realize feature aggregation in the time dimension, and concretely, the convolution layer can be utilized to carry out downsampling processing so as to control the size of the output feature dimension. The upsampling process may be performed by interpolation, which may be, for example, nearest neighbor interpolation or bilinear interpolation. In addition, the upsampling process may also be implemented by deconvolution or hole convolution, which is not limited in this embodiment.

In addition, a plurality of initial joint feature extraction modules, and a corresponding plurality of up-sampling modules and down-sampling modules may be provided, and the dimensions of each up-sampling and down-sampling may be set.

Along the above example, feature extraction can be performed on two-dimensional coordinates A1, A2 and A3 of the joint point a by using an adjacency matrix to obtain initial joint point features of the joint point a, and downsampling and upsampling are performed on the initial joint point features of the joint point a to obtain target joint point features of the joint point a; and, the two-dimensional coordinates B1, B2, B3 of the joint point B are subjected to feature extraction to obtain an initial joint point feature of the joint point B, and the initial joint point feature of the joint point B is subjected to downsampling processing and upsampling processing to obtain a target joint point feature of the joint point B.

Step 206: and determining initial bone edge characteristics associated with at least two target nodes according to the target node characteristics corresponding to the at least two target nodes.

Specifically, after the target node characteristics corresponding to the target nodes are determined, initial bone edge characteristics associated with at least two target nodes can be determined according to the target node characteristics corresponding to the at least two target nodes.

The initial bone edge feature associated with at least two target nodes is understood to be the initial bone edge feature between at least two target nodes. Such as for an initial skeletal edge feature between the target node 1 and the target node 2, which is associated with the target node 1, as well as the target node 2.

For example, an initial skeletal edge feature between the wrist and elbow joints may be determined based on the target joint point feature for the wrist joint and the target joint point feature for the elbow joint, the initial skeletal edge feature being associated with the wrist joint and also with the elbow joint. Alternatively, the initial bone edge characteristics between the elbow joint and the knuckle may be determined based on the target joint point characteristics corresponding to the wrist joint, the target joint point characteristics corresponding to the elbow joint, and the target joint point characteristics corresponding to the knuckle, where the initial bone edge characteristics are associated with the wrist joint, the elbow joint, and the knuckle.

Along the above example, the initial bone edge characteristics associated with the joint point a and the joint point B may be determined according to the target joint point characteristics of the joint point a and the target joint point characteristics of the joint point B.

In a specific implementation, the determining, according to the target joint point features corresponding to the at least two target joint points, initial bone edge features associated with the at least two target joint points includes:

and performing splicing processing on target joint characteristics corresponding to at least two target joints to obtain initial skeleton edge characteristics associated with the at least two target joints.

Along the above example, the target joint point feature of the joint point a and the target joint point feature of the joint point B may be spliced to obtain the initial bone edge feature associated with the joint point a and the joint point B.

In addition, the initial bone edge feature may be determined by adding, subtracting, or multiplying the target joint point features of the target joint points on the shortest path of the at least two target joint points. The shortest path of at least two target nodes is understood to be the shortest path in the human body structure connecting the at least two target nodes. For example, the wrist joint is also included in the shortest path between the knuckle and the elbow joint, so that the target joint point characteristic of the knuckle, the target joint point characteristic of the elbow joint and the target joint point characteristic of the wrist joint can be spliced or added, subtracted, multiplied and the like to determine the initial bone edge characteristic between the knuckle and the elbow joint.

In sum, by performing stitching processing on the target joint characteristics of at least two target joints, the initial bone edge characteristics can be determined, a data basis is provided for the determination of the subsequent three-dimensional coordinates, and the determination of the subsequent human body posture is more accurate.

Step 208: and processing the target joint point characteristics corresponding to the target joint points and the initial bone edge characteristics associated with the target joint points, and determining a three-dimensional coordinate sequence of the joint points of the target object according to the processing result.

Specifically, the three-dimensional coordinate sequence of the joint point of the target object can be determined according to the characteristics of the target joint point corresponding to the target joint point and the initial bone edge characteristics associated with the target joint point.

In a specific implementation, the processing the target joint point feature corresponding to the target joint point and the initial bone edge feature associated with the target joint point, and determining the three-dimensional coordinate sequence of the joint point of the target object according to the processing result includes:

performing fusion processing on the target joint point characteristics corresponding to the target joint point and the initial bone edge characteristics associated with the target joint point to obtain fusion characteristics corresponding to the target joint point;

Performing up-sampling processing on the fusion characteristics, and determining three-dimensional coordinates corresponding to the target node according to processing results;

and constructing a three-dimensional coordinate sequence of the joint point of the target object according to the three-dimensional coordinate corresponding to the target joint point.

Specifically, the method can perform fusion processing on the target joint point characteristics corresponding to the target joint point and the initial bone edge characteristics associated with the target joint point to obtain fusion characteristics corresponding to the target joint point, perform upsampling processing on the fusion characteristics to obtain the fusion characteristics after upsampling processing, input the fusion characteristics after upsampling processing into the full-connection layer to obtain the three-dimensional coordinates corresponding to the target joint point output by the full-connection layer, and construct a three-dimensional coordinate sequence of the joint point of the target object according to the three-dimensional coordinates of the target joint point.

Along the above example, the target joint point feature corresponding to the joint point a and the initial bone edge feature associated with the joint point a (i.e., the initial bone edge feature between the joint point a and the joint point B) may be processed, to obtain the fusion feature AA corresponding to the joint point a, and the fusion feature AA may be up-sampled, and the three-dimensional coordinate corresponding to the joint point a may be determined according to the processing result. And similarly processing the joint point B, and determining the corresponding three-dimensional coordinates of the joint point B. And constructing a three-dimensional coordinate sequence of the joint point of the target object according to the three-dimensional coordinates corresponding to the joint point A and the three-dimensional coordinates corresponding to the joint point B.

In sum, by combining the characteristics of the target joint points and the characteristics of the initial bone edges, the bone edges between the target joint points are introduced, so that the connection relation between the joint points can be further reflected, the three-dimensional coordinates of the finally determined joint points are more accurate, and the precision and accuracy of human body posture estimation are improved.

Specifically, the cross attention mechanism processing can be performed on the target joint point feature and the initial bone edge feature, and the fusion processing can also be performed on the target joint point feature and the initial bone edge feature, for example, the addition or multiplication processing is performed on the target joint point feature and the initial bone edge feature, so as to obtain the fusion feature, and the specific implementation manner is as follows:

the fusing processing is performed on the target joint point characteristics corresponding to the target joint point and the initial bone edge characteristics associated with the target joint point to obtain the fusion characteristics corresponding to the target joint point, including:

and according to a cross attention mechanism, carrying out fusion processing on the target node characteristics corresponding to the target node and the initial skeleton edge characteristics associated with the target node to obtain fusion characteristics corresponding to the target node.

Or, according to a preset fusion calculation method, carrying out fusion processing on the target node characteristics corresponding to the target node and the initial skeleton edge characteristics associated with the target node to obtain fusion characteristics corresponding to the target node.

The preset fusion calculation method can comprise calculation modes such as addition or multiplication of the target joint point characteristics and the initial bone edge characteristics.

In practice, the cross-attention mechanism may be implemented based on a transducer structure. A transducer structure can be understood as a transformation model that computes representations of its inputs and outputs based on self-attention mechanisms. The performance of the encoder in the model is optimized using cross-attention mechanism processing, thereby enabling the accuracy of the combination of the target joint point feature and the initial bone edge feature to be improved.

In an alternative embodiment of the present disclosure, other types of models may be used for feature processing, such as neural network models, and the embodiment of the present disclosure is not limited herein.

In addition, in order to further ensure the resolution of the bone edge features, so that the semantics represented by the introduced bone edge features are rich and accurate, the initial bone edge features can be mapped, and after the target bone edge features are obtained, the target joint point features and the target bone edge features are processed, and the specific implementation mode is as follows:

after determining the initial bone edge characteristics associated with the at least two target nodes according to the target node characteristics corresponding to the at least two target nodes, the method further comprises:

Mapping the initial bone edge feature to obtain target bone edge features associated with the at least two target joint points, wherein the dimension of the target bone edge features is larger than that of the initial bone edge features;

and processing the target joint point characteristics corresponding to the target joint points and the target bone edge characteristics associated with the target joint points, and determining a three-dimensional coordinate sequence of the joint points of the target object according to the processing result.

The dimensions may include, among other things, the gray scale dimension, the color dimension, or the direction vector dimension of the feature.

In practical applications, the mapping process for the initial bone edge features may be implemented by a linear layer or a convolution layer. Alternatively, the method may further utilize laplace mapping to map the initial bone edge feature, and map the initial bone edge feature to a multidimensional space, so as to obtain a target bone edge feature corresponding to the initial bone edge feature, where the dimension of the target bone edge feature is greater than that of the initial bone edge feature. For example, for a two-dimensional initial bone edge feature, it may be mapped to a four-dimensional space, thereby obtaining a four-dimensional target bone edge feature, and achieving feature fitting.

The edge features are mapped to obtain target skeleton edge features associated with at least two target articulation points, the target articulation point features corresponding to the target articulation points and the target skeleton edge features associated with the target articulation points are subjected to cross attention mechanism processing, and three-dimensional coordinates corresponding to the target articulation points are determined according to processing results.

Along the above example, after determining the initial bone edge features associated with the node a and the node B, mapping processing may be performed on the initial bone edge features to obtain target bone edge features associated with the node a and the node B, cross attention mechanism processing is performed on the target node features corresponding to the node a and the target bone edge features associated with the node a, three-dimensional coordinates corresponding to the node a are determined according to the processing result, and correspondingly, cross attention mechanism processing is performed on the target node features corresponding to the node B and the target bone edge features associated with the node B, and three-dimensional coordinates corresponding to the node B are determined according to the processing result. And constructing a three-dimensional coordinate sequence of the joint point of the target object according to the three-dimensional coordinates corresponding to the joint point A and the three-dimensional coordinates corresponding to the joint point B.

It can be understood that, when the target joint point feature and the target bone edge feature are processed, the processing of the cross-attention mechanism or the preset fusion calculation method can be also implemented, and the detailed description is not repeated here.

In sum, the target bone edge characteristics are obtained by mapping the initial bone edge characteristics, the acquisition of the high-order bone edge characteristics is realized, the resolution and the richness of the bone edge characteristics are further improved, the subsequent three-dimensional coordinate determination can be performed through richer and more accurate information, and the accuracy of human body posture estimation is further improved.

In practical application, after determining the three-dimensional coordinate sequence of the joint point of the target object according to the processing result, the method further includes:

and determining the gesture of the target object according to the three-dimensional coordinate sequence of the joint point of the target object.

Specifically, the three-dimensional coordinate sequence of the joint point of the target object can be input into the gesture determination model by using the gesture determination model, so as to obtain the three-dimensional gesture of the target object output by the gesture determination model.

In practical applications, the foregoing steps 204 to 208 may be implemented by a coordinate prediction model, where the coordinate prediction model may include a feature extractor, an upsampling module, a downsampling module, and a fusion module, and the specific implementation manner is as follows:

after the two-dimensional coordinate sequence of the node of the target object is determined, the method further comprises the following steps:

and inputting the two-dimensional coordinate sequence into a coordinate prediction model to obtain a three-dimensional coordinate sequence of the joint point of the target object output by the coordinate prediction model.

Referring to fig. 3, fig. 3 is a schematic flow chart of a coordinate prediction model in a target object processing method according to an embodiment of the present disclosure, and specific steps are as follows.

Step 302: inputting the two-dimensional coordinate sequence of each joint point of the target object into a coordinate prediction model, and extracting features of the two-dimensional coordinates of the joint point through a feature extractor to obtain initial joint point features of a first dimension of the joint point.

Step 304: and performing downsampling processing on the initial joint point characteristics of the first dimension by using a downsampling module to obtain initial joint point characteristics of the second dimension.

Step 306: and performing up-sampling processing on the initial joint point characteristics of the second dimension by using an up-sampling module to obtain target joint point characteristics of the joint point.

Step 308: and splicing the target joint characteristics of at least two joints to obtain initial skeleton edge characteristics, and mapping the initial skeleton edge characteristics by using a convolution layer to obtain target skeleton edge characteristics.

Step 310: and performing cross attention mechanism processing on the target joint point characteristics and the target bone edge characteristics of the joint points to obtain fusion characteristics of the joint points.

Step 312: and carrying out up-sampling treatment on the fusion characteristics to obtain the fusion characteristics after the up-sampling treatment.

Step 314: and determining the three-dimensional coordinates corresponding to the joint point according to the fusion characteristics after the up-sampling treatment, constructing a three-dimensional coordinate sequence of the joint point of the target object according to the three-dimensional coordinates corresponding to the joint point, and outputting the three-dimensional coordinate sequence.

In practical application, the training step of the coordinate prediction model includes:

determining a two-dimensional coordinate sample set of an articulation point of a target object and a three-dimensional coordinate sequence label corresponding to the two-dimensional coordinate sample set, wherein the two-dimensional coordinate sample set comprises two-dimensional coordinate information samples corresponding to at least two articulation points;

inputting the two-dimensional coordinate sample set into a coordinate prediction model, and determining the characteristics of a predicted target joint point corresponding to a target joint point according to a two-dimensional coordinate information sample corresponding to the target joint point by using the coordinate prediction model, wherein the target joint point is any joint point in the at least two joint points;

according to the predicted target joint point characteristics corresponding to at least two target joint points, determining predicted initial bone edge characteristics associated with the at least two target joint points;

processing the predicted target joint point characteristics corresponding to the target joint point and the predicted initial bone edge characteristics associated with the target joint point, and determining a predicted three-dimensional coordinate sequence of the joint point of the target object according to a processing result;

And training the coordinate prediction model by utilizing the predicted three-dimensional coordinate sequence and the three-dimensional coordinate sequence label until the coordinate prediction model meeting the training stop condition is obtained.

The training stopping condition can be understood as reaching a preset iteration number or reaching a preset loss value threshold value by a loss value of the coordinate prediction model.

Specifically, the loss value of the coordinate prediction model can be calculated according to the predicted three-dimensional coordinate sequence and the three-dimensional coordinate sequence label, and parameters of the coordinate prediction model are adjusted according to the loss value until the preset iteration times are reached or the loss value of the coordinate prediction model reaches a preset loss value threshold.

In addition, after the coordinate prediction model is trained, parameter fine adjustment can be performed on the coordinate prediction model, so that the accuracy of an output result of the coordinate prediction model is further ensured, and the specific implementation mode is as follows:

the step of inputting the two-dimensional coordinate sequence into a coordinate prediction model, and after obtaining the three-dimensional coordinate sequence of the joint point of the target object output by the coordinate prediction model, further comprises:

displaying the three-dimensional coordinate sequence of the joint point of the target object to a user;

receiving feedback information of a user aiming at the three-dimensional coordinate sequence;

And adjusting parameters of the coordinate prediction model according to the feedback information.

Specifically, the three-dimensional coordinate sequence of the joint point of the target object output by the coordinate prediction model can be displayed to the user through a display interface of the terminal side device, or after the three-dimensional gesture of the target object is constructed according to the three-dimensional coordinate sequence of the joint point of the target object, the three-dimensional gesture of the target object can be displayed to the user, feedback information of the user for the three-dimensional coordinate sequence or the three-dimensional gesture is received, and the parameters of the coordinate prediction model are finely adjusted according to the feedback information.

In summary, after the training of the coordinate prediction model is finished, parameters of the coordinate prediction model can be adjusted according to feedback information of a user, so that performance of the coordinate prediction model is further improved. And under the condition that the output result of the trained coordinate prediction model is still inaccurate, the result can be timely found and adjusted, and the failure of the subsequent task is avoided.

In summary, an embodiment of the present disclosure provides a method for processing a target object, determining a two-dimensional coordinate sequence of an articulation point of the target object, where the two-dimensional coordinate sequence includes two-dimensional coordinate information corresponding to at least two articulation points; determining the characteristics of the target joint points corresponding to the target joint points according to the two-dimensional coordinate information corresponding to the target joint points, wherein the target joint points are any joint point in the at least two joint points; determining initial bone edge characteristics associated with at least two target articulation points according to the target articulation point characteristics corresponding to the at least two target articulation points; and processing the target joint point characteristics corresponding to the target joint points and the initial bone edge characteristics associated with the target joint points, and determining a three-dimensional coordinate sequence of the joint points of the target object according to the processing result. According to the method, the initial skeleton edge characteristics associated with the joint points of the target object are determined, and the three-dimensional coordinates of the joint points are determined by utilizing the combination of the initial skeleton edge characteristics and the target joint point characteristics, so that the accuracy of the determined three-dimensional coordinates can be improved, and the accuracy and precision of the subsequent human body posture estimation are further ensured.

The following describes, with reference to fig. 4, an example of application of the target object processing method provided in the present specification to human body posture estimation, where the target object processing method is further described. Fig. 4 is a flowchart of a processing procedure of a target object processing method according to an embodiment of the present disclosure, which specifically includes the following steps.

Step 402: the terminal side device receives the image data input by the user in the gesture determination request uploading box, receives an uploading instruction of the user for the gesture determination request, and sends the gesture determination request to the cloud side device.

The terminal side device displays a gesture determination request uploading box to a user. The gesture determination request carries image data. The image data may be any video frame in the video data of the user shot by the terminal device, may be any video frame in the video data of the user shot by the other shooting devices and sent to the terminal device, or may be an image of the user shot by the terminal device or the other shooting devices.

Specifically, the user clicks a control "determine" on the display interface of the end-side device, and the end-side device determines an upload instruction of the user for the gesture determination request based on the click instruction of the user.

Step 404: the cloud-side device receives the gesture determination request and determines image data.

Step 406: the cloud side device inputs the image data into a two-dimensional coordinate extraction model to obtain a two-dimensional coordinate sequence of the joint point of the user, and determines a three-dimensional coordinate sequence of the joint point of the user according to the two-dimensional coordinate sequence.

Specifically, determining the three-dimensional coordinate sequence of the node of the user according to the two-dimensional coordinate sequence may be performed according to the target object processing method, and a detailed description is not repeated here.

In addition, the cloud side device can also determine the human body gesture of the user according to the three-dimensional coordinate sequence of the joint point of the user.

Step 408: the cloud side device transmits the human body posture to the end side device.

Step 410: the end side equipment renders the human body gesture and displays the human body gesture to a user in an output result display frame.

In addition, the terminal side device can receive the coordinate prediction model issued by the cloud side device, and the image data or the video data is processed at the terminal side device by using the coordinate prediction model. Specifically, fig. 5 shows a flowchart of a processing procedure of a target object processing method at an end-side device according to an embodiment of the present disclosure, and specific steps are as follows.

Step 502: the terminal side device receives the image data input by the user in the gesture determination request input box, receives an execution instruction of the gesture determination request of the user for the image data, and inputs the image data to the coordinate prediction model.

The terminal side device can display a gesture determination request input box to a user, the user can input image data needing gesture determination in the gesture determination request input box, the image data can be stored in the terminal side device, and the image data needing gesture determination can be determined based on a selection instruction of the user in an image data uploading box. The coordinate prediction model may be deployed at the end-side device. Image data may be understood as an image of a user exhibiting a user gesture.

Specifically, the user may click a control "determine" on the presentation interface of the end-side device, and the end-side device determines an input instruction of the user for the gesture determination request based on the click instruction of the user.

In addition, after receiving the image data input by the user in the gesture determination request input box, the terminal device may further execute the above-mentioned target object processing method, and determine a three-dimensional coordinate sequence of the node of the user in the image data. The description is not repeated here.

Step 504: the terminal side equipment receives a three-dimensional coordinate sequence of the joint point of the user, which is output by the coordinate prediction model.

Step 506: the terminal side equipment determines the human body posture of the user according to the three-dimensional coordinate sequence of the joint point of the user, renders the human body posture of the user, and displays the human body posture to the user in an output result display frame.

In summary, by determining the initial bone edge feature associated with the joint point of the target object and determining the three-dimensional coordinate of the joint point by utilizing the combination of the initial bone edge feature and the target joint point feature, the method can improve the accuracy of the determined three-dimensional coordinate and further ensure the accuracy and precision of the subsequent human body posture estimation.

Corresponding to the above method embodiments, the present disclosure further provides an embodiment of a target object processing apparatus, and fig. 6 shows a schematic structural diagram of the target object processing apparatus according to one embodiment of the present disclosure. As shown in fig. 6, the apparatus includes:

a first determining module 602 configured to determine a two-dimensional coordinate sequence of the joint points of the target object, where the two-dimensional coordinate sequence includes two-dimensional coordinate information corresponding to at least two joint points;

a second determining module 604, configured to determine a target node feature corresponding to a target node according to two-dimensional coordinate information corresponding to the target node, where the target node is any one of the at least two nodes;

A third determining module 606 configured to determine initial bone edge features associated with at least two target nodes according to target node features corresponding to the at least two target nodes;

and the processing module 608 is configured to process the target joint point characteristics corresponding to the target joint point and the initial bone edge characteristics associated with the target joint point, and determine a three-dimensional coordinate sequence of the joint point of the target object according to the processing result.

In an alternative embodiment, the processing module 608 is further configured to:

and according to a preset fusion calculation method, carrying out fusion processing on the target node characteristics corresponding to the target node and the initial skeleton edge characteristics associated with the target node to obtain fusion characteristics corresponding to the target node.

In an alternative embodiment, the third determining module 606 is further configured to:

and mapping the initial bone edge features to obtain target bone edge features associated with the at least two target joint points, wherein the dimension of the target bone edge features is larger than that of the initial bone edge features.

In an alternative embodiment, the second determining module 604 is further configured to:

In an alternative embodiment, the first determining module 602 is further configured to:

In an alternative embodiment, the apparatus further comprises an input module configured to:

In an alternative embodiment, the apparatus further comprises a training module configured to:

In summary, by determining the initial bone edge feature associated with the joint point of the target object and determining the three-dimensional coordinate of the joint point by utilizing the combination of the initial bone edge feature and the target joint point feature, the device can improve the accuracy of the determined three-dimensional coordinate and further ensure the accuracy and precision of the subsequent human body posture estimation.

The above is a schematic solution of a target object processing apparatus of the present embodiment. It should be noted that, the technical solution of the target object processing apparatus and the technical solution of the target object processing method belong to the same concept, and details of the technical solution of the target object processing apparatus, which are not described in detail, can be referred to the description of the technical solution of the target object processing method.

Fig. 7 illustrates a block diagram of a computing device 700 provided in accordance with one embodiment of the present description. The components of computing device 700 include, but are not limited to, memory 710 and processor 720. Processor 720 is coupled to memory 710 via bus 730, and database 750 is used to store data.

Computing device 700 also includes access device 740, access device 740 enabling computing device 700 to communicate via one or more networks 760. Examples of such networks include public switched telephone networks (PSTN, public Switched Telephone Network), local area networks (LAN, local Area Network), wide area networks (WAN, wide Area Network), personal area networks (PAN, personal Area Network), or combinations of communication networks such as the internet. The access device 740 may include one or more of any type of network interface, wired or wireless, such as a network interface card (NIC, network interface controller), such as an IEEE802.11 wireless local area network (WLAN, wireless Local Area Network) wireless interface, a worldwide interoperability for microwave access (Wi-MAX, worldwide Interoperability for Microwave Access) interface, an ethernet interface, a universal serial bus (USB, universal Serial Bus) interface, a cellular network interface, a bluetooth interface, a near field communication (NFC, near Field Communication) interface, and so forth.

In one embodiment of the present application, the above-described components of computing device 700, as well as other components not shown in FIG. 7, may also be connected to each other, such as by a bus. It should be understood that the block diagram of the computing device illustrated in FIG. 7 is for exemplary purposes only and is not intended to limit the scope of the present application. Those skilled in the art may add or replace other components as desired.

Computing device 700 may be any type of stationary or mobile computing device, including a mobile computer or mobile computing device (e.g., tablet, personal digital assistant, laptop, notebook, netbook, etc.), mobile phone (e.g., smart phone), wearable computing device (e.g., smart watch, smart glasses, etc.), or other type of mobile device, or a stationary computing device such as a desktop computer or personal computer (PC, personal Computer). Computing device 700 may also be a mobile or stationary server.

Wherein the processor 720 is configured to execute computer-executable instructions that, when executed by the processor, perform the steps of the target object processing method described above.

The foregoing is a schematic illustration of a computing device of this embodiment. It should be noted that, the technical solution of the computing device and the technical solution of the target object processing method belong to the same concept, and details of the technical solution of the computing device, which are not described in detail, can be referred to the description of the technical solution of the target object processing method.

An embodiment of the present specification also provides an AR/VR device, comprising: a memory and a processor;

the memory is configured to store computer-executable instructions that, when executed by the processor, perform the steps of the target object processing method described above.

Specifically, the user may wear the AR device or the VR device when moving, and the AR device or the VR device may execute the above target object processing method to obtain the three-dimensional coordinate sequence of the joint point when the user moves, and generate the human body gesture of the user according to the three-dimensional coordinate sequence of the joint point, and display the human body gesture to the user through the AR device or the VR device. So that the user can know the gesture of the user during the movement, and the user experience is ensured.

An embodiment of the present disclosure also provides a computer-readable storage medium storing computer-executable instructions that, when executed by a processor, implement the steps of the target object processing method described above.

The above is an exemplary version of a computer-readable storage medium of the present embodiment. It should be noted that, the technical solution of the storage medium and the technical solution of the target object processing method belong to the same concept, and details of the technical solution of the storage medium, which are not described in detail, can be referred to the description of the technical solution of the target object processing method.

An embodiment of the present disclosure also provides a computer program, where the computer program, when executed in a computer, causes the computer to perform the steps of the target object processing method described above.

The above is an exemplary version of a computer program of the present embodiment. It should be noted that, the technical solution of the computer program and the technical solution of the target object processing method belong to the same concept, and details of the technical solution of the computer program, which are not described in detail, can be referred to the description of the technical solution of the target object processing method.

The foregoing describes specific embodiments of the present disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims can be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.

The computer instructions include computer program code that may be in source code form, object code form, executable file or some intermediate form, etc. The computer readable medium may include: any entity or device capable of carrying the computer program code, a recording medium, a U disk, a removable hard disk, a magnetic disk, an optical disk, a computer Memory, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), an electrical carrier signal, a telecommunications signal, a software distribution medium, and so forth. It should be noted that the content of the computer readable medium can be increased or decreased appropriately according to the requirements of the patent practice, for example, in some areas, according to the patent practice, the computer readable medium does not include an electric carrier signal and a telecommunication signal.

It should be noted that, for simplicity of description, the foregoing method embodiments are all expressed as a series of combinations of actions, but it should be understood by those skilled in the art that the embodiments are not limited by the order of actions described, as some steps may be performed in other order or simultaneously according to the embodiments of the present disclosure. Further, those skilled in the art will appreciate that the embodiments described in the specification are all preferred embodiments, and that the acts and modules referred to are not necessarily all required for the embodiments described in the specification.

In the foregoing embodiments, the descriptions of the embodiments are emphasized, and for parts of one embodiment that are not described in detail, reference may be made to the related descriptions of other embodiments.

The preferred embodiments of the present specification disclosed above are merely used to help clarify the present specification. Alternative embodiments are not intended to be exhaustive or to limit the invention to the precise form disclosed. Obviously, many modifications and variations are possible in light of the teaching of the embodiments. The embodiments were chosen and described in order to best explain the principles of the embodiments and the practical application, to thereby enable others skilled in the art to best understand and utilize the invention. This specification is to be limited only by the claims and the full scope and equivalents thereof.

Claims

1. A target object processing method is applied to a motion gesture evaluation scene of a user, and comprises the following steps:

determining a target joint point characteristic corresponding to a target joint point according to two-dimensional coordinate information corresponding to the target joint point, wherein the target joint point is any joint point in the at least two joint points, the target joint point characteristic is determined according to an initial joint point characteristic corresponding to the target joint point, the initial joint point characteristic is obtained by extracting the two-dimensional coordinate information according to a characteristic extractor, the characteristic extractor comprises a multi-head self-attention mechanism and an adjacent matrix, and the adjacent matrix is used for describing a connection relation between human skeleton structures;

splicing the target node characteristics corresponding to at least two target nodes, or adding, subtracting or multiplying the target node characteristics of the target nodes on the shortest path of the at least two target nodes to obtain initial skeleton edge characteristics associated with the at least two target nodes;

processing the target joint point characteristics corresponding to the target joint points and the target bone edge characteristics associated with the target joint points, and determining a three-dimensional coordinate sequence of the joint points of the target object according to the processing result;

and determining the motion gesture of the user according to the three-dimensional coordinate sequence, comparing the motion gesture of the user with a reference motion gesture, determining whether the motion gesture of the user is standard or not, and determining whether the limb action of the user meets the requirement or not according to the distance between the coordinates of all the joint points in the three-dimensional coordinate sequence.

2. The method according to claim 1, wherein the processing the target node feature corresponding to the target node and the initial bone edge feature associated with the target node, and determining the three-dimensional coordinate sequence of the node of the target object according to the processing result, includes:

3. The method of claim 2, wherein the fusing the target node feature corresponding to the target node and the initial bone edge feature associated with the target node to obtain the fused feature corresponding to the target node, includes:

4. The method of claim 2, wherein the fusing the target node feature corresponding to the target node and the initial bone edge feature associated with the target node to obtain the fused feature corresponding to the target node, includes:

5. The method according to claim 1, wherein the determining, according to the two-dimensional coordinate information corresponding to the target joint point, the target joint point feature corresponding to the target joint point includes:

6. The method of claim 1, the determining a two-dimensional coordinate sequence of an articulation point of a target object, comprising:

7. The method according to claim 1, further comprising, after determining the three-dimensional coordinate sequence of the joint point of the target object according to the processing result:

8. The method of claim 1, further comprising, after determining the two-dimensional coordinate sequence of the node of the target object:

9. The method of claim 8, the training of the coordinate prediction model comprising:

10. An AR/VR device comprising:

a memory and a processor;

the memory is configured to store computer executable instructions, the processor being configured to execute the computer executable instructions, which when executed by the processor, implement the steps of the method of any one of claims 1 to 9.

11. A computing device, comprising:

a memory and a processor;

12. A computer readable storage medium storing computer executable instructions which when executed by a processor implement the steps of the method of any one of claims 1 to 9.