CN113780215A - Information processing method and device, computer equipment and storage medium - Google Patents

Information processing method and device, computer equipment and storage medium Download PDF

Info

Publication number
CN113780215A
CN113780215A CN202111088903.6A CN202111088903A CN113780215A CN 113780215 A CN113780215 A CN 113780215A CN 202111088903 A CN202111088903 A CN 202111088903A CN 113780215 A CN113780215 A CN 113780215A
Authority
CN
China
Prior art keywords
target
joint point
sub
image
determining
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN202111088903.6A
Other languages
Chinese (zh)
Inventor
田茂清
万子牛
李正甲
刘建博
伊帅
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Sensetime Intelligent Technology Co Ltd
Original Assignee
Shanghai Sensetime Intelligent Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Sensetime Intelligent Technology Co Ltd filed Critical Shanghai Sensetime Intelligent Technology Co Ltd
Priority to CN202111088903.6A priority Critical patent/CN113780215A/en
Publication of CN113780215A publication Critical patent/CN113780215A/en
Withdrawn legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Image Analysis (AREA)

Abstract

The present disclosure provides an information processing method, apparatus, computer device, and storage medium, wherein the method comprises: acquiring a video clip including a target object; determining target space-time image characteristics corresponding to the video clips based on initial image characteristics corresponding to the video clips; determining sub-pose information for each joint point of the target object based on the target spatiotemporal image features; wherein the sub-pose information of each joint point is determined based on the sub-pose information of ancestor joint points of the joint point and the target spatio-temporal image feature; and determining target posture information corresponding to the target object based on the sub-posture information of each joint point.

Description

Information processing method and device, computer equipment and storage medium
Technical Field
The present disclosure relates to the field of computer technologies, and in particular, to an information processing method and apparatus, a computer device, and a storage medium.
Background
With the development of deep learning technology, the estimation of three-dimensional human body shape and posture is gradually a research hotspot in the field of computer vision. The three-dimensional human body shape and posture estimation has wide application prospect in tasks of human body motion analysis, motion teaching and the like.
At present, the three-dimensional human body posture estimation is mainly determined based on image features extracted from human body images, and the accuracy of the three-dimensional human body posture determined by the method under the conditions of disordered image backgrounds, occlusion and the like is not high.
Disclosure of Invention
The embodiment of the disclosure at least provides an information processing method, an information processing device, computer equipment and a storage medium.
In a first aspect, an embodiment of the present disclosure provides an information processing method, including:
acquiring a video clip including a target object;
determining target space-time image characteristics corresponding to the video clips based on initial image characteristics corresponding to the video clips;
determining sub-pose information for each joint point of the target object based on the target spatiotemporal image features; wherein the sub-pose information of each joint point is determined based on the sub-pose information of ancestor joint points of the joint point and the target spatio-temporal image feature;
and determining target posture information corresponding to the target object based on the sub-posture information of each joint point.
According to the method and the device, the target space-time image characteristics corresponding to the video segments are determined, the attitude characteristics of the target object in the space dimension and the attitude characteristics in the time dimension can be represented, and the robustness to the disordered background and the shielding can be effectively improved by combining the attitude characteristics in the space dimension and the attitude characteristics in the time dimension, so that the accuracy of attitude detection is improved. Meanwhile, the sub-attitude information of each joint point is respectively determined based on the sub-attitude information of the ancestor node corresponding to each joint point and the target spatiotemporal image characteristics, the dependency relationship among the joint points is considered, the condition that the determined sub-attitude information of each joint point is inaccurate due to neglect of the dependency relationship among the joint points can be avoided, and the accuracy of attitude detection is further improved.
In an optional embodiment, after determining the target spatio-temporal image characteristics corresponding to the video segment, the method further includes:
and determining the morphological information of the target object based on the target space-time image characteristics.
According to the method and the device, the target space-time image characteristics of the posture characteristics of the target object in the space dimension and the posture characteristics in the time dimension can be represented, and the morphological information of the human body can be accurately determined in the image or video with disordered or shielded background.
In an optional embodiment, the determining a target spatiotemporal image feature corresponding to the video segment based on the initial image feature corresponding to the video segment includes:
performing first feature processing operation on the initial image features to obtain first space image features;
performing second feature processing operation on the initial image features to obtain time image features;
determining the target spatiotemporal image feature based on the first spatial image feature and the temporal image feature.
According to the method and the device, through the first feature processing operation, the gesture feature, namely the first spatial image feature, of the target object under the spatial dimension can be accurately determined by utilizing a spatial attention mechanism; through the second feature processing operation, the posture feature, namely the time image feature, for representing the target object in the time dimension can be accurately determined by utilizing a time attention mechanism; and then, the first space image characteristic and the time image characteristic are fused and the like, so that the target space-time image characteristic can be accurately determined.
In an alternative embodiment, said determining said target spatiotemporal image feature based on said first spatial image feature and said temporal image feature comprises:
determining first weight information corresponding to the first spatial image feature and second weight information corresponding to the temporal image feature based on the initial image feature;
determining the target spatiotemporal image feature based on the first weight information, the second weight information, the first spatial image feature, and the temporal image feature.
The method and the device for determining the time image feature of the background environment based on the initial image feature can accurately determine the time image feature of the background environment in the time dimension and the first space image feature in the space dimension, further accurately determine the first weight information corresponding to the first space image feature and the second weight information corresponding to the time image feature, further accurately fuse the first space image feature and the time image feature based on the background environment, and obtain more accurate target space-time image feature.
In an optional embodiment, before the determining the target spatiotemporal image feature corresponding to the video segment based on the initial image feature corresponding to the video segment, the method further includes:
extracting sub-image features of at least part of target images in the video clips;
and splicing the extracted sub-image features to obtain initial image features corresponding to the video clips.
In the embodiment of the disclosure, the plurality of sub-image features are spliced to obtain the initial image feature capable of relatively completely representing the image feature of the video clip.
In an optional implementation manner, the stitching processing on the extracted sub-image features to obtain initial image features corresponding to the video segment includes:
acquiring time characteristics and position characteristics corresponding to each target image in the video clip;
and splicing the extracted sub-image features based on the time features and the position features to obtain initial image features corresponding to the video clips.
The method and the device for obtaining the image feature splicing method are based on the time feature and the position feature corresponding to each target image, and the obtained initial image feature can contain time information and position information, namely the initial image feature capable of representing the time information and the position information of each target image more accurately is obtained; and then according to the initial image information, the time image characteristics under the representation time dimension and the target space-time image characteristics of the first space image characteristics under the space dimension can be accurately determined.
In an alternative embodiment, the determining sub-pose information of each joint point of the target object based on the target spatio-temporal image features comprises:
aiming at each joint point, determining an ancestral joint point corresponding to each joint point based on the position relationship between a root joint point and each joint point; the joint points are connection points of different preset parts of the target object; the root joint point is an ancestor joint point of all joint points; sub-pose information for the root joint point is determined based on the target spatio-temporal image features;
for each joint point, determining sub-pose information for the joint point based on the target spatio-temporal image features and the sub-pose information of the ancestor node.
According to the embodiment of the disclosure, ancestor joint points corresponding to the joint points can be accurately determined according to the connection relationship between the root joint point and each joint point, and then sub-attitude information of each joint point can be more accurately determined.
In an optional embodiment, the determining a target spatiotemporal image feature corresponding to the video segment based on the initial image feature corresponding to the video segment includes:
performing third feature processing operation on the initial image features to obtain second space image features;
and performing fourth feature processing operation on the second space image features to obtain target space-time image features corresponding to the video clips.
The method and the device for processing the initial image features can perform third feature processing operation on the initial image features, and can accurately determine the attitude features, namely the second spatial image features, of the target object under the spatial dimension by utilizing a spatial attention mechanism; and then, performing fourth feature processing operation on the second space image feature, so that the attitude feature, namely the target space-time image feature, for representing the target object in the time dimension and the space dimension can be accurately determined by combining a time attention mechanism.
In a second aspect, an embodiment of the present disclosure further provides an information processing apparatus, including:
the first acquisition module is used for acquiring a video clip comprising a target object;
the first determination module is used for determining target space-time image characteristics corresponding to the video clips based on initial image characteristics corresponding to the video clips;
a second determination module for determining sub-pose information for each joint point of the target object based on the target spatiotemporal image features; wherein the sub-pose information of each joint point is determined based on the sub-pose information of ancestor joint points of the joint point and the target spatio-temporal image feature;
and the third determining module is used for determining target posture information corresponding to the target object based on the sub-posture information of each joint point.
In a third aspect, an embodiment of the present disclosure further provides a computer device, including: a processor, a memory and a bus, the memory storing machine-readable instructions executable by the processor, the processor and the memory communicating via the bus when the computer device is running, the machine-readable instructions when executed by the processor performing the steps of the first aspect described above, or any possible implementation of the first aspect.
In a fourth aspect, this disclosed embodiment also provides a computer-readable storage medium, on which a computer program is stored, where the computer program is executed by a processor to perform the steps in the first aspect or any one of the possible implementation manners of the first aspect.
For the description of the effects of the information processing apparatus, the computer device, and the computer-readable storage medium, reference is made to the description of the information processing method, which is not repeated here.
In order to make the aforementioned objects, features and advantages of the present disclosure more comprehensible, preferred embodiments accompanied with figures are described in detail below.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present disclosure, the drawings required for use in the embodiments will be briefly described below, and the drawings herein incorporated in and forming a part of the specification illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the technical solutions of the present disclosure. It is appreciated that the following drawings depict only certain embodiments of the disclosure and are therefore not to be considered limiting of its scope, for those skilled in the art will be able to derive additional related drawings therefrom without the benefit of the inventive faculty.
Fig. 1 shows a flowchart of an information processing method provided by an embodiment of the present disclosure;
fig. 2 is a schematic structural diagram of a space-time encoder-motion topology decoder provided by an embodiment of the disclosure;
FIG. 3 is a block diagram illustrating the structure of a concatenated block of a space-time encoder provided by an embodiment of the present disclosure;
FIG. 4 is a schematic diagram illustrating the structure of a multi-head spatial self-attention model and a multi-head temporal self-attention model provided by an embodiment of the present disclosure;
FIG. 5 illustrates a schematic diagram of the connection of a human joint provided by an embodiment of the present disclosure;
fig. 6 shows a schematic diagram of an information processing apparatus provided by an embodiment of the present disclosure;
fig. 7 shows a schematic diagram of a computer device provided by an embodiment of the present disclosure.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present disclosure more clear, the technical solutions of the embodiments of the present disclosure will be described clearly and completely with reference to the drawings in the embodiments of the present disclosure, and it is obvious that the described embodiments are only a part of the embodiments of the present disclosure, not all of the embodiments. The components of the embodiments of the present disclosure, generally described and illustrated in the figures herein, can be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present disclosure, presented in the figures, is not intended to limit the scope of the claimed disclosure, but is merely representative of selected embodiments of the disclosure. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the disclosure without making creative efforts, shall fall within the protection scope of the disclosure.
At present, three-dimensional human body posture estimation mainly extracts spatial image characteristics of a human body image, and then posture information of each joint of a human body is respectively determined according to the spatial image characteristics. In the above method, the posture information is determined only when the spatial image feature can be acquired, but in the occlusion environment, the posture at the occlusion time cannot be estimated, and the posture information of each joint of the human body is determined separately, and the dependency between the joints is ignored, so the accuracy of the determined posture information is not high.
Based on the above, the present disclosure provides an information processing method, in which by determining a target spatiotemporal image feature corresponding to a video segment, pose information of a target object in a spatial dimension and pose information in a temporal dimension can be determined, so that not only can a situation that a human pose cannot be accurately determined in a cluttered background environment be avoided, but also a situation that a human pose cannot be accurately determined in a shielded environment can be avoided.
The above-mentioned drawbacks are the results of the inventor after practical and careful study, and therefore, the discovery process of the above-mentioned problems and the solutions proposed by the present disclosure to the above-mentioned problems should be the contribution of the inventor in the process of the present disclosure.
It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures.
For the convenience of understanding of the present embodiment, an information processing method disclosed in the embodiments of the present disclosure is first described in detail, and an execution subject of the information processing method provided in the embodiments of the present disclosure is generally a computer device with certain computing capability.
The following describes an information processing method provided by the embodiment of the present disclosure, taking an execution subject as a server as an example.
Referring to fig. 1, a flowchart of an information processing method provided by an embodiment of the present disclosure is shown, where the method includes S101 to S104, where:
s101: a video clip including a target object is acquired.
In the embodiments of the present disclosure, the target object may be a dynamic object, and specifically may be an object that performs any action. The video segment may include a plurality of frame images, and the frame images may not all include the target object, for example, in a case where the target object is blocked by an obstacle, the target object cannot be identified in the corresponding frame image.
S102: and determining the target space-time image characteristics corresponding to the video clips based on the initial image characteristics corresponding to the video clips.
The initial image features are obtained by splicing sub-image features obtained by feature extraction of at least part of target images in the video clips. Before determining the target space-time image characteristics corresponding to the video clip based on the initial image characteristics corresponding to the video clip, extracting the sub-image characteristics of at least part of target images in the video clip; and then, splicing the extracted sub-image features to obtain initial image features corresponding to the video segments. In a specific implementation process, a Convolutional Neural Network (CNN) may be used to perform feature extraction on a target image. As shown in fig. 2, a plurality of target images in a video segment are input into the CNN, so that sub-image features corresponding to the target images can be obtained.
Specifically, after the video clip is acquired, the video clip may be divided into a plurality of images and all the images may be taken as target images; a plurality of images can also be screened, specifically, a plurality of target images can be screened according to the preset number of interval frames; each image can be segmented to obtain a plurality of sub-images corresponding to each image, and then all the sub-images or part of the sub-images are used as target images.
And then arranging the plurality of target images according to a time sequence, extracting the sub-image characteristics corresponding to each target image, and splicing the sub-image characteristics to obtain initial image characteristics.
In the process of splicing the extracted sub-image features, time features and position features corresponding to each target image in the video clip can be acquired, and then the extracted sub-image features are spliced on the basis of the time features and the position features to obtain initial image features corresponding to the video clip. The temporal feature refers to a feature vector generated based on the time corresponding to each target image, and the positional feature refers to a feature vector generated based on the position corresponding to each target image.
The obtained initial image features comprise sub-image features, time features and position features corresponding to all target images.
After the initial image feature is obtained, as shown in fig. 2, a space-time Encoder (STE) may be used to perform a first feature processing operation on the initial image feature to obtain a first Spatial image feature, and at the same time, the STE may be used to perform a second feature processing operation on the initial image feature to obtain a Temporal image feature, and then the first Spatial image feature and the Temporal image feature are fused to obtain a target space-time image feature. A plurality of Spatial-Temporal Encoder cascade blocks (STE blocks) can be arranged in the STE, and the characteristics of the target spatio-Temporal image can be determined more accurately by serially connecting the plurality of STE blocks. As shown in fig. 3, in each STE Block, a Multi-Head Spatial Self-Attention (MHSSA) model, a Multi-Head Temporal Self-Attention (MHTSA) model, a feature fusion layer, and a Multi-layer Perceptron (MLP) may also be set.
Specifically, a first feature processing operation may be performed on the input initial image features by using a trained MHSSA model, and then a second feature processing operation may be performed on the input initial image features by using a trained MHTSA model connected in parallel with the MHSSA model. The MHSSA model focuses on image features in the spatial dimension, and the MHTSA model focuses on image features in the temporal dimension, so the order of the input initial image feature dimensions is not the same.
In specific implementation, based on the time feature and the position feature, the extracted sub-image features are subjected to stitching processing, and the obtained initial image features can be represented as: txxd, where T represents a feature in the time dimension, i.e., a temporal feature; n represents a feature in the spatial dimension, i.e. a location feature; d represents a sub-image feature. As shown in fig. 4, txxd may be input into the MHSSA model when performing the first feature processing operation. Before the initial image features are input into the MHTSA model, the initial image features represented by txxd may be reshaped to obtain the initial image features represented by NxTxd, and then the initial image features represented by NxTxd are input into the MHTSA model.
The MHSSA model can calculate the feature similarity of the initial image features represented by the TxNxd in the space dimension, namely, the first feature processing operation is carried out to obtain first space image features; the MHTSA model may perform feature similarity calculation in the time dimension on the initial image features represented by NxTxd, that is, the second feature processing operation, to obtain the time image features.
Next, a target spatio-temporal image feature may be determined based on the first spatial image feature and the temporal image feature, and specifically, first weight information corresponding to the first spatial image feature and second weight information corresponding to the temporal image feature may be determined based on the initial image feature; then, target spatiotemporal image features are determined based on the first weight information, the second weight information, the first spatial image features and the temporal image features.
Specifically, the feature fusion layer may be used to determine first weight information corresponding to a first spatial image feature and second weight information corresponding to a temporal image feature according to a sub-image feature in the initial image feature, where the first weight information and the second weight information determined by the sub-image feature in different environments are different. For example, in an occlusion environment, the feature fusion layer may determine, according to the sub-image features of the occlusion environment, a ratio that can characterize that the temporal image feature is larger than the first spatial image feature, that is, the first weight information is larger than the second weight information, because in the occlusion environment, the feature fusion layer may pay more attention to the image feature in the spatial dimension.
In an embodiment, a third feature processing operation may be performed on the initial image feature by using the MHSSA model to obtain a second spatial image feature, where the third feature processing operation may be the same as the first feature processing operation, and then a fourth feature processing operation is performed on the second spatial image feature by using the MHTSA model connected in series with the MHSSA model to obtain a target spatiotemporal image feature corresponding to the video segment. The fourth feature processing operation herein refers to feature similarity calculation in the time dimension for the second spatial image feature.
In an embodiment, a fifth feature processing operation may be performed on the initial image features by using a trained multi-head Coupling Self-Attention (MHCSA) model, so as to directly obtain target spatio-temporal image features corresponding to the video segments.
The calculation amount of the target space-time image characteristics obtained by the serial MHSSA model and the MHTSA model and the calculation amount of the target space-time image characteristics directly obtained by the MHCSA model are larger than the calculation amount of the target space-time image characteristics obtained by the parallel MHSSA model and the MHTSA model, so that the target space-time image characteristics can be obtained by preferentially utilizing the parallel MHSSA model and the MHTSA model in the specific implementation process.
Here, the MHSSA model and the MHTSA model, which are connected in parallel, the MHSSA model and the MHTSA model, which are connected in series, and the MHCSA model are set in the STE Block.
S103: determining sub-pose information for each joint point of the target object based on the target spatiotemporal image features; wherein the sub-pose information of each joint point is determined based on the sub-pose information of ancestor joint points of the joint point and the target spatio-temporal image feature.
For a target object composed of multiple joint points, the position of each joint point is affected by the sub-pose parameters of its own and ancestor joint points, so that the sub-pose information of each joint point can be determined here from both the target spatio-temporal image features and the sub-pose information of the ancestor joint point of each joint point.
In the embodiment of the present disclosure, for a case that a target object is a human body, a connection graph of joint points of the human body may be first determined by using a Skinned Multi-Person Linear Model (SMPL), then a root joint point in the human body may be determined, the root joint point may be used as an ancestor node of all other joint points, where the root joint point may be used to transform the entire human body as a complete rigid body, determine an overall posture of the human body, and then determine an ancestor node corresponding to each joint point according to a position relationship between the root joint point and each joint point.
The joint points are connection points of different preset parts of the target object; the root node is the node of origin, i.e., the root node is the ancestor node of all nodes. The sub-pose information of the root joint point is determined based on the target spatio-temporal image feature, and for each joint point, the sub-pose information of the joint point can be determined based on the target spatio-temporal image feature and the sub-pose information of the ancestor node corresponding to the joint point.
Here, the target spatio-temporal image features may be input into a Kinematic Topology Decoder (KTD) to obtain sub-pose information of each joint point. The KTD can comprise a plurality of trained linear regressors, each joint point corresponds to one linear regressor, and each linear regressor can obtain the sub-posture information of the joint point based on the target spatio-temporal image characteristics and the sub-posture information of the ancestor node corresponding to the joint point.
As shown in the connection diagram of human body joint points shown in fig. 5, taking joint points (0, 2, 5) as an example, joint point No. 0 is a root joint point, in the diagram, joint point No. 0 can point to joint point No. 2, and joint point No. 0 is an ancestor joint point (specifically, a father joint point) of joint point No. 2; the joint point No. 2 continues to point to the joint point No. 5, the joint point No. 2 is an ancestor joint point (specifically, a parent joint point) of the joint point No. 5, and the joint point No. 0 and the joint point No. 2 are both ancestor nodes of the joint point No. 5.
The target spatio-temporal image feature can be used as the input of a trained linear regressor to output the sub-posture information of the root joint point, namely the overall posture, then the target spatio-temporal image feature and the sub-posture information of the root joint point are used as the input of another trained linear regressor to output the sub-posture information of the No. 2 joint point, and then the target spatio-temporal image feature, the sub-posture information of the root joint point and the sub-posture information of the No. 2 joint point are used as the input of a third trained linear regressor to output the sub-posture information of the No. 5 joint point.
S104: and determining target posture information corresponding to the target object based on the sub-posture information of each joint point.
The overall posture information of the target object or the posture information of the target part, etc. may be determined from the sub-posture information of each joint point.
In a specific implementation, after the target spatio-temporal image features corresponding to the video segments are determined, morphological information of the target object may be determined based on the target spatio-temporal image features, specifically, the morphological information may include height, weight and other information, a human body operation model of the target object may be generated based on the morphological information and the sub-posture information of each joint, and then tasks such as operation analysis and the like may be performed on the target object based on the human body operation model.
It will be understood by those skilled in the art that in the method of the present invention, the order of writing the steps does not imply a strict order of execution and any limitations on the implementation, and the specific order of execution of the steps should be determined by their function and possible inherent logic.
Based on the same inventive concept, an information processing apparatus corresponding to the information processing method is also provided in the embodiments of the present disclosure, and because the principle of the apparatus in the embodiments of the present disclosure for solving the problem is similar to the information processing method described above in the embodiments of the present disclosure, the implementation of the apparatus may refer to the implementation of the method, and repeated details are not described again.
Referring to fig. 6, which is a schematic diagram of an architecture of an information processing apparatus according to an embodiment of the present disclosure, the apparatus includes: a first obtaining module 601, a first determining module 602, a second determining module 603, and a third determining module 604; wherein the content of the first and second substances,
a first obtaining module 601, configured to obtain a video clip including a target object;
a first determining module 602, configured to determine, based on an initial image feature corresponding to the video segment, a target spatio-temporal image feature corresponding to the video segment;
a second determining module 603 for determining sub-pose information of each joint point of the target object based on the target spatio-temporal image features; wherein the sub-pose information of each joint point is determined based on the sub-pose information of ancestor joint points of the joint point and the target spatio-temporal image feature;
a third determining module 604, configured to determine, based on the sub-pose information of each joint point, target pose information corresponding to the target object.
In a possible embodiment, the apparatus further comprises: a fourth determination module to: and determining the morphological information of the target object based on the target space-time image characteristics.
In a possible implementation manner, the first determining module 602 is specifically configured to:
performing first feature processing operation on the initial image features to obtain first space image features;
performing second feature processing operation on the initial image features to obtain time image features;
determining the target spatiotemporal image feature based on the first spatial image feature and the temporal image feature.
In a possible implementation manner, the first determining module 602 is specifically configured to:
determining first weight information corresponding to the first spatial image feature and second weight information corresponding to the temporal image feature based on the initial image feature;
determining the target spatiotemporal image feature based on the first weight information, the second weight information, the first spatial image feature, and the temporal image feature.
In a possible embodiment, the apparatus further comprises:
the extraction module is used for extracting the sub-image characteristics of at least part of the target image in the video clip;
and the splicing module is used for splicing the extracted sub-image characteristics to obtain initial image characteristics corresponding to the video clip.
In a possible embodiment, the splicing module is configured to:
acquiring time characteristics and position characteristics corresponding to each target image in the video clip;
and splicing the extracted sub-image features based on the time features and the position features to obtain initial image features corresponding to the video clips.
In a possible implementation manner, the second determining module 603 is specifically configured to:
aiming at each joint point, determining an ancestral joint point corresponding to the joint point based on the position relationship between a root joint point and each joint point; the joint points are connection points of different preset parts of the target object; the root joint point is an ancestor joint point of all joint points; sub-pose information for the root joint point is determined based on the target spatio-temporal image features;
for each joint point, determining sub-pose information for the joint point based on the target spatio-temporal image features and the sub-pose information of the ancestor node.
In a possible implementation manner, the first determining module 601 is specifically configured to:
performing third feature processing operation on the initial image features to obtain second space image features;
and performing fourth feature processing operation on the second space image features to obtain target space-time image features corresponding to the video clips.
The description of the processing flow of each module in the device and the interaction flow between the modules may refer to the related description in the above method embodiments, and will not be described in detail here.
Based on the same technical concept, the embodiment of the disclosure also provides computer equipment. Referring to fig. 7, a schematic structural diagram of a computer device 700 provided in the embodiment of the present disclosure includes a processor 701, a memory 702, and a bus 703. The memory 702 is used for storing execution instructions and includes a memory 7021 and an external memory 7022; the memory 7021 is also referred to as an internal memory, and is used to temporarily store operation data in the processor 701 and data exchanged with an external memory 7022 such as a hard disk, the processor 701 exchanges data with the external memory 7022 through the memory 7021, and when the computer apparatus 700 is operated, the processor 701 communicates with the memory 702 through the bus 703, so that the processor 701 executes the following instructions:
acquiring a video clip including a target object;
determining target space-time image characteristics corresponding to the video clips based on initial image characteristics corresponding to the video clips;
determining sub-pose information for each joint point of the target object based on the target spatiotemporal image features; wherein the sub-pose information of each joint point is determined based on the sub-pose information of ancestor joint points of the joint point and the target spatio-temporal image feature;
and determining target posture information corresponding to the target object based on the sub-posture information of each joint point.
The embodiments of the present disclosure also provide a computer-readable storage medium, on which a computer program is stored, where the computer program is executed by a processor to perform the steps of the information processing method described in the above method embodiments. The storage medium may be a volatile or non-volatile computer-readable storage medium.
The embodiments of the present disclosure also provide a computer program product, where the computer program product carries a program code, and instructions included in the program code may be used to execute the steps of the information processing method in the foregoing method embodiments, which may be referred to specifically for the foregoing method embodiments, and are not described herein again.
The computer program product may be implemented by hardware, software or a combination thereof. In an alternative embodiment, the computer program product is embodied in a computer storage medium, and in another alternative embodiment, the computer program product is embodied in a Software product, such as a Software Development Kit (SDK), or the like.
It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the system and the apparatus described above may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again. In the several embodiments provided in the present disclosure, it should be understood that the disclosed system, apparatus, and method may be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one logical division, and there may be other divisions when actually implemented, and for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of devices or units through some communication interfaces, and may be in an electrical, mechanical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present disclosure may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.
The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a non-volatile computer-readable storage medium executable by a processor. Based on such understanding, the technical solution of the present disclosure may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present disclosure. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.
Finally, it should be noted that: the above-mentioned embodiments are merely specific embodiments of the present disclosure, which are used for illustrating the technical solutions of the present disclosure and not for limiting the same, and the scope of the present disclosure is not limited thereto, and although the present disclosure is described in detail with reference to the foregoing embodiments, those skilled in the art should understand that: any person skilled in the art can modify or easily conceive of the technical solutions described in the foregoing embodiments or equivalent technical features thereof within the technical scope of the present disclosure; such modifications, changes or substitutions do not depart from the spirit and scope of the embodiments of the present disclosure, and should be construed as being included therein. Therefore, the protection scope of the present disclosure shall be subject to the protection scope of the claims.

Claims (11)

1. An information processing method characterized by comprising:
acquiring a video clip including a target object;
determining target space-time image characteristics corresponding to the video clips based on initial image characteristics corresponding to the video clips;
determining sub-pose information for each joint point of the target object based on the target spatiotemporal image features; wherein the sub-pose information of each joint point is determined based on the sub-pose information of ancestor joint points of the joint point and the target spatio-temporal image feature;
and determining target posture information corresponding to the target object based on the sub-posture information of each joint point.
2. The method of claim 1, further comprising, after determining the target spatiotemporal image features corresponding to the video segments:
and determining the morphological information of the target object based on the target space-time image characteristics.
3. The method of claim 1, wherein determining the target spatiotemporal image characteristics corresponding to the video segment based on the initial image characteristics corresponding to the video segment comprises:
performing first feature processing operation on the initial image features to obtain first space image features;
performing second feature processing operation on the initial image features to obtain time image features;
determining the target spatiotemporal image feature based on the first spatial image feature and the temporal image feature.
4. The method of claim 3, wherein determining the target spatiotemporal image feature based on the first spatial image feature and the temporal image feature comprises:
determining first weight information corresponding to the first spatial image feature and second weight information corresponding to the temporal image feature based on the initial image feature;
determining the target spatiotemporal image feature based on the first weight information, the second weight information, the first spatial image feature, and the temporal image feature.
5. The method according to claim 1, wherein before said determining the target spatiotemporal image feature corresponding to the video segment based on the initial image feature corresponding to the video segment, further comprising:
extracting sub-image features of at least part of target images in the video clips;
and splicing the extracted sub-image features to obtain initial image features corresponding to the video clips.
6. The method according to claim 5, wherein the stitching the extracted sub-image features to obtain initial image features corresponding to the video segment comprises:
acquiring time characteristics and position characteristics corresponding to each target image in the video clip;
and splicing the extracted sub-image features based on the time features and the position features to obtain initial image features corresponding to the video clips.
7. The method of claim 1, wherein determining sub-pose information for each joint point of the target object based on the target spatiotemporal image features comprises:
aiming at each joint point, determining an ancestral joint point corresponding to each joint point based on the position relationship between a root joint point and each joint point; the joint points are connection points of different preset parts of the target object; the root joint point is an ancestor joint point of all joint points; sub-pose information for the root joint point is determined based on the target spatio-temporal image features;
for each joint point, determining sub-pose information for the joint point based on the target spatio-temporal image features and the sub-pose information of the ancestor node.
8. The method of claim 1, wherein determining the target spatiotemporal image characteristics corresponding to the video segment based on the initial image characteristics corresponding to the video segment comprises:
performing third feature processing operation on the initial image features to obtain second space image features;
and performing fourth feature processing operation on the second space image features to obtain target space-time image features corresponding to the video clips.
9. An information processing apparatus characterized by comprising:
the first acquisition module is used for acquiring a video clip comprising a target object;
the first determination module is used for determining target space-time image characteristics corresponding to the video clips based on initial image characteristics corresponding to the video clips;
a second determination module for determining sub-pose information for each joint point of the target object based on the target spatiotemporal image features; wherein the sub-pose information of each joint point is determined based on the sub-pose information of ancestor joint points of the joint point and the target spatio-temporal image feature;
and the third determining module is used for determining target posture information corresponding to the target object based on the sub-posture information of each joint point.
10. A computer device, comprising: a processor, a memory and a bus, the memory storing machine-readable instructions executable by the processor, the processor and the memory communicating via the bus when a computer device is running, the machine-readable instructions when executed by the processor performing the steps of the information processing method according to any one of claims 1 to 8.
11. A computer-readable storage medium, characterized in that a computer program is stored thereon, which computer program, when being executed by a processor, performs the steps of the information processing method according to any one of claims 1 to 8.
CN202111088903.6A 2021-09-16 2021-09-16 Information processing method and device, computer equipment and storage medium Withdrawn CN113780215A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111088903.6A CN113780215A (en) 2021-09-16 2021-09-16 Information processing method and device, computer equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111088903.6A CN113780215A (en) 2021-09-16 2021-09-16 Information processing method and device, computer equipment and storage medium

Publications (1)

Publication Number Publication Date
CN113780215A true CN113780215A (en) 2021-12-10

Family

ID=78851514

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111088903.6A Withdrawn CN113780215A (en) 2021-09-16 2021-09-16 Information processing method and device, computer equipment and storage medium

Country Status (1)

Country Link
CN (1) CN113780215A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114333069A (en) * 2022-03-03 2022-04-12 腾讯科技(深圳)有限公司 Object posture processing method, device, equipment and storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114333069A (en) * 2022-03-03 2022-04-12 腾讯科技(深圳)有限公司 Object posture processing method, device, equipment and storage medium
CN114333069B (en) * 2022-03-03 2022-05-17 腾讯科技(深圳)有限公司 Object posture processing method, device, equipment and storage medium

Similar Documents

Publication Publication Date Title
CN111417983B (en) Deformable object tracking based on event camera
Iqbal et al. Hand pose estimation via latent 2.5 d heatmap regression
Zhang et al. Random Gabor based templates for facial expression recognition in images with facial occlusion
US8363902B2 (en) Moving object detection method and moving object detection apparatus
CN113449696B (en) Attitude estimation method and device, computer equipment and storage medium
Li et al. Bidirectional recurrent autoencoder for 3D skeleton motion data refinement
CN108985443B (en) Action recognition method and neural network generation method and device thereof, and electronic equipment
CN110929637A (en) Image identification method and device, electronic equipment and storage medium
CN110399908B (en) Event-based camera classification method and apparatus, storage medium, and electronic apparatus
Xia et al. Human motion recovery jointly utilizing statistical and kinematic information
Agudo et al. Learning shape, motion and elastic models in force space
Simon et al. Kronecker-markov prior for dynamic 3d reconstruction
Lohit et al. Recovering trajectories of unmarked joints in 3d human actions using latent space optimization
CN113780215A (en) Information processing method and device, computer equipment and storage medium
CN113312966B (en) Action recognition method and device based on first person viewing angle
Guo et al. Monocular 3D multi-person pose estimation via predicting factorized correction factors
CN112329663A (en) Micro-expression time detection method and device based on face image sequence
CN110619262B (en) Image recognition method and device
CN110348406B (en) Parameter estimation method and device
Benndorf et al. Automated annotation of sensor data for activity recognition using deep learning
Oh et al. Recovering 3D hand mesh sequence from a single blurry image: A new dataset and temporal unfolding
CN114359892A (en) Three-dimensional target detection method and device and computer readable storage medium
Kiran Kumar et al. Early estimation model for 3D-discrete indian sign language recognition using graph matching
Mata Two approaches to robust hand pose estimation: Generative modeling and semantic relations
KR20240081950A (en) Method for processing action using rank graph convolutional network and apparatus thereof

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WW01 Invention patent application withdrawn after publication
WW01 Invention patent application withdrawn after publication

Application publication date: 20211210