CN113780215A

CN113780215A - Information processing method and device, computer equipment and storage medium

Info

Publication number: CN113780215A
Application number: CN202111088903.6A
Authority: CN
Inventors: 田茂清; 万子牛; 李正甲; 刘建博; 伊帅
Original assignee: Shanghai Sensetime Intelligent Technology Co Ltd
Current assignee: Shanghai Sensetime Intelligent Technology Co Ltd
Priority date: 2021-09-16
Filing date: 2021-09-16
Publication date: 2021-12-10

Abstract

The present disclosure provides an information processing method, apparatus, computer device, and storage medium, wherein the method comprises: acquiring a video clip including a target object; determining target space-time image characteristics corresponding to the video clips based on initial image characteristics corresponding to the video clips; determining sub-pose information for each joint point of the target object based on the target spatiotemporal image features; wherein the sub-pose information of each joint point is determined based on the sub-pose information of ancestor joint points of the joint point and the target spatio-temporal image feature; and determining target posture information corresponding to the target object based on the sub-posture information of each joint point.

Description

Information processing method and device, computer equipment and storage medium

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to an information processing method and apparatus, a computer device, and a storage medium.

Background

With the development of deep learning technology, the estimation of three-dimensional human body shape and posture is gradually a research hotspot in the field of computer vision. The three-dimensional human body shape and posture estimation has wide application prospect in tasks of human body motion analysis, motion teaching and the like.

At present, the three-dimensional human body posture estimation is mainly determined based on image features extracted from human body images, and the accuracy of the three-dimensional human body posture determined by the method under the conditions of disordered image backgrounds, occlusion and the like is not high.

Disclosure of Invention

The embodiment of the disclosure at least provides an information processing method, an information processing device, computer equipment and a storage medium.

In a first aspect, an embodiment of the present disclosure provides an information processing method, including:

acquiring a video clip including a target object;

determining target space-time image characteristics corresponding to the video clips based on initial image characteristics corresponding to the video clips;

determining sub-pose information for each joint point of the target object based on the target spatiotemporal image features; wherein the sub-pose information of each joint point is determined based on the sub-pose information of ancestor joint points of the joint point and the target spatio-temporal image feature;

and determining target posture information corresponding to the target object based on the sub-posture information of each joint point.

According to the method and the device, the target space-time image characteristics corresponding to the video segments are determined, the attitude characteristics of the target object in the space dimension and the attitude characteristics in the time dimension can be represented, and the robustness to the disordered background and the shielding can be effectively improved by combining the attitude characteristics in the space dimension and the attitude characteristics in the time dimension, so that the accuracy of attitude detection is improved. Meanwhile, the sub-attitude information of each joint point is respectively determined based on the sub-attitude information of the ancestor node corresponding to each joint point and the target spatiotemporal image characteristics, the dependency relationship among the joint points is considered, the condition that the determined sub-attitude information of each joint point is inaccurate due to neglect of the dependency relationship among the joint points can be avoided, and the accuracy of attitude detection is further improved.

In an optional embodiment, after determining the target spatio-temporal image characteristics corresponding to the video segment, the method further includes:

and determining the morphological information of the target object based on the target space-time image characteristics.

According to the method and the device, the target space-time image characteristics of the posture characteristics of the target object in the space dimension and the posture characteristics in the time dimension can be represented, and the morphological information of the human body can be accurately determined in the image or video with disordered or shielded background.

In an optional embodiment, the determining a target spatiotemporal image feature corresponding to the video segment based on the initial image feature corresponding to the video segment includes:

performing first feature processing operation on the initial image features to obtain first space image features;

performing second feature processing operation on the initial image features to obtain time image features;

determining the target spatiotemporal image feature based on the first spatial image feature and the temporal image feature.

According to the method and the device, through the first feature processing operation, the gesture feature, namely the first spatial image feature, of the target object under the spatial dimension can be accurately determined by utilizing a spatial attention mechanism; through the second feature processing operation, the posture feature, namely the time image feature, for representing the target object in the time dimension can be accurately determined by utilizing a time attention mechanism; and then, the first space image characteristic and the time image characteristic are fused and the like, so that the target space-time image characteristic can be accurately determined.

In an alternative embodiment, said determining said target spatiotemporal image feature based on said first spatial image feature and said temporal image feature comprises:

determining first weight information corresponding to the first spatial image feature and second weight information corresponding to the temporal image feature based on the initial image feature;

determining the target spatiotemporal image feature based on the first weight information, the second weight information, the first spatial image feature, and the temporal image feature.

The method and the device for determining the time image feature of the background environment based on the initial image feature can accurately determine the time image feature of the background environment in the time dimension and the first space image feature in the space dimension, further accurately determine the first weight information corresponding to the first space image feature and the second weight information corresponding to the time image feature, further accurately fuse the first space image feature and the time image feature based on the background environment, and obtain more accurate target space-time image feature.

In an optional embodiment, before the determining the target spatiotemporal image feature corresponding to the video segment based on the initial image feature corresponding to the video segment, the method further includes:

extracting sub-image features of at least part of target images in the video clips;

and splicing the extracted sub-image features to obtain initial image features corresponding to the video clips.

In the embodiment of the disclosure, the plurality of sub-image features are spliced to obtain the initial image feature capable of relatively completely representing the image feature of the video clip.

In an optional implementation manner, the stitching processing on the extracted sub-image features to obtain initial image features corresponding to the video segment includes:

acquiring time characteristics and position characteristics corresponding to each target image in the video clip;

and splicing the extracted sub-image features based on the time features and the position features to obtain initial image features corresponding to the video clips.

The method and the device for obtaining the image feature splicing method are based on the time feature and the position feature corresponding to each target image, and the obtained initial image feature can contain time information and position information, namely the initial image feature capable of representing the time information and the position information of each target image more accurately is obtained; and then according to the initial image information, the time image characteristics under the representation time dimension and the target space-time image characteristics of the first space image characteristics under the space dimension can be accurately determined.

In an alternative embodiment, the determining sub-pose information of each joint point of the target object based on the target spatio-temporal image features comprises:

aiming at each joint point, determining an ancestral joint point corresponding to each joint point based on the position relationship between a root joint point and each joint point; the joint points are connection points of different preset parts of the target object; the root joint point is an ancestor joint point of all joint points; sub-pose information for the root joint point is determined based on the target spatio-temporal image features;

for each joint point, determining sub-pose information for the joint point based on the target spatio-temporal image features and the sub-pose information of the ancestor node.

According to the embodiment of the disclosure, ancestor joint points corresponding to the joint points can be accurately determined according to the connection relationship between the root joint point and each joint point, and then sub-attitude information of each joint point can be more accurately determined.

performing third feature processing operation on the initial image features to obtain second space image features;

and performing fourth feature processing operation on the second space image features to obtain target space-time image features corresponding to the video clips.

The method and the device for processing the initial image features can perform third feature processing operation on the initial image features, and can accurately determine the attitude features, namely the second spatial image features, of the target object under the spatial dimension by utilizing a spatial attention mechanism; and then, performing fourth feature processing operation on the second space image feature, so that the attitude feature, namely the target space-time image feature, for representing the target object in the time dimension and the space dimension can be accurately determined by combining a time attention mechanism.

In a second aspect, an embodiment of the present disclosure further provides an information processing apparatus, including:

the first acquisition module is used for acquiring a video clip comprising a target object;

the first determination module is used for determining target space-time image characteristics corresponding to the video clips based on initial image characteristics corresponding to the video clips;

a second determination module for determining sub-pose information for each joint point of the target object based on the target spatiotemporal image features; wherein the sub-pose information of each joint point is determined based on the sub-pose information of ancestor joint points of the joint point and the target spatio-temporal image feature;

and the third determining module is used for determining target posture information corresponding to the target object based on the sub-posture information of each joint point.

In a third aspect, an embodiment of the present disclosure further provides a computer device, including: a processor, a memory and a bus, the memory storing machine-readable instructions executable by the processor, the processor and the memory communicating via the bus when the computer device is running, the machine-readable instructions when executed by the processor performing the steps of the first aspect described above, or any possible implementation of the first aspect.

In a fourth aspect, this disclosed embodiment also provides a computer-readable storage medium, on which a computer program is stored, where the computer program is executed by a processor to perform the steps in the first aspect or any one of the possible implementation manners of the first aspect.

For the description of the effects of the information processing apparatus, the computer device, and the computer-readable storage medium, reference is made to the description of the information processing method, which is not repeated here.

In order to make the aforementioned objects, features and advantages of the present disclosure more comprehensible, preferred embodiments accompanied with figures are described in detail below.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present disclosure, the drawings required for use in the embodiments will be briefly described below, and the drawings herein incorporated in and forming a part of the specification illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the technical solutions of the present disclosure. It is appreciated that the following drawings depict only certain embodiments of the disclosure and are therefore not to be considered limiting of its scope, for those skilled in the art will be able to derive additional related drawings therefrom without the benefit of the inventive faculty.

Fig. 1 shows a flowchart of an information processing method provided by an embodiment of the present disclosure;

fig. 2 is a schematic structural diagram of a space-time encoder-motion topology decoder provided by an embodiment of the disclosure;

FIG. 3 is a block diagram illustrating the structure of a concatenated block of a space-time encoder provided by an embodiment of the present disclosure;

FIG. 4 is a schematic diagram illustrating the structure of a multi-head spatial self-attention model and a multi-head temporal self-attention model provided by an embodiment of the present disclosure;

FIG. 5 illustrates a schematic diagram of the connection of a human joint provided by an embodiment of the present disclosure;

fig. 6 shows a schematic diagram of an information processing apparatus provided by an embodiment of the present disclosure;

fig. 7 shows a schematic diagram of a computer device provided by an embodiment of the present disclosure.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present disclosure more clear, the technical solutions of the embodiments of the present disclosure will be described clearly and completely with reference to the drawings in the embodiments of the present disclosure, and it is obvious that the described embodiments are only a part of the embodiments of the present disclosure, not all of the embodiments. The components of the embodiments of the present disclosure, generally described and illustrated in the figures herein, can be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present disclosure, presented in the figures, is not intended to limit the scope of the claimed disclosure, but is merely representative of selected embodiments of the disclosure. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the disclosure without making creative efforts, shall fall within the protection scope of the disclosure.

At present, three-dimensional human body posture estimation mainly extracts spatial image characteristics of a human body image, and then posture information of each joint of a human body is respectively determined according to the spatial image characteristics. In the above method, the posture information is determined only when the spatial image feature can be acquired, but in the occlusion environment, the posture at the occlusion time cannot be estimated, and the posture information of each joint of the human body is determined separately, and the dependency between the joints is ignored, so the accuracy of the determined posture information is not high.

Based on the above, the present disclosure provides an information processing method, in which by determining a target spatiotemporal image feature corresponding to a video segment, pose information of a target object in a spatial dimension and pose information in a temporal dimension can be determined, so that not only can a situation that a human pose cannot be accurately determined in a cluttered background environment be avoided, but also a situation that a human pose cannot be accurately determined in a shielded environment can be avoided.

The above-mentioned drawbacks are the results of the inventor after practical and careful study, and therefore, the discovery process of the above-mentioned problems and the solutions proposed by the present disclosure to the above-mentioned problems should be the contribution of the inventor in the process of the present disclosure.

It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures.

For the convenience of understanding of the present embodiment, an information processing method disclosed in the embodiments of the present disclosure is first described in detail, and an execution subject of the information processing method provided in the embodiments of the present disclosure is generally a computer device with certain computing capability.

The following describes an information processing method provided by the embodiment of the present disclosure, taking an execution subject as a server as an example.

Referring to fig. 1, a flowchart of an information processing method provided by an embodiment of the present disclosure is shown, where the method includes S101 to S104, where:

s101: a video clip including a target object is acquired.

In the embodiments of the present disclosure, the target object may be a dynamic object, and specifically may be an object that performs any action. The video segment may include a plurality of frame images, and the frame images may not all include the target object, for example, in a case where the target object is blocked by an obstacle, the target object cannot be identified in the corresponding frame image.

S102: and determining the target space-time image characteristics corresponding to the video clips based on the initial image characteristics corresponding to the video clips.

The initial image features are obtained by splicing sub-image features obtained by feature extraction of at least part of target images in the video clips. Before determining the target space-time image characteristics corresponding to the video clip based on the initial image characteristics corresponding to the video clip, extracting the sub-image characteristics of at least part of target images in the video clip; and then, splicing the extracted sub-image features to obtain initial image features corresponding to the video segments. In a specific implementation process, a Convolutional Neural Network (CNN) may be used to perform feature extraction on a target image. As shown in fig. 2, a plurality of target images in a video segment are input into the CNN, so that sub-image features corresponding to the target images can be obtained.

Specifically, after the video clip is acquired, the video clip may be divided into a plurality of images and all the images may be taken as target images; a plurality of images can also be screened, specifically, a plurality of target images can be screened according to the preset number of interval frames; each image can be segmented to obtain a plurality of sub-images corresponding to each image, and then all the sub-images or part of the sub-images are used as target images.

And then arranging the plurality of target images according to a time sequence, extracting the sub-image characteristics corresponding to each target image, and splicing the sub-image characteristics to obtain initial image characteristics.

In the process of splicing the extracted sub-image features, time features and position features corresponding to each target image in the video clip can be acquired, and then the extracted sub-image features are spliced on the basis of the time features and the position features to obtain initial image features corresponding to the video clip. The temporal feature refers to a feature vector generated based on the time corresponding to each target image, and the positional feature refers to a feature vector generated based on the position corresponding to each target image.

The obtained initial image features comprise sub-image features, time features and position features corresponding to all target images.

After the initial image feature is obtained, as shown in fig. 2, a space-time Encoder (STE) may be used to perform a first feature processing operation on the initial image feature to obtain a first Spatial image feature, and at the same time, the STE may be used to perform a second feature processing operation on the initial image feature to obtain a Temporal image feature, and then the first Spatial image feature and the Temporal image feature are fused to obtain a target space-time image feature. A plurality of Spatial-Temporal Encoder cascade blocks (STE blocks) can be arranged in the STE, and the characteristics of the target spatio-Temporal image can be determined more accurately by serially connecting the plurality of STE blocks. As shown in fig. 3, in each STE Block, a Multi-Head Spatial Self-Attention (MHSSA) model, a Multi-Head Temporal Self-Attention (MHTSA) model, a feature fusion layer, and a Multi-layer Perceptron (MLP) may also be set.

Specifically, a first feature processing operation may be performed on the input initial image features by using a trained MHSSA model, and then a second feature processing operation may be performed on the input initial image features by using a trained MHTSA model connected in parallel with the MHSSA model. The MHSSA model focuses on image features in the spatial dimension, and the MHTSA model focuses on image features in the temporal dimension, so the order of the input initial image feature dimensions is not the same.

In specific implementation, based on the time feature and the position feature, the extracted sub-image features are subjected to stitching processing, and the obtained initial image features can be represented as: txxd, where T represents a feature in the time dimension, i.e., a temporal feature; n represents a feature in the spatial dimension, i.e. a location feature; d represents a sub-image feature. As shown in fig. 4, txxd may be input into the MHSSA model when performing the first feature processing operation. Before the initial image features are input into the MHTSA model, the initial image features represented by txxd may be reshaped to obtain the initial image features represented by NxTxd, and then the initial image features represented by NxTxd are input into the MHTSA model.

The MHSSA model can calculate the feature similarity of the initial image features represented by the TxNxd in the space dimension, namely, the first feature processing operation is carried out to obtain first space image features; the MHTSA model may perform feature similarity calculation in the time dimension on the initial image features represented by NxTxd, that is, the second feature processing operation, to obtain the time image features.

Next, a target spatio-temporal image feature may be determined based on the first spatial image feature and the temporal image feature, and specifically, first weight information corresponding to the first spatial image feature and second weight information corresponding to the temporal image feature may be determined based on the initial image feature; then, target spatiotemporal image features are determined based on the first weight information, the second weight information, the first spatial image features and the temporal image features.

Specifically, the feature fusion layer may be used to determine first weight information corresponding to a first spatial image feature and second weight information corresponding to a temporal image feature according to a sub-image feature in the initial image feature, where the first weight information and the second weight information determined by the sub-image feature in different environments are different. For example, in an occlusion environment, the feature fusion layer may determine, according to the sub-image features of the occlusion environment, a ratio that can characterize that the temporal image feature is larger than the first spatial image feature, that is, the first weight information is larger than the second weight information, because in the occlusion environment, the feature fusion layer may pay more attention to the image feature in the spatial dimension.

In an embodiment, a third feature processing operation may be performed on the initial image feature by using the MHSSA model to obtain a second spatial image feature, where the third feature processing operation may be the same as the first feature processing operation, and then a fourth feature processing operation is performed on the second spatial image feature by using the MHTSA model connected in series with the MHSSA model to obtain a target spatiotemporal image feature corresponding to the video segment. The fourth feature processing operation herein refers to feature similarity calculation in the time dimension for the second spatial image feature.

In an embodiment, a fifth feature processing operation may be performed on the initial image features by using a trained multi-head Coupling Self-Attention (MHCSA) model, so as to directly obtain target spatio-temporal image features corresponding to the video segments.

The calculation amount of the target space-time image characteristics obtained by the serial MHSSA model and the MHTSA model and the calculation amount of the target space-time image characteristics directly obtained by the MHCSA model are larger than the calculation amount of the target space-time image characteristics obtained by the parallel MHSSA model and the MHTSA model, so that the target space-time image characteristics can be obtained by preferentially utilizing the parallel MHSSA model and the MHTSA model in the specific implementation process.

Here, the MHSSA model and the MHTSA model, which are connected in parallel, the MHSSA model and the MHTSA model, which are connected in series, and the MHCSA model are set in the STE Block.

S103: determining sub-pose information for each joint point of the target object based on the target spatiotemporal image features; wherein the sub-pose information of each joint point is determined based on the sub-pose information of ancestor joint points of the joint point and the target spatio-temporal image feature.

For a target object composed of multiple joint points, the position of each joint point is affected by the sub-pose parameters of its own and ancestor joint points, so that the sub-pose information of each joint point can be determined here from both the target spatio-temporal image features and the sub-pose information of the ancestor joint point of each joint point.

In the embodiment of the present disclosure, for a case that a target object is a human body, a connection graph of joint points of the human body may be first determined by using a Skinned Multi-Person Linear Model (SMPL), then a root joint point in the human body may be determined, the root joint point may be used as an ancestor node of all other joint points, where the root joint point may be used to transform the entire human body as a complete rigid body, determine an overall posture of the human body, and then determine an ancestor node corresponding to each joint point according to a position relationship between the root joint point and each joint point.

The joint points are connection points of different preset parts of the target object; the root node is the node of origin, i.e., the root node is the ancestor node of all nodes. The sub-pose information of the root joint point is determined based on the target spatio-temporal image feature, and for each joint point, the sub-pose information of the joint point can be determined based on the target spatio-temporal image feature and the sub-pose information of the ancestor node corresponding to the joint point.

Here, the target spatio-temporal image features may be input into a Kinematic Topology Decoder (KTD) to obtain sub-pose information of each joint point. The KTD can comprise a plurality of trained linear regressors, each joint point corresponds to one linear regressor, and each linear regressor can obtain the sub-posture information of the joint point based on the target spatio-temporal image characteristics and the sub-posture information of the ancestor node corresponding to the joint point.

As shown in the connection diagram of human body joint points shown in fig. 5, taking joint points (0, 2, 5) as an example, joint point No. 0 is a root joint point, in the diagram, joint point No. 0 can point to joint point No. 2, and joint point No. 0 is an ancestor joint point (specifically, a father joint point) of joint point No. 2; the joint point No. 2 continues to point to the joint point No. 5, the joint point No. 2 is an ancestor joint point (specifically, a parent joint point) of the joint point No. 5, and the joint point No. 0 and the joint point No. 2 are both ancestor nodes of the joint point No. 5.

The target spatio-temporal image feature can be used as the input of a trained linear regressor to output the sub-posture information of the root joint point, namely the overall posture, then the target spatio-temporal image feature and the sub-posture information of the root joint point are used as the input of another trained linear regressor to output the sub-posture information of the No. 2 joint point, and then the target spatio-temporal image feature, the sub-posture information of the root joint point and the sub-posture information of the No. 2 joint point are used as the input of a third trained linear regressor to output the sub-posture information of the No. 5 joint point.

S104: and determining target posture information corresponding to the target object based on the sub-posture information of each joint point.

The overall posture information of the target object or the posture information of the target part, etc. may be determined from the sub-posture information of each joint point.

In a specific implementation, after the target spatio-temporal image features corresponding to the video segments are determined, morphological information of the target object may be determined based on the target spatio-temporal image features, specifically, the morphological information may include height, weight and other information, a human body operation model of the target object may be generated based on the morphological information and the sub-posture information of each joint, and then tasks such as operation analysis and the like may be performed on the target object based on the human body operation model.

It will be understood by those skilled in the art that in the method of the present invention, the order of writing the steps does not imply a strict order of execution and any limitations on the implementation, and the specific order of execution of the steps should be determined by their function and possible inherent logic.

Based on the same inventive concept, an information processing apparatus corresponding to the information processing method is also provided in the embodiments of the present disclosure, and because the principle of the apparatus in the embodiments of the present disclosure for solving the problem is similar to the information processing method described above in the embodiments of the present disclosure, the implementation of the apparatus may refer to the implementation of the method, and repeated details are not described again.

Referring to fig. 6, which is a schematic diagram of an architecture of an information processing apparatus according to an embodiment of the present disclosure, the apparatus includes: a first obtaining module 601, a first determining module 602, a second determining module 603, and a third determining module 604; wherein the content of the first and second substances,

a first obtaining module 601, configured to obtain a video clip including a target object;

a first determining module 602, configured to determine, based on an initial image feature corresponding to the video segment, a target spatio-temporal image feature corresponding to the video segment;

a second determining module 603 for determining sub-pose information of each joint point of the target object based on the target spatio-temporal image features; wherein the sub-pose information of each joint point is determined based on the sub-pose information of ancestor joint points of the joint point and the target spatio-temporal image feature;

a third determining module 604, configured to determine, based on the sub-pose information of each joint point, target pose information corresponding to the target object.

In a possible embodiment, the apparatus further comprises: a fourth determination module to: and determining the morphological information of the target object based on the target space-time image characteristics.

In a possible implementation manner, the first determining module 602 is specifically configured to:

In a possible embodiment, the apparatus further comprises:

the extraction module is used for extracting the sub-image characteristics of at least part of the target image in the video clip;

and the splicing module is used for splicing the extracted sub-image characteristics to obtain initial image characteristics corresponding to the video clip.

In a possible embodiment, the splicing module is configured to:

In a possible implementation manner, the second determining module 603 is specifically configured to:

aiming at each joint point, determining an ancestral joint point corresponding to the joint point based on the position relationship between a root joint point and each joint point; the joint points are connection points of different preset parts of the target object; the root joint point is an ancestor joint point of all joint points; sub-pose information for the root joint point is determined based on the target spatio-temporal image features;

In a possible implementation manner, the first determining module 601 is specifically configured to:

The description of the processing flow of each module in the device and the interaction flow between the modules may refer to the related description in the above method embodiments, and will not be described in detail here.

Based on the same technical concept, the embodiment of the disclosure also provides computer equipment. Referring to fig. 7, a schematic structural diagram of a computer device 700 provided in the embodiment of the present disclosure includes a processor 701, a memory 702, and a bus 703. The memory 702 is used for storing execution instructions and includes a memory 7021 and an external memory 7022; the memory 7021 is also referred to as an internal memory, and is used to temporarily store operation data in the processor 701 and data exchanged with an external memory 7022 such as a hard disk, the processor 701 exchanges data with the external memory 7022 through the memory 7021, and when the computer apparatus 700 is operated, the processor 701 communicates with the memory 702 through the bus 703, so that the processor 701 executes the following instructions:

acquiring a video clip including a target object;

The embodiments of the present disclosure also provide a computer-readable storage medium, on which a computer program is stored, where the computer program is executed by a processor to perform the steps of the information processing method described in the above method embodiments. The storage medium may be a volatile or non-volatile computer-readable storage medium.

The embodiments of the present disclosure also provide a computer program product, where the computer program product carries a program code, and instructions included in the program code may be used to execute the steps of the information processing method in the foregoing method embodiments, which may be referred to specifically for the foregoing method embodiments, and are not described herein again.

The computer program product may be implemented by hardware, software or a combination thereof. In an alternative embodiment, the computer program product is embodied in a computer storage medium, and in another alternative embodiment, the computer program product is embodied in a Software product, such as a Software Development Kit (SDK), or the like.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the system and the apparatus described above may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again. In the several embodiments provided in the present disclosure, it should be understood that the disclosed system, apparatus, and method may be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one logical division, and there may be other divisions when actually implemented, and for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of devices or units through some communication interfaces, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present disclosure may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a non-volatile computer-readable storage medium executable by a processor. Based on such understanding, the technical solution of the present disclosure may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present disclosure. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

Finally, it should be noted that: the above-mentioned embodiments are merely specific embodiments of the present disclosure, which are used for illustrating the technical solutions of the present disclosure and not for limiting the same, and the scope of the present disclosure is not limited thereto, and although the present disclosure is described in detail with reference to the foregoing embodiments, those skilled in the art should understand that: any person skilled in the art can modify or easily conceive of the technical solutions described in the foregoing embodiments or equivalent technical features thereof within the technical scope of the present disclosure; such modifications, changes or substitutions do not depart from the spirit and scope of the embodiments of the present disclosure, and should be construed as being included therein. Therefore, the protection scope of the present disclosure shall be subject to the protection scope of the claims.

Claims

1. An information processing method characterized by comprising:

acquiring a video clip including a target object;

2. The method of claim 1, further comprising, after determining the target spatiotemporal image features corresponding to the video segments:

3. The method of claim 1, wherein determining the target spatiotemporal image characteristics corresponding to the video segment based on the initial image characteristics corresponding to the video segment comprises:

4. The method of claim 3, wherein determining the target spatiotemporal image feature based on the first spatial image feature and the temporal image feature comprises:

5. The method according to claim 1, wherein before said determining the target spatiotemporal image feature corresponding to the video segment based on the initial image feature corresponding to the video segment, further comprising:

6. The method according to claim 5, wherein the stitching the extracted sub-image features to obtain initial image features corresponding to the video segment comprises:

7. The method of claim 1, wherein determining sub-pose information for each joint point of the target object based on the target spatiotemporal image features comprises:

8. The method of claim 1, wherein determining the target spatiotemporal image characteristics corresponding to the video segment based on the initial image characteristics corresponding to the video segment comprises:

9. An information processing apparatus characterized by comprising:

10. A computer device, comprising: a processor, a memory and a bus, the memory storing machine-readable instructions executable by the processor, the processor and the memory communicating via the bus when a computer device is running, the machine-readable instructions when executed by the processor performing the steps of the information processing method according to any one of claims 1 to 8.

11. A computer-readable storage medium, characterized in that a computer program is stored thereon, which computer program, when being executed by a processor, performs the steps of the information processing method according to any one of claims 1 to 8.