WO2023071806A1

WO2023071806A1 - Apriori space generation method and apparatus, and computer device, storage medium, computer program and computer program product

Info

Publication number: WO2023071806A1
Application number: PCT/CN2022/124931
Authority: WO
Inventors: 许嘉晨; 汪旻; 刘文韬; 钱晨; 马利庄
Original assignee: 上海商汤智能科技有限公司
Priority date: 2021-10-29
Filing date: 2022-10-12
Publication date: 2023-05-04
Also published as: CN113920466A

Abstract

Provided in the present disclosure are an apriori space generation method and apparatus, and a computer device, a storage medium, a computer program and a computer program product. The method comprises: acquiring three-dimensional motion data respectively corresponding to at least two motions of a target object, wherein the three-dimensional motion data comprises posture data respectively corresponding to at least two postures of a corresponding motion; performing encoding processing, for removing global orientation, on the three-dimensional motion data corresponding to each motion, so as to obtain target motion data corresponding to each motion; and generating a target apriori space on the basis of the target motion data respectively corresponding to the at least two motions.

Description

Method, device, computer equipment, storage medium, computer program, and computer program product for generating a priori space

Cross References to Related Applications

This disclosure is based on the Chinese patent application with the application number 202111275623.6, the application date is October 29, 2021, and the application title is "Method, device, computer equipment and storage medium for generating prior space", and requires the Chinese patent application Priority, the entire content of the Chinese patent application is hereby incorporated by reference into this disclosure.

technical field

The present disclosure relates to the technical field of computer vision, and in particular to a method, device, computer equipment, storage medium, computer program and computer program product for generating a priori space.

Background technique

Three-dimensional human motion requires not only that each posture corresponding to the motion is reasonable, but also that the transition between continuous postures is also reasonable, so as to ensure the rationality of the overall three-dimensional human motion. In the process of reconstructing 3D human motion through neural network, using prior space to constrain the rationality of motion can make the 3D human motion reconstructed by neural network more reasonable.

However, in the existing prior space, the context information in the human motion data is lost, which affects the rationality and accuracy of the human motion data reconstructed by the neural network.

Contents of the invention

Embodiments of the present disclosure at least provide a method, device, computer equipment, storage medium, computer program, and computer program product for generating a priori space.

In the first aspect, the embodiment of the present disclosure provides a method for generating a priori space, including:

Acquiring three-dimensional motion data corresponding to at least two kinds of motions of the target object; the three-dimensional motion data includes: posture data corresponding to at least two postures corresponding to the motion;

performing encoding processing for removing the global orientation on the three-dimensional motion data corresponding to each type of motion, to obtain target motion data corresponding to each type of motion;

Based on the target motion data respectively corresponding to the at least two types of motion, a target prior space is generated.

In this method, the 3D motion data corresponding to each type of motion is encoded to remove the global orientation to generate target motion data that can represent the posture characteristics of the motion, and the global orientation information can be removed from the data space, reducing the data space. Complexity; the target prior space generated based on the target motion data can improve the rationality and accuracy of the generated target prior space; furthermore, using the target prior space to constrain the rationality and accuracy of the movement can reduce the neural network The difficulty of modeling motion data for networks.

In the second aspect, an embodiment of the present disclosure provides an apparatus for generating a priori space, including:

The acquisition module is configured to acquire three-dimensional motion data corresponding to at least two kinds of motions of the target object; the three-dimensional motion data includes: posture data corresponding to at least two postures corresponding to the motion; the encoding module is configured to correspond to each type of motion The three-dimensional motion data of the three-dimensional motion data is subjected to encoding processing to remove the global orientation to obtain the target motion data corresponding to each type of motion; the determination module is configured to generate a priori space of the target based on the target motion data corresponding to the at least two types of motion.

In a third aspect, an embodiment of the present disclosure provides a computer device, including: a processor and a memory, the memory stores machine-readable instructions executable by the processor, and the processor is configured to execute the instructions stored in the memory Machine-readable instructions, when the machine-readable instructions are executed by the processor, when the machine-readable instructions are executed by the processor, the first aspect above, or any possible implementation manner in the first aspect is executed in the steps.

In a fourth aspect, an embodiment of the present disclosure provides a computer-readable storage medium, on which a computer program is stored, and when the computer program is executed, the above-mentioned first aspect, or any possibility in the first aspect may be executed. steps in the implementation of .

In the fifth aspect, the embodiments of the present disclosure provide a computer program, including computer readable codes. When the computer readable codes are read and executed by a computer, the processor in the device executes the above first aspect, or the first aspect. A step in any one of the possible implementations of an aspect.

In a sixth aspect, an embodiment of the present disclosure provides a computer program product configured to store computer-readable instructions. When the computer-readable instructions are executed, the computer executes the above-mentioned first aspect, or any one of the possible options in the first aspect. steps in the implementation.

For the effect description of the generating device, computer equipment, computer-readable storage medium, computer program, and computer program product of the above-mentioned prior space, please refer to the description of the above-mentioned method for generating a priori space.

Description of drawings

In order to illustrate the technical solutions of the embodiments of the present disclosure more clearly, the following will briefly introduce the accompanying drawings used in the embodiments. The accompanying drawings here are incorporated into the specification and constitute a part of the specification. The drawings show the embodiments consistent with the present disclosure, and are used together with the description to explain the technical solutions of the present disclosure. It should be understood that the following drawings only show some embodiments of the present disclosure, and therefore should not be regarded as limiting the scope. For those skilled in the art, they can also make From these drawings other related drawings are obtained.

FIG. 1 shows a flowchart of a method for generating a priori space provided by an embodiment of the present disclosure;

Fig. 2 shows a schematic structural diagram of a network structure for determining second 3D motion data corresponding to a scale provided by an embodiment of the present disclosure;

FIG. 3 shows a schematic structural diagram of a network structure for performing feature extraction on the first 3D motion data at multiple scales to obtain second 3D motion data corresponding to multiple scales according to an embodiment of the present disclosure;

FIG. 4 shows a flow chart of a method for training an encoding neural network provided by an embodiment of the present disclosure;

FIG. 5 shows a schematic diagram of obtaining target motion data by using an encoding neural network and a decoding neural network to process acquired original three-dimensional motion data provided by an embodiment of the present disclosure;

FIG. 6 shows a schematic diagram of an apparatus for generating a priori space provided by an embodiment of the present disclosure;

Fig. 7 shows a schematic structural diagram of a computer device provided by an embodiment of the present disclosure.

Detailed ways

In order to make the purpose, technical solutions and advantages of the embodiments of the present disclosure clearer, the technical solutions in the embodiments of the present disclosure will be clearly and completely described below in conjunction with the accompanying drawings in the embodiments of the present disclosure. Obviously, the described embodiments are only It is a part of the embodiments of the present disclosure, but not all of them. The components of the disclosed embodiments generally described and illustrated herein may be arranged and designed in a variety of different configurations. Accordingly, the following detailed description of the embodiments of the present disclosure is not intended to limit the scope of the claimed disclosure, but merely represents selected embodiments of the present disclosure. Based on the embodiments of the present disclosure, all other embodiments obtained by those skilled in the art without creative effort shall fall within the protection scope of the present disclosure.

In addition, the terms "first", "second" and the like in the description and claims in the embodiments of the present disclosure and the above drawings are used to distinguish similar objects, and not necessarily used to describe a specific order or sequence. It is to be understood that the terms so used are interchangeable under appropriate circumstances such that the embodiments described herein can be practiced in sequences other than those illustrated or described herein.

After research, it is found that reasonable three-dimensional human motion requires not only that each posture corresponding to the motion is reasonable, but also that the transition between continuous postures is also reasonable, so as to ensure the rationality of the overall three-dimensional human motion. In the process of reconstructing 3D human motion through neural network, using a reasonable prior space to constrain the rationality of motion can make the 3D human motion reconstructed by neural network more reasonable.

The environmental information in the 3D motion data (including 3D human body motion) determines the orientation of the motion. For the same motion, under different environmental information, the orientations of the postures corresponding to the motion are different, and the 3D motion data are also different. In addition to the orientation of each posture corresponding to the motion, each posture in the 3D motion data is the same, and the motion orientation caused by this environmental information leads to a high data space complexity, which increases the difficulty of modeling human motion data. At present, in order to reduce the difficulty of neural network modeling human motion data, the impact of environmental information on human motion data is usually reduced by reducing the number of frames included in the motion, thereby reducing the complexity of the data space corresponding to human motion data, but such The prior space obtained by this method leads to the loss of context information in the human motion data, which affects the rationality and accuracy of the human motion data reconstructed by the neural network.

Based on the above research, the present disclosure provides a method, device, computer equipment, storage medium, computer program, and computer program product for generating a priori space. By performing encoding processing for removing the global orientation on the three-dimensional motion data corresponding to each type of motion, Generating target motion data that can represent the gesture characteristics of motion can remove the global orientation information from the data space, reducing the complexity of the data space, and the target prior space generated based on the target motion data can improve the generated target prior space. The rationality and accuracy of the model, and then, using the target prior space to constrain the rationality and accuracy of the motion can reduce the difficulty of neural network modeling motion data.

The defects in the above solutions are all the results obtained by the inventor after practice and careful research. Therefore, the discovery process of the above problems and the solutions proposed by the present disclosure below for the above problems should be the result of the inventor Contributions made to this disclosure during the course of this disclosure.

It should be noted that like numerals and letters denote similar items in the following figures, therefore, once an item is defined in one figure, it does not require further definition and explanation in subsequent figures.

It should be noted that the specific nouns mentioned in the embodiments of the present disclosure include:

Yaw: Indicates that the direction from the bottom to the top of the posture corresponding to the first frame of posture data in the 3D motion data is the y-axis, the posture corresponding to the first frame of posture data is from the left to the right is the x-axis, and the posture corresponding to the first frame of posture data is from the front to the top. The angle of rotation on the y-axis in the three-dimensional coordinate system established for the z-axis later;

DCT: Discrete Cosine Transform, discrete cosine transform, transforms data from the spatial domain to the frequency domain, enabling data or image compression.

In order to facilitate the understanding of this embodiment, a method for generating a priori space disclosed in the embodiment of the present disclosure is firstly introduced in detail. The execution subject of the method for generating a priori space provided by the embodiment of the present disclosure generally has certain computing power computer equipment, the computer equipment includes, for example: terminal equipment or server or other processing equipment, the terminal equipment can be user equipment (User Equipment, UE), mobile equipment, user terminal, terminal, cellular phone, cordless phone, personal digital assistant ( Personal Digital Assistant, PDA), handheld devices, computing devices, vehicle-mounted devices, wearable devices, etc. In some possible implementation manners, the method for generating the prior space may be implemented by a processor invoking computer-readable instructions stored in a memory.

The method for generating a priori space provided by the embodiments of the present disclosure will be described in detail below.

As shown in FIG. 1, a flow chart of a method for generating a priori space provided by an embodiment of the present disclosure may include the following steps:

S101: Acquire three-dimensional motion data respectively corresponding to at least two kinds of motions of a target object; the three-dimensional motion data includes: gesture data respectively corresponding to at least two postures corresponding to the motions.

Here, the target object may include a moving object such as a target person or a target animal. The three-dimensional motion data can be the posture data corresponding to each posture generated when the target object performs a certain movement. The posture data can represent the posture of the target object, and can include posture and orientation information corresponding to the posture, wherein the orientation information corresponding to the posture is the orientation information affected by the global orientation of the target object.

When obtaining the 3D motion data corresponding to the target object, for example, it is possible to collect multiple frames of images when the target object is moving, and recover the 3D posture of the human body on the multiple frames of images to obtain a frame of posture data corresponding to each frame of image. The corresponding posture data constitute a set of three-dimensional motion data of the target object. When recovering the three-dimensional pose of the human body on multiple frames of images, the poses of each frame will be affected by the orientation of the human body during motion, wherein the orientation of the human body during motion is the global orientation.

The 3D motion data corresponding to each motion in the present disclosure includes a preset number of frames of pose data, for example, 128 frames of pose data. During implementation, the 3D motion data may be 128 frames*72 dimensional motion data. The 72-dimensional data includes, for example, a 3-dimensional pose and 3-dimensional position information corresponding to 23 key points of the human body.

In some embodiments, three-dimensional motion data corresponding to various motions of the target object may be acquired in the original prior space. Wherein, the original prior space may include three-dimensional motion data corresponding to at least one type of motion corresponding to the target object, and the three-dimensional motion data corresponding to each type of motion includes, for example, multiple groups. According to different image processing tasks, the target object is different, and the corresponding target object's actions are also different; for example, the target object can be a "person", and the corresponding actions include: running, jumping, walking, raising legs, turning around, etc.

During implementation, the posture data corresponding to each posture can be determined based on the various postures of the target object collected by the sensor when it is moving, and the determined posture data can be used as the three-dimensional motion data corresponding to the target object. Each posture corresponding to each time is determined, and the posture data corresponding to each movement is determined.

In addition, in the case where the 3D motion data corresponding to the target object is obtained from the original prior space, the original prior space can also be obtained in the following manner:

The three-dimensional motion data is obtained based on the various postures of the target object collected by the sensor when it is moving;

After obtaining the three-dimensional motion data corresponding to each type of motion, the original prior space can be formed based on the three-dimensional motion data corresponding to the various motions of the target object.

Or, the existing 3D motion data set can also be used to select the 3D motion data corresponding to various motions of the target object to obtain the original prior space; or, the original prior space can be pre-generated and stored in the A storage space is provided, and when the target prior space is generated, the original prior space is directly read from the preset storage space.

S102: Perform encoding processing for removing the global orientation on the three-dimensional motion data corresponding to each type of motion, to obtain target motion data corresponding to each type of motion.

In this embodiment, the target motion data may be the motion data corresponding to each type of motion of the target object and does not include the global orientation. For example, the orientation corresponding to the pose of the first frame in each target motion data is a predetermined target direction. The global orientation can represent the orientation of the motion corresponding to the three-dimensional motion data.

When modeling the motion of the target object, it is necessary to make each posture in the motion reasonable and coherent; and for the data space containing 3D motion data, not every 3D motion data corresponding to each spatial point is reasonable , and reasonable 3D motion data are sparsely distributed in this data space; this leads to problems such as irrationality and incoherence in the 3D motion data obtained by motion modeling based on the original prior space. The present disclosure can further compress the 3D motion data into a smaller data space by encoding the 3D motion data, so that the distribution of the 3D motion data that is originally sparsely distributed in the data space in the compressed data space is more compact. Intensive, and thus better supervision of motion modeling, reducing the unreasonable and incoherent 3D motion data obtained by motion modeling.

During implementation, after the original priori space is obtained, for the 3D motion data corresponding to each type of motion in the original priori space, the 3D motion data can be encoded to remove the global orientation, that is, the 3D motion data in the The orientation corresponding to the first frame posture of is adjusted to the target direction, wherein, the target direction may be that the yaw corresponding to the first frame posture is 0 degrees.

Then, adjust the direction of each frame of posture except the first frame posture in the three-dimensional motion data, and then perform encoding processing based on the adjusted postures to obtain the three-dimensional motion The target motion data corresponding to the data; wherein, the relative angle and direction between the postures in the target motion data remain unchanged. Based on the above steps, target motion data corresponding to each type of motion can be determined.

The disclosure generates target motion data that can represent the posture characteristics of the motion by performing encoding processing for removing the global orientation on the three-dimensional motion data corresponding to each type of motion, and can remove the global orientation information from the data space, reducing the complexity of the data space , the target prior space generated based on the target motion data can improve the rationality and accuracy of the generated target prior space, and then use the target prior space to constrain the rationality and accuracy of the motion, which can reduce the neural network construction. The difficulty of modulating motion data.

S103: Based on the target motion data respectively corresponding to the at least two types of motion, generate a target prior space.

In the embodiments of the present disclosure, the target prior space may be formed directly based on the target motion data corresponding to various sports as multiple sets of prior data in the target prior space. Among them, the target motion data in the target prior space is further compressed in the data space compared with the data in the original prior space, which reduces the influence of the global orientation on the data, so the data space is simpler and more compact, which is more conducive to motion modeling.

After the target prior space is obtained, when the neural network models the motion data, the target prior space can be used to constrain the rationality of motion modeling to realize motion modeling.

The disclosure generates a target prior space based on target motion data, which can improve the rationality of the generated target prior space, and further, uses the target prior space to constrain the rationality of motion, which can reduce the difficulty of neural network modeling motion data, thereby Improve the rationality and accuracy of motion data reconstructed by neural networks.

In some embodiments, after the target prior space is generated, the motion type of the target object may also be identified based on the target prior space.

In the embodiment of the present disclosure, for the task of identifying the motion type of the target object, after obtaining the motion video corresponding to the target object, based on the pose data of each pose corresponding to the target object in the motion video, the target prior space includes Among the posture data corresponding to each target motion data, determine the posture data that matches the posture data corresponding to the target object in the motion video, and use the motion type corresponding to the matched posture data as the motion type of the target object in the motion video.

During implementation, the motion type of the target object can be identified as follows:

Step 1: Obtain a motion video of the target object when it is in motion.

Here, a target acquisition device, such as a camera, may be used to capture a motion video of the target object when it is in motion.

Step 2, performing feature extraction on the motion video to obtain motion feature data.

Here, the motion feature data can represent posture data corresponding to each posture of the target object when it is moving, and each posture does not include orientation information.

During implementation, feature extraction is performed on each frame of image corresponding to the motion video, the posture feature of the target object in each frame of image is determined, and the posture feature is used as the motion feature data corresponding to the frame of image, thereby, the corresponding motion feature of each frame of image can be obtained. Motion characteristic data.

Step 3. Based on the motion feature data, determine the target motion data matching the motion feature data from the target prior space.

In this step, the posture features corresponding to each frame posture in the motion feature data and the posture data of each posture corresponding to each target motion data in the target prior space can be consistent matched, and the posture data of each posture included are respectively The target motion data matched with each gesture feature corresponding to the motion feature data is used as the target motion data matched with the motion feature data.

Step 4: Determine the motion type of the target object based on the motion type corresponding to the target motion data matched with the motion feature data.

Here, the motion type corresponding to the target motion data matched with the motion feature data may be used as the motion type of the target object in the motion video.

In the present disclosure, the extracted motion feature data that can represent the corresponding postures of the target object during motion is matched with the target motion data in the target prior space, and the target motion data that matches the motion feature data can be determined. Further, Based on the motion type corresponding to the target motion data matched with the motion feature data, the motion type of the target object can be accurately determined.

In one embodiment, for S102, for the three-dimensional motion data corresponding to each type of motion, the pre-trained target encoding neural network can be used to perform encoding processing for removing the global orientation on the three-dimensional motion data corresponding to this type of motion, to obtain the The target motion data corresponding to this type of motion.

The target encoding neural network is a pre-trained encoding network, which can encode the input 3D motion data to remove the global orientation, so as to obtain more accurate target motion data.

In some embodiments, for S102, for example, the following steps may be used to implement:

S102-1: Determine first frequency domain data corresponding to the three-dimensional motion data in the frequency domain.

Here, each attitude data corresponding to the three-dimensional motion data is data in the spatial domain, and the first frequency domain data is used to represent the fusion coefficients of each attitude data on the first number of frequency domain components, wherein the first number is a preset Certainly, the frequency domain component is used to represent the amount of information corresponding to the attitude data. The corresponding data dimension of the first frequency domain data may be preset, for example, the corresponding data dimension of the first frequency domain data may be a first number m. Taking the 3D motion data including 128 frames of attitude data as an example, and the first number is m, the first frequency domain data may be m-dimensional fusion coefficients of the 128 frames of attitude data on m frequency domain components. Among them, m is a positive integer.

In this step, for the three-dimensional motion data corresponding to each type of motion, DCT may be used to transform the three-dimensional motion data to obtain first frequency-domain data corresponding to the three-dimensional motion data in the frequency domain. During implementation, the overall position change corresponding to each key point of the target object can be determined based on the three-dimensional motion data, and then, DCT is used to convert the three-dimensional motion data that can represent the overall position change corresponding to each key point of the target object into a first frequency domain data.

Alternatively, Fourier transform may also be used to transform the three-dimensional motion data, so as to obtain first frequency domain data corresponding to the three-dimensional motion data in the frequency domain.

S102-2: Divide the three-dimensional motion data into at least two groups of three-dimensional motion sub-data, and determine second frequency-domain data respectively corresponding to the at least two groups of three-dimensional motion sub-data in the frequency domain.

Here, the second frequency domain data is used to characterize the values of each attitude data corresponding to each group of three-dimensional motion sub-data on the second number of frequency domain components, and the second number is also preset. The data dimension corresponding to the second frequency domain data may be preset, for example, the data dimension corresponding to the second frequency domain data may be n.

During implementation, the multi-frame attitude data in the three-dimensional motion data can be divided into a second number of groups according to the order of the poses corresponding to each frame of pose data in the three-dimensional motion data according to the second number, and each group The posture data is used as a set of three-dimensional motion sub-data, thus, multiple sets of three-dimensional motion sub-data are obtained. That is, the motion corresponding to the 3D motion data is divided into multiple segments, and the motion of one segment corresponds to a group of 3D motion sub-data.

Then, for each set of three-dimensional motion sub-data, based on the set of three-dimensional motion sub-data, position changes of each key point of the target object in the segmented motion corresponding to the set of three-dimensional motion sub-data may be determined. Furthermore, DCT or Fourier transform can be used to convert the three-dimensional motion data capable of characterizing the position changes of each key point of the target object in the segmented motion corresponding to the group of three-dimensional motion sub-data into the second frequency domain data, The second frequency domain data corresponding to the group of three-dimensional motion sub-data is obtained.

Exemplarily, when the 3D motion data includes 128 frames of posture data, and the second number is S, the 128 frames of posture data can be divided into S groups first to obtain S groups of 3D motion sub-data, and each group of 3D motion sub-data Including 128/S frame attitude data. Then, DCT may be used to convert each set of three-dimensional motion sub-data to obtain n-dimensional second frequency-domain data corresponding to each set of three-dimensional motion sub-data. Furthermore, based on the conversion of each group of three-dimensional motion sub-data, S pieces of n-dimensional second frequency domain data can be obtained.

S102-3: Based on the first frequency domain data and the second frequency domain data, perform compression processing for removing the global orientation on the three-dimensional motion data to obtain target motion data.

Here, based on the second frequency domain data corresponding to each group of 3D motion sub-data, the 3D motion data corresponding to each group of 3D motion sub-data can be compressed, and then the compressed motion data and the first frequency domain data can be fused , so as to complete the compression process of removing the global orientation on the three-dimensional motion data, and obtain the target motion data.

In addition, for S102-1 to S102-3, it can be performed by using the target coding neural network. After obtaining the 3D motion data, the 3D motion data can be input into the target coding neural network, and then the DCT conversion in the target coding neural network The module can output the first frequency domain data corresponding to the three-dimensional motion data, and the second frequency domain data corresponding to each group of three-dimensional motion sub-data, and then use the first frequency domain data and the second frequency domain data to analyze the three-dimensional motion data Perform compression processing to remove the global orientation, and output target motion data.

In this way, the frequency domain data can represent the change information of each key point of the target object in the three-dimensional motion data in the frequency domain, and each key point of the target object can accurately reflect the posture of the target object. By converting the three-dimensional motion data to the frequency domain, it can Obtain the first frequency domain data that can reflect the overall change information of each key point of the target object during motion and the overall information amount of each key point of the target object during motion; by dividing the three-dimensional motion data into multiple groups of three-dimensional motion The sub-data can realize the segmentation processing of the overall motion of the target object, and realize the detailed analysis of the overall motion of the target object; The change information of key points in each segment of motion, and the second frequency domain data of the amount of information corresponding to each key point in each segment of motion; furthermore, based on the first frequency domain data and the second frequency domain data, it is beneficial to realize the three-dimensional motion Accurately remove the global orientation of the data to obtain accurate and reasonable target motion data.

In some embodiments, for S102-3: it can be implemented according to the following steps:

S102-3-1: Based on the first frequency domain data, obtain frequency domain feature data of the three-dimensional motion data.

During implementation, after the first frequency domain data is obtained, the fully connected layer in the target coding neural network can be used to perform mapping processing on the first frequency domain data, so as to obtain the frequency domain feature data corresponding to the three-dimensional motion data. For example, the m-dimensional first frequency-domain data may be converted into 512-dimensional frequency-domain feature data.

S102-3-2: Based on the second frequency domain data, determine weights corresponding to at least two sets of three-dimensional motion sub-data; based on weights corresponding to at least two sets of three-dimensional motion sub-data, perform weighting processing on the three-dimensional motion data to obtain the first 3D motion data.

In some embodiments, when determining weights corresponding to multiple sets of three-dimensional motion sub-data based on the second frequency-domain data, the second frequency-domain data corresponding to multiple sets of three-dimensional motion sub-data can be fused first to obtain the fusion frequency domain data, wherein the dimension corresponding to the fused frequency domain data is the same as the number of groups of the three-dimensional motion sub-data. Continuing the above example of dividing 128 frames of attitude data into S groups to obtain S groups of three-dimensional motion sub-data, after obtaining S pieces of n-dimensional second frequency domain data, the S pieces of n-dimensional second frequency domain data can be fused into Fused frequency domain data for S*n. In this disclosure, for the second frequency domain data (n-dimensional second frequency domain data) corresponding to each group of three-dimensional motion sub-data in S groups of three-dimensional motion sub-data, the second frequency domain data can be firstly determined based on the second frequency domain data. domain data, and then, the determined weight vectors corresponding to each second frequency domain data may be fused to obtain S*n fused frequency domain data.

Then, normalization processing can be performed on the fused frequency domain data to obtain weights corresponding to multiple sets of three-dimensional motion sub-data. For example, the softmax function may be used to normalize the S*n fused frequency domain data, and output S normalized weight vectors. Wherein, each normalized weight vector corresponds to each pose data in a set of three-dimensional motion sub-data.

Furthermore, the weight vector corresponding to each group of three-dimensional motion sub-data may be used as the weight corresponding to this group of three-dimensional motion sub-data. In the present disclosure, after determining the respective weights corresponding to multiple sets of three-dimensional motion sub-data, weighting can be performed on each frame of attitude data corresponding to the set of three-dimensional motion sub-data based on the weight corresponding to each set of three-dimensional motion sub-data to obtain the set of three-dimensional motion The weighted 3D motion data corresponding to the sub-data. Afterwards, the weighted 3D motion data corresponding to each group of 3D motion sub-data may be used as the first 3D motion data.

S102-3-3: Perform feature extraction in at least two scales on the first three-dimensional motion data to obtain second three-dimensional motion data respectively corresponding to the at least two scales.

Here, a scale can correspond to one convolutional layer and at least one fully connected layer in a target encoding neural network. After obtaining the first 3D motion data, the convolutional layer and the fully connected layer corresponding to the multiple scales deployed in the target encoding neural network can be used to sequentially extract the features of the first 3D motion data at multiple scales, and obtain multiple The second three-dimensional motion data corresponding to the scales respectively.

In some embodiments, for S102-3-3, for each of the at least two scales, the convolution layer corresponding to the scale can be used to perform convolution processing on the input three-dimensional motion data corresponding to the scale, and the convolution The result of the product processing is subjected to full-connection mapping processing to obtain the second three-dimensional motion data corresponding to this scale. Wherein, the input 3D motion data includes: the second 3D motion data corresponding to the former scale of at least two scales, or the first 3D motion data. As shown in FIG. 2 , it is a schematic structural diagram of a network structure for determining second 3D motion data corresponding to a scale provided by an embodiment of the present disclosure, where the scale corresponds to a convolutional layer L101 and two full connections Layers L102 and L103. In FIG. 2 , the convolutional layer L101 performs convolution processing on the input 3D motion data, and the fully connected layers L102 and L103 perform fully connected mapping processing on the result of the convolution processing to obtain the second 3D motion data corresponding to a scale.

As shown in FIG. 3 , it is a schematic structural diagram of a network structure for performing feature extraction on the first 3D motion data at multiple scales to obtain second 3D motion data corresponding to multiple scales according to an embodiment of the present disclosure. , Fig. 3 includes three extraction modules corresponding to three kinds of scales, including the first extraction module L201, the second extraction module L202 and the third extraction module L203; wherein, each extraction module includes a volume as shown in Fig. 2 Laminate layer L101 and two fully connected layers L102 and L103. In Fig. 3, the first extraction module L201, the second extraction module L202, and the third extraction module L203 sequentially perform feature extraction on the first three-dimensional motion data at different scales, and obtain the second output corresponding to the first scale of three-dimensional motion data , the output second three-dimensional motion data corresponding to the second scale and the output second three-dimensional motion data corresponding to the third scale.

During implementation, the extraction module may be a residual block (Residual Block). In addition, the number of extraction modules can be set according to needs, and is not limited here. In the embodiment of the present disclosure, the number of extraction modules is 3 as an example for illustration.

During implementation, as shown in Figure 3, for the first scale, the 3D motion data input to the convolution layer corresponding to the first scale is the first 3D motion data, and the convolution layer is used to convolve the first 3D motion data processing to obtain the convolution processing result; after that, input the convolution processing result to the first fully connected layer, and use the first fully connected layer to perform the first fully connected mapping processing on the convolution processing result to obtain the first mapping Processing results; then input the first mapping processing results to the second fully connected layer, and use the second fully connected layer to perform further fully connected mapping processing on the first mapping processing results to obtain the second mapping processing results; finally, the The convolution processing result is fused with the second mapping processing result to obtain second 3D motion data corresponding to the first scale.

For each scale except the first scale, the second three-dimensional motion data output by the previous scale can be used as the input corresponding to the scale, and the second three-dimensional motion data output by the previous scale can be analyzed by using the convolutional layer corresponding to the scale Perform convolution processing to obtain the convolution processing result corresponding to the scale; after that, input the convolution processing result corresponding to the scale to the first fully connected layer, and use the first fully connected layer to perform the first convolution processing result. The second fully-connected mapping process is performed to obtain the first mapping processing result corresponding to the scale; then the first mapping processing result corresponding to the scale is input to the second fully-connected layer, and the second fully-connected layer is used to The first mapping processing result is further fully connected mapping processing to obtain the second mapping processing result corresponding to the scale; finally, the convolution processing result corresponding to the scale is fused with the second mapping processing result corresponding to the scale to obtain the The scale corresponds to the second 3D motion data.

Furthermore, based on the three extraction modules shown in FIG. 3 , the second three-dimensional motion data corresponding to the three scales can be obtained.

In this way, through the convolution processing and full-connection mapping processing on the three-dimensional motion data, reasonable coding of the three-dimensional motion data can be realized, and the rationality of the second three-dimensional motion data can be improved.

S102-3-4: Fusing the frequency domain feature data with the second three-dimensional motion data respectively corresponding to at least two scales to obtain target motion data.

In this step, the fully connected layer in the target encoding neural network can be used to fuse the frequency domain feature data and the second three-dimensional motion data corresponding to multiple scales to obtain the target motion data.

In this way, since the data compression degrees corresponding to the second frequency domain data reflecting different information amounts are different, the present disclosure can determine each group of three-dimensional motion sub-data based on the information amount reflected by the second frequency domain data corresponding to each group of three-dimensional motion sub-data. The weights corresponding to the sub-data, and then based on the weights corresponding to multiple sets of three-dimensional motion sub-data, weight the three-dimensional motion data, which can realize high-precision compression of each segment of motion corresponding to the three-dimensional motion data, and improve the accuracy of the first three-dimensional motion data. rationality; furthermore, by performing feature extraction on the first three-dimensional motion data at multiple scales, the second three-dimensional motion data corresponding to the first three-dimensional motion data at different depths can be obtained, thus, by extracting the first three-dimensional motion data corresponding to multiple scales respectively The fusion of the 2D and 3D motion data and the frequency domain feature data can improve the accuracy of the target motion data thanks to the richness of the second 3D motion data in the depth dimension.

In some embodiments, for S102-3-4, the frequency domain feature data and the second 3D motion data respectively corresponding to at least two scales may be spliced first to obtain spliced third 3D motion data.

Then, the spliced third three-dimensional motion data is input to the fully connected layer in the target encoding neural network, and the fully connected layer is used to perform fully connected mapping processing on the third three-dimensional motion data to obtain the target motion data.

The target motion data obtained in this disclosure can be feature data of the target dimension, for example, the target motion data can be 1*256-dimensional feature data, that is, compress the 128*72-dimensional three-dimensional motion data into 1*256-dimensional target motion data.

In this way, by splicing the frequency-domain feature data and the second three-dimensional motion data corresponding to multiple scales, the data unification of the frequency-domain feature data and the second three-dimensional motion data can be realized, and the unified third three-dimensional motion data can be obtained; Then, through the fully connected mapping processing of the third three-dimensional motion data, the third three-dimensional motion data in the hidden layer feature space can be restored to the sample initial space corresponding to the three-dimensional motion data, and the target motion data matching the sample initial space can be obtained .

In some embodiments, since obtaining the target motion data corresponding to each type of motion can be performed by using a target coding neural network, the embodiments of the present disclosure also provide a method for training a target coding neural network.

During implementation, sample data may be obtained first; wherein, the sample data includes: sample pose data corresponding to at least two sample poses; the sample pose data can represent the sample pose, for example, may include the sample pose and orientation information corresponding to the sample pose. The multiple sample poses are multiple continuous poses corresponding to one motion, and the acquired sample data is the motion data with the global orientation removed. For example, the sample data may include posture data corresponding to multiple sample movements, and the posture data corresponding to each sample movement is sample posture data corresponding to multiple sample postures of the sample movement.

In some embodiments, the step of obtaining sample data can be implemented according to the following steps:

P1: Obtain original 3D motion data corresponding to at least two sample motions; the original 3D motion data includes: attitude data corresponding to at least two sample poses.

Here, the original 3D motion data is motion data including global orientation, and the corresponding posture data in each original 3D motion data may include posture and posture orientation information, wherein the orientation information corresponding to the posture is affected by the global orientation of the target object Orientation information.

During implementation, the original three-dimensional motion data respectively corresponding to various sample motions may be obtained first.

P2: Based on the orientation information of the attitude data corresponding to the first sample attitude among the attitude data corresponding to the at least two sample attitudes, determine the steering angle corresponding to the original three-dimensional motion data.

Here, for the original three-dimensional motion data corresponding to each sample motion, the orientation information corresponding to the first sample pose can be determined based on the pose data corresponding to the first sample pose among the multiple sample poses included in the original three-dimensional data. , and then, based on the orientation information corresponding to the first sample pose, the yaw corresponding to the first sample pose can be determined; then, with the yaw corresponding to the first sample pose being 0 degrees as the target, determine the steering direction corresponding to the first sample pose on the y-axis Angle and steering direction (clockwise or counterclockwise), that is, based on the steering angle and steering direction, the first sample pose is rotated on the y-axis, and the yaw corresponding to the rotated first sample pose is 0 degrees.

Afterwards, the steering angle and steering direction may be used as the steering angle and steering direction corresponding to each sample pose included in the original three-dimensional motion data. Also, the steering angle and steering direction can be represented in the form of a rotation matrix.

Based on the above steps, the steering angle corresponding to each original three-dimensional motion data can be determined respectively.

P3: Based on the steering angle, perform steering processing on each sample pose in the original 3D motion data to obtain sample data.

In this step, for each original 3D motion data, the steering angle and steering direction corresponding to the original 3D motion data can be used to sequentially perform steering processing on each sample pose in the original 3D motion data to obtain each original 3D motion data corresponding sample data. Among them, the yaw corresponding to the first sample pose in the sample data is 0 degrees, so as to realize the normalization corresponding to the first sample pose. Since the relative angles and directions between the sample poses in the sample data are consistent with the relative angles and directions between the sample poses in the original 3D motion data corresponding to the sample data, the normalization of each sample pose in the original 3D motion data is realized .

In this way, based on the steering angle and steering direction corresponding to each original 3D motion data, the steering processing of the sample pose corresponding to each original 3D motion data can be realized, and then the normalization of each sample pose corresponding to each original 3D motion data can be realized. to obtain the sample data corresponding to each original 3D motion data. Through this scheme, the redundant information in the original 3D motion data can be reduced, thereby improving the reconstruction accuracy of the prior space.

Then multiple rounds of training can be performed on the encoding neural network (wherein the trained encoding neural network is the target encoding neural network) based on the sample data, and in each round of training, the process shown in FIG. 4 is performed. As shown in FIG. 4 , it is a flow chart of a method for training an encoding neural network provided by an embodiment of the present disclosure, which may include the following steps:

S401: Perform random global orientation and steering processing on the sample data to obtain first intermediate sample data; the first intermediate sample data includes: first pose data corresponding to at least two sample poses respectively.

Here, for the obtained sample data, the random rotation module can be used to uniformly sample the sample data to obtain a random rotation angle corresponding to the sample data within a preset range, and use the random rotation angle to perform a random rotation angle for each sample in the sample data The attitude is subjected to random global orientation and steering processing to obtain the first intermediate sample data. For example, for each sample pose, the sample pose can be rotated clockwise or counterclockwise on the y-axis, and the rotation angle is a random rotation angle, and then the rotated first sample pose corresponding to the sample pose can be obtained , and the first pose data corresponding to the first sample pose.

In this way, based on the random global orientation and steering processing performed on each sample pose, the first sample pose and the first pose data corresponding to each sample pose can be obtained respectively.

S402: Use the coding neural network to perform coding processing for removing the global orientation on the first intermediate sample data to obtain coded motion data.

During implementation, the first intermediate sample data can be input into the encoding neural network to be trained, and the encoding neural network is used to perform encoding processing to remove the global orientation on the first intermediate sample data according to the encoding methods mentioned in the above-mentioned embodiments, to obtain encoding motion data; and, during the process of processing the first intermediate sample data, the encoding neural network eliminates the random rotation angle corresponding to the first intermediate sample data, and outputs encoded motion data without information related to the random rotation angle.

S403: Use the decoding neural network to decode the coded motion data to obtain second intermediate sample data; the second intermediate sample data includes: second pose data corresponding to at least two sample poses.

Here, the decoding neural network is a neural network that matches the encoding neural network, and is used to decode and restore data encoded by the encoding neural network. For example, the coded motion data may be input into a decoding neural network to be trained, and the coded motion data may be decoded by using the decoding neural network, thereby outputting the decoded second intermediate sample data.

Wherein, the second intermediate sample data includes the second attitude data corresponding to the plurality of sample attitudes predicted by the decoding neural network; since the encoded motion data is the output data without random rotation angle-related information, the second intermediate sample There is also no information about random rotation angles in the data.

As shown in FIG. 5 , it is a schematic diagram of obtaining the second intermediate sample data by using the encoding neural network and the decoding neural network to process the acquired original 3D motion data provided by the embodiment of the present disclosure. In Fig. 5, after obtaining the original 3D motion data corresponding to the sample motion, the steering angle corresponding to the original 3D motion data can be determined first, and the steering processing is performed on each sample pose in the original 3D motion data to obtain the sample data; and then Use the random rotation module L301 to perform random global orientation steering processing on the sample data to obtain the first intermediate sample data; then use the encoding neural network L302 to perform encoding processing to remove the global orientation on the first intermediate sample data, and output encoded motion data. Afterwards, the coded motion data is decoded by the decoding neural network L303, and the second intermediate sample data is output.

It can be seen from FIG. 5 that the decoding neural network may include a decoding module L3031 and a restoration module L3032. The decoding module L3031 includes multiple fully connected layers, and the restoration module L3032 also includes a first decoding network and a second decoding network.

In Fig. 5, for the operation of outputting encoded motion data, after obtaining the first intermediate sample data, it can be input into the encoding neural network L302, and the first intermediate sample data can be DCT transformed first to obtain the first frequency domain data The second frequency domain data respectively corresponding to the plurality of sets of intermediate sub-sample data, determining weights respectively corresponding to the plurality of sets of intermediate sub-sample data based on the plurality of second frequency domain data, and then using the obtained weights to perform weighting processing on the first intermediate sample data , to obtain the first sample three-dimensional motion data; and based on the DCT transformation performed on the first intermediate sample data, determine the first frequency domain data corresponding to the first intermediate sample data, and then determine the sample frequency domain feature data based on the first frequency domain data ; Afterwards, using the first extraction module, the second extraction module and the third extraction module to perform feature extraction of multiple scales on the first sample three-dimensional motion data, and obtain second sample three-dimensional motion data corresponding to multiple scales; finally, The frequency-domain feature data of the sample is fused with the three-dimensional motion data of the second sample respectively corresponding to multiple scales to obtain coded motion data.

In Fig. 5, for the operation of outputting the second intermediate sample data, after obtaining the coded motion data, it can be input into the decoding module L3031 in the decoding neural network L303, using multiple fully connected layers (including the first fully connected layer , the second fully connected layer, the third fully connected layer, and the fourth fully connected layer) restore the encoded motion data at multiple scales, and finally output the predicted orientation feature data corresponding to each sample pose and the corresponding pose feature data. Then, the restoration module L3032 in the decoding neural network L303 can be used to perform feature splicing on the predicted orientation feature data corresponding to each sample pose and the pose feature data corresponding to each sample pose to obtain the second intermediate sample corresponding to the encoded motion data data.

For the restoration process of the decoding module, the following is a detailed description of the four fully connected layers shown in Figure 5:

First, the obtained coded motion data can be input to the first fully connected layer of the decoding module L3031, and the coded motion data is subjected to a fully connected mapping process using the first fully connected layer to obtain the first output characteristic data corresponding to the first fully connected layer; Then, the first output feature data is input to the second fully connected layer, and the second fully connected layer is used to perform full connection mapping processing on the first output feature data to obtain the second output feature data; the second output feature data and the first The output feature data is input to the third fully connected layer together, and the second output feature data and the first output feature data are used to perform full connection mapping processing on the third fully connected layer to obtain the third output feature data; the third output feature data and The second output feature data is input to the fourth fully connected layer together, and the fourth fully connected layer is used to perform fully connected mapping processing on the second output feature data and the third output feature data, and output the orientation feature data ψ ^g corresponding to each sample pose And the posture feature data ψ ^l of the specific shape of the posture corresponding to each sample posture.

Exemplarily, in the case where the encoded motion data is 1*256-dimensional data and the sample data is 128*72-dimensional data, ψ ^g can be 128*6-dimensional data, and ψ ^l can be 128*32-dimensional data .

Here, since the obtained ψ ^g and ψ ^l are implicit feature data and cannot be directly output, it is necessary to use the first decoding network D _cont in the restoration module to decode ψ ^g to obtain the displayed first target feature data , and use the second decoding network D _vp in the restoration module to decode ψ ^l to obtain the displayed second target feature data, and then perform feature stitching on the first target feature data and the second target feature data to obtain the second target feature data Intermediate sample data.

For example, in the case that the first target feature data can be 128*3-dimensional data, and the second target feature data can be 128*69-dimensional data, the second intermediate sample data of 128*72 dimensions can be obtained after splicing.

In this way, by using the above-mentioned multiple fully connected layers to restore the coded motion data, the gradient of the coded motion data can be prevented from disappearing, and the full fusion of the coded motion data can be realized.

In FIG. 5 , for the description of the loss model, refer to the description of the loss model in step S404 below.

S404: Based on the sample data and the second intermediate sample data, perform a current round of training on the encoding neural network and the decoding neural network.

In this step, based on the sample data and the second intermediate sample data, the model loss corresponding to the encoding neural network and the decoding neural network can be determined, and the encoding neural network and the decoding neural network can be performed using the model loss (shown in FIG. 5 ). In this round of training, the encoding neural network and decoding neural network completed in this round of training are obtained.

Furthermore, multiple rounds of training may be performed on the encoding neural network and the decoding neural network based on the above S401-S404.

S405: Determine the encoding neural network that has undergone at least two rounds of training as the target encoding neural network.

Here, the encoding neural network that has undergone multiple rounds of training can be determined as the target encoding neural network, and the decoding neural network that has undergone multiple rounds of training can be determined as the target decoding neural network.

Wherein, the number of rounds of multi-round training can be a preset value, or, the number of rounds of multi-round training can be determined according to the accuracy of the trained encoding neural network and the decoding neural network, and the number of training completed encoding neural network and decoding neural network When the accuracy meets the preset accuracy, the target encoding neural network and the target decoding neural network are obtained.

In this way, by performing global orientation steering processing on the sample data, the noise-adding processing of the sample data can be realized, and the encoding neural network is used to encode the first intermediate sample data after the noise-adding processing, which can improve the denoising ability of the encoding neural network ; Using the decoding neural network to decode the coded motion data can realize the restoration of the coded motion data. In the case of reliable accuracy of the coded neural network and the decoding neural network, the second intermediate sample data output by the decoding neural network will also be relatively accurate , which is close to the sample data, therefore, based on the sample data and the second intermediate sample data, the corresponding model losses of the encoding neural network and the decoding neural network can be determined, and then multiple rounds of training are performed on the encoding neural network and the decoding neural network based on the model loss , can obtain the target encoding neural network with reliable accuracy and the decoding neural network with reliable accuracy. In some embodiments, the model loss can be determined using the following steps:

S1: Based on the sample data and the second intermediate sample data, at least one of the following is determined: a sample data reconstruction loss, a similarity loss between encoded motion data and a normal distribution.

Here, the sample data reconstruction loss is used to characterize the loss of the decoding neural network when decoding the encoded motion data and the loss of the encoding neural network to determine the encoded motion data; the similarity loss is used to characterize the encoding motion data in the first intermediate sample data as Similarity loss between conditional probability and normal distributions.

In some embodiments, for the sample data reconstruction loss in this step, it can be based on the orientation feature data corresponding to each frame of sample pose data in the sample data, the pose feature data corresponding to each frame of sample pose data, and the second intermediate sample data. The orientation feature data corresponding to the second pose data of each frame and the pose feature data corresponding to the second pose data of each frame determine the sample data reconstruction loss.

Here, each frame of sample pose data in the sample data may include orientation feature data that can characterize the orientation corresponding to each sample pose, and pose feature data that can characterize the specific pose shape corresponding to each sample pose. It can be seen from the above-mentioned embodiment that the second pose data corresponding to the multiple sample poses included in the second intermediate sample data also include the orientation feature data corresponding to the second pose data of each frame, and the pose features corresponding to the second pose data of each frame. data. After that, the sample data reconstruction loss can be determined based on the following formula (1):

Among them, L _rec represents the sample data reconstruction loss, M represents the encoding neural network and decoding neural network, β represents the shape parameter,

represents the orientation feature data corresponding to the sample pose, and θ ^l represents the pose feature data corresponding to the sample pose.

Afterwards, based on the above formula (1), as well as the sample data and the second intermediate sample data, the sample data reconstruction loss can be determined.

For the similarity loss in this step, the following formula (2) can be used to determine:

Among them, L _KL represents the similarity loss, KL represents the divergence,

Represents the first intermediate sample data that has undergone random global orientation and steering processing, z _mot represents encoded motion data,

Indicates the probability of the encoded motion data under the condition of the first intermediate sample data, N(0, I) indicates a normal distribution with a mean value of 0 and a standard deviation of 1.

In this way, based on the above formula (2), the similarity loss can be determined.

S2: After determining the similarity loss, the disclosure determines the model loss based on at least one of the following: sample data reconstruction loss, similarity loss between encoded motion data and normal distribution.

When implementing, the model loss can be determined according to the following formula (3):

L ₌ λ _rec L _rec +λ _KL L _KL (3)

Wherein, λ _rec represents the first loss coefficient corresponding to the sample data reconstruction loss, and λ _KL represents the second loss coefficient corresponding to the similarity loss.

For example, the sum obtained by multiplying the first loss coefficient and the sample data reconstruction loss may be added to the sum of the second loss coefficient and the similarity loss, and the model loss may be determined based on the result obtained after the addition.

In this way, the loss of the decoding neural network when restoring the orientation information corresponding to the pose data can be determined based on the orientation feature data, and the loss of the decoding neural network when restoring the pose of each frame can be determined based on the pose feature data, and the two losses are reconstructed as sample data The loss trains the encoding neural network and decoding neural network, which can improve the accuracy of the output orientation feature data and attitude feature data.

Those skilled in the art can understand that in the above method of implementation, the writing order of each step does not mean a strict execution order and constitutes any limitation on the implementation process. The execution order of each step should be based on its function and possible internal logic. Sure.

Based on the same inventive concept, the embodiment of the present disclosure also provides a priori space generation device corresponding to the generation method of the priori space. Since the problem-solving principle of the device in the embodiment of the present disclosure is the same as that of the priori space in the embodiment of the present disclosure The generation method of is similar, so the implementation of the device can refer to the implementation of the method.

As shown in FIG. 6 , it is a schematic diagram of a device for generating a priori space provided by an embodiment of the present disclosure, including:

The acquisition module 601 is configured to acquire three-dimensional motion data corresponding to at least two kinds of motions of the target object; the three-dimensional motion data includes: posture data corresponding to at least two postures corresponding to the motion; the encoding module 602 is configured to for each The three-dimensional motion data corresponding to the motion is subjected to encoding processing for removing the global orientation to obtain the target motion data corresponding to each type of motion; the determination module 603 is configured to generate a priori target based on the target motion data corresponding to the at least two types of motion space.

In a possible implementation manner, the device further includes: an identification module 604 configured to acquire the target object's movement data after generating the target prior space based on the target motion data respectively corresponding to at least two types of motion. motion video; feature extraction is performed on the motion video to obtain motion feature data; based on the motion feature data, determine from the target prior space the target motion data that matches the motion feature data; The motion type corresponding to the target motion data matched with the motion feature data determines the motion type of the target object.

In a possible implementation manner, the encoding module 602 is configured to determine the first frequency domain data corresponding to the 3D motion data in the frequency domain, and divide the 3D motion data into at least two groups of 3D motion sub-data , and determine the second frequency domain data respectively corresponding to the at least two groups of three-dimensional motion sub-data in the frequency domain; based on the first frequency domain data and the second frequency domain data, remove the overall Orientation compression processing to obtain the target motion data.

In a possible implementation manner, the encoding module 602 is configured to obtain frequency-domain feature data of the three-dimensional motion data based on the first frequency-domain data; and, based on the second frequency-domain data, determine weights corresponding to at least two groups of three-dimensional motion sub-data; based on weights corresponding to at least two groups of three-dimensional motion sub-data, performing weighting processing on the three-dimensional motion data to obtain first three-dimensional motion data; performing feature extraction on at least two scales to obtain second three-dimensional motion data corresponding to the at least two scales; fusing the frequency domain feature data with the second three-dimensional motion data corresponding to at least two scales to obtain The target motion data.

In a possible implementation manner, the encoding module 602 is configured to perform fusion processing on the second frequency domain data respectively corresponding to at least two groups of three-dimensional motion sub-data to obtain fusion frequency domain data; wherein, the fusion frequency domain data The dimensions of the data are the same as the number of groups of the 3D motion sub-data; the fusion frequency domain data is normalized to obtain weights corresponding to at least two groups of 3D motion sub-data.

In a possible implementation manner, the encoding module 602 is configured to perform convolution processing on the input 3D motion data corresponding to the scale for each scale of at least two scales, and perform full convolution processing on the result of the convolution processing. Connecting the mapping process to obtain the second three-dimensional motion data corresponding to the scale; wherein, the input three-dimensional motion data corresponding to the scale includes: the second three-dimensional motion data corresponding to the former scale of the at least two scales, the First 3D motion data.

In a possible implementation manner, the encoding module 602 is configured to concatenate the frequency-domain feature data and second 3D motion data corresponding to at least two scales to obtain third 3D motion data; The third three-dimensional motion data is subjected to full connection mapping processing to obtain the target motion data.

In a possible implementation manner, the coding module 602 is configured to remove the three-dimensional motion data corresponding to each type of motion by using a pre-trained target coding neural network for the three-dimensional motion data corresponding to each type of motion. The encoding process of the global orientation obtains the target motion data corresponding to each type of motion.

In a possible implementation manner, the device further includes: a training module 605 configured to train the target encoding neural network in the following manner: obtain sample data; the sample data includes: at least two sample poses respectively Corresponding sample attitude data; perform at least two rounds of training, and in each round of training, perform the following process: perform random global orientation and steering processing on the sample data to obtain first intermediate sample data; the first intermediate sample The data includes: at least two first posture data respectively corresponding to the sample postures; using a coding neural network to perform coding processing for removing the global orientation on the first intermediate sample data to obtain coded motion data; using a decoding neural network to encode the The encoded motion data is decoded to obtain second intermediate sample data; the second intermediate sample data includes: at least two second pose data corresponding to at least two sample poses; based on the sample data and the second intermediate sample The current round of training is performed on the encoding neural network and the decoding neural network; the encoding neural network that has undergone at least two rounds of training is determined as the target encoding neural network.

In a possible implementation manner, the training module 605 is configured to determine a model loss based on the sample data and the second intermediate sample data; The decoding neural network is trained for the current round.

In a possible implementation manner, the training module 605 is configured to determine at least one of the following based on the sample data and the second intermediate sample data: sample data reconstruction loss, encoded motion data, and normal distribution A similarity loss between them; determining the model loss based on at least one of the sample data reconstruction loss, the similarity loss between the coded motion data and the normal distribution.

In a possible implementation manner, the training module 605 is configured to be based on the orientation feature data corresponding to each frame of sample pose data in the sample data, the pose feature data corresponding to each frame of sample pose data, the second intermediate The orientation feature data corresponding to each frame of the second pose data in the sample data, and the pose feature data corresponding to each frame of the second pose data are used to determine the sample data reconstruction loss.

In a possible implementation manner, the training module 605 is configured to acquire original three-dimensional motion data corresponding to at least two sample motions; the original three-dimensional motion data includes: gesture data corresponding to at least two sample poses; Based on the orientation information of the attitude data corresponding to the first sample attitude in the attitude data corresponding to the at least two sample attitudes, determine the steering angle corresponding to the original three-dimensional motion data; Turning processing is performed on each sample pose in the data to obtain the sample data.

For a description of the processing flow of each module in the device and the interaction flow between the modules, reference may be made to the relevant descriptions in the foregoing method embodiments.

An embodiment of the present disclosure also provides a computer device, as shown in FIG. 7 , which is a schematic structural diagram of a computer device provided by an embodiment of the present disclosure, including:

Processor 71 and memory 72; the memory 72 stores machine-readable instructions executable by the processor 71, the processor 71 is used to execute the machine-readable instructions stored in the memory 72, and the machine-readable instructions are executed by the processor 71 During execution, the processor 71 performs the following steps: obtain the original prior space; the original prior space includes three-dimensional motion data corresponding to various movements of the target object; the three-dimensional motion data includes: Attitude data: The three-dimensional motion data corresponding to each type of motion is encoded to remove the global orientation, and the target motion data corresponding to each type of motion and the target motion data corresponding to multiple types of motion are obtained to generate the target prior space.

Above-mentioned memory 72 comprises memory 721 and external memory 722; Memory 721 here is also called internal memory, is used for temporarily storing computing data in processor 71, and the data exchanged with external memory 722 such as hard disk, processor 71 communicates with memory 721 through memory 721. The external memory 722 performs data exchange.

For the execution process of the above instructions, reference may be made to the steps of the method for generating a priori space described in the embodiments of the present disclosure.

Embodiments of the present disclosure also provide a computer-readable storage medium, on which a computer program is stored, and when the computer program is run by a processor, the method for generating a priori space described in the above-mentioned method embodiments is executed. step. Wherein, the storage medium may be a volatile or non-volatile computer-readable storage medium. The computer program product of the method for generating a priori space provided by the embodiments of the present disclosure includes a computer-readable storage medium storing program codes, and the instructions included in the program code can be used to execute the priori described in the above method embodiments. For the steps of the method for generating the space, refer to the foregoing method embodiments.

The computer program product can be realized by hardware, software or a combination thereof. In an optional embodiment, the computer program product is embodied as a computer storage medium, and in another optional embodiment, the computer program product is embodied as a software product, such as a software development kit (Software Development Kit, SDK) or the like.

Those skilled in the art can clearly understand that for the convenience and brevity of description, for the working process of the device described above, reference can be made to the corresponding process in the foregoing method embodiments. In the several embodiments provided in the present disclosure, it should be understood that the disclosed devices and methods may be implemented in other ways. The device embodiments described above are only illustrative. For example, the division of the units is only a logical function division. In actual implementation, there may be other division methods. For example, multiple units or components can be combined. Or some features can be ignored, or not implemented. In another point, the mutual coupling or direct coupling or communication connection shown or discussed may be through some communication interfaces, and the indirect coupling or communication connection of devices or units may be in electrical, mechanical or other forms.

The units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in one place, or may be distributed to multiple network units. Part or all of the units can be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in each embodiment of the present disclosure may be integrated into one processing unit, each unit may exist separately physically, or two or more units may be integrated into one unit.

If the functions described above are implemented in the form of software function units and sold or used as independent products, they can be stored in a volatile or non-volatile computer-readable storage medium executable by a processor. Based on this understanding, the technical solution of the present disclosure is essentially or the part that contributes to the prior art or the part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium, including Several instructions are used to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the methods described in various embodiments of the present disclosure. The aforementioned storage media include: U disk, mobile hard disk, read-only memory (Read-Only Memory, ROM), random access memory (Random Access Memory, RAM), magnetic disk or optical disc and other media that can store program codes. .

Finally, it should be noted that: the above-described embodiments are only implementations of the present disclosure, and are used to illustrate the technical solutions of the present disclosure, rather than to limit them. The scope of protection of the present disclosure is not limited thereto. This example describes the present disclosure in detail, and those skilled in the art should understand that any person familiar with the technical field can still modify or modify the technical solutions described in the foregoing embodiments within the technical scope disclosed in the present disclosure. Changes can be easily imagined, or equivalent replacements can be made to some of the technical features; and these modifications, changes or replacements do not make the essence of the corresponding technical solutions deviate from the spirit and scope of the technical solutions of the embodiments of the present disclosure, and should be included in the scope of the technical solutions of the embodiments of the present disclosure. within the scope of protection. Therefore, the protection scope of the present disclosure should be defined by the protection scope of the claims.

Claims

A method for generating a priori space, comprising:

Acquiring three-dimensional motion data corresponding to at least two kinds of motions of the target object; the three-dimensional motion data includes: posture data corresponding to at least two postures corresponding to the motion;

performing encoding processing for removing the global orientation on the three-dimensional motion data corresponding to each type of motion, to obtain target motion data corresponding to each type of motion;

Based on the target motion data respectively corresponding to the at least two types of motion, a target prior space is generated.
The method according to claim 1, wherein, after generating the target prior space based on the target motion data respectively corresponding to the at least two types of motion, further comprising:

Obtain a motion video of the target object when it is in motion;

Carrying out feature extraction to described motion video, obtains motion feature data;

determining target motion data matching the motion feature data from the target prior space based on the motion feature data;

Based on the motion type corresponding to the target motion data matched with the motion characteristic data, the motion type of the target object is determined.
The method according to claim 1 or 2, wherein the encoding process of removing the global orientation is performed on the three-dimensional motion data corresponding to each type of motion to obtain the target motion data corresponding to each type of motion, including:

Determining the first frequency domain data corresponding to the three-dimensional motion data in the frequency domain, and dividing the three-dimensional motion data into at least two groups of three-dimensional motion sub-data, and determining that the at least two groups of three-dimensional motion sub-data respectively correspond to The second frequency domain data of ;

Based on the first frequency domain data and the second frequency domain data, the three-dimensional motion data is compressed to remove the global orientation to obtain the target motion data.
The method according to claim 3, wherein, based on the first frequency domain data and the second frequency domain data, the three-dimensional motion data is compressed to remove the global orientation to obtain the target motion data ,include:

Obtaining frequency-domain feature data of the three-dimensional motion data based on the first frequency-domain data;

And, based on the second frequency domain data, determine weights respectively corresponding to at least two groups of three-dimensional motion sub-data;

Based on weights corresponding to at least two groups of three-dimensional motion sub-data, performing weighting processing on the three-dimensional motion data to obtain first three-dimensional motion data;

performing feature extraction on at least two scales of the first three-dimensional motion data to obtain second three-dimensional motion data respectively corresponding to the at least two scales;

The target motion data is obtained by fusing the frequency-domain feature data with second three-dimensional motion data corresponding to at least two scales.
The method according to claim 4, wherein said determining weights corresponding to at least two groups of three-dimensional motion sub-data based on said second frequency domain data includes:

Perform fusion processing on the second frequency domain data respectively corresponding to at least two groups of three-dimensional motion sub-data to obtain fusion frequency domain data; wherein, the dimension of the fusion frequency domain data is the same as the number of groups of the three-dimensional motion sub-data;

Perform normalization processing on the fused frequency domain data to obtain weights respectively corresponding to at least two groups of three-dimensional motion sub-data.
The method according to claim 4 or 5, wherein said performing feature extraction on said first three-dimensional motion data in at least two scales to obtain second three-dimensional motion data respectively corresponding to at least two scales, comprising:

For each of the at least two scales, perform convolution processing on the input 3D motion data corresponding to the scale, and perform fully connected mapping processing on the result of the convolution processing to obtain second 3D motion data corresponding to the scale;

Wherein, the input 3D motion data corresponding to the scale includes: the second 3D motion data corresponding to the former scale of the at least two scales, and the first 3D motion data.
The method according to any one of claims 4 to 6, wherein said fusing said frequency-domain feature data and second three-dimensional motion data respectively corresponding to at least two scales to obtain said target motion data includes:

splicing the frequency-domain feature data and second three-dimensional motion data corresponding to at least two scales to obtain third three-dimensional motion data;

Performing full-connection mapping processing on the third three-dimensional motion data to obtain the target motion data.
The method according to any one of claims 1 to 7, wherein the encoding process of removing the global orientation is performed on the three-dimensional motion data corresponding to each type of motion to obtain the target motion data corresponding to each type of motion, including:

For the three-dimensional motion data corresponding to each type of motion, using the pre-trained target encoding neural network, the three-dimensional motion data corresponding to each type of motion is encoded to remove the global orientation, and the target motion data corresponding to each type of motion is obtained. .
The method according to claim 8, wherein the target coding neural network is obtained by training in the following manner:

Acquiring sample data; the sample data includes: sample pose data corresponding to at least two sample poses;

Perform at least two rounds of training, and in each round, perform the following procedure:

Perform random global orientation and steering processing on the sample data to obtain first intermediate sample data; the first intermediate sample data includes: at least two first posture data respectively corresponding to the sample postures;

performing encoding processing for removing the global orientation on the first intermediate sample data by using an encoding neural network to obtain encoded motion data;

Decoding the coded motion data by using a decoding neural network to obtain second intermediate sample data; the second intermediate sample data includes: at least two second pose data respectively corresponding to the sample poses;

performing a current round of training on the encoding neural network and the decoding neural network based on the sample data and the second intermediate sample data;

An encoding neural network that has undergone at least two rounds of training is determined as the target encoding neural network.
The method according to claim 9, wherein said performing a current round of training on said encoding neural network and said decoding neural network based on said sample data and said second intermediate sample data comprises:

determining a model loss based on the sample data and the second intermediate sample data;

Based on the model loss, a current round of training is performed on the encoding neural network and the decoding neural network.
The method according to claim 10, wherein said determining a model loss based on said sample data and said second intermediate sample data comprises:

Based on the sample data and the second intermediate sample data, at least one of the following is determined: a sample data reconstruction loss, a similarity loss between encoded motion data and a normal distribution;

The model loss is determined based on at least one of the sample data reconstruction loss, a similarity loss between encoded motion data and a normal distribution.
The method according to claim 11, wherein, based on the sample data and the second intermediate sample data, determining a sample data reconstruction loss comprises:

Based on the orientation characteristic data corresponding to each frame of sample posture data in the sample data, the posture characteristic data corresponding to each frame of sample posture data, the orientation characteristic data corresponding to each frame of second posture data in the second intermediate sample data, and each The pose feature data corresponding to the second pose data of the frame is used to determine the reconstruction loss of the sample data.
The method according to any one of claims 9 to 12, wherein said acquiring sample data comprises:

Acquire original three-dimensional motion data corresponding to at least two sample motions; the original three-dimensional motion data includes: attitude data corresponding to at least two sample poses;

Based on the orientation information of the attitude data corresponding to the first sample attitude among the attitude data corresponding to the at least two sample attitudes, determine the steering angle corresponding to the original three-dimensional motion data;

Based on the steering angle, perform steering processing on each sample pose in the original three-dimensional motion data to obtain the sample data.
A device for generating a priori space, comprising: an acquisition module configured to acquire three-dimensional motion data corresponding to at least two types of motion of a target object; the three-dimensional motion data includes: posture data corresponding to at least two postures of the corresponding motion The encoding module is configured to perform encoding processing for removing the global orientation on the three-dimensional motion data corresponding to each type of motion, so as to obtain the target motion data corresponding to each type of motion; the determination module is configured to be based on the target motions corresponding to the at least two types of motion data to generate a target prior space.
A computer device, comprising: a processor and a memory, the memory stores machine-readable instructions executable by the processor, the processor is configured to execute the machine-readable instructions stored in the memory, and the machine can When the read instruction is executed by the processor, the processor executes the steps of the method for generating a priori space according to any one of claims 1 to 13.
A computer-readable storage medium, a computer program is stored on the computer-readable storage medium, and when the computer program is run by a computer device, the computer device executes the prior art described in any one of claims 1 to 13 The steps of the generation method of the space.
A computer program, comprising computer-readable codes, when the computer-readable codes are read and executed by a computer, the processor in the device executes the prior art for realizing any one of claims 1-13 The steps of the generation method of the space.
A computer program product configured to store computer-readable instructions, which, when executed, cause a computer to execute the steps of the method for generating a priori space according to any one of claims 1 to 13.