CN117152843B

CN117152843B - Digital person action control method and system

Info

Publication number: CN117152843B
Application number: CN202311144896.6A
Authority: CN
Inventors: 张青辉; 王英
Original assignee: 4u Beijing Technology Co ltd
Current assignee: 4u Beijing Technology Co ltd
Priority date: 2023-09-06
Filing date: 2023-09-06
Publication date: 2024-05-07
Anticipated expiration: 2043-09-06
Also published as: CN117152843A

Abstract

The invention discloses a method and a system for controlling actions of a digital person, wherein the method and the system acquire user action interaction videos acquired by a camera; extracting features of the user action interaction video to obtain a context action time sequence semantic feature vector; and generating an action control instruction for the digital person based on the context action timing semantic feature vector. Thus, the accuracy of identifying the operation intention of the user can be enhanced, and more accurate and convenient digital human action control can be realized.

Description

Digital person action control method and system

Technical Field

The invention relates to the technical field of intelligent control, in particular to a digital human motion control method and a digital human motion control system.

Background

A digital person is a three-dimensional model capable of simulating a real human in a virtual environment, and has high fidelity and interactivity. Motion control of a digital person is an important component of digital person technology that determines whether the digital person can perform a reasonable motion response according to the user's intention.

Currently, the common motion control methods of digital people mainly include a sensor-based method and a vision-based method. The sensor-based method requires a user to wear a plurality of sensors to capture motion data of the user and then map the motion data to the motion of a digital person, and has the defects of high cost, strong invasiveness, easy interference and the like. The method based on vision utilizes the camera to collect the action video of the user, then recognizes the action intention of the user through the computer vision technology and converts the action intention into the action control instruction of the digital person.

However, existing vision-based digital human motion control methods also have some problems. For example, it is difficult to accurately extract time series features in a user action video, resulting in an unsatisfactory action recognition effect. Thus, an optimized digital human motion control scheme is desired.

Disclosure of Invention

The embodiment of the invention provides a motion control method and a motion control system for a digital person, wherein the motion control method and the motion control system acquire user motion interaction videos acquired by a camera; extracting features of the user action interaction video to obtain a context action time sequence semantic feature vector; and generating an action control instruction for the digital person based on the context action timing semantic feature vector. Thus, the accuracy of identifying the operation intention of the user can be enhanced, and more accurate and convenient digital human action control can be realized.

The embodiment of the invention also provides a method for controlling the actions of the digital person, which comprises the following steps: acquiring user action interaction videos acquired by a camera; extracting features of the user action interaction video to obtain a context action time sequence semantic feature vector; and generating an action control instruction for the digital person based on the contextual action timing semantic feature vector.

The embodiment of the invention also provides a motion control system of the digital person, which comprises: the video acquisition module is used for acquiring user action interaction videos acquired by the camera; the feature extraction module is used for extracting features of the user action interaction video to obtain context action time sequence semantic feature vectors; and the control instruction generation module is used for generating an action control instruction for the digital person based on the context action time sequence semantic feature vector.

Drawings

In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

In the drawings: fig. 1 is a flowchart of a method for controlling actions of a digital person according to an embodiment of the present invention.

Fig. 2 is a schematic diagram of a system architecture of a digital person motion control method according to an embodiment of the present invention.

Fig. 3 is a flowchart of the sub-steps of step 120 in a method for controlling actions of a digital person according to an embodiment of the present invention.

Fig. 4 is a block diagram of a digital human motion control system provided in an embodiment of the present invention.

Fig. 5 is an application scenario diagram of a digital person motion control method provided in an embodiment of the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the embodiments of the present invention will be described in further detail with reference to the accompanying drawings. The exemplary embodiments of the present invention and their descriptions herein are for the purpose of explaining the present invention, but are not to be construed as limiting the invention.

Unless defined otherwise, all technical and scientific terms used in the embodiments of the application have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used in the present application is for the purpose of describing particular embodiments only and is not intended to limit the scope of the present application.

In describing embodiments of the present application, unless otherwise indicated and limited thereto, the term "connected" should be construed broadly, for example, it may be an electrical connection, or may be a communication between two elements, or may be a direct connection, or may be an indirect connection via an intermediate medium, and it will be understood by those skilled in the art that the specific meaning of the term may be interpreted according to circumstances.

It should be noted that, the term "first\second\third" related to the embodiment of the present application is merely to distinguish similar objects, and does not represent a specific order for the objects, it is to be understood that "first\second\third" may interchange a specific order or sequence where allowed. It is to be understood that the "first\second\third" distinguishing objects may be interchanged where appropriate such that embodiments of the application described herein may be practiced in sequences other than those illustrated or described herein.

Digital persons refer to virtual characters created using computer technology and graphics technology. They may be three-dimensional models or two-dimensional images, with human appearance and behavioral characteristics. Digital people are widely used in the fields of movies, games, virtual reality, augmented reality, etc.

Creation of digital persons typically involves modeling, animation, and rendering processes. Modeling refers to the creation of a three-dimensional model of a digital person, including details of outline, muscles, bones, etc., using computer software. Animation refers to giving a vivid action and expression to a digital person, and can be realized by key frame animation, action capturing and other technologies. Rendering refers to adding effects such as illumination, textures and the like to digital human models and animations so that the digital human models and animations present realistic visual effects on a screen.

Digital people are very widely used. In movies and games, a digital person may play a role, interacting with a real actor or other digital person. In virtual reality and augmented reality, a digital person may appear as a virtual tour guide, virtual assistant, or virtual character, interacting with a user. The digital person can also be used in the fields of education, training, medical treatment and the like, and can provide functions of virtual experiments, simulation training, medical visualization and the like.

With the continuous development of computer technology, the fidelity and interactivity of digital people are continuously improved. In the future, digital people are expected to play a more important role in various fields, and more fun and convenience are brought to people.

The conventional digital person motion control method includes: 1. keyframe animation (Keyframe Animation): this is a key frame based animation technique that defines the pose and motion of a digital person at different points in time by setting key frames on a time axis. Animation software generates a smooth animation transition effect based on interpolation calculations between key frames.

2. Motion Capture (Motion Capture): motion capture techniques use sensors or cameras to record motion data of a real human and apply it to a digital human model. The sensor may be an inertial measurement unit (Inertial Measurement Unit, IMU for short), an optical sensor, an electromagnetic sensor, or the like. By capturing motion data of a real human, a digital person can simulate and reproduce a corresponding motion.

3. Physical engine (PHYSICS ENGINE): the physical engine is a computer program for simulating physical laws and can simulate physical effects such as movement, collision, gravity and the like of an object. In digital humans, the physical engine can be used to simulate the bones, joints, muscles, etc. of the human body, making the digital human motion more realistic and lifelike.

4. Motion Planning (Motion Planning): motion planning refers to calculating reasonable motion paths and trajectories of a digital person through an algorithm under given environmental and constraint conditions. The method can be used for realizing navigation, obstacle avoidance and planning of complex actions of the digital person.

5. Control algorithm (Control Algorithm): control algorithms refer to controlling the actions of a digital person by programming or algorithms. Such a method may determine the digital person's action response based on user input or external conditions, such as controlling the digital person's action based on keyboard input or voice commands.

These methods may be applied alone or in combination to achieve motion control of a digital person. Different methods are suitable for different application scenes and requirements, and the fidelity and interactivity of digital people can be improved by selecting a proper action control method.

In one embodiment of the present invention, fig. 1 is a flowchart of a method for controlling actions of a digital person according to an embodiment of the present invention. Fig. 2 is a schematic diagram of a system architecture of a digital person motion control method according to an embodiment of the present invention. As shown in fig. 1 and 2, a digital person motion control method 100 according to an embodiment of the present invention includes: 110, acquiring user action interaction videos acquired by a camera; 120, extracting features of the user action interaction video to obtain a context action time sequence semantic feature vector; and 130, generating an action control instruction for the digital person based on the context action timing semantic feature vector.

In the step 110, the position and angle of the camera are ensured to be appropriate, so that the action interaction of the user can be accurately captured. And providing a data base for subsequent feature extraction and motion control through real-time motion interaction of the video capturing user.

In said step 120, an appropriate feature extraction algorithm, e.g. a deep learning based method, is selected to extract motion features in the video. By extracting features, the user's interaction of actions can be converted into machine-understandable semantic feature vectors, providing input for subsequent action control.

In said step 130, a suitable algorithm or model is designed, the feature vectors are mapped to the motion space of the digital person and corresponding motion control instructions are generated. By generating the action control instruction, the response of the digital person to the user action interaction is realized, so that the digital person can perform corresponding actions according to the intention of the user.

Through the steps, the quality and the accuracy of data can be improved, and the accuracy and the reliability of motion control are ensured. Reasonable selection and optimization of feature extraction can improve understanding and recognition of user actions. The optimization of the generation algorithm of the motion control instruction can improve the motion performance and interaction experience of the digital person. The smoothness and effectiveness of the whole flow can improve the natural interaction and communication effect between the digital person and the user.

The digital human action control method based on the user action interaction video has the advantages of instantaneity, naturality and interactivity, and can provide a more visual and flexible digital human control mode.

Specifically, in the step 110, a user action interactive video acquired by the camera is acquired. Aiming at the technical problems, the technical concept of the application is to collect user action interaction videos by using an intelligent algorithm and a camera, extract time sequence characteristic information about actions from the user action interaction videos, and realize action control of digital people based on gestures of a user.

It should be appreciated that when a user is interacting with a digital person, their actions and gestures may be captured by the camera and presented in the form of a video. Also, since the user's actions are continuous in time, for example, the user may raise the arm first, then lower the arm, then turn the head, and so on. The order and timing information of these actions is important to understand the intent and meaning of the actions of the user. Therefore, in the process of the technical conception of the application, importance of time sequence characteristic information in video data is found, and the accuracy of identifying the operation intention of a user is expected to be enhanced by utilizing the time sequence characteristic information, so that more accurate and convenient digital human action control is realized.

Based on the above, in the technical scheme of the application, the user action interaction video acquired by the camera is firstly acquired. The specific action type and action sequence performed by the user can be captured through the video. For example, the user may make waving, fisting, nodding, etc., which may be extracted and used to generate corresponding digital human motion control instructions.

The video may provide timing information for the action, i.e., the time and duration that the action occurred. This is important for generating smooth and accurate digital human movements. For example, if the user makes one continuous hand waving motion, the digital person may simulate the corresponding continuous hand waving motion based on the timing information in the video.

The video may capture spatial position and pose information of the user's actions. This information can be used to determine the corresponding position and pose that the digital person should take. For example, if the user lifts his arm, the digital person may simulate a corresponding hand lifting action based on the spatial position and pose information in the video.

The video may also capture the facial expression and emotional state of the user. This information can be used to adjust the digital person's expression and emotion to better interact with the user's emotion. For example, if the user laughs, the digital person may present a corresponding smile based on the expression information in the video.

The interaction context information in the video may provide clues about the user's intent and the interaction environment. For example, a user may make some action in a particular scene, or express a particular intent through gestures. Such contextual information may help the digital person to better understand the user's intent and generate corresponding motion control instructions.

Useful information in a user action interaction video includes action type, action sequence, timing and duration of actions, spatial position and pose of actions, expression and emotion of the user, and interaction context and intent. The information can be extracted and analyzed by an intelligent algorithm for finally generating an action control instruction for the digital person, so that natural interaction and action response with the user are realized.

The acquisition of the user action interaction video acquired by the camera plays an important role in finally generating the action control instruction aiming at the digital person. The camera is used for collecting the action interaction video of the user, so that the action behavior of the user can be captured in real time, real-time input data is provided for the action control of the digital person, and the digital person can respond to the action of the user in time. The interaction mode based on natural actions can be realized by collecting the interaction video of the user actions through the camera. The user can interact with the digital person by his own actions without relying on other external devices or controllers. Such natural interactivity may enhance user experience and communication effects.

The video may provide more detailed and comprehensive user action information, including time series, spatial location and pose of the action, etc., which may be used to more accurately understand the user's action intent and generate corresponding action control instructions. The video may capture contextual information of the user's actions, such as gestures, expressions, gestures, etc., that help better understand the user's intent and emotion, thereby generating motion control instructions that more closely match the user's expectations. The video can capture not only the actions of the user, but also voice, expressions and other visual information, and the multi-modal interaction can provide richer and comprehensive user input, so that the action control of the digital person is more intelligent and diversified.

The method has the advantages that the user action interaction video acquired by the camera plays a key role in finally generating the action control instruction aiming at the digital person, real-time, detailed and context-rich user action information is provided, so that the digital person can more accurately understand the intention of the user and generate corresponding action response, and the more natural and intelligent digital person interaction experience is realized.

For the step 120, fig. 3 is a flowchart of the sub-steps of the step 120 in the method for controlling actions of digital persons according to the embodiment of the present invention, as shown in fig. 3, the feature extraction is performed on the user action interactive video to obtain a context action time sequence semantic feature vector, which includes: 121, extracting time sequence characteristics of multiple areas of the user action interaction video to obtain multiple user action interaction time sequence characteristic diagrams; and, extracting 122 global semantic associations between the plurality of user action interaction timing feature diagrams to obtain the contextual action timing semantic feature vector.

First, a video is divided into a plurality of regions, which may be divided according to a body part of a user or a region of interest. Then, for each region, a time series feature is extracted, and a commonly used time series feature extraction method includes an optical flow, a key point track, a gesture sequence, and the like. These features may capture temporal variations of user actions and spatial location information.

Then, after timing feature maps of the multiple regions are obtained, global semantic associations between the feature maps may be further analyzed. This can be achieved by calculating the similarity between feature maps, correlation matrix, or using a neural network of maps, etc. Global semantic association may help understand action relationships and context information between different regions.

Finally, the time sequence feature diagrams of the multiple areas are combined with global semantic association, so that the context action time sequence semantic feature vector can be obtained. This feature vector contains rich information in the user action interaction video, including the type of action, sequence of actions, timing and duration of actions, spatial position and pose of actions, expression and emotion of the user, and interaction context and intent.

By extracting the time sequence characteristics of multiple areas of the user action interaction video and extracting the global semantic association among the plurality of user action interaction time sequence characteristic diagrams, more comprehensive and accurate context action time sequence semantic characteristic vectors can be obtained. These feature vectors may be used for further motion recognition, intent understanding, and motion control, thereby enabling a more intelligent, natural digital human interaction experience.

In the step 121, performing multi-region time sequence feature extraction on the user action interaction video to obtain a plurality of user action interaction time sequence feature diagrams, including: video segmentation is carried out on the user action interaction video to obtain a plurality of user action interaction fragments; and respectively passing the plurality of user action interaction fragments through an action time sequence feature extractor based on a three-dimensional convolutional neural network to obtain a plurality of user action interaction time sequence feature diagrams.

Firstly, through video segmentation, the user action interaction video is segmented into a plurality of segments, and action characteristics with finer granularity can be obtained. Each segment can be processed separately, and the time sequence features in the segment can be extracted from the segments, so that the subtle changes and dynamic features of the user actions can be captured better.

Then, the action time sequence feature extractor based on the three-dimensional convolution neural network can carry out convolution operation in a time sequence dimension, so that time sequence information of user action interaction is effectively captured. These timing diagrams may reflect the dynamics, speed, acceleration, etc. characteristics of the user's actions, providing more useful information for subsequent analysis and applications.

Next, by extracting the user action interaction timing feature diagram, action recognition and context understanding can be performed. These feature maps may be used to train a classifier or model to identify and classify the actions of the user. At the same time, the timing diagram may also be used to understand the contextual relationship of the user actions, such as the order of the sequence of actions, consistency, etc.

Finally, based on the user action interaction time sequence characteristic diagram, an action control instruction aiming at the digital person can be generated, and more natural and intelligent digital person interaction experience is realized. By analyzing the time sequence characteristics of the user actions, the user intention can be more accurately captured and converted into specific action instructions for digital people, and the interaction effect and fidelity are improved.

Through the video segmentation and the action time sequence feature extractor based on the three-dimensional convolutional neural network, a finer and more comprehensive user action interaction time sequence feature graph can be obtained, and beneficial effects are provided for subsequent tasks such as analysis, identification, control and interaction experience.

And then, carrying out multi-region time sequence feature extraction on the user action interaction video to obtain a plurality of user action interaction time sequence feature diagrams. Namely, the user action interactive video is split into a plurality of fragments, and then sequential feature extraction is carried out one by one. In this way, a degree of focusing on finer granularity of motion features is possible.

In a specific example of the present application, the implementation manner of extracting the time sequence characteristics of multiple areas to obtain multiple user action interaction time sequence characteristic diagrams for the user action interaction video is as follows: firstly, video segmentation is carried out on the user action interaction video to obtain a plurality of user action interaction fragments; and then, respectively passing the plurality of user action interaction fragments through an action time sequence feature extractor based on the three-dimensional convolutional neural network to obtain a plurality of user action interaction time sequence feature diagrams.

Wherein, the video segmentation is a process of dividing the user action interaction video into a plurality of user action interaction fragments. First, a user action interactive video is preprocessed. This includes operations such as video decoding, frame extraction, and inter-frame differencing, where video decoding converts a video file into an image sequence, frame extraction extracts key frames or frames of a certain interval from the image sequence as a basis for segmentation, and inter-frame differencing can be used to detect changes in motion. Motion detection and segmentation, using the preprocessed image sequence, may then be achieved by some computer vision techniques, such as background modeling, motion detection, object recognition, etc. Motion detection may help determine the start and end frames of a user action, while segmentation segments the video into multiple user action interaction fragments. Then, after obtaining a plurality of user action interaction fragments, fragment screening and merging may be performed, and fragment screening may select fragments of representative or importance, such as duration, action type, etc., according to some criteria or rules. And segment merging may merge adjacent segments together to obtain a more complete sequence of user action interactions. And finally, outputting the segmented and processed user action interaction fragments, wherein the output can be a video file, an image sequence or other data formats for subsequent feature extraction, analysis and application.

The purpose of the video slicing is to divide the user action interaction video into smaller segments in order to better analyze and process the user action behavior in each segment. Thus, finer action characteristics can be provided, and tasks such as action recognition, action control, context understanding and the like are further supported, so that more intelligent and natural digital human interaction experience is realized.

Further, the three-dimensional convolutional neural network (3D CNN) is a deep learning model for processing video data, and on the basis of the traditional two-dimensional convolutional neural network, a convolutional operation of a time dimension is introduced, so that time sequence features in the video data can be effectively extracted. Unlike a two-dimensional convolutional neural network, 3D CNN performs a convolution operation on each time step of a video sequence by sliding a convolution kernel over the time dimension, taking into account information of the time dimension in the convolution operation. Thus, time sequence changes and dynamic characteristics in video data can be captured.

The basic structure of the 3D CNN is similar to a two-dimensional convolutional neural network, including a convolutional layer, a pooling layer, and a fully-connected layer. However, the convolution kernels in the convolution layers of 3D CNN slide in three dimensions, namely the width, height and time dimensions. So that features in both the spatial and temporal dimensions can be considered. In processing the user action interaction segment, a three-dimensional convolutional neural network-based action timing feature extractor may be used. The feature extractor takes the user action interaction segment as input, and extracts time sequence features in the segment through a plurality of 3D convolution layers and pooling layers. These timing characteristics may be motion patterns of motion, spatial position, attitude changes, etc.

By using the action time sequence feature extractor based on the three-dimensional convolutional neural network, the time sequence feature graph can be effectively extracted from the user action interaction segment. The feature maps can capture dynamic changes and timing information of user actions, and provide useful inputs for subsequent tasks such as action recognition, context understanding, and action control.

In the step 122, extracting global semantic associations between the plurality of user action interaction timing feature diagrams to obtain the context action timing semantic feature vector includes: respectively expanding the plurality of user action interaction time sequence feature diagrams into a plurality of user action interaction time sequence feature vectors; and passing the plurality of user action interaction timing feature vectors through a transducer-based action context encoder to obtain the contextual action timing semantic feature vector.

In the application, the time sequence feature diagram is unfolded into the time sequence feature vector, so that the dimension of data can be reduced, and the features are more compact and easier to process. This helps to reduce computational and memory requirements and improves the efficiency of subsequent processing. Meanwhile, the unfolding time sequence characteristic diagram can also extract more representative characteristics, and key information of user actions can be captured better.

By means of the transducer-based motion context encoder, a plurality of user-motion interaction timing feature vectors may be encoded as context-motion timing semantic feature vectors. This encoding process can capture the relationship and context information between user actions, thereby better representing the semantic meaning of the actions. Such feature vectors may be used for subsequent action recognition, generation, and control tasks.

By obtaining the context action time sequence semantic feature vector, the context relation and semantic meaning of the user action can be better understood. This helps to improve the accuracy of motion recognition and provides guidance for more context consistency for subsequent motion generation. Based on the feature vectors, a generation model or a control algorithm can be designed to realize more intelligent and natural digital human action generation and interaction experience.

By performing context coding on the multiple user action interaction time sequence feature diagrams, action commonalities and change rules among different users can be captured. This facilitates action reasoning and generalization, i.e. learning general action patterns and rules from existing user actions, so that new user actions and interaction scenarios can be accommodated.

And expanding a plurality of user action interaction time sequence feature graphs into feature vectors, and obtaining context action time sequence semantic feature vectors through an action context encoder based on a converter, so that the expression capacity and semantic meaning of the features can be improved, and beneficial effects are brought to tasks such as action understanding, generation and control.

And then, extracting global semantic association among the plurality of user action interaction time sequence feature diagrams to obtain the context action time sequence semantic feature vector. That is, considering that the independent feature extraction is performed on each video segment, the interaction and communication of the time sequence features between each video segment are ignored, so in the technical scheme of the application, it is expected to extract the global semantic association between the plurality of user action interaction time sequence feature graphs to make up for the lack of the interaction between the features.

In a specific example of the present application, the implementation manner of extracting global semantic association between the plurality of user action interaction time sequence feature diagrams to obtain the context action time sequence semantic feature vector is as follows: firstly, respectively expanding the plurality of user action interaction time sequence feature diagrams into a plurality of user action interaction time sequence feature vectors; the plurality of user action interaction timing feature vectors are then passed through a transducer-based action context encoder to obtain a contextual action timing semantic feature vector.

Specifically, the step 130 generates, based on the contextual action timing semantic feature vector, an action control instruction for a digital person, including: performing feature distribution optimization on the context action time sequence semantic feature vector to obtain an optimized context action time sequence semantic feature vector; the semantic feature vector of the optimized context action time sequence passes through a classifier to obtain a classification result, wherein the classification result is used for representing an operation intention label corresponding to the user action interaction video; and generating the action control instruction for the digital person based on the classification result.

In one embodiment of the present application, performing feature distribution optimization on the context action timing semantic feature vector to obtain an optimized context action timing semantic feature vector, including: cascading the plurality of user action interaction time sequence feature vectors to obtain user action interaction time sequence cascading feature vectors; and carrying out Hilbert space heuristic sequence tracking equalization fusion on the user action interaction time sequence cascade feature vector and the context action time sequence semantic feature vector to obtain the optimized context action time sequence semantic feature vector.

In the technical scheme of the application, under the condition that the context action time sequence semantic feature vectors are obtained by the action context encoder based on the converter, the context action time sequence semantic feature vectors can express time sequence image semantic context associated features of the user action time sequence feature vectors, but when the time sequence context associated features are extracted, the overall distribution of the context action time sequence semantic feature vectors is unbalanced relative to the user action time sequence interaction time sequence image semantic features extracted by the action time sequence feature extractor based on the three-dimensional convolutional neural network, so that the expression of the specific action image semantic features in the local time domain is affected.

Here, considering that the contextual action timing semantic feature vector is substantially obtained by concatenating a plurality of contextual user action interaction timing feature vectors obtained by a context encoder based on a converter, the contextual action timing semantic feature vector also conforms to a serialized arrangement of local timing related image semantic representations corresponding to the plurality of user action interaction timing feature vectors, and thus, the user action interaction timing concatenated feature vector obtained by concatenating the plurality of user action interaction timing feature vectors by the applicant of the present application is, for example, written asAnd the contextual action timing semantic feature vector, e.g., denoted/>A hilbert space heuristic sequence tracking equalization fusion is performed to optimize the contextual action timing semantic feature vector, e.g., denoted/>The method is specifically expressed as follows: carrying out Hilbert space heuristic sequence tracking equalization fusion on the user action interaction time sequence cascade feature vector and the context action time sequence semantic feature vector by using the following optimization formula to obtain the optimization context action time sequence semantic feature vector; wherein, the optimization formula is: wherein/> Is the cascade feature vector of the interaction time sequence of the user action,/>Is the context action timing semantic feature vector,/>Representing the user action interaction time sequence cascading feature vector/>And the contextual action timing semantic feature vector/>Two norms of cascade vectors,/>Representing the user action interaction time sequence cascading feature vector/>And the contextual action timing semantic feature vector/>A mean value of a union set formed by all feature values of the user action interaction time sequence cascading feature vectorsAnd the contextual action timing semantic feature vector/>Are all row vectors,/>Representing multiplication by location,/>Representing vector addition,/>Is the optimized context action timing semantic feature vector,/>Is the set of feature values of all positions in the user action interaction time sequence cascade feature vector,/>Is a set of feature values for all positions in the contextual action temporal semantic feature vector.

Here, the feature vectors are concatenated by the user action interaction timing using the complete inner product space characteristic of the Hilbert space with inner productsAnd the contextual action timing semantic feature vector/>Aggregate mean (collective average) of sequence aggregations of (a), exploring the user-action interaction timing cascade feature vector/>And the contextual action timing semantic feature vector/>Sequence-based spatial distribution heuristics (heuristics) within feature fusion space encoded via contextual relevance, thereby timing semantic feature vectors/>, of the contextual actionsThe local feature distribution of the sequence is converted into a sequence tracking instance (TRACKED INSTANCE) in a fusion space so as to realize the (tracklet-aware) distribution equalization of the tracking small-segment cognition of the feature space distribution of the sequence, thus improving the expression of the context action time sequence semantic feature vector on the specific action image semantic features in the local time domain.

Further, the context action time sequence semantic feature vector is passed through a classifier to obtain a classification result, wherein the classification result is used for representing an operation intention label corresponding to the user action interaction video; and generating an action control instruction for the digital person based on the classification result.

In summary, the method 100 for controlling the motion of the digital person according to the embodiment of the present invention is illustrated, and uses a camera to collect the interactive video of the motion of the user by using an intelligent algorithm, and extracts the time sequence feature information about the motion from the interactive video, so as to implement the motion control of the digital person based on the gesture of the user.

Fig. 4 is a block diagram of a digital human motion control system provided in an embodiment of the present invention. As shown in fig. 4, the motion control system of the digital person includes: a video acquisition module 210, configured to acquire a user action interactive video acquired by the camera; the feature extraction module 220 is configured to perform feature extraction on the user action interaction video to obtain a context action time sequence semantic feature vector; and a control instruction generation module 230 for generating an action control instruction for the digital person based on the contextual action timing semantic feature vector.

Specifically, in the motion control system of the digital person, the feature extraction module includes: the time sequence feature extraction unit is used for extracting time sequence features of multiple areas of the user action interaction video to obtain a plurality of user action interaction time sequence feature diagrams; and the global semantic association unit is used for extracting global semantic association among the plurality of user action interaction time sequence feature graphs to obtain the context action time sequence semantic feature vector.

Specifically, in the motion control system of the digital person, the timing characteristic extraction unit is configured to: video segmentation is carried out on the user action interaction video to obtain a plurality of user action interaction fragments; and respectively passing the plurality of user action interaction fragments through an action time sequence feature extractor based on a three-dimensional convolutional neural network to obtain a plurality of user action interaction time sequence feature diagrams.

It will be appreciated by those skilled in the art that the specific operation of the individual steps in the above-described digital person's motion control system has been described in detail in the above description of the digital person's motion control method with reference to fig. 1 to 3, and thus, repetitive description thereof will be omitted.

As described above, the motion control system 100 of a digital person according to an embodiment of the present invention can be implemented in various terminal devices, such as a server or the like for motion control of a digital person. In one example, the digital person's motion control system 100 according to embodiments of the present invention may be integrated into the terminal device as a software module and/or hardware module. For example, the digital person's motion control system 100 may be a software module in the operating system of the terminal device, or may be an application developed for the terminal device; of course, the digital person's motion control system 100 could equally be one of the many hardware modules of the terminal device.

Alternatively, in another example, the digital person's motion control system 100 and the terminal device may be separate devices, and the digital person's motion control system 100 may be connected to the terminal device through a wired and/or wireless network and transmit interactive information in a agreed data format.

Fig. 5 is an application scenario diagram of a digital person motion control method provided in an embodiment of the present invention. As shown in fig. 5, in the application scenario, first, a user action interactive video acquired by a camera is acquired (e.g., C as illustrated in fig. 5); the acquired user action interactive video is then input into a server (e.g., S as illustrated in fig. 5) deployed with a digital person 'S action control method algorithm, wherein the server is capable of processing the user action interactive video based on the digital person' S action control method algorithm to generate an action control instruction for the digital person.

The foregoing description of the embodiments has been provided for the purpose of illustrating the general principles of the invention, and is not meant to limit the scope of the invention, but to limit the invention to the particular embodiments, and any modifications, equivalents, improvements, etc. that fall within the spirit and principles of the invention are intended to be included within the scope of the invention.

Claims

1. A method for controlling motion of a digital person, comprising:

acquiring user action interaction videos acquired by a camera;

Extracting features of the user action interaction video to obtain a context action time sequence semantic feature vector; and

Generating an action control instruction for the digital person based on the context action time sequence semantic feature vector;

wherein generating an action control instruction for a digital person based on the contextual action timing semantic feature vector comprises:

Performing feature distribution optimization on the context action time sequence semantic feature vector to obtain an optimized context action time sequence semantic feature vector;

The semantic feature vector of the optimized context action time sequence passes through a classifier to obtain a classification result, wherein the classification result is used for representing an operation intention label corresponding to the user action interaction video; and

Generating the action control instruction for the digital person based on the classification result;

performing feature distribution optimization on the context action time sequence semantic feature vector to obtain an optimized context action time sequence semantic feature vector, wherein the feature distribution optimization comprises the following steps:

Cascading the plurality of user action interaction time sequence feature vectors to obtain user action interaction time sequence cascading feature vectors;

performing Hilbert space heuristic sequence tracking equalization fusion on the user action interaction time sequence cascading feature vector and the context action time sequence semantic feature vector to obtain the optimized context action time sequence semantic feature vector;

The method for performing hilbert space heuristic sequence tracking equalization fusion on the user action interaction time sequence cascade feature vector and the context action time sequence semantic feature vector to obtain the optimized context action time sequence semantic feature vector comprises the following steps: carrying out Hilbert space heuristic sequence tracking equalization fusion on the user action interaction time sequence cascade feature vector and the context action time sequence semantic feature vector by using the following optimization formula to obtain the optimization context action time sequence semantic feature vector;

Wherein, the optimization formula is:

Wherein V ₁ is the user-action interaction timing cascading feature vector, V ₂ is the contextual action timing semantic feature vector, | (V ₁;V₂)||₂ represents the two norms of the cascading vectors of the user-action interaction timing cascading feature vector V ₁ and the contextual action timing semantic feature vector V ₂), Representing the mean value of the union set formed by all feature values of the user action interaction time sequence cascade feature vector V ₁ and the context action time sequence semantic feature vector V ₂, wherein the user action interaction time sequence cascade feature vector V ₁ and the context action time sequence semantic feature vector V ₂ are row vectors, and as indicated by the multiplication of position points,/>Representing vector addition, V ₂' is the optimization context action timing semantic feature vector,/>Is the set of feature values of all positions in the user action interaction time sequence cascade feature vector,/>Is a set of feature values for all positions in the contextual action temporal semantic feature vector.

2. The method of claim 1, wherein feature extraction is performed on the user action interactive video to obtain a contextual action timing semantic feature vector, comprising:

extracting time sequence characteristics of multiple areas of the user action interaction video to obtain multiple user action interaction time sequence characteristic diagrams; and

And extracting global semantic association among the plurality of user action interaction time sequence feature graphs to obtain the context action time sequence semantic feature vector.

3. The method for controlling actions of a digital person according to claim 2, wherein performing multi-region time series feature extraction on the user action interactive video to obtain a plurality of user action interactive time series feature diagrams comprises:

Video segmentation is carried out on the user action interaction video to obtain a plurality of user action interaction fragments;

and respectively passing the plurality of user action interaction fragments through an action time sequence feature extractor based on a three-dimensional convolutional neural network to obtain a plurality of user action interaction time sequence feature diagrams.

4. The method of claim 3, wherein extracting global semantic associations between the plurality of user action interaction timing feature graphs to obtain the contextual action timing semantic feature vector comprises:

Respectively expanding the plurality of user action interaction time sequence feature diagrams into a plurality of user action interaction time sequence feature vectors; and

And enabling the plurality of user action interaction time sequence feature vectors to pass through a motion context encoder based on a converter to obtain the context action time sequence semantic feature vectors.

5. A digital human motion control system, comprising:

the video acquisition module is used for acquiring user action interaction videos acquired by the camera;

The feature extraction module is used for extracting features of the user action interaction video to obtain context action time sequence semantic feature vectors; and

The control instruction generation module is used for generating an action control instruction for the digital person based on the context action time sequence semantic feature vector;

Wherein, the control instruction generation module includes:

Wherein, the optimization formula is:

6. The digital person motion control system of claim 5, wherein the feature extraction module comprises:

The time sequence feature extraction unit is used for extracting time sequence features of multiple areas of the user action interaction video to obtain a plurality of user action interaction time sequence feature diagrams; and

And the global semantic association unit is used for extracting global semantic association among the plurality of user action interaction time sequence feature graphs to obtain the context action time sequence semantic feature vector.

7. The motion control system of a digital person according to claim 6, wherein the timing feature extraction unit is configured to: