CN117152843A - Digital person action control method and system - Google Patents
Digital person action control method and system Download PDFInfo
- Publication number
- CN117152843A CN117152843A CN202311144896.6A CN202311144896A CN117152843A CN 117152843 A CN117152843 A CN 117152843A CN 202311144896 A CN202311144896 A CN 202311144896A CN 117152843 A CN117152843 A CN 117152843A
- Authority
- CN
- China
- Prior art keywords
- action
- time sequence
- user action
- interaction
- feature vector
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 230000009471 action Effects 0.000 title claims abstract description 338
- 238000000034 method Methods 0.000 title claims abstract description 53
- 230000003993 interaction Effects 0.000 claims abstract description 148
- 239000013598 vector Substances 0.000 claims abstract description 124
- 230000001276 controlling effect Effects 0.000 claims abstract description 11
- 230000033001 locomotion Effects 0.000 claims description 89
- 238000010586 diagram Methods 0.000 claims description 36
- 238000000605 extraction Methods 0.000 claims description 24
- 239000012634 fragment Substances 0.000 claims description 19
- 238000013527 convolutional neural network Methods 0.000 claims description 17
- 230000002452 interceptive effect Effects 0.000 claims description 14
- 238000005457 optimization Methods 0.000 claims description 12
- 230000011218 segmentation Effects 0.000 claims description 11
- 230000002123 temporal effect Effects 0.000 claims description 10
- 230000004927 fusion Effects 0.000 claims description 8
- 238000004422 calculation algorithm Methods 0.000 description 14
- 230000000875 corresponding effect Effects 0.000 description 14
- 230000014509 gene expression Effects 0.000 description 10
- 230000000694 effects Effects 0.000 description 6
- 238000005516 engineering process Methods 0.000 description 6
- 239000000284 extract Substances 0.000 description 6
- 230000001976 improved effect Effects 0.000 description 6
- 230000008569 process Effects 0.000 description 6
- 230000008451 emotion Effects 0.000 description 5
- 230000004044 response Effects 0.000 description 5
- 238000004891 communication Methods 0.000 description 4
- 238000012545 processing Methods 0.000 description 4
- 238000004458 analytical method Methods 0.000 description 3
- 230000008901 benefit Effects 0.000 description 3
- 238000001514 detection method Methods 0.000 description 3
- 230000000007 visual effect Effects 0.000 description 3
- 238000013528 artificial neural network Methods 0.000 description 2
- 230000003190 augmentative effect Effects 0.000 description 2
- 230000006399 behavior Effects 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 2
- 210000000988 bone and bone Anatomy 0.000 description 2
- 238000005259 measurement Methods 0.000 description 2
- 210000003205 muscle Anatomy 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 238000011176 pooling Methods 0.000 description 2
- 238000009877 rendering Methods 0.000 description 2
- 238000012216 screening Methods 0.000 description 2
- 238000012549 training Methods 0.000 description 2
- 241000282412 Homo Species 0.000 description 1
- 206010034719 Personality change Diseases 0.000 description 1
- 230000001133 acceleration Effects 0.000 description 1
- 230000002776 aggregation Effects 0.000 description 1
- 238000004220 aggregation Methods 0.000 description 1
- 230000003542 behavioural effect Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 230000019771 cognition Effects 0.000 description 1
- 238000004590 computer program Methods 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 238000013136 deep learning model Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 230000002996 emotional effect Effects 0.000 description 1
- 230000007613 environmental effect Effects 0.000 description 1
- 230000008921 facial expression Effects 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 230000005484 gravity Effects 0.000 description 1
- 238000005286 illumination Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 210000001503 joint Anatomy 0.000 description 1
- 239000011159 matrix material Substances 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000000704 physical effect Effects 0.000 description 1
- 238000005293 physical law Methods 0.000 description 1
- 230000003252 repetitive effect Effects 0.000 description 1
- 238000004088 simulation Methods 0.000 description 1
- 230000007704 transition Effects 0.000 description 1
- 238000012800 visualization Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/20—Movements or behaviour, e.g. gesture recognition
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
- G06V20/46—Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/70—Labelling scene content, e.g. deriving syntactic or semantic representations
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Multimedia (AREA)
- General Physics & Mathematics (AREA)
- General Health & Medical Sciences (AREA)
- Human Computer Interaction (AREA)
- Social Psychology (AREA)
- Psychiatry (AREA)
- Health & Medical Sciences (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Computational Linguistics (AREA)
- Image Analysis (AREA)
Abstract
The application discloses a method and a system for controlling actions of a digital person, wherein the method and the system acquire user action interaction videos acquired by a camera; extracting features of the user action interaction video to obtain a context action time sequence semantic feature vector; and generating an action control instruction for the digital person based on the context action timing semantic feature vector. Thus, the accuracy of identifying the operation intention of the user can be enhanced, and more accurate and convenient digital human action control can be realized.
Description
Technical Field
The application relates to the technical field of intelligent control, in particular to a digital human motion control method and a digital human motion control system.
Background
A digital person is a three-dimensional model capable of simulating a real human in a virtual environment, and has high fidelity and interactivity. Motion control of a digital person is an important component of digital person technology that determines whether the digital person can perform a reasonable motion response according to the user's intention.
Currently, the common motion control methods of digital people mainly include a sensor-based method and a vision-based method. The sensor-based method requires a user to wear a plurality of sensors to capture motion data of the user and then map the motion data to the motion of a digital person, and has the defects of high cost, strong invasiveness, easy interference and the like. The method based on vision utilizes the camera to collect the action video of the user, then recognizes the action intention of the user through the computer vision technology and converts the action intention into the action control instruction of the digital person.
However, existing vision-based digital human motion control methods also have some problems. For example, it is difficult to accurately extract time series features in a user action video, resulting in an unsatisfactory action recognition effect. Thus, an optimized digital human motion control scheme is desired.
Disclosure of Invention
The embodiment of the application provides a motion control method and a motion control system for a digital person, wherein the motion control method and the motion control system acquire user motion interaction videos acquired by a camera; extracting features of the user action interaction video to obtain a context action time sequence semantic feature vector; and generating an action control instruction for the digital person based on the context action timing semantic feature vector. Thus, the accuracy of identifying the operation intention of the user can be enhanced, and more accurate and convenient digital human action control can be realized.
The embodiment of the application also provides a method for controlling the actions of the digital person, which comprises the following steps: acquiring user action interaction videos acquired by a camera; extracting features of the user action interaction video to obtain a context action time sequence semantic feature vector; and generating an action control instruction for the digital person based on the contextual action timing semantic feature vector.
The embodiment of the application also provides a motion control system of the digital person, which comprises: the video acquisition module is used for acquiring user action interaction videos acquired by the camera; the feature extraction module is used for extracting features of the user action interaction video to obtain context action time sequence semantic feature vectors; and the control instruction generation module is used for generating an action control instruction for the digital person based on the context action time sequence semantic feature vector.
Drawings
In order to more clearly illustrate the embodiments of the application or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
In the drawings: fig. 1 is a flowchart of a method for controlling actions of a digital person according to an embodiment of the present application.
Fig. 2 is a schematic diagram of a system architecture of a digital person motion control method according to an embodiment of the present application.
Fig. 3 is a flowchart of the sub-steps of step 120 in a method for controlling actions of a digital person according to an embodiment of the present application.
Fig. 4 is a block diagram of a digital human motion control system provided in an embodiment of the present application.
Fig. 5 is an application scenario diagram of a digital person motion control method provided in an embodiment of the present application.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present application more apparent, the embodiments of the present application will be described in further detail with reference to the accompanying drawings. The exemplary embodiments of the present application and their descriptions herein are for the purpose of explaining the present application, but are not to be construed as limiting the application.
Unless defined otherwise, all technical and scientific terms used in the embodiments of the application have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used in the present application is for the purpose of describing particular embodiments only and is not intended to limit the scope of the present application.
In describing embodiments of the present application, unless otherwise indicated and limited thereto, the term "connected" should be construed broadly, for example, it may be an electrical connection, or may be a communication between two elements, or may be a direct connection, or may be an indirect connection via an intermediate medium, and it will be understood by those skilled in the art that the specific meaning of the term may be interpreted according to circumstances.
It should be noted that, the term "first\second\third" related to the embodiment of the present application is merely to distinguish similar objects, and does not represent a specific order for the objects, it is to be understood that "first\second\third" may interchange a specific order or sequence where allowed. It is to be understood that the "first\second\third" distinguishing objects may be interchanged where appropriate such that embodiments of the application described herein may be practiced in sequences other than those illustrated or described herein.
Digital persons refer to virtual characters created using computer technology and graphics technology. They may be three-dimensional models or two-dimensional images, with human appearance and behavioral characteristics. Digital people are widely used in the fields of movies, games, virtual reality, augmented reality, etc.
Creation of digital persons typically involves modeling, animation, and rendering processes. Modeling refers to the creation of a three-dimensional model of a digital person, including details of outline, muscles, bones, etc., using computer software. Animation refers to giving a vivid action and expression to a digital person, and can be realized by key frame animation, action capturing and other technologies. Rendering refers to adding effects such as illumination, textures and the like to digital human models and animations so that the digital human models and animations present realistic visual effects on a screen.
Digital people are very widely used. In movies and games, a digital person may play a role, interacting with a real actor or other digital person. In virtual reality and augmented reality, a digital person may appear as a virtual tour guide, virtual assistant, or virtual character, interacting with a user. The digital person can also be used in the fields of education, training, medical treatment and the like, and can provide functions of virtual experiments, simulation training, medical visualization and the like.
With the continuous development of computer technology, the fidelity and interactivity of digital people are continuously improved. In the future, digital people are expected to play a more important role in various fields, and more fun and convenience are brought to people.
The conventional digital person motion control method includes: 1. keyframe animation (Keyframe Animation): this is a key frame based animation technique that defines the pose and motion of a digital person at different points in time by setting key frames on a time axis. Animation software generates a smooth animation transition effect based on interpolation calculations between key frames.
2. Motion Capture (Motion Capture): motion capture techniques use sensors or cameras to record motion data of a real human and apply it to a digital human model. The sensor may be an inertial measurement unit (Inertial Measurement Unit, IMU for short), an optical sensor, an electromagnetic sensor, or the like. By capturing motion data of a real human, a digital person can simulate and reproduce a corresponding motion.
3. Physical Engine (Physics Engine): the physical engine is a computer program for simulating physical laws and can simulate physical effects such as movement, collision, gravity and the like of an object. In digital humans, the physical engine can be used to simulate the bones, joints, muscles, etc. of the human body, making the digital human motion more realistic and lifelike.
4. Motion Planning (Motion Planning): motion planning refers to calculating reasonable motion paths and trajectories of a digital person through an algorithm under given environmental and constraint conditions. The method can be used for realizing navigation, obstacle avoidance and planning of complex actions of the digital person.
5. Control algorithm (Control Algorithm): control algorithms refer to controlling the actions of a digital person by programming or algorithms. Such a method may determine the digital person's action response based on user input or external conditions, such as controlling the digital person's action based on keyboard input or voice commands.
These methods may be applied alone or in combination to achieve motion control of a digital person. Different methods are suitable for different application scenes and requirements, and the fidelity and interactivity of digital people can be improved by selecting a proper action control method.
In one embodiment of the present application, fig. 1 is a flowchart of a method for controlling actions of a digital person according to an embodiment of the present application. Fig. 2 is a schematic diagram of a system architecture of a digital person motion control method according to an embodiment of the present application. As shown in fig. 1 and 2, a digital person motion control method 100 according to an embodiment of the present application includes: 110, acquiring user action interaction videos acquired by a camera; 120, extracting features of the user action interaction video to obtain a context action time sequence semantic feature vector; and 130, generating an action control instruction for the digital person based on the context action timing semantic feature vector.
In the step 110, the position and angle of the camera are ensured to be appropriate, so that the action interaction of the user can be accurately captured. And providing a data base for subsequent feature extraction and motion control through real-time motion interaction of the video capturing user.
In said step 120, an appropriate feature extraction algorithm, e.g. a deep learning based method, is selected to extract motion features in the video. By extracting features, the user's interaction of actions can be converted into machine-understandable semantic feature vectors, providing input for subsequent action control.
In said step 130, a suitable algorithm or model is designed, the feature vectors are mapped to the motion space of the digital person and corresponding motion control instructions are generated. By generating the action control instruction, the response of the digital person to the user action interaction is realized, so that the digital person can perform corresponding actions according to the intention of the user.
Through the steps, the quality and the accuracy of data can be improved, and the accuracy and the reliability of motion control are ensured. Reasonable selection and optimization of feature extraction can improve understanding and recognition of user actions. The optimization of the generation algorithm of the motion control instruction can improve the motion performance and interaction experience of the digital person. The smoothness and effectiveness of the whole flow can improve the natural interaction and communication effect between the digital person and the user.
The digital human action control method based on the user action interaction video has the advantages of instantaneity, naturality and interactivity, and can provide a more visual and flexible digital human control mode.
Specifically, in the step 110, a user action interactive video acquired by the camera is acquired. Aiming at the technical problems, the technical concept of the application is to collect user action interaction videos by using an intelligent algorithm and a camera, extract time sequence characteristic information about actions from the user action interaction videos, and realize action control of digital people based on gestures of a user.
It should be appreciated that when a user is interacting with a digital person, their actions and gestures may be captured by the camera and presented in the form of a video. Also, since the user's actions are continuous in time, for example, the user may raise the arm first, then lower the arm, then turn the head, and so on. The order and timing information of these actions is important to understand the intent and meaning of the actions of the user. Therefore, in the process of the technical conception of the application, importance of time sequence characteristic information in video data is found, and the accuracy of identifying the operation intention of a user is expected to be enhanced by utilizing the time sequence characteristic information, so that more accurate and convenient digital human action control is realized.
Based on the above, in the technical scheme of the application, the user action interaction video acquired by the camera is firstly acquired. The specific action type and action sequence performed by the user can be captured through the video. For example, the user may make waving, fisting, nodding, etc., which may be extracted and used to generate corresponding digital human motion control instructions.
The video may provide timing information for the action, i.e., the time and duration that the action occurred. This is important for generating smooth and accurate digital human movements. For example, if the user makes one continuous hand waving motion, the digital person may simulate the corresponding continuous hand waving motion based on the timing information in the video.
The video may capture spatial position and pose information of the user's actions. This information can be used to determine the corresponding position and pose that the digital person should take. For example, if the user lifts his arm, the digital person may simulate a corresponding hand lifting action based on the spatial position and pose information in the video.
The video may also capture the facial expression and emotional state of the user. This information can be used to adjust the digital person's expression and emotion to better interact with the user's emotion. For example, if the user laughs, the digital person may present a corresponding smile based on the expression information in the video.
The interaction context information in the video may provide clues about the user's intent and the interaction environment. For example, a user may make some action in a particular scene, or express a particular intent through gestures. Such contextual information may help the digital person to better understand the user's intent and generate corresponding motion control instructions.
Useful information in a user action interaction video includes action type, action sequence, timing and duration of actions, spatial position and pose of actions, expression and emotion of the user, and interaction context and intent. The information can be extracted and analyzed by an intelligent algorithm for finally generating an action control instruction for the digital person, so that natural interaction and action response with the user are realized.
The acquisition of the user action interaction video acquired by the camera plays an important role in finally generating the action control instruction aiming at the digital person. The camera is used for collecting the action interaction video of the user, so that the action behavior of the user can be captured in real time, real-time input data is provided for the action control of the digital person, and the digital person can respond to the action of the user in time. The interaction mode based on natural actions can be realized by collecting the interaction video of the user actions through the camera. The user can interact with the digital person by his own actions without relying on other external devices or controllers. Such natural interactivity may enhance user experience and communication effects.
The video may provide more detailed and comprehensive user action information, including time series, spatial location and pose of the action, etc., which may be used to more accurately understand the user's action intent and generate corresponding action control instructions. The video may capture contextual information of the user's actions, such as gestures, expressions, gestures, etc., that help better understand the user's intent and emotion, thereby generating motion control instructions that more closely match the user's expectations. The video can capture not only the actions of the user, but also voice, expressions and other visual information, and the multi-modal interaction can provide richer and comprehensive user input, so that the action control of the digital person is more intelligent and diversified.
The method has the advantages that the user action interaction video acquired by the camera plays a key role in finally generating the action control instruction aiming at the digital person, real-time, detailed and context-rich user action information is provided, so that the digital person can more accurately understand the intention of the user and generate corresponding action response, and the more natural and intelligent digital person interaction experience is realized.
For the step 120, fig. 3 is a flowchart of the sub-steps of the step 120 in the method for controlling actions of digital persons according to the embodiment of the present application, as shown in fig. 3, the feature extraction is performed on the user action interactive video to obtain a context action time sequence semantic feature vector, which includes: 121, extracting time sequence characteristics of multiple areas of the user action interaction video to obtain multiple user action interaction time sequence characteristic diagrams; and, extracting 122 global semantic associations between the plurality of user action interaction timing feature diagrams to obtain the contextual action timing semantic feature vector.
First, a video is divided into a plurality of regions, which may be divided according to a body part of a user or a region of interest. Then, for each region, a time series feature is extracted, and a commonly used time series feature extraction method includes an optical flow, a key point track, a gesture sequence, and the like. These features may capture temporal variations of user actions and spatial location information.
Then, after timing feature maps of the multiple regions are obtained, global semantic associations between the feature maps may be further analyzed. This can be achieved by calculating the similarity between feature maps, correlation matrix, or using a neural network of maps, etc. Global semantic association may help understand action relationships and context information between different regions.
Finally, the time sequence feature diagrams of the multiple areas are combined with global semantic association, so that the context action time sequence semantic feature vector can be obtained. This feature vector contains rich information in the user action interaction video, including the type of action, sequence of actions, timing and duration of actions, spatial position and pose of actions, expression and emotion of the user, and interaction context and intent.
By extracting the time sequence characteristics of multiple areas of the user action interaction video and extracting the global semantic association among the plurality of user action interaction time sequence characteristic diagrams, more comprehensive and accurate context action time sequence semantic characteristic vectors can be obtained. These feature vectors may be used for further motion recognition, intent understanding, and motion control, thereby enabling a more intelligent, natural digital human interaction experience.
In the step 121, performing multi-region time sequence feature extraction on the user action interaction video to obtain a plurality of user action interaction time sequence feature diagrams, including: video segmentation is carried out on the user action interaction video to obtain a plurality of user action interaction fragments; and respectively passing the plurality of user action interaction fragments through an action time sequence feature extractor based on a three-dimensional convolutional neural network to obtain a plurality of user action interaction time sequence feature diagrams.
Firstly, through video segmentation, the user action interaction video is segmented into a plurality of segments, and action characteristics with finer granularity can be obtained. Each segment can be processed separately, and the time sequence features in the segment can be extracted from the segments, so that the subtle changes and dynamic features of the user actions can be captured better.
Then, the action time sequence feature extractor based on the three-dimensional convolution neural network can carry out convolution operation in a time sequence dimension, so that time sequence information of user action interaction is effectively captured. These timing diagrams may reflect the dynamics, speed, acceleration, etc. characteristics of the user's actions, providing more useful information for subsequent analysis and applications.
Next, by extracting the user action interaction timing feature diagram, action recognition and context understanding can be performed. These feature maps may be used to train a classifier or model to identify and classify the actions of the user. At the same time, the timing diagram may also be used to understand the contextual relationship of the user actions, such as the order of the sequence of actions, consistency, etc.
Finally, based on the user action interaction time sequence characteristic diagram, an action control instruction aiming at the digital person can be generated, and more natural and intelligent digital person interaction experience is realized. By analyzing the time sequence characteristics of the user actions, the user intention can be more accurately captured and converted into specific action instructions for digital people, and the interaction effect and fidelity are improved.
Through the video segmentation and the action time sequence feature extractor based on the three-dimensional convolutional neural network, a finer and more comprehensive user action interaction time sequence feature graph can be obtained, and beneficial effects are provided for subsequent tasks such as analysis, identification, control and interaction experience.
And then, carrying out multi-region time sequence feature extraction on the user action interaction video to obtain a plurality of user action interaction time sequence feature diagrams. Namely, the user action interactive video is split into a plurality of fragments, and then sequential feature extraction is carried out one by one. In this way, a degree of focusing on finer granularity of motion features is possible.
In a specific example of the present application, the implementation manner of extracting the time sequence characteristics of multiple areas to obtain multiple user action interaction time sequence characteristic diagrams for the user action interaction video is as follows: firstly, video segmentation is carried out on the user action interaction video to obtain a plurality of user action interaction fragments; and then, respectively passing the plurality of user action interaction fragments through an action time sequence feature extractor based on the three-dimensional convolutional neural network to obtain a plurality of user action interaction time sequence feature diagrams.
Wherein, the video segmentation is a process of dividing the user action interaction video into a plurality of user action interaction fragments. First, a user action interactive video is preprocessed. This includes operations such as video decoding, frame extraction, and inter-frame differencing, where video decoding converts a video file into an image sequence, frame extraction extracts key frames or frames of a certain interval from the image sequence as a basis for segmentation, and inter-frame differencing can be used to detect changes in motion. Motion detection and segmentation, using the preprocessed image sequence, may then be achieved by some computer vision techniques, such as background modeling, motion detection, object recognition, etc. Motion detection may help determine the start and end frames of a user action, while segmentation segments the video into multiple user action interaction fragments. Then, after obtaining a plurality of user action interaction fragments, fragment screening and merging may be performed, and fragment screening may select fragments of representative or importance, such as duration, action type, etc., according to some criteria or rules. And segment merging may merge adjacent segments together to obtain a more complete sequence of user action interactions. And finally, outputting the segmented and processed user action interaction fragments, wherein the output can be a video file, an image sequence or other data formats for subsequent feature extraction, analysis and application.
The purpose of the video slicing is to divide the user action interaction video into smaller segments in order to better analyze and process the user action behavior in each segment. Thus, finer action characteristics can be provided, and tasks such as action recognition, action control, context understanding and the like are further supported, so that more intelligent and natural digital human interaction experience is realized.
Further, the three-dimensional convolutional neural network (3D CNN) is a deep learning model for processing video data, and on the basis of the traditional two-dimensional convolutional neural network, a convolutional operation of a time dimension is introduced, so that time sequence features in the video data can be effectively extracted. Unlike a two-dimensional convolutional neural network, 3D CNN performs a convolution operation on each time step of a video sequence by sliding a convolution kernel over the time dimension, taking into account information of the time dimension in the convolution operation. Thus, time sequence changes and dynamic characteristics in video data can be captured.
The basic structure of the 3D CNN is similar to a two-dimensional convolutional neural network, including a convolutional layer, a pooling layer, and a fully-connected layer. However, the convolution kernels in the convolution layers of 3D CNN slide in three dimensions, namely the width, height and time dimensions. So that features in both the spatial and temporal dimensions can be considered. In processing the user action interaction segment, a three-dimensional convolutional neural network-based action timing feature extractor may be used. The feature extractor takes the user action interaction segment as input, and extracts time sequence features in the segment through a plurality of 3D convolution layers and pooling layers. These timing characteristics may be motion patterns of motion, spatial position, attitude changes, etc.
By using the action time sequence feature extractor based on the three-dimensional convolutional neural network, the time sequence feature graph can be effectively extracted from the user action interaction segment. The feature maps can capture dynamic changes and timing information of user actions, and provide useful inputs for subsequent tasks such as action recognition, context understanding, and action control.
In the step 122, extracting global semantic associations between the plurality of user action interaction timing feature diagrams to obtain the context action timing semantic feature vector includes: respectively expanding the plurality of user action interaction time sequence feature diagrams into a plurality of user action interaction time sequence feature vectors; and passing the plurality of user action interaction timing feature vectors through a transducer-based action context encoder to obtain the contextual action timing semantic feature vector.
In the application, the time sequence feature diagram is unfolded into the time sequence feature vector, so that the dimension of data can be reduced, and the features are more compact and easier to process. This helps to reduce computational and memory requirements and improves the efficiency of subsequent processing. Meanwhile, the unfolding time sequence characteristic diagram can also extract more representative characteristics, and key information of user actions can be captured better.
By means of the transducer-based motion context encoder, a plurality of user-motion interaction timing feature vectors may be encoded as context-motion timing semantic feature vectors. This encoding process can capture the relationship and context information between user actions, thereby better representing the semantic meaning of the actions. Such feature vectors may be used for subsequent action recognition, generation, and control tasks.
By obtaining the context action time sequence semantic feature vector, the context relation and semantic meaning of the user action can be better understood. This helps to improve the accuracy of motion recognition and provides guidance for more context consistency for subsequent motion generation. Based on the feature vectors, a generation model or a control algorithm can be designed to realize more intelligent and natural digital human action generation and interaction experience.
By performing context coding on the multiple user action interaction time sequence feature diagrams, action commonalities and change rules among different users can be captured. This facilitates action reasoning and generalization, i.e. learning general action patterns and rules from existing user actions, so that new user actions and interaction scenarios can be accommodated.
And expanding a plurality of user action interaction time sequence feature graphs into feature vectors, and obtaining context action time sequence semantic feature vectors through an action context encoder based on a converter, so that the expression capacity and semantic meaning of the features can be improved, and beneficial effects are brought to tasks such as action understanding, generation and control.
And then, extracting global semantic association among the plurality of user action interaction time sequence feature diagrams to obtain the context action time sequence semantic feature vector. That is, considering that the independent feature extraction is performed on each video segment, the interaction and communication of the time sequence features between each video segment are ignored, so in the technical scheme of the application, it is expected to extract the global semantic association between the plurality of user action interaction time sequence feature graphs to make up for the lack of the interaction between the features.
In a specific example of the present application, the implementation manner of extracting global semantic association between the plurality of user action interaction time sequence feature diagrams to obtain the context action time sequence semantic feature vector is as follows: firstly, respectively expanding the plurality of user action interaction time sequence feature diagrams into a plurality of user action interaction time sequence feature vectors; the plurality of user action interaction timing feature vectors are then passed through a transducer-based action context encoder to obtain a contextual action timing semantic feature vector.
Specifically, the step 130 generates, based on the contextual action timing semantic feature vector, an action control instruction for a digital person, including: performing feature distribution optimization on the context action time sequence semantic feature vector to obtain an optimized context action time sequence semantic feature vector; the semantic feature vector of the optimized context action time sequence passes through a classifier to obtain a classification result, wherein the classification result is used for representing an operation intention label corresponding to the user action interaction video; and generating the action control instruction for the digital person based on the classification result.
In one embodiment of the present application, performing feature distribution optimization on the context action timing semantic feature vector to obtain an optimized context action timing semantic feature vector, including: cascading the plurality of user action interaction time sequence feature vectors to obtain user action interaction time sequence cascading feature vectors; and carrying out Hilbert space heuristic sequence tracking equalization fusion on the user action interaction time sequence cascade feature vector and the context action time sequence semantic feature vector to obtain the optimized context action time sequence semantic feature vector.
In the technical scheme of the application, under the condition that the context action time sequence semantic feature vectors are obtained by the action context encoder based on the converter, the context action time sequence semantic feature vectors can express time sequence image semantic context associated features of the user action time sequence feature vectors, but when the time sequence context associated features are extracted, the overall distribution of the context action time sequence semantic feature vectors is unbalanced relative to the user action time sequence interaction time sequence image semantic features extracted by the action time sequence feature extractor based on the three-dimensional convolutional neural network, so that the expression of the specific action image semantic features in the local time domain is affected.
Here, the context is moved in consideration of the context action timing semantic feature vector being substantially obtained by concatenating the plurality of context user action interaction timing feature vectors obtained by the context encoder based on the converterThe temporal semantic feature vector also conforms to the serialized arrangement of the local temporal correlation image semantic representations corresponding to the plurality of user action interaction temporal feature vectors, and thus, the user action interaction temporal cascade feature vector obtained by cascading the plurality of user action interaction temporal feature vectors by the applicant of the present application is, for example, denoted asAnd the contextual action timing semantic feature vector, e.g. denoted +.>Performing Hilbert space heuristic sequence tracking equalization fusion to optimize the context action timing semantic feature vector, e.g., denoted +.>The method is specifically expressed as follows: carrying out Hilbert space heuristic sequence tracking equalization fusion on the user action interaction time sequence cascade feature vector and the context action time sequence semantic feature vector by using the following optimization formula to obtain the optimization context action time sequence semantic feature vector; wherein, the optimization formula is: />Wherein (1)>Is the user action interaction time sequence cascade feature vector, < >>Is the contextual action timing semantic feature vector,/->Representing the user action interaction time sequence cascading characteristic vector +.>And the contextual action timing semantic feature vector +.>Is +.>Representing the user action interaction time sequence cascading feature vectorAnd the contextual action timing semantic feature vector +.>The mean value of the union set formed by all the characteristic values of the user action interaction time sequence cascading characteristic vector +.>And the contextual action timing semantic feature vector +.>Are all row vectors, +.>Representing multiplication by location +.>Representing vector addition, ++>Is the optimized contextual action timing semantic feature vector,/->Is the set of feature values for all positions in the user action interaction time sequence cascade feature vector, +.>Is a set of feature values for all positions in the contextual action temporal semantic feature vector.
Here, the feature vectors are concatenated by the user action interaction timing using the complete inner product space characteristic of the Hilbert space with inner productsAnd the contextual action timing semantic feature vector +.>Is a collective mean (collective average) of the sequence aggregation of (a) exploring the user-action interaction timing cascade feature vector +.>And the contextual action timing semantic feature vector +.>Sequence-based spatial distribution heuristics (heuristics) within feature fusion space via context-dependent encoding to thereby align the contextual action temporal semantic feature vector +.>The local feature distribution of the sequence is converted into a sequence tracking instance (tracking instance) in a fusion space so as to realize tracking small-fragment cognition (tracking let-aware) distribution equalization of the feature space distribution of the sequence, and therefore, the expression of the context motion time sequence semantic feature vector on the specific motion image semantic feature in a local time domain is improved.
Further, the context action time sequence semantic feature vector is passed through a classifier to obtain a classification result, wherein the classification result is used for representing an operation intention label corresponding to the user action interaction video; and generating an action control instruction for the digital person based on the classification result.
In summary, the method 100 for controlling the motion of the digital person according to the embodiment of the present application is illustrated, and uses a camera to collect the interactive video of the motion of the user by using an intelligent algorithm, and extracts the time sequence feature information about the motion from the interactive video, so as to implement the motion control of the digital person based on the gesture of the user.
Fig. 4 is a block diagram of a digital human motion control system provided in an embodiment of the present application. As shown in fig. 4, the motion control system of the digital person includes: a video acquisition module 210, configured to acquire a user action interactive video acquired by the camera; the feature extraction module 220 is configured to perform feature extraction on the user action interaction video to obtain a context action time sequence semantic feature vector; and a control instruction generation module 230 for generating an action control instruction for the digital person based on the contextual action timing semantic feature vector.
Specifically, in the motion control system of the digital person, the feature extraction module includes: the time sequence feature extraction unit is used for extracting time sequence features of multiple areas of the user action interaction video to obtain a plurality of user action interaction time sequence feature diagrams; and the global semantic association unit is used for extracting global semantic association among the plurality of user action interaction time sequence feature graphs to obtain the context action time sequence semantic feature vector.
Specifically, in the motion control system of the digital person, the timing characteristic extraction unit is configured to: video segmentation is carried out on the user action interaction video to obtain a plurality of user action interaction fragments; and respectively passing the plurality of user action interaction fragments through an action time sequence feature extractor based on a three-dimensional convolutional neural network to obtain a plurality of user action interaction time sequence feature diagrams.
It will be appreciated by those skilled in the art that the specific operation of the individual steps in the above-described digital person's motion control system has been described in detail in the above description of the digital person's motion control method with reference to fig. 1 to 3, and thus, repetitive description thereof will be omitted.
As described above, the motion control system 100 of a digital person according to an embodiment of the present application can be implemented in various terminal devices, such as a server or the like for motion control of a digital person. In one example, the digital person's motion control system 100 according to embodiments of the present application may be integrated into the terminal device as a software module and/or hardware module. For example, the digital person's motion control system 100 may be a software module in the operating system of the terminal device, or may be an application developed for the terminal device; of course, the digital person's motion control system 100 could equally be one of the many hardware modules of the terminal device.
Alternatively, in another example, the digital person's motion control system 100 and the terminal device may be separate devices, and the digital person's motion control system 100 may be connected to the terminal device through a wired and/or wireless network and transmit interactive information in a agreed data format.
Fig. 5 is an application scenario diagram of a digital person motion control method provided in an embodiment of the present application. As shown in fig. 5, in the application scenario, first, a user action interactive video acquired by a camera is acquired (e.g., C as illustrated in fig. 5); the acquired user action interactive video is then input into a server (e.g., S as illustrated in fig. 5) deployed with a digital person 'S action control method algorithm, wherein the server is capable of processing the user action interactive video based on the digital person' S action control method algorithm to generate an action control instruction for the digital person.
The foregoing description of the embodiments has been provided for the purpose of illustrating the general principles of the application, and is not meant to limit the scope of the application, but to limit the application to the particular embodiments, and any modifications, equivalents, improvements, etc. that fall within the spirit and principles of the application are intended to be included within the scope of the application.
Claims (10)
1. A method for controlling motion of a digital person, comprising: acquiring user action interaction videos acquired by a camera; extracting features of the user action interaction video to obtain a context action time sequence semantic feature vector; and generating an action control instruction for the digital person based on the contextual action timing semantic feature vector.
2. The method of claim 1, wherein feature extraction is performed on the user action interactive video to obtain a contextual action timing semantic feature vector, comprising: extracting time sequence characteristics of multiple areas of the user action interaction video to obtain multiple user action interaction time sequence characteristic diagrams; and extracting global semantic associations among the plurality of user action interaction time sequence feature diagrams to obtain the context action time sequence semantic feature vectors.
3. The method for controlling actions of a digital person according to claim 2, wherein performing multi-region time series feature extraction on the user action interactive video to obtain a plurality of user action interactive time series feature diagrams comprises: video segmentation is carried out on the user action interaction video to obtain a plurality of user action interaction fragments; and respectively passing the plurality of user action interaction fragments through an action time sequence feature extractor based on a three-dimensional convolutional neural network to obtain a plurality of user action interaction time sequence feature diagrams.
4. The method of claim 3, wherein extracting global semantic associations between the plurality of user action interaction timing feature graphs to obtain the contextual action timing semantic feature vector comprises: respectively expanding the plurality of user action interaction time sequence feature diagrams into a plurality of user action interaction time sequence feature vectors; and passing the plurality of user action interaction timing feature vectors through a transducer-based action context encoder to obtain the contextual action timing semantic feature vector.
5. The method of motion control for a digital person according to claim 4, wherein generating motion control instructions for a digital person based on the contextual motion timing semantic feature vector comprises: performing feature distribution optimization on the context action time sequence semantic feature vector to obtain an optimized context action time sequence semantic feature vector; the semantic feature vector of the optimized context action time sequence passes through a classifier to obtain a classification result, wherein the classification result is used for representing an operation intention label corresponding to the user action interaction video; and generating the action control instruction for the digital person based on the classification result.
6. The method of claim 5, wherein performing feature distribution optimization on the contextual action timing semantic feature vector to obtain an optimized contextual action timing semantic feature vector, comprises: cascading the plurality of user action interaction time sequence feature vectors to obtain user action interaction time sequence cascading feature vectors; and carrying out Hilbert space heuristic sequence tracking equalization fusion on the user action interaction time sequence cascade feature vector and the context action time sequence semantic feature vector to obtain the optimized context action time sequence semantic feature vector.
7. The method according to claim 6, wherein performing hilbert space heuristic sequence tracking equalization fusion on the user action interaction time sequence cascade feature vector and the context action time sequence semantic feature vector to obtain the optimized context action time sequence semantic feature vector, comprises: carrying out Hilbert space heuristic sequence tracking equalization fusion on the user action interaction time sequence cascade feature vector and the context action time sequence semantic feature vector by using the following optimization formula to obtain the optimization context action time sequence semantic feature vector; wherein, the optimization formula is:wherein (1)>Is the user action interaction time sequence cascade feature vector, < >>Is the contextual action timing semantic feature vector,/->Representing the user action interaction time sequence cascading characteristic vector +.>And the contextual action timing semantic feature vector +.>Is +.>Representing the user action interaction time sequence cascading characteristic vector +.>And the contextual action timing semantic feature vector +.>The mean value of the union set formed by all the characteristic values of the user action interaction time sequence cascading characteristic vector +.>And the contextual action timing semantic feature vector +.>Are all row vectors, +.>Representing multiplication by location +.>Representing vector addition, ++>Is the optimized contextual action timing semantic feature vector,/->Is when the user action is interactedA set of eigenvalues for all positions in the ordered cascade of eigenvectors,/->Is a set of feature values for all positions in the contextual action temporal semantic feature vector.
8. A digital human motion control system, comprising: the video acquisition module is used for acquiring user action interaction videos acquired by the camera; the feature extraction module is used for extracting features of the user action interaction video to obtain context action time sequence semantic feature vectors; and the control instruction generation module is used for generating an action control instruction for the digital person based on the context action time sequence semantic feature vector.
9. The digital person's motion control system of claim 8, wherein the feature extraction module comprises: the time sequence feature extraction unit is used for extracting time sequence features of multiple areas of the user action interaction video to obtain a plurality of user action interaction time sequence feature diagrams; and the global semantic association unit is used for extracting global semantic association among the plurality of user action interaction time sequence feature graphs to obtain the context action time sequence semantic feature vector.
10. The motion control system of a digital person according to claim 9, wherein the timing feature extraction unit is configured to: video segmentation is carried out on the user action interaction video to obtain a plurality of user action interaction fragments; and respectively passing the plurality of user action interaction fragments through an action time sequence feature extractor based on a three-dimensional convolutional neural network to obtain a plurality of user action interaction time sequence feature diagrams.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311144896.6A CN117152843B (en) | 2023-09-06 | 2023-09-06 | Digital person action control method and system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311144896.6A CN117152843B (en) | 2023-09-06 | 2023-09-06 | Digital person action control method and system |
Publications (2)
Publication Number | Publication Date |
---|---|
CN117152843A true CN117152843A (en) | 2023-12-01 |
CN117152843B CN117152843B (en) | 2024-05-07 |
Family
ID=88905833
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202311144896.6A Active CN117152843B (en) | 2023-09-06 | 2023-09-06 | Digital person action control method and system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN117152843B (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN118535021A (en) * | 2024-07-19 | 2024-08-23 | 长春职业技术学院 | Immersive simulation training system for wind power generation equipment based on virtual reality |
CN118585066A (en) * | 2024-06-05 | 2024-09-03 | 浙江大丰数艺科技有限公司 | Portable space positioning remote sensing interaction control system applied to immersion exhibition |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112669846A (en) * | 2021-03-16 | 2021-04-16 | 深圳追一科技有限公司 | Interactive system, method, device, electronic equipment and storage medium |
CN115455136A (en) * | 2022-03-02 | 2022-12-09 | 杭州摸象大数据科技有限公司 | Intelligent digital human marketing interaction method and device, computer equipment and storage medium |
CN115761813A (en) * | 2022-12-13 | 2023-03-07 | 浙大城市学院 | Intelligent control system and method based on big data analysis |
US20230082830A1 (en) * | 2020-05-18 | 2023-03-16 | Beijing Sogou Technology Development Co., Ltd. | Method and apparatus for driving digital human, and electronic device |
CN116071817A (en) * | 2022-10-25 | 2023-05-05 | 中国矿业大学 | Network architecture and training method of gesture recognition system for automobile cabin |
CN116485960A (en) * | 2023-04-23 | 2023-07-25 | 中国建设银行股份有限公司 | Digital man driving method and device |
-
2023
- 2023-09-06 CN CN202311144896.6A patent/CN117152843B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20230082830A1 (en) * | 2020-05-18 | 2023-03-16 | Beijing Sogou Technology Development Co., Ltd. | Method and apparatus for driving digital human, and electronic device |
CN112669846A (en) * | 2021-03-16 | 2021-04-16 | 深圳追一科技有限公司 | Interactive system, method, device, electronic equipment and storage medium |
CN115455136A (en) * | 2022-03-02 | 2022-12-09 | 杭州摸象大数据科技有限公司 | Intelligent digital human marketing interaction method and device, computer equipment and storage medium |
CN116071817A (en) * | 2022-10-25 | 2023-05-05 | 中国矿业大学 | Network architecture and training method of gesture recognition system for automobile cabin |
CN115761813A (en) * | 2022-12-13 | 2023-03-07 | 浙大城市学院 | Intelligent control system and method based on big data analysis |
CN116485960A (en) * | 2023-04-23 | 2023-07-25 | 中国建设银行股份有限公司 | Digital man driving method and device |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN118585066A (en) * | 2024-06-05 | 2024-09-03 | 浙江大丰数艺科技有限公司 | Portable space positioning remote sensing interaction control system applied to immersion exhibition |
CN118535021A (en) * | 2024-07-19 | 2024-08-23 | 长春职业技术学院 | Immersive simulation training system for wind power generation equipment based on virtual reality |
CN118535021B (en) * | 2024-07-19 | 2024-09-27 | 长春职业技术学院 | Immersive simulation training system for wind power generation equipment based on virtual reality |
Also Published As
Publication number | Publication date |
---|---|
CN117152843B (en) | 2024-05-07 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11928592B2 (en) | Visual sign language translation training device and method | |
Deng et al. | cGAN based facial expression recognition for human-robot interaction | |
US12039454B2 (en) | Microexpression-based image recognition method and apparatus, and related device | |
Mao et al. | Using Kinect for real-time emotion recognition via facial expressions | |
CN117152843B (en) | Digital person action control method and system | |
Ersotelos et al. | Building highly realistic facial modeling and animation: a survey | |
CN108363973B (en) | Unconstrained 3D expression migration method | |
Ludl et al. | Enhancing data-driven algorithms for human pose estimation and action recognition through simulation | |
WO2023284435A1 (en) | Method and apparatus for generating animation | |
US11282257B2 (en) | Pose selection and animation of characters using video data and training techniques | |
US20220398796A1 (en) | Enhanced system for generation of facial models and animation | |
Kowalski et al. | Holoface: Augmenting human-to-human interactions on hololens | |
Escobedo et al. | Dynamic sign language recognition based on convolutional neural networks and texture maps | |
Kwolek et al. | Recognition of JSL fingerspelling using deep convolutional neural networks | |
Ekmen et al. | From 2D to 3D real-time expression transfer for facial animation | |
WO2024066549A1 (en) | Data processing method and related device | |
US20240020901A1 (en) | Method and application for animating computer generated images | |
Usman et al. | Skeleton-based motion prediction: A survey | |
Cai et al. | An automatic music-driven folk dance movements generation method based on sequence-to-sequence network | |
Thalmann et al. | Direct face-to-face communication between real and virtual humans | |
Bevacqua et al. | Multimodal sensing, interpretation and copying of movements by a virtual agent | |
Sun et al. | Generation of virtual digital human for customer service industry | |
Chan et al. | A generic framework for editing and synthesizing multimodal data with relative emotion strength | |
Lin et al. | Emotional Semantic Neural Radiance Fields for Audio-Driven Talking Head | |
Kumar Das et al. | Audio driven artificial video face synthesis using gan and machine learning approaches |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
CP03 | Change of name, title or address | ||
CP03 | Change of name, title or address |
Address after: Building 60, 1st Floor, No.7 Jiuxianqiao North Road, Chaoyang District, Beijing 021 Patentee after: Shiyou (Beijing) Technology Co.,Ltd. Country or region after: China Address before: 4017, 4th Floor, Building 2, No.17 Ritan North Road, Chaoyang District, Beijing Patentee before: 4U (BEIJING) TECHNOLOGY CO.,LTD. Country or region before: China |