CN117528197B - High-frame-rate playback type quick virtual film making system - Google Patents

High-frame-rate playback type quick virtual film making system Download PDF

Info

Publication number
CN117528197B
CN117528197B CN202410022337.6A CN202410022337A CN117528197B CN 117528197 B CN117528197 B CN 117528197B CN 202410022337 A CN202410022337 A CN 202410022337A CN 117528197 B CN117528197 B CN 117528197B
Authority
CN
China
Prior art keywords
sequence
user
semantic
recognition result
feature vectors
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202410022337.6A
Other languages
Chinese (zh)
Other versions
CN117528197A (en
Inventor
王晓燕
王璇
刘松
武世杰
朱飞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Tiangong Color Television Technology Co ltd
Original Assignee
Beijing Tiangong Color Television Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Tiangong Color Television Technology Co ltd filed Critical Beijing Tiangong Color Television Technology Co ltd
Priority to CN202410022337.6A priority Critical patent/CN117528197B/en
Publication of CN117528197A publication Critical patent/CN117528197A/en
Application granted granted Critical
Publication of CN117528197B publication Critical patent/CN117528197B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/80Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
    • H04N21/85Assembly of content; Generation of multimedia applications
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T13/00Animation
    • G06T13/203D [Three Dimensional] animation
    • G06T13/403D [Three Dimensional] animation of characters, e.g. humans, animals or virtual beings
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/1822Parsing for meaning understanding
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/57Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for processing of video signals

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Computational Linguistics (AREA)
  • General Physics & Mathematics (AREA)
  • Human Computer Interaction (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Acoustics & Sound (AREA)
  • Signal Processing (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Social Psychology (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Psychiatry (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • Mathematical Physics (AREA)
  • Processing Or Creating Images (AREA)

Abstract

The invention discloses a high-frame-rate playback type rapid virtual film making system, and relates to the field of virtual film making. Firstly, acquiring a user action video captured by a camera and a user recording voice captured by recording equipment, extracting semantic features of the user recording voice to obtain a sequence of user voice text recognition result word granularity semantic feature vectors, analyzing action semantic features of the user action video to obtain a sequence of user action semantic coding feature vectors, performing cross-modal fusion on the sequence of the user voice text recognition result word granularity semantic feature vectors and the sequence of the user action semantic coding feature vectors to obtain a sequence of action-voice interaction fusion feature vectors, and generating an animated character virtual video based on the sequence of the action-voice interaction fusion feature vectors. Thus, a quick virtual film making process can be realized, and real-time preview capability is provided, so that more creative space and expression modes are provided for film and television creators.

Description

High-frame-rate playback type quick virtual film making system
Technical Field
The present application relates to the field of virtual production, and more particularly, to a high frame rate playback type fast virtual production system.
Background
With the development of the film industry, the virtual film making technology is attracting more and more attention. Virtual production is a method of producing movies and television by using a virtual environment and a character generated by a computer. The method creates virtual scenes, roles and special effects by using computer graphics and animation technology, and simulates real shooting and film-making processes so as to achieve the purpose of film and television making. The virtual film making technology can provide more creative space and expression modes for film and television creators, and meanwhile, the making cost and time can be reduced.
High frame rate playback refers to playing recorded video content at a high frame rate. In movie production, playback at high frame rates is important for live preview and adjustment. By playing back recorded content at a high frame rate, producers can more clearly observe motion details, photographic effects, and special effects presentations, thereby making real-time adjustments and decisions better. Conventional film-making methods may have problems in handling large-scale three-dimensional scenes or complex special effects, for example, conventional film-making systems may not be able to handle large-scale three-dimensional scenes or complex special effects in real time or at high frame rates, resulting in delays and jams in the manufacturing process, affecting manufacturing efficiency.
Thus, an optimized high frame rate playback type fast virtual production system is desired.
Disclosure of Invention
The present application has been made in order to solve the above technical problems. The embodiment of the application provides a high-frame-rate playback type quick virtual production system, which can analyze user creation intention from user action videos and user recorded voices by utilizing computer graphics, voice recognition and animation technologies, so that animation role parameters are intelligently generated, and are mapped to animation roles, so that a quick virtual production process is realized, real-time preview capability is provided, and more creative spaces and expression modes are provided for film and television creators.
According to one aspect of the present application, there is provided a high frame rate playback type fast virtual production system comprising:
the data acquisition module is used for acquiring user action videos captured by the camera and user recording voice captured by the recording equipment;
the semantic feature extraction module is used for extracting semantic features of the user recorded voice to obtain a sequence of semantic feature vectors of the user voice text recognition result word granularity;
the action semantic feature analysis module is used for analyzing action semantic features of the user action video to obtain a sequence of user action semantic coding feature vectors;
the cross-modal fusion module is used for carrying out cross-modal fusion on the sequence of the user voice text recognition result word granularity semantic feature vector and the sequence of the user action semantic coding feature vector so as to obtain a sequence of action-voice interaction fusion feature vector;
the cross-modal fusion module comprises a cross-modal fusion unit and is used for processing the sequence of the user action semantic coding feature vector and the sequence of the user voice text recognition result word granularity semantic feature vector by using the cross-modal bidirectional interaction fusion module so as to obtain the sequence of the action-voice interaction fusion feature vector;
and the generation module is used for generating the animated character virtual video based on the sequence of the action-voice interaction fusion feature vectors.
Compared with the prior art, the high-frame-rate playback type quick virtual production system provided by the application is characterized in that firstly, a user action video captured by a camera and a user recording voice captured by recording equipment are obtained, then, semantic features of the user recording voice are extracted to obtain a sequence of user voice text recognition result word granularity semantic feature vectors, then, the action semantic features of the user action video are analyzed to obtain a sequence of user action semantic coding feature vectors, then, cross-modal fusion is carried out on the sequence of the user voice text recognition result word granularity semantic feature vectors and the sequence of the user action semantic coding feature vectors to obtain a sequence of action-voice interaction fusion feature vectors, and finally, an animated character virtual video is generated based on the sequence of the action-voice interaction fusion feature vectors. Thus, a quick virtual film making process can be realized, and real-time preview capability is provided, so that more creative space and expression modes are provided for film and television creators.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings required for the description of the embodiments will be briefly introduced below, which are not intended to be drawn to scale in terms of actual dimensions, with emphasis on illustrating the gist of the present application.
Fig. 1 is a block diagram schematic of a high frame rate playback fast virtual production system according to an embodiment of the present application.
Fig. 2 is a flowchart of a high frame rate playback type fast virtual production method according to an embodiment of the present application.
Fig. 3 is a schematic diagram of a system architecture of a high frame rate playback type fast virtual production method according to an embodiment of the present application.
Fig. 4 is an application scenario diagram of a high frame rate playback type fast virtual production system according to an embodiment of the present application.
Fig. 5 is a schematic diagram of a high frame rate playback type fast virtual production system according to another embodiment of the present application.
Detailed Description
The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some, but not all embodiments of the present application. All other embodiments, which can be made by one of ordinary skill in the art based on the embodiments of the present application without making any inventive effort, are also within the scope of the present application.
As used in this application and in the claims, the terms "a," "an," "the," and/or "the" are not specific to the singular, but may include the plural, unless the context clearly dictates otherwise. In general, the terms "comprises" and "comprising" merely indicate that the steps and elements are explicitly identified, and they do not constitute an exclusive list, as other steps or elements may be included in a method or apparatus.
Although the present application makes various references to certain modules in a system according to embodiments of the present application, any number of different modules may be used and run on a user terminal and/or server. The modules are merely illustrative, and different aspects of the systems and methods may use different modules.
Flowcharts are used in this application to describe the operations performed by systems according to embodiments of the present application. It should be understood that the preceding or following operations are not necessarily performed in order precisely. Rather, the various steps may be processed in reverse order or simultaneously, as desired. Also, other operations may be added to or removed from these processes.
Hereinafter, example embodiments according to the present application will be described in detail with reference to the accompanying drawings. It should be apparent that the described embodiments are only some of the embodiments of the present application and not all of the embodiments of the present application, and it should be understood that the present application is not limited by the example embodiments described herein.
Aiming at the technical problems, the technical conception of the method is that the method uses computer graphics, voice recognition and animation technology to analyze the user creation intention from the user action video and the user recorded voice so as to intelligently generate animation role parameters and map the animation role parameters to animation roles, thereby realizing a quick virtual film-making process and providing real-time preview capability so as to provide more creative space and expression modes for film and television creators.
Based on this, fig. 1 is a block diagram schematic diagram of a high frame rate playback type fast virtual production system according to an embodiment of the present application. As shown in fig. 1, a high frame rate playback type fast virtual production system 100 according to an embodiment of the present application includes: a data acquisition module 110 for acquiring user motion video captured by the camera and user recorded voice captured by the recording device; the semantic feature extraction module 120 is configured to extract semantic features of the user recorded voice to obtain a sequence of semantic feature vectors of the user voice text recognition result word granularity; the action semantic feature analysis module 130 is configured to analyze action semantic features of the user action video to obtain a sequence of user action semantic coding feature vectors; the cross-modal fusion module 140 is configured to perform cross-modal fusion on the sequence of the user voice text recognition result word granularity semantic feature vector and the sequence of the user action semantic coding feature vector to obtain a sequence of action-voice interaction fusion feature vector; and a generation module 150 for generating an animated character virtual video based on the sequence of motion-voice interaction fusion feature vectors.
Specifically, in the technical scheme of the application, firstly, a user action video captured by a camera and a user recording voice captured by a recording device are obtained. Wherein the user action video records actions and gestures that the user desires to implement in the virtual environment, important clues about the user's authoring intent may be provided. And the user records the voice and records the information of the interpretation of the animation roles, the description of the scenes, the opinion and the like in the creation process of the user.
It should be appreciated that by analyzing the user motion video, the user's requirements for motion style, gesture selection, expression change, etc. of the animated character can be captured. Such information may reflect the user's desire for scenes, characters, and animation effects. For example, a user's actions may express a certain emotion or action style, and by analyzing and understanding these actions, it is possible to infer an emotion atmosphere or character image that the user wishes to present in the virtual production. In addition, the recorded voice of the user contains rich semantic information, and the description, the requirement and the opinion of the user on scenes, roles and special effects can be provided. That is, the user records the voice and reflects information such as emotion tendencies of the user, and the information can help understand the user's desire in the aspects of creation details, atmosphere, expression modes and the like. In order to capture the information, in the technical scheme of the application, voice recognition is carried out on the voice recorded by the user so as to obtain a voice text recognition result of the user; and passing the user voice text recognition result through a semantic encoder to obtain a sequence of word granularity semantic feature vectors of the user voice text recognition result. Meanwhile, performing discrete sampling on the user action video to obtain a sequence of user action key frames; and passing the sequence of the user action key frames through a user action semantic understanding device based on a convolutional neural network model to obtain a sequence of user action semantic coding feature vectors.
Accordingly, the semantic feature extraction module 120 includes: the voice recognition unit is used for carrying out voice recognition on the voice recorded by the user so as to obtain a user voice text recognition result; and the semantic coding unit is used for enabling the user voice text recognition result to pass through a semantic encoder to obtain a sequence of the user voice text recognition result word granularity semantic feature vector.
Wherein, the semantic coding unit includes: the dividing subunit is used for dividing the user voice text recognition result based on word granularity to obtain a sequence of user voice text words; the word embedding coding subunit is used for enabling the sequence of the user voice text words to pass through a word embedding layer to obtain a sequence of user voice text word embedding vectors; and an up-down Wen Yuyi associated encoding subunit for associating Wen Yuyi the encoder with the sequence of user phonetic text words embedded vectors through the user phonetic text based on the converter to obtain the sequence of user phonetic text recognition result word granularity semantic feature vectors.
Accordingly, the action semantic feature analysis module 130 includes: the discrete sampling unit is used for performing discrete sampling on the user action video to obtain a sequence of user action key frames; and the user action semantic understanding unit is used for enabling the sequence of the user action key frames to pass through a user action semantic understanding device based on a convolutional neural network model to obtain the sequence of the user action semantic coding feature vectors.
The user action semantic comprehension device based on the convolutional neural network model comprises an input layer, a convolutional layer, an activation layer, a pooling layer and an output layer.
Further, a cross-modal bidirectional interaction fusion module is used for processing the sequence of the user action key frames and the sequence of the user voice text recognition result word granularity semantic feature vectors to obtain the sequence of the action-voice interaction fusion feature vectors. The method comprises the steps of utilizing the cross-mode bidirectional interaction fusion module to comprehensively analyze action semantic features expressed by a user action video and user voice text semantic features expressed by user recorded voice so as to cross-verify and interactively fuse the two information. For example, by correlating user actions with speech, the user's intent and requirements at a particular action may be inferred therefrom. Meanwhile, the user's authoring intention and requirement can be further understood by matching and analyzing the keywords or emotion co-occurring in the user's actions and voices. In this way, the user's authoring intent and requirements are obtained from different dimensions as important cues and the virtual production process is guided through these cues to generate animated character virtual videos that conform to the user's expectations.
In a specific example of the present application, a process for encoding a sequence of the user action key frames and a sequence of the user voice text recognition result word granularity semantic feature vectors to obtain a sequence of action-voice interaction fusion feature vectors using a cross-modal bi-directional interaction fusion module includes: firstly, calculating the correlation degree between each user action semantic coding feature vector in the sequence of the user action semantic coding feature vectors and each user voice text recognition result word granularity semantic feature vector in the sequence of the user voice text recognition result word granularity semantic feature vectors; then, based on the correlation degree between each user action semantic coding feature vector in the sequence of user action semantic coding feature vectors and all user voice text recognition result word granularity semantic feature vectors in the sequence of user voice text recognition result word granularity semantic feature vectors, carrying out interactive updating on each user action semantic coding feature vector in the sequence of user action semantic coding feature vectors so as to obtain a sequence of updated user action semantic coding feature vectors; meanwhile, based on the correlation degree between each user voice text recognition result word granularity semantic feature vector in the sequence of the user voice text recognition result word granularity semantic feature vectors and all user action semantic coding feature vectors in the sequence of the user action semantic coding feature vectors, interactive updating is carried out on each user voice text recognition result word granularity semantic feature vector in the sequence of the user voice text recognition result word granularity semantic feature vectors so as to obtain a sequence of updated user voice text recognition result word granularity semantic feature vectors; then, fusing the sequence of the user action semantic coding feature vectors and the sequence of the updated user action semantic coding feature vectors to obtain a sequence of interactive fused user action semantic coding feature vectors; meanwhile, fusing the sequence of the semantic feature vectors with the granularity of the words of the user voice text recognition result and the sequence of the semantic feature vectors with the granularity of the words of the updated user voice text recognition result to obtain the sequence of the semantic feature vectors with the granularity of the words of the interactive fusion user voice text recognition result; and multiplying the sequence of the interaction fusion user action semantic coding feature vector and the sequence of the interaction fusion user voice text recognition result word granularity semantic feature vector according to position points to obtain the sequence of the action-voice interaction fusion feature vector.
Here, the internal relation and the association relation of the user action key frame and the corresponding local feature vector in the sequence of the user voice text recognition result word granularity semantic feature vector are highlighted by calculating the correlation degree between the two, so that bidirectional updating fusion and interaction are carried out. That is, the dependency relationship of the two directions is considered simultaneously, so that the risk of information loss caused by unidirectional fusion is avoided, and the fused sequence of the action-voice interaction fusion feature vector has more excellent feature expression capability.
Accordingly, the cross-modality fusion module 140 includes: and the cross-modal fusion unit is used for processing the sequence of the user action semantic coding feature vectors and the sequence of the user voice text recognition result word granularity semantic feature vectors by using a cross-modal bidirectional interaction fusion module so as to obtain the sequence of the action-voice interaction fusion feature vectors.
Specifically, the cross-modal fusion unit includes: the correlation calculation subunit is used for calculating the correlation between each user action semantic coding feature vector in the sequence of the user action semantic coding feature vectors and each user voice text recognition result word granularity semantic feature vector in the sequence of the user voice text recognition result word granularity semantic feature vectors; a user action interactive updating subunit, configured to interactively update each user action semantic coding feature vector in the sequence of user action semantic coding feature vectors based on a correlation between each user action semantic coding feature vector in the sequence of user action semantic coding feature vectors and all user speech text recognition result word granularity semantic feature vectors in the sequence of user speech text recognition result word granularity semantic feature vectors, so as to obtain a sequence of updated user action semantic coding feature vectors; a user voice text interactive updating subunit, configured to interactively update each user voice text recognition result word granularity semantic feature vector in the sequence of user voice text recognition result word granularity semantic feature vectors based on a correlation between each user voice text recognition result word granularity semantic feature vector in the sequence of user motion semantic coding feature vectors and all user motion semantic coding feature vectors in the sequence of user motion semantic coding feature vectors, so as to obtain a sequence of updated user voice text recognition result word granularity semantic feature vectors; a user action fusion subunit, configured to fuse the sequence of the user action semantic coding feature vector and the sequence of the updated user action semantic coding feature vector to obtain a sequence of interactively fused user action semantic coding feature vectors; the user voice text fusion subunit is used for fusing the sequence of the user voice text recognition result word granularity semantic feature vector and the sequence of the updated user voice text recognition result word granularity semantic feature vector to obtain the sequence of the interactive fusion user voice text recognition result word granularity semantic feature vector; and the point multiplication subunit is used for carrying out position-based point multiplication on the sequence of the interaction fusion user action semantic coding feature vector and the sequence of the interaction fusion user voice text recognition result word granularity semantic feature vector so as to obtain the sequence of the action-voice interaction fusion feature vector.
Wherein, the correlation computation subunit is configured to: calculating the correlation between each user action semantic coding feature vector in the sequence of user action semantic coding feature vectors and each user voice text recognition result word granularity semantic feature vector in the sequence of user voice text recognition result word granularity semantic feature vectors according to the following correlation formula; wherein, the correlation formula is:
wherein,representing the first +.in the sequence of the user action semantically encoded feature vector>The first part of the sequence of the semantic coding feature vector of the individual user action and the semantic feature vector of the granularity of the word of the user voice text recognition result>Correlation between semantic feature vectors of granularity of individual user speech text recognition result words, ++>Representation houseThe first ∈of the sequence of semantically encoded feature vectors of the user action>Individual user action semantically encoded feature vectors, and +.>Representing the +.f in the sequence of the semantic feature vector of the granularity of the word of the user's speech text recognition result>Semantic feature vector of granularity of words of voice text recognition result of individual user,/->Representing the transpose operation.
Then, the sequence of the motion-voice interaction fusion feature vector passes through an animation role parameter generator based on a decoder to obtain a sequence of animation role parameters; and mapping the sequence of animated character parameters to the animated character to generate an animated character virtual video. In this way, the user's authoring intent and requirements are translated into parameters for a particular animated character, thereby generating a corresponding animated character virtual video.
More specifically, the decoder-based animated character parameter generator is a decoder-based animated character parameter generator that takes as input a sequence of motion-voice interaction fusion feature vectors to generate a sequence of corresponding animated character parameters. Wherein the animated character parameter generator may learn a mapping relationship from each of the motion-voice interaction fusion feature vectors to animated character parameters. In particular, during training, the animated character parameters with labels are typically used as targets to optimize the weight parameters of the generator model by minimizing the difference between the generated results and the targets. In addition, the sequence of generated animated character parameters is applied to the animated character model to generate a corresponding sequence of animated character poses and actions based on the parameters of each time step. This can be accomplished by inputting parameters into a skeletal controller or animation generator of the animated character model. These models typically use techniques based on physical simulation or key frame interpolation to generate realistic animated character action sequences from parameters.
Accordingly, the generating module 150 includes: the characteristic distribution correction unit is used for carrying out characteristic distribution correction on the sequence of the action-voice interaction fusion characteristic vector so as to obtain a corrected sequence of the action-voice interaction fusion characteristic vector; a decoding unit, configured to pass the corrected sequence of motion-voice interaction fusion feature vectors through a decoder-based animated character parameter generator to obtain a sequence of animated character parameters; and a mapping generation unit for mapping the sequence of animated character parameters to animated characters to generate the animated character virtual video.
In the above technical solution, the sequence of user action semantic coding feature vectors and the sequence of user speech text recognition result word granularity semantic feature vectors are used to express the image semantic features of the user action key frames and the coded text semantic features of the user speech text recognition results, respectively, so when the sequence of user action semantic coding feature vectors and the sequence of user speech text recognition result word granularity semantic feature vectors are processed by using a cross-modal bi-directional interaction fusion module, the cross-modal bi-directional interaction sparsity of semantic features may be caused by considering the cross-modal semantic feature difference between the sequence of user action semantic coding feature vectors and the sequence of user speech text recognition result word granularity semantic feature vectors, thereby affecting the expression effect of the obtained sequence of action-speech interaction fusion feature vectors, and therefore, feature corresponding interaction optimization is expected to be performed based on the respective feature expression significance and the key of the sequence of the user action semantic coding feature vectors and the sequence of user speech text recognition result word granularity semantic feature vectors, thereby improving the expression effect of the sequence of action-speech interaction fusion feature vectors. Based on this, the applicant of the present application corrects the sequence of the user action semantic coding feature vectors and the sequence of the user speech text recognition result word granularity semantic feature vectors.
Accordingly, in one example, the feature distribution correction unit is configured to: performing feature distribution correction on the sequence of the action-voice interaction fusion feature vectors by using the following correction formula to obtain corrected feature vectors; wherein, the correction formula is:
wherein,is the first cascade feature vector obtained by cascading the sequence of user action semantic coding feature vectors, and +.>Is a second cascade feature vector obtained by cascading the sequence of semantic feature vectors of the granularity of the user voice text recognition result word,/I>Representing the position-wise evolution of the feature vector, < >>And->Feature vector +.>And->Reciprocal of maximum eigenvalue, ++>And->Is a weight superparameter,/->Representing multiplication by location +.>Representing the subtraction of vectors, ++>Is the correction feature vector; and fusing the correction feature vector with the sequence of the action-voice interaction fusion feature vector to obtain the sequence of the corrected action-voice interaction fusion feature vector.
Here, the pre-segmented local group of feature value sets is obtained through the evolution value of each feature value of the sequence of the user action semantic coding feature vectors and the sequence of the user voice text recognition result word granularity semantic feature vectors, and the key maximum value features of the user action semantic coding feature vectors and the user voice text recognition result word granularity semantic feature vectors are regressed from the pre-segmented local group, so that the position-wise significance distribution of the feature values can be improved based on the concept of furthest point sampling, and sparse correspondence control among the feature vectors is performed through the key features with significance distribution, so that the correction of the feature vectors is realizedAnd restoring the original manifold geometry of the sequence of the user action semantic coding feature vectors and the sequence of the user voice text recognition result word granularity semantic feature vectors. Thus, the correction feature vector is again +.>And the sequence fusion of the motion-voice interaction fusion feature vector can improve the expression effect of the sequence of the motion-voice interaction fusion feature vector, thereby improving the numerical accuracy of the animation role parameters obtained by a decoder.
In summary, a high frame rate playback type fast virtual production system 100 according to embodiments of the present application is illustrated that can implement a fast virtual production process and provide real-time preview capabilities, thereby providing more creative space and expressions for movie and television creators.
As described above, the high frame rate playback type fast virtual production system 100 according to the embodiment of the present application can be implemented in various terminal devices, for example, a server or the like having the high frame rate playback type fast virtual production algorithm according to the embodiment of the present application. In one example, the high frame rate playback type fast virtual production system 100 according to embodiments of the present application may be integrated into the terminal device as a software module and/or hardware module. For example, the high frame rate playback type fast virtual production system 100 according to the embodiments of the present application may be a software module in the operating system of the terminal device, or may be an application developed for the terminal device; of course, the high frame rate playback type fast virtual production system 100 according to the embodiments of the present application may also be one of many hardware modules of the terminal device.
Alternatively, in another example, the high frame rate playback type fast virtual production system 100 according to the embodiments of the present application and the terminal device may be separate devices, and the high frame rate playback type fast virtual production system 100 may be connected to the terminal device through a wired and/or wireless network and transmit interactive information according to an agreed data format.
Fig. 2 is a flowchart of a high frame rate playback type fast virtual production method according to an embodiment of the present application. Fig. 3 is a schematic diagram of a system architecture of a high frame rate playback type fast virtual production method according to an embodiment of the present application. As shown in fig. 2 and 3, a high frame rate playback type fast virtual production method according to an embodiment of the present application includes: s110, acquiring a user action video captured by a camera and a user recording voice captured by a recording device; s120, extracting semantic features of the user recorded voice to obtain a sequence of semantic feature vectors of the user voice text recognition result word granularity; s130, analyzing action semantic features of the user action video to obtain a sequence of user action semantic coding feature vectors; s140, performing cross-modal fusion on the sequence of the user voice text recognition result word granularity semantic feature vectors and the sequence of the user action semantic coding feature vectors to obtain a sequence of action-voice interaction fusion feature vectors; and S150, generating an animated character virtual video based on the sequence of the action-voice interaction fusion feature vectors.
Here, it will be understood by those skilled in the art that the specific operations of the respective steps in the above-described high frame rate playback type fast virtual production method have been described in detail in the above description with reference to the high frame rate playback type fast virtual production system 100 of fig. 1, and thus, repetitive descriptions thereof will be omitted.
Fig. 4 is an application scenario diagram of a high frame rate playback type fast virtual production system according to an embodiment of the present application. As shown in fig. 4, in this application scenario, first, a user motion video (e.g., D1 illustrated in fig. 4) captured by a camera (e.g., C1 illustrated in fig. 4) and a user recorded voice (e.g., D2 illustrated in fig. 4) captured by a recording device (e.g., C2 illustrated in fig. 4) are acquired, and then the user recorded voice and the user motion video are input to a server (e.g., S illustrated in fig. 4) deployed with a high frame rate playback type fast virtual production algorithm, wherein the server is capable of processing the user recorded voice and the user motion video using the high frame rate playback type fast virtual production algorithm to generate an animated character virtual video.
It should be appreciated that the present application provides a high frame rate playback type fast virtual production system that provides a more efficient, higher quality production scheme for film and television production by integrating multiple key technologies.
In another example of the present application, there is also provided another high frame rate playback type fast virtual production system, as shown in fig. 5, comprising the following key components: the system comprises a three-dimensional scene editing module, a two-dimensional digital painting module, a virtual photographing projection module, a real-time rendering module, a high-performance cache module, a high-frame-rate playback module and a large-screen spelling control module.
Specifically, the three-dimensional scene editing module is used for creating and editing a three-dimensional scene required by a movie and television, and comprises functions of scene arrangement, modeling, texture processing and the like. The two-dimensional digital painting module supports two-dimensional painting of scene elements, and adds details and visual effects to the scene. The virtual shooting projection module simulates a projection mode of a real camera, and determines the position, focal length and angle of a lens so as to realize visual effect and shooting composition. The real-time rendering module converts the edited three-dimensional scene and two-dimensional painting scene into a high-quality image in real time by utilizing a high-performance rendering engine. The high-performance caching module adopts an efficient data caching technology, so that the data access speed and the system operation efficiency are improved, and the rendering and processing speeds are accelerated. The high frame rate playback module ensures that the system is able to play back the production process at a high frame rate, enabling the user to preview and adjust the effects in real time. The large-screen splicing control module is used for controlling and managing a plurality of display screens, and is spliced into large-size display, so that a wider preview making space is provided.
Accordingly, the system has the following advantages: 1. and (3) high-efficiency manufacturing: the system can play back the high-quality film-making effect in real time, and improves the manufacturing efficiency and the working smoothness. 2. Real-time preview and adjustment: the user can preview and adjust the scene and effect in real time, and the later correction cost is reduced. 3. Multidimensional drawing: the three-dimensional and two-dimensional drawing technology is fused, and the richness and the sense of reality of the scene are increased. 4. And (3) large screen spelling control: and a large-size display is supported, and a wider preview space and visual experience are provided.
This application uses specific words to describe embodiments of the application. Reference to "a first/second embodiment," "an embodiment," and/or "some embodiments" means that a particular feature, structure, or characteristic is associated with at least one embodiment of the present application. Thus, it should be emphasized and should be appreciated that two or more references to "an embodiment" or "one embodiment" or "an alternative embodiment" in various positions in this specification are not necessarily referring to the same embodiment. Furthermore, certain features, structures, or characteristics of one or more embodiments of the present application may be combined as suitable.
Furthermore, those skilled in the art will appreciate that the various aspects of the invention are illustrated and described in the context of a number of patentable categories or circumstances, including any novel and useful procedures, machines, products, or materials, or any novel and useful modifications thereof. Accordingly, aspects of the present application may be performed entirely by hardware, entirely by software (including firmware, resident software, micro-code, etc.) or by a combination of hardware and software. The above hardware or software may be referred to as a "data block," module, "" engine, "" unit, "" component, "or" system. Furthermore, aspects of the present application may take the form of a computer product, comprising computer-readable program code, embodied in one or more computer-readable media.
Unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the relevant art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.
The foregoing is illustrative of the present application and is not to be construed as limiting thereof. Although a few exemplary embodiments of this application have been described, those skilled in the art will readily appreciate that many modifications are possible in the exemplary embodiments without materially departing from the novel teachings and advantages of this application. Accordingly, all such modifications are intended to be included within the scope of this application as defined in the claims. It is to be understood that the foregoing is illustrative of the present application and is not to be construed as limited to the specific embodiments disclosed, and that modifications to the disclosed embodiments, as well as other embodiments, are intended to be included within the scope of the appended claims. The application is defined by the claims and their equivalents.

Claims (3)

1. A high frame rate playback type fast virtual production system, comprising:
the data acquisition module is used for acquiring user action videos captured by the camera and user recording voice captured by the recording equipment;
the semantic feature extraction module is used for extracting semantic features of the user recorded voice to obtain a sequence of semantic feature vectors of the user voice text recognition result word granularity;
the action semantic feature analysis module is used for analyzing action semantic features of the user action video to obtain a sequence of user action semantic coding feature vectors;
the cross-modal fusion module is used for carrying out cross-modal fusion on the sequence of the user voice text recognition result word granularity semantic feature vector and the sequence of the user action semantic coding feature vector so as to obtain a sequence of action-voice interaction fusion feature vector;
the cross-modal fusion module comprises a cross-modal fusion unit and is used for processing the sequence of the user action semantic coding feature vector and the sequence of the user voice text recognition result word granularity semantic feature vector by using the cross-modal bidirectional interaction fusion module so as to obtain the sequence of the action-voice interaction fusion feature vector;
a generation module for generating an animated character virtual video based on the sequence of motion-voice interaction fusion feature vectors;
the semantic feature extraction module comprises:
the voice recognition unit is used for carrying out voice recognition on the voice recorded by the user so as to obtain a user voice text recognition result;
the semantic coding unit is used for enabling the user voice text recognition result to pass through a semantic encoder to obtain a sequence of the user voice text recognition result word granularity semantic feature vector;
the semantic coding unit includes:
the dividing subunit is used for dividing the user voice text recognition result based on word granularity to obtain a sequence of user voice text words;
the word embedding coding subunit is used for enabling the sequence of the user voice text words to pass through a word embedding layer to obtain a sequence of user voice text word embedding vectors;
and an up-down Wen Yuyi associated encoding subunit, configured to associate the sequence of the user speech text word embedded vectors with an up-down Wen Yuyi associated encoder through the user speech text based on the converter to obtain a sequence of the user speech text recognition result word granularity semantic feature vectors;
the action semantic feature analysis module comprises:
the discrete sampling unit is used for performing discrete sampling on the user action video to obtain a sequence of user action key frames;
the user action semantic understanding unit is used for enabling the sequence of the user action key frames to pass through a user action semantic understanding device based on a convolutional neural network model to obtain the sequence of the user action semantic coding feature vectors;
the user action semantic understanding device based on the convolutional neural network model comprises an input layer, a convolutional layer, an activation layer, a pooling layer and an output layer;
the cross-modal fusion unit comprises:
the correlation calculation subunit is used for calculating the correlation between each user action semantic coding feature vector in the sequence of the user action semantic coding feature vectors and each user voice text recognition result word granularity semantic feature vector in the sequence of the user voice text recognition result word granularity semantic feature vectors;
a user action interactive updating subunit, configured to interactively update each user action semantic coding feature vector in the sequence of user action semantic coding feature vectors based on a correlation between each user action semantic coding feature vector in the sequence of user action semantic coding feature vectors and all user speech text recognition result word granularity semantic feature vectors in the sequence of user speech text recognition result word granularity semantic feature vectors, so as to obtain a sequence of updated user action semantic coding feature vectors;
a user voice text interactive updating subunit, configured to interactively update each user voice text recognition result word granularity semantic feature vector in the sequence of user voice text recognition result word granularity semantic feature vectors based on a correlation between each user voice text recognition result word granularity semantic feature vector in the sequence of user motion semantic coding feature vectors and all user motion semantic coding feature vectors in the sequence of user motion semantic coding feature vectors, so as to obtain a sequence of updated user voice text recognition result word granularity semantic feature vectors;
a user action fusion subunit, configured to fuse the sequence of the user action semantic coding feature vector and the sequence of the updated user action semantic coding feature vector to obtain a sequence of interactively fused user action semantic coding feature vectors;
the user voice text fusion subunit is used for fusing the sequence of the user voice text recognition result word granularity semantic feature vector and the sequence of the updated user voice text recognition result word granularity semantic feature vector to obtain the sequence of the interactive fusion user voice text recognition result word granularity semantic feature vector;
and the point multiplication subunit is used for carrying out position-based point multiplication on the sequence of the interaction fusion user action semantic coding feature vector and the sequence of the interaction fusion user voice text recognition result word granularity semantic feature vector so as to obtain the sequence of the action-voice interaction fusion feature vector.
2. The high frame rate playback fast virtual production system of claim 1, wherein the correlation calculation subunit is configured to:
calculating the correlation between each user action semantic coding feature vector in the sequence of user action semantic coding feature vectors and each user voice text recognition result word granularity semantic feature vector in the sequence of user voice text recognition result word granularity semantic feature vectors according to the following correlation formula; wherein, the correlation formula is:
wherein (1)>Representing the first +.in the sequence of the user action semantically encoded feature vector>The first part of the sequence of the semantic coding feature vector of the individual user action and the semantic feature vector of the granularity of the word of the user voice text recognition result>Correlation between semantic feature vectors of granularity of individual user speech text recognition result words, ++>Representing the first +.in the sequence of the user action semantically encoded feature vector>Individual user action semantically encoded feature vectors, and +.>Representing the +.f in the sequence of the semantic feature vector of the granularity of the word of the user's speech text recognition result>Semantic feature vector of granularity of words of voice text recognition result of individual user,/->Representing the transpose operation.
3. The high frame rate playback type fast virtual production system of claim 2, wherein the generating module comprises:
the characteristic distribution correction unit is used for carrying out characteristic distribution correction on the sequence of the action-voice interaction fusion characteristic vector so as to obtain a corrected sequence of the action-voice interaction fusion characteristic vector;
a decoding unit, configured to pass the corrected sequence of motion-voice interaction fusion feature vectors through a decoder-based animated character parameter generator to obtain a sequence of animated character parameters;
and a mapping generation unit for mapping the sequence of animated character parameters to animated characters to generate the animated character virtual video.
CN202410022337.6A 2024-01-08 2024-01-08 High-frame-rate playback type quick virtual film making system Active CN117528197B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410022337.6A CN117528197B (en) 2024-01-08 2024-01-08 High-frame-rate playback type quick virtual film making system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410022337.6A CN117528197B (en) 2024-01-08 2024-01-08 High-frame-rate playback type quick virtual film making system

Publications (2)

Publication Number Publication Date
CN117528197A CN117528197A (en) 2024-02-06
CN117528197B true CN117528197B (en) 2024-04-02

Family

ID=89755453

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410022337.6A Active CN117528197B (en) 2024-01-08 2024-01-08 High-frame-rate playback type quick virtual film making system

Country Status (1)

Country Link
CN (1) CN117528197B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10732708B1 (en) * 2017-11-21 2020-08-04 Amazon Technologies, Inc. Disambiguation of virtual reality information using multi-modal data including speech
CN113851145A (en) * 2021-09-23 2021-12-28 厦门大学 Virtual human action sequence synthesis method combining voice and semantic key actions
CN115964467A (en) * 2023-01-02 2023-04-14 西北工业大学 Visual situation fused rich semantic dialogue generation method
CN116582726A (en) * 2023-07-12 2023-08-11 北京红棉小冰科技有限公司 Video generation method, device, electronic equipment and storage medium
CN117193524A (en) * 2023-08-24 2023-12-08 南京熊猫电子制造有限公司 Man-machine interaction system and method based on multi-mode feature fusion

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6925197B2 (en) * 2001-12-27 2005-08-02 Koninklijke Philips Electronics N.V. Method and system for name-face/voice-role association

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10732708B1 (en) * 2017-11-21 2020-08-04 Amazon Technologies, Inc. Disambiguation of virtual reality information using multi-modal data including speech
CN113851145A (en) * 2021-09-23 2021-12-28 厦门大学 Virtual human action sequence synthesis method combining voice and semantic key actions
CN115964467A (en) * 2023-01-02 2023-04-14 西北工业大学 Visual situation fused rich semantic dialogue generation method
CN116582726A (en) * 2023-07-12 2023-08-11 北京红棉小冰科技有限公司 Video generation method, device, electronic equipment and storage medium
CN117193524A (en) * 2023-08-24 2023-12-08 南京熊猫电子制造有限公司 Man-machine interaction system and method based on multi-mode feature fusion

Also Published As

Publication number Publication date
CN117528197A (en) 2024-02-06

Similar Documents

Publication Publication Date Title
Chen et al. Talking-head generation with rhythmic head motion
WO2022001593A1 (en) Video generation method and apparatus, storage medium and computer device
US9626788B2 (en) Systems and methods for creating animations using human faces
Cao et al. Expressive speech-driven facial animation
JP6936298B2 (en) Methods and devices for controlling changes in the mouth shape of 3D virtual portraits
CN111464834B (en) Video frame processing method and device, computing equipment and storage medium
EP3660663B1 (en) Delivering virtualized content
CN110636365B (en) Video character adding method and device, electronic equipment and storage medium
US11581020B1 (en) Facial synchronization utilizing deferred neural rendering
US11582519B1 (en) Person replacement utilizing deferred neural rendering
Zhao et al. Computer-aided graphic design for virtual reality-oriented 3D animation scenes
CN110572717A (en) Video editing method and device
CN116597857A (en) Method, system, device and storage medium for driving image by voice
CN117528197B (en) High-frame-rate playback type quick virtual film making system
Wang et al. Talking faces: Audio-to-video face generation
CN115439614B (en) Virtual image generation method and device, electronic equipment and storage medium
AU2018101526A4 (en) Video interpolation based on deep learning
WO2008023819A1 (en) Computer system and operation control method
Lin et al. High resolution animated scenes from stills
CN115917647A (en) Automatic non-linear editing style transfer
JP2020173776A (en) Method and device for generating video
CN116071473B (en) Method and system for acquiring animation motion key frame
CN117478824B (en) Conference video generation method and device, electronic equipment and storage medium
Lin et al. Emotional Semantic Neural Radiance Fields for Audio-Driven Talking Head
Liu et al. Research on the computer case design of 3D human animation visual experience

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant