WO2024060873A1 - Dynamic image generation method and device - Google Patents

Dynamic image generation method and device Download PDF

Info

Publication number
WO2024060873A1
WO2024060873A1 PCT/CN2023/112565 CN2023112565W WO2024060873A1 WO 2024060873 A1 WO2024060873 A1 WO 2024060873A1 CN 2023112565 W CN2023112565 W CN 2023112565W WO 2024060873 A1 WO2024060873 A1 WO 2024060873A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
information
blendshape
characteristic
generation method
Prior art date
Application number
PCT/CN2023/112565
Other languages
French (fr)
Chinese (zh)
Inventor
魏莱
王宇桐
宋雅奇
薛裕颖
沈云
Original Assignee
中国电信股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中国电信股份有限公司 filed Critical 中国电信股份有限公司
Publication of WO2024060873A1 publication Critical patent/WO2024060873A1/en

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/80Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
    • H04N21/81Monomedia components thereof
    • H04N21/816Monomedia components thereof involving special video data, e.g 3D video
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T13/00Animation
    • G06T13/203D [Three Dimensional] animation
    • G06T13/403D [Three Dimensional] animation of characters, e.g. humans, animals or virtual beings
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T13/00Animation
    • G06T13/802D [Two Dimensional] animation, e.g. using sprites
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/24Speech recognition using non-acoustical features
    • G10L15/25Speech recognition using non-acoustical features using position of the lips, movement of the lips or face analysis
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/80Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
    • H04N21/85Assembly of content; Generation of multimedia applications
    • H04N21/854Content authoring
    • H04N21/8547Content authoring involving timestamps for synchronizing content

Definitions

  • the present disclosure relates to the field of computer technology, and in particular to a method for generating dynamic images, a device for generating dynamic images, and a non-volatile computer-readable storage medium.
  • Intelligent-driven digital humans are digital humans restored through three-dimensional modeling, computer vision, speech recognition and other technologies. They can communicate with users through changes in mouth shapes and expressions.
  • BlendShape mixed shape basic expression animations and basic mouth shape animations are built in advance; expression tags and mouth shape tags are generated based on the text; they are input to the rendering engine for animation retrieval and synthesis.
  • a dynamic image generation method including: determining characteristic information corresponding to the response information according to the user's voice; determining characteristic data corresponding to the response information according to the characteristic information, and the characteristic data corresponding to the characteristic information according to the characteristic information.
  • the BlendShape data and bone data are determined; based on the feature data, dynamic images corresponding to the response information are generated.
  • the characteristic information includes at least one of emotional information or pronunciation information
  • the characteristic data includes at least one of expression data or mouth shape data.
  • it is determined that the characteristic data corresponding to the response information includes at least one of the following: One item: determine the expression data based on the emotional information, and the expression data is determined based on the first BlendShape data and the first skeleton data corresponding to the emotional information; or, determine the mouth shape data based on the pronunciation information, and the mouth shape data is determined based on the second BlendShape data corresponding to the pronunciation information. BlendShape data and second bone data are determined.
  • the skeletal data is determined based on the initial skeletal data and a weighted sum of multiple skeletal data components.
  • the generating method further comprises: determining an initial weight of the feature data based on the feature information; Generating a dynamic image corresponding to the response information based on the feature data includes: randomly generating actual weights of corresponding moments of multiple key frames within a value range determined by the initial weight and the threshold; weighting the feature data using the multiple actual weights to generate multiple key frames; and generating a dynamic image based on the multiple key frames.
  • the value range includes values greater than the difference between the initial weight and the threshold and less than the sum of the initial weight and the threshold.
  • generating a dynamic image based on multiple key frames includes: smoothing weighted feature data corresponding to adjacent key frames to generate non-key frames between adjacent key frames; based on the key frames and Non-key frames and their timestamps are used to generate dynamic images in chronological order.
  • the generation method also includes: determining the timestamp of the feature data according to the feature information; within the value range determined according to the initial weight and the threshold, randomly generating the actual weights of the corresponding moments of multiple key frames, including: The actual weight of the corresponding moment when the timestamp was generated.
  • determining the characteristic data corresponding to the response information according to the characteristic information includes: using the state machine in the semantic engine to determine the identification information corresponding to the characteristic data according to the characteristic information; using the rendering engine to determine the identification information corresponding to the characteristic data according to the state machine. , obtain feature data from the cache pool.
  • the generation method further includes: during the initialization process, using a rendering engine to read multiple feature data from the facial model library; using the rendering engine to load the multiple feature data into the cache in JSON text format. pool.
  • determining the characteristic information corresponding to the response information based on the user's voice includes: when the user initiates voice interaction, perform semantic analysis and emotional analysis on the user's voice; and determine the response text in the question and answer library based on the analysis results; Perform at least one of sentiment analysis or phoneme extraction on the response text to determine feature information.
  • BlendShape data is determined based on the weighted sum of the initial BlendShape data and multiple BlendShape data components.
  • a device for generating dynamic images including: a semantic engine module for determining feature information corresponding to the response information based on the user's voice, and determining feature data corresponding to the response information based on the feature information, The feature data is determined based on the BlendShape data and skeletal data corresponding to the feature information; the rendering engine module is used to generate dynamic images corresponding to the response information based on the feature data.
  • the generating device further includes: a facial model library for storing a plurality of feature data.
  • the feature information includes at least one of emotion information or pronunciation information
  • the feature data includes at least one of expression data or mouth shape data
  • the semantic engine module performs at least one of the following: according to the emotion Information, the expression data is determined based on the first BlendShape data and the first skeleton data corresponding to the emotion information; or, the mouth shape data is determined based on the pronunciation information, and the mouth shape data is determined based on the second BlendShape data and the second BlendShape data corresponding to the pronunciation information. Skeleton data confirmed.
  • the skeletal data is determined based on the initial skeletal data and a weighted sum of multiple skeletal data components.
  • the semantic engine module determines the initial weight of the feature data based on the feature information; the rendering engine module randomly generates the actual weights of corresponding moments of multiple key frames within the value range determined based on the initial weight and threshold, Use multiple actual weights to weight the feature data respectively to generate multiple key frames; generate dynamic images based on multiple key frames.
  • the value range includes values greater than the difference between the initial weight and the threshold and less than the sum of the initial weight and the threshold.
  • the rendering engine module smoothes the weighted feature data corresponding to adjacent key frames to generate non-key frames between adjacent key frames; according to the key frames and non-key frames and their timestamps, Generate moving images in chronological order.
  • the semantic engine module determines the timestamp of the feature data based on the feature information; the rendering engine module generates the actual weight corresponding to the timestamp.
  • the semantic engine module uses the state machine to determine the identification information corresponding to the feature data based on the feature information; the rendering engine module obtains the feature data from the cache pool based on the identification information sent by the state machine.
  • the rendering engine module reads multiple feature data from the facial model library and loads the multiple feature data into the cache pool in JSON text format.
  • the semantic engine module when the user initiates voice interaction, performs semantic analysis and emotional analysis on the user's voice; based on the analysis results, determines the response text in the question and answer library, and performs emotional analysis or phoneme extraction on the response text. At least one process is performed to determine characteristic information.
  • BlendShape data is determined based on the weighted sum of the initial BlendShape data and multiple BlendShape data components.
  • a device for generating dynamic images including: a memory; and a processor coupled to the memory.
  • the processor is configured to execute any of the above implementations based on instructions stored in the memory device.
  • the method of generating dynamic images in the example is provided, including: a memory; and a processor coupled to the memory.
  • the processor is configured to execute any of the above implementations based on instructions stored in the memory device. The method of generating dynamic images in the example.
  • a non-volatile computer-readable storage medium on which a computer program is stored.
  • the program is executed by a processor, the dynamic image generating method in any of the above embodiments is implemented. Law.
  • Figure 1 shows a flow chart of some embodiments of a dynamic image generation method of the present disclosure
  • Figure 2 shows a flow chart of other embodiments of the dynamic image generation method of the present disclosure
  • Figure 3 shows a schematic diagram of some embodiments of a dynamic image generation method of the present disclosure
  • FIG4 is a block diagram showing some embodiments of a device for generating dynamic images of the present disclosure
  • FIG5 is a block diagram showing some other embodiments of the dynamic image generation device of the present disclosure.
  • FIG. 6 shows a block diagram of some further embodiments of the dynamic image generating device of the present disclosure.
  • any specific values are to be construed as illustrative only and not as limiting. Accordingly, other examples of the exemplary embodiments may have different values.
  • the driving methods of intelligent-driven digital humans mainly include the following two types.
  • BlendShape expression animations and basic mouth animations are built in advance; then, use the semantic engine to perform speech recognition on the user's voice to determine the text content to be answered; Generate expression tags and mouth shape tags based on the text, and input them to the rendering engine for animation retrieval and synthesis.
  • the digital human driving effect of this method is relatively rigid and rigid, and requires a large amount of facial animation to be produced in advance. If you have high requirements for the details of expressions, you may even need to pre-produce hundreds of animations and then import them into the rendering engine. If you need to expand expressions or mouth shapes, you need to manually create animations and import animations again, which results in high labor costs and low system scalability.
  • Another method does not require the animation to be imported into the rendering engine in advance. Instead, the semantic engine directly obtains the BlendShape coefficients of expressions and mouth shapes based on training, and sends the coefficients to the rendering engine, which receives and drives them in real time.
  • This method needs to continuously occupy bandwidth to send data, and does not form a standard expression library for the training data. There is a large amount of repeated data, resulting in high bandwidth resource usage, poor data reusability, and reduced real-time performance.
  • BlendShape uses a series of vertex displacements to achieve a smooth deformation effect on the object.
  • the single use of BlendShape for driving does not take into account the impact of bones on the digital human, so the driving accuracy of the digital human is limited and the sense of reality is poor.
  • BlendShape or bones for driving limits the driving accuracy of the digital human, resulting in stiff facial expressions, lack of dynamic changes and precise expression, and poor realism.
  • High resource overhead The training-based method needs to continuously occupy bandwidth to send data, which contains a large amount of duplicate data; and the animation-based method needs to store a large number of animation assets, resulting in high bandwidth and hardware resource overhead, and reduced performance and real-time performance.
  • the inventor of the present disclosure discovered that there are the following problems in the above-mentioned related technologies: the generated dynamic images are rigid and rigid, resulting in poor dynamic image effects.
  • the present disclosure proposes a technical solution for generating dynamic images, which can improve the effect of dynamic images.
  • the present disclosure proposes a digital human dynamic micro-expression driving technology based on random weights. technical plan.
  • the random weight calculation method of the present disclosure randomly calculates the weight of facial key frames based on a state machine and threshold, and uses the least squares method to smooth the key frames to realize the dynamic expression of micro-expressions;
  • the bone correction method of the present disclosure is based on BlendShape and skeleton Weighted to achieve accurate expression driving;
  • this disclosure builds a facial model library to store basic expressions, sets a model ID to uniquely identify different expressions and mouth shapes, and implements lightweight import into the cache pool for data caching based on JSON text, realizing data Efficient reuse and expansion.
  • FIG. 1 shows a flowchart of some embodiments of the dynamic image generation method of the present disclosure.
  • step 110 characteristic information corresponding to the response information is determined based on the user's voice.
  • semantic analysis and emotional analysis are performed on the user's voice; a response text is determined in the question and answer library according to the analysis results; and at least one of emotional analysis or phoneme extraction is performed on the response text. item processing to determine feature information.
  • step 120 the characteristic data corresponding to the response information is determined based on the characteristic information, and the characteristic data is determined based on the BlendShape data and bone data corresponding to the characteristic information.
  • a rendering engine is used to read multiple feature data from a facial model library; and the rendering engine is used to load the multiple feature data into a cache pool in a JSON text format.
  • the facial model library can perform bone correction processing on BlendShape, thereby achieving precise expression driving based on BlendShape and bones.
  • the facial model library can also store text data of basic expressions and model IDs (i.e., identification information) used to uniquely identify different expressions and mouth shapes (i.e., feature data), thereby achieving efficient reading and reuse of data.
  • the state machine in the semantic engine is used to determine the identification information corresponding to the characteristic data according to the characteristic information; the rendering engine is used to obtain the characteristic data from the cache pool according to the identification information sent by the state machine.
  • the facial model library includes a mouth shape database
  • the data structure of the mouth shape data LIP in the mouth shape database is [LipID, BlendShape L , Skeleton L ].
  • LipID represents the mouth shape ID of the mouth shape data.
  • Multiple phonemes can have the same mouth shape, that is, the same mouth shape ID.
  • the phonemes "o" and “ao” have similar mouth shapes and can correspond to the same LipID, thereby reducing the amount of data based on precise driving.
  • BlendShape L represents a set of BlendShape coefficients corresponding to the mouth shape data (i.e., the second BlendShape data), and Skeleton L represents the facial bone coefficients corresponding to the mouth shape data (i.e., the second skeletal data).
  • the facial model library includes an expression database
  • the data structure of the expression data Emotion in the expression database is [EmoID, BlendShape E , Skeleton E ].
  • BlendShape E represents a set of BlendShape coefficients corresponding to the expression data (i.e., the first BlendShape data), and Skeleton E represents the facial skeleton coefficients corresponding to the expression data (i.e., the first skeletal data).
  • BlendShape data is determined based on the weighted sum of the initial BlendShape data and multiple BlendShape data components.
  • BlendShape E data includes a set of overall expression benchmarks (or expression components).
  • the BlendShape E data of human facial expression e at a certain moment is a linear weighting of this set of expression components:
  • BlendShape E B E ⁇ d bE +b bE
  • B E is a set of expression benchmarks
  • d bE is the corresponding weight coefficient
  • b bE is the initial expression (such as a neutral expression that is different from negative expressions and positive expressions).
  • the skeletal data is determined based on the initial skeletal data and a weighted sum of multiple skeletal data components.
  • BlendShape E data changes, the digital human's skeletal data should also change accordingly.
  • the digital human speaks the skeletal points related to the jaw and face also shift. Therefore, it is necessary to perform bone correction processing on the BlendShape E data to make the driving effect more accurate and realistic.
  • S E is a set of bone benchmarks (or bone components)
  • d SE is the corresponding bone coefficient (i.e., weight coefficient)
  • b SE is the initial bone (such as the neutral bone that is different from negative bones and positive bones).
  • the bone coefficient represents the linear blending coefficient of a set of bone components that changes the expression from a neutral bone to a target bone.
  • B L is a set of expression benchmarks
  • d bL is the corresponding weight coefficient
  • b bL is the initial expression (such as a neutral expression that is different from negative expressions and positive expressions)
  • S L is a set of skeleton benchmarks (or bone components)
  • d SL is the corresponding bone coefficient (i.e., weight coefficient)
  • b SL is the initial bone (such as the neutral bone that is different from negative bones and positive bones).
  • the bone coefficient represents the linear blending coefficient of a set of bone components that changes the mouth shape from a neutral bone to a target bone.
  • the feature information includes at least one of emotion information or pronunciation information
  • the feature data includes at least one of expression data or mouth shape data.
  • expression data is determined based on emotional information.
  • the expression data is determined based on the first BlendShape data and the first skeleton data corresponding to the emotion information.
  • mouth shape data is determined based on the pronunciation information, and the mouth shape data is determined based on the second BlendShape data and the second bone shape corresponding to the pronunciation information.
  • the skeleton data is confirmed.
  • the semantic engine recognizes the user's voice and outputs the digital human's answer audio; the state machine of the semantic engine outputs the EmoID and initial weight Weight of the corresponding expression data according to the emotion of the digital human's answer text. Since the emotions of the digital human are not fixed, the state machine also outputs the TimeStamp of the expression data to ensure the changes of different micro-expressions. The state machine outputs the LipID and initial weight Weight of the corresponding lip shape data according to the phonemes of the digital human's answer text. Since the pronunciation of each word of the digital human is not at the same frequency, the state machine also outputs the TimeStamp of the lip shape data to ensure the synchronization of the lip shape and audio.
  • step 130 a dynamic image corresponding to the response information is generated based on the characteristic data.
  • the weighted feature data corresponding to adjacent key frames is smoothed to generate non-key frames between adjacent key frames; according to the key frames and non-key frames and their timestamps, in chronological order Generate dynamic images.
  • the rendering engine caches the data of the facial model library in the cache pool; according to the random weight calculation method, the actual weight of the facial model is randomly calculated based on the threshold to realize the dynamic expression of micro-expressions; and the driving data is smoothed according to the actual weight. , and integrates expression and mouth shape data to realize the driving of the digital human's overall face; synchronously plays voice.
  • BlendShape data is corrected using the skeletal data, and the precise driving of expressions can be realized based on the BlendShape data and the skeletal data, thereby improving the effect of dynamic images.
  • the actual weight can be calculated through the embodiment in Figure 2.
  • FIG. 2 shows a flowchart of another embodiment of the dynamic image generation method of the present disclosure.
  • step 210 the initial weight of the feature data is determined based on the feature information.
  • step 220 actual weights of corresponding moments of multiple key frames are randomly generated within the value range determined based on the initial weight and the threshold.
  • the value range includes values greater than the difference between the initial weight and the threshold and less than the sum of the initial weight and the threshold.
  • the timestamp of the feature data is determined based on the feature information; and the actual weight corresponding to the time stamp is generated.
  • the expression of a digital human should not be static, stiff, and unchanging, so the random weight calculation module configuration There is a threshold T, which reflects the dynamic range of digital human expressions as an increment; the random weight calculation module calculates the maximum weight W+T and the minimum weight WT, and generates a random number R every time interval I as the actual weight:
  • step 230 the feature data is weighted respectively using multiple actual weights to generate multiple key frames.
  • step 240 dynamic images are generated based on multiple key frames.
  • the least squares method is used to smooth the expression data to obtain non-key frame expression data; the expression and mouth shape data are fused , and realize the dynamic driving of the overall face of the digital person according to the timestamp TS.
  • the random weight calculation method can realize the dynamic generation of micro-expressions through random numbers. There is no need to pre-produce hundreds of expression animations, and there is no need for the semantic engine to send a large amount of driving data in real time. Therefore, on the basis of satisfying basic emotions, the expressions of digital people can dynamically change within a certain range over time, giving users a sense of reality and improving the accuracy and authenticity of dynamic images.
  • FIG. 3 shows a schematic diagram of some embodiments of the dynamic image generation method of the present disclosure.
  • the facial model library is used to implement bone correction and can perform bone correction processing on BlendShape, thereby achieving precise expression driving based on BlendShape and bones.
  • the facial model library can also store text data of basic expressions and model IDs (i.e., identification information) used to uniquely identify different expressions and mouth shapes (i.e., feature data), thereby achieving efficient reading and reuse of data.
  • the design of the mouth shape database of the facial model library is as follows: the data structure of the mouth shape data LIP is [LipID, BlendShape L , Skeleton L ]; LipID represents the mouth shape ID of the mouth shape data, and multiple phonemes can have the same mouth shape. That is, the same mouth shape ID. For example, the phonemes "o" and “ao” have similar mouth shapes and can correspond to the same LipID, thereby reducing the amount of data based on precise driving; BlendShape L represents a set of BlendShape coefficients corresponding to the mouth shape data (i.e. the second BlendShape data), Skeleton L represents the facial skeleton coefficient corresponding to the mouth shape data (i.e. the second skeleton data).
  • B E is a set of expression benchmarks
  • d bE is the corresponding weight coefficient
  • b bE is the initial expression (such as a neutral expression that is different from negative expressions and positive expressions).
  • BlendShape E data changes, the digital human's skeletal data should also change accordingly.
  • the digital human speaks the skeletal points related to the jaw and face also shift. Therefore, it is necessary to perform bone correction processing on the BlendShape E data to make the driving effect more accurate and realistic.
  • S E is a set of bone benchmarks (or bone components)
  • d SE is the corresponding bone coefficient (i.e., weight coefficient)
  • b SE is the initial bone (such as the neutral bone that is different from negative bones and positive bones).
  • the bone coefficient represents the linear blending coefficient of a set of bone components that changes the expression from a neutral bone to a target bone.
  • B L is a set of expression benchmarks
  • d bL is the corresponding weight coefficient
  • b bL is the initial expression (such as a neutral expression that is different from negative expressions and positive expressions)
  • S L is a set of skeleton benchmarks (or bone components)
  • d SL is the corresponding bone coefficient (i.e., weight coefficient)
  • b SL is the initial bone (such as the neutral bone that is different from negative bones and positive bones).
  • the bone coefficient represents the linear blending coefficient of a set of bone components that changes the mouth shape from a neutral bone to a target bone.
  • the semantic engine recognizes the user's voice and outputs the audio of the digital person's answer; the state machine of the semantic engine outputs the EmoID and initial weight of the corresponding expression data based on the emotion of the digital person's answer text. Since the emotions of digital people are not fixed, the state machine also outputs the TimeStamp of the expression data to ensure changes in different micro-expressions. The state machine outputs the LipID and initial weight of the corresponding mouth shape data based on the phonemes of the text answered by the digital person. Since the pronunciation of each word of the digital person is not at the same frequency, the state machine also outputs the TimeStamp of the mouth shape data to ensure the synchronization of the mouth shape and audio.
  • the rendering engine caches the data of the facial model library in the cache pool; according to the random weight calculation method, the actual weight of the facial model is randomly calculated based on the threshold to realize the dynamic expression of micro-expressions; and the driving data is smoothed according to the actual weight. , and fuse expression and mouth shape data to achieve the overall facial recognition of digital people. Driver; play voice simultaneously.
  • the state machine of the semantic engine sends the EmoID, initial weight W and timestamp TS of the expression data to the rendering engine; the random weight calculation module of the rendering engine matches the corresponding data in the cache pool based on the EmoID.
  • the expression of the digital human should not be static, stiff, and unchanging, so the random weight calculation module is configured with a threshold T, which reflects the dynamic range of the digital human expression as an increment; the random weight calculation module calculates the maximum weight W+T and the minimum weight WT, and generate a random number R every time interval I as the actual weight: WT ⁇ R ⁇ W+T
  • the least squares method is used to smooth the expression data to obtain non-key frame expression data; the expression and mouth shape data are fused , and realize the dynamic driving of the overall face of the digital person according to the timestamp TS.
  • the digital human intelligence drives the interaction process as follows.
  • step 1 set the BlendShape data model of basic expressions and basic mouth shapes.
  • Each base data model is uniquely identified by a model ID.
  • step 2 a bone correction process is performed, and the coefficients of bone points are added to the data model to form a corrected facial model text.
  • step 3 the rendering engine reads the facial model library when it is initialized, and loads the data into the cache pool in JSON text format.
  • step 4 when the user initiates voice interaction with the digital human, the speech recognition module performs user semantic and emotional analysis.
  • the answer text module stores an intelligent question and answer library and obtains corresponding answer text based on user semantics and emotions.
  • step 6 the natural language processing module performs sentiment analysis and phoneme extraction on the answer text.
  • step 7 the speech synthesis module synthesizes the segmented answer text into audio data.
  • step 8 the state machine sends expression ID or lip shape ID, weight, timestamp, audio and other data to the rendering engine.
  • step 9 the random weight calculation module matches the corresponding basic model in the cache pool based on the ID and generates key frames based on random numbers.
  • step 10 the smoothing module smoothes the key frames based on the least squares method.
  • step 11 the expression fusion module fuses the expression and mouth shape data and implements dynamic driving according to the timestamp.
  • a random weight calculation method which randomly calculates the weight of facial key frames based on a state machine and a threshold, and uses the least squares method to smooth the key frames to achieve dynamic expression of micro-expressions;
  • a bone correction rule is proposed, based on BlendShape and bone weighting to achieve accurate expression driving, thereby improving the realism and driving accuracy of digital human beings.
  • the cache pool is used to store basic facial data, and the random weight calculation method is used to realize dynamic micro-expressions. There is no need to occupy bandwidth to send a large amount of repeated data, and there is no need to store a large amount of animation assets. This reduces the additional overhead of bandwidth and hardware resources, and improves performance and real-time performance. Reduce resource overhead.
  • Model ID uniquely identifies different expressions and mouth shapes. Import data into a cache pool based on JSON text to cache data, and achieve efficient reuse and expansion of data. It is not necessary to build basic mouth shapes and basic expressions into the rendering engine in advance, which improves system scalability and reduces labor costs.
  • FIG. 4 shows a block diagram of some embodiments of the dynamic image generation device of the present disclosure.
  • the dynamic image generation device 4 includes: a semantic engine module 41, which is used to determine the characteristic information corresponding to the response information based on the user's voice, and determine the characteristic data corresponding to the response information based on the characteristic information.
  • the characteristic data is based on the characteristic information.
  • the corresponding BlendShape data and skeleton data are determined;
  • the rendering engine module 42 is used to generate dynamic images corresponding to the response information based on the feature data.
  • the generation device 4 further includes: a facial model library 43 for storing multiple feature data.
  • the feature information includes at least one of emotion information or pronunciation information
  • the feature data includes at least one of expression data or mouth shape data
  • the semantic engine module 41 performs at least one of the following: according to the emotion information, Determine the expression data, the expression data is determined based on the first BlendShape data and the first skeleton data corresponding to the emotion information; or, determine the mouth shape data based on the pronunciation information, and the mouth shape data is determined based on the second BlendShape data and the second skeleton data corresponding to the pronunciation information.
  • the skeletal data is determined based on the initial skeletal data and a weighted sum of multiple skeletal data components.
  • the semantic engine module 41 determines the initial weight of the feature data based on the feature information; the rendering engine module 42 randomly generates the actual values of the corresponding moments of multiple key frames within the value range determined based on the initial weight and threshold. Weight, use multiple actual weights to weight the feature data respectively to generate multiple relationships Keyframes; generate dynamic images based on multiple keyframes.
  • the value range includes values greater than the difference between the initial weight and the threshold and less than the sum of the initial weight and the threshold.
  • the rendering engine module 42 smoothes the weighted feature data corresponding to adjacent key frames to generate non-key frames between adjacent key frames; according to the key frames and non-key frames and their timestamps , generate dynamic images in chronological order.
  • the semantic engine module 41 determines the timestamp of the feature data according to the feature information; the rendering engine module generates the actual weight of the time corresponding to the timestamp.
  • the semantic engine module 41 uses the state machine therein to determine the identification information corresponding to the characteristic data based on the characteristic information; the rendering engine module 42 obtains the characteristic data from the cache pool according to the identification information sent by the state machine.
  • the rendering engine module 42 reads multiple feature data from the facial model library and loads the multiple feature data into the cache pool in JSON text format.
  • the semantic engine module 42 performs semantic analysis and emotional analysis on the user's voice when the user initiates voice interaction; determines the response text in the question and answer library based on the analysis results, and performs emotional analysis or phoneme extraction on the response text. At least one of the processes determines the characteristic information.
  • BlendShape data is determined based on the weighted sum of the initial BlendShape data and multiple BlendShape data components.
  • FIG. 5 shows a block diagram of another embodiment of the dynamic image generating device of the present disclosure.
  • the device 5 of this embodiment includes: a memory 51 and a processor 52 coupled to the memory 51.
  • the processor 52 is configured to execute any implementation of the present disclosure based on instructions stored in the memory 51. The method of generating dynamic images in the example.
  • the memory 51 may include, for example, a system memory, a fixed non-volatile storage medium, etc.
  • the system memory may store, for example, an operating system, an application program, a boot loader, a database, and other programs.
  • FIG. 6 shows a block diagram of some further embodiments of the dynamic image generating device of the present disclosure.
  • the dynamic image generation device 6 of this embodiment includes: a memory 610 and a processor 620 coupled to the memory 610.
  • the processor 620 is configured to execute any of the foregoing based on instructions stored in the memory 610.
  • the memory 610 may include, for example, a system memory, a fixed non-volatile storage medium, etc. Such as storing operating systems, applications, boot loaders, and other programs.
  • the moving image generating device 6 may also include an input/output interface 630, a network interface 640, a storage interface 650, and the like. These interfaces 630, 640, 650, the memory 610 and the processor 620 may be connected through a bus 660, for example. Among them, the input and output interface 630 provides connection interfaces for input and output devices such as monitors, mice, keyboards, touch screens, microphones, and speakers. Network interface 640 provides a connection interface for various networked devices. The storage interface 650 provides a connection interface for external storage devices such as SD cards and USB disks.
  • embodiments of the present disclosure may be provided as methods, systems, or computer program products. Accordingly, the present disclosure may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment that combines software and hardware aspects. Furthermore, the present disclosure may take the form of a computer program product embodied on one or more computer-usable non-transitory storage media including, but not limited to, disk memory, CD-ROM, optical storage, and the like having computer-usable program code embodied therein.
  • the methods and systems of the present disclosure may be implemented in many ways.
  • the methods and systems of the present disclosure may be implemented through software, hardware, firmware, or any combination of software, hardware, and firmware.
  • the above order for the steps of the methods is for illustration only, and the steps of the methods of the present disclosure are not limited to the order specifically described above unless otherwise specifically stated.
  • the present disclosure can also be implemented as programs recorded in recording media, and these programs include machine-readable instructions for implementing methods according to the present disclosure.
  • the present disclosure also covers recording media storing programs for executing methods according to the present disclosure.

Abstract

The present disclosure relates to the technical field of computers, and relates to a dynamic image generation method and device. The generation method comprises: determining, according to user voice, feature information corresponding to response information; determining, according to the feature information, feature data corresponding to the response information, wherein the feature data is determined according to BlendShape data and skeleton data corresponding to the feature information; and generating, according to the feature data, a dynamic image corresponding to the response information.

Description

动态影像的生成方法和装置Dynamic image generation method and device
相关申请的交叉引用Cross-references to related applications
本申请是以CN申请号为202211141405.8,申请日为2022年9月20日的申请为基础,并主张其优先权,该CN申请的公开内容在此作为整体引入本申请中。This application is based on the application with CN application number 202211141405.8 and the filing date is September 20, 2022, and claims its priority. The disclosure content of the CN application is hereby incorporated into this application as a whole.
技术领域Technical field
本公开涉及计算机技术领域,特别涉及一种动态影像的生成方法、动态影像的生成装置和非易失性计算机可读存储介质。The present disclosure relates to the field of computer technology, and in particular to a method for generating dynamic images, a device for generating dynamic images, and a non-volatile computer-readable storage medium.
背景技术Background technique
随着元宇宙、虚拟现实、数字孪生等领域的发展,虚拟数字人开始从外观的数字化逐渐深入到思想和行为的智能化。智能驱动型数字人是通过三维建模、计算机视觉、语音识别等技术还原的数字人,可以通过口型、表情的变化来实现与用户的交流。With the development of the metaverse, virtual reality, digital twins and other fields, virtual digital people have begun to gradually deepen from the digitization of appearance to the intelligence of thoughts and behaviors. Intelligent-driven digital humans are digital humans restored through three-dimensional modeling, computer vision, speech recognition and other technologies. They can communicate with users through changes in mouth shapes and expressions.
在相关技术中,在渲染引擎中,事先内置好若干个BlendShape(混合形状)基本表情动画和基本口型动画;根据文本,生成表情标签和口型标签;输入到渲染引擎,进行动画调取和合成。In related technologies, in the rendering engine, several BlendShape (mixed shape) basic expression animations and basic mouth shape animations are built in advance; expression tags and mouth shape tags are generated based on the text; they are input to the rendering engine for animation retrieval and synthesis.
发明内容Contents of the invention
根据本公开的一些实施例,提供了一种动态影像的生成方法,包括:根据用户语音,确定回应信息对应的特征信息;根据特征信息,确定回应信息对应的特征数据,特征数据根据特征信息对应的BlendShape数据和骨骼数据确定;根据特征数据,生成回应信息对应的动态影像。According to some embodiments of the present disclosure, a dynamic image generation method is provided, including: determining characteristic information corresponding to the response information according to the user's voice; determining characteristic data corresponding to the response information according to the characteristic information, and the characteristic data corresponding to the characteristic information according to the characteristic information The BlendShape data and bone data are determined; based on the feature data, dynamic images corresponding to the response information are generated.
在一些实施例中,特征信息包括情绪信息或发音信息中的至少一项,特征数据包括表情数据或口型数据中的至少一项,根据特征信息,确定回应信息对应的特征数据包括下面的至少一项:根据情绪信息,确定表情数据,表情数据根据情绪信息对应的第一BlendShape数据和第一骨骼数据确定;或者,根据发音信息,确定口型数据,口型数据根据发音信息对应的第二BlendShape数据和第二骨骼数据确定。In some embodiments, the characteristic information includes at least one of emotional information or pronunciation information, and the characteristic data includes at least one of expression data or mouth shape data. According to the characteristic information, it is determined that the characteristic data corresponding to the response information includes at least one of the following: One item: determine the expression data based on the emotional information, and the expression data is determined based on the first BlendShape data and the first skeleton data corresponding to the emotional information; or, determine the mouth shape data based on the pronunciation information, and the mouth shape data is determined based on the second BlendShape data corresponding to the pronunciation information. BlendShape data and second bone data are determined.
在一些实施例中,骨骼数据根据初始骨骼数据和多个骨骼数据分量的加权和确定。In some embodiments, the skeletal data is determined based on the initial skeletal data and a weighted sum of multiple skeletal data components.
在一些实施例中,生成方法还包括:根据特征信息,确定特征数据的初始权重; 根据特征数据,生成回应信息对应的动态影像包括:在根据初始权重和阈值确定的取值范围内,分别随机生成多个关键帧的对应时刻的实际权重;利用多个实际权重分别对特征数据进行加权,以生成多个关键帧;根据多个关键帧,生成动态影像。In some embodiments, the generating method further comprises: determining an initial weight of the feature data based on the feature information; Generating a dynamic image corresponding to the response information based on the feature data includes: randomly generating actual weights of corresponding moments of multiple key frames within a value range determined by the initial weight and the threshold; weighting the feature data using the multiple actual weights to generate multiple key frames; and generating a dynamic image based on the multiple key frames.
在一些实施例中,取值范围包括大于初始权重与阈值之差且小于初始权重与阈值之和的值。In some embodiments, the value range includes values greater than the difference between the initial weight and the threshold and less than the sum of the initial weight and the threshold.
在一些实施例中,根据多个关键帧,生成动态影像包括:对相邻关键帧对应的加权后的特征数据进行平滑处理,以生成相邻关键帧之间的非关键帧;根据关键帧和非关键帧及其时间戳,按照时间顺序生成动态影像。In some embodiments, generating a dynamic image based on multiple key frames includes: smoothing weighted feature data corresponding to adjacent key frames to generate non-key frames between adjacent key frames; based on the key frames and Non-key frames and their timestamps are used to generate dynamic images in chronological order.
在一些实施例中,生成方法还包括:根据特征信息,确定特征数据的时间戳;在根据初始权重和阈值确定的取值范围内,分别随机生成多个关键帧的对应时刻的实际权重包括:生成时间戳的对应时刻的实际权重。In some embodiments, the generation method also includes: determining the timestamp of the feature data according to the feature information; within the value range determined according to the initial weight and the threshold, randomly generating the actual weights of the corresponding moments of multiple key frames, including: The actual weight of the corresponding moment when the timestamp was generated.
在一些实施例中,根据特征信息,确定回应信息对应的特征数据包括:利用语义引擎中的状态机,根据特征信息,确定特征数据对应的标识信息;利用渲染引擎,根据状态机发送的标识信息,从缓存池中获取特征数据。In some embodiments, determining the characteristic data corresponding to the response information according to the characteristic information includes: using the state machine in the semantic engine to determine the identification information corresponding to the characteristic data according to the characteristic information; using the rendering engine to determine the identification information corresponding to the characteristic data according to the state machine. , obtain feature data from the cache pool.
在一些实施例中,生成方法还包括:在初始化的过程中,利用渲染引擎,从面部模型库中读取多个特征数据;利用渲染引擎,以JSON文本格式,将多个特征数据加载到缓存池。In some embodiments, the generation method further includes: during the initialization process, using a rendering engine to read multiple feature data from the facial model library; using the rendering engine to load the multiple feature data into the cache in JSON text format. pool.
在一些实施例中,根据用户语音,确定回应信息对应的特征信息包括:在用户发起语音交互的情况下,对用户语音进行语义分析和情感分析;根据分析结果,在问答库中确定回应文本;对回应文本进行情感分析或音素提取中的至少一项处理,确定特征信息。In some embodiments, determining the characteristic information corresponding to the response information based on the user's voice includes: when the user initiates voice interaction, perform semantic analysis and emotional analysis on the user's voice; and determine the response text in the question and answer library based on the analysis results; Perform at least one of sentiment analysis or phoneme extraction on the response text to determine feature information.
在一些实施例中,BlendShape数据根据初始BlendShape数据和多个BlendShape数据分量的加权和确定。In some embodiments, the BlendShape data is determined based on the weighted sum of the initial BlendShape data and multiple BlendShape data components.
根据本公开的另一些实施例,提供一种动态影像的生成装置,包括:语义引擎模块,用于根据用户语音,确定回应信息对应的特征信息,根据特征信息,确定回应信息对应的特征数据,特征数据根据特征信息对应的BlendShape数据和骨骼数据确定;渲染引擎模块,用于根据特征数据,生成回应信息对应的动态影像。According to other embodiments of the present disclosure, a device for generating dynamic images is provided, including: a semantic engine module for determining feature information corresponding to the response information based on the user's voice, and determining feature data corresponding to the response information based on the feature information, The feature data is determined based on the BlendShape data and skeletal data corresponding to the feature information; the rendering engine module is used to generate dynamic images corresponding to the response information based on the feature data.
在一些实施例中,生成装置还包括:面部模型库,用于存储多个特征数据。In some embodiments, the generating device further includes: a facial model library for storing a plurality of feature data.
在一些实施例中,特征信息包括情绪信息或发音信息中的至少一项,特征数据包括表情数据或口型数据中的至少一项,语义引擎模块执行下面的至少一项:根据情绪 信息,确定表情数据,表情数据根据情绪信息对应的第一BlendShape数据和第一骨骼数据确定;或者,根据发音信息,确定口型数据,口型数据根据发音信息对应的第二BlendShape数据和第二骨骼数据确定。In some embodiments, the feature information includes at least one of emotion information or pronunciation information, the feature data includes at least one of expression data or mouth shape data, and the semantic engine module performs at least one of the following: according to the emotion Information, the expression data is determined based on the first BlendShape data and the first skeleton data corresponding to the emotion information; or, the mouth shape data is determined based on the pronunciation information, and the mouth shape data is determined based on the second BlendShape data and the second BlendShape data corresponding to the pronunciation information. Skeleton data confirmed.
在一些实施例中,骨骼数据根据初始骨骼数据和多个骨骼数据分量的加权和确定。In some embodiments, the skeletal data is determined based on the initial skeletal data and a weighted sum of multiple skeletal data components.
在一些实施例中,语义引擎模块根据特征信息,确定特征数据的初始权重;渲染引擎模块在根据初始权重和阈值确定的取值范围内,分别随机生成多个关键帧的对应时刻的实际权重,利用多个实际权重分别对特征数据进行加权,以生成多个关键帧;根据多个关键帧,生成动态影像。In some embodiments, the semantic engine module determines the initial weight of the feature data based on the feature information; the rendering engine module randomly generates the actual weights of corresponding moments of multiple key frames within the value range determined based on the initial weight and threshold, Use multiple actual weights to weight the feature data respectively to generate multiple key frames; generate dynamic images based on multiple key frames.
在一些实施例中,取值范围包括大于初始权重与阈值之差且小于初始权重与阈值之和的值。In some embodiments, the value range includes values greater than the difference between the initial weight and the threshold and less than the sum of the initial weight and the threshold.
在一些实施例中,渲染引擎模块对相邻关键帧对应的加权后的特征数据进行平滑处理,以生成相邻关键帧之间的非关键帧;根据关键帧和非关键帧及其时间戳,按照时间顺序生成动态影像。In some embodiments, the rendering engine module smoothes the weighted feature data corresponding to adjacent key frames to generate non-key frames between adjacent key frames; according to the key frames and non-key frames and their timestamps, Generate moving images in chronological order.
在一些实施例中,语义引擎模块根据特征信息,确定特征数据的时间戳;渲染引擎模块生成时间戳的对应时刻的实际权重。In some embodiments, the semantic engine module determines the timestamp of the feature data based on the feature information; the rendering engine module generates the actual weight corresponding to the timestamp.
在一些实施例中,语义引擎模块利用其中的状态机,根据特征信息,确定特征数据对应的标识信息;渲染引擎模块根据状态机发送的标识信息,从缓存池中获取特征数据。In some embodiments, the semantic engine module uses the state machine to determine the identification information corresponding to the feature data based on the feature information; the rendering engine module obtains the feature data from the cache pool based on the identification information sent by the state machine.
在一些实施例中,在初始化的过程中,渲染引擎模块从面部模型库中读取多个特征数据,以JSON文本格式,将多个特征数据加载到缓存池。In some embodiments, during the initialization process, the rendering engine module reads multiple feature data from the facial model library and loads the multiple feature data into the cache pool in JSON text format.
在一些实施例中,语义引擎模块在用户发起语音交互的情况下,对用户语音进行语义分析和情感分析;根据分析结果,在问答库中确定回应文本,对回应文本进行情感分析或音素提取中的至少一项处理,确定特征信息。In some embodiments, when the user initiates voice interaction, the semantic engine module performs semantic analysis and emotional analysis on the user's voice; based on the analysis results, determines the response text in the question and answer library, and performs emotional analysis or phoneme extraction on the response text. At least one process is performed to determine characteristic information.
在一些实施例中,BlendShape数据根据初始BlendShape数据和多个BlendShape数据分量的加权和确定。In some embodiments, the BlendShape data is determined based on the weighted sum of the initial BlendShape data and multiple BlendShape data components.
根据本公开的又一些实施例,提供一种动态影像的生成装置,包括:存储器;和耦接至存储器的处理器,处理器被配置为基于存储在存储器装置中的指令,执行上述任一个实施例中的动态影像的生成方法。According to further embodiments of the present disclosure, a device for generating dynamic images is provided, including: a memory; and a processor coupled to the memory. The processor is configured to execute any of the above implementations based on instructions stored in the memory device. The method of generating dynamic images in the example.
根据本公开的再一些实施例,提供一种非易失性计算机可读存储介质,其上存储有计算机程序,该程序被处理器执行时实现上述任一个实施例中的动态影像的生成方 法。According to further embodiments of the present disclosure, a non-volatile computer-readable storage medium is provided, on which a computer program is stored. When the program is executed by a processor, the dynamic image generating method in any of the above embodiments is implemented. Law.
附图说明Description of drawings
构成说明书的一部分的附图描述了本公开的实施例,并且连同说明书一起用于解释本公开的原理。The accompanying drawings, which constitute a part of the specification, illustrate embodiments of the present disclosure and, together with the description, serve to explain the principles of the present disclosure.
参照附图,根据下面的详细描述,可以更加清楚地理解本公开:The present disclosure can be more clearly understood from the following detailed description with reference to the accompanying drawings:
图1示出本公开的动态影像的生成方法的一些实施例的流程图;Figure 1 shows a flow chart of some embodiments of a dynamic image generation method of the present disclosure;
图2示出本公开的动态影像的生成方法的另一些实施例的流程图;Figure 2 shows a flow chart of other embodiments of the dynamic image generation method of the present disclosure;
图3示出本公开的动态影像的生成方法的一些实施例的示意图;Figure 3 shows a schematic diagram of some embodiments of a dynamic image generation method of the present disclosure;
图4示出本公开的动态影像的生成装置的一些实施例的框图;FIG4 is a block diagram showing some embodiments of a device for generating dynamic images of the present disclosure;
图5示出本公开的动态影像的生成装置的另一些实施例的框图;FIG5 is a block diagram showing some other embodiments of the dynamic image generation device of the present disclosure;
图6示出本公开的动态影像的生成装置的又一些实施例的框图。FIG. 6 shows a block diagram of some further embodiments of the dynamic image generating device of the present disclosure.
具体实施方式Detailed ways
现在将参照附图来详细描述本公开的各种示例性实施例。应注意到:除非另外具体说明,否则在这些实施例中阐述的部件和步骤的相对布置、数字表达式和数值不限制本公开的范围。Various exemplary embodiments of the present disclosure will now be described in detail with reference to the accompanying drawings. It should be noted that the relative arrangement of components and steps, numerical expressions, and numerical values set forth in these examples do not limit the scope of the disclosure unless otherwise specifically stated.
同时,应当明白,为了便于描述,附图中所示出的各个部分的尺寸并不是按照实际的比例关系绘制的。At the same time, it should be understood that, for convenience of description, the dimensions of various parts shown in the drawings are not drawn according to actual proportional relationships.
以下对至少一个示例性实施例的描述实际上仅仅是说明性的,决不作为对本公开及其应用或使用的任何限制。The following description of at least one exemplary embodiment is merely illustrative in nature and is in no way intended to limit the disclosure, its application or uses.
对于相关领域普通技术人员已知的技术、方法和设备可能不作详细讨论,但在适当情况下,技术、方法和设备应当被视为说明书的一部分。Techniques, methods and devices known to those of ordinary skill in the relevant art may not be discussed in detail, but where appropriate, the techniques, methods and devices should be considered a part of the specification.
在这里示出和讨论的所有示例中,任何具体值应被解释为仅仅是示例性的,而不是作为限制。因此,示例性实施例的其它示例可以具有不同的值。In all examples shown and discussed herein, any specific values are to be construed as illustrative only and not as limiting. Accordingly, other examples of the exemplary embodiments may have different values.
应注意到:相似的标号和字母在下面的附图中表示类似项,因此,一旦某一项在一个附图中被定义,则在随后的附图中不需要对其进行进一步讨论。It should be noted that similar reference numerals and letters refer to similar items in the following figures, so that once an item is defined in one figure, it does not need further discussion in subsequent figures.
如前所述,智能驱动型数字人的驱动方式主要包括下面两种。As mentioned before, the driving methods of intelligent-driven digital humans mainly include the following two types.
一种方式是在渲染引擎中事先内置好若干个BlendShape基本表情动画和基本口型动画;然后,通过语义引擎对用户的语音进行语音识别,判断要回答的文本内容; 根据文本生成表情标签和口型标签,输入到渲染引擎进行动画调取和合成。这种方式的数字人驱动效果较为死板、僵硬,而且需要预先制作好大量的面部动画。如果对表情的细节要求高,甚至需要预先制作上百个动画,然后导入到渲染引擎中。如果需要对表情或口型进行扩展,则需要重新手动制作动画和导入动画,人工成本高,系统扩展性低。One way is to build several basic BlendShape expression animations and basic mouth animations in the rendering engine in advance; then, use the semantic engine to perform speech recognition on the user's voice to determine the text content to be answered; Generate expression tags and mouth shape tags based on the text, and input them to the rendering engine for animation retrieval and synthesis. The digital human driving effect of this method is relatively rigid and rigid, and requires a large amount of facial animation to be produced in advance. If you have high requirements for the details of expressions, you may even need to pre-produce hundreds of animations and then import them into the rendering engine. If you need to expand expressions or mouth shapes, you need to manually create animations and import animations again, which results in high labor costs and low system scalability.
另一种方式不需要渲染引擎中事先导入动画,而是语义引擎直接根据训练得到表情和口型的BlendShape系数,并向渲染引擎发送系数,渲染引擎进行实时接收和驱动。这种方法需要不断占用带宽发送数据,并且对训练得到的数据没有形成标准的表情库,存在大量重复数据,导致带宽资源占用高,数据复用性差,实时性也有所下降。Another method does not require the animation to be imported into the rendering engine in advance. Instead, the semantic engine directly obtains the BlendShape coefficients of expressions and mouth shapes based on training, and sends the coefficients to the rendering engine, which receives and drives them in real time. This method needs to continuously occupy bandwidth to send data, and does not form a standard expression library for the training data. There is a large amount of repeated data, resulting in high bandwidth resource usage, poor data reusability, and reduced real-time performance.
另外,BlendShape是通过使用一系列的顶点位移,使物体得到平顺的变形效果。单一使用BlendShape进行驱动,没有考虑到骨骼对数字人的影响,因此使得数字人驱动精度受到限制,真实感差。In addition, BlendShape uses a series of vertex displacements to achieve a smooth deformation effect on the object. The single use of BlendShape for driving does not take into account the impact of bones on the digital human, so the driving accuracy of the digital human is limited and the sense of reality is poor.
单一使用骨骼进行驱动,需要在面部添加很多骨骼点,并且制作蒙皮。而且,要在细微处频繁调整表情中骨骼的位置,制作过程比较麻烦,并且如果骨骼数量过多,性能也会大量消耗。If you only use bones for driving, you need to add a lot of bone points to the face and make skin. Moreover, you need to frequently adjust the position of bones in subtle expressions, which makes the production process more troublesome, and if there are too many bones, the performance will be greatly consumed.
也就是说,上述的几种方式存在如下的技术问题。In other words, the above methods have the following technical problems.
驱动精度低:单一使用BlendShape或是骨骼进行驱动,使得数字人驱动精度受到限制,导致数字人面部表情僵硬,缺乏动态变化和精准表达,真实感差。Low driving accuracy: The single use of BlendShape or bones for driving limits the driving accuracy of the digital human, resulting in stiff facial expressions, lack of dynamic changes and precise expression, and poor realism.
资源开销大:基于训练的方法需要不断占用带宽发送数据,其中存在大量重复数据;并且基于动画的方法需要存储大量动画资产,导致带宽、硬件资源开销大,性能和实时性都有所下降。High resource overhead: The training-based method needs to continuously occupy bandwidth to send data, which contains a large amount of duplicate data; and the animation-based method needs to store a large number of animation assets, resulting in high bandwidth and hardware resource overhead, and reduced performance and real-time performance.
系统扩展性差:基于动画的方法需要在渲染引擎中事先内置几种基本表情,在后期若要更改,则又要重新手工制作和导入,过程比较繁琐;基于训练的方法也没有对训练数据形成标准的表情库,导致系统扩展性差,人工成本高。Poor system scalability: The animation-based method requires several basic expressions to be built into the rendering engine in advance. If you want to change it later, you have to manually create and import it again, which is a cumbersome process; the training-based method does not form a standard for training data. The expression library results in poor system scalability and high labor costs.
因此,如何精准、动态、高效地驱动数字人面部的动态微表情,为系统提出了更高的技术要求。Therefore, how to accurately, dynamically and efficiently drive dynamic micro-expressions on digital human faces has put forward higher technical requirements for the system.
本公开的发明人发现上述相关技术中存在如下问题:生成的动态影像死板、僵硬,导致动态影像效果差。The inventor of the present disclosure discovered that there are the following problems in the above-mentioned related technologies: the generated dynamic images are rigid and rigid, resulting in poor dynamic image effects.
鉴于此,本公开提出了一种动态影像的生成技术方案,能够提高动态影像效果。In view of this, the present disclosure proposes a technical solution for generating dynamic images, which can improve the effect of dynamic images.
针对上述技术问题,本公开提出了一种基于随机权重的数字人动态微表情驱动技 术方案。本公开的随机权重计算方法,基于状态机和阈值随机计算面部关键帧权重,并采用最小二乘法对关键帧进行平滑处理,实现微表情的动态表达;本公开的骨骼修正方法,基于BlendShape和骨骼加权实现表情精准驱动;本公开构建了面部模型库以存储基本表情,设置了模型ID以唯一标识不同表情和口型,基于JSON文本,实现了轻量化导入缓存池进行数据缓存,实现了数据的高效复用和扩展。In response to the above technical problems, the present disclosure proposes a digital human dynamic micro-expression driving technology based on random weights. technical plan. The random weight calculation method of the present disclosure randomly calculates the weight of facial key frames based on a state machine and threshold, and uses the least squares method to smooth the key frames to realize the dynamic expression of micro-expressions; the bone correction method of the present disclosure is based on BlendShape and skeleton Weighted to achieve accurate expression driving; this disclosure builds a facial model library to store basic expressions, sets a model ID to uniquely identify different expressions and mouth shapes, and implements lightweight import into the cache pool for data caching based on JSON text, realizing data Efficient reuse and expansion.
例如,可以通过如下的实施例实现本公开的技术方案。For example, the technical solution of the present disclosure can be implemented through the following embodiments.
图1示出本公开的动态影像的生成方法的一些实施例的流程图。FIG. 1 shows a flowchart of some embodiments of the dynamic image generation method of the present disclosure.
如图1所示,在步骤110中,根据用户语音,确定回应信息对应的特征信息。As shown in Figure 1, in step 110, characteristic information corresponding to the response information is determined based on the user's voice.
在一些实施例中,在用户发起语音交互的情况下,对用户语音进行语义分析和情感分析;根据分析结果,在问答库中确定回应文本;对回应文本进行情感分析或音素提取中的至少一项处理,确定特征信息。In some embodiments, when the user initiates voice interaction, semantic analysis and emotional analysis are performed on the user's voice; a response text is determined in the question and answer library according to the analysis results; and at least one of emotional analysis or phoneme extraction is performed on the response text. item processing to determine feature information.
在步骤120中,根据特征信息,确定回应信息对应的特征数据,特征数据根据特征信息对应的BlendShape数据和骨骼数据确定。In step 120, the characteristic data corresponding to the response information is determined based on the characteristic information, and the characteristic data is determined based on the BlendShape data and bone data corresponding to the characteristic information.
在一些实施例中,在初始化的过程中,利用渲染引擎,从面部模型库中读取多个特征数据;利用渲染引擎,以JSON文本格式,将多个特征数据加载到缓存池。In some embodiments, during the initialization process, a rendering engine is used to read multiple feature data from a facial model library; and the rendering engine is used to load the multiple feature data into a cache pool in a JSON text format.
例如,面部模型库能够对BlendShape进行骨骼修正处理,从而基于BlendShape和骨骼共同实现表情精准驱动。面部模型库负责还可以存储基本表情的文本数据、用于唯一标识不同表情和口型(即特征数据)的模型ID(即标识信息),从而实现数据的高效读取和复用。For example, the facial model library can perform bone correction processing on BlendShape, thereby achieving precise expression driving based on BlendShape and bones. The facial model library can also store text data of basic expressions and model IDs (i.e., identification information) used to uniquely identify different expressions and mouth shapes (i.e., feature data), thereby achieving efficient reading and reuse of data.
在一些实施例中,利用语义引擎中的状态机,根据特征信息,确定特征数据对应的标识信息;利用渲染引擎,根据状态机发送的标识信息,从缓存池中获取特征数据。In some embodiments, the state machine in the semantic engine is used to determine the identification information corresponding to the characteristic data according to the characteristic information; the rendering engine is used to obtain the characteristic data from the cache pool according to the identification information sent by the state machine.
例如,面部模型库包括口型数据库,口型数据库中的口型数据LIP的数据结构为[LipID,BlendShapeL,SkeletonL]。For example, the facial model library includes a mouth shape database, and the data structure of the mouth shape data LIP in the mouth shape database is [LipID, BlendShape L , Skeleton L ].
LipID表示口型数据的口型ID,多个音素可以有同样的口型,即同一个口型ID。例如,音素“o”和“ao”口型类似,可以对应同一个LipID,从而在精准驱动的基础上,减小数据量。LipID represents the mouth shape ID of the mouth shape data. Multiple phonemes can have the same mouth shape, that is, the same mouth shape ID. For example, the phonemes "o" and "ao" have similar mouth shapes and can correspond to the same LipID, thereby reducing the amount of data based on precise driving.
BlendShapeL表示该口型数据对应的一组BlendShape系数(即第二BlendShape数据),SkeletonL表示口型数据对应的面部骨骼系数(即第二骨骼数据)。BlendShape L represents a set of BlendShape coefficients corresponding to the mouth shape data (i.e., the second BlendShape data), and Skeleton L represents the facial bone coefficients corresponding to the mouth shape data (i.e., the second skeletal data).
例如,面部模型库包括表情数据库,表情数据库中的表情数据Emotion的数据结构为[EmoID,BlendShapeE,SkeletonE]。 For example, the facial model library includes an expression database, and the data structure of the expression data Emotion in the expression database is [EmoID, BlendShape E , Skeleton E ].
EmoID表示表情数据的表情ID。例如,EmoID=0表示微笑,EmoID=1表示大笑,EmoID=2表示忧伤,EmoID=3表示恐惧,EmoID=4表示愤怒等,支持扩展。EmoID represents the emoticon ID of emoticon data. For example, EmoID=0 means smiling, EmoID=1 means laughing, EmoID=2 means sadness, EmoID=3 means fear, EmoID=4 means anger, etc., and supports expansion.
BlendShapeE表示表情数据对应的一组BlendShape系数(即第一BlendShape数据),SkeletonE表示表情数据对应的面部骨骼系数(即第一骨骼数据)。BlendShape E represents a set of BlendShape coefficients corresponding to the expression data (i.e., the first BlendShape data), and Skeleton E represents the facial skeleton coefficients corresponding to the expression data (i.e., the first skeletal data).
在一些实施例中,BlendShape数据根据初始BlendShape数据和多个BlendShape数据分量的加权和确定。In some embodiments, the BlendShape data is determined based on the weighted sum of the initial BlendShape data and multiple BlendShape data components.
例如,BlendShapeE数据包括一组组成整体表情基准(或表情分量),某一时刻下的人脸表情e的BlendShapeE数据为这组表情分量的线性加权:
BlendShapeE=BE×dbE+bbE
For example, BlendShape E data includes a set of overall expression benchmarks (or expression components). The BlendShape E data of human facial expression e at a certain moment is a linear weighting of this set of expression components:
BlendShape E =B E ×d bE +b bE
BE是一组表情基准,dbE是对应的权重系数,bbE是初始表情(如区别于负性表情和正性表情的中性表情)。B E is a set of expression benchmarks, d bE is the corresponding weight coefficient, and b bE is the initial expression (such as a neutral expression that is different from negative expressions and positive expressions).
在一些实施例中,骨骼数据根据初始骨骼数据和多个骨骼数据分量的加权和确定。In some embodiments, the skeletal data is determined based on the initial skeletal data and a weighted sum of multiple skeletal data components.
例如,随着BlendShapeE数据的变化,数字人的骨骼数据也应该相应变化,如在数字人说话时,下巴和脸部相关的骨骼点也发生位移。因此需要对BlendShapeE数据进行骨骼修正处理,从而使驱动效果更加精准和真实。骨骼修正处理后的人脸表情e为:
e=BlendShapeE+SkeletonE=(BE×dbE+bbE)+(SE×dSE+bSE)
For example, as the BlendShape E data changes, the digital human's skeletal data should also change accordingly. For example, when the digital human speaks, the skeletal points related to the jaw and face also shift. Therefore, it is necessary to perform bone correction processing on the BlendShape E data to make the driving effect more accurate and realistic. The facial expression e after skeleton correction is:
e=BlendShape E +Skeleton E =(B E ×d bE +b bE )+(S E ×d SE +b SE )
SE是一组骨骼基准(或骨骼分量),dSE是对应的骨骼系数(即权重系数),bSE是初始骨骼(如区别于负性骨骼和正性骨骼的中性骨骼)。骨骼系数表示表情从中性骨骼变化到目标骨骼的一组骨骼分量的线性混合系数。S E is a set of bone benchmarks (or bone components), d SE is the corresponding bone coefficient (i.e., weight coefficient), and b SE is the initial bone (such as the neutral bone that is different from negative bones and positive bones). The bone coefficient represents the linear blending coefficient of a set of bone components that changes the expression from a neutral bone to a target bone.
例如,骨骼修正处理后的口型l为:
l=BlendShapeL+SkeletonL=(BL×dbL+bbL)+(SL×dSL+bSL)
For example, the mouth shape l after bone correction is:
l=BlendShape L +Skeleton L =(B L ×d bL +b bL )+(S L ×d SL +b SL )
BL是一组表情基准,dbL是对应的权重系数,bbL是初始表情(如区别于负性表情和正性表情的中性表情),SL是一组骨骼基准(或骨骼分量),dSL是对应的骨骼系数(即权重系数),bSL是初始骨骼(如区别于负性骨骼和正性骨骼的中性骨骼)。骨骼系数表示口型从中性骨骼变化到目标骨骼的一组骨骼分量的线性混合系数。B L is a set of expression benchmarks, d bL is the corresponding weight coefficient, b bL is the initial expression (such as a neutral expression that is different from negative expressions and positive expressions), S L is a set of skeleton benchmarks (or bone components), d SL is the corresponding bone coefficient (i.e., weight coefficient), and b SL is the initial bone (such as the neutral bone that is different from negative bones and positive bones). The bone coefficient represents the linear blending coefficient of a set of bone components that changes the mouth shape from a neutral bone to a target bone.
在一些实施例中,特征信息包括情绪信息或发音信息中的至少一项,特征数据包括表情数据或口型数据中的至少一项。例如,根据情绪信息,确定表情数据。表情数据根据情绪信息对应的第一BlendShape数据和第一骨骼数据确定。例如,根据发音信息,确定口型数据,口型数据根据发音信息对应的第二BlendShape数据和第二骨 骼数据确定。In some embodiments, the feature information includes at least one of emotion information or pronunciation information, and the feature data includes at least one of expression data or mouth shape data. For example, expression data is determined based on emotional information. The expression data is determined based on the first BlendShape data and the first skeleton data corresponding to the emotion information. For example, mouth shape data is determined based on the pronunciation information, and the mouth shape data is determined based on the second BlendShape data and the second bone shape corresponding to the pronunciation information. The skeleton data is confirmed.
例如,语义引擎识别用户语音,输出数字人回答音频;语义引擎的状态机,根据数字人回答文本的情绪,输出对应的表情数据的EmoID和初始权重Weight。由于数字人的情绪不是固定不变的,因此状态机还输出表情数据的TimeStamp(时间戳),以保证不同微表情的变化。状态机根据数字人回答文本的音素,输出对应的口型数据的LipID和初始权重Weight。由于数字人每个字的发音不是相同频率的,因此状态机还输出口型数据的TimeStamp,以保证口型和音频的同步。For example, the semantic engine recognizes the user's voice and outputs the digital human's answer audio; the state machine of the semantic engine outputs the EmoID and initial weight Weight of the corresponding expression data according to the emotion of the digital human's answer text. Since the emotions of the digital human are not fixed, the state machine also outputs the TimeStamp of the expression data to ensure the changes of different micro-expressions. The state machine outputs the LipID and initial weight Weight of the corresponding lip shape data according to the phonemes of the digital human's answer text. Since the pronunciation of each word of the digital human is not at the same frequency, the state machine also outputs the TimeStamp of the lip shape data to ensure the synchronization of the lip shape and audio.
在步骤130中,根据特征数据,生成回应信息对应的动态影像。In step 130, a dynamic image corresponding to the response information is generated based on the characteristic data.
在一些实施例中,对相邻关键帧对应的加权后的特征数据进行平滑处理,以生成相邻关键帧之间的非关键帧;根据关键帧和非关键帧及其时间戳,按照时间顺序生成动态影像。In some embodiments, the weighted feature data corresponding to adjacent key frames is smoothed to generate non-key frames between adjacent key frames; according to the key frames and non-key frames and their timestamps, in chronological order Generate dynamic images.
在一些实施例中,渲染引擎在缓存池缓存面部模型库的数据;根据随机权重计算方法,基于阈值随机计算面部模型的实际权重,实现微表情的动态表达;根据实际权重进行驱动数据的平滑处理,并将表情和口型数据进行融合,实现数字人整体面部的驱动;同步播放语音。In some embodiments, the rendering engine caches the data of the facial model library in the cache pool; according to the random weight calculation method, the actual weight of the facial model is randomly calculated based on the threshold to realize the dynamic expression of micro-expressions; and the driving data is smoothed according to the actual weight. , and integrates expression and mouth shape data to realize the driving of the digital human's overall face; synchronously plays voice.
在上述实施例中,利用骨骼数据对BlendShape数据进行修正,能够基于BlendShape数据和骨骼数据共同实现表情的精准驱动,从而提高动态影像的效果。In the above embodiment, the BlendShape data is corrected using the skeletal data, and the precise driving of expressions can be realized based on the BlendShape data and the skeletal data, thereby improving the effect of dynamic images.
例如,可以通过图2中的实施例计算实际权重。For example, the actual weight can be calculated through the embodiment in Figure 2.
图2示出本公开的动态影像的生成方法的另一些实施例的流程图。FIG. 2 shows a flowchart of another embodiment of the dynamic image generation method of the present disclosure.
如图2所示,在步骤210中,根据特征信息,确定特征数据的初始权重。As shown in Figure 2, in step 210, the initial weight of the feature data is determined based on the feature information.
在一些实施例中,以表情数据为例,语义引擎的状态机向渲染引擎发送表情数据的EmoID、初始权重W和时间戳TS;渲染引擎的随机权重计算模块,根据EmoID匹配缓存池中对应的表情数据e,反映数字人的基本情绪:
e=BlendShapeE+SkeletonE=(BE×dbE+bbE)+(SE×dSE+bSE)
In some embodiments, taking expression data as an example, the state machine of the semantic engine sends the EmoID, initial weight W and timestamp TS of the expression data to the rendering engine; the random weight calculation module of the rendering engine matches the corresponding expression data e in the cache pool according to the EmoID to reflect the basic emotions of the digital human:
e=BlendShape E +Skeleton E =( BE × dbE + bbE )+( SE × dSE + bSE )
在步骤220中,在根据初始权重和阈值确定的取值范围内,分别随机生成多个关键帧的对应时刻的实际权重。In step 220, actual weights of corresponding moments of multiple key frames are randomly generated within the value range determined based on the initial weight and the threshold.
例如,取值范围包括大于初始权重与阈值之差且小于初始权重与阈值之和的值。For example, the value range includes values greater than the difference between the initial weight and the threshold and less than the sum of the initial weight and the threshold.
在一些实施例中,根据特征信息,确定特征数据的时间戳;生成时间戳的对应时刻的实际权重。In some embodiments, the timestamp of the feature data is determined based on the feature information; and the actual weight corresponding to the time stamp is generated.
例如,数字人的表情不应是静止僵硬、一成不变的,因此随机权重计算模块配置 有阈值T,作为增量反映数字人表情的动态变化范围;随机权重计算模块计算最大权重W+T和最小权重W-T,并每隔时间间隔I生成一个随机数R作为实际权重:For example, the expression of a digital human should not be static, stiff, and unchanging, so the random weight calculation module configuration There is a threshold T, which reflects the dynamic range of digital human expressions as an increment; the random weight calculation module calculates the maximum weight W+T and the minimum weight WT, and generates a random number R every time interval I as the actual weight:
W-T<R<W+TW-T<R<W+T
在步骤230中,利用多个实际权重分别对特征数据进行加权,以生成多个关键帧。In step 230, the feature data is weighted respectively using multiple actual weights to generate multiple key frames.
例如,R的范围在最大权重和最小权重之间,作为新的表情权重对e进行加权,生成关键帧的表情数据:
e(I)=[(BE×dbE+bbE)+(SE×dSE+bSE)}]×R
For example, the range of R is between the maximum weight and the minimum weight, and e is weighted as a new expression weight to generate key frame expression data:
e(I)=[(B E ×d bE +b bE )+(S E ×d SE +b SE )}]×R
在步骤240中,根据多个关键帧,生成动态影像。In step 240, dynamic images are generated based on multiple key frames.
例如,对于相邻的两个关键帧的表情数据e(I)和e(J),采用最小二乘法对表情数据进行平滑处理,得到非关键帧的表情数据;将表情和口型数据进行融合,并按照时间戳TS,实现数字人整体面部的动态驱动。For example, for the expression data e(I) and e(J) of two adjacent key frames, the least squares method is used to smooth the expression data to obtain non-key frame expression data; the expression and mouth shape data are fused , and realize the dynamic driving of the overall face of the digital person according to the timestamp TS.
上述实施例中,随机权重计算方法可以通过随机数实现微表情的动态生成,不需预先制作上百种表情动画,也不需要语义引擎实时发送大量的驱动数据。从而,实现数字人在满足基本情绪的基础上,表情能够随时间在一定范围内发生动态变化,给用户带来真实感,提高动态影像的准确性和真实性。In the above embodiment, the random weight calculation method can realize the dynamic generation of micro-expressions through random numbers. There is no need to pre-produce hundreds of expression animations, and there is no need for the semantic engine to send a large amount of driving data in real time. Therefore, on the basis of satisfying basic emotions, the expressions of digital people can dynamically change within a certain range over time, giving users a sense of reality and improving the accuracy and authenticity of dynamic images.
图3示出本公开的动态影像的生成方法的一些实施例的示意图。FIG. 3 shows a schematic diagram of some embodiments of the dynamic image generation method of the present disclosure.
如图3所示,面部模型库用于实现骨骼修正,能够对BlendShape进行骨骼修正处理,从而基于BlendShape和骨骼共同实现表情精准驱动。面部模型库负责还可以存储基本表情的文本数据、用于唯一标识不同表情和口型(即特征数据)的模型ID(即标识信息),从而实现数据的高效读取和复用。As shown in Figure 3, the facial model library is used to implement bone correction and can perform bone correction processing on BlendShape, thereby achieving precise expression driving based on BlendShape and bones. The facial model library can also store text data of basic expressions and model IDs (i.e., identification information) used to uniquely identify different expressions and mouth shapes (i.e., feature data), thereby achieving efficient reading and reuse of data.
例如,面部模型库的口型数据库设计如下:口型数据LIP的数据结构为[LipID,BlendShapeL,SkeletonL];LipID表示口型数据的口型ID,多个音素可以有同样的口型,即同一个口型ID。例如,音素“o”和“ao”口型类似,可以对应同一个LipID,从而在精准驱动的基础上,减小数据量;BlendShapeL表示该口型数据对应的一组BlendShape系数(即第二BlendShape数据),SkeletonL表示口型数据对应的面部骨骼系数(即第二骨骼数据)。For example, the design of the mouth shape database of the facial model library is as follows: the data structure of the mouth shape data LIP is [LipID, BlendShape L , Skeleton L ]; LipID represents the mouth shape ID of the mouth shape data, and multiple phonemes can have the same mouth shape. That is, the same mouth shape ID. For example, the phonemes "o" and "ao" have similar mouth shapes and can correspond to the same LipID, thereby reducing the amount of data based on precise driving; BlendShape L represents a set of BlendShape coefficients corresponding to the mouth shape data (i.e. the second BlendShape data), Skeleton L represents the facial skeleton coefficient corresponding to the mouth shape data (i.e. the second skeleton data).
例如,面部模型库的表情数据库设计如下:表情数据Emotion的数据结构为[EmoID,BlendShapeE,SkeletonE];EmoID表示表情数据的表情ID。例如,EmoID=0表示微笑,EmoID=1表示大笑,EmoID=2表示忧伤,EmoID=3表示恐惧,EmoID=4表示愤怒等,支持扩展;BlendShapeE表示表情数据对应的一组BlendShape系数(即 第一BlendShape数据),SkeletonE表示表情数据对应的面部骨骼系数(即第一骨骼数据)。For example, the expression database design of the facial model library is as follows: the data structure of the expression data Emotion is [EmoID, BlendShape E , Skeleton E ]; EmoID represents the expression ID of the expression data. For example, EmoID=0 means smiling, EmoID=1 means laughing, EmoID=2 means sadness, EmoID=3 means fear, EmoID=4 means anger, etc., and supports expansion; BlendShape E represents a set of BlendShape coefficients corresponding to the expression data (i.e. The first BlendShape data), Skeleton E represents the facial skeleton coefficient corresponding to the expression data (that is, the first skeleton data).
例如,BlendShapeE数据包括一组组成整体表情基准(或表情分量),某一时刻下的人脸表情e的BlendShapeE数据为这组表情分量的线性加权:
BlendShapeE=BE×dbE+bbE
For example, BlendShape E data includes a set of overall expression benchmarks (or expression components), and the BlendShape E data of the facial expression e at a certain moment is the linear weighting of this set of expression components:
BlendShape E = B E × d bE + b bE
BE是一组表情基准,dbE是对应的权重系数,bbE是初始表情(如区别于负性表情和正性表情的中性表情)。B E is a set of expression benchmarks, d bE is the corresponding weight coefficient, and b bE is the initial expression (such as a neutral expression that is different from negative expressions and positive expressions).
例如,随着BlendShapeE数据的变化,数字人的骨骼数据也应该相应变化,如在数字人说话时,下巴和脸部相关的骨骼点也发生位移。因此需要对BlendShapeE数据进行骨骼修正处理,从而使驱动效果更加精准和真实。骨骼修正处理后的人脸表情e为:
e=BlendShapeE+SkeletonE=(BE×dbE+bbE)+(SE×dSE+bSE)
For example, as the BlendShape E data changes, the digital human's skeletal data should also change accordingly. For example, when the digital human speaks, the skeletal points related to the jaw and face also shift. Therefore, it is necessary to perform bone correction processing on the BlendShape E data to make the driving effect more accurate and realistic. The facial expression e after skeleton correction is:
e=BlendShape E +Skeleton E =(B E ×d bE +b bE )+(S E ×d SE +b SE )
SE是一组骨骼基准(或骨骼分量),dSE是对应的骨骼系数(即权重系数),bSE是初始骨骼(如区别于负性骨骼和正性骨骼的中性骨骼)。骨骼系数表示表情从中性骨骼变化到目标骨骼的一组骨骼分量的线性混合系数。S E is a set of bone benchmarks (or bone components), d SE is the corresponding bone coefficient (i.e., weight coefficient), and b SE is the initial bone (such as the neutral bone that is different from negative bones and positive bones). The bone coefficient represents the linear blending coefficient of a set of bone components that changes the expression from a neutral bone to a target bone.
例如,骨骼修正处理后的口型l为:
l=BlendShapeL+SkeletonL=(BL×dbL+bbL)+(SL×dSL+bSL)
For example, the mouth shape l after bone correction is:
l=BlendShape L +Skeleton L =(B L ×d bL +b bL )+(S L ×d SL +b SL )
BL是一组表情基准,dbL是对应的权重系数,bbL是初始表情(如区别于负性表情和正性表情的中性表情),SL是一组骨骼基准(或骨骼分量),dSL是对应的骨骼系数(即权重系数),bSL是初始骨骼(如区别于负性骨骼和正性骨骼的中性骨骼)。骨骼系数表示口型从中性骨骼变化到目标骨骼的一组骨骼分量的线性混合系数。B L is a set of expression benchmarks, d bL is the corresponding weight coefficient, b bL is the initial expression (such as a neutral expression that is different from negative expressions and positive expressions), S L is a set of skeleton benchmarks (or bone components), d SL is the corresponding bone coefficient (i.e., weight coefficient), and b SL is the initial bone (such as the neutral bone that is different from negative bones and positive bones). The bone coefficient represents the linear blending coefficient of a set of bone components that changes the mouth shape from a neutral bone to a target bone.
例如,语义引擎识别用户语音,输出数字人回答音频;语义引擎的状态机,根据数字人回答文本的情绪,输出对应的表情数据的EmoID和初始权重Weight。由于数字人的情绪不是固定不变的,因此状态机还输出表情数据的TimeStamp(时间戳),以保证不同微表情的变化。状态机根据数字人回答文本的音素,输出对应的口型数据的LipID和初始权重Weight。由于数字人每个字的发音不是相同频率的,因此状态机还输出口型数据的TimeStamp,以保证口型和音频的同步。For example, the semantic engine recognizes the user's voice and outputs the audio of the digital person's answer; the state machine of the semantic engine outputs the EmoID and initial weight of the corresponding expression data based on the emotion of the digital person's answer text. Since the emotions of digital people are not fixed, the state machine also outputs the TimeStamp of the expression data to ensure changes in different micro-expressions. The state machine outputs the LipID and initial weight of the corresponding mouth shape data based on the phonemes of the text answered by the digital person. Since the pronunciation of each word of the digital person is not at the same frequency, the state machine also outputs the TimeStamp of the mouth shape data to ensure the synchronization of the mouth shape and audio.
在一些实施例中,渲染引擎在缓存池缓存面部模型库的数据;根据随机权重计算方法,基于阈值随机计算面部模型的实际权重,实现微表情的动态表达;根据实际权重进行驱动数据的平滑处理,并将表情和口型数据进行融合,实现数字人整体面部的 驱动;同步播放语音。In some embodiments, the rendering engine caches the data of the facial model library in the cache pool; according to the random weight calculation method, the actual weight of the facial model is randomly calculated based on the threshold to realize the dynamic expression of micro-expressions; and the driving data is smoothed according to the actual weight. , and fuse expression and mouth shape data to achieve the overall facial recognition of digital people. Driver; play voice simultaneously.
在一些实施例中,以表情数据为例,语义引擎的状态机向渲染引擎发送表情数据的EmoID、初始权重W和时间戳TS;渲染引擎的随机权重计算模块,根据EmoID匹配缓存池中对应的表情数据e,反映数字人的基本情绪:
e=BlendShapeE+SkeletonE=(BE×dbE+bbE)+(SE×dSE+bSE)
In some embodiments, taking expression data as an example, the state machine of the semantic engine sends the EmoID, initial weight W and timestamp TS of the expression data to the rendering engine; the random weight calculation module of the rendering engine matches the corresponding data in the cache pool based on the EmoID. Expression data e reflects the basic emotions of digital people:
e=BlendShape E +Skeleton E =(B E ×d bE +b bE )+( SE ×d SE +b SE )
例如,数字人的表情不应是静止僵硬、一成不变的,因此随机权重计算模块配置有阈值T,作为增量反映数字人表情的动态变化范围;随机权重计算模块计算最大权重W+T和最小权重W-T,并每隔时间间隔I生成一个随机数R作为实际权重:
W-T<R<W+T
For example, the expression of the digital human should not be static, stiff, and unchanging, so the random weight calculation module is configured with a threshold T, which reflects the dynamic range of the digital human expression as an increment; the random weight calculation module calculates the maximum weight W+T and the minimum weight WT, and generate a random number R every time interval I as the actual weight:
WT<R<W+T
例如,R的范围在最大权重和最小权重之间,作为新的表情权重对e进行加权,生成关键帧的表情数据:
e(I)=[(BE×dbE+bbE)+(SE×dSE+bSE)}]×R
For example, the range of R is between the maximum weight and the minimum weight, and e is weighted as a new expression weight to generate key frame expression data:
e(I)=[(B E ×d bE +b bE )+(S E ×d SE +b SE )}]×R
例如,对于相邻的两个关键帧的表情数据e(I)和e(J),采用最小二乘法对表情数据进行平滑处理,得到非关键帧的表情数据;将表情和口型数据进行融合,并按照时间戳TS,实现数字人整体面部的动态驱动。For example, for the expression data e(I) and e(J) of two adjacent key frames, the least squares method is used to smooth the expression data to obtain non-key frame expression data; the expression and mouth shape data are fused , and realize the dynamic driving of the overall face of the digital person according to the timestamp TS.
在一些实施例中,数字人智能驱动交互流程如下。In some embodiments, the digital human intelligence drives the interaction process as follows.
在步骤1中,设置基本表情、基本口型的BlendShape数据模型。每个基本数据模型由模型ID唯一标识。In step 1, set the BlendShape data model of basic expressions and basic mouth shapes. Each base data model is uniquely identified by a model ID.
在步骤2中,执行骨骼修正处理,向数据模型中增加骨骼点的系数,形成修正后的面部模型文本。In step 2, a bone correction process is performed, and the coefficients of bone points are added to the data model to form a corrected facial model text.
在步骤3中,渲染引擎初始化时读取面部模型库,将数据以JSON文本格式加载至缓存池缓存。In step 3, the rendering engine reads the facial model library when it is initialized, and loads the data into the cache pool in JSON text format.
在步骤4中,当用户向数字人发起语音交互时,语音识别模块进行用户语义及情感分析。In step 4, when the user initiates voice interaction with the digital human, the speech recognition module performs user semantic and emotional analysis.
在步骤5中,回答文本模块存有智能问答库,根据用户语义和情感得到对应的回答文本。In step 5, the answer text module stores an intelligent question and answer library and obtains corresponding answer text based on user semantics and emotions.
在步骤6中,自然语言处理模块对回答文本进行情感分析和音素提取。In step 6, the natural language processing module performs sentiment analysis and phoneme extraction on the answer text.
在步骤7中,语音合成模块将分词后的回答文本合成为音频数据。In step 7, the speech synthesis module synthesizes the segmented answer text into audio data.
在步骤8中,状态机将表情ID或口型ID、权重、时间戳、音频等数据发送到渲染引擎。 In step 8, the state machine sends expression ID or lip shape ID, weight, timestamp, audio and other data to the rendering engine.
在步骤9中,随机权重计算模块根据ID匹配缓存池中对应的基本模型,基于随机数生成关键帧。In step 9, the random weight calculation module matches the corresponding basic model in the cache pool based on the ID and generates key frames based on random numbers.
在步骤10中,平滑处理模块基于最小二乘法对关键帧进行平滑处理。In step 10, the smoothing module smoothes the key frames based on the least squares method.
在步骤11中,表情融合模块将表情和口型数据进行融合,并按照时间戳实现动态驱动。In step 11, the expression fusion module fuses the expression and mouth shape data and implements dynamic driving according to the timestamp.
上述实施例中,提出了随机权重计算方法,基于状态机和阈值随机计算面部关键帧权重,并采用最小二乘法对关键帧进行平滑处理,实现微表情的动态表达;提出骨骼修正规则,基于BlendShape和骨骼加权实现表情精准驱动,从而提升数字人真实感和驱动精度。In the above embodiment, a random weight calculation method is proposed, which randomly calculates the weight of facial key frames based on a state machine and a threshold, and uses the least squares method to smooth the key frames to achieve dynamic expression of micro-expressions; a bone correction rule is proposed, based on BlendShape and bone weighting to achieve accurate expression driving, thereby improving the realism and driving accuracy of digital human beings.
利用缓存池存储面部基本数据,利用随机权重计算方法实现动态微表情,不需占用带宽发送大量重复数据,也不需大量存储动画资产,减少带宽、硬件资源的额外开销,提高性能和实时性,降低资源开销。The cache pool is used to store basic facial data, and the random weight calculation method is used to realize dynamic micro-expressions. There is no need to occupy bandwidth to send a large amount of repeated data, and there is no need to store a large amount of animation assets. This reduces the additional overhead of bandwidth and hardware resources, and improves performance and real-time performance. Reduce resource overhead.
构建面部模型库存储基本表情,模型ID唯一标识不同表情和口型,基于JSON文本轻量化导入缓存池进行数据缓存,实现数据的高效复用和扩展。不需要在渲染引擎中事先内置基本口型和基本表情,提高系统扩展性,降低人工成本。Build a facial model library to store basic expressions. Model ID uniquely identifies different expressions and mouth shapes. Import data into a cache pool based on JSON text to cache data, and achieve efficient reuse and expansion of data. It is not necessary to build basic mouth shapes and basic expressions into the rendering engine in advance, which improves system scalability and reduces labor costs.
图4示出本公开的动态影像的生成装置的一些实施例的框图。FIG. 4 shows a block diagram of some embodiments of the dynamic image generation device of the present disclosure.
如图4所示,动态影像的生成装置4包括:语义引擎模块41,用于根据用户语音,确定回应信息对应的特征信息,根据特征信息,确定回应信息对应的特征数据,特征数据根据特征信息对应的BlendShape数据和骨骼数据确定;渲染引擎模块42,用于根据特征数据,生成回应信息对应的动态影像。As shown in Figure 4, the dynamic image generation device 4 includes: a semantic engine module 41, which is used to determine the characteristic information corresponding to the response information based on the user's voice, and determine the characteristic data corresponding to the response information based on the characteristic information. The characteristic data is based on the characteristic information. The corresponding BlendShape data and skeleton data are determined; the rendering engine module 42 is used to generate dynamic images corresponding to the response information based on the feature data.
在一些实施例中,生成装置4还包括:面部模型库43,用于存储多个特征数据。In some embodiments, the generation device 4 further includes: a facial model library 43 for storing multiple feature data.
在一些实施例中,特征信息包括情绪信息或发音信息中的至少一项,特征数据包括表情数据或口型数据中的至少一项,语义引擎模块41执行下面的至少一项:根据情绪信息,确定表情数据,表情数据根据情绪信息对应的第一BlendShape数据和第一骨骼数据确定;或者,根据发音信息,确定口型数据,口型数据根据发音信息对应的第二BlendShape数据和第二骨骼数据确定。In some embodiments, the feature information includes at least one of emotion information or pronunciation information, the feature data includes at least one of expression data or mouth shape data, and the semantic engine module 41 performs at least one of the following: according to the emotion information, Determine the expression data, the expression data is determined based on the first BlendShape data and the first skeleton data corresponding to the emotion information; or, determine the mouth shape data based on the pronunciation information, and the mouth shape data is determined based on the second BlendShape data and the second skeleton data corresponding to the pronunciation information. Sure.
在一些实施例中,骨骼数据根据初始骨骼数据和多个骨骼数据分量的加权和确定。In some embodiments, the skeletal data is determined based on the initial skeletal data and a weighted sum of multiple skeletal data components.
在一些实施例中,语义引擎模块41根据特征信息,确定特征数据的初始权重;渲染引擎模块42在根据初始权重和阈值确定的取值范围内,分别随机生成多个关键帧的对应时刻的实际权重,利用多个实际权重分别对特征数据进行加权,以生成多个关 键帧;根据多个关键帧,生成动态影像。In some embodiments, the semantic engine module 41 determines the initial weight of the feature data based on the feature information; the rendering engine module 42 randomly generates the actual values of the corresponding moments of multiple key frames within the value range determined based on the initial weight and threshold. Weight, use multiple actual weights to weight the feature data respectively to generate multiple relationships Keyframes; generate dynamic images based on multiple keyframes.
在一些实施例中,取值范围包括大于初始权重与阈值之差且小于初始权重与阈值之和的值。In some embodiments, the value range includes values greater than the difference between the initial weight and the threshold and less than the sum of the initial weight and the threshold.
在一些实施例中,渲染引擎模块42对相邻关键帧对应的加权后的特征数据进行平滑处理,以生成相邻关键帧之间的非关键帧;根据关键帧和非关键帧及其时间戳,按照时间顺序生成动态影像。In some embodiments, the rendering engine module 42 smoothes the weighted feature data corresponding to adjacent key frames to generate non-key frames between adjacent key frames; according to the key frames and non-key frames and their timestamps , generate dynamic images in chronological order.
在一些实施例中,语义引擎模块41根据特征信息,确定特征数据的时间戳;渲染引擎模块生成时间戳的对应时刻的实际权重。In some embodiments, the semantic engine module 41 determines the timestamp of the feature data according to the feature information; the rendering engine module generates the actual weight of the time corresponding to the timestamp.
在一些实施例中,语义引擎模块41利用其中的状态机,根据特征信息,确定特征数据对应的标识信息;渲染引擎模块42根据状态机发送的标识信息,从缓存池中获取特征数据。In some embodiments, the semantic engine module 41 uses the state machine therein to determine the identification information corresponding to the characteristic data based on the characteristic information; the rendering engine module 42 obtains the characteristic data from the cache pool according to the identification information sent by the state machine.
在一些实施例中,在初始化的过程中,渲染引擎模块42从面部模型库中读取多个特征数据,以JSON文本格式,将多个特征数据加载到缓存池。In some embodiments, during the initialization process, the rendering engine module 42 reads multiple feature data from the facial model library and loads the multiple feature data into the cache pool in JSON text format.
在一些实施例中,语义引擎模块42在用户发起语音交互的情况下,对用户语音进行语义分析和情感分析;根据分析结果,在问答库中确定回应文本,对回应文本进行情感分析或音素提取中的至少一项处理,确定特征信息。In some embodiments, the semantic engine module 42 performs semantic analysis and emotional analysis on the user's voice when the user initiates voice interaction; determines the response text in the question and answer library based on the analysis results, and performs emotional analysis or phoneme extraction on the response text. At least one of the processes determines the characteristic information.
在一些实施例中,BlendShape数据根据初始BlendShape数据和多个BlendShape数据分量的加权和确定。In some embodiments, the BlendShape data is determined based on the weighted sum of the initial BlendShape data and multiple BlendShape data components.
图5示出本公开的动态影像的生成装置的另一些实施例的框图。FIG. 5 shows a block diagram of another embodiment of the dynamic image generating device of the present disclosure.
如图5所示,该实施例的装置5包括:存储器51以及耦接至该存储器51的处理器52,处理器52被配置为基于存储在存储器51中的指令,执行本公开中任意一个实施例中的动态影像的生成方法。As shown in Figure 5, the device 5 of this embodiment includes: a memory 51 and a processor 52 coupled to the memory 51. The processor 52 is configured to execute any implementation of the present disclosure based on instructions stored in the memory 51. The method of generating dynamic images in the example.
其中,存储器51例如可以包括系统存储器、固定非易失性存储介质等。系统存储器例如存储有操作系统、应用程序、引导装载程序Boot Loader、数据库以及其他程序等。The memory 51 may include, for example, a system memory, a fixed non-volatile storage medium, etc. The system memory may store, for example, an operating system, an application program, a boot loader, a database, and other programs.
图6示出本公开的动态影像的生成装置的又一些实施例的框图。FIG. 6 shows a block diagram of some further embodiments of the dynamic image generating device of the present disclosure.
如图6所示,该实施例的动态影像的生成装置6包括:存储器610以及耦接至该存储器610的处理器620,处理器620被配置为基于存储在存储器610中的指令,执行前述任意一个实施例中的动态影像的生成方法。As shown in Figure 6, the dynamic image generation device 6 of this embodiment includes: a memory 610 and a processor 620 coupled to the memory 610. The processor 620 is configured to execute any of the foregoing based on instructions stored in the memory 610. A method for generating dynamic images in one embodiment.
存储器610例如可以包括系统存储器、固定非易失性存储介质等。系统存储器例 如存储有操作系统、应用程序、引导装载程序Boot Loader以及其他程序等。The memory 610 may include, for example, a system memory, a fixed non-volatile storage medium, etc. Such as storing operating systems, applications, boot loaders, and other programs.
动态影像的生成装置6还可以包括输入输出接口630、网络接口640、存储接口650等。这些接口630、640、650以及存储器610和处理器620之间例如可以通过总线660连接。其中,输入输出接口630为显示器、鼠标、键盘、触摸屏、麦克、音箱等输入输出设备提供连接接口。网络接口640为各种联网设备提供连接接口。存储接口650为SD卡、U盘等外置存储设备提供连接接口。The moving image generating device 6 may also include an input/output interface 630, a network interface 640, a storage interface 650, and the like. These interfaces 630, 640, 650, the memory 610 and the processor 620 may be connected through a bus 660, for example. Among them, the input and output interface 630 provides connection interfaces for input and output devices such as monitors, mice, keyboards, touch screens, microphones, and speakers. Network interface 640 provides a connection interface for various networked devices. The storage interface 650 provides a connection interface for external storage devices such as SD cards and USB disks.
本领域内的技术人员应当明白,本公开的实施例可提供为方法、系统、或计算机程序产品。因此,本公开可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且,本公开可采用在一个或多个其中包含有计算机可用程序代码的计算机可用非瞬时性存储介质包括但不限于磁盘存储器、CD-ROM、光学存储器等上实施的计算机程序产品的形式。Those skilled in the art will appreciate that embodiments of the present disclosure may be provided as methods, systems, or computer program products. Accordingly, the present disclosure may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment that combines software and hardware aspects. Furthermore, the present disclosure may take the form of a computer program product embodied on one or more computer-usable non-transitory storage media including, but not limited to, disk memory, CD-ROM, optical storage, and the like having computer-usable program code embodied therein.
至此,已经详细描述了根据本公开的动态影像的生成方法、动态影像的生成装置和非易失性计算机可读存储介质。为了避免遮蔽本公开的构思,没有描述本领域所公知的一些细节。本领域技术人员根据上面的描述,完全可以明白如何实施这里公开的技术方案。So far, the method for generating a dynamic image, the device for generating a dynamic image, and the non-volatile computer-readable storage medium according to the present disclosure have been described in detail. To avoid obscuring the concepts of the present disclosure, some details that are well known in the art have not been described. Based on the above description, those skilled in the art can completely understand how to implement the technical solution disclosed here.
可能以许多方式来实现本公开的方法和系统。例如,可通过软件、硬件、固件或者软件、硬件、固件的任何组合来实现本公开的方法和系统。用于方法的步骤的上述顺序仅是为了进行说明,本公开的方法的步骤不限于以上具体描述的顺序,除非以其它方式特别说明。此外,在一些实施例中,还可将本公开实施为记录在记录介质中的程序,这些程序包括用于实现根据本公开的方法的机器可读指令。因而,本公开还覆盖存储用于执行根据本公开的方法的程序的记录介质。The methods and systems of the present disclosure may be implemented in many ways. For example, the methods and systems of the present disclosure may be implemented through software, hardware, firmware, or any combination of software, hardware, and firmware. The above order for the steps of the methods is for illustration only, and the steps of the methods of the present disclosure are not limited to the order specifically described above unless otherwise specifically stated. Furthermore, in some embodiments, the present disclosure can also be implemented as programs recorded in recording media, and these programs include machine-readable instructions for implementing methods according to the present disclosure. Thus, the present disclosure also covers recording media storing programs for executing methods according to the present disclosure.
虽然已经通过示例对本公开的一些特定实施例进行了详细说明,但是本领域的技术人员应该理解,以上示例仅是为了进行说明,而不是为了限制本公开的范围。本领域的技术人员应该理解,可在不脱离本公开的范围和精神的情况下,对以上实施例进行修改。本公开的范围由所附权利要求来限定。 Although some specific embodiments of the present disclosure have been described in detail through examples, those skilled in the art will understand that the above examples are for illustration only and are not intended to limit the scope of the disclosure. It should be understood by those skilled in the art that the above embodiments may be modified without departing from the scope and spirit of the present disclosure. The scope of the disclosure is defined by the appended claims.

Claims (16)

  1. 一种动态影像的生成方法,包括:A method for generating dynamic images, including:
    根据用户语音,确定回应信息对应的特征信息;Determine the characteristic information corresponding to the response information based on the user's voice;
    根据所述特征信息,确定所述回应信息对应的特征数据,所述特征数据根据所述特征信息对应的混合形状BlendShape数据和骨骼数据确定;Determine the characteristic data corresponding to the response information according to the characteristic information, and determine the characteristic data according to the BlendShape data and bone data corresponding to the characteristic information;
    根据所述特征数据,生成所述回应信息对应的动态影像。According to the characteristic data, a dynamic image corresponding to the response information is generated.
  2. 根据权利要求1所述的生成方法,其中,所述特征信息包括情绪信息或发音信息中的至少一项,所述特征数据包括表情数据或口型数据中的至少一项,The generation method according to claim 1, wherein the feature information includes at least one of emotion information or pronunciation information, and the feature data includes at least one of expression data or mouth shape data,
    所述根据所述特征信息,确定所述回应信息对应的特征数据包括下面的至少一项:According to the characteristic information, it is determined that the characteristic data corresponding to the response information includes at least one of the following:
    根据所述情绪信息,确定所述表情数据,所述表情数据根据所述情绪信息对应的第一BlendShape数据和第一骨骼数据确定;或者Determine the expression data according to the emotion information, and determine the expression data according to the first BlendShape data and the first skeleton data corresponding to the emotion information; or
    根据所述发音信息,确定所述口型数据,所述口型数据根据所述发音信息对应的第二BlendShape数据和第二骨骼数据确定。The mouth shape data is determined based on the pronunciation information, and the mouth shape data is determined based on the second BlendShape data and the second skeleton data corresponding to the pronunciation information.
  3. 根据权利要求1或2所述的生成方法,其中,所述骨骼数据根据初始骨骼数据和多个骨骼数据分量的加权和确定。The generating method according to claim 1 or 2, wherein the skeletal data is determined based on the weighted sum of initial skeletal data and a plurality of skeletal data components.
  4. 根据权利要求1-3任一项所述的生成方法,还包括:The generation method according to any one of claims 1-3, further comprising:
    根据所述特征信息,确定所述特征数据的初始权重;Determine the initial weight of the feature data according to the feature information;
    其中,所述根据所述特征数据,生成所述回应信息对应的动态影像包括:Wherein, generating a dynamic image corresponding to the response information according to the characteristic data includes:
    在根据所述初始权重和阈值确定的取值范围内,分别随机生成多个关键帧的对应时刻的实际权重;Within the value range determined according to the initial weight and threshold, the actual weights of the corresponding moments of multiple key frames are randomly generated;
    利用多个实际权重分别对所述特征数据进行加权,以生成所述多个关键帧;Using multiple actual weights to weight the feature data respectively to generate the multiple key frames;
    根据所述多个关键帧,生成所述动态影像。The dynamic image is generated according to the plurality of key frames.
  5. 根据权利要求4所述的生成方法,其中,取值范围包括大于所述初始权重与所述阈值之差且小于所述初始权重与所述阈值之和的值。 The generation method according to claim 4, wherein the value range includes values greater than the difference between the initial weight and the threshold and less than the sum of the initial weight and the threshold.
  6. 根据权利要求4或5所述的生成方法,其中,所述根据所述多个关键帧,生成所述动态影像包括:The generation method according to claim 4 or 5, wherein generating the dynamic image according to the plurality of key frames includes:
    对相邻关键帧对应的加权后的特征数据进行平滑处理,以生成所述相邻关键帧之间的非关键帧;Smoothing the weighted feature data corresponding to adjacent key frames to generate non-key frames between the adjacent key frames;
    根据关键帧和非关键帧及其时间戳,按照时间顺序生成所述动态影像。The dynamic image is generated in chronological order according to key frames and non-key frames and their timestamps.
  7. 根据权利要求4-6任一项所述的生成方法,还包括:The generation method according to any one of claims 4-6, further comprising:
    根据所述特征信息,确定所述特征数据的时间戳;Determine the timestamp of the characteristic data according to the characteristic information;
    其中,所述在根据所述初始权重和阈值确定的取值范围内,分别随机生成多个关键帧的对应时刻的实际权重包括:Wherein, within the value range determined according to the initial weight and threshold, the actual weights at corresponding moments of randomly generated multiple key frames include:
    生成所述时间戳的对应时刻的实际权重。The actual weight of the corresponding moment of the timestamp is generated.
  8. 根据权利要求1-7任一项所述的生成方法,其中,所述根据所述特征信息,确定所述回应信息对应的特征数据包括:The generation method according to any one of claims 1 to 7, wherein determining, based on the characteristic information, characteristic data corresponding to the response information comprises:
    利用语义引擎中的状态机,根据所述特征信息,确定所述特征数据对应的标识信息;Utilize the state machine in the semantic engine to determine the identification information corresponding to the characteristic data based on the characteristic information;
    利用渲染引擎,根据状态机发送的所述标识信息,从缓存池中获取所述特征数据。The rendering engine is used to obtain the feature data from the cache pool according to the identification information sent by the state machine.
  9. 根据权利要求8所述的生成方法,还包括:The generation method according to claim 8, further comprising:
    在初始化的过程中,利用所述渲染引擎,从面部模型库中读取多个特征数据;During the initialization process, the rendering engine is used to read multiple feature data from the facial model library;
    利用所述渲染引擎,以JSON文本格式,将所述多个特征数据加载到所述缓存池。The rendering engine is used to load the plurality of feature data into the cache pool in JSON text format.
  10. 根据权利要求1-9任一项所述的生成方法,其中,根据用户语音,确定回应信息对应的特征信息包括:The generation method according to any one of claims 1-9, wherein determining the characteristic information corresponding to the response information according to the user's voice includes:
    在用户发起语音交互的情况下,对所述用户语音进行语义分析和情感分析;When the user initiates voice interaction, perform semantic analysis and emotional analysis on the user's voice;
    根据分析结果,在问答库中确定回应文本;Based on the analysis results, determine the response text in the question and answer database;
    对所述回应文本进行情感分析或音素提取中的至少一项处理,确定所述特征信息。Perform at least one of sentiment analysis and phoneme extraction on the response text to determine the feature information.
  11. 根据权利要求1-10任一项所述的生成方法,其中,所述BlendShape数据根据初始BlendShape数据和多个BlendShape数据分量的加权和确定。 The generation method according to any one of claims 1 to 10, wherein the BlendShape data is determined based on a weighted sum of initial BlendShape data and multiple BlendShape data components.
  12. 一种动态影像的生成装置,包括:A device for generating dynamic images, including:
    语义引擎模块,用于根据用户语音,确定回应信息对应的特征信息,根据所述特征信息,确定所述回应信息对应的特征数据,所述特征数据根据所述特征信息对应的混合形状BlendShape数据和骨骼数据确定;The semantic engine module is used to determine the characteristic information corresponding to the response information based on the user's voice, and determine the characteristic data corresponding to the response information based on the characteristic information. The characteristic data is based on the blend shape BlendShape data corresponding to the characteristic information and Skeletal data determined;
    渲染引擎模块,用于根据所述特征数据,生成所述回应信息对应的动态影像。A rendering engine module is used to generate dynamic images corresponding to the response information according to the characteristic data.
  13. 根据权利要求12所述的生成装置,还包括:The generating device according to claim 12, further comprising:
    面部模型库,用于存储多个特征数据。Facial model library for storing multiple feature data.
  14. 一种动态影像的生成装置,包括:A dynamic image generating device, comprising:
    存储器;和memory; and
    耦接至所述存储器的处理器,所述处理器被配置为基于存储在所述存储器中的指令,执行权利要求1-11任一项所述的动态影像的生成方法。A processor coupled to the memory, the processor being configured to execute the dynamic image generation method according to any one of claims 1-11 based on instructions stored in the memory.
  15. 一种非易失性计算机可读存储介质,其上存储有计算机程序,该程序被处理器执行时实现权利要求1-11任一项所述的动态影像的生成方法。A non-volatile computer-readable storage medium on which a computer program is stored. When the program is executed by a processor, the dynamic image generation method described in any one of claims 1-11 is implemented.
  16. 一种计算机程序,包括:A computer program consisting of:
    指令,所述指令当由处理器执行时使所述处理器执行根据权利要求1-11任一项所述的动态影像的生成方法。 Instructions, which when executed by a processor, cause the processor to execute the dynamic image generation method according to any one of claims 1-11.
PCT/CN2023/112565 2022-09-20 2023-08-11 Dynamic image generation method and device WO2024060873A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202211141405.8 2022-09-20
CN202211141405.8A CN115529500A (en) 2022-09-20 2022-09-20 Method and device for generating dynamic image

Publications (1)

Publication Number Publication Date
WO2024060873A1 true WO2024060873A1 (en) 2024-03-28

Family

ID=84697278

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/112565 WO2024060873A1 (en) 2022-09-20 2023-08-11 Dynamic image generation method and device

Country Status (2)

Country Link
CN (1) CN115529500A (en)
WO (1) WO2024060873A1 (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115529500A (en) * 2022-09-20 2022-12-27 中国电信股份有限公司 Method and device for generating dynamic image

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090231347A1 (en) * 2008-03-11 2009-09-17 Masanori Omote Method and Apparatus for Providing Natural Facial Animation
CN113643413A (en) * 2021-08-30 2021-11-12 北京沃东天骏信息技术有限公司 Animation processing method, animation processing device, animation processing medium and electronic equipment
CN113781610A (en) * 2021-06-28 2021-12-10 武汉大学 Virtual face generation method
CN114219880A (en) * 2021-12-16 2022-03-22 网易(杭州)网络有限公司 Method and device for generating expression animation
CN115529500A (en) * 2022-09-20 2022-12-27 中国电信股份有限公司 Method and device for generating dynamic image

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111292427B (en) * 2020-03-06 2021-01-01 腾讯科技(深圳)有限公司 Bone displacement information acquisition method, device, equipment and storage medium
CN111445561B (en) * 2020-03-25 2023-11-17 北京百度网讯科技有限公司 Virtual object processing method, device, equipment and storage medium
CN111443852A (en) * 2020-03-25 2020-07-24 北京百度网讯科技有限公司 Digital human action control method and device, electronic equipment and storage medium
CN112270734B (en) * 2020-10-19 2024-01-26 北京大米科技有限公司 Animation generation method, readable storage medium and electronic equipment
CN113822967A (en) * 2021-02-09 2021-12-21 北京沃东天骏信息技术有限公司 Man-machine interaction method, device, system, electronic equipment and computer medium
CN113763518A (en) * 2021-09-09 2021-12-07 北京顺天立安科技有限公司 Multi-mode infinite expression synthesis method and device based on virtual digital human
CN113538636B (en) * 2021-09-15 2022-07-01 中国传媒大学 Virtual object control method and device, electronic equipment and medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090231347A1 (en) * 2008-03-11 2009-09-17 Masanori Omote Method and Apparatus for Providing Natural Facial Animation
CN113781610A (en) * 2021-06-28 2021-12-10 武汉大学 Virtual face generation method
CN113643413A (en) * 2021-08-30 2021-11-12 北京沃东天骏信息技术有限公司 Animation processing method, animation processing device, animation processing medium and electronic equipment
CN114219880A (en) * 2021-12-16 2022-03-22 网易(杭州)网络有限公司 Method and device for generating expression animation
CN115529500A (en) * 2022-09-20 2022-12-27 中国电信股份有限公司 Method and device for generating dynamic image

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
LI YAN: "Research on the Application of 3D Skin Technology in Facial Animation", JOURNAL OF WEINAN NORMAL UNIVERSITY, vol. 32, no. 12, 20 June 2017 (2017-06-20), pages 27 - 31, XP009553301, ISSN: 1009-5128 *
LI, QING ET AL.: "Orthogonal-Blendshape-Based Editing System for Facial Motion Capture Data", IEEE COMPUTER GRAPHICS AND APPLICATIONS, vol. 28, no. 6, 11 November 2008 (2008-11-11), XP011237925, DOI: 10.1109/MCG.2008.120 *

Also Published As

Publication number Publication date
CN115529500A (en) 2022-12-27

Similar Documents

Publication Publication Date Title
Cao et al. Expressive speech-driven facial animation
WO2022116977A1 (en) Action driving method and apparatus for target object, and device, storage medium, and computer program product
CN110390704B (en) Image processing method, image processing device, terminal equipment and storage medium
Kshirsagar et al. Visyllable based speech animation
US10776977B2 (en) Real-time lip synchronization animation
US6772122B2 (en) Character animation
US20220108510A1 (en) Real-time generation of speech animation
WO2024060873A1 (en) Dynamic image generation method and device
CN113077537A (en) Video generation method, storage medium and equipment
JP2008052628A (en) Animation data-generating device and animation data-generating program
US20200193961A1 (en) System for synchronizing speech and motion of character
Čereković et al. Multimodal behavior realization for embodied conversational agents
CN112348932A (en) Mouth shape animation recording method and device, electronic equipment and storage medium
US11461948B2 (en) System and method for voice driven lip syncing and head reenactment
Kolivand et al. Realistic lip syncing for virtual character using common viseme set
Leandro Parreira Duarte et al. Coarticulation and speech synchronization in MPEG-4 based facial animation
Martin et al. 3D audiovisual rendering and real-time interactive control of expressivity in a talking head
Zhang Analysis on the fusion of intelligent digital technology and media art
Rademan et al. Improved visual speech synthesis using dynamic viseme k-means clustering and decision trees.
CN117152308B (en) Virtual person action expression optimization method and system
WO2024027307A1 (en) Method and apparatus for generating mouth-shape animation, device, and medium
Çakmak et al. HMM-based generation of laughter facial expression
US20230061761A1 (en) Synthetic emotion in continuously generated voice-to-video system
CN115797515A (en) Voice generation and expression driving method, client and server
KR20230095432A (en) Text description-based character animation synthesis system