WO2024060873A1

WO2024060873A1 - Dynamic image generation method and device

Info

Publication number: WO2024060873A1
Application number: PCT/CN2023/112565
Authority: WO
Inventors: 魏莱; 王宇桐; 宋雅奇; 薛裕颖; 沈云
Original assignee: 中国电信股份有限公司
Priority date: 2022-09-20
Filing date: 2023-08-11
Publication date: 2024-03-28
Also published as: CN115529500A

Abstract

The present disclosure relates to the technical field of computers, and relates to a dynamic image generation method and device. The generation method comprises: determining, according to user voice, feature information corresponding to response information; determining, according to the feature information, feature data corresponding to the response information, wherein the feature data is determined according to BlendShape data and skeleton data corresponding to the feature information; and generating, according to the feature data, a dynamic image corresponding to the response information.

Description

Dynamic image generation method and device

Cross-references to related applications

This application is based on the application with CN application number 202211141405.8 and the filing date is September 20, 2022, and claims its priority. The disclosure content of the CN application is hereby incorporated into this application as a whole.

Technical field

The present disclosure relates to the field of computer technology, and in particular to a method for generating dynamic images, a device for generating dynamic images, and a non-volatile computer-readable storage medium.

Background technique

With the development of the metaverse, virtual reality, digital twins and other fields, virtual digital people have begun to gradually deepen from the digitization of appearance to the intelligence of thoughts and behaviors. Intelligent-driven digital humans are digital humans restored through three-dimensional modeling, computer vision, speech recognition and other technologies. They can communicate with users through changes in mouth shapes and expressions.

In related technologies, in the rendering engine, several BlendShape (mixed shape) basic expression animations and basic mouth shape animations are built in advance; expression tags and mouth shape tags are generated based on the text; they are input to the rendering engine for animation retrieval and synthesis.

Contents of the invention

According to some embodiments of the present disclosure, a dynamic image generation method is provided, including: determining characteristic information corresponding to the response information according to the user's voice; determining characteristic data corresponding to the response information according to the characteristic information, and the characteristic data corresponding to the characteristic information according to the characteristic information The BlendShape data and bone data are determined; based on the feature data, dynamic images corresponding to the response information are generated.

In some embodiments, the characteristic information includes at least one of emotional information or pronunciation information, and the characteristic data includes at least one of expression data or mouth shape data. According to the characteristic information, it is determined that the characteristic data corresponding to the response information includes at least one of the following: One item: determine the expression data based on the emotional information, and the expression data is determined based on the first BlendShape data and the first skeleton data corresponding to the emotional information; or, determine the mouth shape data based on the pronunciation information, and the mouth shape data is determined based on the second BlendShape data corresponding to the pronunciation information. BlendShape data and second bone data are determined.

In some embodiments, the skeletal data is determined based on the initial skeletal data and a weighted sum of multiple skeletal data components.

In some embodiments, the generating method further comprises: determining an initial weight of the feature data based on the feature information; Generating a dynamic image corresponding to the response information based on the feature data includes: randomly generating actual weights of corresponding moments of multiple key frames within a value range determined by the initial weight and the threshold; weighting the feature data using the multiple actual weights to generate multiple key frames; and generating a dynamic image based on the multiple key frames.

In some embodiments, the value range includes values greater than the difference between the initial weight and the threshold and less than the sum of the initial weight and the threshold.

In some embodiments, generating a dynamic image based on multiple key frames includes: smoothing weighted feature data corresponding to adjacent key frames to generate non-key frames between adjacent key frames; based on the key frames and Non-key frames and their timestamps are used to generate dynamic images in chronological order.

In some embodiments, the generation method also includes: determining the timestamp of the feature data according to the feature information; within the value range determined according to the initial weight and the threshold, randomly generating the actual weights of the corresponding moments of multiple key frames, including: The actual weight of the corresponding moment when the timestamp was generated.

In some embodiments, determining the characteristic data corresponding to the response information according to the characteristic information includes: using the state machine in the semantic engine to determine the identification information corresponding to the characteristic data according to the characteristic information; using the rendering engine to determine the identification information corresponding to the characteristic data according to the state machine. , obtain feature data from the cache pool.

In some embodiments, the generation method further includes: during the initialization process, using a rendering engine to read multiple feature data from the facial model library; using the rendering engine to load the multiple feature data into the cache in JSON text format. pool.

In some embodiments, determining the characteristic information corresponding to the response information based on the user's voice includes: when the user initiates voice interaction, perform semantic analysis and emotional analysis on the user's voice; and determine the response text in the question and answer library based on the analysis results; Perform at least one of sentiment analysis or phoneme extraction on the response text to determine feature information.

In some embodiments, the BlendShape data is determined based on the weighted sum of the initial BlendShape data and multiple BlendShape data components.

According to other embodiments of the present disclosure, a device for generating dynamic images is provided, including: a semantic engine module for determining feature information corresponding to the response information based on the user's voice, and determining feature data corresponding to the response information based on the feature information, The feature data is determined based on the BlendShape data and skeletal data corresponding to the feature information; the rendering engine module is used to generate dynamic images corresponding to the response information based on the feature data.

In some embodiments, the generating device further includes: a facial model library for storing a plurality of feature data.

In some embodiments, the feature information includes at least one of emotion information or pronunciation information, the feature data includes at least one of expression data or mouth shape data, and the semantic engine module performs at least one of the following: according to the emotion Information, the expression data is determined based on the first BlendShape data and the first skeleton data corresponding to the emotion information; or, the mouth shape data is determined based on the pronunciation information, and the mouth shape data is determined based on the second BlendShape data and the second BlendShape data corresponding to the pronunciation information. Skeleton data confirmed.

In some embodiments, the semantic engine module determines the initial weight of the feature data based on the feature information; the rendering engine module randomly generates the actual weights of corresponding moments of multiple key frames within the value range determined based on the initial weight and threshold, Use multiple actual weights to weight the feature data respectively to generate multiple key frames; generate dynamic images based on multiple key frames.

In some embodiments, the rendering engine module smoothes the weighted feature data corresponding to adjacent key frames to generate non-key frames between adjacent key frames; according to the key frames and non-key frames and their timestamps, Generate moving images in chronological order.

In some embodiments, the semantic engine module determines the timestamp of the feature data based on the feature information; the rendering engine module generates the actual weight corresponding to the timestamp.

In some embodiments, the semantic engine module uses the state machine to determine the identification information corresponding to the feature data based on the feature information; the rendering engine module obtains the feature data from the cache pool based on the identification information sent by the state machine.

In some embodiments, during the initialization process, the rendering engine module reads multiple feature data from the facial model library and loads the multiple feature data into the cache pool in JSON text format.

In some embodiments, when the user initiates voice interaction, the semantic engine module performs semantic analysis and emotional analysis on the user's voice; based on the analysis results, determines the response text in the question and answer library, and performs emotional analysis or phoneme extraction on the response text. At least one process is performed to determine characteristic information.

According to further embodiments of the present disclosure, a device for generating dynamic images is provided, including: a memory; and a processor coupled to the memory. The processor is configured to execute any of the above implementations based on instructions stored in the memory device. The method of generating dynamic images in the example.

According to further embodiments of the present disclosure, a non-volatile computer-readable storage medium is provided, on which a computer program is stored. When the program is executed by a processor, the dynamic image generating method in any of the above embodiments is implemented. Law.

Description of drawings

The accompanying drawings, which constitute a part of the specification, illustrate embodiments of the present disclosure and, together with the description, serve to explain the principles of the present disclosure.

The present disclosure can be more clearly understood from the following detailed description with reference to the accompanying drawings:

Figure 1 shows a flow chart of some embodiments of a dynamic image generation method of the present disclosure;

Figure 2 shows a flow chart of other embodiments of the dynamic image generation method of the present disclosure;

Figure 3 shows a schematic diagram of some embodiments of a dynamic image generation method of the present disclosure;

FIG4 is a block diagram showing some embodiments of a device for generating dynamic images of the present disclosure;

FIG5 is a block diagram showing some other embodiments of the dynamic image generation device of the present disclosure;

FIG. 6 shows a block diagram of some further embodiments of the dynamic image generating device of the present disclosure.

Detailed ways

Various exemplary embodiments of the present disclosure will now be described in detail with reference to the accompanying drawings. It should be noted that the relative arrangement of components and steps, numerical expressions, and numerical values set forth in these examples do not limit the scope of the disclosure unless otherwise specifically stated.

At the same time, it should be understood that, for convenience of description, the dimensions of various parts shown in the drawings are not drawn according to actual proportional relationships.

The following description of at least one exemplary embodiment is merely illustrative in nature and is in no way intended to limit the disclosure, its application or uses.

Techniques, methods and devices known to those of ordinary skill in the relevant art may not be discussed in detail, but where appropriate, the techniques, methods and devices should be considered a part of the specification.

In all examples shown and discussed herein, any specific values are to be construed as illustrative only and not as limiting. Accordingly, other examples of the exemplary embodiments may have different values.

It should be noted that similar reference numerals and letters refer to similar items in the following figures, so that once an item is defined in one figure, it does not need further discussion in subsequent figures.

As mentioned before, the driving methods of intelligent-driven digital humans mainly include the following two types.

One way is to build several basic BlendShape expression animations and basic mouth animations in the rendering engine in advance; then, use the semantic engine to perform speech recognition on the user's voice to determine the text content to be answered; Generate expression tags and mouth shape tags based on the text, and input them to the rendering engine for animation retrieval and synthesis. The digital human driving effect of this method is relatively rigid and rigid, and requires a large amount of facial animation to be produced in advance. If you have high requirements for the details of expressions, you may even need to pre-produce hundreds of animations and then import them into the rendering engine. If you need to expand expressions or mouth shapes, you need to manually create animations and import animations again, which results in high labor costs and low system scalability.

Another method does not require the animation to be imported into the rendering engine in advance. Instead, the semantic engine directly obtains the BlendShape coefficients of expressions and mouth shapes based on training, and sends the coefficients to the rendering engine, which receives and drives them in real time. This method needs to continuously occupy bandwidth to send data, and does not form a standard expression library for the training data. There is a large amount of repeated data, resulting in high bandwidth resource usage, poor data reusability, and reduced real-time performance.

In addition, BlendShape uses a series of vertex displacements to achieve a smooth deformation effect on the object. The single use of BlendShape for driving does not take into account the impact of bones on the digital human, so the driving accuracy of the digital human is limited and the sense of reality is poor.

If you only use bones for driving, you need to add a lot of bone points to the face and make skin. Moreover, you need to frequently adjust the position of bones in subtle expressions, which makes the production process more troublesome, and if there are too many bones, the performance will be greatly consumed.

In other words, the above methods have the following technical problems.

Low driving accuracy: The single use of BlendShape or bones for driving limits the driving accuracy of the digital human, resulting in stiff facial expressions, lack of dynamic changes and precise expression, and poor realism.

High resource overhead: The training-based method needs to continuously occupy bandwidth to send data, which contains a large amount of duplicate data; and the animation-based method needs to store a large number of animation assets, resulting in high bandwidth and hardware resource overhead, and reduced performance and real-time performance.

Poor system scalability: The animation-based method requires several basic expressions to be built into the rendering engine in advance. If you want to change it later, you have to manually create and import it again, which is a cumbersome process; the training-based method does not form a standard for training data. The expression library results in poor system scalability and high labor costs.

Therefore, how to accurately, dynamically and efficiently drive dynamic micro-expressions on digital human faces has put forward higher technical requirements for the system.

The inventor of the present disclosure discovered that there are the following problems in the above-mentioned related technologies: the generated dynamic images are rigid and rigid, resulting in poor dynamic image effects.

In view of this, the present disclosure proposes a technical solution for generating dynamic images, which can improve the effect of dynamic images.

In response to the above technical problems, the present disclosure proposes a digital human dynamic micro-expression driving technology based on random weights. technical plan. The random weight calculation method of the present disclosure randomly calculates the weight of facial key frames based on a state machine and threshold, and uses the least squares method to smooth the key frames to realize the dynamic expression of micro-expressions; the bone correction method of the present disclosure is based on BlendShape and skeleton Weighted to achieve accurate expression driving; this disclosure builds a facial model library to store basic expressions, sets a model ID to uniquely identify different expressions and mouth shapes, and implements lightweight import into the cache pool for data caching based on JSON text, realizing data Efficient reuse and expansion.

For example, the technical solution of the present disclosure can be implemented through the following embodiments.

FIG. 1 shows a flowchart of some embodiments of the dynamic image generation method of the present disclosure.

As shown in Figure 1, in step 110, characteristic information corresponding to the response information is determined based on the user's voice.

In some embodiments, when the user initiates voice interaction, semantic analysis and emotional analysis are performed on the user's voice; a response text is determined in the question and answer library according to the analysis results; and at least one of emotional analysis or phoneme extraction is performed on the response text. item processing to determine feature information.

In step 120, the characteristic data corresponding to the response information is determined based on the characteristic information, and the characteristic data is determined based on the BlendShape data and bone data corresponding to the characteristic information.

In some embodiments, during the initialization process, a rendering engine is used to read multiple feature data from a facial model library; and the rendering engine is used to load the multiple feature data into a cache pool in a JSON text format.

For example, the facial model library can perform bone correction processing on BlendShape, thereby achieving precise expression driving based on BlendShape and bones. The facial model library can also store text data of basic expressions and model IDs (i.e., identification information) used to uniquely identify different expressions and mouth shapes (i.e., feature data), thereby achieving efficient reading and reuse of data.

In some embodiments, the state machine in the semantic engine is used to determine the identification information corresponding to the characteristic data according to the characteristic information; the rendering engine is used to obtain the characteristic data from the cache pool according to the identification information sent by the state machine.

For example, the facial model library includes a mouth shape database, and the data structure of the mouth shape data LIP in the mouth shape database is [LipID, BlendShape _L , Skeleton _L ].

LipID represents the mouth shape ID of the mouth shape data. Multiple phonemes can have the same mouth shape, that is, the same mouth shape ID. For example, the phonemes "o" and "ao" have similar mouth shapes and can correspond to the same LipID, thereby reducing the amount of data based on precise driving.

BlendShape _L represents a set of BlendShape coefficients corresponding to the mouth shape data (i.e., the second BlendShape data), and Skeleton _L represents the facial bone coefficients corresponding to the mouth shape data (i.e., the second skeletal data).

For example, the facial model library includes an expression database, and the data structure of the expression data Emotion in the expression database is [EmoID, BlendShape _E , Skeleton _E ].

EmoID represents the emoticon ID of emoticon data. For example, EmoID=0 means smiling, EmoID=1 means laughing, EmoID=2 means sadness, EmoID=3 means fear, EmoID=4 means anger, etc., and supports expansion.

BlendShape _E represents a set of BlendShape coefficients corresponding to the expression data (i.e., the first BlendShape data), and Skeleton _E represents the facial skeleton coefficients corresponding to the expression data (i.e., the first skeletal data).

For example, BlendShape _E data includes a set of overall expression benchmarks (or expression components). The BlendShape _E data of human facial expression e at a certain moment is a linear weighting of this set of expression components:
BlendShape _E ＝B _E ×d _bE +b _bE

B _E is a set of expression benchmarks, d _bE is the corresponding weight coefficient, and b _bE is the initial expression (such as a neutral expression that is different from negative expressions and positive expressions).

For example, as the BlendShape _E data changes, the digital human's skeletal data should also change accordingly. For example, when the digital human speaks, the skeletal points related to the jaw and face also shift. Therefore, it is necessary to perform bone correction processing on the BlendShape _E data to make the driving effect more accurate and realistic. The facial expression e after skeleton correction is:
e＝BlendShape _E +Skeleton _E =(B _E ×d _bE +b _bE )+(S _E ×d _SE +b _SE )

S _E is a set of bone benchmarks (or bone components), d _SE is the corresponding bone coefficient (i.e., weight coefficient), and b _SE is the initial bone (such as the neutral bone that is different from negative bones and positive bones). The bone coefficient represents the linear blending coefficient of a set of bone components that changes the expression from a neutral bone to a target bone.

For example, the mouth shape l after bone correction is:
l＝BlendShape _L +Skeleton _L =(B _L ×d _bL +b _bL )+(S _L ×d _SL +b _SL )

B _L is a set of expression benchmarks, d _bL is the corresponding weight coefficient, b _bL is the initial expression (such as a neutral expression that is different from negative expressions and positive expressions), S _L is a set of skeleton benchmarks (or bone components), d _SL is the corresponding bone coefficient (i.e., weight coefficient), and b _SL is the initial bone (such as the neutral bone that is different from negative bones and positive bones). The bone coefficient represents the linear blending coefficient of a set of bone components that changes the mouth shape from a neutral bone to a target bone.

In some embodiments, the feature information includes at least one of emotion information or pronunciation information, and the feature data includes at least one of expression data or mouth shape data. For example, expression data is determined based on emotional information. The expression data is determined based on the first BlendShape data and the first skeleton data corresponding to the emotion information. For example, mouth shape data is determined based on the pronunciation information, and the mouth shape data is determined based on the second BlendShape data and the second bone shape corresponding to the pronunciation information. The skeleton data is confirmed.

For example, the semantic engine recognizes the user's voice and outputs the digital human's answer audio; the state machine of the semantic engine outputs the EmoID and initial weight Weight of the corresponding expression data according to the emotion of the digital human's answer text. Since the emotions of the digital human are not fixed, the state machine also outputs the TimeStamp of the expression data to ensure the changes of different micro-expressions. The state machine outputs the LipID and initial weight Weight of the corresponding lip shape data according to the phonemes of the digital human's answer text. Since the pronunciation of each word of the digital human is not at the same frequency, the state machine also outputs the TimeStamp of the lip shape data to ensure the synchronization of the lip shape and audio.

In step 130, a dynamic image corresponding to the response information is generated based on the characteristic data.

In some embodiments, the weighted feature data corresponding to adjacent key frames is smoothed to generate non-key frames between adjacent key frames; according to the key frames and non-key frames and their timestamps, in chronological order Generate dynamic images.

In some embodiments, the rendering engine caches the data of the facial model library in the cache pool; according to the random weight calculation method, the actual weight of the facial model is randomly calculated based on the threshold to realize the dynamic expression of micro-expressions; and the driving data is smoothed according to the actual weight. , and integrates expression and mouth shape data to realize the driving of the digital human's overall face; synchronously plays voice.

In the above embodiment, the BlendShape data is corrected using the skeletal data, and the precise driving of expressions can be realized based on the BlendShape data and the skeletal data, thereby improving the effect of dynamic images.

For example, the actual weight can be calculated through the embodiment in Figure 2.

FIG. 2 shows a flowchart of another embodiment of the dynamic image generation method of the present disclosure.

As shown in Figure 2, in step 210, the initial weight of the feature data is determined based on the feature information.

In some embodiments, taking expression data as an example, the state machine of the semantic engine sends the EmoID, initial weight W and timestamp TS of the expression data to the rendering engine; the random weight calculation module of the rendering engine matches the corresponding expression data e in the cache pool according to the EmoID to reflect the basic emotions of the digital human:
e＝BlendShape _E +Skeleton _E ＝( _BE × _dbE + _bbE )+( _SE × _dSE + _bSE )

In step 220, actual weights of corresponding moments of multiple key frames are randomly generated within the value range determined based on the initial weight and the threshold.

For example, the value range includes values greater than the difference between the initial weight and the threshold and less than the sum of the initial weight and the threshold.

In some embodiments, the timestamp of the feature data is determined based on the feature information; and the actual weight corresponding to the time stamp is generated.

For example, the expression of a digital human should not be static, stiff, and unchanging, so the random weight calculation module configuration There is a threshold T, which reflects the dynamic range of digital human expressions as an increment; the random weight calculation module calculates the maximum weight W+T and the minimum weight WT, and generates a random number R every time interval I as the actual weight:

W-T<R<W+T

In step 230, the feature data is weighted respectively using multiple actual weights to generate multiple key frames.

For example, the range of R is between the maximum weight and the minimum weight, and e is weighted as a new expression weight to generate key frame expression data:
e(I)=[(B _E ×d _bE +b _bE )+(S _E ×d _SE +b _SE )}]×R

In step 240, dynamic images are generated based on multiple key frames.

For example, for the expression data e(I) and e(J) of two adjacent key frames, the least squares method is used to smooth the expression data to obtain non-key frame expression data; the expression and mouth shape data are fused , and realize the dynamic driving of the overall face of the digital person according to the timestamp TS.

In the above embodiment, the random weight calculation method can realize the dynamic generation of micro-expressions through random numbers. There is no need to pre-produce hundreds of expression animations, and there is no need for the semantic engine to send a large amount of driving data in real time. Therefore, on the basis of satisfying basic emotions, the expressions of digital people can dynamically change within a certain range over time, giving users a sense of reality and improving the accuracy and authenticity of dynamic images.

FIG. 3 shows a schematic diagram of some embodiments of the dynamic image generation method of the present disclosure.

As shown in Figure 3, the facial model library is used to implement bone correction and can perform bone correction processing on BlendShape, thereby achieving precise expression driving based on BlendShape and bones. The facial model library can also store text data of basic expressions and model IDs (i.e., identification information) used to uniquely identify different expressions and mouth shapes (i.e., feature data), thereby achieving efficient reading and reuse of data.

For example, the design of the mouth shape database of the facial model library is as follows: the data structure of the mouth shape data LIP is [LipID, BlendShape _L , Skeleton _L ]; LipID represents the mouth shape ID of the mouth shape data, and multiple phonemes can have the same mouth shape. That is, the same mouth shape ID. For example, the phonemes "o" and "ao" have similar mouth shapes and can correspond to the same LipID, thereby reducing the amount of data based on precise driving; BlendShape _L represents a set of BlendShape coefficients corresponding to the mouth shape data (i.e. the second BlendShape data), Skeleton _L represents the facial skeleton coefficient corresponding to the mouth shape data (i.e. the second skeleton data).

For example, the expression database design of the facial model library is as follows: the data structure of the expression data Emotion is [EmoID, BlendShape _E , Skeleton _E ]; EmoID represents the expression ID of the expression data. For example, EmoID=0 means smiling, EmoID=1 means laughing, EmoID=2 means sadness, EmoID=3 means fear, EmoID=4 means anger, etc., and supports expansion; BlendShape _E represents a set of BlendShape coefficients corresponding to the expression data (i.e. The first BlendShape data), Skeleton _E represents the facial skeleton coefficient corresponding to the expression data (that is, the first skeleton data).

For example, BlendShape _E data includes a set of overall expression benchmarks (or expression components), and the BlendShape _E data of the facial expression e at a certain moment is the linear weighting of this set of expression components:
BlendShape _E = B _E × d _bE + b _bE

For example, the semantic engine recognizes the user's voice and outputs the audio of the digital person's answer; the state machine of the semantic engine outputs the EmoID and initial weight of the corresponding expression data based on the emotion of the digital person's answer text. Since the emotions of digital people are not fixed, the state machine also outputs the TimeStamp of the expression data to ensure changes in different micro-expressions. The state machine outputs the LipID and initial weight of the corresponding mouth shape data based on the phonemes of the text answered by the digital person. Since the pronunciation of each word of the digital person is not at the same frequency, the state machine also outputs the TimeStamp of the mouth shape data to ensure the synchronization of the mouth shape and audio.

In some embodiments, the rendering engine caches the data of the facial model library in the cache pool; according to the random weight calculation method, the actual weight of the facial model is randomly calculated based on the threshold to realize the dynamic expression of micro-expressions; and the driving data is smoothed according to the actual weight. , and fuse expression and mouth shape data to achieve the overall facial recognition of digital people. Driver; play voice simultaneously.

In some embodiments, taking expression data as an example, the state machine of the semantic engine sends the EmoID, initial weight W and timestamp TS of the expression data to the rendering engine; the random weight calculation module of the rendering engine matches the corresponding data in the cache pool based on the EmoID. Expression data e reflects the basic emotions of digital people:
e＝BlendShape _E +Skeleton _E =(B _E ×d _bE +b _bE )+( _SE ×d _SE +b _SE )

For example, the expression of the digital human should not be static, stiff, and unchanging, so the random weight calculation module is configured with a threshold T, which reflects the dynamic range of the digital human expression as an increment; the random weight calculation module calculates the maximum weight W+T and the minimum weight WT, and generate a random number R every time interval I as the actual weight:
WT<R<W+T

In some embodiments, the digital human intelligence drives the interaction process as follows.

In step 1, set the BlendShape data model of basic expressions and basic mouth shapes. Each base data model is uniquely identified by a model ID.

In step 2, a bone correction process is performed, and the coefficients of bone points are added to the data model to form a corrected facial model text.

In step 3, the rendering engine reads the facial model library when it is initialized, and loads the data into the cache pool in JSON text format.

In step 4, when the user initiates voice interaction with the digital human, the speech recognition module performs user semantic and emotional analysis.

In step 5, the answer text module stores an intelligent question and answer library and obtains corresponding answer text based on user semantics and emotions.

In step 6, the natural language processing module performs sentiment analysis and phoneme extraction on the answer text.

In step 7, the speech synthesis module synthesizes the segmented answer text into audio data.

In step 8, the state machine sends expression ID or lip shape ID, weight, timestamp, audio and other data to the rendering engine.

In step 9, the random weight calculation module matches the corresponding basic model in the cache pool based on the ID and generates key frames based on random numbers.

In step 10, the smoothing module smoothes the key frames based on the least squares method.

In step 11, the expression fusion module fuses the expression and mouth shape data and implements dynamic driving according to the timestamp.

In the above embodiment, a random weight calculation method is proposed, which randomly calculates the weight of facial key frames based on a state machine and a threshold, and uses the least squares method to smooth the key frames to achieve dynamic expression of micro-expressions; a bone correction rule is proposed, based on BlendShape and bone weighting to achieve accurate expression driving, thereby improving the realism and driving accuracy of digital human beings.

The cache pool is used to store basic facial data, and the random weight calculation method is used to realize dynamic micro-expressions. There is no need to occupy bandwidth to send a large amount of repeated data, and there is no need to store a large amount of animation assets. This reduces the additional overhead of bandwidth and hardware resources, and improves performance and real-time performance. Reduce resource overhead.

Build a facial model library to store basic expressions. Model ID uniquely identifies different expressions and mouth shapes. Import data into a cache pool based on JSON text to cache data, and achieve efficient reuse and expansion of data. It is not necessary to build basic mouth shapes and basic expressions into the rendering engine in advance, which improves system scalability and reduces labor costs.

FIG. 4 shows a block diagram of some embodiments of the dynamic image generation device of the present disclosure.

As shown in Figure 4, the dynamic image generation device 4 includes: a semantic engine module 41, which is used to determine the characteristic information corresponding to the response information based on the user's voice, and determine the characteristic data corresponding to the response information based on the characteristic information. The characteristic data is based on the characteristic information. The corresponding BlendShape data and skeleton data are determined; the rendering engine module 42 is used to generate dynamic images corresponding to the response information based on the feature data.

In some embodiments, the generation device 4 further includes: a facial model library 43 for storing multiple feature data.

In some embodiments, the feature information includes at least one of emotion information or pronunciation information, the feature data includes at least one of expression data or mouth shape data, and the semantic engine module 41 performs at least one of the following: according to the emotion information, Determine the expression data, the expression data is determined based on the first BlendShape data and the first skeleton data corresponding to the emotion information; or, determine the mouth shape data based on the pronunciation information, and the mouth shape data is determined based on the second BlendShape data and the second skeleton data corresponding to the pronunciation information. Sure.

In some embodiments, the semantic engine module 41 determines the initial weight of the feature data based on the feature information; the rendering engine module 42 randomly generates the actual values of the corresponding moments of multiple key frames within the value range determined based on the initial weight and threshold. Weight, use multiple actual weights to weight the feature data respectively to generate multiple relationships Keyframes; generate dynamic images based on multiple keyframes.

In some embodiments, the rendering engine module 42 smoothes the weighted feature data corresponding to adjacent key frames to generate non-key frames between adjacent key frames; according to the key frames and non-key frames and their timestamps , generate dynamic images in chronological order.

In some embodiments, the semantic engine module 41 determines the timestamp of the feature data according to the feature information; the rendering engine module generates the actual weight of the time corresponding to the timestamp.

In some embodiments, the semantic engine module 41 uses the state machine therein to determine the identification information corresponding to the characteristic data based on the characteristic information; the rendering engine module 42 obtains the characteristic data from the cache pool according to the identification information sent by the state machine.

In some embodiments, during the initialization process, the rendering engine module 42 reads multiple feature data from the facial model library and loads the multiple feature data into the cache pool in JSON text format.

In some embodiments, the semantic engine module 42 performs semantic analysis and emotional analysis on the user's voice when the user initiates voice interaction; determines the response text in the question and answer library based on the analysis results, and performs emotional analysis or phoneme extraction on the response text. At least one of the processes determines the characteristic information.

FIG. 5 shows a block diagram of another embodiment of the dynamic image generating device of the present disclosure.

As shown in Figure 5, the device 5 of this embodiment includes: a memory 51 and a processor 52 coupled to the memory 51. The processor 52 is configured to execute any implementation of the present disclosure based on instructions stored in the memory 51. The method of generating dynamic images in the example.

The memory 51 may include, for example, a system memory, a fixed non-volatile storage medium, etc. The system memory may store, for example, an operating system, an application program, a boot loader, a database, and other programs.

As shown in Figure 6, the dynamic image generation device 6 of this embodiment includes: a memory 610 and a processor 620 coupled to the memory 610. The processor 620 is configured to execute any of the foregoing based on instructions stored in the memory 610. A method for generating dynamic images in one embodiment.

The memory 610 may include, for example, a system memory, a fixed non-volatile storage medium, etc. Such as storing operating systems, applications, boot loaders, and other programs.

The moving image generating device 6 may also include an input/output interface 630, a network interface 640, a storage interface 650, and the like. These interfaces 630, 640, 650, the memory 610 and the processor 620 may be connected through a bus 660, for example. Among them, the input and output interface 630 provides connection interfaces for input and output devices such as monitors, mice, keyboards, touch screens, microphones, and speakers. Network interface 640 provides a connection interface for various networked devices. The storage interface 650 provides a connection interface for external storage devices such as SD cards and USB disks.

Those skilled in the art will appreciate that embodiments of the present disclosure may be provided as methods, systems, or computer program products. Accordingly, the present disclosure may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment that combines software and hardware aspects. Furthermore, the present disclosure may take the form of a computer program product embodied on one or more computer-usable non-transitory storage media including, but not limited to, disk memory, CD-ROM, optical storage, and the like having computer-usable program code embodied therein.

So far, the method for generating a dynamic image, the device for generating a dynamic image, and the non-volatile computer-readable storage medium according to the present disclosure have been described in detail. To avoid obscuring the concepts of the present disclosure, some details that are well known in the art have not been described. Based on the above description, those skilled in the art can completely understand how to implement the technical solution disclosed here.

The methods and systems of the present disclosure may be implemented in many ways. For example, the methods and systems of the present disclosure may be implemented through software, hardware, firmware, or any combination of software, hardware, and firmware. The above order for the steps of the methods is for illustration only, and the steps of the methods of the present disclosure are not limited to the order specifically described above unless otherwise specifically stated. Furthermore, in some embodiments, the present disclosure can also be implemented as programs recorded in recording media, and these programs include machine-readable instructions for implementing methods according to the present disclosure. Thus, the present disclosure also covers recording media storing programs for executing methods according to the present disclosure.

Although some specific embodiments of the present disclosure have been described in detail through examples, those skilled in the art will understand that the above examples are for illustration only and are not intended to limit the scope of the disclosure. It should be understood by those skilled in the art that the above embodiments may be modified without departing from the scope and spirit of the present disclosure. The scope of the disclosure is defined by the appended claims.

Claims

A method for generating dynamic images, including:

Determine the characteristic information corresponding to the response information based on the user's voice;

Determine the characteristic data corresponding to the response information according to the characteristic information, and determine the characteristic data according to the BlendShape data and bone data corresponding to the characteristic information;

According to the characteristic data, a dynamic image corresponding to the response information is generated.
The generation method according to claim 1, wherein the feature information includes at least one of emotion information or pronunciation information, and the feature data includes at least one of expression data or mouth shape data,

According to the characteristic information, it is determined that the characteristic data corresponding to the response information includes at least one of the following:

Determine the expression data according to the emotion information, and determine the expression data according to the first BlendShape data and the first skeleton data corresponding to the emotion information; or

The mouth shape data is determined based on the pronunciation information, and the mouth shape data is determined based on the second BlendShape data and the second skeleton data corresponding to the pronunciation information.
The generating method according to claim 1 or 2, wherein the skeletal data is determined based on the weighted sum of initial skeletal data and a plurality of skeletal data components.
The generation method according to any one of claims 1-3, further comprising:

Determine the initial weight of the feature data according to the feature information;

Wherein, generating a dynamic image corresponding to the response information according to the characteristic data includes:

Within the value range determined according to the initial weight and threshold, the actual weights of the corresponding moments of multiple key frames are randomly generated;

Using multiple actual weights to weight the feature data respectively to generate the multiple key frames;

The dynamic image is generated according to the plurality of key frames.
The generation method according to claim 4, wherein the value range includes values greater than the difference between the initial weight and the threshold and less than the sum of the initial weight and the threshold.
The generation method according to claim 4 or 5, wherein generating the dynamic image according to the plurality of key frames includes:

Smoothing the weighted feature data corresponding to adjacent key frames to generate non-key frames between the adjacent key frames;

The dynamic image is generated in chronological order according to key frames and non-key frames and their timestamps.
The generation method according to any one of claims 4-6, further comprising:

Determine the timestamp of the characteristic data according to the characteristic information;

Wherein, within the value range determined according to the initial weight and threshold, the actual weights at corresponding moments of randomly generated multiple key frames include:

The actual weight of the corresponding moment of the timestamp is generated.
The generation method according to any one of claims 1 to 7, wherein determining, based on the characteristic information, characteristic data corresponding to the response information comprises:

Utilize the state machine in the semantic engine to determine the identification information corresponding to the characteristic data based on the characteristic information;

The rendering engine is used to obtain the feature data from the cache pool according to the identification information sent by the state machine.
The generation method according to claim 8, further comprising:

During the initialization process, the rendering engine is used to read multiple feature data from the facial model library;

The rendering engine is used to load the plurality of feature data into the cache pool in JSON text format.
The generation method according to any one of claims 1-9, wherein determining the characteristic information corresponding to the response information according to the user's voice includes:

When the user initiates voice interaction, perform semantic analysis and emotional analysis on the user's voice;

Based on the analysis results, determine the response text in the question and answer database;

Perform at least one of sentiment analysis and phoneme extraction on the response text to determine the feature information.
The generation method according to any one of claims 1 to 10, wherein the BlendShape data is determined based on a weighted sum of initial BlendShape data and multiple BlendShape data components.
A device for generating dynamic images, including:

The semantic engine module is used to determine the characteristic information corresponding to the response information based on the user's voice, and determine the characteristic data corresponding to the response information based on the characteristic information. The characteristic data is based on the blend shape BlendShape data corresponding to the characteristic information and Skeletal data determined;

A rendering engine module is used to generate dynamic images corresponding to the response information according to the characteristic data.
The generating device according to claim 12, further comprising:

Facial model library for storing multiple feature data.
A dynamic image generating device, comprising:

memory; and

A processor coupled to the memory, the processor being configured to execute the dynamic image generation method according to any one of claims 1-11 based on instructions stored in the memory.
A non-volatile computer-readable storage medium on which a computer program is stored. When the program is executed by a processor, the dynamic image generation method described in any one of claims 1-11 is implemented.
A computer program consisting of:

Instructions, which when executed by a processor, cause the processor to execute the dynamic image generation method according to any one of claims 1-11.