CN116524074A

CN116524074A - Method, device, equipment and storage medium for generating digital human gestures

Info

Publication number: CN116524074A
Application number: CN202310296375.6A
Authority: CN
Inventors: 高楠; 曾智; 张树武; 张桂煊; 赵泽宇
Original assignee: Institute of Automation of Chinese Academy of Science
Current assignee: Institute of Automation of Chinese Academy of Science
Priority date: 2023-03-23
Filing date: 2023-03-23
Publication date: 2023-08-01

Abstract

The embodiment of the invention provides a method, a device, equipment and a storage medium for generating digital human gestures, wherein the method comprises the following steps: acquiring a target audio file of a digital human gesture to be generated; determining an action generation sequence corresponding to the target audio file based on a script generation model; and based on the action generation sequence and the gesture generation model, controlling the generated representative gesture and rhythmic gesture to be synthesized into the digital human gesture corresponding to the target audio file. According to the method provided by the invention, through the action generation sequence corresponding to the target audio file determined by the script generation model, digital human gesture synthesis under synchronous voice is effectively controlled, the gesture is decoupled and modeled to obtain the representative gesture generation model and the rhythmic gesture generation model, and the representative gesture and the rhythmic gesture respectively obtained by the gesture generation model are combined, so that more natural and rich gestures can be generated, and the effect of the digital human gesture is more real.

Description

Method, device, equipment and storage medium for generating digital human gestures

Technical Field

The present invention relates to the field of artificial intelligence technologies, and in particular, to a method, an apparatus, a device, and a storage medium for generating digital human gestures.

Background

The digital person understands and analyzes the external input through the recognition system, generates a feedback result aiming at the driving signal, synthesizes corresponding digital person voice and behavior action based on the decisions, and realizes interaction with human beings. The digital human action driving effect is a key factor affecting the digital human personification degree. In particular, the gesture has a strong auxiliary expression effect, and can effectively promote expression as non-language information.

Recent developments in deep learning technology have also facilitated the development of gesture generation technology, employing large-scale data sets, and modeling relationships between multiple modalities using deep neural networks. Most of the existing methods for generating digital human gestures adopt fixed rules, the gestures in a well-defined database are matched, the fixed rules need professional personnel and priori knowledge to design, and for complex voice scenes, the generated results are not abundant enough, and the sense of reality and the sense of naturalness are not enough. The threshold is high and the corresponding result is not very ideal.

Therefore, how to generate digital human gestures rich in realism and naturalness by using the existing large-scale data sets has become a technical problem to be solved in the industry.

Disclosure of Invention

Aiming at the technical problems in the prior art, the invention provides a method, a device, equipment and a storage medium for generating digital human gestures.

In a first aspect, the present invention provides a method for digital human gesture generation, comprising:

acquiring a target audio file of a digital human gesture to be generated;

determining an action generation sequence corresponding to the target audio file based on a script generation model; the action generating sequence is used for indicating whether gesture actions exist at any moment;

based on the action generation sequence and a gesture generation model, the generated representative gesture and rhythmic gesture are controlled to be synthesized into a digital human gesture corresponding to the target audio file;

the script generation model is trained based on a training sample determined by a first video file with voice information and motion information; the gesture generation model includes a first gesture generation model for generating a representative gesture and a second gesture generation model for generating a rhythmic gesture.

Optionally, based on the action generating sequence and the gesture generating model, the controlling the generated representative gesture and rhythmic gesture to be synthesized into the digital human gesture corresponding to the target audio file includes:

Generating a representative gesture corresponding to the target audio file based on the first gesture generation model;

generating a rhythmic gesture corresponding to the target audio file based on the second gesture generation model;

based on the action occurrence sequence and a preset synthesis rule, fusing the representative gesture and the rhythmic gesture to obtain a digital human gesture corresponding to the target audio file; the preset synthesis rule is used for limiting the digital human gesture at any moment to be determined based on any one or combination of the representative gesture and the rhythmic gesture.

Optionally, the script generating model is trained based on training samples determined by a first video file having voice information and motion information, and the corresponding training method includes:

determining the first N elements of the current element in the initial training sample and the last M elements of the current element as one element in the first training sample; the initial training sample is used for representing whether actions occur in the first video file at different moments; the first video file comprises voice information and action information; m and N are positive integers;

Based on the first training sample, under the condition that the first loss function meets convergence, training of the script generation model is completed; the first loss function is determined based on a first mark corresponding to the first training sample and a predicted result output after the first training sample is input into the script generation model.

Optionally, the initial training samples are training samples for characterizing whether actions occur in the first video file at different moments, and the corresponding acquisition method includes:

acquiring a first position sequence signal based on a first video file; the first position sequence signal is used for representing positions corresponding to all key points of human hand bones, human body bones and facial bones at different moments;

determining a starting position corresponding to any gesture represented by the first position sequence signal;

marking the target element by adopting a first mark based on the distance between the target element in the first position sequence and the initial position and the distance between the next element of the target element in the first position sequence and the initial position, and taking the first mark as the initial training sample; the target element is any element in the first position sequence; the first flag is used to characterize whether an action occurs with respect to the target element.

Optionally, the determining the starting position corresponding to any gesture characterized by the first position sequence signal includes:

simplifying the first position sequence signal according to preset simplified gesture key points, and counting to obtain a position histogram of the simplified gesture key points;

and determining the most frequent position in the histogram according to a preset complete gesture format, and taking the most frequent position as the initial position corresponding to any gesture represented by the first position sequence signal.

Optionally, the training method corresponding to the first gesture generation model includes:

extracting a sample representing that motion occurs in the initial training sample based on the first mark, and taking the sample as a second training sample;

uniformly sampling the second training sample to obtain a third training sample with uniform length; each sample in the third training samples comprises L second training samples after uniform sampling; l is a positive integer;

and based on the third training sample, under the condition that a second loss function meets convergence, training of the first gesture generation model is completed, and the second loss function is determined based on the marks of the third training sample and the gesture reconstructed by the first gesture generation model.

Optionally, before the uniformly sampling the second training samples to obtain third training samples with uniform lengths, the method further includes:

determining whether key point data is lost for each sample in the second training samples;

and if any one of the second training samples has the key point data loss, repairing by adopting a mode of rotating and translating the adjacent training samples.

In a second aspect, the present invention also provides a device for generating a digital human gesture, including:

the acquisition module is used for acquiring a target audio file of the digital human gesture to be generated;

the determining module is used for determining an action occurrence sequence corresponding to the target audio file based on the script generation model; the action generating sequence is used for indicating whether gesture actions exist at any moment;

the generation module is used for controlling the generated representative gesture and rhythmic gesture to be synthesized into the digital human gesture corresponding to the target audio file based on the action generation sequence and the gesture generation model;

In a third aspect, the present invention also provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the method of digital human gesture generation as described in the first aspect above when executing the program.

In a fourth aspect, the present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements a method of digital human gesture generation as described in the first aspect above.

In a fifth aspect, the invention also provides a computer program product comprising a computer program which, when executed by a processor, implements a method of digital human gesture generation as described in the first aspect above.

According to the method, the device, the equipment and the storage medium for generating the digital human gestures, through the action generation sequence corresponding to the target audio file determined by the script generation model, digital human gesture synthesis under synchronous voice is effectively controlled, the gestures are decoupled and modeled to obtain the representative gesture generation model and the rhythmic gesture generation model, and the representative gestures and the rhythmic gestures respectively obtained by the gesture generation model are combined, so that more natural and rich gestures can be generated, and the effect of the digital human gestures is more real.

Drawings

In order to more clearly illustrate the invention or the technical solutions of the prior art, the following description will briefly explain the drawings used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are some embodiments of the invention, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of a method for digital human gesture generation provided by an embodiment of the present invention;

FIG. 2 is a schematic diagram of a specific implementation flow of digital human gesture generation according to an embodiment of the present invention;

FIG. 3 is a schematic structural diagram of a device for generating digital human gestures according to an embodiment of the present invention;

fig. 4 is a schematic diagram of an entity structure of an electronic device according to an embodiment of the present invention.

Detailed Description

In the embodiment of the invention, the term "and/or" describes the association relation of the association objects, which means that three relations can exist, for example, a and/or B can be expressed as follows: a exists alone, A and B exist together, and B exists alone. The character "/" generally indicates that the context-dependent object is an "or" relationship.

The term "plurality" in embodiments of the present invention means two or more, and other adjectives are similar.

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

FIG. 1 is a flowchart of a method for generating a digital human gesture according to an embodiment of the present invention, as shown in FIG. 1, the method includes:

step 101, obtaining a target audio file of a digital human gesture to be generated;

specifically, the target audio file used for generating the digital human gesture may be an audio file obtained directly, an audio file extracted based on a video file, or an audio file obtained by converting other types of files. The format supported by the target audio file may be any audio format, such as MP3, MP5, WAV, flac, MIDI, RA, APE, AAC, CDA, MOV, etc. The target audio file is a file storing sound content. The duration of the audio file is not limited, and any duration of the audio file can be used.

Step 102, determining an action occurrence sequence corresponding to the target audio file based on a script generation model; the action generating sequence is used for indicating whether gesture actions exist at any moment; the script generation model is trained based on training samples determined from a first video file having speech information and motion information.

According to the script generation model, through processing the first video file with the voice information and the action information, the corresponding training sample is obtained, the training sample mainly extracts whether the corresponding action information exists in different moments or not, and the association between the voice information and the action information is established, wherein the voice information can be represented by an audio file, the audio file can be extracted from the first video file, the specific audio file format is not limited, and then the script generation model is trained. The first video file may be obtained from an existing large-scale dataset such as TED, spech 2 gestrre, trinity Gesture Dataset, etc.

According to the script generation model provided by the application, determining whether gesture actions exist at any moment corresponding to the target audio file can be understood as generating an action generation sequence or an action script, wherein the action generation sequence or the action script is used for indicating whether actions exist at each moment, and if the actions exist at a certain moment, the corresponding gesture actions are added to the moment; if there is no action at a certain time, then no corresponding gesture action is added at that time. Whether an action exists at a certain moment is determined by a trained script generation model according to a target audio file. The script generation model is obtained by training a representative training sample.

Step 103, based on the action generation sequence and the gesture generation model, controlling the generated representative gesture and rhythmic gesture to be synthesized into a digital human gesture corresponding to the target audio file;

the gesture generation model includes a first gesture generation model for generating a representative gesture and a second gesture generation model for generating a rhythmic gesture.

The action generating sequence obtained by the script generating model indicates whether gesture actions exist at any moment, and the first gesture generating model can generate corresponding representative gestures according to the target audio file, wherein the representative gestures usually have a certain rule or accord with a certain complete format, for example, the action of lifting hands is decomposed and usually comprises a plurality of continuous stage actions of elbows, wrists and fingers, and the actions are obtained by observing different hand joints. The second gesture generating model is mainly used for generating corresponding rhythm type gestures according to the target audio file, the rhythm type gestures usually have no substantial meaning, and are usually habitual actions of people, such as slight shaking of a human body or rhythmic beat under the condition of music. In this way, through the first gesture generating model, a representative gesture corresponding to the target audio file is generated, and each representative gesture also has a temporal characteristic, that is, the generated representative gesture may be represented as a representative gesture sequence, where the representative gesture sequence has a temporal characteristic that indicates whether there is a representative gesture at any moment, and what the representative gesture is specifically. Likewise, a rhythmic gesture may also be represented as a sequence of rhythmic gestures that are practical, meaning whether or not there is a rhythmic gesture at any time, and what the rhythmic gesture is specifically.

And the digital human gestures corresponding to the finally synthesized target audio file are controlled by inserting the representative gesture, the rhythmic gesture or a combination of the representative gesture and the rhythmic gesture when the gesture is expressed in the action generation sequence or not inserting the gesture or the rhythmic gesture or other modes when the gesture is expressed in the action generation sequence.

According to the method for generating the digital human gestures, provided by the invention, through the action generation sequence corresponding to the target audio file determined by the script generation model, digital human gesture synthesis under synchronous voice is effectively controlled, the gestures are decoupled and modeled to obtain the representative gesture generation model and the rhythmic gesture generation model, and the representative gestures and the rhythmic gestures respectively obtained by the gesture generation model are combined, so that more natural and rich gestures can be generated, and the effect of the digital human gestures is more realistic.

Specifically, the target audio file is input into a trained first gesture generation model M _R Is representative gesture P ^R Inputting the target audio signal into a trained second gesture generation model M _P Resulting rhythmic gesture P ^B 。

The digital human gesture is controlled by a representative gesture P at the same moment when the gesture action exists through the action generation sequence and a preset synthesis rule ^R And rhythmic gesture P ^B One or a combination of them, and at the moment when there is no gesture action, the digital human gesture is composed of representative gestures P at the same moment ^R Rhythmic gesture P ^B Or other fixed preset gesture determination, where the fixed preset gesture may be a selected one of all gestures.

The action generation sequence obtained by the script generation model and the preset synthesis rules are defined differently, so that the rules of the generated digital human gestures can be edited in a self-defined manner, and the generated digital human gestures can be controlled effectively.

Specifically, the script generation model is trained based on a training sample determined from a first video file having speech information and motion information. Firstly, collecting a speaker video data set according to the need, extracting text signals with time information, extracting audio signals in the speaker video, and extracting human skeleton key point position sequence signals in the speaker video; and (3) performing time alignment on the extracted text signals and the audio signals, and determining a model training data set by combining the human skeleton key point position sequence signals. The data set may be divided into a training set and a testing set. The data set may be understood as an initial training sample.

Time-aligning the extracted text signal with the audio signal, comprising:

according to the requirement, a text is aligned with a text label obtained by voice by using a voice Forced alignment (MFA) tool, a text signal with time information is extracted, and the extracted text signal with time information is converted into a text feature T through a Chinese text pre-training model _i, The method comprises the steps of carrying out a first treatment on the surface of the Inputting the text features into a text semantic coding model to obtain semantic features; converting the extracted audio signal into spectral features S by calculation as required _i The method comprises the steps of carrying out a first treatment on the surface of the Inputting the frequency spectrum characteristics into an audio coding model to obtain rhythm characteristics; combining the semantic features with the charmThe rhythmic features are aligned according to time.

Constructing a first training sample of a script generation model, which specifically comprises the following steps: and taking the front N frames and the rear M frames of the current frame of the human skeleton key point position sequence signal in the initial training sample and the corresponding aligned audio signals as one element in the first training sample. Here, M and N are positive integers, which may be the same or different, and specific values may be set according to actual requirements.

Constructing a regression model by using a full convolution layer ResNet-101 network structure, obtaining action occurrence probability through a sigmoid activation function as the script generation model, determining a corresponding first loss function based on a first mark corresponding to the first training sample and a result of prediction output after the first training sample is input into the script generation model, and adopting a mean square error (Mean Square Error, MSE) form to be expressed asWherein L is _Script Represents a first loss function, n _j A value n 'representing the occurrence of an actual motion corresponding to the current frame' _j The value indicating the occurrence of an action predicted from the script generation model, m indicating the total number of data per batch at the time of training, and j indicating any one of a batch of training samples.

Inputting the first training sample into a script generation model M _A And then, adjusting parameters in the script generation model by utilizing the difference between the predicted action script and the real action script in the training sample through gradient back propagation so as to learn, completing training of the model under the condition that a first loss function through gradient back propagation meets convergence, and finally obtaining a gesture action generation script n. The first loss function for gradient back-propagation is an absolute distance function between the predicted action script and the real action script.

The initial training sample is a training sample for representing whether actions occur in the first video file at different moments, and the corresponding acquisition method comprises the following steps:

Specifically, the method for acquiring the data set of model training, namely the initial training sample, comprises the following steps: based on the first video files with voice information and action information, extracting audio signals in the speaker video, specifically, the audio files can be converted into mel spectrogram features as audio signals by extracting the audio files of each first video file.

The method comprises the steps of extracting position sequence signals of human skeleton key points in a speaker video, namely obtaining a first position sequence signal, specifically extracting human skeleton key points in a first video file by using a Mediapipe tool, wherein the human skeleton key points specifically comprise 42 hand key points, 33 body skeleton key points and 468 face key points, and each skeleton key point can be represented by three-dimensional position coordinates.

The first position sequence signal may represent any gesture, and several continuous actions included in any gesture may conform to a corresponding rule, and all have a corresponding start position and end position, or all have a start point and an end point, where the start position and the start point are used to represent a start action of a gesture, that is, the gesture starts at the moment, and the end position is used to represent an end action of a gesture, that is, the gesture ends at the moment.

Determining the distance between the target element and the initial position in the first position sequence and the distance between the next element of the target element and the initial position, and marking each element in the first position sequence by using a first mark according to whether the two distances meet the rule of judging whether any element in the first position sequence has an action, wherein the first mark is used for indicating whether any element in the first position sequence has the action.

By marking whether any element in the first position sequence has action occurrence or not in the above manner, the first position sequence with the mark, namely the initial training sample, is obtained. For further processing or screening by subsequent operations based on the initial training samples.

In particular, several successive actions included in any gesture represented by the first position sequence signal may follow a corresponding law, which may be obtained by observation and statistical analysis. In order to reduce the number of analyses and increase the processing speed, the invention simplifies the first position sequence signal through preset simplified gesture key points according to the rules possibly met by the continuous actions included in any gesture, wherein the preset simplified gesture key points may be the main skeleton key points which can be used for simply representing the gesture, such as skeleton key points of elbows, wrists and fingers, in all the skeleton key points, and the optimal simplified gesture key points comprise the positions of 4 skeleton key points of a left elbow, a right elbow, a left wrist and a right wrist, and are used as the optimal configuration of the preset simplified gesture key points.

Simplifying the first position sequence according to a preset simplified gesture key point, and counting a position histogram of the first position sequence at the preset simplified gesture key point.

The method includes that any gesture represented by the first position sequence may include several continuous actions according to a corresponding rule, where the rule describes a complete gesture as several stages, each stage has a respective characteristic, and the several stages are used as a preset completed gesture format, where the several stages of a complete gesture may include 5 stages of rest position or start position, preparation (duration), emphasis (stroke), hold (hold), withdrawal (withdrawal), and any complete gesture starts moving from the rest position or start position until the withdrawal stage completes movement, and returns to the rest position or start position corresponding to the next complete gesture.

And finding the most frequently-occurring position (the position with the most occurrence times or the position with the most frequently occurrence) in the position histogram as the rest position or the starting position according to the position histogram and the preset complete gesture format corresponding to each complete gesture.

Specifically, the first gesture generation model is used for generating a representative gesture corresponding to the audio file, and the corresponding training method comprises the following steps:

and extracting the marked first position sequence signal to obtain an element marked as an action occurrence element, taking the element as a second training sample, and carrying out normalization processing on the extracted second training sample to obtain a third training sample with uniform length, wherein the uniform length is L, and L is a positive integer and is usually 50 to 100.

And inputting the third training sample into the first gesture generation model to obtain a reconstructed representative gesture, and completing training of the first gesture generation model under the condition that the real gesture corresponding to the third training sample and the second loss function determined by the gesture reconstructed by the first gesture generation model are converged. Such that any audio file, upon input to the first gesture generation model, may output the reconstructed representative gesture.

Specifically, the samples in the second training samples may have the loss of key points due to self-shielding, subtitle shielding and other reasons, which can affect the modeling of the representative gesture. Specifically, through 21 skeletal key points of the whole hand, the second training sample is subjected to hand key point number detection, and a sample with lost key point data of the skeletal key points of the hand is found, for example, a frame F _i . Then using F of the previous frame _i-1 As the filled data, by calculating the adjacent frame (frame F _i And frame F _i-1 ) Is the relative of (a) degree of rotation R _i-1→i And translation T _i-1→i Gesture filling, i.e. F _i ＝F _i-1 *R _i-1→i +T _i-1→i . Relative rotation degree R here _i-1→i Mainly refer to frame F _i-1 To frame F _i Angle of rotation, translation T _i-1→i Mainly refer to frame F _i-1 Move to frame F _i Is used for the position coordinates of the object. Through the repaired sample, training the first gesture generation model, so that the representative gestures generated by the obtained first gesture generation model are more accurate, in addition, clustering analysis can be performed on the generated representative gestures through a clustering algorithm, for example, the obtained representative gestures are clustered in cascade according to a K-means algorithm, the clustering center is set to be 15, and finally, 15 kinds of representative gestures of different types are obtained. And the control dimension and flexibility of the generated digital human gestures are further improved by configuring the priorities for selecting different types of representative gestures in a preset synthesis rule.

Optionally, the training method corresponding to the second gesture generation model includes:

determining the first audio signal and the first position sequence signal as fourth training samples;

and based on the fourth training sample, under the condition that a third loss function meets convergence, training of the second gesture generation model is completed, and the third loss function is determined based on the positions of all key points in the fourth training sample and the positions of all key points predicted by the second gesture generation model.

Specifically, a one-dimensional convolution model in a U-NET form is adopted to perform rhythmic action modeling, namely a second gesture generation model is constructed, and the audio signal is converted into the frequency spectrum characteristic S through calculation _i Inputting the prosody characteristic into an audio coding model to obtain prosody characteristics; inputting prosodic features into rhythmic gesture generating model M _P And obtaining corresponding predicted rhythmic gestures.

And determining a third loss function based on the positions of the key points in the fourth training sample and the positions of the key points predicted by the second gesture generation model, adjusting the parameters of the second gesture generation model through gradient back propagation by utilizing the difference between the predicted rhythmic gesture and the real gesture, and completing the training of the second gesture generation model under the condition that the third loss function meets convergence.The third loss function of gradient back-propagation includes: the first part is a distance absolute function of the real gesture and the predicted rhythmic gesture; the second part is a first-order distance absolute function of the real gesture and the predicted rhythmic gesture; the third part is a second order absolute distance function of the true gesture and the predicted rhythmic gesture. The third loss function L is expressed in the form of higher derivative _Beat I.e. Wherein (1)>Representing the real gesture, the first derivative of the real gesture and the second derivative of the real gesture, respectively, +.>Representing the predicted rhythmic gesture, the first derivative of the predicted rhythmic gesture, and the second derivative of the predicted rhythmic gesture, respectively.

Optionally, the method further comprises:

after the script generation model and the gesture generation model are respectively trained, determining the target loss function based on the first loss function and the third loss function;

and finishing fine tuning of the script generation model, the first gesture generation model and the second gesture generation model under the condition that the target loss function meets convergence.

Specifically, the present invention generates a model M for the above-described script _A First gesture generating model M _R A second gesture generating model M _P After the training is completed, the three are trained together to carry out fine adjustment, and a loss function L is adopted during the fine adjustment _Total I.e. when the action script indicates that there is a gesture action (n=1), the network uses a third loss function L _Beat And a first loss function L _Script As the above-mentioned loss function L _Total When the action script indicates that there is no gesture action (n=0), the network uses a first loss function L _Script As the above-mentioned loss function L _Total This can be expressed as:

respectively inputting target audio files into a trained script generation model M _R The motion script n corresponding to the prediction in the model is input into a trained rhythmic gesture generating model M _P Resulting rhythmic gesture P ^B Input to a trained representative gesture generating model M _R Is representative gesture P ^R The method comprises the steps of carrying out a first treatment on the surface of the Finally, the representative gesture P is processed through the action script n and the preset synthesis rules ^B Sum-and-rhythm gesture P ^R The digital human gesture G is obtained by fusion, and can be expressed as:

the above preset composition rules are only schematically illustrated and are not intended to be a specific limitation of the preset composition rules in the present invention. The preset synthesis rules can be set and adjusted according to actual requirements.

The digital human gesture G realizes the effect of driving the digital human action by using the graphics software.

To test the validity of the model, the model is applied to a test set. Direct use of rest gestures for several numbersTime accuracy F of character man drive _Precision Diversity F _Diversity The authenticity FID is shown in the second row of table 1. The results under three evaluation indexes using a digital person driven in the form of a straight hand gesture are shown in the third row of table 1. The results of a digital person driven using the form of a random gesture under three evaluation indicators are shown in the fourth row of table 1. The results of a digital person driven in the form of a gesture generated using a deep learning model under three evaluation indices are shown in the fifth row of table 1. Finally, a model controllably generated by the digital person gesture under the trained synchronous voice is used for generating a final digital person gesture for the audio signal, and the digital person is driven by the gesture, so that the time accuracy, the gesture diversity and the gesture sense are greatly improved, and the generated digital person gesture is embodied to have a visual effect with high quality.

Table 1 is a time accuracy F for a digital person driven with rest gestures, always gestures, random gestures, deep learning model generated gestures, and inventive processed generated gestures _Precision Diversity F _Diversity Comparison results table of authenticity FID:

TABLE 1

In order to more clearly describe the method for generating the digital human gesture provided by the embodiment of the invention, a specific example is described below.

Fig. 2 is a schematic diagram of a specific implementation flow of digital human gesture generation according to an embodiment of the present invention, where, as shown in fig. 2, the method includes:

step S1, training data are collected and prepared;

collecting professional Chinese news comment videos on the Internet as a first video file, and marking multi-mode data represented by the first video file through corresponding processing to serve as an initial training sample;

specifically, a corresponding audio file is acquired from the first video file, and audio is converted into a mel-frequency spectrogram characteristic as an audio signal. And extracting a text signal with time information from the first video file, and converting the text signal into text characteristics through a Chinese text pre-training model, wherein the Chinese text pre-training model adopts a long-short-term memory (Long Short Term Memory, LSTM) network structure. The text signal with time information is extracted from the first video file, specifically, after text and audio signals are aligned by using a voice forced aligner MFA tool, the text signal with time information, namely, text label, is obtained, text characteristics are obtained after the text label passes through a Chinese pre-training model, and the text characteristics are used as text signals. In addition, a Mediapipe tool is used to extract human skeletal keypoint sequence signals.

And taking the audio signal, the text signal and the human skeleton key point position sequence signal as initial training samples.

Step S2, sample acquisition of a script generation model (action script extraction);

in a first step, a complete gesture is described as moving from one rest position (starting position), through several phases of rest position (starting position), preparation, emphasis, hold, withdrawal, and back to another new rest position (starting position). Secondly, simplifying the gesture by using the positions of 4 skeleton points of the left elbow, the right elbow, the left wrist and the right wrist of the human body, counting the position histograms appearing at the four skeleton points, and finding the most frequently appearing position as a rest position gesture P ^* . Third step, determining the current frame simplified gesture and P of the human skeleton key point position sequence signal ^* The distance between them is denoted as D _tr Calculating the distance D between the simplified gesture of the current frame and the next frame of the human skeleton key point position sequence signal _tp And screening out frames with gesture actions according to a preset threshold value. The specific process is as follows:

wherein n is the numerical value of the action script, and when the marking value is 1, the position is represented that the action occurs; when the flag value is 0, it means that no action occurs here.

And taking the first 24 frames and the last 25 frames of the audio signals in the initial training samples as the audio signals of each frame of data during training, and taking the audio signals as training samples of a script generation model, namely a first training sample.

S3, representative gesture extraction;

extracting gesture fragments when an action script is 1 in an initial training sample, wherein each fragment comprises a complete stage of a gesture, and a representative gesture modeling can be influenced due to the fact that a hand skeleton key point sequence is missing due to self-shielding, subtitle shielding and the like, and finding out a first frame F with hand missing through detecting 42 hand skeleton key point numbers _i . Then using F of the previous frame _i-1 As data for depuffering, by calculating the relative rotation R of adjacent frames _i-1→i And translation T _i-1→i Gesture filling, i.e. F _i ＝F _i-1 *R _i-1→i +T _i-1→i . Uniformly sampling the repaired representative gesture sequence to ensure that each representative gesture contains 50 frames; clustering the repaired representative gestures, extracting representative gesture video features by adopting a video key action recognition pre-training model, clustering the extracted features according to a K-means algorithm, setting a clustering center as 15, and finally obtaining 15 types of different types of representative gestures as a third training sample of the first gesture generation model.

S4, extracting rhythmic gesture samples;

converting the extracted audio signal into spectral features S by calculation _i The method comprises the steps of carrying out a first treatment on the surface of the And inputting the frequency spectrum characteristics into an audio coding model to obtain rhythm characteristics serving as rhythmic gesture samples.

S5, generating a model by a script;

a process for script generation model training, comprising: and (2) constructing a script generation model by using a full convolution layer ResNet-101 network structure, inputting the first training sample in the step (S2) into the constructed script generation model, outputting a predicted action script, determining a first loss function based on the actual action and the predicted action, and completing training of the script generation model under the condition that the first loss function is converged.

S6, generating a model by a first gesture;

a training process for a first gesture generation model, comprising: constructing a representative gesture model based on a variational self-encoder (including an encoder and a decoder) which can generate data similar to the original distribution by constructing a hidden vector layer, and performing the third training sample in the step S2 according to the following stepsInputting the gesture into the representative gesture generation model; wherein (1)>Representing any segment in the third training sample, 49 in (49,2) represents 42 hand skeletal keypoints and 7 skeletal keypoints at the arm and shoulder, and 2 in (49,2) represents that the corresponding position coordinates are two-dimensional. Learning a gaussian distribution N (μ, δ) by an encoder in the representative gesture generation model to approximate it to a standard normal distribution; reconstructing a corresponding gesture from the decoding network >The loss function used in network training is the reconstruction loss L _Cons And a divergence loss function (Kullback-Leibler Divergence, KL) L _KL This can be expressed as:

wherein L is _Cons Represents a reconstruction loss function, m represents the total sequence length corresponding to the gesture, i represents any element in the gesture sequence,representing a representative gesture corresponding to the training sample i; />An ith representative gesture representing a prediction or reconstruction of the first gesture generation model. L (L) _KL The dispersion loss function is represented, delta represents the variance of the probability distribution corresponding to the first gesture generation model, and mu represents the probability distribution mean value corresponding to the first gesture generation model.

At the above reconstruction loss L _Cons And KL loss function L _KL And under the condition that the determined second loss function is converged, completing the training of the first gesture generation model.

S7, generating a model by a second gesture;

a training process for a first gesture generation model, comprising: modeling of the second gesture generation model is performed by adopting a one-dimensional convolution model in a U-NET form. Determining a third loss function L using the difference between the predicted gesture and the real gesture _Beat And adjusting parameters of the rhythmic gesture generation model through gradient back propagation so as to learn. The third loss function for gradient back-propagation includes: the first part is a distance absolute function of the real gesture and the predicted gesture; the second part is a first-order distance absolute function of the real gesture and the predicted gesture; the third part is a second order absolute distance function of the true gesture and the predicted gesture. The third loss function L _Beat The method adopts a higher derivative expression form, and is specifically expressed as follows:

wherein->For real gesture information, ++>Is a predicted rhythmic gesture.

Step S8, action scripts;

and (3) according to the action script obtained in the step (S2), the script generating model in the step (S5) is trained, and the action script corresponding to the audio file is output by the trained script generating model.

Step S9, representative gestures;

according to the representative gesture obtained in the step S3, training the first gesture generation model in the step S6, generating the model from the trained first gesture, and outputting the representative gesture corresponding to the audio file.

Step S10, rhythmic gestures;

and (3) training the second gesture generation model according to the representative gesture obtained in the step (S4), and outputting the representative gesture corresponding to the audio file by the trained second gesture generation model.

Step S11, digital human gestures;

and (3) fusing the representative gesture obtained in the step (S9) with the rhythmic gesture obtained in the step (S10) by the action script obtained in the step (S8) and a preset synthesis rule to obtain a final digital human gesture. The preset synthesis rule is used for limiting that when the gesture action exists, the digital human gesture is synthesized by one or a combination of the representative gesture and the rhythmic gesture at the same moment, and when the gesture action does not exist, the digital human gesture is determined by the representative gesture, the rhythmic gesture or other fixed preset gestures at the same moment, wherein the fixed preset gesture can be any one selected from all the gestures.

Step S12, digital human action driving display;

and (3) the digital human gesture obtained in the step S11 realizes the display effect of the digital human action drive by using the graphics software.

The invention provides a method for controlling digital human gesture generation, which can predict an action script from audio and text by training a script generation model and can perform gesture generation under control based on the predicted action script. And decoupling the gestures into a representative gesture and a rhythmic gesture, and respectively training corresponding models to perform gesture modeling. The representative gesture generation model is used for generating a professional representative gesture by training a variation self-encoder and sampling from a normal distribution space. The rhythmic gestures obtained by the rhythmic gesture model enrich the display of the gestures, enable the gestures to be more vivid, control the representative gestures and the rhythmic gestures to generate digital human gestures according to the action scripts obtained by the script generation model, and control the synthesis process of the gestures to be more flexible, so that the digital human gestures are more natural and richer.

Fig. 3 is a schematic structural diagram of an apparatus for generating a digital human gesture according to an embodiment of the present invention, as shown in fig. 3, the apparatus includes:

the acquiring module 301 is configured to acquire a target audio file of a digital human gesture to be generated;

A determining module 302, configured to determine an action occurrence sequence corresponding to the target audio file based on a script generation model; the action generating sequence is used for indicating whether gesture actions exist at any moment;

the generating module 303 is configured to control the generated representative gesture and rhythmic gesture to be synthesized into a digital human gesture corresponding to the target audio file based on the action generating sequence and the gesture generating model;

Optionally, the generating module 303 is specifically configured to, in a process of controlling the generated representative gesture and rhythmic gesture to be synthesized into the digital human gesture corresponding to the target audio file based on the action generating sequence and the gesture generating model:

The device for generating the digital human gesture provided by the embodiment of the invention can execute the technical scheme of the method for generating the digital human gesture in any embodiment, and the implementation principle and the beneficial effects of the device are similar to those of the method for generating the digital human gesture, and can be referred to without redundant description.

Fig. 4 is a schematic physical structure of an electronic device according to an embodiment of the present invention, as shown in fig. 4, the electronic device may include: processor 410, communication interface (Communications Interface) 420, memory 430 and communication bus 440, wherein processor 410, communication interface 420 and memory 430 communicate with each other via communication bus 440. The processor 410 may invoke logic instructions in the memory 430 to perform a method of digital human gesture generation, the method comprising:

acquiring a target audio file of a digital human gesture to be generated;

Further, the logic instructions in the memory 430 described above may be implemented in the form of software functional units and may be stored in a computer-readable storage medium when sold or used as a stand-alone product. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

In another aspect, the present invention also provides a computer program product comprising a computer program, the computer program being storable on a non-transitory computer readable storage medium, the computer program, when executed by a processor, being capable of performing the method of digital human gesture generation provided by the methods described above, the method comprising:

acquiring a target audio file of a digital human gesture to be generated;

In yet another aspect, the present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, is implemented to perform a method of digital human gesture generation provided by the methods described above, the method comprising:

acquiring a target audio file of a digital human gesture to be generated;

The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.

From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on this understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective embodiments or some parts of the embodiments.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. A method of digital human gesture generation, comprising:

acquiring a target audio file of a digital human gesture to be generated;

2. The method for generating a digital human gesture according to claim 1, wherein the generating a model based on the motion generation sequence and the gesture, controlling the generated representative gesture and rhythmic gesture to be synthesized into the digital human gesture corresponding to the target audio file comprises:

3. The method of digital human gesture generation according to claim 2, wherein the script generation model is trained based on training samples determined from a first video file having speech information and motion information, the corresponding training method comprising:

4. A method of digital human gesture generation according to claim 3, wherein the initial training samples are training samples for characterizing whether there is an action occurrence in the first video file at different moments, and the corresponding acquisition method comprises:

5. The method of digital human gesture generation according to claim 4, wherein determining a starting position corresponding to any gesture characterized by the first position sequence signal comprises:

6. The method for generating digital human gestures according to claim 4, wherein the training method corresponding to the first gesture generating model comprises:

7. The method for generating a digital human gesture according to claim 6, wherein before uniformly sampling the second training samples to obtain third training samples with uniform lengths, further comprises:

8. An apparatus for digital human gesture generation, comprising:

9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the method of digital human gesture generation of any one of claims 1 to 7 when the program is executed by the processor.

10. A non-transitory computer readable storage medium having stored thereon a computer program, which when executed by a processor implements the method of digital human gesture generation of any of claims 1 to 7.