CN113781610B - Virtual face generation method - Google Patents
Virtual face generation method Download PDFInfo
- Publication number
- CN113781610B CN113781610B CN202110719425.8A CN202110719425A CN113781610B CN 113781610 B CN113781610 B CN 113781610B CN 202110719425 A CN202110719425 A CN 202110719425A CN 113781610 B CN113781610 B CN 113781610B
- Authority
- CN
- China
- Prior art keywords
- face
- mouth
- emotion
- virtual
- model
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T13/00—Animation
- G06T13/20—3D [Three Dimensional] animation
- G06T13/40—3D [Three Dimensional] animation of characters, e.g. humans, animals or virtual beings
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/216—Parsing using statistical methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T13/00—Animation
- G06T13/20—3D [Three Dimensional] animation
- G06T13/205—3D [Three Dimensional] animation driven by audio data
-
- G—PHYSICS
- G11—INFORMATION STORAGE
- G11B—INFORMATION STORAGE BASED ON RELATIVE MOVEMENT BETWEEN RECORD CARRIER AND TRANSDUCER
- G11B27/00—Editing; Indexing; Addressing; Timing or synchronising; Monitoring; Measuring tape travel
- G11B27/10—Indexing; Addressing; Timing or synchronising; Measuring tape travel
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- General Engineering & Computer Science (AREA)
- Multimedia (AREA)
- Probability & Statistics with Applications (AREA)
- Processing Or Creating Images (AREA)
Abstract
The invention provides a virtual face generation method, which takes Chinese as a research object and utilizes speaking text, a true face image and side face image information to generate lip-sound synchronous personalized face animation with smooth and natural mouth shape actions and rich and vivid facial expression changes, and the method comprises the steps of original face geometric modeling, face dynamic mouth shape modeling, face facial expression modeling, voice modeling and virtual face synthesis. The invention provides a set of complete virtual face manufacturing steps, the manufacturing cost is low, the virtual face with synchronous lips and voice which can simultaneously take account of mouth shape change, expression change and image individuation can be realized, and the virtual face has wide application value and development prospect in a plurality of fields such as movie manufacture, computer games, virtual anchor, virtual customer service, on-line virtual teacher teaching and the like.
Description
Technical Field
The invention relates to the technical field of artificial intelligence virtual human application, in particular to a virtual human face generation method.
Background
In recent years, with the continuous development of computer vision, computer graphics, and artificial intelligence technology, virtual persons have begun to be applied to various industries, such as virtual hosts, virtual teachers, and the like. The advent of these applications has placed higher demands on the generation of virtual faces. The virtual face generation aims at producing the personalized face animation with smooth and natural mouth shape actions and lip synchronization of rich and vivid facial expression changes, and has wide application value and development prospect in a plurality of fields such as movie production, computer games, virtual anchor, virtual customer service, online virtual teacher teaching and the like. Taking on-line virtual teacher teaching as an example, the virtual teacher can intelligently simulate the whole course of the teacher teaching, and the contents of the lectures originally passed through the lectures are stored in the on-line teaching platform in the form of video of virtual teacher images. Because it is not limited by time and space, the production cost and the updating cost of online course video can be reduced, the burden of teachers in producing online courses can be greatly reduced, and shared high-quality educational resources can be provided for society. In particular, under the influence of a new epidemic situation, in the case that the off-line education is suspended, the on-line virtual teacher teaching becomes exposed and plays an important role as an advantage of remote education.
However, the existing virtual face making method based on the face animation technology has the problems that the mouth shape is not natural enough, the fine expression change is lacked, the face does not have generalization capability, and the like. Although some current deep learning methods improve the realism of the avatar to some extent, the training cost for animation is too high, the training time is long, and the appearance of the face is still relatively single. Cudeiro et al (2019) propose a sound driven character animation framework (Voice Operated Character Animation, VOCA) that can generate lip-synchronized three-dimensional face animations from given audio signals and three-dimensional head grid templates. The frame is simple and universal, and has good generalization capability for different objects, different languages and different sound sources. But because it mainly focuses on the lower half face where the mouth is, while learning the face movement, the change of the upper half face is ignored. Therefore, the face animation cannot be modeled well by the audio driving technique. Instead of relying on traditional computer mapping methods, kumar et al (2017) proposed a framework-ObamaNet that uses arbitrary text to generate audio and photo-level realistic lip-sync video, realizing high quality video that mimics the president obama's obama speech before the united states and ensuring accurate lip synchronization. The architecture includes three main modules, namely a Char2 Wav-based text-to-speech network, a Time-Delayed LSTM (Time-Delayed LSTM) of the mouth feature points synchronized with the audio, and a Pix2 Pix-based network of video frames based on feature points. The time delay LSTM predicts the mouth key points from the audio features, then finds the mouth region image which is most matched with the mouth key points from the face image library according to the prediction result, fuses the matched mouth sequence with the target video, and finally generates the face animation video. Although the proposed method can synthesize realistic video, the three-dimensional model of the human face is difficult to adapt to other people due to large difference of textures and face shapes of different people, so that the method does not have generalization capability. Chung et al (2017) propose an encoder-decoder convolutional neural network model, spech 2Vid, consisting of four modules, an audio encoder, an identity image encoder, a face image decoder, and a deblurring module. However, the method does not consider the time correlation between video frames, so that the generated video has jitter problem, and the provided facial expression is single, so that the video reality has certain limitation. Based on their research work, to improve the realism of the generated video, the research team (2019) further inputs five still images in the adjacent time region corresponding to the audio into the identity image encoder to extract facial expression information, and adds a context encoder in the spec 2Vid model to extract facial features of the truth image corresponding to the audio. Prajwal et al (2020) propose an improvement to the spec 2Vid model based on generating a LipGAN model of the countermeasure network, enabling the generator network to encompass a face image encoder, an audio encoder and a face image decoder. Compared with the spech 2Vid model, the LipGAN face image decoder adopts more jump connection, so that the face image encoder can provide more abundant face image information. And driving the three-dimensional face model by adopting the movement track of the sound organs in the smart cloud (2020) to realize the conversion process from voice or text to vision so as to generate the face animation. The face animation generating task based on the two-dimensional image is to set voice or text as input, and a deep learning method is adopted to synthesize the face animation with high sense of reality, arbitrary identity and lip synchronization. However, there are also disadvantages to this study. The thinking of the human face animation driven by the movement track of the sound organ is higher in performance than that of the traditional method, but modeling of the sound organ such as the tongue, the lips and the like can influence other parts of the face such as cheeks and the like, so that the final animation effect is greatly influenced, and the face change is not natural enough. On the other hand, in the face animation, the expression change is also important as compared with the mouth shape change, and the face animation is vivid and natural, but the research is lack of attention to the expression change.
Foreign companies have developed a series of classical products for decades in face animation simulation technology. The us dream factory (2001) designed a realistic monster Shi Laike home corner for the movie monster schlecher, which can achieve lip synchronization and expression very close to a real person. Some classical 3D games, such as magic beasts world, king glory, and Xian Jian Massa Medicata Fermentata, use three-dimensional scenes and virtual characters to obtain perfect effects, the appearance, movement and posture of the characters in the game are more lifelike, so that players can be personally on the scene. The from ai product (2016) of the united states computer vision company from ai may implement the function of capturing human facial features from a still picture to create a realistic 3D avatar. Research in this regard has been initiated relatively late in the country, but some virtual face products have also been developed successively in recent years. The Beijing Wo Furui culture propagation company develops a bionic robot with expressions, which consists of the bionic robot with expressions and a cartoon image robot, takes the VOFRID free-stereoscopic curved surface display technology as a core, changes the mechanical shape of the robot uniformly and endows the robot with anthropomorphic facial expressions. Advanced A.I. virtual anchor solutions are proposed by the mass communication carrier, and the automatic output from text to video is realized by utilizing a plurality of artificial intelligence technologies such as speech synthesis, image processing, machine translation and the like of mass communication carrier, so that the generation of anchor multi-language video is supported, and the customization of real person images and cartoon 3D images can be realized. However, the virtual anchor character is basically fixed, and the cost required for manufacturing the specific character is high. The core technology of the products developed by the domestic and foreign companies is not disclosed.
The current state of research of virtual faces at home and abroad is reviewed, and the face animation technology can be seen to be widely explored. How to control the mouth shape and facial expression of a human face to change and simultaneously keep local detail information, and generating a lip-synchronized human face with a sense of reality still faces great challenges. Most of the current researches adopt a learning mechanism to enable the trained face model to have good mouth shape and expression effects. However, language training requires a lot of material, and the various types of language samples required by current training models are insufficient. Moreover, the cost required by model training is high, and the training period is long. There is currently no low cost method to produce any personalized virtual face animation that realistically achieves a human-like speaking effect.
Disclosure of Invention
The invention provides a virtual face generation method, which can generate lip-synchronous personalized face animation with smooth and natural mouth shape actions and rich and vivid facial expression changes by using speaking text and real person image information, and solves the technical problem that the face generation effect can be ensured only by relying on a large number of training samples in the prior art.
The invention provides a virtual face generation method, which comprises the following steps:
S1: constructing an original face geometric model, wherein the original face geometric model comprises face feature points;
s2: determining lip skeleton feature points based on face feature points in an original face geometric model, constructing mouth-shaped changing key frames and middle frames for the lip skeleton feature points by establishing a mapping relation between phonemes and vision positions and a mapping relation between vision positions and mouth-shaped key frames, and constructing a face dynamic mouth-shaped model driven by Chinese pinyin phonemes, wherein the face dynamic mouth-shaped model comprises mouth-shaped changing key frames, mouth-shaped changing middle frame numbers, lip skeleton feature point coordinate values and insertion time, phonemes are defined as minimum units of syllable pronunciation actions according to natural attributes of voices, the vision positions refer to states of positions of upper and lower lips and upper and lower jaws when phonemes are pronounced, the mouth-shaped key frames are used for recording key contents of virtual characters in mouth animation pictures when phonemes are pronounced, and the mouth-shaped middle frames are used for representing complete change processes from mouth shape generation to mouth shape completion when phonemes are pronounced;
s3: according to the mouth-shaped changing key frames, the coordinate values of lip skeleton feature points of the mouth-shaped changing key frames and emotion key words contained in the input speaking text, facial expression changes with different degrees are designed, expression key frames are generated, a facial expression model of a face is constructed, and the expression key frames are used for recording facial expression changes contained in virtual facial animation;
S4: converting the input speaking text into voice audio, and processing the speaking speed and pause of the person to construct a voice model;
s5: inputting a face front image and a face side image, processing the original face geometric model obtained in the step S1 to obtain a real face geometric model, and integrating the real face geometric model with the face dynamic mouth model obtained in the step S2, the face facial expression model obtained in the step S3 and the voice model obtained in the step S4 through synthesis synchronous processing to generate a virtual face.
In one embodiment, step S2 includes:
s2.1: combining face feature points contained in an original face geometric model according to a preset standard, and selecting skeleton feature points as sources for driving the change of the mouth model to obtain lip skeleton feature points;
s2.2: constructing an oral type keyframe for the lip bone feature points by establishing a mapping relation between phonemes and vision bits and a mapping relation between vision bits and oral type keyframes;
s2.3: introducing dynamic view bits to manufacture an intermediate frame of mouth shape variation;
s2.4: and storing the information of the dynamic mouth shape frame to construct a human face dynamic mouth shape model, wherein the information of the dynamic mouth shape frame comprises a mouth shape key frame and an intermediate frame, and each frame of information comprises a skeleton point number, a skeleton point three-dimensional coordinate and an insertion time.
In one embodiment, step S2.2 comprises:
s2.2.1: classifying factor states of basic mouth shapes of Chinese pronunciation according to preset rules by adopting three basic parameters, wherein the three basic parameters comprise longitudinal change values of lips, transverse change values of lips and opening and closing change values of upper and lower jaws and teeth, defining a view position according to the classified phoneme states, and constructing a mapping relation between phonemes and the view position;
s2.2.2: and (2) establishing a mapping relation between the vision positions and the mouth-shaped key frames according to the Chinese pinyin corresponding to the input speaking text and the lip bone characteristic points in the step S2.1, wherein each vision position corresponds to one set of lip bone point three-dimensional coordinates.
In one embodiment, step S2.3 comprises:
s2.3.1: taking the static view bits as mouth shape key frames, adding a plurality of intermediate frames between the two static view bits to represent the continuous process of two mouth shape changes, and calculating mouth feature point coordinates of the intermediate frames according to feature point coordinates of the front and rear key frames and taking time between the two key frames as variable parameters;
s2.3.2: determining the number of intermediate frames of the two mouth-shaped key frames according to the time interval between the two mouth-shaped key frames;
s2.3.3: and adding the number of all the mouth shape key frames and the number of the middle frame to obtain the insertion time of the mouth shape key frame corresponding to the phoneme of the next Chinese character.
In one embodiment, step S3 includes:
s3.1: establishing a basic emotion dictionary for identifying emotion keywords related to emotion in an input text;
s3.2: combining the emotion dictionary, and carrying out emotion word recognition and emotion calculation on the input speaking text;
s3.3: facial expression changes with different degrees are designed according to lip skeleton feature point coordinate values of the mouth-shaped variable key frames and the mouth-shaped variable intermediate frames and emotion keywords contained in the input speaking text, and expression key frames are generated;
s3.4: the blink action is controlled by modifying the three-dimensional coordinates of the skeletal feature points of the upper eyelid, setting key frames when opening and closing.
In one embodiment, step S3.2 comprises:
s3.2.1: initializing emotion parameters, wherein the emotion parameters comprise emotion weight values and influence degree values of emotion degree adverbs;
s3.2.2: word segmentation processing is carried out on the input text;
s3.2.3: detecting whether words in word segmentation results contain keywords related to emotion, and giving different emotion weights to the words according to the intensity of emotion word meanings;
s3.2.4: detecting whether a word in a word segmentation result contains a degree adverb or not, and giving influence degree values of different emotion degree adverbs to the word according to the degree adverb;
S3.2.5: detecting whether words in word segmentation results contain negative words or not, and determining the number of the negative words;
s3.2.6: and calculating the emotion value of the input text word according to the emotion weight value of the word segmentation result, the influence degree value of the emotion degree adverbs and the number of the contained negative words.
In one embodiment, step S3.3 includes:
s3.3.1: superposing different basic action units to express rich facial expressions, defining action units on the expression, and determining expression action units, wherein the facial expressions comprise happiness, anger, surprise and sadness;
s3.3.2: according to the emotion calculation result and lip bone feature point coordinate values of the mouth-shaped keyframes and the intermediate frames, facial expressions with different degrees are designed;
s3.3.3: and calculating coordinates of feature points in the facial expression key frames with different degrees.
The above technical solutions in the embodiments of the present application at least have one or more of the following technical effects:
the application provides a virtual face generating method, which comprises the steps of constructing an original face geometric model, a face dynamic mouth model, a face facial expression model, constructing a voice model, inputting a face front image and a face side image, processing the obtained original face geometric model to obtain a real face geometric model, and integrating the face dynamic mouth model, the face facial expression model and the voice model of the real face geometric model through synthesis synchronous processing to generate a virtual face. Compared with the prior art, the method and the implementation steps for manufacturing the complete virtual human face are provided, a large number of training samples are not needed, only the speaking text, the front image and the side image of the real person are needed to be provided, and the voice audio, the mouth shape action, the facial expression change and the natural blinking action which imitate the speaking of the ordinary person can be generated, so that the cost for generating the human face is greatly reduced.
Further, referring to the real image data and the Chinese pronunciation teaching video, by establishing a mapping of phonemes-visual positions-key frames and key frame interpolation, a key frame and an intermediate frame with a changed mouth shape are constructed for the selected lip skeleton feature points, so that a dynamic mouth shape model based on Chinese pinyin phonemes driving can be generated. And, can realize lip synchronization of seamless connection. According to the playing speed of the animation, the generated Chinese character voice can be matched with the mouth shape in the animation by adjusting the voice speed parameter in the voice synthesis tool.
Further, referring to a facial motion coding system framework, aiming at the characteristics of facial expression change of a face corresponding to basic emotion of a human on the basis of constructing a basic emotion dictionary, the virtual face animation comprises rich facial expression change according to emotion keywords, emotion degree adverbs and emotion negatives contained in an input speaking text, and expression key frames with different degrees of anger, surprise and sadness.
Further, a UV unfolding method is adopted to cut the face model, the appearance mapping of the model is realized through stretching and linking treatment of the real person image, the face model with the characteristics of the real person face organ characteristics, textures, skin colors and the like is output, the virtual character has personalized face animation, and the virtual character can be generalized to any He Xuni character.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow chart of generating a virtual face in accordance with an embodiment of the present invention;
FIG. 2 is an original face geometry model constructed in accordance with an embodiment of the present invention;
FIG. 3 is a schematic view of selecting feature points of lips skeleton according to an embodiment of the present invention;
FIG. 4 is a flowchart of emotion word recognition and emotion calculation according to an embodiment of the present invention;
FIG. 5 is a schematic representation of the design of expressions of different degrees of happy emotion according to an embodiment of the present invention;
FIG. 6 is a schematic representation of the design of expressions of various degrees of anger emotion according to an embodiment of the present invention;
FIG. 7 is a schematic representation of the design of an expression with varying degrees of surprise emotion according to an embodiment of the present invention;
FIG. 8 is a schematic representation of the design of different degrees of sadness emotion in an embodiment of the present invention;
FIG. 9 is a schematic diagram of a model cut of an embodiment of the present invention;
FIG. 10 is a graph of UV unfolded tiling effect of a model according to an embodiment of the present invention;
FIG. 11 is an alignment of a frontal image with a side image in accordance with an embodiment of the present invention;
FIG. 12 is a schematic illustration of a mapping of a face geometry model in accordance with an embodiment of the present invention;
fig. 13 is a schematic diagram of virtual face synthesis according to an embodiment of the present invention.
Detailed Description
Aiming at the difficult problems in the prior art, the invention designs a virtual face generation method driven by Chinese text and emotion vocabulary by taking Chinese as a research object. Based on the human face animation model of key frame interpolation, establishing mapping of Chinese phonemes-view bits-key frames, and designing a mouth shape frame based on Chinese pinyin phoneme driving; meanwhile, an emotion keyword detection method based on a basic emotion dictionary is adopted to drive the facial expression change of the virtual character, so that the virtual face with lip synchronization, which can simultaneously take account of mouth shape change, expression change and image individuation, is realized.
In order to make the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
The embodiment of the invention provides a virtual face generation method, which comprises the following steps:
s1: constructing an original face geometric model, wherein the original face geometric model comprises face feature points;
s2: determining lip skeleton feature points based on face feature points in an original face geometric model, constructing mouth-shaped changing key frames and middle frames for the lip skeleton feature points by establishing a mapping relation between phonemes and vision positions and a mapping relation between vision positions and mouth-shaped key frames, and constructing a face dynamic mouth-shaped model driven by Chinese pinyin phonemes, wherein the face dynamic mouth-shaped model comprises mouth-shaped changing key frames, mouth-shaped changing middle frame numbers, lip skeleton feature point coordinate values and insertion time, phonemes are defined as minimum units of syllable pronunciation actions according to natural attributes of voices, the vision positions refer to states of positions of upper and lower lips and upper and lower jaws when phonemes are pronounced, the mouth-shaped key frames are used for recording key contents of virtual characters in mouth animation pictures when phonemes are pronounced, and the mouth-shaped middle frames are used for representing complete change processes from mouth shape generation to mouth shape completion when phonemes are pronounced;
s3: according to lip skeleton feature point coordinate values of the mouth-shaped variable key frame and the mouth-shaped intermediate key frame and emotion key words contained in the input speaking text, facial expression changes with different degrees are designed, an expression key frame is generated, a facial expression model of a face is constructed, and the expression key frame is used for recording facial expression changes contained in virtual facial animation;
S4: converting the input speaking text into voice audio, and processing the speaking speed and pause of the person to construct a voice model;
s5: inputting a face front image and a face side image, processing the original face geometric model obtained in the step S1 to obtain a real face geometric model, and integrating the real face geometric model with the face dynamic mouth model obtained in the step S2, the face facial expression model obtained in the step S3 and the voice model obtained in the step S4 through synthesis synchronous processing to generate a virtual face.
Specifically, the facial geometry modeling in the step S1 is the basis and key of virtual facial generation, and the integrity and accuracy of the facial model directly affect the effects of dynamic mouth modeling, expression modeling and virtual facial synthesis in the subsequent steps. In order to describe the complex motion change of the five sense organs of a face system during speaking, a face geometric model is constructed, and face basic feature points of the model are obtained.
Step S2 is to perform face dynamic mouth shape modeling and input a speaking text, wherein mouth shape changing key frames and intermediate frames are built for selected lip skeleton feature points through mapping of a phoneme-view-position-key frame and key frame interpolation, a dynamic mouth shape model based on Chinese pinyin phoneme driving is realized, the dynamic mouth shape model comprises a dynamic mouth shape key frame and a dynamic mouth shape intermediate frame, and the key frames and the intermediate frames comprise three pieces of information, namely numbers, lip skeleton feature point coordinate values and insertion time.
Step S3 is to perform facial expression modeling of a person, where the expression of the person is generated by controlling the movement of facial muscles so as to change the shape and position of the facial features. Referring to a facial motion coding system (Facial Action Coding System, FACS) framework proposed by Ekman and Friesen (1978), on the basis of constructing an emotion dictionary, aiming at the characteristics of facial expression change of a face corresponding to four basic human emotion, the step designs facial expression change of different degrees according to emotion keywords contained in an input speaking text, and generates an emotion key frame. The facial expression model also comprises a key frame and an intermediate frame corresponding to the dynamic mouth model, wherein when no emotion changes, the dynamic mouth model key frame and the intermediate frame (mouth model frame) are taken as video frames (expression frames) of the facial expression model, when emotion changes are involved, the coordinate values of lip bone characteristic points of the dynamic mouth model key frame and the intermediate frame are modified and taken as expression frames, and the number and the insertion time of the frames are kept unchanged.
Step S4, performing voice modeling, namely converting the input speaking text into voice audio, and processing the speaking speed and pause of the person to construct a voice model.
And S5, virtual face synthesis is carried out, the original face geometric model is processed to obtain a real face geometric model, and then the real face geometric model is integrated with the dynamic mouth model, the facial expression model and the voice model.
Referring to fig. 1, a flow chart of generating a virtual face according to an embodiment of the present invention is shown in fig. 2, where a) represents a schematic diagram of an original face geometric model, and b) represents face feature point acquisition. FIG. 3 is a schematic view of selecting feature points of lips skeleton according to an embodiment of the present invention; FIG. 4 is a flowchart of emotion word recognition and emotion calculation according to an embodiment of the present invention; FIG. 5 is a schematic representation of the design of the expression of different degrees of happy emotion according to an embodiment of the present invention, a) representing the original model, b) representing the heart's meriz, c) representing the happy heart, d) representing the very excitement; FIG. 6 is a schematic diagram of an expression design of different degrees of anger emotion according to an embodiment of the invention, a) representing an original model, b) representing a bit of holding back, c) representing very lively, d) representing a sudden jump; FIG. 7 is a schematic representation of the design of an expression of various degrees of surprise emotion according to an embodiment of the present invention, c) representing an original model, d) representing somewhat surprise, e) representing a foggy, f) representing a surprise; FIG. 8 is a schematic representation of the design of different degrees of sadness according to an embodiment of the invention, a) representing an original model, b) representing some disappointment, c) representing a casualty, d) representing a mood-bearing pain; FIG. 9 is a schematic diagram of a model cut of an embodiment of the present invention; FIG. 10 is a graph of UV unfolded tiling effect of a model according to an embodiment of the present invention; FIG. 11 is an alignment of a frontal image with a side image in accordance with an embodiment of the present invention; FIG. 12 is a schematic illustration of a mapping of a face geometry model in accordance with an embodiment of the present invention; fig. 13 is a schematic diagram of virtual face synthesis in an embodiment of the present invention, a) representing an original face, b) representing a mapped face, c) representing a synthesized face.
In one embodiment, step S1 includes:
s1.1: adopting a drawing tool to manufacture an original face geometric model in a polygonal editing mode, wherein the original face geometric model covers the face outline, eyebrows, eyes, nose, mouth, ears and neck of a face;
s1.2: and referring to the related face feature points defined in the preset standard, selecting preset feature points as face feature points of the original face geometric model.
In the specific implementation process, step S1.1 adopts a drawing tool to manufacture an original face geometric model in a polygonal editing mode. In order to keep the facial geometry model intact, the model encompasses the facial contours, eyebrows, eyes, nose (nostrils and bridge of the nose), mouth, ears, and neck of the face. In order to achieve a realistic effect, eyes in the model consist of eyesockets and eyeballs so as to match with natural blinking actions when the expression changes; the mouth includes lips (upper and lower lips), tongue, teeth (upper and lower teeth), etc. to cooperate with the sounding of the vocal organs. The original model has no expression and is tightly closed by two lips, and is the initial natural state of the human face. A schematic diagram of an original face model made with 3Ds Max is shown in part a) of fig. 2.
Step S1.2: with reference to relevant facial feature points defined in the MPEG-4 (i.e., ISO/IEC 14496) standard officially published 2 in 1999, 39 of the feature points are selected, as shown in part b) of fig. 2, including 6 each of the left and right eyebrows, 7 each of the forehead, 2 each of the left and right eyelid, 4 each of the nose, 6 each of the left and right cheek, and 12 each of the mouth. Mouth feature points relating to mouth shape modeling are determined by step 2.1. And positioning the facial features to obtain the feature point coordinates of the facial features.
In one embodiment, step S2 includes:
s2.1: selecting the bone characteristic points contained in the original facial geometry model as sources for driving the change of the mouth shape model by referring to the preset standard to obtain lip bone characteristic points;
s2.2: constructing an oral type keyframe for the lip bone feature points by establishing a mapping relation between phonemes and vision bits and a mapping relation between vision bits and oral type keyframes;
s2.3: introducing dynamic view bits to manufacture an intermediate frame of mouth shape variation;
s2.4: and storing the information of the dynamic mouth shape frame to construct a human face dynamic mouth shape model, wherein the information of the dynamic mouth shape frame comprises a mouth shape key frame and an intermediate frame, and each frame of information comprises a skeleton point number, a skeleton point three-dimensional coordinate and an insertion time.
In the specific implementation process, step S2.1: and selecting the characteristic points of the lips and bones. Feature points about lips are defined in the MPEG-4 standard. Considering that the lip bone points in the feature points are related to the skin and the muscle in the small part range around the lip bone points, the effect of simulating and representing the mouth shape change can be achieved by controlling the movement of the bone feature points. Therefore, the invention combines partial characteristic points in the standard, and selects the skeleton characteristic points as sources for driving the change of the mouth shape model. As shown in fig. 3, 1 is selected in the middle of the upper and lower lips, 1 is selected in the mouth corners of the two sides, and 2 is selected in the upper left, lower left, upper right and lower right, and 12 bone feature points are all selected.
Step S2.2: and generating the mouth-shaped variable key frame. The step establishes a mapping of phonemes-visemes-keyframes to obtain keyframes of mouth shape change when a person speaks, wherein phonemes are the minimum units for syllable pronunciation actions divided according to natural attributes of voices, and each Chinese character can be decomposed into one or a combination of several phonemes. The visual position refers to the state of the upper and lower lips and the upper and lower jaw when the phonemes are pronounced. And recording key contents in the mouth animation picture of the virtual character when the phonemes are pronounced by adopting the key frames.
Step S2.3: and generating a mouth-shaped variable intermediate frame. The mouth shape action is a continuous process when a person speaks. It is often not sufficient to define only one static phoneme-viseme-mouth key frame. This step introduces dynamic view to make an intermediate frame of mouth shape variation to represent the complete variation from mouth shape generation to mouth shape ending at the time of pronunciation of a phoneme.
Step S2.4: dynamic mouth shape frame information is stored. According to the vision position corresponding to the obtained initial consonant and vowel phoneme and F 0 ~F 10 And generating a key frame and an intermediate frame of the dynamic mouth shape according to the coordinates of the characteristic points in the key frame of the position, and storing the mouth shape frame information, wherein each frame of information comprises a skeleton point number, a skeleton point three-dimensional coordinate and an insertion time. Wherein, the bone points are numbered from mouth_01 to mouth_12, which corresponds to one characteristic point of 12 bone points; the three-dimensional coordinates of the skeleton points are obtained by comparing frames of pronunciation in the model and the real video, and are gradually modified and perfected according to the quality of the synthesized animation; the insertion time is set to v speak And v frame Setting is performed.
In one embodiment, step S2.2 comprises:
s2.2.1: classifying the phoneme states of the basic mouth shape of the Chinese pronunciation according to a preset rule by adopting three basic parameters, wherein the three basic parameters comprise a longitudinal change value of lips, a transverse change value of lips and an opening and closing change value of upper and lower jaws and teeth, defining a view position according to the classified phoneme states, and constructing a mapping relation between phonemes and the view position;
S2.2.2: and (2) establishing a mapping relation between the vision positions and the mouth-shaped key frames according to the Chinese pinyin corresponding to the input speaking text and the lip bone characteristic points in the step S2.1, wherein each vision position corresponds to one set of lip bone point three-dimensional coordinates.
In the specific implementation process, step S2.2.1: establishing phoneme-to-viseme mapping
The Chinese pronunciation consists of initial consonants, final consonants and intermediate consonants of Chinese phonetic alphabets. Three basic parameters, namely a longitudinal change value of lips, a transverse change value of lips and an opening and closing change value of upper and lower jaws and teeth, are adopted to classify the states of the basic mouth shapes of Chinese pronunciation according to rules 1-4.
Rule 1: if the upper and lower lips are closed, the longitudinal change value, the transverse change value and the opening and closing change values of the upper and lower jaws and teeth are all 0, the mouth shape is in an initial state.
Rule 2: if the longitudinal change value of the lips is changed, classifying the mouth shapes of the 55 phonemes of the Chinese pinyin into four types of states, namely: (1) no change (unchanged lip closure), (2) minor change (height difference < 10%), (3) moderate change (height difference < 25%), (4) significant change (height difference <25% < 50%).
For example, the phoneme "en" is a sound in which the lips are closed and there is no change in height, belonging to class (1); when the phoneme "n" pronounces, the lips are slightly large, belonging to the class (2); when the phoneme "e" pronounces, the upper and lower lips are open greatly, belonging to the class (3); when the phoneme "a" pronounces, the lips are open obviously, belonging to the class (4).
Rule 3: if the lip lateral variation value is changed, classifying the phoneme mouth forms into three types of states, namely: (1) the lip length is unchanged, (2) the lip length is reduced, and (3) the lip length is increased.
Rule 4: if the opening and closing of the upper jaw and the lower jaw and the teeth are changed when each phoneme is pronounced, the mouth shape is classified, and phonemes with similar lip changes are subdivided into different categories according to the opening and closing conditions of the upper jaw and the lower jaw and the teeth.
And defining the view bit according to the classified phoneme state. A phoneme-viseme mapping table as shown in table 1 is established between phonemes and visemes, which will be the basis for making a mouth-shaped variant key frame.
Table 1: phoneme-to-viseme mapping table
Step 2.2.2: establishing a viseme-to-mouth key frame map
And (3) establishing a mapping between the vision position and the mouth shape key frame according to the Chinese pinyin corresponding to the input speaking text and the 12 lip skeleton feature points defined in the step S2.1, wherein each vision position corresponds to a set of lip skeleton point three-dimensional coordinates. When the coordinates of the bone feature points of the lips are determined, extracting the mouth shape changing features in the real person video and the Chinese phonetic pronunciation teaching video, presetting each set of coordinates according to each set of features, and modifying and perfecting the mouth shape animation one by referring to the effect of comparison with the real person after synthesizing the mouth shape animation.
The pinyin corresponding to each Chinese character in the Chinese language contains a final, but not necessarily contains an initial. Meanwhile, the pronunciation of the initials and the finals are continuous but relatively independent. Therefore, aiming at different conditions of initials and finals in Chinese, the pinyin of the Chinese character is judged and processed according to rules 5-8, and the initials and the finals are respectively stored. The physical class of the Chinese character comprises initials, finals and time attributes, wherein for each Chinese character object, the initials can be null, the finals are not null, and the time attributes are used for recording the frame number positions of the Chinese character for driving the mouth shape change in the animation.
Rule 5: if the first position of the Chinese character is three syllables of 'z', 'c','s', but the second position is not 'h', the first position of the pinyin is saved as an initial consonant, and the rest is a final;
rule 6: if the first digit of the Chinese character is three syllables of 'z', 'c','s', and the second digit is 'h', the first two digits of the preserved pinyin are initials, and the rest are finals;
rule 7: if the first position of the Chinese character is the initial consonant, but not three syllables of 'z', 'c','s', the first position of the pinyin is saved as the initial consonant, and the rest is the final;
rule 8: if the Chinese character only contains vowels, for example, the pinyin of Chinese character "o" is vowels "a", only the pinyin is saved as vowels.
In one embodiment, step S2.3 comprises:
s2.3.1: taking the static view bits as mouth shape key frames, adding a plurality of intermediate frames between the two static view bits to represent the continuous process of two mouth shape changes, and calculating mouth feature point coordinates of the intermediate frames according to feature point coordinates of the front and rear key frames and taking time between the two key frames as variable parameters;
s2.3.2: determining the number of intermediate frames of the two mouth-shaped key frames according to the time interval between the two mouth-shaped key frames;
s2.3.3: and adding the number of all the mouth shape key frames and the number of the middle frame to obtain the insertion time of the mouth shape key frame corresponding to the phoneme of the next Chinese character.
In the specific implementation process, step S2.3.1: mouth-based intermediate frame feature point coordinate calculation
When the intermediate frames are designed, the static video bit is used as a mouth shape key frame, and a plurality of intermediate frames are added between the two static video bits to represent the continuous process of two mouth shape changes, so that the animation picture is smooth and fluent. Smooth motion between two key frames is specified within a standard time interval using an interpolation function. According to the feature point coordinates of the front and back key frames, the time between the two key frames is used as a variable parameter, and the interpolation formula (1) is used for calculating the mouth feature point coordinates of the intermediate frame.
Wherein P is (n,t) Sitting at time t for nth feature point of mouthMark, t 1 、t 2 Delta t=t, the moment at which the feature point of the previous key frame and the next key frame change, respectively 2 -t 1 。P (n,t1) 、P (n,t2) Respectively the nth characteristic point of the mouth is at t 1 、t 2 Coordinates of the time.
Step S2.3.2: intermediate frame number calculation
The number of intermediate frames between two key frames is determined by the time interval between the two key frames, but the different phones sound for different durations. For example, the pronunciation time of an initial is usually slightly shorter than that of a final, and thus the time of a change in mouth shape is also slightly shorter.
When the time interval is determined, a weight is assigned to each phoneme, and the weight represents the length of the mouth-shaped variation time corresponding to the phoneme. The larger the weight, the more the number of intermediate frames needed for the corresponding key frame; conversely, the fewer. And (3) comparing the teaching video of Chinese pinyin pronunciation with the video of true pronunciation, extracting the pronunciation time of each group of phonemes, and setting the weight of each phoneme as shown in table 2.
Table 2: phoneme weight table
The duration required for pronunciation of a certain Chinese character j is calculated by using the formula (2).
Wherein v is speak For speech rate, the unit is one/second, i.e. the number of words per second is averaged over a session, and this parameter is set manually according to the application requirements. N is the total word number of one segment of words, w j The weight of the phonemes of the jth Chinese character is equal to the weight sum of the whole text.
The number of intermediate frames required for the key frames corresponding to the Chinese characters is calculated by a formula (3).
Wherein N is ji The number of the intermediate frames, w, required after the key frame corresponding to the ith phoneme of the jth Chinese character i Weight of ith phoneme, w j Is the sum of the weights of all phonemes of the jth Chinese character, v frame For animation playback speed, the units are frames/second.
From the formulas (2) and (3), the number of intermediate frames is determined by v speak And v frame The effect of these two parameters. v speak Can be manually set according to the individual. v speak The faster the pronunciation time of each word is, the shorter the number of intermediate frames following the corresponding key frame for each word is. A small number of intermediate frames can make the mouth shape change unnatural. Therefore, a certain number of intermediate frames should be ensured in practical applications. For v frame The faster the playback speed, i.e. the more frames per second, the more intermediate frames will be in number, and the finer the animation effect will be. According to the principle of human visual retention, the picture retention time is too short and is not easily captured by human eyes, typically not exceeding 24 frames. Thus v frame Is set to 24 frames/second according to a common standard.
Step S2.3.3: determining the time of insertion of the next phoneme in the key frame
After the number of the intermediate frames is obtained, adding all the key frame numbers to the number of the intermediate frames to obtain the insertion time (namely the number of the insertion frames) of the key frame corresponding to the phoneme of the next Chinese character j+1, wherein the calculation formula is shown in a formula (4).
Wherein T is j+1 The number of inserted frames of the corresponding phoneme of the j+1th Chinese character, N i Is the sum of the numbers of key frames and intermediate frames of the ith Chinese character.
In one embodiment, step S3 includes:
s3.1: establishing a basic emotion dictionary for identifying emotion keywords related to emotion in an input text;
s3.2: combining the emotion dictionary, and carrying out emotion word recognition and emotion calculation on the input speaking text;
s3.3: facial expression changes with different degrees are designed according to lip skeleton feature point coordinate values of the mouth-shaped variable key frames and the mouth-shaped variable intermediate frames and emotion keywords contained in the input speaking text, and expression key frames are generated;
s3.4: the blink action is controlled by modifying the three-dimensional coordinates of the skeletal feature points of the upper eyelid, setting key frames when opening and closing.
In the specific implementation process, step 3.1: constructing a basic emotion dictionary
And establishing a basic emotion dictionary for identifying emotion keywords related to emotion in the input text. According to FACS, four human most basic emotional categories, happy, angry, surprised, sad, are formed. The Chinese emotion dictionary comprises emotion polarity vocabulary, polarity evaluation words and claimed words and degree adverbs. Referring to a word set for emotion analysis of a Chinese learning network and a simplified Chinese emotion polarity dictionary, positive emotion words and negative emotion words with higher occurrence frequency in spoken language are summarized, and a basic emotion dictionary of the invention is constructed, as shown in Table 3.
Table 3: basic emotion dictionary
Step S3.2: emotion word recognition and emotion calculation
And combining the basic emotion dictionary, and carrying out emotion word recognition and emotion calculation on the input speaking text. A flowchart of emotion word recognition and emotion calculation is shown in fig. 4.
Step S3.3: facial expression generation
Various rich expressions and actions of the face of a person are represented by facial muscle movements. This step achieves facial expression changes of different degrees.
Step 3.4: and (5) controlling blink actions. This step mimics the action of a real human blink.
In one embodiment, step S3.2 comprises:
s3.2.1: initializing emotion parameters, wherein the emotion parameters comprise emotion weight values and influence degree values of emotion degree adverbs;
s3.2.2: word segmentation processing is carried out on the input text;
s3.2.3: detecting whether words in word segmentation results contain keywords related to emotion, and giving different emotion weights to the words according to the intensity of emotion word meanings;
s3.2.4: detecting whether a word in a word segmentation result contains a degree adverb or not, and giving influence degree values of different emotion degree adverbs to the word according to the degree adverb;
s3.2.5: detecting whether words in word segmentation results contain negative words or not, and determining the number of the negative words;
S3.2.6: and calculating the emotion value of the input text word according to the emotion weight value of the word segmentation result, the influence degree value of the emotion degree adverbs and the number of the contained negative words.
In a specific implementation process, step S3.2.1: emotion parameter initialization
For emotion value calculation, firstly, initializing emotion parameters, including an emotion weight q and an influence degree value omega of an emotion degree adverb, and enabling q=1 and omega=1. Meanwhile, in order to count the occurrence times of the negatives in the input text, n is set as the number of the negatives, and n=0.
Step S3.2.2: word segmentation processing of input text
And performing word segmentation processing on the input text. And removing some invalid words and special symbols in the text, and then performing word segmentation operation on the text by adopting a barking word segmentation and other segmentation tools to obtain a text word segmentation result set.
For example, the word segmentation result of the text "feel very good today" is { "today", "feel", "very good" }.
Step S3.2.3: emotion keyword detection
According to the intensity degree of the meaning of the emotion words, different emotion weights q are given to the words, and as shown in table 4, if the emotion degree is low, q=1; in emotion degrees, q=2; high emotion level, q=3.
For example, the vocabulary with happy emotion has a lower "satisfaction" weight and a higher "happiness" weight; the vocabulary with anger emotion has a lower weight and the vocabulary with anger emotion has a higher weight.
Table 4: emotion vocabulary weight classification
Comparing the text word segmentation result obtained in the step S3.2.2 with emotion words in the basic emotion dictionary established in the step S3.1, detecting whether the word segmentation is a keyword related to emotion, and classifying and setting an emotion weight q according to the emotion word weight.
For example, the text word segmentation result of "feel happy today" has the emotion word "happy".
Step S3.2.4: emotion level adverb detection
The intensity of emotion has a certain influence on the change of expression. For example, when happy, according to different happy degrees, the expression change has a certain difference in the change amplitude of the mouth angle rising and the mouth angle opening; the expression change varies in the degree of frowning at anger and even at extremely anger, changes in grin and biting incisors occur; when surprised, the amplitude of the expression change in the process of lifting the eyebrows and opening the eyes can be changed according to the different surprised degrees; when sadness occurs, although the expression change is not obvious at the eyelid, the change of the pulling action under the mouth angle occurs.
The emotion intensity level is reflected by the level adverbs, and the influence level value omega of the emotion level adverbs is set. And (3) detecting whether the text word segmentation result obtained in the step (3.2.2) contains a degree adverb, and determining the omega value of the word segmentation according to the table 5.
Table 5: degree of influence of adverbs
For example, the text word segmentation result of "feel very happy today" contains the degree adverb "very" with a degree of high magnitude, ω=3.
Step S3.2.5: emotion negation detection
The emotion tendencies can have a larger influence on the change of the facial expression, and the emotion tendencies are reflected through the negatives. And (3) detecting whether the text word segmentation result obtained in the step (3.2.2) contains negative words and the number of the contained negative words, and determining the value of n.
For example, the text word segmentation result of "feel uncomfortable today" contains the negative words "no", and the number of the negative words n=1.
Step S3.2.6: emotion value calculation
On the basis of detecting emotion keywords, emotion degree adverbs and emotion negatives, the emotion value Q of the input text word, namely the emotion intensity value, is further calculated according to a formula (5).
Q=(-1) n ×q×ω (5)
Wherein n is the number of occurrences of the negative word, q is the emotion weight, and ω is the influence degree value of the degree adverb.
In one embodiment, step S3.3 includes:
s3.3.1: superposing different basic action units to express rich facial expressions, defining action units on the expression, and determining expression action units, wherein the facial expressions comprise happiness, anger, surprise and sadness;
s3.3.2: according to the emotion calculation result and lip bone feature point coordinate values of the mouth-shaped keyframes and the intermediate frames, facial expressions with different degrees are designed;
s3.3.3: and calculating coordinates of feature points in the facial expression key frames with different degrees.
In a specific implementation process, step S3.3.1: expression AU determination
FACS is performed by Action Unit (AU)The relaxation or contraction of the facial muscles of the human face is described as shown in table 6. For example AU 4 For the purpose of eyebrow wrinkling, the eyebrows are reduced to be close to each other, and wrinkles are generated between the eyebrows; AU (AU) 7 For squinting, it is manifested by tightening the eyelid, pulling down the eyelid, and protruding the lower eyelid.
Table 6: basic AU definition
AU numbering | Expression change description |
1 | Lifting the inner corners of eyebrows |
2 | Lifting the external corner of eyebrow |
4 | Eyebrow tattooing |
5 | Eyelid elevation |
6 | Cheek lifting |
7 | Eye wheel circle muscle inner ring tighten (Miyan) |
9 | Wrinkling nose |
12 | Pulling the mouth angle upwards |
15 | Pulling the mouth angle downwards |
25,26,27 | Opening mouth |
43 | Eye closure |
45 | Blinking of a blink |
The present invention superimposes the different basic AUs shown in table 6 to express rich facial expressions. To express facial expression changes of four emotions of happiness, anger, surprise, sadness, AU definitions were made for these expressions as shown in table 7.
Table 7: AU definition of four basic expressions
Step S3.3.2: facial expression design of different degrees
The character expression is mainly represented by the changes generated in the eyes, cheeks and mouth. With reference to specific parts of emotion change defined by FACS, facial expression change with different magnitudes is respectively designed for emotion with different intensity degrees, and the facial expression change is obtained by further transforming the dynamic mouth model obtained by matching with the design of step 2 according to emotion with different degrees. This step performs facial expression designs of different degrees according to the following cases 1 to 4, respectively.
Case 1: happy expression design
According to the definition of FACS, the expression of happy emotion is mainly reflected in changes of eyes, lips and cheeks, i.e. eye micro-squints, mouth corner lifting, lips separated to some extent and cheek lifting. Since the amplitude of squint changes with the emotion degree, the difference of the eyelid change amplitude is not obvious. And simulating the variation of the happy emotion expression of the face model under different degrees. The higher the happiness degree, the larger the mouth angle rise according to the difference of the Q value calculated in the formula (5).
For example, "feel very happy today", a word set { "today", "feel", "very", "happy" is obtained from barking word segmentation. Wherein, the 'straight' is a degree adverb, belongs to a degree high order, and the happy is a degree medium emotion vocabulary, and finally the emotion intensity degree value is obtained as medium and high. When the word is pronouncing, the face starts to have expression change. Compared with the initial state of the original model, the happy emotion face model is characterized in that eyelids are tightened, cheeks are lifted, cheeks are raised, and large cheeks are raised and the corners of the mouth are broken.
Fig. 5 b), c) and d) show expression change designs of three different degrees of happy emotions, namely "heart merizi", "heart happy" and "very excited". In contrast to the original model shown in fig. 5 a), no significant difference is seen even to a different extent, since the change in eyelid area is subtle. The lips are most obviously distinguished, the corners of the mouth are raised when the lips are slightly raised, and the cheekbone muscles are slightly raised; at a moderate level, the lips break apart, and the cheekbone changes significantly; when the degree is strong, the lips are big, and the teeth of the upper jaw and the lower jaw are obviously separated.
Case 2: anger expression design
According to the definition of FACS, the expression of anger emotion is mainly reflected in changes of the eyebrow, eyelid and skin, i.e. intra-eyebrow wrinkles, eyelid small tightening, skin pucker in the bridge of the nose. And simulating anger emotion expression changes of the face model under different degrees. According to the difference of the Q value calculated in the formula (5), the change of the eyebrow and the forehead skin changes to different degrees, and the higher the anger degree is, the lower the position of wrinkles in the eyebrow is, and the more the forehead and the nose bridge skin are obviously wrinkled. The change in the grain is provided in the FACS material, but this feature is not present in everyone nor age related and is not considered by the present invention.
For example, "I get very angry", get word segmentation set from barking word segmentation { "I", "very", "angry" }. Wherein, the word "very" is a degree adverb, which belongs to the emotion vocabulary with high degree and low degree, and finally the emotion intensity degree is low. When the character is pronouncing, the face starts to have expression change, and compared with the initial state of the original model, the change of the eyebrow and the eye part is similar, the internal crease is shown, the eyelid is strained to a small extent, and the skin of the nose bridge part is wrinkled.
Fig. 6 b), c) and d) show schematic diagrams of expression designs of three different degrees of anger emotion, namely "slightly suppressed, very angry" and "violent jump as a thunder". When the anger level is low, the eyebrows are slightly wrinkled, compared to the original model shown in fig. 6 a); at a moderate level, the nose bridge portion skin is wrinkled; when the intensity is high, the change of the eyebrow and nose bridge parts is increased, and simultaneously the action of biting and incisors is generated.
Case 3: surprise expression design
According to the definition of FACS, the expression of surprise emotion is mainly reflected in the change of eyebrow and mouth, namely, the upward movement of eyebrow, the opening of binocular circle, the slight opening of mouth and the separation of upper jaw and lower jaw. And simulating the surprise emotion expression change of the face model under different degrees. The higher the surprise, the more the eyelid is changed, the more the eyebrow is raised, according to the difference in the Q value calculated in the formula (5). The mouth opening motion is affected to a surprising degree and varies from person to person.
Examples of expression designs for the three different degrees of surprise emotion are shown in fig. 7 b), c) and d), namely "somewhat surprise", "glaring" and "surprise". In contrast to the original model shown in fig. 7 a), as the degree of surprise increases gradually, the amplitude of the eyebrow lifting, the eyes opening, the eyelid becoming loose and the mouth opening also increases gradually.
Case 4: sad expression design
According to the definition of FACS, the expression of sad emotion is mainly reflected in the change of eyebrow, eyelid and mouth angle, namely micro-wrinkling of the eyebrow, slight pulling down of eyelid and pulling down of mouth angle. And simulating sad emotion expression changes of the face model under different degrees. According to the difference of the Q value calculated in the formula (5), the degree of the frowning and the degree of the mouth angle pulling-down are changed to different degrees, and the higher the degree of sadness is, the more obvious the frowning is, and the larger the mouth angle pulling-down amplitude is.
Examples of expression designs for the three different degrees of sad emotion are shown in b), c) and d) of fig. 8, namely "some disappointment", "hurt the heart of a person" and "mood-sinking pain". In comparison with the original model shown in a) of fig. 8, when the expression change of sadness increases with the emotion intensity, the change of the eye-brow part is not obvious, and the mouth angle drop and the facial muscle collapse degree obviously change with the emotion intensity.
Step S3.3.3: expression key frame generation
And (3) calculating coordinates of feature points in the facial expression key frames with different degrees by adopting a formula (6).
P ( ' n,t) =[P (n,t) -P (n,0) ]×Q×μ+P (n,0) (6)
Wherein P is ( ' n,t) For the coordinates of the nth feature point of the mouth when the expression changes at the moment t, P (n,t) The coordinates of the nth feature point of the mouth at the moment t are calculated by the formula (1) when the same vision position is in the emotion intensity degree. P (P) (n,0) The coordinates of the nth feature point of the mouth at the initial moment, namely the coordinates when the initial state is not expressed. Q is the intensity value of emotion and is calculated by the formula (5). Mu is the proportionality coefficient.
It should be noted that "step 2: the result of the dynamic mouth modeling of the human face is that a dynamic mouth model is generated, namely, the mouth-shaped changing key frame, the serial number of the mouth-shaped changing intermediate frame, the coordinate value of the bone characteristic point of the lips and the insertion time; the coordinate values of the characteristic points of the lips skeleton are inputted into the step S3. If no emotion word is input in the input text, the coordinate value of the lip bone characteristic point is kept unchanged; if the input text contains emotion words, the coordinate values of the feature points of the lips bones are further transformed according to four different expression situations. This is because expression changes include changes in lips.
For example, when the input text is "I am a teacher". And (3) generating an emotion word in the middle, wherein in step S2, I are a teacher. "die model.
When the input text is "I'm stiff today. ", step S2 generates" i am stiff today ". "the mouth shape model, but when the degree adverbs" straight "and" good "are detected in step S3, the mouth shape model is further transformed.
In a specific implementation, this step S3.4 may be implemented by sub-steps S3.4.1 and S3.4.2.
Step S3.4.1: blink count calculation
As can be seen from the observation of real people and video image data, the number of blinks per minute of a general person is 15-20, namely, blinks are performed every 3 and 4 seconds. The blinking is not evenly distributed, sometimes continuously blinking, and sometimes at intervals of approximately 10 seconds, without subjective consciousness control. When the blinking action is designed, a random number is used for controlling the blinking time point. The number of blinks is denoted by m. For one segment of animation, calculating m according to the total duration and the frequency of blinking once for 3 seconds, randomly selecting a time point in a certain time range to trigger blinking action, and according to v frame The number of frames triggering the blinking action is counted, thereby controlling the blinking action.
Step S3.4.2: blink motion design
The blinking action mainly has two states of opening and closing. By modifying the three-dimensional coordinates of the upper eyelid bone feature points, key frames are set when opening and closing. The open and close states are modified by setting key frames at the time points obtained by the random numbers to realize the blink animation.
Step S4: the speech modeling may be achieved by sub-steps S4.1 to S4.2.
Step S4.1: text-to-speech
Any Text information input from outside is converted into standard natural language voice audio information using Text To Speech (TTS) technology.
Step S4.2: processing of speech speed and pauses
Often, a person will intentionally or unintentionally adjust the speech rate or the speech pauses while communicating. The operation of speech processing requiring speech rate control and speech pause processing may be achieved by sub-steps S4.2.1-S4.2.2.
Step S4.2.1: speech speed control
In order to simulate the speaking of a virtual person and enable the animation effect of the face to be more lifelike and sound and feel, the mouth shape change is required to be matched with the audio output.
Microsoft SAPI engine is called by jar package Jacob to realize the synthesis of text to audio. There are several parameters in SAPI, including output file type, audio volume, and pace value rate. Wherein, the rate is an important parameter (threshold value-10) for lip synchronous control, and partial rate values can be removed according to the pronunciation speed of a common person. According to the speech speed v speak A suitable rate value is selected to synthesize speech. The corresponding rate values for the different speech rates are shown in table 8.
Table 8: rate value corresponds to speech rate
Step S4.2.2: speaking pause processing
In the text processing process, the positions of punctuation marks (comma, period, semicolon, question mark and exclamation mark) which can generate pauses are recorded and stored, and short pauses are generated at the positions during voice processing.
For example, the input text is "me". . . Run in real time. . . Stationary cheering ",
(1) From step S2, a mouth model of "I am running on the dead" is generated.
(2) When a person is simulated speaking, the speech rate needs to be controlled.
(3) The recognition of the degree adverb "true" by step S3 will have an effect on the speech rate in the speech modeling of step S4.
(4) Meanwhile, the input text contains periods which represent pauses, and then the pausing process is needed.
Therefore, a coordination between speech rate and pause is required.
Step S5: virtual face synthesis can be achieved through substeps S5.1-S5.3.
Step S5.1: UV spreading
If the face model is to be fine and micro, a high-quality real person map is needed to be stitched with the face model, that is, the x, y and z coordinates of the three-dimensional space where the face model is located correspond to the coordinates u, v and w in the real person map. Since the w coordinates are rarely used in practical application, only the u and v coordinates need to be considered. The invention defines the UV expansion, and can reduce the dimension of the three-dimensional information to the two-dimensional information.
And (3) expanding the original face geometric model obtained in the step (S1.1). Because the face model is of a symmetrical structure, the face model can be cut from the middle. As shown in fig. 9, the cutting line extends through the entire skull. Then it is tiled onto a grid, and figure 10 shows a model UV-spreading tiling effect graph.
Step S5.2: face map making
In order to ensure that the mapping effect is close to that of a real person, the method for splicing two pictures of an input real person face image and a side face image to manufacture a face expansion diagram comprises the substeps S5.2.1-S5.2.3.
Step s5.2.1: alignment of face images on the front side
The front face image and the side face image shown in fig. 11 are aligned with respect to a portion having obvious features such as an eye corner and a mouth corner.
Step S5.2.2: image boundary blurring
In the aligned new images, the junction of the two images is subjected to blurring, so that the two images are close to match with the skin color to achieve the effect of natural fitting.
Step S5.2.3: mapping process
Spreading the synthesized face unfolding diagram on unfolding grids of the model, stretching the diagram to ensure the alignment of eyes, nose, mouth and other organs, fixing the positions of the various organs of the face, stretching the diagram to enable the various organs to be completely covered on the grid diagram, and finally reading the face model generated after mapping by three-dimensional manufacturing software such as 3Ds Max and the like.
Fig. 12 is a schematic view of the face model of the original face model of fig. 2 after mapping.
Step S5.3: synthetic synchronization processing
And (3) integrating the results obtained in the steps S1 to S4 to synthesize the final virtual face. A human face unfolding diagram is synthesized by the true human image and is used for mapping to obtain a personalized true human face model; the mouth shape change and the expression change obtained by the text information are stored in the same 3Ds Max maxscript file and are used for driving skeleton feature points in the model to move so as to generate a key frame and an intermediate frame; the audio converted from text adds matching sounds to the animation. Fig. 13 is a schematic view of the synthesized virtual face.
The above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.
Claims (4)
1. The method for generating the virtual face is characterized by comprising the following steps:
S1: constructing an original face geometric model, wherein the original face geometric model comprises face feature points;
s2: determining lip skeleton feature points based on face feature points in an original face geometric model, constructing mouth-shaped changing key frames and middle frames for the lip skeleton feature points by establishing a mapping relation between phonemes and vision positions and a mapping relation between vision positions and mouth-shaped key frames, and constructing a face dynamic mouth-shaped model driven by Chinese pinyin phonemes, wherein the face dynamic mouth-shaped model comprises mouth-shaped changing key frames, mouth-shaped changing middle frame numbers, lip skeleton feature point coordinate values and insertion time, phonemes are defined as minimum units of syllable pronunciation actions according to natural attributes of voices, the vision positions refer to states of positions of upper and lower lips and upper and lower jaws when phonemes are pronounced, the mouth-shaped key frames are used for recording key contents of virtual characters in mouth animation pictures when phonemes are pronounced, and the mouth-shaped middle frames are used for representing complete change processes from mouth shape generation to mouth shape completion when phonemes are pronounced;
s3: according to the mouth-shaped changing key frames, the coordinate values of lip skeleton feature points of the mouth-shaped changing key frames and emotion key words contained in the input speaking text, facial expression changes with different degrees are designed, expression key frames are generated, a facial expression model of a face is constructed, and the expression key frames are used for recording facial expression changes contained in virtual facial animation;
S4: converting the input speaking text into voice audio, and processing the speaking speed and pause of the person to construct a voice model;
s5: inputting a face front image and a face side image, processing the original face geometric model obtained in the step S1 to obtain a real face geometric model, and integrating the real face geometric model with the face dynamic mouth model obtained in the step S2, the face facial expression model obtained in the step S3 and the voice model obtained in the step S4 through synthesis synchronous processing to generate a virtual face;
wherein, step S2 includes:
s2.1: combining face feature points contained in an original face geometric model according to a preset standard, and selecting skeleton feature points as sources for driving the change of the mouth model to obtain lip skeleton feature points;
s2.2: constructing an oral type keyframe for the lip bone feature points by establishing a mapping relation between phonemes and vision bits and a mapping relation between vision bits and oral type keyframes;
s2.3: introducing dynamic view bits to manufacture an intermediate frame of mouth shape variation;
s2.4: storing information of a dynamic mouth shape frame to construct a human face dynamic mouth shape model, wherein the information of the dynamic mouth shape frame comprises a mouth shape key frame and an intermediate frame, and each frame of information comprises a skeleton point number, a skeleton point three-dimensional coordinate and an insertion time;
Step S2.2 comprises:
s2.2.1: classifying factor states of basic mouth shapes of Chinese pronunciation according to preset rules by adopting three basic parameters, wherein the three basic parameters comprise longitudinal change values of lips, transverse change values of lips and opening and closing change values of upper and lower jaws and teeth, defining a view position according to the classified phoneme states, and constructing a mapping relation between phonemes and the view position;
s2.2.2: establishing a mapping relation between vision positions and mouth-shaped key frames according to the Chinese pinyin corresponding to the input speaking text and the lip bone characteristic points in the step S2.1, wherein each vision position corresponds to a set of lip bone point three-dimensional coordinates;
step S2.3 comprises:
s2.3.1: taking the static view bits as mouth shape key frames, adding a plurality of intermediate frames between the two static view bits to represent the continuous process of two mouth shape changes, and calculating mouth feature point coordinates of the intermediate frames according to feature point coordinates of the front and rear key frames and taking time between the two key frames as variable parameters;
s2.3.2: determining the number of intermediate frames of the two mouth-shaped key frames according to the time interval between the two mouth-shaped key frames;
s2.3.3: and adding the number of all the mouth shape key frames and the number of the middle frame to obtain the insertion time of the mouth shape key frame corresponding to the phoneme of the next Chinese character.
2. The method for generating a virtual face according to claim 1, wherein step S3 includes:
s3.1: establishing a basic emotion dictionary for identifying emotion keywords related to emotion in an input text;
s3.2: combining the emotion dictionary, and carrying out emotion word recognition and emotion calculation on the input speaking text;
s3.3: facial expression changes with different degrees are designed according to lip skeleton feature point coordinate values of the mouth-shaped variable key frames and the mouth-shaped variable intermediate frames and emotion keywords contained in the input speaking text, and expression key frames are generated;
s3.4: the blink action is controlled by modifying the three-dimensional coordinates of the skeletal feature points of the upper eyelid, setting key frames when opening and closing.
3. The method for generating a virtual face according to claim 2, wherein step S3.2 includes:
s3.2.1: initializing emotion parameters, wherein the emotion parameters comprise emotion weight values and influence degree values of emotion degree adverbs;
s3.2.2: word segmentation processing is carried out on the input text;
s3.2.3: detecting whether words in word segmentation results contain keywords related to emotion, and giving different emotion weights to the words according to the intensity of emotion word meanings;
S3.2.4: detecting whether a word in a word segmentation result contains a degree adverb or not, and giving influence degree values of different emotion degree adverbs to the word according to the degree adverb;
s3.2.5: detecting whether words in word segmentation results contain negative words or not, and determining the number of the negative words;
s3.2.6: and calculating the emotion value of the input text word according to the emotion weight value of the word segmentation result, the influence degree value of the emotion degree adverbs and the number of the contained negative words.
4. The method for generating a virtual face according to claim 2, wherein step S3.3 includes:
s3.3.1: superposing different basic action units to express rich facial expressions, defining action units on the expression, and determining expression action units, wherein the facial expressions comprise happiness, anger, surprise and sadness;
s3.3.2: according to the emotion calculation result and lip bone feature point coordinate values of the mouth-shaped keyframes and the intermediate frames, facial expressions with different degrees are designed;
s3.3.3: and calculating coordinates of feature points in the facial expression key frames with different degrees.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110719425.8A CN113781610B (en) | 2021-06-28 | 2021-06-28 | Virtual face generation method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110719425.8A CN113781610B (en) | 2021-06-28 | 2021-06-28 | Virtual face generation method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113781610A CN113781610A (en) | 2021-12-10 |
CN113781610B true CN113781610B (en) | 2023-08-22 |
Family
ID=78835806
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110719425.8A Active CN113781610B (en) | 2021-06-28 | 2021-06-28 | Virtual face generation method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113781610B (en) |
Families Citing this family (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114359450A (en) * | 2022-01-17 | 2022-04-15 | 小哆智能科技(北京)有限公司 | Method and device for simulating virtual character speaking |
CN114359443A (en) * | 2022-01-17 | 2022-04-15 | 小哆智能科技(北京)有限公司 | Method and device for simulating virtual character speaking |
CN114422697B (en) * | 2022-01-19 | 2023-07-18 | 浙江博采传媒有限公司 | Virtual shooting method, system and storage medium based on optical capturing |
CN114401431B (en) * | 2022-01-19 | 2024-04-09 | 中国平安人寿保险股份有限公司 | Virtual person explanation video generation method and related device |
CN116612218A (en) * | 2022-02-08 | 2023-08-18 | 北京字跳网络技术有限公司 | Expression animation generation method, device, equipment and storage medium |
CN114937104B (en) * | 2022-06-24 | 2024-08-13 | 北京有竹居网络技术有限公司 | Virtual object face information generation method and device and electronic equipment |
CN115529500A (en) * | 2022-09-20 | 2022-12-27 | 中国电信股份有限公司 | Method and device for generating dynamic image |
CN115690280B (en) * | 2022-12-28 | 2023-03-21 | 山东金东数字创意股份有限公司 | Three-dimensional image pronunciation mouth shape simulation method |
CN118540424A (en) * | 2023-02-14 | 2024-08-23 | 华为云计算技术有限公司 | Training method and device for digital person generation model and computing equipment cluster |
CN116095357B (en) * | 2023-04-07 | 2023-07-04 | 世优(北京)科技有限公司 | Live broadcasting method, device and system of virtual anchor |
CN116778040B (en) * | 2023-08-17 | 2024-04-09 | 北京百度网讯科技有限公司 | Face image generation method based on mouth shape, training method and device of model |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106653052A (en) * | 2016-12-29 | 2017-05-10 | Tcl集团股份有限公司 | Virtual human face animation generation method and device |
CN111988658A (en) * | 2020-08-28 | 2020-11-24 | 网易(杭州)网络有限公司 | Video generation method and device |
CN112465935A (en) * | 2020-11-19 | 2021-03-09 | 科大讯飞股份有限公司 | Virtual image synthesis method and device, electronic equipment and storage medium |
-
2021
- 2021-06-28 CN CN202110719425.8A patent/CN113781610B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106653052A (en) * | 2016-12-29 | 2017-05-10 | Tcl集团股份有限公司 | Virtual human face animation generation method and device |
CN111988658A (en) * | 2020-08-28 | 2020-11-24 | 网易(杭州)网络有限公司 | Video generation method and device |
CN112465935A (en) * | 2020-11-19 | 2021-03-09 | 科大讯飞股份有限公司 | Virtual image synthesis method and device, electronic equipment and storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN113781610A (en) | 2021-12-10 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN113781610B (en) | Virtual face generation method | |
CN106653052B (en) | Virtual human face animation generation method and device | |
CN104361620B (en) | A kind of mouth shape cartoon synthetic method based on aggregative weighted algorithm | |
Bailly et al. | Audiovisual speech synthesis | |
Busso et al. | Rigid head motion in expressive speech animation: Analysis and synthesis | |
KR102035596B1 (en) | System and method for automatically generating virtual character's facial animation based on artificial intelligence | |
Kuratate et al. | Kinematics-based synthesis of realistic talking faces | |
CN103258340B (en) | Is rich in the manner of articulation of the three-dimensional visualization Mandarin Chinese pronunciation dictionary of emotional expression ability | |
KR20120130627A (en) | Apparatus and method for generating animation using avatar | |
Albrecht et al. | Automatic generation of non-verbal facial expressions from speech | |
Ma et al. | Accurate automatic visible speech synthesis of arbitrary 3D models based on concatenation of diviseme motion capture data | |
Wang et al. | Assembling an expressive facial animation system | |
CN109101953A (en) | The facial expressions and acts generation method of subregion element based on human facial expressions | |
King | A facial model and animation techniques for animated speech | |
Tang et al. | Real-time conversion from a single 2D face image to a 3D text-driven emotive audio-visual avatar | |
Morishima et al. | Real-time facial action image synthesis system driven by speech and text | |
Beskow et al. | Data-driven synthesis of expressive visual speech using an MPEG-4 talking head. | |
Breen et al. | An investigation into the generation of mouth shapes for a talking head | |
Uz et al. | Realistic speech animation of synthetic faces | |
Wang et al. | A real-time text to audio-visual speech synthesis system. | |
GB2346527A (en) | Virtual actor with set of speaker profiles | |
GB2328849A (en) | System for animating virtual actors using linguistic representations of speech for visual realism. | |
Fanelli et al. | Acquisition of a 3d audio-visual corpus of affective speech | |
Kaneko et al. | Automatic synthesis of moving facial images with expression and mouth shape controlled by text | |
Dey | Visual Speech in Technology-Enhanced Learning |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |