CN116597070A

CN116597070A - Eye motion simulation method in three-dimensional image pronunciation process

Info

Publication number: CN116597070A
Application number: CN202310111493.5A
Authority: CN
Inventors: 周安斌; 晏武志; 彭辰; 李鑫
Original assignee: Shandong Jindong Digital Creative Co ltd
Current assignee: Shandong Jindong Digital Creative Co ltd
Priority date: 2023-02-13
Filing date: 2023-02-13
Publication date: 2023-08-15

Abstract

The invention provides an eye motion simulation method in a three-dimensional image pronunciation process, which belongs to the technical field of three-dimensional image pronunciation, and comprises the following steps: collecting two-dimensional eye posture elements in a plane video library when a person speaks as a training data set, and establishing a corresponding relation between the two-dimensional eye posture elements; building a neural network model, and training by adopting a training data set; three-dimensional recognition is carried out on the pronunciation process of a plurality of testers to obtain three-dimensional eye posture factors; updating and optimizing the neural network model by utilizing the three-dimensional eye posture factors to form a three-dimensional pronunciation neural network model; taking the character string needing three-dimensional image pronunciation as text data, extracting phonemes from the text data, and obtaining a pronunciation phoneme set; calculating the obtained pronunciation phoneme set by using the three-dimensional pronunciation neural network model to obtain a three-dimensional pronunciation element set; and establishing a three-dimensional virtual image comprising eyes, and controlling the eye posture of the three-dimensional virtual image by adopting the obtained three-dimensional sounding element set.

Description

Eye motion simulation method in three-dimensional image pronunciation process

Technical Field

The invention belongs to the technical field of three-dimensional image pronunciation, and particularly relates to an eye motion simulation method in a three-dimensional image pronunciation process.

Background

Virtual character modeling and rendering techniques are widely used in animation, gaming, and movies industries. Enabling a virtual character to speak with natural and smooth mouth-shaped motions synchronized with sound is key to improving user experience. In a real-time system, it is necessary to synchronously play audio acquired in real time in the form of streams and synchronously render virtual character figures, and in this process, it is necessary to ensure synchronization between the audio and the character mouth shapes. Eyes are windows of mind, and when people speak or speak, different speaking texts can have different eye movements corresponding to different speaking texts.

The Chinese invention patent with publication number of CN108538308B (application number: CN 201810018724.7) discloses a method and a device for simulating mouth shape and/or expression based on voice, comprising: collecting an audio signal; transforming the audio signal into spectral data corresponding to the audio signal; determining frequency distribution data according to the frequency spectrum data; determining mouth-style simulation parameters and/or expression simulation parameters according to the frequency distribution data; simulating the corresponding mouth shape according to the mouth shape simulation parameters and/or simulating the corresponding expression according to the expression simulation parameters.

Said invention and current many three-dimensional images only consider mouth shape and/or expression in the course of pronunciation, and do not consider the action simulation of eyes, so that the three-dimensional image pronunciation process is unnatural.

Disclosure of Invention

In view of the above, the invention provides a method for simulating the eye movements in the three-dimensional image pronunciation process, which can solve the technical problem that the three-dimensional image pronunciation process is unnatural because the eye movements are not considered in the three-dimensional image pronunciation process.

The invention is realized in the following way:

the invention provides a three-dimensional image pronunciation process eye motion simulation method, which comprises the following steps:

s100: collecting two-dimensional eye elements in a plane video library when a person speaks as a training data set, and establishing a corresponding relation between the two-dimensional eye elements, wherein the two-dimensional eye elements comprise phonemes and two-dimensional eye gestures;

s200: establishing a neural network model, and training by adopting a training data set, wherein phonemes in the training data set are used as input, and two-dimensional eye gestures are used as output;

s300: three-dimensional recognition is carried out on the pronunciation process of a plurality of testers to obtain three-dimensional eye elements, wherein the three-dimensional eye elements comprise phonemes and three-dimensional eye gestures;

s400: further training and optimizing the neural network model by utilizing three-dimensional eye elements to form a three-dimensional pronunciation neural network model;

s500: taking a character string needing three-dimensional image pronunciation as text data, extracting phonemes from the text data, and obtaining a pronunciation phoneme set;

s600: calculating the obtained pronunciation phoneme set by using the three-dimensional pronunciation neural network model to obtain a three-dimensional pronunciation element set;

s700: establishing a three-dimensional virtual image comprising eyes, controlling the eye posture of the three-dimensional virtual image by adopting the obtained three-dimensional sounding element set, and simulating; the eye gesture is described by adopting coordinates of eye gesture key points.

On the basis of the technical scheme, the three-dimensional image pronunciation process eye motion simulation method can be improved as follows:

the specific step of collecting the two-dimensional eye elements when the person in the plane video library speaks in step S100 includes:

the first step: selecting a character speaking video clip in a plane video library, wherein the video clip comprises video audio and a plurality of frames;

and a second step of: establishing a two-dimensional coordinate system for the image in each frame, and identifying the eye area to obtain a human eye image;

and a third step of: two-dimensional key points are set on the recognized human eye images and two-dimensional coordinates of the eye posture key points in the plane are recorded.

Fourth step: collecting phonemes in corresponding video audio for each frame;

the two-dimensional key points at least comprise left corners, right corners, center vertexes of lower eyelid, center vertexes of upper eyelid, centers of pupil, edge points of pupil and reflective bright areas of eyes of the test person, wherein the edge points of pupil comprise at least three.

Further, the specific step of establishing the correspondence between the two-dimensional eye elements in the step S100 is as follows: and establishing a corresponding relation of the phonemes and the two-dimensional eye gestures according to a time axis of video.

In the step of establishing a two-dimensional coordinate system for the image in each frame and identifying the eye region, an AdaBoost classifier is adopted to identify the eye region in the image.

Further, the step of "setting a two-dimensional key point on the identified human eye image" includes:

acquiring a plurality of groups of human eye images;

for each pixel point in the human eye image, calculating a Y value component of the pixel point in a YUV space based on the RGB value of the pixel point, and taking the Y value component as a gray value of the pixel point;

calculating the average value of gray values of all pixel points in the gray image, determining a binarization threshold value, and performing binarization processing on the gray image to obtain a binarized image

Identifying a video disc area in the binarized image to obtain an identification result;

performing expansion and corrosion treatment on the binarized image, searching a communication area with a pixel value not being 0 in the treated binarized image, determining a video disc area based on a searching result, and taking the video disc area as a sample image in an identification training set;

marking two-dimensional key points in the sample image;

establishing an identification training neural network model, and training by utilizing the identification training set to obtain an eye key point identification model;

and calculating the identified human eye images by using the eye identification model to obtain two-dimensional key points of each human eye image.

Wherein the step of determining the optic disc area based on the search result includes:

counting the number of pixel points in each communication area aiming at each searched communication area;

judging whether the quantity meets the preset quantity condition or not;

if so, the connected region is determined as a disc region.

The specific step of three-dimensionally identifying the pronunciation process of the plurality of testers to obtain the three-dimensional eye element in the step S300 includes:

step 1: three cameras are arranged right in front of the face of the tester, right left side and right side;

step 2: the tester performs dialogue pronunciation or reading, and the camera is used for collecting the dialogue pronunciation or reading process of the tester;

step 3: establishing a three-dimensional coordinate system, and carrying out three-dimensional coordinate marking on each three-dimensional key point to form three-dimensional coordinates of the eye gesture key points;

step 4: recording coordinates of each three-dimensional key point at different moments to form a three-dimensional key point coordinate sequence; identifying pronunciation phonemes of a tester at different moments to form a pronunciation phoneme sequence; establishing a relation between the three-dimensional key point coordinate sequence and the pronunciation phoneme sequence according to the moment;

the three-dimensional key points at least comprise left corners, right corners, center vertexes of lower eyelid, center vertexes of upper eyelid, centers of pupil, edge points of pupil and reflective bright areas of eyes of the test person, wherein the edge points of pupil comprise at least three.

Further, the step of collecting the dialogue pronunciation or the reading process of the tester by using the camera specifically includes:

two-dimensional key points of eye images of the testers collected by the camera;

determining three-dimensional target key points from three-dimensional eye key points in a pre-manufactured three-dimensional eye model according to the two-dimensional key points;

calculating camera parameters according to the two-dimensional eye key points and the three-dimensional target key points, wherein the camera parameters comprise the following steps: rotation angle parameter, translation amount, and scaling value;

according to the camera parameters and the three-dimensional target key points, performing pronunciation transformation processing on the three-dimensional eye model to obtain three-dimensional pronunciation parameters corresponding to the two-dimensional eye key points;

performing sparsification processing on the three-dimensional pronunciation parameters to obtain sparse pronunciation parameters;

migrating the sparse pronunciation parameters and the rotation angle parameters to an animation model of the virtual character so that the pronunciation of the virtual character is consistent with the pronunciation of the eye image;

the three-dimensional eye model is formed by reconstructing a pronunciation fusion model through a parameterized eye model 3 DMM; the three-dimensional pronunciation parameters comprise pronunciation subparameters corresponding to different eye parts.

Further, the step of performing a thinning process on the three-dimensional pronunciation parameters to obtain thinned pronunciation parameters specifically includes:

taking the pronunciation subparameter corresponding to each eye part as a target subparameter, executing the following operations:

inquiring a target pronunciation sub-model corresponding to the target sub-parameter in a pre-stored pronunciation model mapping table; wherein, the pronunciation model mapping table stores the sub-model identifiers corresponding to the pronunciation sub-models of different eye parts included in the pronunciation fusion model;

assigning the target sub-parameters to the target pronunciation sub-model;

setting pronunciation sub-parameters corresponding to other pronunciation sub-models except the target pronunciation sub-model in the pronunciation fusion model as preset parameters;

fusing the target pronunciation sub-model and other pronunciation sub-models to obtain a target three-dimensional eye model;

calculating vertex deformation corresponding to the target sub-parameters based on the target three-dimensional eye model and a preset three-dimensional eye model; the preset three-dimensional eye model is obtained by fusing a plurality of pronunciation sub-models with pronunciation sub-parameters as the preset parameters;

inputting the vertex deformation corresponding to each target subparameter and the three-dimensional pronunciation parameters into an optimization model for iterative calculation until the loss value of the optimization model reaches a preset loss value, and outputting an optimization result;

and taking the optimized result as a sparse pronunciation parameter.

The step of controlling the eye pose of the three-dimensional avatar by using the obtained three-dimensional sounding element set in the step S700 and performing simulation specifically includes:

the first step: acquiring a text with eye actions required to be read in a three-dimensional image;

and a second step of: establishing a phoneme set according to phonemes corresponding to characters in the text;

and a third step of: searching and preloading a mouth shape of a corresponding three-dimensional image in a preset three-dimensional mouth shape library according to the phoneme set to serve as a basic mouth shape;

fourth step: replacing the sequence consisting of the basic mouth shapes by utilizing the three-dimensional sounding element set according to the phonemes;

fifth step: and obtaining and playing the mouth shape sequence with the eye gesture actions.

Compared with the prior art, the three-dimensional image pronunciation process eye motion simulation method provided by the invention has the beneficial effects that: the method comprises the steps of describing the eye gesture of a person during pronunciation by adopting two-dimensional and three-dimensional key points, wherein the two-dimensional and three-dimensional key points at least comprise a left eye corner, a right eye corner, a lower eyelid center vertex, an upper eyelid center vertex, a pupil center, pupil edge points and an eye reflection bright area of eyes of a tester, and the pupil edge points comprise at least three. The neural network is built, firstly, two-dimensional eye elements are adopted for training, the three-dimensional eye elements are utilized for further training and optimizing the neural network model, a three-dimensional pronunciation neural network model is formed, the three-dimensional eye elements are not required to be directly adopted for training, the calculated amount of model operation is reduced, and the model training speed is improved. Finally, the text to be pronounced is calculated by using the obtained three-dimensional pronunciation neural network model, the three-dimensional pronunciation element set is obtained to control the eye gesture of the three-dimensional virtual image and simulate, and the technical problem that the three-dimensional image pronunciation process is unnatural in the prior art is effectively solved due to the addition of the eye action in the simulation process.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the description of the embodiments of the present invention will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of a three-dimensional visual pronunciation process eye motion simulation method provided by the invention;

in the drawings, the list of components represented by the various numbers is as follows:

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments. All other embodiments, based on the embodiments of the invention, which are apparent to those of ordinary skill in the art without inventive faculty, are intended to be within the scope of the invention.

Thus, the following detailed description of the embodiments of the invention, as presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, based on the embodiments of the invention, which are apparent to those of ordinary skill in the art without inventive faculty, are intended to be within the scope of the invention.

It should be noted that: like reference numerals and letters denote like items in the following figures, and thus once an item is defined in one figure, no further definition or explanation thereof is necessary in the following figures.

In the description of the present invention, it should be understood that the terms "center", "longitudinal", "lateral", "length", "width", "thickness", "upper", "lower", "front", "rear", "left", "right", "vertical", "horizontal", "top", "bottom", "inner", "outer", "clockwise", "counterclockwise", etc. indicate orientations or positional relationships based on the orientations or positional relationships shown in the drawings are merely for convenience in describing the present invention and simplifying the description, and do not indicate or imply that the apparatus or elements referred to must have a specific orientation, be configured and operated in a specific orientation, and thus should not be construed as limiting the present invention.

Furthermore, the terms "first," "second," and the like, are used for descriptive purposes only and are not to be construed as indicating or implying a relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include one or more such feature. In the description of the present invention, the meaning of "a plurality" is two or more, unless explicitly defined otherwise.

As shown in fig. 1, the present invention provides a three-dimensional image pronunciation process eye motion simulation method, which comprises the following steps:

s400: further training and optimizing the neural network model by utilizing the three-dimensional eye elements to form a three-dimensional pronunciation neural network model;

s500: taking the character string needing three-dimensional image pronunciation as text data, extracting phonemes from the text data, and obtaining a pronunciation phoneme set;

s700: establishing a three-dimensional virtual image comprising eyes, controlling the eye posture of the three-dimensional virtual image by adopting the obtained three-dimensional sounding element set, and simulating; the eye gesture is described by coordinates of eye gesture key points.

In the above technical solution, the specific steps of collecting the two-dimensional eye elements when the person in the plane video library speaks in step S100 include:

Fourth step: collecting phonemes in corresponding video audio for each frame;

the two-dimensional key points at least comprise a left corner, a right corner, a center vertex of lower eyelid, a center vertex of upper eyelid, a center of pupil, a pupil edge point and an eye reflection bright area of the eyes of the testers, wherein the pupil edge point comprises at least three points.

Further, in the above technical solution, the specific step of establishing the correspondence between the two-dimensional eye elements in step S100 is: and establishing a corresponding relation of the phonemes and the two-dimensional eye gestures according to a time axis of video.

In the above technical solution, in the step of "establishing a two-dimensional coordinate system for the image in each frame and identifying the eye region", an AdaBoost classifier is used to identify the eye region in the image.

Further, in the above technical solution, the step of "setting a two-dimensional key point on the identified human eye image" includes:

acquiring a plurality of groups of human eye images;

for each pixel point in the human eye image, calculating a Y value component of the pixel point in YUV space based on RGB value of the pixel point, and taking the Y value component as gray value of the pixel point;

performing expansion and corrosion treatment on the binarized image, searching a connected region with a pixel value not being 0 in the treated binarized image, determining a video disc region based on a searching result, and taking the video disc region as a sample image in an identification training set;

marking two-dimensional key points in the sample image;

establishing an identification training neural network model, and training by utilizing an identification training set to obtain an eye key point identification model;

In the above technical solution, the step of determining the optic disc area based on the search result includes:

judging whether the quantity meets the preset quantity condition;

if so, the connected region is determined as a disc region.

In the above technical solution, the specific steps of performing three-dimensional recognition on the pronunciation process of the plurality of testers in step S300 to obtain the three-dimensional eye element include:

the three-dimensional key points at least comprise a left corner, a right corner, a center vertex of lower eyelid, a center vertex of upper eyelid, a center of pupil, a pupil edge point and an eye reflection bright area of the eyes of the testers, wherein the pupil edge point comprises at least three points.

Further, in the above technical solution, the step of collecting the dialogue pronunciation or the reading process of the tester by using the camera specifically includes:

carrying out sparsification processing on the three-dimensional pronunciation parameters to obtain sparsified pronunciation parameters;

migrating the sparse pronunciation parameters and the rotation angle parameters to an animation model of the virtual character, so that the pronunciation of the virtual character is consistent with the pronunciation of the eye image;

the three-dimensional eye model is formed by reconstructing a pronunciation fusion model through a parameterized eye model 3 DMM; the three-dimensional pronunciation parameters comprise pronunciation sub-parameters corresponding to different eye parts.

Further, in the above technical solution, the step of performing a thinning process on the three-dimensional pronunciation parameters to obtain thinned pronunciation parameters specifically includes:

inquiring a target pronunciation sub-model corresponding to the target sub-parameter in a pre-stored pronunciation model mapping table; wherein, the pronunciation model mapping table stores the sub model identifiers corresponding to the pronunciation sub models of different eye parts included in the pronunciation fusion model;

assigning the target sub-parameters to the target pronunciation sub-model;

calculating vertex deformation corresponding to the target sub-parameters based on the target three-dimensional eye model and a preset three-dimensional eye model; the preset three-dimensional eye model is obtained by fusing a plurality of pronunciation sub-models with pronunciation sub-parameters as preset parameters;

inputting the vertex deformation quantity and the three-dimensional pronunciation parameters corresponding to each target subparameter into an optimization model for iterative calculation until the loss value of the optimization model reaches a preset loss value, and outputting an optimization result;

and taking the optimized result as a sparse pronunciation parameter.

In the above technical solution, the step of controlling the eye pose of the three-dimensional avatar by using the obtained three-dimensional sounding element set in step S700 and performing simulation specifically includes:

and a third step of: searching and preloading the mouth shape of the corresponding three-dimensional image in a preset three-dimensional mouth shape library according to the phoneme set to serve as a basic mouth shape;

The foregoing is merely illustrative of the present invention, and the present invention is not limited thereto, and any person skilled in the art will readily recognize that variations or substitutions are within the scope of the present invention. Therefore, the protection scope of the invention is subject to the protection scope of the claims.

Claims

1. The method for simulating the eye movement in the three-dimensional image pronunciation process is characterized by comprising the following steps of:

2. The method for simulating the eye movements in the three-dimensional image pronunciation process according to claim 1, wherein the specific step of collecting the two-dimensional eye elements when the person in the planar video library speaks in step S100 comprises the following steps:

Fourth step: collecting phonemes in corresponding video audio for each frame;

3. The method for simulating the eye movements in the three-dimensional image pronunciation process according to claim 2, wherein the specific steps of establishing the correspondence between the two-dimensional eye elements in the step S100 are as follows: and establishing a corresponding relation of the phonemes and the two-dimensional eye gestures according to a time axis of video.

4. The method for simulating the eye movement in the three-dimensional image pronunciation process according to claim 1, wherein in the step of establishing a two-dimensional coordinate system for the image in each frame and identifying the eye region, an AdaBoost classifier is used to identify the eye region in the image.

5. The method of claim 4, wherein the step of setting two-dimensional key points on the recognized human eye image comprises:

acquiring a plurality of groups of human eye images;

marking two-dimensional key points in the sample image;

6. The method of claim 1, wherein the step of determining the optic disc area based on the search result comprises:

judging whether the quantity meets the preset quantity condition or not;

if so, the connected region is determined as a disc region.

7. The method for simulating the eye movements in the three-dimensional visual pronunciation process according to claim 1, wherein the specific step of three-dimensionally recognizing the pronunciation process of the plurality of testers in step S300 to obtain the three-dimensional eye elements comprises the following steps:

8. The method for simulating the eye movements in the three-dimensional visual pronunciation process according to claim 7, wherein the step of capturing the dialogue pronunciation or the speaking process of the tester by using the camera specifically comprises the following steps:

9. The method for simulating the eye motion in the three-dimensional visual pronunciation process according to claim 7, wherein the step of performing a thinning process on the three-dimensional pronunciation parameters to obtain the thinned pronunciation parameters comprises the following steps:

assigning the target sub-parameters to the target pronunciation sub-model;

and taking the optimized result as a sparse pronunciation parameter.

10. The method for simulating the eye movements in the three-dimensional avatar pronunciation process according to claim 1, wherein the step of controlling the eye posture of the three-dimensional avatar by using the obtained three-dimensional sound-producing element set in the step S700 and simulating the eye posture of the three-dimensional avatar specifically comprises: