CN116597070A - Eye motion simulation method in three-dimensional image pronunciation process - Google Patents
Eye motion simulation method in three-dimensional image pronunciation process Download PDFInfo
- Publication number
- CN116597070A CN116597070A CN202310111493.5A CN202310111493A CN116597070A CN 116597070 A CN116597070 A CN 116597070A CN 202310111493 A CN202310111493 A CN 202310111493A CN 116597070 A CN116597070 A CN 116597070A
- Authority
- CN
- China
- Prior art keywords
- dimensional
- pronunciation
- eye
- image
- model
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 63
- 230000033001 locomotion Effects 0.000 title claims abstract description 11
- 238000004088 simulation Methods 0.000 title abstract description 17
- 238000012549 training Methods 0.000 claims abstract description 34
- 238000003062 neural network model Methods 0.000 claims abstract description 22
- 210000001747 pupil Anatomy 0.000 claims description 20
- 210000000744 eyelid Anatomy 0.000 claims description 12
- 230000004424 eye movement Effects 0.000 claims description 10
- 230000004927 fusion Effects 0.000 claims description 9
- 238000005457 optimization Methods 0.000 claims description 9
- 238000012545 processing Methods 0.000 claims description 9
- 238000004891 communication Methods 0.000 claims description 8
- 238000013507 mapping Methods 0.000 claims description 6
- 238000012360 testing method Methods 0.000 claims description 4
- 230000000007 visual effect Effects 0.000 claims description 4
- 238000004364 calculation method Methods 0.000 claims description 3
- 230000007797 corrosion Effects 0.000 claims description 3
- 238000005260 corrosion Methods 0.000 claims description 3
- 230000009466 transformation Effects 0.000 claims description 3
- 238000013519 translation Methods 0.000 claims description 3
- 230000005236 sound signal Effects 0.000 description 3
- 238000013528 artificial neural network Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000009877 rendering Methods 0.000 description 1
- 230000003595 spectral effect Effects 0.000 description 1
- 238000001228 spectrum Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 230000001360 synchronised effect Effects 0.000 description 1
- 230000001131 transforming effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/06—Transformation of speech into a non-audible representation, e.g. speech visualisation or speech processing for tactile aids
- G10L21/10—Transforming into visible information
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T13/00—Animation
- G06T13/20—3D [Three Dimensional] animation
- G06T13/205—3D [Three Dimensional] animation driven by audio data
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T13/00—Animation
- G06T13/20—3D [Three Dimensional] animation
- G06T13/40—3D [Three Dimensional] animation of characters, e.g. humans, animals or virtual beings
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T17/00—Three dimensional [3D] modelling, e.g. data description of 3D objects
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T19/00—Manipulating 3D models or images for computer graphics
- G06T19/006—Mixed reality
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T19/00—Manipulating 3D models or images for computer graphics
- G06T19/20—Editing of 3D images, e.g. changing shapes or colours, aligning objects or positioning parts
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02T—CLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
- Y02T10/00—Road transport of goods or passengers
- Y02T10/10—Internal combustion engine [ICE] based vehicles
- Y02T10/40—Engine management systems
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Software Systems (AREA)
- Computer Graphics (AREA)
- Multimedia (AREA)
- Computer Hardware Design (AREA)
- General Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Human Computer Interaction (AREA)
- Signal Processing (AREA)
- Geometry (AREA)
- Quality & Reliability (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Acoustics & Sound (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Architecture (AREA)
- Processing Or Creating Images (AREA)
Abstract
The invention provides an eye motion simulation method in a three-dimensional image pronunciation process, which belongs to the technical field of three-dimensional image pronunciation, and comprises the following steps: collecting two-dimensional eye posture elements in a plane video library when a person speaks as a training data set, and establishing a corresponding relation between the two-dimensional eye posture elements; building a neural network model, and training by adopting a training data set; three-dimensional recognition is carried out on the pronunciation process of a plurality of testers to obtain three-dimensional eye posture factors; updating and optimizing the neural network model by utilizing the three-dimensional eye posture factors to form a three-dimensional pronunciation neural network model; taking the character string needing three-dimensional image pronunciation as text data, extracting phonemes from the text data, and obtaining a pronunciation phoneme set; calculating the obtained pronunciation phoneme set by using the three-dimensional pronunciation neural network model to obtain a three-dimensional pronunciation element set; and establishing a three-dimensional virtual image comprising eyes, and controlling the eye posture of the three-dimensional virtual image by adopting the obtained three-dimensional sounding element set.
Description
Technical Field
The invention belongs to the technical field of three-dimensional image pronunciation, and particularly relates to an eye motion simulation method in a three-dimensional image pronunciation process.
Background
Virtual character modeling and rendering techniques are widely used in animation, gaming, and movies industries. Enabling a virtual character to speak with natural and smooth mouth-shaped motions synchronized with sound is key to improving user experience. In a real-time system, it is necessary to synchronously play audio acquired in real time in the form of streams and synchronously render virtual character figures, and in this process, it is necessary to ensure synchronization between the audio and the character mouth shapes. Eyes are windows of mind, and when people speak or speak, different speaking texts can have different eye movements corresponding to different speaking texts.
The Chinese invention patent with publication number of CN108538308B (application number: CN 201810018724.7) discloses a method and a device for simulating mouth shape and/or expression based on voice, comprising: collecting an audio signal; transforming the audio signal into spectral data corresponding to the audio signal; determining frequency distribution data according to the frequency spectrum data; determining mouth-style simulation parameters and/or expression simulation parameters according to the frequency distribution data; simulating the corresponding mouth shape according to the mouth shape simulation parameters and/or simulating the corresponding expression according to the expression simulation parameters.
Said invention and current many three-dimensional images only consider mouth shape and/or expression in the course of pronunciation, and do not consider the action simulation of eyes, so that the three-dimensional image pronunciation process is unnatural.
Disclosure of Invention
In view of the above, the invention provides a method for simulating the eye movements in the three-dimensional image pronunciation process, which can solve the technical problem that the three-dimensional image pronunciation process is unnatural because the eye movements are not considered in the three-dimensional image pronunciation process.
The invention is realized in the following way:
the invention provides a three-dimensional image pronunciation process eye motion simulation method, which comprises the following steps:
s100: collecting two-dimensional eye elements in a plane video library when a person speaks as a training data set, and establishing a corresponding relation between the two-dimensional eye elements, wherein the two-dimensional eye elements comprise phonemes and two-dimensional eye gestures;
s200: establishing a neural network model, and training by adopting a training data set, wherein phonemes in the training data set are used as input, and two-dimensional eye gestures are used as output;
s300: three-dimensional recognition is carried out on the pronunciation process of a plurality of testers to obtain three-dimensional eye elements, wherein the three-dimensional eye elements comprise phonemes and three-dimensional eye gestures;
s400: further training and optimizing the neural network model by utilizing three-dimensional eye elements to form a three-dimensional pronunciation neural network model;
s500: taking a character string needing three-dimensional image pronunciation as text data, extracting phonemes from the text data, and obtaining a pronunciation phoneme set;
s600: calculating the obtained pronunciation phoneme set by using the three-dimensional pronunciation neural network model to obtain a three-dimensional pronunciation element set;
s700: establishing a three-dimensional virtual image comprising eyes, controlling the eye posture of the three-dimensional virtual image by adopting the obtained three-dimensional sounding element set, and simulating; the eye gesture is described by adopting coordinates of eye gesture key points.
On the basis of the technical scheme, the three-dimensional image pronunciation process eye motion simulation method can be improved as follows:
the specific step of collecting the two-dimensional eye elements when the person in the plane video library speaks in step S100 includes:
the first step: selecting a character speaking video clip in a plane video library, wherein the video clip comprises video audio and a plurality of frames;
and a second step of: establishing a two-dimensional coordinate system for the image in each frame, and identifying the eye area to obtain a human eye image;
and a third step of: two-dimensional key points are set on the recognized human eye images and two-dimensional coordinates of the eye posture key points in the plane are recorded.
Fourth step: collecting phonemes in corresponding video audio for each frame;
the two-dimensional key points at least comprise left corners, right corners, center vertexes of lower eyelid, center vertexes of upper eyelid, centers of pupil, edge points of pupil and reflective bright areas of eyes of the test person, wherein the edge points of pupil comprise at least three.
Further, the specific step of establishing the correspondence between the two-dimensional eye elements in the step S100 is as follows: and establishing a corresponding relation of the phonemes and the two-dimensional eye gestures according to a time axis of video.
In the step of establishing a two-dimensional coordinate system for the image in each frame and identifying the eye region, an AdaBoost classifier is adopted to identify the eye region in the image.
Further, the step of "setting a two-dimensional key point on the identified human eye image" includes:
acquiring a plurality of groups of human eye images;
for each pixel point in the human eye image, calculating a Y value component of the pixel point in a YUV space based on the RGB value of the pixel point, and taking the Y value component as a gray value of the pixel point;
calculating the average value of gray values of all pixel points in the gray image, determining a binarization threshold value, and performing binarization processing on the gray image to obtain a binarized image
Identifying a video disc area in the binarized image to obtain an identification result;
performing expansion and corrosion treatment on the binarized image, searching a communication area with a pixel value not being 0 in the treated binarized image, determining a video disc area based on a searching result, and taking the video disc area as a sample image in an identification training set;
marking two-dimensional key points in the sample image;
establishing an identification training neural network model, and training by utilizing the identification training set to obtain an eye key point identification model;
and calculating the identified human eye images by using the eye identification model to obtain two-dimensional key points of each human eye image.
Wherein the step of determining the optic disc area based on the search result includes:
counting the number of pixel points in each communication area aiming at each searched communication area;
judging whether the quantity meets the preset quantity condition or not;
if so, the connected region is determined as a disc region.
The specific step of three-dimensionally identifying the pronunciation process of the plurality of testers to obtain the three-dimensional eye element in the step S300 includes:
step 1: three cameras are arranged right in front of the face of the tester, right left side and right side;
step 2: the tester performs dialogue pronunciation or reading, and the camera is used for collecting the dialogue pronunciation or reading process of the tester;
step 3: establishing a three-dimensional coordinate system, and carrying out three-dimensional coordinate marking on each three-dimensional key point to form three-dimensional coordinates of the eye gesture key points;
step 4: recording coordinates of each three-dimensional key point at different moments to form a three-dimensional key point coordinate sequence; identifying pronunciation phonemes of a tester at different moments to form a pronunciation phoneme sequence; establishing a relation between the three-dimensional key point coordinate sequence and the pronunciation phoneme sequence according to the moment;
the three-dimensional key points at least comprise left corners, right corners, center vertexes of lower eyelid, center vertexes of upper eyelid, centers of pupil, edge points of pupil and reflective bright areas of eyes of the test person, wherein the edge points of pupil comprise at least three.
Further, the step of collecting the dialogue pronunciation or the reading process of the tester by using the camera specifically includes:
two-dimensional key points of eye images of the testers collected by the camera;
determining three-dimensional target key points from three-dimensional eye key points in a pre-manufactured three-dimensional eye model according to the two-dimensional key points;
calculating camera parameters according to the two-dimensional eye key points and the three-dimensional target key points, wherein the camera parameters comprise the following steps: rotation angle parameter, translation amount, and scaling value;
according to the camera parameters and the three-dimensional target key points, performing pronunciation transformation processing on the three-dimensional eye model to obtain three-dimensional pronunciation parameters corresponding to the two-dimensional eye key points;
performing sparsification processing on the three-dimensional pronunciation parameters to obtain sparse pronunciation parameters;
migrating the sparse pronunciation parameters and the rotation angle parameters to an animation model of the virtual character so that the pronunciation of the virtual character is consistent with the pronunciation of the eye image;
the three-dimensional eye model is formed by reconstructing a pronunciation fusion model through a parameterized eye model 3 DMM; the three-dimensional pronunciation parameters comprise pronunciation subparameters corresponding to different eye parts.
Further, the step of performing a thinning process on the three-dimensional pronunciation parameters to obtain thinned pronunciation parameters specifically includes:
taking the pronunciation subparameter corresponding to each eye part as a target subparameter, executing the following operations:
inquiring a target pronunciation sub-model corresponding to the target sub-parameter in a pre-stored pronunciation model mapping table; wherein, the pronunciation model mapping table stores the sub-model identifiers corresponding to the pronunciation sub-models of different eye parts included in the pronunciation fusion model;
assigning the target sub-parameters to the target pronunciation sub-model;
setting pronunciation sub-parameters corresponding to other pronunciation sub-models except the target pronunciation sub-model in the pronunciation fusion model as preset parameters;
fusing the target pronunciation sub-model and other pronunciation sub-models to obtain a target three-dimensional eye model;
calculating vertex deformation corresponding to the target sub-parameters based on the target three-dimensional eye model and a preset three-dimensional eye model; the preset three-dimensional eye model is obtained by fusing a plurality of pronunciation sub-models with pronunciation sub-parameters as the preset parameters;
inputting the vertex deformation corresponding to each target subparameter and the three-dimensional pronunciation parameters into an optimization model for iterative calculation until the loss value of the optimization model reaches a preset loss value, and outputting an optimization result;
and taking the optimized result as a sparse pronunciation parameter.
The step of controlling the eye pose of the three-dimensional avatar by using the obtained three-dimensional sounding element set in the step S700 and performing simulation specifically includes:
the first step: acquiring a text with eye actions required to be read in a three-dimensional image;
and a second step of: establishing a phoneme set according to phonemes corresponding to characters in the text;
and a third step of: searching and preloading a mouth shape of a corresponding three-dimensional image in a preset three-dimensional mouth shape library according to the phoneme set to serve as a basic mouth shape;
fourth step: replacing the sequence consisting of the basic mouth shapes by utilizing the three-dimensional sounding element set according to the phonemes;
fifth step: and obtaining and playing the mouth shape sequence with the eye gesture actions.
Compared with the prior art, the three-dimensional image pronunciation process eye motion simulation method provided by the invention has the beneficial effects that: the method comprises the steps of describing the eye gesture of a person during pronunciation by adopting two-dimensional and three-dimensional key points, wherein the two-dimensional and three-dimensional key points at least comprise a left eye corner, a right eye corner, a lower eyelid center vertex, an upper eyelid center vertex, a pupil center, pupil edge points and an eye reflection bright area of eyes of a tester, and the pupil edge points comprise at least three. The neural network is built, firstly, two-dimensional eye elements are adopted for training, the three-dimensional eye elements are utilized for further training and optimizing the neural network model, a three-dimensional pronunciation neural network model is formed, the three-dimensional eye elements are not required to be directly adopted for training, the calculated amount of model operation is reduced, and the model training speed is improved. Finally, the text to be pronounced is calculated by using the obtained three-dimensional pronunciation neural network model, the three-dimensional pronunciation element set is obtained to control the eye gesture of the three-dimensional virtual image and simulate, and the technical problem that the three-dimensional image pronunciation process is unnatural in the prior art is effectively solved due to the addition of the eye action in the simulation process.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the description of the embodiments of the present invention will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow chart of a three-dimensional visual pronunciation process eye motion simulation method provided by the invention;
in the drawings, the list of components represented by the various numbers is as follows:
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments. All other embodiments, based on the embodiments of the invention, which are apparent to those of ordinary skill in the art without inventive faculty, are intended to be within the scope of the invention.
Thus, the following detailed description of the embodiments of the invention, as presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, based on the embodiments of the invention, which are apparent to those of ordinary skill in the art without inventive faculty, are intended to be within the scope of the invention.
It should be noted that: like reference numerals and letters denote like items in the following figures, and thus once an item is defined in one figure, no further definition or explanation thereof is necessary in the following figures.
In the description of the present invention, it should be understood that the terms "center", "longitudinal", "lateral", "length", "width", "thickness", "upper", "lower", "front", "rear", "left", "right", "vertical", "horizontal", "top", "bottom", "inner", "outer", "clockwise", "counterclockwise", etc. indicate orientations or positional relationships based on the orientations or positional relationships shown in the drawings are merely for convenience in describing the present invention and simplifying the description, and do not indicate or imply that the apparatus or elements referred to must have a specific orientation, be configured and operated in a specific orientation, and thus should not be construed as limiting the present invention.
Furthermore, the terms "first," "second," and the like, are used for descriptive purposes only and are not to be construed as indicating or implying a relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include one or more such feature. In the description of the present invention, the meaning of "a plurality" is two or more, unless explicitly defined otherwise.
As shown in fig. 1, the present invention provides a three-dimensional image pronunciation process eye motion simulation method, which comprises the following steps:
s100: collecting two-dimensional eye elements in a plane video library when a person speaks as a training data set, and establishing a corresponding relation between the two-dimensional eye elements, wherein the two-dimensional eye elements comprise phonemes and two-dimensional eye gestures;
s200: establishing a neural network model, and training by adopting a training data set, wherein phonemes in the training data set are used as input, and two-dimensional eye gestures are used as output;
s300: three-dimensional recognition is carried out on the pronunciation process of a plurality of testers to obtain three-dimensional eye elements, wherein the three-dimensional eye elements comprise phonemes and three-dimensional eye gestures;
s400: further training and optimizing the neural network model by utilizing the three-dimensional eye elements to form a three-dimensional pronunciation neural network model;
s500: taking the character string needing three-dimensional image pronunciation as text data, extracting phonemes from the text data, and obtaining a pronunciation phoneme set;
s600: calculating the obtained pronunciation phoneme set by using the three-dimensional pronunciation neural network model to obtain a three-dimensional pronunciation element set;
s700: establishing a three-dimensional virtual image comprising eyes, controlling the eye posture of the three-dimensional virtual image by adopting the obtained three-dimensional sounding element set, and simulating; the eye gesture is described by coordinates of eye gesture key points.
In the above technical solution, the specific steps of collecting the two-dimensional eye elements when the person in the plane video library speaks in step S100 include:
the first step: selecting a character speaking video clip in a plane video library, wherein the video clip comprises video audio and a plurality of frames;
and a second step of: establishing a two-dimensional coordinate system for the image in each frame, and identifying the eye area to obtain a human eye image;
and a third step of: two-dimensional key points are set on the recognized human eye images and two-dimensional coordinates of the eye posture key points in the plane are recorded.
Fourth step: collecting phonemes in corresponding video audio for each frame;
the two-dimensional key points at least comprise a left corner, a right corner, a center vertex of lower eyelid, a center vertex of upper eyelid, a center of pupil, a pupil edge point and an eye reflection bright area of the eyes of the testers, wherein the pupil edge point comprises at least three points.
Further, in the above technical solution, the specific step of establishing the correspondence between the two-dimensional eye elements in step S100 is: and establishing a corresponding relation of the phonemes and the two-dimensional eye gestures according to a time axis of video.
In the above technical solution, in the step of "establishing a two-dimensional coordinate system for the image in each frame and identifying the eye region", an AdaBoost classifier is used to identify the eye region in the image.
Further, in the above technical solution, the step of "setting a two-dimensional key point on the identified human eye image" includes:
acquiring a plurality of groups of human eye images;
for each pixel point in the human eye image, calculating a Y value component of the pixel point in YUV space based on RGB value of the pixel point, and taking the Y value component as gray value of the pixel point;
calculating the average value of gray values of all pixel points in the gray image, determining a binarization threshold value, and performing binarization processing on the gray image to obtain a binarized image
Identifying a video disc area in the binarized image to obtain an identification result;
performing expansion and corrosion treatment on the binarized image, searching a connected region with a pixel value not being 0 in the treated binarized image, determining a video disc region based on a searching result, and taking the video disc region as a sample image in an identification training set;
marking two-dimensional key points in the sample image;
establishing an identification training neural network model, and training by utilizing an identification training set to obtain an eye key point identification model;
and calculating the identified human eye images by using the eye identification model to obtain two-dimensional key points of each human eye image.
In the above technical solution, the step of determining the optic disc area based on the search result includes:
counting the number of pixel points in each communication area aiming at each searched communication area;
judging whether the quantity meets the preset quantity condition;
if so, the connected region is determined as a disc region.
In the above technical solution, the specific steps of performing three-dimensional recognition on the pronunciation process of the plurality of testers in step S300 to obtain the three-dimensional eye element include:
step 1: three cameras are arranged right in front of the face of the tester, right left side and right side;
step 2: the tester performs dialogue pronunciation or reading, and the camera is used for collecting the dialogue pronunciation or reading process of the tester;
step 3: establishing a three-dimensional coordinate system, and carrying out three-dimensional coordinate marking on each three-dimensional key point to form three-dimensional coordinates of the eye gesture key points;
step 4: recording coordinates of each three-dimensional key point at different moments to form a three-dimensional key point coordinate sequence; identifying pronunciation phonemes of a tester at different moments to form a pronunciation phoneme sequence; establishing a relation between the three-dimensional key point coordinate sequence and the pronunciation phoneme sequence according to the moment;
the three-dimensional key points at least comprise a left corner, a right corner, a center vertex of lower eyelid, a center vertex of upper eyelid, a center of pupil, a pupil edge point and an eye reflection bright area of the eyes of the testers, wherein the pupil edge point comprises at least three points.
Further, in the above technical solution, the step of collecting the dialogue pronunciation or the reading process of the tester by using the camera specifically includes:
two-dimensional key points of eye images of the testers collected by the camera;
determining three-dimensional target key points from three-dimensional eye key points in a pre-manufactured three-dimensional eye model according to the two-dimensional key points;
calculating camera parameters according to the two-dimensional eye key points and the three-dimensional target key points, wherein the camera parameters comprise the following steps: rotation angle parameter, translation amount, and scaling value;
according to the camera parameters and the three-dimensional target key points, performing pronunciation transformation processing on the three-dimensional eye model to obtain three-dimensional pronunciation parameters corresponding to the two-dimensional eye key points;
carrying out sparsification processing on the three-dimensional pronunciation parameters to obtain sparsified pronunciation parameters;
migrating the sparse pronunciation parameters and the rotation angle parameters to an animation model of the virtual character, so that the pronunciation of the virtual character is consistent with the pronunciation of the eye image;
the three-dimensional eye model is formed by reconstructing a pronunciation fusion model through a parameterized eye model 3 DMM; the three-dimensional pronunciation parameters comprise pronunciation sub-parameters corresponding to different eye parts.
Further, in the above technical solution, the step of performing a thinning process on the three-dimensional pronunciation parameters to obtain thinned pronunciation parameters specifically includes:
taking the pronunciation subparameter corresponding to each eye part as a target subparameter, executing the following operations:
inquiring a target pronunciation sub-model corresponding to the target sub-parameter in a pre-stored pronunciation model mapping table; wherein, the pronunciation model mapping table stores the sub model identifiers corresponding to the pronunciation sub models of different eye parts included in the pronunciation fusion model;
assigning the target sub-parameters to the target pronunciation sub-model;
setting pronunciation sub-parameters corresponding to other pronunciation sub-models except the target pronunciation sub-model in the pronunciation fusion model as preset parameters;
fusing the target pronunciation sub-model and other pronunciation sub-models to obtain a target three-dimensional eye model;
calculating vertex deformation corresponding to the target sub-parameters based on the target three-dimensional eye model and a preset three-dimensional eye model; the preset three-dimensional eye model is obtained by fusing a plurality of pronunciation sub-models with pronunciation sub-parameters as preset parameters;
inputting the vertex deformation quantity and the three-dimensional pronunciation parameters corresponding to each target subparameter into an optimization model for iterative calculation until the loss value of the optimization model reaches a preset loss value, and outputting an optimization result;
and taking the optimized result as a sparse pronunciation parameter.
In the above technical solution, the step of controlling the eye pose of the three-dimensional avatar by using the obtained three-dimensional sounding element set in step S700 and performing simulation specifically includes:
the first step: acquiring a text with eye actions required to be read in a three-dimensional image;
and a second step of: establishing a phoneme set according to phonemes corresponding to characters in the text;
and a third step of: searching and preloading the mouth shape of the corresponding three-dimensional image in a preset three-dimensional mouth shape library according to the phoneme set to serve as a basic mouth shape;
fourth step: replacing the sequence consisting of the basic mouth shapes by utilizing the three-dimensional sounding element set according to the phonemes;
fifth step: and obtaining and playing the mouth shape sequence with the eye gesture actions.
The foregoing is merely illustrative of the present invention, and the present invention is not limited thereto, and any person skilled in the art will readily recognize that variations or substitutions are within the scope of the present invention. Therefore, the protection scope of the invention is subject to the protection scope of the claims.
Claims (10)
1. The method for simulating the eye movement in the three-dimensional image pronunciation process is characterized by comprising the following steps of:
s100: collecting two-dimensional eye elements in a plane video library when a person speaks as a training data set, and establishing a corresponding relation between the two-dimensional eye elements, wherein the two-dimensional eye elements comprise phonemes and two-dimensional eye gestures;
s200: establishing a neural network model, and training by adopting a training data set, wherein phonemes in the training data set are used as input, and two-dimensional eye gestures are used as output;
s300: three-dimensional recognition is carried out on the pronunciation process of a plurality of testers to obtain three-dimensional eye elements, wherein the three-dimensional eye elements comprise phonemes and three-dimensional eye gestures;
s400: further training and optimizing the neural network model by utilizing three-dimensional eye elements to form a three-dimensional pronunciation neural network model;
s500: taking a character string needing three-dimensional image pronunciation as text data, extracting phonemes from the text data, and obtaining a pronunciation phoneme set;
s600: calculating the obtained pronunciation phoneme set by using the three-dimensional pronunciation neural network model to obtain a three-dimensional pronunciation element set;
s700: establishing a three-dimensional virtual image comprising eyes, controlling the eye posture of the three-dimensional virtual image by adopting the obtained three-dimensional sounding element set, and simulating; the eye gesture is described by adopting coordinates of eye gesture key points.
2. The method for simulating the eye movements in the three-dimensional image pronunciation process according to claim 1, wherein the specific step of collecting the two-dimensional eye elements when the person in the planar video library speaks in step S100 comprises the following steps:
the first step: selecting a character speaking video clip in a plane video library, wherein the video clip comprises video audio and a plurality of frames;
and a second step of: establishing a two-dimensional coordinate system for the image in each frame, and identifying the eye area to obtain a human eye image;
and a third step of: two-dimensional key points are set on the recognized human eye images and two-dimensional coordinates of the eye posture key points in the plane are recorded.
Fourth step: collecting phonemes in corresponding video audio for each frame;
the two-dimensional key points at least comprise left corners, right corners, center vertexes of lower eyelid, center vertexes of upper eyelid, centers of pupil, edge points of pupil and reflective bright areas of eyes of the test person, wherein the edge points of pupil comprise at least three.
3. The method for simulating the eye movements in the three-dimensional image pronunciation process according to claim 2, wherein the specific steps of establishing the correspondence between the two-dimensional eye elements in the step S100 are as follows: and establishing a corresponding relation of the phonemes and the two-dimensional eye gestures according to a time axis of video.
4. The method for simulating the eye movement in the three-dimensional image pronunciation process according to claim 1, wherein in the step of establishing a two-dimensional coordinate system for the image in each frame and identifying the eye region, an AdaBoost classifier is used to identify the eye region in the image.
5. The method of claim 4, wherein the step of setting two-dimensional key points on the recognized human eye image comprises:
acquiring a plurality of groups of human eye images;
for each pixel point in the human eye image, calculating a Y value component of the pixel point in a YUV space based on the RGB value of the pixel point, and taking the Y value component as a gray value of the pixel point;
calculating the average value of gray values of all pixel points in the gray image, determining a binarization threshold value, and performing binarization processing on the gray image to obtain a binarized image
Identifying a video disc area in the binarized image to obtain an identification result;
performing expansion and corrosion treatment on the binarized image, searching a communication area with a pixel value not being 0 in the treated binarized image, determining a video disc area based on a searching result, and taking the video disc area as a sample image in an identification training set;
marking two-dimensional key points in the sample image;
establishing an identification training neural network model, and training by utilizing the identification training set to obtain an eye key point identification model;
and calculating the identified human eye images by using the eye identification model to obtain two-dimensional key points of each human eye image.
6. The method of claim 1, wherein the step of determining the optic disc area based on the search result comprises:
counting the number of pixel points in each communication area aiming at each searched communication area;
judging whether the quantity meets the preset quantity condition or not;
if so, the connected region is determined as a disc region.
7. The method for simulating the eye movements in the three-dimensional visual pronunciation process according to claim 1, wherein the specific step of three-dimensionally recognizing the pronunciation process of the plurality of testers in step S300 to obtain the three-dimensional eye elements comprises the following steps:
step 1: three cameras are arranged right in front of the face of the tester, right left side and right side;
step 2: the tester performs dialogue pronunciation or reading, and the camera is used for collecting the dialogue pronunciation or reading process of the tester;
step 3: establishing a three-dimensional coordinate system, and carrying out three-dimensional coordinate marking on each three-dimensional key point to form three-dimensional coordinates of the eye gesture key points;
step 4: recording coordinates of each three-dimensional key point at different moments to form a three-dimensional key point coordinate sequence; identifying pronunciation phonemes of a tester at different moments to form a pronunciation phoneme sequence; establishing a relation between the three-dimensional key point coordinate sequence and the pronunciation phoneme sequence according to the moment;
the three-dimensional key points at least comprise left corners, right corners, center vertexes of lower eyelid, center vertexes of upper eyelid, centers of pupil, edge points of pupil and reflective bright areas of eyes of the test person, wherein the edge points of pupil comprise at least three.
8. The method for simulating the eye movements in the three-dimensional visual pronunciation process according to claim 7, wherein the step of capturing the dialogue pronunciation or the speaking process of the tester by using the camera specifically comprises the following steps:
two-dimensional key points of eye images of the testers collected by the camera;
determining three-dimensional target key points from three-dimensional eye key points in a pre-manufactured three-dimensional eye model according to the two-dimensional key points;
calculating camera parameters according to the two-dimensional eye key points and the three-dimensional target key points, wherein the camera parameters comprise the following steps: rotation angle parameter, translation amount, and scaling value;
according to the camera parameters and the three-dimensional target key points, performing pronunciation transformation processing on the three-dimensional eye model to obtain three-dimensional pronunciation parameters corresponding to the two-dimensional eye key points;
performing sparsification processing on the three-dimensional pronunciation parameters to obtain sparse pronunciation parameters;
migrating the sparse pronunciation parameters and the rotation angle parameters to an animation model of the virtual character so that the pronunciation of the virtual character is consistent with the pronunciation of the eye image;
the three-dimensional eye model is formed by reconstructing a pronunciation fusion model through a parameterized eye model 3 DMM; the three-dimensional pronunciation parameters comprise pronunciation subparameters corresponding to different eye parts.
9. The method for simulating the eye motion in the three-dimensional visual pronunciation process according to claim 7, wherein the step of performing a thinning process on the three-dimensional pronunciation parameters to obtain the thinned pronunciation parameters comprises the following steps:
taking the pronunciation subparameter corresponding to each eye part as a target subparameter, executing the following operations:
inquiring a target pronunciation sub-model corresponding to the target sub-parameter in a pre-stored pronunciation model mapping table; wherein, the pronunciation model mapping table stores the sub-model identifiers corresponding to the pronunciation sub-models of different eye parts included in the pronunciation fusion model;
assigning the target sub-parameters to the target pronunciation sub-model;
setting pronunciation sub-parameters corresponding to other pronunciation sub-models except the target pronunciation sub-model in the pronunciation fusion model as preset parameters;
fusing the target pronunciation sub-model and other pronunciation sub-models to obtain a target three-dimensional eye model;
calculating vertex deformation corresponding to the target sub-parameters based on the target three-dimensional eye model and a preset three-dimensional eye model; the preset three-dimensional eye model is obtained by fusing a plurality of pronunciation sub-models with pronunciation sub-parameters as the preset parameters;
inputting the vertex deformation corresponding to each target subparameter and the three-dimensional pronunciation parameters into an optimization model for iterative calculation until the loss value of the optimization model reaches a preset loss value, and outputting an optimization result;
and taking the optimized result as a sparse pronunciation parameter.
10. The method for simulating the eye movements in the three-dimensional avatar pronunciation process according to claim 1, wherein the step of controlling the eye posture of the three-dimensional avatar by using the obtained three-dimensional sound-producing element set in the step S700 and simulating the eye posture of the three-dimensional avatar specifically comprises:
the first step: acquiring a text with eye actions required to be read in a three-dimensional image;
and a second step of: establishing a phoneme set according to phonemes corresponding to characters in the text;
and a third step of: searching and preloading a mouth shape of a corresponding three-dimensional image in a preset three-dimensional mouth shape library according to the phoneme set to serve as a basic mouth shape;
fourth step: replacing the sequence consisting of the basic mouth shapes by utilizing the three-dimensional sounding element set according to the phonemes;
fifth step: and obtaining and playing the mouth shape sequence with the eye gesture actions.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310111493.5A CN116597070A (en) | 2023-02-13 | 2023-02-13 | Eye motion simulation method in three-dimensional image pronunciation process |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310111493.5A CN116597070A (en) | 2023-02-13 | 2023-02-13 | Eye motion simulation method in three-dimensional image pronunciation process |
Publications (1)
Publication Number | Publication Date |
---|---|
CN116597070A true CN116597070A (en) | 2023-08-15 |
Family
ID=87603242
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310111493.5A Pending CN116597070A (en) | 2023-02-13 | 2023-02-13 | Eye motion simulation method in three-dimensional image pronunciation process |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116597070A (en) |
-
2023
- 2023-02-13 CN CN202310111493.5A patent/CN116597070A/en active Pending
Similar Documents
Publication | Publication Date | Title |
---|---|---|
EP3885965B1 (en) | Image recognition method based on micro facial expressions, apparatus and related device | |
KR100720309B1 (en) | Automatic 3D modeling system and method | |
KR101558202B1 (en) | Apparatus and method for generating animation using avatar | |
CN105809144A (en) | Gesture recognition system and method adopting action segmentation | |
CN110610534B (en) | Automatic mouth shape animation generation method based on Actor-Critic algorithm | |
CN105518708A (en) | Method and equipment for verifying living human face, and computer program product | |
CN113781610A (en) | Virtual face generation method | |
JP2011159329A (en) | Automatic 3d modeling system and method | |
CN111028216A (en) | Image scoring method and device, storage medium and electronic equipment | |
WO2024001095A1 (en) | Facial expression recognition method, terminal device and storage medium | |
Mattos et al. | Improving CNN-based viseme recognition using synthetic data | |
CN113223125A (en) | Face driving method, device, equipment and medium for virtual image | |
CN112764530A (en) | Ammunition identification method based on touch handle and augmented reality glasses | |
CN116665695A (en) | Virtual object mouth shape driving method, related device and medium | |
CN116597070A (en) | Eye motion simulation method in three-dimensional image pronunciation process | |
CN114972601A (en) | Model generation method, face rendering device and electronic equipment | |
CN114677476A (en) | Face processing method and device, computer equipment and storage medium | |
CN114630190A (en) | Joint posture parameter determining method, model training method and device | |
CN113744129A (en) | Semantic neural rendering-based face image generation method and system | |
JP6927495B2 (en) | Person evaluation equipment, programs, and methods | |
CN112764531A (en) | Augmented reality ammunition identification method | |
CN105718050B (en) | Real-time human face interaction method and system | |
CN116129487A (en) | Three-dimensional image pronunciation head posture simulation method | |
CN112613495B (en) | Real person video generation method and device, readable storage medium and equipment | |
CN112667088B (en) | Gesture application identification method and system based on VR walking platform |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |