CN116597070A - Eye motion simulation method in three-dimensional image pronunciation process - Google Patents

Eye motion simulation method in three-dimensional image pronunciation process Download PDF

Info

Publication number
CN116597070A
CN116597070A CN202310111493.5A CN202310111493A CN116597070A CN 116597070 A CN116597070 A CN 116597070A CN 202310111493 A CN202310111493 A CN 202310111493A CN 116597070 A CN116597070 A CN 116597070A
Authority
CN
China
Prior art keywords
dimensional
pronunciation
eye
image
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310111493.5A
Other languages
Chinese (zh)
Inventor
周安斌
晏武志
彭辰
李鑫
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shandong Jindong Digital Creative Co ltd
Original Assignee
Shandong Jindong Digital Creative Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shandong Jindong Digital Creative Co ltd filed Critical Shandong Jindong Digital Creative Co ltd
Priority to CN202310111493.5A priority Critical patent/CN116597070A/en
Publication of CN116597070A publication Critical patent/CN116597070A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/06Transformation of speech into a non-audible representation, e.g. speech visualisation or speech processing for tactile aids
    • G10L21/10Transforming into visible information
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T13/00Animation
    • G06T13/203D [Three Dimensional] animation
    • G06T13/2053D [Three Dimensional] animation driven by audio data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T13/00Animation
    • G06T13/203D [Three Dimensional] animation
    • G06T13/403D [Three Dimensional] animation of characters, e.g. humans, animals or virtual beings
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T17/00Three dimensional [3D] modelling, e.g. data description of 3D objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T19/00Manipulating 3D models or images for computer graphics
    • G06T19/006Mixed reality
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T19/00Manipulating 3D models or images for computer graphics
    • G06T19/20Editing of 3D images, e.g. changing shapes or colours, aligning objects or positioning parts
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Computer Graphics (AREA)
  • Multimedia (AREA)
  • Computer Hardware Design (AREA)
  • General Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Human Computer Interaction (AREA)
  • Signal Processing (AREA)
  • Geometry (AREA)
  • Quality & Reliability (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Acoustics & Sound (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Architecture (AREA)
  • Processing Or Creating Images (AREA)

Abstract

The invention provides an eye motion simulation method in a three-dimensional image pronunciation process, which belongs to the technical field of three-dimensional image pronunciation, and comprises the following steps: collecting two-dimensional eye posture elements in a plane video library when a person speaks as a training data set, and establishing a corresponding relation between the two-dimensional eye posture elements; building a neural network model, and training by adopting a training data set; three-dimensional recognition is carried out on the pronunciation process of a plurality of testers to obtain three-dimensional eye posture factors; updating and optimizing the neural network model by utilizing the three-dimensional eye posture factors to form a three-dimensional pronunciation neural network model; taking the character string needing three-dimensional image pronunciation as text data, extracting phonemes from the text data, and obtaining a pronunciation phoneme set; calculating the obtained pronunciation phoneme set by using the three-dimensional pronunciation neural network model to obtain a three-dimensional pronunciation element set; and establishing a three-dimensional virtual image comprising eyes, and controlling the eye posture of the three-dimensional virtual image by adopting the obtained three-dimensional sounding element set.

Description

Eye motion simulation method in three-dimensional image pronunciation process
Technical Field
The invention belongs to the technical field of three-dimensional image pronunciation, and particularly relates to an eye motion simulation method in a three-dimensional image pronunciation process.
Background
Virtual character modeling and rendering techniques are widely used in animation, gaming, and movies industries. Enabling a virtual character to speak with natural and smooth mouth-shaped motions synchronized with sound is key to improving user experience. In a real-time system, it is necessary to synchronously play audio acquired in real time in the form of streams and synchronously render virtual character figures, and in this process, it is necessary to ensure synchronization between the audio and the character mouth shapes. Eyes are windows of mind, and when people speak or speak, different speaking texts can have different eye movements corresponding to different speaking texts.
The Chinese invention patent with publication number of CN108538308B (application number: CN 201810018724.7) discloses a method and a device for simulating mouth shape and/or expression based on voice, comprising: collecting an audio signal; transforming the audio signal into spectral data corresponding to the audio signal; determining frequency distribution data according to the frequency spectrum data; determining mouth-style simulation parameters and/or expression simulation parameters according to the frequency distribution data; simulating the corresponding mouth shape according to the mouth shape simulation parameters and/or simulating the corresponding expression according to the expression simulation parameters.
Said invention and current many three-dimensional images only consider mouth shape and/or expression in the course of pronunciation, and do not consider the action simulation of eyes, so that the three-dimensional image pronunciation process is unnatural.
Disclosure of Invention
In view of the above, the invention provides a method for simulating the eye movements in the three-dimensional image pronunciation process, which can solve the technical problem that the three-dimensional image pronunciation process is unnatural because the eye movements are not considered in the three-dimensional image pronunciation process.
The invention is realized in the following way:
the invention provides a three-dimensional image pronunciation process eye motion simulation method, which comprises the following steps:
s100: collecting two-dimensional eye elements in a plane video library when a person speaks as a training data set, and establishing a corresponding relation between the two-dimensional eye elements, wherein the two-dimensional eye elements comprise phonemes and two-dimensional eye gestures;
s200: establishing a neural network model, and training by adopting a training data set, wherein phonemes in the training data set are used as input, and two-dimensional eye gestures are used as output;
s300: three-dimensional recognition is carried out on the pronunciation process of a plurality of testers to obtain three-dimensional eye elements, wherein the three-dimensional eye elements comprise phonemes and three-dimensional eye gestures;
s400: further training and optimizing the neural network model by utilizing three-dimensional eye elements to form a three-dimensional pronunciation neural network model;
s500: taking a character string needing three-dimensional image pronunciation as text data, extracting phonemes from the text data, and obtaining a pronunciation phoneme set;
s600: calculating the obtained pronunciation phoneme set by using the three-dimensional pronunciation neural network model to obtain a three-dimensional pronunciation element set;
s700: establishing a three-dimensional virtual image comprising eyes, controlling the eye posture of the three-dimensional virtual image by adopting the obtained three-dimensional sounding element set, and simulating; the eye gesture is described by adopting coordinates of eye gesture key points.
On the basis of the technical scheme, the three-dimensional image pronunciation process eye motion simulation method can be improved as follows:
the specific step of collecting the two-dimensional eye elements when the person in the plane video library speaks in step S100 includes:
the first step: selecting a character speaking video clip in a plane video library, wherein the video clip comprises video audio and a plurality of frames;
and a second step of: establishing a two-dimensional coordinate system for the image in each frame, and identifying the eye area to obtain a human eye image;
and a third step of: two-dimensional key points are set on the recognized human eye images and two-dimensional coordinates of the eye posture key points in the plane are recorded.
Fourth step: collecting phonemes in corresponding video audio for each frame;
the two-dimensional key points at least comprise left corners, right corners, center vertexes of lower eyelid, center vertexes of upper eyelid, centers of pupil, edge points of pupil and reflective bright areas of eyes of the test person, wherein the edge points of pupil comprise at least three.
Further, the specific step of establishing the correspondence between the two-dimensional eye elements in the step S100 is as follows: and establishing a corresponding relation of the phonemes and the two-dimensional eye gestures according to a time axis of video.
In the step of establishing a two-dimensional coordinate system for the image in each frame and identifying the eye region, an AdaBoost classifier is adopted to identify the eye region in the image.
Further, the step of "setting a two-dimensional key point on the identified human eye image" includes:
acquiring a plurality of groups of human eye images;
for each pixel point in the human eye image, calculating a Y value component of the pixel point in a YUV space based on the RGB value of the pixel point, and taking the Y value component as a gray value of the pixel point;
calculating the average value of gray values of all pixel points in the gray image, determining a binarization threshold value, and performing binarization processing on the gray image to obtain a binarized image
Identifying a video disc area in the binarized image to obtain an identification result;
performing expansion and corrosion treatment on the binarized image, searching a communication area with a pixel value not being 0 in the treated binarized image, determining a video disc area based on a searching result, and taking the video disc area as a sample image in an identification training set;
marking two-dimensional key points in the sample image;
establishing an identification training neural network model, and training by utilizing the identification training set to obtain an eye key point identification model;
and calculating the identified human eye images by using the eye identification model to obtain two-dimensional key points of each human eye image.
Wherein the step of determining the optic disc area based on the search result includes:
counting the number of pixel points in each communication area aiming at each searched communication area;
judging whether the quantity meets the preset quantity condition or not;
if so, the connected region is determined as a disc region.
The specific step of three-dimensionally identifying the pronunciation process of the plurality of testers to obtain the three-dimensional eye element in the step S300 includes:
step 1: three cameras are arranged right in front of the face of the tester, right left side and right side;
step 2: the tester performs dialogue pronunciation or reading, and the camera is used for collecting the dialogue pronunciation or reading process of the tester;
step 3: establishing a three-dimensional coordinate system, and carrying out three-dimensional coordinate marking on each three-dimensional key point to form three-dimensional coordinates of the eye gesture key points;
step 4: recording coordinates of each three-dimensional key point at different moments to form a three-dimensional key point coordinate sequence; identifying pronunciation phonemes of a tester at different moments to form a pronunciation phoneme sequence; establishing a relation between the three-dimensional key point coordinate sequence and the pronunciation phoneme sequence according to the moment;
the three-dimensional key points at least comprise left corners, right corners, center vertexes of lower eyelid, center vertexes of upper eyelid, centers of pupil, edge points of pupil and reflective bright areas of eyes of the test person, wherein the edge points of pupil comprise at least three.
Further, the step of collecting the dialogue pronunciation or the reading process of the tester by using the camera specifically includes:
two-dimensional key points of eye images of the testers collected by the camera;
determining three-dimensional target key points from three-dimensional eye key points in a pre-manufactured three-dimensional eye model according to the two-dimensional key points;
calculating camera parameters according to the two-dimensional eye key points and the three-dimensional target key points, wherein the camera parameters comprise the following steps: rotation angle parameter, translation amount, and scaling value;
according to the camera parameters and the three-dimensional target key points, performing pronunciation transformation processing on the three-dimensional eye model to obtain three-dimensional pronunciation parameters corresponding to the two-dimensional eye key points;
performing sparsification processing on the three-dimensional pronunciation parameters to obtain sparse pronunciation parameters;
migrating the sparse pronunciation parameters and the rotation angle parameters to an animation model of the virtual character so that the pronunciation of the virtual character is consistent with the pronunciation of the eye image;
the three-dimensional eye model is formed by reconstructing a pronunciation fusion model through a parameterized eye model 3 DMM; the three-dimensional pronunciation parameters comprise pronunciation subparameters corresponding to different eye parts.
Further, the step of performing a thinning process on the three-dimensional pronunciation parameters to obtain thinned pronunciation parameters specifically includes:
taking the pronunciation subparameter corresponding to each eye part as a target subparameter, executing the following operations:
inquiring a target pronunciation sub-model corresponding to the target sub-parameter in a pre-stored pronunciation model mapping table; wherein, the pronunciation model mapping table stores the sub-model identifiers corresponding to the pronunciation sub-models of different eye parts included in the pronunciation fusion model;
assigning the target sub-parameters to the target pronunciation sub-model;
setting pronunciation sub-parameters corresponding to other pronunciation sub-models except the target pronunciation sub-model in the pronunciation fusion model as preset parameters;
fusing the target pronunciation sub-model and other pronunciation sub-models to obtain a target three-dimensional eye model;
calculating vertex deformation corresponding to the target sub-parameters based on the target three-dimensional eye model and a preset three-dimensional eye model; the preset three-dimensional eye model is obtained by fusing a plurality of pronunciation sub-models with pronunciation sub-parameters as the preset parameters;
inputting the vertex deformation corresponding to each target subparameter and the three-dimensional pronunciation parameters into an optimization model for iterative calculation until the loss value of the optimization model reaches a preset loss value, and outputting an optimization result;
and taking the optimized result as a sparse pronunciation parameter.
The step of controlling the eye pose of the three-dimensional avatar by using the obtained three-dimensional sounding element set in the step S700 and performing simulation specifically includes:
the first step: acquiring a text with eye actions required to be read in a three-dimensional image;
and a second step of: establishing a phoneme set according to phonemes corresponding to characters in the text;
and a third step of: searching and preloading a mouth shape of a corresponding three-dimensional image in a preset three-dimensional mouth shape library according to the phoneme set to serve as a basic mouth shape;
fourth step: replacing the sequence consisting of the basic mouth shapes by utilizing the three-dimensional sounding element set according to the phonemes;
fifth step: and obtaining and playing the mouth shape sequence with the eye gesture actions.
Compared with the prior art, the three-dimensional image pronunciation process eye motion simulation method provided by the invention has the beneficial effects that: the method comprises the steps of describing the eye gesture of a person during pronunciation by adopting two-dimensional and three-dimensional key points, wherein the two-dimensional and three-dimensional key points at least comprise a left eye corner, a right eye corner, a lower eyelid center vertex, an upper eyelid center vertex, a pupil center, pupil edge points and an eye reflection bright area of eyes of a tester, and the pupil edge points comprise at least three. The neural network is built, firstly, two-dimensional eye elements are adopted for training, the three-dimensional eye elements are utilized for further training and optimizing the neural network model, a three-dimensional pronunciation neural network model is formed, the three-dimensional eye elements are not required to be directly adopted for training, the calculated amount of model operation is reduced, and the model training speed is improved. Finally, the text to be pronounced is calculated by using the obtained three-dimensional pronunciation neural network model, the three-dimensional pronunciation element set is obtained to control the eye gesture of the three-dimensional virtual image and simulate, and the technical problem that the three-dimensional image pronunciation process is unnatural in the prior art is effectively solved due to the addition of the eye action in the simulation process.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the description of the embodiments of the present invention will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow chart of a three-dimensional visual pronunciation process eye motion simulation method provided by the invention;
in the drawings, the list of components represented by the various numbers is as follows:
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments. All other embodiments, based on the embodiments of the invention, which are apparent to those of ordinary skill in the art without inventive faculty, are intended to be within the scope of the invention.
Thus, the following detailed description of the embodiments of the invention, as presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, based on the embodiments of the invention, which are apparent to those of ordinary skill in the art without inventive faculty, are intended to be within the scope of the invention.
It should be noted that: like reference numerals and letters denote like items in the following figures, and thus once an item is defined in one figure, no further definition or explanation thereof is necessary in the following figures.
In the description of the present invention, it should be understood that the terms "center", "longitudinal", "lateral", "length", "width", "thickness", "upper", "lower", "front", "rear", "left", "right", "vertical", "horizontal", "top", "bottom", "inner", "outer", "clockwise", "counterclockwise", etc. indicate orientations or positional relationships based on the orientations or positional relationships shown in the drawings are merely for convenience in describing the present invention and simplifying the description, and do not indicate or imply that the apparatus or elements referred to must have a specific orientation, be configured and operated in a specific orientation, and thus should not be construed as limiting the present invention.
Furthermore, the terms "first," "second," and the like, are used for descriptive purposes only and are not to be construed as indicating or implying a relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include one or more such feature. In the description of the present invention, the meaning of "a plurality" is two or more, unless explicitly defined otherwise.
As shown in fig. 1, the present invention provides a three-dimensional image pronunciation process eye motion simulation method, which comprises the following steps:
s100: collecting two-dimensional eye elements in a plane video library when a person speaks as a training data set, and establishing a corresponding relation between the two-dimensional eye elements, wherein the two-dimensional eye elements comprise phonemes and two-dimensional eye gestures;
s200: establishing a neural network model, and training by adopting a training data set, wherein phonemes in the training data set are used as input, and two-dimensional eye gestures are used as output;
s300: three-dimensional recognition is carried out on the pronunciation process of a plurality of testers to obtain three-dimensional eye elements, wherein the three-dimensional eye elements comprise phonemes and three-dimensional eye gestures;
s400: further training and optimizing the neural network model by utilizing the three-dimensional eye elements to form a three-dimensional pronunciation neural network model;
s500: taking the character string needing three-dimensional image pronunciation as text data, extracting phonemes from the text data, and obtaining a pronunciation phoneme set;
s600: calculating the obtained pronunciation phoneme set by using the three-dimensional pronunciation neural network model to obtain a three-dimensional pronunciation element set;
s700: establishing a three-dimensional virtual image comprising eyes, controlling the eye posture of the three-dimensional virtual image by adopting the obtained three-dimensional sounding element set, and simulating; the eye gesture is described by coordinates of eye gesture key points.
In the above technical solution, the specific steps of collecting the two-dimensional eye elements when the person in the plane video library speaks in step S100 include:
the first step: selecting a character speaking video clip in a plane video library, wherein the video clip comprises video audio and a plurality of frames;
and a second step of: establishing a two-dimensional coordinate system for the image in each frame, and identifying the eye area to obtain a human eye image;
and a third step of: two-dimensional key points are set on the recognized human eye images and two-dimensional coordinates of the eye posture key points in the plane are recorded.
Fourth step: collecting phonemes in corresponding video audio for each frame;
the two-dimensional key points at least comprise a left corner, a right corner, a center vertex of lower eyelid, a center vertex of upper eyelid, a center of pupil, a pupil edge point and an eye reflection bright area of the eyes of the testers, wherein the pupil edge point comprises at least three points.
Further, in the above technical solution, the specific step of establishing the correspondence between the two-dimensional eye elements in step S100 is: and establishing a corresponding relation of the phonemes and the two-dimensional eye gestures according to a time axis of video.
In the above technical solution, in the step of "establishing a two-dimensional coordinate system for the image in each frame and identifying the eye region", an AdaBoost classifier is used to identify the eye region in the image.
Further, in the above technical solution, the step of "setting a two-dimensional key point on the identified human eye image" includes:
acquiring a plurality of groups of human eye images;
for each pixel point in the human eye image, calculating a Y value component of the pixel point in YUV space based on RGB value of the pixel point, and taking the Y value component as gray value of the pixel point;
calculating the average value of gray values of all pixel points in the gray image, determining a binarization threshold value, and performing binarization processing on the gray image to obtain a binarized image
Identifying a video disc area in the binarized image to obtain an identification result;
performing expansion and corrosion treatment on the binarized image, searching a connected region with a pixel value not being 0 in the treated binarized image, determining a video disc region based on a searching result, and taking the video disc region as a sample image in an identification training set;
marking two-dimensional key points in the sample image;
establishing an identification training neural network model, and training by utilizing an identification training set to obtain an eye key point identification model;
and calculating the identified human eye images by using the eye identification model to obtain two-dimensional key points of each human eye image.
In the above technical solution, the step of determining the optic disc area based on the search result includes:
counting the number of pixel points in each communication area aiming at each searched communication area;
judging whether the quantity meets the preset quantity condition;
if so, the connected region is determined as a disc region.
In the above technical solution, the specific steps of performing three-dimensional recognition on the pronunciation process of the plurality of testers in step S300 to obtain the three-dimensional eye element include:
step 1: three cameras are arranged right in front of the face of the tester, right left side and right side;
step 2: the tester performs dialogue pronunciation or reading, and the camera is used for collecting the dialogue pronunciation or reading process of the tester;
step 3: establishing a three-dimensional coordinate system, and carrying out three-dimensional coordinate marking on each three-dimensional key point to form three-dimensional coordinates of the eye gesture key points;
step 4: recording coordinates of each three-dimensional key point at different moments to form a three-dimensional key point coordinate sequence; identifying pronunciation phonemes of a tester at different moments to form a pronunciation phoneme sequence; establishing a relation between the three-dimensional key point coordinate sequence and the pronunciation phoneme sequence according to the moment;
the three-dimensional key points at least comprise a left corner, a right corner, a center vertex of lower eyelid, a center vertex of upper eyelid, a center of pupil, a pupil edge point and an eye reflection bright area of the eyes of the testers, wherein the pupil edge point comprises at least three points.
Further, in the above technical solution, the step of collecting the dialogue pronunciation or the reading process of the tester by using the camera specifically includes:
two-dimensional key points of eye images of the testers collected by the camera;
determining three-dimensional target key points from three-dimensional eye key points in a pre-manufactured three-dimensional eye model according to the two-dimensional key points;
calculating camera parameters according to the two-dimensional eye key points and the three-dimensional target key points, wherein the camera parameters comprise the following steps: rotation angle parameter, translation amount, and scaling value;
according to the camera parameters and the three-dimensional target key points, performing pronunciation transformation processing on the three-dimensional eye model to obtain three-dimensional pronunciation parameters corresponding to the two-dimensional eye key points;
carrying out sparsification processing on the three-dimensional pronunciation parameters to obtain sparsified pronunciation parameters;
migrating the sparse pronunciation parameters and the rotation angle parameters to an animation model of the virtual character, so that the pronunciation of the virtual character is consistent with the pronunciation of the eye image;
the three-dimensional eye model is formed by reconstructing a pronunciation fusion model through a parameterized eye model 3 DMM; the three-dimensional pronunciation parameters comprise pronunciation sub-parameters corresponding to different eye parts.
Further, in the above technical solution, the step of performing a thinning process on the three-dimensional pronunciation parameters to obtain thinned pronunciation parameters specifically includes:
taking the pronunciation subparameter corresponding to each eye part as a target subparameter, executing the following operations:
inquiring a target pronunciation sub-model corresponding to the target sub-parameter in a pre-stored pronunciation model mapping table; wherein, the pronunciation model mapping table stores the sub model identifiers corresponding to the pronunciation sub models of different eye parts included in the pronunciation fusion model;
assigning the target sub-parameters to the target pronunciation sub-model;
setting pronunciation sub-parameters corresponding to other pronunciation sub-models except the target pronunciation sub-model in the pronunciation fusion model as preset parameters;
fusing the target pronunciation sub-model and other pronunciation sub-models to obtain a target three-dimensional eye model;
calculating vertex deformation corresponding to the target sub-parameters based on the target three-dimensional eye model and a preset three-dimensional eye model; the preset three-dimensional eye model is obtained by fusing a plurality of pronunciation sub-models with pronunciation sub-parameters as preset parameters;
inputting the vertex deformation quantity and the three-dimensional pronunciation parameters corresponding to each target subparameter into an optimization model for iterative calculation until the loss value of the optimization model reaches a preset loss value, and outputting an optimization result;
and taking the optimized result as a sparse pronunciation parameter.
In the above technical solution, the step of controlling the eye pose of the three-dimensional avatar by using the obtained three-dimensional sounding element set in step S700 and performing simulation specifically includes:
the first step: acquiring a text with eye actions required to be read in a three-dimensional image;
and a second step of: establishing a phoneme set according to phonemes corresponding to characters in the text;
and a third step of: searching and preloading the mouth shape of the corresponding three-dimensional image in a preset three-dimensional mouth shape library according to the phoneme set to serve as a basic mouth shape;
fourth step: replacing the sequence consisting of the basic mouth shapes by utilizing the three-dimensional sounding element set according to the phonemes;
fifth step: and obtaining and playing the mouth shape sequence with the eye gesture actions.
The foregoing is merely illustrative of the present invention, and the present invention is not limited thereto, and any person skilled in the art will readily recognize that variations or substitutions are within the scope of the present invention. Therefore, the protection scope of the invention is subject to the protection scope of the claims.

Claims (10)

1. The method for simulating the eye movement in the three-dimensional image pronunciation process is characterized by comprising the following steps of:
s100: collecting two-dimensional eye elements in a plane video library when a person speaks as a training data set, and establishing a corresponding relation between the two-dimensional eye elements, wherein the two-dimensional eye elements comprise phonemes and two-dimensional eye gestures;
s200: establishing a neural network model, and training by adopting a training data set, wherein phonemes in the training data set are used as input, and two-dimensional eye gestures are used as output;
s300: three-dimensional recognition is carried out on the pronunciation process of a plurality of testers to obtain three-dimensional eye elements, wherein the three-dimensional eye elements comprise phonemes and three-dimensional eye gestures;
s400: further training and optimizing the neural network model by utilizing three-dimensional eye elements to form a three-dimensional pronunciation neural network model;
s500: taking a character string needing three-dimensional image pronunciation as text data, extracting phonemes from the text data, and obtaining a pronunciation phoneme set;
s600: calculating the obtained pronunciation phoneme set by using the three-dimensional pronunciation neural network model to obtain a three-dimensional pronunciation element set;
s700: establishing a three-dimensional virtual image comprising eyes, controlling the eye posture of the three-dimensional virtual image by adopting the obtained three-dimensional sounding element set, and simulating; the eye gesture is described by adopting coordinates of eye gesture key points.
2. The method for simulating the eye movements in the three-dimensional image pronunciation process according to claim 1, wherein the specific step of collecting the two-dimensional eye elements when the person in the planar video library speaks in step S100 comprises the following steps:
the first step: selecting a character speaking video clip in a plane video library, wherein the video clip comprises video audio and a plurality of frames;
and a second step of: establishing a two-dimensional coordinate system for the image in each frame, and identifying the eye area to obtain a human eye image;
and a third step of: two-dimensional key points are set on the recognized human eye images and two-dimensional coordinates of the eye posture key points in the plane are recorded.
Fourth step: collecting phonemes in corresponding video audio for each frame;
the two-dimensional key points at least comprise left corners, right corners, center vertexes of lower eyelid, center vertexes of upper eyelid, centers of pupil, edge points of pupil and reflective bright areas of eyes of the test person, wherein the edge points of pupil comprise at least three.
3. The method for simulating the eye movements in the three-dimensional image pronunciation process according to claim 2, wherein the specific steps of establishing the correspondence between the two-dimensional eye elements in the step S100 are as follows: and establishing a corresponding relation of the phonemes and the two-dimensional eye gestures according to a time axis of video.
4. The method for simulating the eye movement in the three-dimensional image pronunciation process according to claim 1, wherein in the step of establishing a two-dimensional coordinate system for the image in each frame and identifying the eye region, an AdaBoost classifier is used to identify the eye region in the image.
5. The method of claim 4, wherein the step of setting two-dimensional key points on the recognized human eye image comprises:
acquiring a plurality of groups of human eye images;
for each pixel point in the human eye image, calculating a Y value component of the pixel point in a YUV space based on the RGB value of the pixel point, and taking the Y value component as a gray value of the pixel point;
calculating the average value of gray values of all pixel points in the gray image, determining a binarization threshold value, and performing binarization processing on the gray image to obtain a binarized image
Identifying a video disc area in the binarized image to obtain an identification result;
performing expansion and corrosion treatment on the binarized image, searching a communication area with a pixel value not being 0 in the treated binarized image, determining a video disc area based on a searching result, and taking the video disc area as a sample image in an identification training set;
marking two-dimensional key points in the sample image;
establishing an identification training neural network model, and training by utilizing the identification training set to obtain an eye key point identification model;
and calculating the identified human eye images by using the eye identification model to obtain two-dimensional key points of each human eye image.
6. The method of claim 1, wherein the step of determining the optic disc area based on the search result comprises:
counting the number of pixel points in each communication area aiming at each searched communication area;
judging whether the quantity meets the preset quantity condition or not;
if so, the connected region is determined as a disc region.
7. The method for simulating the eye movements in the three-dimensional visual pronunciation process according to claim 1, wherein the specific step of three-dimensionally recognizing the pronunciation process of the plurality of testers in step S300 to obtain the three-dimensional eye elements comprises the following steps:
step 1: three cameras are arranged right in front of the face of the tester, right left side and right side;
step 2: the tester performs dialogue pronunciation or reading, and the camera is used for collecting the dialogue pronunciation or reading process of the tester;
step 3: establishing a three-dimensional coordinate system, and carrying out three-dimensional coordinate marking on each three-dimensional key point to form three-dimensional coordinates of the eye gesture key points;
step 4: recording coordinates of each three-dimensional key point at different moments to form a three-dimensional key point coordinate sequence; identifying pronunciation phonemes of a tester at different moments to form a pronunciation phoneme sequence; establishing a relation between the three-dimensional key point coordinate sequence and the pronunciation phoneme sequence according to the moment;
the three-dimensional key points at least comprise left corners, right corners, center vertexes of lower eyelid, center vertexes of upper eyelid, centers of pupil, edge points of pupil and reflective bright areas of eyes of the test person, wherein the edge points of pupil comprise at least three.
8. The method for simulating the eye movements in the three-dimensional visual pronunciation process according to claim 7, wherein the step of capturing the dialogue pronunciation or the speaking process of the tester by using the camera specifically comprises the following steps:
two-dimensional key points of eye images of the testers collected by the camera;
determining three-dimensional target key points from three-dimensional eye key points in a pre-manufactured three-dimensional eye model according to the two-dimensional key points;
calculating camera parameters according to the two-dimensional eye key points and the three-dimensional target key points, wherein the camera parameters comprise the following steps: rotation angle parameter, translation amount, and scaling value;
according to the camera parameters and the three-dimensional target key points, performing pronunciation transformation processing on the three-dimensional eye model to obtain three-dimensional pronunciation parameters corresponding to the two-dimensional eye key points;
performing sparsification processing on the three-dimensional pronunciation parameters to obtain sparse pronunciation parameters;
migrating the sparse pronunciation parameters and the rotation angle parameters to an animation model of the virtual character so that the pronunciation of the virtual character is consistent with the pronunciation of the eye image;
the three-dimensional eye model is formed by reconstructing a pronunciation fusion model through a parameterized eye model 3 DMM; the three-dimensional pronunciation parameters comprise pronunciation subparameters corresponding to different eye parts.
9. The method for simulating the eye motion in the three-dimensional visual pronunciation process according to claim 7, wherein the step of performing a thinning process on the three-dimensional pronunciation parameters to obtain the thinned pronunciation parameters comprises the following steps:
taking the pronunciation subparameter corresponding to each eye part as a target subparameter, executing the following operations:
inquiring a target pronunciation sub-model corresponding to the target sub-parameter in a pre-stored pronunciation model mapping table; wherein, the pronunciation model mapping table stores the sub-model identifiers corresponding to the pronunciation sub-models of different eye parts included in the pronunciation fusion model;
assigning the target sub-parameters to the target pronunciation sub-model;
setting pronunciation sub-parameters corresponding to other pronunciation sub-models except the target pronunciation sub-model in the pronunciation fusion model as preset parameters;
fusing the target pronunciation sub-model and other pronunciation sub-models to obtain a target three-dimensional eye model;
calculating vertex deformation corresponding to the target sub-parameters based on the target three-dimensional eye model and a preset three-dimensional eye model; the preset three-dimensional eye model is obtained by fusing a plurality of pronunciation sub-models with pronunciation sub-parameters as the preset parameters;
inputting the vertex deformation corresponding to each target subparameter and the three-dimensional pronunciation parameters into an optimization model for iterative calculation until the loss value of the optimization model reaches a preset loss value, and outputting an optimization result;
and taking the optimized result as a sparse pronunciation parameter.
10. The method for simulating the eye movements in the three-dimensional avatar pronunciation process according to claim 1, wherein the step of controlling the eye posture of the three-dimensional avatar by using the obtained three-dimensional sound-producing element set in the step S700 and simulating the eye posture of the three-dimensional avatar specifically comprises:
the first step: acquiring a text with eye actions required to be read in a three-dimensional image;
and a second step of: establishing a phoneme set according to phonemes corresponding to characters in the text;
and a third step of: searching and preloading a mouth shape of a corresponding three-dimensional image in a preset three-dimensional mouth shape library according to the phoneme set to serve as a basic mouth shape;
fourth step: replacing the sequence consisting of the basic mouth shapes by utilizing the three-dimensional sounding element set according to the phonemes;
fifth step: and obtaining and playing the mouth shape sequence with the eye gesture actions.
CN202310111493.5A 2023-02-13 2023-02-13 Eye motion simulation method in three-dimensional image pronunciation process Pending CN116597070A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310111493.5A CN116597070A (en) 2023-02-13 2023-02-13 Eye motion simulation method in three-dimensional image pronunciation process

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310111493.5A CN116597070A (en) 2023-02-13 2023-02-13 Eye motion simulation method in three-dimensional image pronunciation process

Publications (1)

Publication Number Publication Date
CN116597070A true CN116597070A (en) 2023-08-15

Family

ID=87603242

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310111493.5A Pending CN116597070A (en) 2023-02-13 2023-02-13 Eye motion simulation method in three-dimensional image pronunciation process

Country Status (1)

Country Link
CN (1) CN116597070A (en)

Similar Documents

Publication Publication Date Title
EP3885965B1 (en) Image recognition method based on micro facial expressions, apparatus and related device
KR100720309B1 (en) Automatic 3D modeling system and method
KR101558202B1 (en) Apparatus and method for generating animation using avatar
CN105809144A (en) Gesture recognition system and method adopting action segmentation
CN110610534B (en) Automatic mouth shape animation generation method based on Actor-Critic algorithm
CN105518708A (en) Method and equipment for verifying living human face, and computer program product
CN113781610A (en) Virtual face generation method
JP2011159329A (en) Automatic 3d modeling system and method
CN111028216A (en) Image scoring method and device, storage medium and electronic equipment
WO2024001095A1 (en) Facial expression recognition method, terminal device and storage medium
Mattos et al. Improving CNN-based viseme recognition using synthetic data
CN113223125A (en) Face driving method, device, equipment and medium for virtual image
CN112764530A (en) Ammunition identification method based on touch handle and augmented reality glasses
CN116665695A (en) Virtual object mouth shape driving method, related device and medium
CN116597070A (en) Eye motion simulation method in three-dimensional image pronunciation process
CN114972601A (en) Model generation method, face rendering device and electronic equipment
CN114677476A (en) Face processing method and device, computer equipment and storage medium
CN114630190A (en) Joint posture parameter determining method, model training method and device
CN113744129A (en) Semantic neural rendering-based face image generation method and system
JP6927495B2 (en) Person evaluation equipment, programs, and methods
CN112764531A (en) Augmented reality ammunition identification method
CN105718050B (en) Real-time human face interaction method and system
CN116129487A (en) Three-dimensional image pronunciation head posture simulation method
CN112613495B (en) Real person video generation method and device, readable storage medium and equipment
CN112667088B (en) Gesture application identification method and system based on VR walking platform

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination