CN115471886A - Digital person generation method and system - Google Patents

Digital person generation method and system Download PDF

Info

Publication number
CN115471886A
CN115471886A CN202211030862.XA CN202211030862A CN115471886A CN 115471886 A CN115471886 A CN 115471886A CN 202211030862 A CN202211030862 A CN 202211030862A CN 115471886 A CN115471886 A CN 115471886A
Authority
CN
China
Prior art keywords
image
face
features
preset
human
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211030862.XA
Other languages
Chinese (zh)
Inventor
不公告发明人
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangzhou Bairui Network Technology Co ltd
Original Assignee
Guangzhou Bairui Network Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou Bairui Network Technology Co ltd filed Critical Guangzhou Bairui Network Technology Co ltd
Priority to CN202211030862.XA priority Critical patent/CN115471886A/en
Publication of CN115471886A publication Critical patent/CN115471886A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/168Feature extraction; Face representation
    • G06V40/171Local features and components; Facial parts ; Occluding parts, e.g. glasses; Geometrical relationships
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/34Smoothing or thinning of the pattern; Morphological operations; Skeletonisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/36Applying a local operator, i.e. means to operate on image points situated in the vicinity of a given point; Non-linear local filtering operations, e.g. median filtering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Artificial Intelligence (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Medical Informatics (AREA)
  • Databases & Information Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Nonlinear Science (AREA)
  • Human Computer Interaction (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Image Processing (AREA)
  • Processing Or Creating Images (AREA)

Abstract

The invention discloses a digital human generation method and a system, wherein the method calls an encoder to respectively extract face images and voice features according to a depth self-coding theory, performs feature fusion in a hidden space, establishes an incidence relation between the voice and the face features, inputs the fusion features into a pre-trained GAN network to generate a digital human image sequence with dynamic change, and drives a digital human to perform dynamic activities by utilizing the incidence relation between the voice and the face. The digital person provided by the invention can be driven by different voices, is easy to replace the image of a person, does not have deformity, and can effectively improve the image and the display effect of the digital person; and in the processing process, each image is repaired and optimized, so that the fidelity and the imaging effect of a digital person can be further improved, and the application requirements of virtual hosts or digital substitutes in different fields are met.

Description

Digital person generation method and system
Technical Field
The invention relates to the technical field of digital person editing, in particular to a digital person generation method and a digital person generation system.
Background
With the gradual maturity of artificial intelligence technology, more and more fields begin to use digital people to replace human beings in video and audio videos to complete tasks or actions that human beings cannot complete in reality, so as to make up for the deficiencies of human existence, such as: the shooting of dangerous action in the film trade, the maintenance of after-sale personnel's steady mood in the retail trade, the broadcast of zero error in the media trade to and intelligent efficient in the financial trade, humanized service of real-time response etc. to reduce the cost and the risk cost that use the manpower to spend, reduce the error risk that the mankind produced, promote people's experience effect.
There are two types of digital human technologies in common use, the first is to simulate a digital human by using a plurality of different sensors to assist human-computer interaction to drive the digital human to act or speak. And the other method is to adopt a 3DMM face reconstruction model to reconstruct a 2D face and generate a digital person in an analog mode.
However, the two methods commonly used at present have the following technical problems: the cost for purchasing the sensors by adopting a mode of assisting simulation by a plurality of sensors is high, digital talents need to move under the condition of manual assistance, and the digital people cannot work normally without the dynamic action of an assistor; and adopting 3D modeling mode needs to spend a large amount of time to model, and editing time is long, and is inefficient, and in case great gesture changes appear, the visual angle changes or the illumination is not accurate when 3D model in the posture parameter when weak, leads to digital people's gesture, facial expression and five sense organs action to appear the deformity change, even appears the action and the unmatched condition of sound, influences user's use greatly.
Disclosure of Invention
The invention provides a digital person generation method and a system, wherein the method can only use digital person images and voice to edit spoken face images, and converts the edited face images into panoramic digital person images with limbs and background by a background fusion technology and a posture change algorithm, so that the dependence on helpers can be eliminated, the editing efficiency can be improved, the probability of deformity is reduced, and the fidelity and the imaging effect of digital persons are improved.
A first aspect of an embodiment of the present invention provides a method for generating a digital person, where the method includes:
after acquiring voice data and an original image containing a background and a face, fusing the voice data and the original image to obtain a digital human image containing dynamic changes of the face;
replacing the images of the five sense organs in the digital human image by adopting a preset human face sample image to obtain a replaced image;
repairing the replaced five sense organ characteristics in the replaced image, and carrying out image fusion on the repaired replaced image and a preset background image to obtain a fused image;
and extracting character features of the fusion image, and training a preset GAN network by utilizing the character features to obtain a digital human image containing posture change.
In a possible implementation manner of the first aspect, the replacing a five sense organs image in the digital human image with a preset human face sample image to obtain a replaced image includes:
carrying out edge detection on the facial features region of the digital human image through an edge detection algorithm to obtain facial features region information, wherein the facial features region comprises: eye region, mouth region, nose region, ear region, eyebrow region, and face contour region;
extracting a corresponding face sample image from a preset sample space based on the facial region information, wherein the preset sample space consists of an image sample and a video sample preset by a user;
and replacing the facial sample image with the facial image corresponding to the facial region information to obtain a replaced image.
In a possible implementation manner of the first aspect, the repairing the replaced facial feature in the replacement image includes:
carrying out smooth filtering processing on the replacement image to obtain a filtered image;
acquiring boundary information of a five-sense organ region in the filtering image, wherein the boundary information is an image trace of the human face sample image and the digital human image;
and fusing the boundary information into the filtering image.
In a possible implementation manner of the first aspect, the image fusing the repaired replacement image and the preset background image to obtain a fused image includes:
determining face region information of a preset background image;
and splicing the repaired replacement image in a preset background image according to the face region information to obtain a fused image.
In a possible implementation manner of the first aspect, the extracting the character features of the fused image, and training a preset GAN network by using the character features to obtain a digital human image containing a posture change includes:
extracting character features from the fused image and acquiring preset non-character features, wherein the preset non-character features are posture change features of each frame of image in a preset video sample of the user;
performing feature fusion on the character features and preset non-character features to obtain fusion features;
inputting the fusion characteristics into a preset GAN network to obtain a digital human image sequence containing posture changes;
and constructing a digital human image by adopting the digital human image sequence containing the posture change.
In a possible implementation manner of the first aspect, after the step of training a preset GAN network by using the character features to obtain a digital human image containing a posture change, the method further includes:
synthesizing the digital human image and the voice data into audio and video data;
and sending the audio and video data to a preset user terminal for a user to check.
In a possible implementation manner of the first aspect, the fusing the voice data and the original image to obtain a digital human image containing dynamic changes of a human face includes:
determining a face region of the original image, extracting face key points of the face region, and aligning a face based on the face key points to obtain a face front image;
calling a preset face encoder to extract face features from the face front image, and calling a preset voice encoder to extract voice features from the voice data;
and performing feature fusion on the face features and the voice features to obtain fusion features, and inputting the fusion features into a preset decoder to be mixed to obtain a digital human image containing dynamic changes of the face.
A second aspect of an embodiment of the present invention provides a digital person generation system, including:
the fusion module is used for fusing the voice data and the original image to obtain a digital human image containing the dynamic change of the human face after acquiring the voice data and the original image containing the background and the human face;
the replacing module is used for replacing the images of the five sense organs in the digital human image by adopting a preset human face sample image to obtain a replaced image;
the restoration module is used for restoring the replaced five-sense organ characteristics in the replacement image and carrying out image fusion on the restored replacement image and a preset background image to obtain a fused image;
and the editing module is used for extracting the character features of the fused image and training a preset GAN network by utilizing the character features to obtain the digital human image containing the posture change.
Compared with the prior art, the method and the system for generating the digital person provided by the embodiment of the invention have the beneficial effects that: the invention can only use digital human image and voice to edit into a talking human face image, and converts the edited human face image into a panoramic digital human image with limbs and background by background fusion technology and posture change algorithm, which not only can eliminate the dependence on an assistor, but also can improve the editing efficiency, reduce the probability of deformity and improve the fidelity and imaging effect of the digital human.
Drawings
Fig. 1 is a schematic flow chart diagram of a method for generating a digital person according to an embodiment of the present invention;
FIG. 2 is a flowchart illustrating the operation of generating a digital human image with dynamically changing faces according to an embodiment of the present invention;
FIG. 3 is a flowchart illustrating the operation of generating a fused image according to an embodiment of the present invention;
FIG. 4 is a flowchart illustrating the operation of generating a digital human image of a pose change according to an embodiment of the present invention;
fig. 5 is a flowchart illustrating an operation of generating audio/video data according to an embodiment of the present invention;
fig. 6 is a schematic structural diagram of a digital person generating system according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
With the gradual maturity of artificial intelligence technology, more and more fields begin to use digital people to replace human beings in video and audio videos to complete tasks or actions that human beings cannot complete in reality, so as to make up for the deficiencies of human existence, such as: the system comprises the steps of shooting dangerous actions in the film industry, keeping stable emotion of after-sales personnel in the retail industry, broadcasting zero errors in the media industry, intelligent, efficient and real-time responding humanized services in the financial industry and the like, so that the cost and the risk cost spent on using manpower are reduced, the error risk generated by human is reduced, and the experience effect of people is improved.
There are two types of digital human technologies in common use, the first is to simulate a digital human by using a plurality of different sensors to assist human-computer interaction to drive the digital human to act or speak. And the other method is to adopt a 3DMM face reconstruction model to reconstruct a 2D face and generate a digital person in an analog mode.
However, the two methods commonly used at present have the following technical problems: the cost for purchasing the sensors by adopting a mode of assisting simulation by a plurality of sensors is high, digital talents need to move under the condition of manual assistance, and the digital people cannot work normally without the dynamic action of an assistor; and once a large posture change occurs, the posture parameters in the 3D model are inaccurate when the visual angle changes or the illumination is weak, so that the posture, the facial expression and the facial movement of the digital person are deformed, even the situation that the movement is not matched with the sound occurs, and the use of a user is greatly influenced.
In order to solve the above problem, a digital person generation method provided by the embodiments of the present application will be described and explained in detail by the following specific embodiments.
Referring to fig. 1, a flow chart of a digital person generation method according to an embodiment of the present invention is shown.
By way of example, the digital human generation method may include:
s11, after voice data and an original image containing a background and a face are obtained, the voice data and the original image are fused to obtain a digital human image containing dynamic changes of the face.
In an embodiment, the voice data may be voice data currently being entered by the user. The original image of the face may be an image of the user.
In one embodiment, features of the speech data may be fused into the original image, so that the original image may contain dynamic changes of the human face, resulting in a digital human image.
Referring to fig. 2, there is shown a flowchart illustrating an operation of generating a digital human image with dynamically changing human faces according to an embodiment of the present invention.
Wherein, as an example, step S11 may comprise the following sub-steps:
and the substep S111 is to determine a face region of the original image, extract face key points of the face region, and perform face alignment based on the face key points to obtain a face front image.
In one embodiment, the face of the original image that is acquired may be distorted, which in turn causes the face of the subsequently generated image to be distorted.
In order to avoid the above situation, the face image may be subjected to face detection, and a face region may be extracted, and then the face region may be subjected to key point detection, and face alignment may be performed according to the face key points to obtain a front face image, so as to obtain an original image.
And a substep S112, calling a preset face encoder to extract face features from the face front image, and calling a preset voice encoder to extract voice features from the voice data.
And a substep S113, performing feature fusion on the face features and the voice features to obtain fusion features, and inputting the fusion features into a preset decoder to be mixed to obtain a digital human image containing dynamic changes of the face.
In one embodiment, the face features can be extracted from the face front image by a face encoder, and the voice features can be extracted from the voice data by a voice encoder; then establishing the relation between the voice and the face through a hidden space; finally, the joint features in the hidden space are used to generate the speaking face by the decoder.
Specifically, a series of digital human images containing mouth shape changes corresponding to the input speech may be obtained by a decoder using the fusion features as input.
In fusion, two features may be connected together along a particular dimension. Such as: the first feature is [1,256] in size, the second feature is [1,256] in size, the two features are fused along the direction of dim =0, and then the fused features are changed into fused features of [1,512], and finally the fused features are input into an interface to generate a digital human image containing dynamic human face changes.
And S12, replacing the images of the five sense organs in the digital human image by adopting a preset human face sample image to obtain a replaced image.
In order to make the image of the digital person closer to the user, the five sense organs of the image of the digital person can be replaced to generate a corresponding replacement image.
In an alternative embodiment, step S12 may comprise the following sub-steps:
substep S121, performing edge detection on the facial features region of the digital human image through an edge detection algorithm to obtain facial features region information, wherein the facial features region includes: eye region, mouth region, nose region, ear region, eyebrow region, and face contour region.
And a substep S122 of extracting a corresponding face sample image from a preset sample space based on the facial region information, wherein the preset sample space is composed of an image sample and a video sample preset by a user.
And S123, replacing the facial sample image with the facial image corresponding to the facial region information to obtain a replaced image.
Specifically, the sample space is composed of image samples and video samples used in the post-processing and posture change algorithm, and includes a face image including a background, an eye sample, a posture change video, and the like, and particularly, it can be set by a user.
In actual operation, firstly, edge detection is carried out on the region of the human eye part through an edge detection algorithm to obtain more accurate position information of the human eye, eye information of the same region of the corresponding image in the sample space is extracted, and the corresponding human eye region in the generated image is replaced by the same eye region of the corresponding image in the sample space to carry out eye synthesis; after eye synthesis, performing face key point detection on the image in the sample space through a face key point detection algorithm to obtain the position of a face contour, accurately positioning a face region, extracting a corresponding face region in the generated image, and replacing the face region at the corresponding position in the sample space with the face region in the generated image to perform face synthesis; and then, carrying out edge detection on the nose area through an edge detection algorithm to obtain more accurate position information of the nose, extracting the nose information of the same area of the corresponding image in the sample space, replacing the corresponding nose area in the generated image with the same eye area of the corresponding image in the sample space to carry out nose synthesis, and so on. Finally, the digital human image is modified to a replacement image.
It should be noted that, the face synthesis algorithm first performs face key point detection on the image in the sample space through a face key point detection algorithm to obtain the position of the face contour, accurately positions the face region, then extracts the corresponding face region in the generated image, and finally replaces the face region in the sample space with the face region in the generated image.
And S13, repairing the replaced five sense organ characteristics in the replaced image, and carrying out image fusion on the repaired replaced image and a preset background image to obtain a fused image.
After replacement, the image needs to be repaired, so that the image is smoother, the impression of a user is improved, and meanwhile, the image can be fused with a background image, and the display effect of a digital person is further improved.
In one embodiment, step S13 may include the following sub-steps:
and a substep S131 of performing smooth filtering processing on the replacement image to obtain a filtered image.
And a substep S132 of obtaining boundary information of a five-sense organ region in the filtering image, wherein the boundary information is image traces of the face sample image and the digital human image.
And a substep S133 of fusing the boundary information into the filtered image.
Specifically, boundary information during eye synthesis can be acquired, smooth filtering is performed on a human eye region of the synthesized image, and information at a corresponding boundary of an eye in the filtered image is fused into the synthesized image to perform eye repair; then, key point information during face synthesis is obtained to obtain a face contour, smooth filtering is carried out on a face region, and information of a corresponding face boundary in a filtered image is fused into a synthesized image to achieve the purpose of face restoration; and finally, fusing the repaired face image into a background image through background synthesis.
In actual operation, the eye repairing may first obtain boundary information during eye synthesis, then perform smooth filtering on a human eye region of the synthesized image, and finally fuse information at a corresponding boundary of an eye in the filtered image into the synthesized image.
The face restoration method includes the steps of firstly obtaining key point information during face synthesis, obtaining face outline, secondly carrying out smooth filtering on a face area, and finally fusing information of a corresponding face boundary in a filtered image into a synthesized image.
In an alternative embodiment, the image of the eye region may be restored immediately after the replacement of the eye region is completed, and then the replacement and restoration of the next region may be performed.
When the replacement and the synthesis are carried out, the synthesis trace which is obvious at the synthesis edge of the replacement image appears due to the difference between the sample image and the generated image, so the trace which appears when the image is synthesized is eliminated by respectively repairing the eyes and the face through the image repairing.
Eye synthesis and eye repair are performed, so that the eye effect is more vivid than that generated by the best current prior art; the matching of the face and the background area is more perfect through face synthesis and face repair; and through background synthesis, the digital person is fused in the background image, so that the digital person is more real and more practical.
In one embodiment, step S13 may further include the following sub-steps:
and a substep S134 of determining the face region information of the preset background image.
The preset background image may be a panoramic image containing a background preset by a user. The face region information may be position information of a region occupied by the face in the replacement image in the background image.
In one embodiment, the size or coordinates of the face in the replacement image may be determined, and the position in the background image may be determined based on the size or coordinates, resulting in face region information.
And a substep S135, splicing the repaired replacement image in a preset background image according to the face region information to obtain a fusion image.
The background synthesis may be performed by splicing the restored face image reductively at a corresponding position in a background image containing the face in the sample space according to position information of a face region in the sample space used in a face synthesis algorithm, so as to splice the replacement image in a preset background image to obtain a fused image.
Referring to fig. 3, a flowchart illustrating an operation of generating a fused image according to an embodiment of the present invention is shown.
After the digital human image is obtained, the five sense organs of the digital human image can be replaced to obtain a replacement image, then the five sense organs in the replacement image are repaired, and finally the repaired image and the background image are fused to obtain a fused image.
And S14, extracting the character features of the fused image, and training a preset GAN network by utilizing the character features to obtain a digital human image containing posture change.
The fused image already contains the face image which is edited again, and the character features (such as the eye, ear, mouth, nose and the outline thereof) in the newly edited face image can be utilized to carry out model training, so that a dynamic digital human image can be generated.
In one embodiment, step S14 may further include the following sub-steps:
and a substep S141 of extracting character features from the fused image and acquiring preset non-character features, wherein the preset non-character features are the posture change features of each frame of image in a preset video sample of the user.
And a substep S142, performing feature fusion on the character features and preset non-character features to obtain fusion features.
And a substep S143, inputting the fusion characteristics into a preset GAN network to obtain a digital human image sequence containing posture change.
And a substep S144 of constructing a digital human image by using the digital human image sequence containing the posture change.
The posture change algorithm firstly extracts the posture change characteristics of each frame of image in a reference video in a sample space to form non-character characteristics, secondly extracts the characteristics of a digital human image to form character characteristics, then fuses the non-character characteristics and the character characteristics, and finally generates a posture change image by a generating model in the GAN through training the GAN (confrontation generation network).
Referring to fig. 4, a flowchart of the operation of generating a digital human image with posture changes according to an embodiment of the present invention is shown.
In practical operation, a pose feature extractor can be used for sequentially extracting pose change features of each frame of image in a reference video in a sample space to form non-character features, then the character feature extractor is used for extracting features of a digital human image to form character features, then the non-character features extracted by the pose feature extractor and the character features extracted by the character feature extractor are fused to obtain fused features containing the non-character features and the character features of the pose features, and finally a GAN (confrontation generation network) is trained to generate the digital human image with the pose change through a generation model in the GAN.
It should be noted that the reference video stored in the sample space may be adjusted according to the actual needs of the user, and if the user only needs one posture change, one reference video may be used, if the user requires multiple posture changes, multiple reference videos or one reference video including multiple posture changes may be needed, and if the user does not need a posture change, the reference video may not be needed.
Correspondingly, the generated digital human image with pose changes is a digital human sequence frame. The digital human image is not only a front face image, but also contains a posture, and particularly, the generated front image is driven to make a corresponding posture change through one image containing the posture, such as 30 degrees of head left or right deflection and the like.
For the speaker image sequence generated by the voice, because the images in the reference video are multiple and orderly, the images in the reference video in the corresponding order generated by the voice are sequentially and respectively driven by the images in the reference video to perform the posture change, finally, the image sequence of the multiple images containing the posture change can be obtained and the posture change is continuous visually (and each image contains the posture and is not a front image), and the digital human video containing the posture change driven by the voice can be synthesized by the image sequence containing the posture change.
By way of example: assuming that there are 10 ordered images in the reference video, denoted as R1, R2.. And R10, and there are also 10 ordered images generated by speech, denoted as G1, G2.. And G10, in the pose change algorithm, first the pose feature of R1 is extracted by a pose feature extractor, the character feature of G1 is extracted by a character feature extractor, the two features are fused to obtain a fused feature, and an image P1 containing the pose of R1 of G1 is generated by GAN, and P1 contains pose changes with respect to G1, and so on until an image P10 containing the pose of R10 of G10 is obtained, the video synthesized by P1.. And P10 is a digital person containing pose changes of R1, R2.. And R10 driven by speech. Specifically, if the frame number r in the reference video is smaller than the frame number g of the voice generation, then the reference video is arranged in reverse order, for example, when g =15, r =10; g = 11; r =9, g =15, r =5, etc.). The digital human image with posture change herein refers to a single image containing a plurality of human posture changes, and is "digital human with posture change" as shown in fig. 4 below. This is called a posture-change image because it is a posture-change image having the same posture as that of the input posture image (i.e., "the sequence frame of the reference video" in fig. 4 below) with respect to the digital human image before the use of the posture-change algorithm (i.e., "the digital human sequence frame" in fig. 4 below), which is merely a frontal human image whose face, limbs, body, etc. do not have any posture other than the frontal posture).
In addition, it should be noted that, when generating digital images, one dimension can be reduced to disregard the change of posture, so that less data is needed to achieve the same effect as the best prior art at present, and the generated images can be further optimized when the generated images are not good due to less training data through a post-processing algorithm.
The invention independently learns the posture of the digital person through the posture change algorithm, so that the posture change of the digital person is more natural and smooth, and the digital person not only comprises the change of the head posture, but also comprises the change of the body actions.
In order to enable the user to view the digital human image more intuitively, in an embodiment, after the editing and generating the digital human image, the method may further include:
and S15, synthesizing the digital human image and the voice data into audio and video data.
And S16, sending the audio and video data to a preset user terminal for a user to check.
Referring to fig. 5, an operation flow chart of a digital person generation method according to an embodiment of the present invention is shown.
After the digital human sequence image is generated, the sequence image of the digital human and the voice synthesized in the TTS module are converted into an audio/video simultaneously containing video and voice through an audio/video synthesis module; and finally, transmitting the audio and video acquired from the audio and video synthesis module to the user terminal through the audio and video transmission module.
The user terminal is an intelligent terminal of a user, such as a mobile phone.
In this embodiment, an embodiment of the present invention provides a method for generating a digital person, which has the following beneficial effects: the invention can only use digital human images and voice to edit into speaking human face images, and converts the edited human face images into panoramic digital human images with limbs and backgrounds by background fusion technology and posture change algorithm, thereby not only eliminating the dependence on helpers, but also improving editing efficiency, reducing the probability of deformity occurrence and improving the fidelity and imaging effect of digital human.
An embodiment of the present invention further provides a digital person generating system, and referring to fig. 6, a schematic structural diagram of the digital person generating system provided in the embodiment of the present invention is shown.
Wherein, as an example, the digital human generation system may include:
the fusion module 601 is configured to obtain a digital human image containing dynamic changes of a human face by fusing voice data and an original image containing a background and the human face after the voice data and the original image containing the background and the human face are obtained;
a replacing module 602, configured to replace a facial feature image in the digital human image with a preset human face sample image to obtain a replaced image;
a repairing module 603, configured to repair the replaced features of the five sense organs in the replacement image, and perform image fusion on the repaired replacement image and a preset background image to obtain a fused image;
and the editing module 604 is configured to extract the character features of the fused image, and train a preset GAN network by using the character features to obtain a digital human image containing posture changes.
Optionally, the replacing module is further configured to:
carrying out edge detection on the five sense organ region of the digital human image through an edge detection algorithm to obtain five sense organ region information, wherein the five sense organ region comprises: eye region, mouth region, nose region, ear region, eyebrow region, and face contour region;
extracting a corresponding face sample image from a preset sample space based on the facial region information, wherein the preset sample space consists of an image sample and a video sample preset by a user;
and replacing the facial sample image with the facial image corresponding to the facial region information to obtain a replaced image.
Optionally, the repair module is further configured to:
carrying out smooth filtering processing on the replacement image to obtain a filtered image;
acquiring boundary information of a five-sense organ region in the filtering image, wherein the boundary information is an image trace of the human face sample image and the digital human image;
and fusing the boundary information into the filtering image.
Optionally, the repair module is further configured to:
determining face region information of a preset background image;
and splicing the repaired replacement image in a preset background image according to the face region information to obtain a fused image.
Optionally, the editing module is further configured to:
extracting character features from the fused image and acquiring preset non-character features, wherein the preset non-character features are posture change features of each frame of image in a preset video sample of the user;
performing feature fusion on the character features and preset non-character features to obtain fusion features;
inputting the fusion characteristics into a preset GAN network to obtain a digital human image sequence containing posture changes;
and constructing a digital human image by adopting the digital human image sequence containing the posture change.
Optionally, the system further comprises:
a synthesis module for synthesizing the digital human image and the voice data into audio and video data;
and the sending module is used for sending the audio and video data to a preset user terminal for a user to check.
Optionally, the fusion module is further configured to:
determining a face region of the original image, extracting face key points of the face region, and aligning a face based on the face key points to obtain a face front image;
calling a preset face encoder to extract face features from the face front image, and calling a preset voice encoder to extract voice features from the voice data;
and performing feature fusion on the face features and the voice features to obtain fusion features, and inputting the fusion features into a preset decoder to be mixed to obtain a digital human image containing dynamic changes of the face.
It can be clearly understood by those skilled in the art that, for convenience and brevity, the specific working process of the system described above may refer to the corresponding process in the foregoing method embodiment, and is not described herein again.
Further, an embodiment of the present application further provides an electronic device, including: a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the digital human generation method as described in the above embodiments when executing the program.
Further, the present application provides a computer-readable storage medium, which stores computer-executable instructions for causing a computer to execute the digital human generation method according to the foregoing embodiment.
While the foregoing is directed to the preferred embodiment of the present invention, it will be understood by those skilled in the art that various changes and modifications may be made without departing from the spirit and scope of the invention.

Claims (10)

1. A method of digital person generation, the method comprising:
after acquiring voice data and an original image containing a background and a face, fusing the voice data and the original image to obtain a digital human image containing dynamic changes of the face;
replacing the images of five sense organs in the digital human image by adopting a preset human face sample image to obtain a replaced image;
repairing the replaced five sense organ characteristics in the replaced image, and carrying out image fusion on the repaired replaced image and a preset background image to obtain a fused image;
and extracting the character features of the fused image, and training a preset GAN network by utilizing the character features to obtain a digital human image containing posture change.
2. The method of claim 1, wherein the replacing the facial features image in the digital human image with a predetermined human face sample image to obtain a replacement image comprises:
carrying out edge detection on the five sense organ region of the digital human image through an edge detection algorithm to obtain five sense organ region information, wherein the five sense organ region comprises: eye region, mouth region, nose region, ear region, eyebrow region, and face contour region;
extracting a corresponding face sample image from a preset sample space based on the facial region information, wherein the preset sample space consists of an image sample and a video sample preset by a user;
and replacing the facial sample image with the facial image corresponding to the facial region information to obtain a replaced image.
3. The digital human generation method of claim 2, wherein the repairing the replaced facial features in the replacement image comprises:
carrying out smooth filtering processing on the replacement image to obtain a filtered image;
acquiring boundary information of a five-sense organ region in the filtering image, wherein the boundary information is an image trace of the human face sample image and the digital human image;
and fusing the boundary information into the filtering image.
4. The method for generating a digital human being according to claim 1, wherein the image fusion of the repaired replacement image and the preset background image to obtain a fused image comprises:
determining face region information of a preset background image;
and splicing the repaired replacement image in a preset background image according to the face region information to obtain a fusion image.
5. The method of claim 2, wherein the extracting the human features of the fused image and training a predetermined GAN network with the human features to obtain the digital human image containing the posture change comprises:
extracting character features from the fused image and acquiring preset non-character features, wherein the preset non-character features are posture change features of each frame of image in a preset video sample of the user;
performing feature fusion on the character features and preset non-character features to obtain fusion features;
inputting the fusion characteristics into a preset GAN network to obtain a digital human image sequence containing posture changes;
and constructing a digital human image by adopting the digital human image sequence containing the posture change.
6. The method of any one of claims 1-5, wherein after the step of training the predetermined GAN network with the human features to obtain the digital human image containing posture changes, the method further comprises:
synthesizing the digital human image and the voice data into audio and video data;
and sending the audio and video data to a preset user terminal for a user to check.
7. The method according to any one of claims 1 to 5, wherein the fusing the voice data and the original image to obtain a digital human image containing dynamic human face changes comprises:
determining a face region of the original image, extracting face key points of the face region, and aligning a face based on the face key points to obtain a face front image;
calling a preset face encoder to extract face features from the face front image, and calling a preset voice encoder to extract voice features from the voice data;
and performing feature fusion on the face features and the voice features to obtain fusion features, and inputting the fusion features into a preset decoder to be mixed to obtain a digital human image containing dynamic changes of the face.
8. A digital person generation system, the system comprising:
the fusion module is used for fusing the voice data and the original image to obtain a digital human image containing the dynamic change of the human face after acquiring the voice data and the original image containing the background and the human face;
the replacing module is used for replacing the images of the five sense organs in the digital human image by adopting a preset human face sample image to obtain a replaced image;
the restoration module is used for restoring the characteristics of the replaced facial features in the replacement image and fusing the restored replacement image with a preset background image to obtain a fused image;
and the editing module is used for extracting the character characteristics of the fusion image and training a preset GAN network by utilizing the character characteristics to obtain a digital human image containing posture change.
9. An electronic device, comprising: memory, processor and computer program stored on the memory and executable on the processor, characterized in that the processor, when executing the program, implements the digital human generation method according to any one of claims 1 to 7.
10. A computer-readable storage medium, characterized in that it stores a computer-executable program for causing a computer to perform the digital human generation method of any one of claims 1-7.
CN202211030862.XA 2022-08-26 2022-08-26 Digital person generation method and system Pending CN115471886A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211030862.XA CN115471886A (en) 2022-08-26 2022-08-26 Digital person generation method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211030862.XA CN115471886A (en) 2022-08-26 2022-08-26 Digital person generation method and system

Publications (1)

Publication Number Publication Date
CN115471886A true CN115471886A (en) 2022-12-13

Family

ID=84368854

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211030862.XA Pending CN115471886A (en) 2022-08-26 2022-08-26 Digital person generation method and system

Country Status (1)

Country Link
CN (1) CN115471886A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115661005A (en) * 2022-12-26 2023-01-31 成都索贝数码科技股份有限公司 Generation method and device for customized digital person
CN117011435A (en) * 2023-09-28 2023-11-07 世优(北京)科技有限公司 Digital human image AI generation method and device

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115661005A (en) * 2022-12-26 2023-01-31 成都索贝数码科技股份有限公司 Generation method and device for customized digital person
CN117011435A (en) * 2023-09-28 2023-11-07 世优(北京)科技有限公司 Digital human image AI generation method and device
CN117011435B (en) * 2023-09-28 2024-01-09 世优(北京)科技有限公司 Digital human image AI generation method and device

Similar Documents

Publication Publication Date Title
CN115471886A (en) Digital person generation method and system
CN110751708B (en) Method and system for driving face animation in real time through voice
CN110659573B (en) Face recognition method and device, electronic equipment and storage medium
CN110266973A (en) Method for processing video frequency, device, computer readable storage medium and computer equipment
US10970909B2 (en) Method and apparatus for eye movement synthesis
US7257538B2 (en) Generating animation from visual and audio input
CN114187624B (en) Image generation method, device, electronic equipment and storage medium
Gibert et al. Analysis and synthesis of the three-dimensional movements of the head, face, and hand of a speaker using cued speech
CN115761075A (en) Face image generation method, device, equipment, medium and product
CN110096987A (en) A kind of sign language action identification method based on two-way 3DCNN model
Fan et al. Joint audio-text model for expressive speech-driven 3d facial animation
CN115049016A (en) Model driving method and device based on emotion recognition
CN116597857A (en) Method, system, device and storage medium for driving image by voice
CN117557695A (en) Method and device for generating video by driving single photo through audio
CN115409923A (en) Method, device and system for generating three-dimensional virtual image facial animation
CN111461959B (en) Face emotion synthesis method and device
CN114898018A (en) Animation generation method and device for digital object, electronic equipment and storage medium
CN114630190A (en) Joint posture parameter determining method, model training method and device
CN115984452A (en) Head three-dimensional reconstruction method and equipment
CN114818609A (en) Interaction method for virtual object, electronic device and computer storage medium
Chen et al. VAST: Vivify your talking avatar via zero-shot expressive facial style transfer
Zeng et al. Highly fluent sign language synthesis based on variable motion frame interpolation
US20240054811A1 (en) Mouth Shape Correction Model, And Model Training And Application Method
CN114693565B (en) GAN image restoration method based on jump connection multi-scale fusion
CN115810203B (en) Obstacle avoidance recognition method, system, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination