CN110347867A - Method and apparatus for generating lip motion video - Google Patents

Method and apparatus for generating lip motion video Download PDF

Info

Publication number
CN110347867A
CN110347867A CN201910640823.3A CN201910640823A CN110347867A CN 110347867 A CN110347867 A CN 110347867A CN 201910640823 A CN201910640823 A CN 201910640823A CN 110347867 A CN110347867 A CN 110347867A
Authority
CN
China
Prior art keywords
key point
lip
pronunciation unit
point sequence
lip key
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910640823.3A
Other languages
Chinese (zh)
Other versions
CN110347867B (en
Inventor
龙翔
李鑫
刘霄
赵翔
王平
李甫
张赫男
孙昊
文石磊
丁二锐
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN201910640823.3A priority Critical patent/CN110347867B/en
Publication of CN110347867A publication Critical patent/CN110347867A/en
Application granted granted Critical
Publication of CN110347867B publication Critical patent/CN110347867B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/71Indexing; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/168Feature extraction; Face representation
    • G06V40/171Local features and components; Facial parts ; Occluding parts, e.g. glasses; Geometrical relationships
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Physics & Mathematics (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Theoretical Computer Science (AREA)
  • Human Computer Interaction (AREA)
  • General Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Electrically Operated Instructional Devices (AREA)

Abstract

The embodiment of the present application discloses the method and apparatus for generating lip motion video.One specific embodiment of this method includes: acquisition target text;Determine the corresponding lip key point sequence of each pronunciation unit of target text;Based on the corresponding lip key point sequence of each pronunciation unit, the corresponding lip key point sequence of target text is generated;The corresponding lip key point sequence inputting of target text to image trained in advance is synthesized into network, obtains the corresponding lip motion image sequence of target text;Splice the corresponding lip motion image sequence of target text, generates the corresponding lip motion video of target text.This embodiment improves the efficiency for generating lip motion video.

Description

Method and apparatus for generating lip motion video
Technical field
The invention relates to field of computer technology, and in particular to for generating the method and dress of lip motion video It sets.
Background technique
Lip motion video generation technique is to correspond to the specified content of computer technology synthesis, on the time, flow completely naturally Smooth lip motion video.
Currently, common lip motion video generating mode is to record the corresponding lip motion of all possible pronunciation unit Sentence to be synthesized is split as the sequence of pronunciation unit by video, by the corresponding lip motion video of each pronunciation unit according to The specified time zooms in and out, splicing synthesis lip motion video.
Summary of the invention
The embodiment of the present application proposes the method and apparatus for generating lip motion video.
In a first aspect, the embodiment of the present application provides a kind of method for generating lip motion video, comprising: obtain mesh Mark text;Determine the corresponding lip key point sequence of each pronunciation unit of target text;It is corresponding based on each pronunciation unit Lip key point sequence generates the corresponding lip key point sequence of target text;By the corresponding lip key point sequence of target text Column are input to image synthesis network trained in advance, obtain the corresponding lip motion image sequence of target text;Splice target text This corresponding lip motion image sequence generates the corresponding lip motion video of target text.
In some embodiments, this method further include: utilize the corresponding voice of speech synthesis technique synthesis target text;It will The corresponding voice of target text is integrated into the corresponding lip motion video of target text.
In some embodiments, the corresponding lip key point sequence of each pronunciation unit of target text is determined, comprising: obtain The lip motion video for the continuous statement for taking target person to prerecord and the original lip motion video of each pronunciation unit;It is right In each pronunciation unit, determine similar to the original lip motion video of the pronunciation unit in the lip motion video of continuous statement The corresponding lip key point sequence of lip motion video clip, generate the corresponding candidate lip key point sequence of the pronunciation unit Set;The corresponding lip key point of the pronunciation unit is determined from the corresponding candidate lip key point arrangement set of the pronunciation unit Sequence.
In some embodiments, the original lip motion in the lip motion video of continuous statement with the pronunciation unit is determined The corresponding lip key point sequence of the similar lip motion video clip of video generates the corresponding candidate lip of the pronunciation unit and closes Key point sequence set, comprising: the extraction of lip key point is carried out to the lip motion video of continuous statement, obtains the mouth of continuous statement Lip key point sequence;The extraction of lip key point is carried out to the original lip motion video of the pronunciation unit, obtains the pronunciation unit Original lip key point sequence;The determining original lip with the pronunciation unit closes from the lip key point sequence of continuous statement The similar lip key point sequence of key point sequence generates the corresponding candidate lip key point arrangement set of the pronunciation unit.
In some embodiments, the determining original lip with the pronunciation unit from the lip key point sequence of continuous statement The similar lip key point sequence of crucial point sequence, comprising: the original in the original lip key point sequence based on the pronunciation unit Lip key point in the lip key point sequence of beginning lip key point and continuous statement, the determining original mouth with the pronunciation unit The end position of the similar lip key point sequence of lip key point sequence;Based on the original lip key point sequence with the pronunciation unit The end position for arranging similar lip key point sequence carries out path backtracking, the determining original lip key point with the pronunciation unit The similar lip key point sequence of sequence.
In some embodiments, the pronunciation list is determined from the corresponding candidate lip key point arrangement set of the pronunciation unit The corresponding lip key point sequence of member, comprising: calculate the corresponding each candidate lip key point sequence of the pronunciation unit and the hair The similarity of the corresponding each candidate lip key point sequence of the adjacent pronunciation unit of sound unit;Based on similarity calculated, Determine the end position of the corresponding lip key point sequence of the pronunciation unit;Based on the corresponding lip key point sequence of the pronunciation unit The end position of column carries out path backtracking, determines the corresponding lip key point sequence of the pronunciation unit.
In some embodiments, it is based on the corresponding lip key point sequence of each pronunciation unit, it is corresponding to generate target text Lip key point sequence, comprising: be based on the corresponding voice of target text, determine the beginning and ending time of each pronunciation unit;It will be every In the corresponding lip key point sequences match to each pronunciation unit corresponding beginning and ending time of a pronunciation unit, target text is generated Corresponding lip key point sequence.
In some embodiments, by the corresponding lip key point sequences match of each pronunciation unit to each pronunciation unit pair In the beginning and ending time answered, comprising: linear interpolation is carried out to the corresponding lip key point sequence of each pronunciation unit in timing, it will In the corresponding lip key point sequences match to each pronunciation unit corresponding beginning and ending time of each pronunciation unit.
In some embodiments, by the corresponding lip key point sequences match of each pronunciation unit to each pronunciation unit After in the corresponding beginning and ending time, further includes: be smoothed to the adjacent corresponding lip key point sequence of pronunciation unit.
In some embodiments, the adjacent corresponding lip key point sequence of pronunciation unit is smoothed, comprising: The lip key point sequence fragment of the corresponding rear preset duration of previous pronunciation unit in adjacent pronunciation unit is chosen with after The lip key point sequence fragment of the corresponding preceding preset duration of one pronunciation unit;Lip key point based on selected taking-up, to phase The adjacent corresponding lip key point sequence of pronunciation unit is smoothed.
In some embodiments, training obtains image synthesis network as follows: obtaining training sample, wherein instruction Practicing sample includes sample lip key point and sample lip motion image;Using sample lip key point as input, by sample mouth For lip motion images as output, training obtains image synthesis network.
In some embodiments, sample lip motion image is that the lip for the continuous statement prerecorded from target person is dynamic Make the image extracted in video, sample lip key point is to carry out lip key point to the image extracted to extract obtained mouth Lip key point.
Second aspect, the embodiment of the present application provide a kind of for generating the device of lip motion video, comprising: text obtains Unit is taken, is configured to obtain target text;Sequence determination unit is configured to determine each pronunciation unit pair of target text The lip key point sequence answered;Sequence generating unit is configured to based on the corresponding lip key point sequence of each pronunciation unit, Generate the corresponding lip key point sequence of target text;Image composing unit is configured to close on the corresponding lip of target text Key point sequence is input to image synthesis network trained in advance, obtains the corresponding lip motion image sequence of target text;Video Generation unit is configured to splice the corresponding lip motion image sequence of target text, and it is dynamic to generate the corresponding lip of target text Make video.
In some embodiments, device further include: speech synthesis unit is configured to synthesize using speech synthesis technique The corresponding voice of target text;Voice integrated unit is configured to the corresponding voice of target text being integrated into target text pair The lip motion video answered.
In some embodiments, sequence determination unit, comprising: video acquisition subelement is configured to obtain target person The lip motion video for the continuous statement prerecorded and the original lip motion video of each pronunciation unit;It is single that set generates son Member is configured to determine each pronunciation unit the original mouth in the lip motion video of continuous statement with the pronunciation unit The corresponding lip key point sequence of the similar lip motion video clip of lip action video, generates the corresponding candidate of the pronunciation unit Lip key point arrangement set;Sequence determines subelement, is configured to candidate lip key point sequence corresponding from the pronunciation unit The corresponding lip key point sequence of the pronunciation unit is determined in column set.
In some embodiments, set generates subelement, comprising: the first extraction module is configured to continuous statement Lip motion video carries out the extraction of lip key point, obtains the lip key point sequence of continuous statement;Second extraction module is matched It is set to and the extraction of lip key point is carried out to the original lip motion video of the pronunciation unit, obtain the original lip of the pronunciation unit Crucial point sequence;Gather generation module, is configured to from the lip key point sequence of continuous statement the determining and pronunciation unit The similar lip key point sequence of original lip key point sequence, generate the corresponding candidate lip key point sequence of the pronunciation unit Column set.
In some embodiments, set generation module is further configured to: the original lip based on the pronunciation unit closes Lip key point in the lip key point sequence of original lip key point and continuous statement in key point sequence, the determining and hair The end position of the similar lip key point sequence of original lip key point sequence of sound unit;Based on the original with the pronunciation unit The end position of the similar lip key point sequence of beginning lip key point sequence carries out path backtracking, determining and the pronunciation unit The similar lip key point sequence of original lip key point sequence.
In some embodiments, sequence determines that subelement is further configured to: it is corresponding each to calculate the pronunciation unit The phase of the corresponding each candidate lip key point sequence of candidate lip key point sequence pronunciation unit adjacent with the pronunciation unit Like degree;Based on similarity calculated, the end position of the corresponding lip key point sequence of the pronunciation unit is determined;Based on the hair The end position of the corresponding lip key point sequence of sound unit carries out path backtracking, determines that the corresponding lip of the pronunciation unit is crucial Point sequence.
In some embodiments, sequence generating unit, comprising: the time determines subelement, is configured to based on target text Corresponding voice determines the beginning and ending time of each pronunciation unit;Sequence generates subelement, is configured to each pronunciation unit pair In the lip key point sequences match answered to each pronunciation unit corresponding beginning and ending time, generates the corresponding lip of target text and close Key point sequence.
In some embodiments, sequence generates subelement, comprising: linear interpolation module is configured in timing to every The corresponding lip key point sequence of a pronunciation unit carries out linear interpolation, by the corresponding lip key point sequence of each pronunciation unit It is matched in each pronunciation unit corresponding beginning and ending time.
In some embodiments, sequence generates subelement, further includes: smoothing module is configured to adjacent hair The corresponding lip key point sequence of sound unit is smoothed.
In some embodiments, smoothing module is further configured to: being chosen previous in adjacent pronunciation unit The lip key point sequence fragment and the corresponding preceding preset duration of latter pronunciation unit of preset duration after a pronunciation unit is corresponding Lip key point sequence fragment;Lip key point based on selected taking-up, it is crucial to the adjacent corresponding lip of pronunciation unit Point sequence is smoothed.
In some embodiments, training obtains image synthesis network as follows: obtaining training sample, wherein instruction Practicing sample includes sample lip key point and sample lip motion image;Using sample lip key point as input, by sample mouth For lip motion images as output, training obtains image synthesis network.
In some embodiments, sample lip motion image is that the lip for the continuous statement prerecorded from target person is dynamic Make the image extracted in video, sample lip key point is to carry out lip key point to the image extracted to extract obtained mouth Lip key point.
The third aspect, the embodiment of the present application provide a kind of electronic equipment, which includes: one or more processing Device;Storage device is stored thereon with one or more programs;When one or more programs are executed by one or more processors, So that one or more processors realize the method as described in implementation any in first aspect.
Fourth aspect, the embodiment of the present application provide a kind of computer-readable medium, are stored thereon with computer program, should The method as described in implementation any in first aspect is realized when computer program is executed by processor.
Method and apparatus provided by the embodiments of the present application for generating lip motion video, it is first determined acquired mesh Mark the corresponding lip key point sequence of each pronunciation unit of text;It is based on the corresponding lip key point of each pronunciation unit later Sequence generates the corresponding lip key point sequence of target text;Then by the corresponding lip key point sequence inputting of target text Network is synthesized to image trained in advance, obtains the corresponding lip motion image sequence of target text;Finally splice target text Corresponding lip motion image sequence generates the corresponding lip motion video of target text.To improve generation lip motion The efficiency of video.
Detailed description of the invention
By reading a detailed description of non-restrictive embodiments in the light of the attached drawings below, the application's is other Feature, objects and advantages will become more apparent upon:
Fig. 1 is that this application can be applied to exemplary system architectures therein;
Fig. 2 is the flow chart according to one embodiment of the method for generating lip motion video of the application;
Fig. 3 is the flow chart according to another embodiment of the method for generating lip motion video of the application;
Fig. 4 is the flow chart according to another embodiment of the method for generating lip motion video of the application;
Fig. 5 is the structural schematic diagram according to one embodiment of the device for generating lip motion video of the application;
Fig. 6 is adapted for the structural schematic diagram for the computer system for realizing the electronic equipment of the embodiment of the present application.
Specific embodiment
The application is described in further detail with reference to the accompanying drawings and examples.It is understood that this place is retouched The specific embodiment stated is used only for explaining related invention, rather than the restriction to the invention.It also should be noted that in order to Convenient for description, part relevant to related invention is illustrated only in attached drawing.
It should be noted that in the absence of conflict, the features in the embodiments and the embodiments of the present application can phase Mutually combination.The application is described in detail below with reference to the accompanying drawings and in conjunction with the embodiments.
Fig. 1 is shown can be using the application for generating the method for lip motion video or for generating lip motion The exemplary system architecture 100 of the embodiment of the device of video.
As shown in Figure 1, may include terminal device 101, network 102 and server 103 in system architecture 100.Network 102 To provide the medium of communication link between terminal device 101 and server 103.Network 102 may include various connection classes Type, such as wired, wireless communication link or fiber optic cables etc..
User can be used terminal device 101 and be interacted by network 102 with server 103, to receive or send message etc.. Various client softwares can be installed, such as video generates class application etc. on terminal device 101.
Terminal device 101 can be hardware, be also possible to software.When terminal device 101 is hardware, can be with aobvious Display screen and the various electronic equipments for supporting video playing.Including but not limited to smart phone, tablet computer, portable meter on knee Calculation machine and desktop computer etc..When terminal device 101 is software, may be mounted in above-mentioned electronic equipment.It can be real Ready-made multiple softwares or software module, also may be implemented into single software or software module.It is not specifically limited herein.
Server 103 can be to provide the server of various services.Such as video generates server.Video generates server The data such as target text can be carried out analyzing etc. with processing, generate processing result (such as the corresponding lip motion view of target text Frequently), and by processing result it is pushed to terminal device 101.
It should be noted that server 103 can be hardware, it is also possible to software.It, can when server 103 is hardware To be implemented as the distributed server cluster that multiple servers form, individual server also may be implemented into.When server 103 is When software, multiple softwares or software module (such as providing Distributed Services) may be implemented into, also may be implemented into single Software or software module.It is not specifically limited herein.
It should be noted that for generating the method for lip motion video generally by servicing provided by the embodiment of the present application Device 103 executes, and correspondingly, the device for generating lip motion video is generally positioned in server 103.
It should be understood that the number of terminal device, network and server in Fig. 1 is only schematical.According to realization need It wants, can have any number of terminal device, network and server.
With continued reference to Fig. 2, it illustrates an implementations according to the method for generating lip motion video of the application The process 200 of example.The method for being used to generate lip motion video, comprising the following steps:
Step 201, target text is obtained.
In the present embodiment, for generating executing subject (such as the server shown in FIG. 1 of the method for lip motion video 103) target text can be obtained from the terminal device (such as terminal device 101 shown in FIG. 1) for communicating with connection.Practice In, user's video installed in equipment that can open a terminal generates class application, inputs target text, and be committed to above-mentioned execution master Body.
Step 202, the corresponding lip key point sequence of each pronunciation unit of target text is determined.
In the present embodiment, above-mentioned executing subject can determine that the corresponding lip of each pronunciation unit of target text is crucial Point sequence.
Here, pronunciation unit can be the pronunciation unit for a certain granularity that the mankind speak.For example, pronunciation is single for Chinese Member is either pinyin combinations, are also possible to single simple or compound vowel of a Chinese syllable initial consonant etc..In general, the granularity that pronunciation unit is split is smaller, Quantity is fewer, and the difficulty based on text marking pronunciation unit is lower;The granularity that pronunciation unit is split is bigger, and quantity is got over More, the lip motion video based on pronunciation unit synthesis is more smooth.Here, above-mentioned executing subject can be chosen and close according to demand Target text is split as pronunciation unit sequence by suitable fractionation granularity.
In general, lip corresponds to different lip shape change procedures during target person issues different pronunciation units. The corresponding lip key point sequence of each pronunciation unit can be the lip shape change procedure that target person issues each pronunciation unit In each moment the sequence that is composed of lip key point.Wherein, target person can be desirable to the lip motion of synthesis Personage in video, the personage can be to be arbitrarily designated personage.Lip key point may include lip center and lip angle etc.. Lip key point can be represented by vectors, i.e., centric angle normalization after the corresponding all key points of single lip coordinate according to The secondary one-dimensional vector being spliced into.
Step 203, it is based on the corresponding lip key point sequence of each pronunciation unit, the corresponding lip of target text is generated and closes Key point sequence.
In the present embodiment, above-mentioned executing subject can be based on the corresponding lip key point sequence of each pronunciation unit, raw At the corresponding lip key point sequence of target text.For example, above-mentioned executing subject can be according to each pronunciation unit in target text Position in this, successively splices the corresponding lip key point sequence of each pronunciation unit, obtains the corresponding lip of target text and closes Key point sequence.
Step 204, the corresponding lip key point sequence inputting of target text to image trained in advance is synthesized into network, obtained To the corresponding lip motion image sequence of target text.
In the present embodiment, above-mentioned executing subject can be by the corresponding lip key point sequence inputting of target text to preparatory Trained image synthesizes network, to obtain the corresponding lip motion image sequence of target text.Wherein, image synthesis network can be with For synthesizing lip motion image, the corresponding relationship between lip key point and lip motion image is characterized.
In some optional implementations of the present embodiment, image synthesis network can be using machine learning method and Training sample carries out obtained from Training existing machine learning model.In general, image synthesis network can use Pix2pixHD neural network structure, to generate high-resolution lip motion image.
Here, image synthesis network can train as follows obtains:
Firstly, obtaining training sample.
Wherein, training sample may include sample lip key point and sample lip motion image.Sample lip motion figure As can be the image extracted from the lip motion video for the continuous statement that target person is prerecorded.Sample lip key point It can be and the obtained lip key point of lip key point extraction is carried out to the image extracted.
Then, using sample lip key point as input, using sample lip motion image as output, training obtains image Synthesize network.
Step 205, the corresponding lip motion image sequence of splicing target text generates the corresponding lip motion of target text Video.
In the present embodiment, above-mentioned executing subject can splice the corresponding lip motion image sequence of target text, with life At the corresponding lip motion video of target text.
In some optional implementations of the present embodiment, above-mentioned executing subject can also be first with speech synthesis skill Art synthesizes the corresponding voice of target text;Then the corresponding voice of target text is integrated into the corresponding lip motion of target text Video.
Method provided by the embodiments of the present application for generating lip motion video, it is first determined acquired target text The corresponding lip key point sequence of each pronunciation unit;It is based on the corresponding lip key point sequence of each pronunciation unit later, Generate the corresponding lip key point sequence of target text;Then by the corresponding lip key point sequence inputting of target text to preparatory Trained image synthesizes network, obtains the corresponding lip motion image sequence of target text;Finally splicing target text is corresponding Lip motion image sequence generates the corresponding lip motion video of target text.Lip motion video is generated to improve Efficiency.
With further reference to Fig. 3, it illustrates according to the method for generating lip motion video of the application another The process 300 of embodiment.The method for being used to generate lip motion video, comprising the following steps:
Step 301, target text is obtained.
In the present embodiment, the concrete operations of step 301 have carried out in step 201 in detail in the embodiment shown in Figure 2 Thin introduction, details are not described herein.
Step 302, the lip motion video of continuous statement that target person is prerecorded and each pronunciation unit are obtained Original lip motion video.
In the present embodiment, for generating executing subject (such as the server shown in FIG. 1 of the method for lip motion video 103) the original lip of the lip motion video for the continuous statement that available target person is prerecorded and each pronunciation unit Action video.
In general, current embodiment require that prerecording two kinds of data.First, the lip motion of the continuous statement of target person regards Frequently.That is, recording the lip motion video that target person says the continuous statement of object language, and lip center and angle are advised The obtained video of generalized.Wherein, object language can be desirable to language described in the personage in the lip motion video of synthesis Type can be any single category of language, be also possible to the set of multilingual type composition.Multilingual type will There are more pronunciation units, has higher requirement to data mark and computing capability.Second, each pronunciation unit of target person Original lip motion video.That is, recording the lip motion video that target person says each pronunciation unit, it is single to mark each pronunciation The pronunciation time started and termination time of member, the time started are intercepted out to the part for terminating the time, and to lip center and angle Carry out the obtained video of planningization.
Step 303, for each pronunciation unit, the original in the lip motion video of continuous statement with the pronunciation unit is determined The corresponding lip key point sequence of the similar lip motion video clip of beginning lip motion video, it is corresponding to generate the pronunciation unit Candidate lip key point arrangement set.
In the present embodiment, for each pronunciation unit, above-mentioned executing subject can determine the lip motion of continuous statement Lip key point sequence corresponding with the similar lip motion video clip of original lip motion video of the pronunciation unit in video Column, to generate the corresponding candidate lip key point arrangement set of the pronunciation unit.
In some optional implementations of the present embodiment, for each pronunciation unit, above-mentioned executing subject can be from It is dynamic that largely lip similar with the original lip motion video of the pronunciation unit is found out in the lip motion video of continuous statement Make video clip, and the extraction of lip key point is carried out to the lip motion video clip found respectively, to generate the pronunciation The corresponding candidate lip key point arrangement set of unit.
In some optional implementations of the present embodiment, above-mentioned executing subject can be first to the lip of continuous statement Action video carries out the extraction of lip key point, obtains the lip key point sequence of continuous statement;Then for each pronunciation unit, The extraction of lip key point is carried out to the original lip motion video of the pronunciation unit, the original lip for obtaining the pronunciation unit is crucial Point sequence;The original lip key point sequence phase with the pronunciation unit is finally determined from the lip key point sequence of continuous statement As lip key point sequence, generate the corresponding candidate lip key point arrangement set of the pronunciation unit.
Step 304, determine that the pronunciation unit is corresponding from the corresponding candidate lip key point arrangement set of the pronunciation unit Lip key point sequence.
In the present embodiment, above-mentioned executing subject can be from the corresponding candidate lip key point arrangement set of the pronunciation unit The corresponding lip key point sequence of the middle determination pronunciation unit.In general, above-mentioned executing subject can be corresponding from the pronunciation unit Selecting in candidate lip key point arrangement set being capable of lip key point sequence natural sparse model corresponding with adjacent pronunciation unit The candidate lip key point sequence of transition, as the corresponding lip key point sequence of the pronunciation unit.
Step 305, the corresponding voice of speech synthesis technique synthesis target text is utilized.
In the present embodiment, above-mentioned executing subject can use the corresponding voice of speech synthesis technique synthesis target text.
Step 306, it is based on the corresponding voice of target text, determines the beginning and ending time of each pronunciation unit.
In the present embodiment, above-mentioned executing subject can be based on the corresponding voice of target text, determine each pronunciation unit Beginning and ending time.In general, the beginning and ending time of pronunciation unit can be determined by speech synthesis system, existing speech synthesis system is all It can obtain the beginning and ending time of pronunciation unit.
Step 307, by the corresponding lip key point sequences match of each pronunciation unit to each pronunciation unit corresponding Only in the time.
In the present embodiment, above-mentioned executing subject can arrive the corresponding lip key point sequences match of each pronunciation unit In each pronunciation unit corresponding beginning and ending time.In general, above-mentioned executing subject can close the corresponding lip of each pronunciation unit Key point sequence is expanded or is compressed in the corresponding beginning and ending time.For example, above-mentioned executing subject can be in timing to each pronunciation The corresponding lip key point sequence of unit carries out linear interpolation, and the corresponding lip key point sequences match of each pronunciation unit is arrived In each pronunciation unit corresponding beginning and ending time.
Step 308, the adjacent corresponding lip key point sequence of pronunciation unit is smoothed, generates target text Corresponding lip key point sequence.
In the present embodiment, above-mentioned executing subject can carry out the adjacent corresponding lip key point sequence of pronunciation unit Smoothing processing, to generate the corresponding lip key point sequence of target text.For example, above-mentioned executing subject can choose first it is adjacent Pronunciation unit in previous pronunciation unit it is corresponding after preset duration lip key point sequence fragment and latter pronunciation singly The lip key point sequence fragment of the corresponding preceding preset duration of member;It is then based on the selected lip key point taken out, to adjacent The corresponding lip key point sequence of pronunciation unit is smoothed.
For example, to carry out transition between two adjacent pronunciation units, then just needing to take previous pronunciation unit The lip key point sequence of β milliseconds afterwards (β suitably chooses according to pronunciation unit length, such as Chinese pinyin, can use 30 milliseconds) Column-slice section (x0,...,xL) and latter pronunciation unit preceding β milliseconds of lip key point sequence fragment (y0,...,yL), under Column formula carries out smooth:
xi=xi+i(y0-xL)/2L, wherein i=0,1 ..., L;
yj=yj-(L-j)(y0-xL)/2L, wherein j=0,1 ..., L.
Step 309, the corresponding lip key point sequence inputting of target text to image trained in advance is synthesized into network, obtained To the corresponding lip motion image sequence of target text.
Step 310, the corresponding lip motion image sequence of splicing target text generates the corresponding lip motion of target text Video.
In the present embodiment, the concrete operations of step 309-310 are in the embodiment shown in Figure 2 in step 204-205 It is described in detail, details are not described herein.
Step 311, the corresponding voice of target text is integrated into the corresponding lip motion video of target text.
In the present embodiment, it is corresponding can be integrated into target text by above-mentioned executing subject for the corresponding voice of target text Lip motion video.
From figure 3, it can be seen that compared with the corresponding embodiment of Fig. 2, being regarded for generating lip motion in the present embodiment The process 300 of the method for frequency was highlighted the corresponding lip key point sequence of the pronunciation unit beginning and ending time corresponding with pronunciation unit The step of being matched.The scheme of the present embodiment description improves the corresponding lip motion video of target text and target as a result, The matching degree of the corresponding voice of text, to keep the lip motion video generated more natural and tripping.
With further reference to Fig. 4, it illustrates according to the method for generating lip motion video of the application another The process 400 of embodiment.The method for being used to generate lip motion video, comprising the following steps:
Step 401, target text is obtained.
Step 402, the lip motion video of continuous statement that target person is prerecorded and each pronunciation unit are obtained Original lip motion video.
In the present embodiment, the concrete operations of step 401-402 are in the embodiment shown in Figure 2 in step 301-302 It is described in detail, details are not described herein.
Step 403, the extraction of lip key point is carried out to the lip motion video of continuous statement, obtains the lip of continuous statement Crucial point sequence.
In the present embodiment, for generating executing subject (such as the server shown in FIG. 1 of the method for lip motion video 103) can lip motion video to continuous statement carry out the extraction of lip key point, to obtain the lip key point of continuous statement Sequence.For example, the lip key point sequence of continuous statement can be (c1,...,cM)。
Step 404, for each pronunciation unit, lip key point is carried out to the original lip motion video of the pronunciation unit It extracts, obtains the original lip key point sequence of the pronunciation unit.
In the present embodiment, for each pronunciation unit, above-mentioned executing subject can be to the original lip of the pronunciation unit Action video carries out the extraction of lip key point, to obtain the original lip key point sequence of the pronunciation unit.For example, pronunciation unit Original lip key point sequence can be (a1,...,aN)。
Step 405, the original lip key point in the original lip key point sequence based on the pronunciation unit and continuous language Lip key point in the lip key point sequence of sentence determines mouth similar with the original lip key point sequence of the pronunciation unit The end position of lip key point sequence.
In the present embodiment, above-mentioned executing subject can be based on the original in the original lip key point sequence of the pronunciation unit Lip key point in the lip key point sequence of beginning lip key point and continuous statement, the determining original mouth with the pronunciation unit The end position of the similar lip key point sequence of lip key point sequence.
Step 406, the knot based on lip key point sequence similar with the original lip key point sequence of the pronunciation unit Beam position carries out path backtracking, determines lip key point sequence similar with the original lip key point sequence of the pronunciation unit, Generate the corresponding candidate lip key point arrangement set of the pronunciation unit.
In the present embodiment, above-mentioned executing subject can be based on similar to the original lip key point sequence of the pronunciation unit The end position of lip key point sequence carry out path backtracking, determining and the pronunciation unit original lip key point sequence phase As lip key point sequence, to generate the corresponding candidate lip key point arrangement set of the pronunciation unit.
For example, lip key point sequence similar with the original lip key point sequence of pronunciation unit isThis In, α most like lip key point sequences before we can be found out by sequence similarity dynamic specification algorithm.Specific algorithm It is as follows:
Firstly, initialization s (0,0)=0.
Then, all s (i, j) are calculated using iterative formula, wherein i=1 ..., N, j=1 ..., M:
And according to specific selection situation, corresponding optimal path p (i, j) is recorded:
Wherein, ρ1, ρ2, ρ3For punishment parameter, it is greater than 0 real number, is selected according to the size of video and word speed etc..
Finally, finding maximum preceding α s (N, j), j therein is then the stop bits of its corresponding lip key point sequence It sets, path backtracking is carried out according to p (i, j), finds corresponding entire lip key point sequence.α and pronunciation are thus obtained The most like lip key point sequence of the original lip key point sequence of unit.
Step 407, it is adjacent with the pronunciation unit to calculate the corresponding each candidate lip key point sequence of the pronunciation unit The similarity of the corresponding each candidate lip key point sequence of pronunciation unit.
In the present embodiment, above-mentioned executing subject can calculate the corresponding each candidate lip key point sequence of the pronunciation unit Arrange the similarity of the corresponding each candidate's lip key point sequence of adjacent with the pronunciation unit pronunciation unit.
Step 408, it is based on similarity calculated, determines the stop bits of the corresponding lip key point sequence of the pronunciation unit It sets.
In the present embodiment, above-mentioned executing subject can be based on similarity calculated, determine that the pronunciation unit is corresponding The end position of lip key point sequence.
Step 409, the end position based on the corresponding lip key point sequence of the pronunciation unit carries out path backtracking, determines The corresponding lip key point sequence of the pronunciation unit.
In the present embodiment, above-mentioned executing subject can be based on the end of the corresponding lip key point sequence of the pronunciation unit Position carries out path backtracking, determines the corresponding lip key point sequence of the pronunciation unit.
For example, target text can be converted directly into pronunciation unit sequence, according to pronunciation unit sequence, each position is obtained α candidate sequence:
({X1,1,...,X1,α},...,{XT,1,...,XT,α})。
We need to select a most suitable candidate in each position,So that adjacent pronunciation The linking of unit is best.That is, the last period lip key point sequenceLast group of lip key point and latter section of lip close Key point sequenceFirst group of lip key point it is most like.Here, similar to measure with the distance of lip key point vector Property, it is simply denoted as hereWe can find global optimum by being bordered by similitude dynamic programming algorithm Lip key point sequence.Specific algorithm is as follows:
Firstly, initialization d (1, k)=0, wherein k=1 ..., α.
Then, all d (i, k)=0 are calculated using iterative formula, wherein i=1 ..., T;K=1 ..., α.
And record optimal path q (i, j):
Finally, finding the smallest d (T, kT), then for the T pronunciation unit, we take kthTIt is a candidate as its lip pass Key point sequence carries out path backtracking according to q (i, j), we are available all ki, i=1 ..., T.Thus obtain The lip key point candidate sequence being most preferably connected
Step 410, it is based on the corresponding lip key point sequence of each pronunciation unit, the corresponding lip of target text is generated and closes Key point sequence.
Step 411, the corresponding lip key point sequence inputting of target text to image trained in advance is synthesized into network, obtained To the corresponding lip motion image sequence of target text.
Step 412, the corresponding lip motion image sequence of splicing target text generates the corresponding lip motion of target text Video.
In the present embodiment, the concrete operations of step 410-412 are in the embodiment shown in Figure 2 in step 203-205 It is described in detail, details are not described herein.
Figure 4, it is seen that compared with the corresponding embodiment of Fig. 2, being regarded for generating lip motion in the present embodiment The process 400 of the method for frequency highlights the step of determining pronunciation unit corresponding lip key point sequence.The present embodiment is retouched as a result, The scheme stated makes adjacent pronunciation unit natural sparse model transition, to keep the lip motion video generated more natural and tripping.
With further reference to Fig. 5, as the realization to method shown in above-mentioned each figure, this application provides one kind for generating mouth One embodiment of the device of lip action video, the Installation practice is corresponding with embodiment of the method shown in Fig. 2, device tool Body can be applied in various electronic equipments.
As shown in figure 5, the device 500 for generating lip motion video of the present embodiment may include: that text obtains list Member 501, sequence determination unit 502, sequence generating unit 503, image composing unit 504 and video generation unit 505.Wherein, Text acquiring unit 501 is configured to obtain target text;Sequence determination unit 502 is configured to determine the every of target text The corresponding lip key point sequence of a pronunciation unit;Sequence generating unit 503 is configured to corresponding based on each pronunciation unit Lip key point sequence generates the corresponding lip key point sequence of target text;Image composing unit 504 is configured to mesh It marks the corresponding lip key point sequence inputting of text to image trained in advance and synthesizes network, obtain the corresponding lip of target text Motion images sequence;Video generation unit 505 is configured to splice the corresponding lip motion image sequence of target text, generates The corresponding lip motion video of target text.
In the present embodiment, in the device 500 for generating lip motion video: text acquiring unit 501, sequence determine Unit 502, sequence generating unit 503, image composing unit 504 and the specific of video generation unit 505 handle and its are brought Technical effect can be respectively with reference to step 201, step 202, step 203, step 204 and the step 205 in Fig. 2 corresponding embodiment Related description, details are not described herein.
In some optional implementations of the present embodiment, for generating the device 500 of lip motion video further include: Speech synthesis unit (not shown) is configured to utilize the corresponding voice of speech synthesis technique synthesis target text;Voice Integrated unit (not shown) is configured to the corresponding voice of target text being integrated into the corresponding lip motion of target text Video.
In some optional implementations of the present embodiment, sequence determination unit 502 includes: video acquisition subelement (not shown), lip motion video and each pronunciation for being configured to obtain the continuous statement that target person is prerecorded are single The original lip motion video of member;Set generates subelement (not shown), is configured to determine each pronunciation unit Lip motion video clip similar with the original lip motion video of the pronunciation unit in the lip motion video of continuous statement Corresponding lip key point sequence generates the corresponding candidate lip key point arrangement set of the pronunciation unit;Sequence determines that son is single First (not shown) is configured to determine the pronunciation list from the corresponding candidate lip key point arrangement set of the pronunciation unit The corresponding lip key point sequence of member.
In some optional implementations of the present embodiment, set generates subelement, comprising: the first extraction module (figure In be not shown), be configured to carry out the extraction of lip key point to the lip motion video of continuous statement, obtain the mouth of continuous statement Lip key point sequence;Second extraction module (not shown) is configured to the original lip motion video to the pronunciation unit The extraction of lip key point is carried out, the original lip key point sequence of the pronunciation unit is obtained;Set generation module (does not show in figure Out), it is configured to the determining original lip key point sequence with the pronunciation unit from the lip key point sequence of continuous statement Similar lip key point sequence generates the corresponding candidate lip key point arrangement set of the pronunciation unit.
In some optional implementations of the present embodiment, set generation module is further configured to: based on the hair Original lip key point in the original lip key point sequence of sound unit and the mouth in the lip key point sequence of continuous statement Lip key point determines the end position of lip key point sequence similar with the original lip key point sequence of the pronunciation unit; End position based on lip key point sequence similar with the original lip key point sequence of the pronunciation unit carries out path and returns It traces back, determines lip key point sequence similar with the original lip key point sequence of the pronunciation unit.
In some optional implementations of the present embodiment, sequence determines that subelement is further configured to: calculating should The corresponding each candidate of the corresponding each candidate lip key point sequence of pronunciation unit pronunciation unit adjacent with the pronunciation unit The similarity of lip key point sequence;Based on similarity calculated, the corresponding lip key point sequence of the pronunciation unit is determined End position;End position based on the corresponding lip key point sequence of the pronunciation unit carries out path backtracking, determines the hair The corresponding lip key point sequence of sound unit.
In some optional implementations of the present embodiment, sequence generating unit 503 includes: the time to determine subelement (not shown) is configured to determine the beginning and ending time of each pronunciation unit based on the corresponding voice of target text;Sequence is raw At subelement (not shown), it is configured to the corresponding lip key point sequences match of each pronunciation unit to each pronunciation In the unit corresponding beginning and ending time, the corresponding lip key point sequence of target text is generated.
In some optional implementations of the present embodiment, it includes: linear interpolation module (in figure that sequence, which generates subelement, It is not shown), it is configured in timing carry out linear interpolation to the corresponding lip key point sequence of each pronunciation unit, it will be each In pronunciation unit corresponding lip key point sequences match to each pronunciation unit corresponding beginning and ending time.
In some optional implementations of the present embodiment, sequence generates subelement further include: smoothing module (figure In be not shown), be configured to be smoothed the adjacent corresponding lip key point sequence of pronunciation unit.
In some optional implementations of the present embodiment, smoothing module is further configured to: being chosen adjacent Pronunciation unit in previous pronunciation unit it is corresponding after preset duration lip key point sequence fragment and latter pronunciation singly The lip key point sequence fragment of the corresponding preceding preset duration of member;Lip key point based on selected taking-up, to adjacent pronunciation The corresponding lip key point sequence of unit is smoothed.
In some optional implementations of the present embodiment, training obtains image synthesis network as follows: obtaining Take training sample, wherein training sample includes sample lip key point and sample lip motion image;By sample lip key point As input, using sample lip motion image as output, training obtains image synthesis network.
In some optional implementations of the present embodiment, sample lip motion image is prerecorded from target person Continuous statement lip motion video in the image that extracts, sample lip key point is to carry out lip pass to the image extracted Key point extracts obtained lip key point.
Below with reference to Fig. 6, it is (such as shown in FIG. 1 that it illustrates the electronic equipments for being suitable for being used to realize the embodiment of the present application Server 103) computer system 600 structural schematic diagram.Electronic equipment shown in Fig. 6 is only an example, should not be right The function and use scope of the embodiment of the present application bring any restrictions.
As shown in fig. 6, computer system 600 includes central processing unit (CPU) 601, it can be read-only according to being stored in Program in memory (ROM) 602 or be loaded into the program in random access storage device (RAM) 603 from storage section 608 and Execute various movements appropriate and processing.In RAM 603, also it is stored with system 600 and operates required various programs and data. CPU 601, ROM 602 and RAM 603 are connected with each other by bus 604.Input/output (I/O) interface 605 is also connected to always Line 604.
I/O interface 605 is connected to lower component: the importation 606 including keyboard, mouse etc.;It is penetrated including such as cathode The output par, c 607 of spool (CRT), liquid crystal display (LCD) etc. and loudspeaker etc.;Storage section 608 including hard disk etc.; And the communications portion 609 of the network interface card including LAN card, modem etc..Communications portion 609 via such as because The network of spy's net executes communication process.Driver 610 is also connected to I/O interface 605 as needed.Detachable media 611, such as Disk, CD, magneto-optic disk, semiconductor memory etc. are mounted on as needed on driver 610, in order to read from thereon Computer program be mounted into storage section 608 as needed.
Particularly, in accordance with an embodiment of the present disclosure, it may be implemented as computer above with reference to the process of flow chart description Software program.For example, embodiment of the disclosure includes a kind of computer program product comprising be carried on computer-readable medium On computer program, which includes the program code for method shown in execution flow chart.In such reality It applies in example, which can be downloaded and installed from network by communications portion 609, and/or from detachable media 611 are mounted.When the computer program is executed by central processing unit (CPU) 601, limited in execution the present processes Above-mentioned function.
It should be noted that computer-readable medium described herein can be computer-readable signal media or meter Calculation machine readable storage medium storing program for executing either the two any combination.Computer readable storage medium for example can be --- but not Be limited to --- electricity, magnetic, optical, electromagnetic, infrared ray or semiconductor system, device or device, or any above combination.Meter The more specific example of calculation machine readable storage medium storing program for executing can include but is not limited to: have the electrical connection, just of one or more conducting wires Taking formula computer disk, hard disk, random access storage device (RAM), read-only memory (ROM), erasable type may be programmed read-only storage Device (EPROM or flash memory), optical fiber, portable compact disc read-only memory (CD-ROM), light storage device, magnetic memory device, Or above-mentioned any appropriate combination.In this application, computer readable storage medium can be it is any include or storage journey The tangible medium of sequence, the program can be commanded execution system, device or device use or in connection.And at this In application, computer-readable signal media may include in a base band or as carrier wave a part propagate data-signal, Wherein carry computer-readable program code.The data-signal of this propagation can take various forms, including but unlimited In electromagnetic signal, optical signal or above-mentioned any appropriate combination.Computer-readable signal media can also be that computer can Any computer-readable medium other than storage medium is read, which can send, propagates or transmit and be used for By the use of instruction execution system, device or device or program in connection.Include on computer-readable medium Program code can transmit with any suitable medium, including but not limited to: wireless, electric wire, optical cable, RF etc. are above-mentioned Any appropriate combination.
The calculating of the operation for executing the application can be write with one or more programming languages or combinations thereof Machine program code, described program design language include object-oriented programming language-such as Java, Smalltalk, C+ +, further include conventional procedural programming language-such as " C " language or similar programming language.Program code can Fully to execute, partly execute on the user computer on the user computer, be executed as an independent software package, Part executes on the remote computer or holds on remote computer or electronic equipment completely on the user computer for part Row.In situations involving remote computers, remote computer can pass through the network of any kind --- including local area network (LAN) or wide area network (WAN)-is connected to subscriber computer, or, it may be connected to outer computer (such as utilize internet Service provider is connected by internet).
Flow chart and block diagram in attached drawing are illustrated according to the system of the various embodiments of the application, method and computer journey The architecture, function and operation in the cards of sequence product.In this regard, each box in flowchart or block diagram can generation A part of one module, program segment or code of table, a part of the module, program segment or code include one or more use The executable instruction of the logic function as defined in realizing.It should also be noted that in some implementations as replacements, being marked in box The function of note can also occur in a different order than that indicated in the drawings.For example, two boxes succeedingly indicated are actually It can be basically executed in parallel, they can also be executed in the opposite order sometimes, and this depends on the function involved.Also it to infuse Meaning, the combination of each box in block diagram and or flow chart and the box in block diagram and or flow chart can be with holding The dedicated hardware based system of functions or operations as defined in row is realized, or can use specialized hardware and computer instruction Combination realize.
Being described in unit involved in the embodiment of the present application can be realized by way of software, can also be by hard The mode of part is realized.Described unit also can be set in the processor, for example, can be described as: a kind of processor packet Include text acquiring unit, sequence determination unit, sequence generating unit, image composing unit and video generation unit.Wherein, these The title of unit does not constitute the restriction to the unit itself in this case, for example, text acquiring unit can also be described For " unit for obtaining target text ".
As on the other hand, present invention also provides a kind of computer-readable medium, which be can be Included in electronic equipment described in above-described embodiment;It is also possible to individualism, and without in the supplying electronic equipment. Above-mentioned computer-readable medium carries one or more program, when said one or multiple programs are held by the electronic equipment When row, so that the electronic equipment: obtaining target text;Determine the corresponding lip key point sequence of each pronunciation unit of target text Column;Based on the corresponding lip key point sequence of each pronunciation unit, the corresponding lip key point sequence of target text is generated;By mesh It marks the corresponding lip key point sequence inputting of text to image trained in advance and synthesizes network, obtain the corresponding lip of target text Motion images sequence;Splice the corresponding lip motion image sequence of target text, generates the corresponding lip motion view of target text Frequently.
Above description is only the preferred embodiment of the application and the explanation to institute's application technology principle.Those skilled in the art Member is it should be appreciated that invention scope involved in the application, however it is not limited to technology made of the specific combination of above-mentioned technical characteristic Scheme, while should also cover in the case where not departing from foregoing invention design, it is carried out by above-mentioned technical characteristic or its equivalent feature Any combination and the other technical solutions formed.Such as features described above has similar function with (but being not limited to) disclosed herein Can technical characteristic replaced mutually and the technical solution that is formed.

Claims (15)

1. a kind of method for generating lip motion video, comprising:
Obtain target text;
Determine the corresponding lip key point sequence of each pronunciation unit of the target text;
Based on the corresponding lip key point sequence of each pronunciation unit, the corresponding lip key point sequence of the target text is generated Column;
The corresponding lip key point sequence inputting of the target text to image trained in advance is synthesized into network, obtains the mesh Mark the corresponding lip motion image sequence of text;
Splice the corresponding lip motion image sequence of the target text, generates the corresponding lip motion view of the target text Frequently.
2. according to the method described in claim 1, wherein, the method also includes:
The corresponding voice of the target text is synthesized using speech synthesis technique;
The corresponding voice of the target text is integrated into the corresponding lip motion video of the target text.
3. according to the method described in claim 1, wherein, the corresponding mouth of each pronunciation unit of the determination target text Lip key point sequence, comprising:
Obtain the lip motion video for the continuous statement that target person is prerecorded and the original lip motion of each pronunciation unit Video;
For each pronunciation unit, determine dynamic with the original lip of the pronunciation unit in the lip motion video of the continuous statement Make the corresponding lip key point sequence of the similar lip motion video clip of video, generates the corresponding candidate lip of the pronunciation unit Key point arrangement set;
The corresponding lip key point of the pronunciation unit is determined from the corresponding candidate lip key point arrangement set of the pronunciation unit Sequence.
4. according to the method described in claim 3, wherein, in the lip motion video of the determination continuous statement with the hair The corresponding lip key point sequence of the similar lip motion video clip of original lip motion video of sound unit, generates the pronunciation The corresponding candidate lip key point arrangement set of unit, comprising:
The extraction of lip key point is carried out to the lip motion video of the continuous statement, the lip for obtaining the continuous statement is crucial Point sequence;
The extraction of lip key point is carried out to the original lip motion video of the pronunciation unit, obtains the original lip of the pronunciation unit Crucial point sequence;
Determination is similar to the original lip key point sequence of the pronunciation unit from the lip key point sequence of the continuous statement Lip key point sequence, generate the corresponding candidate lip key point arrangement set of the pronunciation unit.
5. determined according to the method described in claim 4, wherein, in the lip key point sequence from the continuous statement with The similar lip key point sequence of original lip key point sequence of the pronunciation unit, comprising:
The lip of original lip key point and the continuous statement in original lip key point sequence based on the pronunciation unit Lip key point in crucial point sequence determines lip key point similar with the original lip key point sequence of the pronunciation unit The end position of sequence;
End position based on lip key point sequence similar with the original lip key point sequence of the pronunciation unit carries out road Diameter backtracking, determines lip key point sequence similar with the original lip key point sequence of the pronunciation unit.
6. described from the corresponding candidate lip key point sequence sets of the pronunciation unit according to the method described in claim 3, wherein The corresponding lip key point sequence of the pronunciation unit is determined in conjunction, comprising:
Calculating the corresponding each candidate lip key point sequence of the pronunciation unit, adjacent pronunciation unit is corresponding with the pronunciation unit Each of candidate lip key point sequence similarity;
Based on similarity calculated, the end position of the corresponding lip key point sequence of the pronunciation unit is determined;
End position based on the corresponding lip key point sequence of the pronunciation unit carries out path backtracking, determines the pronunciation unit pair The lip key point sequence answered.
7. it is described to be based on the corresponding lip key point sequence of each pronunciation unit according to the method described in claim 2, wherein, Generate the corresponding lip key point sequence of the target text, comprising:
Based on the corresponding voice of the target text, the beginning and ending time of each pronunciation unit is determined;
By in each pronunciation unit corresponding lip key point sequences match to each pronunciation unit corresponding beginning and ending time, generate The corresponding lip key point sequence of the target text.
8. described by the corresponding lip key point sequences match of each pronunciation unit according to the method described in claim 7, wherein Into each pronunciation unit corresponding beginning and ending time, comprising:
Linear interpolation is carried out to the corresponding lip key point sequence of each pronunciation unit in timing, each pronunciation unit is corresponding Lip key point sequences match to each pronunciation unit corresponding beginning and ending time in.
9. according to the method described in claim 7, wherein, described by the corresponding lip key point sequence of each pronunciation unit After being fitted in each pronunciation unit corresponding beginning and ending time, further includes:
The adjacent corresponding lip key point sequence of pronunciation unit is smoothed.
10. described to the adjacent corresponding lip key point sequence of pronunciation unit according to the method described in claim 9, wherein It is smoothed, comprising:
Choose the lip key point sequence fragment of the corresponding rear preset duration of previous pronunciation unit in adjacent pronunciation unit The lip key point sequence fragment of preceding preset duration corresponding with latter pronunciation unit;
Lip key point based on selected taking-up smoothly locates the adjacent corresponding lip key point sequence of pronunciation unit Reason.
11. method described in one of -10 according to claim 1, wherein it is trained as follows that described image synthesizes network It arrives:
Obtain training sample, wherein the training sample includes sample lip key point and sample lip motion image;
Using the sample lip key point as input, using the sample lip motion image as output, training obtains described Image synthesizes network.
12. according to the method for claim 11, wherein the sample lip motion image is prerecorded from target person Continuous statement lip motion video in the image that extracts, the sample lip key point is to carry out mouth to the image extracted Lip key point extracts obtained lip key point.
13. a kind of for generating the device of lip motion video, comprising:
Text acquiring unit is configured to obtain target text;
Sequence determination unit is configured to determine the corresponding lip key point sequence of each pronunciation unit of the target text;
Sequence generating unit is configured to generate the target text based on the corresponding lip key point sequence of each pronunciation unit This corresponding lip key point sequence;
Image composing unit is configured to the corresponding lip key point sequence inputting of the target text to figure trained in advance As synthesis network, the corresponding lip motion image sequence of the target text is obtained;
Video generation unit is configured to splice the corresponding lip motion image sequence of the target text, generates the target The corresponding lip motion video of text.
14. a kind of electronic equipment, comprising:
One or more processors;
Storage device is stored thereon with one or more programs,
When one or more of programs are executed by one or more of processors, so that one or more of processors are real The now method as described in any in claim 1-12.
15. a kind of computer-readable medium, is stored thereon with computer program, wherein the computer program is held by processor The method as described in any in claim 1-12 is realized when row.
CN201910640823.3A 2019-07-16 2019-07-16 Method and device for generating lip motion video Active CN110347867B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910640823.3A CN110347867B (en) 2019-07-16 2019-07-16 Method and device for generating lip motion video

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910640823.3A CN110347867B (en) 2019-07-16 2019-07-16 Method and device for generating lip motion video

Publications (2)

Publication Number Publication Date
CN110347867A true CN110347867A (en) 2019-10-18
CN110347867B CN110347867B (en) 2022-04-19

Family

ID=68175446

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910640823.3A Active CN110347867B (en) 2019-07-16 2019-07-16 Method and device for generating lip motion video

Country Status (1)

Country Link
CN (1) CN110347867B (en)

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111147894A (en) * 2019-12-09 2020-05-12 苏宁智能终端有限公司 Sign language video generation method, device and system
CN111261187A (en) * 2020-02-04 2020-06-09 清华珠三角研究院 Method, system, device and storage medium for converting voice into lip shape
CN112131988A (en) * 2020-09-14 2020-12-25 北京百度网讯科技有限公司 Method, device, equipment and computer storage medium for determining virtual character lip shape
CN112381926A (en) * 2020-11-13 2021-02-19 北京有竹居网络技术有限公司 Method and apparatus for generating video
CN112752118A (en) * 2020-12-29 2021-05-04 北京字节跳动网络技术有限公司 Video generation method, device, equipment and storage medium
CN113077819A (en) * 2021-03-19 2021-07-06 北京有竹居网络技术有限公司 Pronunciation evaluation method and device, storage medium and electronic equipment
CN113111812A (en) * 2021-04-20 2021-07-13 深圳追一科技有限公司 Mouth action driving model training method and assembly
CN113223123A (en) * 2021-05-21 2021-08-06 北京大米科技有限公司 Image processing method and image processing apparatus
CN113642394A (en) * 2021-07-07 2021-11-12 北京搜狗科技发展有限公司 Action processing method, device and medium for virtual object
CN113744368A (en) * 2021-08-12 2021-12-03 北京百度网讯科技有限公司 Animation synthesis method and device, electronic equipment and storage medium
CN113873297A (en) * 2021-10-18 2021-12-31 深圳追一科技有限公司 Method and related device for generating digital character video
CN114173188A (en) * 2021-10-18 2022-03-11 深圳追一科技有限公司 Video generation method, electronic device, storage medium, and digital human server
CN116579298A (en) * 2022-01-30 2023-08-11 腾讯科技(深圳)有限公司 Video generation method, device, equipment and storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101751692A (en) * 2009-12-24 2010-06-23 四川大学 Method for voice-driven lip animation
CN103971393A (en) * 2013-01-29 2014-08-06 株式会社东芝 Computer generated head
CN104361620A (en) * 2014-11-27 2015-02-18 韩慧健 Mouth shape animation synthesis method based on comprehensive weighted algorithm
US20150279364A1 (en) * 2014-03-29 2015-10-01 Ajay Krishnan Mouth-Phoneme Model for Computerized Lip Reading
CN109409195A (en) * 2018-08-30 2019-03-01 华侨大学 A kind of lip reading recognition methods neural network based and system
CN109637518A (en) * 2018-11-07 2019-04-16 北京搜狗科技发展有限公司 Virtual newscaster's implementation method and device

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101751692A (en) * 2009-12-24 2010-06-23 四川大学 Method for voice-driven lip animation
CN103971393A (en) * 2013-01-29 2014-08-06 株式会社东芝 Computer generated head
US20150279364A1 (en) * 2014-03-29 2015-10-01 Ajay Krishnan Mouth-Phoneme Model for Computerized Lip Reading
CN104361620A (en) * 2014-11-27 2015-02-18 韩慧健 Mouth shape animation synthesis method based on comprehensive weighted algorithm
CN109409195A (en) * 2018-08-30 2019-03-01 华侨大学 A kind of lip reading recognition methods neural network based and system
CN109637518A (en) * 2018-11-07 2019-04-16 北京搜狗科技发展有限公司 Virtual newscaster's implementation method and device

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
CHRISTOPH BREGLER: "Video Rewrite: Driving Visual Speech with Audio", 《ACM SIGGRAPH》 *
赖伟: "一种基于三维模型和照片的合成"说话头"", 《中国图象图形学报》 *

Cited By (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111147894A (en) * 2019-12-09 2020-05-12 苏宁智能终端有限公司 Sign language video generation method, device and system
CN111261187B (en) * 2020-02-04 2023-02-14 清华珠三角研究院 Method, system, device and storage medium for converting voice into lip shape
CN111261187A (en) * 2020-02-04 2020-06-09 清华珠三角研究院 Method, system, device and storage medium for converting voice into lip shape
CN112131988A (en) * 2020-09-14 2020-12-25 北京百度网讯科技有限公司 Method, device, equipment and computer storage medium for determining virtual character lip shape
CN112131988B (en) * 2020-09-14 2024-03-26 北京百度网讯科技有限公司 Method, apparatus, device and computer storage medium for determining virtual character lip shape
CN112381926A (en) * 2020-11-13 2021-02-19 北京有竹居网络技术有限公司 Method and apparatus for generating video
CN112752118A (en) * 2020-12-29 2021-05-04 北京字节跳动网络技术有限公司 Video generation method, device, equipment and storage medium
CN113077819A (en) * 2021-03-19 2021-07-06 北京有竹居网络技术有限公司 Pronunciation evaluation method and device, storage medium and electronic equipment
CN113111812A (en) * 2021-04-20 2021-07-13 深圳追一科技有限公司 Mouth action driving model training method and assembly
CN113223123A (en) * 2021-05-21 2021-08-06 北京大米科技有限公司 Image processing method and image processing apparatus
CN113642394A (en) * 2021-07-07 2021-11-12 北京搜狗科技发展有限公司 Action processing method, device and medium for virtual object
WO2023279960A1 (en) * 2021-07-07 2023-01-12 北京搜狗科技发展有限公司 Action processing method and apparatus for virtual object, and storage medium
CN113642394B (en) * 2021-07-07 2024-06-11 北京搜狗科技发展有限公司 Method, device and medium for processing actions of virtual object
CN113744368A (en) * 2021-08-12 2021-12-03 北京百度网讯科技有限公司 Animation synthesis method and device, electronic equipment and storage medium
CN114173188A (en) * 2021-10-18 2022-03-11 深圳追一科技有限公司 Video generation method, electronic device, storage medium, and digital human server
CN113873297A (en) * 2021-10-18 2021-12-31 深圳追一科技有限公司 Method and related device for generating digital character video
CN113873297B (en) * 2021-10-18 2024-04-30 深圳追一科技有限公司 Digital character video generation method and related device
CN116579298A (en) * 2022-01-30 2023-08-11 腾讯科技(深圳)有限公司 Video generation method, device, equipment and storage medium

Also Published As

Publication number Publication date
CN110347867B (en) 2022-04-19

Similar Documents

Publication Publication Date Title
CN110347867A (en) Method and apparatus for generating lip motion video
CN109377539B (en) Method and apparatus for generating animation
US11158102B2 (en) Method and apparatus for processing information
CN107945786B (en) Speech synthesis method and device
CN108763190B (en) Voice-based mouth shape cartoon synthesizer, method and readable storage medium storing program for executing
CN107464554B (en) Method and device for generating speech synthesis model
CN108877782A (en) Audio recognition method and device
CN108989882A (en) Method and apparatus for exporting the snatch of music in video
CN107481715B (en) Method and apparatus for generating information
CN109754783A (en) Method and apparatus for determining the boundary of audio sentence
US20200410731A1 (en) Method and apparatus for controlling mouth shape changes of three-dimensional virtual portrait
CN109545192A (en) Method and apparatus for generating model
CN107707745A (en) Method and apparatus for extracting information
CN110446066B (en) Method and apparatus for generating video
US20240070397A1 (en) Human-computer interaction method, apparatus and system, electronic device and computer medium
CN110880198A (en) Animation generation method and device
US11968433B2 (en) Systems and methods for generating synthetic videos based on audio contents
CN109545193A (en) Method and apparatus for generating model
CN109920431A (en) Method and apparatus for output information
CN113299312A (en) Image generation method, device, equipment and storage medium
CN109697978A (en) Method and apparatus for generating model
CN110534085A (en) Method and apparatus for generating information
CN109582825A (en) Method and apparatus for generating information
CN112735371A (en) Method and device for generating speaker video based on text information
CN109784128A (en) Mixed reality intelligent glasses with text and language process function

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant