CN110459200A - Phoneme synthesizing method, device, computer equipment and storage medium - Google Patents
Phoneme synthesizing method, device, computer equipment and storage medium Download PDFInfo
- Publication number
- CN110459200A CN110459200A CN201910602385.1A CN201910602385A CN110459200A CN 110459200 A CN110459200 A CN 110459200A CN 201910602385 A CN201910602385 A CN 201910602385A CN 110459200 A CN110459200 A CN 110459200A
- Authority
- CN
- China
- Prior art keywords
- face
- label
- voice
- acoustic model
- characteristic
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 41
- 230000002194 synthesizing effect Effects 0.000 title claims abstract description 28
- 238000003860 storage Methods 0.000 title claims abstract description 18
- 239000000284 extract Substances 0.000 claims abstract description 9
- 230000001815 facial effect Effects 0.000 claims description 24
- 238000004590 computer program Methods 0.000 claims description 16
- 238000000605 extraction Methods 0.000 claims description 13
- 230000015572 biosynthetic process Effects 0.000 claims description 10
- 238000003786 synthesis reaction Methods 0.000 claims description 10
- 241000208340 Araliaceae Species 0.000 claims description 2
- 235000005035 Panax pseudoginseng ssp. pseudoginseng Nutrition 0.000 claims description 2
- 235000003140 Panax quinquefolius Nutrition 0.000 claims description 2
- 235000008434 ginseng Nutrition 0.000 claims description 2
- 238000010586 diagram Methods 0.000 description 10
- 210000004709 eyebrow Anatomy 0.000 description 9
- 210000001508 eye Anatomy 0.000 description 8
- 210000005252 bulbus oculi Anatomy 0.000 description 6
- 238000004422 calculation algorithm Methods 0.000 description 4
- 230000006870 function Effects 0.000 description 4
- 238000004458 analytical method Methods 0.000 description 3
- 230000002708 enhancing effect Effects 0.000 description 3
- 239000000203 mixture Substances 0.000 description 3
- 210000000214 mouth Anatomy 0.000 description 3
- 210000001331 nose Anatomy 0.000 description 3
- 238000012937 correction Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 238000004364 calculation method Methods 0.000 description 1
- 239000003086 colorant Substances 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 230000010485 coping Effects 0.000 description 1
- 230000007812 deficiency Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000009826 distribution Methods 0.000 description 1
- 230000008921 facial expression Effects 0.000 description 1
- 210000004209 hair Anatomy 0.000 description 1
- 238000005286 illumination Methods 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 238000002360 preparation method Methods 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 238000001228 spectrum Methods 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
- 230000001360 synchronised effect Effects 0.000 description 1
- 238000010189 synthetic method Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/10—Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
- G06V40/16—Human faces, e.g. facial parts, sketches or expressions
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/02—Methods for producing synthetic speech; Speech synthesisers
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/08—Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/24—Speech recognition using non-acoustical features
- G10L15/25—Speech recognition using non-acoustical features using position of the lips, movement of the lips or face analysis
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Multimedia (AREA)
- Human Computer Interaction (AREA)
- Computational Linguistics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Acoustics & Sound (AREA)
- General Physics & Mathematics (AREA)
- Oral & Maxillofacial Surgery (AREA)
- Theoretical Computer Science (AREA)
- General Health & Medical Sciences (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Artificial Intelligence (AREA)
- Processing Or Creating Images (AREA)
Abstract
The invention discloses a kind of phoneme synthesizing method, device, computer equipment and storage medium, this method, which passes through, to be obtained to the face picture in audio-video;Extract the face characteristic of the face picture;It is determined according to the face characteristic described to the corresponding face label of face picture in audio-video;The corresponding acoustic model of the face label is chosen from acoustic model repository, the acoustic model includes multiple voice labels;Determine the corresponding speech characteristic parameter of each voice label in multiple voice labels;The corresponding speech characteristic parameter of each voice label is used, to synthesize voice with the corresponding role of face picture in audio-video, to dub accuracy rate purpose to be described to realize to improve.
Description
Technical field
The present invention relates to computer fields more particularly to a kind of phoneme synthesizing method, device, computer equipment and storage to be situated between
Matter.
Background technique
Currently, being gradually born suitable for network from media, these are usual from media with the continuous development of new media
It can make some simply with audio-video to entertain masses.However in this kind of videos, due to cost of manufacture, depending on
Role in frequency, which dubs, to be generallyd use speech synthesis technique and obtains.It is only simple since current speech synthesis technique tone color is single
Single one kind and two kinds of tone colors, can be easy to cause between character lack relevance in this way, the face and sound of character
It does not match that or matching degree is not high, it is not high so as to cause accuracy rate is dubbed.
Summary of the invention
The embodiment of the present invention provides a kind of phoneme synthesizing method, device, computer equipment and storage medium, is improved with realizing
The purpose for the accuracy rate dubbed.
A kind of phoneme synthesizing method, comprising:
It obtains to the face picture in audio-video;
Extract the face characteristic of the face picture;
It is determined according to the face characteristic described to the corresponding face label of face picture in audio-video;
The corresponding acoustic model of the face label is chosen from acoustic model repository, the acoustic model includes multiple voices
Label;
Determine the corresponding speech characteristic parameter of each voice label in multiple voice labels;
Use the corresponding speech characteristic parameter of each voice label to be described to the face picture pair in audio-video
The role answered synthesizes voice.
A kind of speech synthetic device, comprising:
First obtains module, for obtaining to the face picture in audio-video;
First extraction module, for extracting the face characteristic of the face picture;
First determining module, it is described to corresponding with the face picture in audio-video for being determined according to the face characteristic
Face label;
Module is chosen, for choosing the corresponding acoustic model of the face label, the acoustic mode from acoustic model repository
Type includes multiple voice labels;
Second determining module, for determining the corresponding phonetic feature of each voice label in multiple voice labels
Parameter;
Synthesis module, for using the corresponding speech characteristic parameter of each voice label to be described to in audio-video
The corresponding role of face picture synthesize voice.
A kind of computer equipment, including memory, processor and storage are in the memory and can be in the processing
The computer program run on device, the processor realize the step of above-mentioned phoneme synthesizing method when executing the computer program
Suddenly.
A kind of computer readable storage medium, the computer-readable recording medium storage have computer program, the meter
The step of calculation machine program realizes above-mentioned phoneme synthesizing method when being executed by processor.
Above-mentioned phoneme synthesizing method, device, computer equipment and storage medium, by obtaining to the face in audio-video
Then picture obtains the corresponding face label of face picture to the Facial Feature Analysis of face picture, then according to face label
Acoustic model is chosen from acoustic model repository, it, being capable of foundation using the corresponding speech characteristic parameter of acoustic model come synthetic video
Face obtains different phonetic characteristic parameter, according to speech characteristic parameter synthetic video, can identify the role people in entertainment video
Face feature, thus according to face characteristic be people's object angle colour matching one fit well on the acoustic model of face characteristic so that dub with
Between character relevance enhancing, improve face and dub between matching degree, avoid the situation that voice does not correspond to,
To improve the accuracy rate dubbed.
Detailed description of the invention
In order to illustrate the technical solution of the embodiments of the present invention more clearly, below by institute in the description to the embodiment of the present invention
Attached drawing to be used is needed to be briefly described, it should be apparent that, the accompanying drawings in the following description is only some implementations of the invention
Example, for those of ordinary skill in the art, without any creative labor, can also be according to these attached drawings
Obtain other attached drawings.
Fig. 1 is an application environment schematic diagram of phoneme synthesizing method in one embodiment of the invention;
Fig. 2 is an exemplary diagram of phoneme synthesizing method in one embodiment of the invention;
Fig. 3 is another exemplary diagram of phoneme synthesizing method in one embodiment of the invention;
Fig. 4 is another exemplary diagram of phoneme synthesizing method in one embodiment of the invention;
Fig. 5 is another exemplary diagram of phoneme synthesizing method in one embodiment of the invention;
Fig. 6 is another exemplary diagram of phoneme synthesizing method in one embodiment of the invention;
Fig. 7 is an exemplary diagram of speech synthetic device in one embodiment of the invention;
Fig. 8 is another exemplary diagram of speech synthetic device in one embodiment of the invention;
Fig. 9 is another exemplary diagram of speech synthetic device in one embodiment of the invention;
Figure 10 is an exemplary diagram of computer equipment in one embodiment of the invention.
Specific embodiment
Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete
Site preparation description, it is clear that described embodiments are some of the embodiments of the present invention, instead of all the embodiments.Based on this hair
Embodiment in bright, every other implementation obtained by those of ordinary skill in the art without making creative efforts
Example, shall fall within the protection scope of the present invention.
Phoneme synthesizing method provided in an embodiment of the present invention can be applicable in the application environment such as Fig. 1, wherein terminal is set
Standby to be communicated by network with server, terminal device, will be to the people in audio-video after getting to audio-video
Face picture is transmitted to server, and server starts to extract the face characteristic of face picture after receiving face picture, and determines face
The corresponding face label of picture, and then acoustic model is chosen from acoustic model repository according to face label, it is finally synthesizing wait dub
The voice of the corresponding role of face picture in video.Wherein, terminal device/can be, but not limited to various personal computers, pen
Remember this computer, smart phone, tablet computer and portable wearable device.Server can be either more with independent server
The server cluster of a server composition is realized.
In one embodiment, as shown in Fig. 2, providing a kind of phoneme synthesizing method, the service in Fig. 1 is applied in this way
It is illustrated, includes the following steps: for device
S10: it obtains to the face picture in audio-video.
In this embodiment, face picture is to the corresponding face picture of face occurred in audio-video.Wherein, in order to
The accuracy that subsequent face characteristic extracts, to which in the face picture in audio-video, the face of face and the outer profile of face are answered
It is high-visible.
S20: the face characteristic of face picture is extracted.
Wherein, face characteristic refers to the key feature of reflection face information, such as geometrical characteristic (such as face five of facial image
Official's characteristic point and facial contour feature point) and facial image gray feature (such as face complexion), for knowing to facial image
Not.
Preferably, geometrical characteristic includes that the crucial point location of human face five-sense-organ and the key point of facial contour are determined in the present embodiment
The characteristic point of position.Specifically, face can be obtained using the facial modeling algorithm of ASM (Active Shape Model)
Feature, above-mentioned algorithm are that Global Face appearance establishes universal model, are steady keys to local image damage, but its calculating generation
Valence is very high, needs a large amount of iterative steps, can also be that the facial modeling of AAM (Active Appreance Model) is calculated
Method obtains face characteristic, which directly regards a recurrence task as by positioning feature point, counted with a global recurrence device
Calculate the coordinate of characteristic point.Since facial modeling is still the work for having very much challenge, because of human face expression, appearance
There are many variations such as gesture, illumination.Meanwhile the positioning difficulty of face different location characteristic point is different, if with a kind of single
Model is come if positioning, it is difficult to guarantee the accuracy rate of positioning.Therefore, in order to overcome the above problem, Coarse-to- can be used
The facial modeling algorithm of fine CNN network obtains face characteristic.
S30: it is determined according to face characteristic to the corresponding face label of face picture in audio-video.
Wherein, face label refers to according to face characteristic the label classified to face picture.Illustratively, face
Label can be female's baby face label, young woman's face label, grandmother's face label, male baby face label, young man's face label, uncle
Face label, grandfather's face label etc..It should be noted that in the present embodiment, above-mentioned face label is intended merely to facilitate to this
Embodiment is understood and is enumerated that the present embodiment is not construed as limiting face label.
S40: the corresponding acoustic model of face label is chosen from acoustic model repository, acoustic model includes multiple voice marks
Label.
Wherein, voice label includes tone, tone color, word speed, intensity label of sound etc.;Acoustic model repository contains multiple
Acoustic model contains multiple voice labels in each acoustic model, and each voice label is corresponding with speech characteristic parameter, the language
For sound characteristic parameter for characterizing the corresponding phonetic feature of voice label, each acoustic model in acoustic model repository is according to face
Label setting, different face labels are corresponding with different acoustic models.For example, compared to other face labels, young woman
In the corresponding acoustic model of face label, the corresponding speech characteristic parameter of tone label is higher, illustrates young woman's face label pair
The tone answered is higher;In the corresponding acoustic model of grandmother's face label, the corresponding word speed parameter of word speed label is lower, illustrates bright
The corresponding word speed of young woman's face label is more slowly.
Specifically, acoustic mode corresponding with the face label is selected from preset acoustic model repository according to face label
Type.
S50: the corresponding speech characteristic parameter of each voice label in multiple voice labels is determined.
Wherein, speech characteristic parameter refers to the speech characteristic parameters such as tone, tone color, word speed, the intensity of voice.
S60: use the corresponding speech characteristic parameter of each voice label for corresponding with the face picture in audio-video
Role synthesizes voice.
Specifically, due to the speech characteristic parameter in acoustic model be in a numerical intervals, however synthesize voice
Required speech characteristic parameter is a determining numerical value, therefore, can be in the speech characteristic parameter of the acoustic model
A numerical value is randomly selected in numerical intervals as numerical value required for synthesis voice.
Server carries out speech synthesis according to these speech characteristic parameters, then exports voice document, and by voice document
Return to terminal device, wherein the format of voice document be according to demand depending on, for example, the format of voice document can be
.mp3 or .wvm, here without limitation.
In this embodiment, synthesis voice, which can be, synthesizes one section of language corresponding with speech text using speech characteristic parameter
Sound is also possible to carry out voice change process to the sound bite according to speech characteristic parameter.Specifically, it can obtain described wait dub
The corresponding acquisition speech text of video, then use the speech characteristic parameter to be described wait dub speech text described in Video Composition
Corresponding voice.
In this embodiment, since facial characteristics can embody the personality of personage to a certain extent, and the sound of personage
If not corresponding with personage, it can give people a kind of feeling come out of the play.Therefore, by obtaining wait match the face picture in audio-video, so
The corresponding face label of face picture is obtained to the Facial Feature Analysis of face picture afterwards, then according to face label from acoustic mode
Type chooses acoustic model in library, using the corresponding speech characteristic parameter of acoustic model come synthetic video, can obtain according to face
Different phonetic characteristic parameter can identify role's face characteristic in entertainment video according to speech characteristic parameter synthetic video, from
And according to face characteristic be people's object angle colour matching one fit well on the acoustic model of face characteristic so that dub with character it
Between relevance enhancing, improve face and dub between matching degree, avoid the situation that voice does not correspond to, thus improve match
The accuracy rate of sound.
In one embodiment, specifically comprise the following steps: as shown in figure 3, obtaining acoustic model repository
S70: multiple face samples and the corresponding multiple speech samples of multiple face samples are obtained.
In this embodiment, face sample refers to that the image comprising each face of face face, speech samples refer to
Be a Duan Yuyin.Wherein, there is incidence relations for each face sample and each speech samples, i.e. speech samples are by face
The corresponding people of sample exports (saying).
Illustratively, it is more bonded the acoustic model repository of face characteristic in order to obtain, it can be from largely comprising voice and face
Video in extract a face sample and the corresponding multiple speech samples of multiple face samples, due in some specific scenes
In, the sound (speech samples) in video is said by the corresponding personage of face (face sample), therefore, these face samples
There is certain relevance with speech samples.
S80: the face characteristic of face sample is extracted.
It wherein, can be using Coarse-to- trained in advance in order to guarantee the accuracy of extracted face characteristic
Fine CNN network extracts the face characteristic of face sample.
S90: the corresponding face label of face characteristic is determined.
Specifically, whether the characteristic value of label can be preset at any one by judging the characteristic value of the face characteristic
In section, if in the characteristic value section that any one presets label, it is determined that the default label is corresponding for the face characteristic
Face label.
S100: extracting the phonetic feature of speech samples, and phonetic feature includes multiple speech characteristic parameters.
In this embodiment, phonetic feature is tone height feature, the tone color quality feature, word speed speed in speech samples
Feature, intensity size characteristic etc., such as tone, tone color, word speed, intensity etc..It specifically, can be using language provided by Python
The speech feature extraction of sound library (such as Audiolab) Lai Jinhang speech samples, when specific operation, it is only necessary to which speech samples are made
It is passed in the library Audiolab for a parameter.Certainly, scheme can also draw the voice frequency of speech samples using Matlab
Then spectrogram is analyzed voice spectrum figure, the phonetic feature of speech samples is obtained.But, it is contemplated that the number of speech samples
Use sound bank provided by Python preferably according to amount, and the simplicity of operation, this programme to carry out the language of speech samples
Sound feature extraction.
S110: the corresponding multiple voice labels of multiple speech characteristic parameters are determined.
In this embodiment, it since the phonetic feature of each speech samples includes multiple speech characteristic parameters,
Every a kind of voice label, the i.e. corresponding voice mark of each speech characteristic parameter are determined according to each speech characteristic parameter
Label.It is specific to determine that method can be, it first determines section described in each speech characteristic parameter, language is then determined according to the section
Phonetic symbol label.The present embodiment is illustrated with pitch parameters: be based on step S100: extraction phonetic feature, obtain the voice
The tone of sample is 100Hz, is then based on each preset tones section (high-pitched tone section [300Hz, 500Hz], medium pitch section
[80Hz, 300Hz), low pitch section [0Hz, 80Hz)), show that the tone is in medium pitch section, it is determined that the speech samples
Tone label be medium pitch.
S120: acoustic model repository is generated according to multiple face labels and multiple voice labels.
In one embodiment, as shown in figure 4, step S120: according to multiple face labels and multiple voice label generation sound
Model library is learned, is specifically comprised the following steps:
S121: the quantity statistics that the corresponding face label of multiple face samples and voice label are carried out, to obtain people
The relevance of face label and the voice label, to correspond to probability of occurrence maximum for matching every class face label for the relevance
Voice label.
Specifically, it can be counted by the quantity to the corresponding voice label of every class face label, every class is occurred
The voice label of maximum probability (namely frequency of occurrence is most) is as the voice label with face label with relevance.
S122: according to relevance determine each face label corresponding to voice label.
S123: according to voice label corresponding to each face label, the corresponding acoustic model of face label is generated.
Wherein, acoustic model refers to the model including multiple speech characteristic parameters.Illustratively, the corresponding sound of face label
Model can be the dollfaced face label of female and be corresponding with the dollfaced acoustic model of female.
Generating the corresponding acoustic model of face label is to provide support to choose the corresponding acoustic model of face label.
S124: acoustic model repository is generated according to the corresponding acoustic model of all face labels.
In this embodiment, the quantity statistics corresponding face label of multiple face samples and voice label carried out,
To obtain the relevance of face label Yu the voice label, according to relevance determine each face label corresponding to voice mark
Label generate the corresponding acoustic model of face label, according to all face marks according to voice label corresponding to each face label
It signs corresponding acoustic model and generates acoustic model repository, the relevance between face and voice can be increased, so that shape of face and voice
More be bonded so that it is subsequent for when with the facial image in audio-video is corresponding feel to carry out speech synthesis when, it is synthesized
Voice is more in line with figure image, to be unlikely to the feeling for allowing spectators to generate " coming out of the play ".
In one embodiment, as shown in figure 5, step S80: extracting the face characteristic of face sample, comprising the following steps:
S81: multiple contour characteristic points and multiple five features point are extracted from each face sample.
Wherein, contour characteristic points refer to the characteristic point of the outer profile of face;Face include eye, ear, mouth, nose and eyebrow, and five
Official's characteristic point refers to the characteristic point on this 5 parts of left and right eye, eyebrow, nose, mouth.
As the preferred of the present embodiment, the characteristic point that multi-model carrys out locating human face's different location can be used, face is divided into
Five features point positions respectively with contour characteristic points, and five features point refers to the characteristic point where the face of face, outside
Contour feature point refers to the characteristic point where the outer profile of face.In our embodiment, Coarse-to-fine can be used
The facial modeling algorithm of CNN network obtains five features point and contour characteristic points on each facial image, thus
Obtain face characteristic.
Specifically, DCNN model is divided into two groups of parallel CNN cascade networks.Wherein one group is one 4 grades cascade
CNN, for obtaining the five features point of face (as chosen 51 face five features points).Wherein, it is used for human face five-sense-organ for the 1st grade
The positioning of the minimum bounding box (bounding box) of characteristic point, the minimum bounding box of the five features point are to enclose face
The minimum picture of all face (5 left and right eye, eyebrow, nose, mouth parts) on image;2nd grade for will most parcel
It encloses in box and is input in CNN, thus the position of multiple characteristic points according to a preliminary estimate;3rd level is used for will be each in minimum bounding box
The picture of face, which is cut out, to be come, and is input to the position that multiple five features points are further accurately estimated in the CNN of this grade;
4th grade be for above-mentioned each face picture carry out rotation correction, and to five features point each after rotation correction carry out essence
It determines position, obtains multiple five features points.Another group is 2 grades of cascade CNN, and the 1st group of contour characteristic points for face are (such as
Choose 17 contour characteristic points) minimum bounding box positioning, the minimum bounding boxs of the contour characteristic points is to enclose
The minimum picture of the outer profile of facial image.2nd grade from the minimum bounding box of contour characteristic points for estimating multiple foreign steamers
The accurate location of wide characteristic point obtains multiple contour characteristic points.Why the 1st grade of two groups of parallel CNN cascade networks all
It needs to position minimum bounding box, is because traditional DCNN is in priori knowledge deficiency, the most strength of convolutional network is all unrestrained
Take on finding face, reduces the efficiency of facial modeling, and then influence the acquisition efficiency of face characteristic.
S82: each all contour characteristic points of face sample are attached, the outer profile of facial image is obtained.
S82: the five features point of all identical face types of each face sample is attached, facial image is obtained
The corresponding face profile of each face.
Wherein, the five features point of identical face type is attached, is referred to left eye eyeball, right eye eyeball, mouth, nose
Five features point corresponding with eyebrow is attached respectively.Face profile refers to left eye contour, right eye contour, mouth wheel
Wide, nose profile and eyebrow outline.
S83: using the profile of the outer profile of each face sample and each face as the corresponding face characteristic of face sample.
In one embodiment, as shown in fig. 6, step S90: determining the corresponding face label of face characteristic, including walk as follows
It is rapid:
S91: the radian of the outer profile of face sample is calculated.
S92: according to each face profile, spacing, the length and width of each face between each face are calculated.
Wherein, the spacing between each face refers to the spacing two-by-two between left and right eye, mouth, nose and eyebrow, example
As between the spacing between eyes and eyebrow, the spacing between left and right eye, two eyebrows spacing and nose and mouth between
Spacing etc.;The width of face refers to the maximum width of each face, such as the maximum width of left eye eyeball, the maximum of right eye eyeball are wide
The maximum width of degree, the maximum width of mouth and nose;The length of face refers to the maximum length of face, such as left eye eyeball
Maximum length, the maximum length of right eye eyeball, the maximum length of mouth, the length of the maximum length of nose and eyebrow.
S93: according to the length and width of spacing, each face between the radian of outer profile and each face, people is determined
The corresponding face label of face feature.
Determining the principle of the corresponding face label of face characteristic is: according to the outer profile radian of face, the width of face, five
Spacing etc. between the length of official and each face is classified.For example, the width of the outer profile radian of face, each face,
Spacing between the length of each face and each face reaches the dollfaced outer profile radian threshold value of preset female, face
Width threshold value, face length threshold and each face between spacing threshold when, then be the corresponding face of the face characteristic
Picture classification is female's baby face, and assigns female's baby face label.
It should be noted that since face label is to be determined by face characteristic, and face characteristic is usually embodied in people
The length and width of spacing, each face between the radian and face of the outer profile of face, therefore, in this embodiment, meter
Calculate spacing, the length and width of each face between the radian of outer profile of face sample, each face, and according to can face
The length and width of spacing, each face between the radian of the outer profile of sample, each face enough confirms the face sample
Spacing between corresponding shape of face, the size of face and each face, so as to determine the corresponding face mark of the face sample
Label.
It should be understood that the size of the serial number of each step is not meant that the order of the execution order in above-described embodiment, each process
Execution sequence should be determined by its function and internal logic, the implementation process without coping with the embodiment of the present invention constitutes any limit
It is fixed.
In one embodiment, a kind of speech synthetic device is provided, voice closes in the speech synthetic device and above-described embodiment
It is corresponded at method.As shown in fig. 7, the speech synthetic device includes the first acquisition module 10, the first extraction module 20, first
Determining module 30 chooses module 40, the second determining module 50 and synthesis module 60.Detailed description are as follows for each functional module:
First obtains module 10, for obtaining to the face picture in audio-video;
First extraction module 20, for extracting the face characteristic of face picture;
First determining module 30, for being determined according to face characteristic to the corresponding face mark of face picture in audio-video
Label;
Module 40 is chosen, for choosing the corresponding acoustic model of face label from acoustic model repository, acoustic model includes
Multiple voice labels;
Second determining module 50, for determining the corresponding speech characteristic parameter of each voice label in multiple voice labels;
Synthesis module 60, for using the corresponding speech characteristic parameter of each voice label for the face in audio-video
The corresponding role of picture synthesizes voice.
Preferably, in one embodiment, as shown in figure 8, obtaining acoustic model repository by following module:
Second obtains module 70, for obtaining multiple face samples and the corresponding multiple speech samples of multiple face samples;
Second extraction module 80, for extracting the face characteristic of face sample;
Third determining module 90, for determining the corresponding face label of face characteristic;
Third extraction module 100, for extracting the phonetic feature of speech samples, phonetic feature includes multiple phonetic feature ginsengs
Number;
4th determining module 110, for determining the corresponding multiple voice labels of multiple speech characteristic parameters;
Generation module 120, for generating acoustic model repository according to multiple face labels and multiple voice labels.
Preferably, in one embodiment, as shown in figure 9, generation module 120 includes:
Statistic unit 121 is united for the quantity to the corresponding face label of multiple face samples and voice label
Meter, to obtain the relevance of face label and voice label, for matching, every class face label is corresponding generally the relevance to occur
The maximum voice label of rate;
First determination unit 122, for according to relevance determine each face label corresponding to voice label;
It is corresponding to generate face label for the voice label according to corresponding to each face label for first generation unit 123
Acoustic model;
Second generation unit 124, for generating acoustic model repository according to the corresponding acoustic model of all face labels.
Optionally, in one embodiment, the first extraction module 20 includes:
Extraction unit, for extracting multiple contour characteristic points and multiple five features point from each face sample;
First connection unit obtains facial image for being attached each all contour characteristic points of face sample
Outer profile;
Second connection unit, for connecting the five features point of all identical face types of each face sample
It connects, obtains the corresponding face profile of each face of facial image;
Second determination unit, for using the profile of the outer profile of each face sample and each face as face sample pair
The face characteristic answered.
Optionally, in one embodiment, third determining module 90, comprising:
First computing unit, the radian of the outer profile for calculating face characteristic;
Second computing unit, for calculating spacing, the length of each face between each face according to each face profile
Degree and width;
Third determination unit, the length for spacing, each face between the radian and each face according to outer profile
And width, determine the corresponding face label of face characteristic.
Specific about speech synthetic device limits the restriction that may refer to above for phoneme synthesizing method, herein not
It repeats again.Modules in above-mentioned speech synthetic device can be realized fully or partially through software, hardware and combinations thereof.On
Stating each module can be embedded in the form of hardware or independently of in the processor in computer equipment, can also store in a software form
In memory in computer equipment, the corresponding operation of the above modules is executed in order to which processor calls.
In one embodiment, a kind of computer equipment is provided, which can be server, internal junction
Composition can be as shown in Figure 10.The computer equipment include by system bus connect processor, memory, network interface and
Database.Wherein, the processor of the computer equipment is for providing calculating and control ability.The memory packet of the computer equipment
Include non-volatile memory medium, built-in storage.The non-volatile memory medium is stored with operating system, computer program and data
Library.The built-in storage provides environment for the operation of operating system and computer program in non-volatile memory medium.The calculating
The database of machine equipment is used for data required for storaged voice synthetic method.The network interface of the computer equipment be used for it is outer
The terminal in portion passes through network connection communication.To realize a kind of phoneme synthesizing method when the computer program is executed by processor.
In one embodiment, a kind of computer equipment is provided, including memory, processor and storage are on a memory
And the computer program that can be run on a processor, processor perform the steps of when executing computer program
It obtains to the face picture in audio-video;
Extract the face characteristic of face picture;
It is determined according to face characteristic to the corresponding face label of face picture in audio-video;
The corresponding acoustic model of face label is chosen from acoustic model repository, acoustic model includes multiple voice labels;
Determine the corresponding speech characteristic parameter of each voice label in multiple voice labels;
Use the corresponding speech characteristic parameter of each voice label for the corresponding role of face picture in audio-video
Synthesize voice.
In one embodiment, a kind of computer readable storage medium is provided, computer program is stored thereon with, is calculated
Machine program performs the steps of when being executed by processor
It obtains to the face picture in audio-video;
Extract the face characteristic of face picture;
It is determined according to face characteristic to the corresponding face label of face picture in audio-video;
The corresponding acoustic model of face label is chosen from acoustic model repository, acoustic model includes multiple voice labels;
Determine the corresponding speech characteristic parameter of each voice label in multiple voice labels;
Use the corresponding speech characteristic parameter of each voice label for the corresponding role of face picture in audio-video
Synthesize voice.
Above-mentioned phoneme synthesizing method, device, computer equipment and storage medium, by obtaining to the face in audio-video
Then picture obtains the corresponding face label of face picture to the Facial Feature Analysis of face picture, then according to face label
Acoustic model is chosen from acoustic model repository, it, being capable of foundation using the corresponding speech characteristic parameter of acoustic model come synthetic video
Face obtains different phonetic characteristic parameter, according to speech characteristic parameter synthetic video, so as to distinguish the role in entertainment video,
It is used in the case where meeting polygonal color, and dubs and have relevance between character, the dubbed effect ratio of multiple roles
It is more various, to improve dubbed effect.
Those of ordinary skill in the art will appreciate that realizing all or part of the process in above-described embodiment method, being can be with
Relevant hardware is instructed to complete by computer program, the computer program can be stored in a non-volatile computer
In read/write memory medium, the computer program is when being executed, it may include such as the process of the embodiment of above-mentioned each method.Wherein,
To any reference of memory, storage, database or other media used in each embodiment provided herein,
Including non-volatile and/or volatile memory.Nonvolatile memory may include read-only memory (ROM), programming ROM
(PROM), electrically programmable ROM (EPROM), electrically erasable ROM (EEPROM) or flash memory.Volatile memory may include
Random access memory (RAM) or external cache.By way of illustration and not limitation, RAM is available in many forms,
Such as static state RAM (SRAM), dynamic ram (DRAM), synchronous dram (SDRAM), double data rate sdram (DDRSDRAM), enhancing
Type SDRAM (ESDRAM), synchronization link (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM
(RDRAM), direct memory bus dynamic ram (DRDRAM) and memory bus dynamic ram (RDRAM) etc..
It is apparent to those skilled in the art that for convenience of description and succinctly, only with above-mentioned each function
Can unit, module division progress for example, in practical application, can according to need and by above-mentioned function distribution by different
Functional unit, module are completed, i.e., the internal structure of described device is divided into different functional unit or module, more than completing
The all or part of function of description.
Embodiment described above is merely illustrative of the technical solution of the present invention, rather than its limitations;Although referring to aforementioned reality
Applying example, invention is explained in detail, those skilled in the art should understand that: it still can be to aforementioned each
Technical solution documented by embodiment is modified or equivalent replacement of some of the technical features;And these are modified
Or replacement, the spirit and scope for technical solution of various embodiments of the present invention that it does not separate the essence of the corresponding technical solution should all
It is included within protection scope of the present invention.
Claims (10)
1. a kind of phoneme synthesizing method characterized by comprising
It obtains to the face picture in audio-video;
Extract the face characteristic of the face picture;
It is determined according to the face characteristic described to the corresponding face label of face picture in audio-video;
The corresponding acoustic model of the face label is chosen from acoustic model repository, the acoustic model includes multiple voice marks
Label;
Determine the corresponding speech characteristic parameter of each voice label in multiple voice labels;
Use the corresponding speech characteristic parameter of each voice label to be described to corresponding with the face picture in audio-video
Role synthesizes voice.
2. phoneme synthesizing method as described in claim 1, which is characterized in that obtain the acoustic model as follows
Library:
Obtain multiple face samples and the corresponding multiple speech samples of the multiple face sample;
Extract the face characteristic of the face sample;
Determine the corresponding face label of the face characteristic;
The phonetic feature of the speech samples is extracted, the phonetic feature includes multiple speech characteristic parameters;
Determine the corresponding multiple voice labels of multiple speech characteristic parameters;
The acoustic model repository is generated according to multiple face labels and multiple voice labels.
3. phoneme synthesizing method as claimed in claim 2, which is characterized in that described according to multiple face labels and multiple
The voice label generates the acoustic model repository, comprising:
The quantity of multiple corresponding face labels of face sample and the voice label is counted, to obtain
The relevance of the face label and the voice label, for matching, face label described in every class is corresponding the relevance to occur
The voice label of maximum probability;
According to the relevance determine each face label corresponding to voice label;
According to voice label corresponding to each face label, the corresponding acoustic model of the face label is generated;
According to all face label corresponding acoustic model generations acoustic model repository.
4. phoneme synthesizing method as claimed in claim 2, which is characterized in that the face for extracting the face sample is special
Sign, comprising:
Multiple contour characteristic points and multiple five features point are extracted from each face sample;
Each all contour characteristic points of face sample are attached, the outer profile of the facial image is obtained;
The five features point of all identical face types of each face sample is attached, the facial image is obtained
The corresponding face profile of each face;
The profile of the outer profile of each face sample and each face is corresponding as the face sample
The face characteristic.
5. phoneme synthesizing method as claimed in claim 4, which is characterized in that the corresponding face of the determination face characteristic
Label, comprising:
Calculate the radian of the outer profile of the face sample;
According to each face profile, spacing, the length and width of each face between each face are calculated;
According to the length and width of spacing, each face between the radian of the outer profile and each face, institute is determined
State the corresponding face label of face characteristic.
6. a kind of speech synthetic device characterized by comprising
First obtains module, for obtaining to the face picture in audio-video;
First extraction module, for extracting the face characteristic of the face picture;
First determining module, it is described to the corresponding face of face picture in audio-video for being determined according to the face characteristic
Label;
Module is chosen, for choosing the corresponding acoustic model of the face label, the acoustic model packet from acoustic model repository
Include multiple voice labels;
Second determining module, for determining the corresponding phonetic feature ginseng of each voice label in multiple voice labels
Number;
Synthesis module, for using the corresponding speech characteristic parameter of each voice label to be described to the people in audio-video
The corresponding role of face picture synthesizes voice.
7. speech synthetic device as claimed in claim 6, which is characterized in that obtain the acoustic model by following module
Library:
Second obtains module, for obtaining multiple face samples and the corresponding multiple speech samples of the multiple face sample;
Second extraction module, for extracting the face characteristic of the face sample;
Third determining module, for determining the corresponding face label of the face characteristic;
Third extraction module, for extracting the phonetic feature of the speech samples, the phonetic feature includes multiple phonetic features
Parameter;
4th determining module, for determining the corresponding multiple voice labels of multiple speech characteristic parameters;
Generation module, for generating the acoustic model repository according to multiple face labels and multiple voice labels.
8. speech synthetic device as claimed in claim 7, which is characterized in that the generation module includes:
Statistic unit, for the quantity to multiple corresponding face labels of face sample and the voice label into
Row statistics, to obtain the relevance of the face label and the voice label, the relevance is for matching people described in every class
Face label corresponds to the maximum voice label of probability of occurrence;
First determination unit, for according to the relevance determine each face label corresponding to voice label;
First generation unit generates the face label pair for the voice label according to corresponding to each face label
The acoustic model answered;
Second generation unit, for according to all face label corresponding acoustic model generations acoustic model repository.
9. a kind of computer equipment, including memory, processor and storage are in the memory and can be in the processor
The computer program of upper operation, which is characterized in that the processor realized when executing the computer program as claim 1 to
Any one of 5 phoneme synthesizing methods.
10. a kind of computer readable storage medium, the computer-readable recording medium storage has computer program, and feature exists
In realization phoneme synthesizing method as described in any one of claim 1 to 5 when the computer program is executed by processor.
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910602385.1A CN110459200A (en) | 2019-07-05 | 2019-07-05 | Phoneme synthesizing method, device, computer equipment and storage medium |
PCT/CN2020/085572 WO2021004113A1 (en) | 2019-07-05 | 2020-04-20 | Speech synthesis method and apparatus, computer device and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910602385.1A CN110459200A (en) | 2019-07-05 | 2019-07-05 | Phoneme synthesizing method, device, computer equipment and storage medium |
Publications (1)
Publication Number | Publication Date |
---|---|
CN110459200A true CN110459200A (en) | 2019-11-15 |
Family
ID=68482140
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910602385.1A Pending CN110459200A (en) | 2019-07-05 | 2019-07-05 | Phoneme synthesizing method, device, computer equipment and storage medium |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN110459200A (en) |
WO (1) | WO2021004113A1 (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111583903A (en) * | 2020-04-28 | 2020-08-25 | 北京字节跳动网络技术有限公司 | Speech synthesis method, vocoder training method, device, medium, and electronic device |
WO2021004113A1 (en) * | 2019-07-05 | 2021-01-14 | 深圳壹账通智能科技有限公司 | Speech synthesis method and apparatus, computer device and storage medium |
CN113345422A (en) * | 2021-04-23 | 2021-09-03 | 北京巅峰科技有限公司 | Voice data processing method, device, equipment and storage medium |
CN117641019A (en) * | 2023-12-01 | 2024-03-01 | 广州一千零一动漫有限公司 | Audio matching verification method and system based on animation video |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113010138B (en) * | 2021-03-04 | 2023-04-07 | 腾讯科技(深圳)有限公司 | Article voice playing method, device and equipment and computer readable storage medium |
Citations (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPH0887289A (en) * | 1994-09-19 | 1996-04-02 | Fujitsu Ltd | Voice rule synthesis device |
WO1999066495A1 (en) * | 1998-06-14 | 1999-12-23 | Nissim Cohen | Voice character imitator system |
CN104485100A (en) * | 2014-12-18 | 2015-04-01 | 天津讯飞信息科技有限公司 | Text-to-speech pronunciation person self-adaptive method and system |
CN104809923A (en) * | 2015-05-13 | 2015-07-29 | 苏州清睿信息技术有限公司 | Self-complied and self-guided method and system for generating intelligent voice communication |
CN105096932A (en) * | 2015-07-14 | 2015-11-25 | 百度在线网络技术(北京)有限公司 | Voice synthesis method and apparatus of talking book |
CN106531148A (en) * | 2016-10-24 | 2017-03-22 | 咪咕数字传媒有限公司 | Cartoon dubbing method and apparatus based on voice synthesis |
CN106548772A (en) * | 2017-01-16 | 2017-03-29 | 上海智臻智能网络科技股份有限公司 | Speech recognition test system and method |
CN106648082A (en) * | 2016-12-09 | 2017-05-10 | 厦门快商通科技股份有限公司 | Intelligent service device capable of simulating human interactions and method |
CN107172449A (en) * | 2017-06-19 | 2017-09-15 | 微鲸科技有限公司 | Multi-medium play method, device and multimedia storage method |
CN107358949A (en) * | 2017-05-27 | 2017-11-17 | 芜湖星途机器人科技有限公司 | Robot sounding automatic adjustment system |
JP2018097185A (en) * | 2016-12-14 | 2018-06-21 | パナソニックIpマネジメント株式会社 | Voice dialogue device, voice dialogue method, voice dialogue program and robot |
CN108735211A (en) * | 2018-05-16 | 2018-11-02 | 智车优行科技(北京)有限公司 | Method of speech processing, device, vehicle, electronic equipment, program and medium |
CN108744521A (en) * | 2018-06-28 | 2018-11-06 | 网易(杭州)网络有限公司 | The method and device of game speech production, electronic equipment, storage medium |
CN109391842A (en) * | 2018-11-16 | 2019-02-26 | 维沃移动通信有限公司 | A kind of dubbing method, mobile terminal |
Family Cites Families (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6839672B1 (en) * | 1998-01-30 | 2005-01-04 | At&T Corp. | Integration of talking heads and text-to-speech synthesizers for visual TTS |
US7133535B2 (en) * | 2002-12-21 | 2006-11-07 | Microsoft Corp. | System and method for real time lip synchronization |
US9607609B2 (en) * | 2014-09-25 | 2017-03-28 | Intel Corporation | Method and apparatus to synthesize voice based on facial structures |
CN105931631A (en) * | 2016-04-15 | 2016-09-07 | 北京地平线机器人技术研发有限公司 | Voice synthesis system and method |
CN107507620A (en) * | 2017-09-25 | 2017-12-22 | 广东小天才科技有限公司 | A kind of voice broadcast sound method to set up, device, mobile terminal and storage medium |
CN110459200A (en) * | 2019-07-05 | 2019-11-15 | 深圳壹账通智能科技有限公司 | Phoneme synthesizing method, device, computer equipment and storage medium |
-
2019
- 2019-07-05 CN CN201910602385.1A patent/CN110459200A/en active Pending
-
2020
- 2020-04-20 WO PCT/CN2020/085572 patent/WO2021004113A1/en active Application Filing
Patent Citations (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPH0887289A (en) * | 1994-09-19 | 1996-04-02 | Fujitsu Ltd | Voice rule synthesis device |
WO1999066495A1 (en) * | 1998-06-14 | 1999-12-23 | Nissim Cohen | Voice character imitator system |
CN104485100A (en) * | 2014-12-18 | 2015-04-01 | 天津讯飞信息科技有限公司 | Text-to-speech pronunciation person self-adaptive method and system |
CN104809923A (en) * | 2015-05-13 | 2015-07-29 | 苏州清睿信息技术有限公司 | Self-complied and self-guided method and system for generating intelligent voice communication |
CN105096932A (en) * | 2015-07-14 | 2015-11-25 | 百度在线网络技术(北京)有限公司 | Voice synthesis method and apparatus of talking book |
CN106531148A (en) * | 2016-10-24 | 2017-03-22 | 咪咕数字传媒有限公司 | Cartoon dubbing method and apparatus based on voice synthesis |
CN106648082A (en) * | 2016-12-09 | 2017-05-10 | 厦门快商通科技股份有限公司 | Intelligent service device capable of simulating human interactions and method |
JP2018097185A (en) * | 2016-12-14 | 2018-06-21 | パナソニックIpマネジメント株式会社 | Voice dialogue device, voice dialogue method, voice dialogue program and robot |
CN106548772A (en) * | 2017-01-16 | 2017-03-29 | 上海智臻智能网络科技股份有限公司 | Speech recognition test system and method |
CN107358949A (en) * | 2017-05-27 | 2017-11-17 | 芜湖星途机器人科技有限公司 | Robot sounding automatic adjustment system |
CN107172449A (en) * | 2017-06-19 | 2017-09-15 | 微鲸科技有限公司 | Multi-medium play method, device and multimedia storage method |
CN108735211A (en) * | 2018-05-16 | 2018-11-02 | 智车优行科技(北京)有限公司 | Method of speech processing, device, vehicle, electronic equipment, program and medium |
CN108744521A (en) * | 2018-06-28 | 2018-11-06 | 网易(杭州)网络有限公司 | The method and device of game speech production, electronic equipment, storage medium |
CN109391842A (en) * | 2018-11-16 | 2019-02-26 | 维沃移动通信有限公司 | A kind of dubbing method, mobile terminal |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2021004113A1 (en) * | 2019-07-05 | 2021-01-14 | 深圳壹账通智能科技有限公司 | Speech synthesis method and apparatus, computer device and storage medium |
CN111583903A (en) * | 2020-04-28 | 2020-08-25 | 北京字节跳动网络技术有限公司 | Speech synthesis method, vocoder training method, device, medium, and electronic device |
CN111583903B (en) * | 2020-04-28 | 2021-11-05 | 北京字节跳动网络技术有限公司 | Speech synthesis method, vocoder training method, device, medium, and electronic device |
CN113345422A (en) * | 2021-04-23 | 2021-09-03 | 北京巅峰科技有限公司 | Voice data processing method, device, equipment and storage medium |
CN113345422B (en) * | 2021-04-23 | 2024-02-20 | 北京巅峰科技有限公司 | Voice data processing method, device, equipment and storage medium |
CN117641019A (en) * | 2023-12-01 | 2024-03-01 | 广州一千零一动漫有限公司 | Audio matching verification method and system based on animation video |
CN117641019B (en) * | 2023-12-01 | 2024-05-24 | 广州一千零一动漫有限公司 | Audio matching verification method and system based on animation video |
Also Published As
Publication number | Publication date |
---|---|
WO2021004113A1 (en) | 2021-01-14 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110459200A (en) | Phoneme synthesizing method, device, computer equipment and storage medium | |
US11741940B2 (en) | Text and audio-based real-time face reenactment | |
US8125485B2 (en) | Animating speech of an avatar representing a participant in a mobile communication | |
CN110390704B (en) | Image processing method, image processing device, terminal equipment and storage medium | |
CN110163054B (en) | Method and device for generating human face three-dimensional image | |
US9082400B2 (en) | Video generation based on text | |
WO2018049979A1 (en) | Animation synthesis method and device | |
CN108958610A (en) | Special efficacy generation method, device and electronic equipment based on face | |
CN108346427A (en) | A kind of audio recognition method, device, equipment and storage medium | |
KR102509666B1 (en) | Real-time face replay based on text and audio | |
TW201937344A (en) | Smart robot and man-machine interaction method | |
CN110555896B (en) | Image generation method and device and storage medium | |
CN102568023A (en) | Real-time animation for an expressive avatar | |
JP2014519082A5 (en) | ||
CN107911643B (en) | Method and device for showing scene special effect in video communication | |
CN109801349B (en) | Sound-driven three-dimensional animation character real-time expression generation method and system | |
CN112669417A (en) | Virtual image generation method and device, storage medium and electronic equipment | |
CN112668407A (en) | Face key point generation method and device, storage medium and electronic equipment | |
CN114359517A (en) | Avatar generation method, avatar generation system, and computing device | |
KR20200059993A (en) | Apparatus and method for generating conti for webtoon | |
CN110794964A (en) | Interaction method and device for virtual robot, electronic equipment and storage medium | |
CN113299312A (en) | Image generation method, device, equipment and storage medium | |
CN110148406A (en) | A kind of data processing method and device, a kind of device for data processing | |
US20120013620A1 (en) | Animating Speech Of An Avatar Representing A Participant In A Mobile Communications With Background Media | |
CN112652041A (en) | Virtual image generation method and device, storage medium and electronic equipment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |