CN110459200A - Phoneme synthesizing method, device, computer equipment and storage medium - Google Patents

Phoneme synthesizing method, device, computer equipment and storage medium Download PDF

Info

Publication number
CN110459200A
CN110459200A CN201910602385.1A CN201910602385A CN110459200A CN 110459200 A CN110459200 A CN 110459200A CN 201910602385 A CN201910602385 A CN 201910602385A CN 110459200 A CN110459200 A CN 110459200A
Authority
CN
China
Prior art keywords
face
label
voice
acoustic model
characteristic
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910602385.1A
Other languages
Chinese (zh)
Inventor
向纯玉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
OneConnect Smart Technology Co Ltd
Original Assignee
OneConnect Smart Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by OneConnect Smart Technology Co Ltd filed Critical OneConnect Smart Technology Co Ltd
Priority to CN201910602385.1A priority Critical patent/CN110459200A/en
Publication of CN110459200A publication Critical patent/CN110459200A/en
Priority to PCT/CN2020/085572 priority patent/WO2021004113A1/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/24Speech recognition using non-acoustical features
    • G10L15/25Speech recognition using non-acoustical features using position of the lips, movement of the lips or face analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Human Computer Interaction (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Acoustics & Sound (AREA)
  • General Physics & Mathematics (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Theoretical Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Processing Or Creating Images (AREA)

Abstract

The invention discloses a kind of phoneme synthesizing method, device, computer equipment and storage medium, this method, which passes through, to be obtained to the face picture in audio-video;Extract the face characteristic of the face picture;It is determined according to the face characteristic described to the corresponding face label of face picture in audio-video;The corresponding acoustic model of the face label is chosen from acoustic model repository, the acoustic model includes multiple voice labels;Determine the corresponding speech characteristic parameter of each voice label in multiple voice labels;The corresponding speech characteristic parameter of each voice label is used, to synthesize voice with the corresponding role of face picture in audio-video, to dub accuracy rate purpose to be described to realize to improve.

Description

Phoneme synthesizing method, device, computer equipment and storage medium
Technical field
The present invention relates to computer fields more particularly to a kind of phoneme synthesizing method, device, computer equipment and storage to be situated between Matter.
Background technique
Currently, being gradually born suitable for network from media, these are usual from media with the continuous development of new media It can make some simply with audio-video to entertain masses.However in this kind of videos, due to cost of manufacture, depending on Role in frequency, which dubs, to be generallyd use speech synthesis technique and obtains.It is only simple since current speech synthesis technique tone color is single Single one kind and two kinds of tone colors, can be easy to cause between character lack relevance in this way, the face and sound of character It does not match that or matching degree is not high, it is not high so as to cause accuracy rate is dubbed.
Summary of the invention
The embodiment of the present invention provides a kind of phoneme synthesizing method, device, computer equipment and storage medium, is improved with realizing The purpose for the accuracy rate dubbed.
A kind of phoneme synthesizing method, comprising:
It obtains to the face picture in audio-video;
Extract the face characteristic of the face picture;
It is determined according to the face characteristic described to the corresponding face label of face picture in audio-video;
The corresponding acoustic model of the face label is chosen from acoustic model repository, the acoustic model includes multiple voices Label;
Determine the corresponding speech characteristic parameter of each voice label in multiple voice labels;
Use the corresponding speech characteristic parameter of each voice label to be described to the face picture pair in audio-video The role answered synthesizes voice.
A kind of speech synthetic device, comprising:
First obtains module, for obtaining to the face picture in audio-video;
First extraction module, for extracting the face characteristic of the face picture;
First determining module, it is described to corresponding with the face picture in audio-video for being determined according to the face characteristic Face label;
Module is chosen, for choosing the corresponding acoustic model of the face label, the acoustic mode from acoustic model repository Type includes multiple voice labels;
Second determining module, for determining the corresponding phonetic feature of each voice label in multiple voice labels Parameter;
Synthesis module, for using the corresponding speech characteristic parameter of each voice label to be described to in audio-video The corresponding role of face picture synthesize voice.
A kind of computer equipment, including memory, processor and storage are in the memory and can be in the processing The computer program run on device, the processor realize the step of above-mentioned phoneme synthesizing method when executing the computer program Suddenly.
A kind of computer readable storage medium, the computer-readable recording medium storage have computer program, the meter The step of calculation machine program realizes above-mentioned phoneme synthesizing method when being executed by processor.
Above-mentioned phoneme synthesizing method, device, computer equipment and storage medium, by obtaining to the face in audio-video Then picture obtains the corresponding face label of face picture to the Facial Feature Analysis of face picture, then according to face label Acoustic model is chosen from acoustic model repository, it, being capable of foundation using the corresponding speech characteristic parameter of acoustic model come synthetic video Face obtains different phonetic characteristic parameter, according to speech characteristic parameter synthetic video, can identify the role people in entertainment video Face feature, thus according to face characteristic be people's object angle colour matching one fit well on the acoustic model of face characteristic so that dub with Between character relevance enhancing, improve face and dub between matching degree, avoid the situation that voice does not correspond to, To improve the accuracy rate dubbed.
Detailed description of the invention
In order to illustrate the technical solution of the embodiments of the present invention more clearly, below by institute in the description to the embodiment of the present invention Attached drawing to be used is needed to be briefly described, it should be apparent that, the accompanying drawings in the following description is only some implementations of the invention Example, for those of ordinary skill in the art, without any creative labor, can also be according to these attached drawings Obtain other attached drawings.
Fig. 1 is an application environment schematic diagram of phoneme synthesizing method in one embodiment of the invention;
Fig. 2 is an exemplary diagram of phoneme synthesizing method in one embodiment of the invention;
Fig. 3 is another exemplary diagram of phoneme synthesizing method in one embodiment of the invention;
Fig. 4 is another exemplary diagram of phoneme synthesizing method in one embodiment of the invention;
Fig. 5 is another exemplary diagram of phoneme synthesizing method in one embodiment of the invention;
Fig. 6 is another exemplary diagram of phoneme synthesizing method in one embodiment of the invention;
Fig. 7 is an exemplary diagram of speech synthetic device in one embodiment of the invention;
Fig. 8 is another exemplary diagram of speech synthetic device in one embodiment of the invention;
Fig. 9 is another exemplary diagram of speech synthetic device in one embodiment of the invention;
Figure 10 is an exemplary diagram of computer equipment in one embodiment of the invention.
Specific embodiment
Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete Site preparation description, it is clear that described embodiments are some of the embodiments of the present invention, instead of all the embodiments.Based on this hair Embodiment in bright, every other implementation obtained by those of ordinary skill in the art without making creative efforts Example, shall fall within the protection scope of the present invention.
Phoneme synthesizing method provided in an embodiment of the present invention can be applicable in the application environment such as Fig. 1, wherein terminal is set Standby to be communicated by network with server, terminal device, will be to the people in audio-video after getting to audio-video Face picture is transmitted to server, and server starts to extract the face characteristic of face picture after receiving face picture, and determines face The corresponding face label of picture, and then acoustic model is chosen from acoustic model repository according to face label, it is finally synthesizing wait dub The voice of the corresponding role of face picture in video.Wherein, terminal device/can be, but not limited to various personal computers, pen Remember this computer, smart phone, tablet computer and portable wearable device.Server can be either more with independent server The server cluster of a server composition is realized.
In one embodiment, as shown in Fig. 2, providing a kind of phoneme synthesizing method, the service in Fig. 1 is applied in this way It is illustrated, includes the following steps: for device
S10: it obtains to the face picture in audio-video.
In this embodiment, face picture is to the corresponding face picture of face occurred in audio-video.Wherein, in order to The accuracy that subsequent face characteristic extracts, to which in the face picture in audio-video, the face of face and the outer profile of face are answered It is high-visible.
S20: the face characteristic of face picture is extracted.
Wherein, face characteristic refers to the key feature of reflection face information, such as geometrical characteristic (such as face five of facial image Official's characteristic point and facial contour feature point) and facial image gray feature (such as face complexion), for knowing to facial image Not.
Preferably, geometrical characteristic includes that the crucial point location of human face five-sense-organ and the key point of facial contour are determined in the present embodiment The characteristic point of position.Specifically, face can be obtained using the facial modeling algorithm of ASM (Active Shape Model) Feature, above-mentioned algorithm are that Global Face appearance establishes universal model, are steady keys to local image damage, but its calculating generation Valence is very high, needs a large amount of iterative steps, can also be that the facial modeling of AAM (Active Appreance Model) is calculated Method obtains face characteristic, which directly regards a recurrence task as by positioning feature point, counted with a global recurrence device Calculate the coordinate of characteristic point.Since facial modeling is still the work for having very much challenge, because of human face expression, appearance There are many variations such as gesture, illumination.Meanwhile the positioning difficulty of face different location characteristic point is different, if with a kind of single Model is come if positioning, it is difficult to guarantee the accuracy rate of positioning.Therefore, in order to overcome the above problem, Coarse-to- can be used The facial modeling algorithm of fine CNN network obtains face characteristic.
S30: it is determined according to face characteristic to the corresponding face label of face picture in audio-video.
Wherein, face label refers to according to face characteristic the label classified to face picture.Illustratively, face Label can be female's baby face label, young woman's face label, grandmother's face label, male baby face label, young man's face label, uncle Face label, grandfather's face label etc..It should be noted that in the present embodiment, above-mentioned face label is intended merely to facilitate to this Embodiment is understood and is enumerated that the present embodiment is not construed as limiting face label.
S40: the corresponding acoustic model of face label is chosen from acoustic model repository, acoustic model includes multiple voice marks Label.
Wherein, voice label includes tone, tone color, word speed, intensity label of sound etc.;Acoustic model repository contains multiple Acoustic model contains multiple voice labels in each acoustic model, and each voice label is corresponding with speech characteristic parameter, the language For sound characteristic parameter for characterizing the corresponding phonetic feature of voice label, each acoustic model in acoustic model repository is according to face Label setting, different face labels are corresponding with different acoustic models.For example, compared to other face labels, young woman In the corresponding acoustic model of face label, the corresponding speech characteristic parameter of tone label is higher, illustrates young woman's face label pair The tone answered is higher;In the corresponding acoustic model of grandmother's face label, the corresponding word speed parameter of word speed label is lower, illustrates bright The corresponding word speed of young woman's face label is more slowly.
Specifically, acoustic mode corresponding with the face label is selected from preset acoustic model repository according to face label Type.
S50: the corresponding speech characteristic parameter of each voice label in multiple voice labels is determined.
Wherein, speech characteristic parameter refers to the speech characteristic parameters such as tone, tone color, word speed, the intensity of voice.
S60: use the corresponding speech characteristic parameter of each voice label for corresponding with the face picture in audio-video Role synthesizes voice.
Specifically, due to the speech characteristic parameter in acoustic model be in a numerical intervals, however synthesize voice Required speech characteristic parameter is a determining numerical value, therefore, can be in the speech characteristic parameter of the acoustic model A numerical value is randomly selected in numerical intervals as numerical value required for synthesis voice.
Server carries out speech synthesis according to these speech characteristic parameters, then exports voice document, and by voice document Return to terminal device, wherein the format of voice document be according to demand depending on, for example, the format of voice document can be .mp3 or .wvm, here without limitation.
In this embodiment, synthesis voice, which can be, synthesizes one section of language corresponding with speech text using speech characteristic parameter Sound is also possible to carry out voice change process to the sound bite according to speech characteristic parameter.Specifically, it can obtain described wait dub The corresponding acquisition speech text of video, then use the speech characteristic parameter to be described wait dub speech text described in Video Composition Corresponding voice.
In this embodiment, since facial characteristics can embody the personality of personage to a certain extent, and the sound of personage If not corresponding with personage, it can give people a kind of feeling come out of the play.Therefore, by obtaining wait match the face picture in audio-video, so The corresponding face label of face picture is obtained to the Facial Feature Analysis of face picture afterwards, then according to face label from acoustic mode Type chooses acoustic model in library, using the corresponding speech characteristic parameter of acoustic model come synthetic video, can obtain according to face Different phonetic characteristic parameter can identify role's face characteristic in entertainment video according to speech characteristic parameter synthetic video, from And according to face characteristic be people's object angle colour matching one fit well on the acoustic model of face characteristic so that dub with character it Between relevance enhancing, improve face and dub between matching degree, avoid the situation that voice does not correspond to, thus improve match The accuracy rate of sound.
In one embodiment, specifically comprise the following steps: as shown in figure 3, obtaining acoustic model repository
S70: multiple face samples and the corresponding multiple speech samples of multiple face samples are obtained.
In this embodiment, face sample refers to that the image comprising each face of face face, speech samples refer to Be a Duan Yuyin.Wherein, there is incidence relations for each face sample and each speech samples, i.e. speech samples are by face The corresponding people of sample exports (saying).
Illustratively, it is more bonded the acoustic model repository of face characteristic in order to obtain, it can be from largely comprising voice and face Video in extract a face sample and the corresponding multiple speech samples of multiple face samples, due in some specific scenes In, the sound (speech samples) in video is said by the corresponding personage of face (face sample), therefore, these face samples There is certain relevance with speech samples.
S80: the face characteristic of face sample is extracted.
It wherein, can be using Coarse-to- trained in advance in order to guarantee the accuracy of extracted face characteristic Fine CNN network extracts the face characteristic of face sample.
S90: the corresponding face label of face characteristic is determined.
Specifically, whether the characteristic value of label can be preset at any one by judging the characteristic value of the face characteristic In section, if in the characteristic value section that any one presets label, it is determined that the default label is corresponding for the face characteristic Face label.
S100: extracting the phonetic feature of speech samples, and phonetic feature includes multiple speech characteristic parameters.
In this embodiment, phonetic feature is tone height feature, the tone color quality feature, word speed speed in speech samples Feature, intensity size characteristic etc., such as tone, tone color, word speed, intensity etc..It specifically, can be using language provided by Python The speech feature extraction of sound library (such as Audiolab) Lai Jinhang speech samples, when specific operation, it is only necessary to which speech samples are made It is passed in the library Audiolab for a parameter.Certainly, scheme can also draw the voice frequency of speech samples using Matlab Then spectrogram is analyzed voice spectrum figure, the phonetic feature of speech samples is obtained.But, it is contemplated that the number of speech samples Use sound bank provided by Python preferably according to amount, and the simplicity of operation, this programme to carry out the language of speech samples Sound feature extraction.
S110: the corresponding multiple voice labels of multiple speech characteristic parameters are determined.
In this embodiment, it since the phonetic feature of each speech samples includes multiple speech characteristic parameters, Every a kind of voice label, the i.e. corresponding voice mark of each speech characteristic parameter are determined according to each speech characteristic parameter Label.It is specific to determine that method can be, it first determines section described in each speech characteristic parameter, language is then determined according to the section Phonetic symbol label.The present embodiment is illustrated with pitch parameters: be based on step S100: extraction phonetic feature, obtain the voice The tone of sample is 100Hz, is then based on each preset tones section (high-pitched tone section [300Hz, 500Hz], medium pitch section [80Hz, 300Hz), low pitch section [0Hz, 80Hz)), show that the tone is in medium pitch section, it is determined that the speech samples Tone label be medium pitch.
S120: acoustic model repository is generated according to multiple face labels and multiple voice labels.
In one embodiment, as shown in figure 4, step S120: according to multiple face labels and multiple voice label generation sound Model library is learned, is specifically comprised the following steps:
S121: the quantity statistics that the corresponding face label of multiple face samples and voice label are carried out, to obtain people The relevance of face label and the voice label, to correspond to probability of occurrence maximum for matching every class face label for the relevance Voice label.
Specifically, it can be counted by the quantity to the corresponding voice label of every class face label, every class is occurred The voice label of maximum probability (namely frequency of occurrence is most) is as the voice label with face label with relevance.
S122: according to relevance determine each face label corresponding to voice label.
S123: according to voice label corresponding to each face label, the corresponding acoustic model of face label is generated.
Wherein, acoustic model refers to the model including multiple speech characteristic parameters.Illustratively, the corresponding sound of face label Model can be the dollfaced face label of female and be corresponding with the dollfaced acoustic model of female.
Generating the corresponding acoustic model of face label is to provide support to choose the corresponding acoustic model of face label.
S124: acoustic model repository is generated according to the corresponding acoustic model of all face labels.
In this embodiment, the quantity statistics corresponding face label of multiple face samples and voice label carried out, To obtain the relevance of face label Yu the voice label, according to relevance determine each face label corresponding to voice mark Label generate the corresponding acoustic model of face label, according to all face marks according to voice label corresponding to each face label It signs corresponding acoustic model and generates acoustic model repository, the relevance between face and voice can be increased, so that shape of face and voice More be bonded so that it is subsequent for when with the facial image in audio-video is corresponding feel to carry out speech synthesis when, it is synthesized Voice is more in line with figure image, to be unlikely to the feeling for allowing spectators to generate " coming out of the play ".
In one embodiment, as shown in figure 5, step S80: extracting the face characteristic of face sample, comprising the following steps:
S81: multiple contour characteristic points and multiple five features point are extracted from each face sample.
Wherein, contour characteristic points refer to the characteristic point of the outer profile of face;Face include eye, ear, mouth, nose and eyebrow, and five Official's characteristic point refers to the characteristic point on this 5 parts of left and right eye, eyebrow, nose, mouth.
As the preferred of the present embodiment, the characteristic point that multi-model carrys out locating human face's different location can be used, face is divided into Five features point positions respectively with contour characteristic points, and five features point refers to the characteristic point where the face of face, outside Contour feature point refers to the characteristic point where the outer profile of face.In our embodiment, Coarse-to-fine can be used The facial modeling algorithm of CNN network obtains five features point and contour characteristic points on each facial image, thus Obtain face characteristic.
Specifically, DCNN model is divided into two groups of parallel CNN cascade networks.Wherein one group is one 4 grades cascade CNN, for obtaining the five features point of face (as chosen 51 face five features points).Wherein, it is used for human face five-sense-organ for the 1st grade The positioning of the minimum bounding box (bounding box) of characteristic point, the minimum bounding box of the five features point are to enclose face The minimum picture of all face (5 left and right eye, eyebrow, nose, mouth parts) on image;2nd grade for will most parcel It encloses in box and is input in CNN, thus the position of multiple characteristic points according to a preliminary estimate;3rd level is used for will be each in minimum bounding box The picture of face, which is cut out, to be come, and is input to the position that multiple five features points are further accurately estimated in the CNN of this grade; 4th grade be for above-mentioned each face picture carry out rotation correction, and to five features point each after rotation correction carry out essence It determines position, obtains multiple five features points.Another group is 2 grades of cascade CNN, and the 1st group of contour characteristic points for face are (such as Choose 17 contour characteristic points) minimum bounding box positioning, the minimum bounding boxs of the contour characteristic points is to enclose The minimum picture of the outer profile of facial image.2nd grade from the minimum bounding box of contour characteristic points for estimating multiple foreign steamers The accurate location of wide characteristic point obtains multiple contour characteristic points.Why the 1st grade of two groups of parallel CNN cascade networks all It needs to position minimum bounding box, is because traditional DCNN is in priori knowledge deficiency, the most strength of convolutional network is all unrestrained Take on finding face, reduces the efficiency of facial modeling, and then influence the acquisition efficiency of face characteristic.
S82: each all contour characteristic points of face sample are attached, the outer profile of facial image is obtained.
S82: the five features point of all identical face types of each face sample is attached, facial image is obtained The corresponding face profile of each face.
Wherein, the five features point of identical face type is attached, is referred to left eye eyeball, right eye eyeball, mouth, nose Five features point corresponding with eyebrow is attached respectively.Face profile refers to left eye contour, right eye contour, mouth wheel Wide, nose profile and eyebrow outline.
S83: using the profile of the outer profile of each face sample and each face as the corresponding face characteristic of face sample.
In one embodiment, as shown in fig. 6, step S90: determining the corresponding face label of face characteristic, including walk as follows It is rapid:
S91: the radian of the outer profile of face sample is calculated.
S92: according to each face profile, spacing, the length and width of each face between each face are calculated.
Wherein, the spacing between each face refers to the spacing two-by-two between left and right eye, mouth, nose and eyebrow, example As between the spacing between eyes and eyebrow, the spacing between left and right eye, two eyebrows spacing and nose and mouth between Spacing etc.;The width of face refers to the maximum width of each face, such as the maximum width of left eye eyeball, the maximum of right eye eyeball are wide The maximum width of degree, the maximum width of mouth and nose;The length of face refers to the maximum length of face, such as left eye eyeball Maximum length, the maximum length of right eye eyeball, the maximum length of mouth, the length of the maximum length of nose and eyebrow.
S93: according to the length and width of spacing, each face between the radian of outer profile and each face, people is determined The corresponding face label of face feature.
Determining the principle of the corresponding face label of face characteristic is: according to the outer profile radian of face, the width of face, five Spacing etc. between the length of official and each face is classified.For example, the width of the outer profile radian of face, each face, Spacing between the length of each face and each face reaches the dollfaced outer profile radian threshold value of preset female, face Width threshold value, face length threshold and each face between spacing threshold when, then be the corresponding face of the face characteristic Picture classification is female's baby face, and assigns female's baby face label.
It should be noted that since face label is to be determined by face characteristic, and face characteristic is usually embodied in people The length and width of spacing, each face between the radian and face of the outer profile of face, therefore, in this embodiment, meter Calculate spacing, the length and width of each face between the radian of outer profile of face sample, each face, and according to can face The length and width of spacing, each face between the radian of the outer profile of sample, each face enough confirms the face sample Spacing between corresponding shape of face, the size of face and each face, so as to determine the corresponding face mark of the face sample Label.
It should be understood that the size of the serial number of each step is not meant that the order of the execution order in above-described embodiment, each process Execution sequence should be determined by its function and internal logic, the implementation process without coping with the embodiment of the present invention constitutes any limit It is fixed.
In one embodiment, a kind of speech synthetic device is provided, voice closes in the speech synthetic device and above-described embodiment It is corresponded at method.As shown in fig. 7, the speech synthetic device includes the first acquisition module 10, the first extraction module 20, first Determining module 30 chooses module 40, the second determining module 50 and synthesis module 60.Detailed description are as follows for each functional module:
First obtains module 10, for obtaining to the face picture in audio-video;
First extraction module 20, for extracting the face characteristic of face picture;
First determining module 30, for being determined according to face characteristic to the corresponding face mark of face picture in audio-video Label;
Module 40 is chosen, for choosing the corresponding acoustic model of face label from acoustic model repository, acoustic model includes Multiple voice labels;
Second determining module 50, for determining the corresponding speech characteristic parameter of each voice label in multiple voice labels;
Synthesis module 60, for using the corresponding speech characteristic parameter of each voice label for the face in audio-video The corresponding role of picture synthesizes voice.
Preferably, in one embodiment, as shown in figure 8, obtaining acoustic model repository by following module:
Second obtains module 70, for obtaining multiple face samples and the corresponding multiple speech samples of multiple face samples;
Second extraction module 80, for extracting the face characteristic of face sample;
Third determining module 90, for determining the corresponding face label of face characteristic;
Third extraction module 100, for extracting the phonetic feature of speech samples, phonetic feature includes multiple phonetic feature ginsengs Number;
4th determining module 110, for determining the corresponding multiple voice labels of multiple speech characteristic parameters;
Generation module 120, for generating acoustic model repository according to multiple face labels and multiple voice labels.
Preferably, in one embodiment, as shown in figure 9, generation module 120 includes:
Statistic unit 121 is united for the quantity to the corresponding face label of multiple face samples and voice label Meter, to obtain the relevance of face label and voice label, for matching, every class face label is corresponding generally the relevance to occur The maximum voice label of rate;
First determination unit 122, for according to relevance determine each face label corresponding to voice label;
It is corresponding to generate face label for the voice label according to corresponding to each face label for first generation unit 123 Acoustic model;
Second generation unit 124, for generating acoustic model repository according to the corresponding acoustic model of all face labels.
Optionally, in one embodiment, the first extraction module 20 includes:
Extraction unit, for extracting multiple contour characteristic points and multiple five features point from each face sample;
First connection unit obtains facial image for being attached each all contour characteristic points of face sample Outer profile;
Second connection unit, for connecting the five features point of all identical face types of each face sample It connects, obtains the corresponding face profile of each face of facial image;
Second determination unit, for using the profile of the outer profile of each face sample and each face as face sample pair The face characteristic answered.
Optionally, in one embodiment, third determining module 90, comprising:
First computing unit, the radian of the outer profile for calculating face characteristic;
Second computing unit, for calculating spacing, the length of each face between each face according to each face profile Degree and width;
Third determination unit, the length for spacing, each face between the radian and each face according to outer profile And width, determine the corresponding face label of face characteristic.
Specific about speech synthetic device limits the restriction that may refer to above for phoneme synthesizing method, herein not It repeats again.Modules in above-mentioned speech synthetic device can be realized fully or partially through software, hardware and combinations thereof.On Stating each module can be embedded in the form of hardware or independently of in the processor in computer equipment, can also store in a software form In memory in computer equipment, the corresponding operation of the above modules is executed in order to which processor calls.
In one embodiment, a kind of computer equipment is provided, which can be server, internal junction Composition can be as shown in Figure 10.The computer equipment include by system bus connect processor, memory, network interface and Database.Wherein, the processor of the computer equipment is for providing calculating and control ability.The memory packet of the computer equipment Include non-volatile memory medium, built-in storage.The non-volatile memory medium is stored with operating system, computer program and data Library.The built-in storage provides environment for the operation of operating system and computer program in non-volatile memory medium.The calculating The database of machine equipment is used for data required for storaged voice synthetic method.The network interface of the computer equipment be used for it is outer The terminal in portion passes through network connection communication.To realize a kind of phoneme synthesizing method when the computer program is executed by processor.
In one embodiment, a kind of computer equipment is provided, including memory, processor and storage are on a memory And the computer program that can be run on a processor, processor perform the steps of when executing computer program
It obtains to the face picture in audio-video;
Extract the face characteristic of face picture;
It is determined according to face characteristic to the corresponding face label of face picture in audio-video;
The corresponding acoustic model of face label is chosen from acoustic model repository, acoustic model includes multiple voice labels;
Determine the corresponding speech characteristic parameter of each voice label in multiple voice labels;
Use the corresponding speech characteristic parameter of each voice label for the corresponding role of face picture in audio-video Synthesize voice.
In one embodiment, a kind of computer readable storage medium is provided, computer program is stored thereon with, is calculated Machine program performs the steps of when being executed by processor
It obtains to the face picture in audio-video;
Extract the face characteristic of face picture;
It is determined according to face characteristic to the corresponding face label of face picture in audio-video;
The corresponding acoustic model of face label is chosen from acoustic model repository, acoustic model includes multiple voice labels;
Determine the corresponding speech characteristic parameter of each voice label in multiple voice labels;
Use the corresponding speech characteristic parameter of each voice label for the corresponding role of face picture in audio-video Synthesize voice.
Above-mentioned phoneme synthesizing method, device, computer equipment and storage medium, by obtaining to the face in audio-video Then picture obtains the corresponding face label of face picture to the Facial Feature Analysis of face picture, then according to face label Acoustic model is chosen from acoustic model repository, it, being capable of foundation using the corresponding speech characteristic parameter of acoustic model come synthetic video Face obtains different phonetic characteristic parameter, according to speech characteristic parameter synthetic video, so as to distinguish the role in entertainment video, It is used in the case where meeting polygonal color, and dubs and have relevance between character, the dubbed effect ratio of multiple roles It is more various, to improve dubbed effect.
Those of ordinary skill in the art will appreciate that realizing all or part of the process in above-described embodiment method, being can be with Relevant hardware is instructed to complete by computer program, the computer program can be stored in a non-volatile computer In read/write memory medium, the computer program is when being executed, it may include such as the process of the embodiment of above-mentioned each method.Wherein, To any reference of memory, storage, database or other media used in each embodiment provided herein, Including non-volatile and/or volatile memory.Nonvolatile memory may include read-only memory (ROM), programming ROM (PROM), electrically programmable ROM (EPROM), electrically erasable ROM (EEPROM) or flash memory.Volatile memory may include Random access memory (RAM) or external cache.By way of illustration and not limitation, RAM is available in many forms, Such as static state RAM (SRAM), dynamic ram (DRAM), synchronous dram (SDRAM), double data rate sdram (DDRSDRAM), enhancing Type SDRAM (ESDRAM), synchronization link (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic ram (DRDRAM) and memory bus dynamic ram (RDRAM) etc..
It is apparent to those skilled in the art that for convenience of description and succinctly, only with above-mentioned each function Can unit, module division progress for example, in practical application, can according to need and by above-mentioned function distribution by different Functional unit, module are completed, i.e., the internal structure of described device is divided into different functional unit or module, more than completing The all or part of function of description.
Embodiment described above is merely illustrative of the technical solution of the present invention, rather than its limitations;Although referring to aforementioned reality Applying example, invention is explained in detail, those skilled in the art should understand that: it still can be to aforementioned each Technical solution documented by embodiment is modified or equivalent replacement of some of the technical features;And these are modified Or replacement, the spirit and scope for technical solution of various embodiments of the present invention that it does not separate the essence of the corresponding technical solution should all It is included within protection scope of the present invention.

Claims (10)

1. a kind of phoneme synthesizing method characterized by comprising
It obtains to the face picture in audio-video;
Extract the face characteristic of the face picture;
It is determined according to the face characteristic described to the corresponding face label of face picture in audio-video;
The corresponding acoustic model of the face label is chosen from acoustic model repository, the acoustic model includes multiple voice marks Label;
Determine the corresponding speech characteristic parameter of each voice label in multiple voice labels;
Use the corresponding speech characteristic parameter of each voice label to be described to corresponding with the face picture in audio-video Role synthesizes voice.
2. phoneme synthesizing method as described in claim 1, which is characterized in that obtain the acoustic model as follows Library:
Obtain multiple face samples and the corresponding multiple speech samples of the multiple face sample;
Extract the face characteristic of the face sample;
Determine the corresponding face label of the face characteristic;
The phonetic feature of the speech samples is extracted, the phonetic feature includes multiple speech characteristic parameters;
Determine the corresponding multiple voice labels of multiple speech characteristic parameters;
The acoustic model repository is generated according to multiple face labels and multiple voice labels.
3. phoneme synthesizing method as claimed in claim 2, which is characterized in that described according to multiple face labels and multiple The voice label generates the acoustic model repository, comprising:
The quantity of multiple corresponding face labels of face sample and the voice label is counted, to obtain The relevance of the face label and the voice label, for matching, face label described in every class is corresponding the relevance to occur The voice label of maximum probability;
According to the relevance determine each face label corresponding to voice label;
According to voice label corresponding to each face label, the corresponding acoustic model of the face label is generated;
According to all face label corresponding acoustic model generations acoustic model repository.
4. phoneme synthesizing method as claimed in claim 2, which is characterized in that the face for extracting the face sample is special Sign, comprising:
Multiple contour characteristic points and multiple five features point are extracted from each face sample;
Each all contour characteristic points of face sample are attached, the outer profile of the facial image is obtained;
The five features point of all identical face types of each face sample is attached, the facial image is obtained The corresponding face profile of each face;
The profile of the outer profile of each face sample and each face is corresponding as the face sample The face characteristic.
5. phoneme synthesizing method as claimed in claim 4, which is characterized in that the corresponding face of the determination face characteristic Label, comprising:
Calculate the radian of the outer profile of the face sample;
According to each face profile, spacing, the length and width of each face between each face are calculated;
According to the length and width of spacing, each face between the radian of the outer profile and each face, institute is determined State the corresponding face label of face characteristic.
6. a kind of speech synthetic device characterized by comprising
First obtains module, for obtaining to the face picture in audio-video;
First extraction module, for extracting the face characteristic of the face picture;
First determining module, it is described to the corresponding face of face picture in audio-video for being determined according to the face characteristic Label;
Module is chosen, for choosing the corresponding acoustic model of the face label, the acoustic model packet from acoustic model repository Include multiple voice labels;
Second determining module, for determining the corresponding phonetic feature ginseng of each voice label in multiple voice labels Number;
Synthesis module, for using the corresponding speech characteristic parameter of each voice label to be described to the people in audio-video The corresponding role of face picture synthesizes voice.
7. speech synthetic device as claimed in claim 6, which is characterized in that obtain the acoustic model by following module Library:
Second obtains module, for obtaining multiple face samples and the corresponding multiple speech samples of the multiple face sample;
Second extraction module, for extracting the face characteristic of the face sample;
Third determining module, for determining the corresponding face label of the face characteristic;
Third extraction module, for extracting the phonetic feature of the speech samples, the phonetic feature includes multiple phonetic features Parameter;
4th determining module, for determining the corresponding multiple voice labels of multiple speech characteristic parameters;
Generation module, for generating the acoustic model repository according to multiple face labels and multiple voice labels.
8. speech synthetic device as claimed in claim 7, which is characterized in that the generation module includes:
Statistic unit, for the quantity to multiple corresponding face labels of face sample and the voice label into Row statistics, to obtain the relevance of the face label and the voice label, the relevance is for matching people described in every class Face label corresponds to the maximum voice label of probability of occurrence;
First determination unit, for according to the relevance determine each face label corresponding to voice label;
First generation unit generates the face label pair for the voice label according to corresponding to each face label The acoustic model answered;
Second generation unit, for according to all face label corresponding acoustic model generations acoustic model repository.
9. a kind of computer equipment, including memory, processor and storage are in the memory and can be in the processor The computer program of upper operation, which is characterized in that the processor realized when executing the computer program as claim 1 to Any one of 5 phoneme synthesizing methods.
10. a kind of computer readable storage medium, the computer-readable recording medium storage has computer program, and feature exists In realization phoneme synthesizing method as described in any one of claim 1 to 5 when the computer program is executed by processor.
CN201910602385.1A 2019-07-05 2019-07-05 Phoneme synthesizing method, device, computer equipment and storage medium Pending CN110459200A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201910602385.1A CN110459200A (en) 2019-07-05 2019-07-05 Phoneme synthesizing method, device, computer equipment and storage medium
PCT/CN2020/085572 WO2021004113A1 (en) 2019-07-05 2020-04-20 Speech synthesis method and apparatus, computer device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910602385.1A CN110459200A (en) 2019-07-05 2019-07-05 Phoneme synthesizing method, device, computer equipment and storage medium

Publications (1)

Publication Number Publication Date
CN110459200A true CN110459200A (en) 2019-11-15

Family

ID=68482140

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910602385.1A Pending CN110459200A (en) 2019-07-05 2019-07-05 Phoneme synthesizing method, device, computer equipment and storage medium

Country Status (2)

Country Link
CN (1) CN110459200A (en)
WO (1) WO2021004113A1 (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111583903A (en) * 2020-04-28 2020-08-25 北京字节跳动网络技术有限公司 Speech synthesis method, vocoder training method, device, medium, and electronic device
WO2021004113A1 (en) * 2019-07-05 2021-01-14 深圳壹账通智能科技有限公司 Speech synthesis method and apparatus, computer device and storage medium
CN113345422A (en) * 2021-04-23 2021-09-03 北京巅峰科技有限公司 Voice data processing method, device, equipment and storage medium
CN117641019A (en) * 2023-12-01 2024-03-01 广州一千零一动漫有限公司 Audio matching verification method and system based on animation video

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113010138B (en) * 2021-03-04 2023-04-07 腾讯科技(深圳)有限公司 Article voice playing method, device and equipment and computer readable storage medium

Citations (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH0887289A (en) * 1994-09-19 1996-04-02 Fujitsu Ltd Voice rule synthesis device
WO1999066495A1 (en) * 1998-06-14 1999-12-23 Nissim Cohen Voice character imitator system
CN104485100A (en) * 2014-12-18 2015-04-01 天津讯飞信息科技有限公司 Text-to-speech pronunciation person self-adaptive method and system
CN104809923A (en) * 2015-05-13 2015-07-29 苏州清睿信息技术有限公司 Self-complied and self-guided method and system for generating intelligent voice communication
CN105096932A (en) * 2015-07-14 2015-11-25 百度在线网络技术(北京)有限公司 Voice synthesis method and apparatus of talking book
CN106531148A (en) * 2016-10-24 2017-03-22 咪咕数字传媒有限公司 Cartoon dubbing method and apparatus based on voice synthesis
CN106548772A (en) * 2017-01-16 2017-03-29 上海智臻智能网络科技股份有限公司 Speech recognition test system and method
CN106648082A (en) * 2016-12-09 2017-05-10 厦门快商通科技股份有限公司 Intelligent service device capable of simulating human interactions and method
CN107172449A (en) * 2017-06-19 2017-09-15 微鲸科技有限公司 Multi-medium play method, device and multimedia storage method
CN107358949A (en) * 2017-05-27 2017-11-17 芜湖星途机器人科技有限公司 Robot sounding automatic adjustment system
JP2018097185A (en) * 2016-12-14 2018-06-21 パナソニックIpマネジメント株式会社 Voice dialogue device, voice dialogue method, voice dialogue program and robot
CN108735211A (en) * 2018-05-16 2018-11-02 智车优行科技(北京)有限公司 Method of speech processing, device, vehicle, electronic equipment, program and medium
CN108744521A (en) * 2018-06-28 2018-11-06 网易(杭州)网络有限公司 The method and device of game speech production, electronic equipment, storage medium
CN109391842A (en) * 2018-11-16 2019-02-26 维沃移动通信有限公司 A kind of dubbing method, mobile terminal

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6839672B1 (en) * 1998-01-30 2005-01-04 At&T Corp. Integration of talking heads and text-to-speech synthesizers for visual TTS
US7133535B2 (en) * 2002-12-21 2006-11-07 Microsoft Corp. System and method for real time lip synchronization
US9607609B2 (en) * 2014-09-25 2017-03-28 Intel Corporation Method and apparatus to synthesize voice based on facial structures
CN105931631A (en) * 2016-04-15 2016-09-07 北京地平线机器人技术研发有限公司 Voice synthesis system and method
CN107507620A (en) * 2017-09-25 2017-12-22 广东小天才科技有限公司 A kind of voice broadcast sound method to set up, device, mobile terminal and storage medium
CN110459200A (en) * 2019-07-05 2019-11-15 深圳壹账通智能科技有限公司 Phoneme synthesizing method, device, computer equipment and storage medium

Patent Citations (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH0887289A (en) * 1994-09-19 1996-04-02 Fujitsu Ltd Voice rule synthesis device
WO1999066495A1 (en) * 1998-06-14 1999-12-23 Nissim Cohen Voice character imitator system
CN104485100A (en) * 2014-12-18 2015-04-01 天津讯飞信息科技有限公司 Text-to-speech pronunciation person self-adaptive method and system
CN104809923A (en) * 2015-05-13 2015-07-29 苏州清睿信息技术有限公司 Self-complied and self-guided method and system for generating intelligent voice communication
CN105096932A (en) * 2015-07-14 2015-11-25 百度在线网络技术(北京)有限公司 Voice synthesis method and apparatus of talking book
CN106531148A (en) * 2016-10-24 2017-03-22 咪咕数字传媒有限公司 Cartoon dubbing method and apparatus based on voice synthesis
CN106648082A (en) * 2016-12-09 2017-05-10 厦门快商通科技股份有限公司 Intelligent service device capable of simulating human interactions and method
JP2018097185A (en) * 2016-12-14 2018-06-21 パナソニックIpマネジメント株式会社 Voice dialogue device, voice dialogue method, voice dialogue program and robot
CN106548772A (en) * 2017-01-16 2017-03-29 上海智臻智能网络科技股份有限公司 Speech recognition test system and method
CN107358949A (en) * 2017-05-27 2017-11-17 芜湖星途机器人科技有限公司 Robot sounding automatic adjustment system
CN107172449A (en) * 2017-06-19 2017-09-15 微鲸科技有限公司 Multi-medium play method, device and multimedia storage method
CN108735211A (en) * 2018-05-16 2018-11-02 智车优行科技(北京)有限公司 Method of speech processing, device, vehicle, electronic equipment, program and medium
CN108744521A (en) * 2018-06-28 2018-11-06 网易(杭州)网络有限公司 The method and device of game speech production, electronic equipment, storage medium
CN109391842A (en) * 2018-11-16 2019-02-26 维沃移动通信有限公司 A kind of dubbing method, mobile terminal

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021004113A1 (en) * 2019-07-05 2021-01-14 深圳壹账通智能科技有限公司 Speech synthesis method and apparatus, computer device and storage medium
CN111583903A (en) * 2020-04-28 2020-08-25 北京字节跳动网络技术有限公司 Speech synthesis method, vocoder training method, device, medium, and electronic device
CN111583903B (en) * 2020-04-28 2021-11-05 北京字节跳动网络技术有限公司 Speech synthesis method, vocoder training method, device, medium, and electronic device
CN113345422A (en) * 2021-04-23 2021-09-03 北京巅峰科技有限公司 Voice data processing method, device, equipment and storage medium
CN113345422B (en) * 2021-04-23 2024-02-20 北京巅峰科技有限公司 Voice data processing method, device, equipment and storage medium
CN117641019A (en) * 2023-12-01 2024-03-01 广州一千零一动漫有限公司 Audio matching verification method and system based on animation video
CN117641019B (en) * 2023-12-01 2024-05-24 广州一千零一动漫有限公司 Audio matching verification method and system based on animation video

Also Published As

Publication number Publication date
WO2021004113A1 (en) 2021-01-14

Similar Documents

Publication Publication Date Title
CN110459200A (en) Phoneme synthesizing method, device, computer equipment and storage medium
US11741940B2 (en) Text and audio-based real-time face reenactment
US8125485B2 (en) Animating speech of an avatar representing a participant in a mobile communication
CN110390704B (en) Image processing method, image processing device, terminal equipment and storage medium
CN110163054B (en) Method and device for generating human face three-dimensional image
US9082400B2 (en) Video generation based on text
WO2018049979A1 (en) Animation synthesis method and device
CN108958610A (en) Special efficacy generation method, device and electronic equipment based on face
CN108346427A (en) A kind of audio recognition method, device, equipment and storage medium
KR102509666B1 (en) Real-time face replay based on text and audio
TW201937344A (en) Smart robot and man-machine interaction method
CN110555896B (en) Image generation method and device and storage medium
CN102568023A (en) Real-time animation for an expressive avatar
JP2014519082A5 (en)
CN107911643B (en) Method and device for showing scene special effect in video communication
CN109801349B (en) Sound-driven three-dimensional animation character real-time expression generation method and system
CN112669417A (en) Virtual image generation method and device, storage medium and electronic equipment
CN112668407A (en) Face key point generation method and device, storage medium and electronic equipment
CN114359517A (en) Avatar generation method, avatar generation system, and computing device
KR20200059993A (en) Apparatus and method for generating conti for webtoon
CN110794964A (en) Interaction method and device for virtual robot, electronic equipment and storage medium
CN113299312A (en) Image generation method, device, equipment and storage medium
CN110148406A (en) A kind of data processing method and device, a kind of device for data processing
US20120013620A1 (en) Animating Speech Of An Avatar Representing A Participant In A Mobile Communications With Background Media
CN112652041A (en) Virtual image generation method and device, storage medium and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination