CN110459200A

CN110459200A - Phoneme synthesizing method, device, computer equipment and storage medium

Info

Publication number: CN110459200A
Application number: CN201910602385.1A
Authority: CN
Inventors: 向纯玉
Original assignee: OneConnect Smart Technology Co Ltd
Current assignee: OneConnect Smart Technology Co Ltd
Priority date: 2019-07-05
Filing date: 2019-07-05
Publication date: 2019-11-15
Also published as: WO2021004113A1

Abstract

The invention discloses a kind of phoneme synthesizing method, device, computer equipment and storage medium, this method, which passes through, to be obtained to the face picture in audio-video；Extract the face characteristic of the face picture；It is determined according to the face characteristic described to the corresponding face label of face picture in audio-video；The corresponding acoustic model of the face label is chosen from acoustic model repository, the acoustic model includes multiple voice labels；Determine the corresponding speech characteristic parameter of each voice label in multiple voice labels；The corresponding speech characteristic parameter of each voice label is used, to synthesize voice with the corresponding role of face picture in audio-video, to dub accuracy rate purpose to be described to realize to improve.

Description

Phoneme synthesizing method, device, computer equipment and storage medium

Technical field

The present invention relates to computer fields more particularly to a kind of phoneme synthesizing method, device, computer equipment and storage to be situated between Matter.

Background technique

Currently, being gradually born suitable for network from media, these are usual from media with the continuous development of new media It can make some simply with audio-video to entertain masses.However in this kind of videos, due to cost of manufacture, depending on Role in frequency, which dubs, to be generallyd use speech synthesis technique and obtains.It is only simple since current speech synthesis technique tone color is single Single one kind and two kinds of tone colors, can be easy to cause between character lack relevance in this way, the face and sound of character It does not match that or matching degree is not high, it is not high so as to cause accuracy rate is dubbed.

Summary of the invention

The embodiment of the present invention provides a kind of phoneme synthesizing method, device, computer equipment and storage medium, is improved with realizing The purpose for the accuracy rate dubbed.

A kind of phoneme synthesizing method, comprising:

It obtains to the face picture in audio-video；

Extract the face characteristic of the face picture；

It is determined according to the face characteristic described to the corresponding face label of face picture in audio-video；

The corresponding acoustic model of the face label is chosen from acoustic model repository, the acoustic model includes multiple voices Label；

Determine the corresponding speech characteristic parameter of each voice label in multiple voice labels；

Use the corresponding speech characteristic parameter of each voice label to be described to the face picture pair in audio-video The role answered synthesizes voice.

A kind of speech synthetic device, comprising:

First obtains module, for obtaining to the face picture in audio-video；

First extraction module, for extracting the face characteristic of the face picture；

First determining module, it is described to corresponding with the face picture in audio-video for being determined according to the face characteristic Face label；

Module is chosen, for choosing the corresponding acoustic model of the face label, the acoustic mode from acoustic model repository Type includes multiple voice labels；

Second determining module, for determining the corresponding phonetic feature of each voice label in multiple voice labels Parameter；

Synthesis module, for using the corresponding speech characteristic parameter of each voice label to be described to in audio-video The corresponding role of face picture synthesize voice.

A kind of computer equipment, including memory, processor and storage are in the memory and can be in the processing The computer program run on device, the processor realize the step of above-mentioned phoneme synthesizing method when executing the computer program Suddenly.

A kind of computer readable storage medium, the computer-readable recording medium storage have computer program, the meter The step of calculation machine program realizes above-mentioned phoneme synthesizing method when being executed by processor.

Above-mentioned phoneme synthesizing method, device, computer equipment and storage medium, by obtaining to the face in audio-video Then picture obtains the corresponding face label of face picture to the Facial Feature Analysis of face picture, then according to face label Acoustic model is chosen from acoustic model repository, it, being capable of foundation using the corresponding speech characteristic parameter of acoustic model come synthetic video Face obtains different phonetic characteristic parameter, according to speech characteristic parameter synthetic video, can identify the role people in entertainment video Face feature, thus according to face characteristic be people's object angle colour matching one fit well on the acoustic model of face characteristic so that dub with Between character relevance enhancing, improve face and dub between matching degree, avoid the situation that voice does not correspond to, To improve the accuracy rate dubbed.

Detailed description of the invention

In order to illustrate the technical solution of the embodiments of the present invention more clearly, below by institute in the description to the embodiment of the present invention Attached drawing to be used is needed to be briefly described, it should be apparent that, the accompanying drawings in the following description is only some implementations of the invention Example, for those of ordinary skill in the art, without any creative labor, can also be according to these attached drawings Obtain other attached drawings.

Fig. 1 is an application environment schematic diagram of phoneme synthesizing method in one embodiment of the invention；

Fig. 2 is an exemplary diagram of phoneme synthesizing method in one embodiment of the invention；

Fig. 3 is another exemplary diagram of phoneme synthesizing method in one embodiment of the invention；

Fig. 4 is another exemplary diagram of phoneme synthesizing method in one embodiment of the invention；

Fig. 5 is another exemplary diagram of phoneme synthesizing method in one embodiment of the invention；

Fig. 6 is another exemplary diagram of phoneme synthesizing method in one embodiment of the invention；

Fig. 7 is an exemplary diagram of speech synthetic device in one embodiment of the invention；

Fig. 8 is another exemplary diagram of speech synthetic device in one embodiment of the invention；

Fig. 9 is another exemplary diagram of speech synthetic device in one embodiment of the invention；

Figure 10 is an exemplary diagram of computer equipment in one embodiment of the invention.

Specific embodiment

Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete Site preparation description, it is clear that described embodiments are some of the embodiments of the present invention, instead of all the embodiments.Based on this hair Embodiment in bright, every other implementation obtained by those of ordinary skill in the art without making creative efforts Example, shall fall within the protection scope of the present invention.

Phoneme synthesizing method provided in an embodiment of the present invention can be applicable in the application environment such as Fig. 1, wherein terminal is set Standby to be communicated by network with server, terminal device, will be to the people in audio-video after getting to audio-video Face picture is transmitted to server, and server starts to extract the face characteristic of face picture after receiving face picture, and determines face The corresponding face label of picture, and then acoustic model is chosen from acoustic model repository according to face label, it is finally synthesizing wait dub The voice of the corresponding role of face picture in video.Wherein, terminal device/can be, but not limited to various personal computers, pen Remember this computer, smart phone, tablet computer and portable wearable device.Server can be either more with independent server The server cluster of a server composition is realized.

In one embodiment, as shown in Fig. 2, providing a kind of phoneme synthesizing method, the service in Fig. 1 is applied in this way It is illustrated, includes the following steps: for device

S10: it obtains to the face picture in audio-video.

In this embodiment, face picture is to the corresponding face picture of face occurred in audio-video.Wherein, in order to The accuracy that subsequent face characteristic extracts, to which in the face picture in audio-video, the face of face and the outer profile of face are answered It is high-visible.

S20: the face characteristic of face picture is extracted.

Wherein, face characteristic refers to the key feature of reflection face information, such as geometrical characteristic (such as face five of facial image Official's characteristic point and facial contour feature point) and facial image gray feature (such as face complexion), for knowing to facial image Not.

Preferably, geometrical characteristic includes that the crucial point location of human face five-sense-organ and the key point of facial contour are determined in the present embodiment The characteristic point of position.Specifically, face can be obtained using the facial modeling algorithm of ASM (Active Shape Model) Feature, above-mentioned algorithm are that Global Face appearance establishes universal model, are steady keys to local image damage, but its calculating generation Valence is very high, needs a large amount of iterative steps, can also be that the facial modeling of AAM (Active Appreance Model) is calculated Method obtains face characteristic, which directly regards a recurrence task as by positioning feature point, counted with a global recurrence device Calculate the coordinate of characteristic point.Since facial modeling is still the work for having very much challenge, because of human face expression, appearance There are many variations such as gesture, illumination.Meanwhile the positioning difficulty of face different location characteristic point is different, if with a kind of single Model is come if positioning, it is difficult to guarantee the accuracy rate of positioning.Therefore, in order to overcome the above problem, Coarse-to- can be used The facial modeling algorithm of fine CNN network obtains face characteristic.

S30: it is determined according to face characteristic to the corresponding face label of face picture in audio-video.

Wherein, face label refers to according to face characteristic the label classified to face picture.Illustratively, face Label can be female's baby face label, young woman's face label, grandmother's face label, male baby face label, young man's face label, uncle Face label, grandfather's face label etc..It should be noted that in the present embodiment, above-mentioned face label is intended merely to facilitate to this Embodiment is understood and is enumerated that the present embodiment is not construed as limiting face label.

S40: the corresponding acoustic model of face label is chosen from acoustic model repository, acoustic model includes multiple voice marks Label.

Wherein, voice label includes tone, tone color, word speed, intensity label of sound etc.；Acoustic model repository contains multiple Acoustic model contains multiple voice labels in each acoustic model, and each voice label is corresponding with speech characteristic parameter, the language For sound characteristic parameter for characterizing the corresponding phonetic feature of voice label, each acoustic model in acoustic model repository is according to face Label setting, different face labels are corresponding with different acoustic models.For example, compared to other face labels, young woman In the corresponding acoustic model of face label, the corresponding speech characteristic parameter of tone label is higher, illustrates young woman's face label pair The tone answered is higher；In the corresponding acoustic model of grandmother's face label, the corresponding word speed parameter of word speed label is lower, illustrates bright The corresponding word speed of young woman's face label is more slowly.

Specifically, acoustic mode corresponding with the face label is selected from preset acoustic model repository according to face label Type.

S50: the corresponding speech characteristic parameter of each voice label in multiple voice labels is determined.

Wherein, speech characteristic parameter refers to the speech characteristic parameters such as tone, tone color, word speed, the intensity of voice.

S60: use the corresponding speech characteristic parameter of each voice label for corresponding with the face picture in audio-video Role synthesizes voice.

Specifically, due to the speech characteristic parameter in acoustic model be in a numerical intervals, however synthesize voice Required speech characteristic parameter is a determining numerical value, therefore, can be in the speech characteristic parameter of the acoustic model A numerical value is randomly selected in numerical intervals as numerical value required for synthesis voice.

Server carries out speech synthesis according to these speech characteristic parameters, then exports voice document, and by voice document Return to terminal device, wherein the format of voice document be according to demand depending on, for example, the format of voice document can be .mp3 or .wvm, here without limitation.

In this embodiment, synthesis voice, which can be, synthesizes one section of language corresponding with speech text using speech characteristic parameter Sound is also possible to carry out voice change process to the sound bite according to speech characteristic parameter.Specifically, it can obtain described wait dub The corresponding acquisition speech text of video, then use the speech characteristic parameter to be described wait dub speech text described in Video Composition Corresponding voice.

In this embodiment, since facial characteristics can embody the personality of personage to a certain extent, and the sound of personage If not corresponding with personage, it can give people a kind of feeling come out of the play.Therefore, by obtaining wait match the face picture in audio-video, so The corresponding face label of face picture is obtained to the Facial Feature Analysis of face picture afterwards, then according to face label from acoustic mode Type chooses acoustic model in library, using the corresponding speech characteristic parameter of acoustic model come synthetic video, can obtain according to face Different phonetic characteristic parameter can identify role's face characteristic in entertainment video according to speech characteristic parameter synthetic video, from And according to face characteristic be people's object angle colour matching one fit well on the acoustic model of face characteristic so that dub with character it Between relevance enhancing, improve face and dub between matching degree, avoid the situation that voice does not correspond to, thus improve match The accuracy rate of sound.

In one embodiment, specifically comprise the following steps: as shown in figure 3, obtaining acoustic model repository

S70: multiple face samples and the corresponding multiple speech samples of multiple face samples are obtained.

In this embodiment, face sample refers to that the image comprising each face of face face, speech samples refer to Be a Duan Yuyin.Wherein, there is incidence relations for each face sample and each speech samples, i.e. speech samples are by face The corresponding people of sample exports (saying).

Illustratively, it is more bonded the acoustic model repository of face characteristic in order to obtain, it can be from largely comprising voice and face Video in extract a face sample and the corresponding multiple speech samples of multiple face samples, due in some specific scenes In, the sound (speech samples) in video is said by the corresponding personage of face (face sample), therefore, these face samples There is certain relevance with speech samples.

S80: the face characteristic of face sample is extracted.

It wherein, can be using Coarse-to- trained in advance in order to guarantee the accuracy of extracted face characteristic Fine CNN network extracts the face characteristic of face sample.

S90: the corresponding face label of face characteristic is determined.

Specifically, whether the characteristic value of label can be preset at any one by judging the characteristic value of the face characteristic In section, if in the characteristic value section that any one presets label, it is determined that the default label is corresponding for the face characteristic Face label.

S100: extracting the phonetic feature of speech samples, and phonetic feature includes multiple speech characteristic parameters.

In this embodiment, phonetic feature is tone height feature, the tone color quality feature, word speed speed in speech samples Feature, intensity size characteristic etc., such as tone, tone color, word speed, intensity etc..It specifically, can be using language provided by Python The speech feature extraction of sound library (such as Audiolab) Lai Jinhang speech samples, when specific operation, it is only necessary to which speech samples are made It is passed in the library Audiolab for a parameter.Certainly, scheme can also draw the voice frequency of speech samples using Matlab Then spectrogram is analyzed voice spectrum figure, the phonetic feature of speech samples is obtained.But, it is contemplated that the number of speech samples Use sound bank provided by Python preferably according to amount, and the simplicity of operation, this programme to carry out the language of speech samples Sound feature extraction.

S110: the corresponding multiple voice labels of multiple speech characteristic parameters are determined.

In this embodiment, it since the phonetic feature of each speech samples includes multiple speech characteristic parameters, Every a kind of voice label, the i.e. corresponding voice mark of each speech characteristic parameter are determined according to each speech characteristic parameter Label.It is specific to determine that method can be, it first determines section described in each speech characteristic parameter, language is then determined according to the section Phonetic symbol label.The present embodiment is illustrated with pitch parameters: be based on step S100: extraction phonetic feature, obtain the voice The tone of sample is 100Hz, is then based on each preset tones section (high-pitched tone section [300Hz, 500Hz], medium pitch section [80Hz, 300Hz), low pitch section [0Hz, 80Hz)), show that the tone is in medium pitch section, it is determined that the speech samples Tone label be medium pitch.

S120: acoustic model repository is generated according to multiple face labels and multiple voice labels.

In one embodiment, as shown in figure 4, step S120: according to multiple face labels and multiple voice label generation sound Model library is learned, is specifically comprised the following steps:

S121: the quantity statistics that the corresponding face label of multiple face samples and voice label are carried out, to obtain people The relevance of face label and the voice label, to correspond to probability of occurrence maximum for matching every class face label for the relevance Voice label.

Specifically, it can be counted by the quantity to the corresponding voice label of every class face label, every class is occurred The voice label of maximum probability (namely frequency of occurrence is most) is as the voice label with face label with relevance.

S122: according to relevance determine each face label corresponding to voice label.

S123: according to voice label corresponding to each face label, the corresponding acoustic model of face label is generated.

Wherein, acoustic model refers to the model including multiple speech characteristic parameters.Illustratively, the corresponding sound of face label Model can be the dollfaced face label of female and be corresponding with the dollfaced acoustic model of female.

Generating the corresponding acoustic model of face label is to provide support to choose the corresponding acoustic model of face label.

S124: acoustic model repository is generated according to the corresponding acoustic model of all face labels.

In this embodiment, the quantity statistics corresponding face label of multiple face samples and voice label carried out, To obtain the relevance of face label Yu the voice label, according to relevance determine each face label corresponding to voice mark Label generate the corresponding acoustic model of face label, according to all face marks according to voice label corresponding to each face label It signs corresponding acoustic model and generates acoustic model repository, the relevance between face and voice can be increased, so that shape of face and voice More be bonded so that it is subsequent for when with the facial image in audio-video is corresponding feel to carry out speech synthesis when, it is synthesized Voice is more in line with figure image, to be unlikely to the feeling for allowing spectators to generate " coming out of the play ".

In one embodiment, as shown in figure 5, step S80: extracting the face characteristic of face sample, comprising the following steps:

S81: multiple contour characteristic points and multiple five features point are extracted from each face sample.

Wherein, contour characteristic points refer to the characteristic point of the outer profile of face；Face include eye, ear, mouth, nose and eyebrow, and five Official's characteristic point refers to the characteristic point on this 5 parts of left and right eye, eyebrow, nose, mouth.

As the preferred of the present embodiment, the characteristic point that multi-model carrys out locating human face's different location can be used, face is divided into Five features point positions respectively with contour characteristic points, and five features point refers to the characteristic point where the face of face, outside Contour feature point refers to the characteristic point where the outer profile of face.In our embodiment, Coarse-to-fine can be used The facial modeling algorithm of CNN network obtains five features point and contour characteristic points on each facial image, thus Obtain face characteristic.

Specifically, DCNN model is divided into two groups of parallel CNN cascade networks.Wherein one group is one 4 grades cascade CNN, for obtaining the five features point of face (as chosen 51 face five features points).Wherein, it is used for human face five-sense-organ for the 1st grade The positioning of the minimum bounding box (bounding box) of characteristic point, the minimum bounding box of the five features point are to enclose face The minimum picture of all face (5 left and right eye, eyebrow, nose, mouth parts) on image；2nd grade for will most parcel It encloses in box and is input in CNN, thus the position of multiple characteristic points according to a preliminary estimate；3rd level is used for will be each in minimum bounding box The picture of face, which is cut out, to be come, and is input to the position that multiple five features points are further accurately estimated in the CNN of this grade； 4th grade be for above-mentioned each face picture carry out rotation correction, and to five features point each after rotation correction carry out essence It determines position, obtains multiple five features points.Another group is 2 grades of cascade CNN, and the 1st group of contour characteristic points for face are (such as Choose 17 contour characteristic points) minimum bounding box positioning, the minimum bounding boxs of the contour characteristic points is to enclose The minimum picture of the outer profile of facial image.2nd grade from the minimum bounding box of contour characteristic points for estimating multiple foreign steamers The accurate location of wide characteristic point obtains multiple contour characteristic points.Why the 1st grade of two groups of parallel CNN cascade networks all It needs to position minimum bounding box, is because traditional DCNN is in priori knowledge deficiency, the most strength of convolutional network is all unrestrained Take on finding face, reduces the efficiency of facial modeling, and then influence the acquisition efficiency of face characteristic.

S82: each all contour characteristic points of face sample are attached, the outer profile of facial image is obtained.

S82: the five features point of all identical face types of each face sample is attached, facial image is obtained The corresponding face profile of each face.

Wherein, the five features point of identical face type is attached, is referred to left eye eyeball, right eye eyeball, mouth, nose Five features point corresponding with eyebrow is attached respectively.Face profile refers to left eye contour, right eye contour, mouth wheel Wide, nose profile and eyebrow outline.

S83: using the profile of the outer profile of each face sample and each face as the corresponding face characteristic of face sample.

In one embodiment, as shown in fig. 6, step S90: determining the corresponding face label of face characteristic, including walk as follows It is rapid:

S91: the radian of the outer profile of face sample is calculated.

S92: according to each face profile, spacing, the length and width of each face between each face are calculated.

Wherein, the spacing between each face refers to the spacing two-by-two between left and right eye, mouth, nose and eyebrow, example As between the spacing between eyes and eyebrow, the spacing between left and right eye, two eyebrows spacing and nose and mouth between Spacing etc.；The width of face refers to the maximum width of each face, such as the maximum width of left eye eyeball, the maximum of right eye eyeball are wide The maximum width of degree, the maximum width of mouth and nose；The length of face refers to the maximum length of face, such as left eye eyeball Maximum length, the maximum length of right eye eyeball, the maximum length of mouth, the length of the maximum length of nose and eyebrow.

S93: according to the length and width of spacing, each face between the radian of outer profile and each face, people is determined The corresponding face label of face feature.

Determining the principle of the corresponding face label of face characteristic is: according to the outer profile radian of face, the width of face, five Spacing etc. between the length of official and each face is classified.For example, the width of the outer profile radian of face, each face, Spacing between the length of each face and each face reaches the dollfaced outer profile radian threshold value of preset female, face Width threshold value, face length threshold and each face between spacing threshold when, then be the corresponding face of the face characteristic Picture classification is female's baby face, and assigns female's baby face label.

It should be noted that since face label is to be determined by face characteristic, and face characteristic is usually embodied in people The length and width of spacing, each face between the radian and face of the outer profile of face, therefore, in this embodiment, meter Calculate spacing, the length and width of each face between the radian of outer profile of face sample, each face, and according to can face The length and width of spacing, each face between the radian of the outer profile of sample, each face enough confirms the face sample Spacing between corresponding shape of face, the size of face and each face, so as to determine the corresponding face mark of the face sample Label.

It should be understood that the size of the serial number of each step is not meant that the order of the execution order in above-described embodiment, each process Execution sequence should be determined by its function and internal logic, the implementation process without coping with the embodiment of the present invention constitutes any limit It is fixed.

In one embodiment, a kind of speech synthetic device is provided, voice closes in the speech synthetic device and above-described embodiment It is corresponded at method.As shown in fig. 7, the speech synthetic device includes the first acquisition module 10, the first extraction module 20, first Determining module 30 chooses module 40, the second determining module 50 and synthesis module 60.Detailed description are as follows for each functional module:

First obtains module 10, for obtaining to the face picture in audio-video；

First extraction module 20, for extracting the face characteristic of face picture；

First determining module 30, for being determined according to face characteristic to the corresponding face mark of face picture in audio-video Label；

Module 40 is chosen, for choosing the corresponding acoustic model of face label from acoustic model repository, acoustic model includes Multiple voice labels；

Second determining module 50, for determining the corresponding speech characteristic parameter of each voice label in multiple voice labels；

Synthesis module 60, for using the corresponding speech characteristic parameter of each voice label for the face in audio-video The corresponding role of picture synthesizes voice.

Preferably, in one embodiment, as shown in figure 8, obtaining acoustic model repository by following module:

Second obtains module 70, for obtaining multiple face samples and the corresponding multiple speech samples of multiple face samples；

Second extraction module 80, for extracting the face characteristic of face sample；

Third determining module 90, for determining the corresponding face label of face characteristic；

Third extraction module 100, for extracting the phonetic feature of speech samples, phonetic feature includes multiple phonetic feature ginsengs Number；

4th determining module 110, for determining the corresponding multiple voice labels of multiple speech characteristic parameters；

Generation module 120, for generating acoustic model repository according to multiple face labels and multiple voice labels.

Preferably, in one embodiment, as shown in figure 9, generation module 120 includes:

Statistic unit 121 is united for the quantity to the corresponding face label of multiple face samples and voice label Meter, to obtain the relevance of face label and voice label, for matching, every class face label is corresponding generally the relevance to occur The maximum voice label of rate；

First determination unit 122, for according to relevance determine each face label corresponding to voice label；

It is corresponding to generate face label for the voice label according to corresponding to each face label for first generation unit 123 Acoustic model；

Second generation unit 124, for generating acoustic model repository according to the corresponding acoustic model of all face labels.

Optionally, in one embodiment, the first extraction module 20 includes:

Extraction unit, for extracting multiple contour characteristic points and multiple five features point from each face sample；

First connection unit obtains facial image for being attached each all contour characteristic points of face sample Outer profile；

Second connection unit, for connecting the five features point of all identical face types of each face sample It connects, obtains the corresponding face profile of each face of facial image；

Second determination unit, for using the profile of the outer profile of each face sample and each face as face sample pair The face characteristic answered.

Optionally, in one embodiment, third determining module 90, comprising:

First computing unit, the radian of the outer profile for calculating face characteristic；

Second computing unit, for calculating spacing, the length of each face between each face according to each face profile Degree and width；

Third determination unit, the length for spacing, each face between the radian and each face according to outer profile And width, determine the corresponding face label of face characteristic.

Specific about speech synthetic device limits the restriction that may refer to above for phoneme synthesizing method, herein not It repeats again.Modules in above-mentioned speech synthetic device can be realized fully or partially through software, hardware and combinations thereof.On Stating each module can be embedded in the form of hardware or independently of in the processor in computer equipment, can also store in a software form In memory in computer equipment, the corresponding operation of the above modules is executed in order to which processor calls.

In one embodiment, a kind of computer equipment is provided, which can be server, internal junction Composition can be as shown in Figure 10.The computer equipment include by system bus connect processor, memory, network interface and Database.Wherein, the processor of the computer equipment is for providing calculating and control ability.The memory packet of the computer equipment Include non-volatile memory medium, built-in storage.The non-volatile memory medium is stored with operating system, computer program and data Library.The built-in storage provides environment for the operation of operating system and computer program in non-volatile memory medium.The calculating The database of machine equipment is used for data required for storaged voice synthetic method.The network interface of the computer equipment be used for it is outer The terminal in portion passes through network connection communication.To realize a kind of phoneme synthesizing method when the computer program is executed by processor.

In one embodiment, a kind of computer equipment is provided, including memory, processor and storage are on a memory And the computer program that can be run on a processor, processor perform the steps of when executing computer program

It obtains to the face picture in audio-video；

Extract the face characteristic of face picture；

It is determined according to face characteristic to the corresponding face label of face picture in audio-video；

The corresponding acoustic model of face label is chosen from acoustic model repository, acoustic model includes multiple voice labels；

Use the corresponding speech characteristic parameter of each voice label for the corresponding role of face picture in audio-video Synthesize voice.

In one embodiment, a kind of computer readable storage medium is provided, computer program is stored thereon with, is calculated Machine program performs the steps of when being executed by processor

It obtains to the face picture in audio-video；

Extract the face characteristic of face picture；

Above-mentioned phoneme synthesizing method, device, computer equipment and storage medium, by obtaining to the face in audio-video Then picture obtains the corresponding face label of face picture to the Facial Feature Analysis of face picture, then according to face label Acoustic model is chosen from acoustic model repository, it, being capable of foundation using the corresponding speech characteristic parameter of acoustic model come synthetic video Face obtains different phonetic characteristic parameter, according to speech characteristic parameter synthetic video, so as to distinguish the role in entertainment video, It is used in the case where meeting polygonal color, and dubs and have relevance between character, the dubbed effect ratio of multiple roles It is more various, to improve dubbed effect.

Those of ordinary skill in the art will appreciate that realizing all or part of the process in above-described embodiment method, being can be with Relevant hardware is instructed to complete by computer program, the computer program can be stored in a non-volatile computer In read/write memory medium, the computer program is when being executed, it may include such as the process of the embodiment of above-mentioned each method.Wherein, To any reference of memory, storage, database or other media used in each embodiment provided herein, Including non-volatile and/or volatile memory.Nonvolatile memory may include read-only memory (ROM), programming ROM (PROM), electrically programmable ROM (EPROM), electrically erasable ROM (EEPROM) or flash memory.Volatile memory may include Random access memory (RAM) or external cache.By way of illustration and not limitation, RAM is available in many forms, Such as static state RAM (SRAM), dynamic ram (DRAM), synchronous dram (SDRAM), double data rate sdram (DDRSDRAM), enhancing Type SDRAM (ESDRAM), synchronization link (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic ram (DRDRAM) and memory bus dynamic ram (RDRAM) etc..

It is apparent to those skilled in the art that for convenience of description and succinctly, only with above-mentioned each function Can unit, module division progress for example, in practical application, can according to need and by above-mentioned function distribution by different Functional unit, module are completed, i.e., the internal structure of described device is divided into different functional unit or module, more than completing The all or part of function of description.

Embodiment described above is merely illustrative of the technical solution of the present invention, rather than its limitations；Although referring to aforementioned reality Applying example, invention is explained in detail, those skilled in the art should understand that: it still can be to aforementioned each Technical solution documented by embodiment is modified or equivalent replacement of some of the technical features；And these are modified Or replacement, the spirit and scope for technical solution of various embodiments of the present invention that it does not separate the essence of the corresponding technical solution should all It is included within protection scope of the present invention.

Claims

1. a kind of phoneme synthesizing method characterized by comprising

It obtains to the face picture in audio-video；

Extract the face characteristic of the face picture；

The corresponding acoustic model of the face label is chosen from acoustic model repository, the acoustic model includes multiple voice marks Label；

Use the corresponding speech characteristic parameter of each voice label to be described to corresponding with the face picture in audio-video Role synthesizes voice.

2. phoneme synthesizing method as described in claim 1, which is characterized in that obtain the acoustic model as follows Library:

Obtain multiple face samples and the corresponding multiple speech samples of the multiple face sample；

Extract the face characteristic of the face sample；

Determine the corresponding face label of the face characteristic；

The phonetic feature of the speech samples is extracted, the phonetic feature includes multiple speech characteristic parameters；

Determine the corresponding multiple voice labels of multiple speech characteristic parameters；

The acoustic model repository is generated according to multiple face labels and multiple voice labels.

3. phoneme synthesizing method as claimed in claim 2, which is characterized in that described according to multiple face labels and multiple The voice label generates the acoustic model repository, comprising:

The quantity of multiple corresponding face labels of face sample and the voice label is counted, to obtain The relevance of the face label and the voice label, for matching, face label described in every class is corresponding the relevance to occur The voice label of maximum probability；

According to the relevance determine each face label corresponding to voice label；

According to voice label corresponding to each face label, the corresponding acoustic model of the face label is generated；

According to all face label corresponding acoustic model generations acoustic model repository.

4. phoneme synthesizing method as claimed in claim 2, which is characterized in that the face for extracting the face sample is special Sign, comprising:

Multiple contour characteristic points and multiple five features point are extracted from each face sample；

Each all contour characteristic points of face sample are attached, the outer profile of the facial image is obtained；

The five features point of all identical face types of each face sample is attached, the facial image is obtained The corresponding face profile of each face；

The profile of the outer profile of each face sample and each face is corresponding as the face sample The face characteristic.

5. phoneme synthesizing method as claimed in claim 4, which is characterized in that the corresponding face of the determination face characteristic Label, comprising:

Calculate the radian of the outer profile of the face sample；

According to each face profile, spacing, the length and width of each face between each face are calculated；

According to the length and width of spacing, each face between the radian of the outer profile and each face, institute is determined State the corresponding face label of face characteristic.

6. a kind of speech synthetic device characterized by comprising

First obtains module, for obtaining to the face picture in audio-video；

First determining module, it is described to the corresponding face of face picture in audio-video for being determined according to the face characteristic Label；

Module is chosen, for choosing the corresponding acoustic model of the face label, the acoustic model packet from acoustic model repository Include multiple voice labels；

Second determining module, for determining the corresponding phonetic feature ginseng of each voice label in multiple voice labels Number；

Synthesis module, for using the corresponding speech characteristic parameter of each voice label to be described to the people in audio-video The corresponding role of face picture synthesizes voice.

7. speech synthetic device as claimed in claim 6, which is characterized in that obtain the acoustic model by following module Library:

Second obtains module, for obtaining multiple face samples and the corresponding multiple speech samples of the multiple face sample；

Second extraction module, for extracting the face characteristic of the face sample；

Third determining module, for determining the corresponding face label of the face characteristic；

Third extraction module, for extracting the phonetic feature of the speech samples, the phonetic feature includes multiple phonetic features Parameter；

4th determining module, for determining the corresponding multiple voice labels of multiple speech characteristic parameters；

Generation module, for generating the acoustic model repository according to multiple face labels and multiple voice labels.

8. speech synthetic device as claimed in claim 7, which is characterized in that the generation module includes:

Statistic unit, for the quantity to multiple corresponding face labels of face sample and the voice label into Row statistics, to obtain the relevance of the face label and the voice label, the relevance is for matching people described in every class Face label corresponds to the maximum voice label of probability of occurrence；

First determination unit, for according to the relevance determine each face label corresponding to voice label；

First generation unit generates the face label pair for the voice label according to corresponding to each face label The acoustic model answered；

Second generation unit, for according to all face label corresponding acoustic model generations acoustic model repository.

9. a kind of computer equipment, including memory, processor and storage are in the memory and can be in the processor The computer program of upper operation, which is characterized in that the processor realized when executing the computer program as claim 1 to Any one of 5 phoneme synthesizing methods.

10. a kind of computer readable storage medium, the computer-readable recording medium storage has computer program, and feature exists In realization phoneme synthesizing method as described in any one of claim 1 to 5 when the computer program is executed by processor.