CN110288077A

CN110288077A - A kind of synthesis based on artificial intelligence is spoken the method and relevant apparatus of expression

Info

Publication number: CN110288077A
Application number: CN201910745062.8A
Authority: CN
Inventors: 李广之; 陀得意; 康世胤
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2018-11-14
Filing date: 2018-11-14
Publication date: 2019-09-27
Anticipated expiration: 2038-11-14
Also published as: CN109447234A; CN110288077B; CN109447234B

Abstract

The embodiment of the present application discloses a kind of synthesis based on artificial intelligence and speaks the method and relevant apparatus of expression, refer at least to the multiple technologies in artificial intelligence, the content of text sent for terminal, determine that the corresponding text feature of content of text and the text feature identify the duration of pronunciation element, by expression model, the corresponding target expressive features of duration for obtaining the text feature, identifying pronunciation element；And the target expressive features are returned to the terminal.The expression model can determine different sub- expressive features to the same pronunciation element in this article eigen with different durations, increase the variation patterns for expression of speaking, speak expression and the expression collocation of speaker that target expressive features generate are determined according to expression model, due to speaking expression with different variation patterns for the same pronunciation element, so as to improve the excessive unnatural situation for expression shape change of speaking, the feeling of immersion of user is improved.

Description

A kind of synthesis based on artificial intelligence is spoken the method and relevant apparatus of expression

For the application to application No. is 201811354206.9, the applying date is on November 14th, 2018, entitled " a kind of The Chinese patent application of model training method, the method and relevant apparatus for synthesizing expression of speaking " proposes divisional application.

Technical field

This application involves data processing field, speak the method for expression more particularly to a kind of synthesis based on artificial intelligence And relevant apparatus.

Background technique

With the development of computer technology, human-computer interaction is relatively common, but mostly simple interactive voice, for example, The text or voice that interactive device can be inputted according to user determine reply content, and play synthesized according to reply content it is virtual Sound.

Such human-computer interaction bring user's feeling of immersion is difficult to meet the interaction demand of current user, in order to improve User's feeling of immersion can for example be met the tendency of with expression shape change ability using the virtual objects that the shape of the mouth as one speaks changes as the interactive object with user And it gives birth to.This virtual objects can be with virtual images such as cartoon, visual humans, when carrying out human-computer interaction with user, in addition to that can broadcast The virtual acoustic of interaction is put, can also show corresponding expression according to virtual acoustic, provides the user with a kind of virtual objects hair The impression of the virtual speech out.

It is mainly to be determined according to currently played pronunciation element which kind of expression current this virtual objects, which make, is caused When for playing virtual speech, the expression shape change pattern limitation of virtual objects, and also expression shape change is excessive unnatural, gives user The impression of offer is practical and bad, it is difficult to play the role of improving user's feeling of immersion.

Summary of the invention

The model training method that in order to solve the above-mentioned technical problem, this application provides a kind of for synthesizing expression of speaking, The method and relevant apparatus for synthesizing expression of speaking, increase the variation patterns for expression of speaking, determine target according to expression model The expression of speaking that expressive features generate, due to speaking expression with different variation patterns for the same pronunciation element, thus The excessive unnatural situation for expression shape change of speaking is improved to a certain extent

The embodiment of the present application discloses following technical solution:

In a first aspect, the embodiment of the present application provide it is a kind of for synthesizing the model training method for expression of speaking, comprising:

Obtain the video comprising speaker's face action expression and corresponding voice；

According to the expressive features of speaker described in the video acquisition, the acoustic feature of the voice and the voice Text feature；The acoustic feature includes multiple sub- acoustic features；

Determine that the text feature identifies the time zone of pronunciation element according to the text feature and the acoustic feature Between and duration；It is target speaker element that the text feature, which either one or two of is identified pronunciation element, the target speaker element Time interval is the target speaker element time of corresponding sub- acoustic feature in the video in the acoustic feature Section, the target speaker element when a length of target speaker element corresponding to sub- acoustic feature duration；

The time interval and duration and the expressive features that pronunciation element is identified according to the text feature, determine First corresponding relationship, first corresponding relationship are used to embody the duration of pronunciation element and the time interval of pronunciation element described Corresponding relationship in expressive features between corresponding sub- expressive features；

According to first corresponding relationship training expression model；The expression model is used for according to text feature undetermined and institute State text feature undetermined identify pronunciation element duration determine corresponding target expressive features.

Second aspect, the embodiment of the present application provide a kind of for synthesizing the model training apparatus for expression of speaking, described device Including acquiring unit, the first determination unit, the second determination unit and the first training unit:

The acquiring unit includes speaker's face action expression and the video for corresponding to voice for obtaining；

The acquiring unit is also used to the acoustics of the expressive features of the speaker according to the video acquisition, the voice The text feature of feature and the voice；The acoustic feature includes multiple sub- acoustic features；

First determination unit, for determining the text feature institute according to the text feature and the acoustic feature The time interval and duration of mark pronunciation element；It is target speaker member that the text feature, which either one or two of is identified pronunciation element, Element, the time interval of the target speaker element are that the target speaker element corresponding sub- acoustics in the acoustic feature is special Levy time interval in the video, the target speaker element when a length of target speaker element corresponding to sub- acoustics The duration of feature；

Second determination unit, for identifying the time interval and duration of pronunciation element according to the text feature, And the expressive features determine that the first corresponding relationship, first corresponding relationship are used to embody the duration and pronunciation of pronunciation element Corresponding relationship of the time interval of element in the expressive features between corresponding sub- expressive features；

First training unit, for according to first corresponding relationship training expression model；The expression model is used Determine that corresponding target expression is special in the duration for identifying pronunciation element according to text feature undetermined and the text feature undetermined Sign.

The third aspect, the embodiment of the present application provide a kind of for synthesizing the model training equipment for expression of speaking, the equipment Including processor and memory:

Said program code is transferred to the processor for storing program code by the memory；

The processor is used to be used for according to any one of instruction execution first aspect in said program code Synthesize the model training method for expression of speaking.

Fourth aspect, the embodiment of the present application provide a kind of method for synthesizing expression of speaking, which comprises

Determine that the corresponding text feature of content of text and the text feature identify the duration of pronunciation element；The text Feature includes multiple Ziwen eigens；

By the text feature, the duration and expression model that identify element, it is corresponding to obtain the content of text Target expressive features；Either one or two of the target expressive features include multiple sub- expressive features, and the text feature is identified Pronunciation element is target speaker element, in the target expressive features, the corresponding sub- expressive features of the target speaker element Be according to the target speaker element in the text feature corresponding Ziwen eigen and when the target speaker element Grow what determination obtained.

5th aspect, the embodiment of the present application provide a kind of device for synthesizing expression of speaking, and described device includes determination unit And first acquisition unit:

The determination unit, for determining that the corresponding text feature of content of text and the text feature identify pronunciation member The duration of element；The text feature includes multiple Ziwen eigens；

The first acquisition unit, for by the text feature, identify pronunciation element duration and expression model, Obtain the corresponding target expressive features of the content of text；The target expressive features include multiple sub- expressive features, the text It is target speaker element that eigen, which either one or two of is identified pronunciation element, in the target expressive features, the target speaker The corresponding sublist feelings of element be characterized according to the target speaker element in the text feature corresponding Ziwen eigen and What the duration determination of the target speaker element obtained.

6th aspect, the embodiment of the present application provide a kind of equipment for synthesizing expression of speaking, and the equipment includes processing Device and memory:

The processor is used for the synthesis according to any one of instruction execution fourth aspect in said program code The method for expression of speaking.

7th aspect, the embodiment of the present application provide a kind of computer readable storage medium, the computer-readable storage medium Matter is for storing program code, and said program code is for executing described in any one of first aspect for synthesizing expression of speaking Model training method or any one of fourth aspect described in synthesis speak the method for expression.

In order to determining that variation multiplicity, hypermimia are natural for virtual objects it can be seen from above-mentioned technical proposal Expression of speaking, the embodiment of the present application provides a kind of completely new expression model training mode, according to containing the human face that speaks The video of movement expression and corresponding voice obtains the text of the expressive features of speaker, the acoustic feature of the voice and the voice Eigen.Since acoustic feature and text feature are to be obtained according to same video, therefore can be determined according to acoustic feature Text feature identifies the time interval and duration of pronunciation element.The time zone of pronunciation element is identified according to the text feature Between and duration and the expressive features determine that the first corresponding relationship, first corresponding relationship are used to embody element Corresponding relationship of the duration with the time interval of pronunciation element in the expressive features between corresponding sub- expressive features.

It, can be with by the time interval of target speaker element for identifying for the target speaker element in pronunciation element From determining the sub- expressive features in the time interval in expressive features, and the duration of target speaker element can embody mesh Various durations of the mark pronunciation element under the various expression sentences of video speech, therefore the sub- expressive features determined can be with body Speaker says the possible expression of target speaker element in different expression sentences now.Therefore trained according to the first corresponding relationship The expression model arrived, for the text feature of expressive features to be determined, which can have not in this article eigen Same pronunciation element with duration determines different sub- expressive features, the variation patterns for expression of speaking is increased, according to expression Model determines the expression of speaking that target expressive features generate, due to speaking expression with different for the same pronunciation element Variation patterns, to improve the excessive unnatural situation for expression shape change of speaking to a certain extent.

Detailed description of the invention

In order to illustrate the technical solutions in the embodiments of the present application or in the prior art more clearly, to embodiment or will show below There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only this Some embodiments of application without any creative labor, may be used also for those of ordinary skill in the art To obtain other drawings based on these drawings.

Fig. 1 is the application scenarios schematic diagram of expression model training method provided by the embodiments of the present application；

Fig. 2 is provided by the embodiments of the present application a kind of for synthesizing the flow chart of the model training method for expression of speaking；

Fig. 3 is a kind of flow chart of acoustic training model method provided by the embodiments of the present application；

Fig. 4 is a kind of application scenarios schematic diagram of method for synthesizing expression of speaking provided by the embodiments of the present application；

Fig. 5 is a kind of flow chart of method for synthesizing expression of speaking provided by the embodiments of the present application；

Fig. 6 is the framework signal that visualization phoneme synthesizing method is generated in a kind of human-computer interaction provided by the embodiments of the present application Figure；

Fig. 7 a is a kind of structure chart of model training apparatus for synthesizing expression of speaking provided by the embodiments of the present application；

Fig. 7 b is a kind of structure chart of model training apparatus for synthesizing expression of speaking provided by the embodiments of the present application；

Fig. 7 c is a kind of structure chart of model training apparatus for synthesizing expression of speaking provided by the embodiments of the present application；

Fig. 8 a is a kind of structure chart of device for synthesizing expression of speaking provided by the embodiments of the present application；

Fig. 8 b is a kind of structure chart of device for synthesizing expression of speaking provided by the embodiments of the present application；

Fig. 9 is a kind of structure chart of server provided by the embodiments of the present application；

Figure 10 is a kind of structure chart of terminal device provided by the embodiments of the present application.

Specific embodiment

In order to make those skilled in the art more fully understand application scheme, below in conjunction in the embodiment of the present application Attached drawing, the technical scheme in the embodiment of the application is clearly and completely described, it is clear that described embodiment is only this Apply for a part of the embodiment, instead of all the embodiments.Based on the embodiment in the application, those of ordinary skill in the art exist Every other embodiment obtained under the premise of creative work is not made, shall fall in the protection scope of this application.

Currently, when virtual objects and user carry out human-computer interaction, it is mainly basis which kind of expression of speaking virtual objects, which make, What currently played pronunciation element determined, for example, establishing the corresponding relationship of pronunciation element and expression, under normal circumstances, one The corresponding expression of speaking of the element that pronounces, when being played to some pronunciation element, so that virtual objects make the pronunciation element institute Corresponding expression of speaking, when this method is caused for virtual speech is played, the expression of speaking of virtual objects can only currently be broadcast Expression of speaking corresponding to pronunciation element is put, the expression shape change pattern of speaking of virtual objects limits to, and due to a pronunciation member Element only corresponds to an expression of speaking, and also results in the excessive unnatural of expression shape change of speaking, and the impression provided the user with is practical simultaneously It is bad, it is difficult to play the role of improving user's feeling of immersion.

In order to solve the above-mentioned technical problem, the embodiment of the present application provides a kind of completely new expression model training mode, When carrying out expression model training, by the duration for the pronunciation element that text feature, text feature are identified, the time zone for the element that pronounces Between corresponding sub- expressive features are as training sample in expressive features, thus according to the duration of pronunciation element and pronunciation element Corresponding relationship of the time interval in expressive features between corresponding sub- expressive features is trained to obtain expression model.

The method for expression of speaking is synthesized provided by the embodiment of the present application and accordingly for synthesizing the mould for expression of speaking Type training method may each be based on artificial intelligence realization, and artificial intelligence (Artificial Intelligence, AI) is benefit Machine simulation, extension and the intelligence for extending people controlled with digital computer or digital computer, perception environment obtain knowledge And use theory, method, technology and the application system of Knowledge Acquirement optimum.In other words, artificial intelligence is computer section The complex art learned, it attempts to understand essence of intelligence, and produce it is a kind of new can be in such a way that human intelligence be similar The intelligence machine made a response.Artificial intelligence namely studies the design principle and implementation method of various intelligence machines, makes machine Have the function of perception, reasoning and decision.

Artificial intelligence technology is an interdisciplinary study, is related to that field is extensive, and the technology of existing hardware view also has software layer The technology in face.Artificial intelligence basic technology generally comprise as sensor, Special artificial intelligent chip, cloud computing, distributed storage, The technologies such as big data processing technique, operation/interactive system, electromechanical integration.Artificial intelligence software's technology mainly includes computer Several general orientation such as vision technique, voice processing technology, natural language processing technique and machine learning/deep learning.

In the embodiment of the present application, the artificial intelligence software's technology related generally to includes above-mentioned computer vision technique, language The directions such as sound processing technique, natural language processing technique and deep learning.

Such as image procossing (the Image in computer vision (Computer Vision) can be related to Processing), image, semantic understands (Image Semantic Understanding, ISU), video processing (video Processing), video semanteme understands that (video semantic understanding, VSU), three-dimension object rebuild (3D Object reconstruction), recognition of face (face recognition) etc..

Such as the speech recognition technology in voice technology (Speech Technology) can be related to, including voice Signal Pretreatment (Speech signal preprocessing), voice signal frequency-domain analysis (Speech signal Frequency analyzing), speech recognition (Speech signal feature extraction), voice The training of signal characteristic matching/identification (Speech signal feature matching/recognition), voice (Speech training) etc..

Such as the text that can be related in natural language processing (Nature Language processing, NLP) is located in advance (Text preprocessing) and semantic understanding (Semantic understanding) etc. are managed, including word, sentence cutting (word/sentence segementation), part-of-speech tagging (word tagging), statement classification (word/sentence Classification) etc..

Such as the deep learning (Deep Learning) in machine learning (Machine learning, ML) can be related to, Including all kinds of artificial neural networks (artificial neural network).

The technical solution of the application in order to facilitate understanding, below with reference to practical application scene to provided by the embodiments of the present application Expression model training method is introduced.

Model training method provided by the present application can be applied to have processing to the video for saying voice including speaker The data processing equipment of ability, such as terminal device, server.Wherein, terminal device be specifically as follows smart phone, computer, Personal digital assistant (Personal Digital Assistant, PDA), tablet computer etc.；Server is specifically as follows independence Server, or cluster server.

The data processing equipment can have the ability for implementing computer vision technique, and computer vision is one and studies such as What makes the science of machine " seeing ", further, just refer to replace human eye to identify target with video camera and computer, with The machine vision such as track and measurement, and graphics process is further done, so that computer is treated as being more suitable for eye-observation or is sent instrument to The image of device detection.As a branch of science, the relevant theory and technology of computer vision research, it is intended to which foundation can be from figure The artificial intelligence system of information is obtained in picture or multidimensional data.Computer vision technique generally includes image procossing, image is known Not, image, semantic understanding, image retrieval, OCR, video processing, video semanteme understanding, video content/Activity recognition, three-dimension object The technologies such as reconstruction, 3D technology, virtual reality, augmented reality, synchronous superposition, further include common recognition of face, The biometrics identification technologies such as fingerprint recognition.

In the embodiment of the present application, data processing equipment can obtain speaker by computer vision technique from video The various informations such as expressive features, corresponding duration.

The data processing equipment, which can have, implements automatic speech recognition technology (ASR) and Application on Voiceprint Recognition skill in voice technology The ability of art etc..Data processing equipment can be listened, can be seen, can felt for voice technology, is the developing direction of the following human-computer interaction, Middle voice becomes following one of the man-machine interaction mode being most expected.

In the embodiment of the present application, data processing equipment is by implementing above-mentioned voice technology, can to the video of acquisition into Row speech recognition, so that it is all kinds of to obtain the acoustic feature of speaker in video, corresponding pronunciation element, corresponding duration etc. Information.

The data processing equipment can also have the ability for implementing natural language processing, be computer science and people An important directions in work smart field.It, which studies to be able to achieve between people and computer, carries out efficient communication with natural language Various theory and methods.Natural language processing is one and melts linguistics, computer science, mathematics in the science of one.Therefore, this The research in one field will be related to natural language, i.e. people's language used in everyday, so it and philological research have closely Connection.Natural language processing technique generally includes the technologies such as text-processing, semantic understanding.

In the embodiment of the present application, data processing equipment may be implemented by aforementioned by implementing above-mentioned NLP technology from view The text feature of voice is determined in frequency.

The data processing equipment can have machine learning (Machine Learning, ML) ability.ML is a Men Duoling Domain cross discipline is related to the multiple subjects such as probability theory, statistics, Approximation Theory, convextiry analysis, algorithm complexity theory.It specializes in The learning behavior of the mankind is simulated or realized to computer how, to obtain new knowledge or skills, reorganizes existing knowledge knot Structure is allowed to constantly improve the performance of itself.Machine learning is the core of artificial intelligence, is the basic way for making computer have intelligence Diameter, application spread the every field of artificial intelligence.Machine learning and deep learning generally include the technologies such as artificial neural network.

In the embodiment of the present application, it relates generally to for synthesizing the model training method for expression of speaking to all kinds of artificial neurons The application of network, such as pass through the first corresponding relationship training expression model.

Referring to Fig. 1, Fig. 1 is the application scenarios schematic diagram of expression model training method provided by the embodiments of the present application, this is answered With including server 101 in scene, server 101 is available to contain the view of speaker's face action expression and corresponding voice Frequently, which can be one, be also possible to multiple.Languages corresponding to the character that voice includes in video can be Chinese, The various languages such as English, Korean.

Server 101 is according to the expressive features of the available speaker of video of acquisition, the acoustic feature and voice of voice Text feature.Wherein, the expressive features of speaker can indicate face action table when speaker says voice in video Feelings, such as may include shape of the mouth as one speaks feature, eye motion etc., video viewers can experience view by the expressive features of speaker Voice in frequency is exactly that the speaker says.The acoustic feature of voice may include the sound wave of voice.The text feature of voice For identifying pronunciation element corresponding to content of text, it should be noted that the pronunciation element in the embodiment of the present application can be Speaker says the corresponding pronunciation of character that voice includes.

It should be noted that in the present embodiment, expressive features, acoustic feature and text feature can by feature to The form of amount indicates.

Since acoustic feature and text feature are to be obtained according to same video, therefore server 101 can be according to text Feature and acoustic feature determine that text feature identifies the time interval and duration of pronunciation element.Wherein, time interval is pronunciation Section at the beginning of sub- acoustic feature corresponding to element is corresponding in video between finish time, when a length of pronunciation The duration of sub- acoustic feature corresponding to element, such as can be the difference of finish time and start time.One sub- acoustics is special Sign is a part of acoustic feature corresponding to a pronunciation element, may include multiple sub- acoustic features in acoustic feature.

Then, server 101 identifies the time interval and duration and expressive features of pronunciation element according to text feature Determine that the first corresponding relationship, the first corresponding relationship are used to embody the duration of pronunciation element and the time interval of pronunciation element described Corresponding relationship in expressive features between corresponding sub- expressive features.Wherein, a sub- expressive features are that a pronunciation element institute is right A part of expressive features answered may include multiple sub- expressive features in expressive features.

For identifying for any pronunciation element such as target speaker element in pronunciation element, target speaker element Time interval is the target speaker element time interval of corresponding sub- acoustic feature in the video in acoustic feature, and acoustics is special Sign, text feature and expressive features are obtained according to same video, and corresponding is same timeline, therefore passes through target Pronouncing the time interval of element can be from determining the sub- expressive features in the time interval in expressive features.And target speaker Element when a length of target speaker element corresponding to sub- acoustic feature duration, target speaker element can be embodied and regarded Various durations under the various expression sentences of frequency voice, therefore the sub- expressive features determined can be embodied in different expression languages Speaker says the possible expression of speaking of the target speaker element in sentence.

By voice that speaker says be " you have had a meal ", video comprising the voice when a length of 2s for, wherein Text feature is used to identify the pronunciation element of " you have had a meal " these characters, and the pronunciation element that text feature is identified includes " ni ' chi ' fan ' le ' ma ", expressive features indicate the table of speaking of speaker when speaker says " you have had a meal " this section of voice Feelings, acoustic feature are that speaker says the sound wave issued when " you have had a meal " this section of voice.Target speaker element is " ni ' Any pronunciation element in chi ' fan ' le ' ma ", if target speaker element is " ni ", the time interval of " ni " is 0s and the Section between 0.1s, the when a length of 0.1s of " ni ", sub- expressive features corresponding to " ni " be in video 0s to 0.1s it Between section speaker say a part of expressive features corresponding to voice, such as can be sub- expressive features A.Server 101 When determining the first corresponding relationship, the time interval 0s and 0.1s of pronunciation element " ni " can be identified according to text feature Its corresponding sub- expressive features A is determined, in this way, can determine that the duration 0.1s of pronunciation element " ni " and pronunciation element " ni " exist Corresponding relationship between sub- expressive features A corresponding to time interval 0s and 0.1s, the first corresponding relationship include pronunciation member Pair of the duration 0.1s and pronunciation element " ni " of plain " ni " between sub- expressive features A corresponding to time interval 0s and 0.1s It should be related to.

Server 101 according to the first corresponding relationship training expression model, expression model be used for according to text feature undetermined and The duration that text feature undetermined identifies pronunciation element determines corresponding target expressive features.

It is understood that this implementation is concerned with pronunciation element, and it is specific to be not concerned with character corresponding to pronunciation element What is.The same sentence may include different characters in the voice that speaker says, but is different character and may correspond to Be identical pronunciation element, in this way, the same pronunciation element is located at different time intervals, may have different durations, To corresponding different sub- expressive features.

For example, the character that the voice that speaker says includes is " and you say a secret ", character " secret " and " close " are corresponding The pronunciation element of text feature mark be all " mi ", the pronunciation element " mi " of the mark of text feature corresponding to character " secret " Time interval is the section between 0.4s and 0.6s, Shi Changwei 0.2s, text feature mark corresponding to character " close " The time interval of element " mi " of pronouncing is the section between 0.6s and 0.7s, Shi Changwei 0.1s.As it can be seen that different characters Text feature corresponding to " secret " and " close " identifies the same pronunciation element " mi ", but corresponding duration is different, therefore, pronunciation The corresponding different sub- expressive features of element " mi ".

In addition, the difference of the expression way according to speaker when speaking, different sentences can in the voice that speaker says It can include identical characters, the pronunciation element of the mark of text feature corresponding to identical characters may have different durations, in this way, The same pronunciation element corresponds to different sub- expressive features.

For example, the voice that speaker says is " hello ", the pronunciation element of the mark of text feature corresponding to character " you " The when a length of 0.1s of " ni ", still, in another voice that speaker says " you I he ", text corresponding to character " you " The duration of the pronunciation element " ni " of signature identification can be 0.3s, at this point, the hair of the mark of text feature corresponding to identical characters Tone element has different durations, and the same pronunciation element is allowed to correspond to different sub- expressive features.

Since a pronunciation element corresponds to different durations, sub- expressive features corresponding to the pronunciation element of different durations are not Together, the first corresponding relationship can embody the duration for the different pronunciation elements that a pronunciation element has and pair of sub- expressive features It should be related to, in this way, when determining sub- expressive features using the expression model obtained according to the training of the first corresponding relationship, for true Determine the text feature of expressive features, which can be true to the same pronunciation element in this article eigen with different durations Different sub- expressive features are made, the variation patterns for expression of speaking are increased.In addition, determining target expression according to expression model The expression of speaking that feature generates, due to speaking expression with different variation patterns, thus centainly for the same pronunciation element The excessive unnatural situation for expression shape change of speaking is improved in degree.

It is understood that increase the variation patterns for expression of speaking to solve technical problem present in traditional approach, Improve the excessive unnatural situation for expression shape change of speaking, the embodiment of the present application provides a kind of new expression model training side Method, and expression of speaking corresponding to content of text is generated using the expression model.Next, will be in conjunction with attached drawing to the application reality The speak method of expression of the model training method for synthesizing expression of speaking and synthesis for applying example offer is introduced.

Firstly, the model training method for synthesizing expression of speaking is introduced.Referring to fig. 2, Fig. 2 shows a kind of use In the flow chart for the model training method for synthesizing expression of speaking, which comprises

S201, the video comprising speaker's face action expression and corresponding voice is obtained.

The video for containing face action expression and corresponding voice can be speaker in the playback environ-ment for having camera, record The voice that speaker processed says, while recording what speaker's face action expression obtained by camera.

S202, the expressive features of the speaker according to the video acquisition, the acoustic feature of the voice and described The text feature of voice.

Wherein, expressive features can be by carrying out what feature extraction obtained to the face action expression in the video, sound Learning feature can be by saying what voice progress feature extraction obtained to the speaker in the video, and text feature can be logical It crosses and what the progress feature extraction of text corresponding to voice obtained is said to the speaker in the video, expressive features, acoustic feature It is all to be obtained according to same video with text feature, time shaft having the same.

Expressive features can be identified for that out feature when speaker says voice in face action expression, and video viewers pass through Expressive features are it can be seen which pronunciation element speaker is issuing, and in one implementation, expressive features include at least mouth Type feature, shape of the mouth as one speaks feature can directly embody feature when speaker says voice in face action expression, to guarantee to regard It is exactly that the speaker says that frequency viewer, which can experience the voice in video by the shape of the mouth as one speaks feature of speaker,.

S203, determined according to the text feature and the acoustic feature text feature identify pronunciation element when Between section and duration.

In the present embodiment, pronunciation element for example can say the corresponding syllable of character that voice includes, word for speaker Symbol can be the underlying semantics unit in different language, such as in Chinese, and character can be Chinese character, the corresponding pronunciation member of Chinese character Element can be pinyin syllable；In English, character can be word, the corresponding pronunciation element of word can be corresponding phonetic symbol or Phonetic symbol combination.For example, when character is Chinese character, the corresponding pronunciation element of character can be pinyin syllable, for example, character is Chinese character " you ", the corresponding pronunciation element of the character can be pinyin syllable " ni "；When character is English, the corresponding pronunciation of character Element can be English syllable, for example, character is English word " ball ", the corresponding pronunciation element of the character can be English sound SectionCertainly, pronunciation element may be the pronunciation unit of minimum included by the corresponding pinyin syllable of character, for example, The character that the voice that speaker says includes is " you ", and pronunciation element may include " n " and " i " two pronunciation elements.

In some cases, pronunciation element can also may be distinguished based on tone, therefore, pronunciation element can also wrap Include tone.Such as in Chinese, it is " you are Ni Ni " that speaker, which says the character that voice includes, wherein character " you " and " girl " Pinyin syllable be all " ni ", still the tone of " you " is three sound, and the tone of " girl " is, therefore, corresponding to " you " The element that pronounces includes " ni " and three several tune, and pronunciation element corresponding to " girl " includes " ni " and each tune, " you " and " girl " Corresponding pronunciation element is distinguished according to the difference of tone.When in use, suitable pronunciation can be determined according to demand Element.

It should be noted that the languages of the corresponding pronunciation element of the languages and character of character are in addition to above-mentioned several possible sides Other than formula, it can also be other different languages, any restriction not done to the category of language of character herein.For ease of description, In each embodiment of the application, will mainly using character as Chinese character, the corresponding pronunciation element of character be pinyin syllable for into Row explanation.

S204, the time interval that pronunciation element is identified according to the text feature and duration and the expressive features, Determine the first corresponding relationship.

Since acoustic feature and text feature are obtained according to same video, acoustic feature itself includes time letter Breath, therefore can determine that text feature identifies the time interval and duration of pronunciation element according to acoustic feature.Wherein, target is sent out The time interval of tone element be the target speaker element in the acoustic feature corresponding sub- acoustic feature in the video In time interval, the target speaker element when a length of target speaker element corresponding to sub- acoustic feature it is lasting when Between, the target speaker element either one or two of identifies pronunciation element by the text feature.

First corresponding relationship is used to embody the duration of pronunciation element and the time interval of pronunciation element in the expression Corresponding relationship in feature between corresponding sub- expressive features.

When determining the first corresponding relationship, can be determined from expressive features at this by the time interval for the element that pronounces The corresponding sub- expressive features of pronunciation element in time interval, so that it is determined that the time interval of the duration of pronunciation element and pronunciation element Corresponding relationship in the expressive features between corresponding sub- expressive features.

It is understood that duration corresponding to same pronunciation element not only can be different, and it can also be identical, it is not only same When duration difference corresponding to element of pronouncing, sub- expressive features corresponding to same pronunciation phonemes can be different, or even work as same hair Duration phase corresponding to tone element can also can be led simultaneously as saying the reasons such as the tone, the communicative habits difference of voice Cause the same pronunciation element with identical duration that there are different sub- expressive features.

For example, the tone of speaker's excitement says voice " Ni Ni " and speaker speaks in an angry tone out voice " girl Girl ", even if the duration for the pronunciation element that text feature is identified is identical, due to speaker speaks the tone, it is also possible to make It is different to obtain the corresponding sub- expressive features of same pronunciation element.

S205, expression model is trained according to first corresponding relationship.

Trained expression model can be determined that it identifies the corresponding target expression of pronunciation element by text feature undetermined Feature, wherein the corresponding content of text of text feature undetermined is to need to synthesize to speak expression or also to need further to generate virtual The content of text of sound.Text feature undetermined and the text feature undetermined identify the duration of pronunciation element as expression model Input, output of the target expressive features as expression model.

Training data for training expression model is the first corresponding relationship, in the first corresponding relationship, when having identical The same pronunciation element of long or different durations can correspond to different sub- expressive features, in this way, in the subsequent table obtained using training When feelings model determines target expressive features, the duration that text feature undetermined and text feature undetermined identify pronunciation element is inputted After the expression model, the situation similar with training data can be also obtained, i.e., the same pronunciation element input with different durations should The target expressive features obtained after expression model may be different, even if the same pronunciation element with identical duration is inputted the table Different target expressive features may also be obtained after feelings model.

It should be noted that same pronunciation element there may be different durations in the present embodiment, there are different durations Same pronunciation element have different expressive features, can have different expressive features identical duration, pass through hair Tone element corresponding contextual information can accurately determine the pronunciation element it is corresponding be which duration, and determine Which expression is the corresponding duration of pronunciation element be.

People is in normal articulation, and under different contexts, the characteristics of same pronunciation element is stated may be different, for example, The duration of pronunciation element is different, then, the same pronunciation element may have different sub- expressive features, that is to say, that one Which duration pronunciation element corresponds to, and the pronunciation element with duration corresponds to the upper of which expressive features and the pronunciation element It is hereafter related.Therefore, in one implementation, text feature used in training expression model can be also used for described in mark Pronounce element and the corresponding contextual information of pronunciation element in voice.In this way, making the expression model obtained using training true When determining expressive features, contextual information can accurately determine the duration of pronunciation element, and which corresponding sublist feelings Feature.

For example, the voice said of speaker is " you are Ni Ni ", corresponding to the pronunciation element that is identified of text feature Including " ni ' shi ' ni ' ni ", wherein pronunciation element " ni " occurs three times, the when a length of 0.1s of first pronunciation element " ni ", Corresponding sub- expressive features A, the when a length of 0.2s of first pronunciation element " ni ", corresponding sub- expressive features B, third are pronounced element The when a length of 0.1s of " ni ", corresponding sub- expressive features C.If this article eigen can also identify pronounce in voice element and pronunciation The corresponding contextual information of element, wherein first pronunciation element " ni " contextual information be contextual information 1, second The contextual information of pronunciation element " ni " is contextual information 2, and the contextual information of third pronunciation element " ni " is context Information 3, then, when determining expressive features using the expression model that training obtains, contextual information 1 can be accurately determined The when a length of 0.2s of tone of setting out element " ni ", corresponding sub- expressive features are sub- expressive features A, and so on.

It is accurately true by contextual information since contextual information can embody expression way of the people in normal articulation The duration and corresponding sub- expressive features of pronunciation element are made, so that utilizing the expression mould obtained according to the training of the first corresponding relationship When type determines that virtual objects issue the target expressive features of pronunciation element, the expression way of virtual objects is enabled to more to be bonded people Expression.In addition, contextual information can be informed in the case where issuing upper pronunciation element, speaker issues current pronunciation Which type of sublist feelings corresponding to element are characterized in, so that currently sub- expressive features and context corresponding to pronunciation element are sent out Linking between sub- expressive features corresponding to tone element is associated, improves the excessive smooth journey that the later period generates expression of speaking Degree.

In order to determining that variation multiplicity, hypermimia are natural for virtual objects it can be seen from above-mentioned technical proposal Expression of speaking, the embodiment of the present application provides a kind of completely new expression model training mode, according to containing the human face that speaks The video of movement expression and corresponding voice obtains the text of the expressive features of speaker, the acoustic feature of the voice and the voice Eigen.Since acoustic feature and text feature are to be obtained according to same video, therefore can be determined according to acoustic feature Text feature identifies the time interval and duration of pronunciation element.

It is understood that generating expression of speaking by the expression model for the method training that embodiment corresponding to Fig. 2 provides When, the variation patterns for expression of speaking can be increased, improve the excessive unnatural situation for expression shape change of speaking.And in man-machine friendship When mutual, the expression of speaking of virtual objects is not only shown to user, can also play the virtual acoustic of interaction.If using existing Mode generates virtual acoustic, it is possible that the expression of speaking of virtual acoustic and the provided schemes generation of the embodiment of the present application is not taken With the case where, in this case, the embodiment of the present application provides a kind of new acoustic training model method, trains by this way Acoustic model out can be generated and speak expression collocation virtual acoustic, referring to Fig. 3, this method comprises:

S301, determine that the text feature identifies the second corresponding relationship between pronunciation element and the acoustic feature.

S302, acoustic model is trained according to second corresponding relationship.

Wherein, the duration and pronunciation element that second corresponding relationship is used to embody pronunciation element are in the acoustic feature Corresponding relationship between corresponding sub- acoustic feature.

Trained acoustic model can be determined that it identifies the corresponding target acoustical of pronunciation element by text feature undetermined Feature.Wherein, text feature undetermined and the text feature undetermined identify the duration of pronunciation element as the defeated of acoustic model Enter, output of the target acoustical feature as acoustic model.

When determining the second corresponding relationship, pronunciation element is exactly that speaker speaks sending, acoustic feature and human hair of speaking Pronunciation member out is known as corresponding relationship, pronunciation element can be determined in sound in this way, identifying pronunciation element according to text feature Corresponding sub- acoustic feature in feature is learned, to can determine text spy for any acoustic feature that text feature is identified Sign identifies the second corresponding relationship between pronunciation element and the acoustic feature.

It is understood that duration corresponding to same pronunciation element not only can be different, and it can also be identical, not only same When duration difference corresponding to one pronunciation element, sub- acoustic feature corresponding to same pronunciation phonemes can be different, or even when same Duration phase corresponding to pronunciation element, can also energy simultaneously as saying the reasons such as the tone, the expression way difference of voice Cause the same pronunciation element with identical duration that there is different sub- acoustic features.

The existing element that pronounces includes temporal information in itself, can be true from expressive features by the time interval for the element that pronounces The corresponding sub- expressive features of element of pronouncing in the time interval are made, so that it is determined that the duration of pronunciation element and the element that pronounces Corresponding relationship of the time interval in the expressive features between corresponding sub- expressive features.

Training data used in the training data as used in training acoustic model and training expression model is from same One video, corresponding identical time shaft, when speaker issues a pronunciation element, the sound of speaker and the face of speaker are dynamic Collocation as expression, in this way, the virtual acoustic that generates of the target acoustical feature determined according to the acoustic model with according to table The expression of speaking that the target expressive features that feelings model is determined generate is collocation, provides the user with better impression, improves and uses Family feeling of immersion.

In addition, due to being the second corresponding relationship for training the training data of acoustic model, in the second corresponding relationship, tool There are identical duration or the same pronunciation element of different durations that can correspond to different sub- acoustic features, in this way, using training subsequent When obtained acoustic model determines target acoustical feature, text feature undetermined and text feature undetermined are identified into pronunciation element After duration inputs the acoustic model, the situation similar with training data can be also obtained, i.e., the same pronunciation member with different durations It is different that element inputs the target acoustical feature obtained after the acoustic model, even if same pronunciation element has identical duration, has phase Same pronunciation element with duration inputs after the acoustic model also available different target acoustical feature.

It can be seen that the acoustic model that event is obtained according to the training of the second corresponding relationship, for the text of acoustic feature to be determined Eigen, the acoustic model can determine different sub- sound to the same pronunciation element in this article eigen with different durations Feature is learned, the variation patterns of virtual acoustic are increased, according to the virtual acoustic of acoustic model determined target acoustical feature generation, Since there are different variation patterns for the same pronunciation element virtual acoustic, to improve virtual acoustic to a certain extent The excessive unnatural situation of variation.

It is understood that using expression model determine text feature undetermined identify pronunciation element duration it is corresponding When target expressive features, the input of the expression model is text feature undetermined and the identified pronunciation element of the text feature undetermined Duration, wherein the duration for the element that pronounces directly determines what the target expression determined is characterized in.That is, in order to Determine the corresponding target expressive features of duration of pronunciation element, it is necessary first to determine the duration of pronunciation element, pronounce element Duration can be determined in several ways.

One of which determines that the mode of the duration of pronunciation element can be and is determined according to duration modeling, for this purpose, this reality It applies and a kind of training method of duration modeling is provided, this method includes identifying hair according to the text feature and the text feature The duration training duration modeling of tone element.

Trained duration modeling can determine the duration of its labeling phonemes for text feature undetermined.Text feature undetermined As the input of duration modeling, text feature undetermined identifies output of the duration of pronunciation element as duration modeling.

The instruction that the training data as used in the training duration modeling and training expression model and acoustics model use Practice data and comes from same video, the text feature for including in training data used in the training duration modeling and text feature institute The duration of mark pronunciation element is that expression model and acoustics model is trained to identify hair using text feature and text feature The duration of tone element.In this way, the duration for the pronunciation element determined using the duration modeling is suitable for previous embodiment training Obtained expression model and acoustic model, expression model are determined according to the duration of the pronunciation element obtained using the duration modeling Target expressive features and the target determined according to the duration of the pronunciation element obtained using the duration modeling of acoustic model Acoustic feature meets expression way when people normally speaks.

Next, pairing is introduced at the method for expression of speaking.It is provided by the embodiments of the present application to synthesize expression of speaking Method can be applied to provide synthesis and speak the equipment of expression correlation function, such as terminal device, server, wherein terminal Equipment is specifically as follows smart phone, computer, personal digital assistant (Personal Digital Assistant, PDA), puts down Plate computer etc.；Server is specifically as follows application server, or Web server, in practical application deployment, and the service Device can be separate server, or cluster server.

Determine that the method for expression of speaking can be applied to plurality of application scenes in interactive voice provided by the embodiments of the present application, The embodiment of the present application is by taking two kinds of application scenarios as an example.

The first application scenarios can be in scene of game, be exchanged between different user by virtual objects, and one A user can virtual objects corresponding with another user interact, for example, user A and user B by virtual objects into Row exchange, user A input content of text, and user B sees the expression of speaking of the corresponding virtual objects of user A, user B and user A Corresponding virtual objects interact.

Second of application scenarios can be applied in intelligent sound assistant, such as intelligent sound assistant siri, when user makes When with intelligent sound assistant siri, intelligent sound assistant siri can also be shown when to user feedback interactive information to user The expression of speaking of virtual objects, user interact with the virtual objects.

The technical solution of the application in order to facilitate understanding, below using server as executing subject, in conjunction with practical application field The method provided by the embodiments of the present application for synthesizing expression of speaking is introduced in scape.

Referring to fig. 4, Fig. 4 is the application scenarios schematic diagram of the method provided by the embodiments of the present application for synthesizing expression of speaking.It should It include terminal device 401 and server 402 in application scenarios, wherein terminal device 401 is used for the content of text for obtaining itself It is sent to server 402, server 402 is used to execute the method provided by the embodiments of the present application for synthesizing expression of speaking, with determination The corresponding target expressive features of content of text that terminal device 401 is sent.

When server 402 it needs to be determined that when the corresponding target expressive features of content of text, server 402 determines text first The corresponding text feature of content and the text feature identify the duration of pronunciation element, and then, server 402 is by text feature Being input to the expression model that the corresponding embodiment of Fig. 2 trains with the duration for identifying pronunciation element, to obtain content of text corresponding Target expressive features.

Since expression model is obtained according to the training of the first corresponding relationship, first corresponding relationship is for embodying pronunciation Corresponding relationship of the duration of element with the time interval of pronunciation element in the expressive features between corresponding sub- expressive features, Same pronunciation element with identical duration or different durations in first corresponding relationship can correspond to different sub- expressive features.This Sample, when determining target expressive features using the expression model, for the text feature of expressive features to be determined, which can To determine different sub- expressive features to the same pronunciation element in this article eigen with similar and different duration, increase The variation patterns for expression of speaking determine the expression of speaking that target expressive features generate according to expression model, due to for same A pronunciation element speaks expression with different variation patterns, thus improve to a certain extent expression shape change of speaking excessively not Natural situation.

Below in conjunction with attached drawing, a kind of the speak method of expression of synthesis provided by the embodiments of the present application is introduced.

The method flow diagram that expression of speaking is determined in a kind of interactive voice is shown referring to Fig. 5, Fig. 5, which comprises

S501, determine that the corresponding text feature of content of text and the text feature identify the duration of pronunciation element.

In the present embodiment, content of text refers to the text needed to the user feedback interacted with virtual objects, root It may be different according to the different content of text of application scenarios.

In the first application scenarios mentioned above, content of text can be text corresponding to user's input content. For example, user B sees the expression of speaking of the corresponding virtual objects of user A, user B virtual objects corresponding with user A are handed over Mutually, then, text corresponding to user's A input content can be used as content of text.

In second of application scenarios mentioned above, content of text can be the interaction fed back according to user's input content Text corresponding to information.For example, siri can be answered for user's input after user inputs " today, weather was how ", It include the interactive information of weather condition today to user feedback, then, to the interaction including weather condition today of user feedback Text corresponding to information can be used as content of text.

It should be noted that the mode of user's input can be input text, it is also possible to input voice.When user inputs Mode can be input text when, then content of text be terminal device 101 directly according to user input obtain either root Text feedback is inputted according to user, what is inputted by terminal device 101 as user is voice, then content of text is terminal device The voice of 101 couples of users input carries out identification acquisition or according to the user's input voice feedback recognized.

Feature extraction can be carried out to content of text, to obtain the corresponding text feature of content of text, text feature can To include multiple Ziwen eigens, it can determine that text feature identifies the duration of pronunciation element according to text feature.

It should be noted that when determining that text feature identifies the duration of pronunciation element, it can be special by the text It seeks peace duration modeling, obtains the duration that the text feature identifies pronunciation element.Wherein, duration modeling is according to history text What the duration training that feature and history text feature identify pronunciation element obtained.The training method of duration modeling is referring to aforementioned reality The introduction of example is applied, details are not described herein again.

S502, by the text feature, identify pronunciation element duration and expression model, obtain the content of text Corresponding target expressive features.

That is, using the text feature and the duration for identifying pronunciation element as the input of expression model, thus The corresponding target expressive features of the content of text are obtained by expression model.

In the target expressive features, the corresponding sublist feelings of target speaker element are characterized according to the target speaker member The element duration of corresponding Ziwen eigen and target speaker element determination in the text feature obtains.

Wherein, the target speaker element identifies any pronunciation element in pronunciation element, institute by the text feature Stating expression model, the training of method provided by corresponding embodiment obtains according to fig. 2.

It should be noted that text feature used in training expression model can be used for marking due in some cases Know pronounce in the voice element and the corresponding contextual information of pronunciation element.So, target is being determined using expression model When expressive features, it is corresponding up and down that text feature can be used for identifying pronunciation element and pronunciation element in the content of text Literary information.

It can be seen from the above technical proposal that since expression model is obtained according to the training of the first corresponding relationship, it is described First corresponding relationship is used to embody the duration of pronunciation element and the time interval for the element that pronounces is corresponding in the expressive features Corresponding relationship between sub- expressive features, the same pronunciation element meeting in the first corresponding relationship with identical duration or different durations Corresponding different sub- expressive features.In this way, when determining target expressive features using the expression model, for expressive features to be determined Text feature, which can determine in this article eigen with the same pronunciation element of similar and different duration Different sub- expressive features, increase the variation patterns for expression of speaking, and determine that target expressive features generate according to expression model Expression of speaking, due to speaking expression with different variation patterns, to change to a certain extent for the same pronunciation element It has been apt to speak the excessive unnatural situation of expression shape change.

It is understood that the method synthesis for passing through the offer of embodiment corresponding to Fig. 5 is spoken when expression, it can increase and speak The variation patterns of expression improve the excessive unnatural situation for expression shape change of speaking.And in human-computer interaction, not only to user The expression of speaking for showing virtual objects can also play the virtual acoustic of interaction.If generating Virtual Sound using existing way Sound, it is possible that the case where virtual acoustic and expression of speaking are not arranged in pairs or groups, in this case, the embodiment of the present application provides one kind The method for synthesizing virtual acoustic, the virtual acoustic synthesized by this way can include passing through with expression collocation of speaking, this method It is special to obtain the corresponding target acoustical of the content of text for the text feature, the duration and acoustic model for identifying pronunciation element Sign.

Wherein, in the target acoustical feature, the corresponding sub- acoustic feature of the target speaker element is according to The duration of corresponding Ziwen eigen and the target speaker element determination in the text feature of target speaker element obtains 's；Method training provided by the corresponding embodiment of described acoustic model Fig. 3 obtains.

By determining that the training data of acoustic model used in target acoustical feature is made with determining target expressive features The training data of expression model comes from same video, corresponds to identical time shaft, when speaker issues a pronunciation element, The sound of speaker and the expression of speaker are collocation, in this way, raw according to the target acoustical feature that the acoustic model is determined At the expression of speaking that is generated with the target expressive features determined according to expression model of virtual acoustic be collocation, mentioned to user For preferably experiencing, user's feeling of immersion is improved.

Next, will be said in conjunction with concrete application scene based on model training method provided by the embodiments of the present application and synthesis The method for talking about expression and virtual acoustic is introduced to visualization phoneme synthesizing method is generated in human-computer interaction.

The application scenarios can be exchanged with user B by virtual objects with scene of game, user A, and user A inputs text Content, user B see that speaking for the corresponding virtual objects of user A and hears that virtual acoustic, user B are corresponding with user A at expression Virtual objects interact.The framework that visualization phoneme synthesizing method is generated in a kind of human-computer interaction is shown referring to Fig. 6, Fig. 6 Schematic diagram.

As shown in fig. 6, including model training part and composite part in the configuration diagram.Wherein, in model training portion Point, it can collect and contain the video of speaker's face action expression and corresponding voice.Speaker is said corresponding to voice Text carries out text analyzing, prosodic analysis, to extract text feature.Voice progress acoustic feature is said to speaker to mention It takes, to extract acoustic feature.Human facial feature extraction is carried out to face action expression when saying voice to speaker, thus Extract expressive features.Voice is said by forcing alignment module to handle, according to text feature and acoustics spy to speaker Levy the time interval and duration for determining that text feature identifies pronunciation element.

Then, expression is carried out according to the duration, corresponding expressive features, text feature that text feature identifies pronunciation element Model training obtains expression model；Duration, the corresponding acoustic feature, text spy of pronunciation element are identified according to text feature Sign carries out acoustic training model, obtains acoustic model；The when progress of pronunciation element is identified according to text feature and text feature The training of row duration modeling, obtains duration modeling.So far, model training is partially completed the training of required model.

Composite part is subsequently entered, can use expression model, acoustic model, the duration mould that training obtains in composite part Type completes visualization speech synthesis.Specifically, carrying out text analyzing, the rhythm point to the content of text of visualization voice to be synthesized Analysis, obtains the corresponding text feature of content of text, and text feature input duration modeling is carried out duration prediction and obtains text feature Identify the duration of pronunciation element.The frame level feature vector that text feature is generated together with the duration for identifying pronunciation element is defeated Enter expression model progress expressive features to predict to obtain the corresponding target expressive features of content of text.By the text feature and marked The frame level feature vector input acoustic model that the duration of knowledge pronunciation element generates together carries out acoustic feature and predicts to obtain in text Hold corresponding target acoustical feature.Finally, obtained target expressive features and target acoustical feature, which are carried out rendering, generates animation, To obtain visualization voice.

On the one hand the visualization voice obtained through the above scheme increases the variation patterns of speak expression and virtual speech, The excessive unnatural situation for expression shape change of speaking is improved to a certain extent, on the other hand, since training acoustic model is made Training data used in training data and training expression model comes from same video, and corresponding identical time shaft is spoken When human hair goes out a pronunciation element, the sound of speaker and the expression of speaker are collocation, in this way, true according to the acoustic model What the virtual acoustic that the target acoustical feature made generates was generated with the target expressive features determined according to expression model speaks Expression is collocation, therefore the visualization voice synthesized provides the user with better impression, improves user's feeling of immersion.

The model training method for synthesizing expression of speaking and synthesis provided based on previous embodiment is spoken the side of expression Relevant apparatus provided by the embodiments of the present application is introduced in method.The present embodiment provides a kind of for synthesizing the mould for expression of speaking Type training device 700, referring to Fig. 7 a, described device 700 includes acquiring unit 701, the determining list of the first determination unit 702, second Member 703 and the first training unit 704:

The acquiring unit 701 includes speaker's face action expression and the video for corresponding to voice, Yi Jiyong for obtaining In the expressive features, the acoustic feature of the voice and the text spy of the voice that obtain the speaker according to the video Sign；The acoustic feature includes multiple sub- acoustic features；

First determination unit 702, for determining that the text is special according to the text feature and the acoustic feature Sign identifies the time interval and duration of pronunciation element；It is target speaker that the text feature, which either one or two of is identified pronunciation element, Element, the time interval of the target speaker element are the target speaker element corresponding sub- acoustics in the acoustic feature Time interval of the feature in the video, the target speaker element when a length of target speaker element corresponding to sub- sound Learn the duration of feature；

Second determination unit 703, for according to the text feature identify pronunciation element time interval and when The long and described expressive features determine the first corresponding relationship, first corresponding relationship be used to embody pronunciation element duration and Corresponding relationship of the time interval of pronunciation element in the expressive features between corresponding sub- expressive features；

First training unit 704, for according to first corresponding relationship training expression model；The expression model Duration for identifying pronunciation element according to text feature undetermined and the text feature undetermined determines corresponding target expression Feature.

In one implementation, referring to Fig. 7 b, described device 700 further includes third determination unit 705 and the second training Unit 706:

The third determination unit 705, for determining that the text feature identifies pronunciation element and the acoustic feature Between the second corresponding relationship；The duration and pronunciation element that second corresponding relationship is used to embody pronunciation element are in the acoustics spy Corresponding relationship in sign between corresponding sub- acoustic feature；

Second training unit 706, for according to second corresponding relationship training acoustic model, the acoustic model Duration for identifying pronunciation element according to text feature undetermined and the text feature undetermined determines corresponding target acoustical Feature.

In one implementation, referring to Fig. 7 c, described device 700 further includes third training unit 707:

The third training unit 707, for according to the text feature and the identified pronunciation element of the text feature Duration training duration modeling, the duration modeling is used to according to text feature undetermined determine that the text feature undetermined is identified The duration of pronunciation element.

In one implementation, the text feature is for identifying pronounce in the voice element and pronunciation element pair The contextual information answered.

The embodiment of the present application also provides a kind of device 800 for synthesizing expression of speaking, and referring to Fig. 8 a, described device 800 includes Determination unit 801 and first acquisition unit 802:

The determination unit 801, for determining that the corresponding text feature of content of text and the text feature identify hair The duration of tone element；The text feature includes multiple Ziwen eigens；

The first acquisition unit 802, for passing through the text feature, the duration and expression mould that identify element Type obtains the corresponding target expressive features of the content of text；The target expressive features include multiple sub- expressive features, described It is target speaker element that text feature, which either one or two of is identified pronunciation element, in the target expressive features, the target hair The corresponding sublist feelings of tone element are characterized according to the target speaker element the corresponding Ziwen eigen in the text feature What the duration determination with the target speaker element obtained.

In one implementation, the expression model is obtained according to the training of the first corresponding relationship, described first pair Should be related to for embody pronunciation element duration with pronounce element time interval in the expressive features corresponding sublist feelings Corresponding relationship between feature.

In one implementation, referring to Fig. 8 b, described device 800 further includes second acquisition unit 803:

The second acquisition unit 803, for passing through the text feature, the duration and acoustic mode that identify element Type obtains the corresponding target acoustical feature of the content of text；In the target acoustical feature, the target speaker element pair The sub- acoustic feature answered be according to the target speaker element in the text feature corresponding Ziwen eigen and the mesh What the duration determination of mark pronunciation element obtained；

The acoustic model is obtained according to the training of the second corresponding relationship, and second corresponding relationship is for embodying pronunciation Corresponding relationship of the duration of element with pronunciation element in the acoustic feature between corresponding sub- acoustic feature.

In one implementation, the determination unit 801 is specifically used for obtaining by the text feature and duration modeling Obtain the duration that the text feature identifies pronunciation element；The duration modeling is special according to history text feature and history text What the duration training that sign identifies pronunciation element obtained.

In one implementation, the text feature is for identifying pronounce in the content of text element and pronunciation member The corresponding contextual information of element.

In order to determining that variation multiplicity, hypermimia are natural for virtual objects it can be seen from above-mentioned technical proposal Expression of speaking, the embodiment of the present application provides a kind of completely new expression model training apparatus, according to containing the human face that speaks The video of movement expression and corresponding voice obtains the acoustic feature and the voice of the expressive features of the speaker, the voice Text feature.It, therefore can be true according to acoustic feature since acoustic feature and text feature are obtained according to same video Make time interval and duration that text feature identifies pronunciation element.According to the text feature identify pronunciation element when Between section and duration and the expressive features determine the first corresponding relationship, first corresponding relationship is for embodying pronunciation member Corresponding relationship of the duration of element with the time interval of pronunciation element in the expressive features between corresponding sub- expressive features.

It, can be with by the time interval of target speaker element for identifying for the target speaker element in pronunciation element From determining the sub- expressive features in the time interval in expressive features, and the duration of target speaker element can embody mesh Various durations of the mark pronunciation element under the various expression sentences of video speech, therefore the sub- expressive features determined can be with body Speaker says the possible expression of target speaker element in different expression sentences now.Therefore trained according to the first corresponding relationship The expression model arrived determines the device for expression of speaking by being somebody's turn to do for the text feature of expressive features to be determined in interactive voice Expression model can determine different sub- expressive features to the same pronunciation element in this article eigen with different durations, increase The variation patterns for having added expression of speaking, according to expression model determine target expressive features generate expression of speaking, due to for The same pronunciation element speaks expression with different variation patterns, to improve the mistake for expression shape change of speaking to a certain extent Spend unnatural situation.

The embodiment of the present application also provides a kind of server, which can be used as the model for synthesizing expression of speaking Training equipment, can also be used as the equipment for synthesizing expression of speaking, the server is introduced below in conjunction with attached drawing.Please Shown in Figure 9, server 900 can generate bigger difference because configuration or performance are different, may include one or one Above central processing unit (Central Processing Units, abbreviation CPU) 922 is (for example, one or more are handled Device) and memory 932, one or more storage application programs 942 or data 944 storage medium 930 (such as one or More than one mass memory unit).Wherein, memory 932 and storage medium 930 can be of short duration storage or persistent storage.It deposits Storage may include one or more modules (diagram does not mark) in the program of storage medium 930, and each module may include To the series of instructions operation in server.Further, central processing unit 922 can be set to logical with storage medium 930 Letter executes the series of instructions operation in storage medium 930 on the device 900.

Equipment 900 can also include one or more power supplys 926, one or more wired or wireless networks connect Mouth 950, one or more input/output interfaces 958, and/or, one or more operating systems 941, such as Windows ServerTM, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM etc..

The step as performed by server can be based on the server architecture shown in Fig. 9 in above-described embodiment.

Wherein, CPU 922 is for executing following steps:

According to the expressive features of speaker described in the video acquisition, the acoustic feature of the voice and the voice Text feature；The acoustic feature includes multiple sub- acoustic features；Institute is determined according to the text feature and the acoustic feature State time interval and duration that text feature identifies pronunciation element；The text feature either one or two of is identified pronunciation element The time interval of target speaker element, the target speaker element is right in the acoustic feature for the target speaker element Answer time interval of the sub- acoustic feature in the video, the when a length of target speaker element institute of the target speaker element The duration of corresponding sub- acoustic feature；

Alternatively, CPU 922 is for executing following steps:

Shown in Figure 10, the embodiment of the present application provides a kind of terminal device, which, which can be used as, is used for Synthesis is spoken the equipment of expression, the terminal device can be include mobile phone, tablet computer, personal digital assistant (Personal Digital Assistant, abbreviation PDA), point-of-sale terminal (Point of Sales, abbreviation POS), vehicle-mounted computer etc. it is any eventually End equipment, by taking terminal device is mobile phone as an example:

Figure 10 shows the block diagram of the part-structure of mobile phone relevant to terminal device provided by the embodiments of the present application.Ginseng Examine Figure 10, mobile phone include: radio frequency (Radio Frequency, abbreviation RF) circuit 1010, memory 1020, input unit 1030, Display unit 1040, sensor 1050, voicefrequency circuit 1060, Wireless Fidelity (wireless fidelity, abbreviation WiFi) module 1070, the components such as processor 1080 and power supply 1090.It will be understood by those skilled in the art that mobile phone knot shown in Figure 10 Structure does not constitute the restriction to mobile phone, may include perhaps combining certain components or not than illustrating more or fewer components Same component layout.

It is specifically introduced below with reference to each component parts of the Figure 10 to mobile phone:

RF circuit 1010 can be used for receiving and sending messages or communication process in, signal sends and receivees, particularly, by base station After downlink information receives, handled to processor 1080；In addition, the data for designing uplink are sent to base station.In general, RF circuit 1010 include but is not limited to antenna, at least one amplifier, transceiver, coupler, low-noise amplifier (Low Noise Amplifier, abbreviation LNA), duplexer etc..In addition, RF circuit 1010 can also by wireless communication with network and other equipment Communication.Any communication standard or agreement, including but not limited to global system for mobile communications can be used in above-mentioned wireless communication (Global System of Mobile communication, abbreviation GSM), general packet radio service (General Packet Radio Service, abbreviation GPRS), CDMA (Code Division Multiple Access, referred to as CDMA), wideband code division multiple access (Wideband Code Division Multiple Access, abbreviation WCDMA), long term evolution (Long Term Evolution, abbreviation LTE), Email, short message service (Short Messaging Service, letter Claim SMS) etc..

Memory 1020 can be used for storing software program and module, and processor 1080 is stored in memory by operation 1020 software program and module, thereby executing the various function application and data processing of mobile phone.Memory 1020 can be led It to include storing program area and storage data area, wherein storing program area can be needed for storage program area, at least one function Application program (such as sound-playing function, image player function etc.) etc.；Storage data area, which can be stored, uses institute according to mobile phone Data (such as audio data, phone directory etc.) of creation etc..In addition, memory 1020 may include high random access storage Device, can also include nonvolatile memory, and a for example, at least disk memory, flush memory device or other volatibility are solid State memory device.

Input unit 1030 can be used for receiving the number or character information of input, and generate with the user setting of mobile phone with And the related key signals input of function control.Specifically, input unit 1030 may include touch panel 1031 and other inputs Equipment 1032.Touch panel 1031, also referred to as touch screen collect touch operation (such as the user of user on it or nearby Use the behaviour of any suitable object or attachment such as finger, stylus on touch panel 1031 or near touch panel 1031 Make), and corresponding attachment device is driven according to preset formula.Optionally, touch panel 1031 may include touch detection Two parts of device and touch controller.Wherein, the touch orientation of touch detecting apparatus detection user, and detect touch operation band The signal come, transmits a signal to touch controller；Touch controller receives touch information from touch detecting apparatus, and by it It is converted into contact coordinate, then gives processor 1080, and order that processor 1080 is sent can be received and executed.In addition, Touch panel 1031 can be realized using multiple types such as resistance-type, condenser type, infrared ray and surface acoustic waves.In addition to touch surface Plate 1031, input unit 1030 can also include other input equipments 1032.Specifically, other input equipments 1032 may include But in being not limited to physical keyboard, function key (such as volume control button, switch key etc.), trace ball, mouse, operating stick etc. It is one or more.

Display unit 1040 can be used for showing information input by user or be supplied to user information and mobile phone it is each Kind menu.Display unit 1040 may include display panel 1041, optionally, can use liquid crystal display (Liquid Crystal Display, abbreviation LCD), Organic Light Emitting Diode (Organic Light-Emitting Diode, referred to as ) etc. OLED forms configure display panel 1041.Further, touch panel 1031 can cover display panel 1041, work as touch-control After panel 1031 detects touch operation on it or nearby, processor 1080 is sent to determine the type of touch event, It is followed by subsequent processing device 1080 and provides corresponding visual output on display panel 1041 according to the type of touch event.Although in Figure 10 In, touch panel 1031 and display panel 1041 are the input and input function for realizing mobile phone as two independent components, But in some embodiments it is possible to touch panel 1031 is integrated with display panel 1041 and realizes outputting and inputting for mobile phone Function.

Mobile phone may also include at least one sensor 1050, such as optical sensor, motion sensor and other sensors. Specifically, optical sensor may include ambient light sensor and proximity sensor, wherein ambient light sensor can be according to ambient light Light and shade adjust the brightness of display panel 1041, proximity sensor can close display panel when mobile phone is moved in one's ear 1041 and/or backlight.As a kind of motion sensor, accelerometer sensor can detect in all directions (generally three axis) and add The size of speed can detect that size and the direction of gravity when static, can be used to identify application (such as the horizontal/vertical screen of mobile phone posture Switching, dependent game, magnetometer pose calibrating), Vibration identification correlation function (such as pedometer, tap) etc.；Also as mobile phone The other sensors such as configurable gyroscope, barometer, hygrometer, thermometer, infrared sensor, details are not described herein.

Voicefrequency circuit 1060, loudspeaker 1061, microphone 1062 can provide the audio interface between user and mobile phone.Audio Electric signal after the audio data received conversion can be transferred to loudspeaker 1061, be converted by loudspeaker 1061 by circuit 1060 For voice signal output；On the other hand, the voice signal of collection is converted to electric signal by microphone 1062, by voicefrequency circuit 1060 Audio data is converted to after reception, then by after the processing of audio data output processor 1080, through RF circuit 1010 to be sent to ratio Such as another mobile phone, or audio data is exported to memory 1020 to be further processed.

WiFi belongs to short range wireless transmission technology, and mobile phone can help user's transceiver electronics postal by WiFi module 1070 Part, browsing webpage and access streaming video etc., it provides wireless broadband internet access for user.Although Figure 10 is shown WiFi module 1070, but it is understood that, and it is not belonging to must be configured into for mobile phone, it can according to need do not changing completely Become in the range of the essence of invention and omits.

Processor 1080 is the control centre of mobile phone, using the various pieces of various interfaces and connection whole mobile phone, By running or execute the software program and/or module that are stored in memory 1020, and calls and be stored in memory 1020 Interior data execute the various functions and processing data of mobile phone, to carry out integral monitoring to mobile phone.Optionally, processor 1080 may include one or more processing units；Preferably, processor 1080 can integrate application processor and modulation /demodulation processing Device, wherein the main processing operation system of application processor, user interface and application program etc., modem processor is mainly located Reason wireless communication.It is understood that above-mentioned modem processor can not also be integrated into processor 1080.

Mobile phone further includes the power supply 1090 (such as battery) powered to all parts, it is preferred that power supply can pass through power supply Management system and processor 1080 are logically contiguous, to realize management charging, electric discharge and power consumption pipe by power-supply management system The functions such as reason.

Although being not shown, mobile phone can also include camera, bluetooth module etc., and details are not described herein.

In the present embodiment, processor 1080 included by the terminal device is also with the following functions:

The embodiment of the present application also provides a kind of computer readable storage medium, and the computer readable storage medium is for depositing Program code is stored up, said program code is for executing described in earlier figures 2 to Fig. 3 corresponding embodiment for synthesizing expression of speaking Synthesis described in embodiment corresponding to model training method or Fig. 5 is spoken the method for expression.

The description of the present application and term " first " in above-mentioned attached drawing, " second ", " third ", " the 4th " etc. are (if deposited ) it is to be used to distinguish similar objects, without being used to describe a particular order or precedence order.It should be understood that use in this way Data are interchangeable under appropriate circumstances, so that embodiments herein described herein for example can be in addition to illustrating herein Or the sequence other than those of description is implemented.In addition, term " includes " and " having " and their any deformation, it is intended that Cover it is non-exclusive include, for example, containing the process, method, system, product or equipment of a series of steps or units need not limit In step or unit those of is clearly listed, but may include be not clearly listed or for these process, methods, produce The other step or units of product or equipment inherently.

It should be appreciated that in this application, " at least one (item) " refers to one or more, and " multiple " refer to two or two More than a."and/or" indicates may exist three kinds of relationships, for example, " A and/or B " for describing the incidence relation of affiliated partner It can indicate: only exist A, only exist B and exist simultaneously tri- kinds of situations of A and B, wherein A, B can be odd number or plural number.Word Symbol "/" typicallys represent the relationship that forward-backward correlation object is a kind of "or"." at least one of following (a) " or its similar expression, refers to Any combination in these, any combination including individual event (a) or complex item (a).At least one of for example, in a, b or c (a) can indicate: a, b, c, " a and b ", " a and c ", " b and c ", or " a and b and c ", and wherein a, b, c can be individually, can also To be multiple.

In several embodiments provided herein, it should be understood that disclosed system, device and method can be with It realizes by another way.For example, the apparatus embodiments described above are merely exemplary, for example, the unit It divides, only a kind of logical function partition, there may be another division manner in actual implementation, such as multiple units or components It can be combined or can be integrated into another system, or some features can be ignored or not executed.Another point, it is shown or The mutual coupling, direct-coupling or communication connection discussed can be through some interfaces, the indirect coupling of device or unit It closes or communicates to connect, can be electrical property, mechanical or other forms.

The unit as illustrated by the separation member may or may not be physically separated, aobvious as unit The component shown may or may not be physical unit, it can and it is in one place, or may be distributed over multiple In network unit.It can select some or all of unit therein according to the actual needs to realize the mesh of this embodiment scheme 's.

It, can also be in addition, each functional unit in each embodiment of the application can integrate in one processing unit It is that each unit physically exists alone, can also be integrated in one unit with two or more units.Above-mentioned integrated list Member both can take the form of hardware realization, can also realize in the form of software functional units.

If the integrated unit is realized in the form of SFU software functional unit and sells or use as independent product When, it can store in a computer readable storage medium.Based on this understanding, the technical solution of the application is substantially The all or part of the part that contributes to existing technology or the technical solution can be in the form of software products in other words It embodies, which is stored in a storage medium, including some instructions are used so that a computer Equipment (can be personal computer, server or the network equipment etc.) executes the complete of each embodiment the method for the application Portion or part steps.And storage medium above-mentioned includes: USB flash disk, mobile hard disk, read-only memory (Read-Only Memory, letter Claim ROM), random access memory (Random Access Memory, abbreviation RAM), magnetic or disk etc. is various to deposit Store up the medium of program code.

The above, above embodiments are only to illustrate the technical solution of the application, rather than its limitations；Although referring to before Embodiment is stated the application is described in detail, those skilled in the art should understand that: it still can be to preceding Technical solution documented by each embodiment is stated to modify or equivalent replacement of some of the technical features；And these It modifies or replaces, the spirit and scope of each embodiment technical solution of the application that it does not separate the essence of the corresponding technical solution.

Claims

A kind of method of expression 1. synthesis based on artificial intelligence is spoken, which is characterized in that the described method includes:

Obtain the content of text that terminal is sent；

Determine that the corresponding text feature of the content of text and the text feature identify the duration of pronunciation element；The text Feature includes multiple Ziwen eigens；

By expression model, the corresponding target expressive features of duration for obtaining the text feature, identifying pronunciation element；It is described Either one or two of target expressive features include multiple sub- expressive features, and the text feature is identified pronunciation element is target speaker member Element, in the target expressive features, the corresponding sublist feelings of the target speaker element are characterized according to the target speaker member The element duration of corresponding Ziwen eigen and target speaker element determination in the text feature obtains；

The target expressive features are returned to the terminal.
2. the method according to claim 1, wherein the method also includes:

By the text feature, the duration and acoustic model that identify element, the corresponding mesh of the content of text is obtained Mark acoustic feature；In the target acoustical feature, the corresponding sub- acoustic feature of the target speaker element is according to the mesh The duration of corresponding Ziwen eigen and the target speaker element determination in the text feature of mark pronunciation element obtains.
3. the method according to claim 1, wherein the corresponding text feature of the determining content of text and described Text feature identifies the duration of pronunciation element, comprising:

By the text feature and duration modeling, the duration that the text feature identifies pronunciation element is obtained.
4. method according to claim 1 to 3, which is characterized in that the content of text be to virtual objects The text of the user feedback interacted, the content of text include text corresponding to user's input content, alternatively, according to Text corresponding to the interactive information of family input content feedback.
5. method according to claim 1 to 3, which is characterized in that the text feature is for identifying the text Pronounce element and the corresponding contextual information of pronunciation element in this content.
6. method according to claim 1 to 3, which is characterized in that it is special that the expressive features include at least the shape of the mouth as one speaks Sign.
The device of expression 7. a kind of synthesis based on artificial intelligence is spoken, which is characterized in that described device include acquiring unit, really Order member, first acquisition unit and return unit:

The acquiring unit, for obtaining the content of text of terminal transmission；

The determination unit, for determining that the corresponding text feature of the content of text and the text feature identify pronunciation member The duration of element；The text feature includes multiple Ziwen eigens；

The first acquisition unit, for passing through expression model, the duration pair for obtaining the text feature, identifying pronunciation element The target expressive features answered；The target expressive features include multiple sub- expressive features, any that the text feature is identified A pronunciation element is target speaker element, and in the target expressive features, the corresponding sublist feelings of the target speaker element are special Sign be according to the target speaker element in the text feature corresponding Ziwen eigen and the target speaker element Duration determination obtains；

The return unit, for returning to the target expressive features to the terminal.
A kind of method of expression 8. synthesis based on artificial intelligence is spoken, which is characterized in that the described method includes:

Obtain the video comprising speaker's face action expression and corresponding voice；

According to the text of the expressive features of speaker described in the video acquisition, the acoustic feature of the voice and the voice Feature；The acoustic feature includes multiple sub- acoustic features；

According to the text feature and the acoustic feature determine the text feature identify pronunciation element time interval and Duration；It is target speaker element, the time of the target speaker element that the text feature, which either one or two of is identified pronunciation element, Section is the target speaker element time interval of corresponding sub- acoustic feature in the video in the acoustic feature, The target speaker element when a length of target speaker element corresponding to sub- acoustic feature duration；

The time interval and duration, the expressive features of pronunciation element are identified according to the text feature, the text feature With acoustic feature training expression model and acoustic model；The expression model be used for according to text feature undetermined and it is described to Determine text feature identify pronunciation element duration determine corresponding target expressive features；The acoustic model is used for according to undetermined The duration that text feature and the text feature undetermined identify pronunciation element determines corresponding target acoustical feature；

Obtain the content of text that terminal is sent；

Determine that the corresponding text feature of the content of text and the text feature identify the duration of pronunciation element；The text Feature includes multiple Ziwen eigens；

By the text feature, the duration and the expression model and acoustic model of pronunciation element are identified, described in acquisition The corresponding target expressive features of content of text and target acoustical feature；

The target expressive features and target acoustical feature are subjected to rendering and generate animation.
9. according to the method described in claim 8, it is characterized in that, described according to the text feature, text feature institute The time interval and duration, the expressive features and the acoustic feature of mark pronunciation element train expression model and acoustic mode Type, comprising:

The time interval and duration and the expressive features that pronunciation element is identified according to the text feature, determine first Corresponding relationship, first corresponding relationship are used to embody the duration of pronunciation element and the time interval of pronunciation element in the expression Corresponding relationship in feature between corresponding sub- expressive features；

According to first corresponding relationship training expression model；

Determine that the text feature identifies the second corresponding relationship between pronunciation element and the acoustic feature；Described second is corresponding Relationship is used to embody the duration of pronunciation element and the element that pronounces is corresponding between corresponding sub- acoustic feature in the acoustic feature Relationship；

According to second corresponding relationship training acoustic model.
10. method according to claim 8 or claim 9, which is characterized in that the method also includes:

The duration training duration modeling of pronunciation element, the duration mould are identified according to the text feature and the text feature Type is used to determine that the text feature undetermined identifies the duration of pronunciation element according to text feature undetermined；

The duration that the text feature identifies pronunciation element is determined as follows:

By the text feature and the duration modeling, the duration that the text feature identifies pronunciation element is obtained.
11. a kind of speak the equipment of expression for the synthesis based on artificial intelligence, which is characterized in that the equipment includes processor And memory:

Said program code is transferred to the processor for storing program code by the memory；

The processor is used for according to any one of instruction execution claim 1-6 or 7-10 in said program code The synthesis based on artificial intelligence speak the method for expression.
12. a kind of computer readable storage medium, which is characterized in that the computer readable storage medium is for storing program generation Code, said program code require the synthesis based on artificial intelligence described in any one of 1-6 or 7-10 to say for perform claim The method for talking about expression.