CN110288077A - A kind of synthesis based on artificial intelligence is spoken the method and relevant apparatus of expression - Google Patents
A kind of synthesis based on artificial intelligence is spoken the method and relevant apparatus of expression Download PDFInfo
- Publication number
- CN110288077A CN110288077A CN201910745062.8A CN201910745062A CN110288077A CN 110288077 A CN110288077 A CN 110288077A CN 201910745062 A CN201910745062 A CN 201910745062A CN 110288077 A CN110288077 A CN 110288077A
- Authority
- CN
- China
- Prior art keywords
- text
- feature
- duration
- expression
- text feature
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/004—Artificial life, i.e. computing arrangements simulating life
- G06N3/006—Artificial life, i.e. computing arrangements simulating life based on simulated virtual individual or collective life forms, e.g. social simulations or particle swarm optimisation [PSO]
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Health & Medical Sciences (AREA)
- Electrically Operated Instructional Devices (AREA)
Abstract
The embodiment of the present application discloses a kind of synthesis based on artificial intelligence and speaks the method and relevant apparatus of expression, refer at least to the multiple technologies in artificial intelligence, the content of text sent for terminal, determine that the corresponding text feature of content of text and the text feature identify the duration of pronunciation element, by expression model, the corresponding target expressive features of duration for obtaining the text feature, identifying pronunciation element;And the target expressive features are returned to the terminal.The expression model can determine different sub- expressive features to the same pronunciation element in this article eigen with different durations, increase the variation patterns for expression of speaking, speak expression and the expression collocation of speaker that target expressive features generate are determined according to expression model, due to speaking expression with different variation patterns for the same pronunciation element, so as to improve the excessive unnatural situation for expression shape change of speaking, the feeling of immersion of user is improved.
Description
For the application to application No. is 201811354206.9, the applying date is on November 14th, 2018, entitled " a kind of
The Chinese patent application of model training method, the method and relevant apparatus for synthesizing expression of speaking " proposes divisional application.
Technical field
This application involves data processing field, speak the method for expression more particularly to a kind of synthesis based on artificial intelligence
And relevant apparatus.
Background technique
With the development of computer technology, human-computer interaction is relatively common, but mostly simple interactive voice, for example,
The text or voice that interactive device can be inputted according to user determine reply content, and play synthesized according to reply content it is virtual
Sound.
Such human-computer interaction bring user's feeling of immersion is difficult to meet the interaction demand of current user, in order to improve
User's feeling of immersion can for example be met the tendency of with expression shape change ability using the virtual objects that the shape of the mouth as one speaks changes as the interactive object with user
And it gives birth to.This virtual objects can be with virtual images such as cartoon, visual humans, when carrying out human-computer interaction with user, in addition to that can broadcast
The virtual acoustic of interaction is put, can also show corresponding expression according to virtual acoustic, provides the user with a kind of virtual objects hair
The impression of the virtual speech out.
It is mainly to be determined according to currently played pronunciation element which kind of expression current this virtual objects, which make, is caused
When for playing virtual speech, the expression shape change pattern limitation of virtual objects, and also expression shape change is excessive unnatural, gives user
The impression of offer is practical and bad, it is difficult to play the role of improving user's feeling of immersion.
Summary of the invention
The model training method that in order to solve the above-mentioned technical problem, this application provides a kind of for synthesizing expression of speaking,
The method and relevant apparatus for synthesizing expression of speaking, increase the variation patterns for expression of speaking, determine target according to expression model
The expression of speaking that expressive features generate, due to speaking expression with different variation patterns for the same pronunciation element, thus
The excessive unnatural situation for expression shape change of speaking is improved to a certain extent
The embodiment of the present application discloses following technical solution:
In a first aspect, the embodiment of the present application provide it is a kind of for synthesizing the model training method for expression of speaking, comprising:
Obtain the video comprising speaker's face action expression and corresponding voice;
According to the expressive features of speaker described in the video acquisition, the acoustic feature of the voice and the voice
Text feature;The acoustic feature includes multiple sub- acoustic features;
Determine that the text feature identifies the time zone of pronunciation element according to the text feature and the acoustic feature
Between and duration;It is target speaker element that the text feature, which either one or two of is identified pronunciation element, the target speaker element
Time interval is the target speaker element time of corresponding sub- acoustic feature in the video in the acoustic feature
Section, the target speaker element when a length of target speaker element corresponding to sub- acoustic feature duration;
The time interval and duration and the expressive features that pronunciation element is identified according to the text feature, determine
First corresponding relationship, first corresponding relationship are used to embody the duration of pronunciation element and the time interval of pronunciation element described
Corresponding relationship in expressive features between corresponding sub- expressive features;
According to first corresponding relationship training expression model;The expression model is used for according to text feature undetermined and institute
State text feature undetermined identify pronunciation element duration determine corresponding target expressive features.
Second aspect, the embodiment of the present application provide a kind of for synthesizing the model training apparatus for expression of speaking, described device
Including acquiring unit, the first determination unit, the second determination unit and the first training unit:
The acquiring unit includes speaker's face action expression and the video for corresponding to voice for obtaining;
The acquiring unit is also used to the acoustics of the expressive features of the speaker according to the video acquisition, the voice
The text feature of feature and the voice;The acoustic feature includes multiple sub- acoustic features;
First determination unit, for determining the text feature institute according to the text feature and the acoustic feature
The time interval and duration of mark pronunciation element;It is target speaker member that the text feature, which either one or two of is identified pronunciation element,
Element, the time interval of the target speaker element are that the target speaker element corresponding sub- acoustics in the acoustic feature is special
Levy time interval in the video, the target speaker element when a length of target speaker element corresponding to sub- acoustics
The duration of feature;
Second determination unit, for identifying the time interval and duration of pronunciation element according to the text feature,
And the expressive features determine that the first corresponding relationship, first corresponding relationship are used to embody the duration and pronunciation of pronunciation element
Corresponding relationship of the time interval of element in the expressive features between corresponding sub- expressive features;
First training unit, for according to first corresponding relationship training expression model;The expression model is used
Determine that corresponding target expression is special in the duration for identifying pronunciation element according to text feature undetermined and the text feature undetermined
Sign.
The third aspect, the embodiment of the present application provide a kind of for synthesizing the model training equipment for expression of speaking, the equipment
Including processor and memory:
Said program code is transferred to the processor for storing program code by the memory;
The processor is used to be used for according to any one of instruction execution first aspect in said program code
Synthesize the model training method for expression of speaking.
Fourth aspect, the embodiment of the present application provide a kind of method for synthesizing expression of speaking, which comprises
Determine that the corresponding text feature of content of text and the text feature identify the duration of pronunciation element;The text
Feature includes multiple Ziwen eigens;
By the text feature, the duration and expression model that identify element, it is corresponding to obtain the content of text
Target expressive features;Either one or two of the target expressive features include multiple sub- expressive features, and the text feature is identified
Pronunciation element is target speaker element, in the target expressive features, the corresponding sub- expressive features of the target speaker element
Be according to the target speaker element in the text feature corresponding Ziwen eigen and when the target speaker element
Grow what determination obtained.
5th aspect, the embodiment of the present application provide a kind of device for synthesizing expression of speaking, and described device includes determination unit
And first acquisition unit:
The determination unit, for determining that the corresponding text feature of content of text and the text feature identify pronunciation member
The duration of element;The text feature includes multiple Ziwen eigens;
The first acquisition unit, for by the text feature, identify pronunciation element duration and expression model,
Obtain the corresponding target expressive features of the content of text;The target expressive features include multiple sub- expressive features, the text
It is target speaker element that eigen, which either one or two of is identified pronunciation element, in the target expressive features, the target speaker
The corresponding sublist feelings of element be characterized according to the target speaker element in the text feature corresponding Ziwen eigen and
What the duration determination of the target speaker element obtained.
6th aspect, the embodiment of the present application provide a kind of equipment for synthesizing expression of speaking, and the equipment includes processing
Device and memory:
Said program code is transferred to the processor for storing program code by the memory;
The processor is used for the synthesis according to any one of instruction execution fourth aspect in said program code
The method for expression of speaking.
7th aspect, the embodiment of the present application provide a kind of computer readable storage medium, the computer-readable storage medium
Matter is for storing program code, and said program code is for executing described in any one of first aspect for synthesizing expression of speaking
Model training method or any one of fourth aspect described in synthesis speak the method for expression.
In order to determining that variation multiplicity, hypermimia are natural for virtual objects it can be seen from above-mentioned technical proposal
Expression of speaking, the embodiment of the present application provides a kind of completely new expression model training mode, according to containing the human face that speaks
The video of movement expression and corresponding voice obtains the text of the expressive features of speaker, the acoustic feature of the voice and the voice
Eigen.Since acoustic feature and text feature are to be obtained according to same video, therefore can be determined according to acoustic feature
Text feature identifies the time interval and duration of pronunciation element.The time zone of pronunciation element is identified according to the text feature
Between and duration and the expressive features determine that the first corresponding relationship, first corresponding relationship are used to embody element
Corresponding relationship of the duration with the time interval of pronunciation element in the expressive features between corresponding sub- expressive features.
It, can be with by the time interval of target speaker element for identifying for the target speaker element in pronunciation element
From determining the sub- expressive features in the time interval in expressive features, and the duration of target speaker element can embody mesh
Various durations of the mark pronunciation element under the various expression sentences of video speech, therefore the sub- expressive features determined can be with body
Speaker says the possible expression of target speaker element in different expression sentences now.Therefore trained according to the first corresponding relationship
The expression model arrived, for the text feature of expressive features to be determined, which can have not in this article eigen
Same pronunciation element with duration determines different sub- expressive features, the variation patterns for expression of speaking is increased, according to expression
Model determines the expression of speaking that target expressive features generate, due to speaking expression with different for the same pronunciation element
Variation patterns, to improve the excessive unnatural situation for expression shape change of speaking to a certain extent.
Detailed description of the invention
In order to illustrate the technical solutions in the embodiments of the present application or in the prior art more clearly, to embodiment or will show below
There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only this
Some embodiments of application without any creative labor, may be used also for those of ordinary skill in the art
To obtain other drawings based on these drawings.
Fig. 1 is the application scenarios schematic diagram of expression model training method provided by the embodiments of the present application;
Fig. 2 is provided by the embodiments of the present application a kind of for synthesizing the flow chart of the model training method for expression of speaking;
Fig. 3 is a kind of flow chart of acoustic training model method provided by the embodiments of the present application;
Fig. 4 is a kind of application scenarios schematic diagram of method for synthesizing expression of speaking provided by the embodiments of the present application;
Fig. 5 is a kind of flow chart of method for synthesizing expression of speaking provided by the embodiments of the present application;
Fig. 6 is the framework signal that visualization phoneme synthesizing method is generated in a kind of human-computer interaction provided by the embodiments of the present application
Figure;
Fig. 7 a is a kind of structure chart of model training apparatus for synthesizing expression of speaking provided by the embodiments of the present application;
Fig. 7 b is a kind of structure chart of model training apparatus for synthesizing expression of speaking provided by the embodiments of the present application;
Fig. 7 c is a kind of structure chart of model training apparatus for synthesizing expression of speaking provided by the embodiments of the present application;
Fig. 8 a is a kind of structure chart of device for synthesizing expression of speaking provided by the embodiments of the present application;
Fig. 8 b is a kind of structure chart of device for synthesizing expression of speaking provided by the embodiments of the present application;
Fig. 9 is a kind of structure chart of server provided by the embodiments of the present application;
Figure 10 is a kind of structure chart of terminal device provided by the embodiments of the present application.
Specific embodiment
In order to make those skilled in the art more fully understand application scheme, below in conjunction in the embodiment of the present application
Attached drawing, the technical scheme in the embodiment of the application is clearly and completely described, it is clear that described embodiment is only this
Apply for a part of the embodiment, instead of all the embodiments.Based on the embodiment in the application, those of ordinary skill in the art exist
Every other embodiment obtained under the premise of creative work is not made, shall fall in the protection scope of this application.
Currently, when virtual objects and user carry out human-computer interaction, it is mainly basis which kind of expression of speaking virtual objects, which make,
What currently played pronunciation element determined, for example, establishing the corresponding relationship of pronunciation element and expression, under normal circumstances, one
The corresponding expression of speaking of the element that pronounces, when being played to some pronunciation element, so that virtual objects make the pronunciation element institute
Corresponding expression of speaking, when this method is caused for virtual speech is played, the expression of speaking of virtual objects can only currently be broadcast
Expression of speaking corresponding to pronunciation element is put, the expression shape change pattern of speaking of virtual objects limits to, and due to a pronunciation member
Element only corresponds to an expression of speaking, and also results in the excessive unnatural of expression shape change of speaking, and the impression provided the user with is practical simultaneously
It is bad, it is difficult to play the role of improving user's feeling of immersion.
In order to solve the above-mentioned technical problem, the embodiment of the present application provides a kind of completely new expression model training mode,
When carrying out expression model training, by the duration for the pronunciation element that text feature, text feature are identified, the time zone for the element that pronounces
Between corresponding sub- expressive features are as training sample in expressive features, thus according to the duration of pronunciation element and pronunciation element
Corresponding relationship of the time interval in expressive features between corresponding sub- expressive features is trained to obtain expression model.
The method for expression of speaking is synthesized provided by the embodiment of the present application and accordingly for synthesizing the mould for expression of speaking
Type training method may each be based on artificial intelligence realization, and artificial intelligence (Artificial Intelligence, AI) is benefit
Machine simulation, extension and the intelligence for extending people controlled with digital computer or digital computer, perception environment obtain knowledge
And use theory, method, technology and the application system of Knowledge Acquirement optimum.In other words, artificial intelligence is computer section
The complex art learned, it attempts to understand essence of intelligence, and produce it is a kind of new can be in such a way that human intelligence be similar
The intelligence machine made a response.Artificial intelligence namely studies the design principle and implementation method of various intelligence machines, makes machine
Have the function of perception, reasoning and decision.
Artificial intelligence technology is an interdisciplinary study, is related to that field is extensive, and the technology of existing hardware view also has software layer
The technology in face.Artificial intelligence basic technology generally comprise as sensor, Special artificial intelligent chip, cloud computing, distributed storage,
The technologies such as big data processing technique, operation/interactive system, electromechanical integration.Artificial intelligence software's technology mainly includes computer
Several general orientation such as vision technique, voice processing technology, natural language processing technique and machine learning/deep learning.
In the embodiment of the present application, the artificial intelligence software's technology related generally to includes above-mentioned computer vision technique, language
The directions such as sound processing technique, natural language processing technique and deep learning.
Such as image procossing (the Image in computer vision (Computer Vision) can be related to
Processing), image, semantic understands (Image Semantic Understanding, ISU), video processing (video
Processing), video semanteme understands that (video semantic understanding, VSU), three-dimension object rebuild (3D
Object reconstruction), recognition of face (face recognition) etc..
Such as the speech recognition technology in voice technology (Speech Technology) can be related to, including voice
Signal Pretreatment (Speech signal preprocessing), voice signal frequency-domain analysis (Speech signal
Frequency analyzing), speech recognition (Speech signal feature extraction), voice
The training of signal characteristic matching/identification (Speech signal feature matching/recognition), voice
(Speech training) etc..
Such as the text that can be related in natural language processing (Nature Language processing, NLP) is located in advance
(Text preprocessing) and semantic understanding (Semantic understanding) etc. are managed, including word, sentence cutting
(word/sentence segementation), part-of-speech tagging (word tagging), statement classification (word/sentence
Classification) etc..
Such as the deep learning (Deep Learning) in machine learning (Machine learning, ML) can be related to,
Including all kinds of artificial neural networks (artificial neural network).
The technical solution of the application in order to facilitate understanding, below with reference to practical application scene to provided by the embodiments of the present application
Expression model training method is introduced.
Model training method provided by the present application can be applied to have processing to the video for saying voice including speaker
The data processing equipment of ability, such as terminal device, server.Wherein, terminal device be specifically as follows smart phone, computer,
Personal digital assistant (Personal Digital Assistant, PDA), tablet computer etc.;Server is specifically as follows independence
Server, or cluster server.
The data processing equipment can have the ability for implementing computer vision technique, and computer vision is one and studies such as
What makes the science of machine " seeing ", further, just refer to replace human eye to identify target with video camera and computer, with
The machine vision such as track and measurement, and graphics process is further done, so that computer is treated as being more suitable for eye-observation or is sent instrument to
The image of device detection.As a branch of science, the relevant theory and technology of computer vision research, it is intended to which foundation can be from figure
The artificial intelligence system of information is obtained in picture or multidimensional data.Computer vision technique generally includes image procossing, image is known
Not, image, semantic understanding, image retrieval, OCR, video processing, video semanteme understanding, video content/Activity recognition, three-dimension object
The technologies such as reconstruction, 3D technology, virtual reality, augmented reality, synchronous superposition, further include common recognition of face,
The biometrics identification technologies such as fingerprint recognition.
In the embodiment of the present application, data processing equipment can obtain speaker by computer vision technique from video
The various informations such as expressive features, corresponding duration.
The data processing equipment, which can have, implements automatic speech recognition technology (ASR) and Application on Voiceprint Recognition skill in voice technology
The ability of art etc..Data processing equipment can be listened, can be seen, can felt for voice technology, is the developing direction of the following human-computer interaction,
Middle voice becomes following one of the man-machine interaction mode being most expected.
In the embodiment of the present application, data processing equipment is by implementing above-mentioned voice technology, can to the video of acquisition into
Row speech recognition, so that it is all kinds of to obtain the acoustic feature of speaker in video, corresponding pronunciation element, corresponding duration etc.
Information.
The data processing equipment can also have the ability for implementing natural language processing, be computer science and people
An important directions in work smart field.It, which studies to be able to achieve between people and computer, carries out efficient communication with natural language
Various theory and methods.Natural language processing is one and melts linguistics, computer science, mathematics in the science of one.Therefore, this
The research in one field will be related to natural language, i.e. people's language used in everyday, so it and philological research have closely
Connection.Natural language processing technique generally includes the technologies such as text-processing, semantic understanding.
In the embodiment of the present application, data processing equipment may be implemented by aforementioned by implementing above-mentioned NLP technology from view
The text feature of voice is determined in frequency.
The data processing equipment can have machine learning (Machine Learning, ML) ability.ML is a Men Duoling
Domain cross discipline is related to the multiple subjects such as probability theory, statistics, Approximation Theory, convextiry analysis, algorithm complexity theory.It specializes in
The learning behavior of the mankind is simulated or realized to computer how, to obtain new knowledge or skills, reorganizes existing knowledge knot
Structure is allowed to constantly improve the performance of itself.Machine learning is the core of artificial intelligence, is the basic way for making computer have intelligence
Diameter, application spread the every field of artificial intelligence.Machine learning and deep learning generally include the technologies such as artificial neural network.
In the embodiment of the present application, it relates generally to for synthesizing the model training method for expression of speaking to all kinds of artificial neurons
The application of network, such as pass through the first corresponding relationship training expression model.
Referring to Fig. 1, Fig. 1 is the application scenarios schematic diagram of expression model training method provided by the embodiments of the present application, this is answered
With including server 101 in scene, server 101 is available to contain the view of speaker's face action expression and corresponding voice
Frequently, which can be one, be also possible to multiple.Languages corresponding to the character that voice includes in video can be Chinese,
The various languages such as English, Korean.
Server 101 is according to the expressive features of the available speaker of video of acquisition, the acoustic feature and voice of voice
Text feature.Wherein, the expressive features of speaker can indicate face action table when speaker says voice in video
Feelings, such as may include shape of the mouth as one speaks feature, eye motion etc., video viewers can experience view by the expressive features of speaker
Voice in frequency is exactly that the speaker says.The acoustic feature of voice may include the sound wave of voice.The text feature of voice
For identifying pronunciation element corresponding to content of text, it should be noted that the pronunciation element in the embodiment of the present application can be
Speaker says the corresponding pronunciation of character that voice includes.
It should be noted that in the present embodiment, expressive features, acoustic feature and text feature can by feature to
The form of amount indicates.
Since acoustic feature and text feature are to be obtained according to same video, therefore server 101 can be according to text
Feature and acoustic feature determine that text feature identifies the time interval and duration of pronunciation element.Wherein, time interval is pronunciation
Section at the beginning of sub- acoustic feature corresponding to element is corresponding in video between finish time, when a length of pronunciation
The duration of sub- acoustic feature corresponding to element, such as can be the difference of finish time and start time.One sub- acoustics is special
Sign is a part of acoustic feature corresponding to a pronunciation element, may include multiple sub- acoustic features in acoustic feature.
Then, server 101 identifies the time interval and duration and expressive features of pronunciation element according to text feature
Determine that the first corresponding relationship, the first corresponding relationship are used to embody the duration of pronunciation element and the time interval of pronunciation element described
Corresponding relationship in expressive features between corresponding sub- expressive features.Wherein, a sub- expressive features are that a pronunciation element institute is right
A part of expressive features answered may include multiple sub- expressive features in expressive features.
For identifying for any pronunciation element such as target speaker element in pronunciation element, target speaker element
Time interval is the target speaker element time interval of corresponding sub- acoustic feature in the video in acoustic feature, and acoustics is special
Sign, text feature and expressive features are obtained according to same video, and corresponding is same timeline, therefore passes through target
Pronouncing the time interval of element can be from determining the sub- expressive features in the time interval in expressive features.And target speaker
Element when a length of target speaker element corresponding to sub- acoustic feature duration, target speaker element can be embodied and regarded
Various durations under the various expression sentences of frequency voice, therefore the sub- expressive features determined can be embodied in different expression languages
Speaker says the possible expression of speaking of the target speaker element in sentence.
By voice that speaker says be " you have had a meal ", video comprising the voice when a length of 2s for, wherein
Text feature is used to identify the pronunciation element of " you have had a meal " these characters, and the pronunciation element that text feature is identified includes
" ni ' chi ' fan ' le ' ma ", expressive features indicate the table of speaking of speaker when speaker says " you have had a meal " this section of voice
Feelings, acoustic feature are that speaker says the sound wave issued when " you have had a meal " this section of voice.Target speaker element is " ni '
Any pronunciation element in chi ' fan ' le ' ma ", if target speaker element is " ni ", the time interval of " ni " is 0s and the
Section between 0.1s, the when a length of 0.1s of " ni ", sub- expressive features corresponding to " ni " be in video 0s to 0.1s it
Between section speaker say a part of expressive features corresponding to voice, such as can be sub- expressive features A.Server 101
When determining the first corresponding relationship, the time interval 0s and 0.1s of pronunciation element " ni " can be identified according to text feature
Its corresponding sub- expressive features A is determined, in this way, can determine that the duration 0.1s of pronunciation element " ni " and pronunciation element " ni " exist
Corresponding relationship between sub- expressive features A corresponding to time interval 0s and 0.1s, the first corresponding relationship include pronunciation member
Pair of the duration 0.1s and pronunciation element " ni " of plain " ni " between sub- expressive features A corresponding to time interval 0s and 0.1s
It should be related to.
Server 101 according to the first corresponding relationship training expression model, expression model be used for according to text feature undetermined and
The duration that text feature undetermined identifies pronunciation element determines corresponding target expressive features.
It is understood that this implementation is concerned with pronunciation element, and it is specific to be not concerned with character corresponding to pronunciation element
What is.The same sentence may include different characters in the voice that speaker says, but is different character and may correspond to
Be identical pronunciation element, in this way, the same pronunciation element is located at different time intervals, may have different durations,
To corresponding different sub- expressive features.
For example, the character that the voice that speaker says includes is " and you say a secret ", character " secret " and " close " are corresponding
The pronunciation element of text feature mark be all " mi ", the pronunciation element " mi " of the mark of text feature corresponding to character " secret "
Time interval is the section between 0.4s and 0.6s, Shi Changwei 0.2s, text feature mark corresponding to character " close "
The time interval of element " mi " of pronouncing is the section between 0.6s and 0.7s, Shi Changwei 0.1s.As it can be seen that different characters
Text feature corresponding to " secret " and " close " identifies the same pronunciation element " mi ", but corresponding duration is different, therefore, pronunciation
The corresponding different sub- expressive features of element " mi ".
In addition, the difference of the expression way according to speaker when speaking, different sentences can in the voice that speaker says
It can include identical characters, the pronunciation element of the mark of text feature corresponding to identical characters may have different durations, in this way,
The same pronunciation element corresponds to different sub- expressive features.
For example, the voice that speaker says is " hello ", the pronunciation element of the mark of text feature corresponding to character " you "
The when a length of 0.1s of " ni ", still, in another voice that speaker says " you I he ", text corresponding to character " you "
The duration of the pronunciation element " ni " of signature identification can be 0.3s, at this point, the hair of the mark of text feature corresponding to identical characters
Tone element has different durations, and the same pronunciation element is allowed to correspond to different sub- expressive features.
Since a pronunciation element corresponds to different durations, sub- expressive features corresponding to the pronunciation element of different durations are not
Together, the first corresponding relationship can embody the duration for the different pronunciation elements that a pronunciation element has and pair of sub- expressive features
It should be related to, in this way, when determining sub- expressive features using the expression model obtained according to the training of the first corresponding relationship, for true
Determine the text feature of expressive features, which can be true to the same pronunciation element in this article eigen with different durations
Different sub- expressive features are made, the variation patterns for expression of speaking are increased.In addition, determining target expression according to expression model
The expression of speaking that feature generates, due to speaking expression with different variation patterns, thus centainly for the same pronunciation element
The excessive unnatural situation for expression shape change of speaking is improved in degree.
It is understood that increase the variation patterns for expression of speaking to solve technical problem present in traditional approach,
Improve the excessive unnatural situation for expression shape change of speaking, the embodiment of the present application provides a kind of new expression model training side
Method, and expression of speaking corresponding to content of text is generated using the expression model.Next, will be in conjunction with attached drawing to the application reality
The speak method of expression of the model training method for synthesizing expression of speaking and synthesis for applying example offer is introduced.
Firstly, the model training method for synthesizing expression of speaking is introduced.Referring to fig. 2, Fig. 2 shows a kind of use
In the flow chart for the model training method for synthesizing expression of speaking, which comprises
S201, the video comprising speaker's face action expression and corresponding voice is obtained.
The video for containing face action expression and corresponding voice can be speaker in the playback environ-ment for having camera, record
The voice that speaker processed says, while recording what speaker's face action expression obtained by camera.
S202, the expressive features of the speaker according to the video acquisition, the acoustic feature of the voice and described
The text feature of voice.
Wherein, expressive features can be by carrying out what feature extraction obtained to the face action expression in the video, sound
Learning feature can be by saying what voice progress feature extraction obtained to the speaker in the video, and text feature can be logical
It crosses and what the progress feature extraction of text corresponding to voice obtained is said to the speaker in the video, expressive features, acoustic feature
It is all to be obtained according to same video with text feature, time shaft having the same.
Expressive features can be identified for that out feature when speaker says voice in face action expression, and video viewers pass through
Expressive features are it can be seen which pronunciation element speaker is issuing, and in one implementation, expressive features include at least mouth
Type feature, shape of the mouth as one speaks feature can directly embody feature when speaker says voice in face action expression, to guarantee to regard
It is exactly that the speaker says that frequency viewer, which can experience the voice in video by the shape of the mouth as one speaks feature of speaker,.
S203, determined according to the text feature and the acoustic feature text feature identify pronunciation element when
Between section and duration.
In the present embodiment, pronunciation element for example can say the corresponding syllable of character that voice includes, word for speaker
Symbol can be the underlying semantics unit in different language, such as in Chinese, and character can be Chinese character, the corresponding pronunciation member of Chinese character
Element can be pinyin syllable;In English, character can be word, the corresponding pronunciation element of word can be corresponding phonetic symbol or
Phonetic symbol combination.For example, when character is Chinese character, the corresponding pronunciation element of character can be pinyin syllable, for example, character is
Chinese character " you ", the corresponding pronunciation element of the character can be pinyin syllable " ni ";When character is English, the corresponding pronunciation of character
Element can be English syllable, for example, character is English word " ball ", the corresponding pronunciation element of the character can be English sound
SectionCertainly, pronunciation element may be the pronunciation unit of minimum included by the corresponding pinyin syllable of character, for example,
The character that the voice that speaker says includes is " you ", and pronunciation element may include " n " and " i " two pronunciation elements.
In some cases, pronunciation element can also may be distinguished based on tone, therefore, pronunciation element can also wrap
Include tone.Such as in Chinese, it is " you are Ni Ni " that speaker, which says the character that voice includes, wherein character " you " and " girl "
Pinyin syllable be all " ni ", still the tone of " you " is three sound, and the tone of " girl " is, therefore, corresponding to " you "
The element that pronounces includes " ni " and three several tune, and pronunciation element corresponding to " girl " includes " ni " and each tune, " you " and " girl "
Corresponding pronunciation element is distinguished according to the difference of tone.When in use, suitable pronunciation can be determined according to demand
Element.
It should be noted that the languages of the corresponding pronunciation element of the languages and character of character are in addition to above-mentioned several possible sides
Other than formula, it can also be other different languages, any restriction not done to the category of language of character herein.For ease of description,
In each embodiment of the application, will mainly using character as Chinese character, the corresponding pronunciation element of character be pinyin syllable for into
Row explanation.
S204, the time interval that pronunciation element is identified according to the text feature and duration and the expressive features,
Determine the first corresponding relationship.
Since acoustic feature and text feature are obtained according to same video, acoustic feature itself includes time letter
Breath, therefore can determine that text feature identifies the time interval and duration of pronunciation element according to acoustic feature.Wherein, target is sent out
The time interval of tone element be the target speaker element in the acoustic feature corresponding sub- acoustic feature in the video
In time interval, the target speaker element when a length of target speaker element corresponding to sub- acoustic feature it is lasting when
Between, the target speaker element either one or two of identifies pronunciation element by the text feature.
First corresponding relationship is used to embody the duration of pronunciation element and the time interval of pronunciation element in the expression
Corresponding relationship in feature between corresponding sub- expressive features.
When determining the first corresponding relationship, can be determined from expressive features at this by the time interval for the element that pronounces
The corresponding sub- expressive features of pronunciation element in time interval, so that it is determined that the time interval of the duration of pronunciation element and pronunciation element
Corresponding relationship in the expressive features between corresponding sub- expressive features.
It is understood that duration corresponding to same pronunciation element not only can be different, and it can also be identical, it is not only same
When duration difference corresponding to element of pronouncing, sub- expressive features corresponding to same pronunciation phonemes can be different, or even work as same hair
Duration phase corresponding to tone element can also can be led simultaneously as saying the reasons such as the tone, the communicative habits difference of voice
Cause the same pronunciation element with identical duration that there are different sub- expressive features.
For example, the tone of speaker's excitement says voice " Ni Ni " and speaker speaks in an angry tone out voice " girl
Girl ", even if the duration for the pronunciation element that text feature is identified is identical, due to speaker speaks the tone, it is also possible to make
It is different to obtain the corresponding sub- expressive features of same pronunciation element.
S205, expression model is trained according to first corresponding relationship.
Trained expression model can be determined that it identifies the corresponding target expression of pronunciation element by text feature undetermined
Feature, wherein the corresponding content of text of text feature undetermined is to need to synthesize to speak expression or also to need further to generate virtual
The content of text of sound.Text feature undetermined and the text feature undetermined identify the duration of pronunciation element as expression model
Input, output of the target expressive features as expression model.
Training data for training expression model is the first corresponding relationship, in the first corresponding relationship, when having identical
The same pronunciation element of long or different durations can correspond to different sub- expressive features, in this way, in the subsequent table obtained using training
When feelings model determines target expressive features, the duration that text feature undetermined and text feature undetermined identify pronunciation element is inputted
After the expression model, the situation similar with training data can be also obtained, i.e., the same pronunciation element input with different durations should
The target expressive features obtained after expression model may be different, even if the same pronunciation element with identical duration is inputted the table
Different target expressive features may also be obtained after feelings model.
It should be noted that same pronunciation element there may be different durations in the present embodiment, there are different durations
Same pronunciation element have different expressive features, can have different expressive features identical duration, pass through hair
Tone element corresponding contextual information can accurately determine the pronunciation element it is corresponding be which duration, and determine
Which expression is the corresponding duration of pronunciation element be.
People is in normal articulation, and under different contexts, the characteristics of same pronunciation element is stated may be different, for example,
The duration of pronunciation element is different, then, the same pronunciation element may have different sub- expressive features, that is to say, that one
Which duration pronunciation element corresponds to, and the pronunciation element with duration corresponds to the upper of which expressive features and the pronunciation element
It is hereafter related.Therefore, in one implementation, text feature used in training expression model can be also used for described in mark
Pronounce element and the corresponding contextual information of pronunciation element in voice.In this way, making the expression model obtained using training true
When determining expressive features, contextual information can accurately determine the duration of pronunciation element, and which corresponding sublist feelings
Feature.
For example, the voice said of speaker is " you are Ni Ni ", corresponding to the pronunciation element that is identified of text feature
Including " ni ' shi ' ni ' ni ", wherein pronunciation element " ni " occurs three times, the when a length of 0.1s of first pronunciation element " ni ",
Corresponding sub- expressive features A, the when a length of 0.2s of first pronunciation element " ni ", corresponding sub- expressive features B, third are pronounced element
The when a length of 0.1s of " ni ", corresponding sub- expressive features C.If this article eigen can also identify pronounce in voice element and pronunciation
The corresponding contextual information of element, wherein first pronunciation element " ni " contextual information be contextual information 1, second
The contextual information of pronunciation element " ni " is contextual information 2, and the contextual information of third pronunciation element " ni " is context
Information 3, then, when determining expressive features using the expression model that training obtains, contextual information 1 can be accurately determined
The when a length of 0.2s of tone of setting out element " ni ", corresponding sub- expressive features are sub- expressive features A, and so on.
It is accurately true by contextual information since contextual information can embody expression way of the people in normal articulation
The duration and corresponding sub- expressive features of pronunciation element are made, so that utilizing the expression mould obtained according to the training of the first corresponding relationship
When type determines that virtual objects issue the target expressive features of pronunciation element, the expression way of virtual objects is enabled to more to be bonded people
Expression.In addition, contextual information can be informed in the case where issuing upper pronunciation element, speaker issues current pronunciation
Which type of sublist feelings corresponding to element are characterized in, so that currently sub- expressive features and context corresponding to pronunciation element are sent out
Linking between sub- expressive features corresponding to tone element is associated, improves the excessive smooth journey that the later period generates expression of speaking
Degree.
In order to determining that variation multiplicity, hypermimia are natural for virtual objects it can be seen from above-mentioned technical proposal
Expression of speaking, the embodiment of the present application provides a kind of completely new expression model training mode, according to containing the human face that speaks
The video of movement expression and corresponding voice obtains the text of the expressive features of speaker, the acoustic feature of the voice and the voice
Eigen.Since acoustic feature and text feature are to be obtained according to same video, therefore can be determined according to acoustic feature
Text feature identifies the time interval and duration of pronunciation element.
It, can be with by the time interval of target speaker element for identifying for the target speaker element in pronunciation element
From determining the sub- expressive features in the time interval in expressive features, and the duration of target speaker element can embody mesh
Various durations of the mark pronunciation element under the various expression sentences of video speech, therefore the sub- expressive features determined can be with body
Speaker says the possible expression of target speaker element in different expression sentences now.Therefore trained according to the first corresponding relationship
The expression model arrived, for the text feature of expressive features to be determined, which can have not in this article eigen
Same pronunciation element with duration determines different sub- expressive features, the variation patterns for expression of speaking is increased, according to expression
Model determines the expression of speaking that target expressive features generate, due to speaking expression with different for the same pronunciation element
Variation patterns, to improve the excessive unnatural situation for expression shape change of speaking to a certain extent.
It is understood that generating expression of speaking by the expression model for the method training that embodiment corresponding to Fig. 2 provides
When, the variation patterns for expression of speaking can be increased, improve the excessive unnatural situation for expression shape change of speaking.And in man-machine friendship
When mutual, the expression of speaking of virtual objects is not only shown to user, can also play the virtual acoustic of interaction.If using existing
Mode generates virtual acoustic, it is possible that the expression of speaking of virtual acoustic and the provided schemes generation of the embodiment of the present application is not taken
With the case where, in this case, the embodiment of the present application provides a kind of new acoustic training model method, trains by this way
Acoustic model out can be generated and speak expression collocation virtual acoustic, referring to Fig. 3, this method comprises:
S301, determine that the text feature identifies the second corresponding relationship between pronunciation element and the acoustic feature.
S302, acoustic model is trained according to second corresponding relationship.
Wherein, the duration and pronunciation element that second corresponding relationship is used to embody pronunciation element are in the acoustic feature
Corresponding relationship between corresponding sub- acoustic feature.
Trained acoustic model can be determined that it identifies the corresponding target acoustical of pronunciation element by text feature undetermined
Feature.Wherein, text feature undetermined and the text feature undetermined identify the duration of pronunciation element as the defeated of acoustic model
Enter, output of the target acoustical feature as acoustic model.
When determining the second corresponding relationship, pronunciation element is exactly that speaker speaks sending, acoustic feature and human hair of speaking
Pronunciation member out is known as corresponding relationship, pronunciation element can be determined in sound in this way, identifying pronunciation element according to text feature
Corresponding sub- acoustic feature in feature is learned, to can determine text spy for any acoustic feature that text feature is identified
Sign identifies the second corresponding relationship between pronunciation element and the acoustic feature.
It is understood that duration corresponding to same pronunciation element not only can be different, and it can also be identical, not only same
When duration difference corresponding to one pronunciation element, sub- acoustic feature corresponding to same pronunciation phonemes can be different, or even when same
Duration phase corresponding to pronunciation element, can also energy simultaneously as saying the reasons such as the tone, the expression way difference of voice
Cause the same pronunciation element with identical duration that there is different sub- acoustic features.
The existing element that pronounces includes temporal information in itself, can be true from expressive features by the time interval for the element that pronounces
The corresponding sub- expressive features of element of pronouncing in the time interval are made, so that it is determined that the duration of pronunciation element and the element that pronounces
Corresponding relationship of the time interval in the expressive features between corresponding sub- expressive features.
Training data used in the training data as used in training acoustic model and training expression model is from same
One video, corresponding identical time shaft, when speaker issues a pronunciation element, the sound of speaker and the face of speaker are dynamic
Collocation as expression, in this way, the virtual acoustic that generates of the target acoustical feature determined according to the acoustic model with according to table
The expression of speaking that the target expressive features that feelings model is determined generate is collocation, provides the user with better impression, improves and uses
Family feeling of immersion.
In addition, due to being the second corresponding relationship for training the training data of acoustic model, in the second corresponding relationship, tool
There are identical duration or the same pronunciation element of different durations that can correspond to different sub- acoustic features, in this way, using training subsequent
When obtained acoustic model determines target acoustical feature, text feature undetermined and text feature undetermined are identified into pronunciation element
After duration inputs the acoustic model, the situation similar with training data can be also obtained, i.e., the same pronunciation member with different durations
It is different that element inputs the target acoustical feature obtained after the acoustic model, even if same pronunciation element has identical duration, has phase
Same pronunciation element with duration inputs after the acoustic model also available different target acoustical feature.
It can be seen that the acoustic model that event is obtained according to the training of the second corresponding relationship, for the text of acoustic feature to be determined
Eigen, the acoustic model can determine different sub- sound to the same pronunciation element in this article eigen with different durations
Feature is learned, the variation patterns of virtual acoustic are increased, according to the virtual acoustic of acoustic model determined target acoustical feature generation,
Since there are different variation patterns for the same pronunciation element virtual acoustic, to improve virtual acoustic to a certain extent
The excessive unnatural situation of variation.
It is understood that using expression model determine text feature undetermined identify pronunciation element duration it is corresponding
When target expressive features, the input of the expression model is text feature undetermined and the identified pronunciation element of the text feature undetermined
Duration, wherein the duration for the element that pronounces directly determines what the target expression determined is characterized in.That is, in order to
Determine the corresponding target expressive features of duration of pronunciation element, it is necessary first to determine the duration of pronunciation element, pronounce element
Duration can be determined in several ways.
One of which determines that the mode of the duration of pronunciation element can be and is determined according to duration modeling, for this purpose, this reality
It applies and a kind of training method of duration modeling is provided, this method includes identifying hair according to the text feature and the text feature
The duration training duration modeling of tone element.
Trained duration modeling can determine the duration of its labeling phonemes for text feature undetermined.Text feature undetermined
As the input of duration modeling, text feature undetermined identifies output of the duration of pronunciation element as duration modeling.
The instruction that the training data as used in the training duration modeling and training expression model and acoustics model use
Practice data and comes from same video, the text feature for including in training data used in the training duration modeling and text feature institute
The duration of mark pronunciation element is that expression model and acoustics model is trained to identify hair using text feature and text feature
The duration of tone element.In this way, the duration for the pronunciation element determined using the duration modeling is suitable for previous embodiment training
Obtained expression model and acoustic model, expression model are determined according to the duration of the pronunciation element obtained using the duration modeling
Target expressive features and the target determined according to the duration of the pronunciation element obtained using the duration modeling of acoustic model
Acoustic feature meets expression way when people normally speaks.
Next, pairing is introduced at the method for expression of speaking.It is provided by the embodiments of the present application to synthesize expression of speaking
Method can be applied to provide synthesis and speak the equipment of expression correlation function, such as terminal device, server, wherein terminal
Equipment is specifically as follows smart phone, computer, personal digital assistant (Personal Digital Assistant, PDA), puts down
Plate computer etc.;Server is specifically as follows application server, or Web server, in practical application deployment, and the service
Device can be separate server, or cluster server.
Determine that the method for expression of speaking can be applied to plurality of application scenes in interactive voice provided by the embodiments of the present application,
The embodiment of the present application is by taking two kinds of application scenarios as an example.
The first application scenarios can be in scene of game, be exchanged between different user by virtual objects, and one
A user can virtual objects corresponding with another user interact, for example, user A and user B by virtual objects into
Row exchange, user A input content of text, and user B sees the expression of speaking of the corresponding virtual objects of user A, user B and user A
Corresponding virtual objects interact.
Second of application scenarios can be applied in intelligent sound assistant, such as intelligent sound assistant siri, when user makes
When with intelligent sound assistant siri, intelligent sound assistant siri can also be shown when to user feedback interactive information to user
The expression of speaking of virtual objects, user interact with the virtual objects.
The technical solution of the application in order to facilitate understanding, below using server as executing subject, in conjunction with practical application field
The method provided by the embodiments of the present application for synthesizing expression of speaking is introduced in scape.
Referring to fig. 4, Fig. 4 is the application scenarios schematic diagram of the method provided by the embodiments of the present application for synthesizing expression of speaking.It should
It include terminal device 401 and server 402 in application scenarios, wherein terminal device 401 is used for the content of text for obtaining itself
It is sent to server 402, server 402 is used to execute the method provided by the embodiments of the present application for synthesizing expression of speaking, with determination
The corresponding target expressive features of content of text that terminal device 401 is sent.
When server 402 it needs to be determined that when the corresponding target expressive features of content of text, server 402 determines text first
The corresponding text feature of content and the text feature identify the duration of pronunciation element, and then, server 402 is by text feature
Being input to the expression model that the corresponding embodiment of Fig. 2 trains with the duration for identifying pronunciation element, to obtain content of text corresponding
Target expressive features.
Since expression model is obtained according to the training of the first corresponding relationship, first corresponding relationship is for embodying pronunciation
Corresponding relationship of the duration of element with the time interval of pronunciation element in the expressive features between corresponding sub- expressive features,
Same pronunciation element with identical duration or different durations in first corresponding relationship can correspond to different sub- expressive features.This
Sample, when determining target expressive features using the expression model, for the text feature of expressive features to be determined, which can
To determine different sub- expressive features to the same pronunciation element in this article eigen with similar and different duration, increase
The variation patterns for expression of speaking determine the expression of speaking that target expressive features generate according to expression model, due to for same
A pronunciation element speaks expression with different variation patterns, thus improve to a certain extent expression shape change of speaking excessively not
Natural situation.
Below in conjunction with attached drawing, a kind of the speak method of expression of synthesis provided by the embodiments of the present application is introduced.
The method flow diagram that expression of speaking is determined in a kind of interactive voice is shown referring to Fig. 5, Fig. 5, which comprises
S501, determine that the corresponding text feature of content of text and the text feature identify the duration of pronunciation element.
In the present embodiment, content of text refers to the text needed to the user feedback interacted with virtual objects, root
It may be different according to the different content of text of application scenarios.
In the first application scenarios mentioned above, content of text can be text corresponding to user's input content.
For example, user B sees the expression of speaking of the corresponding virtual objects of user A, user B virtual objects corresponding with user A are handed over
Mutually, then, text corresponding to user's A input content can be used as content of text.
In second of application scenarios mentioned above, content of text can be the interaction fed back according to user's input content
Text corresponding to information.For example, siri can be answered for user's input after user inputs " today, weather was how ",
It include the interactive information of weather condition today to user feedback, then, to the interaction including weather condition today of user feedback
Text corresponding to information can be used as content of text.
It should be noted that the mode of user's input can be input text, it is also possible to input voice.When user inputs
Mode can be input text when, then content of text be terminal device 101 directly according to user input obtain either root
Text feedback is inputted according to user, what is inputted by terminal device 101 as user is voice, then content of text is terminal device
The voice of 101 couples of users input carries out identification acquisition or according to the user's input voice feedback recognized.
Feature extraction can be carried out to content of text, to obtain the corresponding text feature of content of text, text feature can
To include multiple Ziwen eigens, it can determine that text feature identifies the duration of pronunciation element according to text feature.
It should be noted that when determining that text feature identifies the duration of pronunciation element, it can be special by the text
It seeks peace duration modeling, obtains the duration that the text feature identifies pronunciation element.Wherein, duration modeling is according to history text
What the duration training that feature and history text feature identify pronunciation element obtained.The training method of duration modeling is referring to aforementioned reality
The introduction of example is applied, details are not described herein again.
S502, by the text feature, identify pronunciation element duration and expression model, obtain the content of text
Corresponding target expressive features.
That is, using the text feature and the duration for identifying pronunciation element as the input of expression model, thus
The corresponding target expressive features of the content of text are obtained by expression model.
In the target expressive features, the corresponding sublist feelings of target speaker element are characterized according to the target speaker member
The element duration of corresponding Ziwen eigen and target speaker element determination in the text feature obtains.
Wherein, the target speaker element identifies any pronunciation element in pronunciation element, institute by the text feature
Stating expression model, the training of method provided by corresponding embodiment obtains according to fig. 2.
It should be noted that text feature used in training expression model can be used for marking due in some cases
Know pronounce in the voice element and the corresponding contextual information of pronunciation element.So, target is being determined using expression model
When expressive features, it is corresponding up and down that text feature can be used for identifying pronunciation element and pronunciation element in the content of text
Literary information.
It can be seen from the above technical proposal that since expression model is obtained according to the training of the first corresponding relationship, it is described
First corresponding relationship is used to embody the duration of pronunciation element and the time interval for the element that pronounces is corresponding in the expressive features
Corresponding relationship between sub- expressive features, the same pronunciation element meeting in the first corresponding relationship with identical duration or different durations
Corresponding different sub- expressive features.In this way, when determining target expressive features using the expression model, for expressive features to be determined
Text feature, which can determine in this article eigen with the same pronunciation element of similar and different duration
Different sub- expressive features, increase the variation patterns for expression of speaking, and determine that target expressive features generate according to expression model
Expression of speaking, due to speaking expression with different variation patterns, to change to a certain extent for the same pronunciation element
It has been apt to speak the excessive unnatural situation of expression shape change.
It is understood that the method synthesis for passing through the offer of embodiment corresponding to Fig. 5 is spoken when expression, it can increase and speak
The variation patterns of expression improve the excessive unnatural situation for expression shape change of speaking.And in human-computer interaction, not only to user
The expression of speaking for showing virtual objects can also play the virtual acoustic of interaction.If generating Virtual Sound using existing way
Sound, it is possible that the case where virtual acoustic and expression of speaking are not arranged in pairs or groups, in this case, the embodiment of the present application provides one kind
The method for synthesizing virtual acoustic, the virtual acoustic synthesized by this way can include passing through with expression collocation of speaking, this method
It is special to obtain the corresponding target acoustical of the content of text for the text feature, the duration and acoustic model for identifying pronunciation element
Sign.
Wherein, in the target acoustical feature, the corresponding sub- acoustic feature of the target speaker element is according to
The duration of corresponding Ziwen eigen and the target speaker element determination in the text feature of target speaker element obtains
's;Method training provided by the corresponding embodiment of described acoustic model Fig. 3 obtains.
By determining that the training data of acoustic model used in target acoustical feature is made with determining target expressive features
The training data of expression model comes from same video, corresponds to identical time shaft, when speaker issues a pronunciation element,
The sound of speaker and the expression of speaker are collocation, in this way, raw according to the target acoustical feature that the acoustic model is determined
At the expression of speaking that is generated with the target expressive features determined according to expression model of virtual acoustic be collocation, mentioned to user
For preferably experiencing, user's feeling of immersion is improved.
Next, will be said in conjunction with concrete application scene based on model training method provided by the embodiments of the present application and synthesis
The method for talking about expression and virtual acoustic is introduced to visualization phoneme synthesizing method is generated in human-computer interaction.
The application scenarios can be exchanged with user B by virtual objects with scene of game, user A, and user A inputs text
Content, user B see that speaking for the corresponding virtual objects of user A and hears that virtual acoustic, user B are corresponding with user A at expression
Virtual objects interact.The framework that visualization phoneme synthesizing method is generated in a kind of human-computer interaction is shown referring to Fig. 6, Fig. 6
Schematic diagram.
As shown in fig. 6, including model training part and composite part in the configuration diagram.Wherein, in model training portion
Point, it can collect and contain the video of speaker's face action expression and corresponding voice.Speaker is said corresponding to voice
Text carries out text analyzing, prosodic analysis, to extract text feature.Voice progress acoustic feature is said to speaker to mention
It takes, to extract acoustic feature.Human facial feature extraction is carried out to face action expression when saying voice to speaker, thus
Extract expressive features.Voice is said by forcing alignment module to handle, according to text feature and acoustics spy to speaker
Levy the time interval and duration for determining that text feature identifies pronunciation element.
Then, expression is carried out according to the duration, corresponding expressive features, text feature that text feature identifies pronunciation element
Model training obtains expression model;Duration, the corresponding acoustic feature, text spy of pronunciation element are identified according to text feature
Sign carries out acoustic training model, obtains acoustic model;The when progress of pronunciation element is identified according to text feature and text feature
The training of row duration modeling, obtains duration modeling.So far, model training is partially completed the training of required model.
Composite part is subsequently entered, can use expression model, acoustic model, the duration mould that training obtains in composite part
Type completes visualization speech synthesis.Specifically, carrying out text analyzing, the rhythm point to the content of text of visualization voice to be synthesized
Analysis, obtains the corresponding text feature of content of text, and text feature input duration modeling is carried out duration prediction and obtains text feature
Identify the duration of pronunciation element.The frame level feature vector that text feature is generated together with the duration for identifying pronunciation element is defeated
Enter expression model progress expressive features to predict to obtain the corresponding target expressive features of content of text.By the text feature and marked
The frame level feature vector input acoustic model that the duration of knowledge pronunciation element generates together carries out acoustic feature and predicts to obtain in text
Hold corresponding target acoustical feature.Finally, obtained target expressive features and target acoustical feature, which are carried out rendering, generates animation,
To obtain visualization voice.
On the one hand the visualization voice obtained through the above scheme increases the variation patterns of speak expression and virtual speech,
The excessive unnatural situation for expression shape change of speaking is improved to a certain extent, on the other hand, since training acoustic model is made
Training data used in training data and training expression model comes from same video, and corresponding identical time shaft is spoken
When human hair goes out a pronunciation element, the sound of speaker and the expression of speaker are collocation, in this way, true according to the acoustic model
What the virtual acoustic that the target acoustical feature made generates was generated with the target expressive features determined according to expression model speaks
Expression is collocation, therefore the visualization voice synthesized provides the user with better impression, improves user's feeling of immersion.
The model training method for synthesizing expression of speaking and synthesis provided based on previous embodiment is spoken the side of expression
Relevant apparatus provided by the embodiments of the present application is introduced in method.The present embodiment provides a kind of for synthesizing the mould for expression of speaking
Type training device 700, referring to Fig. 7 a, described device 700 includes acquiring unit 701, the determining list of the first determination unit 702, second
Member 703 and the first training unit 704:
The acquiring unit 701 includes speaker's face action expression and the video for corresponding to voice, Yi Jiyong for obtaining
In the expressive features, the acoustic feature of the voice and the text spy of the voice that obtain the speaker according to the video
Sign;The acoustic feature includes multiple sub- acoustic features;
First determination unit 702, for determining that the text is special according to the text feature and the acoustic feature
Sign identifies the time interval and duration of pronunciation element;It is target speaker that the text feature, which either one or two of is identified pronunciation element,
Element, the time interval of the target speaker element are the target speaker element corresponding sub- acoustics in the acoustic feature
Time interval of the feature in the video, the target speaker element when a length of target speaker element corresponding to sub- sound
Learn the duration of feature;
Second determination unit 703, for according to the text feature identify pronunciation element time interval and when
The long and described expressive features determine the first corresponding relationship, first corresponding relationship be used to embody pronunciation element duration and
Corresponding relationship of the time interval of pronunciation element in the expressive features between corresponding sub- expressive features;
First training unit 704, for according to first corresponding relationship training expression model;The expression model
Duration for identifying pronunciation element according to text feature undetermined and the text feature undetermined determines corresponding target expression
Feature.
In one implementation, referring to Fig. 7 b, described device 700 further includes third determination unit 705 and the second training
Unit 706:
The third determination unit 705, for determining that the text feature identifies pronunciation element and the acoustic feature
Between the second corresponding relationship;The duration and pronunciation element that second corresponding relationship is used to embody pronunciation element are in the acoustics spy
Corresponding relationship in sign between corresponding sub- acoustic feature;
Second training unit 706, for according to second corresponding relationship training acoustic model, the acoustic model
Duration for identifying pronunciation element according to text feature undetermined and the text feature undetermined determines corresponding target acoustical
Feature.
In one implementation, referring to Fig. 7 c, described device 700 further includes third training unit 707:
The third training unit 707, for according to the text feature and the identified pronunciation element of the text feature
Duration training duration modeling, the duration modeling is used to according to text feature undetermined determine that the text feature undetermined is identified
The duration of pronunciation element.
In one implementation, the text feature is for identifying pronounce in the voice element and pronunciation element pair
The contextual information answered.
The embodiment of the present application also provides a kind of device 800 for synthesizing expression of speaking, and referring to Fig. 8 a, described device 800 includes
Determination unit 801 and first acquisition unit 802:
The determination unit 801, for determining that the corresponding text feature of content of text and the text feature identify hair
The duration of tone element;The text feature includes multiple Ziwen eigens;
The first acquisition unit 802, for passing through the text feature, the duration and expression mould that identify element
Type obtains the corresponding target expressive features of the content of text;The target expressive features include multiple sub- expressive features, described
It is target speaker element that text feature, which either one or two of is identified pronunciation element, in the target expressive features, the target hair
The corresponding sublist feelings of tone element are characterized according to the target speaker element the corresponding Ziwen eigen in the text feature
What the duration determination with the target speaker element obtained.
In one implementation, the expression model is obtained according to the training of the first corresponding relationship, described first pair
Should be related to for embody pronunciation element duration with pronounce element time interval in the expressive features corresponding sublist feelings
Corresponding relationship between feature.
In one implementation, referring to Fig. 8 b, described device 800 further includes second acquisition unit 803:
The second acquisition unit 803, for passing through the text feature, the duration and acoustic mode that identify element
Type obtains the corresponding target acoustical feature of the content of text;In the target acoustical feature, the target speaker element pair
The sub- acoustic feature answered be according to the target speaker element in the text feature corresponding Ziwen eigen and the mesh
What the duration determination of mark pronunciation element obtained;
The acoustic model is obtained according to the training of the second corresponding relationship, and second corresponding relationship is for embodying pronunciation
Corresponding relationship of the duration of element with pronunciation element in the acoustic feature between corresponding sub- acoustic feature.
In one implementation, the determination unit 801 is specifically used for obtaining by the text feature and duration modeling
Obtain the duration that the text feature identifies pronunciation element;The duration modeling is special according to history text feature and history text
What the duration training that sign identifies pronunciation element obtained.
In one implementation, the text feature is for identifying pronounce in the content of text element and pronunciation member
The corresponding contextual information of element.
In order to determining that variation multiplicity, hypermimia are natural for virtual objects it can be seen from above-mentioned technical proposal
Expression of speaking, the embodiment of the present application provides a kind of completely new expression model training apparatus, according to containing the human face that speaks
The video of movement expression and corresponding voice obtains the acoustic feature and the voice of the expressive features of the speaker, the voice
Text feature.It, therefore can be true according to acoustic feature since acoustic feature and text feature are obtained according to same video
Make time interval and duration that text feature identifies pronunciation element.According to the text feature identify pronunciation element when
Between section and duration and the expressive features determine the first corresponding relationship, first corresponding relationship is for embodying pronunciation member
Corresponding relationship of the duration of element with the time interval of pronunciation element in the expressive features between corresponding sub- expressive features.
It, can be with by the time interval of target speaker element for identifying for the target speaker element in pronunciation element
From determining the sub- expressive features in the time interval in expressive features, and the duration of target speaker element can embody mesh
Various durations of the mark pronunciation element under the various expression sentences of video speech, therefore the sub- expressive features determined can be with body
Speaker says the possible expression of target speaker element in different expression sentences now.Therefore trained according to the first corresponding relationship
The expression model arrived determines the device for expression of speaking by being somebody's turn to do for the text feature of expressive features to be determined in interactive voice
Expression model can determine different sub- expressive features to the same pronunciation element in this article eigen with different durations, increase
The variation patterns for having added expression of speaking, according to expression model determine target expressive features generate expression of speaking, due to for
The same pronunciation element speaks expression with different variation patterns, to improve the mistake for expression shape change of speaking to a certain extent
Spend unnatural situation.
The embodiment of the present application also provides a kind of server, which can be used as the model for synthesizing expression of speaking
Training equipment, can also be used as the equipment for synthesizing expression of speaking, the server is introduced below in conjunction with attached drawing.Please
Shown in Figure 9, server 900 can generate bigger difference because configuration or performance are different, may include one or one
Above central processing unit (Central Processing Units, abbreviation CPU) 922 is (for example, one or more are handled
Device) and memory 932, one or more storage application programs 942 or data 944 storage medium 930 (such as one or
More than one mass memory unit).Wherein, memory 932 and storage medium 930 can be of short duration storage or persistent storage.It deposits
Storage may include one or more modules (diagram does not mark) in the program of storage medium 930, and each module may include
To the series of instructions operation in server.Further, central processing unit 922 can be set to logical with storage medium 930
Letter executes the series of instructions operation in storage medium 930 on the device 900.
Equipment 900 can also include one or more power supplys 926, one or more wired or wireless networks connect
Mouth 950, one or more input/output interfaces 958, and/or, one or more operating systems 941, such as
Windows ServerTM, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM etc..
The step as performed by server can be based on the server architecture shown in Fig. 9 in above-described embodiment.
Wherein, CPU 922 is for executing following steps:
Obtain the video comprising speaker's face action expression and corresponding voice;
According to the expressive features of speaker described in the video acquisition, the acoustic feature of the voice and the voice
Text feature;The acoustic feature includes multiple sub- acoustic features;Institute is determined according to the text feature and the acoustic feature
State time interval and duration that text feature identifies pronunciation element;The text feature either one or two of is identified pronunciation element
The time interval of target speaker element, the target speaker element is right in the acoustic feature for the target speaker element
Answer time interval of the sub- acoustic feature in the video, the when a length of target speaker element institute of the target speaker element
The duration of corresponding sub- acoustic feature;
The time interval and duration and the expressive features that pronunciation element is identified according to the text feature, determine
First corresponding relationship, first corresponding relationship are used to embody the duration of pronunciation element and the time interval of pronunciation element described
Corresponding relationship in expressive features between corresponding sub- expressive features;
According to first corresponding relationship training expression model;The expression model is used for according to text feature undetermined and institute
State text feature undetermined identify pronunciation element duration determine corresponding target expressive features.
Alternatively, CPU 922 is for executing following steps:
Determine that the corresponding text feature of content of text and the text feature identify the duration of pronunciation element;The text
Feature includes multiple Ziwen eigens;
By the text feature, the duration and expression model that identify element, it is corresponding to obtain the content of text
Target expressive features;Either one or two of the target expressive features include multiple sub- expressive features, and the text feature is identified
Pronunciation element is target speaker element, in the target expressive features, the corresponding sub- expressive features of the target speaker element
Be according to the target speaker element in the text feature corresponding Ziwen eigen and when the target speaker element
Grow what determination obtained.
Shown in Figure 10, the embodiment of the present application provides a kind of terminal device, which, which can be used as, is used for
Synthesis is spoken the equipment of expression, the terminal device can be include mobile phone, tablet computer, personal digital assistant (Personal
Digital Assistant, abbreviation PDA), point-of-sale terminal (Point of Sales, abbreviation POS), vehicle-mounted computer etc. it is any eventually
End equipment, by taking terminal device is mobile phone as an example:
Figure 10 shows the block diagram of the part-structure of mobile phone relevant to terminal device provided by the embodiments of the present application.Ginseng
Examine Figure 10, mobile phone include: radio frequency (Radio Frequency, abbreviation RF) circuit 1010, memory 1020, input unit 1030,
Display unit 1040, sensor 1050, voicefrequency circuit 1060, Wireless Fidelity (wireless fidelity, abbreviation WiFi) module
1070, the components such as processor 1080 and power supply 1090.It will be understood by those skilled in the art that mobile phone knot shown in Figure 10
Structure does not constitute the restriction to mobile phone, may include perhaps combining certain components or not than illustrating more or fewer components
Same component layout.
It is specifically introduced below with reference to each component parts of the Figure 10 to mobile phone:
RF circuit 1010 can be used for receiving and sending messages or communication process in, signal sends and receivees, particularly, by base station
After downlink information receives, handled to processor 1080;In addition, the data for designing uplink are sent to base station.In general, RF circuit
1010 include but is not limited to antenna, at least one amplifier, transceiver, coupler, low-noise amplifier (Low Noise
Amplifier, abbreviation LNA), duplexer etc..In addition, RF circuit 1010 can also by wireless communication with network and other equipment
Communication.Any communication standard or agreement, including but not limited to global system for mobile communications can be used in above-mentioned wireless communication
(Global System of Mobile communication, abbreviation GSM), general packet radio service (General
Packet Radio Service, abbreviation GPRS), CDMA (Code Division Multiple Access, referred to as
CDMA), wideband code division multiple access (Wideband Code Division Multiple Access, abbreviation WCDMA), long term evolution
(Long Term Evolution, abbreviation LTE), Email, short message service (Short Messaging Service, letter
Claim SMS) etc..
Memory 1020 can be used for storing software program and module, and processor 1080 is stored in memory by operation
1020 software program and module, thereby executing the various function application and data processing of mobile phone.Memory 1020 can be led
It to include storing program area and storage data area, wherein storing program area can be needed for storage program area, at least one function
Application program (such as sound-playing function, image player function etc.) etc.;Storage data area, which can be stored, uses institute according to mobile phone
Data (such as audio data, phone directory etc.) of creation etc..In addition, memory 1020 may include high random access storage
Device, can also include nonvolatile memory, and a for example, at least disk memory, flush memory device or other volatibility are solid
State memory device.
Input unit 1030 can be used for receiving the number or character information of input, and generate with the user setting of mobile phone with
And the related key signals input of function control.Specifically, input unit 1030 may include touch panel 1031 and other inputs
Equipment 1032.Touch panel 1031, also referred to as touch screen collect touch operation (such as the user of user on it or nearby
Use the behaviour of any suitable object or attachment such as finger, stylus on touch panel 1031 or near touch panel 1031
Make), and corresponding attachment device is driven according to preset formula.Optionally, touch panel 1031 may include touch detection
Two parts of device and touch controller.Wherein, the touch orientation of touch detecting apparatus detection user, and detect touch operation band
The signal come, transmits a signal to touch controller;Touch controller receives touch information from touch detecting apparatus, and by it
It is converted into contact coordinate, then gives processor 1080, and order that processor 1080 is sent can be received and executed.In addition,
Touch panel 1031 can be realized using multiple types such as resistance-type, condenser type, infrared ray and surface acoustic waves.In addition to touch surface
Plate 1031, input unit 1030 can also include other input equipments 1032.Specifically, other input equipments 1032 may include
But in being not limited to physical keyboard, function key (such as volume control button, switch key etc.), trace ball, mouse, operating stick etc.
It is one or more.
Display unit 1040 can be used for showing information input by user or be supplied to user information and mobile phone it is each
Kind menu.Display unit 1040 may include display panel 1041, optionally, can use liquid crystal display (Liquid
Crystal Display, abbreviation LCD), Organic Light Emitting Diode (Organic Light-Emitting Diode, referred to as
) etc. OLED forms configure display panel 1041.Further, touch panel 1031 can cover display panel 1041, work as touch-control
After panel 1031 detects touch operation on it or nearby, processor 1080 is sent to determine the type of touch event,
It is followed by subsequent processing device 1080 and provides corresponding visual output on display panel 1041 according to the type of touch event.Although in Figure 10
In, touch panel 1031 and display panel 1041 are the input and input function for realizing mobile phone as two independent components,
But in some embodiments it is possible to touch panel 1031 is integrated with display panel 1041 and realizes outputting and inputting for mobile phone
Function.
Mobile phone may also include at least one sensor 1050, such as optical sensor, motion sensor and other sensors.
Specifically, optical sensor may include ambient light sensor and proximity sensor, wherein ambient light sensor can be according to ambient light
Light and shade adjust the brightness of display panel 1041, proximity sensor can close display panel when mobile phone is moved in one's ear
1041 and/or backlight.As a kind of motion sensor, accelerometer sensor can detect in all directions (generally three axis) and add
The size of speed can detect that size and the direction of gravity when static, can be used to identify application (such as the horizontal/vertical screen of mobile phone posture
Switching, dependent game, magnetometer pose calibrating), Vibration identification correlation function (such as pedometer, tap) etc.;Also as mobile phone
The other sensors such as configurable gyroscope, barometer, hygrometer, thermometer, infrared sensor, details are not described herein.
Voicefrequency circuit 1060, loudspeaker 1061, microphone 1062 can provide the audio interface between user and mobile phone.Audio
Electric signal after the audio data received conversion can be transferred to loudspeaker 1061, be converted by loudspeaker 1061 by circuit 1060
For voice signal output;On the other hand, the voice signal of collection is converted to electric signal by microphone 1062, by voicefrequency circuit 1060
Audio data is converted to after reception, then by after the processing of audio data output processor 1080, through RF circuit 1010 to be sent to ratio
Such as another mobile phone, or audio data is exported to memory 1020 to be further processed.
WiFi belongs to short range wireless transmission technology, and mobile phone can help user's transceiver electronics postal by WiFi module 1070
Part, browsing webpage and access streaming video etc., it provides wireless broadband internet access for user.Although Figure 10 is shown
WiFi module 1070, but it is understood that, and it is not belonging to must be configured into for mobile phone, it can according to need do not changing completely
Become in the range of the essence of invention and omits.
Processor 1080 is the control centre of mobile phone, using the various pieces of various interfaces and connection whole mobile phone,
By running or execute the software program and/or module that are stored in memory 1020, and calls and be stored in memory 1020
Interior data execute the various functions and processing data of mobile phone, to carry out integral monitoring to mobile phone.Optionally, processor
1080 may include one or more processing units;Preferably, processor 1080 can integrate application processor and modulation /demodulation processing
Device, wherein the main processing operation system of application processor, user interface and application program etc., modem processor is mainly located
Reason wireless communication.It is understood that above-mentioned modem processor can not also be integrated into processor 1080.
Mobile phone further includes the power supply 1090 (such as battery) powered to all parts, it is preferred that power supply can pass through power supply
Management system and processor 1080 are logically contiguous, to realize management charging, electric discharge and power consumption pipe by power-supply management system
The functions such as reason.
Although being not shown, mobile phone can also include camera, bluetooth module etc., and details are not described herein.
In the present embodiment, processor 1080 included by the terminal device is also with the following functions:
Determine that the corresponding text feature of content of text and the text feature identify the duration of pronunciation element;The text
Feature includes multiple Ziwen eigens;
By the text feature, the duration and expression model that identify element, it is corresponding to obtain the content of text
Target expressive features;Either one or two of the target expressive features include multiple sub- expressive features, and the text feature is identified
Pronunciation element is target speaker element, in the target expressive features, the corresponding sub- expressive features of the target speaker element
Be according to the target speaker element in the text feature corresponding Ziwen eigen and when the target speaker element
Grow what determination obtained.
The embodiment of the present application also provides a kind of computer readable storage medium, and the computer readable storage medium is for depositing
Program code is stored up, said program code is for executing described in earlier figures 2 to Fig. 3 corresponding embodiment for synthesizing expression of speaking
Synthesis described in embodiment corresponding to model training method or Fig. 5 is spoken the method for expression.
The description of the present application and term " first " in above-mentioned attached drawing, " second ", " third ", " the 4th " etc. are (if deposited
) it is to be used to distinguish similar objects, without being used to describe a particular order or precedence order.It should be understood that use in this way
Data are interchangeable under appropriate circumstances, so that embodiments herein described herein for example can be in addition to illustrating herein
Or the sequence other than those of description is implemented.In addition, term " includes " and " having " and their any deformation, it is intended that
Cover it is non-exclusive include, for example, containing the process, method, system, product or equipment of a series of steps or units need not limit
In step or unit those of is clearly listed, but may include be not clearly listed or for these process, methods, produce
The other step or units of product or equipment inherently.
It should be appreciated that in this application, " at least one (item) " refers to one or more, and " multiple " refer to two or two
More than a."and/or" indicates may exist three kinds of relationships, for example, " A and/or B " for describing the incidence relation of affiliated partner
It can indicate: only exist A, only exist B and exist simultaneously tri- kinds of situations of A and B, wherein A, B can be odd number or plural number.Word
Symbol "/" typicallys represent the relationship that forward-backward correlation object is a kind of "or"." at least one of following (a) " or its similar expression, refers to
Any combination in these, any combination including individual event (a) or complex item (a).At least one of for example, in a, b or c
(a) can indicate: a, b, c, " a and b ", " a and c ", " b and c ", or " a and b and c ", and wherein a, b, c can be individually, can also
To be multiple.
In several embodiments provided herein, it should be understood that disclosed system, device and method can be with
It realizes by another way.For example, the apparatus embodiments described above are merely exemplary, for example, the unit
It divides, only a kind of logical function partition, there may be another division manner in actual implementation, such as multiple units or components
It can be combined or can be integrated into another system, or some features can be ignored or not executed.Another point, it is shown or
The mutual coupling, direct-coupling or communication connection discussed can be through some interfaces, the indirect coupling of device or unit
It closes or communicates to connect, can be electrical property, mechanical or other forms.
The unit as illustrated by the separation member may or may not be physically separated, aobvious as unit
The component shown may or may not be physical unit, it can and it is in one place, or may be distributed over multiple
In network unit.It can select some or all of unit therein according to the actual needs to realize the mesh of this embodiment scheme
's.
It, can also be in addition, each functional unit in each embodiment of the application can integrate in one processing unit
It is that each unit physically exists alone, can also be integrated in one unit with two or more units.Above-mentioned integrated list
Member both can take the form of hardware realization, can also realize in the form of software functional units.
If the integrated unit is realized in the form of SFU software functional unit and sells or use as independent product
When, it can store in a computer readable storage medium.Based on this understanding, the technical solution of the application is substantially
The all or part of the part that contributes to existing technology or the technical solution can be in the form of software products in other words
It embodies, which is stored in a storage medium, including some instructions are used so that a computer
Equipment (can be personal computer, server or the network equipment etc.) executes the complete of each embodiment the method for the application
Portion or part steps.And storage medium above-mentioned includes: USB flash disk, mobile hard disk, read-only memory (Read-Only Memory, letter
Claim ROM), random access memory (Random Access Memory, abbreviation RAM), magnetic or disk etc. is various to deposit
Store up the medium of program code.
The above, above embodiments are only to illustrate the technical solution of the application, rather than its limitations;Although referring to before
Embodiment is stated the application is described in detail, those skilled in the art should understand that: it still can be to preceding
Technical solution documented by each embodiment is stated to modify or equivalent replacement of some of the technical features;And these
It modifies or replaces, the spirit and scope of each embodiment technical solution of the application that it does not separate the essence of the corresponding technical solution.
Claims (12)
- A kind of method of expression 1. synthesis based on artificial intelligence is spoken, which is characterized in that the described method includes:Obtain the content of text that terminal is sent;Determine that the corresponding text feature of the content of text and the text feature identify the duration of pronunciation element;The text Feature includes multiple Ziwen eigens;By expression model, the corresponding target expressive features of duration for obtaining the text feature, identifying pronunciation element;It is described Either one or two of target expressive features include multiple sub- expressive features, and the text feature is identified pronunciation element is target speaker member Element, in the target expressive features, the corresponding sublist feelings of the target speaker element are characterized according to the target speaker member The element duration of corresponding Ziwen eigen and target speaker element determination in the text feature obtains;The target expressive features are returned to the terminal.
- 2. the method according to claim 1, wherein the method also includes:By the text feature, the duration and acoustic model that identify element, the corresponding mesh of the content of text is obtained Mark acoustic feature;In the target acoustical feature, the corresponding sub- acoustic feature of the target speaker element is according to the mesh The duration of corresponding Ziwen eigen and the target speaker element determination in the text feature of mark pronunciation element obtains.
- 3. the method according to claim 1, wherein the corresponding text feature of the determining content of text and described Text feature identifies the duration of pronunciation element, comprising:By the text feature and duration modeling, the duration that the text feature identifies pronunciation element is obtained.
- 4. method according to claim 1 to 3, which is characterized in that the content of text be to virtual objects The text of the user feedback interacted, the content of text include text corresponding to user's input content, alternatively, according to Text corresponding to the interactive information of family input content feedback.
- 5. method according to claim 1 to 3, which is characterized in that the text feature is for identifying the text Pronounce element and the corresponding contextual information of pronunciation element in this content.
- 6. method according to claim 1 to 3, which is characterized in that it is special that the expressive features include at least the shape of the mouth as one speaks Sign.
- The device of expression 7. a kind of synthesis based on artificial intelligence is spoken, which is characterized in that described device include acquiring unit, really Order member, first acquisition unit and return unit:The acquiring unit, for obtaining the content of text of terminal transmission;The determination unit, for determining that the corresponding text feature of the content of text and the text feature identify pronunciation member The duration of element;The text feature includes multiple Ziwen eigens;The first acquisition unit, for passing through expression model, the duration pair for obtaining the text feature, identifying pronunciation element The target expressive features answered;The target expressive features include multiple sub- expressive features, any that the text feature is identified A pronunciation element is target speaker element, and in the target expressive features, the corresponding sublist feelings of the target speaker element are special Sign be according to the target speaker element in the text feature corresponding Ziwen eigen and the target speaker element Duration determination obtains;The return unit, for returning to the target expressive features to the terminal.
- A kind of method of expression 8. synthesis based on artificial intelligence is spoken, which is characterized in that the described method includes:Obtain the video comprising speaker's face action expression and corresponding voice;According to the text of the expressive features of speaker described in the video acquisition, the acoustic feature of the voice and the voice Feature;The acoustic feature includes multiple sub- acoustic features;According to the text feature and the acoustic feature determine the text feature identify pronunciation element time interval and Duration;It is target speaker element, the time of the target speaker element that the text feature, which either one or two of is identified pronunciation element, Section is the target speaker element time interval of corresponding sub- acoustic feature in the video in the acoustic feature, The target speaker element when a length of target speaker element corresponding to sub- acoustic feature duration;The time interval and duration, the expressive features of pronunciation element are identified according to the text feature, the text feature With acoustic feature training expression model and acoustic model;The expression model be used for according to text feature undetermined and it is described to Determine text feature identify pronunciation element duration determine corresponding target expressive features;The acoustic model is used for according to undetermined The duration that text feature and the text feature undetermined identify pronunciation element determines corresponding target acoustical feature;Obtain the content of text that terminal is sent;Determine that the corresponding text feature of the content of text and the text feature identify the duration of pronunciation element;The text Feature includes multiple Ziwen eigens;By the text feature, the duration and the expression model and acoustic model of pronunciation element are identified, described in acquisition The corresponding target expressive features of content of text and target acoustical feature;The target expressive features and target acoustical feature are subjected to rendering and generate animation.
- 9. according to the method described in claim 8, it is characterized in that, described according to the text feature, text feature institute The time interval and duration, the expressive features and the acoustic feature of mark pronunciation element train expression model and acoustic mode Type, comprising:The time interval and duration and the expressive features that pronunciation element is identified according to the text feature, determine first Corresponding relationship, first corresponding relationship are used to embody the duration of pronunciation element and the time interval of pronunciation element in the expression Corresponding relationship in feature between corresponding sub- expressive features;According to first corresponding relationship training expression model;Determine that the text feature identifies the second corresponding relationship between pronunciation element and the acoustic feature;Described second is corresponding Relationship is used to embody the duration of pronunciation element and the element that pronounces is corresponding between corresponding sub- acoustic feature in the acoustic feature Relationship;According to second corresponding relationship training acoustic model.
- 10. method according to claim 8 or claim 9, which is characterized in that the method also includes:The duration training duration modeling of pronunciation element, the duration mould are identified according to the text feature and the text feature Type is used to determine that the text feature undetermined identifies the duration of pronunciation element according to text feature undetermined;The duration that the text feature identifies pronunciation element is determined as follows:By the text feature and the duration modeling, the duration that the text feature identifies pronunciation element is obtained.
- 11. a kind of speak the equipment of expression for the synthesis based on artificial intelligence, which is characterized in that the equipment includes processor And memory:Said program code is transferred to the processor for storing program code by the memory;The processor is used for according to any one of instruction execution claim 1-6 or 7-10 in said program code The synthesis based on artificial intelligence speak the method for expression.
- 12. a kind of computer readable storage medium, which is characterized in that the computer readable storage medium is for storing program generation Code, said program code require the synthesis based on artificial intelligence described in any one of 1-6 or 7-10 to say for perform claim The method for talking about expression.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910745062.8A CN110288077B (en) | 2018-11-14 | 2018-11-14 | Method and related device for synthesizing speaking expression based on artificial intelligence |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811354206.9A CN109447234B (en) | 2018-11-14 | 2018-11-14 | Model training method, method for synthesizing speaking expression and related device |
CN201910745062.8A CN110288077B (en) | 2018-11-14 | 2018-11-14 | Method and related device for synthesizing speaking expression based on artificial intelligence |
Related Parent Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811354206.9A Division CN109447234B (en) | 2018-11-14 | 2018-11-14 | Model training method, method for synthesizing speaking expression and related device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110288077A true CN110288077A (en) | 2019-09-27 |
CN110288077B CN110288077B (en) | 2022-12-16 |
Family
ID=65552918
Family Applications (2)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910745062.8A Active CN110288077B (en) | 2018-11-14 | 2018-11-14 | Method and related device for synthesizing speaking expression based on artificial intelligence |
CN201811354206.9A Active CN109447234B (en) | 2018-11-14 | 2018-11-14 | Model training method, method for synthesizing speaking expression and related device |
Family Applications After (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811354206.9A Active CN109447234B (en) | 2018-11-14 | 2018-11-14 | Model training method, method for synthesizing speaking expression and related device |
Country Status (1)
Country | Link |
---|---|
CN (2) | CN110288077B (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111369687A (en) * | 2020-03-04 | 2020-07-03 | 腾讯科技(深圳)有限公司 | Method and device for synthesizing action sequence of virtual object |
CN111369967A (en) * | 2020-03-11 | 2020-07-03 | 北京字节跳动网络技术有限公司 | Virtual character-based voice synthesis method, device, medium and equipment |
Families Citing this family (58)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9318108B2 (en) | 2010-01-18 | 2016-04-19 | Apple Inc. | Intelligent automated assistant |
US8977255B2 (en) | 2007-04-03 | 2015-03-10 | Apple Inc. | Method and system for operating a multi-function portable electronic device using voice-activation |
US8676904B2 (en) | 2008-10-02 | 2014-03-18 | Apple Inc. | Electronic devices with voice command and contextual data processing capabilities |
US20120309363A1 (en) | 2011-06-03 | 2012-12-06 | Apple Inc. | Triggering notifications associated with tasks items that represent tasks to perform |
US10417037B2 (en) | 2012-05-15 | 2019-09-17 | Apple Inc. | Systems and methods for integrating third party services with a digital assistant |
EP3809407A1 (en) | 2013-02-07 | 2021-04-21 | Apple Inc. | Voice trigger for a digital assistant |
US10652394B2 (en) | 2013-03-14 | 2020-05-12 | Apple Inc. | System and method for processing voicemail |
US10748529B1 (en) | 2013-03-15 | 2020-08-18 | Apple Inc. | Voice activated device for use with a voice-based digital assistant |
US10176167B2 (en) | 2013-06-09 | 2019-01-08 | Apple Inc. | System and method for inferring user intent from speech inputs |
US10170123B2 (en) | 2014-05-30 | 2019-01-01 | Apple Inc. | Intelligent assistant for home automation |
US9966065B2 (en) | 2014-05-30 | 2018-05-08 | Apple Inc. | Multi-command single utterance input method |
US9715875B2 (en) | 2014-05-30 | 2017-07-25 | Apple Inc. | Reducing the need for manual start/end-pointing and trigger phrases |
US9338493B2 (en) | 2014-06-30 | 2016-05-10 | Apple Inc. | Intelligent automated assistant for TV user interactions |
US9886953B2 (en) | 2015-03-08 | 2018-02-06 | Apple Inc. | Virtual assistant activation |
US10200824B2 (en) | 2015-05-27 | 2019-02-05 | Apple Inc. | Systems and methods for proactively identifying and surfacing relevant content on a touch-sensitive device |
US20160378747A1 (en) | 2015-06-29 | 2016-12-29 | Apple Inc. | Virtual assistant for media playback |
US10671428B2 (en) | 2015-09-08 | 2020-06-02 | Apple Inc. | Distributed personal assistant |
US10740384B2 (en) | 2015-09-08 | 2020-08-11 | Apple Inc. | Intelligent automated assistant for media search and playback |
US10331312B2 (en) | 2015-09-08 | 2019-06-25 | Apple Inc. | Intelligent automated assistant in a media environment |
US10747498B2 (en) | 2015-09-08 | 2020-08-18 | Apple Inc. | Zero latency digital assistant |
US10691473B2 (en) | 2015-11-06 | 2020-06-23 | Apple Inc. | Intelligent automated assistant in a messaging environment |
US10956666B2 (en) | 2015-11-09 | 2021-03-23 | Apple Inc. | Unconventional virtual assistant interactions |
US10223066B2 (en) | 2015-12-23 | 2019-03-05 | Apple Inc. | Proactive assistance based on dialog communication between devices |
US10586535B2 (en) | 2016-06-10 | 2020-03-10 | Apple Inc. | Intelligent digital assistant in a multi-tasking environment |
DK179415B1 (en) | 2016-06-11 | 2018-06-14 | Apple Inc | Intelligent device arbitration and control |
DK201670540A1 (en) | 2016-06-11 | 2018-01-08 | Apple Inc | Application integration with a digital assistant |
US10726832B2 (en) | 2017-05-11 | 2020-07-28 | Apple Inc. | Maintaining privacy of personal information |
DK180048B1 (en) | 2017-05-11 | 2020-02-04 | Apple Inc. | MAINTAINING THE DATA PROTECTION OF PERSONAL INFORMATION |
DK179496B1 (en) | 2017-05-12 | 2019-01-15 | Apple Inc. | USER-SPECIFIC Acoustic Models |
DK201770429A1 (en) | 2017-05-12 | 2018-12-14 | Apple Inc. | Low-latency intelligent automated assistant |
DK179745B1 (en) | 2017-05-12 | 2019-05-01 | Apple Inc. | SYNCHRONIZATION AND TASK DELEGATION OF A DIGITAL ASSISTANT |
US20180336892A1 (en) | 2017-05-16 | 2018-11-22 | Apple Inc. | Detecting a trigger of a digital assistant |
US10303715B2 (en) | 2017-05-16 | 2019-05-28 | Apple Inc. | Intelligent automated assistant for media exploration |
US10818288B2 (en) | 2018-03-26 | 2020-10-27 | Apple Inc. | Natural assistant interaction |
US11145294B2 (en) | 2018-05-07 | 2021-10-12 | Apple Inc. | Intelligent automated assistant for delivering content from user experiences |
US10928918B2 (en) | 2018-05-07 | 2021-02-23 | Apple Inc. | Raise to speak |
DK179822B1 (en) | 2018-06-01 | 2019-07-12 | Apple Inc. | Voice interaction at a primary device to access call functionality of a companion device |
DK180639B1 (en) | 2018-06-01 | 2021-11-04 | Apple Inc | DISABILITY OF ATTENTION-ATTENTIVE VIRTUAL ASSISTANT |
US10892996B2 (en) | 2018-06-01 | 2021-01-12 | Apple Inc. | Variable latency device coordination |
US11462215B2 (en) | 2018-09-28 | 2022-10-04 | Apple Inc. | Multi-modal inputs for voice commands |
US11348573B2 (en) | 2019-03-18 | 2022-05-31 | Apple Inc. | Multimodality in digital assistant systems |
DK201970509A1 (en) | 2019-05-06 | 2021-01-15 | Apple Inc | Spoken notifications |
US11307752B2 (en) | 2019-05-06 | 2022-04-19 | Apple Inc. | User configurable task triggers |
US11140099B2 (en) | 2019-05-21 | 2021-10-05 | Apple Inc. | Providing message response suggestions |
DK180129B1 (en) | 2019-05-31 | 2020-06-02 | Apple Inc. | User activity shortcut suggestions |
DK201970511A1 (en) | 2019-05-31 | 2021-02-15 | Apple Inc | Voice identification in digital assistant systems |
US11468890B2 (en) | 2019-06-01 | 2022-10-11 | Apple Inc. | Methods and user interfaces for voice-based control of electronic devices |
CN110531860B (en) | 2019-09-02 | 2020-07-24 | 腾讯科技(深圳)有限公司 | Animation image driving method and device based on artificial intelligence |
CN111260761B (en) * | 2020-01-15 | 2023-05-09 | 北京猿力未来科技有限公司 | Method and device for generating mouth shape of animation character |
US11593984B2 (en) | 2020-02-07 | 2023-02-28 | Apple Inc. | Using text for avatar animation |
CN111225237B (en) * | 2020-04-23 | 2020-08-21 | 腾讯科技(深圳)有限公司 | Sound and picture matching method of video, related device and storage medium |
US11038934B1 (en) | 2020-05-11 | 2021-06-15 | Apple Inc. | Digital assistant hardware abstraction |
US11061543B1 (en) | 2020-05-11 | 2021-07-13 | Apple Inc. | Providing relevant data items based on context |
US11755276B2 (en) | 2020-05-12 | 2023-09-12 | Apple Inc. | Reducing description length based on confidence |
US11490204B2 (en) | 2020-07-20 | 2022-11-01 | Apple Inc. | Multi-device audio adjustment coordination |
US11438683B2 (en) | 2020-07-21 | 2022-09-06 | Apple Inc. | User identification using headphones |
CN112396182B (en) * | 2021-01-19 | 2021-04-16 | 腾讯科技(深圳)有限公司 | Method for training face driving model and generating face mouth shape animation |
CN113079328B (en) * | 2021-03-19 | 2023-03-28 | 北京有竹居网络技术有限公司 | Video generation method and device, storage medium and electronic equipment |
Citations (16)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1971621A (en) * | 2006-11-10 | 2007-05-30 | 中国科学院计算技术研究所 | Generating method of cartoon face driven by voice and text together |
CN101176146A (en) * | 2005-05-18 | 2008-05-07 | 松下电器产业株式会社 | Speech synthesizer |
CN101474481A (en) * | 2009-01-12 | 2009-07-08 | 北京科技大学 | Emotional robot system |
CN104063427A (en) * | 2014-06-06 | 2014-09-24 | 北京搜狗科技发展有限公司 | Expression input method and device based on semantic understanding |
US20150042662A1 (en) * | 2013-08-08 | 2015-02-12 | Kabushiki Kaisha Toshiba | Synthetic audiovisual storyteller |
JP2015125613A (en) * | 2013-12-26 | 2015-07-06 | Kddi株式会社 | Animation generation device, data format, animation generation method and program |
CN104850335A (en) * | 2015-05-28 | 2015-08-19 | 瞬联软件科技(北京)有限公司 | Expression curve generating method based on voice input |
CN105760852A (en) * | 2016-03-14 | 2016-07-13 | 江苏大学 | Driver emotion real time identification method fusing facial expressions and voices |
CN105931631A (en) * | 2016-04-15 | 2016-09-07 | 北京地平线机器人技术研发有限公司 | Voice synthesis system and method |
US20160321243A1 (en) * | 2014-01-10 | 2016-11-03 | Cluep Inc. | Systems, devices, and methods for automatic detection of feelings in text |
CN106293074A (en) * | 2016-07-29 | 2017-01-04 | 维沃移动通信有限公司 | A kind of Emotion identification method and mobile terminal |
CN107301168A (en) * | 2017-06-01 | 2017-10-27 | 深圳市朗空亿科科技有限公司 | Intelligent robot and its mood exchange method, system |
CN107634901A (en) * | 2017-09-19 | 2018-01-26 | 广东小天才科技有限公司 | Method for pushing, pusher and the terminal device of session expression |
WO2018084305A1 (en) * | 2016-11-07 | 2018-05-11 | ヤマハ株式会社 | Voice synthesis method |
CN108320021A (en) * | 2018-01-23 | 2018-07-24 | 深圳狗尾草智能科技有限公司 | Robot motion determines method, displaying synthetic method, device with expression |
CN108597541A (en) * | 2018-04-28 | 2018-09-28 | 南京师范大学 | A kind of speech-emotion recognition method and system for enhancing indignation and happily identifying |
Family Cites Families (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
GB2510200B (en) * | 2013-01-29 | 2017-05-10 | Toshiba Res Europe Ltd | A computer generated head |
CN105206258B (en) * | 2015-10-19 | 2018-05-04 | 百度在线网络技术(北京)有限公司 | The generation method and device and phoneme synthesizing method and device of acoustic model |
CN108763190B (en) * | 2018-04-12 | 2019-04-02 | 平安科技(深圳)有限公司 | Voice-based mouth shape cartoon synthesizer, method and readable storage medium storing program for executing |
-
2018
- 2018-11-14 CN CN201910745062.8A patent/CN110288077B/en active Active
- 2018-11-14 CN CN201811354206.9A patent/CN109447234B/en active Active
Patent Citations (18)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101176146A (en) * | 2005-05-18 | 2008-05-07 | 松下电器产业株式会社 | Speech synthesizer |
US20090234652A1 (en) * | 2005-05-18 | 2009-09-17 | Yumiko Kato | Voice synthesis device |
CN1971621A (en) * | 2006-11-10 | 2007-05-30 | 中国科学院计算技术研究所 | Generating method of cartoon face driven by voice and text together |
CN101474481A (en) * | 2009-01-12 | 2009-07-08 | 北京科技大学 | Emotional robot system |
US20150042662A1 (en) * | 2013-08-08 | 2015-02-12 | Kabushiki Kaisha Toshiba | Synthetic audiovisual storyteller |
JP2015125613A (en) * | 2013-12-26 | 2015-07-06 | Kddi株式会社 | Animation generation device, data format, animation generation method and program |
US20160321243A1 (en) * | 2014-01-10 | 2016-11-03 | Cluep Inc. | Systems, devices, and methods for automatic detection of feelings in text |
CN104933113A (en) * | 2014-06-06 | 2015-09-23 | 北京搜狗科技发展有限公司 | Expression input method and device based on semantic understanding |
CN104063427A (en) * | 2014-06-06 | 2014-09-24 | 北京搜狗科技发展有限公司 | Expression input method and device based on semantic understanding |
CN104850335A (en) * | 2015-05-28 | 2015-08-19 | 瞬联软件科技(北京)有限公司 | Expression curve generating method based on voice input |
CN105760852A (en) * | 2016-03-14 | 2016-07-13 | 江苏大学 | Driver emotion real time identification method fusing facial expressions and voices |
CN105931631A (en) * | 2016-04-15 | 2016-09-07 | 北京地平线机器人技术研发有限公司 | Voice synthesis system and method |
CN106293074A (en) * | 2016-07-29 | 2017-01-04 | 维沃移动通信有限公司 | A kind of Emotion identification method and mobile terminal |
WO2018084305A1 (en) * | 2016-11-07 | 2018-05-11 | ヤマハ株式会社 | Voice synthesis method |
CN107301168A (en) * | 2017-06-01 | 2017-10-27 | 深圳市朗空亿科科技有限公司 | Intelligent robot and its mood exchange method, system |
CN107634901A (en) * | 2017-09-19 | 2018-01-26 | 广东小天才科技有限公司 | Method for pushing, pusher and the terminal device of session expression |
CN108320021A (en) * | 2018-01-23 | 2018-07-24 | 深圳狗尾草智能科技有限公司 | Robot motion determines method, displaying synthetic method, device with expression |
CN108597541A (en) * | 2018-04-28 | 2018-09-28 | 南京师范大学 | A kind of speech-emotion recognition method and system for enhancing indignation and happily identifying |
Non-Patent Citations (2)
Title |
---|
何文静等: "真实感虚拟手语主持人的实现", 《微计算机信息》 * |
陈鹏展等: "基于语音信号与文本信息的双模态情感识别", 《华东交通大学学报》 * |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111369687A (en) * | 2020-03-04 | 2020-07-03 | 腾讯科技(深圳)有限公司 | Method and device for synthesizing action sequence of virtual object |
CN111369687B (en) * | 2020-03-04 | 2021-03-30 | 腾讯科技(深圳)有限公司 | Method and device for synthesizing action sequence of virtual object |
CN111369967A (en) * | 2020-03-11 | 2020-07-03 | 北京字节跳动网络技术有限公司 | Virtual character-based voice synthesis method, device, medium and equipment |
Also Published As
Publication number | Publication date |
---|---|
CN109447234A (en) | 2019-03-08 |
CN110288077B (en) | 2022-12-16 |
CN109447234B (en) | 2022-10-21 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110288077A (en) | A kind of synthesis based on artificial intelligence is spoken the method and relevant apparatus of expression | |
CN110531860B (en) | Animation image driving method and device based on artificial intelligence | |
CN110853618B (en) | Language identification method, model training method, device and equipment | |
JP7312853B2 (en) | AI-BASED VOICE-DRIVEN ANIMATION METHOD AND APPARATUS, DEVICE AND COMPUTER PROGRAM | |
CN110838286B (en) | Model training method, language identification method, device and equipment | |
CN110490213B (en) | Image recognition method, device and storage medium | |
CN110853617B (en) | Model training method, language identification method, device and equipment | |
CN110381389A (en) | A kind of method for generating captions and device based on artificial intelligence | |
CN107294837A (en) | Engaged in the dialogue interactive method and system using virtual robot | |
WO2016004266A2 (en) | Generating computer responses to social conversational inputs | |
CN110808034A (en) | Voice conversion method, device, storage medium and electronic equipment | |
CN112040263A (en) | Video processing method, video playing method, video processing device, video playing device, storage medium and equipment | |
CN111414506B (en) | Emotion processing method and device based on artificial intelligence, electronic equipment and storage medium | |
CN113421547B (en) | Voice processing method and related equipment | |
CN110322760A (en) | Voice data generation method, device, terminal and storage medium | |
CN109801618A (en) | A kind of generation method and device of audio-frequency information | |
CN110162600A (en) | A kind of method of information processing, the method and device of conversational response | |
CN114882862A (en) | Voice processing method and related equipment | |
CN114360510A (en) | Voice recognition method and related device | |
CN115866327A (en) | Background music adding method and related device | |
JPWO2019044534A1 (en) | Information processing device and information processing method | |
CN117219043A (en) | Model training method, model application method and related device | |
CN116959407A (en) | Pronunciation prediction method and device and related products | |
CN116978359A (en) | Phoneme recognition method, device, electronic equipment and storage medium | |
CN117991908A (en) | Method, device, equipment and storage medium for interacting with virtual image |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |