CN107623622A

CN107623622A - A kind of method and electronic equipment for sending speech animation

Info

Publication number: CN107623622A
Application number: CN201610560255.2A
Authority: CN
Inventors: 武俊敏
Original assignee: Zhang Ying Information Technology (shanghai) Co Ltd
Current assignee: Zhang Ying Information Technology (shanghai) Co Ltd
Priority date: 2016-07-15
Filing date: 2016-07-15
Publication date: 2018-01-23

Abstract

The embodiments of the invention provide a kind of method and electronic equipment for sending speech animation.This method includes：Obtain image；Obtain voice messaging；According to the voice messaging, animation is carried out to described image, to generate speech animation；And send the speech animation to recipient.By according to voice messaging, carrying out animation to image, to generate speech animation, and speech animation being sent to recipient, can enable the recipient to watch speech animation.Compared to pure speech message, speech animation is more interesting and visual effect, so as to improve Consumer's Experience.

Description

A kind of method and electronic equipment for sending speech animation

Technical field

The present invention relates to the communications field, and in particular to a kind of method and electronic equipment for sending speech animation.

Background technology

With the development of internet, people are communicated using speech message more and more.But current voice Message is more dull, lacks interest.

The content of the invention

The embodiments of the invention provide a kind of method and electronic equipment for sending speech animation, to increase speech message Interest, and improve Consumer's Experience.

According to the first aspect of the invention, there is provided a kind of method for sending speech animation, this method include：

Obtain image；

Obtain voice messaging；

According to the voice messaging, animation is carried out to described image, to generate speech animation；And

The speech animation is sent to recipient.

According to the first aspect of the invention, in the first possible implementation, the acquisition image includes：

Obtain pre-set image or the image of user's input.

According to the first aspect of the invention, in second of possible implementation, the acquisition voice messaging includes：

Obtain the voice messaging of user's input.

According to the first aspect of the invention, it is described according to the voice messaging in the third possible implementation, it is right Described image carries out animation, is included with generating speech animation：

The voice messaging is divided into multiple sound bites；

Obtain the feature of each sound bite in the multiple sound bite；

According to mouth expression corresponding to the Feature Selection of each sound bite；

According to expression frame corresponding to mouth expression corresponding to each described sound bite and described image generation；

According to expression frame corresponding to the multiple sound bite and voice messaging generation speech animation.

The third possible implementation according to the first aspect of the invention, in the 4th kind of possible implementation, It is described the voice messaging is divided into multiple sound bites to include：

According to default animation frame per second, the voice messaging is divided into multiple sound bites, wherein, each sound bite with A frame in animation is corresponding.

The third possible implementation according to the first aspect of the invention, in the 5th kind of possible implementation, Mouth expression includes corresponding to the Feature Selection of each sound bite described in the basis：

According to the feature of each sound bite and default model, selection and the spy in default expression storehouse Mouth expression corresponding to sign.

The third possible implementation according to the first aspect of the invention, in the 6th kind of possible implementation, Mouth expression includes corresponding to the Feature Selection of each sound bite described in the basis：

According to the feature of each sound bite, a upper sound bite pair for default model and the sound bite The mouth expression answered, mouth expression corresponding with the feature is selected in default expression storehouse.

The third possible implementation according to the first aspect of the invention, in the 7th kind of possible implementation, Expression frame corresponding to mouth expression corresponding to each sound bite described in the basis and described image generation includes：

Mouth expression and described image it will be combined corresponding to each described sound bite, expression frame corresponding to generation.

The third possible implementation according to the first aspect of the invention, in the 8th kind of possible implementation, The expression frame according to corresponding to the multiple sound bite and voice messaging generation speech animation include：

Expression frame corresponding to the multiple sound bite is sequentially arranged, generates expression animation；And

The expression animation and the voice messaging are packaged into the speech animation.

According to the second aspect of the invention, there is provided a kind of electronic equipment, the electronic equipment include：

Image collection module, for obtaining image；

Voice acquisition module, for obtaining voice messaging；

Speech animation generation module, for according to the voice messaging, animation being carried out to described image, to generate voice Animation；And

Sending module, for sending the speech animation to other electronic equipments.

With reference to the second aspect of the present invention, in the first possible implementation, described image acquisition module is specifically used In：

Obtain pre-set image or the image of user's input.

With reference to the second aspect of the present invention, in second of possible implementation, the voice acquisition module is specifically used In：

Obtain the voice messaging of user's input.

With reference to the second aspect of the present invention, in the third possible implementation, the speech animation generation module bag Include：

Voice splits submodule, for the voice messaging to be divided into multiple sound bites；

Feature acquisition submodule, for obtaining the feature of each sound bite in the multiple sound bite；

Mouth expression selects submodule, for mouth table corresponding to the Feature Selection according to each sound bite Feelings；

Expression frame generates submodule, for the mouth expression according to corresponding to each described sound bite and described image life Into corresponding expression frame；

Speech animation generates submodule, for the expression frame according to corresponding to the multiple sound bite and the voice messaging Generate speech animation.

With reference to the third possible implementation of the second aspect of the present invention, in the 4th kind of possible implementation, The voice segmentation submodule is specifically used for：

According to default animation frame per second, the voice messaging is divided into multiple sound bites, wherein, each sound bite It is corresponding with the frame in animation.

With reference to the third possible implementation of the second aspect of the present invention, in the 5th kind of possible implementation, The mouth expression selection submodule is specifically used for：

With reference to the third possible implementation of the second aspect of the present invention, in the 6th kind of possible implementation, The mouth expression selection submodule is specifically used for：

With reference to the third possible implementation of the second aspect of the present invention, in the 7th kind of possible implementation, The expression frame generation submodule is specifically used for：

With reference to the third possible implementation of the second aspect of the present invention, in the 8th kind of possible implementation, The speech animation generation submodule is specifically used for：

According to the third aspect of the invention we, there is provided a kind of electronic equipment, the electronic equipment include：

Memory, audio acquisition module, Network Interface Module and with memory, audio acquisition module 402, network interface The processor of module connection, wherein, memory is used to store batch processing code, and processor calls the program that memory is stored Code is used to perform following operation：

Obtain image；

Obtain voice messaging；

The speech animation is sent to other electronic equipments.

With reference to the third aspect of the present invention, in the first possible implementation, processor calls memory to be stored Program code be used for perform following operation：

Obtain pre-set image or the image of user's input.

With reference to the third aspect of the present invention, in second of possible implementation, processor memory is stored Program code is used to perform following operation：

Obtain the voice messaging of user's input.

With reference to the third aspect of the present invention, in the third possible implementation, processor calls memory to be stored Program code be used for perform following operation：

The voice messaging is divided into multiple sound bites；

Obtain the feature of each sound bite in the multiple sound bite；

With reference to the third possible implementation of the third aspect of the present invention, in the 4th kind of possible implementation, The program code that processor calls memory to be stored is used to perform following operation：

With reference to the third possible implementation of the third aspect of the present invention, in the 5th kind of possible implementation, The program code that processor calls memory to be stored is used to perform following operation：

With reference to the third possible implementation of the third aspect of the present invention, in the 6th kind of possible implementation, The program code that processor calls memory to be stored is used to perform following operation：

With reference to the third possible implementation of the third aspect of the present invention, in the 7th kind of possible implementation, The program code that processor calls memory to be stored is used to perform following operation：

With reference to the third possible implementation of the third aspect of the present invention, in the 8th kind of possible implementation, The program code that processor calls memory to be stored is used to perform following operation：

The embodiments of the invention provide a kind of method and electronic equipment for sending speech animation.By according to voice messaging, Animation is carried out to image, to generate speech animation, and speech animation is sent to recipient, can enable the recipient to watch Speech animation.Compared to pure speech message, speech animation is more interesting and visual effect, so as to improve user's body Test.Corresponding speech animation can be generated by voice in real time according further to the method for speech production speech animation, without Obtain facial information, have efficiency high, speed be fast, limitation less, the advantages of resource consumption is few.

Brief description of the drawings

Technical scheme in order to illustrate the embodiments of the present invention more clearly, make required in being described below to embodiment Accompanying drawing is briefly described, it should be apparent that, drawings in the following description are only some embodiments of the present invention, for For those of ordinary skill in the art, on the premise of not paying creative work, other can also be obtained according to these accompanying drawings Accompanying drawing.

Fig. 1 shows a kind of flow chart of the method for transmission speech animation according to embodiments of the present invention；

Fig. 2 shows according to embodiments of the present invention according to the voice messaging, animation is carried out to described image, with life Into the flow chart of voice animated steps；

Fig. 3 shows the block diagram of a kind of electronic equipment according to embodiments of the present invention；

Fig. 4 shows the block diagram of a kind of electronic equipment according to embodiments of the present invention；

Fig. 5 is a kind of schematic diagram for switching pre-set image provided in an embodiment of the present invention；

Fig. 6 is a kind of schematic diagram for switching pre-set image provided in an embodiment of the present invention；

Fig. 7 is a kind of schematic diagram for switching pre-set image provided in an embodiment of the present invention.

Embodiment

To make the object, technical solutions and advantages of the present invention clearer, below in conjunction with attached in the embodiment of the present invention Figure, the technical scheme in the embodiment of the present invention is clearly and completely described, it is clear that described embodiment is only this Invention part of the embodiment, rather than whole embodiments.Based on the embodiment in the present invention, those of ordinary skill in the art exist The every other embodiment obtained under the premise of creative work is not made, belongs to the scope of protection of the invention.

Fig. 1 shows a kind of flow chart of the method for transmission speech animation according to embodiments of the present invention.This method can be Performed in terminal.Terminal may include but be not limited to, mobile phone, on knee or notebook, desktop computer, individual number Word assistant (PDA), game control terminal.As shown in figure 1, this method may include following steps：

Step 101：Obtain image.

In one embodiment, obtaining image includes obtaining pre-set image.In one example, pre-set image is acquiescence Image, the acquiescence acquisition image is operated every time.In another example, pre-set image is the figure inputted user's last time Picture, so each user input new image and then update pre-set image.

In another embodiment, the image that image includes obtaining user's input is obtained.In one example, user inputs Image be image that user selects from local image library so that user can select arbitrary local image dynamic to generate Draw.In another example, the image of user's input is the photo that user is obtained by camera so that user can select i.e. When the photo that shoots generate animation.Here photo can be from taking a picture or non-from taking a picture.At also one In example, the image of user's input is the image that user selects from one group of pre-set image of display.In further example, During generating speech animation, user can switch between several pre-set images, as shown in Figures 5 to 7, have several pre- If image, during expression frame is generated, can be switched between different pre-set images, be elected in a certain width it is pre- If during image, according to expression frame corresponding to pre-set image generation, after another width pre-set image is switched to, after switching Pre-set image continue to generate corresponding to expression frame, what is initially selected is the pre-set image shown in Fig. 5, through after a while it Afterwards, the pre-set image shown in Fig. 6 is switched in real time, has been crossed again after a period of time, switches to the default figure shown in Fig. 7 Picture.

Step 102：Obtain voice messaging.

In one embodiment, the voice messaging that voice messaging includes obtaining user's input is obtained.In one example, lead to The audio acquisition module (such as microphone) that crossing terminal includes or couple obtains the voice messaging that user inputs immediately.Specifically, In this example embodiment, it may include audio-frequency information is obtained by microphone in real time, voice messaging is isolated from the audio-frequency information.Generally For, the frequency range of the voice of people in 300Hz between 4000Hz, therefore can by being filtered to audio-frequency information, point Separate out voice messaging of the frequency range in 300Hz to the information between 4000Hz as people.Optionally, can also further pass through The intensity of sound separates voice messaging because the voice of people typically in 40dB between 60dB, therefore can according to sound DB filters to audio-frequency information, isolates intensity in 40dB to the audio-frequency information between 60dB.Optionally, can also be right The voice messaging isolated carries out the processing such as noise reduction, obtains more accurate voice messaging.In another example, user's input Voice messaging be that user selects from local voice storehouse.

Step 103：According to the voice messaging, animation is carried out to described image, to generate speech animation.

Specifically, step 103 may include following steps：

The voice messaging is divided into multiple sound bites；

Obtain the feature of each sound bite in the multiple sound bite；

According to expression frame corresponding to mouth expression corresponding to each described sound bite and described image generation；And

Above-mentioned steps hereinafter are described in detail in reference picture 2.

Step 104：The speech animation is sent to recipient.

Recipient can be one or more.For example, can to one of one or more social networking application user-associations or Multiple terminals send the speech animation.For another example, can be sent to multiple terminals of one or more social networking application group associations The speech animation.Transmission may include the mode by transit server or the mode without transit server.

In one embodiment, before step 101, this method also includes：Obtain and start speech animation instruction.For example, User can start speech animation instruction by clicking on pre-set button in instant messaging application interface to trigger.The embodiment of the present invention The mode for starting speech animation instruction is not limited.

In one embodiment, this method also includes the step of display speech animation so that user can intuitively experience The speech animation of recording.

The embodiments of the invention provide a kind of method for sending speech animation.By according to voice messaging, being carried out to image Animation, to generate speech animation, and speech animation is sent to recipient, can enable the recipient to watch speech animation. Compared to pure speech message, speech animation is more interesting and visual effect, so as to improve Consumer's Experience.

Fig. 2 shows according to embodiments of the present invention according to the voice messaging, animation is carried out to described image, with life Into the flow chart of voice animated steps.As shown in Figure 2, the step comprises the following steps：

Step 201：The voice messaging is divided into multiple sound bites.

Specifically, the step includes：According to default animation frame per second, the voice messaging is divided into multiple sound bites, Wherein, each sound bite is corresponding with the frame in animation.Exemplary, when default animation frame per second is 30 frames/second, each The length of sound bite is 1/30 second；When default animation frame per second is 60 frames/second, the length of each sound bite is 1/60 second. The embodiment of the present invention is not limited to specific preset frame rate and partitioning scheme.

Step 202：Obtain the feature of each sound bite in the multiple sound bite.

Exemplary, this feature can be MFCC (Mel Frequency Cepstral Coefficents, mel-frequency Cepstrum coefficient) feature.The embodiment of the present invention is not limited to specific feature.

Step 203：According to mouth expression corresponding to the Feature Selection of each sound bite.

In one embodiment, the process can include：

According to the feature and preset model, mouth expression corresponding with the feature is selected in default expression storehouse.

Preset model can be trained to obtain by supervised learning.Training can use any one side in following scheme Case is carried out：

Scheme 1：

A, training data is collected.

Collect the data of the shape corresponding relation largely comprising voice and mouth, such as film, TV fragment.

B, the data being collected into are pre-processed.

The frame of video that face mouth is carried in the data being collected into is picked out.

The MFCC feature extractions of the shape of mouth in these frame of video and corresponding voice messaging are come out.

C, random forest (Random Forest) is instructed according to the shape of these mouths and corresponding MFCC features Practice, the random forest after being trained is as preset model.

During the mouth expression according to corresponding to the Feature Selection, the feature is inputted random after the training Forest, random forest will determine that the shape of mouth corresponding to this feature, and the shape of the mouth is chosen from default expression storehouse Corresponding mouth expression is as mouth expression corresponding to the feature.

Scheme 2:

A, training data is collected.

Collect the largely data comprising voice and mouth open and-shut mode corresponding relation, such as film, TV fragment.

B, the data being collected into are pre-processed.

The MFCC feature extractions of the open and-shut mode of mouth in these frame of video and corresponding voice messaging are come out.

C, according to the open and-shut mode of these mouths and corresponding MFCC features to SVM (Support Vector Machine, SVMs) it is trained, the SVM after being trained is as preset model.

During the mouth expression according to corresponding to the Feature Selection, SVM that the feature is inputted after the training, SVM will determine that mouth state is out or closed corresponding to this feature, if corresponding state is out, from default expression storehouse The expression that is out of mouth state is chosen as mouth expression corresponding to the feature, if corresponding state is to close, from default Expression storehouse in choose mouth state be the expression closed as mouth expression corresponding to the feature.

Scheme 3:

A, training data is collected.

B, the data being collected into are pre-processed.

By voice letter corresponding to the characteristic point of face corresponding to the shape of mouth in these frame of video and the shape of the mouth The MFCC feature extractions of breath come out.

C, according to the characteristic point of these faces and corresponding MFCC features to GMM (Gaussian Mixture Model) mould Type is trained, and the GMM model after being trained is as preset model.

During the mouth expression according to corresponding to the Feature Selection, GMM that the feature is inputted after the training Model, GMM model will determine that the characteristic point of face corresponding to this feature, and the feature of the face is chosen from default expression storehouse Mouth expression is as mouth expression corresponding to the feature corresponding to point.

Scheme 4：

A, training data is collected.

B, the data being collected into are pre-processed.

C, according to the characteristic point of these faces and corresponding MFCC features to 3 layers of neutral net (Neural Networks) It is trained, 3 after being trained layer neutral net is as preset model.

During the mouth expression according to corresponding to the Feature Selection, the feature is inputted after the training 3 layers Neutral net, 3 layers of neutral net will determine that the characteristic point of face corresponding to this feature, and choose the people from default expression storehouse Mouth expression corresponding to the characteristic point of face is as mouth expression corresponding to the feature.

In another embodiment, the process can include：

In this embodiment, the training method of preset model is referred to described in SVM above, will not be repeated here.

During the mouth expression according to corresponding to the Feature Selection, SVM that the feature is inputted after the training, SVM will determine that the probability that mouth state corresponding to this feature is out, and be designated as p, then the mouth state is that the probability closed is 1-p.

If p exceedes default threshold value, mouth state corresponding to judgement is out that otherwise mouth state corresponding to judgement is Close.The initial value of the threshold value is 0.5, and expression corresponding to a upper sound bite for sound bite according to corresponding to when the feature Mouth state the threshold value dynamically adjusted.

Exemplary, when the mouth state of expression corresponding to a upper sound bite for sound bite corresponding to the feature is It is 0.3 by the adjusting thresholds when opening, i.e., p corresponding to described feature is more than 0.3 and judges that its corresponding mouth state is out.

If SVM judges that state is out corresponding to this feature, selection mouth state is opened from default expression storehouse Expression is as expression corresponding to the feature, if SVM judges that state corresponding to this feature is to close, from default expression storehouse Selection mouth state is the expression closed as expression corresponding to the feature.

Step 204：According to expression corresponding to mouth expression corresponding to each described sound bite and described image generation Frame.

Specifically, the process can be：

Identify the face mouth region in described image.

Exemplary, can be according to active appearance models (Active Appearance Model), active shape model (Active Shape Model) or other modes, the characteristic point of face mouth region is obtained from described image.

It should be noted that the pre- picture can have several, can be in real time described during expression frame is generated Switched in multiple image.

As shown in Figures 5 to 7, there is multiple image, during expression frame is generated, can enter between different images Row switching, when certain piece image in choosing, according to expression frame corresponding to image generation, after another piece image is switched to, root According to the image after switching continue to generate corresponding to expression frame, what is initially selected is the image shown in Fig. 5, through after a while it Afterwards, the image shown in Fig. 6 is switched in real time, has been crossed again after a period of time, switches to the image shown in Fig. 7.

According to the mouth expression, the face mouth region in described image is driven.

Exemplary, it is corresponding with the face mouth region in described image to calculate mouth feature point in the mouth expression Characteristic point position deviation, according to the deviation, generate the shifting of each characteristic point in the face mouth region in described image Dynamic parameter, and the face mouth region in described image is driven according to the moving parameter.

Expression frame corresponding to face mouth region generation in the described image after described image and driving.

Exemplary, replaced with the face mouth region in the described image after the driving in the described image before driving Face mouth region, and new image is generated, according to expression frame corresponding to the new image generation.

During face mouth region in above-mentioned identification image, if being not detected by face, with the face of acquiescence Basis of the mouth region as generation expression frame.

According to the mouth expression, the face mouth region of the acquiescence is driven.

Exemplary, it is corresponding with the face mouth region of the acquiescence to calculate mouth feature point in the mouth expression The position deviation of characteristic point, according to the deviation, generate the mobile ginseng of each characteristic point in the face mouth region of the acquiescence Number, and according to the face mouth region of the moving parameter driving acquiescence.

Generated according to the face mouth region of the acquiescence after the face mouth region of the acquiescence and driving corresponding Expression frame.

Exemplary, with the people of the acquiescence before the face mouth region replacement driving of the acquiescence after the driving Face mouth region, and new image is generated, according to expression frame corresponding to the new image generation.

Step 205：According to expression frame corresponding to the multiple sound bite and voice messaging generation speech animation.

Specifically, the process may include following steps：

Encapsulation format can use any suitable audio frequency and video encapsulation format, such as MP4, TS, MKV, FLV etc..

The embodiment of the present invention can generate corresponding speech animation by voice in real time, without obtaining facial information, With efficiency high, speed is fast, limitation less, the advantages of resource consumption is few.Quickly the open and-shut mode of mouth can be entered by SVM Row judges, so as to effectively improve the speed of identification.The shape of mouth can be quickly identified by random forest, so as to have Improve the speed of identification in effect ground.The shape of mouth can be quickly identified by SVM, so as to effectively improve the speed of identification Degree, judges the mouth state of present frame according further to the mouth state of previous frame, is effectively improved identification Accuracy rate.The shape of mouth can be quickly identified by GMM model, so as to effectively improve the speed of identification.Pass through god The shape of mouth can be quickly identified through network, so as to effectively improve the speed of identification.

Fig. 3 shows the block diagram of a kind of electronic equipment according to embodiments of the present invention.As shown in figure 3, the electronic equipment bag Include：Image collection module 301, for obtaining image；Voice acquisition module 302, for obtaining voice messaging；Speech animation generates Module 303, for according to the voice messaging, animation being carried out to described image, to generate speech animation；And sending module 304, for sending the speech animation to other electronic equipments.

Specifically, described image acquisition module 301 is used for the image for obtaining pre-set image or user's input.

Specifically, the voice acquisition module 302 is used for the voice messaging for obtaining user's input.

Specifically, the speech animation generation module 303 includes：

Voice splits submodule 3031, for the voice messaging to be divided into multiple sound bites；

Feature acquisition submodule 3032, for obtaining the feature of each sound bite in the multiple sound bite；

Mouth expression selects submodule 3033, for mouth corresponding to the Feature Selection according to each sound bite Expression；

Expression frame generates submodule 3034, for mouth expression and the figure according to corresponding to each described sound bite The expression frame as corresponding to generation；

Speech animation generates submodule 3035, for the expression frame according to corresponding to the multiple sound bite and the voice Information generates speech animation.

Specifically, the voice segmentation submodule 3031 is used to, according to default animation frame per second, the voice messaging be split For multiple sound bites, wherein, each sound bite is corresponding with the frame in animation.

Optionally, mouth expression selection submodule 3033 be used for according to the feature of each sound bite with it is pre- If model, corresponding with feature mouth expression is selected in default expression storehouse.

Optionally, the mouth expression selection submodule 3033 is used for the feature according to each sound bite, in advance If model and the sound bite a upper sound bite corresponding to mouth expression, in default expression storehouse selection with it is described Mouth expression corresponding to feature.

Specifically, the expression frame generation submodule 3034 is used for mouth expression corresponding to each described sound bite Combined with described image, expression frame corresponding to generation.

Specifically, the speech animation generation submodule 3035 is used to be sequentially arranged the multiple sound bite pair The expression frame answered, generate expression animation；And the expression animation and the voice messaging are packaged into the speech animation.

Optionally, the electronic equipment also includes：Start speech animation instruction acquisition module, start speech animation for obtaining Instruction.For example, user can start speech animation instruction by clicking on pre-set button in instant messaging application interface to trigger. The embodiment of the present invention is not limited to the mode for starting speech animation instruction.

Optionally, the electronic equipment also includes speech animation display module, for showing speech animation, so that user The speech animation of recording can intuitively be experienced.

The embodiments of the invention provide a kind of electronic equipment for being used to send speech animation.It is right by according to voice messaging Image carries out animation, to generate speech animation, and sends speech animation to recipient, can enable the recipient to watch language Sound animation.Compared to pure speech message, speech animation is more interesting and visual effect, so as to improve Consumer's Experience.

Fig. 4 shows a kind of electronic equipment according to embodiments of the present invention.As shown in figure 4, the electronic equipment includes storage Device 401, audio acquisition module 402, Network Interface Module 403 and with memory 401, audio acquisition module 402, network interface The processor 404 that module 403 connects, wherein, memory 401 is used to store batch processing code, and processor 404 calls memory 401 program codes stored are used to perform following operation：

Obtain image；

Obtain voice messaging；

The speech animation is sent to other electronic equipments.

Specifically, the program code that processor 404 calls memory 401 to be stored is used to perform following operation：

Obtain pre-set image or the image of user's input.

Specifically, the program code that processor 404 is stored with memory 401 is used to perform following operation：

Obtain the voice messaging of user's input.

The voice messaging is divided into multiple sound bites；

Obtain the feature of each sound bite in the multiple sound bite；

Above-mentioned all optional technical schemes, any combination can be used to form the alternative embodiment of the present invention, herein no longer Repeat one by one.

It should be noted that：Above-described embodiment provide electronic equipment perform send speech animation method when, only with The division progress of above-mentioned each functional module, can be as needed and by above-mentioned function distribution by not for example, in practical application Same functional module is completed, i.e., the internal structure of equipment is divided into different functional modules, to complete whole described above Or partial function.In addition, the electronic equipment that above-described embodiment provides belongs to same with sending the embodiment of the method for speech animation Design, its specific implementation process refer to embodiment of the method, repeated no more here.

One of ordinary skill in the art will appreciate that hardware can be passed through by realizing all or part of step of above-described embodiment To complete, by program the hardware of correlation can also be instructed to complete, described program can be stored in a kind of computer-readable In storage medium, storage medium mentioned above can be read-only storage, disk or CD etc..

The foregoing is only presently preferred embodiments of the present invention, be not intended to limit the invention, it is all the present invention spirit and Within principle, any modification, equivalent substitution and improvements made etc., it should be included in the scope of the protection.

Claims

A kind of 1. method for sending speech animation, it is characterised in that methods described includes：

Obtain image；

Obtain voice messaging；

According to the voice messaging, animation is carried out to described image, to generate speech animation；And

The speech animation is sent to recipient.
2. according to the method for claim 1, it is characterised in that the acquisition image includes：

Obtain pre-set image or the image of user's input.
3. according to the method for claim 1, it is characterised in that the acquisition voice messaging includes：

Obtain the voice messaging of user's input.
4. according to the method for claim 1, it is characterised in that it is described according to the voice messaging, described image is carried out Animation, included with generating speech animation：

The voice messaging is divided into multiple sound bites；

Obtain the feature of each sound bite in the multiple sound bite；

According to mouth expression corresponding to the Feature Selection of each sound bite；

According to expression frame corresponding to mouth expression corresponding to each described sound bite and described image generation；

According to expression frame corresponding to the multiple sound bite and voice messaging generation speech animation.
5. according to the method for claim 4, it is characterised in that described that the voice messaging is divided into multiple sound bites Including：

According to default animation frame per second, the voice messaging is divided into multiple sound bites, wherein, each sound bite and animation In a frame it is corresponding.
6. a kind of electronic equipment, it is characterised in that the equipment includes：

Image collection module, for obtaining image；

Voice acquisition module, for obtaining voice messaging；

Speech animation generation module, for according to the voice messaging, carrying out animation to described image, being moved with generating voice Draw；And

Sending module, for sending the speech animation to other electronic equipments.
7. electronic equipment according to claim 6, it is characterised in that described image acquisition module is specifically used for：

Obtain pre-set image or the image of user's input.
8. electronic equipment according to claim 6, it is characterised in that the voice acquisition module is specifically used for：

Obtain the voice messaging of user's input.
9. electronic equipment according to claim 6, it is characterised in that the speech animation generation module includes：

Voice splits submodule, for the voice messaging to be divided into multiple sound bites；

Feature acquisition submodule, for obtaining the feature of each sound bite in the multiple sound bite；

Mouth expression selects submodule, for mouth expression corresponding to the Feature Selection according to each sound bite；

Expression frame generates submodule, for the mouth expression according to corresponding to each described sound bite and described image generation pair The expression frame answered；

Speech animation generates submodule, is generated for the expression frame according to corresponding to the multiple sound bite and the voice messaging Speech animation.
10. electronic equipment according to claim 9, it is characterised in that the voice segmentation submodule is specifically used for：

According to default animation frame per second, the voice messaging is divided into multiple sound bites, wherein, each sound bite and animation In a frame it is corresponding.