CN116597858A

CN116597858A - Voice mouth shape matching method and device, storage medium and electronic equipment

Info

Publication number: CN116597858A
Application number: CN202310363302.4A
Authority: CN
Inventors: 夏明�; 郝冬宁
Original assignee: Hubei Xingji Meizu Technology Co ltd
Current assignee: Hubei Xingji Meizu Technology Co ltd
Priority date: 2023-04-06
Filing date: 2023-04-06
Publication date: 2023-08-15

Abstract

The application discloses a voice mouth shape matching method, a device, a storage medium and electronic equipment, and relates to the technical field of computers, wherein the method comprises the following steps: acquiring characters corresponding to the voice to be matched and pronunciation time corresponding to the characters; generating a mouth shape graph corresponding to the characters based on the mouth shape form keys corresponding to the characters; and displaying the mouth shape graph corresponding to the text in the pronunciation time corresponding to the text. The method and the device provided by the application can display the corresponding mouth shape graph in the pronunciation time corresponding to each character, so that the virtual image can synchronously make the mouth shape action matched with the voice, and the accuracy of the voice and the mouth shape matching of the virtual image is improved.

Description

Voice mouth shape matching method and device, storage medium and electronic equipment

Technical Field

The present application relates to the field of computer technologies, and in particular, to a method and apparatus for voice mouth shape matching, a storage medium, and an electronic device.

Background

With the rapid development of artificial intelligence technology and meta-space technology, the avatar is widely used. The virtual image can have natural and smooth mouth shape action synchronous with voice when speaking, so that the experience of a user can be improved.

Disclosure of Invention

In a first aspect, the present application provides a method for matching a voice mouth shape, including:

acquiring characters corresponding to the voice to be matched and pronunciation time corresponding to the characters;

generating a mouth shape graph corresponding to the characters based on the mouth shape form keys corresponding to the characters;

and displaying the mouth shape graph corresponding to the text in the pronunciation time corresponding to the text.

In some embodiments, obtaining text corresponding to the voice to be matched includes:

acquiring voice to be matched;

and carrying out voice recognition on the voice to be matched, and determining characters corresponding to the voice to be matched and pronunciation time corresponding to the characters.

In some embodiments, before the generating the mouth shape graph corresponding to the text based on the mouth shape morphological key corresponding to the text, the method includes:

determining language information of the voice to be matched;

determining a mouth shape form key library corresponding to the voice to be matched based on the language information;

matching the characters with each candidate character in the mouth shape form key library, and determining a mouth shape form key corresponding to the characters;

the mouth shape form key library comprises a plurality of candidate characters and mouth shape form keys corresponding to the candidate characters.

In some embodiments, the mouth shape morphology key library is determined based on the steps of:

acquiring mouth shape graphs of each candidate character under the current language information;

generating mouth shape form keys corresponding to the candidate characters based on the mouth shape graphs of the candidate characters;

and constructing a mouth shape form key library corresponding to the current language information based on the mouth shape form keys corresponding to each candidate character.

In some embodiments, the matching the text with each candidate text in the mouth shape form key library, and determining the mouth shape form key corresponding to the text includes:

determining a mouth shape key of the character under each pronunciation under the condition that the character is a polyphone;

acquiring a voice transcription text of the voice to be matched;

inputting the voice transcription text into a multi-sound word disambiguation model to obtain pronunciation probabilities of the words output by the multi-sound word disambiguation model under various pronunciations;

and taking the mouth shape form key corresponding to the pronunciation with the largest pronunciation probability as the mouth shape form key corresponding to the characters.

In some embodiments, the generating a mouth shape graph corresponding to the text based on the mouth shape morphological key corresponding to the text includes:

Acquiring an initial mouth shape key corresponding to the text and an audio feature and/or a face image corresponding to the voice to be matched;

matching the audio features with the audio features corresponding to the emotion types, and determining a first emotion type corresponding to the voice to be matched;

matching the expression features of the face image with the expression features corresponding to the emotion types, and determining a second emotion type corresponding to the voice to be matched;

and adjusting the initial mouth shape form key based on the first emotion type and/or the second emotion type corresponding to the voice to be matched, and generating a mouth shape diagram corresponding to the text based on the mouth shape form key corresponding to the text after adjustment.

In some embodiments, the displaying the mouth shape graph corresponding to the text in the pronunciation time corresponding to the text includes:

performing smooth interpolation on a mouth shape form key corresponding to a current character and a mouth shape form key corresponding to a next character to generate a mouth shape switching animation corresponding to the current character and the next character;

determining the starting time of the mouth shape switching diagram in the pronunciation time corresponding to the current character, and determining the ending time of the mouth shape switching diagram in the pronunciation time corresponding to the next character;

And displaying the mouth shape switching animation in the pronunciation time determined by the starting time and the ending time.

In some embodiments, after the displaying the mouth shape graph corresponding to the text within the pronunciation time corresponding to the text, the method includes:

determining voiceprint characteristics of the voice to be matched;

matching the voiceprint characteristics with preset speaker voiceprint characteristics, and determining speaker identity information corresponding to the voiceprint characteristics;

determining an avatar corresponding to the voice to be matched based on the identity information of the speaker;

and loading the mouth shape graph corresponding to the characters to the corresponding positions in the virtual image in the pronunciation time corresponding to the characters.

In some embodiments, the performing the voice recognition on the voice to be matched, determining a text corresponding to the voice to be matched, and a pronunciation time corresponding to the text, includes:

inputting the voice to be matched into a voice recognition model to obtain characters corresponding to the voice to be matched and pronunciation time corresponding to the characters, which are output by the voice recognition model;

the voice recognition model comprises a feature extraction layer, a silence detection layer and a voice recognition layer; the silence detection layer and the voice recognition layer are respectively connected with the feature extraction layer;

The feature extraction layer is used for dividing the voice to be matched into a plurality of voice frames and extracting acoustic recognition features of each voice frame; the silence detection layer is used for determining a voice frame to be recognized in the voice to be matched and pronunciation time corresponding to the voice frame to be recognized based on acoustic recognition characteristics of each voice frame; the voice recognition layer is used for determining characters corresponding to the voice to be matched based on the acoustic recognition characteristics of the voice frame to be recognized.

In a second aspect, the present application provides a voice mouth shape matching device, comprising:

the device comprises an acquisition unit, a comparison unit and a comparison unit, wherein the acquisition unit is used for acquiring characters corresponding to the voice to be matched and pronunciation time corresponding to the characters;

the generating unit is used for generating a mouth shape graph corresponding to the characters based on the mouth shape form keys corresponding to the characters;

and the matching unit is used for displaying the mouth shape graph corresponding to the text in the pronunciation time corresponding to the text.

In a third aspect, the application provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing a method as described in any one of the preceding claims when executing the program.

In a fourth aspect, the application provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements a method as described in any of the above.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the application and together with the description, serve to explain the principles of the application.

In order to more clearly illustrate the application or the technical solutions of the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described, and it is obvious that the drawings in the description below are some embodiments of the application, and other drawings can be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of a voice mouth shape matching method according to an embodiment of the present application;

FIG. 2 is a schematic flow chart of a method for constructing a mouth shape key according to one embodiment of the present application;

FIG. 3 is a flowchart illustrating a voice pattern matching method according to another embodiment of the present application;

FIG. 4 is a schematic diagram of a voice die matching device according to an embodiment of the present application;

Fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

In order that those skilled in the art will better understand the present application, a technical solution in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in which it is apparent that the described embodiments are only some embodiments of the present application, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present application without making any inventive effort, shall fall within the scope of the present application.

It should be noted that the terms "first," "second," and the like herein are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the application described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

The voice mouth shape matching method provided by the embodiment of the application is suitable for terminal equipment. Terminal devices include various handheld devices, vehicle mounted devices, wearable devices, computing devices, or other processing devices connected to a wireless modem, such as cell phones, tablets, desktop notebooks, and smart devices that can run applications, including the central console of a smart car, etc. In particular, it may refer to a User Equipment (UE), an access terminal, a subscriber unit, a subscriber station, a mobile station, a remote terminal, a mobile device, a User terminal, a wireless communication device, a User agent, or a User device.

The terminal device may also be a satellite phone, a cellular phone, a Smart phone, a wireless data card, a wireless modem, a machine type communication device, a cordless phone, a session initiation protocol (Session Initiation Protocol, SIP) phone, a wireless local loop (Wireless Local Loop, WLL) station, a personal digital assistant (Personal Digital Assistant, PDA), a handheld device with wireless communication capabilities, a computing device or other processing device connected to a wireless modem, a car mounted device or a wearable device, a Virtual Reality (VR) terminal device, an augmented reality (Augmented Reality, AR) terminal device, a wireless terminal in industrial control (Industrial Control), a wireless terminal in Self-driving (Self-driving), a wireless terminal in telemedicine (Remote medium), a wireless terminal in Smart grid (Smart grid), a wireless terminal in transportation security (Transportation safety), a wireless terminal in Smart city (Smart city), a wireless terminal in Smart home (Smart home), a terminal in a 5G network or a future communication network, etc.

The terminal device may be powered by a battery, may also be attached to and powered by a power supply system of the vehicle or vessel. The power supply system of the vehicle or the ship can also charge the battery of the terminal equipment so as to prolong the communication time of the terminal equipment.

The terminal device may interact with the user through voice and display a mouth shape of the avatar. The avatar may include a character, an animal, a cartoon, etc.

Fig. 1 is a flow chart of a voice mouth shape matching method according to an embodiment of the present application, and as shown in fig. 1, the method includes steps 110, 120 and 130. The method flow steps are only one possible implementation of the application.

Step 110, obtaining characters corresponding to the voice to be matched and pronunciation time corresponding to the characters.

Specifically, the execution main body of the voice mouth shape matching method provided by the embodiment of the application is a voice mouth shape matching device, and the device can be independently arranged hardware equipment in terminal equipment or can be a software program running in the terminal equipment. For example, when the terminal device is a mobile phone, the voice mouth shape matching device may be embodied as an application program in the mobile phone.

The voice to be matched is a voice for performing mouth shape matching on the avatar. For example, in a live scene of an avatar, the voice to be matched may be a real-time voice uttered by a user, and the user is shown to the outside in an avatar manner.

The voice to be matched can be converted into characters, and the pronunciation time of the characters can be obtained.

The pronunciation time may include a voice start time, a voice end time, and a voice pronunciation duration, corresponding to the text. By determining the pronunciation time of each character, the mouth shape action of the virtual image can be consistent with the speech speed of the speech to be matched.

For example, the voice to be matched is "hello". The voice mouth shape matching device carries out voice recognition on the voice to be matched, determines that the word corresponding to the voice to be matched is 'hello', and obtains the voice starting time and the voice ending time corresponding to the 'hello' word, and the voice starting time and the voice ending time corresponding to the 'good' word. According to the pronunciation time corresponding to each character, the duration time of the mouth shape action of the virtual image corresponding to each character can be determined.

For example, the text and the pronunciation time thereof are easily provided by, for example, configuring the text to be subjected to the mouth shape matching and the time information corresponding to the text as text or script or JSON or arbitrary textualized instruction.

Step 120, generating a mouth shape graph corresponding to the text based on the mouth shape form key corresponding to the text.

Specifically, a Shape Key (Shape Key) is a positional parameter where each feature point of the lip is located. And constructing a mouth shape graph corresponding to the characters according to the mouth shape form keys corresponding to the characters.

The mouth shape is a picture capable of showing the shape of the lip of the avatar. The mouth shape graph comprises a plurality of characteristic points, the characteristic points are connected to obtain a three-dimensional mesh graph, and the three-dimensional engine is used for rendering the mesh graph to obtain the mouth shape graph of the virtual image.

For example, the lips of the avatar are divided into upper lips, lower lips, mouth corners, persons, and the like, and a plurality of feature points are set in each portion. Wherein, when the mouth shape action is performed, the part with larger movement range can be provided with denser characteristic points.

And 130, displaying the mouth shape graph corresponding to the text in the pronunciation time corresponding to the text.

Specifically, after determining the pronunciation time of each character and the mouth shape graph corresponding to each character, sequentially displaying the mouth shape graph corresponding to each character in the pronunciation time corresponding to each character according to the pronunciation sequence of each character, so as to realize the synchronization of the voice to be matched and the mouth shape action of the virtual image.

According to the voice mouth shape matching method provided by the embodiment of the application, characters corresponding to voices to be matched and pronunciation time corresponding to the characters are obtained, and a mouth shape graph corresponding to the characters is generated according to mouth shape keys corresponding to the characters; displaying a mouth shape diagram corresponding to the characters in pronunciation time corresponding to the characters; the method can enable the avatar to synchronously make the mouth shape action matched with the real-time voice sent by the user, improves the accuracy of the voice and the mouth shape matching of the avatar, improves the reality of the avatar, and improves the experience of the user in using the avatar.

It should be noted that each embodiment of the present application may be freely combined, exchanged in order, or separately executed, and does not need to rely on or rely on a fixed execution sequence.

In some embodiments, step 110 comprises:

acquiring voice to be matched;

Specifically, there are various ways of acquiring the voice to be matched. For example, the voice mouth shape matching device monitors the voice of the current environment, and when the voice signal is monitored, the voice signal is subjected to denoising processing to obtain the voice to be matched.

The voice to be matched can be converted into corresponding characters by voice recognition through the voice recognition model.

The voice recognition model can be constructed by adopting a convolutional neural network model, a fully-connected neural network model, a cyclic neural network model, a long-term and short-term memory neural network model and the like.

The pronunciation time corresponding to the text can be obtained by means of voice activity detection (Voice Activity Detection, VAD).

The voice mouth shape matching device can also analyze the acquisition parameters of the voice to be matched, and determine the time stamp of each word corresponding to the voice to be matched so as to obtain the pronunciation time corresponding to each word.

According to the voice mouth shape matching method provided by the embodiment of the application, the acquired voice to be matched is subjected to voice recognition, the characters corresponding to the voice to be matched and the pronunciation time corresponding to the characters are determined, the mouth shape graph corresponding to the characters can be displayed within the pronunciation time corresponding to the characters, the accuracy of matching the voice and the mouth shape of the virtual image is improved, and the user experience of using the virtual image is improved.

In some embodiments, step 120 is preceded by:

determining language information of the voice to be matched;

Determining a mouth shape key library corresponding to the voice to be matched based on language information;

matching the characters with each candidate character in the mouth shape form key library, and determining the mouth shape form key corresponding to the characters;

the mouth shape key library comprises a plurality of candidate characters and mouth shape keys corresponding to the candidate characters.

Specifically, the speech to be matched may be in various languages, such as mandarin, sichuan, guangdong, henan, and the like. The language information comprises the voice characteristics, the language category and other information of the voice to be matched. The pronunciation of the same text in different languages may be different, that is, the mouth shape key of the same text in different languages may be different.

The mouth shape form key library can be created for each language in advance, and the mouth shape form key library comprises each candidate character included in the corresponding language and mouth shape form keys of each candidate character under the pronunciation of the corresponding language.

After the voice to be matched is obtained, language analysis can be performed on the voice to be matched to obtain language information of the voice to be matched, the language information of the voice to be matched is matched with each language, the language corresponding to the voice to be matched can be obtained, and the mouth shape form key library of the language is queried, so that each candidate word in the language and the mouth shape form key corresponding to each candidate word are obtained. And matching the characters corresponding to the voice to be matched with each candidate character in the mouth shape form key library to obtain the mouth shape form key corresponding to the successfully matched candidate characters, and obtaining the mouth shape form key corresponding to the characters.

For example, the voice to be matched is mandarin, and the text corresponding to the voice to be matched is "hello". After the voice to be matched is obtained, language analysis is carried out on the voice to be matched to obtain language information of the voice to be matched, the language information of the voice to be matched is matched with each language to obtain the language information of the voice to be matched, the language of the voice to be matched is determined to be Mandarin, a mouth shape form key library corresponding to the voice to be matched is determined to be a mouth shape form key library of Mandarin, the voice to be matched is converted into characters 'hello', and candidate characters in the mouth shape form key libraries of the voice to be matched are matched respectively, and mouth shape form keys corresponding to 'you' and 'hello' in Mandarin are obtained.

According to the voice mouth shape matching method provided by the embodiment of the application, mouth shape keys corresponding to the voices to be matched in each language can be obtained by setting the mouth shape key library in multiple languages, so that mouth shape graphs of the voices to be matched in each language are obtained, and the voice mouth shape matching method is suitable for more language environments.

In some embodiments, the mouth shape library is determined based on the following steps:

and constructing a mouth shape form key library corresponding to the current language information based on the mouth shape form keys corresponding to the candidate characters.

Specifically, determining the language corresponding to the current voice to be matched according to the current language information, and obtaining each candidate character and the mouth shape graph of pronunciation of each candidate character under the language; feature point modeling is conducted on the mouth shape graph of each candidate character, mouth shape form keys corresponding to each candidate character are generated, each candidate character and each mouth shape form key are stored in a correlated mode, and a mouth shape form key library corresponding to current language information is obtained.

According to the voice mouth shape matching method provided by the embodiment of the application, the mouth shape form keys corresponding to the candidate characters are generated according to the mouth shape graphs of the candidate characters, and finally, the mouth shape form key library corresponding to the current language information is constructed, so that the voice mouth shape matching method can be used for voice mouth shape matching, and the accuracy of voice mouth shape matching is improved.

In some embodiments, matching the text with each candidate text in the mouth shape form key library to determine a mouth shape form key corresponding to the text, including:

under the condition that the characters are polyphones, determining mouth shape keys of the characters under each pronunciation;

Acquiring a voice transcription text of a voice to be matched;

inputting the voice transfer text into a multi-phonetic word disambiguation model to obtain pronunciation probabilities of the words output by the multi-phonetic word disambiguation model under various pronunciations;

Specifically, polyphones are words having two or more pronunciations. The meaning and usage of different pronunciation representations are different, and the parts of speech are often different, for example, the word "yes" can read 2 sounds, and the word "when" or "yes" can be represented; "yes" may also mean "give" and the like, as well as read 4.

The voice transcription text is a text formed by converting the voice to be matched into characters.

The polyphone disambiguation model is a statistical model for the polyphone disambiguation task, e.g., the polyphone disambiguation model may be constructed from a maximum entropy model, a conditional random field model, and the like. For another example, a polyphonic disambiguation model may be obtained by feature training through a neural network model.

The polyphone disambiguation model mainly determines the pronunciation of the polyphone based on extracted features, wherein the extracted features comprise the front and rear characters of the polyphone, the front and rear words, the word length of the front and rear words, the part of speech of the front and rear words, the relative positions of the front and rear keywords and the polyphone in sentences and the like.

The voice mouth shape matching device sends the voice transcription text to a multi-sound word disambiguation model, the multi-sound word disambiguation model decodes the received voice transcription text to obtain pronunciation probabilities of all pronunciations of the multi-sound words of the voice transcription text, and pronunciation of the multi-sound words in the context of the voice transcription text is predicted according to the pronunciation probabilities of all pronunciations of the multi-sound words. According to the predicted pronunciation probability of each pronunciation of the polyphone, the pronunciation with the largest pronunciation probability is selected to obtain the pronunciation of the polyphone; or, further judgment can be made according to the pronunciation probability of each pronunciation of the predicted polyphones and by combining linguistic pronunciation rules, so that the pronunciation of the polyphones can be determined.

For example, the voice transcription text of the voice to be matched is "placed upwards", wherein the "towards" word has two pronunciations, namely "zhao", which can represent the meaning of morning, day or day, etc.; "chao" may also mean facing, rising, or the like. After the voice mouth shape matching device obtains the voice transcription text of the voice to be matched to be placed upwards, the voice transcription text can be decoded through the multi-phonetic word disambiguation model, the multi-phonetic word orientation is predicted, the probability of obtaining pronunciation of zhao is 0.1, and the probability of pronunciation of chao is 0.9. Based on the two probabilities obtained, a pronunciation with a higher probability can be selected, i.e. a pronunciation with a probability of 0.9 is selected as a pronunciation of the polyphone "towards" in the "placed upwards" of the phonetic transcription text.

According to the voice mouth shape matching method provided by the embodiment of the application, the pronunciation of the polyphone under the current context can be determined through the polyphone disambiguation model, and then the mouth shape form key corresponding to the pronunciation is obtained. Even if the characters are polyphones, the virtual image can accurately make the mouth shape corresponding to the characters, and the accuracy of voice mouth shape matching is improved.

In some embodiments, step 130 comprises:

acquiring an initial mouth shape key corresponding to the characters and audio features and/or face images corresponding to the voices to be matched;

and adjusting the initial mouth shape form key based on the first emotion type and/or the second emotion type corresponding to the voice to be matched, and generating a mouth shape graph corresponding to the text based on the mouth shape form key corresponding to the adjusted text.

In particular, the audio features may include: energy characteristics, voicing frame characteristics, pitch frequency characteristics, formant characteristics, harmonic to noise ratio characteristics, mel cepstrum coefficient characteristics, and the like. The expressive features may include: facial features and expression amplitude features. The emotion types can be classified into happiness, anger, sadness, and happiness, or they can be classified into agitation, calm, and the like.

If the mouth shape graph can be associated with the emotion of the user, the experience of the user can be improved. Wherein, the emotion of the user can be judged through the audio characteristics of the voice of the user and/or the expression characteristics of the face of the user.

Extracting audio features of the voice to be matched, inputting the audio features into a first emotion recognition model, matching the input audio features with the audio features corresponding to the emotion types by the first emotion recognition model, and determining a first emotion type corresponding to the voice to be matched according to a matching result.

The first emotion recognition model may be obtained by training the initial model according to audio features of the plurality of sample voices and emotion types corresponding to the plurality of sample voices.

The face image of the speaker corresponding to the voice to be matched can be acquired while the voice to be matched is acquired. Extracting the expression features of the face image, inputting the expression features into a second emotion recognition model, matching the input expression features with the expression features corresponding to the emotion types by the second emotion recognition model, and determining a second emotion type corresponding to the voice to be matched according to the matching result.

The second emotion recognition model may be obtained by training the initial model according to the expression characteristics of the plurality of sample face images and emotion types corresponding to the plurality of sample face images.

The initial models corresponding to the first emotion recognition model and the second emotion recognition model can adopt a convolution neural network model, a full-connection neural network model, a circulation neural network model, a long-term and short-term memory neural network model and the like.

The initial mouth shape key can be adjusted according to the first emotion type and/or the second emotion type, and a mouth shape diagram corresponding to the text is generated according to the mouth shape key corresponding to the adjusted text.

For example, the user's emotion type is excited based on the first emotion type and the second emotion type being jointly determined. In this case, since the movement range of the user's lips is relatively large, the user can adjust the initial shape key, and the change range of the initial shape key is increased, so that an adjusted shape key is obtained, and a corresponding shape graph is generated based on the adjusted shape key.

According to the voice mouth shape matching method provided by the embodiment of the application, the initial mouth shape key corresponding to the characters is adjusted according to the audio characteristics and/or the expression characteristics corresponding to the voice to be matched, so that the mouth shape of the virtual image is more attached to the actual mouth shape of the user, and the experience of the user is improved.

In some embodiments, step 140 comprises:

performing smooth interpolation on the mouth shape form key corresponding to the current character and the mouth shape form key corresponding to the next character to generate mouth shape switching animation corresponding to the current character and the next character;

determining the starting time of the mouth shape switching animation in the pronunciation time corresponding to the current character, and determining the ending time of the mouth shape switching animation in the pronunciation time corresponding to the next character;

Specifically, if the mouth shape diagram corresponding to the current text is directly switched to the mouth shape diagram corresponding to the next text, the mouth shape switching process lacks transition, and the switching process is abrupt. In order to smoothly and naturally switch the mouth shape graph corresponding to the current character to the mouth shape graph corresponding to the next character, the embodiment of the application carries out smooth interpolation between the mouth shape key corresponding to the current character and the mouth shape key corresponding to the next character and generates a plurality of mouth shape switching graphs between the mouth shapes corresponding to the current character and the next character, and the mouth shape switching graphs form mouth shape switching animation.

For example, the current text is "you", the next text is "good", the mouth shape form key corresponding to "you" and the mouth shape form key corresponding to "good" are obtained, a smooth difference is made between the mouth shape form key corresponding to "you" and the mouth shape form key corresponding to "good", and a mouth shape switching animation between "you" and "good" is generated.

And determining the starting time of the mouth shape switching animation in the pronunciation time corresponding to the current character, and determining the ending time of the mouth shape switching animation in the pronunciation time corresponding to the next character.

For example, the pronunciation time of "you" is T ₁ From moment to T ₂ At the moment, the pronunciation time of "good" is T ₂ From moment to T ₃ At the moment, then at T ₁ From moment to T ₂ Determining the starting time of the mouth shape switching animation as T in the time range of the time _1a At T ₂ From moment to T ₃ Determining the ending time of the mouth shape switching animation as T in the time range of the time _2a At T _1a From moment to T _2a The pronunciation time interpolation entry type switching animation of the moment.

According to the voice mouth shape matching method provided by the embodiment of the application, smooth switching from the mouth shape image corresponding to the current text to the mouth shape image corresponding to the next text is realized in an interpolation mode, so that the naturalness of mouth shape switching is improved, and the user has immersive experience.

In some embodiments, after step 140, the method comprises:

determining voiceprint characteristics of the voice to be matched;

determining the voice to be matched corresponds based on speaker identity information;

And loading the mouth shape graph corresponding to the characters to the corresponding position in the pronunciation time corresponding to the characters.

Specifically, the voiceprint features are the acoustic wave spectrum carrying speech information and displayed by an electroacoustical instrument, and are biological features composed of hundreds of feature dimensions such as wavelength, frequency, intensity and the like. The identity information includes information of the sex, age, position, etc. of the speaker. The preset speaker voiceprint features are voiceprint features of a speaker with known identity information.

And extracting voiceprint features of the voice to be matched, matching the voiceprint features with a plurality of preset speaker voiceprint features, and determining speaker identity information according to the preset speaker voiceprint features successfully matched with the voiceprint features.

The identity information is analyzed to obtain the information of the gender, age, position and the like of the speaker, the virtual image of the speaker is constructed according to the predicted information of the gender, age, position and the like of the speaker, and the mouth shape diagram corresponding to the characters is loaded to the corresponding position in the virtual image within the pronunciation time corresponding to the characters, so that the synchronous change of the mouth shape and the image can be realized.

According to the voice mouth shape matching method provided by the embodiment of the application, the voice print characteristics of the voice to be matched are determined, the speaker identity information corresponding to the voice print characteristics can be obtained, the virtual image corresponding to the voice to be matched can be constructed by analyzing the speaker identity information, and the mouth shape graph corresponding to the text is loaded to the corresponding position in the virtual image, so that synchronous change of the mouth shape and the virtual image can be realized, and immersive experience is provided for a user.

In some embodiments, step 120 comprises:

inputting the voice to be matched into a voice recognition model to obtain characters corresponding to the voice to be matched output by the voice recognition model and pronunciation time corresponding to the characters;

the feature extraction layer is used for dividing the voice to be matched into a plurality of voice frames and extracting the acoustic recognition features of each voice frame; the silence detection layer is used for determining a voice frame to be recognized in the voice to be matched and pronunciation time corresponding to the voice frame to be recognized based on the acoustic recognition characteristics of each voice frame; the voice recognition layer is used for determining characters corresponding to the voice to be matched based on the acoustic recognition characteristics of the voice frame to be recognized.

Specifically, in the actual voice interaction process, the voice to be matched sent by the user may include a voice part and a non-voice part. The non-speech portion may be a silence portion or an ambient sound portion. For example, the user does not speak for more than half of the time, more than half of the collected voices to be matched are mute, and recognition processing is carried out on the voices to be matched with the mute, so that the computing resources of the terminal equipment are wasted.

The neural network model can be used as an initial model to establish a voice recognition model for processing the voice to be matched to obtain characters corresponding to the voice to be matched and pronunciation time corresponding to the characters.

In consideration of silence detection and speech recognition, when implemented using a neural network model, the method can be based on analysis of acoustic features of the speech to be matched. Therefore, the speech recognition model established by the embodiment of the application can comprise a feature extraction layer, a silence detection layer and a speech recognition layer from the model structure. The silence detection layer and the voice recognition layer are respectively connected with the feature extraction layer.

The feature extraction layer is used for dividing the voice to be matched into a plurality of voice frames and extracting the acoustic recognition features of the voice frames. First, the feature extraction layer may divide the speech to be matched into a plurality of speech frames. For example, voices to be matched are separated by 10ms into one frame. Next, the feature extraction layer extracts acoustic recognition features of the respective speech frames. The acoustic recognition feature is used to describe the physical quantity of the speech frame in terms of acoustic properties. For example, the acoustic recognition features may be prosodic features, timbre features, loudness features, and the like; but may also be time domain features, frequency domain features, etc. The frequency domain features may further include mel-frequency cepstrum coefficients, filter bank features, and the like.

The silence detection layer is used for determining a voice frame to be recognized in the voice to be matched according to the acoustic recognition characteristics output by the characteristic extraction layer. The voice frames to be recognized are voice frames containing voices to be matched after silence detection is carried out on each voice frame. By extracting the voice frame to be recognized, the useful part (voice part) in the voice to be matched can be extracted, and the processing of the useless part (non-voice part) is reduced, so that the calculation amount of the terminal equipment is reduced.

The voice recognition layer is used for determining a voice recognition result of the voice to be matched according to the acoustic recognition characteristics of the voice frame to be recognized.

The feature extraction layer, silence detection layer and speech recognition layer may be implemented using different initial neural network models. The types of the initial neural network models adopted by the layers can be the same or different, and the embodiment of the application is not particularly limited. The initial neural network model may include a convolutional neural network, a deep feed-forward sequence memory neural network, a long-term memory neural network, an attention neural network, and the like.

In order to reduce the model structure of the speech recognition model, the silence detection layer and the speech recognition layer may also be implemented by using a partial structure of a neural network, for example, a full connection layer in the neural network. Because the tasks executed by the layers are different, the number of neurons, weight parameters and the like of the layers are different although the layers are all realized by adopting fully connected layers.

After the non-voice part is obtained, a mouth shape graph of the non-voice part can be constructed, and the mouth shape in the mouth shape graph of the non-voice part can be tight or slightly open. The duration of the non-speech portion is obtained, during which time a mouth-shape of the non-speech portion is displayed.

According to the voice mouth shape matching method provided by the embodiment of the application, the useful part (voice part) and the useless part (non-voice part) of the voice to be matched can be identified through the voice identification model, the voice to be matched is processed in a targeted manner, the calculation amount of terminal equipment is reduced, meanwhile, the mouth shape graph of the non-voice part is constructed, the virtual image can be attached to the actual situation in the mouth shape changing process, and the experience of a user is improved.

In some embodiments, fig. 2 is a schematic flow chart of creating a mouth shape key according to an embodiment of the present application, and as shown in fig. 2, the construction method includes:

step 210, creating a mouth shape form key library, which comprises a mouth shape identifier, a mouth shape form key and candidate characters; one mouth shape graph comprises a group of mouth shape keys, and simultaneously, one mouth shape graph can correspond to a plurality of candidate characters, and the corresponding mouth shape keys can be searched through characters corresponding to the voice to be matched or mouth shape identification keywords.

Step 220, traversing candidate characters in the mouth shape form key library, and judging whether the current mouth shape form key library contains characters corresponding to the voice to be matched and mouth shape form keys corresponding to the characters; if all the characters corresponding to the voice to be matched can be matched to the candidate characters in the current mouth shape form key library and the mouth shape form keys corresponding to the candidate characters, ending the creation.

Step 230, if the text corresponding to the voice to be matched is not matched with the candidate text in the mouth shape form key library, drawing a mouth shape diagram corresponding to the text corresponding to the voice to be matched, and naming a unique mouth shape identification keyword so as to facilitate subsequent inquiry.

Step 240, adjusting the mouth shape form key in the model editing tool to make the model mouth shape consistent with the mouth shape diagram corresponding to the text corresponding to the voice to be matched.

Step 250, entering a mouth shape identification keyword and a corresponding mouth shape key in a mouth shape key library.

According to the voice mouth shape matching method provided by the embodiment of the application, the mouth shape form key corresponding to the text is determined by drawing the mouth shape graph corresponding to the text corresponding to the voice to be matched, so that the mouth shape form key library is perfected, and the accuracy of voice mouth shape matching is improved.

In some embodiments, fig. 3 is a flow chart of a voice mouth shape matching method according to another embodiment of the present application, as shown in fig. 3, the method includes:

step 310, starting a voice mouth shape matching device, and rendering the virtual image by the three-dimensional engine.

Step 320, the voice mouth shape matching device monitors the sound of the current environment.

Step 330, monitoring the voice to be matched, converting the voice to be matched into characters, and recording the pronunciation time of each character;

step 340, inquiring a mouth shape form key library to obtain mouth shape form keys corresponding to the characters;

and 350, modifying the mouth shape form key of the virtual image in an interpolation mode according to the pronunciation time of each character to smoothly switch the mouth shape.

The change from one mouth shape to another mouth shape is a gradual change process, which is to interpolate the mouth shape form keys corresponding to adjacent characters respectively so as to smoothly gradually change the mouth shape graph of the current character to the mouth shape graph of the next character in interval time.

According to the voice mouth shape matching method provided by the embodiment of the application, the mouth shape matching is carried out on the acquired voice to be matched by monitoring the voice in the current environment, and the mouth shape form key of the virtual image is modified in an interpolation mode, so that the mouth shape is smoothly switched, the accuracy of voice mouth shape matching is improved, and the user experience is improved.

The voice mouth shape matching device provided by the embodiment of the application is described below, and the voice mouth shape matching device described below and the voice mouth shape matching method described above can be referred to correspondingly.

Fig. 4 is a schematic structural diagram of a voice mouth shape matching device according to an embodiment of the present application, and as shown in fig. 4, the device includes an obtaining unit 410, a generating unit 420, and a matching unit 430.

The obtaining unit 410 is configured to obtain a text corresponding to the voice to be matched and a pronunciation time corresponding to the text.

The generating unit 420 is configured to generate a mouth shape graph corresponding to the text based on the mouth shape morphological key corresponding to the text.

The matching unit 430 is configured to display a mouth shape chart corresponding to the text within the pronunciation time corresponding to the text.

Specifically, according to an embodiment of the present application, any of the plurality of units of the acquisition unit 410, the generation unit 420, and the matching unit 430 may be combined in one unit to be implemented, or any of the plurality of units may be split into a plurality of units.

Alternatively, at least some of the functionality of one or more of the units may be combined with at least some of the functionality of other units and implemented in one unit.

According to an embodiment of the present application, at least one of the acquisition unit 410, the generation unit 420 and the matching unit 430 may be at least partially implemented as a hardware circuit, such as a Field Programmable Gate Array (FPGA), a Programmable Logic Array (PLA), a system on a chip, a system on a substrate, a system on a package, an Application Specific Integrated Circuit (ASIC), or may be implemented in hardware or firmware in any other reasonable way of integrating or packaging the circuits, or in any one of or a suitable combination of any of the three.

Alternatively, at least one of the acquisition unit 410, the generation unit 420 and the matching unit 430 may be at least partially implemented as a computer program element, which when executed may perform the respective functions.

According to the voice mouth shape matching device provided by the embodiment of the application, characters corresponding to voices to be matched and pronunciation time corresponding to the characters are obtained, and a mouth shape graph corresponding to the characters is generated according to mouth shape keys corresponding to the characters; displaying a mouth shape diagram corresponding to the characters in pronunciation time corresponding to the characters; the method can enable the avatar to synchronously make the mouth shape action matched with the real-time voice sent by the user, improves the accuracy of the voice and the mouth shape matching of the avatar, improves the reality of the avatar, and improves the experience of the user in using the avatar.

In some embodiments, the voice die matching apparatus further comprises an identification unit for:

acquiring voice to be matched;

In some embodiments, the voice die matching apparatus further comprises a determining unit for:

Determining language information of the voice to be matched;

In some embodiments, the determining unit is specifically configured to:

acquiring a voice transcription text of a voice to be matched;

In some embodiments, the generating unit is specifically configured to:

In some embodiments, the matching unit is specifically configured to:

In some embodiments, the matching unit is further to:

determining voiceprint characteristics of the voice to be matched;

determining an avatar corresponding to the voice to be matched based on the speaker identity information;

In some embodiments, the identification unit is specifically configured to:

It should be noted that, the voice mouth shape matching device provided by the embodiment of the present application can implement all the method steps implemented by the embodiment of the voice mouth shape matching method, and can achieve the same technical effects, and detailed descriptions of the same parts and beneficial effects as those of the embodiment of the method in the embodiment are omitted.

Fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present application, as shown in fig. 5, the electronic device may include: processor (Processor) 510, communication interface (Communications Interface) 520, memory (Memory) 530, and communication bus (Communications Bus) 540, wherein Processor 510, communication interface 520, memory 530 complete communication with each other via communication bus 540. Processor 510 may invoke logic commands in memory 530 to perform a voice pattern matching method comprising:

generating a mouth shape graph corresponding to the text based on the mouth shape form key corresponding to the text;

In addition, the logic commands in the memory described above may be implemented in the form of software functional units and may be stored in a computer readable storage medium when sold or used as a stand alone product. Based on this understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art or in the form of a software product stored in a storage medium, comprising several commands for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

The processor in the electronic device provided by the embodiment of the application can call the logic instruction in the memory to realize the method, and the specific implementation mode is consistent with the implementation mode of the method, and the same beneficial effects can be achieved, and the detailed description is omitted here.

Embodiments of the present application also provide a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, is implemented to perform the methods provided by the above embodiments.

The specific embodiment is consistent with the foregoing method embodiment, and the same beneficial effects can be achieved, and will not be described herein.

The embodiments of the present application provide a computer program product comprising a computer program which, when executed by a processor, implements a method as described above.

The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present application without undue burden.

From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on this understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective embodiments or some parts of the embodiments.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present application, and are not limiting; although the application has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present application.

Claims

1. A method for voice mouth shape matching, comprising:

2. The method for matching a voice mouth shape according to claim 1, wherein obtaining text corresponding to the voice to be matched comprises:

acquiring voice to be matched;

3. The method for matching a voice mouth shape according to claim 1, wherein before generating a mouth shape map corresponding to the text based on a mouth shape morphological key corresponding to the text, the method comprises:

determining language information of the voice to be matched;

4. The voice die matching method according to claim 3, wherein the die shape key library is determined based on the steps of:

5. The method for matching a voice mouth shape according to claim 3, wherein the step of matching the text with each candidate text in the mouth shape key library to determine a mouth shape key corresponding to the text comprises:

acquiring a voice transcription text of the voice to be matched;

6. The method for matching a voice mouth shape according to claim 1, wherein the generating a mouth shape map corresponding to the text based on a mouth shape morphological key corresponding to the text comprises:

7. The method for matching a voice mouth shape according to claim 1, wherein displaying the mouth shape graph corresponding to the text within the pronunciation time corresponding to the text comprises:

8. The method for matching a voice mouth shape according to claim 1, wherein after the mouth shape graph corresponding to the text is displayed within the pronunciation time corresponding to the text, the method comprises:

determining voiceprint characteristics of the voice to be matched;

9. The method for voice mouth shape matching according to any one of claims 2 to 8, wherein the performing voice recognition on the voice to be matched, determining a text corresponding to the voice to be matched, and a pronunciation time corresponding to the text, includes:

10. A voice die matching apparatus, comprising:

11. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the voice die matching method of any one of claims 1 to 9 when the program is executed by the processor.

12. A non-transitory computer readable storage medium having stored thereon a computer program, which when executed by a processor implements the speech pattern matching method according to any of claims 1 to 9.