CN115497448A - Method and device for synthesizing voice animation, electronic equipment and storage medium - Google Patents

Method and device for synthesizing voice animation, electronic equipment and storage medium Download PDF

Info

Publication number
CN115497448A
CN115497448A CN202110671977.6A CN202110671977A CN115497448A CN 115497448 A CN115497448 A CN 115497448A CN 202110671977 A CN202110671977 A CN 202110671977A CN 115497448 A CN115497448 A CN 115497448A
Authority
CN
China
Prior art keywords
audio information
lip
voice
image
sequence
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110671977.6A
Other languages
Chinese (zh)
Inventor
曹爽
潘伟洲
曾润良
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN202110671977.6A priority Critical patent/CN115497448A/en
Publication of CN115497448A publication Critical patent/CN115497448A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T13/00Animation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/04Details of speech synthesis systems, e.g. synthesiser structure or memory management
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/24Speech recognition using non-acoustical features
    • G10L15/25Speech recognition using non-acoustical features using position of the lips, movement of the lips or face analysis
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • G10L2015/025Phonemes, fenemes or fenones being the recognition units

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Processing Or Creating Images (AREA)

Abstract

The embodiment of the application provides a method and a device for synthesizing voice animation, electronic equipment and a storage medium, and relates to the technical field of voice. The method comprises the following steps: displaying an image acquisition control, and acquiring a facial image of a target user in response to the triggering operation of the image acquisition control to obtain a target facial image; displaying the follow-up reading information and a follow-up reading control, responding to the triggering operation of the follow-up reading control, and acquiring audio information which is input by a target user and corresponds to the follow-up reading information; and acquiring and displaying a voice animation including an image of the lip changes of the target user, the lip changes of the target user being synchronized with the content of the audio information, the image of the lip changes of the target user being obtained from the lip and the audio information in the target face image. The embodiment of the application can help the user to practice pronunciation and lip shape during pronunciation more accurately according to the voice animation, and interest and efficiency of language learning are improved.

Description

Method and device for synthesizing voice animation, electronic equipment and storage medium
Technical Field
The present application relates to the field of speech technologies, and in particular, to a method and an apparatus for synthesizing speech animation, an electronic device, and a storage medium.
Background
In language learning, listening and reading are very important, and in general, when learning a language, it is necessary to read aloud along with an audible pronunciation, and further, to learn various voices.
In many cases, the learner cannot know whether the pronunciation of the learner is standard or not by himself. Therefore, most language learning software also displays the standard lip pattern of pronunciation to the learner, so as to help the learner to correct pronunciation by adjusting his own lip according to the standard lip pattern.
However, the standard lip pattern provided by the prior art is either hand-drawn or a model lip pattern, and learners cannot accurately imitate the standard lip pattern, which affects pronunciation accuracy and reduces language learning enthusiasm.
Disclosure of Invention
Embodiments of the present invention provide a method, an apparatus, an electronic device, and a storage medium for synthesizing a voice animation, which overcome the above problems or at least partially solve the above problems.
In a first aspect, a method for synthesizing a speech animation is provided, the method comprising:
displaying an image acquisition control, and acquiring a facial image of a target user in response to the triggering operation of the image acquisition control to obtain a target facial image;
displaying the follow-up reading information and a follow-up reading control, responding to the triggering operation of the follow-up reading control, and acquiring audio information which is input by a target user and corresponds to the follow-up reading information;
and acquiring and displaying the voice animation comprising the image of the lip change of the target user, wherein the lip change of the target user is synchronized with the content of the audio information, and the image of the lip change of the target user is obtained according to the lip and the audio information in the target face image.
In one possible implementation, a manner of obtaining speech animation includes:
obtaining at least one sample lip shape, wherein the sample lip shape is used for expressing the pronunciation of at least one phoneme; updating the lip shape of the target user in the target face image according to the sample lip shape to obtain a synthesized face image, wherein the lip shape of the target user in the synthesized face image is used for expressing the pronunciation of the phoneme expressed by the corresponding sample lip shape;
performing voice recognition on the audio information to obtain a phoneme sequence of the audio information, wherein the phoneme sequence comprises a phoneme corresponding to at least one time point in the audio information;
determining a synthetic face image corresponding to each phoneme in the phoneme sequence, and obtaining a synthetic face image sequence of lip changes of the target user according to each synthetic face image;
and acquiring an audio frame sequence corresponding to the sequence of the pixels in the audio information, and synchronizing the audio frame sequence and the synthesized facial image sequence according to the time information of the audio frame sequence in the audio information to generate the voice animation.
In one possible implementation, determining a synthetic face image corresponding to each phoneme in the sequence of phonemes includes:
determining a lip shape corresponding to each phoneme in the phoneme sequence to obtain a lip shape sequence;
a composite face image corresponding to each lip in the sequence of lips is determined.
In one possible implementation, obtaining the sequence of composite face images further includes:
respectively fusing the synthesized face images in the synthesized face image sequence with preset material images to obtain a fused image sequence;
synchronizing the sequence of audio frames and the sequence of synthetic facial images to obtain a speech animation, comprising:
and synchronizing the audio frame sequence and the fusion image sequence to obtain the voice animation.
In one possible implementation manner, performing speech recognition on the audio information to obtain a phoneme sequence of the audio information includes:
obtaining an initial translation text of the audio information, and determining the language of the audio information according to the initial translation text;
if the language of the audio information is the target language, acquiring a word segmentation result to be corrected and polyphones in the word segmentation result from the initial translation text;
screening correct polyphone characters from the polyphone characters, and filling the correct polyphone characters into word segmentation results to be corrected to obtain correct word segmentation results;
and acquiring the standard pronunciation of the correct word segmentation result, and performing phoneme recognition on the standard pronunciation through a preset acoustic model to obtain a phoneme sequence of the audio information.
In one possible implementation, obtaining initial translated text of audio information includes:
detecting and eliminating direct current offset in the audio information and resampling the audio information after the direct current offset is eliminated to obtain resampled audio information;
carrying out voice detection on the resampled audio information to obtain a voice audio frame in the audio information;
and carrying out voice recognition on the human voice audio frame to obtain an initial translation text.
In one possible implementation, obtaining and displaying a voice animation of an image including a lip change of a target user includes:
inputting the target face image and the audio information into a voice animation installation package running locally at a terminal to obtain voice animations output by the voice animation installation package;
the voice animation installation package is generated through the following steps:
acquiring a program code for acquiring a voice animation according to a target face image and audio information;
compiling the program code by using a cross tool chain to obtain a static library running aiming at a target operating system, wherein the cross tool chain is a cross compiling environment corresponding to a voice animation installation package to be generated;
and defining an external interface and a header file of the static library, and generating a voice animation installation package.
In a second aspect, there is provided a speech animation synthesis apparatus, including:
the target face image acquisition module is used for displaying the image acquisition control, responding to the triggering operation of the image acquisition control, and acquiring a face image of a target user to obtain a target face image;
the audio information acquisition module is used for displaying the follow-up reading information and the follow-up reading control, responding to the triggering operation of the follow-up reading control and acquiring audio information which is input by a target user and corresponds to the follow-up reading information;
and the voice animation display module is used for acquiring and displaying voice animation comprising an image of lip change of the target user, the lip change of the target user is synchronous with the content of the audio information, and the image of the lip change of the target user is obtained according to the lip and the audio information in the target face image.
In one possible implementation, the synthesizing apparatus further includes: the voice animation synthesis module specifically comprises:
a synthesized face image generation sub-module for obtaining at least one sample lip for expressing pronunciation of at least one phoneme; updating the lip shape of the target user in the target face image according to the sample lip shape to obtain a synthesized face image, wherein the lip shape of the target user in the synthesized face image is used for expressing the pronunciation of the phoneme expressed by the corresponding sample lip shape;
the voice recognition submodule is used for carrying out voice recognition on the audio information to obtain a phoneme sequence of the audio information, wherein the phoneme sequence comprises a phoneme corresponding to at least one time point in the audio information;
the image sequence submodule is used for determining a synthetic face image corresponding to each phoneme in the phoneme sequence and obtaining a synthetic face image sequence of lip changes of the target user according to each synthetic face image;
and the synchronization submodule is used for acquiring an audio frame sequence corresponding to the phoneme sequence in the audio information, and synchronizing the audio frame sequence and the synthesized face image sequence according to the time information of the audio frame sequence in the audio information to generate the voice animation.
In one possible implementation, the image sequence sub-module includes:
the lip sequence unit is used for determining the lip corresponding to each phoneme in the phoneme sequence to obtain a lip sequence;
an image corresponding unit for determining a composite face image corresponding to each lip in the lip sequence.
In one possible implementation, the speech animation synthesis module further includes:
the fusion sequence submodule is used for fusing the synthesized face images in the synthesized face image sequence with preset material images respectively to obtain a fusion image sequence;
correspondingly, the synchronization submodule is used for synchronizing the audio frame sequence and the fusion image sequence to obtain the voice animation.
In one possible implementation, the speech recognition sub-module includes:
the initial translation unit is used for obtaining an initial translation text of the audio information and determining the language of the audio information according to the initial translation text;
the word segmentation unit is used for acquiring a word segmentation result to be corrected and polyphones in the word segmentation result from the initial translation text if the language of the audio information is determined to be the target language;
the calibration unit is used for screening out correct polyphone characters from the polyphone characters, filling the correct polyphone characters into word segmentation results to be corrected, and obtaining correct word segmentation results;
and the phoneme recognition unit is used for acquiring the standard pronunciation of the correct word segmentation result, and performing phoneme recognition on the standard pronunciation through a preset acoustic model to obtain a phoneme sequence of the audio information.
In one possible implementation, the initial translation unit includes:
the preprocessing unit is used for detecting and eliminating direct current offset in the audio information and resampling the audio information after the direct current offset is eliminated to obtain resampled audio information;
the voice detection unit is used for carrying out voice detection on the audio information after resampling to obtain a voice audio frame in the audio information;
and the voice recognition unit is used for carrying out voice recognition on the human voice audio frame to obtain an initial translation text.
In one possible implementation, the voice animation display module is specifically configured to: inputting the target face image and the audio information into a voice animation installation package running locally at the terminal to obtain voice animation output by the voice animation installation package;
in one possible implementation manner, the synthesizing apparatus further includes an installation package generating module, where the installation package generating module includes:
a code module unit for acquiring a program code for acquiring a voice animation based on the target face image and the audio information;
the compiling unit is used for compiling the program codes by utilizing a cross tool chain to obtain a static library running aiming at a target operating system, wherein the cross tool chain is a cross compiling environment corresponding to a voice animation installation package to be generated;
and the definition unit is used for defining an external interface and a header file of the static library and generating a voice animation installation package.
In a third aspect, an embodiment of the present invention provides an electronic device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor executes the computer program to implement the steps of the method provided in the first aspect.
In a fourth aspect, an embodiment of the present invention provides a computer-readable storage medium, on which a computer program is stored, which, when executed by a processor, implements the steps of the method as provided in the first aspect.
In a fifth aspect, an embodiment of the present invention provides a computer program, where the computer program includes computer instructions stored in a computer-readable storage medium, and when a processor of a computer device reads the computer instructions from the computer-readable storage medium, the processor executes the computer instructions, so that the computer device executes the steps of implementing the method provided in the first aspect.
According to the voice animation synthesis method, the voice animation synthesis device, the electronic equipment and the storage medium, the image acquisition control is displayed, the face image of the target user is acquired in response to the triggering operation of the image acquisition control, the foundation is laid for generating the image of the lip shape number of the target user, the target face image is acquired, the read-along information and the read-along control are displayed, the audio information which is input by the target user and corresponds to the read-along information is acquired in response to the triggering operation of the read-along control, the voice animation of the image with the lip shape change of the target user is acquired and displayed, the lip shape change of the target user is synchronous with the content of the audio information, the voice animation with the lip shape of the user changing synchronously with the content of the audio information can be displayed when the user learns the language, the lip shape in the voice animation is generated according to the preset sample lip shape corresponding to the standard reading, the display effect of the voice animation is more vivid and the lip shape is closer to the standard pronunciation, and the interest and the efficiency of language learning are improved.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings used in the description of the embodiments of the present application will be briefly described below.
FIG. 1 is a schematic illustration of an implementation environment provided by an embodiment of the present application;
FIG. 2 is a flow chart illustrating a method for synthesizing speech animation according to an embodiment of the present application;
FIG. 3 is a diagram illustrating an embodiment of the present application updating an original lip shape to a lip shape corresponding to a different pronunciation;
FIG. 4a is a schematic diagram of an interface for capturing facial images according to an embodiment of the present disclosure;
FIG. 4b is a schematic diagram of a preparation interface before information is read according to an embodiment of the present disclosure;
FIG. 4c is a schematic diagram of an interface for reading information according to an embodiment of the present disclosure;
FIG. 4d is a schematic view of an interface for reading information according to another embodiment of the present application;
FIG. 4e is a schematic diagram of an interface after completing the read-after-follow according to an embodiment of the present disclosure;
FIG. 4f is a schematic view of an interface for voice animation as shown in an embodiment of the present application;
FIG. 5 is a schematic diagram of a face image of a model speaking English phonemes according to an embodiment of the present application;
FIG. 6 is a schematic diagram of a process for obtaining a sequence of composite facial images according to an embodiment of the present application;
fig. 7 is a schematic structural diagram of a speech animation synthesis apparatus according to an embodiment of the present application;
fig. 8 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
Detailed Description
Reference will now be made in detail to the embodiments of the present application, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the accompanying drawings are illustrative and are only for the purpose of explaining the present application and are not to be construed as limiting the present invention.
As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may also be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or wirelessly coupled. As used herein, the term "and/or" includes all or any element and all combinations of one or more of the associated listed items.
To make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.
The terms referred to in this application will first be introduced and explained:
1) Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.
The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence base technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning, automatic driving, intelligent traffic and the like.
2) Computer Vision technology (CV) is a science for researching how to make a machine "see", and further refers to that a camera and a Computer are used to replace human eyes to perform machine Vision such as identification, tracking and measurement on a target, and further image processing is performed, so that the Computer processing becomes an image more suitable for human eyes to observe or is transmitted to an instrument to detect. As a scientific discipline, computer vision research-related theories and techniques attempt to build artificial intelligence systems that can acquire information from images or multidimensional data. The computer vision technology generally includes image processing, image recognition, image semantic understanding, image retrieval, OCR, video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D technology, virtual reality, augmented reality, synchronous positioning and map construction, automatic driving, intelligent transportation and other technologies, and also includes common biometric identification technologies such as face recognition and fingerprint recognition.
3) Machine Learning (ML) is a multi-domain cross discipline, and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The method specially studies how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach to make computers have intelligence, and is applied in various fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and teaching learning
4) Rhubarb Lip Sync is a command line tool based on the CMU Sphinx speech recognition system, matches phonemes to lips, and outputs animations according to a timeline.
The application provides a method, an apparatus, an electronic device and a computer-readable storage medium for synthesizing voice animation, which aim to solve the above technical problems in the prior art.
The following describes the technical solution of the present application and how to solve the above technical problems in detail by specific embodiments. These several specific embodiments may be combined with each other below, and details of the same or similar concepts or processes may not be repeated in some embodiments. Embodiments of the present application will be described below with reference to the accompanying drawings.
Referring to fig. 1, a schematic diagram of an implementation environment provided by an embodiment of the application is shown. The implementation environment may include: a terminal 110 and a server 120.
The terminal 110 installs and runs an application 111, which may be a program for implementing voice animation synthesis, and when the terminal 110 runs the application 111, a user interface of the application 111 is displayed on a screen of the terminal 110. The application 111 may be a language teaching program, a multimedia entertainment program, a camera program, a social communication program, and the like. In the present embodiment, the application 111 is exemplified as a language teaching program. The terminal 110 is a terminal used by the user 112, the application 111 displays an image acquisition control, acquires a facial image of the user 112 in response to a trigger operation of the image acquisition control, obtains a target facial image, displays the read-along information and the read-along control, acquires audio information corresponding to the read-along information and input by the user in response to the trigger operation of the read-along control, and acquires and displays voice animation including an image of lip changes of the user, the lip changes of the user are synchronized with the content of the audio information, and the image of the lip changes of the user is obtained according to the lip changes and the audio information in the target facial image.
Optionally, the terminal 110 may refer to one of multiple terminals, and this embodiment is only illustrated by the terminal 110. The device types of the terminal 110 include: at least one of a smart phone, a tablet computer, an e-book reader, a Moving Picture Experts Group Audio Layer III (MP 3) player, a Moving Picture Experts Group Audio Layer IV (MP 4) player, a laptop portable computer, and a desktop computer.
Only one terminal is shown in fig. 1, but there are a plurality of other terminals that may access the server 120 in different embodiments. Optionally, there are one or more terminals corresponding to the developer, a development and editing platform for installing the application on the terminal, the developer may edit and update the application on the terminal, and transmit the updated application installation package to the server 120 through a wired or wireless network, and the terminal 110 may download the application installation package from the server 120 to implement the update of the application.
The first terminal 110 and other terminals are connected to the server 120 through a wireless network or a wired network.
The server 120 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as cloud service, a cloud database, cloud computing, a cloud function, cloud storage, network service, cloud communication, middleware service, domain name service, security service, content Delivery Network (CDN), big data, and an artificial intelligence platform.
The execution method of the server in the embodiment of the present application may be implemented in a Cloud Computing (Cloud Computing) mode, where the Cloud Computing is a Computing mode, and distributes Computing tasks on a resource pool formed by a large number of computers, so that various application systems can obtain Computing power, storage space, and information service as needed. The network that provides the resources is referred to as the "cloud". Resources in the "cloud" appear to the user as being infinitely expandable and available at any time, available on demand, expandable at any time, and paid for on-demand.
As a basic capability provider of cloud computing, a cloud computing resource pool (called as an ifas (Infrastructure as a Service) platform for short is established, and multiple types of virtual resources are deployed in the resource pool and are selectively used by external clients.
According to the logic function division, a PaaS (Platform as a Service) layer can be deployed on an IaaS (Infrastructure as a Service) layer, a SaaS (Software as a Service) layer is deployed on the PaaS layer, and the SaaS can be directly deployed on the IaaS. PaaS is a platform on which software runs, such as a database, a web container, etc. SaaS is a variety of business software, such as web portal, sms group sender, etc. Generally speaking, saaS and PaaS are upper layers relative to IaaS.
The server 120 is used to provide background services for the application. Optionally, the server 120 undertakes primary computational work and the terminals undertake secondary computational work; alternatively, the server 120 undertakes the secondary computing work and the terminal undertakes the primary computing work; alternatively, the server 120 and the terminal perform cooperative computing by using a distributed computing architecture.
In one illustrative example, the server 120 includes a memory 121, a processor 122, a user account database 123, an image processing service module 124, and an Input/Output Interface (I/O Interface) 125 for a user. The processor 122 is configured to load an instruction stored in the server 120, and process data in the user account database 123 and the image processing service module 124; the user account database 123 is configured to store data of a user account used by the terminal 110 and other terminals, such as a head portrait of the user account, a nickname of the user account, a learning record of the user account, and a service area where the user account is located; the image processing service module 124 is used for providing a plurality of teaching materials for the user to learn, such as spoken language teaching, word memory, song learning, and the like; the user-facing I/O interface 125 is used to establish communication with the first terminal 110 and/or the second terminal 130 through a wireless network or a wired network to exchange data.
Referring to fig. 2, a flow chart of a method for synthesizing a speech animation according to an embodiment of the present application is exemplarily shown, and the method includes:
s101, displaying an image acquisition control, responding to the triggering operation of the image acquisition control, acquiring a facial image of a target user, and obtaining a target facial image.
The method and the device for the image acquisition control have the advantages that the user can interact through the display image acquisition control, and the mode of triggering the image acquisition control by the user is not specifically limited, and can be clicking, double clicking, long pressing, sliding and the like. After a user triggers the image acquisition control, acquiring a facial image of a target user by calling image acquisition equipment such as a camera or a camera acquisition assembly on or connected with the terminal, and acquiring an image containing a human face in real time to serve as the target facial image. The target face image refers to a face image whose face quality meets a specified condition, for example, the specified condition may refer to that the definition of the face reaches a definition threshold value, and the like.
Alternatively, in order to more quickly adjust the lip shape of the target user to correspond to the lip shape of a different phoneme, the target user may be photographed when the face is not expressive or smiling in the target face image according to the embodiment of the present application. Further, after the facial image of the target user is collected, the facial attribute detection can be performed on the target face in the target image, for example, the facial attribute can include attributes such as emotion. And if the emotion of the target user is determined to be smiling, blankness and the like, determining that the lip shape in the facial image meets the preset condition.
And S102, displaying the follow-reading information and the follow-reading control, responding to the trigger operation of the follow-reading control, and acquiring audio information which is input by a target user and corresponds to the follow-reading information.
In the embodiment of the present application, the execution sequence of steps S101 and S102 is not particularly limited, and for example, step S101 may precede step S102, step S102 may precede step S101, and the like. The reading-after information is displayed, so that the target user can read the reading-after information, and before reading, in response to the triggering operation of the target user on the reading-after space, the audio of the target user during reading the reading-after information is acquired by calling audio acquisition equipment, such as a microphone or other audio acquisition components, on the terminal or connected with the terminal.
S103, acquiring and displaying voice animation comprising the image of the lip change of the target user, wherein the lip change of the target user is synchronous with the content of the audio information, and the image of the lip change of the target user is obtained according to the lip and the audio information in the image of the target face
By acquiring a target face image and audio information, the target face image can be processed into images corresponding to different lips based on the target face image through a computer vision technology, such as a Rhubarb Lip Sync mode and the like, in the images, the main body of the face is still the target user, but the lips are updated to the lips corresponding to different pronunciations from the original lips in the target face image.
Referring to fig. 3, which schematically illustrates an embodiment of the present application updated from an original lip shape to a lip shape corresponding to a different pronunciation, as shown in the figure, the original lip shape is the lip shape when the user is not speaking, and the lips are close to being closed, and the lip shape may be different due to different pronunciation characteristics, such as that the lips are naturally opened when an (o) sound is spoken, the lips are rounded when an o (o) sound is spoken, the mouth is slightly flat when an e (goose/coat) sound is spoken, and the lips are rounded and protrude to be small holes when a u (house/at) sound is spoken.
After the images of different lips of the target user are obtained, the images of different lips are further synchronized with the content of the audio information, and then the voice animation of the images of the lip changes of the target user can be obtained.
The method for synthesizing the voice animation comprises the steps of responding to the triggering operation of the image acquisition control by displaying the image acquisition control, acquiring the facial image of a target user, laying a foundation for generating an image of a lip shape number of the target user, obtaining the target facial image, displaying the read-along information and the read-along control, responding to the triggering operation of the read-along control, acquiring audio information which is input by the target user and corresponds to the read-along information, and acquiring and displaying the voice animation comprising the image of the lip shape change of the target user, wherein the lip shape change of the target user is synchronous with the content of the audio information, so that the lip shape of the user can be displayed when the user learns the language, the voice animation with the lip shape synchronous change of the lip shape of the user along with the content of the audio information can be displayed.
The method for synthesizing voice animation according to the embodiment of the present application is further understood with reference to fig. 4 a-4 f.
Referring to fig. 4a, a schematic diagram of an interface for acquiring a facial image displayed in an embodiment of the present application is exemplarily shown, as shown in the figure, a dashed line box 401 in the interface is used to guide a user to match a size and a position of a human face with the dashed line box when shooting a facial image, so as to avoid a problem that a complete face of the user cannot be acquired or a lip shape of the user cannot be clearly recognized due to an undersized face, the interface further includes a camera switching control 403, and a user can switch between a front camera and a rear camera in a terminal by operating, for example, clicking the camera switching control 403, and when switching to the front camera, an acquired target user generally refers to a user operating the terminal, and when switching to the rear camera, an acquired target user generally does not refer to a user operating the terminal. The user can complete the acquisition of the target facial image by clicking the image acquisition control 402 in the interface,
fig. 4b exemplarily shows a schematic diagram of a preparation interface before reading the message, in which the prompt message 404 is used to prompt the user whether to prepare for reading, and the specific content of the prompt message may be "ready? "but also" can be a message such as "is ready? "," may have started ", and so on. Also shown in this figure is a confirmation control 405, which when operated by the user on confirmation control 405, indicates that the user is ready to speak. The confirmation control 405 in the embodiment of the application may be operated in a single-click, double-click, long-press or sliding manner, and preferably, the confirmation control 405 is operated in a long-press or sliding manner, so that the user can be prevented from performing subsequent follow-up reading due to mistaken touch. The confirmation control 405 is triggered by sliding, and when the user (indicated by a hand) holds the confirmation control 405 and moves the confirmation control 405 from the left side to the right side of the sliding slot 406, the user is confirmed to be ready to read.
Optionally, the preparation interface before displaying the read-after information in the embodiment of the application may also enter the interface for displaying the read-after information in a countdown manner without displaying the confirmation control, and the specific time information of the countdown may be displayed in the preparation interface to remind the user of being ready to read-after as soon as possible.
Fig. 4c exemplarily shows a schematic diagram of an interface of the read-after information displayed in the embodiment of the present application, as shown in the figure, the read-after information display area 406 in the interface is used for displaying the read-after information, and the read-after information may include information of corresponding characters such as photos, pictures, animations, etc., in addition to characters, so as to improve the cognitive level of the user and enhance the interest of the read-after, for example, the characters in the read-after information display area in the figure are dog, and further show the photos of the dog, and when the user triggers the operation on the read-after control 407, the audio information corresponding to the read-after information input by the target user starts to be collected. When there are multiple pieces of follow-up reading information, the interface further includes a follow-up reading information entry indication area 408 for showing the ordering of the real-time follow-up reading information in the total follow-up reading information, and taking fig. 4c as an example, it can be seen from the follow-up reading information entry indication area 408 that the follow-up reading information "dog" is the first follow-up reading information.
Fig. 4d exemplarily shows a schematic diagram of an interface of the read-after information displayed in another embodiment of the present application, and in fig. 4d, compared with fig. 4a, a new read-after information "cat" and a pattern of a cat are displayed in the read-after information display area 406, and the corresponding read-after information entry indication area 408 is also updated to be the second read-after information, and further, for the first read-after information, it is indicated that the read-after is completed by a five-pointed star pattern, it can be understood that, in practical applications, the read-after information that has been completed by read-after may also be identified by other forms, which is not limited in this application embodiment.
Fig. 4e exemplarily illustrates a schematic diagram of an interface after the read-after completion, which is displayed in the embodiment of the present application, as shown in the figure, multiple display controls 409 are displayed in the interface, each display control corresponds to one piece of completed read-after information, the display controls 409, in addition to displaying the corresponding completed read-after information, further include a play control 4091 and a first re-read control 4092, when a user triggers operation on the play control 4091, the audio information of the corresponding read-after information is read by the user, when the user triggers operation on the re-read control 4092, the interface is further adjusted to display the read-after information, and when the user reads again, the re-acquired audio information of the user is covered with the audio information acquired at the previous time. The interface also shows a second rereading control 410 and an animation generation control 411, when a user triggers the second rereading control 410, the user needs to reread all the read-after information, the interface can further jump to the first read-after information interface, when the user triggers the animation generation control 411, the voice animation of the image with the lip-shaped change of the target user is generated according to the target face image and the audio information, because the process is time-consuming, a transition picture can be displayed before jumping to the interface for displaying the voice animation, the transition picture can display characters such as 'video is being generated', and the percentage of the generation progress of the video can be displayed, so that the user can know the generation progress of the audio animation more accurately.
Fig. 4f is a schematic diagram schematically illustrating an interface of a voice animation displayed in an embodiment of the present application, as shown in the figure, the interface includes a playing area 412 for playing the voice animation, and by triggering an operation on the playing area 412, the playing of the voice animation can be played or paused. As seen from the voice animation in the figure, the voice animation not only comprises the facial image of the user, but also comprises a preset background material, the figure is the background material for the virtual character with the facial image of the target user to dance with the little bear, the image and the material with the lip-shaped change of the target user are subjected to image fusion, and the voice animation with more interest can be obtained. The interface shown in fig. 4f further includes a replacement head portrait control 413, a rereading voice control 414, a sharing control 415, and a saving control 416, specifically, when the replacement head portrait control 414 triggers an operation, the interface shown in fig. 4a is skipped to reacquire the head portrait, when the rereading voice control 414 triggers an operation, the interface with the first information is skipped to reacquire the audio information, when the sharing control 415 triggers an operation, other application programs with sharing authority are further displayed to share the voice animation to other application programs, and when the saving control 417 triggers an operation, the voice animation is saved locally in the terminal.
On the basis of the above embodiments, as an alternative embodiment, the method for obtaining the voice animation includes:
s201, obtaining at least one sample lip shape, updating the lip shape of a target user in the target face image according to the sample lip shape, obtaining a synthesized face image, wherein the lip shape of the target user in the synthesized face image is used for expressing the pronunciation of a phoneme expressed by the corresponding sample lip shape;
the lip shape of the sample in the embodiment of the application is used for expressing the pronunciation of at least one phoneme, and specifically, a face image when the phoneme of the model is sounded can be collected in advance, and then the lip shape area is cut from the face image to obtain the sample image. Please refer to fig. 5, which schematically shows a schematic diagram of a face image when a model vocalizes english phonemes in an embodiment of the present application, and fig. 5 shows a schematic diagram of 9 lips of the model in total, specifically:
the class A lips are those that produce consonants "P", "B" and "M", much as the O-shaped mouth, with only slight pressure between the lips.
The B-type lip is the lip when consonants such as 'K', 'S' and 'T' are sounded, the mouth needs to be opened and the clenched clenches are required when the consonants such as 'K', 'S' and 'T' are sounded, and the B-type lip can be used for some vowels such as 'EE' in bee.
The C-type lip is a lip when a vowel such as "EH" and "AE" is uttered, and is also used for some consonants depending on the context, and is also used as an intermediate position when transitioning from the a-type lip or the B-type lip to the D-type lip.
The lip shape of class D is a lip shape when a vowel such as "AA" in vocal gather is used.
The class E lip is a roundish mouth, which is the lip at, for example, "AO" in phoneoff and "ER" in bird, and the class E lip has a smaller mouth opening than the class C lip, and also serves as an intermediate position when transitioning from the class C or D lip to the class F lip.
The lip shape of class F is a wrinkled lip shape, and is the lip shape when the sound is emitted from youu as "UW", show as "OW" and way as "W".
The class G lip, the upper tooth contacting the lower lip, is the lip when "F" in phonation for and "V" in very.
The lip of the H-type is used for sounding the long L-shaped sound, the tongue is lifted behind the upper teeth, and the mouth is at least as long as the lip of the C-type but less than the mouth of the D-type. Lips when "P", "B" and "M" are sounded.
The lip of the X-type, which is generally present in a free position in the pronunciation for speech pauses, is almost identical to the lip of the a-type, but the pressure between the lips is somewhat less, because the lips should be in a closed and relaxed state at this time.
In general, a lip may include a plurality of key points, referred to in the embodiments of the present application as "lip key points," that describe the profile of the lip. As an implementation, the key points may be distributed on the contour line of the lip, in particular, at the two corners of the mouth, the outer edges of the upper and lower lips, and the edge inside the lips. Other numbers of keypoints may be employed in addition to this example.
According to the method and the device, information such as the distance relation, the angle relation, the opening and closing degree and the proportion of the lip to the face of the model of each lip key point in the sample lip is determined to serve as the lip feature of the corresponding sample lip, and then the lip feature of the lip key point of the target user in the target face image is correspondingly adjusted according to the lip feature, so that the similarity between the adjusted lip feature of the target user and the lip feature of the sample lip conforms to the preset condition, and the synthesized face image can be obtained.
S202, carrying out voice recognition on the audio information to obtain a phoneme sequence of the audio information, wherein the phoneme sequence comprises phonemes corresponding to at least one time point in the audio information.
In practice, phonemes are the smallest units of speech that are divided according to the natural properties of the speech. From an acoustic property point of view, a phoneme is the smallest unit of speech divided from a psychoacoustic point of view. Taking the chinese characters as an example, there are one phoneme of the chinese syllables \257 (o), two phonemes of the asi (i), three phonemes of the chinese syllables d 257i (lo), and so on. Specifically, the speech signal in the time domain is collected in the embodiment of the present application, and in order to facilitate analysis of the speech signal, the speech signal in the time domain needs to be converted into the speech signal in the frequency domain. And inputting the acoustic features into a pre-trained phoneme recognition model to obtain a phoneme sequence, wherein the phoneme sequence is a sequence consisting of a plurality of phonemes.
S203, synthetic face images corresponding to the phonemes in the phoneme sequence are determined, and a synthetic face image sequence of the lip changes of the target user is obtained according to the synthetic face images.
Specifically, the embodiment of the present application may determine a lip shape corresponding to each phoneme in the phoneme sequence to obtain a lip shape sequence, then determine a synthetic face image corresponding to each lip shape in the lip shape sequence, and arrange the determined synthetic face images according to an arrangement order of the phonemes in the phoneme sequence, so as to obtain a synthetic face image sequence of the lip shape change of the target user.
Referring to fig. 6, which schematically illustrates a flowchart of obtaining a synthetic face image sequence according to an embodiment of the present application, as shown in the figure, each vertical bar (e.g. 611) represents a phoneme, all vertical bars are sorted according to occurrence time of the phoneme in the audio information, that is, a phoneme sequence 610 is obtained, the synthetic face image set 620 includes corresponding synthetic face images obtained in advance for the phonemes, includes synthetic face images synthesized according to 9 sample lips shown in fig. 5, which are denoted as a ' to H ' and X ', so that a corresponding synthetic face image can be determined for each phoneme in the phoneme sequence 610, and a synthetic face image sequence 630 is obtained, in which a first synthetic face image in the sequence is a ' and a synthetic face image corresponding to a first phoneme in the phoneme sequence is a '.
S204, acquiring an audio frame sequence corresponding to the phoneme sequence in the audio information, and synchronizing the audio frame sequence and the synthesized face image sequence according to the time information of the audio frame sequence in the audio information to generate the voice animation.
In order to ensure that the lip shape number of the target user is synchronized with the sound, the embodiment of the present application needs to determine an audio frame sequence corresponding to a phoneme sequence in the audio information, generally, one audio frame in the audio information corresponds to one phoneme, so that the audio frame sequence corresponding to the phoneme sequence is also unique, and further according to the time information of the audio frame sequence in the audio information, the duration of each lip shape of the target user can be determined, so as to obtain the voice animation.
As an alternative embodiment, the above step S201 may be performed by the server shown in fig. 1, and the steps S202 to S204 are performed by the terminal, that is, after the terminal acquires the target face image, the target face image is sent to the server, the server updates the lip shape of the target user in the target face image according to a sample lip shape acquired in advance, obtains a synthesized face image, and returns the synthesized face image to the terminal, the terminal performs raincoat voice recognition on the audio information locally, obtains a phoneme sequence of the audio information, determines a synthesized face image corresponding to each phoneme in the phoneme sequence, and obtains a synthesized face image sequence of the lip shape change of the target user according to each synthesized face image; the method comprises the steps of acquiring an audio frame sequence corresponding to a phoneme sequence in audio information, synchronizing the audio frame sequence with a synthesized face image sequence according to time information of the audio frame sequence in the audio information, and generating a voice animation.
On the basis of the above embodiments, as an alternative embodiment, obtaining a synthetic face image sequence, and then:
and respectively fusing the synthesized face images in the synthesized face image sequence with preset material images to obtain a fused image sequence.
In order to further increase the interest of the voice animation, the embodiment of the application further needs to fuse the synthesized face images in the synthesized face image sequence with the preset material images, and it can be understood that the specific fusion mode may be to superimpose the synthesized face images at preset positions in the material images.
Synchronizing the sequence of audio frames and the sequence of synthetic facial images to obtain a speech animation, comprising: and synchronizing the audio frame sequence and the fusion image sequence to obtain the voice animation.
It should be understood that, since each frame of the fused image in the fused image sequence corresponds to the synthesized face image one to one, the audio frame sequence and the fused image sequence may be synchronized by the above method to obtain the speech animation, which is not described in detail in the embodiments of the present application.
On the basis of the foregoing embodiments, as an optional embodiment, performing speech recognition on audio information to obtain a phoneme sequence of the audio information includes:
s301, obtaining an initial translation text of the audio information, and determining the language of the audio information according to the initial translation text;
according to the embodiment of the application, the initial translation text of the audio information can be obtained through a voice recognition technology, and because the words/words in some languages have polyphonic characters, and the polyphonic characters can cause the problem that the translation result of the audio information is inaccurate, the language of the audio information is further determined, and optionally, the target language is English.
S302, if the language of the audio information is the target language, acquiring a word segmentation result to be corrected and polyphones in the word segmentation result from the initial translation text;
s303, screening correct polyphone characters from the polyphone characters, and filling the correct polyphone characters into word segmentation results to be corrected to obtain correct word segmentation results;
s304, obtaining the standard pronunciation of the correct word segmentation result, and carrying out phoneme recognition on the standard pronunciation through a preset acoustic model to obtain a phoneme sequence of the audio information.
The acoustic model is obtained by audio training which meets pronunciation conditions, and the acoustic model has the capability of performing phoneme recognition on speech, namely the acoustic model is used for calculating the posterior probability that acoustic features belong to each phoneme.
The embodiment of the invention also provides a method for creating the acoustic model, which comprises the following steps:
s401, obtaining a training sample, wherein the training sample is an audio of a standard pronunciation; the standard pronunciation is audio with good pronunciation conditions, which can represent the condition of pronunciation definition, the condition of audio frequency speed, etc.
S402, framing the audio of the training sample, and extracting the features of the framed audio to obtain audio features;
s403, generating a phoneme label of the audio of the training sample;
s404, matching the audio features with the phoneme labels to obtain a processed training sample;
s405, carrying out iterative training on the training sample through the neural network model to obtain an acoustic model.
The training samples may be well-pronounced audio training over several hours (e.g., 100 hours). The audio is first framed and then the audio features are extracted. For example, every 25ms frame, which is shifted by 10ms, is characterized by 40-dimensionmel-Frequency cepstral coefficients (MFCCs).
After the audio features are extracted, the audio text is unfolded into phonemes according to a dictionary, each frame is divided according to time average and labeled with phoneme labels, after the audio features correspond to the phoneme labels, the initial model is used for training, and when iteration reaches a certain number of rounds, the training is stopped, and a final acoustic model is obtained.
On the basis of the above embodiments, as an alternative embodiment, obtaining the initial translation text of the audio information includes:
s501, detecting and eliminating the direct current offset in the audio information, resampling the audio information after the direct current offset is eliminated, and obtaining the resampled audio information.
Audible artifacts are perceptible noise introduced into the sound output from the audio device, often caused by the operation of the audio device itself. Audible artifacts are generally undesirable and represent a deviation from the fidelity of the audio input to the device. A click, which is a particular type of audible artifact generated by the speaker, is typically caused by a sharp transient voltage (e.g., a dc offset across the speaker that may occur when the audio power amplifier transitions between operating modes, such as a power-down mode and a power-on mode). According to the embodiment of the application, the audible false images including click sound can be eliminated by detecting the direct current offset in the audio information, and the audio information with higher definition is obtained. Specifically, the embodiment of the present application may use a digital filter to perform dc offset cancellation on the audio information. For example, an infinite response high pass filter is used to remove dc offset.
S502, carrying out voice detection on the resampled audio information to obtain a voice audio frame in the audio information.
Specifically, the embodiment of the application can detect the information of the person in the resampled audio information through a pre-trained person detection model. For example, the audio data is input to a human voice detection model, the human voice detection model performs human voice detection on the audio data, and a detection result is output, where the detection result may include an audio time period in which human voice occurs, a human voice audio frame, and the like. In practical application, other parameters and the like may be included according to needs, and the embodiment is not limited.
S503, carrying out voice recognition on the human voice audio frame to obtain an initial translation text.
On the basis of the above embodiments, as an alternative embodiment, acquiring and displaying a voice animation including an image of a lip change of a target user includes:
and inputting the target face image and the audio information into a voice animation installation package running locally at the terminal to obtain voice animation output by the voice animation installation package.
According to the embodiment of the application, the voice animation installation package is compiled and operated locally at the terminal, so that voice animation of the image with the Lip shape change of the target user can be generated locally at the terminal, and the defects that the Rhubarb Lip Sync can not be applied to a mobile terminal and is suitable for developers to use and the use threshold is high as a command line tool are overcome.
The voice animation installation package is generated through the following steps:
s601, acquiring a program code for acquiring a voice animation according to a target face image and audio information;
and S602, compiling the program code by using a cross tool chain to obtain a static library running for the target operating system, wherein the cross tool chain is a cross compiling environment corresponding to the voice animation installation package to be generated.
The cross tool chain in the embodiment of the application is a cross compiling environment corresponding to the voice animation installation package to be generated, and the target platform can comprise an android platform and an ios platform. That is, the cross tool chain in the embodiment of the present invention implements cross platform compilation.
In the compiling process, all functions and codes are completed on a linux machine, the linux operating system is adopted, the user terminal is an android operating platform or an iso platform, and cross-platform compiling is needed when the android and ios systems based on the unix operating system and the iso platform.
The cross-platform compiling method includes the steps of firstly building a cross compiling environment, namely installing and configuring a cross compiling tool chain. And compiling an operating system, an application program and the like required by the embedded Linux system in the environment, and then uploading the operating system, the application program and the like to the target machine. Versions of android and ios require separate cross-environments.
And defining an external interface and a header file of the static library, and generating a voice animation installation package.
The method comprises the steps of compiling program codes of voice animations into static libraries capable of running on an android development platform and an android development platform in a cross-platform mode, then designing android and ios interfaces to complete parts, defining external interfaces and header files of the static libraries, and completing the functions of calling local voice animation synthesis by the android and the ios.
After the voice animation installation package is generated in the embodiment of the invention, the voice animation installation package needs to be tested, and then whether the voice animation installation package is released or debugged and modified is judged according to a test result. The testing process refers to comparing a testing result obtained when the voice animation installation package is executed at the local terminal with a testing result of the server side, and judging whether the installation package is available or not.
The testing part mainly judges whether the local voice animation synthesis function is available, wherein the availability comprises three aspects of stability, accuracy and low delay. The stability is that the program can not be flashed off midway, one thousand target coding images and voice information can be subjected to local voice animation synthesis, no flash off is found, the accuracy is that the result of the locally synthesized voice animation and the result of the voice animation synthesized by the server are evaluated to have no deviation, and one thousand voice animations are used for testing, and the average score is within a tolerable score. The delay test is to count the deviation between the local voice animation synthesis delay and the server voice animation synthesis delay, and finally the local average delay is obviously lower than the average delay of the server. The voice animation installation package generated in the embodiment of the invention can run on the user terminal without depending on a network.
The embodiment of the present application provides a speech animation synthesis apparatus, and as shown in fig. 7, the apparatus may include: the target facial image acquisition module 101, the audio information acquisition module 102 and the voice animation display module 103 specifically:
the target facial image acquisition module 101 is used for displaying an image acquisition control, responding to the triggering operation of the image acquisition control, and acquiring a facial image of a target user to obtain a target facial image;
the audio information acquisition module 102 is configured to display the read-after information and the read-after control, and acquire, in response to a trigger operation of the read-after control, audio information corresponding to the read-after information and input by a target user;
and the voice animation display module 103 is used for acquiring and displaying voice animation comprising images of lip changes of the target user, the lip changes of the target user are synchronous with the content of the audio information, and the images of the lip changes of the target user are obtained according to the lip and the audio information in the target face image.
The speech animation synthesis apparatus provided in the embodiment of the present invention specifically executes the process of the above method embodiment, and please specifically refer to the content of the above speech animation synthesis method embodiment, which is not described herein again. The voice animation synthesis device provided by the embodiment of the invention collects the facial image of the target user by displaying the image acquisition control in response to the triggering operation of the image acquisition control, lays a foundation for generating the image of the lip shape number of the target user, obtains the target facial image, displays the read-along information and the read-along control, collects the audio information which is input by the target user and corresponds to the read-along information in response to the triggering operation of the read-along control, and acquires and displays the voice animation of the image comprising the lip shape change of the target user, wherein the lip shape change of the target user is synchronous with the content of the audio information, so that the voice animation of which the lip shape of the user synchronously changes with the content of the audio information can be displayed when the user learns the language.
On the basis of the above embodiments, as an alternative embodiment, the synthesizing apparatus further includes: the voice animation synthesis module specifically comprises:
a synthesized face image generation submodule for obtaining at least one sample lip for expressing the pronunciation of at least one phoneme; updating the lip shape of the target user in the target face image according to the sample lip shape to obtain a synthesized face image, wherein the lip shape of the target user in the synthesized face image is used for expressing the pronunciation of the phoneme expressed by the corresponding sample lip shape;
the voice recognition submodule is used for carrying out voice recognition on the audio information to obtain a phoneme sequence of the audio information, and the phoneme sequence comprises phonemes corresponding to at least one time point in the audio information;
the image sequence submodule is used for determining a synthetic face image corresponding to each phoneme in the phoneme sequence and obtaining a synthetic face image sequence of lip changes of the target user according to each synthetic face image;
and the synchronization submodule is used for acquiring the audio frame sequence corresponding to the pixel sequence in the audio information, synchronizing the audio frame sequence and the synthesized facial image sequence according to the time information of the audio frame sequence in the audio information and generating the voice animation.
On the basis of the above embodiments, as an alternative embodiment, the image sequence sub-module includes:
the lip sequence unit is used for determining lips corresponding to all phonemes in the phoneme sequence to obtain a lip sequence;
an image corresponding unit determines a composite face image corresponding to each lip in the lip sequence.
On the basis of the foregoing embodiments, as an alternative embodiment, the speech animation synthesis module further includes:
the fusion sequence submodule is used for fusing the synthesized face images in the synthesized face image sequence with preset material images respectively to obtain a fusion image sequence;
correspondingly, the synchronization submodule is used for synchronizing the audio frame sequence and the fusion image sequence to obtain the voice animation.
On the basis of the above embodiments, as an alternative embodiment, the speech recognition sub-module includes:
the initial translation unit is used for obtaining an initial translation text of the audio information and determining the language of the audio information according to the initial translation text;
the word segmentation unit is used for acquiring a word segmentation result to be corrected and polyphones in the word segmentation result from the initial translation text if the language of the audio information is determined to be the target language;
the calibration unit is used for screening out correct polyphone characters from the polyphone characters, filling the correct polyphone characters into word segmentation results to be corrected, and obtaining correct word segmentation results;
and the phoneme recognition unit is used for acquiring the standard pronunciation of the correct word segmentation result, and performing phoneme recognition on the standard pronunciation through a preset acoustic model to obtain a phoneme sequence of the audio information.
On the basis of the above embodiments, as an alternative embodiment, the initial translation unit includes:
the preprocessing unit is used for detecting and eliminating the direct current offset in the audio information and resampling the audio information after the direct current offset is eliminated to obtain the resampled audio information;
the voice detection unit is used for carrying out voice detection on the audio information after resampling to obtain a voice audio frame in the audio information;
and the voice recognition unit is used for carrying out voice recognition on the human voice audio frame to obtain an initial translation text.
On the basis of the above embodiments, as an optional embodiment, the voice animation display module is specifically configured to: inputting the target face image and the audio information into a voice animation installation package running locally at a terminal to obtain voice animations output by the voice animation installation package;
on the basis of the foregoing embodiments, as an optional embodiment, the synthesizing apparatus further includes an installation package generating module, where the installation package generating module includes:
a code module unit for acquiring a program code for acquiring a voice animation based on the target face image and the audio information;
the compiling unit is used for compiling the program codes by utilizing a cross tool chain to obtain a static library running aiming at a target operating system, wherein the cross tool chain is a cross compiling environment corresponding to a voice animation installation package to be generated;
and the definition unit is used for defining an external interface and a header file of the static library and generating a voice animation installation package.
An embodiment of the present application provides an electronic device, including: a memory and a processor; at least one program stored in the memory for execution by the processor, which when executed by the processor, implements: the method comprises the steps of acquiring a facial image of a target user by displaying an image acquisition control in response to a triggering operation of the image acquisition control, laying a foundation for generating an image of a lip shape number of the target user, acquiring the target facial image, displaying follow-up reading information and the follow-up reading control, acquiring audio information which is input by the target user and corresponds to the follow-up reading information in response to the triggering operation of the follow-up reading control, and acquiring and displaying voice animation comprising an image of lip shape change of the target user, wherein the lip shape change of the target user is synchronous with the content of the audio information, so that the lip shape of the user can be displayed as the voice animation synchronously changing with the content of the audio information when the user learns the language.
In an alternative embodiment, an electronic device is provided, as shown in fig. 8, the electronic device 4000 shown in fig. 8 comprising: a processor 4001 and a memory 4003. Processor 4001 is coupled to memory 4003, such as via bus 4002. Optionally, the electronic device 4000 may further comprise a transceiver 4004. It should be noted that the transceiver 4004 is not limited to one in practical applications, and the structure of the electronic device 4000 is not limited to the embodiment of the present application.
The Processor 4001 may be a CPU (Central Processing Unit), a general-purpose Processor, a DSP (Digital Signal Processor), an ASIC (Application Specific Integrated Circuit), an FPGA (field programmable Gate Array) or other programmable logic device, a transistor logic device, a hardware component, or any combination thereof. Which may implement or perform the various illustrative logical blocks, modules, and circuits described in connection with the disclosure. The processor 4001 may also be a combination that performs a computing function, e.g., comprising one or more microprocessors, a combination of DSPs and microprocessors, etc.
Bus 4002 may include a path that carries information between the aforementioned components. The bus 4002 may be a PCI (Peripheral Component Interconnect) bus, an EISA (Extended Industry Standard Architecture) bus, or the like. The bus 4002 may be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown in FIG. 8, but that does not indicate only one bus or one type of bus.
The Memory 4003 may be a ROM (Read Only Memory) or other types of static storage devices that can store static information and instructions, a RAM (Random Access Memory) or other types of dynamic storage devices that can store information and instructions, an EEPROM (Electrically Erasable Programmable Read Only Memory), a CD-ROM (Compact Disc Read Only Memory) or other optical Disc storage, optical Disc storage (including Compact Disc, laser Disc, optical Disc, digital versatile Disc, blu-ray Disc, etc.), a magnetic Disc storage medium or other magnetic storage devices, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer, but is not limited to these.
The memory 4003 is used for storing application codes for executing the scheme of the present application, and the execution is controlled by the processor 4001. Processor 4001 is configured to execute application code stored in memory 4003 to implement what is shown in the foregoing method embodiments.
The embodiment of the present application provides a computer readable storage medium, on which a computer program is stored, and when the computer program runs on a computer, the computer is enabled to execute the corresponding content in the foregoing method embodiment. Compared with the prior art, the method has the advantages that the image acquisition control is displayed, the triggering operation of the image acquisition control is responded, the facial image of the target user is acquired, the basis is laid for generating the image of the lip shape number of the target user, the target facial image is acquired, the follow-up reading information and the follow-up reading control are displayed, the audio information which is input by the target user and corresponds to the follow-up reading information is acquired and displayed, the voice animation comprising the image of the lip shape change of the target user is acquired and displayed, the lip shape change of the target user is synchronous with the content of the audio information, so that the voice animation of the lip shape of the user synchronously changing with the content of the audio information can be displayed when the user learns the language, the lip shape of the user in the voice animation is generated according to the preset sample lip shape corresponding to the standard reading sound, the display effect of the voice animation is better, and the lip shape animation is closer to the standard sound, the lip shape of the user can accurately practice and the sound according to the voice, and the interest and efficiency of the language learning are improved.
Embodiments of the present application provide a computer program, which includes computer instructions stored in a computer-readable storage medium, and when a processor of a computer device reads the computer instructions from the computer-readable storage medium, the processor executes the computer instructions, so that the computer device executes the contents shown in the foregoing method embodiments. Compared with the prior art, the method has the advantages that the image acquisition control is displayed, the triggering operation of the image acquisition control is responded, the facial image of the target user is acquired, the basis is laid for generating the image of the lip shape number of the target user, the target facial image is acquired, the follow-up reading information and the follow-up reading control are displayed, the audio information which is input by the target user and corresponds to the follow-up reading information is acquired and displayed, the voice animation comprising the image of the lip shape change of the target user is acquired and displayed, the lip shape change of the target user is synchronous with the content of the audio information, so that the voice animation of the lip shape of the user synchronously changing with the content of the audio information can be displayed when the user learns the language, the lip shape of the user in the voice animation is generated according to the preset sample lip shape corresponding to the standard reading sound, the display effect of the voice animation is better, and the lip shape animation is closer to the standard sound, the lip shape of the user can accurately practice and the sound according to the voice, and the interest and efficiency of the language learning are improved.
It should be understood that, although the steps in the flowcharts of the figures are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and may be performed in other orders unless otherwise indicated herein. Moreover, at least a portion of the steps in the flow chart of the figure may include multiple sub-steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, which are not necessarily performed in sequence, but may be performed alternately or alternately with other steps or at least a portion of the sub-steps or stages of other steps.
The foregoing is only a partial embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims (10)

1. A method for synthesizing voice animation, comprising:
displaying an image acquisition control, and acquiring a facial image of a target user in response to the triggering operation of the image acquisition control to obtain a target facial image;
displaying follow-up reading information and a follow-up reading control, and collecting audio information which is input by the target user and corresponds to the follow-up reading information in response to the triggering operation of the follow-up reading control;
and acquiring and displaying voice animation comprising the image of the lip changes of the target user, wherein the lip changes of the target user are synchronized with the content of the audio information, and the image of the lip changes of the target user is obtained according to the lip and the audio information in the target face image.
2. The method for synthesizing speech animation according to claim 1, wherein the manner of obtaining the speech animation comprises:
obtaining at least one sample lip for expressing a pronunciation of at least one phoneme; updating the lip shape of the target user in the target face image according to the sample lip shape to obtain a synthesized face image, wherein the lip shape of the target user in the synthesized face image is used for expressing the pronunciation of the phoneme of the corresponding sample lip shape expression;
performing voice recognition on the audio information to obtain a phoneme sequence of the audio information, wherein the phoneme sequence comprises a phoneme corresponding to at least one time point in the audio information;
determining a synthetic face image corresponding to each phoneme in the phoneme sequence, and obtaining a synthetic face image sequence of lip changes of a target user according to each synthetic face image;
and acquiring an audio frame sequence corresponding to the phoneme sequence in the audio information, and synchronizing the audio frame sequence and the synthesized facial image sequence according to the time information of the audio frame sequence in the audio information to generate the voice animation.
3. The method of synthesizing a speech animation according to claim 2, wherein the determining a synthesized face image corresponding to each phoneme in the sequence of phonemes includes:
determining a lip shape corresponding to each phoneme in the phoneme sequence to obtain a lip shape sequence;
a composite face image corresponding to each lip in the sequence of lips is determined.
4. The method of synthesizing voice animation according to claim 2, wherein the obtaining of the sequence of synthesized face images further comprises:
respectively fusing the synthesized face images in the synthesized face image sequence with preset material images to obtain a fused image sequence;
the synchronizing the sequence of audio frames and the sequence of synthetic facial images to obtain the speech animation comprises:
and synchronizing the audio frame sequence and the fusion image sequence to obtain the voice animation.
5. The method for synthesizing speech animation according to claim 2, wherein the performing speech recognition on the audio information to obtain the phoneme sequence of the audio information comprises:
obtaining an initial translation text of audio information, and determining the language of the audio information according to the initial translation text;
if the language of the audio information is determined to be the target language, acquiring a word segmentation result to be corrected and polyphones in the word segmentation result from the initial translation text;
screening out correct polyphone characters from the polyphone characters, and filling the correct polyphone characters into the word segmentation result to be corrected to obtain a correct word segmentation result;
and acquiring the standard pronunciation of the correct word segmentation result, and performing phoneme recognition on the standard pronunciation through a preset acoustic model to obtain a phoneme sequence of the audio information.
6. The method for synthesizing voice animation according to claim 5, wherein the obtaining of the initial translation text of the audio information comprises:
detecting and eliminating direct current offset in the audio information and resampling the audio information after the direct current offset is eliminated to obtain resampled audio information;
carrying out voice detection on the resampled audio information to obtain a voice audio frame in the audio information;
and performing voice recognition on the voice audio frame to obtain the initial translation text.
7. The method for synthesizing voice animation according to any one of claims 1 to 6, wherein the acquiring and displaying the voice animation including the image of the lip change of the target user comprises:
inputting the target face image and the audio information into a voice animation installation package running locally at a terminal to obtain the voice animation output by the voice animation installation package;
wherein the voice animation installation package is generated by the following steps:
acquiring program codes for acquiring the voice animation according to the target face image and the audio information;
compiling the program code by using a cross tool chain to obtain a static library running for a target operating system, wherein the cross tool chain is a cross compiling environment corresponding to a voice animation installation package to be generated;
and defining an external interface and a header file of the static library, and generating the voice animation installation package.
8. An apparatus for synthesizing speech animation, comprising:
the target facial image acquisition module is used for displaying the image acquisition control, responding to the triggering operation of the image acquisition control, acquiring the facial image of a target user and acquiring a target facial image;
the audio information acquisition module is used for displaying the read-after information and the read-after control, responding to the triggering operation of the read-after control and acquiring the audio information which is input by the target user and corresponds to the read-after information;
and the voice animation display module is used for acquiring and displaying voice animation comprising the image of the lip change of the target user, the lip change of the target user is synchronous with the content of the audio information, and the image of the lip change of the target user is obtained according to the lip and the audio information in the target face image.
9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the steps of the method for synthesizing a speech animation according to any one of claims 1 to 7 are implemented by the processor when the program is executed.
10. A computer-readable storage medium storing computer instructions for causing a computer to perform the steps of the method for synthesizing a speech animation according to any one of claims 1 to 7.
CN202110671977.6A 2021-06-17 2021-06-17 Method and device for synthesizing voice animation, electronic equipment and storage medium Pending CN115497448A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110671977.6A CN115497448A (en) 2021-06-17 2021-06-17 Method and device for synthesizing voice animation, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110671977.6A CN115497448A (en) 2021-06-17 2021-06-17 Method and device for synthesizing voice animation, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN115497448A true CN115497448A (en) 2022-12-20

Family

ID=84464965

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110671977.6A Pending CN115497448A (en) 2021-06-17 2021-06-17 Method and device for synthesizing voice animation, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN115497448A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116433807A (en) * 2023-04-21 2023-07-14 北京百度网讯科技有限公司 Animation synthesis method and device, and training method and device for animation synthesis model
CN116564338A (en) * 2023-07-12 2023-08-08 腾讯科技(深圳)有限公司 Voice animation generation method, device, electronic equipment and medium
CN117745902A (en) * 2024-02-20 2024-03-22 卓世科技(海南)有限公司 Digital person generation method and device for rehabilitation demonstration

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116433807A (en) * 2023-04-21 2023-07-14 北京百度网讯科技有限公司 Animation synthesis method and device, and training method and device for animation synthesis model
CN116564338A (en) * 2023-07-12 2023-08-08 腾讯科技(深圳)有限公司 Voice animation generation method, device, electronic equipment and medium
CN116564338B (en) * 2023-07-12 2023-09-08 腾讯科技(深圳)有限公司 Voice animation generation method, device, electronic equipment and medium
CN117745902A (en) * 2024-02-20 2024-03-22 卓世科技(海南)有限公司 Digital person generation method and device for rehabilitation demonstration
CN117745902B (en) * 2024-02-20 2024-04-26 卓世科技(海南)有限公司 Digital person generation method and device for rehabilitation demonstration

Similar Documents

Publication Publication Date Title
CN110688911B (en) Video processing method, device, system, terminal equipment and storage medium
US20230042654A1 (en) Action synchronization for target object
CN115497448A (en) Method and device for synthesizing voice animation, electronic equipment and storage medium
US20120130717A1 (en) Real-time Animation for an Expressive Avatar
CN113077537B (en) Video generation method, storage medium and device
CN109801349B (en) Sound-driven three-dimensional animation character real-time expression generation method and system
CN114401438A (en) Video generation method and device for virtual digital person, storage medium and terminal
CN111145777A (en) Virtual image display method and device, electronic equipment and storage medium
CN110162598B (en) Data processing method and device for data processing
CN110148406B (en) Data processing method and device for data processing
US20230082830A1 (en) Method and apparatus for driving digital human, and electronic device
CN112668407A (en) Face key point generation method and device, storage medium and electronic equipment
CN112750187A (en) Animation generation method, device and equipment and computer readable storage medium
KR20190109651A (en) Voice imitation conversation service providing method and sytem based on artificial intelligence
CN113299312A (en) Image generation method, device, equipment and storage medium
CN113111812A (en) Mouth action driving model training method and assembly
CN114255737B (en) Voice generation method and device and electronic equipment
CN116958342A (en) Method for generating actions of virtual image, method and device for constructing action library
KR20230075998A (en) Method and system for generating avatar based on text
CN112331184B (en) Voice mouth shape synchronization method and device, electronic equipment and storage medium
CN115529500A (en) Method and device for generating dynamic image
CN115083371A (en) Method and device for driving virtual digital image singing
CN114694633A (en) Speech synthesis method, apparatus, device and storage medium
CN113990295A (en) Video generation method and device
KR102138132B1 (en) System for providing animation dubbing service for learning language

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination