CN116312458A - Acoustic model training method, voice synthesis method, device and computer equipment - Google Patents

Acoustic model training method, voice synthesis method, device and computer equipment Download PDF

Info

Publication number
CN116312458A
CN116312458A CN202310133141.XA CN202310133141A CN116312458A CN 116312458 A CN116312458 A CN 116312458A CN 202310133141 A CN202310133141 A CN 202310133141A CN 116312458 A CN116312458 A CN 116312458A
Authority
CN
China
Prior art keywords
length
phoneme
frequency spectrum
mel
sequence
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310133141.XA
Other languages
Chinese (zh)
Inventor
殷腾龙
马明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hisense Visual Technology Co Ltd
Original Assignee
Hisense Visual Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hisense Visual Technology Co Ltd filed Critical Hisense Visual Technology Co Ltd
Priority to CN202310133141.XA priority Critical patent/CN116312458A/en
Publication of CN116312458A publication Critical patent/CN116312458A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/04Details of speech synthesis systems, e.g. synthesiser structure or memory management
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Machine Translation (AREA)

Abstract

The application relates to an acoustic model training method, a voice synthesis method, a device and computer equipment, which are applied to the field of voice synthesis and can improve the accuracy of text synthesized voice. The method comprises the following steps: acquiring a sample phoneme sequence, a standard Mel frequency spectrum corresponding to the sample phoneme sequence and the length of the standard Mel frequency spectrum; inputting the sample phoneme sequence and the length of the standard mel spectrum into an acoustic model to obtain a mel spectrum corresponding to the sample phoneme sequence, wherein the acoustic model comprises a phoneme embedding layer, an encoder, a length predictor, a length regulator and a decoder; the length predictor is used for obtaining the length proportion corresponding to each phoneme based on the intermediate vector, and determining the length of the Mel frequency spectrum corresponding to each phoneme based on the length of the standard Mel frequency spectrum and the length proportion of each phoneme; and training an acoustic model based on the Mel frequency spectrum and the standard Mel frequency spectrum corresponding to the sample phoneme sequence to obtain a trained target acoustic model.

Description

Acoustic model training method, voice synthesis method, device and computer equipment
Technical Field
Embodiments of the present application relate to the field of speech synthesis. And more particularly, to an acoustic model training method, a speech synthesis method, a device and a computer apparatus.
Background
The back-end algorithm for speech synthesis of the mainstream mainly consists of two parts, one part being an end-to-end speech synthesis model (also called acoustic model) and the other part being a vocoder. The acoustic model is used for converting phonemes output by the front end into a Mel frequency spectrum; the vocoder is used to convert the mel spectrum into speech.
Currently, in a training stage, an automatic speech recognition technology (Automatic Speech Recognition, ASR) is needed to align phonemes with mel spectrums (abbreviated as ASR alignment method) to obtain duration time of each phoneme corresponding to a text, however, in an acoustic model based on the ASR alignment method, an ASR alignment model is usually needed to be trained separately, a training process is complex, and alignment effect of phonemes with mel spectrums in the acoustic model is affected by accuracy of the ASR alignment model.
Disclosure of Invention
In order to solve the above technical problems or at least partially solve the above technical problems, the embodiments of the present application provide an acoustic model training method, a speech synthesis method, a device, and a computer device, which can improve the accuracy of text-synthesized speech.
In a first aspect, an embodiment of the present application provides an acoustic model training method, including:
Acquiring a sample phoneme sequence, a standard Mel frequency spectrum corresponding to the sample phoneme sequence and the length of the standard Mel frequency spectrum;
inputting the sample phoneme sequence and the length of the standard mel spectrum into an acoustic model to obtain a mel spectrum corresponding to the sample phoneme sequence, wherein the acoustic model comprises a phoneme embedding layer, an encoder, a length predictor, a length regulator and a decoder; the phoneme embedding layer is used for obtaining a phoneme vector corresponding to the sample phoneme sequence; the encoder is used for encoding the phoneme vector to obtain an intermediate vector; the length predictor is used for obtaining the length ratio corresponding to each phoneme based on the intermediate vector, and determining the length of the Mel frequency spectrum corresponding to each phoneme based on the length of the standard Mel frequency spectrum and the length ratio of the length of the Mel frequency spectrum corresponding to one phoneme to the length of the standard Mel frequency spectrum; the length adjuster is used for determining an intermediate sequence based on the intermediate vector and the length of the Mel frequency spectrum corresponding to each phoneme; the decoder is used for decoding the intermediate sequence to obtain a mel frequency spectrum corresponding to the sample phoneme sequence, and the length of the intermediate sequence is the same as that of the standard mel frequency spectrum;
and training an acoustic model based on the Mel frequency spectrum and the standard Mel frequency spectrum corresponding to the sample phoneme sequence to obtain a trained target acoustic model.
In a second aspect, embodiments of the present application provide a method for synthesizing speech, including:
acquiring a voice text to be synthesized;
determining the length of a target Mel frequency spectrum corresponding to the voice text to be synthesized according to the text length of the voice text to be synthesized;
converting the voice text to be synthesized into a target phoneme sequence;
inputting the target phoneme sequence and the length of the target mel frequency spectrum into a target acoustic model to obtain a target mel frequency spectrum, wherein the target acoustic model is obtained through training by the acoustic model training method according to the first aspect;
and generating target voice corresponding to the voice text to be synthesized by the target Mel frequency spectrum through the vocoder.
In a third aspect, an embodiment of the present application provides an acoustic model training apparatus, including:
the acquisition module is used for acquiring the sample phoneme sequence, the standard Mel frequency spectrum corresponding to the sample phoneme sequence and the length of the standard Mel frequency spectrum;
the input module is used for inputting the lengths of the sample phoneme sequence and the standard mel frequency spectrum into an acoustic model to obtain a mel frequency spectrum corresponding to the sample phoneme sequence, and the acoustic model comprises a phoneme embedding layer, an encoder, a length predictor, a length regulator and a decoder; the phoneme embedding layer is used for obtaining a phoneme vector corresponding to the sample phoneme sequence; the encoder is used for encoding the phoneme vector to obtain an intermediate vector; the length predictor is used for obtaining the length ratio corresponding to each phoneme based on the intermediate vector, and determining the length of the Mel frequency spectrum corresponding to each phoneme based on the length of the standard Mel frequency spectrum and the length ratio of the length of the Mel frequency spectrum corresponding to one phoneme to the length of the standard Mel frequency spectrum; the length adjuster is used for determining an intermediate sequence based on the intermediate vector and the length of the Mel frequency spectrum corresponding to each phoneme; the decoder is used for decoding the intermediate sequence to obtain a mel frequency spectrum corresponding to the sample phoneme sequence, and the length of the intermediate sequence is the same as that of the standard mel frequency spectrum;
And the training module is used for training the acoustic model based on the Mel frequency spectrum corresponding to the sample phoneme sequence and the standard Mel frequency spectrum to obtain a trained target acoustic model.
In a fourth aspect, embodiments of the present application provide a speech synthesis apparatus, including:
the acquisition module is used for acquiring a voice text to be synthesized;
the determining module is used for determining the length of a target Mel frequency spectrum corresponding to the voice text to be synthesized according to the text length of the voice text to be synthesized;
the conversion module is used for converting the voice text to be synthesized into a target phoneme sequence;
the input module is used for inputting the target phoneme sequence and the length of the target mel frequency spectrum into a target acoustic model to obtain a target mel frequency spectrum, and the target acoustic model is obtained through training by the acoustic model training method according to the first aspect;
and the generating module is used for generating target voice corresponding to the voice text to be synthesized through the vocoder by using the target Mel frequency spectrum.
In a fifth aspect, embodiments of the present application provide an electronic device comprising a processor, a memory, and a computer program stored on the memory and executable on the processor, the computer program implementing the acoustic model training method as described in the first aspect or the speech synthesis method as described in the second aspect when executed by the processor.
In a sixth aspect, embodiments of the present application provide a computer readable storage medium comprising: the computer-readable storage medium has stored thereon a computer program which, when executed by a processor, implements the acoustic model training method as described in the second aspect or the speech synthesis method as described in the second aspect.
In a seventh aspect, embodiments of the present application provide a computer program product comprising: the computer program product, when run on a computer, causes the computer to implement the acoustic model training method as described in the second aspect or the speech synthesis method as described in the second aspect.
Compared with the prior art, the technical scheme provided by the embodiment of the application has the following advantages: in the embodiment of the application, a sample phoneme sequence, a standard mel frequency spectrum corresponding to the sample phoneme sequence and the length of the standard mel frequency spectrum are obtained; inputting the sample phoneme sequence and the length of the standard mel spectrum into an acoustic model to obtain a mel spectrum corresponding to the sample phoneme sequence, wherein the acoustic model comprises a phoneme embedding layer, an encoder, a length predictor, a length regulator and a decoder; the phoneme embedding layer is used for obtaining a phoneme vector corresponding to the sample phoneme sequence; the encoder is used for encoding the phoneme vector to obtain an intermediate vector; the length predictor is used for obtaining the length ratio corresponding to each phoneme based on the intermediate vector, and determining the length of the Mel frequency spectrum corresponding to each phoneme based on the length of the standard Mel frequency spectrum and the length ratio of the length of the Mel frequency spectrum corresponding to one phoneme to the length of the standard Mel frequency spectrum; the length adjuster is used for determining an intermediate sequence based on the intermediate vector and the length of the Mel frequency spectrum corresponding to each phoneme; the decoder is used for decoding the intermediate sequence to obtain a mel frequency spectrum corresponding to the sample phoneme sequence, and the length of the intermediate sequence is the same as that of the standard mel frequency spectrum; and training an acoustic model based on the Mel frequency spectrum and the standard Mel frequency spectrum corresponding to the sample phoneme sequence to obtain a trained target acoustic model. Therefore, a length predictor for obtaining the length ratio corresponding to each phoneme based on the intermediate vector and determining the length of the Mel spectrum corresponding to each phoneme based on the length of the standard Mel spectrum and the length ratio of each phoneme is added in the acoustic model, so that the duration of the phoneme is predicted by using the standard Mel spectrum in the training process of the acoustic model, the alignment of the phoneme and the Mel spectrum is realized, the effect that the length of the intermediate sequence is the same as the length of the standard Mel spectrum is achieved, the conversion accuracy of the phoneme sequence to the Mel spectrum can be ensured, the application scheme does not need to use an ASR alignment model for alignment, the training process of the acoustic model is simplified, and the influence of the accuracy of the ASR alignment model on the effect of the acoustic model is avoided.
Drawings
In order to more clearly illustrate the embodiments of the present application or the implementation in the related art, a brief description will be given below of the drawings required for the embodiments or the related art descriptions, and it is apparent that the drawings in the following description are some embodiments of the present application, and other drawings may be obtained according to these drawings for those of ordinary skill in the art.
FIG. 1 illustrates a scene architecture diagram of a speech synthesis method in accordance with some embodiments;
FIG. 2 illustrates a hardware configuration block diagram of an electronic device in accordance with some embodiments;
FIG. 3 illustrates an operating system diagram of an electronic device and a server in accordance with some embodiments;
FIG. 4 illustrates a flow diagram of an acoustic model training method in accordance with some embodiments;
FIG. 5 illustrates one of the structural schematics of an acoustic model according to some embodiments;
FIG. 6 illustrates a second structural schematic of an acoustic model according to some embodiments;
FIG. 7 illustrates a third structural schematic of an acoustic model in accordance with some embodiments;
FIG. 8 illustrates a fourth schematic structural diagram of an acoustic model, according to some embodiments;
FIG. 9 illustrates a fifth structural schematic of an acoustic model according to some embodiments;
FIG. 10 illustrates one of the flow diagrams of a speech synthesis method according to some embodiments;
FIG. 11 illustrates a second flow diagram of a speech synthesis method according to some embodiments;
FIG. 12 illustrates a frame schematic of an acoustic model training apparatus in accordance with some embodiments;
FIG. 13 illustrates a schematic frame diagram of a speech synthesis apparatus according to some embodiments;
FIG. 14 illustrates a computer device hardware schematic, according to some embodiments.
Detailed Description
For purposes of clarity and implementation of the present application, the following description will make clear and complete descriptions of exemplary implementations of the present application with reference to the accompanying drawings in which exemplary implementations of the present application are illustrated, it being apparent that the exemplary implementations described are only some, but not all, of the examples of the present application.
It should be noted that the brief description of the terms in the present application is only for convenience in understanding the embodiments described below, and is not intended to limit the embodiments of the present application. Unless otherwise indicated, these terms should be construed in their ordinary and customary meaning.
The terms "first," second, "" third and the like in the description and in the claims and in the above-described figures are used for distinguishing between similar or similar objects or entities and not necessarily for limiting a particular order or sequence, unless otherwise indicated. It is to be understood that the terms so used are interchangeable under appropriate circumstances.
The terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a product or apparatus that comprises a list of elements is not necessarily limited to all elements explicitly listed, but may include other elements not expressly listed or inherent to such product or apparatus.
Fig. 1 is a schematic view of a scenario architecture of a speech synthesis method according to an embodiment of the present application. As shown in fig. 1, a scenario architecture provided in an embodiment of the present application includes: server 100 and electronic device 200.
The electronic device 200 provided in the embodiment of the application may have various implementation forms, for example, may be an intelligent sound box, a television, a refrigerator, a washing machine, an air conditioner, an intelligent curtain, a router, a set top box, a mobile phone, a personal computer (Personal Computer, PC) intelligent television, a laser projection device, a display (monitor), an electronic whiteboard (electronic bulletin board), a wearable device, an on-board device, an electronic desktop (electronic table), and the like.
In some embodiments, when the electronic device 200 receives an instruction to synthesize text into speech, the electronic device 200 may implement the function of synthesizing text into speech by itself, and the electronic device may implement the function of synthesizing text into speech by the server by communicating data with the server 100. The electronic device 200 may be allowed to make a communication connection with the server 100 through a Local Area Network (LAN), a Wireless Local Area Network (WLAN).
The server 100 may be a server providing various services, such as a server providing speech synthesis support for text acquired by the electronic device 200. The server may perform a synthesis process on the received text data and feed back a result of the process (e.g., synthesized voice) to the electronic device 200. The server 100 may be a server cluster, or may be a plurality of server clusters, and may include one or more types of servers.
The electronic device 200 may be hardware or software. When the electronic device 200 is hardware, it may be various electronic devices with sound playing functions, including but not limited to a smart speaker, a smart phone, a television, a tablet computer, an electronic book reader, a smart watch, a player, a computer, an AI device, a robot, a smart vehicle, etc. When the electronic apparatus 200 is software, it can be installed in the above-listed electronic apparatus. Which may be implemented as a plurality of software or software modules (e.g. for providing speech synthesis services) or as a single software or software module. The present invention is not particularly limited herein.
The electronic device 200 receives an instruction for converting a text into a voice input by a user, then sends the voice text to be synthesized to the server 100, the server 100 processes the voice text to be synthesized by using the voice synthesis method provided in the embodiment of the present application to obtain a target voice corresponding to the voice text to be synthesized, and then returns the target voice to the electronic device 200, where the electronic device 200 voice broadcasts the target voice.
It should be noted that the schematic view of the scenario shown in fig. 1 only shows one possible scenario for implementing the speech synthesis method provided in the present embodiment. The execution body of the speech synthesis method provided in the embodiment of the present application may be the above server or a functional module or a functional entity in the server for implementing the speech synthesis method. The execution body of the speech synthesis method provided in the embodiment of the present application may also be the electronic device or a functional module or a functional entity in the electronic device for implementing the speech synthesis method.
Fig. 2 shows a hardware configuration block diagram of an electronic device 200 in accordance with an example embodiment. The electronic device 200 as shown in fig. 2 includes at least one of a communicator 220, a detector 230, an external device interface 240, a controller 250, a display 260, an audio output interface 270, a memory, a power supply, and a user interface 280. The controller includes a central processing unit, an audio processor, a RAM, a ROM, and first to nth interfaces for input/output.
The communicator 220 is a component for communicating with external devices or servers according to various communication protocol types. For example: the communicator may include at least one of a Wifi module, a bluetooth module, a wired ethernet module, or other network communication protocol chip or a near field communication protocol chip, and an infrared receiver. The electronic device 200 may establish transmission and reception of control signals and data signals through the communicator 220 and the server 100.
The user interface 280 may be used to receive external control signals.
The detector 230 is used to collect signals of the external environment or interaction with the outside. For example, detector 230 includes a light receiver, a sensor for capturing the intensity of ambient light; alternatively, the detector 230 includes an image collector such as a camera, which may be used to collect external environmental scenes, user attributes, or user interaction gestures, or alternatively, the detector 230 includes a sound collector such as a microphone, or the like, which is used to receive external sounds.
The sound collector may be a microphone, also called "microphone", which may be used to receive the sound of a user and to convert the sound signal into an electrical signal. The electronic device 200 may be provided with at least one microphone. In other embodiments, the electronic device 200 may be provided with two microphones, and may implement a noise reduction function in addition to collecting sound signals. In other embodiments, the electronic device 200 may also be provided with three, four, or more microphones to enable collection of sound signals, noise reduction, identification of sound sources, directional recording, etc.
Further, the microphone may be built in the electronic device 200, or the microphone may be connected to the electronic device 200 by a wired or wireless means. Of course, the location of the microphone on the electronic device 200 is not limited in the embodiments of the present application. Alternatively, the electronic device 200 may not include a microphone, i.e., the microphone is not provided in the electronic device 200. The electronic device 200 may be coupled to a microphone (also referred to as a microphone) via an interface, such as the USB interface 130. The external microphone may be secured to the electronic device 200 by external fasteners such as a camera mount with a clip.
The controller 250 controls the operation of the electronic device and responds to the user's operations by various software control programs stored on the memory. The controller 250 controls the overall operation of the electronic device 200.
In some embodiments the controller includes at least one of a central processing unit (Central Processing Unit, CPU), video processor, audio processor, RAM Random Access Memory, RAM), ROM (Read-Only Memory), first to nth interfaces for input/output, a communication Bus (Bus), and the like.
In some examples, the operating system of the electronic device is an Android system, and as shown in fig. 3, the electronic device 200 may be logically divided into an application layer (Applications) 21, a kernel layer 22 and a hardware layer 23.
Wherein, as shown in fig. 3, the hardware layer 23 may include the controller 250, the communicator 220, the detector 230, etc. shown in fig. 2. The application layer 21 includes one or more applications. The application may be a system application or a third party application. For example, the application layer 21 includes a voice recognition application that can provide voice interaction interfaces and services for connection of the electronic device 200 to the server 100.
The kernel layer 22 acts as software middleware between the hardware layer 23 and the application layer 21 for managing and controlling hardware and software resources.
As shown in fig. 3, the server 100 may include: the communication control module 101, the speech synthesis module 102 and the data storage module 103 may also include other modules, which are not limited herein. Wherein the communication control module 101 is configured to communicate with the communicator 220, and the data storage module 103 is configured to store various databases, which in the present embodiment may be configured to store tabular data.
In some examples, the kernel layer 22 includes a detector driver for sending the speech text to be synthesized collected by the detector 230 to the speech synthesis application in the event that the speech synthesis application in the electronic device 200 is launched and the electronic device 200 has established a communication connection with the server 100. The speech synthesis application then sends the text containing the speech to be synthesized to the speech synthesis module 102 in the server. The speech synthesis module 102 is configured to input a speech text to be synthesized sent by the electronic device 200 into a speech synthesis model, obtain a speech synthesis result, and then transmit the speech synthesis result to the electronic device 200.
The traditional Text To Speech (TTS) model mainly includes: front-end text-to-phoneme conversion, and back-end phoneme-to-speech signal conversion. Wherein the front end serves to convert the initial text into phonemes and the back end serves to upsample the phonemes into a speech signal. A great deal of a priori knowledge is required to build these modules and a great deal of time is spent on feature design. With the development of deep learning in the voice field, an end-to-end voice conversion scheme based on a deep neural network is gradually developed. At present, the rear end takes acoustic characteristics as overmuch, and the end-to-end training and synthesis are basically realized.
Text front-end processing typically goes through 3 basic steps, text preprocessing, to normalized text, to phonemes, while prosody is predicted from the text and normalized text. The phonemes and prosody identifiers are collectively referred to as linguistic features (linguistic feature). The output of the front end of the text is used as the input of a downstream acoustic model and a vocoder, and if the problems such as pronunciation errors occur, the phoneme sequence is directly corrected under most conditions, so that the difficulty of problem solving is greatly reduced.
The back end generally has two parts: phoneme-acoustic feature (mel-spectrum) conversion (acoustic model), acoustic feature-speech signal conversion (vocoder).
The reason why the back end is divided into two sections: TTS is upsampling and the speech signal is sampled at a high rate, typically 16kHz, corresponding to a number of audio samples per word. In order to obtain good results, it is generally necessary to take the acoustic features as excess, and upsampling in two steps, mel spectrum (mel spectrum) is the most common excessive acoustic feature.
Currently, vocoders can be broadly classified into phase reconstruction-based vocoders and neural network-based vocoders. Phase reconstruction-based vocoders use algorithms to derive phase characteristics and reconstruct speech waveforms primarily because the acoustic characteristics (mel characteristics, etc.) used by TTS have lost phase characteristics. The vocoder based on the neural network directly maps (maps) the acoustic features and the voice waveforms, so that the synthesized voice quality is higher. Currently, the popular neural network vocoders mainly include wavenet, wavernn, melgan, waveglow, fastspeech and the like. The present application focuses on the phoneme-mel spectral conversion part, which is the core part of TTS, and plays a decisive role in synthesizing the effect.
The phoneme-mel spectrum conversion part is mainly divided into an autoregressive model and a non-autoregressive model. These autoregressive models represented by tacotron2 and transformerTTS do not require alignment information of phonemes, and the models learn which phonemes each step corresponds to and finally determine whether to stop. Such autoregressive models must be performed step by step, ending when the maximum number of steps is determined by themselves or set manually, and therefore the synthesis speed of these models is very slow.
The non-autoregressive model is obviously faster than the autoregressive model, and the non-autoregressive model represented by FastSpecch has better synthesis effect and high speed, but the result of alignment of phonemes and Mel frequency spectrum (abbreviated as phoneme alignment) is needed to be obtained through an ASR alignment model during training. Typically, ten words are the same, the rate at which a mel spectrum is generated is in the order of tens of milliseconds for fastspech and three hundred milliseconds for Tacotron 2. Therefore, the synthesis speed advantage of the non-autoregressive model can be compared, but the problem to be solved is the phoneme alignment problem.
Speech synthesis is an "up-sampling" process, and the length of acoustic features such as mel-frequency spectrum is often much longer than the number of phonemes. Meanwhile, the voice has randomness, and the length of each phoneme pronunciation has a large change. Thus, whether training or reasoning, it is necessary to determine the length of each phoneme and find its corresponding segment in the mel spectrum, called "alignment".
The end-to-end speech synthesis model generates a mel spectrum from the phonemes output from the front end through the trained acoustic model, and converts the mel spectrum into a speech signal (synthesized speech) through the vocoder. The non-autoregressive end-to-end model needs to acquire the real time length of each phoneme corresponding to the audio in the training stage, and is generally acquired by training the phonemes by adopting an ASR alignment model, for example, the main stream algorithm model FastSpecch needs to rely on the phoneme alignment result of the ASR alignment model in training, and the following problems exist: 1) The training process is complex, and an ASR alignment model is generally required to be trained independently to achieve a good effect; 2) The phoneme alignment effect is affected by the accuracy of the ASR alignment model; 3) There is a rounding error for the estimate of the number of phoneme frames.
Therefore, in the training stage of the existing non-autoregressive end-to-end model, on one hand, the training process is complex because an ASR alignment model needs to be independently trained; on the one hand, because of slight errors in the alignment of phonemes achieved by the ASR alignment model, errors in the generated Mel spectrum are caused, and finally flaws in the synthesized speech are caused.
In order to solve the above technical problems, in some embodiments of the present application, a sample phoneme sequence, a standard mel spectrum corresponding to the sample phoneme sequence, and a length of the standard mel spectrum are obtained; inputting the sample phoneme sequence and the length of the standard mel spectrum into an acoustic model to obtain a mel spectrum corresponding to the sample phoneme sequence, wherein the acoustic model comprises a phoneme embedding layer, an encoder, a length predictor, a length regulator and a decoder; the phoneme embedding layer is used for obtaining a phoneme vector corresponding to the sample phoneme sequence; the encoder is used for encoding the phoneme vector to obtain an intermediate vector; the length predictor is used for obtaining the length ratio corresponding to each phoneme based on the intermediate vector, and determining the length of the Mel frequency spectrum corresponding to each phoneme based on the length of the standard Mel frequency spectrum and the length ratio of the length of the Mel frequency spectrum corresponding to one phoneme to the length of the standard Mel frequency spectrum; the length adjuster is used for determining an intermediate sequence based on the intermediate vector and the length of the Mel frequency spectrum corresponding to each phoneme; the decoder is used for decoding the intermediate sequence to obtain a mel frequency spectrum corresponding to the sample phoneme sequence, and the length of the intermediate sequence is the same as that of the standard mel frequency spectrum; and training an acoustic model based on the Mel frequency spectrum and the standard Mel frequency spectrum corresponding to the sample phoneme sequence to obtain a trained target acoustic model. Therefore, a length predictor for obtaining the length proportion corresponding to each phoneme based on the intermediate vector is added in the acoustic model, and the length of the Mel spectrum corresponding to each phoneme is determined based on the length of the standard Mel spectrum and the length proportion of each phoneme, so that the duration of the phoneme is predicted by using the standard Mel spectrum in the training process of the acoustic model, the alignment of the phoneme and the Mel spectrum is realized, the effect that the length of the intermediate sequence is the same as the length of the standard Mel spectrum is achieved, the conversion accuracy of the phoneme sequence to the Mel spectrum can be ensured, the application scheme does not need to use an ASR alignment model for alignment, the training process of the acoustic model is simplified, the accuracy of the phoneme alignment is improved, the influence of the accuracy of the ASR alignment model on the effect of the acoustic model is avoided, the training process of the TTS model is further improved, the accumulation of errors is avoided, the stable synthesis effect is achieved, and the synthesis effect is improved.
Fig. 4 is a flowchart of steps for implementing an acoustic model training method according to one or more embodiments of the present application, where an execution body of the acoustic model training method may be a server or an electronic device, or may be a functional module or a functional entity in the server or the electronic device that can implement the acoustic model training method, which is not limited herein. In the embodiment of the present application, the acoustic model training method may include S401 to S403 described below.
S401, acquiring a sample phoneme sequence, a standard Mel frequency spectrum corresponding to the sample phoneme sequence and the length of the standard Mel frequency spectrum.
S402, inputting the lengths of the sample phoneme sequence and the standard Mel frequency spectrum into an acoustic model to obtain the Mel frequency spectrum corresponding to the sample phoneme sequence.
Wherein, as shown in fig. 5, the acoustic model 50 includes a phoneme embedding layer 51, an encoder 52, a length predictor 53, a length adjuster 54, and a decoder 55; the phoneme embedding layer 51 is configured to obtain a phoneme vector corresponding to the sample phoneme sequence; the encoder 52 is configured to encode the phoneme vector to obtain an intermediate vector; the length predictor 53 is configured to obtain a length ratio corresponding to each phoneme based on the intermediate vector, and determine a length of the mel spectrum corresponding to each phoneme based on the length of the standard mel spectrum and the length ratio of each phoneme; the length adjuster 54 is configured to determine an intermediate sequence based on the intermediate vector and the length of the mel spectrum corresponding to each phoneme; the decoder 55 is configured to decode the intermediate sequence to obtain a mel spectrum corresponding to the sample phoneme sequence, where the length of the intermediate sequence is the same as the length of the standard mel spectrum.
Wherein the intermediate vector may be a prosodic hidden feature.
Wherein the length ratio is the ratio of the length of the mel spectrum corresponding to one phoneme to the length of the standard mel spectrum. The length predictor is specifically configured to predict the length of the mel spectrum corresponding to each phoneme based on the intermediate vector, normalize the length of the mel spectrum corresponding to each phoneme to obtain a length ratio corresponding to each phoneme, and finally determine the length of the mel spectrum corresponding to each phoneme based on the length of the standard mel spectrum and the length ratio of each phoneme.
In some embodiments of the present application, the length predictor may include a phoneme length prediction module (Portion Predictor), a normalization module (e.g., softmax), and a product module, where the phoneme length prediction module is configured to predict a length of a mel spectrum corresponding to each phoneme based on the intermediate vector, the normalization module is configured to normalize the length of the mel spectrum corresponding to each phoneme to obtain a length ratio corresponding to each phoneme, and the product module is configured to determine a length of the mel spectrum corresponding to each phoneme based on a length of a standard mel spectrum and the length ratio of each phoneme.
The phoneme length prediction module is specifically configured to predict a length of a mel spectrum corresponding to each phoneme based on the intermediate vector and the standard mel spectrum.
The main structure of the length predictor may be a convolutional network, and the specific network structure is not limited in the embodiments of the present application.
In some embodiments of the present application, the length may be a time length (unit is s), or may be an integer multiple of a fixed time length (unit is individual), which may be specifically determined according to practical situations, and is not limited herein. The fixed time length may be determined according to practical situations, for example, may be a time length occupied by one phoneme.
The length of the intermediate sequence is the same as that of the standard mel frequency spectrum, so that the length of the mel frequency spectrum corresponding to the sample phoneme sequence obtained by decoding the intermediate sequence is the same as that of the standard mel frequency spectrum, and the mel frequency spectrum corresponding to the sample phoneme sequence is closer to that of the standard mel frequency spectrum.
The network structure of the decoder and the network structure of the encoder may be the same or different, which is not limited herein.
In some embodiments of the present application, there are many chinese characters, and training samples are difficult to cover, so that the chinese characters are usually converted into phonemes, and speech is synthesized through the phonemes. For example, complete voice information "hello" is synthesized in a television scene, and the text is split into a phoneme sequence of "eos n i2 h aa3 uu3 eos". Where "eos" is an endpoint.
In the embodiment of the application, a sample phoneme sequence and a standard mel spectrum are input into an acoustic model, a phoneme embedding layer embeds the sample phoneme sequence into a vector space to obtain a corresponding phoneme vector, and the phoneme vector is marked as x l Wherein L is [0,1, …, L]Representing the left to right position of the phonemes, L is the number of phonemes. Then x is l An incoming encoder generates an intermediate vector h l The encoder fuses the relation between the front and back of the phonemes to be able to model the changes made in the context of the phonemes. The encoder typically uses N feed-forward Transformer Block (FFT Block), with N preferably 4-6. Since the total length of the standard mel-spectrum is known, the length of the mel-spectrum corresponding to each phoneme can be determined by predicting the duty cycle of each phoneme. Intermediate vector h l The result is normalized through a Softmax layer to obtain the proportion p of each phoneme finally l Wherein
Figure BDA0004086130860000081
Since the training is given by the speech, the length of the standard mel spectrum is known, and the length of the standard mel spectrum is denoted as T, then T l =T*p l I.e. the mel spectrum length corresponding to each phoneme. Will h l ,T l And is passed into a length adjuster to obtain an intermediate sequence of equal length to the standard mel spectrum. Finally, the intermediate sequence is transferred to a decoder to obtain a mel spectrum.
S403, training an acoustic model based on the Mel frequency spectrum and the standard Mel frequency spectrum corresponding to the sample phoneme sequence to obtain a trained target acoustic model.
According to the method and the device, the length predictor for obtaining the length proportion corresponding to each phoneme based on the intermediate vector is added to the acoustic model, and the length of the Mel spectrum corresponding to each phoneme is determined based on the length of the standard Mel spectrum and the length proportion of each phoneme, so that the duration of the phoneme is predicted by using the standard Mel spectrum in the training process of the acoustic model, the alignment of the phoneme and the Mel spectrum is achieved, the effect that the length of the intermediate sequence is identical to the length of the standard Mel spectrum is achieved, the conversion accuracy of the phoneme sequence to the Mel spectrum can be ensured, the application scheme does not need to use an ASR alignment model for alignment, the training process of the acoustic model is simplified, the influence of the accuracy of the ASR alignment model on the effect of the acoustic model is avoided, the accuracy of the ASR alignment model is improved, the influence of the accuracy of the ASR alignment model on the effect of the acoustic model is avoided, the training process of the TTS model is further improved, the accumulation of errors is avoided, the stable synthesis effect is achieved, and the synthesis effect is improved.
In some embodiments of the present application, the Length adjuster is a Length adjuster, and phonemes are arranged according to their respective lengths (T l ) And copying to obtain a copy sequence corresponding to each phoneme, and then splicing the copy sequences corresponding to each phoneme to obtain a middle sequence consistent with the standard Mel spectrum length.
In some embodiments of the present application, in conjunction with fig. 5, as shown in fig. 6, the acoustic model 50 further includes: a standard deviation predictor 56; the standard deviation predictor 56 is configured to obtain a standard deviation corresponding to each phoneme according to the intermediate vector; the length adjuster 54 is specifically a soft length adjuster based on gaussian probability density functions corresponding to respective phonemes, one of which is determined by a standard deviation corresponding to the one phoneme and a center position corresponding to the one phoneme, the center position corresponding to the one phoneme being determined according to the length of the mel spectrum corresponding to the respective phoneme.
In the mel-spectrum time-frequency analysis process, the phonemes have no obvious boundary, and the two phonemes are commonly partially overlapped. Meanwhile, continuous changes can be generated among phonemes in the pronunciation process. To model both phenomena, a standard deviation Predictor (STD Predictor) was introduced to model the overlap and continuous variation. Intermediate vector h l Introducing into standard deviation predictor, and taking absolute value (ABS) of the result to obtain standard deviation parameter sigma l The method comprises the steps of carrying out a first treatment on the surface of the Correspondingly will h ll ,T l And into Soft Length Regulator.
The specific implementation steps of the soft length adjuster (Soft Length Regulator) in the embodiment of the application are as follows:
step1: acquiring the corresponding central position mu of each phoneme l
Figure BDA0004086130860000091
Step2: by mu l Sum of variances sigma l The structural mean value is mu l Variance is sigma l Gaussian probability density function D of (2) l (T), where T ε [0,1, …, T]The function takes the maximum value at the central position of the phonemes and gradually decays towards two sides;
step3: encoding h phonemes with intermediate fundamental frequency and energy information l Copy T times to obtain
Figure BDA0004086130860000092
Figure BDA0004086130860000093
Figure BDA0004086130860000094
The length of (2) is consistent with the mel spectrum;
step4: by D l (t) highlighting the phoneme center position to obtain the final code H of the phoneme l I.e.
Figure BDA0004086130860000095
Step5: coding H of each phoneme l Summing to obtain the code H of the complete sentence
Figure BDA0004086130860000096
In summary, in the embodiment of the present application, in combination with the standard deviation predictor and the soft length adjuster, overlapping and continuous changes between phonemes are modeled, and the characteristics of a gaussian distribution probability density function are utilized to highlight the middle phonemes, so that a softer processing method is adopted for the boundary. The intermediate sequence obtained in this way accords with the real standard Mel frequency spectrum, and can improve the conversion accuracy of the phoneme sequence to the Mel frequency spectrum, and further improve the accuracy of the voice synthesis.
In some embodiments of the present application, in conjunction with fig. 6, as shown in fig. 7, the acoustic model 50 further includes: a fundamental predictor 571 and a fundamental embedding layer 572; the baseband predictor 571 is configured to determine a baseband parameter sequence according to the intermediate vector, and the baseband embedding layer 572 is configured to obtain a baseband parameter vector corresponding to the baseband parameter sequence; the length adjuster 54 is specifically configured to determine the intermediate sequence based on a first vector and a length of a mel spectrum corresponding to each phoneme, where the first vector is a sum of the intermediate vector and the fundamental frequency parameter vector.
In the embodiment of the application, in order to enrich the rhythm of the voice, avoid the intonation to be flat, an independent fundamental frequency predictor is constructed after the encoder to model the change of the tone, the tone corresponding to the phoneme is encoded, the mel frequency spectrum obtained by converting the phoneme sequence is more close to the standard mel frequency spectrum, more accurate acoustic characteristics can be obtained, and further the accuracy of voice synthesis can be improved. Intermediate vector h l An incoming fundamental frequency Predictor (F0 Predictor) for obtaining a fundamental frequency parameter F l . Will f at the back l Transmitting into base frequency Embedding layer (F0-Embedding) to obtain final Embedding result and h l Added to get q l . Corresponding will q ll ,T l Into Soft Length Regulator, step3 in the soft length adjuster is represented by q l Copy T times to obtain
Figure BDA0004086130860000097
In some embodiments of the present application, in conjunction with fig. 6, as shown in fig. 8, the acoustic model 50 further includes: an energy predictor 581 and an energy embedding layer 582; the energy predictor 581 is configured to determine an energy parameter sequence according to the intermediate vector, and the energy embedding layer 582 is configured to obtain an energy parameter vector corresponding to the energy parameter sequence; the length adjuster 54 is specifically configured to determine the intermediate sequence based on a second vector and a length of the mel spectrum corresponding to each phoneme, where the second vector is a sum of the intermediate vector and the energy parameter vector.
In the embodiment of the application, an independent energy predictor is constructed after the encoder to model the intensity variation of the voice, the intensity variation of the voice in the phonemes is encoded, the mel frequency spectrum obtained by converting the phoneme sequence is more similar to the standard mel frequency spectrum, more accurate acoustic characteristics can be obtained, and further the accuracy of voice synthesis can be improved. Intermediate vector h l An Energy Predictor (Energy Predictor) obtains an Energy parameter g l . G is then g l Transmitting Energy-Embedding to obtain final Embedding result and h l Adding to obtain w l . Corresponding will w ll ,T l Into Soft Length Regulator, step3 in the soft length adjuster is represented by w l Copy T times to obtain
Figure BDA0004086130860000101
In some embodiments of the present application, in conjunction with fig. 8, as shown in fig. 9, the acoustic model further includes: a fundamental predictor 571 and a fundamental embedding layer 572; the baseband predictor 571 is configured to determine a baseband parameter sequence according to the intermediate vector, and the baseband embedding layer 572 is configured to obtain a baseband parameter vector corresponding to the baseband parameter sequence; the length adjuster 54 is specifically configured to determine the intermediate sequence based on a third vector and a length of the mel spectrum corresponding to each phoneme, where the third vector is a sum of the intermediate vector, the fundamental frequency parameter vector, and the energy parameter vector.
In the embodiment of the application, in order to enrich the rhythm of the voice, avoid the intonation to be flat, an independent fundamental frequency predictor is constructed after the encoder to model the change of the tone, the tone corresponding to the phoneme is encoded, the mel frequency spectrum obtained by converting the phoneme sequence is more close to the standard mel frequency spectrum, more accurate acoustic characteristics can be obtained, and further the accuracy of voice synthesis can be improved. Intermediate vector h l An incoming fundamental frequency Predictor (F0 Predictor) for obtaining a fundamental frequency parameter F l . Will f at the back l Introducing a base frequency Embedding layer (F0-Embedding) to obtain a final Embedding result and w l Added to obtain s l . Corresponding will s ll ,T l Into Soft Length Regulator, step3 in the soft length adjuster is a Step of l Copy T times to obtain
Figure BDA0004086130860000102
In some embodiments of the present application, the acoustic model may form an end-to-end speech synthesis model with a vocoder and a front-end model, where the front-end model is used to convert sample text into a sample phoneme sequence, and the vocoder is used to convert a mel spectrum corresponding to the sample phoneme sequence into a speech signal.
Fig. 10 is a flowchart of steps for implementing a speech synthesis method according to one or more embodiments of the present application, where an execution subject of the speech synthesis method may be a server or an electronic device, or may be a functional module or a functional entity in the server or the electronic device capable of implementing the speech synthesis method, which is not limited herein. In the embodiment of the present application, the voice synthesis method may include S1001 to S1005 described below.
S1001, acquiring a voice text to be synthesized.
S1002, determining the length of a target Mel frequency spectrum corresponding to the voice text to be synthesized according to the text length of the voice text to be synthesized.
S1003, converting the voice text to be synthesized into a target phoneme sequence.
S1004, inputting the target phoneme sequence and the length of the target Mel frequency spectrum into a target acoustic model to obtain the target Mel frequency spectrum.
The target acoustic model is obtained through training by the acoustic model training method.
S1005, generating target voice corresponding to the voice text to be synthesized through the target Mel frequency spectrum by the vocoder.
In the embodiment of the application, in the process of training the acoustic model by using the acoustic model training method to obtain the acoustic model for reasoning, the accuracy of phoneme alignment is improved, a stable synthesis effect is achieved, and the synthesis accuracy is improved.
In some embodiments of the present application, as shown in fig. 11 in conjunction with fig. 10, the above S1002 may be specifically implemented by the following S1002a and S1002 b.
S1002a, determining the initial length of a target Mel frequency spectrum corresponding to the voice text to be synthesized according to the text length of the voice text to be synthesized.
S1002b, according to the received speech speed requirement of the synthesized speech, the initial length of the target Mel frequency spectrum is adjusted, and the length of the target Mel frequency spectrum is obtained.
In addition to the method, more accurate Mel frequency spectrum can be obtained in the training model stage, the speech speed of the voice synthesis can be controlled by controlling the length T of the Mel frequency spectrum in the reasoning stage, the greater the T, the slower the speech speed, and the smaller the T, the faster the speech speed.
FIG. 12 is a block diagram of an acoustic model training apparatus according to an embodiment of the present application, as shown in FIG. 12, the apparatus includes:
An obtaining module 1201, configured to obtain a sample phoneme sequence, a standard mel spectrum corresponding to the sample phoneme sequence, and a length of the standard mel spectrum;
an input module 1202 for inputting the sample phoneme sequence and the length of the standard mel spectrum into an acoustic model to obtain a mel spectrum corresponding to the sample phoneme sequence, wherein the acoustic model comprises a phoneme embedding layer, an encoder, a length predictor, a length regulator and a decoder; the phoneme embedding layer is used for obtaining a phoneme vector corresponding to the sample phoneme sequence; the encoder is used for encoding the phoneme vector to obtain an intermediate vector; the length predictor is used for obtaining the length ratio corresponding to each phoneme based on the intermediate vector, and determining the length of the Mel frequency spectrum corresponding to each phoneme based on the length of the standard Mel frequency spectrum and the length ratio of the length of the Mel frequency spectrum corresponding to one phoneme to the length of the standard Mel frequency spectrum; the length adjuster is used for determining an intermediate sequence based on the intermediate vector and the length of the Mel frequency spectrum corresponding to each phoneme; the decoder is used for decoding the intermediate sequence to obtain a mel frequency spectrum corresponding to the sample phoneme sequence, and the length of the intermediate sequence is the same as that of the standard mel frequency spectrum;
The training module 1203 is configured to train the acoustic model based on the mel spectrum and the standard mel spectrum corresponding to the sample phoneme sequence, and obtain a trained target acoustic model.
In some embodiments of the present application, the acoustic model further comprises: a standard deviation predictor;
the standard deviation predictor is used for obtaining standard deviations corresponding to each phoneme according to the intermediate vectors;
the length adjuster is specifically a soft length adjuster based on gaussian probability density functions corresponding to each phoneme, wherein the gaussian probability density function corresponding to one phoneme is determined by a standard deviation corresponding to the one phoneme and a center position corresponding to the one phoneme, and the center position corresponding to the one phoneme is determined according to the length of a mel spectrum corresponding to each phoneme.
In some embodiments of the present application, the acoustic model further comprises: a base frequency predictor and a base frequency embedding layer;
the base frequency predictor is used for determining a base frequency parameter sequence according to the intermediate vector, and the base frequency embedding layer is used for obtaining a base frequency parameter vector corresponding to the base frequency parameter sequence;
the length adjuster is specifically configured to determine an intermediate sequence based on a first vector and a length of a mel spectrum corresponding to each phoneme, the first vector being a sum of the intermediate vector and the fundamental frequency parameter vector.
In some embodiments of the present application, the acoustic model further comprises: an energy predictor and energy embedding layer;
the energy predictor is used for determining an energy parameter sequence according to the intermediate vector, and the energy embedding layer is used for obtaining an energy parameter vector corresponding to the energy parameter sequence;
the length adjuster is specifically configured to determine an intermediate sequence based on a second vector and a length of a mel spectrum corresponding to each phoneme, the second vector being a sum of the intermediate vector and the energy parameter vector.
In some embodiments of the present application, the acoustic model further comprises: a base frequency predictor and a base frequency embedding layer;
the base frequency predictor is used for determining a base frequency parameter sequence according to the intermediate vector, and the base frequency embedding layer is used for obtaining a base frequency parameter vector corresponding to the base frequency parameter sequence;
the length adjuster is specifically configured to determine an intermediate sequence based on a third vector and a length of a mel spectrum corresponding to each phoneme, the third vector being a sum of the intermediate vector, the fundamental frequency parameter vector, and the energy parameter vector.
In this embodiment of the present application, each module of the acoustic model training device may implement the acoustic model training method provided in the foregoing method embodiment, and may achieve the same technical effects, so that repetition is avoided, and details are not repeated here.
Fig. 13 is a block diagram of a speech synthesis apparatus according to an embodiment of the present application, and as shown in fig. 13, the apparatus includes:
an obtaining module 1301, configured to obtain a voice text to be synthesized;
a determining module 1302, configured to determine, according to a text length of the to-be-synthesized voice text, a length of a target mel spectrum corresponding to the to-be-synthesized voice text;
the conversion module 1303 is configured to convert the speech text to be synthesized into a target phoneme sequence;
an input module 1304, configured to input the target phoneme sequence and the length of the target mel frequency spectrum into a target acoustic model, to obtain a target mel frequency spectrum, where the target acoustic model is obtained by training according to the above-mentioned acoustic model training method;
the generating module 1305 is configured to generate, by using a vocoder, the target mel spectrum into target speech corresponding to the text of the speech to be synthesized.
In some embodiments of the present application, the determining module 1302 is specifically configured to determine an initial length of a target mel spectrum corresponding to the voice text to be synthesized according to a text length of the voice text to be synthesized; and adjusting the initial length of the target Mel frequency spectrum according to the received speech speed requirement of the synthesized speech to obtain the length of the target Mel frequency spectrum.
In this embodiment of the present application, each module of the speech synthesis apparatus may implement the speech synthesis method provided in the foregoing method embodiment, and may achieve the same technical effects, so that repetition is avoided, and no further description is provided herein.
As shown in fig. 14, the embodiment of the present application further provides a computer device 1400, and the computer device 1400 may be the electronic device or the server. The computer device 1400 includes: the processor 1401, the memory 1402, and a computer program stored in the memory 1402 and capable of being executed on the processor 1401, which when executed by the processor 1401, implements the respective processes performed by the acoustic model training method or the speech synthesis method as described above, and achieve the same technical effects, and are not repeated here.
The embodiment of the present application further provides a computer readable storage medium, where a computer program is stored, where the computer program when executed by a processor implements each process executed by the above-mentioned acoustic model training method or speech synthesis method, and the same technical effects can be achieved, so that repetition is avoided, and no detailed description is given here.
The computer readable storage medium may be a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a magnetic disk, an optical disk, or the like.
The present application provides a computer program product comprising: the computer program product, when run on a computer, causes the computer to implement the acoustic model training method or the speech synthesis method described above.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some or all of the technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit of the corresponding technical solutions from the scope of the technical solutions of the embodiments of the present application.
The foregoing description, for purposes of explanation, has been presented in conjunction with specific embodiments. However, the illustrative discussions above are not intended to be exhaustive or to limit the embodiments to the precise forms disclosed above. Many modifications and variations are possible in light of the above teaching. The embodiments were chosen and described in order to best explain the principles and the practical application, to thereby enable others skilled in the art to best utilize the embodiments and various embodiments with various modifications as are suited to the particular use contemplated.

Claims (10)

1. An acoustic model training method, comprising:
acquiring a sample phoneme sequence, a standard Mel frequency spectrum corresponding to the sample phoneme sequence and the length of the standard Mel frequency spectrum;
inputting the sample phoneme sequence and the length of the standard mel spectrum into an acoustic model to obtain a mel spectrum corresponding to the sample phoneme sequence, wherein the acoustic model comprises a phoneme embedding layer, an encoder, a length predictor, a length regulator and a decoder; the phoneme embedding layer is used for obtaining a phoneme vector corresponding to the sample phoneme sequence; the encoder is used for encoding the phoneme vector to obtain an intermediate vector; the length predictor is used for obtaining the length ratio corresponding to each phoneme based on the intermediate vector, and determining the length of the Mel frequency spectrum corresponding to each phoneme based on the length of the standard Mel frequency spectrum and the length ratio of the length of the Mel frequency spectrum corresponding to one phoneme to the length of the standard Mel frequency spectrum; the length adjuster is used for determining an intermediate sequence based on the intermediate vector and the length of the Mel frequency spectrum corresponding to each phoneme; the decoder is used for decoding the intermediate sequence to obtain a mel frequency spectrum corresponding to the sample phoneme sequence, and the length of the intermediate sequence is the same as that of the standard mel frequency spectrum;
And training an acoustic model based on the Mel frequency spectrum and the standard Mel frequency spectrum corresponding to the sample phoneme sequence to obtain a trained target acoustic model.
2. The method of claim 1, wherein the acoustic model further comprises: a standard deviation predictor;
the standard deviation predictor is used for obtaining standard deviations corresponding to each phoneme according to the intermediate vectors;
the length adjuster is specifically a soft length adjuster based on a gaussian probability density function corresponding to each phoneme, wherein the gaussian probability density function corresponding to one phoneme is determined by a standard deviation corresponding to the one phoneme and a center position corresponding to the one phoneme, and the center position corresponding to the one phoneme is determined according to the length of a mel frequency spectrum corresponding to each phoneme.
3. The method according to claim 1 or 2, wherein the acoustic model further comprises: a base frequency predictor and a base frequency embedding layer;
the base frequency predictor is used for determining a base frequency parameter sequence according to the intermediate vector, and the base frequency embedding layer is used for obtaining a base frequency parameter vector corresponding to the base frequency parameter sequence;
the length adjuster is specifically configured to determine an intermediate sequence based on a first vector and a length of a mel spectrum corresponding to each phoneme, where the first vector is a sum of the intermediate vector and the fundamental frequency parameter vector.
4. The method according to claim 1 or 2, wherein the acoustic model further comprises: an energy predictor and energy embedding layer;
the energy predictor is used for determining an energy parameter sequence according to the intermediate vector, and the energy embedding layer is used for obtaining an energy parameter vector corresponding to the energy parameter sequence;
the length adjuster is specifically configured to determine an intermediate sequence based on a second vector and a length of mel spectrum corresponding to each phoneme, where the second vector is a sum of the intermediate vector and the energy parameter vector.
5. The method of claim 4, wherein the acoustic model further comprises: a base frequency predictor and a base frequency embedding layer;
the base frequency predictor is used for determining a base frequency parameter sequence according to the intermediate vector, and the base frequency embedding layer is used for obtaining a base frequency parameter vector corresponding to the base frequency parameter sequence;
the length adjuster is specifically configured to determine an intermediate sequence based on a third vector and a length of mel spectrum corresponding to each phoneme, where the third vector is a sum of the intermediate vector, the fundamental frequency parameter vector, and the energy parameter vector.
6. A method of speech synthesis, the method comprising:
Acquiring a voice text to be synthesized;
determining the length of a target Mel frequency spectrum corresponding to the voice text to be synthesized according to the text length of the voice text to be synthesized;
converting the voice text to be synthesized into a target phoneme sequence;
inputting the target phoneme sequence and the length of the target mel frequency spectrum into a target acoustic model to obtain a target mel frequency spectrum, wherein the target acoustic model is obtained through training by the acoustic model training method according to any one of claims 1 to 5;
and generating target voice corresponding to the voice text to be synthesized by the target Mel frequency spectrum through a vocoder.
7. The method of claim 6, wherein determining, according to the to-be-synthesized voice text, a length of a target mel spectrum corresponding to the to-be-synthesized voice text includes:
determining the initial length of a target Mel frequency spectrum corresponding to the voice text to be synthesized according to the text length of the voice text to be synthesized;
and adjusting the initial length of the target Mel frequency spectrum according to the received speech speed requirement of the synthesized speech to obtain the length of the target Mel frequency spectrum.
8. An acoustic model training device, comprising:
The acquisition module is used for acquiring the sample phoneme sequence, the standard Mel frequency spectrum corresponding to the sample phoneme sequence and the length of the standard Mel frequency spectrum;
the input module is used for inputting the lengths of the sample phoneme sequence and the standard mel frequency spectrum into an acoustic model to obtain a mel frequency spectrum corresponding to the sample phoneme sequence, and the acoustic model comprises a phoneme embedding layer, an encoder, a length predictor, a length regulator and a decoder; the phoneme embedding layer is used for obtaining a phoneme vector corresponding to the sample phoneme sequence; the encoder is used for encoding the phoneme vector to obtain an intermediate vector; the length predictor is used for obtaining the length ratio corresponding to each phoneme based on the intermediate vector, and determining the length of the Mel frequency spectrum corresponding to each phoneme based on the length of the standard Mel frequency spectrum and the length ratio of the length of the Mel frequency spectrum corresponding to one phoneme to the length of the standard Mel frequency spectrum; the length adjuster is used for determining an intermediate sequence based on the intermediate vector and the length of the Mel frequency spectrum corresponding to each phoneme; the decoder is used for decoding the intermediate sequence to obtain a mel frequency spectrum corresponding to the sample phoneme sequence, and the length of the intermediate sequence is the same as that of the standard mel frequency spectrum;
And the training module is used for training the acoustic model based on the Mel frequency spectrum corresponding to the sample phoneme sequence and the standard Mel frequency spectrum to obtain a trained target acoustic model.
9. A speech synthesis apparatus, comprising:
the acquisition module is used for acquiring a voice text to be synthesized;
the determining module is used for determining the length of a target Mel frequency spectrum corresponding to the voice text to be synthesized according to the text length of the voice text to be synthesized;
the conversion module is used for converting the voice text to be synthesized into a target phoneme sequence;
an input module, configured to input the target phoneme sequence and the length of the target mel spectrum into a target acoustic model, to obtain a target mel spectrum, where the target acoustic model is obtained by training the acoustic model training method according to any one of claims 1 to 5;
and the generating module is used for generating the target voice corresponding to the voice text to be synthesized through the vocoder by using the target Mel frequency spectrum.
10. A computer device, comprising: a memory and a processor, the memory for storing a computer program; the processor is configured to perform the acoustic model training method of any one of claims 1 to 5, or the speech synthesis method of claim 6 or 7, when the computer program is invoked.
CN202310133141.XA 2023-02-17 2023-02-17 Acoustic model training method, voice synthesis method, device and computer equipment Pending CN116312458A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310133141.XA CN116312458A (en) 2023-02-17 2023-02-17 Acoustic model training method, voice synthesis method, device and computer equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310133141.XA CN116312458A (en) 2023-02-17 2023-02-17 Acoustic model training method, voice synthesis method, device and computer equipment

Publications (1)

Publication Number Publication Date
CN116312458A true CN116312458A (en) 2023-06-23

Family

ID=86829769

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310133141.XA Pending CN116312458A (en) 2023-02-17 2023-02-17 Acoustic model training method, voice synthesis method, device and computer equipment

Country Status (1)

Country Link
CN (1) CN116312458A (en)

Similar Documents

Publication Publication Date Title
CN112802448B (en) Speech synthesis method and system for generating new tone
EP4118641A1 (en) Speech recognition using unspoken text and speech synthesis
KR20210007786A (en) Vision-assisted speech processing
CN111161695B (en) Song generation method and device
CN112185363B (en) Audio processing method and device
CN113205793B (en) Audio generation method and device, storage medium and electronic equipment
CN111554281B (en) Vehicle-mounted man-machine interaction method for automatically identifying languages, vehicle-mounted terminal and storage medium
CN114678032B (en) Training method, voice conversion method and device and electronic equipment
CN113053357A (en) Speech synthesis method, apparatus, device and computer readable storage medium
CN111369968A (en) Sound reproduction method, device, readable medium and electronic equipment
CN112908293B (en) Method and device for correcting pronunciations of polyphones based on semantic attention mechanism
CN112185340B (en) Speech synthesis method, speech synthesis device, storage medium and electronic equipment
CN111833878A (en) Chinese voice interaction non-inductive control system and method based on raspberry Pi edge calculation
CN116312476A (en) Speech synthesis method and device, storage medium and electronic equipment
CN113314096A (en) Speech synthesis method, apparatus, device and storage medium
CN114360491B (en) Speech synthesis method, device, electronic equipment and computer readable storage medium
CN113421571B (en) Voice conversion method and device, electronic equipment and storage medium
CN116129859A (en) Prosody labeling method, acoustic model training method, voice synthesis method and voice synthesis device
CN116312458A (en) Acoustic model training method, voice synthesis method, device and computer equipment
CN115700871A (en) Model training and speech synthesis method, device, equipment and medium
CN114333758A (en) Speech synthesis method, apparatus, computer device, storage medium and product
CN113314101A (en) Voice processing method and device, electronic equipment and storage medium
Nikitaras et al. Fine-grained noise control for multispeaker speech synthesis
CN113223513A (en) Voice conversion method, device, equipment and storage medium
WO2024055752A1 (en) Speech synthesis model training method, speech synthesis method, and related apparatuses

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination