CN107871494A

CN107871494A - The method, apparatus and electronic equipment of a kind of phonetic synthesis

Info

Publication number: CN107871494A
Application number: CN201610849422.5A
Authority: CN
Inventors: 宋阳
Original assignee: Beijing Sogou Technology Development Co Ltd
Current assignee: Beijing Sogou Technology Development Co Ltd
Priority date: 2016-09-23
Filing date: 2016-09-23
Publication date: 2018-04-03
Anticipated expiration: 2036-09-23
Also published as: CN107871494B

Abstract

The invention discloses a kind of method, apparatus of phonetic synthesis and electronic equipment, the method for the phonetic synthesis includes：The base frequency parameters and range parameter of frozen composition text audio are extracted from the recording of frozen composition text；Audio pressure limit and filtering process are carried out according to the range parameter, obtain the frequency spectrum parameter of frozen composition text audio；When synthesizing voice, base frequency parameters and frequency spectrum parameter synthesis voice based on the frozen composition text in voice to be synthesized.In the above-mentioned technical solutions, audio is set to reach the effect that amplitude is more balanced and audio is harmonious by audio pressure limit and filtering process, so that frequency spectrum parameter reaches consistent with the tone color of pure parameter synthesis voice (nonimmobilized component text), base frequency parameters and frequency spectrum parameter synthesis voice again based on this frozen composition text, the tone color of its frozen composition text is consistent with nonimmobilized component text, solves the technical problem that parameter phonetic synthesis tone color is inconsistent in the prior art.

Description

Voice synthesis method and device and electronic equipment

Technical Field

The present invention relates to the field of speech signal processing technologies, and in particular, to a method and an apparatus for speech synthesis, and an electronic device.

Background

Parametric speech synthesis is currently a mainstream speech synthesis technology. The parameter speech synthesis occupies less space, has high calculation real-time performance and has wide application prospect in intelligent terminals and embedded devices.

Parametric speech is done with synthetic text, which is usually composed of fixed-invariant components (i.e., fixed-component text) and variable-parameter components (i.e., non-fixed-component text). In the prior art, during voice synthesis, a part of voice segments are obtained from a fixed component text in a manner of prerecording natural voice, voice synthesis is performed on a variable component text to obtain another voice segment, and then two voice segment signals are spliced to obtain a final continuous voice signal. Because of the great difference in tone color between the natural speech and the synthesized speech synthesized by the electronic device, the problem of tone color inconsistency exists in the speech synthesized by splicing the fixed component text with the natural speech and the non-fixed component text with the synthesized speech.

Therefore, the technical problem of inconsistent tone in the parameter speech synthesis in the prior art is solved.

Disclosure of Invention

The embodiment of the invention provides a method and a device for voice synthesis and electronic equipment, which are used for solving the technical problem of inconsistent tone in parameter voice synthesis in the prior art.

The embodiment of the application provides a method for synthesizing voice, which comprises the following steps:

extracting fundamental frequency parameters and amplitude parameters of the fixed component text audio frequency from the recording of the fixed component text;

performing audio voltage limit and filtering processing according to the amplitude parameters to obtain frequency spectrum parameters of the text audio with fixed components;

in synthesizing speech, speech is synthesized based on the fundamental frequency parameters and the spectral parameters of the fixed-component text in the speech to be synthesized.

Optionally, the performing audio limiting and filtering processing according to the amplitude parameter to obtain a spectrum parameter of a fixed-component text audio includes:

normalizing the amplitude parameter, and converting the normalized parameter into a decibel value sequence;

performing audio voltage limiting processing on the decibel value sequence;

carrying out amplitude inverse normalization on the audio amplitude subjected to the audio pressure limit processing to obtain a processed amplitude parameter;

and carrying out filtering processing according to the processed amplitude parameter to obtain the frequency spectrum parameter.

Optionally, the normalizing, performed on the amplitude parameter, includes: normalizing the amplitude parameter according to the following formula, and obtaining the normalized parameter y ₁ ：

Wherein scale represents a normalization coefficient, y represents the amplitude parameter, and n represents the number of quantization bits of the fixed-component text audio.

Optionally, the converting the normalized parameter into a decibel value sequence includes:

every point x in the normalized parameter is calculated according to the following formula ₁ Converted into corresponding decibel values y ₂ ：

y ₂ ＝20*log ₁₀ (abs(x ₁ ))

By all of y ₂ Forming the sequence of decibel values.

Optionally, the performing audio threshold processing on the decibel value sequence includes:

processing each decibel value in the sequence of decibel values by:

wherein ratio represents a pressure limit ratio, 0<ratio&lt 1,border represents the boundary value of the pressure limit, y ₂ Represents the aboveOne decibel value, y, in the sequence of scores ₃ The target decibel value obtained for the pressure limit is indicated.

Optionally, the performing amplitude inverse normalization on the target decibel value after the audio frequency limitation processing to obtain a processed amplitude parameter includes:

wherein scale represents a normalization coefficient, y ₃ Representing the target decibel value, y, attained by the pressure limit ₄ One of the amplitude parameters obtained by inverse normalization is represented, and n represents the number of quantization bits of the fixed-component text audio.

Optionally, the performing filtering processing according to the processed amplitude parameter to obtain the spectrum parameter includes:

extracting the spectrum envelope parameters of the processed amplitude parameters, and performing filtering operation on the extracted spectrum envelope parameters;

extracting parameters of a Mel cepstrum or a line spectrum from the filtered spectrum envelope parameters;

and taking the extracted Mel cepstrum or line spectrum pair parameter as the spectrum parameter.

Optionally, after synthesizing the speech based on the fundamental frequency parameter and the spectral parameter of the fixed component text in the speech to be synthesized when synthesizing the speech, the method further includes:

each point of the audio sequence of the synthesized speech is subjected to a warping process using the following formula:

wherein, y _tts Representing a predetermined audio sequence, Y _nat An audio sequence, y ', representing synthesized speech' _nat Representing each point, y, in the warped audio sequence _nat Representing each of the audio sequences before warpingA point(s).

The embodiment of the present application further provides a speech synthesis apparatus, including:

the extraction unit is used for extracting the fundamental frequency parameter and the amplitude parameter of the fixed component text audio frequency from the recording of the fixed component text;

the frequency spectrum acquisition unit is used for carrying out audio voltage limit and filtering processing according to the amplitude parameters to obtain frequency spectrum parameters of the text audio with fixed components;

and the synthesis unit is used for synthesizing the voice based on the fundamental frequency parameters and the spectrum parameters of the fixed component text in the voice to be synthesized when the voice is synthesized.

Optionally, the spectrum obtaining unit includes:

the conversion subunit is used for carrying out normalization processing on the amplitude parameters and converting the parameters after the normalization processing into a decibel value sequence;

the voltage limit subunit is used for carrying out audio voltage limit processing on the decibel value sequence;

the conversion subunit is further configured to perform amplitude inverse normalization on the audio amplitude subjected to the audio pressure limit processing to obtain a processed amplitude parameter;

and the filtering subunit is used for performing filtering processing according to the processed amplitude parameter to obtain the spectrum parameter.

Optionally, the converting subunit is configured to: normalizing the amplitude parameter according to the following formula, and obtaining a normalized parameter y ₁ ：

Optionally, the converting subunit is further configured to:

each point x in the normalized parameter is calculated according to the following formula ₁ Converted into corresponding decibel values y ₂ ：

y ₂ ＝20*log ₁₀ (abs(x ₁ ))

From all of y ₂ Forming the sequence of decibels values.

Optionally, the voltage limiting subunit is configured to: processing each decibel value in the sequence of decibel values by:

wherein ratio represents a pressure limit ratio, 0<ratio&lt 1,border represents the boundary value of the pressure limit, y ₂ Representing a decibel value, y, in said sequence of scores ₃ Representing the target decibel value attained by the pressure limit.

Optionally, the converting subunit is further configured to: the amplitude parameter is obtained by processing according to the following formula:

wherein scale represents a normalization coefficient, y ₃ Target decibel value, y, obtained by indicating pressure limit ₄ One of the amplitude parameters obtained by inverse normalization is represented, and n represents the number of quantization bits of the fixed-component text audio.

Optionally, the filtering subunit is configured to:

Optionally, the apparatus further comprises:

a normalization unit, configured to, after synthesizing speech based on the fundamental frequency parameters and the spectral parameters of the fixed component text in the speech to be synthesized, perform normalization processing on each point of the audio sequence of the synthesized speech by using the following formula:

wherein, y _tts Representing a predetermined audio sequence, Y _nat Audio sequence representing synthesized speech, y' _nat Representing each point, y, in the warped audio sequence _nat Representing each point in the audio sequence before the warping process.

Embodiments of the present application also provide an electronic device comprising a memory, and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, the one or more programs including instructions for:

extracting fundamental frequency parameters and amplitude parameters of the fixed component text audio from the recording of the fixed component text;

One or more technical solutions in the embodiments of the present application have at least the following technical effects:

the method comprises the steps of extracting fundamental frequency parameters and amplitude parameters of fixed component text audios from the recording of the fixed component texts; performing audio pressure limit and filtering processing according to the extracted amplitude parameters to obtain spectral parameters of the fixed component text audio, wherein the audio pressure limit and filtering processing can enable the audio to achieve the effects of more balanced amplitude and consistent audio coordination, so that the obtained spectral parameters can achieve the tone color consistency with pure parameter synthetic speech (non-fixed component text); therefore, when the voice is synthesized, the voice is synthesized based on the fundamental frequency parameter and the frequency spectrum parameter of the fixed component text in the voice to be synthesized, and the tone of the fixed component text is consistent with that of the non-fixed component text, so that the technical problem that the tone of the parameter voice synthesis in the prior art is inconsistent is solved, and the fundamental frequency parameter of the recording is adopted during voice synthesis, so that the synthesized voice rhythm is consistent with the natural voice, the expressive force is stronger, and the beneficial effects of ensuring the integral consistency of the tone of the synthesized voice and improving the expressive force of the synthesized voice are achieved.

Drawings

Fig. 1 is a flowchart of a speech synthesis method according to an embodiment of the present application;

fig. 2 is a schematic diagram of a speech synthesis apparatus according to an embodiment of the present application;

fig. 3 is a schematic diagram of an electronic device for implementing a speech synthesis method according to an embodiment of the present application.

Detailed Description

In the technical scheme provided by the embodiment of the application, parameters of the fixed component text are adjusted, and the spectral parameters are obtained by audio limitation and filtering processing of amplitude parameters extracted from the recording, so that the spectral parameters are consistent with those of the non-fixed component text, and the technical problem of inconsistent tone of parameter speech synthesis in the prior art is solved.

The main implementation principle, the specific implementation mode and the corresponding beneficial effects of the technical scheme of the embodiment of the present application are explained in detail with reference to the accompanying drawings.

Examples

Referring to fig. 1, an embodiment of the present application provides a method for speech synthesis, including:

s101: extracting fundamental frequency parameters and amplitude parameters of the fixed component text audio frequency from the recording of the fixed component text;

s102: performing audio pressure limit and filtering processing according to the amplitude parameters to obtain frequency spectrum parameters of the text audio with fixed components;

s103: in synthesizing speech, speech is synthesized based on the fundamental frequency parameters and the spectral parameters of the fixed-component text in the speech to be synthesized.

In a specific implementation process, the embodiment of the present application may establish a template library before synthesizing the speech, and store the fundamental frequency parameters and the spectrum parameters of the fixed component text. When the template library is established, the common texts are recorded and the fixed component texts are extracted from the common texts, for example, the fixed component texts 'the plane is about to take off' are extracted from the recording of 'the plane is about to take off' which is opened in Hirschmann. For the recording of the fixed component text, S101 is executed to extract a fundamental frequency parameter and an amplitude parameter of the fixed component text audio therefrom.

In order to make the tone color of the fixed component text consistent with that of the non-fixed component text, the embodiment of the present application processes the extracted amplitude parameter, and executes S102 to perform audio limiting and filtering processing according to the extracted amplitude parameter, so as to obtain the spectral parameter of the fixed component text audio. Alternatively, the method may specifically be performed by:

step 1: and carrying out normalization processing on the amplitude parameters, and converting the normalized parameters into decibel value sequences. Optionally, normalization processing may be performed on the extracted amplitude parameters according to the following formula one and formula two pairs, and the normalized parameter y is obtained ₁ ：

Scale represents a normalization coefficient, y represents an amplitude parameter (namely, an amplitude sequence of the fixed-component text audio, which includes a plurality of amplitude values), n represents the quantization bit number of the fixed-component text audio, abs represents an absolute value, and max represents a maximum value. Since the amplitude parameter is a sequence of multiple amplitude values, each amplitude value corresponds to a point after normalization, and for this purpose, y obtained by normalization ₁ Comprising a plurality of dots. By normalizing the amplitude parameters, the effect consistency of subsequent amplitude operation on different audios is ensured.

Further, y can be performed on the normalized parameters according to the formula three ₁ Convert y into ₁ Each point x in (1) ₁ Into a corresponding decibel value y ₂ ：

y ₂ ＝20*log ₁₀ (abs(x ₁ ) Equation three)

From all y converted to ₂ A sequence of decibel values of the extracted amplitude parameter is constructed.

And 2, step: and carrying out audio voltage limit processing on the decibel value sequence obtained by conversion. The audio pressure limit processing can be performed by an audio pressure limiter, the embodiment of the present application does not limit the type of the audio pressure limiter, and each decibel value in the sequence of decibel values obtained by conversion is processed by taking a formula four as an example below:

wherein ratio represents a pressure limit ratio, 0<ratio&lt 1,border represents the boundary value of the pressure limit, y ₂ Representing a decibel value, y, in said sequence of scores ₃ Representing the target decibel value attained by the pressure limit. In practice, the magnitude of the pressure limit ratio and the pressure limit boundary values may be adjusted by the designer, typically empirically setting the ratio to 0.7 and the border to-10. Because the audio of non-fixed component text is predicted based on statistical models, which have an averaging effect,the statistical model has an averaging effect, the volume is more balanced, the smaller sound in the recording is larger and the larger sound is smaller through the audio frequency pressure limit processing, the difference between the sounds of the text audios with the fixed components is smaller, the audio volume of the text audios with the fixed components is more balanced, and the effect that the amplitude of the text audios with the non-fixed components is consistent is achieved.

And step 3: and performing amplitude inverse normalization on the target decibel value after the audio frequency pressure limit processing to obtain a processed amplitude parameter. Specifically, the processed amplitude parameter may be obtained according to the formula five:

And 4, step 4: and filtering according to the processed amplitude parameters to obtain the frequency spectrum parameters of the fixed-component text audio. Specifically, the extraction of the spectral envelope parameter may be performed on the processed amplitude parameter, and the filtering operation may be performed on the extracted spectral envelope parameter; then, extracting parameters of a Mel cepstrum or a line spectrum from the filtered spectrum envelope parameters; then, the extracted mel cepstrum or line spectrum pair parameter is taken as a spectrum parameter. Assuming that the spectrum envelope is M-dimensional, and T frames in total, filtering operation is performed on the extracted M × T matrix, in the embodiment of the present application, two-dimensional median filtering is selected during filtering, a window of w1 × w2 is first selected (w 1 and w2 may be adjusted according to actual conditions, and w1 may be specifically set to 81, and w2 may be set to 5), and in the matrix of M × T, each point is replaced by a median of values in its neighboring window of w1 × w 2. The filtering operation may also adopt two-dimensional average filtering or other filters, which are not described herein again.

The fundamental frequency parameters of the fixed component text audio extracted from the recording and the spectrum parameters obtained by the processing may be stored in a template library so as to perform S103 synthesizing the speech based on the fundamental frequency parameters and the spectrum parameters of the fixed component text in the speech to be synthesized when synthesizing the speech. Specifically, when synthesizing the speech, the text of the speech to be synthesized may be analyzed to extract the fixed component text and the non-fixed component text, the fundamental frequency parameter and the spectral parameter corresponding to the fixed component text are extracted from the template library for the fixed component text, and the speech is synthesized by the vocoder together with the fundamental frequency parameter and the spectral parameter of the non-fixed component text.

In a specific implementation process, for synthesized speech, the speech corresponding to a fixed component text and the speech of a non-fixed component text have an inconsistent problem in energy, and a problem of a sudden large and small volume may occur, for this reason, in an optional implementation manner of the embodiment of the present application, a normalization process may be performed after the speech is synthesized: and (3) carrying out a regularization treatment on each point of the audio sequence of the synthesized voice by adopting a formula six:

wherein, y _tts Representing a preset audio sequence, which can be a voice synthesized by a traditional voice synthesis system, such as fundamental frequency and frequency spectrum of a fixed component text and a non-fixed component text, and a voice synthesized by a vocoder after being predicted by a statistical model; y is _nat The audio sequence synthesized by the method of the embodiment of the present application, i.e. the audio sequence before the warping process, is shown as follows: the fundamental frequency and the frequency spectrum of the fixed component text adopt parameters obtained after pressure limit and filtering, and then an audio sequence is synthesized by a vocoder; y' _nat Representing each point, y, in the warped audio sequence _nat Representing each point in the audio sequence before the warping process. By arranging the synthesized voice, the consistency of the integral voice volume can be ensured.

In the technical scheme, the consistent frequency spectrum of natural voice, namely recording and parameter synthesis voice is realized through audio frequency pressure limit and spectrum filtering, so that the tone colors of the natural voice are consistent, meanwhile, the rhythm of the fixed component text is ensured to be the same as that of the natural voice by adopting the fundamental frequency parameters of the fixed component text recording, the expressive force of the synthesis voice is improved, and finally, the volume of the synthesis voice is ensured to be consistent through regulating the volume of the synthesis voice.

Referring to fig. 2, based on the method for speech synthesis provided in the foregoing embodiment, an embodiment of the present application further provides a speech synthesis apparatus, including:

an extracting unit 21, configured to extract a fundamental frequency parameter and an amplitude parameter of a fixed component text audio from a recording of the fixed component text;

the frequency spectrum obtaining unit 22 is configured to perform audio frequency pressure limiting and filtering processing according to the amplitude parameter to obtain a frequency spectrum parameter of a text audio with a fixed component;

and a synthesizing unit 23, configured to synthesize speech based on the fundamental frequency parameter and the spectral parameter of the fixed component text in the speech to be synthesized when synthesizing speech.

In a specific implementation process, the spectrum obtaining unit 22 may include: a converting subunit, a voltage limiting subunit and a filtering subunit. The conversion subunit is used for carrying out normalization processing on the amplitude parameters and converting the parameters after the normalization processing into a decibel value sequence; the voltage limit subunit is used for carrying out audio voltage limit processing on the decibel value sequence; the conversion subunit is further configured to perform amplitude inverse normalization on the audio amplitude subjected to the audio threshold processing to obtain a processed amplitude parameter; and the filtering subunit is used for performing filtering processing according to the processed amplitude parameter to obtain the spectrum parameter.

The conversion subunit may perform normalization processing on the amplitude parameter according to the following formula, and obtain a normalized parameter y ₁ ：

After obtaining the normalized parameters, the conversion subunit may further: each point x in the normalized parameter is calculated according to the following formula ₁ Converted into corresponding decibel values y ₂ ：

y ₂ ＝20*log ₁₀ (abs(x ₁ ))

From all of y ₂ Forming the sequence of decibel values.

The voltage limiting subunit may be configured to: processing each decibel value in said sequence of decibel values by the following formula:

Further, the conversion subunit may be further configured to: the amplitude parameter is obtained by processing according to the following formula:

When obtaining the spectrum parameters, the filtering subunit may extract the spectrum envelope parameters from the processed amplitude parameters, and perform filtering operation on the extracted spectrum envelope parameters; extracting parameters of a Mel cepstrum or a line spectrum from the filtered spectrum envelope parameters; and taking the extracted Mel cepstrum or line spectrum pair parameter as the spectrum parameter.

In a specific implementation process, the apparatus for speech synthesis provided in the embodiment of the present application may further include:

a warping unit 24, configured to, after synthesizing speech based on the fundamental frequency parameters and the spectral parameters of the fixed component text in the speech to be synthesized, perform warping processing on each point of the audio sequence of the synthesized speech using the following formula:

With regard to the apparatus in the above embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be described in detail here.

FIG. 3 is a block diagram illustrating an electronic device 800 for implementing a method of speech synthesis according to an example embodiment. For example, the electronic device 800 may be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, an exercise device, a personal digital assistant, and the like.

Referring to fig. 4, electronic device 800 may include one or more of the following components: processing component 802, memory 804, power component 806, multimedia component 808, audio component 810, input/output (I/O) interface 812, sensor component 814, and communication component 816.

The processing component 802 generally controls overall operation of the electronic device 800, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing elements 802 may include one or more processors 820 to execute instructions to perform all or a portion of the steps of the methods described above. Further, the processing component 802 can include one or more modules that facilitate interaction between the processing component 802 and other components. For example, the processing component 802 can include a multimedia module to facilitate interaction between the multimedia component 808 and the processing component 802.

The memory 804 is configured to store various types of data to support operation at the device 800. Examples of such data include instructions for any application or method operating on the electronic device 800, contact data, phonebook data, messages, pictures, videos, and so forth. The memory 804 may be implemented by any type or combination of volatile or non-volatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.

The power supply component 806 provides power to the various components of the electronic device 800. The power components 806 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for the electronic device 800.

The multimedia component 808 includes a screen that provides an output interface between the electronic device 800 and a user. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 808 includes a front facing camera and/or a rear facing camera. The front-facing camera and/or the rear-facing camera may receive external multimedia data when the device 800 is in an operating mode, such as a shooting mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have a focal length and optical zoom capability.

The audio component 810 is configured to output and/or input audio signals. For example, the audio component 810 includes a Microphone (MIC) configured to receive external audio signals when the electronic device 800 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signals may further be stored in the memory 804 or transmitted via the communication component 816. In some embodiments, audio component 810 also includes a speaker for outputting audio signals.

The I/O interface 812 provides an interface between the processing component 802 and peripheral interface modules, which may be keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to: a home button, a volume button, a start button, and a lock button.

Sensor assembly 814 includes one or more sensors for providing various aspects of state assessment for electronic device 800. For example, the sensor assembly 814 may detect an open/closed state of the device 800, the relative positioning of components, such as a display and keypad of the electronic device 800, the sensor assembly 814 may also detect a change in the position of the electronic device 800 or a component of the electronic device 800, the presence or absence of user contact with the electronic device 800, orientation or acceleration/deceleration of the electronic device 800, and a change in the temperature of the electronic device 800. Sensor assembly 814 may include a proximity sensor configured to detect the presence of a nearby object without any physical contact. The sensor assembly 814 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 814 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.

The communication component 816 is configured to facilitate wired or wireless communication between the electronic device 800 and other devices. The electronic device 800 may access a wireless network based on a communication standard, such as WiFi,2G or 3G, or a combination thereof. In an exemplary embodiment, the communication component 816 receives a broadcast signal or broadcast associated information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communications component 816 further includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, ultra Wideband (UWB) technology, bluetooth (BT) technology, and other technologies.

In an exemplary embodiment, the electronic device 800 may be implemented by one or more Application Specific Integrated Circuits (ASICs), digital Signal Processors (DSPs), digital Signal Processing Devices (DSPDs), programmable Logic Devices (PLDs), field Programmable Gate Arrays (FPGAs), controllers, micro-controllers, microprocessors or other electronic components for performing the above-described methods.

In an exemplary embodiment, a non-transitory computer-readable storage medium comprising instructions, such as the memory 804 comprising instructions, executable by the processor 820 of the electronic device 800 to perform the above-described method is also provided. For example, the non-transitory computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

A non-transitory computer readable storage medium having instructions that, when executed by a processor of a mobile terminal, enable the mobile terminal to perform a method of speech synthesis that extracts a fundamental frequency parameter and an amplitude parameter of fixed component text audio from a recording of the fixed component text; performing audio pressure limit and filtering processing according to the amplitude parameters to obtain frequency spectrum parameters of the text audio with fixed components; in synthesizing speech, speech is synthesized based on the fundamental frequency parameters and the spectral parameters of the fixed-component text in the speech to be synthesized.

Other embodiments of the invention will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the invention and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the invention being indicated by the following claims.

It will be understood that the invention is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the invention is only limited by the appended patent claims

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims

1. A method of speech synthesis, the method comprising:

performing audio pressure limit and filtering processing according to the amplitude parameters to obtain frequency spectrum parameters of the text audio with fixed components;

2. The method of claim 1, wherein said performing audio limiting and filtering based on said magnitude parameters to obtain spectral parameters of fixed-component textual audio comprises:

carrying out normalization processing on the amplitude parameters, and converting the normalized parameters into decibel value sequences;

performing audio voltage limiting processing on the decibel value sequence;

and carrying out filtering processing according to the processed amplitude parameters to obtain the frequency spectrum parameters.

3. The method of claim 2, wherein the normalizing the magnitude parameter comprises:

normalizing the amplitude parameter according to the following formula, and obtaining a normalized parameter y ₁ ：

4. The method of claim 2, wherein converting the normalized parameters into a sequence of decibel values comprises:

y ₂ ＝20*log ₁₀ (abs(x ₁ ))

From all of y ₂ Forming the sequence of decibel values.

5. The method of claim 2, wherein the performing audio limiting processing for the sequence of decibel values comprises:

processing each decibel value in the sequence of decibel values by:

wherein ratio represents a pressure limit ratio, 0<ratio&lt 1,border represents the boundary value of the pressure limit, y ₂ Representing a decibel value, y, in said sequence of scores ₃ The target decibel value obtained for the pressure limit is indicated.

6. The method of claim 2, wherein the inverse amplitude normalization of the audio-limited processed target decibel value to obtain a processed amplitude parameter comprises:

7. The method of claim 2, wherein the filtering according to the processed amplitude parameter to obtain the spectrum parameter comprises:

8. The method according to any one of claims 1 to 7, wherein after synthesizing the speech based on the fundamental frequency parameter and the spectral parameter of the fixed component text in the speech to be synthesized, the method further comprises:

each point of the audio sequence of the synthesized speech is warped using the following formula:

wherein, y _tts Representing a predetermined audio sequence, Y _nat An audio sequence, y ', representing synthesized speech' _nat Representing each point, y, in the warped audio sequence _nat Representing each point in the audio sequence before the warping process.

9. An apparatus for speech synthesis, comprising:

10. An electronic device comprising memory and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by one or more processors the one or more programs comprising instructions for: