CN107871494A - The method, apparatus and electronic equipment of a kind of phonetic synthesis - Google Patents

The method, apparatus and electronic equipment of a kind of phonetic synthesis Download PDF

Info

Publication number
CN107871494A
CN107871494A CN201610849422.5A CN201610849422A CN107871494A CN 107871494 A CN107871494 A CN 107871494A CN 201610849422 A CN201610849422 A CN 201610849422A CN 107871494 A CN107871494 A CN 107871494A
Authority
CN
China
Prior art keywords
parameters
audio
parameter
amplitude
speech
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201610849422.5A
Other languages
Chinese (zh)
Other versions
CN107871494B (en
Inventor
宋阳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Sogou Technology Development Co Ltd
Original Assignee
Beijing Sogou Technology Development Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Sogou Technology Development Co Ltd filed Critical Beijing Sogou Technology Development Co Ltd
Priority to CN201610849422.5A priority Critical patent/CN107871494B/en
Publication of CN107871494A publication Critical patent/CN107871494A/en
Application granted granted Critical
Publication of CN107871494B publication Critical patent/CN107871494B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/04Details of speech synthesis systems, e.g. synthesiser structure or memory management
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination

Abstract

The invention discloses a kind of method, apparatus of phonetic synthesis and electronic equipment, the method for the phonetic synthesis includes:The base frequency parameters and range parameter of frozen composition text audio are extracted from the recording of frozen composition text;Audio pressure limit and filtering process are carried out according to the range parameter, obtain the frequency spectrum parameter of frozen composition text audio;When synthesizing voice, base frequency parameters and frequency spectrum parameter synthesis voice based on the frozen composition text in voice to be synthesized.In the above-mentioned technical solutions, audio is set to reach the effect that amplitude is more balanced and audio is harmonious by audio pressure limit and filtering process, so that frequency spectrum parameter reaches consistent with the tone color of pure parameter synthesis voice (nonimmobilized component text), base frequency parameters and frequency spectrum parameter synthesis voice again based on this frozen composition text, the tone color of its frozen composition text is consistent with nonimmobilized component text, solves the technical problem that parameter phonetic synthesis tone color is inconsistent in the prior art.

Description

Voice synthesis method and device and electronic equipment
Technical Field
The present invention relates to the field of speech signal processing technologies, and in particular, to a method and an apparatus for speech synthesis, and an electronic device.
Background
Parametric speech synthesis is currently a mainstream speech synthesis technology. The parameter speech synthesis occupies less space, has high calculation real-time performance and has wide application prospect in intelligent terminals and embedded devices.
Parametric speech is done with synthetic text, which is usually composed of fixed-invariant components (i.e., fixed-component text) and variable-parameter components (i.e., non-fixed-component text). In the prior art, during voice synthesis, a part of voice segments are obtained from a fixed component text in a manner of prerecording natural voice, voice synthesis is performed on a variable component text to obtain another voice segment, and then two voice segment signals are spliced to obtain a final continuous voice signal. Because of the great difference in tone color between the natural speech and the synthesized speech synthesized by the electronic device, the problem of tone color inconsistency exists in the speech synthesized by splicing the fixed component text with the natural speech and the non-fixed component text with the synthesized speech.
Therefore, the technical problem of inconsistent tone in the parameter speech synthesis in the prior art is solved.
Disclosure of Invention
The embodiment of the invention provides a method and a device for voice synthesis and electronic equipment, which are used for solving the technical problem of inconsistent tone in parameter voice synthesis in the prior art.
The embodiment of the application provides a method for synthesizing voice, which comprises the following steps:
extracting fundamental frequency parameters and amplitude parameters of the fixed component text audio frequency from the recording of the fixed component text;
performing audio voltage limit and filtering processing according to the amplitude parameters to obtain frequency spectrum parameters of the text audio with fixed components;
in synthesizing speech, speech is synthesized based on the fundamental frequency parameters and the spectral parameters of the fixed-component text in the speech to be synthesized.
Optionally, the performing audio limiting and filtering processing according to the amplitude parameter to obtain a spectrum parameter of a fixed-component text audio includes:
normalizing the amplitude parameter, and converting the normalized parameter into a decibel value sequence;
performing audio voltage limiting processing on the decibel value sequence;
carrying out amplitude inverse normalization on the audio amplitude subjected to the audio pressure limit processing to obtain a processed amplitude parameter;
and carrying out filtering processing according to the processed amplitude parameter to obtain the frequency spectrum parameter.
Optionally, the normalizing, performed on the amplitude parameter, includes: normalizing the amplitude parameter according to the following formula, and obtaining the normalized parameter y 1
Wherein scale represents a normalization coefficient, y represents the amplitude parameter, and n represents the number of quantization bits of the fixed-component text audio.
Optionally, the converting the normalized parameter into a decibel value sequence includes:
every point x in the normalized parameter is calculated according to the following formula 1 Converted into corresponding decibel values y 2
y 2 =20*log 10 (abs(x 1 ))
By all of y 2 Forming the sequence of decibel values.
Optionally, the performing audio threshold processing on the decibel value sequence includes:
processing each decibel value in the sequence of decibel values by:
wherein ratio represents a pressure limit ratio, 0<ratio&lt 1,border represents the boundary value of the pressure limit, y 2 Represents the aboveOne decibel value, y, in the sequence of scores 3 The target decibel value obtained for the pressure limit is indicated.
Optionally, the performing amplitude inverse normalization on the target decibel value after the audio frequency limitation processing to obtain a processed amplitude parameter includes:
wherein scale represents a normalization coefficient, y 3 Representing the target decibel value, y, attained by the pressure limit 4 One of the amplitude parameters obtained by inverse normalization is represented, and n represents the number of quantization bits of the fixed-component text audio.
Optionally, the performing filtering processing according to the processed amplitude parameter to obtain the spectrum parameter includes:
extracting the spectrum envelope parameters of the processed amplitude parameters, and performing filtering operation on the extracted spectrum envelope parameters;
extracting parameters of a Mel cepstrum or a line spectrum from the filtered spectrum envelope parameters;
and taking the extracted Mel cepstrum or line spectrum pair parameter as the spectrum parameter.
Optionally, after synthesizing the speech based on the fundamental frequency parameter and the spectral parameter of the fixed component text in the speech to be synthesized when synthesizing the speech, the method further includes:
each point of the audio sequence of the synthesized speech is subjected to a warping process using the following formula:
wherein, y tts Representing a predetermined audio sequence, Y nat An audio sequence, y ', representing synthesized speech' nat Representing each point, y, in the warped audio sequence nat Representing each of the audio sequences before warpingA point(s).
The embodiment of the present application further provides a speech synthesis apparatus, including:
the extraction unit is used for extracting the fundamental frequency parameter and the amplitude parameter of the fixed component text audio frequency from the recording of the fixed component text;
the frequency spectrum acquisition unit is used for carrying out audio voltage limit and filtering processing according to the amplitude parameters to obtain frequency spectrum parameters of the text audio with fixed components;
and the synthesis unit is used for synthesizing the voice based on the fundamental frequency parameters and the spectrum parameters of the fixed component text in the voice to be synthesized when the voice is synthesized.
Optionally, the spectrum obtaining unit includes:
the conversion subunit is used for carrying out normalization processing on the amplitude parameters and converting the parameters after the normalization processing into a decibel value sequence;
the voltage limit subunit is used for carrying out audio voltage limit processing on the decibel value sequence;
the conversion subunit is further configured to perform amplitude inverse normalization on the audio amplitude subjected to the audio pressure limit processing to obtain a processed amplitude parameter;
and the filtering subunit is used for performing filtering processing according to the processed amplitude parameter to obtain the spectrum parameter.
Optionally, the converting subunit is configured to: normalizing the amplitude parameter according to the following formula, and obtaining a normalized parameter y 1
Wherein scale represents a normalization coefficient, y represents the amplitude parameter, and n represents the number of quantization bits of the fixed-component text audio.
Optionally, the converting subunit is further configured to:
each point x in the normalized parameter is calculated according to the following formula 1 Converted into corresponding decibel values y 2
y 2 =20*log 10 (abs(x 1 ))
From all of y 2 Forming the sequence of decibels values.
Optionally, the voltage limiting subunit is configured to: processing each decibel value in the sequence of decibel values by:
wherein ratio represents a pressure limit ratio, 0<ratio&lt 1,border represents the boundary value of the pressure limit, y 2 Representing a decibel value, y, in said sequence of scores 3 Representing the target decibel value attained by the pressure limit.
Optionally, the converting subunit is further configured to: the amplitude parameter is obtained by processing according to the following formula:
wherein scale represents a normalization coefficient, y 3 Target decibel value, y, obtained by indicating pressure limit 4 One of the amplitude parameters obtained by inverse normalization is represented, and n represents the number of quantization bits of the fixed-component text audio.
Optionally, the filtering subunit is configured to:
extracting the spectrum envelope parameters of the processed amplitude parameters, and performing filtering operation on the extracted spectrum envelope parameters;
extracting parameters of a Mel cepstrum or a line spectrum from the filtered spectrum envelope parameters;
and taking the extracted Mel cepstrum or line spectrum pair parameter as the spectrum parameter.
Optionally, the apparatus further comprises:
a normalization unit, configured to, after synthesizing speech based on the fundamental frequency parameters and the spectral parameters of the fixed component text in the speech to be synthesized, perform normalization processing on each point of the audio sequence of the synthesized speech by using the following formula:
wherein, y tts Representing a predetermined audio sequence, Y nat Audio sequence representing synthesized speech, y' nat Representing each point, y, in the warped audio sequence nat Representing each point in the audio sequence before the warping process.
Embodiments of the present application also provide an electronic device comprising a memory, and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, the one or more programs including instructions for:
extracting fundamental frequency parameters and amplitude parameters of the fixed component text audio from the recording of the fixed component text;
performing audio voltage limit and filtering processing according to the amplitude parameters to obtain frequency spectrum parameters of the text audio with fixed components;
in synthesizing speech, speech is synthesized based on the fundamental frequency parameters and the spectral parameters of the fixed-component text in the speech to be synthesized.
One or more technical solutions in the embodiments of the present application have at least the following technical effects:
the method comprises the steps of extracting fundamental frequency parameters and amplitude parameters of fixed component text audios from the recording of the fixed component texts; performing audio pressure limit and filtering processing according to the extracted amplitude parameters to obtain spectral parameters of the fixed component text audio, wherein the audio pressure limit and filtering processing can enable the audio to achieve the effects of more balanced amplitude and consistent audio coordination, so that the obtained spectral parameters can achieve the tone color consistency with pure parameter synthetic speech (non-fixed component text); therefore, when the voice is synthesized, the voice is synthesized based on the fundamental frequency parameter and the frequency spectrum parameter of the fixed component text in the voice to be synthesized, and the tone of the fixed component text is consistent with that of the non-fixed component text, so that the technical problem that the tone of the parameter voice synthesis in the prior art is inconsistent is solved, and the fundamental frequency parameter of the recording is adopted during voice synthesis, so that the synthesized voice rhythm is consistent with the natural voice, the expressive force is stronger, and the beneficial effects of ensuring the integral consistency of the tone of the synthesized voice and improving the expressive force of the synthesized voice are achieved.
Drawings
Fig. 1 is a flowchart of a speech synthesis method according to an embodiment of the present application;
fig. 2 is a schematic diagram of a speech synthesis apparatus according to an embodiment of the present application;
fig. 3 is a schematic diagram of an electronic device for implementing a speech synthesis method according to an embodiment of the present application.
Detailed Description
In the technical scheme provided by the embodiment of the application, parameters of the fixed component text are adjusted, and the spectral parameters are obtained by audio limitation and filtering processing of amplitude parameters extracted from the recording, so that the spectral parameters are consistent with those of the non-fixed component text, and the technical problem of inconsistent tone of parameter speech synthesis in the prior art is solved.
The main implementation principle, the specific implementation mode and the corresponding beneficial effects of the technical scheme of the embodiment of the present application are explained in detail with reference to the accompanying drawings.
Examples
Referring to fig. 1, an embodiment of the present application provides a method for speech synthesis, including:
s101: extracting fundamental frequency parameters and amplitude parameters of the fixed component text audio frequency from the recording of the fixed component text;
s102: performing audio pressure limit and filtering processing according to the amplitude parameters to obtain frequency spectrum parameters of the text audio with fixed components;
s103: in synthesizing speech, speech is synthesized based on the fundamental frequency parameters and the spectral parameters of the fixed-component text in the speech to be synthesized.
In a specific implementation process, the embodiment of the present application may establish a template library before synthesizing the speech, and store the fundamental frequency parameters and the spectrum parameters of the fixed component text. When the template library is established, the common texts are recorded and the fixed component texts are extracted from the common texts, for example, the fixed component texts 'the plane is about to take off' are extracted from the recording of 'the plane is about to take off' which is opened in Hirschmann. For the recording of the fixed component text, S101 is executed to extract a fundamental frequency parameter and an amplitude parameter of the fixed component text audio therefrom.
In order to make the tone color of the fixed component text consistent with that of the non-fixed component text, the embodiment of the present application processes the extracted amplitude parameter, and executes S102 to perform audio limiting and filtering processing according to the extracted amplitude parameter, so as to obtain the spectral parameter of the fixed component text audio. Alternatively, the method may specifically be performed by:
step 1: and carrying out normalization processing on the amplitude parameters, and converting the normalized parameters into decibel value sequences. Optionally, normalization processing may be performed on the extracted amplitude parameters according to the following formula one and formula two pairs, and the normalized parameter y is obtained 1
Scale represents a normalization coefficient, y represents an amplitude parameter (namely, an amplitude sequence of the fixed-component text audio, which includes a plurality of amplitude values), n represents the quantization bit number of the fixed-component text audio, abs represents an absolute value, and max represents a maximum value. Since the amplitude parameter is a sequence of multiple amplitude values, each amplitude value corresponds to a point after normalization, and for this purpose, y obtained by normalization 1 Comprising a plurality of dots. By normalizing the amplitude parameters, the effect consistency of subsequent amplitude operation on different audios is ensured.
Further, y can be performed on the normalized parameters according to the formula three 1 Convert y into 1 Each point x in (1) 1 Into a corresponding decibel value y 2
y 2 =20*log 10 (abs(x 1 ) Equation three)
From all y converted to 2 A sequence of decibel values of the extracted amplitude parameter is constructed.
And 2, step: and carrying out audio voltage limit processing on the decibel value sequence obtained by conversion. The audio pressure limit processing can be performed by an audio pressure limiter, the embodiment of the present application does not limit the type of the audio pressure limiter, and each decibel value in the sequence of decibel values obtained by conversion is processed by taking a formula four as an example below:
wherein ratio represents a pressure limit ratio, 0<ratio&lt 1,border represents the boundary value of the pressure limit, y 2 Representing a decibel value, y, in said sequence of scores 3 Representing the target decibel value attained by the pressure limit. In practice, the magnitude of the pressure limit ratio and the pressure limit boundary values may be adjusted by the designer, typically empirically setting the ratio to 0.7 and the border to-10. Because the audio of non-fixed component text is predicted based on statistical models, which have an averaging effect,the statistical model has an averaging effect, the volume is more balanced, the smaller sound in the recording is larger and the larger sound is smaller through the audio frequency pressure limit processing, the difference between the sounds of the text audios with the fixed components is smaller, the audio volume of the text audios with the fixed components is more balanced, and the effect that the amplitude of the text audios with the non-fixed components is consistent is achieved.
And step 3: and performing amplitude inverse normalization on the target decibel value after the audio frequency pressure limit processing to obtain a processed amplitude parameter. Specifically, the processed amplitude parameter may be obtained according to the formula five:
wherein scale represents a normalization coefficient, y 3 Target decibel value, y, obtained by indicating pressure limit 4 One of the amplitude parameters obtained by inverse normalization is represented, and n represents the number of quantization bits of the fixed-component text audio.
And 4, step 4: and filtering according to the processed amplitude parameters to obtain the frequency spectrum parameters of the fixed-component text audio. Specifically, the extraction of the spectral envelope parameter may be performed on the processed amplitude parameter, and the filtering operation may be performed on the extracted spectral envelope parameter; then, extracting parameters of a Mel cepstrum or a line spectrum from the filtered spectrum envelope parameters; then, the extracted mel cepstrum or line spectrum pair parameter is taken as a spectrum parameter. Assuming that the spectrum envelope is M-dimensional, and T frames in total, filtering operation is performed on the extracted M × T matrix, in the embodiment of the present application, two-dimensional median filtering is selected during filtering, a window of w1 × w2 is first selected (w 1 and w2 may be adjusted according to actual conditions, and w1 may be specifically set to 81, and w2 may be set to 5), and in the matrix of M × T, each point is replaced by a median of values in its neighboring window of w1 × w 2. The filtering operation may also adopt two-dimensional average filtering or other filters, which are not described herein again.
The fundamental frequency parameters of the fixed component text audio extracted from the recording and the spectrum parameters obtained by the processing may be stored in a template library so as to perform S103 synthesizing the speech based on the fundamental frequency parameters and the spectrum parameters of the fixed component text in the speech to be synthesized when synthesizing the speech. Specifically, when synthesizing the speech, the text of the speech to be synthesized may be analyzed to extract the fixed component text and the non-fixed component text, the fundamental frequency parameter and the spectral parameter corresponding to the fixed component text are extracted from the template library for the fixed component text, and the speech is synthesized by the vocoder together with the fundamental frequency parameter and the spectral parameter of the non-fixed component text.
In a specific implementation process, for synthesized speech, the speech corresponding to a fixed component text and the speech of a non-fixed component text have an inconsistent problem in energy, and a problem of a sudden large and small volume may occur, for this reason, in an optional implementation manner of the embodiment of the present application, a normalization process may be performed after the speech is synthesized: and (3) carrying out a regularization treatment on each point of the audio sequence of the synthesized voice by adopting a formula six:
wherein, y tts Representing a preset audio sequence, which can be a voice synthesized by a traditional voice synthesis system, such as fundamental frequency and frequency spectrum of a fixed component text and a non-fixed component text, and a voice synthesized by a vocoder after being predicted by a statistical model; y is nat The audio sequence synthesized by the method of the embodiment of the present application, i.e. the audio sequence before the warping process, is shown as follows: the fundamental frequency and the frequency spectrum of the fixed component text adopt parameters obtained after pressure limit and filtering, and then an audio sequence is synthesized by a vocoder; y' nat Representing each point, y, in the warped audio sequence nat Representing each point in the audio sequence before the warping process. By arranging the synthesized voice, the consistency of the integral voice volume can be ensured.
In the technical scheme, the consistent frequency spectrum of natural voice, namely recording and parameter synthesis voice is realized through audio frequency pressure limit and spectrum filtering, so that the tone colors of the natural voice are consistent, meanwhile, the rhythm of the fixed component text is ensured to be the same as that of the natural voice by adopting the fundamental frequency parameters of the fixed component text recording, the expressive force of the synthesis voice is improved, and finally, the volume of the synthesis voice is ensured to be consistent through regulating the volume of the synthesis voice.
Referring to fig. 2, based on the method for speech synthesis provided in the foregoing embodiment, an embodiment of the present application further provides a speech synthesis apparatus, including:
an extracting unit 21, configured to extract a fundamental frequency parameter and an amplitude parameter of a fixed component text audio from a recording of the fixed component text;
the frequency spectrum obtaining unit 22 is configured to perform audio frequency pressure limiting and filtering processing according to the amplitude parameter to obtain a frequency spectrum parameter of a text audio with a fixed component;
and a synthesizing unit 23, configured to synthesize speech based on the fundamental frequency parameter and the spectral parameter of the fixed component text in the speech to be synthesized when synthesizing speech.
In a specific implementation process, the spectrum obtaining unit 22 may include: a converting subunit, a voltage limiting subunit and a filtering subunit. The conversion subunit is used for carrying out normalization processing on the amplitude parameters and converting the parameters after the normalization processing into a decibel value sequence; the voltage limit subunit is used for carrying out audio voltage limit processing on the decibel value sequence; the conversion subunit is further configured to perform amplitude inverse normalization on the audio amplitude subjected to the audio threshold processing to obtain a processed amplitude parameter; and the filtering subunit is used for performing filtering processing according to the processed amplitude parameter to obtain the spectrum parameter.
The conversion subunit may perform normalization processing on the amplitude parameter according to the following formula, and obtain a normalized parameter y 1
Wherein scale represents a normalization coefficient, y represents the amplitude parameter, and n represents the number of quantization bits of the fixed-component text audio.
After obtaining the normalized parameters, the conversion subunit may further: each point x in the normalized parameter is calculated according to the following formula 1 Converted into corresponding decibel values y 2
y 2 =20*log 10 (abs(x 1 ))
From all of y 2 Forming the sequence of decibel values.
The voltage limiting subunit may be configured to: processing each decibel value in said sequence of decibel values by the following formula:
wherein ratio represents a pressure limit ratio, 0<ratio&lt 1,border represents the boundary value of the pressure limit, y 2 Representing a decibel value, y, in said sequence of scores 3 Representing the target decibel value attained by the pressure limit.
Further, the conversion subunit may be further configured to: the amplitude parameter is obtained by processing according to the following formula:
wherein scale represents a normalization coefficient, y 3 Representing the target decibel value, y, attained by the pressure limit 4 One of the amplitude parameters obtained by inverse normalization is represented, and n represents the number of quantization bits of the fixed-component text audio.
When obtaining the spectrum parameters, the filtering subunit may extract the spectrum envelope parameters from the processed amplitude parameters, and perform filtering operation on the extracted spectrum envelope parameters; extracting parameters of a Mel cepstrum or a line spectrum from the filtered spectrum envelope parameters; and taking the extracted Mel cepstrum or line spectrum pair parameter as the spectrum parameter.
In a specific implementation process, the apparatus for speech synthesis provided in the embodiment of the present application may further include:
a warping unit 24, configured to, after synthesizing speech based on the fundamental frequency parameters and the spectral parameters of the fixed component text in the speech to be synthesized, perform warping processing on each point of the audio sequence of the synthesized speech using the following formula:
wherein, y tts Representing a predetermined audio sequence, Y nat Audio sequence representing synthesized speech, y' nat Representing each point, y, in the warped audio sequence nat Representing each point in the audio sequence before the warping process.
With regard to the apparatus in the above embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be described in detail here.
FIG. 3 is a block diagram illustrating an electronic device 800 for implementing a method of speech synthesis according to an example embodiment. For example, the electronic device 800 may be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, an exercise device, a personal digital assistant, and the like.
Referring to fig. 4, electronic device 800 may include one or more of the following components: processing component 802, memory 804, power component 806, multimedia component 808, audio component 810, input/output (I/O) interface 812, sensor component 814, and communication component 816.
The processing component 802 generally controls overall operation of the electronic device 800, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing elements 802 may include one or more processors 820 to execute instructions to perform all or a portion of the steps of the methods described above. Further, the processing component 802 can include one or more modules that facilitate interaction between the processing component 802 and other components. For example, the processing component 802 can include a multimedia module to facilitate interaction between the multimedia component 808 and the processing component 802.
The memory 804 is configured to store various types of data to support operation at the device 800. Examples of such data include instructions for any application or method operating on the electronic device 800, contact data, phonebook data, messages, pictures, videos, and so forth. The memory 804 may be implemented by any type or combination of volatile or non-volatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.
The power supply component 806 provides power to the various components of the electronic device 800. The power components 806 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for the electronic device 800.
The multimedia component 808 includes a screen that provides an output interface between the electronic device 800 and a user. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 808 includes a front facing camera and/or a rear facing camera. The front-facing camera and/or the rear-facing camera may receive external multimedia data when the device 800 is in an operating mode, such as a shooting mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have a focal length and optical zoom capability.
The audio component 810 is configured to output and/or input audio signals. For example, the audio component 810 includes a Microphone (MIC) configured to receive external audio signals when the electronic device 800 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signals may further be stored in the memory 804 or transmitted via the communication component 816. In some embodiments, audio component 810 also includes a speaker for outputting audio signals.
The I/O interface 812 provides an interface between the processing component 802 and peripheral interface modules, which may be keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to: a home button, a volume button, a start button, and a lock button.
Sensor assembly 814 includes one or more sensors for providing various aspects of state assessment for electronic device 800. For example, the sensor assembly 814 may detect an open/closed state of the device 800, the relative positioning of components, such as a display and keypad of the electronic device 800, the sensor assembly 814 may also detect a change in the position of the electronic device 800 or a component of the electronic device 800, the presence or absence of user contact with the electronic device 800, orientation or acceleration/deceleration of the electronic device 800, and a change in the temperature of the electronic device 800. Sensor assembly 814 may include a proximity sensor configured to detect the presence of a nearby object without any physical contact. The sensor assembly 814 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 814 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.
The communication component 816 is configured to facilitate wired or wireless communication between the electronic device 800 and other devices. The electronic device 800 may access a wireless network based on a communication standard, such as WiFi,2G or 3G, or a combination thereof. In an exemplary embodiment, the communication component 816 receives a broadcast signal or broadcast associated information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communications component 816 further includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, ultra Wideband (UWB) technology, bluetooth (BT) technology, and other technologies.
In an exemplary embodiment, the electronic device 800 may be implemented by one or more Application Specific Integrated Circuits (ASICs), digital Signal Processors (DSPs), digital Signal Processing Devices (DSPDs), programmable Logic Devices (PLDs), field Programmable Gate Arrays (FPGAs), controllers, micro-controllers, microprocessors or other electronic components for performing the above-described methods.
In an exemplary embodiment, a non-transitory computer-readable storage medium comprising instructions, such as the memory 804 comprising instructions, executable by the processor 820 of the electronic device 800 to perform the above-described method is also provided. For example, the non-transitory computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.
A non-transitory computer readable storage medium having instructions that, when executed by a processor of a mobile terminal, enable the mobile terminal to perform a method of speech synthesis that extracts a fundamental frequency parameter and an amplitude parameter of fixed component text audio from a recording of the fixed component text; performing audio pressure limit and filtering processing according to the amplitude parameters to obtain frequency spectrum parameters of the text audio with fixed components; in synthesizing speech, speech is synthesized based on the fundamental frequency parameters and the spectral parameters of the fixed-component text in the speech to be synthesized.
Other embodiments of the invention will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the invention and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the invention being indicated by the following claims.
It will be understood that the invention is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the invention is only limited by the appended patent claims
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims (10)

1. A method of speech synthesis, the method comprising:
extracting fundamental frequency parameters and amplitude parameters of the fixed component text audio from the recording of the fixed component text;
performing audio pressure limit and filtering processing according to the amplitude parameters to obtain frequency spectrum parameters of the text audio with fixed components;
in synthesizing speech, speech is synthesized based on the fundamental frequency parameters and the spectral parameters of the fixed-component text in the speech to be synthesized.
2. The method of claim 1, wherein said performing audio limiting and filtering based on said magnitude parameters to obtain spectral parameters of fixed-component textual audio comprises:
carrying out normalization processing on the amplitude parameters, and converting the normalized parameters into decibel value sequences;
performing audio voltage limiting processing on the decibel value sequence;
carrying out amplitude inverse normalization on the audio amplitude subjected to the audio pressure limit processing to obtain a processed amplitude parameter;
and carrying out filtering processing according to the processed amplitude parameters to obtain the frequency spectrum parameters.
3. The method of claim 2, wherein the normalizing the magnitude parameter comprises:
normalizing the amplitude parameter according to the following formula, and obtaining a normalized parameter y 1
Wherein scale represents a normalization coefficient, y represents the amplitude parameter, and n represents the number of quantization bits of the fixed-component text audio.
4. The method of claim 2, wherein converting the normalized parameters into a sequence of decibel values comprises:
each point x in the normalized parameter is calculated according to the following formula 1 Converted into corresponding decibel values y 2
y 2 =20*log 10 (abs(x 1 ))
From all of y 2 Forming the sequence of decibel values.
5. The method of claim 2, wherein the performing audio limiting processing for the sequence of decibel values comprises:
processing each decibel value in the sequence of decibel values by:
wherein ratio represents a pressure limit ratio, 0<ratio&lt 1,border represents the boundary value of the pressure limit, y 2 Representing a decibel value, y, in said sequence of scores 3 The target decibel value obtained for the pressure limit is indicated.
6. The method of claim 2, wherein the inverse amplitude normalization of the audio-limited processed target decibel value to obtain a processed amplitude parameter comprises:
wherein scale represents a normalization coefficient, y 3 Target decibel value, y, obtained by indicating pressure limit 4 One of the amplitude parameters obtained by inverse normalization is represented, and n represents the number of quantization bits of the fixed-component text audio.
7. The method of claim 2, wherein the filtering according to the processed amplitude parameter to obtain the spectrum parameter comprises:
extracting the spectrum envelope parameters of the processed amplitude parameters, and performing filtering operation on the extracted spectrum envelope parameters;
extracting parameters of a Mel cepstrum or a line spectrum from the filtered spectrum envelope parameters;
and taking the extracted Mel cepstrum or line spectrum pair parameter as the spectrum parameter.
8. The method according to any one of claims 1 to 7, wherein after synthesizing the speech based on the fundamental frequency parameter and the spectral parameter of the fixed component text in the speech to be synthesized, the method further comprises:
each point of the audio sequence of the synthesized speech is warped using the following formula:
wherein, y tts Representing a predetermined audio sequence, Y nat An audio sequence, y ', representing synthesized speech' nat Representing each point, y, in the warped audio sequence nat Representing each point in the audio sequence before the warping process.
9. An apparatus for speech synthesis, comprising:
the extraction unit is used for extracting the fundamental frequency parameter and the amplitude parameter of the fixed component text audio frequency from the recording of the fixed component text;
the frequency spectrum acquisition unit is used for carrying out audio voltage limit and filtering processing according to the amplitude parameters to obtain frequency spectrum parameters of the text audio with fixed components;
and the synthesis unit is used for synthesizing the voice based on the fundamental frequency parameters and the spectrum parameters of the fixed component text in the voice to be synthesized when the voice is synthesized.
10. An electronic device comprising memory and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by one or more processors the one or more programs comprising instructions for:
extracting fundamental frequency parameters and amplitude parameters of the fixed component text audio from the recording of the fixed component text;
performing audio pressure limit and filtering processing according to the amplitude parameters to obtain frequency spectrum parameters of the text audio with fixed components;
in synthesizing speech, speech is synthesized based on the fundamental frequency parameters and the spectral parameters of the fixed-component text in the speech to be synthesized.
CN201610849422.5A 2016-09-23 2016-09-23 Voice synthesis method and device and electronic equipment Active CN107871494B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610849422.5A CN107871494B (en) 2016-09-23 2016-09-23 Voice synthesis method and device and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610849422.5A CN107871494B (en) 2016-09-23 2016-09-23 Voice synthesis method and device and electronic equipment

Publications (2)

Publication Number Publication Date
CN107871494A true CN107871494A (en) 2018-04-03
CN107871494B CN107871494B (en) 2020-12-11

Family

ID=61751192

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610849422.5A Active CN107871494B (en) 2016-09-23 2016-09-23 Voice synthesis method and device and electronic equipment

Country Status (1)

Country Link
CN (1) CN107871494B (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109584859A (en) * 2018-11-07 2019-04-05 上海指旺信息科技有限公司 Phoneme synthesizing method and device
CN110020616A (en) * 2019-03-26 2019-07-16 深兰科技(上海)有限公司 A kind of target identification method and equipment
CN110930977A (en) * 2019-11-12 2020-03-27 北京搜狗科技发展有限公司 Data processing method and device and electronic equipment
CN111328008A (en) * 2020-02-24 2020-06-23 广州市迪士普音响科技有限公司 Sound pressure level intelligent control method based on sound amplification system
CN111883103A (en) * 2020-06-19 2020-11-03 马上消费金融股份有限公司 Method and device for synthesizing voice
WO2021051765A1 (en) * 2019-09-17 2021-03-25 北京京东尚科信息技术有限公司 Speech synthesis method and apparatus, and storage medium
CN113744716A (en) * 2021-10-19 2021-12-03 北京房江湖科技有限公司 Method and apparatus for synthesizing speech

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH037999A (en) * 1989-06-05 1991-01-16 Matsushita Electric Works Ltd Voice output device
WO2000060575A1 (en) * 1999-04-05 2000-10-12 Hughes Electronics Corporation A voicing measure as an estimate of signal periodicity for a frequency domain interpolative speech codec system
CN1835075A (en) * 2006-04-07 2006-09-20 安徽中科大讯飞信息科技有限公司 Speech synthetizing method combined natural sample selection and acaustic parameter to build mould
CN1945691A (en) * 2006-10-16 2007-04-11 安徽中科大讯飞信息科技有限公司 Method for improving template sentence synthetic effect in voice synthetic system
US20080243511A1 (en) * 2006-10-24 2008-10-02 Yusuke Fujita Speech synthesizer
CN101471071A (en) * 2007-12-26 2009-07-01 中国科学院自动化研究所 Speech synthesis system based on mixed hidden Markov model
CN201422103Y (en) * 2009-06-18 2010-03-10 安徽汇鑫电子有限公司 Audio frequency processor
US20120053933A1 (en) * 2010-08-30 2012-03-01 Kabushiki Kaisha Toshiba Speech synthesizer, speech synthesis method and computer program product
CN103065619A (en) * 2012-12-26 2013-04-24 安徽科大讯飞信息科技股份有限公司 Speech synthesis method and speech synthesis system
CN103247295A (en) * 2008-05-29 2013-08-14 高通股份有限公司 Systems, methods, apparatus, and computer program products for spectral contrast enhancement
CN104485099A (en) * 2014-12-26 2015-04-01 中国科学技术大学 Method for improving naturalness of synthetic speech

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH037999A (en) * 1989-06-05 1991-01-16 Matsushita Electric Works Ltd Voice output device
WO2000060575A1 (en) * 1999-04-05 2000-10-12 Hughes Electronics Corporation A voicing measure as an estimate of signal periodicity for a frequency domain interpolative speech codec system
CN1835075A (en) * 2006-04-07 2006-09-20 安徽中科大讯飞信息科技有限公司 Speech synthetizing method combined natural sample selection and acaustic parameter to build mould
CN1945691A (en) * 2006-10-16 2007-04-11 安徽中科大讯飞信息科技有限公司 Method for improving template sentence synthetic effect in voice synthetic system
US20080243511A1 (en) * 2006-10-24 2008-10-02 Yusuke Fujita Speech synthesizer
CN101471071A (en) * 2007-12-26 2009-07-01 中国科学院自动化研究所 Speech synthesis system based on mixed hidden Markov model
CN103247295A (en) * 2008-05-29 2013-08-14 高通股份有限公司 Systems, methods, apparatus, and computer program products for spectral contrast enhancement
CN201422103Y (en) * 2009-06-18 2010-03-10 安徽汇鑫电子有限公司 Audio frequency processor
US20120053933A1 (en) * 2010-08-30 2012-03-01 Kabushiki Kaisha Toshiba Speech synthesizer, speech synthesis method and computer program product
CN103065619A (en) * 2012-12-26 2013-04-24 安徽科大讯飞信息科技股份有限公司 Speech synthesis method and speech synthesis system
CN104485099A (en) * 2014-12-26 2015-04-01 中国科学技术大学 Method for improving naturalness of synthetic speech

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109584859A (en) * 2018-11-07 2019-04-05 上海指旺信息科技有限公司 Phoneme synthesizing method and device
CN110020616A (en) * 2019-03-26 2019-07-16 深兰科技(上海)有限公司 A kind of target identification method and equipment
WO2021051765A1 (en) * 2019-09-17 2021-03-25 北京京东尚科信息技术有限公司 Speech synthesis method and apparatus, and storage medium
CN110930977A (en) * 2019-11-12 2020-03-27 北京搜狗科技发展有限公司 Data processing method and device and electronic equipment
CN111328008A (en) * 2020-02-24 2020-06-23 广州市迪士普音响科技有限公司 Sound pressure level intelligent control method based on sound amplification system
CN111328008B (en) * 2020-02-24 2021-11-05 广州市迪士普音响科技有限公司 Sound pressure level intelligent control method based on sound amplification system
CN111883103A (en) * 2020-06-19 2020-11-03 马上消费金融股份有限公司 Method and device for synthesizing voice
CN111883103B (en) * 2020-06-19 2021-12-24 马上消费金融股份有限公司 Method and device for synthesizing voice
CN113744716A (en) * 2021-10-19 2021-12-03 北京房江湖科技有限公司 Method and apparatus for synthesizing speech
CN113744716B (en) * 2021-10-19 2023-08-29 北京房江湖科技有限公司 Method and apparatus for synthesizing speech

Also Published As

Publication number Publication date
CN107871494B (en) 2020-12-11

Similar Documents

Publication Publication Date Title
CN107871494B (en) Voice synthesis method and device and electronic equipment
CN107705783B (en) Voice synthesis method and device
US11430427B2 (en) Method and electronic device for separating mixed sound signal
CN110136692B (en) Speech synthesis method, apparatus, device and storage medium
CN108198569B (en) Audio processing method, device and equipment and readable storage medium
CN110097890B (en) Voice processing method and device for voice processing
CN111583944A (en) Sound changing method and device
CN109887515B (en) Audio processing method and device, electronic equipment and storage medium
CN110890083B (en) Audio data processing method and device, electronic equipment and storage medium
CN109410973B (en) Sound changing processing method, device and computer readable storage medium
CN111326138A (en) Voice generation method and device
EP3340077B1 (en) Method and apparatus for inputting expression information
CN110677734B (en) Video synthesis method and device, electronic equipment and storage medium
CN110931028B (en) Voice processing method and device and electronic equipment
CN107437412B (en) Acoustic model processing method, voice synthesis method, device and related equipment
CN110415702A (en) Training method and device, conversion method and device
CN115273831A (en) Voice conversion model training method, voice conversion method and device
EP4050601B1 (en) Method and apparatus for audio processing, terminal and storage medium
CN113301372A (en) Live broadcast method, device, terminal and storage medium
CN110148424B (en) Voice processing method and device, electronic equipment and storage medium
CN111373409A (en) Method and terminal for acquiring color value change
CN113409765A (en) Voice synthesis method and device for voice synthesis
CN110580910B (en) Audio processing method, device, equipment and readable storage medium
CN111063365B (en) Voice processing method and device and electronic equipment
CN113113036B (en) Audio signal processing method and device, terminal and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant