WO2018159402A1

WO2018159402A1 - Speech synthesis system, speech synthesis program, and speech synthesis method

Info

Publication number: WO2018159402A1
Application number: PCT/JP2018/006165
Authority: WO
Inventors: 橘　健太郎; 芳則志賀
Original assignee: 国立研究開発法人情報通信研究機構
Priority date: 2017-02-28
Filing date: 2018-02-21
Publication date: 2018-09-07
Also published as: JP2018141915A; JP6802958B2

Abstract

This speech synthesis system includes: a first extraction unit that extracts, at every unit interval, a fundamental frequency of a speech waveform corresponding to a known text; a second extraction unit that extracts, at every unit interval, a periodic component and a non-periodic component from the speech waveform; a third extraction unit that extracts spectral envelopes of the extracted periodic component and non-periodic component; a generation unit that generates a context label on the basis of context information of the known text; and a learning unit that performs learning by associating an acoustic feature amount including the fundamental frequency, the spectral envelope of the periodic component, and the spectral envelope of the non-periodic component with the corresponding context label, thereby constructing a statistical model.

Description

Speech synthesis system, speech synthesis program, and speech synthesis method

The present invention relates to a speech synthesis technique (statistical parametric speech synthesis; hereinafter also abbreviated as “SPSS”) according to statistical parametric speech synthesis.

Traditionally, speech synthesis technology has been widely applied to text-to-speech applications and multilingual translation services. SPSS is known as one method of such speech synthesis technology. SPSS is a framework for synthesizing speech based on a statistical model. The main research subject in SPSS has been speech synthesis based on a hidden Markov model (hereinafter abbreviated as “HMM”) over the past decade.

In recent years, speech synthesis based on a deep neural network (hereinafter abbreviated as “DNN”), which is a type of deep learning, has attracted attention (see, for example, Non-Patent Document 1). ). According to research results shown in Non-Patent Document 1, it is shown that speech synthesis based on DNN can generate higher quality speech than speech synthesis based on HMM.

In many SPSSs, a vocoder is used as a source filter model when generating speech. More specifically, the source filter model is composed of a vocal tract filter and an excitation source. The vocal tract filter is a model of the vocal tract and is expressed by a spectral envelope parameter. A source signal modeling an excitation source (voice vocal vibration) is expressed by mixing a pulse sequence and a noise component.

In a vocoder generally adopted, it is determined whether each frame of the vibration source is a voiced section or an unvoiced section. If it is determined that the frame is a voiced section, the voice pitch (pitch) is set. A pulse sequence of a corresponding fundamental frequency (hereinafter also abbreviated as “F ₀ ”) is generated, and when it is determined that the period is an unvoiced section, a vibration source is generated as white noise. Here, the determination of voiced and unvoiced is made based on whether F ₀ is non-zero (voiced) or zero (unvoiced). In general SPSS, this F ₀ sequence is expressed as a discontinuous sequence in which a one-dimensional sequence and a zero-dimensional discrete symbol are switched, and in each frame, voiced / unvoiced (hereinafter referred to as “V / UV”). A flag (hereinafter also abbreviated as “V / UV flag”) for switching is required.

The quality of the synthesized speech may be deteriorated due to the V / UV determination error in each frame and the difficulty in modeling the vibration source that outputs the discontinuous series.

MSD (multi-space distribution) modeling has been proposed as a method for modeling such a sequence (see, for example, Non-Patent Document 2). However, MSD modeling inherently involves the difficulty of representing continuous and discrete sequences. In addition, for V / UV frames in which a prediction error has occurred, vocoding often results in quality degradation of the synthesized speech. For example, a frame that is erroneously voiced is caused to have a buzzy feeling, and a frame that is erroneously unvoiced is caused to be squat.

Several solutions have been proposed for the problems described above.
The first method is to treat a sequence of F ₀ that is a discontinuous sequence as a continuous sequence (see, for example, Non-Patent Document 3). By using this method, it is shown that F ₀ can be modeled as a continuous sequence and the quality can be improved. In this method, it is necessary to determine V / UV at the time of waveform generation, and it is necessary to model a discrete sequence.

As another method, it is conceivable to determine V / UV from some continuous series. For example, a method for determining V / UV based on an aperiodicity index instead of the V / UV flag has been proposed (see, for example, Non-Patent Document 4). While this method can realize completely continuous modeling, it is necessary to determine V / UV at the time of waveform generation. Therefore, the influence of a V / UV determination error cannot be completely avoided.

As another method, a method using a Maximum Voiced Frequency (hereinafter also abbreviated as “MVF”) indicating the maximum periodicity of a voice signal instead of the V / UV flag has been proposed (for example, non-standard). (See Patent Document 5). By using the MVF, it is possible to divide the sound using the high frequency band as an aperiodic component and the low frequency band as a periodic component. In addition, since the MVF is continuously modeled, it is possible to function as a V / UV flag by setting a threshold value. However, since only the high frequency band and the low frequency band are divided, the model accuracy of the periodic / non-periodic component is not sufficient.

The present technology is intended to solve such a problem, and an object of the present technology is to provide a new method capable of reducing the influence on quality caused by a determination error of V / UV in an acoustic model in SPSS. .

According to an aspect of the present invention, a speech synthesis system according to SPSS is provided. The speech synthesis system includes a first extraction unit that extracts a fundamental frequency of a speech waveform corresponding to a known text for each unit section, and a second extraction that extracts a periodic component and an aperiodic component from the speech waveform for each unit section. A third extraction unit that extracts a spectral envelope of the extracted periodic component and aperiodic component, a generation unit that generates a context label based on context information of a known text, and a spectral envelope of the fundamental frequency and the periodic component And a learning unit that constructs a statistical model by learning by associating the acoustic feature amount including the spectrum envelope of the non-periodic component and the corresponding context label.

Preferably, the speech synthesis system is configured to determine a context label based on context information of the text in response to input of an arbitrary text, and an acoustic feature corresponding to the context label determined by the determination unit from the statistical model And an estimation unit for estimating the quantity. The estimated acoustic feature amount includes a fundamental frequency, a spectral envelope of periodic components, and a spectral envelope of non-periodic components. The speech synthesis system further performs a first reconstruction for reconfiguring the periodic component by filtering a pulse sequence generated according to the fundamental frequency included in the estimated acoustic feature amount according to the spectral envelope of the periodic component. A second reconstructing unit that reconstructs the non-periodic component by filtering the noise sequence according to the spectral envelope of the non-periodic component, and the reconstructed periodic component and aperiodic component, And an adder that outputs as an audio waveform corresponding to the input arbitrary text.

Preferably, the second extraction unit extracts only the non-periodic component from the unit interval in which the first extraction unit cannot extract the fundamental frequency, and extracts the periodic component and the non-periodic component from the other unit intervals.

Preferably, a 1st extraction part determines a fundamental frequency by the interpolation process about the unit area which cannot extract a fundamental frequency.

Preferably, the pulse sequence is a sequence generated from the fundamental frequency sequence subjected to the interpolation process, and the noise sequence is a sequence in which noise is generated over the entire interval.

According to still another aspect of the present invention, a speech synthesis program for realizing a speech synthesis method according to SPSS is provided. The speech synthesis program is extracted to the computer by extracting a fundamental frequency of a speech waveform corresponding to a known text for each unit section, extracting a periodic component and an aperiodic component from the speech waveform for each unit section, and Extracting the spectral envelope of periodic and non-periodic components; generating a context label based on known text context information; and acoustic features including fundamental frequency, spectral envelope of periodic components, and spectral envelope of non-periodic components The step of constructing a statistical model is executed by learning the quantity and the corresponding context label in association with each other.

According to still another aspect of the present invention, a speech synthesis method according to SPSS is provided. The speech synthesis method includes a step of extracting a fundamental frequency of a speech waveform corresponding to a known text for each unit section, a step of extracting a periodic component and a non-periodic component from the speech waveform for each unit section, and an extracted periodic component And a step of generating a spectral envelope of a non-periodic component, a step of generating a context label based on context information of known text, an acoustic feature amount including a fundamental frequency, a spectral envelope of a periodic component, and a spectral envelope of a non-periodic component; And building a statistical model by associating and learning corresponding context labels.

According to the present technology, in SPSS, it is possible to reduce the influence on the quality caused by the V / UV determination error in the acoustic model.

It is a schematic diagram which shows the outline | summary of the multilingual translation system using the speech synthesis system according to this Embodiment. It is a schematic diagram which shows the hardware structural example of the service provision apparatus according to this Embodiment. It is a schematic diagram for demonstrating the outline | summary of the speech synthesis process which concerns on related technology. It is a schematic diagram for demonstrating the outline | summary of the speech synthesis process according to this Embodiment. It is a block diagram for demonstrating the process of the principal part in the speech synthesis system according to this Embodiment. It is a figure which shows an example of the speech waveform of the periodic component and non-periodic component which are output in the speech synthesis system according to this Embodiment. It is a flowchart which shows an example of the process sequence in the speech synthesis system according to this Embodiment. It is a flowchart which shows an example of the process sequence in the speech synthesis system according to this Embodiment. It is a figure which shows the example of an evaluation result of the pair comparison experiment about the speech synthesis system according to this Embodiment.

Embodiments of the present invention will be described in detail with reference to the drawings. Note that the same or corresponding parts in the drawings are denoted by the same reference numerals and description thereof will not be repeated.

[A. Application example]
First, one application example of the speech synthesis system according to the present embodiment will be described. More specifically, a multilingual translation system using the speech synthesis system according to the present embodiment will be described.

FIG. 1 is a schematic diagram showing an outline of a multilingual translation system 1 using a speech synthesis system according to the present embodiment. Referring to FIG. 1, multilingual translation system 1 includes a service providing device 10. The service providing apparatus 10 performs speech recognition, multilingual translation, etc. on the input speech (some words uttered in the first language) from the mobile terminal 30 connected via the network 2, and in the second language. The corresponding words are synthesized and the synthesized result is output to the portable terminal 30 as output speech.

For example, when the user 4 utters the English word “Where is the station?” To the mobile terminal 30, the mobile terminal 30 generates and generates input speech using a microphone or the like based on the generated words. The input voice is transmitted to the service providing apparatus 10. The service providing apparatus 10 synthesizes an output speech indicating the word “where is the station” in Japanese corresponding to “Where is the station?”. When receiving the output sound from the service providing apparatus 10, the portable terminal 30 plays back the received output sound. As a result, the conversation partner of user 4 can hear the phrase “Where is the station?” In Japanese.

Although not shown, the conversation partner of the user 4 may have the same portable terminal 30. For example, in response to a question from the user 4, an answer “go straight and left” When it is directed to the terminal, the processing as described above is executed, and the corresponding English word “Go straight and turn left” is answered from the mobile terminal of the user 4's conversation partner.

Thus, in the multilingual translation system 1, translation can be freely performed between the language of the first language (speech) and the language of the second language (speech). In addition, you may enable it to mutually translate automatically between arbitrary numbers not only in two languages.

自動 By using this automatic speech translation function, foreign travel and communication with foreigners can be facilitated.

The speech synthesis system according to the present embodiment included in the service providing apparatus 10 employs one SPSS technique, as will be described later. The service providing apparatus 10 includes an analysis unit 12, a learning unit 14, a DNN 16, and a speech synthesis unit 18 as components related to the speech synthesis system.

The service providing apparatus 10 includes a speech recognition unit 20 and a translation unit 22 as components relating to automatic translation. Service providing apparatus 10 further includes a communication processing unit 24 for performing communication processing with portable terminal 30.

More specifically, the analysis unit 12 and the learning unit 14 are in charge of machine learning for constructing the DNN 16. Details of functions and processing of the analysis unit 12 and the learning unit 14 will be described later. The DNN 16 stores a neural network as a result of machine learning by the analysis unit 12 and the learning unit 14.

In this embodiment, DNN is used as an example, but instead of DNN, a recurrent neural network (hereinafter abbreviated as “RNN”), long-short memory (long-short term memory); LSTM) RNN or convolutional neural network (CNN) may be used.

The voice recognition unit 20 outputs voice recognition text by executing voice recognition processing on the input voice from the mobile terminal 30 received via the communication processing unit 24. The translation unit 22 generates a text in a specified language (also referred to as “translation text” for convenience of explanation) from the speech recognition text from the speech recognition unit 20. For the voice recognition unit 20 and the translation unit 22, any known method can be employed.

The speech synthesizer 18 performs speech synthesis on the translated text from the translator 22 with reference to the DNN 16, and transmits the output speech obtained as a result to the mobile terminal 30 via the communication processor 24.

For convenience of explanation, FIG. 1 shows components (mainly the analysis unit 12 and the learning unit 14) that are in charge of machine learning for constructing the DNN 16, and components that are in charge of multilingual translation using the generated DNN 16 (mainly In the example, the voice recognition unit 20, the translation unit 22, and the voice synthesis unit 18) are mounted on the same service providing apparatus 10, but these functions may be mounted on different apparatuses. In this case, the DNN 16 is constructed by performing machine learning in the first device, and the second device is provided with speech synthesis using the generated DNN 16 and a service using the speech synthesis. May be.

In the multilingual translation service as described above, an application executed on the mobile terminal 30 may be in charge of at least some functions of the speech recognition unit 20 and the translation unit 22. In addition, an application executed on the mobile terminal 30 may be responsible for the functions of the components (DNN 16 and speech synthesizer 18) responsible for speech synthesis.

Thus, the multilingual translation system 1 and a speech synthesis system that is a part of the multilingual translation system 1 can be realized by cooperation of the service providing apparatus 10 and the mobile terminal 30 in an arbitrary form. At this time, the functions shared by the respective devices may be appropriately determined according to the situation, and are not limited to the multilingual translation system 1 shown in FIG.

[B. Hardware configuration of service providing device]
Next, an example of the hardware configuration of the service providing apparatus will be described. FIG. 2 is a schematic diagram showing a hardware configuration example of the service providing apparatus 10 according to the present embodiment. The service providing apparatus 10 is typically realized using a general-purpose computer.

With reference to FIG. 2, the service providing apparatus 10 includes, as main hardware components, a processor 100, a main memory 102, a display 104, an input device 106, a network interface (I / F) 108, An optical drive 134 and a secondary storage device 112 are included. These components are connected to each other via an internal bus 110.

The processor 100 is an arithmetic entity that executes processes necessary for realizing the service providing apparatus 10 according to the present embodiment by executing various programs as will be described later. For example, the processor 100 includes one or more CPUs (central processing units). ) And GPU (graphics processing unit). A CPU or GPU having a plurality of cores may be used.

The main memory 102 is a storage area for temporarily storing program code, work memory, and the like when the processor 100 executes a program. For example, a dynamic random access memory (DRAM) or a static random access memory (SRAM) is used. It consists of volatile memory devices.

The display 104 is a display unit that outputs a user interface related to processing, processing results, and the like, and includes, for example, an LCD (liquid crystal display) or an organic EL (electroluminescence) display.

The input device 106 is a device that accepts instructions and operations from the user, and includes, for example, a keyboard, a mouse, a touch panel, a pen, and the like. Further, the input device 106 may include a microphone for collecting sounds necessary for machine learning, or an interface for connecting to a sound collecting device that collects sounds necessary for machine learning. Also good.

The network interface 108 exchanges data with the mobile terminal 30 or any information processing apparatus on the Internet or an intranet. As the network interface 108, for example, an arbitrary communication method such as Ethernet (registered trademark), wireless LAN (Local Area Network), Bluetooth (registered trademark), or the like can be adopted.

The optical drive 134 reads information stored in an optical disk 136 such as a CD-ROM (compact disc read only memory) or DVD (digital versatile disc) and outputs the information to other components via the internal bus 110. The optical disk 136 is an example of a non-transitory recording medium, and circulates in a state where an arbitrary program is stored in a nonvolatile manner. The optical drive 134 reads out the program from the optical disk 136 and installs it in the secondary storage device 112 or the like, so that the general-purpose computer functions as the service providing device 10 (or speech synthesizer). Therefore, the subject of the present invention can also be a program itself installed in the secondary storage device 112 or the like, or a recording medium such as an optical disk 136 storing a program for realizing the functions and processes according to the present embodiment. .

FIG. 2 shows an optical recording medium such as an optical disk 136 as an example of a non-transitory recording medium. However, the present invention is not limited to this, and a semiconductor recording medium such as a flash memory or a magnetic recording medium such as a hard disk or a storage tape. A magneto-optical recording medium such as MO (magneto-optical disk) may be used.

The secondary storage device 112 includes a program executed by the processor 100, input data to be processed by the program (including input voice and text for learning, input voice from the mobile terminal 30, and the like), and a program Is a component that stores output data (including output audio transmitted to the mobile terminal 30) and the like generated by the execution of, and is composed of, for example, a nonvolatile storage device such as a hard disk or SSD (solid state drive) .

More specifically, the secondary storage device 112 typically has an OS (operating system) (not shown), an analysis program 121 for realizing the analysis unit 12, and learning for realizing the learning unit 14. A program 141, a speech recognition program 201 for realizing the speech recognition unit 20, a translation program 221 for realizing the translation unit 22, and a speech synthesis program 181 for realizing the speech synthesis unit 18 are stored. Yes.

Some of the libraries and functional modules required when these programs are executed by the processor 100 may be replaced with libraries or functional modules provided by the OS as standard. In this case, each program alone does not include all the program modules necessary for realizing the corresponding function, but the necessary function can be realized by being installed under the OS execution environment. . Even such a program that does not include some libraries or functional modules can be included in the technical scope of the present invention.

Further, these programs may be distributed not only by being stored and distributed in any of the above-described recording media but also by being downloaded from a server device or the like via the Internet or an intranet.

Actually, databases for realizing the speech recognition unit 20 and the translation unit 22 are necessary, but for convenience of explanation, these databases are not drawn.

The secondary storage device 112 may store the input speech 130 for machine learning and the corresponding text 132 for constructing the DNN 16 in addition to the DNN 16.

FIG. 2 shows an example in which the service providing apparatus 10 is configured by a single computer. However, the present invention is not limited to this, and a plurality of computers connected via a network are linked in an explicit or implicit manner in a multilingual manner. You may make it implement | achieve the speech synthesis system which is the translation system 1 and its part.

All or part of the functions realized by the computer (processor 100) executing the program may be realized by using a hard-wired circuit such as an integrated circuit. For example, it may be realized by using ASIC (application specific integrated circuit), FPGA (field-programmable gate array), or the like.

Those skilled in the art will be able to realize a speech synthesis system according to the present embodiment by using a technique according to the time when the present invention is implemented as appropriate.

[C. Overview]
In the present embodiment, a speech synthesis system according to SPSS is provided. In the speech synthesis system according to the present embodiment, a method in which the determination of V / UV is made unnecessary by decomposing the source signal indicating the excitation source into a periodic component and an aperiodic component is adopted. Learning is performed by applying to the DNN speech parameters indicating periodic and non-periodic components representing the source signal.

First, speech synthesis processing according to related technology and processing when the speech synthesis processing is applied to SPSS will be described. FIG. 3 is a schematic diagram for explaining the outline of the speech synthesis process according to the related art. Referring to FIG. 3, the speech synthesis process according to the related art includes a pulse generation unit 250, a white noise generation unit 252, a switching unit 254, and a speech synthesis filter 256. In the configuration shown in FIG. 3, the pulse generation unit 250, the white noise generation unit 252, and the switching unit 254 correspond to a part modeling the excitation source, and a source signal from the excitation source is output from the pulse generation unit 250. Any one of the pulse sequence to be performed and the noise sequence from the white noise generation unit 252 is selected by the switching unit 254 and supplied to the speech synthesis filter 256. The pulse generator 250 is given a parameter of F ₀ indicating the pitch of the voice, and outputs a pulse sequence at intervals of the reciprocal of F ₀ (basic period / pitch period). Although not shown, the pulse generator 250 may be provided with an amplitude parameter indicating the loudness of the voice. The speech synthesis filter 256 is a part that determines the timbre of the speech, and is given a parameter indicating a spectrum envelope.

In the source filter model at the time of voice generation shown in FIG. 3, the input voice waveform is divided into unit sections (for example, in units of frames), and it is determined whether each unit section is a voiced section or an unvoiced section. For voiced intervals, a pulse sequence is output as a source signal, and for unvoiced intervals, a noise sequence is output as a source signal. A parameter for identifying the voiced section and the unvoiced section is a V / UV flag.

When the source filter model shown in FIG. 3 is applied to SPSS, F ₀ , the V / UV flag, and the spectral envelope are the learning target parameters. Therefore, it is necessary to correctly determine V / UV for each unit section. However, since it is not easy to determine the V / UV and to model the source signal with discontinuity due to the switching of the pulse sequence and the noise sequence, there is a possibility that quality degradation occurs in the synthesized speech.

Therefore, in the present embodiment, a technique that does not require determination of V / UV for each unit section of the speech waveform is adopted. Thereby, the influence on the quality of the synthesized speech due to the determination error of V / UV that may occur in the related art is reduced.

FIG. 4 is a schematic diagram for explaining the outline of the speech synthesis process according to the present embodiment. Referring to FIG. 4, in the speech synthesis process according to the present embodiment, pulse generation unit 200, speech synthesis filter (periodic component) 202, Gaussian noise generation unit 204, speech synthesis filter (non-periodic component) 206. And an adder 208.

In this embodiment, the source signal is prepared for each of the periodic component and the non-periodic component, instead of switching the source signal using the V / UV flag shown in FIG. That is, the audio signal is decomposed into a periodic component and an aperiodic component.

More specifically, the pulse generation unit 200 and the speech synthesis filter (periodic component) 202 are parts that generate a periodic component, and the pulse generation unit 200 is configured to generate a pulse according to a designated F ₀ (continuous as will be described later). And the speech synthesis filter (periodic component) 202 multiplies the continuous pulse sequence by a filter corresponding to the spectral envelope corresponding to the periodic component, thereby generating the periodic component included in the synthesized speech. Output.

As described above, regardless of whether each unit section is V / UV, a continuous pulse sequence can be used on the assumption that the silent section of the periodic component is inaudible power. This is because the section is treated as voiced. That is, it is assumed that the spectrum envelope corresponding to the periodic component has a sufficiently small amplitude in a section having no periodicity such as silence and silentness. According to this assumption, even if a periodic component is generated from a pulse sequence of F ₀ in such a silent or silent period, it is considered to be sufficiently small to be inaudible. For this reason, in the speech synthesis processing according to the related art, even in a silent section where the generation of the pulse sequence has been stopped, in the speech synthesis processing according to the present embodiment, the pulse sequence is generated, thereby discontinuity of the pulse sequence. It is possible to reduce the influence on the synthesized speech due to.

The Gaussian noise generation unit 204 and the speech synthesis filter (non-periodic component) 206 are parts that generate aperiodic components. The Gaussian noise generation unit 204 generates Gaussian noise as an example of a continuous noise sequence. At the same time, the speech synthesis filter (non-periodic component) 206 multiplies the noise sequence by a filter corresponding to the spectrum envelope corresponding to the non-periodic component, thereby outputting the aperiodic component included in the synthesized speech.

Finally, the periodic component output from the speech synthesis filter (periodic component) 202 and the aperiodic component output from the speech synthesis filter (non-periodic component) 206 are added by the adding unit 208, so that the synthesized speech Is output.

As described above, regardless of whether each unit section is V / UV, the noise sequence can be used on the assumption that the aperiodic component is composed of a silent signal and silence, and the entire section is silent. This is because it is treated as. As described above, a voice synthesis method that does not require V / UV determination can be realized by using an acoustic model that does not need to distinguish between a voiced section and an unvoiced section, and performing learning based on the acoustic model.

[D. Learning processing and speech synthesis processing]
Next, details of learning processing and speech synthesis processing in the speech synthesis system according to the present embodiment will be described. FIG. 5 is a block diagram for explaining processing of a main part in the speech synthesis system according to the present embodiment.

Referring to FIG. 5, the speech synthesis system includes an analysis unit 12 and a learning unit 14 for constructing DNN 16, and a speech synthesis unit 18 that outputs a speech waveform using DNN 16. Hereinafter, processing and functions of these units will be described in detail.

(D1: analysis unit 12)
First, processing and functions in the analysis unit 12 will be described. The analysis unit 12 is a part in charge of speech analysis, and generates an acoustic feature quantity sequence from a speech waveform indicated by the input speech for learning. In the speech synthesis system according to the present embodiment, the acoustic feature quantity for each frame includes F ₀ and spectrum envelope (periodic component and non-periodic component).

More specifically, the analysis unit 12 includes an F ₀ extraction unit 120, a periodic / non-periodic component extraction unit 122, and a feature amount extraction unit 124. The feature quantity extraction unit 124 includes an F ₀ interpolation unit 126 and a spectrum envelope extraction unit 128.

F ₀ extraction unit 120 extracts the F ₀ of the voice waveform corresponding to a known text for each frame (unit interval). That, F ₀ extraction unit 120 extracts from the speech waveform input to F ₀ for each frame. The extracted F ₀ is provided to the periodic / non-periodic component extracting unit 122 and the feature amount extracting unit 124.

The period / aperiodic component extraction unit 122 extracts a period component and an aperiodic component for each frame (unit section) from the input speech waveform. More specifically, the period / aperiodic component extraction unit 122 extracts a period component and an aperiodic component from F ₀ based on the input speech waveform F ₀ . In the present embodiment, the source signal s (t) is extracted as shown in the following equation (1).

However, f ₀ (t) indicates F ₀ in the frame t of the speech waveform, the periodic signal s _pdc (t) indicates a periodic component in the frame t of the speech waveform, and the non-periodic signal s _apd (t) Indicates a non-periodic component in frame t of the speech waveform.

Thus, when F ₀ exists for each frame t of the input speech waveform, the source signal is treated as including a periodic component and an aperiodic component, and when F ₀ does not exist, the source signal Is treated as including only non-periodic components. That is, the periodic / non-periodic component extraction unit 122 extracts only the non-periodic component from the frame (unit section) from which the F ₀ extraction unit 120 cannot extract F _0, and extracts the periodic component and the non-periodic component from the other frames. To do.

In the present embodiment, a sinusoidal model as shown in the following equation (2) is adopted as an example of expressing the harmonic component of the source signal.

In equation (2), J represents the number of harmonics. That is, in the sinusoidal model shown in Equation (2), the frequency and amplitude in the harmonic are approximated linearly. In solving this sinusoidal model, it is necessary to determine the values of α _k , β _k , γ, and φ _k , respectively. More specifically, a value that minimizes δ defined according to the following equation (3) is determined as a solution.

However, ω (t) is a window function of length 2N _w +1. The value that minimizes δ defined according to the equation (3) is determined by the solution shown in Non-Patent Document 8.

The periodic / non-periodic component extraction unit 122 extracts the periodic signal s _pdc (t) and the non-periodic signal s _apd (t) included in the input speech waveform according to the mathematical solution as described above.

The feature quantity extraction unit 124 outputs a continuous F ₀ , a periodic component spectrum envelope, and a non-periodic component spectrum envelope as acoustic feature quantities. As the spectral envelope, for example, any of LSP (line spectral pair), LPC (linear prediction coefficients), and mel cepstrum coefficients may be adopted. As the acoustic features, logarithmic continuous F ₀ (hereinafter, abbreviated as "continuous logF _0".) Is used.

F ₀ interpolation unit 126 interpolates the F ₀ to F ₀ extracting unit 120 is extracted for each frame from the speech waveform, generates a continuous F ₀ (F ₀ sequence). More specifically, for example, in accordance with a predetermined interpolation function from F ₀ extracted in immediate vicinity of one or more frames can be determined F ₀ in the target frame. As the interpolation method of F ₀ in the F ₀ interpolation unit 126, any known arbitrary method can be adopted.

The spectrum envelope extraction unit 128 extracts the spectrum envelope of the extracted periodic component and non-periodic component. More specifically, the spectrum envelope extraction unit 128 determines the periodic signal s _pdc (t) output from the periodic / non-periodic component extraction unit 122 and the non-periodic based on the F ₀ extracted by the F ₀ extraction unit 120. A spectral envelope is extracted from the sex signal s _apd (t). That is, the spectrum envelope extraction unit 128 extracts a spectrum envelope (pdc) indicating a periodic component indicating a distribution characteristic of each frequency component included in the periodic signal s _pdc (t) for each frame, and aperiodic for each frame. A spectral envelope (apd) indicating an aperiodic component indicating a distribution characteristic of each frequency component included in the sex signal s _apd (t) is extracted.

FIG. 6 is a diagram showing an example of a speech waveform of a periodic component and an aperiodic component output in the speech synthesis system according to the present embodiment. FIG. 6 shows, as an example, an audio signal when the speaker utters “all”. As will be described later, the DNN 16 learns the acoustic feature amount in units of frames.

FIG. 6A shows the input sound waveform (source signal), FIG. 6B shows the sound waveform of the periodic component extracted from the source signal, and FIG. The speech waveform of the aperiodic component extracted from the source signal is shown. While the periodic component of the section where F ₀ is extracted is extracted as shown in FIG. 6B, the non-periodic component of the section where F ₀ is extracted and the section where F ₀ is not extracted are shown in FIG. )become that way. In the section labeled “non-F ₀ ” in FIG. 6B, the amplitude is almost zero, and this section corresponds to a section in which F ₀ is not extracted.

(D2: learning unit 14)
Next, processing and functions in the learning unit 14 will be described. In SPSS, the relationship between the input text and the speech waveform corresponding to the text is statistically learned. In general, it is not easy to model this relationship directly. Therefore, in the speech synthesis system according to the present embodiment, a context label sequence based on the context information of the input text is generated, and an acoustic feature amount sequence including F ₀ and a spectrum envelope is generated from the input speech waveform. . Then, learning is performed using the context label sequence and the acoustic feature amount sequence, thereby constructing an acoustic model that receives the context label sequence and outputs the acoustic feature amount sequence. In the present embodiment, an acoustic model that is a statistical model is constructed according to DNN. As a result, the DNN 16 stores a parameter indicating the acoustic model (statistical model) to be constructed.

The configuration shown in FIG. 5 includes a text analysis unit 162 and a context label generation unit 164 as components that generate a context label sequence. The text analysis unit 162 and the context label generation unit 164 generate a context label based on context information of known text.

Since the context label is used by both the learning unit 14 and the speech synthesis unit 18, a configuration example commonly used by the learning unit 14 and the speech synthesis unit 18 is shown. However, a component for generating a context label may be mounted on each of the learning unit 14 and the speech synthesis unit 18.

The text analysis unit 162 analyzes the input text for learning or synthesis, and outputs the context information to the context label generation unit 164. The context label generation unit 164 determines a context label based on the branch information from the text analysis unit 162 and outputs it to the model learning unit 140.

In the speech synthesis system according to the present embodiment, learning is performed using the acoustic feature amount for each frame, so the context label generation unit 164 also generates a context label for each frame. In general, since the context label is generated in units of phonemes, the context label generation unit 164 generates the context label in units of frames by adding position information of each frame in the phoneme.

The model learning unit 140 receives the acoustic feature amount series 142 from the analysis unit 12 and the context label series 166 from the context label generation unit 164 as input, and learns an acoustic model using DNN. In this manner, the model learning unit 140 is a statistical model by learning by associating F ₀ , an acoustic feature amount including a spectral envelope of a periodic component and a spectral envelope of an aperiodic component, and a corresponding context label. Build an acoustic model.

In learning of an acoustic model based on DNN in the model learning unit 140, a context label is input for each frame, and an acoustic feature vector for each frame (elements include at least continuous log F ₀ and a spectrum envelope of periodic components). The probability distribution is modeled by using a DNN that outputs a spectrum envelope of non-periodic components). Typically, the model learning unit 140 learns the DNN so as to minimize the mean square error with respect to the normalized acoustic feature quantity vector. Such DNN learning is equivalent to modeling a probability distribution using a normal distribution having an average vector that changes from frame to frame and a context-independent covariance matrix, as shown in the following equation (4). It is.

However, λ represents a DNN parameter set, U represents a global covariance matrix, and μt represents an average vector of speech parameters estimated by DNN. Therefore, the generated probability distribution sequence has a time-varying mean vector and a time-invariant covariance matrix.

(D3: speech synthesis unit 18)
Next, processing and functions in the speech synthesizer 18 will be described. The speech synthesizer 18 generates a context label for each frame generated from the text to be synthesized, and inputs the generated context label for each frame to the DNN 16 to estimate the probability distribution series. Then, based on the estimated probability distribution series, a speech waveform is synthesized through a process reverse to that during learning.

More specifically, the speech synthesizer 18 includes an acoustic feature quantity estimator 180, a pulse generator 184, a periodic component generator 186, an aperiodic component generator 188, and an adder 187.

When some text to be synthesized is input, the text analysis unit 162 analyzes the input text and outputs context information, and the context label generation unit 164 generates a context label based on the branch information. That is, the text analysis unit 162 and the context label generation unit 164 determine a context label based on the context information of the text in response to input of arbitrary text.

The acoustic feature amount estimation unit 180 estimates an acoustic feature amount corresponding to a context label determined from an acoustic model that is a statistical model built in the DNN 16. More specifically, the acoustic feature quantity estimation unit 180 inputs the generated context label for each frame to the DNN 16 indicating the learned acoustic model. The acoustic feature quantity estimation unit 180 estimates an acoustic feature quantity corresponding to the input context label from the DNN 16. In response to the input of the context label series, the DNN 16 outputs an acoustic feature quantity series 182 that is a probability distribution series in which only the average vector changes for each frame.

The interpolated continuous F ₀ (F ₀ sequence), the spectral envelope of the periodic component, and the spectral envelope of the non-periodic component included in the acoustic feature amount sequence 182 are estimated from the context label sequence using the DNN 16.

The interpolated continuous F ₀ (F ₀ sequence) can be expressed as a continuous distribution, and thus is composed of a continuous pulse sequence. The spectral envelope of the periodic component and the spectral envelope of the non-periodic component are modeled for each.

The pulse generation unit 184 and the periodic component generation unit 186 reconfigure the periodic component by filtering the pulse sequence generated according to F ₀ included in the estimated acoustic feature amount according to the spectral envelope of the periodic component. . More specifically, the pulse generation unit 184 generates a pulse sequence according to F ₀ (F ₀ sequence) from the acoustic feature quantity estimation unit 180. The periodic component generation unit 186 generates a periodic component by filtering the pulse sequence from the pulse generation unit 184 with the spectral envelope of the periodic component.

The aperiodic component generation unit 188 reconstructs the aperiodic component by filtering a noise sequence such as a Gaussian noise sequence according to the spectrum envelope of the aperiodic component. More specifically, the non-periodic component generation unit 188 generates a non-periodic component by filtering Gaussian noise from an arbitrary excitation source with the spectral envelope of the non-periodic component.

The adder 187 reconstructs the speech waveform by adding the periodic component from the periodic component generator 186 and the aperiodic component from the aperiodic component generator 188. That is, the adding unit 187 adds the reconstructed periodic component and non-periodic component, and outputs the result as a speech waveform corresponding to the input arbitrary text.

As described above, in the speech synthesis system according to the present embodiment, using DNN 16 constructed in advance, a probability distribution series is estimated for a context label for each frame, and a static feature amount and a dynamic feature amount are estimated. By using the explicit relationship between and, an acoustic feature quantity sequence that appropriately transitions is generated. Then, by applying the generated acoustic feature quantity sequence to the vocoder, synthesized speech is generated from the estimated acoustic feature quantity.

Thus, in the speech synthesis system according to the present embodiment, a speech waveform can be generated from a continuous sequence without performing V / UV determination.

In this embodiment, a system using DNN as a learning means will be described as a typical example. However, the learning means is not limited to DNN, and any supervised learning method can be adopted. For example, an HMM or a recurrent neural network may be employed.

[E. Processing procedure]
7 and 8 are flowcharts showing an example of a processing procedure in the speech synthesis system according to the present embodiment. Each step shown in FIGS. 7 and 8 may be realized by one or more processors (for example, processor 100 shown in FIG. 2) executing one or more programs.

FIG. 7 shows a prior machine learning process for constructing the DNN 16, and FIG. 8 shows a speech synthesis process using the DNN 16.

Referring to FIG. 7, when a known text and a speech waveform corresponding to the text are input (step S100), processor 100 divides the input speech waveform into frames (step S102). A context label sequence and an acoustic feature amount sequence are generated by executing a process for generating a context label from input text (steps S110 to S112) and an acoustic feature amount sequence (steps S120 to S128). To do.

That is, the processor 100 analyzes the input text to generate context information (step S110), and determines a context label for the corresponding frame based on the generated context information (step S112).

Further, the processor 100 extracts F ₀ in the target frame of the input speech waveform (step S120), and performs continuous interpolation on F ₀ extracted previously, thereby determining continuous F ₀ . (Step S122). Then, the processor 100 extracts a periodic component and an aperiodic component in the target frame of the input speech waveform (step S124), and extracts a spectrum envelope for each component (step S126). The processor 100 determines the logarithm of continuous F ₀ determined in step S122 and the spectrum envelope (periodic component and non-periodic component) extracted in step S126 as acoustic feature amounts (step S128).

The processor 100 adds the context label determined in step S112 and the acoustic feature amount determined in step S128 to the DNN 16 (step S130). Then, the processor 100 determines whether or not there is an unprocessed frame (step S132). If there is an unprocessed frame (YES in step S132), steps S110 to S112, and The processes in steps S120 to S128 are repeated. If there is no unprocessed frame (NO in step S132), the processor 100 determines whether a new text and a speech waveform corresponding to the text are input (step S134). When a new text and a speech waveform corresponding to the text are input (YES in step S134), the processes in and after step S102 are repeated.

If the new text and the speech waveform corresponding to the text are not input (NO in step S134), the learning process ends.

In the above description, each time a context label and an acoustic feature quantity are generated, an example of processing to be input to the DNN 16 is shown, but after the generation of the context label series and the acoustic feature quantity series from the target speech waveform is completed. Alternatively, the inputs may be collectively input to the DNN 16.

Next, referring to FIG. 8, when the text to be synthesized is input (step S200), the processor 100 analyzes the input text to generate context information (step S202), and the generated text is generated. Based on the context information, a context label for the corresponding frame is determined (step S204). Then, the processor 100 estimates an acoustic feature amount corresponding to the context label determined in step S204 from the DNN 16 (step S206).

The processor 100 generates a pulse sequence according to F ₀ included in the estimated acoustic feature amount (step S208), and filters the generated pulse sequence with a spectrum envelope (periodic component) included in the estimated acoustic feature amount. Thus, a periodic component of the speech waveform is generated (step S210).

In addition, the processor 100 generates a Gaussian noise sequence (step S212), and filters the generated Gaussian noise sequence with a spectrum envelope (non-periodic component) included in the estimated acoustic feature amount, so that a non-speech waveform is generated. A periodic component is generated (step S214).

Finally, the processor 100 adds the periodic component generated in step S210 and the non-periodic component generated in step S214, and outputs the result as a synthesized speech waveform (step S216). Then, the speech synthesis process for the input text ends. Note that the processing of steps S206 to S216 is repeated by the number of frames constituting the input text.

[F. Experimental evaluation]
Next, experimental evaluation performed on the effectiveness of the speech synthesis system according to the present embodiment will be described.

(F1: Experimental conditions)
As a comparative example to be compared with the example according to the present embodiment, general DNN speech synthesis is used.

As the voice data, ATR phoneme balance sentence 503 spoken by one Japanese female speaker was used. Of these, 493 sentences were used as learning data, and the remaining 10 sentences were used as evaluation sentences.

The sampling frequency of audio data was 16 kHz, and the analysis period was 5 ms. The spectrum and the non-periodicity index (AP) obtained by WORD analysis on the speech data of the learning data were expressed as 39th order mel cepstrum coefficients (40th order including 0th order), respectively.

The log F ₀ was calculated by integrating the results of a plurality of known extraction methods, and the microprosody was removed by smoothing.

Similar to the HMM speech synthesis of the comparative example, the phoneme duration model of the example uses a context label in units of phonemes and uses a 5-state non-skip left-to-right type context-dependent phoneme HSMM (hidden semi-Markov model : Hidden semi-Markov model). Further, in learning of the acoustic model by DNN, a continuous log F ₀ pattern obtained by further interpolating the unvoiced section was used. Those obtained by further adding a primary dynamic feature quantity and a secondary dynamic feature quantity to these parameters were defined as acoustic feature quantities.

For the DNN speech synthesis of the comparative example, V / UV information was used in addition to the above feature amount. The input vector was generated by adding the duration information obtained from the HSMM duration model to the phoneme unit context label, thereby generating a context label for each frame and expressing it as a total of 483 dimensional vectors.

The output vector was a 244-dimensional acoustic feature vector in the comparative example, and a 243-dimensional acoustic feature vector in the example.

Table 1 below shows a list of features and models used in the examples and comparative examples. However, the input vector and the output vector were both normalized so that the average was 0 and the variance was 1.

In the DNN network configuration, the number of hidden layers is six, the number of units is 1024, and weights are initialized using random numbers. The mini-batch size was 256, the number of epochs was 30, the learning coefficient was 2.5 × 10 ⁴ , the hidden layer activation function was ReLU (rectied linear unit), and the optimizer was Adam. A dropout with a weight of 0.5 was also used.

(F2: Subjective evaluation)
As shown in Table 1, since the acoustic feature amount is different between the example and the comparative example, the evaluation was made by subjective evaluation instead of objective evaluation. More specifically, the naturalness of synthesized speech was compared by a pair comparison experiment.

As described above, of the ATR phoneme balance sentence 503 sentences, 10 sentences that were not used as learning data were used as evaluation voices. Have the subjects (breakdown: 4 men, 1 woman) listen to the synthesized speech generated by each of the examples and comparative examples, and select the one that felt more natural (sound quality is high) received. However, when there was no difference between the presented voice pairs, the “None” option was accepted.

In both the examples and comparative examples, a post filter for the mel cepstrum coefficient of the spectral envelope was applied.

FIG. 9 is a diagram showing an example of evaluation results of a paired comparison experiment for the speech synthesis system according to the present embodiment. In FIG. 9, the non-periodicity index (AP) of the comparative example represents non-periodicity between 0.0 and 1.0.

留 in Fig. 9 indicates the AP threshold value. It is completely voiced when α = 0.0, and completely unvoiced when α = 1.0. When the AP is lower than the threshold value α, it is voiced, and when it is high, the voice is unvoiced.

In the preliminary experiment, α = 0.5 and α = 0.6 were used as threshold values with low V / UV determination error rate (FIGS. 9A and 9B). Further, “reference” in FIG. 9C indicates a result when a correct answer of the determination result of V / UV is given.

In any of the cases shown in FIGS. 9A to 9C, it was confirmed that the p-value of the test statistic was p <0.01 compared to the comparative example, indicating that the example showed significance. .

(F3: Conclusion of experimental evaluation)
In the speech synthesis system according to the present embodiment, by separating an input speech into periodic components / aperiodic components, it could continuously express the trajectory of F ₀ and spectral envelope. By adopting such a method, it is considered that advantages such as improvement of modeling accuracy and avoidance of V / UV determination errors can be obtained.

According to the result of the subjective evaluation described above, the example according to the present embodiment showed better performance even when correct V / UV information was given to the comparative example. According to such a result, it can be evaluated that modeling separated into a periodic component and an aperiodic component contributes to quality improvement.

[G. Summary]
The speech synthesis system according to the present embodiment employs a technique that does not require determination of V / UV with respect to the source signal when performing SPSS. By expressing the source signal as a combination of a periodic component and a non-periodic component instead of determining V / UV, it is possible to suppress quality degradation to synthesized speech due to a V / UV determination error. Further, the modeling accuracy of the constructed acoustic model can be improved by making the F ₀ series continuous.

It was shown that the synthesized speech by the speech synthesis system according to the present embodiment can be sufficiently improved in quality as compared with the conventional method while performing subjective evaluation.

The embodiment disclosed this time should be considered as illustrative in all points and not restrictive. The scope of the present invention is shown not by the above description of the embodiments but by the scope of claims, and is intended to include all modifications within the meaning and scope equivalent to the scope of claims.

DESCRIPTION OF SYMBOLS 1 Multilingual translation system, 2 networks, 4 users, 10 service provision apparatus, 12 analysis part, 14 learning part, 18 speech synthesizer, 20 speech recognition part, 22 translation part, 24 communication processing part, 30 portable terminal, 100 processor , 102 Main memory, 104 display, 106 input device, 108 network interface, 110 internal bus, 112 secondary storage device, 120 F ₀ extraction unit, 121 analysis program, 122 period / non-periodic component extraction unit, 124 feature quantity extraction unit 126 F ₀ interpolation unit, 128 spectral envelope extraction unit, 130 input speech, 132 text, 134 optical drive, 136 optical disc, 140 model learning unit, 141 learning program, 142,182 acoustic feature quantity sequence, 162 text analysis unit, 164 Context label generator 166 context label series, 180 acoustic feature quantity estimation unit, 181 speech synthesis program, 184, 200, 250 pulse generation unit, 186 periodic component generation unit, 187, 208 addition unit, 188 aperiodic component generation unit, 201 speech recognition program 204 Gaussian noise generation unit, 221 translation program, 252 white noise generation unit, 254 switching unit, 256 speech synthesis filter.

Claims

A speech synthesis system according to statistical parametric speech synthesis,
A first extraction unit for extracting a fundamental frequency of a speech waveform corresponding to a known text for each unit section;
A second extraction unit for extracting a periodic component and an aperiodic component from the speech waveform for each unit section;
A third extraction unit for extracting a spectral envelope of the extracted periodic component and non-periodic component;
A generating unit that generates a context label based on context information of the known text;
A learning unit that constructs a statistical model by learning the basic frequency, the spectral envelope of the periodic component, the acoustic feature amount including the spectral envelope of the non-periodic component, and the corresponding context label. Speech synthesis system.
A determination unit that determines a context label based on context information of the text in response to input of arbitrary text;
An estimation unit that estimates an acoustic feature amount corresponding to the context label determined by the determination unit from the statistical model, and the estimated acoustic feature amount includes a fundamental frequency, a spectral envelope of a periodic component, and an aperiodic component Including the spectral envelope,
A first reconstruction unit that reconstructs a periodic component by filtering a pulse sequence generated according to a fundamental frequency included in the estimated acoustic feature amount according to a spectral envelope of the periodic component;
A second reconstruction unit that reconstructs the aperiodic component by filtering the noise sequence according to the spectral envelope of the aperiodic component;
The speech synthesis system according to claim 1, further comprising: an addition unit that adds the reconstructed periodic component and aperiodic component and outputs the result as a speech waveform corresponding to the input arbitrary text.
The second extraction unit extracts only the non-periodic component from a unit interval in which the first extraction unit cannot extract the fundamental frequency, and extracts the periodic component and the non-periodic component from other unit intervals. The speech synthesis system according to claim 1 or 2.
The speech synthesis system according to any one of claims 1 to 3, wherein the first extraction unit determines a fundamental frequency by interpolation processing for a unit section in which the fundamental frequency cannot be extracted.
A speech synthesis program for realizing a speech synthesis method according to statistical parametric speech synthesis, the speech synthesis program extracting a fundamental frequency of a speech waveform corresponding to text known to a computer for each unit interval;
Extracting periodic and non-periodic components from the speech waveform for each unit interval;
Extracting a spectral envelope of the extracted periodic and aperiodic components;
Generating a context label based on context information of the known text;
The step of constructing a statistical model is performed by associating and learning the acoustic feature quantity including the fundamental frequency, the spectral envelope of the periodic component, and the spectral envelope of the non-periodic component, and the corresponding context label. , Speech synthesis program.
A speech synthesis method according to statistical parametric speech synthesis,
Extracting a fundamental frequency of a speech waveform corresponding to a known text for each unit interval;
Extracting periodic and non-periodic components from the speech waveform for each unit interval;
Extracting a spectral envelope of the extracted periodic and aperiodic components;
Generating a context label based on context information of the known text;
Building a statistical model by associating and learning the acoustic feature quantity including the fundamental frequency, the spectral envelope of the periodic component, the spectral envelope of the non-periodic component, and the corresponding context label, Speech synthesis method.