WO2018159402A1 - Speech synthesis system, speech synthesis program, and speech synthesis method - Google Patents

Speech synthesis system, speech synthesis program, and speech synthesis method Download PDF

Info

Publication number
WO2018159402A1
WO2018159402A1 PCT/JP2018/006165 JP2018006165W WO2018159402A1 WO 2018159402 A1 WO2018159402 A1 WO 2018159402A1 JP 2018006165 W JP2018006165 W JP 2018006165W WO 2018159402 A1 WO2018159402 A1 WO 2018159402A1
Authority
WO
WIPO (PCT)
Prior art keywords
unit
speech synthesis
periodic component
speech
periodic
Prior art date
Application number
PCT/JP2018/006165
Other languages
French (fr)
Japanese (ja)
Inventor
橘 健太郎
芳則 志賀
Original Assignee
国立研究開発法人情報通信研究機構
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 国立研究開発法人情報通信研究機構 filed Critical 国立研究開発法人情報通信研究機構
Publication of WO2018159402A1 publication Critical patent/WO2018159402A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/06Elementary speech units used in speech synthesisers; Concatenation rules
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • G10L13/10Prosody rules derived from text; Stress or intonation

Definitions

  • the present invention relates to a speech synthesis technique (statistical parametric speech synthesis; hereinafter also abbreviated as “SPSS”) according to statistical parametric speech synthesis.
  • SPSS statistical parametric speech synthesis
  • SPSS speech synthesis technology
  • HMM hidden Markov model
  • Non-Patent Document 1 speech synthesis based on a deep neural network (hereinafter abbreviated as “DNN”), which is a type of deep learning, has attracted attention (see, for example, Non-Patent Document 1). ). According to research results shown in Non-Patent Document 1, it is shown that speech synthesis based on DNN can generate higher quality speech than speech synthesis based on HMM.
  • DNN deep neural network
  • a vocoder is used as a source filter model when generating speech. More specifically, the source filter model is composed of a vocal tract filter and an excitation source.
  • the vocal tract filter is a model of the vocal tract and is expressed by a spectral envelope parameter.
  • a source signal modeling an excitation source (voice vocal vibration) is expressed by mixing a pulse sequence and a noise component.
  • each frame of the vibration source is a voiced section or an unvoiced section. If it is determined that the frame is a voiced section, the voice pitch (pitch) is set. A pulse sequence of a corresponding fundamental frequency (hereinafter also abbreviated as “F 0 ”) is generated, and when it is determined that the period is an unvoiced section, a vibration source is generated as white noise.
  • F 0 fundamental frequency
  • this F 0 sequence is expressed as a discontinuous sequence in which a one-dimensional sequence and a zero-dimensional discrete symbol are switched, and in each frame, voiced / unvoiced (hereinafter referred to as “V / UV”).
  • a flag hereinafter also abbreviated as “V / UV flag”) for switching is required.
  • the quality of the synthesized speech may be deteriorated due to the V / UV determination error in each frame and the difficulty in modeling the vibration source that outputs the discontinuous series.
  • MSD (multi-space distribution) modeling has been proposed as a method for modeling such a sequence (see, for example, Non-Patent Document 2).
  • MSD modeling inherently involves the difficulty of representing continuous and discrete sequences.
  • vocoding often results in quality degradation of the synthesized speech. For example, a frame that is erroneously voiced is caused to have a buzzy feeling, and a frame that is erroneously unvoiced is caused to be squat.
  • the first method is to treat a sequence of F 0 that is a discontinuous sequence as a continuous sequence (see, for example, Non-Patent Document 3). By using this method, it is shown that F 0 can be modeled as a continuous sequence and the quality can be improved. In this method, it is necessary to determine V / UV at the time of waveform generation, and it is necessary to model a discrete sequence.
  • V / UV As another method, it is conceivable to determine V / UV from some continuous series. For example, a method for determining V / UV based on an aperiodicity index instead of the V / UV flag has been proposed (see, for example, Non-Patent Document 4). While this method can realize completely continuous modeling, it is necessary to determine V / UV at the time of waveform generation. Therefore, the influence of a V / UV determination error cannot be completely avoided.
  • a method using a Maximum Voiced Frequency (hereinafter also abbreviated as “MVF”) indicating the maximum periodicity of a voice signal instead of the V / UV flag has been proposed (for example, non-standard).
  • MVF Maximum Voiced Frequency
  • the MVF is continuously modeled, it is possible to function as a V / UV flag by setting a threshold value.
  • the model accuracy of the periodic / non-periodic component is not sufficient.
  • the present technology is intended to solve such a problem, and an object of the present technology is to provide a new method capable of reducing the influence on quality caused by a determination error of V / UV in an acoustic model in SPSS. .
  • a speech synthesis system includes a first extraction unit that extracts a fundamental frequency of a speech waveform corresponding to a known text for each unit section, and a second extraction that extracts a periodic component and an aperiodic component from the speech waveform for each unit section.
  • a third extraction unit that extracts a spectral envelope of the extracted periodic component and aperiodic component, a generation unit that generates a context label based on context information of a known text, and a spectral envelope of the fundamental frequency and the periodic component
  • a learning unit that constructs a statistical model by learning by associating the acoustic feature amount including the spectrum envelope of the non-periodic component and the corresponding context label.
  • the speech synthesis system is configured to determine a context label based on context information of the text in response to input of an arbitrary text, and an acoustic feature corresponding to the context label determined by the determination unit from the statistical model And an estimation unit for estimating the quantity.
  • the estimated acoustic feature amount includes a fundamental frequency, a spectral envelope of periodic components, and a spectral envelope of non-periodic components.
  • the speech synthesis system further performs a first reconstruction for reconfiguring the periodic component by filtering a pulse sequence generated according to the fundamental frequency included in the estimated acoustic feature amount according to the spectral envelope of the periodic component.
  • a second reconstructing unit that reconstructs the non-periodic component by filtering the noise sequence according to the spectral envelope of the non-periodic component, and the reconstructed periodic component and aperiodic component, And an adder that outputs as an audio waveform corresponding to the input arbitrary text.
  • the second extraction unit extracts only the non-periodic component from the unit interval in which the first extraction unit cannot extract the fundamental frequency, and extracts the periodic component and the non-periodic component from the other unit intervals.
  • a 1st extraction part determines a fundamental frequency by the interpolation process about the unit area which cannot extract a fundamental frequency.
  • the pulse sequence is a sequence generated from the fundamental frequency sequence subjected to the interpolation process
  • the noise sequence is a sequence in which noise is generated over the entire interval.
  • a speech synthesis program for realizing a speech synthesis method according to SPSS.
  • the speech synthesis program is extracted to the computer by extracting a fundamental frequency of a speech waveform corresponding to a known text for each unit section, extracting a periodic component and an aperiodic component from the speech waveform for each unit section, and Extracting the spectral envelope of periodic and non-periodic components; generating a context label based on known text context information; and acoustic features including fundamental frequency, spectral envelope of periodic components, and spectral envelope of non-periodic components
  • the step of constructing a statistical model is executed by learning the quantity and the corresponding context label in association with each other.
  • a speech synthesis method includes a step of extracting a fundamental frequency of a speech waveform corresponding to a known text for each unit section, a step of extracting a periodic component and a non-periodic component from the speech waveform for each unit section, and an extracted periodic component And a step of generating a spectral envelope of a non-periodic component, a step of generating a context label based on context information of known text, an acoustic feature amount including a fundamental frequency, a spectral envelope of a periodic component, and a spectral envelope of a non-periodic component; And building a statistical model by associating and learning corresponding context labels.
  • FIG. 1 is a schematic diagram showing an outline of a multilingual translation system 1 using a speech synthesis system according to the present embodiment.
  • multilingual translation system 1 includes a service providing device 10.
  • the service providing apparatus 10 performs speech recognition, multilingual translation, etc. on the input speech (some words uttered in the first language) from the mobile terminal 30 connected via the network 2, and in the second language.
  • the corresponding words are synthesized and the synthesized result is output to the portable terminal 30 as output speech.
  • the mobile terminal 30 when the user 4 utters the English word “Where is the station?” To the mobile terminal 30, the mobile terminal 30 generates and generates input speech using a microphone or the like based on the generated words. The input voice is transmitted to the service providing apparatus 10. The service providing apparatus 10 synthesizes an output speech indicating the word “where is the station” in Japanese corresponding to “Where is the station?”. When receiving the output sound from the service providing apparatus 10, the portable terminal 30 plays back the received output sound. As a result, the conversation partner of user 4 can hear the phrase “Where is the station?” In Japanese.
  • the conversation partner of the user 4 may have the same portable terminal 30.
  • an answer “go straight and left” When it is directed to the terminal, the processing as described above is executed, and the corresponding English word “Go straight and turn left” is answered from the mobile terminal of the user 4's conversation partner.
  • translation can be freely performed between the language of the first language (speech) and the language of the second language (speech).
  • speech the language of the first language
  • speech the language of the second language
  • you may enable it to mutually translate automatically between arbitrary numbers not only in two languages.
  • the speech synthesis system according to the present embodiment included in the service providing apparatus 10 employs one SPSS technique, as will be described later.
  • the service providing apparatus 10 includes an analysis unit 12, a learning unit 14, a DNN 16, and a speech synthesis unit 18 as components related to the speech synthesis system.
  • the service providing apparatus 10 includes a speech recognition unit 20 and a translation unit 22 as components relating to automatic translation.
  • Service providing apparatus 10 further includes a communication processing unit 24 for performing communication processing with portable terminal 30.
  • the analysis unit 12 and the learning unit 14 are in charge of machine learning for constructing the DNN 16. Details of functions and processing of the analysis unit 12 and the learning unit 14 will be described later.
  • the DNN 16 stores a neural network as a result of machine learning by the analysis unit 12 and the learning unit 14.
  • DNN is used as an example, but instead of DNN, a recurrent neural network (hereinafter abbreviated as “RNN”), long-short memory (long-short term memory); LSTM) RNN or convolutional neural network (CNN) may be used.
  • RNN recurrent neural network
  • long-short memory long-short term memory
  • LSTM recurrent neural network
  • CNN convolutional neural network
  • the voice recognition unit 20 outputs voice recognition text by executing voice recognition processing on the input voice from the mobile terminal 30 received via the communication processing unit 24.
  • the translation unit 22 generates a text in a specified language (also referred to as “translation text” for convenience of explanation) from the speech recognition text from the speech recognition unit 20.
  • a specified language also referred to as “translation text” for convenience of explanation
  • any known method can be employed.
  • the speech synthesizer 18 performs speech synthesis on the translated text from the translator 22 with reference to the DNN 16, and transmits the output speech obtained as a result to the mobile terminal 30 via the communication processor 24.
  • FIG. 1 shows components (mainly the analysis unit 12 and the learning unit 14) that are in charge of machine learning for constructing the DNN 16, and components that are in charge of multilingual translation using the generated DNN 16 (mainly In the example, the voice recognition unit 20, the translation unit 22, and the voice synthesis unit 18) are mounted on the same service providing apparatus 10, but these functions may be mounted on different apparatuses.
  • the DNN 16 is constructed by performing machine learning in the first device, and the second device is provided with speech synthesis using the generated DNN 16 and a service using the speech synthesis. May be.
  • an application executed on the mobile terminal 30 may be in charge of at least some functions of the speech recognition unit 20 and the translation unit 22.
  • an application executed on the mobile terminal 30 may be responsible for the functions of the components (DNN 16 and speech synthesizer 18) responsible for speech synthesis.
  • the multilingual translation system 1 and a speech synthesis system that is a part of the multilingual translation system 1 can be realized by cooperation of the service providing apparatus 10 and the mobile terminal 30 in an arbitrary form.
  • the functions shared by the respective devices may be appropriately determined according to the situation, and are not limited to the multilingual translation system 1 shown in FIG.
  • FIG. 2 is a schematic diagram showing a hardware configuration example of the service providing apparatus 10 according to the present embodiment.
  • the service providing apparatus 10 is typically realized using a general-purpose computer.
  • the service providing apparatus 10 includes, as main hardware components, a processor 100, a main memory 102, a display 104, an input device 106, a network interface (I / F) 108, An optical drive 134 and a secondary storage device 112 are included. These components are connected to each other via an internal bus 110.
  • the processor 100 is an arithmetic entity that executes processes necessary for realizing the service providing apparatus 10 according to the present embodiment by executing various programs as will be described later.
  • the processor 100 includes one or more CPUs (central processing units). ) And GPU (graphics processing unit).
  • CPUs central processing units
  • GPU graphics processing unit
  • a CPU or GPU having a plurality of cores may be used.
  • the main memory 102 is a storage area for temporarily storing program code, work memory, and the like when the processor 100 executes a program.
  • a dynamic random access memory (DRAM) or a static random access memory (SRAM) is used. It consists of volatile memory devices.
  • the display 104 is a display unit that outputs a user interface related to processing, processing results, and the like, and includes, for example, an LCD (liquid crystal display) or an organic EL (electroluminescence) display.
  • the input device 106 is a device that accepts instructions and operations from the user, and includes, for example, a keyboard, a mouse, a touch panel, a pen, and the like. Further, the input device 106 may include a microphone for collecting sounds necessary for machine learning, or an interface for connecting to a sound collecting device that collects sounds necessary for machine learning. Also good.
  • the network interface 108 exchanges data with the mobile terminal 30 or any information processing apparatus on the Internet or an intranet.
  • the network interface 108 for example, an arbitrary communication method such as Ethernet (registered trademark), wireless LAN (Local Area Network), Bluetooth (registered trademark), or the like can be adopted.
  • the optical drive 134 reads information stored in an optical disk 136 such as a CD-ROM (compact disc read only memory) or DVD (digital versatile disc) and outputs the information to other components via the internal bus 110.
  • the optical disk 136 is an example of a non-transitory recording medium, and circulates in a state where an arbitrary program is stored in a nonvolatile manner.
  • the optical drive 134 reads out the program from the optical disk 136 and installs it in the secondary storage device 112 or the like, so that the general-purpose computer functions as the service providing device 10 (or speech synthesizer). Therefore, the subject of the present invention can also be a program itself installed in the secondary storage device 112 or the like, or a recording medium such as an optical disk 136 storing a program for realizing the functions and processes according to the present embodiment. .
  • FIG. 2 shows an optical recording medium such as an optical disk 136 as an example of a non-transitory recording medium.
  • a semiconductor recording medium such as a flash memory or a magnetic recording medium such as a hard disk or a storage tape.
  • a magneto-optical recording medium such as MO (magneto-optical disk) may be used.
  • the secondary storage device 112 includes a program executed by the processor 100, input data to be processed by the program (including input voice and text for learning, input voice from the mobile terminal 30, and the like), and a program Is a component that stores output data (including output audio transmitted to the mobile terminal 30) and the like generated by the execution of, and is composed of, for example, a nonvolatile storage device such as a hard disk or SSD (solid state drive) .
  • a program Is a component that stores output data (including output audio transmitted to the mobile terminal 30) and the like generated by the execution of, and is composed of, for example, a nonvolatile storage device such as a hard disk or SSD (solid state drive) .
  • the secondary storage device 112 typically has an OS (operating system) (not shown), an analysis program 121 for realizing the analysis unit 12, and learning for realizing the learning unit 14.
  • a program 141, a speech recognition program 201 for realizing the speech recognition unit 20, a translation program 221 for realizing the translation unit 22, and a speech synthesis program 181 for realizing the speech synthesis unit 18 are stored. Yes.
  • libraries and functional modules required when these programs are executed by the processor 100 may be replaced with libraries or functional modules provided by the OS as standard.
  • each program alone does not include all the program modules necessary for realizing the corresponding function, but the necessary function can be realized by being installed under the OS execution environment. . Even such a program that does not include some libraries or functional modules can be included in the technical scope of the present invention.
  • these programs may be distributed not only by being stored and distributed in any of the above-described recording media but also by being downloaded from a server device or the like via the Internet or an intranet.
  • the secondary storage device 112 may store the input speech 130 for machine learning and the corresponding text 132 for constructing the DNN 16 in addition to the DNN 16.
  • FIG. 2 shows an example in which the service providing apparatus 10 is configured by a single computer.
  • the present invention is not limited to this, and a plurality of computers connected via a network are linked in an explicit or implicit manner in a multilingual manner. You may make it implement
  • All or part of the functions realized by the computer (processor 100) executing the program may be realized by using a hard-wired circuit such as an integrated circuit.
  • a hard-wired circuit such as an integrated circuit.
  • ASIC application specific integrated circuit
  • FPGA field-programmable gate array
  • a speech synthesis system according to SPSS is provided.
  • a method in which the determination of V / UV is made unnecessary by decomposing the source signal indicating the excitation source into a periodic component and an aperiodic component is adopted. Learning is performed by applying to the DNN speech parameters indicating periodic and non-periodic components representing the source signal.
  • FIG. 3 is a schematic diagram for explaining the outline of the speech synthesis process according to the related art.
  • the speech synthesis process according to the related art includes a pulse generation unit 250, a white noise generation unit 252, a switching unit 254, and a speech synthesis filter 256.
  • the pulse generation unit 250, the white noise generation unit 252, and the switching unit 254 correspond to a part modeling the excitation source, and a source signal from the excitation source is output from the pulse generation unit 250.
  • the pulse generator 250 is given a parameter of F 0 indicating the pitch of the voice, and outputs a pulse sequence at intervals of the reciprocal of F 0 (basic period / pitch period). Although not shown, the pulse generator 250 may be provided with an amplitude parameter indicating the loudness of the voice.
  • the speech synthesis filter 256 is a part that determines the timbre of the speech, and is given a parameter indicating a spectrum envelope.
  • the input voice waveform is divided into unit sections (for example, in units of frames), and it is determined whether each unit section is a voiced section or an unvoiced section.
  • a pulse sequence is output as a source signal
  • a noise sequence is output as a source signal.
  • a parameter for identifying the voiced section and the unvoiced section is a V / UV flag.
  • FIG. 4 is a schematic diagram for explaining the outline of the speech synthesis process according to the present embodiment.
  • pulse generation unit 200 speech synthesis filter (periodic component) 202, Gaussian noise generation unit 204, speech synthesis filter (non-periodic component) 206.
  • speech synthesis filter non-periodic component
  • adder 208 adder
  • the source signal is prepared for each of the periodic component and the non-periodic component, instead of switching the source signal using the V / UV flag shown in FIG. That is, the audio signal is decomposed into a periodic component and an aperiodic component.
  • the pulse generation unit 200 and the speech synthesis filter (periodic component) 202 are parts that generate a periodic component, and the pulse generation unit 200 is configured to generate a pulse according to a designated F 0 (continuous as will be described later). And the speech synthesis filter (periodic component) 202 multiplies the continuous pulse sequence by a filter corresponding to the spectral envelope corresponding to the periodic component, thereby generating the periodic component included in the synthesized speech. Output.
  • a continuous pulse sequence can be used on the assumption that the silent section of the periodic component is inaudible power. This is because the section is treated as voiced. That is, it is assumed that the spectrum envelope corresponding to the periodic component has a sufficiently small amplitude in a section having no periodicity such as silence and silentness. According to this assumption, even if a periodic component is generated from a pulse sequence of F 0 in such a silent or silent period, it is considered to be sufficiently small to be inaudible.
  • the pulse sequence is generated, thereby discontinuity of the pulse sequence. It is possible to reduce the influence on the synthesized speech due to.
  • the Gaussian noise generation unit 204 and the speech synthesis filter (non-periodic component) 206 are parts that generate aperiodic components.
  • the Gaussian noise generation unit 204 generates Gaussian noise as an example of a continuous noise sequence.
  • the speech synthesis filter (non-periodic component) 206 multiplies the noise sequence by a filter corresponding to the spectrum envelope corresponding to the non-periodic component, thereby outputting the aperiodic component included in the synthesized speech.
  • the periodic component output from the speech synthesis filter (periodic component) 202 and the aperiodic component output from the speech synthesis filter (non-periodic component) 206 are added by the adding unit 208, so that the synthesized speech Is output.
  • the noise sequence can be used on the assumption that the aperiodic component is composed of a silent signal and silence, and the entire section is silent. This is because it is treated as.
  • a voice synthesis method that does not require V / UV determination can be realized by using an acoustic model that does not need to distinguish between a voiced section and an unvoiced section, and performing learning based on the acoustic model.
  • FIG. 5 is a block diagram for explaining processing of a main part in the speech synthesis system according to the present embodiment.
  • the speech synthesis system includes an analysis unit 12 and a learning unit 14 for constructing DNN 16, and a speech synthesis unit 18 that outputs a speech waveform using DNN 16.
  • an analysis unit 12 and a learning unit 14 for constructing DNN 16 and a speech synthesis unit 18 that outputs a speech waveform using DNN 16.
  • a speech synthesis unit 18 that outputs a speech waveform using DNN 16.
  • the analysis unit 12 is a part in charge of speech analysis, and generates an acoustic feature quantity sequence from a speech waveform indicated by the input speech for learning.
  • the acoustic feature quantity for each frame includes F 0 and spectrum envelope (periodic component and non-periodic component).
  • the analysis unit 12 includes an F 0 extraction unit 120, a periodic / non-periodic component extraction unit 122, and a feature amount extraction unit 124.
  • the feature quantity extraction unit 124 includes an F 0 interpolation unit 126 and a spectrum envelope extraction unit 128.
  • F 0 extraction unit 120 extracts the F 0 of the voice waveform corresponding to a known text for each frame (unit interval). That, F 0 extraction unit 120 extracts from the speech waveform input to F 0 for each frame. The extracted F 0 is provided to the periodic / non-periodic component extracting unit 122 and the feature amount extracting unit 124.
  • the period / aperiodic component extraction unit 122 extracts a period component and an aperiodic component for each frame (unit section) from the input speech waveform. More specifically, the period / aperiodic component extraction unit 122 extracts a period component and an aperiodic component from F 0 based on the input speech waveform F 0 .
  • the source signal s (t) is extracted as shown in the following equation (1).
  • f 0 (t) indicates F 0 in the frame t of the speech waveform
  • the periodic signal s pdc (t) indicates a periodic component in the frame t of the speech waveform
  • the non-periodic signal s apd (t) Indicates a non-periodic component in frame t of the speech waveform.
  • the source signal when F 0 exists for each frame t of the input speech waveform, the source signal is treated as including a periodic component and an aperiodic component, and when F 0 does not exist, the source signal Is treated as including only non-periodic components. That is, the periodic / non-periodic component extraction unit 122 extracts only the non-periodic component from the frame (unit section) from which the F 0 extraction unit 120 cannot extract F 0, and extracts the periodic component and the non-periodic component from the other frames. To do.
  • a sinusoidal model as shown in the following equation (2) is adopted as an example of expressing the harmonic component of the source signal.
  • Equation (2) J represents the number of harmonics. That is, in the sinusoidal model shown in Equation (2), the frequency and amplitude in the harmonic are approximated linearly. In solving this sinusoidal model, it is necessary to determine the values of ⁇ k , ⁇ k , ⁇ , and ⁇ k , respectively. More specifically, a value that minimizes ⁇ defined according to the following equation (3) is determined as a solution.
  • ⁇ (t) is a window function of length 2N w +1.
  • the value that minimizes ⁇ defined according to the equation (3) is determined by the solution shown in Non-Patent Document 8.
  • the periodic / non-periodic component extraction unit 122 extracts the periodic signal s pdc (t) and the non-periodic signal s apd (t) included in the input speech waveform according to the mathematical solution as described above.
  • the feature quantity extraction unit 124 outputs a continuous F 0 , a periodic component spectrum envelope, and a non-periodic component spectrum envelope as acoustic feature quantities.
  • a spectral envelope for example, any of LSP (line spectral pair), LPC (linear prediction coefficients), and mel cepstrum coefficients may be adopted.
  • LSP linear spectral pair
  • LPC linear prediction coefficients
  • mel cepstrum coefficients may be adopted as the acoustic features.
  • logarithmic continuous F 0 hereinafter, abbreviated as "continuous logF 0".
  • F 0 interpolation unit 126 interpolates the F 0 to F 0 extracting unit 120 is extracted for each frame from the speech waveform, generates a continuous F 0 (F 0 sequence). More specifically, for example, in accordance with a predetermined interpolation function from F 0 extracted in immediate vicinity of one or more frames can be determined F 0 in the target frame. As the interpolation method of F 0 in the F 0 interpolation unit 126, any known arbitrary method can be adopted.
  • the spectrum envelope extraction unit 128 extracts the spectrum envelope of the extracted periodic component and non-periodic component. More specifically, the spectrum envelope extraction unit 128 determines the periodic signal s pdc (t) output from the periodic / non-periodic component extraction unit 122 and the non-periodic based on the F 0 extracted by the F 0 extraction unit 120. A spectral envelope is extracted from the sex signal s apd (t). That is, the spectrum envelope extraction unit 128 extracts a spectrum envelope (pdc) indicating a periodic component indicating a distribution characteristic of each frequency component included in the periodic signal s pdc (t) for each frame, and aperiodic for each frame. A spectral envelope (apd) indicating an aperiodic component indicating a distribution characteristic of each frequency component included in the sex signal s apd (t) is extracted.
  • FIG. 6 is a diagram showing an example of a speech waveform of a periodic component and an aperiodic component output in the speech synthesis system according to the present embodiment.
  • FIG. 6 shows, as an example, an audio signal when the speaker utters “all”.
  • the DNN 16 learns the acoustic feature amount in units of frames.
  • FIG. 6A shows the input sound waveform (source signal)
  • FIG. 6B shows the sound waveform of the periodic component extracted from the source signal
  • FIG. The speech waveform of the aperiodic component extracted from the source signal is shown. While the periodic component of the section where F 0 is extracted is extracted as shown in FIG. 6B, the non-periodic component of the section where F 0 is extracted and the section where F 0 is not extracted are shown in FIG. )become that way. In the section labeled “non-F 0 ” in FIG. 6B, the amplitude is almost zero, and this section corresponds to a section in which F 0 is not extracted.
  • the configuration shown in FIG. 5 includes a text analysis unit 162 and a context label generation unit 164 as components that generate a context label sequence.
  • the text analysis unit 162 and the context label generation unit 164 generate a context label based on context information of known text.
  • the context label is used by both the learning unit 14 and the speech synthesis unit 18, a configuration example commonly used by the learning unit 14 and the speech synthesis unit 18 is shown. However, a component for generating a context label may be mounted on each of the learning unit 14 and the speech synthesis unit 18.
  • the text analysis unit 162 analyzes the input text for learning or synthesis, and outputs the context information to the context label generation unit 164.
  • the context label generation unit 164 determines a context label based on the branch information from the text analysis unit 162 and outputs it to the model learning unit 140.
  • the context label generation unit 164 In the speech synthesis system according to the present embodiment, learning is performed using the acoustic feature amount for each frame, so the context label generation unit 164 also generates a context label for each frame. In general, since the context label is generated in units of phonemes, the context label generation unit 164 generates the context label in units of frames by adding position information of each frame in the phoneme.
  • the model learning unit 140 receives the acoustic feature amount series 142 from the analysis unit 12 and the context label series 166 from the context label generation unit 164 as input, and learns an acoustic model using DNN. In this manner, the model learning unit 140 is a statistical model by learning by associating F 0 , an acoustic feature amount including a spectral envelope of a periodic component and a spectral envelope of an aperiodic component, and a corresponding context label. Build an acoustic model.
  • a context label is input for each frame, and an acoustic feature vector for each frame (elements include at least continuous log F 0 and a spectrum envelope of periodic components).
  • the probability distribution is modeled by using a DNN that outputs a spectrum envelope of non-periodic components).
  • the model learning unit 140 learns the DNN so as to minimize the mean square error with respect to the normalized acoustic feature quantity vector.
  • Such DNN learning is equivalent to modeling a probability distribution using a normal distribution having an average vector that changes from frame to frame and a context-independent covariance matrix, as shown in the following equation (4). It is.
  • the generated probability distribution sequence has a time-varying mean vector and a time-invariant covariance matrix.
  • the speech synthesizer 18 generates a context label for each frame generated from the text to be synthesized, and inputs the generated context label for each frame to the DNN 16 to estimate the probability distribution series. Then, based on the estimated probability distribution series, a speech waveform is synthesized through a process reverse to that during learning.
  • the speech synthesizer 18 includes an acoustic feature quantity estimator 180, a pulse generator 184, a periodic component generator 186, an aperiodic component generator 188, and an adder 187.
  • the text analysis unit 162 analyzes the input text and outputs context information, and the context label generation unit 164 generates a context label based on the branch information. That is, the text analysis unit 162 and the context label generation unit 164 determine a context label based on the context information of the text in response to input of arbitrary text.
  • the acoustic feature amount estimation unit 180 estimates an acoustic feature amount corresponding to a context label determined from an acoustic model that is a statistical model built in the DNN 16. More specifically, the acoustic feature quantity estimation unit 180 inputs the generated context label for each frame to the DNN 16 indicating the learned acoustic model. The acoustic feature quantity estimation unit 180 estimates an acoustic feature quantity corresponding to the input context label from the DNN 16. In response to the input of the context label series, the DNN 16 outputs an acoustic feature quantity series 182 that is a probability distribution series in which only the average vector changes for each frame.
  • the interpolated continuous F 0 (F 0 sequence), the spectral envelope of the periodic component, and the spectral envelope of the non-periodic component included in the acoustic feature amount sequence 182 are estimated from the context label sequence using the DNN 16.
  • the interpolated continuous F 0 (F 0 sequence) can be expressed as a continuous distribution, and thus is composed of a continuous pulse sequence.
  • the spectral envelope of the periodic component and the spectral envelope of the non-periodic component are modeled for each.
  • the pulse generation unit 184 and the periodic component generation unit 186 reconfigure the periodic component by filtering the pulse sequence generated according to F 0 included in the estimated acoustic feature amount according to the spectral envelope of the periodic component. . More specifically, the pulse generation unit 184 generates a pulse sequence according to F 0 (F 0 sequence) from the acoustic feature quantity estimation unit 180. The periodic component generation unit 186 generates a periodic component by filtering the pulse sequence from the pulse generation unit 184 with the spectral envelope of the periodic component.
  • the aperiodic component generation unit 188 reconstructs the aperiodic component by filtering a noise sequence such as a Gaussian noise sequence according to the spectrum envelope of the aperiodic component. More specifically, the non-periodic component generation unit 188 generates a non-periodic component by filtering Gaussian noise from an arbitrary excitation source with the spectral envelope of the non-periodic component.
  • a noise sequence such as a Gaussian noise sequence according to the spectrum envelope of the aperiodic component.
  • the non-periodic component generation unit 188 generates a non-periodic component by filtering Gaussian noise from an arbitrary excitation source with the spectral envelope of the non-periodic component.
  • the adder 187 reconstructs the speech waveform by adding the periodic component from the periodic component generator 186 and the aperiodic component from the aperiodic component generator 188. That is, the adding unit 187 adds the reconstructed periodic component and non-periodic component, and outputs the result as a speech waveform corresponding to the input arbitrary text.
  • a probability distribution series is estimated for a context label for each frame, and a static feature amount and a dynamic feature amount are estimated.
  • an acoustic feature quantity sequence that appropriately transitions is generated.
  • synthesized speech is generated from the estimated acoustic feature quantity.
  • a speech waveform can be generated from a continuous sequence without performing V / UV determination.
  • a system using DNN as a learning means will be described as a typical example.
  • the learning means is not limited to DNN, and any supervised learning method can be adopted.
  • an HMM or a recurrent neural network may be employed.
  • FIGS. 7 and 8 are flowcharts showing an example of a processing procedure in the speech synthesis system according to the present embodiment. Each step shown in FIGS. 7 and 8 may be realized by one or more processors (for example, processor 100 shown in FIG. 2) executing one or more programs.
  • FIG. 7 shows a prior machine learning process for constructing the DNN 16
  • FIG. 8 shows a speech synthesis process using the DNN 16.
  • processor 100 divides the input speech waveform into frames (step S102).
  • a context label sequence and an acoustic feature amount sequence are generated by executing a process for generating a context label from input text (steps S110 to S112) and an acoustic feature amount sequence (steps S120 to S128). To do.
  • the processor 100 analyzes the input text to generate context information (step S110), and determines a context label for the corresponding frame based on the generated context information (step S112).
  • the processor 100 extracts F 0 in the target frame of the input speech waveform (step S120), and performs continuous interpolation on F 0 extracted previously, thereby determining continuous F 0 . (Step S122). Then, the processor 100 extracts a periodic component and an aperiodic component in the target frame of the input speech waveform (step S124), and extracts a spectrum envelope for each component (step S126). The processor 100 determines the logarithm of continuous F 0 determined in step S122 and the spectrum envelope (periodic component and non-periodic component) extracted in step S126 as acoustic feature amounts (step S128).
  • the processor 100 adds the context label determined in step S112 and the acoustic feature amount determined in step S128 to the DNN 16 (step S130). Then, the processor 100 determines whether or not there is an unprocessed frame (step S132). If there is an unprocessed frame (YES in step S132), steps S110 to S112, and The processes in steps S120 to S128 are repeated. If there is no unprocessed frame (NO in step S132), the processor 100 determines whether a new text and a speech waveform corresponding to the text are input (step S134). When a new text and a speech waveform corresponding to the text are input (YES in step S134), the processes in and after step S102 are repeated.
  • step S134 If the new text and the speech waveform corresponding to the text are not input (NO in step S134), the learning process ends.
  • step S200 when the text to be synthesized is input (step S200), the processor 100 analyzes the input text to generate context information (step S202), and the generated text is generated. Based on the context information, a context label for the corresponding frame is determined (step S204). Then, the processor 100 estimates an acoustic feature amount corresponding to the context label determined in step S204 from the DNN 16 (step S206).
  • the processor 100 generates a pulse sequence according to F 0 included in the estimated acoustic feature amount (step S208), and filters the generated pulse sequence with a spectrum envelope (periodic component) included in the estimated acoustic feature amount. Thus, a periodic component of the speech waveform is generated (step S210).
  • the processor 100 generates a Gaussian noise sequence (step S212), and filters the generated Gaussian noise sequence with a spectrum envelope (non-periodic component) included in the estimated acoustic feature amount, so that a non-speech waveform is generated.
  • a periodic component is generated (step S214).
  • the processor 100 adds the periodic component generated in step S210 and the non-periodic component generated in step S214, and outputs the result as a synthesized speech waveform (step S216). Then, the speech synthesis process for the input text ends. Note that the processing of steps S206 to S216 is repeated by the number of frames constituting the input text.
  • ATR phoneme balance sentence 503 spoken by one Japanese female speaker was used. Of these, 493 sentences were used as learning data, and the remaining 10 sentences were used as evaluation sentences.
  • the sampling frequency of audio data was 16 kHz, and the analysis period was 5 ms.
  • the spectrum and the non-periodicity index (AP) obtained by WORD analysis on the speech data of the learning data were expressed as 39th order mel cepstrum coefficients (40th order including 0th order), respectively.
  • the log F 0 was calculated by integrating the results of a plurality of known extraction methods, and the microprosody was removed by smoothing.
  • the phoneme duration model of the example uses a context label in units of phonemes and uses a 5-state non-skip left-to-right type context-dependent phoneme HSMM (hidden semi-Markov model : Hidden semi-Markov model). Further, in learning of the acoustic model by DNN, a continuous log F 0 pattern obtained by further interpolating the unvoiced section was used. Those obtained by further adding a primary dynamic feature quantity and a secondary dynamic feature quantity to these parameters were defined as acoustic feature quantities.
  • V / UV information was used in addition to the above feature amount.
  • the input vector was generated by adding the duration information obtained from the HSMM duration model to the phoneme unit context label, thereby generating a context label for each frame and expressing it as a total of 483 dimensional vectors.
  • the output vector was a 244-dimensional acoustic feature vector in the comparative example, and a 243-dimensional acoustic feature vector in the example.
  • Table 1 shows a list of features and models used in the examples and comparative examples. However, the input vector and the output vector were both normalized so that the average was 0 and the variance was 1.
  • the number of hidden layers is six, the number of units is 1024, and weights are initialized using random numbers.
  • the mini-batch size was 256, the number of epochs was 30, the learning coefficient was 2.5 ⁇ 10 4 , the hidden layer activation function was ReLU (rectied linear unit), and the optimizer was Adam. A dropout with a weight of 0.5 was also used.
  • FIG. 9 is a diagram showing an example of evaluation results of a paired comparison experiment for the speech synthesis system according to the present embodiment.
  • the non-periodicity index (AP) of the comparative example represents non-periodicity between 0.0 and 1.0.
  • the example according to the present embodiment showed better performance even when correct V / UV information was given to the comparative example. According to such a result, it can be evaluated that modeling separated into a periodic component and an aperiodic component contributes to quality improvement.
  • the speech synthesis system employs a technique that does not require determination of V / UV with respect to the source signal when performing SPSS.
  • the source signal By expressing the source signal as a combination of a periodic component and a non-periodic component instead of determining V / UV, it is possible to suppress quality degradation to synthesized speech due to a V / UV determination error. Further, the modeling accuracy of the constructed acoustic model can be improved by making the F 0 series continuous.

Abstract

This speech synthesis system includes: a first extraction unit that extracts, at every unit interval, a fundamental frequency of a speech waveform corresponding to a known text; a second extraction unit that extracts, at every unit interval, a periodic component and a non-periodic component from the speech waveform; a third extraction unit that extracts spectral envelopes of the extracted periodic component and non-periodic component; a generation unit that generates a context label on the basis of context information of the known text; and a learning unit that performs learning by associating an acoustic feature amount including the fundamental frequency, the spectral envelope of the periodic component, and the spectral envelope of the non-periodic component with the corresponding context label, thereby constructing a statistical model.

Description

音声合成システム、音声合成プログラムおよび音声合成方法Speech synthesis system, speech synthesis program, and speech synthesis method
 本発明は、統計的パラメトリック音声合成に従う音声合成技術(statistical parametric speech synthesis;以下「SPSS」とも略称する。)に関する。 The present invention relates to a speech synthesis technique (statistical parametric speech synthesis; hereinafter also abbreviated as “SPSS”) according to statistical parametric speech synthesis.
 従来から、音声合成技術は、テキスト読み上げアプリケーションや多言語翻訳サービスなどに広く応用されている。このような音声合成技術の一手法として、SPSSが知られている。SPSSは、統計モデルに基づいて音声を合成するフレームワークである。SPSSにおける主要な研究対象は、過去十数年にわたって、隠れマルコフモデル(hidden Markov model;以下「HMM」とも略称する。)に基づく音声合成であった。 Traditionally, speech synthesis technology has been widely applied to text-to-speech applications and multilingual translation services. SPSS is known as one method of such speech synthesis technology. SPSS is a framework for synthesizing speech based on a statistical model. The main research subject in SPSS has been speech synthesis based on a hidden Markov model (hereinafter abbreviated as “HMM”) over the past decade.
 近年、深層学習(deep learning)の一類型である、深層ニューラルネットワーク(deep neural network;以下「DNN」とも略称する。)に基づく音声合成が注目を集めている(例えば、非特許文献1など参照)。非特許文献1に示された研究成果によれば、DNNに基づく音声合成は、HMMに基づく音声合成に比較して、より高品質な音声を生成できることが示されている。 In recent years, speech synthesis based on a deep neural network (hereinafter abbreviated as “DNN”), which is a type of deep learning, has attracted attention (see, for example, Non-Patent Document 1). ). According to research results shown in Non-Patent Document 1, it is shown that speech synthesis based on DNN can generate higher quality speech than speech synthesis based on HMM.
 多くのSPSSにおいては、音声生成時のソースフィルタモデルとしてボコーダー(vocoder)が利用される。より具体的には、ソースフィルタモデルは、声道フィルタと励振源とから構成される。声道フィルタは、声道をモデル化したものであり、スペクトル包絡パラメータにより表現される。励振源(声帯振動)をモデル化した源信号は、パルス系列とノイズ成分とを混合することにより表現される。 In many SPSSs, a vocoder is used as a source filter model when generating speech. More specifically, the source filter model is composed of a vocal tract filter and an excitation source. The vocal tract filter is a model of the vocal tract and is expressed by a spectral envelope parameter. A source signal modeling an excitation source (voice vocal vibration) is expressed by mixing a pulse sequence and a noise component.
 一般的に採用されるボコーダーにおいて、振動源の各フレームは有声区間であるか無声区間であるかの判定がなされ、有声区間であると判定された場合には、声の高さ(ピッチ)に相当する基本周波数(以下、「F0」とも略称する。)のパルス系列を生成し、無声区間であると判定された場合には、ホワイトノイズとして振動源を生成する。ここで、有声と無声との判定はF0が非ゼロ(有声)かゼロ(無声)かに基づいて行なわれる。一般的なSPSSでは、このF0の系列は、1次元の系列とゼロ次元の離散シンボルとが切替わる不連続系列と表現され、各フレームにおいて、有声/無声(以下、「V/UV」とも略称する。)を切替えるためのフラグ(以下、「V/UVフラグ」とも略称する。)が必要となる。 In a vocoder generally adopted, it is determined whether each frame of the vibration source is a voiced section or an unvoiced section. If it is determined that the frame is a voiced section, the voice pitch (pitch) is set. A pulse sequence of a corresponding fundamental frequency (hereinafter also abbreviated as “F 0 ”) is generated, and when it is determined that the period is an unvoiced section, a vibration source is generated as white noise. Here, the determination of voiced and unvoiced is made based on whether F 0 is non-zero (voiced) or zero (unvoiced). In general SPSS, this F 0 sequence is expressed as a discontinuous sequence in which a one-dimensional sequence and a zero-dimensional discrete symbol are switched, and in each frame, voiced / unvoiced (hereinafter referred to as “V / UV”). A flag (hereinafter also abbreviated as “V / UV flag”) for switching is required.
 各フレームにおけるV/UVの判定エラー、および、不連続系列を出力する振動源をモデル化することの困難性、に起因して、合成音声に品質劣化が生じる可能性がある。 The quality of the synthesized speech may be deteriorated due to the V / UV determination error in each frame and the difficulty in modeling the vibration source that outputs the discontinuous series.
 このような系列をモデル化する一手法として、MSD(multi-space distribution)モデリングが提案されている(例えば、非特許文献2参照)。しかしながら、MSDモデリングは、連続系列と離散系列とを表現するという困難性を本質的に伴うものである。また、予測エラーが生じたV/UVフレームについては、ボコーディング(vocoding)において、しばしば合成音声の品質劣化を招く結果となる。例えば、誤って有声判定されたフレームについてはbuzzy感を生じさせ、誤って無声判定されたフレームについてはしゃがれ感を生じさせる。 MSD (multi-space distribution) modeling has been proposed as a method for modeling such a sequence (see, for example, Non-Patent Document 2). However, MSD modeling inherently involves the difficulty of representing continuous and discrete sequences. In addition, for V / UV frames in which a prediction error has occurred, vocoding often results in quality degradation of the synthesized speech. For example, a frame that is erroneously voiced is caused to have a buzzy feeling, and a frame that is erroneously unvoiced is caused to be squat.
 上述したような課題に対して、いくつかの解決手法が提案されている。
 一つ目の手法は、不連続系列であるF0の系列を補間することで、連続的な系列として扱うものである(例えば、非特許文献3など参照)。この手法を用いることで、F0は連続系列としてモデル化でき、品質を改善できることが示されている。この手法においては、波形生成時にV/UVの判定を行なう必要があり、離散系列のモデル化が必要となる。
Several solutions have been proposed for the problems described above.
The first method is to treat a sequence of F 0 that is a discontinuous sequence as a continuous sequence (see, for example, Non-Patent Document 3). By using this method, it is shown that F 0 can be modeled as a continuous sequence and the quality can be improved. In this method, it is necessary to determine V / UV at the time of waveform generation, and it is necessary to model a discrete sequence.
 別の手法として、何らかの連続系列からV/UVを判定することが考えられる。例えば、V/UVフラグに代えて、非周期性指標に基づいて、V/UVを判定する手法が提案されている(例えば、非特許文献4など参照)。この手法は、完全に連続なモデル化を実現できる一方で、V/UVの判定が波形生成時に必要となるので、V/UVの判定エラーの影響を完全には避けることができない。 As another method, it is conceivable to determine V / UV from some continuous series. For example, a method for determining V / UV based on an aperiodicity index instead of the V / UV flag has been proposed (see, for example, Non-Patent Document 4). While this method can realize completely continuous modeling, it is necessary to determine V / UV at the time of waveform generation. Therefore, the influence of a V / UV determination error cannot be completely avoided.
 さらに別の手法として、V/UVフラグに代えて、音声信号の周期性の最大帯域を示すMaximum Voiced Frequency(以下、「MVF」とも略称する。)を用いる手法が提案されている(例えば、非特許文献5など参照)。MVFを用いることにより、高周波帯域を非周期成分とするとともに、低周波帯域を周期成分として、音声を分割できる。また、MVFは、連続的にモデル化されるので、しきい値を設定することで、V/UVフラグとして機能させることも可能である。しかしながら、高周波帯域および低周波帯域の2帯域にしか分割しないため、周期/非周期成分のモデル精度が十分ではない。 As another method, a method using a Maximum Voiced Frequency (hereinafter also abbreviated as “MVF”) indicating the maximum periodicity of a voice signal instead of the V / UV flag has been proposed (for example, non-standard). (See Patent Document 5). By using the MVF, it is possible to divide the sound using the high frequency band as an aperiodic component and the low frequency band as a periodic component. In addition, since the MVF is continuously modeled, it is possible to function as a V / UV flag by setting a threshold value. However, since only the high frequency band and the low frequency band are divided, the model accuracy of the periodic / non-periodic component is not sufficient.
 本技術は、このような課題を解決するためのものであり、SPSSにおいて、音響モデルにおけるV/UVの判定エラーに起因する品質への影響を低減できる新たな手法を提供することを目的としている。 The present technology is intended to solve such a problem, and an object of the present technology is to provide a new method capable of reducing the influence on quality caused by a determination error of V / UV in an acoustic model in SPSS. .
 本発明のある局面に従えば、SPSSに従う音声合成システムが提供される。音声合成システムは、既知のテキストに対応する音声波形の基本周波数を単位区間毎に抽出する第1の抽出部と、音声波形から周期成分および非周期成分を単位区間毎に抽出する第2の抽出部と、抽出された周期成分および非周期成分のスペクトル包絡を抽出する第3の抽出部と、既知のテキストの文脈情報に基づくコンテキストラベルを生成する生成部と、基本周波数、周期成分のスペクトル包絡、非周期成分のスペクトル包絡を含む音響特徴量と、対応するコンテキストラベルとを対応付けて学習することで、統計モデルを構築する学習部とを含む。 According to an aspect of the present invention, a speech synthesis system according to SPSS is provided. The speech synthesis system includes a first extraction unit that extracts a fundamental frequency of a speech waveform corresponding to a known text for each unit section, and a second extraction that extracts a periodic component and an aperiodic component from the speech waveform for each unit section. A third extraction unit that extracts a spectral envelope of the extracted periodic component and aperiodic component, a generation unit that generates a context label based on context information of a known text, and a spectral envelope of the fundamental frequency and the periodic component And a learning unit that constructs a statistical model by learning by associating the acoustic feature amount including the spectrum envelope of the non-periodic component and the corresponding context label.
 好ましくは、音声合成システムは、任意のテキストの入力に応答して、当該テキストの文脈情報に基づくコンテキストラベルを決定する決定部と、統計モデルから決定部により決定されたコンテキストラベルに対応する音響特徴量を推定する推定部とをさらに含む。当該推定される音響特徴量は、基本周波数、周期成分のスペクトル包絡、非周期成分のスペクトル包絡を含む。音声合成システムは、さらに、推定された音響特徴量に含まれる基本周波数に従って生成されたパルス系列を、周期成分のスペクトル包絡に応じてフィルタリングすることで、周期成分を再構成する第1の再構成部と、ノイズ系列を非周期成分のスペクトル包絡に応じてフィルタリングすることで、非周期成分を再構成する第2の再構成部と、再構成された周期成分および非周期成分を加算して、入力された任意のテキストに対応する音声波形として出力する加算部とを含む。 Preferably, the speech synthesis system is configured to determine a context label based on context information of the text in response to input of an arbitrary text, and an acoustic feature corresponding to the context label determined by the determination unit from the statistical model And an estimation unit for estimating the quantity. The estimated acoustic feature amount includes a fundamental frequency, a spectral envelope of periodic components, and a spectral envelope of non-periodic components. The speech synthesis system further performs a first reconstruction for reconfiguring the periodic component by filtering a pulse sequence generated according to the fundamental frequency included in the estimated acoustic feature amount according to the spectral envelope of the periodic component. A second reconstructing unit that reconstructs the non-periodic component by filtering the noise sequence according to the spectral envelope of the non-periodic component, and the reconstructed periodic component and aperiodic component, And an adder that outputs as an audio waveform corresponding to the input arbitrary text.
 好ましくは、第2の抽出部は、第1の抽出部が基本周波数を抽出できない単位区間から非周期成分のみを抽出し、それ以外の単位区間から周期成分および非周期成分を抽出する。 Preferably, the second extraction unit extracts only the non-periodic component from the unit interval in which the first extraction unit cannot extract the fundamental frequency, and extracts the periodic component and the non-periodic component from the other unit intervals.
 好ましくは、第1の抽出部は、基本周波数を抽出できない単位区間について、補間処理により基本周波数を決定する。 Preferably, a 1st extraction part determines a fundamental frequency by the interpolation process about the unit area which cannot extract a fundamental frequency.
 好ましくは、パルス系列は、補間処理がなされた基本周波数系列から生成された系列であり、ノイズ系列は、全区間にわたりノイズが生成された系列である。 Preferably, the pulse sequence is a sequence generated from the fundamental frequency sequence subjected to the interpolation process, and the noise sequence is a sequence in which noise is generated over the entire interval.
 本発明のさらに別の局面に従えば、SPSSに従う音声合成方法を実現するための音声合成プログラムが提供される。音声合成プログラムはコンピュータに、既知のテキストに対応する音声波形の基本周波数を単位区間毎に抽出するステップと、音声波形から周期成分および非周期成分を単位区間毎に抽出するステップと、抽出された周期成分および非周期成分のスペクトル包絡を抽出するステップと、既知のテキストの文脈情報に基づくコンテキストラベルを生成するステップと、基本周波数、周期成分のスペクトル包絡、非周期成分のスペクトル包絡を含む音響特徴量と、対応するコンテキストラベルとを対応付けて学習することで、統計モデルを構築するステップとを実行させる。 According to still another aspect of the present invention, a speech synthesis program for realizing a speech synthesis method according to SPSS is provided. The speech synthesis program is extracted to the computer by extracting a fundamental frequency of a speech waveform corresponding to a known text for each unit section, extracting a periodic component and an aperiodic component from the speech waveform for each unit section, and Extracting the spectral envelope of periodic and non-periodic components; generating a context label based on known text context information; and acoustic features including fundamental frequency, spectral envelope of periodic components, and spectral envelope of non-periodic components The step of constructing a statistical model is executed by learning the quantity and the corresponding context label in association with each other.
 本発明のさらに別の局面に従えば、SPSSに従う音声合成方法が提供される。音声合成方法は、既知のテキストに対応する音声波形の基本周波数を単位区間毎に抽出するステップと、音声波形から周期成分および非周期成分を単位区間毎に抽出するステップと、抽出された周期成分および非周期成分のスペクトル包絡を抽出するステップと、既知のテキストの文脈情報に基づくコンテキストラベルを生成するステップと、基本周波数、周期成分のスペクトル包絡、非周期成分のスペクトル包絡を含む音響特徴量と、対応するコンテキストラベルとを対応付けて学習することで、統計モデルを構築するステップとを含む。 According to still another aspect of the present invention, a speech synthesis method according to SPSS is provided. The speech synthesis method includes a step of extracting a fundamental frequency of a speech waveform corresponding to a known text for each unit section, a step of extracting a periodic component and a non-periodic component from the speech waveform for each unit section, and an extracted periodic component And a step of generating a spectral envelope of a non-periodic component, a step of generating a context label based on context information of known text, an acoustic feature amount including a fundamental frequency, a spectral envelope of a periodic component, and a spectral envelope of a non-periodic component; And building a statistical model by associating and learning corresponding context labels.
 本技術によれば、SPSSにおいて、音響モデルにおけるV/UVの判定エラーに起因する品質への影響を低減できる。 According to the present technology, in SPSS, it is possible to reduce the influence on the quality caused by the V / UV determination error in the acoustic model.
本実施の形態に従う音声合成システムを用いた多言語翻訳システムの概要を示す模式図である。It is a schematic diagram which shows the outline | summary of the multilingual translation system using the speech synthesis system according to this Embodiment. 本実施の形態に従うサービス提供装置のハードウェア構成例を示す模式図である。It is a schematic diagram which shows the hardware structural example of the service provision apparatus according to this Embodiment. 関連技術に係る音声合成処理の概要を説明するための模式図である。It is a schematic diagram for demonstrating the outline | summary of the speech synthesis process which concerns on related technology. 本実施の形態に従う音声合成処理の概要を説明するための模式図である。It is a schematic diagram for demonstrating the outline | summary of the speech synthesis process according to this Embodiment. 本実施の形態に従う音声合成システムにおける要部の処理を説明するためのブロック図である。It is a block diagram for demonstrating the process of the principal part in the speech synthesis system according to this Embodiment. 本実施の形態に従う音声合成システムにおいて出力される周期成分および非周期成分の音声波形の一例を示す図である。It is a figure which shows an example of the speech waveform of the periodic component and non-periodic component which are output in the speech synthesis system according to this Embodiment. 本実施の形態に従う音声合成システムにおける処理手順の一例を示すフローチャートである。It is a flowchart which shows an example of the process sequence in the speech synthesis system according to this Embodiment. 本実施の形態に従う音声合成システムにおける処理手順の一例を示すフローチャートである。It is a flowchart which shows an example of the process sequence in the speech synthesis system according to this Embodiment. 本実施の形態に従う音声合成システムについての対比較実験の評価結果例を示す図である。It is a figure which shows the example of an evaluation result of the pair comparison experiment about the speech synthesis system according to this Embodiment.
 本発明の実施の形態について、図面を参照しながら詳細に説明する。なお、図中の同一または相当部分については、同一符号を付してその説明は繰返さない。 Embodiments of the present invention will be described in detail with reference to the drawings. Note that the same or corresponding parts in the drawings are denoted by the same reference numerals and description thereof will not be repeated.
 [A.応用例]
 まず、本実施の形態に従う音声合成システムの一つの応用例について説明する。より具体的には、本実施の形態に従う音声合成システムを用いた多言語翻訳システムについて説明する。
[A. Application example]
First, one application example of the speech synthesis system according to the present embodiment will be described. More specifically, a multilingual translation system using the speech synthesis system according to the present embodiment will be described.
 図1は、本実施の形態に従う音声合成システムを用いた多言語翻訳システム1の概要を示す模式図である。図1を参照して、多言語翻訳システム1は、サービス提供装置10を含む。サービス提供装置10は、ネットワーク2を介して接続される携帯端末30からの入力音声(第1言語で発せられたなんらかのことば)に対して音声認識、多言語翻訳などを行なって、第2言語での対応することばを合成して、その合成結果を出力音声として携帯端末30へ出力する。 FIG. 1 is a schematic diagram showing an outline of a multilingual translation system 1 using a speech synthesis system according to the present embodiment. Referring to FIG. 1, multilingual translation system 1 includes a service providing device 10. The service providing apparatus 10 performs speech recognition, multilingual translation, etc. on the input speech (some words uttered in the first language) from the mobile terminal 30 connected via the network 2, and in the second language. The corresponding words are synthesized and the synthesized result is output to the portable terminal 30 as output speech.
 例えば、ユーザ4は、携帯端末30に対して、「Where is the station ?」という英語のことばを発すると、携帯端末30は、その発せられたことばからマイクロフォンなどにより入力音声を生成し、生成した入力音声をサービス提供装置10へ送信する。サービス提供装置10は、「Where is the station ?」に対応する、日本語の「駅はどこですか?」ということばを示す出力音声を合成する。携帯端末30は、サービス提供装置10から出力音声を受信すると、その受信した出力音声を再生する。これによって、ユーザ4の対話相手には、日本語の「駅はどこですか?」とのことばが聞こえる。 For example, when the user 4 utters the English word “Where is the station?” To the mobile terminal 30, the mobile terminal 30 generates and generates input speech using a microphone or the like based on the generated words. The input voice is transmitted to the service providing apparatus 10. The service providing apparatus 10 synthesizes an output speech indicating the word “where is the station” in Japanese corresponding to “Where is the station?”. When receiving the output sound from the service providing apparatus 10, the portable terminal 30 plays back the received output sound. As a result, the conversation partner of user 4 can hear the phrase “Where is the station?” In Japanese.
 図示していないが、ユーザ4の対話相手も同様の携帯端末30を有していてもよく、例えば、ユーザ4からの質問に対して、「まっすぐ行って左です」との回答を自身の携帯端末に向かって発すると、上述したような処理が実行されて、ユーザ4の対話相手の携帯端末から、対応する英語の「Go straight and turn left」ということばが回答される。 Although not shown, the conversation partner of the user 4 may have the same portable terminal 30. For example, in response to a question from the user 4, an answer “go straight and left” When it is directed to the terminal, the processing as described above is executed, and the corresponding English word “Go straight and turn left” is answered from the mobile terminal of the user 4's conversation partner.
 このように、多言語翻訳システム1においては、第1言語のことば(音声)と第2言語のことば(音声)との間で自在に翻訳が可能である。なお、2つの言語に限らず、任意の数の言語間で相互に自動翻訳できるようにしてもよい。 Thus, in the multilingual translation system 1, translation can be freely performed between the language of the first language (speech) and the language of the second language (speech). In addition, you may enable it to mutually translate automatically between arbitrary numbers not only in two languages.
 このような自動音声翻訳の機能を利用することで、外国旅行や外国人とのコミュニケーションを容易化できる。 自動 By using this automatic speech translation function, foreign travel and communication with foreigners can be facilitated.
 サービス提供装置10に含まれる本実施の形態に従う音声合成システムは、後述するように、SPSSの一手法を採用する。サービス提供装置10は、音声合成システムに関するコンポーネントとして、分析部12と、学習部14と、DNN16と、音声合成部18とを含む。 The speech synthesis system according to the present embodiment included in the service providing apparatus 10 employs one SPSS technique, as will be described later. The service providing apparatus 10 includes an analysis unit 12, a learning unit 14, a DNN 16, and a speech synthesis unit 18 as components related to the speech synthesis system.
 サービス提供装置10は、自動翻訳に関するコンポーネントとして、音声認識部20と、翻訳部22とを含む。サービス提供装置10は、さらに、携帯端末30との間で通信処理を行なうための通信処理部24を含む。 The service providing apparatus 10 includes a speech recognition unit 20 and a translation unit 22 as components relating to automatic translation. Service providing apparatus 10 further includes a communication processing unit 24 for performing communication processing with portable terminal 30.
 より具体的には、分析部12および学習部14は、DNN16を構築するための機械学習を担当する。分析部12および学習部14の機能および処理の詳細については、後述する。DNN16は、分析部12および学習部14による機械学習の結果としてのニューラルネットワークを格納する。 More specifically, the analysis unit 12 and the learning unit 14 are in charge of machine learning for constructing the DNN 16. Details of functions and processing of the analysis unit 12 and the learning unit 14 will be described later. The DNN 16 stores a neural network as a result of machine learning by the analysis unit 12 and the learning unit 14.
 本実施の形態においては一例として、DNNを用いているが、DNNに代えて、再帰型ニューラルネットワーク(recurrent neural network;以下「RNN」とも略称する)、長・短記憶(long-short term memory;LSTM)RNN、畳み込みニューラルネットワーク(convolutional neural network;CNN)のいずれかを用いてもよい。 In this embodiment, DNN is used as an example, but instead of DNN, a recurrent neural network (hereinafter abbreviated as “RNN”), long-short memory (long-short term memory); LSTM) RNN or convolutional neural network (CNN) may be used.
 音声認識部20は、通信処理部24を介して受信した携帯端末30からの入力音声に対して、音声認識処理を実行することで音声認識テキストを出力する。翻訳部22は、音声認識部20からの音声認識テキストから、指定された言語のテキスト(説明の便宜上、「翻訳テキスト」とも記す。)を生成する。音声認識部20および翻訳部22については、公知の任意の方法を採用できる。 The voice recognition unit 20 outputs voice recognition text by executing voice recognition processing on the input voice from the mobile terminal 30 received via the communication processing unit 24. The translation unit 22 generates a text in a specified language (also referred to as “translation text” for convenience of explanation) from the speech recognition text from the speech recognition unit 20. For the voice recognition unit 20 and the translation unit 22, any known method can be employed.
 音声合成部18は、翻訳部22からの翻訳テキストに対して、DNN16を参照して音声合成を行ない、その結果得られる出力音声を、通信処理部24を介して携帯端末30へ送信する。 The speech synthesizer 18 performs speech synthesis on the translated text from the translator 22 with reference to the DNN 16, and transmits the output speech obtained as a result to the mobile terminal 30 via the communication processor 24.
 図1には、説明の便宜上、DNN16を構築するための機械学習を担当するコンポーネント(主として、分析部12および学習部14)と、生成されたDNN16を用いて多言語翻訳を担当するコンポーネント(主として、音声認識部20、翻訳部22、および音声合成部18)が同一のサービス提供装置10に実装されている例を示すが、これらの機能をそれぞれ別の装置に実装してもよい。この場合、第1の装置において、機械学習を実施することでDNN16を構築し、第2の装置において、当該生成されたDNN16を用いて音声合成および当該音声合成を利用したサービスを提供するようにしてもよい。 For convenience of explanation, FIG. 1 shows components (mainly the analysis unit 12 and the learning unit 14) that are in charge of machine learning for constructing the DNN 16, and components that are in charge of multilingual translation using the generated DNN 16 (mainly In the example, the voice recognition unit 20, the translation unit 22, and the voice synthesis unit 18) are mounted on the same service providing apparatus 10, but these functions may be mounted on different apparatuses. In this case, the DNN 16 is constructed by performing machine learning in the first device, and the second device is provided with speech synthesis using the generated DNN 16 and a service using the speech synthesis. May be.
 上述したような多言語翻訳サービスにおいては、音声認識部20および翻訳部22の少なくとも一部の機能を携帯端末30で実行されるアプリケーションが担当するようにしてもよい。また、音声合成を担当するコンポーネント(DNN16および音声合成部18)の機能を携帯端末30で実行されるアプリケーションが担当するようにしてもよい。 In the multilingual translation service as described above, an application executed on the mobile terminal 30 may be in charge of at least some functions of the speech recognition unit 20 and the translation unit 22. In addition, an application executed on the mobile terminal 30 may be responsible for the functions of the components (DNN 16 and speech synthesizer 18) responsible for speech synthesis.
 このように、サービス提供装置10および携帯端末30が任意の形態で協働することで、多言語翻訳システム1およびその一部である音声合成システムを実現できる。このとき、それぞれの装置が分担する機能については、状況に応じて適宜決定すればよく、図1に示される多言語翻訳システム1に限定されるようなものではない。 Thus, the multilingual translation system 1 and a speech synthesis system that is a part of the multilingual translation system 1 can be realized by cooperation of the service providing apparatus 10 and the mobile terminal 30 in an arbitrary form. At this time, the functions shared by the respective devices may be appropriately determined according to the situation, and are not limited to the multilingual translation system 1 shown in FIG.
 [B.サービス提供装置のハードウェア構成]
 次に、サービス提供装置のハードウェア構成の一例について説明する。図2は、本実施の形態に従うサービス提供装置10のハードウェア構成例を示す模式図である。サービス提供装置10は、典型的には、汎用コンピュータを用いて実現される。
[B. Hardware configuration of service providing device]
Next, an example of the hardware configuration of the service providing apparatus will be described. FIG. 2 is a schematic diagram showing a hardware configuration example of the service providing apparatus 10 according to the present embodiment. The service providing apparatus 10 is typically realized using a general-purpose computer.
 図2を参照して、サービス提供装置10は、主要なハードウェアコンポーネントとして、プロセッサ100と、主メモリ102と、ディスプレイ104と、入力デバイス106と、ネットワークインターフェイス(I/F:interface)108と、光学ドライブ134と、二次記憶装置112とを含む。これらのコンポーネントは、内部バス110を介して互いに接続される。 With reference to FIG. 2, the service providing apparatus 10 includes, as main hardware components, a processor 100, a main memory 102, a display 104, an input device 106, a network interface (I / F) 108, An optical drive 134 and a secondary storage device 112 are included. These components are connected to each other via an internal bus 110.
 プロセッサ100は、後述するような各種プログラムを実行することで、本実施の形態に従うサービス提供装置10の実現に必要な処理を実行する演算主体であり、例えば、1または複数のCPU(central processing unit)やGPU(graphics processing unit)などで構成される。複数のコアを有するようなCPUまたはGPUを用いてもよい。 The processor 100 is an arithmetic entity that executes processes necessary for realizing the service providing apparatus 10 according to the present embodiment by executing various programs as will be described later. For example, the processor 100 includes one or more CPUs (central processing units). ) And GPU (graphics processing unit). A CPU or GPU having a plurality of cores may be used.
 主メモリ102は、プロセッサ100がプログラムを実行するにあたって、プログラムコードやワークメモリなどを一時的に格納する記憶領域であり、例えば、DRAM(dynamic random access memory)やSRAM(static random access memory)などの揮発性メモリデバイスなどで構成される。 The main memory 102 is a storage area for temporarily storing program code, work memory, and the like when the processor 100 executes a program. For example, a dynamic random access memory (DRAM) or a static random access memory (SRAM) is used. It consists of volatile memory devices.
 ディスプレイ104は、処理に係るユーザインターフェイスや処理結果などを出力する表示部であり、例えば、LCD(liquid crystal display)や有機EL(electroluminescence)ディスプレイなどで構成される。 The display 104 is a display unit that outputs a user interface related to processing, processing results, and the like, and includes, for example, an LCD (liquid crystal display) or an organic EL (electroluminescence) display.
 入力デバイス106は、ユーザからの指示や操作などを受付けるデバイスであり、例えば、キーボード、マウス、タッチパネル、ペンなどで構成される。また、入力デバイス106としては、機械学習に必要な音声を収集するためのマイクロフォンを含んでいてもよいし、機械学習に必要な音声を収集した集音デバイスと接続するためのインターフェイスを含んでいてもよい。 The input device 106 is a device that accepts instructions and operations from the user, and includes, for example, a keyboard, a mouse, a touch panel, a pen, and the like. Further, the input device 106 may include a microphone for collecting sounds necessary for machine learning, or an interface for connecting to a sound collecting device that collects sounds necessary for machine learning. Also good.
 ネットワークインターフェイス108は、インターネット上またはイントラネット上の携帯端末30や任意の情報処理装置などとの間でデータを遣り取りする。ネットワークインターフェイス108としては、例えば、イーサネット(登録商標)、無線LAN(Local Area Network)、Bluetooth(登録商標)などの任意の通信方式を採用できる。 The network interface 108 exchanges data with the mobile terminal 30 or any information processing apparatus on the Internet or an intranet. As the network interface 108, for example, an arbitrary communication method such as Ethernet (registered trademark), wireless LAN (Local Area Network), Bluetooth (registered trademark), or the like can be adopted.
 光学ドライブ134は、CD-ROM(compact disc read only memory)、DVD(digital versatile disc)などの光学ディスク136に格納されている情報を読出して、内部バス110を介して他のコンポーネントへ出力する。光学ディスク136は、非一過的(non-transitory)な記録媒体の一例であり、任意のプログラムを不揮発的に格納した状態で流通する。光学ドライブ134が光学ディスク136からプログラムを読み出して、二次記憶装置112などにインストールすることで、汎用コンピュータがサービス提供装置10(または、音声合成装置)として機能するようになる。したがって、本発明の主題は、二次記憶装置112などにインストールされたプログラム自体、または、本実施の形態に従う機能や処理を実現するためのプログラムを格納した光学ディスク136などの記録媒体でもあり得る。 The optical drive 134 reads information stored in an optical disk 136 such as a CD-ROM (compact disc read only memory) or DVD (digital versatile disc) and outputs the information to other components via the internal bus 110. The optical disk 136 is an example of a non-transitory recording medium, and circulates in a state where an arbitrary program is stored in a nonvolatile manner. The optical drive 134 reads out the program from the optical disk 136 and installs it in the secondary storage device 112 or the like, so that the general-purpose computer functions as the service providing device 10 (or speech synthesizer). Therefore, the subject of the present invention can also be a program itself installed in the secondary storage device 112 or the like, or a recording medium such as an optical disk 136 storing a program for realizing the functions and processes according to the present embodiment. .
 図2には、非一過的な記録媒体の一例として、光学ディスク136などの光学記録媒体を示すが、これに限らず、フラッシュメモリなどの半導体記録媒体、ハードディスクまたはストレージテープなどの磁気記録媒体、MO(magneto-optical disk)などの光磁気記録媒体を用いてもよい。 FIG. 2 shows an optical recording medium such as an optical disk 136 as an example of a non-transitory recording medium. However, the present invention is not limited to this, and a semiconductor recording medium such as a flash memory or a magnetic recording medium such as a hard disk or a storage tape. A magneto-optical recording medium such as MO (magneto-optical disk) may be used.
 二次記憶装置112は、プロセッサ100にて実行されるプログラム、プログラムが処理対象とする入力データ(学習用の入力音声およびテキスト、ならびに、携帯端末30からの入力音声などを含む)、および、プログラムの実行により生成される出力データ(携帯端末30へ送信される出力音声などを含む)などを格納するコンポーネントであり、例えば、ハードディスク、SSD(solid state drive)などの不揮発性記憶装置で構成される。 The secondary storage device 112 includes a program executed by the processor 100, input data to be processed by the program (including input voice and text for learning, input voice from the mobile terminal 30, and the like), and a program Is a component that stores output data (including output audio transmitted to the mobile terminal 30) and the like generated by the execution of, and is composed of, for example, a nonvolatile storage device such as a hard disk or SSD (solid state drive) .
 より具体的には、二次記憶装置112は、典型的には、図示しないOS(operating system)の他、分析部12を実現するための分析プログラム121と、学習部14を実現するための学習プログラム141と、音声認識部20を実現するための音声認識プログラム201と、翻訳部22を実現するための翻訳プログラム221と、音声合成部18を実現するための音声合成プログラム181とを格納している。 More specifically, the secondary storage device 112 typically has an OS (operating system) (not shown), an analysis program 121 for realizing the analysis unit 12, and learning for realizing the learning unit 14. A program 141, a speech recognition program 201 for realizing the speech recognition unit 20, a translation program 221 for realizing the translation unit 22, and a speech synthesis program 181 for realizing the speech synthesis unit 18 are stored. Yes.
 これらのプログラムをプロセッサ100で実行する際に必要となるライブラリや機能モジュールの一部を、OSが標準で提供するライブラリまたは機能モジュールを用いて代替するようにしてもよい。この場合には、各プログラム単体では、対応する機能を実現するために必要なプログラムモジュールのすべてを含むものにはならないが、OSの実行環境下にインストールされることで、必要な機能を実現できる。このような一部のライブラリまたは機能モジュールを含まないプログラムであっても、本発明の技術的範囲に含まれ得る。 Some of the libraries and functional modules required when these programs are executed by the processor 100 may be replaced with libraries or functional modules provided by the OS as standard. In this case, each program alone does not include all the program modules necessary for realizing the corresponding function, but the necessary function can be realized by being installed under the OS execution environment. . Even such a program that does not include some libraries or functional modules can be included in the technical scope of the present invention.
 また、これらのプログラムは、上述したようないずれかの記録媒体に格納されて流通するだけでなく、インターネットまたはイントラネットを介してサーバ装置などからダウンロードすることで配布されてもよい。 Further, these programs may be distributed not only by being stored and distributed in any of the above-described recording media but also by being downloaded from a server device or the like via the Internet or an intranet.
 なお、実際には、音声認識部20および翻訳部22を実現するためのデータベースが必要となるが、説明の便宜上、それらのデータベースについては描いていない。 Actually, databases for realizing the speech recognition unit 20 and the translation unit 22 are necessary, but for convenience of explanation, these databases are not drawn.
 二次記憶装置112は、DNN16に加えて、DNN16を構築するための、機械学習用の入力音声130および対応するテキスト132を格納していてもよい。 The secondary storage device 112 may store the input speech 130 for machine learning and the corresponding text 132 for constructing the DNN 16 in addition to the DNN 16.
 図2には、単一のコンピュータがサービス提供装置10を構成する例を示すが、これに限らず、ネットワークを介して接続された複数のコンピュータが明示的または黙示的に連携して、多言語翻訳システム1およびその一部である音声合成システムを実現するようにしてもよい。 FIG. 2 shows an example in which the service providing apparatus 10 is configured by a single computer. However, the present invention is not limited to this, and a plurality of computers connected via a network are linked in an explicit or implicit manner in a multilingual manner. You may make it implement | achieve the speech synthesis system which is the translation system 1 and its part.
 コンピュータ(プロセッサ100)がプログラムを実行することで実現される機能の全部または一部を、集積回路などのハードワイヤード回路(hard-wired circuit)を用いて実現してもよい。例えば、ASIC(application specific integrated circuit)やFPGA(field-programmable gate array)などを用いて実現してもよい。 All or part of the functions realized by the computer (processor 100) executing the program may be realized by using a hard-wired circuit such as an integrated circuit. For example, it may be realized by using ASIC (application specific integrated circuit), FPGA (field-programmable gate array), or the like.
 当業者であれば、本発明が実施される時代に応じた技術を適宜用いて、本実施の形態に従う音声合成システムを実現できるであろう。 Those skilled in the art will be able to realize a speech synthesis system according to the present embodiment by using a technique according to the time when the present invention is implemented as appropriate.
 [C.概要]
 本実施の形態においては、SPSSに従う音声合成システムが提供される。本実施の形態に従う音声合成システムにおいては、励振源を示す源信号を周期成分と非周期成分とに分解することで、V/UVの判定を不要化した方式を採用する。源信号を表現する周期成分および非周期成分を示す音声パラメータをDNNに適用して学習を行なう。
[C. Overview]
In the present embodiment, a speech synthesis system according to SPSS is provided. In the speech synthesis system according to the present embodiment, a method in which the determination of V / UV is made unnecessary by decomposing the source signal indicating the excitation source into a periodic component and an aperiodic component is adopted. Learning is performed by applying to the DNN speech parameters indicating periodic and non-periodic components representing the source signal.
 まず、関連技術に係る音声合成処理および当該音声合成処理をSPSSに適用する場合の処理について説明する。図3は、関連技術に係る音声合成処理の概要を説明するための模式図である。図3を参照して、関連技術に係る音声合成処理においては、パルス生成部250と、ホワイトノイズ生成部252と、切替部254と、音声合成フィルタ256とを含む。図3に示す構成において、パルス生成部250、ホワイトノイズ生成部252、および、切替部254は、励振源をモデル化した部分に相当し、励振源からの源信号は、パルス生成部250から出力されるパルス系列と、ホワイトノイズ生成部252からの雑音系列とのうち、いずれか一方が切替部254にて選択されて、音声合成フィルタ256へ与えられる。パルス生成部250には、声の高さを示すF0のパラメータが与えられ、F0の逆数(基本周期/ピッチ周期)の間隔でパルス系列を出力する。なお、図示していないが、パルス生成部250には、声の大きさを示す振幅のパラメータが与えられてもよい。音声合成フィルタ256は、音声の音色を決定する部分であり、スペクトル包絡を示すパラメータが与えられる。 First, speech synthesis processing according to related technology and processing when the speech synthesis processing is applied to SPSS will be described. FIG. 3 is a schematic diagram for explaining the outline of the speech synthesis process according to the related art. Referring to FIG. 3, the speech synthesis process according to the related art includes a pulse generation unit 250, a white noise generation unit 252, a switching unit 254, and a speech synthesis filter 256. In the configuration shown in FIG. 3, the pulse generation unit 250, the white noise generation unit 252, and the switching unit 254 correspond to a part modeling the excitation source, and a source signal from the excitation source is output from the pulse generation unit 250. Any one of the pulse sequence to be performed and the noise sequence from the white noise generation unit 252 is selected by the switching unit 254 and supplied to the speech synthesis filter 256. The pulse generator 250 is given a parameter of F 0 indicating the pitch of the voice, and outputs a pulse sequence at intervals of the reciprocal of F 0 (basic period / pitch period). Although not shown, the pulse generator 250 may be provided with an amplitude parameter indicating the loudness of the voice. The speech synthesis filter 256 is a part that determines the timbre of the speech, and is given a parameter indicating a spectrum envelope.
 図3に示す音声生成時のソースフィルタモデルにおいては、入力された音声波形を単位区間(例えば、フレーム単位)で区切るとともに、各単位区間が有声区間であるか無声区間であるかが判定され、有声区間についてはパルス系列が源信号として出力され、無声区間についてはノイズ系列が源信号として出力される。この有声区間と無声区間とを識別するパラメータがV/UVフラグである。 In the source filter model at the time of voice generation shown in FIG. 3, the input voice waveform is divided into unit sections (for example, in units of frames), and it is determined whether each unit section is a voiced section or an unvoiced section. For voiced intervals, a pulse sequence is output as a source signal, and for unvoiced intervals, a noise sequence is output as a source signal. A parameter for identifying the voiced section and the unvoiced section is a V / UV flag.
 図3に示すソースフィルタモデルをSPSSに適用する場合には、F0、V/UVフラグ、スペクトル包絡が学習対象のパラメータとなる。したがって、各単位区間についてV/UVを正しく判定しなければならない。しかしながら、V/UVの判定、および、パルス系列およびノイズ系列が切替えられることによる不連続性を伴う源信号のモデル化は容易ではないので、合成音声に品質劣化が生じる可能性がある。 When the source filter model shown in FIG. 3 is applied to SPSS, F 0 , the V / UV flag, and the spectral envelope are the learning target parameters. Therefore, it is necessary to correctly determine V / UV for each unit section. However, since it is not easy to determine the V / UV and to model the source signal with discontinuity due to the switching of the pulse sequence and the noise sequence, there is a possibility that quality degradation occurs in the synthesized speech.
 そこで、本実施の形態においては、音声波形の各単位区間についてのV/UVを判定する必要のない手法を採用する。これにより、関連技術において生じ得る、V/UVの判定エラーによる合成音声の品質への影響を低減する。 Therefore, in the present embodiment, a technique that does not require determination of V / UV for each unit section of the speech waveform is adopted. Thereby, the influence on the quality of the synthesized speech due to the determination error of V / UV that may occur in the related art is reduced.
 図4は、本実施の形態に従う音声合成処理の概要を説明するための模式図である。図4を参照して、本実施の形態に従う音声合成処理においては、パルス生成部200と、音声合成フィルタ(周期成分)202と、ガウシアンノイズ生成部204と、音声合成フィルタ(非周期成分)206と、加算部208とを含む。 FIG. 4 is a schematic diagram for explaining the outline of the speech synthesis process according to the present embodiment. Referring to FIG. 4, in the speech synthesis process according to the present embodiment, pulse generation unit 200, speech synthesis filter (periodic component) 202, Gaussian noise generation unit 204, speech synthesis filter (non-periodic component) 206. And an adder 208.
 本実施の形態においては、図3に示すV/UVフラグを用いた源信号の切替えではなく、周期成分および非周期成分のそれぞれに源信号を用意する。すなわち、音声信号を周期成分および非周期成分に分解する。 In this embodiment, the source signal is prepared for each of the periodic component and the non-periodic component, instead of switching the source signal using the V / UV flag shown in FIG. That is, the audio signal is decomposed into a periodic component and an aperiodic component.
 より具体的には、パルス生成部200および音声合成フィルタ(周期成分)202は、周期成分を生成する部分であり、パルス生成部200は、指定されたF0に従うパルス(後述するように、連続的なパルス系列)を生成するとともに、音声合成フィルタ(周期成分)202が周期成分に対応するスペクトル包絡に応じたフィルタを当該連続的なパルス系列に乗じることで、合成音声に含まれる周期成分を出力する。 More specifically, the pulse generation unit 200 and the speech synthesis filter (periodic component) 202 are parts that generate a periodic component, and the pulse generation unit 200 is configured to generate a pulse according to a designated F 0 (continuous as will be described later). And the speech synthesis filter (periodic component) 202 multiplies the continuous pulse sequence by a filter corresponding to the spectral envelope corresponding to the periodic component, thereby generating the periodic component included in the synthesized speech. Output.
 このように、各単位区間がV/UVのいずれであるかによらず、連続的なパルス系列を用いることができるのは、周期成分の無音区間は非可聴なパワーであると仮定し、全区間を有声であると扱うためである。すなわち、無音や無声といった周期性をもたない区間において、周期成分に対応するスペクトル包絡は、十分に振幅が小さいと仮定する。この仮定に従うと、このような無音または無声の区間において、F0のパルス系列から周期成分を生成したとしても、非可聴な程に十分に小さくなると考えられる。そのため、関連技術に係る音声合成処理において、パルス系列の生成を停止していた無声区間においても、本実施の形態に従う音声合成処理においては、パルス系列を発生することで、パルス系列の不連続性に起因する合成音声への影響を低減することができる。 As described above, regardless of whether each unit section is V / UV, a continuous pulse sequence can be used on the assumption that the silent section of the periodic component is inaudible power. This is because the section is treated as voiced. That is, it is assumed that the spectrum envelope corresponding to the periodic component has a sufficiently small amplitude in a section having no periodicity such as silence and silentness. According to this assumption, even if a periodic component is generated from a pulse sequence of F 0 in such a silent or silent period, it is considered to be sufficiently small to be inaudible. For this reason, in the speech synthesis processing according to the related art, even in a silent section where the generation of the pulse sequence has been stopped, in the speech synthesis processing according to the present embodiment, the pulse sequence is generated, thereby discontinuity of the pulse sequence. It is possible to reduce the influence on the synthesized speech due to.
 また、ガウシアンノイズ生成部204および音声合成フィルタ(非周期成分)206は、非周期成分を生成する部分であり、ガウシアンノイズ生成部204は、連続的なノイズ系列の一例として、ガウシアンノイズを生成するとともに、音声合成フィルタ(非周期成分)206が非周期成分に対応するスペクトル包絡に応じたフィルタを当該ノイズ系列に乗じることで、合成音声に含まれる非周期成分を出力する。 The Gaussian noise generation unit 204 and the speech synthesis filter (non-periodic component) 206 are parts that generate aperiodic components. The Gaussian noise generation unit 204 generates Gaussian noise as an example of a continuous noise sequence. At the same time, the speech synthesis filter (non-periodic component) 206 multiplies the noise sequence by a filter corresponding to the spectrum envelope corresponding to the non-periodic component, thereby outputting the aperiodic component included in the synthesized speech.
 最終的に、音声合成フィルタ(周期成分)202から出力される周期成分、および、音声合成フィルタ(非周期成分)206から出力される非周期成分が加算部208で加算されることで、合成音声を示す音声波形が出力される。 Finally, the periodic component output from the speech synthesis filter (periodic component) 202 and the aperiodic component output from the speech synthesis filter (non-periodic component) 206 are added by the adding unit 208, so that the synthesized speech Is output.
 このように、各単位区間がV/UVのいずれであるかによらず、ノイズ系列を用いることができるのは、非周期成分が無声信号および無音により構成されると仮定し、全区間を無声であると扱うためである。以上のように、有声区間および無声区間を区別する必要のない音響モデルを用いるとともに、その音響モデルに基づく学習を行なうことで、V/UVの判定を必要としない音声合成方法を実現できる。 As described above, regardless of whether each unit section is V / UV, the noise sequence can be used on the assumption that the aperiodic component is composed of a silent signal and silence, and the entire section is silent. This is because it is treated as. As described above, a voice synthesis method that does not require V / UV determination can be realized by using an acoustic model that does not need to distinguish between a voiced section and an unvoiced section, and performing learning based on the acoustic model.
 [D.学習処理および音声合成処理]
 次に、本実施の形態に従う音声合成システムにおける学習処理および音声合成処理の詳細について説明する。図5は、本実施の形態に従う音声合成システムにおける要部の処理を説明するためのブロック図である。
[D. Learning processing and speech synthesis processing]
Next, details of learning processing and speech synthesis processing in the speech synthesis system according to the present embodiment will be described. FIG. 5 is a block diagram for explaining processing of a main part in the speech synthesis system according to the present embodiment.
 図5を参照して、音声合成システムは、DNN16を構築するための分析部12および学習部14と、DNN16を用いて音声波形を出力する音声合成部18とを含む。以下、これらの各部の処理および機能について詳述する。 Referring to FIG. 5, the speech synthesis system includes an analysis unit 12 and a learning unit 14 for constructing DNN 16, and a speech synthesis unit 18 that outputs a speech waveform using DNN 16. Hereinafter, processing and functions of these units will be described in detail.
 (d1:分析部12)
 まず、分析部12における処理および機能について説明する。分析部12は、音声分析を担当する部分であり、学習用の入力音声が示す音声波形から音響特徴量系列を生成する。本実施の形態に従う音声合成システムにおいて、フレーム毎の音響特徴量は、F0およびスペクトル包絡(周期成分および非周期成分)を含む。
(D1: analysis unit 12)
First, processing and functions in the analysis unit 12 will be described. The analysis unit 12 is a part in charge of speech analysis, and generates an acoustic feature quantity sequence from a speech waveform indicated by the input speech for learning. In the speech synthesis system according to the present embodiment, the acoustic feature quantity for each frame includes F 0 and spectrum envelope (periodic component and non-periodic component).
 より具体的には、分析部12は、F0抽出部120と、周期/非周期成分抽出部122と、特徴量抽出部124とを含む。特徴量抽出部124は、F0補間部126と、スペクトル包絡抽出部128とを含む。 More specifically, the analysis unit 12 includes an F 0 extraction unit 120, a periodic / non-periodic component extraction unit 122, and a feature amount extraction unit 124. The feature quantity extraction unit 124 includes an F 0 interpolation unit 126 and a spectrum envelope extraction unit 128.
 F0抽出部120は、既知のテキストに対応する音声波形のF0をフレーム(単位区間)毎に抽出する。すなわち、F0抽出部120は、入力される音声波形からF0をフレーム毎に抽出する。抽出されたF0は、周期/非周期成分抽出部122および特徴量抽出部124へ与えられる。 F 0 extraction unit 120 extracts the F 0 of the voice waveform corresponding to a known text for each frame (unit interval). That, F 0 extraction unit 120 extracts from the speech waveform input to F 0 for each frame. The extracted F 0 is provided to the periodic / non-periodic component extracting unit 122 and the feature amount extracting unit 124.
 周期/非周期成分抽出部122は、入力される音声波形から周期成分および非周期成分をフレーム(単位区間)毎に抽出する。より具体的には、周期/非周期成分抽出部122は、入力される音声波形のF0に基づいて、F0から周期成分および非周期成分を抽出する。本実施の形態においては、源信号s(t)を以下の(1)式に示すように抽出する。 The period / aperiodic component extraction unit 122 extracts a period component and an aperiodic component for each frame (unit section) from the input speech waveform. More specifically, the period / aperiodic component extraction unit 122 extracts a period component and an aperiodic component from F 0 based on the input speech waveform F 0 . In the present embodiment, the source signal s (t) is extracted as shown in the following equation (1).
Figure JPOXMLDOC01-appb-M000001
Figure JPOXMLDOC01-appb-M000001
 但し、f(t)は、音声波形のフレームtにおけるF0を示し、周期性信号spdc(t)は、音声波形のフレームtにおける周期成分を示し、非周期性信号sapd(t)は、音声波形のフレームtにおける非周期成分を示す。 However, f 0 (t) indicates F 0 in the frame t of the speech waveform, the periodic signal s pdc (t) indicates a periodic component in the frame t of the speech waveform, and the non-periodic signal s apd (t) Indicates a non-periodic component in frame t of the speech waveform.
 このように、入力される音声波形のフレームt毎に、F0が存在する場合には、源信号は周期成分および非周期成分を含むものとして扱い、F0が存在しない場合には、源信号は非周期成分のみを含むものとして扱う。すなわち、周期/非周期成分抽出部122は、F0抽出部120がF0を抽出できないフレーム(単位区間)から非周期成分のみを抽出し、それ以外のフレームから周期成分および非周期成分を抽出する。 Thus, when F 0 exists for each frame t of the input speech waveform, the source signal is treated as including a periodic component and an aperiodic component, and when F 0 does not exist, the source signal Is treated as including only non-periodic components. That is, the periodic / non-periodic component extraction unit 122 extracts only the non-periodic component from the frame (unit section) from which the F 0 extraction unit 120 cannot extract F 0, and extracts the periodic component and the non-periodic component from the other frames. To do.
 本実施の形態においては、源信号の周期(harmonic)成分を表現する一例として、以下の(2)式に示すようなsinusoidalモデルを採用する。 In the present embodiment, a sinusoidal model as shown in the following equation (2) is adopted as an example of expressing the harmonic component of the source signal.
Figure JPOXMLDOC01-appb-M000002
Figure JPOXMLDOC01-appb-M000002
 (2)式において、Jはharmonicの数を示す。すなわち、(2)式に示すsinusoidalモデルにおいては、harmonicでの周波数および振幅は線形的に近似されている。このsinusoidalモデルを解くにあたって、α,β,γ,φの値をそれぞれ決定する必要がある。より具体的には、以下の(3)式に従って定義されるδを最小化する値が解として決定される。 In equation (2), J represents the number of harmonics. That is, in the sinusoidal model shown in Equation (2), the frequency and amplitude in the harmonic are approximated linearly. In solving this sinusoidal model, it is necessary to determine the values of α k , β k , γ, and φ k , respectively. More specifically, a value that minimizes δ defined according to the following equation (3) is determined as a solution.
Figure JPOXMLDOC01-appb-M000003
Figure JPOXMLDOC01-appb-M000003
 但し、ω(t)は、長さ2N+1の窓関数である。(3)式に従って定義されるδを最小化する値は、非特許文献8に示される解法によって決定される。 However, ω (t) is a window function of length 2N w +1. The value that minimizes δ defined according to the equation (3) is determined by the solution shown in Non-Patent Document 8.
 周期/非周期成分抽出部122は、上述したような数学的な解法に従って、入力される音声波形に含まれる周期性信号spdc(t)および非周期性信号sapd(t)を抽出する。 The periodic / non-periodic component extraction unit 122 extracts the periodic signal s pdc (t) and the non-periodic signal s apd (t) included in the input speech waveform according to the mathematical solution as described above.
 特徴量抽出部124は、音響特徴量として、連続的なF0、周期成分のスペクトル包絡、非周期成分のスペクトル包絡を出力する。スペクトル包絡としては、例えば、LSP(line spectral pair)、LPC(linear prediction coefficients)、メルケプストラム係数のいずれを採用してもよい。なお、音響特徴量としては、連続的なF0の対数(以下、「連続的なlogF0」とも略称する。)が用いられる。 The feature quantity extraction unit 124 outputs a continuous F 0 , a periodic component spectrum envelope, and a non-periodic component spectrum envelope as acoustic feature quantities. As the spectral envelope, for example, any of LSP (line spectral pair), LPC (linear prediction coefficients), and mel cepstrum coefficients may be adopted. As the acoustic features, logarithmic continuous F 0 (hereinafter, abbreviated as "continuous logF 0".) Is used.
 F0補間部126は、F0抽出部120が音声波形からフレーム毎に抽出されるF0を補間して、連続的なF0(F0系列)を生成する。より具体的には、例えば、直近の1または複数のフレームにおいて抽出されたF0から所定の補間関数に従って、対象のフレームにおけるF0を決定できる。F0補間部126におけるF0の補間方法は、公知の任意の方法を採用できる。 F 0 interpolation unit 126 interpolates the F 0 to F 0 extracting unit 120 is extracted for each frame from the speech waveform, generates a continuous F 0 (F 0 sequence). More specifically, for example, in accordance with a predetermined interpolation function from F 0 extracted in immediate vicinity of one or more frames can be determined F 0 in the target frame. As the interpolation method of F 0 in the F 0 interpolation unit 126, any known arbitrary method can be adopted.
 スペクトル包絡抽出部128は、抽出される周期成分および非周期成分のスペクトル包絡を抽出する。より具体的には、スペクトル包絡抽出部128は、F0抽出部120が抽出したF0に基づいて、周期/非周期成分抽出部122から出力される周期性信号spdc(t)および非周期性信号sapd(t)から、スペクトル包絡を抽出する。すなわち、スペクトル包絡抽出部128は、フレーム毎の周期性信号spdc(t)に含まれる各周波数成分の分布特性を示す周期成分を示すスペクトル包絡(pdc)を抽出するとともに、フレーム毎の非周期性信号sapd(t)に含まれる各周波数成分の分布特性を示す非周期成分を示すスペクトル包絡(apd)を抽出する。 The spectrum envelope extraction unit 128 extracts the spectrum envelope of the extracted periodic component and non-periodic component. More specifically, the spectrum envelope extraction unit 128 determines the periodic signal s pdc (t) output from the periodic / non-periodic component extraction unit 122 and the non-periodic based on the F 0 extracted by the F 0 extraction unit 120. A spectral envelope is extracted from the sex signal s apd (t). That is, the spectrum envelope extraction unit 128 extracts a spectrum envelope (pdc) indicating a periodic component indicating a distribution characteristic of each frequency component included in the periodic signal s pdc (t) for each frame, and aperiodic for each frame. A spectral envelope (apd) indicating an aperiodic component indicating a distribution characteristic of each frequency component included in the sex signal s apd (t) is extracted.
 図6は、本実施の形態に従う音声合成システムにおいて出力される周期成分および非周期成分の音声波形の一例を示す図である。図6には、一例として、話者が「すべて」と発したときの音声信号を示す。後述するように、DNN16において、フレーム単位で音響特徴量が学習される。 FIG. 6 is a diagram showing an example of a speech waveform of a periodic component and an aperiodic component output in the speech synthesis system according to the present embodiment. FIG. 6 shows, as an example, an audio signal when the speaker utters “all”. As will be described later, the DNN 16 learns the acoustic feature amount in units of frames.
 図6(a)には、入力された音声波形(源信号)を示し、図6(b)には、源信号から抽出された周期成分の音声波形を示し、図6(c)には、源信号から抽出された非周期成分の音声波形を示す。F0が抽出される区間の周期成分が図6(b)に示すように抽出される一方、F0が抽出される区間の非周期成分とF0が抽出されない区間とは、図6(c)のようになる。図6(b)中において「non-F0」とラベル付けされた区間では、振幅がほとんどゼロになっており、この区間がF0が抽出されない区間に相当する。 FIG. 6A shows the input sound waveform (source signal), FIG. 6B shows the sound waveform of the periodic component extracted from the source signal, and FIG. The speech waveform of the aperiodic component extracted from the source signal is shown. While the periodic component of the section where F 0 is extracted is extracted as shown in FIG. 6B, the non-periodic component of the section where F 0 is extracted and the section where F 0 is not extracted are shown in FIG. )become that way. In the section labeled “non-F 0 ” in FIG. 6B, the amplitude is almost zero, and this section corresponds to a section in which F 0 is not extracted.
 (d2:学習部14)
 次に、学習部14における処理および機能について説明する。SPSSにおいては、入力されたテキストと当該テキストに対応する音声波形との関係を統計的に学習する。一般的に、この関係を直接モデル化することは容易ではない。そこで、本実施の形態に従う音声合成システムにおいては、入力されたテキストの文脈情報に基づくコンテキストラベル系列を生成するとともに、入力された音声波形からF0およびスペクトル包絡を含む音響特徴量系列を生成する。そして、コンテキストラベル系列および音響特徴量系列を用いて学習することで、コンテキストラベル系列を入力とし、音響特徴量系列を出力する音響モデルを構築する。本実施の形態においては、DNNに従って統計モデルである音響モデルを構築する。その結果、DNN16には、構築される音響モデル(統計モデル)を示すパラメータが格納されることになる。
(D2: learning unit 14)
Next, processing and functions in the learning unit 14 will be described. In SPSS, the relationship between the input text and the speech waveform corresponding to the text is statistically learned. In general, it is not easy to model this relationship directly. Therefore, in the speech synthesis system according to the present embodiment, a context label sequence based on the context information of the input text is generated, and an acoustic feature amount sequence including F 0 and a spectrum envelope is generated from the input speech waveform. . Then, learning is performed using the context label sequence and the acoustic feature amount sequence, thereby constructing an acoustic model that receives the context label sequence and outputs the acoustic feature amount sequence. In the present embodiment, an acoustic model that is a statistical model is constructed according to DNN. As a result, the DNN 16 stores a parameter indicating the acoustic model (statistical model) to be constructed.
 図5に示す構成においては、コンテキストラベル系列を生成するコンポーネントとして、テキスト分析部162およびコンテキストラベル生成部164を含む。テキスト分析部162およびコンテキストラベル生成部164は、既知のテキストの文脈情報に基づくコンテキストラベルを生成する。 The configuration shown in FIG. 5 includes a text analysis unit 162 and a context label generation unit 164 as components that generate a context label sequence. The text analysis unit 162 and the context label generation unit 164 generate a context label based on context information of known text.
 コンテキストラベルは、学習部14および音声合成部18の両方で用いるため、学習部14および音声合成部18が共通に利用する構成例を示している。しかしながら、学習部14および音声合成部18の各々に、コンテキストラベルを生成するためのコンポーネントをそれぞれ実装するようにしてもよい。 Since the context label is used by both the learning unit 14 and the speech synthesis unit 18, a configuration example commonly used by the learning unit 14 and the speech synthesis unit 18 is shown. However, a component for generating a context label may be mounted on each of the learning unit 14 and the speech synthesis unit 18.
 テキスト分析部162は、入力される学習用または合成対象のテキストを分析して、その文脈情報をコンテキストラベル生成部164へ出力する。コンテキストラベル生成部164は、テキスト分析部162からの分脈情報に基づいて、コンテキストラベルを決定してモデル学習部140へ出力する。 The text analysis unit 162 analyzes the input text for learning or synthesis, and outputs the context information to the context label generation unit 164. The context label generation unit 164 determines a context label based on the branch information from the text analysis unit 162 and outputs it to the model learning unit 140.
 本実施の形態に従う音声合成システムにおいては、フレーム毎の音響特徴量を用いて学習を行なうので、コンテキストラベル生成部164についても、フレーム毎のコンテキストラベルを生成する。一般的に、コンテキストラベルは音素単位で生成されるため、コンテキストラベル生成部164は、音素内における各フレームの位置情報を付与することで、フレーム単位のコンテキストラベルを生成する。 In the speech synthesis system according to the present embodiment, learning is performed using the acoustic feature amount for each frame, so the context label generation unit 164 also generates a context label for each frame. In general, since the context label is generated in units of phonemes, the context label generation unit 164 generates the context label in units of frames by adding position information of each frame in the phoneme.
 モデル学習部140は、分析部12からの音響特徴量系列142と、コンテキストラベル生成部164からのコンテキストラベル系列166とを入力として、DNNを用いて音響モデルを学習する。このように、モデル学習部140は、F0、周期成分のスペクトル包絡、非周期成分のスペクトル包絡を含む音響特徴量と、対応するコンテキストラベルとを対応付けて学習することで、統計モデルである音響モデルを構築する。 The model learning unit 140 receives the acoustic feature amount series 142 from the analysis unit 12 and the context label series 166 from the context label generation unit 164 as input, and learns an acoustic model using DNN. In this manner, the model learning unit 140 is a statistical model by learning by associating F 0 , an acoustic feature amount including a spectral envelope of a periodic component and a spectral envelope of an aperiodic component, and a corresponding context label. Build an acoustic model.
 モデル学習部140でのDNNに基づく音響モデルの学習においては、フレーム毎にコンテキストラベルを入力するとともに、フレーム毎の音響特徴量ベクトル(要素として、少なくとも、連続的なlogF0、周期成分のスペクトル包絡、非周期成分のスペクトル包絡を含む)を出力とするDNNを用いることで、確率分布のモデル化を行なう。典型的には、モデル学習部140は、正規化された音響特徴量ベクトルについての平均二乗誤差を最小化するようにDNNを学習する。このようなDNNの学習は、以下の(4)式に示すように、フレーム毎に変化する平均ベクトルおよびコンテキスト非依存の共分散行列をもつ正規分布により、確率分布のモデル化を行なうことと等価である。 In learning of an acoustic model based on DNN in the model learning unit 140, a context label is input for each frame, and an acoustic feature vector for each frame (elements include at least continuous log F 0 and a spectrum envelope of periodic components). The probability distribution is modeled by using a DNN that outputs a spectrum envelope of non-periodic components). Typically, the model learning unit 140 learns the DNN so as to minimize the mean square error with respect to the normalized acoustic feature quantity vector. Such DNN learning is equivalent to modeling a probability distribution using a normal distribution having an average vector that changes from frame to frame and a context-independent covariance matrix, as shown in the following equation (4). It is.
Figure JPOXMLDOC01-appb-M000004
Figure JPOXMLDOC01-appb-M000004
 但し、λはDNNのパラメータセットを示し、Uはグローバルな共分散行列を示し、μtはDNNにより推定される音声パラメータの平均ベクトルを示す。したがって、生成された確率分布系列は、時変な平均ベクトルおよび時不変な共分散行列をもつことになる。 However, λ represents a DNN parameter set, U represents a global covariance matrix, and μt represents an average vector of speech parameters estimated by DNN. Therefore, the generated probability distribution sequence has a time-varying mean vector and a time-invariant covariance matrix.
 (d3:音声合成部18)
 次に、音声合成部18における処理および機能について説明する。音声合成部18は、合成対象のテキストから生成されるフレーム毎のコンテキストラベルを生成し、生成したフレーム毎のコンテキストラベルをDNN16に入力することで、確率分布系列を推定する。そして、推定した確率分布系列に基づいて、学習時とは逆の処理を経て、音声波形を合成する。
(D3: speech synthesis unit 18)
Next, processing and functions in the speech synthesizer 18 will be described. The speech synthesizer 18 generates a context label for each frame generated from the text to be synthesized, and inputs the generated context label for each frame to the DNN 16 to estimate the probability distribution series. Then, based on the estimated probability distribution series, a speech waveform is synthesized through a process reverse to that during learning.
 より具体的には、音声合成部18は、音響特徴量推定部180と、パルス生成部184と、周期成分生成部186と、非周期成分生成部188と、加算部187とを含む。 More specifically, the speech synthesizer 18 includes an acoustic feature quantity estimator 180, a pulse generator 184, a periodic component generator 186, an aperiodic component generator 188, and an adder 187.
 何らかの合成対象のテキストが入力されると、テキスト分析部162が入力されたテキストを分析して文脈情報を出力し、コンテキストラベル生成部164が分脈情報に基づいてコンテキストラベルを生成する。すなわち、テキスト分析部162およびコンテキストラベル生成部164は、任意のテキストの入力に応答して、当該テキストの文脈情報に基づくコンテキストラベルを決定する。 When some text to be synthesized is input, the text analysis unit 162 analyzes the input text and outputs context information, and the context label generation unit 164 generates a context label based on the branch information. That is, the text analysis unit 162 and the context label generation unit 164 determine a context label based on the context information of the text in response to input of arbitrary text.
 音響特徴量推定部180は、DNN16に構築された統計モデルである音響モデルから決定されたコンテキストラベルに対応する音響特徴量を推定する。より具体的には、音響特徴量推定部180は、生成されたフレーム毎のコンテキストラベルを、学習された音響モデルを示すDNN16に入力する。音響特徴量推定部180は、入力されたコンテキストラベルに対応する音響特徴量をDNN16から推定する。コンテキストラベル系列の入力に対応して、DNN16からはフレーム毎に平均ベクトルのみが変化する確率分布系列である音響特徴量系列182が出力される。 The acoustic feature amount estimation unit 180 estimates an acoustic feature amount corresponding to a context label determined from an acoustic model that is a statistical model built in the DNN 16. More specifically, the acoustic feature quantity estimation unit 180 inputs the generated context label for each frame to the DNN 16 indicating the learned acoustic model. The acoustic feature quantity estimation unit 180 estimates an acoustic feature quantity corresponding to the input context label from the DNN 16. In response to the input of the context label series, the DNN 16 outputs an acoustic feature quantity series 182 that is a probability distribution series in which only the average vector changes for each frame.
 音響特徴量系列182に含まれる、補間された連続的なF0(F0系列)、周期成分のスペクトル包絡、非周期成分のスペクトル包絡は、DNN16を用いて、コンテキストラベル系列から推定される。 The interpolated continuous F 0 (F 0 sequence), the spectral envelope of the periodic component, and the spectral envelope of the non-periodic component included in the acoustic feature amount sequence 182 are estimated from the context label sequence using the DNN 16.
 補間された連続的なF0(F0系列)は、連続分布として表現できるため、連続的なパルス系列から構成される。周期成分のスペクトル包絡および非周期成分のスペクトル包絡は、それぞれについてモデル化される。 The interpolated continuous F 0 (F 0 sequence) can be expressed as a continuous distribution, and thus is composed of a continuous pulse sequence. The spectral envelope of the periodic component and the spectral envelope of the non-periodic component are modeled for each.
 パルス生成部184および周期成分生成部186は、推定された音響特徴量に含まれるF0に従って生成されたパルス系列を、周期成分のスペクトル包絡に応じてフィルタリングすることで、周期成分を再構成する。より具体的には、パルス生成部184は、音響特徴量推定部180からのF0(F0系列)に従ってパルス系列を生成する。周期成分生成部186は、パルス生成部184からのパルス系列を周期成分のスペクトル包絡でフィルタリングすることで、周期成分を生成する。 The pulse generation unit 184 and the periodic component generation unit 186 reconfigure the periodic component by filtering the pulse sequence generated according to F 0 included in the estimated acoustic feature amount according to the spectral envelope of the periodic component. . More specifically, the pulse generation unit 184 generates a pulse sequence according to F 0 (F 0 sequence) from the acoustic feature quantity estimation unit 180. The periodic component generation unit 186 generates a periodic component by filtering the pulse sequence from the pulse generation unit 184 with the spectral envelope of the periodic component.
 非周期成分生成部188は、ガウシアンノイズ系列などのノイズ系列を非周期成分のスペクトル包絡に応じてフィルタリングすることで、非周期成分を再構成する。より具体的には、非周期成分生成部188は、任意の励振源からのガウス性ノイズを非周期成分のスペクトル包絡でフィルタリングすることで、非周期成分を生成する。 The aperiodic component generation unit 188 reconstructs the aperiodic component by filtering a noise sequence such as a Gaussian noise sequence according to the spectrum envelope of the aperiodic component. More specifically, the non-periodic component generation unit 188 generates a non-periodic component by filtering Gaussian noise from an arbitrary excitation source with the spectral envelope of the non-periodic component.
 加算部187は、周期成分生成部186からの周期成分と非周期成分生成部188からの非周期成分とを加算することで、音声波形を再構成する。すなわち、加算部187は、再構成された周期成分および非周期成分を加算して、入力された任意のテキストに対応する音声波形として出力する。 The adder 187 reconstructs the speech waveform by adding the periodic component from the periodic component generator 186 and the aperiodic component from the aperiodic component generator 188. That is, the adding unit 187 adds the reconstructed periodic component and non-periodic component, and outputs the result as a speech waveform corresponding to the input arbitrary text.
 上述したように、本実施の形態に従う音声合成システムにおいては、予め学習により構築されたDNN16を用いて、フレーム毎のコンテキストラベルについて確率分布系列を推定するとともに、静的特徴量と動的特徴量との間の明示的な関係を利用することで,適切に遷移する音響特徴量系列を生成する。そして、生成された音響特徴量系列をボコーダーに適用することで、推定された音響特徴量から合成音声を生成する。 As described above, in the speech synthesis system according to the present embodiment, using DNN 16 constructed in advance, a probability distribution series is estimated for a context label for each frame, and a static feature amount and a dynamic feature amount are estimated. By using the explicit relationship between and, an acoustic feature quantity sequence that appropriately transitions is generated. Then, by applying the generated acoustic feature quantity sequence to the vocoder, synthesized speech is generated from the estimated acoustic feature quantity.
 このように、本実施の形態に従う音声合成システムにおいては、V/UVの判定を行なうことなく、連続的な系列から音声波形を生成できる。 Thus, in the speech synthesis system according to the present embodiment, a speech waveform can be generated from a continuous sequence without performing V / UV determination.
 なお、本実施の形態においては、典型例として、学習手段としてDNNを用いるシステムを説明するが、学習手段としてはDNNに限られず、任意の教師あり学習の方法を採用できる。例えば、HMMや再帰型ニューラルネットワーク(Recurrent Neural Network)などを採用してもよい。 In this embodiment, a system using DNN as a learning means will be described as a typical example. However, the learning means is not limited to DNN, and any supervised learning method can be adopted. For example, an HMM or a recurrent neural network may be employed.
 [E.処理手順]
 図7および図8は、本実施の形態に従う音声合成システムにおける処理手順の一例を示すフローチャートである。図7および図8に示す各ステップは、1または複数のプロセッサ(例えば、図2に示すプロセッサ100)が1または複数のプログラムを実行することで実現されてもよい。
[E. Processing procedure]
7 and 8 are flowcharts showing an example of a processing procedure in the speech synthesis system according to the present embodiment. Each step shown in FIGS. 7 and 8 may be realized by one or more processors (for example, processor 100 shown in FIG. 2) executing one or more programs.
 図7には、DNN16を構築するための事前の機械学習の処理を示し、図8には、DNN16を用いた音声合成の処理を示す。 FIG. 7 shows a prior machine learning process for constructing the DNN 16, and FIG. 8 shows a speech synthesis process using the DNN 16.
 図7を参照して、プロセッサ100は、既知のテキストおよび当該テキストに対応する音声波形が入力されると(ステップS100)、入力された音声波形をフレームに区切り(ステップS102)、フレーム毎に、入力されたテキストからコンテキストラベルを生成する処理(ステップS110~S112)、および、音響特徴量系列を生成する処理(ステップS120~S128)を実行することで、コンテキストラベル系列および音響特徴量系列を生成する。 Referring to FIG. 7, when a known text and a speech waveform corresponding to the text are input (step S100), processor 100 divides the input speech waveform into frames (step S102). A context label sequence and an acoustic feature amount sequence are generated by executing a process for generating a context label from input text (steps S110 to S112) and an acoustic feature amount sequence (steps S120 to S128). To do.
 すなわち、プロセッサ100は、入力されたテキストを分析して文脈情報を生成し(ステップS110)、当該生成された文脈情報に基づいて、対応するフレームについてのコンテキストラベルを決定する(ステップS112)。 That is, the processor 100 analyzes the input text to generate context information (step S110), and determines a context label for the corresponding frame based on the generated context information (step S112).
 また、プロセッサ100は、入力された音声波形の対象フレームにおけるF0を抽出し(ステップS120)、先に抽出されたF0との間で補間処理を行なうことで、連続的なF0を決定する(ステップS122)。そして、プロセッサ100は、入力された音声波形の対象フレームにおける周期成分および非周期成分を抽出し(ステップS124)、それぞれの成分についてのスペクトル包絡を抽出する(ステップS126)。プロセッサ100は、ステップS122において決定した連続的なF0の対数、ならびに、ステップS126において抽出したスペクトル包絡(周期成分および非周期成分)を音響特徴量として決定する(ステップS128)。 Further, the processor 100 extracts F 0 in the target frame of the input speech waveform (step S120), and performs continuous interpolation on F 0 extracted previously, thereby determining continuous F 0 . (Step S122). Then, the processor 100 extracts a periodic component and an aperiodic component in the target frame of the input speech waveform (step S124), and extracts a spectrum envelope for each component (step S126). The processor 100 determines the logarithm of continuous F 0 determined in step S122 and the spectrum envelope (periodic component and non-periodic component) extracted in step S126 as acoustic feature amounts (step S128).
 プロセッサ100は、ステップS112において決定されたコンテキストラベルと、ステップS128において決定された音響特徴量とをDNN16に追加する(ステップS130)。そして、プロセッサ100は、未処理のフレームが存在するか否かを判断し(ステップS132)、未処理のフレームが存在する場合(ステップS132においてYESの場合)には、ステップS110~S112、および、ステップS120~S128の処理を繰返す。また、未処理のフレームが存在しない場合(ステップS132においてNOの場合)には、プロセッサ100は、新たなテキストおよび当該テキストに対応する音声波形が入力されたか否かを判断し(ステップS134)、新たなテキストおよび当該テキストに対応する音声波形が入力された場合(ステップS134においてYESの場合)には、ステップS102以下の処理を繰返す。 The processor 100 adds the context label determined in step S112 and the acoustic feature amount determined in step S128 to the DNN 16 (step S130). Then, the processor 100 determines whether or not there is an unprocessed frame (step S132). If there is an unprocessed frame (YES in step S132), steps S110 to S112, and The processes in steps S120 to S128 are repeated. If there is no unprocessed frame (NO in step S132), the processor 100 determines whether a new text and a speech waveform corresponding to the text are input (step S134). When a new text and a speech waveform corresponding to the text are input (YES in step S134), the processes in and after step S102 are repeated.
 新たなテキストおよび当該テキストに対応する音声波形が入力されていない場合(ステップS134においてNOの場合)には、学習処理は終了する。 If the new text and the speech waveform corresponding to the text are not input (NO in step S134), the learning process ends.
 なお、上述の説明においては、コンテキストラベルおよび音響特徴量が生成される毎に、DNN16へ入力する処理例を示すが、対象の音声波形からコンテキストラベル系列および音響特徴量系列の生成が完了した後に、まとめてDNN16へ入力するようにしてもよい。 In the above description, each time a context label and an acoustic feature quantity are generated, an example of processing to be input to the DNN 16 is shown, but after the generation of the context label series and the acoustic feature quantity series from the target speech waveform is completed. Alternatively, the inputs may be collectively input to the DNN 16.
 次に、図8を参照して、プロセッサ100は、合成対象のテキストが入力されると(ステップS200)、入力されたテキストを分析して文脈情報を生成し(ステップS202)、当該生成された文脈情報に基づいて、対応するフレームについてのコンテキストラベルを決定する(ステップS204)。そして、プロセッサ100は、ステップS204において決定したコンテキストラベルに対応する音響特徴量をDNN16から推定する(ステップS206)。 Next, referring to FIG. 8, when the text to be synthesized is input (step S200), the processor 100 analyzes the input text to generate context information (step S202), and the generated text is generated. Based on the context information, a context label for the corresponding frame is determined (step S204). Then, the processor 100 estimates an acoustic feature amount corresponding to the context label determined in step S204 from the DNN 16 (step S206).
 プロセッサ100は、推定した音響特徴量に含まれるF0に従ってパルス系列を発生する(ステップS208)とともに、推定した音響特徴量に含まれるスペクトル包絡(周期成分)で当該発生したパルス系列をフィルタリングすることで、音声波形の周期成分を生成する(ステップS210)。 The processor 100 generates a pulse sequence according to F 0 included in the estimated acoustic feature amount (step S208), and filters the generated pulse sequence with a spectrum envelope (periodic component) included in the estimated acoustic feature amount. Thus, a periodic component of the speech waveform is generated (step S210).
 また、プロセッサ100は、ガウシアンノイズ系列を発生する(ステップS212)とともに、推定した音響特徴量に含まれるスペクトル包絡(非周期成分)で当該発生したガウシアンノイズ系列をフィルタリングすることで、音声波形の非周期成分を生成する(ステップS214)。 In addition, the processor 100 generates a Gaussian noise sequence (step S212), and filters the generated Gaussian noise sequence with a spectrum envelope (non-periodic component) included in the estimated acoustic feature amount, so that a non-speech waveform is generated. A periodic component is generated (step S214).
 最終的に、プロセッサ100は、ステップS210において生成した周期成分とステップS214において生成した非周期成分とを加算して、合成音声の音声波形として出力する(ステップS216)。そして、入力されたテキストに対する音声合成処理は終了する。なお、ステップS206~S216の処理は、入力されたテキストを構成するフレームの数だけ繰返される。 Finally, the processor 100 adds the periodic component generated in step S210 and the non-periodic component generated in step S214, and outputs the result as a synthesized speech waveform (step S216). Then, the speech synthesis process for the input text ends. Note that the processing of steps S206 to S216 is repeated by the number of frames constituting the input text.
 [F.実験的評価]
 次に、本実施の形態に従う音声合成システムのおける有効性について実施した実験的評価について説明する。
[F. Experimental evaluation]
Next, experimental evaluation performed on the effectiveness of the speech synthesis system according to the present embodiment will be described.
 (f1:実験条件)
 本実施の形態に従う実施例の比較対象となる比較例として、一般的なDNN音声合成を用いた。
(F1: Experimental conditions)
As a comparative example to be compared with the example according to the present embodiment, general DNN speech synthesis is used.
 音声データとして、日本語女性話者1名により発声されたATR音素バランス文503文を用いた。このうち、493文を学習データとして用いるとともに、残り10文を評価文として用いた。 As the voice data, ATR phoneme balance sentence 503 spoken by one Japanese female speaker was used. Of these, 493 sentences were used as learning data, and the remaining 10 sentences were used as evaluation sentences.
 音声データのサンプリング周波数は16kHzとし、分析周期は5msとした。学習データの音声データに対するWORLD分析によって得られた、スペクトルおよび非周期性指標(AP)を、それぞれ39次のメルケプストラム係数(0次を含めて40次)として表現した。 The sampling frequency of audio data was 16 kHz, and the analysis period was 5 ms. The spectrum and the non-periodicity index (AP) obtained by WORD analysis on the speech data of the learning data were expressed as 39th order mel cepstrum coefficients (40th order including 0th order), respectively.
 logF0については、公知の複数の抽出法による結果を統合することで算出した上で、平滑化によってマイクロプロソディを除去した。 The log F 0 was calculated by integrating the results of a plurality of known extraction methods, and the microprosody was removed by smoothing.
 実施例の音素継続長モデルは、比較例のHMM音声合成と同様に、音素単位のコンテキストラベルを用いて、5状態のスキップ無しleft-to-right型のコンテキスト依存音素HSMM(hidden semi-Markov model:隠れセミマルコフモデル)を学習した。また、DNNによる音響モデルの学習では、さらに無声区間を補間した連続logF0パターンを用いた。これらのパラメータに対して、さらに1次動的特徴量および2次動的特徴量を付与したものを音響特徴量とした。 Similar to the HMM speech synthesis of the comparative example, the phoneme duration model of the example uses a context label in units of phonemes and uses a 5-state non-skip left-to-right type context-dependent phoneme HSMM (hidden semi-Markov model : Hidden semi-Markov model). Further, in learning of the acoustic model by DNN, a continuous log F 0 pattern obtained by further interpolating the unvoiced section was used. Those obtained by further adding a primary dynamic feature quantity and a secondary dynamic feature quantity to these parameters were defined as acoustic feature quantities.
 比較例のDNN音声合成については、上記特徴量に加え、V/UV情報を用いた。入力ベクトルは、音素単位のコンテキストラベルに対して、HSMMの継続長モデルから得られた継続長情報を付与することで、フレーム毎のコンテキストラベルを生成し、合計483次元のベクトルとして表現した。 For the DNN speech synthesis of the comparative example, V / UV information was used in addition to the above feature amount. The input vector was generated by adding the duration information obtained from the HSMM duration model to the phoneme unit context label, thereby generating a context label for each frame and expressing it as a total of 483 dimensional vectors.
 出力ベクトルは、比較例が244次元の音響特徴量のベクトルとし、実施例が243次元の音響特徴量のベクトルとした。 The output vector was a 244-dimensional acoustic feature vector in the comparative example, and a 243-dimensional acoustic feature vector in the example.
 実施例および比較例にそれぞれ用いた特徴量およびモデルの一覧を以下の表1に示す。但し、入力ベクトルおよび出力ベクトルは、いずれも平均が0、分散が1となるように正規化した。 Table 1 below shows a list of features and models used in the examples and comparative examples. However, the input vector and the output vector were both normalized so that the average was 0 and the variance was 1.
 DNNのネットワーク構成は、隠れ層を6層とし、ユニット数1024とした上で、重みは乱数を用いて初期化した。また、ミニバッチサイズは256として、epoch数は30として、学習係数は2.5×10として、隠れ層の活性化関数はReLU(rectied linear unit)とし、optimizerはAdamとした。また、重み0.5のDropoutも用いた。 In the DNN network configuration, the number of hidden layers is six, the number of units is 1024, and weights are initialized using random numbers. The mini-batch size was 256, the number of epochs was 30, the learning coefficient was 2.5 × 10 4 , the hidden layer activation function was ReLU (rectied linear unit), and the optimizer was Adam. A dropout with a weight of 0.5 was also used.
Figure JPOXMLDOC01-appb-T000005
Figure JPOXMLDOC01-appb-T000005
 (f2:主観評価)
 表1に示すように、実施例と比較例との間で音響特徴量が異なっているため、客観評価ではなく主観評価にて評価した。より具体的には、対比較実験により合成音声の自然性を比較した。
(F2: Subjective evaluation)
As shown in Table 1, since the acoustic feature amount is different between the example and the comparative example, the evaluation was made by subjective evaluation instead of objective evaluation. More specifically, the naturalness of synthesized speech was compared by a pair comparison experiment.
 上述したように、ATR音素バランス文503文のうち学習データとしなかった10文を評価音声とした。実施例および比較例のそれぞれによって生成された合成音声を被験者(内訳:男性4名、女性1名)に聞いてもらい、より自然性である(音声品質が高い)と感じたものを選択してもらった。但し、提示音声対に差が感じられない際には、「どちらでもない」という選択肢を認めた。 As described above, of the ATR phoneme balance sentence 503 sentences, 10 sentences that were not used as learning data were used as evaluation voices. Have the subjects (breakdown: 4 men, 1 woman) listen to the synthesized speech generated by each of the examples and comparative examples, and select the one that felt more natural (sound quality is high) received. However, when there was no difference between the presented voice pairs, the “None” option was accepted.
 なお、実施例および比較例ともに、スペクトル包絡のメルケプストラム係数に対するポストフィルタを適用した。 In both the examples and comparative examples, a post filter for the mel cepstrum coefficient of the spectral envelope was applied.
 図9は、本実施の形態に従う音声合成システムについての対比較実験の評価結果例を示す図である。図9において、比較例の非周期性指標(AP)は0.0から1.0の間で非周期性を表現している。 FIG. 9 is a diagram showing an example of evaluation results of a paired comparison experiment for the speech synthesis system according to the present embodiment. In FIG. 9, the non-periodicity index (AP) of the comparative example represents non-periodicity between 0.0 and 1.0.
 図9中のαはAPのしきい値を示す。α=0.0の場合に完全に有声となり、α=1.0の場合に完全に無声となる。APがしきい値αより低い場合は有声とし、高い場合は無声とした。 留 in Fig. 9 indicates the AP threshold value. It is completely voiced when α = 0.0, and completely unvoiced when α = 1.0. When the AP is lower than the threshold value α, it is voiced, and when it is high, the voice is unvoiced.
 予備実験においてV/UVの判定エラー率の低かったしきい値として、α=0.5およびα=0.6を用いた(図9(a)および(b))。また、図9(c)の「reference」は、V/UVの判定結果の正解を与えた場合の結果を示す。 In the preliminary experiment, α = 0.5 and α = 0.6 were used as threshold values with low V / UV determination error rate (FIGS. 9A and 9B). Further, “reference” in FIG. 9C indicates a result when a correct answer of the determination result of V / UV is given.
 図9(a)~(c)に示すいずれの場合についても、実施例が比較例に対して、検定統計量のp値がp<0.01となり、有意性を示したことが確認された。 In any of the cases shown in FIGS. 9A to 9C, it was confirmed that the p-value of the test statistic was p <0.01 compared to the comparative example, indicating that the example showed significance. .
 (f3:実験的評価の結論)
 本実施の形態に従う音声合成システムにおいては、入力音声を周期成分/非周期成分に分離することにより、連続的にF0およびスペクトル包絡のトラジェクトリを表現できた。このような手法を採用することにより、モデリング精度の改善およびV/UVの判定エラーの回避といった利点を得ることができたと考えられる。
(F3: Conclusion of experimental evaluation)
In the speech synthesis system according to the present embodiment, by separating an input speech into periodic components / aperiodic components, it could continuously express the trajectory of F 0 and spectral envelope. By adopting such a method, it is considered that advantages such as improvement of modeling accuracy and avoidance of V / UV determination errors can be obtained.
 上述の主観評価の結果によれば、本実施の形態に従う実施例は、比較例に対して正しいV/UV情報が与えられたときでさえ、より優れた性能を示した。このような結果によれば、周期成分と非周期成分とに分離したモデリングが品質改善に寄与していると評価できる。 According to the result of the subjective evaluation described above, the example according to the present embodiment showed better performance even when correct V / UV information was given to the comparative example. According to such a result, it can be evaluated that modeling separated into a periodic component and an aperiodic component contributes to quality improvement.
 [G.まとめ]
 本実施の形態に従う音声合成システムにおいては、SPSSを実施するにあたって、源信号についてV/UVを判定する必要のない手法を採用した。V/UVを判定する代わりに、源信号を周期成分と非周期成分との組み合わせとして表現することで、V/UVの判定エラーによる合成音声への品質劣化を抑制することができる。また、F0系列を連続化することで、構築される音響モデルのモデリング精度を向上することもできる。
[G. Summary]
The speech synthesis system according to the present embodiment employs a technique that does not require determination of V / UV with respect to the source signal when performing SPSS. By expressing the source signal as a combination of a periodic component and a non-periodic component instead of determining V / UV, it is possible to suppress quality degradation to synthesized speech due to a V / UV determination error. Further, the modeling accuracy of the constructed acoustic model can be improved by making the F 0 series continuous.
 本実施の形態に従う音声合成システムによる合成音声については、主観評価ながら、従来の手法に比較して、十分に品質を向上させることができることが示された。 It was shown that the synthesized speech by the speech synthesis system according to the present embodiment can be sufficiently improved in quality as compared with the conventional method while performing subjective evaluation.
 今回開示された実施の形態は、すべての点で例示であって制限的なものではないと考えられるべきである。本発明の範囲は、上記した実施の形態の説明ではなくて請求の範囲によって示され、請求の範囲と均等の意味および範囲内でのすべての変更が含まれることが意図される。 The embodiment disclosed this time should be considered as illustrative in all points and not restrictive. The scope of the present invention is shown not by the above description of the embodiments but by the scope of claims, and is intended to include all modifications within the meaning and scope equivalent to the scope of claims.
 1 多言語翻訳システム、2 ネットワーク、4 ユーザ、10 サービス提供装置、12 分析部、14 学習部、18 音声合成部、20 音声認識部、22 翻訳部、24 通信処理部、30 携帯端末、100 プロセッサ、102 主メモリ、104 ディスプレイ、106 入力デバイス、108 ネットワークインターフェイス、110 内部バス、112 二次記憶装置、120 F0抽出部、121 分析プログラム、122 周期/非周期成分抽出部、124 特徴量抽出部、126 F0補間部、128 スペクトル包絡抽出部、130 入力音声、132 テキスト、134 光学ドライブ、136 光学ディスク、140 モデル学習部、141 学習プログラム、142,182 音響特徴量系列、162 テキスト分析部、164 コンテキストラベル生成部、166 コンテキストラベル系列、180 音響特徴量推定部、181 音声合成プログラム、184,200,250 パルス生成部、186 周期成分生成部、187,208 加算部、188 非周期成分生成部、201 音声認識プログラム、204 ガウシアンノイズ生成部、221 翻訳プログラム、252 ホワイトノイズ生成部、254 切替部、256 音声合成フィルタ。 DESCRIPTION OF SYMBOLS 1 Multilingual translation system, 2 networks, 4 users, 10 service provision apparatus, 12 analysis part, 14 learning part, 18 speech synthesizer, 20 speech recognition part, 22 translation part, 24 communication processing part, 30 portable terminal, 100 processor , 102 Main memory, 104 display, 106 input device, 108 network interface, 110 internal bus, 112 secondary storage device, 120 F 0 extraction unit, 121 analysis program, 122 period / non-periodic component extraction unit, 124 feature quantity extraction unit 126 F 0 interpolation unit, 128 spectral envelope extraction unit, 130 input speech, 132 text, 134 optical drive, 136 optical disc, 140 model learning unit, 141 learning program, 142,182 acoustic feature quantity sequence, 162 text analysis unit, 164 Context label generator 166 context label series, 180 acoustic feature quantity estimation unit, 181 speech synthesis program, 184, 200, 250 pulse generation unit, 186 periodic component generation unit, 187, 208 addition unit, 188 aperiodic component generation unit, 201 speech recognition program 204 Gaussian noise generation unit, 221 translation program, 252 white noise generation unit, 254 switching unit, 256 speech synthesis filter.

Claims (6)

  1.  統計的パラメトリック音声合成に従う音声合成システムであって、
     既知のテキストに対応する音声波形の基本周波数を単位区間毎に抽出する第1の抽出部と、
     前記音声波形から周期成分および非周期成分を単位区間毎に抽出する第2の抽出部と、
     前記抽出された周期成分および非周期成分のスペクトル包絡を抽出する第3の抽出部と、
     前記既知のテキストの文脈情報に基づくコンテキストラベルを生成する生成部と、
     前記基本周波数、前記周期成分のスペクトル包絡、前記非周期成分のスペクトル包絡を含む音響特徴量と、対応する前記コンテキストラベルとを対応付けて学習することで、統計モデルを構築する学習部とを備える、音声合成システム。
    A speech synthesis system according to statistical parametric speech synthesis,
    A first extraction unit for extracting a fundamental frequency of a speech waveform corresponding to a known text for each unit section;
    A second extraction unit for extracting a periodic component and an aperiodic component from the speech waveform for each unit section;
    A third extraction unit for extracting a spectral envelope of the extracted periodic component and non-periodic component;
    A generating unit that generates a context label based on context information of the known text;
    A learning unit that constructs a statistical model by learning the basic frequency, the spectral envelope of the periodic component, the acoustic feature amount including the spectral envelope of the non-periodic component, and the corresponding context label. Speech synthesis system.
  2.  任意のテキストの入力に応答して、当該テキストの文脈情報に基づくコンテキストラベルを決定する決定部と、
     前記統計モデルから前記決定部により決定されたコンテキストラベルに対応する音響特徴量を推定する推定部とを備え、当該推定される音響特徴量は、基本周波数、周期成分のスペクトル包絡、非周期成分のスペクトル包絡を含み、
     前記推定された音響特徴量に含まれる基本周波数に従って生成されたパルス系列を、周期成分のスペクトル包絡に応じてフィルタリングすることで、周期成分を再構成する第1の再構成部と、
     ノイズ系列を非周期成分のスペクトル包絡に応じてフィルタリングすることで、非周期成分を再構成する第2の再構成部と、
     前記再構成された周期成分および非周期成分を加算して、前記入力された任意のテキストに対応する音声波形として出力する加算部とをさらに備える、請求項1に記載の音声合成システム。
    A determination unit that determines a context label based on context information of the text in response to input of arbitrary text;
    An estimation unit that estimates an acoustic feature amount corresponding to the context label determined by the determination unit from the statistical model, and the estimated acoustic feature amount includes a fundamental frequency, a spectral envelope of a periodic component, and an aperiodic component Including the spectral envelope,
    A first reconstruction unit that reconstructs a periodic component by filtering a pulse sequence generated according to a fundamental frequency included in the estimated acoustic feature amount according to a spectral envelope of the periodic component;
    A second reconstruction unit that reconstructs the aperiodic component by filtering the noise sequence according to the spectral envelope of the aperiodic component;
    The speech synthesis system according to claim 1, further comprising: an addition unit that adds the reconstructed periodic component and aperiodic component and outputs the result as a speech waveform corresponding to the input arbitrary text.
  3.  前記第2の抽出部は、前記第1の抽出部が前記基本周波数を抽出できない単位区間から前記非周期成分のみを抽出し、それ以外の単位区間から前記周期成分および前記非周期成分を抽出する、請求項1または2に記載の音声合成システム。 The second extraction unit extracts only the non-periodic component from a unit interval in which the first extraction unit cannot extract the fundamental frequency, and extracts the periodic component and the non-periodic component from other unit intervals. The speech synthesis system according to claim 1 or 2.
  4.  前記第1の抽出部は、前記基本周波数を抽出できない単位区間について、補間処理により基本周波数を決定する、請求項1~3のいずれか1項に記載の音声合成システム。 The speech synthesis system according to any one of claims 1 to 3, wherein the first extraction unit determines a fundamental frequency by interpolation processing for a unit section in which the fundamental frequency cannot be extracted.
  5.  統計的パラメトリック音声合成に従う音声合成方法を実現するための音声合成プログラムであって、前記音声合成プログラムはコンピュータに
     既知のテキストに対応する音声波形の基本周波数を単位区間毎に抽出するステップと、
     前記音声波形から周期成分および非周期成分を単位区間毎に抽出するステップと、
     前記抽出された周期成分および非周期成分のスペクトル包絡を抽出するステップと、
     前記既知のテキストの文脈情報に基づくコンテキストラベルを生成するステップと、
     前記基本周波数、前記周期成分のスペクトル包絡、前記非周期成分のスペクトル包絡を含む音響特徴量と、対応する前記コンテキストラベルとを対応付けて学習することで、統計モデルを構築するステップとを実行させる、音声合成プログラム。
    A speech synthesis program for realizing a speech synthesis method according to statistical parametric speech synthesis, the speech synthesis program extracting a fundamental frequency of a speech waveform corresponding to text known to a computer for each unit interval;
    Extracting periodic and non-periodic components from the speech waveform for each unit interval;
    Extracting a spectral envelope of the extracted periodic and aperiodic components;
    Generating a context label based on context information of the known text;
    The step of constructing a statistical model is performed by associating and learning the acoustic feature quantity including the fundamental frequency, the spectral envelope of the periodic component, and the spectral envelope of the non-periodic component, and the corresponding context label. , Speech synthesis program.
  6.  統計的パラメトリック音声合成に従う音声合成方法であって、
     既知のテキストに対応する音声波形の基本周波数を単位区間毎に抽出するステップと、
     前記音声波形から周期成分および非周期成分を単位区間毎に抽出するステップと、
     前記抽出された周期成分および非周期成分のスペクトル包絡を抽出するステップと、
     前記既知のテキストの文脈情報に基づくコンテキストラベルを生成するステップと、
     前記基本周波数、前記周期成分のスペクトル包絡、前記非周期成分のスペクトル包絡を含む音響特徴量と、対応する前記コンテキストラベルとを対応付けて学習することで、統計モデルを構築するステップとを備える、音声合成方法。
    A speech synthesis method according to statistical parametric speech synthesis,
    Extracting a fundamental frequency of a speech waveform corresponding to a known text for each unit interval;
    Extracting periodic and non-periodic components from the speech waveform for each unit interval;
    Extracting a spectral envelope of the extracted periodic and aperiodic components;
    Generating a context label based on context information of the known text;
    Building a statistical model by associating and learning the acoustic feature quantity including the fundamental frequency, the spectral envelope of the periodic component, the spectral envelope of the non-periodic component, and the corresponding context label, Speech synthesis method.
PCT/JP2018/006165 2017-02-28 2018-02-21 Speech synthesis system, speech synthesis program, and speech synthesis method WO2018159402A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
JP2017-037151 2017-02-28
JP2017037151A JP6802958B2 (en) 2017-02-28 2017-02-28 Speech synthesis system, speech synthesis program and speech synthesis method

Publications (1)

Publication Number Publication Date
WO2018159402A1 true WO2018159402A1 (en) 2018-09-07

Family

ID=63371228

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2018/006165 WO2018159402A1 (en) 2017-02-28 2018-02-21 Speech synthesis system, speech synthesis program, and speech synthesis method

Country Status (2)

Country Link
JP (1) JP6802958B2 (en)
WO (1) WO2018159402A1 (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020230926A1 (en) * 2019-05-15 2020-11-19 엘지전자 주식회사 Voice synthesis apparatus for evaluating quality of synthesized voice by using artificial intelligence, and operating method therefor
CN114360587A (en) * 2021-12-27 2022-04-15 北京百度网讯科技有限公司 Method, apparatus, device, medium and product for identifying audio
CN114550733A (en) * 2022-04-22 2022-05-27 成都启英泰伦科技有限公司 Voice synthesis method capable of being used for chip end

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020145472A1 (en) * 2019-01-11 2020-07-16 네이버 주식회사 Neural vocoder for implementing speaker adaptive model and generating synthesized speech signal, and method for training neural vocoder
WO2020158891A1 (en) * 2019-02-01 2020-08-06 ヤマハ株式会社 Sound signal synthesis method and neural network training method
JP7359164B2 (en) 2019-02-06 2023-10-11 ヤマハ株式会社 Sound signal synthesis method and neural network training method
US11232780B1 (en) 2020-08-24 2022-01-25 Google Llc Predicting parametric vocoder parameters from prosodic features
WO2023281555A1 (en) * 2021-07-05 2023-01-12 日本電信電話株式会社 Generation method, generation program, and generation device
CN113838453B (en) * 2021-08-17 2022-06-28 北京百度网讯科技有限公司 Voice processing method, device, equipment and computer storage medium
CN114373445B (en) * 2021-12-23 2022-10-25 北京百度网讯科技有限公司 Voice generation method and device, electronic equipment and storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH05108097A (en) * 1991-10-19 1993-04-30 Ricoh Co Ltd Speech synthesizing device
JP2011247921A (en) * 2010-05-24 2011-12-08 Nippon Telegr & Teleph Corp <Ntt> Signal synthesizing method, signal synthesizing apparatus, and program
JP2012058293A (en) * 2010-09-06 2012-03-22 National Institute Of Information & Communication Technology Unvoiced filter learning apparatus, voice synthesizer, unvoiced filter learning method, and program

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH05108097A (en) * 1991-10-19 1993-04-30 Ricoh Co Ltd Speech synthesizing device
JP2011247921A (en) * 2010-05-24 2011-12-08 Nippon Telegr & Teleph Corp <Ntt> Signal synthesizing method, signal synthesizing apparatus, and program
JP2012058293A (en) * 2010-09-06 2012-03-22 National Institute Of Information & Communication Technology Unvoiced filter learning apparatus, voice synthesizer, unvoiced filter learning method, and program

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
"in japanese", IN JAPANESE, 1 March 2017 (2017-03-01) *
MAIA, RANNIERY ET AL.: "An Excitation Model for HMM-Based Speech Synthesis Based on Residual Modeling", 6TH ISCA WORKSHOP ON SPEECH SYNTHESIS, 22 August 2007 (2007-08-22), pages 131 - 136, XP055543103 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020230926A1 (en) * 2019-05-15 2020-11-19 엘지전자 주식회사 Voice synthesis apparatus for evaluating quality of synthesized voice by using artificial intelligence, and operating method therefor
US11705105B2 (en) 2019-05-15 2023-07-18 Lg Electronics Inc. Speech synthesizer for evaluating quality of synthesized speech using artificial intelligence and method of operating the same
CN114360587A (en) * 2021-12-27 2022-04-15 北京百度网讯科技有限公司 Method, apparatus, device, medium and product for identifying audio
CN114550733A (en) * 2022-04-22 2022-05-27 成都启英泰伦科技有限公司 Voice synthesis method capable of being used for chip end
CN114550733B (en) * 2022-04-22 2022-07-01 成都启英泰伦科技有限公司 Voice synthesis method capable of being used for chip end

Also Published As

Publication number Publication date
JP2018141915A (en) 2018-09-13
JP6802958B2 (en) 2020-12-23

Similar Documents

Publication Publication Date Title
WO2018159402A1 (en) Speech synthesis system, speech synthesis program, and speech synthesis method
Oord et al. Wavenet: A generative model for raw audio
Van Den Oord et al. Wavenet: A generative model for raw audio
Takamichi et al. Postfilters to modify the modulation spectrum for statistical parametric speech synthesis
Tokuda et al. Speech synthesis based on hidden Markov models
WO2020215666A1 (en) Speech synthesis method and apparatus, computer device, and storage medium
WO2018159403A1 (en) Learning device, speech synthesis system, and speech synthesis method
JP2008242317A (en) Meter pattern generating device, speech synthesizing device, program, and meter pattern generating method
Hwang et al. LP-WaveNet: Linear prediction-based WaveNet speech synthesis
Yin et al. Modeling F0 trajectories in hierarchically structured deep neural networks
Wang et al. fairseq s^ 2: A scalable and integrable speech synthesis toolkit
Adiga et al. Acoustic features modelling for statistical parametric speech synthesis: a review
WO2015025788A1 (en) Quantitative f0 pattern generation device and method, and model learning device and method for generating f0 pattern
JP6631883B2 (en) Model learning device for cross-lingual speech synthesis, model learning method for cross-lingual speech synthesis, program
US9058820B1 (en) Identifying speech portions of a sound model using various statistics thereof
JP2010139745A (en) Recording medium storing statistical pronunciation variation model, automatic voice recognition system, and computer program
Kathania et al. Explicit pitch mapping for improved children’s speech recognition
Shahnawazuddin et al. Studying the role of pitch-adaptive spectral estimation and speaking-rate normalization in automatic speech recognition
Nose Efficient implementation of global variance compensation for parametric speech synthesis
JP7423056B2 (en) Reasoners and how to learn them
Giacobello et al. Stable 1-norm error minimization based linear predictors for speech modeling
Van Nguyen et al. Development of Vietnamese speech synthesis system using deep neural networks
Al-Radhi et al. Continuous vocoder applied in deep neural network based voice conversion
Li et al. Diverse and Expressive Speech Prosody Prediction with Denoising Diffusion Probabilistic Model
Sunil et al. Children's Speech Recognition Under Mismatched Condition: A Review

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18761759

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18761759

Country of ref document: EP

Kind code of ref document: A1