WO2010104040A1 - Voice synthesis apparatus based on single-model voice recognition synthesis, voice synthesis method and voice synthesis program - Google Patents

Voice synthesis apparatus based on single-model voice recognition synthesis, voice synthesis method and voice synthesis program Download PDF

Info

Publication number
WO2010104040A1
WO2010104040A1 PCT/JP2010/053802 JP2010053802W WO2010104040A1 WO 2010104040 A1 WO2010104040 A1 WO 2010104040A1 JP 2010053802 W JP2010053802 W JP 2010053802W WO 2010104040 A1 WO2010104040 A1 WO 2010104040A1
Authority
WO
WIPO (PCT)
Prior art keywords
speech
articulation
unit
sequence
speech synthesis
Prior art date
Application number
PCT/JP2010/053802
Other languages
French (fr)
Japanese (ja)
Inventor
恒雄 新田
Original Assignee
国立大学法人豊橋技術科学大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 国立大学法人豊橋技術科学大学 filed Critical 国立大学法人豊橋技術科学大学
Priority to JP2011503812A priority Critical patent/JP5574344B2/en
Publication of WO2010104040A1 publication Critical patent/WO2010104040A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/14Speech classification or search using statistical models, e.g. Hidden Markov Models [HMMs]
    • G10L15/142Hidden Markov Models [HMMs]

Definitions

  • speech recognition technology Two types of speech recognition technology and speech synthesis technology are known as user interfaces using speech input / output.
  • pattern recognition processing using phonemes, syllables, words, and the like as recognition units has been generally performed based on the result of feature analysis processing such as frequency spectrum. This is based on the assumption that the human auditory nervous system has spectrum analysis capability, and that higher-level language processing is performed on the cerebrum for the spectrum time series.
  • the speech recognition apparatuses developed so far classify words or word strings based on acoustic features composed of spectral time series.
  • the waveform connection method and vocoder method are mainly used.
  • voices are generated by connecting waveform segments in units of phonemes and the like.
  • the vocoder method is a method that simulates articulatory motion in human speech generation, and uses separately the sound source information such as vocal organ vibration information and vocal cord vibration. Specifically, parameters reflecting the movement of the vocal organs, that is, the articulatory motion, are extracted from the speech by PARCOR analysis, etc., and the segments consisting of these spectral envelope information are connected and pitch pulses or noise sequences are added to the excitation source. To generate audio.
  • Non-Patent Document 1 a hypothesis that humans perceive speech as articulatory motion instead of speech as an acoustic signal is promising (see Non-Patent Document 1).
  • HMM hidden Markov Model
  • This system applies the HMM currently used as a standard in voice recognition, and the operation of the system is shown in FIG.
  • the learning part of the HMM not shown in the figure uses a spectral parameter sequence (here, Mel Cepstrum Coefficient (hereinafter sometimes referred to as MFCC)) and pitch parameters based on a probability distribution in multiple spaces. Learning using the Baum-Welch algorithm by HMM. At this time, a state duration distribution is constructed from a trellis or the like obtained when the HMM 101 expressing the spectrum sequence of a specific speaker is continuously learned.
  • MFCC Mel Cepstrum Coefficient
  • each state of the HMM is continued based on the state duration distribution, and an excitation waveform generated from the obtained spectrum and pitch is expressed by MLSA (Mel Log). : Mel logarithm)
  • MLSA Mel logarithm
  • the synthesis unit is configured by the specific speaker HMM created from the speech spectrum information of the specific speaker.
  • An object of the present invention is to provide a speech synthesizer based on model speech recognition synthesis, a speech synthesis method, and a speech synthesis program.
  • a phoneme unit articulation motion storage unit that stores in advance a state transition model of articulation motion stored for each predetermined speech unit;
  • a speech synthesizer based on one-model speech recognition synthesis comprising: a speech recognition unit that performs speech recognition with reference to a state transition model; and a speech synthesis unit that performs speech synthesis while acquiring an optimal articulation sequence from the state transition model.
  • the voice recognition unit includes voice acquisition means for acquiring voice, articulation feature extraction means for extracting the articulation feature of the voice acquired by the voice acquisition means, and articulation extracted by the articulation feature extraction means.
  • the first storage control means for storing the features in the storage means, the articulation feature time series data read from the articulation feature storage means and the state transition model are compared to identify the optimum speech unit series
  • the optimal articulation feature sequence generation unit A second storage control unit that stores in the storage unit the optimum articulation feature sequence data generated in step (b), and converts the articulation feature sequence data read from the storage unit of the optimum articulation feature sequence data into a speech synthesis parameter sequence.
  • the third storage control means for storing the speech synthesis parameter series converted by the speech synthesis parameter series conversion means in the storage means, and the speech synthesis parameter series storage means. And a means for synthesizing speech from the driving sound source signal.
  • the phoneme unit articulation motion storage unit stores a coefficient set of a hidden Markov model (HMM) expressing articulation motion, and the optimal speech unit sequence of the speech recognition unit. It can be referred to from the discriminating means and the optimum articulation feature sequence generating means of the speech synthesizer.
  • HMM hidden Markov model
  • the articulation feature extraction means includes an analysis filter for Fourier-analyzing the digital signal of speech, a local feature having a time axis differential feature extraction unit and a frequency axis differential feature extraction unit. It is characterized by comprising an extracting unit and a discriminative phoneme feature extracting unit having a multi-layer neural network configured in one or more stages.
  • the means for synthesizing speech from the speech synthesis parameter and the driving excitation signal provides a driving excitation codebook and the speech synthesized from the speech synthesis parameter and the driving excitation code. And a means for selecting an optimal driving sound source by comparing with the original learning speech and a means for registering the selected driving sound source code in a corresponding articulatory motion state transition model.
  • a phoneme unit articulation motion storage unit that stores in advance a state transition model of articulation motion stored for each predetermined speech unit, and speech recognition with reference to the state transition model.
  • a speech synthesis method based on one-model speech recognition synthesis comprising: a speech recognition unit to perform; and a speech synthesis unit that performs speech synthesis while acquiring an optimal articulation sequence from the state transition model,
  • the speech recognition unit stores a speech acquisition step for acquiring speech, a articulation feature extraction step for extracting the articulation feature of the speech acquired in the speech acquisition step, and a articulation feature extracted in the articulation feature extraction step.
  • a first storage control step for storing in the means; and an optimum speech unit sequence identification step for comparing the articulation feature time-series data read from the articulation feature storage means and the state transition model to identify the optimum speech unit sequence.
  • a speech synthesis unit that estimates an optimal state sequence related to articulation motion from the optimal speech unit sequence and generates an articulation feature sequence; and an optimal articulation feature sequence generation step that generates the optimal articulation feature sequence
  • a second storage control step for storing the articulatory feature sequence data in the storage unit; and the optimum articulation feature sequence data read from the storage unit
  • a speech synthesis parameter sequence conversion step for converting the sound feature sequence data into a speech synthesis parameter sequence; a third storage control step for storing in the storage means the speech synthesis parameter sequence converted in the speech synthesis parameter sequence conversion step; And synthesizing speech from the speech synthesis parameters read from the speech synthesis parameter series storage means and the driving sound source signal.
  • the state transition model is created using a multi-speaker speech
  • the step of converting the articulation feature sequence data into a speech synthesis parameter sequence includes:
  • the means for converting the articulation feature series data created by only the voice of the speaker or the unspecified speaker into a speech synthesis parameter series is created by adaptive learning with the voice of the specific speaker.
  • the speech synthesizer of the invention according to claim 1 is different from the “information based on spectrum” of the specific speaker used by the conventional HMM synthesizer, and extracts the “information based on articulatory motion” to extract the HMM synthesizer. Constitute. For this reason, since the HMM synthesis part is composed of parameters that are basically invariant to the articulatory speaker, there is an advantage that the learning speech data of each speaker is unnecessary or very small for the HMM part. In order to generate speech, it is necessary to convert articulatory motion into motion of a specific speaker's speech organ, but this portion can be realized with a small amount of speech data.
  • the spectrum varies greatly depending on the speaker, the context at the time of speech or the surrounding noise, etc., so it is used when obtaining the acoustic likelihood.
  • the HMM design required a lot of voice data.
  • the articulatory feature is an input feature to the HMM, there are advantages that even a small number of learning speakers can obtain sufficient phoneme recognition performance and the number of HMM mixture distributions can be reduced.
  • the speech synthesizer of the invention since the HMM coefficient set expressing the articulation motion is stored in the phoneme unit articulation motion storage unit, the optimum speech unit sequence identifying means and the optimum articulation feature referencing this In the sequence generation means, speech recognition processing and speech synthesis processing are realized by parameters that are basically unchanged for the speaker.
  • the speech synthesizer according to the invention of claim 4 extracts “information based on articulatory motion of unspecified majority speakers” instead of “information based on the spectrum of specific speakers” used by the conventional HMM synthesizer.
  • the HMM synthesizing apparatus is configured.
  • the HMM synthesis part can be made common to the speakers, and each speaker has the advantage that the learning speech data can be made unnecessary for the HMM part in principle.
  • the speech synthesis is separated into the articulatory motion command part for the speech organs and the speech organs and parts related to the motions that are different for each person, and the former is used for the speaker by using the articulation feature data of many speakers.
  • the speech synthesis method of the invention according to claim 6 is different from the “information based on spectrum” of a specific speaker used in the conventional HMM synthesis method, and extracts the “information based on articulatory motion” to extract the HMM synthesis method.
  • the HMM synthesis part is composed of parameters that are basically invariant to the speaker, which is an articulatory movement, each speaker has the advantage that learning speech data is unnecessary or requires a very small amount for the HMM part.
  • the spectrum varies greatly depending on the speaker, the context at the time of speech or the surrounding noise, etc., so it is used when obtaining the acoustic likelihood.
  • the HMM design required a lot of voice data.
  • the articulatory feature is an input feature to the HMM, there are advantages that even a small number of learning speakers can obtain sufficient phoneme recognition performance and the number of HMM mixture distributions can be reduced.
  • the articulatory feature extraction step is configured by the local feature extraction step and the discriminative phoneme feature extraction step, the discrimination feature based on the articulatory motion is the input feature to the HMM. Therefore, sufficient phoneme recognition performance can be obtained with a small number of learning speakers.
  • the speech synthesis method of the invention according to claim 9 extracts “information based on articulatory motion of unspecified majority speakers” instead of “information based on the spectrum of specific speakers” used in the conventional HMM synthesis method.
  • the HMM synthesis method is configured.
  • the HMM synthesis part can be made common to the speakers, and each speaker has the advantage that the learning speech data can be made unnecessary for the HMM part in principle.
  • the speech synthesis is separated into the articulatory motion command part for the speech organs and the speech organs and parts related to the motions that are different for each person, and the former is used for the speaker by using the articulation feature data of many speakers.
  • the speech synthesis method of the invention according to claim 10 is similar to the CELP closed loop learning concept widely used in speech communication (see Non-Patent Document 4) for driving sound source signals that greatly affect the sound quality of synthesized speech.
  • the PSOLA technology widely used for waveform synthesis see Non-Patent Document 5
  • the optimum driving excitation code is selected and registered in the corresponding articulatory motion state transition model, while referring to this High-quality speech can be obtained by speech synthesis.
  • the speech synthesis program of the invention according to claim 11 can drive a computer as the speech synthesis processing means according to any of claims 1 to 5, the effects of the invention according to claims 1 to 5 can be obtained. Can play.
  • the speech synthesis program of the invention according to claim 12 can drive a computer as each processing step of the speech synthesis method according to any of claims 6 to 10, the invention according to claims 6 to 10. The effect of can be produced.
  • FIG. 2 shows an electrical configuration of the speech synthesizer 1.
  • the speech synthesizer 1 includes a central processing unit 11, an input device 12, an output device 13, a storage device 14, and an external storage device 15.
  • the central processing unit 11 is provided for performing processing such as numerical computation and control, and performs computation and processing according to the processing procedure described in the present embodiment.
  • a CPU or the like can be used.
  • the input device 12 is configured by a microphone, a keyboard, or the like, and inputs a voice uttered by a user or a character string input by a key.
  • the output device 13 includes a display, a speaker, and the like, and outputs a voice synthesis result or information obtained by processing the voice synthesis result.
  • the storage device 14 stores processing procedures (speech synthesis program) executed by the central processing unit 11 and temporary data necessary for the processing.
  • ROM Read Only Memory
  • RAM Random Access Memory
  • the external storage device 15 is used for the articulation feature series set used for the speech synthesis process, the neural network weight coefficient set used for the articulation feature extraction process, and the conversion process from the articulation feature series data to the speech synthesis parameter series.
  • a hard disk drive (HDD) can be used. And these are electrically connected through the bus
  • HDD hard disk drive
  • the hardware configuration of the speech synthesizer 1 of the present invention is not limited to the configuration shown in FIG. Accordingly, a communication I / F connected to a communication network such as the Internet may be provided.
  • the speech synthesizer 1 and the speech synthesis program have a configuration independent of other systems, but the present invention is not limited to this configuration. Therefore, a configuration incorporated as a part of another device or a configuration incorporated as a part of another program may be employed. Further, the input in that case is indirectly performed through the other devices and programs described above.
  • the stored data is divided into each area and stored in the external storage device 15, and as shown in FIG. 2, the articulation feature storage area 16 in which the articulation features are stored, and the hidden Markov model in which the hidden Markov model is stored.
  • the articulation feature storage area 16 stores a discrimination feature series of speech.
  • Discrimination features were proposed to classify phonemes (phonemes) based on structural features related to articulation, and are voiced / non-voiced / continuous / semi-vowel / bursting / friction / friction.
  • many methods for directly extracting articulatory features such as discriminative features from speech have been proposed, including a method using a neural network (see Non-Patent Document 6).
  • the hidden Markov model storage area 17 stores a hidden Markov model that is referred to when speech recognition or speech synthesis is performed in the central processing unit 11.
  • the optimum articulation feature sequence storage area 18 stores an optimum articulation feature sequence as a result of searching the central processing unit 11 with reference to the hidden Markov model.
  • the input voice storage area 19 stores voice data input via the input device 12.
  • the speech synthesis parameter storage area 20 stores a speech synthesis parameter as a result calculated by the central processing unit 11 with reference to the weighting coefficient (coefficient storage area 23) of the neural network.
  • the synthesized speech storage area 21 stores the synthesized speech data obtained as a result of referring to the speech synthesis parameter 20 and the driving sound source codebook set in the coefficient storage area 23 in the central processing 11.
  • the processing result storage area 22 stores data obtained as a result of various processes executed in the central processing unit 11.
  • the coefficient storage area 23 is used for a neural network weighting coefficient set for extracting articulation features, a neural network weighting coefficient set used for converting articulation feature series data into speech synthesis parameters, and used for speech synthesis.
  • a codebook set for driving sound source is stored. Details of these data will be described later.
  • the discriminative phoneme features used for the discriminative feature series stored in the articulation feature storage area 16 will be described in detail.
  • Japanese phonemes are shown in FIG. 3 as distinctive phonemic features (hereinafter, sometimes referred to as DPF).
  • the discriminative phoneme feature is one method of expressing articulatory features.
  • the vertical column shows the distinguishing features
  • the horizontal column shows the individual phonemes.
  • (+) means having a distinguishing feature for each phoneme
  • (-) means not having that feature.
  • discriminative phoneme features for languages other than Japanese in addition to these discriminative features and phonemes, discriminative features or phonemes specific to the language are also considered.
  • nil high / low assigns a distinguishing feature to phonemes that do not belong to either high or low tongue
  • nil (front / rear) is (the position where the tongue rises) This is for assigning a discrimination feature to a phoneme that does not belong to either forward or backward, and indicates a newly added feature.
  • the speech recognition performance is improved by balancing the phonemes.
  • IPA international phonetic alphabet
  • This IPA table is divided into consonant and vowel tables, and the consonants are classified by the articulation position and articulation method.
  • the articulation position includes lips, gums, hard palate, soft palate, glottis and the like, and the articulation method includes rupture, friction, rubbing, bullet, nasal sound, semi-vowel and the like.
  • voiced and unvoiced for each. For example, / p / is a consonant, and is classified into unvoiced sound, lip sound, and plosive sound.
  • vowels are classified according to the place where the tongue is most prominent and the size of the space between the tongue and the palate.
  • the place where the tongue is most prominent is distinguished from the front (front tongue), back (rear tongue) or middle (middle tongue), and the space between the tongue and the palate can be narrow, semi-narrow, half-wide or wide. It is divided. For example, / i / is a front vowel and a narrow vowel.
  • the part having the articulatory feature the part of consonant, unvoiced sound, lip sound, burst sound is taken as +, for example, / p /), Otherwise-.
  • the spectrum greatly fluctuates depending on the speaker, the context at the time of speech, the ambient noise, etc., so that the design of the HMM used when obtaining the acoustic likelihood is used. I needed a lot of audio data.
  • a speech spectrum is used as an input feature, and fluctuations of individual vector elements are expressed from a plurality of normal distributions.
  • MFCC mel cepstrum
  • DCT discrete cosine transformed
  • the phoneme recognition performance and the articulation feature (specifically using the discriminating feature (DPF, which will be described later)) when learning the HMM in phonemes using MFCC in FIG. 4 are input features to the HMM.
  • the graph which compared phoneme recognition performance is shown.
  • the horizontal axis indicates the number of mixed distributions (1, 2, 4, 8, 16 from the left) necessary for expressing the HMM, and the amount of computation required for recognition as the number of mixtures increases. Has also increased.
  • the bar graph shown for each mixture number indicates the number of male speakers used for HMM learning. For each mixture number, one person, two persons, four persons, eight persons, and 33 persons from the left, and x indicates 100 persons It is.
  • HMMs articulatory motion state transition models
  • FIG. 5 is a functional block diagram showing speech recognition and speech synthesis processing executed by the speech synthesizer 1.
  • an input unit 201 As shown in this figure, as a functional block necessary for speech recognition processing and speech synthesis processing executed in the speech synthesizer 1, an input unit 201, an A / D conversion unit 202, an articulation feature extraction unit 210, and a speech recognition unit 220 are illustrated.
  • the articulation feature calculation storage unit 207 stores various coefficient sets 2071 for speech analysis, neural network weighting coefficient sets for articulation feature calculation, and the like.
  • the phoneme unit articulation movement storage unit 225 stores a coefficient set 2251 of an HMM model expressing the articulation movement.
  • the coefficient set 2251 stored therein is a voice recognition unit 220 and an optimal articulation feature sequence / voice synthesis.
  • the parameter conversion unit 230 can refer to it.
  • the speech synthesis storage unit 235 stores a speech synthesis parameter set 2351 that is a calculation result of the optimum articulation feature sequence / speech synthesis parameter conversion unit 230 and a driving excitation codebook 2352.
  • the speech synthesizer 240 configures a digital filter using a speech synthesis parameter (corresponding to a change in the vocal tract shape) as a coefficient, and synthesizes speech using the drive excitation input read from the drive excitation codebook 2352.
  • the synthesized voice is sent to the output unit 205 via the D / A conversion unit 206 and sent out from the speaker.
  • the input unit 201 is provided for receiving sound input from the outside and converting it into an analog electric signal.
  • the A / D conversion unit 202 is provided to convert an analog signal received by the input unit 201 into a digital signal.
  • the articulatory feature extraction unit 210 is provided to extract a predetermined feature amount necessary for speech recognition. Also, the articulatory feature extraction unit 210 extracts time-series data of articulatory features (from the time-series data of feature amounts extracted by the analysis filter). Hereinafter, it is provided for extracting “articulation feature series”.
  • the speech recognition unit 220 is provided to search for phonemes, syllables, words, and the like included in speech from the articulation feature series obtained from the articulation feature extraction unit 210.
  • the articulation feature extraction unit 210 that extracts the articulation feature from the digital signal includes an analysis filter 211, a local feature extraction unit 212, and a discriminative (phoneme) feature extraction unit 213.
  • the digital signal converted by the A / D converter 202 is subjected to Fourier analysis (using a Hamming window having a window width of 24 to 32 msec). Next, it is passed through a band pass filter of about 24 channels to extract frequency components. As a result, a speech spectrum sequence and a speech power sequence at intervals of 5 to 10 msec are extracted. The obtained speech spectrum sequence and speech power sequence are output to local feature extraction section 212.
  • the time axis differential feature extraction unit 2121 and the frequency axis differential feature extraction unit 2122 extract differential features in the time axis direction and the frequency direction.
  • the time axis differential feature of the audio power sequence is calculated separately.
  • linear regression calculation is used to suppress the influence of noise fluctuations and the like.
  • the extracted local features are output to the discriminative phoneme feature extraction unit 213.
  • the discriminative phoneme feature extraction unit 213 extracts the articulation feature series based on the local features extracted by the local feature extraction unit 212.
  • the discriminative phoneme feature extraction unit 213 includes two-stage neural networks 2131 and 2132.
  • the neural network constituting the discriminative phoneme feature extraction unit 213 is a two-stage circuit including a first multilayer neural network 2131 at the first stage and a second multilayer neural network 2132 at the next stage. Consists of The first multilayer neural network 2131 extracts an articulatory feature sequence from the correlation between local features obtained from the speech spectrum sequence and the speech power sequence. Further, the second multilayer neural network 2132 extracts a meaningful subspace from the context information of the articulation feature series, that is, the interdependence between frames, and obtains an accurate articulation feature series.
  • FIG. 7 shows an example of the articulation feature extraction result calculated by the discriminative phoneme feature extraction unit 213. This figure shows the articulation feature extraction result obtained for the utterance “jinkose” which is the Japanese reading of “artificial satellite”. In this way, it is understood that the articulation features extracted by the two-stage neural networks 2131 and 2132 have high accuracy.
  • the configuration of the neural network for obtaining the articulatory feature sequence may be a one-stage configuration at the expense of performance (see Non-Patent Document 3).
  • Each neural network has a hierarchical structure, and has one or two hidden layers excluding an input layer and an output layer (this is called a multilayer neural network).
  • a so-called recurrent neural network having a structure that feeds back from the output layer or hidden layer to the input layer may be used.
  • the results calculated in each neural network are not significantly different.
  • These neural networks function as articulatory feature extractors through learning of the weighting coefficient shown in Non-Patent Document 7 (see Non-Patent Document 7).
  • learning by the neural network of the discriminative phoneme feature extraction unit 213 is performed by adding voice local feature data to the input layer and giving the voice articulation feature to the output layer as a teacher signal.
  • the input unit 201 corresponds to the voice acquisition unit of the invention according to the speech synthesizer
  • the articulation feature extraction unit 210 corresponds to the articulation feature extraction unit.
  • the voice recognition unit 220 corresponds to an optimum voice unit sequence identification unit
  • the central processing unit 11 corresponds to each storage control unit
  • the external storage unit 15 corresponds to each storage unit.
  • the phoneme unit articulation motion storage unit 225 corresponds to the phoneme unit articulation motion storage unit
  • the HMM based on the articulation characteristics of the unspecified speaker stored therein corresponds to the state transition model of articulation motion.
  • the steps processed based on these functions correspond to the steps in the speech recognition unit of the invention according to the speech synthesis method.
  • the optimum articulation feature sequence / speech synthesis parameter conversion unit 230 generates an HMM model coefficient set 2251 representing the articulation motion stored in the phoneme unit articulation motion storage unit 225. While referencing, a speech synthesis parameter is generated and output to the speech synthesis unit 240. Note that text data (or voice data) input by the input unit 201 is used as data to be combined.
  • FIG. 8 is an explanatory diagram of the operation of the optimal articulation feature sequence / speech synthesis parameter converter 230 in HMM speech synthesis.
  • An articulatory feature series / speech synthesis parameter (here, PARCOR coefficient) conversion unit 230 is configured using the PARCOR coefficient corresponding to the teacher data as input to the neural network.
  • the HMM is a probabilistic model that expresses a non-stationary time series signal by making a state transition between a plurality of stationary signal sources, and is suitable for the expression of a time series that varies due to various factors such as speech.
  • a multidimensional normal mixed distribution represented by a weighted sum of multidimensional normal distributions is often used, and this embodiment is also the same. As a result, it is possible to finely model complex fluctuations caused by the speaker and the surrounding environment.
  • the learning of the model parameter ⁇ of the HMM is formulated as shown in Equation 1 in the form of obtaining ⁇ that maximizes the observation likelihood ⁇ (O
  • the driving sound source illustrated in FIG. 8 is created by multi-streams of articulation feature sequences and driving sound source codes when performing HMM learning using learning speech data.
  • the (residual) segment with the smallest error is selected and simultaneously driven to the corresponding articulation motion state.
  • high-quality synthesized speech can be obtained. That is, the speech waveform obtained by passing all the drive excitations through the synthesis filter (PARCOR synthesis filter) is compared with the original waveform, and the drive excitation code with less error is selected.
  • the driving excitation codebook can configure a compact and efficient codebook by registering representative segments by clustering from learning speech data and by making the registered codebook a tree structure.
  • the portion of the optimal articulation feature sequence / speech synthesis parameter conversion unit 230 that acquires the optimal articulation feature sequence with reference to the HMM coefficient set 2251 (see FIG. 8) is related to the speech synthesizer. It corresponds to an optimal articulation feature sequence generation unit, and a PARCOR coefficient conversion unit corresponds to a speech synthesis parameter sequence conversion unit. Further, the speech synthesizer (PARCOR synthesis filter) 240 corresponds to means for synthesizing speech from speech synthesis parameters and drive sound source signals.
  • the central processing unit 11 corresponds to each storage control unit
  • the external storage unit 15 corresponds to each storage unit
  • the phoneme unit articulation motion storage unit 225 corresponds to the phoneme unit articulation motion storage unit, and is stored in this.
  • the point that the HMM based on the articulation characteristics of the unspecified speaker is equivalent to the state transition model of articulation movement is the same as in the case of the speech recognition apparatus.
  • the steps processed based on these functions correspond to the steps in the speech synthesizer of the invention relating to the speech synthesis method.
  • the excitation waveform created from the driving excitation codebook as in this embodiment was compared with the original waveform.
  • (a) is a residual excitation waveform extracted from the original speech
  • (b) is a speech waveform approximated from a pulse train and noise conventionally used
  • (c) is from the driving excitation codebook of this embodiment.
  • the created sound source waveform is shown. It can be seen that the excitation waveform created from the excitation codebook is close to the residual waveform when the original speech is subjected to PARCOR analysis.
  • FIG. 11A shows the spectrum of the original speech
  • FIG. 11B shows the spectrum of the synthesized speech obtained by converting the articulation feature series into the speech synthesis parameters (PARCOR coefficient sequence) based on the articulation features obtained from the speech
  • ) Shows the spectrum of the synthesized speech (HMM / DPF / PARCOR analysis) of this embodiment.
  • the synthesized speech of the present embodiment has a high-frequency spectrum smoothed by the smoothing of the HMM. It can be seen that the original speech spectrum shape is maintained.
  • the spectrum of (b) is also similar to (c), and can be used to know the articulation feature extraction result of the input voice in talkback when confirming the voice recognition result.
  • synthesized speech waveforms were compared. 12, (a) is the original speech waveform, (b) is the speech waveform synthesized using the excitation waveform approximated from the pulse train and noise, and (c) and (d) are synthesized using the driving excitation codebook. It is a voice waveform. Note that (c) is based on the driving excitation codebook of a specific speaker, and (d) is based on the driving excitation codebook of an unspecified speaker. As is clear from this figure, (c) and (d) obtain waveforms close to the original speech.
  • (d) creates a driving excitation codebook from the voices of an unspecified number of speakers, and extracts the voices of specific speakers (articulation features are extracted and used for multi-layer neural network learning for speech synthesis parameter conversion.
  • a codebook created only from the person (c) a slight deterioration is seen in (d) compared with (c). Therefore, a process for tuning a specific speaker is required. Therefore, the sound quality can be improved by learning by including a small amount of specific speaker voice in the code book created from a large number of unspecified many speaker voices.
  • the conversion accuracy can be improved by learning a small amount of specific speaker speech as a user for a large amount of unspecified speaker speech. Can do.
  • the voice is acquired, the articulation feature series is extracted, the optimal articulation series is obtained from the articulation motion model of the HMM, further converted into the voice synthesis parameters, and the synthesized voice is output.
  • the present invention is not limited to such use, and a kanji-kana mixed sentence input from a keyboard is also converted into a kana sequence after being converted into a kana sequence, as a normal speech synthesizer performs.
  • the distinctive phoneme feature as the articulation feature has a one-to-one correspondence with the kana character so that it can be easily understood, and it is possible to easily synthesize speech through conversion of the kana character / articulation feature series. .
  • FIG. 13 is a usage form in which voice is synthesized by first inputting text from a keyboard, and secondly, the recognition result text is displayed on the display through voice recognition from the voice, and the recognition result is re-synthesized and voiced. And third, a usage mode in which the output from the articulation feature extraction unit 40 (extracted articulation feature) is converted by the articulation feature / vocal tract parameter conversion unit 43 and voice confirmation is performed (path 47 in the figure). Is possible.
  • the text of the speech recognition result is output and processed in the same manner as the key-operated text. Therefore, the recognition result text (word or sentence) is the same as in the first usage mode. (Word string)), the synthesized speech is returned to the user through the same process as the first usage pattern.

Abstract

Disclosed are a voice synthesis apparatus, voice synthesis method and voice synthesis program capable of implementing voice synthesis of a specified individual with high quality using few items of learned voice data. The voice synthesis apparatus learns a transition model (225) of articulatory movement stored for each of fixed voice units such as phonemes, from an unspecified large number of speakers. The voice synthesis apparatus is provided with means (230) for converting to voice synthesis parameters that carry vocal tract shape information whereby a series of articulatory features is adapted to individuals and an optimum voice unit series is obtained at the same time by comparing this model with the input voice. In addition, the voice synthesis apparatus obtains high-quality synthesised voice for a specified individual by registering sound source code in a state transition model of articulatory movement using closed loop learning employing a drive sound source codebook.

Description

1モデル音声認識合成に基づく音声合成装置、音声合成方法および音声合成プログラムSpeech synthesis apparatus, speech synthesis method and speech synthesis program based on one model speech recognition synthesis
 本発明は、1モデル音声認識合成に基づく音声合成装置、1モデル音声認識合成に基づく音声合成方法および1モデル音声認識合成に基づく音声合成プログラムに関する。より詳細には、音声発話から調音特徴を抽出し、音声認識に供することのできる調音運動に係る状態遷移モデルを構築するとともに、同じ調音運動の状態遷移モデルを用いて音声を合成する1モデル音声認識合成に基づく音声合成装置、音声合成方法および音声合成プログラムに関する。なお、1モデルとは、音声認識と音声合成の双方に共通の(すなわち1つの)状態遷移モデルを使用することを意味する。 The present invention relates to a speech synthesizer based on 1-model speech recognition synthesis, a speech synthesis method based on 1-model speech recognition synthesis, and a speech synthesis program based on 1-model speech recognition synthesis. More specifically, a model speech that extracts articulation features from speech utterances, constructs a state transition model related to articulatory motion that can be used for speech recognition, and synthesizes speech using the same articulatory motion state transition model The present invention relates to a speech synthesizer based on recognition synthesis, a speech synthesis method, and a speech synthesis program. One model means that a common (that is, one) state transition model is used for both speech recognition and speech synthesis.
 音声入出力を用いたユーザインタフェースとして音声認識技術と音声合成技術の二つが知られている。音声認識技術では、周波数スペクトルなどの特徴分析処理結果をもとに、音素・音節・単語などを認識単位とするパターン認識処理を行うことが一般に行われてきた。これは、人間の聴覚神経系がスペクトル分析能力を持ち、スペクトル時系列に対して大脳で高次言語処理が行われるという推測に基づいている。これまでに開発された音声認識装置は、スペクトル時系列からなる音響特徴を基に単語もしくは単語列の分類を行うものであった。 Two types of speech recognition technology and speech synthesis technology are known as user interfaces using speech input / output. In speech recognition technology, pattern recognition processing using phonemes, syllables, words, and the like as recognition units has been generally performed based on the result of feature analysis processing such as frequency spectrum. This is based on the assumption that the human auditory nervous system has spectrum analysis capability, and that higher-level language processing is performed on the cerebrum for the spectrum time series. The speech recognition apparatuses developed so far classify words or word strings based on acoustic features composed of spectral time series.
 次に音声合成技術では、主に波形接続方式とボコーダ方式が利用されている。波形接続方式は、音素等を単位とする波形素片を基にこれらを接続して音声を生成する。またボコーダ方式は、人間の音声生成における調音運動を模擬した方式であり、発声器官の動作情報と声帯振動などの音源情報を分離して利用する。具体的には、音声から発声器官の動きすなわち調音運動を反映するパラメータをPARCOR分析等により抽出し、これらのスペクトル包絡情報からなる素片を接続するとともに、励振源にピッチパルスもしくは雑音系列を加えて音声を生成する。 Next, in speech synthesis technology, the waveform connection method and vocoder method are mainly used. In the waveform connection method, voices are generated by connecting waveform segments in units of phonemes and the like. The vocoder method is a method that simulates articulatory motion in human speech generation, and uses separately the sound source information such as vocal organ vibration information and vocal cord vibration. Specifically, parameters reflecting the movement of the vocal organs, that is, the articulatory motion, are extracted from the speech by PARCOR analysis, etc., and the segments consisting of these spectral envelope information are connected and pitch pulses or noise sequences are added to the excitation source. To generate audio.
 このように、現在の音声認識および音声合成は異なる二つのシステムとして実現されている。これに対して近年の脳研究から、人間は音響信号としての音声ではなく、調音運動としての音声を知覚しているとする仮説が有力視されつつある(非特許文献1参照)。 Thus, current speech recognition and speech synthesis are realized as two different systems. On the other hand, from recent brain research, a hypothesis that humans perceive speech as articulatory motion instead of speech as an acoustic signal is promising (see Non-Patent Document 1).
 人間の脳における音声言語の処理に関しては、まず発話の際に調音器官の筋肉の動きを支配するブローカ野が深く関わることが1861年にフランスのP.P.Brocaによって発見された。この部分が損傷すると、発話の流暢性が失われるブローカ失語(運動失語)が観測されるため、主に音声生成システムを担うと考えられた。続いて、発話内容の理解に関わるウェルニッケ野が、1884年にドイツのC.Wernickeによって発見された。この部分の疾患では、流暢ではあるが誤りだらけの文を発話するウェルニッケ失語(感覚失語)が観測されるため、主に音声理解システムに関わる部位と考えられた。このように人間の場合には、発話器官と聴覚器官の二つが存在し、さらに上記したように二つの脳部位の異なる働きが観測されたこともあり、2-system説が優勢とされた。先に説明した音声合成におけるボコーダも、1928年にH.Dudleyが最初に装置化した際には、脳からの調音指令を図に示し、発声器官の動きを帯域フィルター群で抽出し、同時に音源を抽出して伝送する装置を真空管回路で実現している。このボコーダの考えは、その後、1969年にF.ItakuraとB.Atalによって線形予測符号化(Linear Predictive Coding:LPC)として完成され、現在の音声通信の基礎となっている。 Regarding the processing of spoken language in the human brain, it is first related to the French P.A. in 1861 that the Broca area, which governs the movement of articulatory muscles, is deeply involved in speech. P. Discovered by Broca. When this part is damaged, broker aphasia (motor aphasia), in which the fluency of speech is lost, is observed, so it was thought to be responsible mainly for the speech generation system. Subsequently, Wernickeno, who was involved in understanding the content of the utterance, Discovered by Wernicke. In this part of the disease, Wernicke aphasia (sensory aphasia), which utters fluent but error-prone sentences, is observed, so it was considered to be mainly related to the speech understanding system. In this way, in the case of human beings, there are two organs, the speech organ and the auditory organ, and as described above, the different functions of the two brain regions have been observed, and the 2-system theory has become dominant. The vocoder in speech synthesis described earlier was also developed in 1928 by H.C. When Dudley first made a device, the articulation command from the brain was shown in the figure, and the device that extracts the movement of the vocal organs with a band filter group and simultaneously extracts and transmits the sound source is realized with a vacuum tube circuit . The idea of this vocoder was later developed in 1969 by F.C. Itakura and B.I. It has been completed as a linear predictive coding (LPC) by Atal and is the basis of current voice communication.
 その後、1976年にH.McGurkによりマクガーク効果が発見された。これは、例えば画面上に/ga/と発話している映像を表示し、同時にスピーカから/ba/という音声を呈示すると、/da/もしくは/ga/と判断したという実験で、人間の音声発話と理解が脳では調音運動を担う1-systemによって処理されているという説を支持するものであった。人間の音声生成と理解は1-systemか2-systemかという論争は、その後も長く続いたが、近年になってfMRI等により脳研究が大きく進展し、現在までの知見によると、音声の発話と理解にはブローカ野とウェルニッケ野の連携を含む大域的な処理機構が関係しているとされ、1-system説が優勢になっている。近年は、調音運動に関する指令を正確に抽出する研究が音声認識の分野で盛んな一方、調音指令からの音声合成に関してfMRI等による観測が行われている段階である。 Then, in 1976, H.C. McGurk discovered the McGark effect. This is because, for example, when an image uttered as / ga / is displayed on the screen and at the same time the voice of / ba / is presented from a speaker, it is determined that it is / da / or / ga /. This supported the theory that understanding was handled by the 1-system responsible for articulatory movements in the brain. The controversy over whether human speech generation and understanding is 1-system or 2-system has continued for a long time, but in recent years, brain research has greatly progressed with fMRI, etc. The global processing mechanism including the cooperation between the broker and Wernicke fields is related to the understanding, and the 1-system theory is dominant. In recent years, research on accurately extracting commands related to articulatory movements has been active in the field of speech recognition, while speech synthesis from articulatory commands is being observed by fMRI or the like.
 このように、1-system説が有力になりつつあるが、こうしたシステムを実用化する上で障害が多々ある。実現に最も近いシステムとして、隠れマルコフモデル(Hidden Markov Model;以下、HMMと記述する場合がある)合成がある(非特許文献2参照)。 In this way, the 1-system theory is becoming prominent, but there are many obstacles in putting such a system into practical use. As a system closest to realization, there is a hidden Markov model (Hidden Markov Model; hereinafter referred to as HMM) synthesis (see Non-Patent Document 2).
 この方式は、音声認識で現在標準的に用いられているHMMを応用するもので、システムの動作を図1に示す。図に記載のないHMMの学習部は、スペクトルパラメータ列(ここではメルケプストラム(Mel Frequency Cepstrum Coefficient;以下、MFCCと記述する場合がある)を使用)およびピッチパラメータを多空間上の確立分布に基づいたHMMによってBaum-Welchアルゴリズムを用いて学習する。その際、特定話者のスペクトラム列を表現したHMM101に対して、これを連続学習する際に得られるトレリスなどから状態継続長分布を構成する。合成部では、テキストが入力され、テキスト解析によって韻律情報を付与した後、状態継続長分布を元にHMMの各状態を連続し、得られるスペクトルおよびピッチから生成される励振波形をMLSA(Mel Log:メル対数)合成フィルタ102に通して合成音声波形を得る。 This system applies the HMM currently used as a standard in voice recognition, and the operation of the system is shown in FIG. The learning part of the HMM not shown in the figure uses a spectral parameter sequence (here, Mel Cepstrum Coefficient (hereinafter sometimes referred to as MFCC)) and pitch parameters based on a probability distribution in multiple spaces. Learning using the Baum-Welch algorithm by HMM. At this time, a state duration distribution is constructed from a trellis or the like obtained when the HMM 101 expressing the spectrum sequence of a specific speaker is continuously learned. In the synthesizer, after text is input and prosodic information is given by text analysis, each state of the HMM is continued based on the state duration distribution, and an excitation waveform generated from the obtained spectrum and pitch is expressed by MLSA (Mel Log). : Mel logarithm) The synthesized speech waveform is obtained through the synthesis filter 102.
 一方、人間は幼児の時から、親の音声波形という極少ない人間の声のみを聴取することで、その他、不特定多数の人間の音声を聞き取ることができる。この事実は、人間の脳が音声を調音運動という不変的な特徴パターンに変換して聴いていることを示唆する。 On the other hand, human beings can listen to an unspecified number of human voices by listening to very few human voices, such as the parent's voice waveform, since they were young children. This fact suggests that the human brain listens by converting speech into an invariant feature pattern called articulation.
 上記非特許文献2に開示される方式は、特定話者の音声スペクトル情報から作成した特定話者HMMで合成部を構成するため、高品質音声を実現するには、特定話者の多大な音声データを必要とするという欠点がある。また、このHMMを音声認識で利用する場合、特定話者の音声で設計したHMMのため、その話者以外の多数話者に対して低い音声認識結果しか得られないものであった。 In the method disclosed in Non-Patent Document 2, the synthesis unit is configured by the specific speaker HMM created from the speech spectrum information of the specific speaker. There is a drawback of requiring data. Further, when this HMM is used for speech recognition, since it is an HMM designed with the speech of a specific speaker, only a low speech recognition result can be obtained for a large number of speakers other than the speaker.
 本発明は、上記の問題点を解消するためになされたものであり、不特定話者に対する高い音声認識性能と特定個人に対する明瞭な音声合成という、これまでの方式では相反する機能を実現する1モデル音声認識合成に基づく音声合成装置、音声合成方法および音声合成プログラムを提供することを目的とする。 The present invention has been made to solve the above-described problems, and realizes a function that conflicts with the conventional methods of high speech recognition performance for unspecified speakers and clear speech synthesis for specific individuals 1 An object of the present invention is to provide a speech synthesizer based on model speech recognition synthesis, a speech synthesis method, and a speech synthesis program.
 上述の問題点を解決するために、請求項1に係る発明の音声合成装置では、一定の音声単位毎に記憶された調音運動の状態遷移モデルを予め記憶する音素単位調音運動記憶部と、前記状態遷移モデルを参照しつつ音声認識を行う音声認識部と、前記状態遷移モデルから最適調音系列を取得しつつ音声合成を行う音声合成部とを備えた1モデル音声認識合成に基づく音声合成装置であって、音声認識部は、音声を取得する音声取得手段と、前記音声取得手段にて取得された音声の調音特徴を抽出する調音特徴抽出手段と、前記調音特徴抽出手段にて抽出された調音特徴を記憶手段に記憶する第1の記憶制御手段と、前記調音特徴の記憶手段から読み出された調音特徴時系列データと前記状態遷移モデルとを比較し最適音声単位系列を識別する最適音声単位系列識別手段を含み、音声合成部は、前記最適音声単位系列から調音運動に関する最適状態系列を推定し調音特徴系列を生成する最適調音特徴系列生成手段と、前記最適調音特徴系列生成手段にて生成された最適調音特徴系列データを記憶手段に記憶する第2の記憶制御手段と、前記最適調音特徴系列データの記憶手段から読み出された調音特徴系列データを音声合成パラメータ系列に変換する音声合成パラメータ系列変換手段と、前記音声合成パラメータ系列変換手段にて変換された音声合成パラメータ系列を記憶手段に記憶する第3の記憶制御手段と、前記音声合成パラメータ系列の記憶手段から読み出された音声合成パラメータと駆動音源信号から音声を合成する手段とを含むことを特徴としている。 In order to solve the above-described problem, in the speech synthesizer according to the first aspect of the present invention, a phoneme unit articulation motion storage unit that stores in advance a state transition model of articulation motion stored for each predetermined speech unit; A speech synthesizer based on one-model speech recognition synthesis, comprising: a speech recognition unit that performs speech recognition with reference to a state transition model; and a speech synthesis unit that performs speech synthesis while acquiring an optimal articulation sequence from the state transition model. The voice recognition unit includes voice acquisition means for acquiring voice, articulation feature extraction means for extracting the articulation feature of the voice acquired by the voice acquisition means, and articulation extracted by the articulation feature extraction means. The first storage control means for storing the features in the storage means, the articulation feature time series data read from the articulation feature storage means and the state transition model are compared to identify the optimum speech unit series An optimal speech unit sequence identifying unit, wherein the speech synthesis unit estimates an optimal state sequence related to articulation motion from the optimal speech unit sequence and generates an articulation feature sequence; and the optimal articulation feature sequence generation unit A second storage control unit that stores in the storage unit the optimum articulation feature sequence data generated in step (b), and converts the articulation feature sequence data read from the storage unit of the optimum articulation feature sequence data into a speech synthesis parameter sequence. Read out from the speech synthesis parameter series conversion means, the third storage control means for storing the speech synthesis parameter series converted by the speech synthesis parameter series conversion means in the storage means, and the speech synthesis parameter series storage means. And a means for synthesizing speech from the driving sound source signal.
 また、請求項2に係る発明の音声合成装置では、前記音素単位調音運動記憶部は、調音運動を表現した隠れマルコフモデル(HMM)の係数セットが記憶され、前記音声認識部の最適音声単位系列識別手段および前記音声合成部の最適調音特徴系列生成手段から参照可能であることを特徴としている。 In the speech synthesizer of the invention according to claim 2, the phoneme unit articulation motion storage unit stores a coefficient set of a hidden Markov model (HMM) expressing articulation motion, and the optimal speech unit sequence of the speech recognition unit. It can be referred to from the discriminating means and the optimum articulation feature sequence generating means of the speech synthesizer.
 また、請求項3に係る発明の音声合成装置では、前記調音特徴抽出手段は、音声のデジタル信号をフーリエ分析する分析フィルタと、時間軸微分特徴抽出部および周波数軸微分特徴抽出部を有する局所特徴抽出部と、多層ニューラルネットワークを一段または複数段に構成された弁別的音素特徴抽出部とを備えたことを特徴としている。 Further, in the speech synthesizer of the invention according to claim 3, the articulation feature extraction means includes an analysis filter for Fourier-analyzing the digital signal of speech, a local feature having a time axis differential feature extraction unit and a frequency axis differential feature extraction unit. It is characterized by comprising an extracting unit and a discriminative phoneme feature extracting unit having a multi-layer neural network configured in one or more stages.
 また、請求項4に係る音声合成装置では、前記状態遷移モデルが、多数話者音声を用いて作成されるとともに、前記調音特徴系列データを音声合成パラメータ系列に変換する手段を、特定話者の音声のみ、もしくは不特定話者で作成した前記調音特徴系列データを音声合成パラメータ系列に変換する手段を、特定話者の音声で適応学習して作成されること
を特徴としている。
In the speech synthesizer according to claim 4, the state transition model is created using a multi-speaker voice, and means for converting the articulation feature series data into a speech synthesis parameter series is provided. The means for converting the articulation feature series data created by only the voice or by the unspecified speaker into a voice synthesis parameter series is created by adaptive learning with the voice of the specific speaker.
 また、請求項5に係る発明の音声合成装置では、前記音声合成パラメータと駆動音源信号から音声を合成する手段において、駆動音源符号帳を設けるとともに、音声合成パラメータと駆動音源符号から合成された音声を元の学習音声と比較して最適な駆動音源を選択する手段と、前記選択された駆動音源符号を対応する調音運動の状態遷移モデルに登録する手段を備えたことを特徴としている。 In the speech synthesizer of the invention according to claim 5, the means for synthesizing speech from the speech synthesis parameter and the driving excitation signal provides a driving excitation codebook and the speech synthesized from the speech synthesis parameter and the driving excitation code. And a means for selecting an optimal driving sound source by comparing with the original learning speech and a means for registering the selected driving sound source code in a corresponding articulatory motion state transition model.
 請求項6に係る発明の音声合成方法では、一定の音声単位毎に記憶された調音運動の状態遷移モデルを予め記憶する音素単位調音運動記憶部と、前記状態遷移モデルを参照しつつ音声認識を行う音声認識部と、前記状態遷移モデルから最適調音系列を取得しつつ音声合成を行う音声合成部とを備えた1モデル音声認識合成に基づく音声合成方法であって、
 音声認識部は、音声を取得する音声取得ステップと、前記音声取得ステップにて取得された音声の調音特徴を抽出する調音特徴抽出ステップと、前記調音特徴抽出ステップにて抽出された調音特徴を記憶手段に記憶する第1の記憶制御ステップと、前記調音特徴の記憶手段から読み出された調音特徴時系列データと前記状態遷移モデルとを比較し最適音声単位系列を識別する最適音声単位系列識別ステップを含み、音声合成部は、前記最適音声単位系列から調音運動に関する最適状態系列を推定し調音特徴系列を生成する最適調音特徴系列生成ステップと、前記最適調音特徴系列生成ステップにて生成された最適調音特徴系列データを記憶手段に記憶する第2の記憶制御ステップと、前記最適調音特徴系列データの記憶手段から読み出された調音特徴系列データを音声合成パラメータ系列に変換する音声合成パラメータ系列変換ステップと、前記音声合成パラメータ系列変換ステップにて変換された音声合成パラメータ系列を記憶手段に記憶する第3の記憶制御ステップと、前記音声合成パラメータ系列の記憶手段から読み出された音声合成パラメータと駆動音源信号から音声を合成するステップとを含むことを特徴としている。
In the speech synthesis method according to the sixth aspect of the present invention, a phoneme unit articulation motion storage unit that stores in advance a state transition model of articulation motion stored for each predetermined speech unit, and speech recognition with reference to the state transition model. A speech synthesis method based on one-model speech recognition synthesis, comprising: a speech recognition unit to perform; and a speech synthesis unit that performs speech synthesis while acquiring an optimal articulation sequence from the state transition model,
The speech recognition unit stores a speech acquisition step for acquiring speech, a articulation feature extraction step for extracting the articulation feature of the speech acquired in the speech acquisition step, and a articulation feature extracted in the articulation feature extraction step. A first storage control step for storing in the means; and an optimum speech unit sequence identification step for comparing the articulation feature time-series data read from the articulation feature storage means and the state transition model to identify the optimum speech unit sequence. A speech synthesis unit that estimates an optimal state sequence related to articulation motion from the optimal speech unit sequence and generates an articulation feature sequence; and an optimal articulation feature sequence generation step that generates the optimal articulation feature sequence A second storage control step for storing the articulatory feature sequence data in the storage unit; and the optimum articulation feature sequence data read from the storage unit A speech synthesis parameter sequence conversion step for converting the sound feature sequence data into a speech synthesis parameter sequence; a third storage control step for storing in the storage means the speech synthesis parameter sequence converted in the speech synthesis parameter sequence conversion step; And synthesizing speech from the speech synthesis parameters read from the speech synthesis parameter series storage means and the driving sound source signal.
 また、請求項7に係る発明の音声合成方法では、前記音素単位調音運動記憶部は、調音運動を表現した隠れマルコフモデル(HMM)の係数セットが記憶され、前記音声認識部の最適音声単位系列識別ステップおよび前記音声合成部の最適調音特徴系列生成ステップにおいて参照可能であることを特徴としている。 In the speech synthesis method according to the seventh aspect of the present invention, the phoneme unit articulation motion storage unit stores a coefficient set of a hidden Markov model (HMM) expressing articulation motion, and the optimal speech unit sequence of the speech recognition unit. It can be referred to in the identification step and the optimum articulation feature sequence generation step of the speech synthesizer.
 また、請求項8に係る発明の音声合成方法では、前記調音特徴抽出ステップは、音声のデジタル信号をフーリエ分析する分析フィルタと、時間軸微分特徴抽出ステップおよび周波数軸微分特徴抽出ステップを有する局所特徴抽出ステップと、多層ニューラルネットワークにより処理される弁別的音素特徴抽出ステップとを備えたことを特徴としている。 In the speech synthesis method according to the eighth aspect of the present invention, the articulation feature extraction step includes a local feature including an analysis filter that performs Fourier analysis on a digital signal of speech, a time axis differential feature extraction step, and a frequency axis differential feature extraction step. It is characterized by comprising an extraction step and a discrimination phoneme feature extraction step processed by a multilayer neural network.
 また、請求項9に係る発明の音声合成方法では、前記状態遷移モデルが、多数話者音声を用いて作成されるとともに、前記調音特徴系列データを音声合成パラメータ系列に変換するステップを、特定話者の音声のみ、もしくは不特定話者で作成した前記調音特徴系列データを音声合成パラメータ系列に変換する手段を、特定話者の音声で適応学習して作成されることを特徴としている。 In the speech synthesis method of the invention according to claim 9, the state transition model is created using a multi-speaker speech, and the step of converting the articulation feature sequence data into a speech synthesis parameter sequence includes: The means for converting the articulation feature series data created by only the voice of the speaker or the unspecified speaker into a speech synthesis parameter series is created by adaptive learning with the voice of the specific speaker.
 また、請求項10に係る発明の音声合成方法では、前記音声合成パラメータと駆動音源信号から音声を合成するステップにおいて、駆動音源符号帳を設けるとともに、音声合成パラメータと駆動音源符号から合成された音声を元の学習音声と比較して最適な駆動音源を選択するステップと、前記選択された駆動音源符号を対応する調音運動の状態遷移モデルに登録するステップを備えたことを特徴としている。 In the speech synthesis method according to claim 10, in the step of synthesizing speech from the speech synthesis parameter and the driving excitation signal, a driving excitation codebook is provided, and the speech synthesized from the speech synthesis parameter and the driving excitation code is provided. Are compared with the original learning speech, and an optimum driving sound source is selected, and the selected driving sound source code is registered in a corresponding articulatory motion state transition model.
 請求項11に係る発明の音声合成プログラムでは、請求項1ないし5のいずれかに記載の音声合成装置の各処理手段としてコンピュータを駆動させている。 In the speech synthesis program of the invention according to claim 11, a computer is driven as each processing means of the speech synthesis apparatus according to claim 1.
 また、請求項12に係る発明の音声合成プログラムでは、請求項6ないし10のいずれかに記載の音声合成方法の各処理ステップとしてコンピュータを駆動させている。 In the speech synthesis program according to the twelfth aspect, the computer is driven as each processing step of the speech synthesis method according to any one of the sixth to tenth aspects.
 請求項1に係る発明の音声合成装置は、従来のHMM合成装置が使用していた特定話者の「スペクトルに基づく情報」と異なり、「調音運動に基づく情報」を抽出してHMM合成装置を構成する。このため、HMM合成の部分を調音運動という話者に対して基本的に不変なパラメータから構成するため、HMM部分に関して個々の話者の学習音声データが不要もしくは極少量で済むという利点がある。また、音声を生成するには、調音運動を特定話者の発話器官の運動に変換する必要があるが、この部分は少量の音声データで実現できる。話者の音声は調音運動の状態遷移モデルとして不変量と見做し、特定話者の発話動作は音声合成パラメータ系列に変換されることから、両者を分離して把握することができる。このように、音声合成を、不変量と見做すことのできる発話器官への調音動作指令部分(調音運動の状態遷移モデルおよび音素単位調音運動記憶部)と、個人毎に異なる発話器官とその動作に係わる部分(最適音声単位系列識別手段および最適調音特徴系列生成手段)に分離したことにより、個人の発話器官の特性に合わせた高品質な音声合成装置を実現することができる。 The speech synthesizer of the invention according to claim 1 is different from the “information based on spectrum” of the specific speaker used by the conventional HMM synthesizer, and extracts the “information based on articulatory motion” to extract the HMM synthesizer. Constitute. For this reason, since the HMM synthesis part is composed of parameters that are basically invariant to the articulatory speaker, there is an advantage that the learning speech data of each speaker is unnecessary or very small for the HMM part. In order to generate speech, it is necessary to convert articulatory motion into motion of a specific speaker's speech organ, but this portion can be realized with a small amount of speech data. Since the speech of the speaker is regarded as an invariant as a state transition model of articulatory movement, and the speech operation of the specific speaker is converted into a speech synthesis parameter series, both can be grasped separately. In this way, the articulatory motion command part (articulatory motion state transition model and phoneme unit articulatory memory unit) to speech organs that can be regarded as invariant speech synthesis, different speech organs for each individual and their By separating the parts related to the operation (optimum speech unit sequence identification means and optimum articulation feature sequence generation means), it is possible to realize a high-quality speech synthesizer that matches the characteristics of the individual speech organs.
 特に、従来の音声スペクトル由来の特徴を使用する音声認識では、話者や発話時の文脈または周囲の騒音等によって、スペクトルが大きく変動してしまうため、音響的な尤度を求める際に使用するHMMの設計に多くの音声データを必要としていた。※これに対し、調音特徴をHMMへの入力特徴とする場合、少ない学習話者でも十分な音素認識性能を得ることができ、かつHMMの混合分布数も少なくて済むという利点を有する。※ In particular, in speech recognition using features derived from the conventional speech spectrum, the spectrum varies greatly depending on the speaker, the context at the time of speech or the surrounding noise, etc., so it is used when obtaining the acoustic likelihood. The HMM design required a lot of voice data. * On the other hand, when the articulatory feature is an input feature to the HMM, there are advantages that even a small number of learning speakers can obtain sufficient phoneme recognition performance and the number of HMM mixture distributions can be reduced. *
 請求項2に係る発明の音声合成装置は、音素単位調音運動記憶部に調音運動を表現したHMMの係数セットが記憶されていることから、これを参照する最適音声単位系列識別手段および最適調音特徴系列生成手段では、話者に対して基本的に不変なパラメータにより音声認識処理および音声合成処理が実現される。 In the speech synthesizer of the invention according to claim 2, since the HMM coefficient set expressing the articulation motion is stored in the phoneme unit articulation motion storage unit, the optimum speech unit sequence identifying means and the optimum articulation feature referencing this In the sequence generation means, speech recognition processing and speech synthesis processing are realized by parameters that are basically unchanged for the speaker.
 請求項3に係る発明の音声合成装置は、局所特徴抽出部と弁別的音素特徴抽出部とによって調音特徴抽出部が構成されていることから、調音運動に基づく弁別特徴をHMMへの入力特徴とすることができ、少ない学習話者により十分な音素認識性能を得ることができる。 In the speech synthesizer of the invention according to claim 3, since the articulatory feature extracting unit is configured by the local feature extracting unit and the discriminative phoneme feature extracting unit, the discriminating feature based on the articulatory motion is input to the HMM. Therefore, sufficient phoneme recognition performance can be obtained with a small number of learning speakers.
 請求項4に係る発明の音声合成装置は、従来のHMM合成装置が使用していた「特定話者のスペクトルに基づく情報」ではなく、「不特定多数話者の調音運動の基づく情報」を抽出してHMM合成装置を構成するものである。これにより、上記発明の効果に加えて、HMM合成の部分を話者に対し共通化することができ、個々の話者はHMM部分に関して学習音声データが原則不要にできるという利点がある。また、音声合成を、発話器官への調音動作指令部分と、個人毎に異なる発話器官とその動作に係わる部分に分離し、かつ前者を多数話者の調音特徴データを使用して、話者に対しより不変な調音動作指令として構成したことにより、個人の発話器官の特性に合わせた高品質音声合成と、高い音声認識性能の双方を達成することができる。 The speech synthesizer according to the invention of claim 4 extracts “information based on articulatory motion of unspecified majority speakers” instead of “information based on the spectrum of specific speakers” used by the conventional HMM synthesizer. Thus, the HMM synthesizing apparatus is configured. Thereby, in addition to the effect of the above invention, the HMM synthesis part can be made common to the speakers, and each speaker has the advantage that the learning speech data can be made unnecessary for the HMM part in principle. In addition, the speech synthesis is separated into the articulatory motion command part for the speech organs and the speech organs and parts related to the motions that are different for each person, and the former is used for the speaker by using the articulation feature data of many speakers. On the other hand, it is possible to achieve both high-quality speech synthesis and high speech recognition performance in accordance with the characteristics of the individual speech organs by configuring as a more invariant articulation operation command.
 また、個人の音声に適応した合成音を少ないデータで得られることを可能にするため、高い音素認識性能の実現と相俟って、音声対話で問題となっている未知語に、人間同士が行っていると同様の対応を可能にする。すなわち、未知語が出現した際、未知語部分に対応する調音特徴系列を利用し、問い返しの確認発話を容易に合成することができる。 In addition, in order to make it possible to obtain synthesized speech adapted to individual voices with less data, coupled with the realization of high phoneme recognition performance, unknown words that are problematic in voice dialogue are Enable the same response as you do. That is, when an unknown word appears, it is possible to easily synthesize a confirmation utterance for answering using an articulation feature sequence corresponding to the unknown word part.
 請求項5に係る発明の音声合成装置は、合成音の音質に大きな影響を与える駆動音源信号に、音声通信で広く利用されているCELP(Code Excited Linear Prediction)の閉ループ学習の考え方(非特許文献4参照)と、同じく波形合成に広く利用されているPSOLA(Pitch Synchronous Overlap and Add)の技術(非特許文献5参照)を導入することにより、上記発明の効果に加えて、最適な駆動音源符号を選択して対応する調音運動の状態遷移モデルに登録し、これを参照しつつ音声合成することによって高品質音声を得ることができる。 The speech synthesizer of the invention according to claim 5 is a closed loop learning concept of CELP (Code Excited Linear Prediction) widely used in speech communication for driving sound source signals that greatly affect the sound quality of synthesized sound (non-patent document). 4) and the technology of PSOLA (Pitch Synchronous Overlap and Add) (see Non-Patent Document 5), which is also widely used for waveform synthesis, in addition to the effects of the above invention, the optimum driving excitation code Is selected and registered in the corresponding articulatory motion state transition model, and high-quality speech can be obtained by synthesizing speech while referring to the model.
 請求項6に係る発明の音声合成方法は、従来のHMM合成方法が使用していた特定話者の「スペクトルに基づく情報」と異なり、「調音運動に基づく情報」を抽出してHMM合成方法を構成する。このため、HMM合成の部分を調音運動という話者に対して基本的に不変なパラメータから構成するため、個々の話者はHMM部分に関して学習音声データが不要もしくは極少量で済むという利点がある。また、音声を生成するには、調音運動を特定話者の発話器官の運動に変換する必要があるが、この部分は少量の音声データで実現できる。話者の音声は調音運動の状態遷移モデルとして不変量と見做し、特定話者の発話動作は音声合成パラメータ系列に変換されることから、両者を分離して把握することができる。このように、音声合成を、不変量と見做すことのできる発話器官への調音動作指令部分(調音運動の状態遷移モデルおよび音素単位調音運動記憶部)と、個人毎に異なる発話器官とその動作に係わる部分(最適音声単位系列識別ステップおよび最適調音特徴系列生成ステップ)に分離したことにより、個人の発話器官の特性に合わせた高品質な音声合成方法を実現することができる。 The speech synthesis method of the invention according to claim 6 is different from the “information based on spectrum” of a specific speaker used in the conventional HMM synthesis method, and extracts the “information based on articulatory motion” to extract the HMM synthesis method. Constitute. For this reason, since the HMM synthesis part is composed of parameters that are basically invariant to the speaker, which is an articulatory movement, each speaker has the advantage that learning speech data is unnecessary or requires a very small amount for the HMM part. In order to generate speech, it is necessary to convert articulatory motion into motion of a specific speaker's speech organ, but this portion can be realized with a small amount of speech data. Since the speech of the speaker is regarded as an invariant as a state transition model of articulatory movement, and the speech operation of the specific speaker is converted into a speech synthesis parameter series, both can be grasped separately. In this way, the articulatory motion command part (articulatory motion state transition model and phoneme unit articulatory memory unit) to speech organs that can be regarded as invariant speech synthesis, different speech organs for each individual and their By separating the operation-related parts (optimum speech unit sequence identification step and optimum articulation feature sequence generation step), it is possible to realize a high-quality speech synthesis method that matches the characteristics of the individual utterance organs.
 特に、従来の音声スペクトル由来の特徴を使用する音声認識では、話者や発話時の文脈または周囲の騒音等によって、スペクトルが大きく変動してしまうため、音響的な尤度を求める際に使用するHMMの設計に多くの音声データを必要としていた。これに対し、調音特徴をHMMへの入力特徴とする場合、少ない学習話者でも十分な音素認識性能を得ることができ、かつHMMの混合分布数も少なくて済むという利点を有する。 In particular, in speech recognition using features derived from the conventional speech spectrum, the spectrum varies greatly depending on the speaker, the context at the time of speech or the surrounding noise, etc., so it is used when obtaining the acoustic likelihood. The HMM design required a lot of voice data. On the other hand, when the articulatory feature is an input feature to the HMM, there are advantages that even a small number of learning speakers can obtain sufficient phoneme recognition performance and the number of HMM mixture distributions can be reduced.
 請求項7に係る発明の音声合成方法は、音素単位調音運動記憶部に調音運動を表現したHMMの係数セットが記憶されていることから、これを参照する最適音声単位系列識別ステップおよび最適調音特徴系列生成ステップでは、話者に対して基本的に不変なパラメータにより音声認識処理および音声合成処理が実現される。 In the speech synthesis method of the invention according to claim 7, since the HMM coefficient set expressing the articulation motion is stored in the phoneme unit articulation motion storage unit, the optimum speech unit sequence identification step and the optimum articulation feature referencing this In the sequence generation step, speech recognition processing and speech synthesis processing are realized by parameters that are basically unchanged for the speaker.
 請求項8に係る発明の音声合成方法は、局所特徴抽出ステップと弁別的音素特徴抽出ステップとによって調音特徴抽出ステップが構成されていることから、調音運動に基づく弁別特徴をHMMへの入力特徴とすることができ、少ない学習話者により十分な音素認識性能を得ることができる。 In the speech synthesis method of the invention according to claim 8, since the articulatory feature extraction step is configured by the local feature extraction step and the discriminative phoneme feature extraction step, the discrimination feature based on the articulatory motion is the input feature to the HMM. Therefore, sufficient phoneme recognition performance can be obtained with a small number of learning speakers.
 請求項9に係る発明の音声合成方法は、従来のHMM合成方法が使用していた「特定話者のスペクトルに基づく情報」ではなく、「不特定多数話者の調音運動の基づく情報」を抽出してHMM合成方法を構成するものである。これにより、上記発明の効果に加えて、HMM合成の部分を話者に対し共通化することができ、個々の話者はHMM部分に関して学習音声データが原則不要にできるという利点がある。また、音声合成を、発話器官への調音動作指令部分と、個人毎に異なる発話器官とその動作に係わる部分に分離し、かつ前者を多数話者の調音特徴データを使用して、話者に対しより不変な調音動作指令として構成したことにより、個人の発話器官の特性に合わせた高品質音声合成と、高い音声認識性能の双方を達成することができる。 The speech synthesis method of the invention according to claim 9 extracts “information based on articulatory motion of unspecified majority speakers” instead of “information based on the spectrum of specific speakers” used in the conventional HMM synthesis method. Thus, the HMM synthesis method is configured. Thereby, in addition to the effect of the above invention, the HMM synthesis part can be made common to the speakers, and each speaker has the advantage that the learning speech data can be made unnecessary for the HMM part in principle. In addition, the speech synthesis is separated into the articulatory motion command part for the speech organs and the speech organs and parts related to the motions that are different for each person, and the former is used for the speaker by using the articulation feature data of many speakers. On the other hand, it is possible to achieve both high-quality speech synthesis and high speech recognition performance in accordance with the characteristics of the individual speech organs by configuring as a more invariant articulation operation command.
 また、個人の音声に適応した合成音を少ないデータで得られることを可能にするため、高い音素認識性能の実現と相俟って、音声対話で問題となっている未知語に、人間同士が
行っていると同様の対応を可能にする。すなわち、未知語が出現した際、未知語部分に対応する調音特徴系列を利用し、問い返しの確認発話を容易に合成することができる。
In addition, in order to make it possible to obtain synthesized speech adapted to individual voices with less data, coupled with the realization of high phoneme recognition performance, unknown words that are problematic in voice dialogue are Enable the same response as you do. That is, when an unknown word appears, it is possible to easily synthesize a confirmation utterance for answering using an articulation feature sequence corresponding to the unknown word part.
 請求項10に係る発明の音声合成方法は、合成音の音質に大きな影響を与える駆動音源信号に、音声通信で広く利用されているCELPの閉ループ学習の考え方(非特許文献4参照)と、同じく波形合成に広く利用されているPSOLAの技術(非特許文献5参照)を導入することにより、最適な駆動音源符号を選択して対応する調音運動の状態遷移モデルに登録し、これを参照しつつ音声合成することによって高品質音声を得ることができる。 The speech synthesis method of the invention according to claim 10 is similar to the CELP closed loop learning concept widely used in speech communication (see Non-Patent Document 4) for driving sound source signals that greatly affect the sound quality of synthesized speech. By introducing the PSOLA technology widely used for waveform synthesis (see Non-Patent Document 5), the optimum driving excitation code is selected and registered in the corresponding articulatory motion state transition model, while referring to this High-quality speech can be obtained by speech synthesis.
 請求項11に係る発明の音声合成プログラムは、請求項1ないし5のいずれかに記載の音声合成処理手段としてコンピュータを駆動させることが可能となるから、請求項1ないし5に係る発明の効果を奏することができる。 Since the speech synthesis program of the invention according to claim 11 can drive a computer as the speech synthesis processing means according to any of claims 1 to 5, the effects of the invention according to claims 1 to 5 can be obtained. Can play.
 請求項12に係る発明の音声合成プログラムは、請求項6ないし10のいずれかに記載の音声合成方法の各処理ステップとしてコンピュータを駆動させることが可能となるから、請求項6ないし10に係る発明の効果を奏することができる。 Since the speech synthesis program of the invention according to claim 12 can drive a computer as each processing step of the speech synthesis method according to any of claims 6 to 10, the invention according to claims 6 to 10. The effect of can be produced.
特定話者のスペクトル情報に基づくHMM音声合成処理を示す模式図である。It is a schematic diagram which shows the HMM speech synthesis process based on the spectrum information of a specific speaker. 音声合成装置の電気的構成を示す模式図である。It is a schematic diagram which shows the electrical structure of a speech synthesizer. 調音特徴を表す弁別的音素特徴の一例を示す図である。It is a figure which shows an example of the discriminative phoneme characteristic showing an articulation characteristic. MFCC特徴と調音特徴を用いた際の音素認識性能を比較した図である。It is the figure which compared the phoneme recognition performance at the time of using an MFCC feature and an articulation feature. 音声合成装置にて実行される音声合成処理を示す機能ブロック図である。It is a functional block diagram which shows the speech synthesis process performed with a speech synthesizer. 調音特徴抽出部の機能詳細を示すブロック図である。It is a block diagram which shows the functional detail of an articulation feature extraction part. 弁別的音素特徴抽出部にて得られる調音特徴の一例を示す図である。It is a figure which shows an example of the articulation feature obtained in a discrimination phoneme feature extraction part. 調音特徴に基づくHMM音声合成の動作を説明する図である。It is a figure explaining the operation | movement of the HMM speech synthesis based on an articulation feature. 音声合成で利用する駆動音源符号帳からの符号選択を説明する図である。It is a figure explaining the code | symbol selection from the drive excitation codebook utilized by speech synthesis. 音声合成部で用いた音源波形を原音声の残差としての音源波形と比較した図である。It is the figure which compared the sound source waveform used in the speech synthesizer with the sound source waveform as the residual of the original speech. 音声合成部で生成された合成音声のスペクトル包絡と原音声のスペクトル包絡を比較した図である。It is the figure which compared the spectrum envelope of the synthetic | combination speech produced | generated by the speech synthesizer, and the spectrum envelope of the original speech. 音声合成部で生成された合成音声波形と原音声を比較した図である。It is the figure which compared the synthetic speech waveform produced | generated in the speech synthesizer with the original speech. 1モデル音声認識合成システムの構成例を示した図である。It is the figure which showed the example of a structure of 1 model speech recognition synthesis system.
 以下、本明の音声合成装置および音声合成方法の実施の形態について、図面を参照して説明する。なお、これらの図面は、本発明が採用しうる技術的特徴を説明するために用い
られるものであり、記載されている装置の構成、各種処理のフローなどは、特に特定的な記載がない限り、それのみに限定する趣旨ではなく、単なる説明例である。
Embodiments of the present speech synthesis apparatus and speech synthesis method will be described below with reference to the drawings. These drawings are used to explain the technical features that can be adopted by the present invention, and the configuration of the apparatus described, the flow of various processes, etc., unless otherwise specified. It is not intended to be limited to that, but merely an illustrative example.
 はじめに、図2を参照し、音声合成装置1の電気的構成について説明する。図2は、音声合成装置1の電気的構成を示している。この図に示すように、音声合成装置1は、中央演算処理装置11、入力装置12、出力装置13、記憶装置14および外部記憶装置15から構成されている。 First, the electrical configuration of the speech synthesizer 1 will be described with reference to FIG. FIG. 2 shows an electrical configuration of the speech synthesizer 1. As shown in this figure, the speech synthesizer 1 includes a central processing unit 11, an input device 12, an output device 13, a storage device 14, and an external storage device 15.
 中央演算処理装置11は、数値演算・制御などの処理を行うために設けられており、本実施の形態において説明する処理手順に従って演算・処理を行う。例えばCPU等が使用可能である。入力装置12は、マイクロホンやキーボード等で構成され、利用者が発声した音声やキー入力された文字列が入力される。出力装置13は、ディスプレイやスピーカ等で構成され、音声合成結果、あるいは音声合成結果を処理することによって得られた情報が出力される。記憶装置14は、中央演算処理装置11によって実行される処理手順(音声合成プログラム)や、その処理に必要な一時データが格納される。例えば、ROM(リード・オンリー・メモリ)やRAM(ランダム・アクセス・メモリ)が使用可能である。 The central processing unit 11 is provided for performing processing such as numerical computation and control, and performs computation and processing according to the processing procedure described in the present embodiment. For example, a CPU or the like can be used. The input device 12 is configured by a microphone, a keyboard, or the like, and inputs a voice uttered by a user or a character string input by a key. The output device 13 includes a display, a speaker, and the like, and outputs a voice synthesis result or information obtained by processing the voice synthesis result. The storage device 14 stores processing procedures (speech synthesis program) executed by the central processing unit 11 and temporary data necessary for the processing. For example, ROM (Read Only Memory) or RAM (Random Access Memory) can be used.
 また、外部記憶装置15は、音声合成処理に使用される調音特徴系列セット、調音特徴抽出処理に使用されるニューラルネットの重み係数セット、調音特徴系列データから音声合成パラメータ系列への変換処理に使用されるニューラルネットの重み係数セット、調音運動のHMM状態遷移モデルセット、最適調音特徴系列データ、音声認識処理に必要なモデル、入力された音声のデータ、音声合成パラメータ系列データ、駆動音源用符号帳セット、解析結果データ等を記憶するために設けられている。例えば、ハードディスクドライブ(HDD)が使用可能である。そして、これらは、互いにデータの送受信が可能なように、バス22を介して電気的に接続されている。 The external storage device 15 is used for the articulation feature series set used for the speech synthesis process, the neural network weight coefficient set used for the articulation feature extraction process, and the conversion process from the articulation feature series data to the speech synthesis parameter series. Set of neural network weight coefficients, HMM state transition model set of articulation motion, optimal articulation feature sequence data, model necessary for speech recognition processing, input speech data, speech synthesis parameter sequence data, drive sound source codebook It is provided for storing sets, analysis result data, and the like. For example, a hard disk drive (HDD) can be used. And these are electrically connected through the bus | bath 22 so that transmission / reception of data mutually is possible.
 なお、本発明の音声合成装置1のハードウエア構成は、図2に示す構成に限定されるものではない。従って、インターネット等の通信ネットワークと接続する通信I/Fを備えていても構わない。 The hardware configuration of the speech synthesizer 1 of the present invention is not limited to the configuration shown in FIG. Accordingly, a communication I / F connected to a communication network such as the Internet may be provided.
 また、本実施の形態では、音声合成装置1および音声合成プログラムは他のシステムから独立した構成を有しているが、本発明はこの構成に限定されるものではない。従って、他の装置の一部として組込まれた構成や、他のプログラムの一部として組込まれた構成とすることも可能である。また、その場合における入力は、上述の他の装置やプログラムを介して間接的に行われることになる。 In this embodiment, the speech synthesizer 1 and the speech synthesis program have a configuration independent of other systems, but the present invention is not limited to this configuration. Therefore, a configuration incorporated as a part of another device or a configuration incorporated as a part of another program may be employed. Further, the input in that case is indirectly performed through the other devices and programs described above.
 次に、外部記憶装置15に記憶されている記憶データについて説明する。記憶データは各領域に区分されて外部記憶装置15に記憶されており、図2に示すように、調音特徴が記憶されている調音特徴記憶領域16、隠れマルコフモデルが記憶されている隠れマルコフモデル記憶領域17、最適調音特徴系列が記憶されている最適調音特徴系列記憶領域18、入力された音声が記憶される入力音声記憶領域19、音声合成パラメータが記憶される音声合成パラメータ記憶領域20、合成された音声が記憶される合成音声記憶領域21、処理後のデータが記憶される処理結果記憶領域22、各処理時に使用される係数が記憶されている係数記憶領域23、およびその他の領域が設けられている。 Next, the storage data stored in the external storage device 15 will be described. The stored data is divided into each area and stored in the external storage device 15, and as shown in FIG. 2, the articulation feature storage area 16 in which the articulation features are stored, and the hidden Markov model in which the hidden Markov model is stored. A storage area 17, an optimal articulation feature sequence storage area 18 in which an optimal articulation feature sequence is stored, an input voice storage area 19 in which input speech is stored, a speech synthesis parameter storage area 20 in which speech synthesis parameters are stored, and synthesis A synthesized speech storage area 21 for storing the processed speech, a processing result storage area 22 for storing processed data, a coefficient storage area 23 for storing coefficients used in each processing, and other areas. It has been.
 調音特徴記憶領域16には、音声の弁別的特徴系列が記憶されている。弁別特徴は、調音に関わる構造的な特徴を基に音素(音韻)を分類するために提案されたもので、有声性/非有声性/連続性/半母音性/破裂性/摩擦性/破擦性/舌端性/鼻音性/高舌性/低舌性/(舌の盛上る位置が)前方性/後方性/・・・;(Distinctive Fe
ature:DF)などがある。また、音声から弁別的特徴などの調音特徴を直接抽出する方法も、ニューラルネットワークを利用する手法など多く提案されている(非特許文献6参照)。
The articulation feature storage area 16 stores a discrimination feature series of speech. Discrimination features were proposed to classify phonemes (phonemes) based on structural features related to articulation, and are voiced / non-voiced / continuous / semi-vowel / bursting / friction / friction. Sexual / lingual / nasal / high tongue / low tongue / (position where tongue rises) anterior / posterior / ...; (Distinctive Fe
(character: DF). In addition, many methods for directly extracting articulatory features such as discriminative features from speech have been proposed, including a method using a neural network (see Non-Patent Document 6).
 隠れマルコフモデル記憶領域17には、中央演算処理装置11において音声認識や音声合成が行われる場合に参照される隠れマルコフモデルが記憶されている。最適調音特徴系列記憶領域18には、中央演算処理装置11において隠れマルコフモデルを参照して探索した結果の最適な調音特徴系列が記憶されている。入力音声記憶領域19には、入力装置12を介して入力された音声データが記憶される。音声合成パラメータ記憶領域20には、中央演算処理装置11においてニューラルネットの重み係数(係数記憶領域23)を参照して計算された結果の音声合成パラメータが記憶されている。合成音声記憶領域21には、中央演算処理11において音声合成パラメータ20と係数記憶領域23上の駆動音源用符号帳セットを参照して計算された結果の合成音声データが記憶される。処理結果記憶領域22には、中央演算処理装置11において実行される各種処理の結果得られたデータが記憶される。係数記憶領域23には、調音特徴抽出のためのニューラルネットの重み係数セット、調音特徴系列データから音声合成パラメータへの変換処理に使用されるニューラルネットの重み係数セット、および音声合成に使用される駆動音源用符号帳セットが記憶される。なお、これらのデータの詳細は後述する。 The hidden Markov model storage area 17 stores a hidden Markov model that is referred to when speech recognition or speech synthesis is performed in the central processing unit 11. The optimum articulation feature sequence storage area 18 stores an optimum articulation feature sequence as a result of searching the central processing unit 11 with reference to the hidden Markov model. The input voice storage area 19 stores voice data input via the input device 12. The speech synthesis parameter storage area 20 stores a speech synthesis parameter as a result calculated by the central processing unit 11 with reference to the weighting coefficient (coefficient storage area 23) of the neural network. The synthesized speech storage area 21 stores the synthesized speech data obtained as a result of referring to the speech synthesis parameter 20 and the driving sound source codebook set in the coefficient storage area 23 in the central processing 11. The processing result storage area 22 stores data obtained as a result of various processes executed in the central processing unit 11. The coefficient storage area 23 is used for a neural network weighting coefficient set for extracting articulation features, a neural network weighting coefficient set used for converting articulation feature series data into speech synthesis parameters, and used for speech synthesis. A codebook set for driving sound source is stored. Details of these data will be described later.
 ここで、調音特徴記憶領域16に記憶されている弁別的特徴系列に使用される弁別的音素特徴について詳述する。日本語の音素を例として、その弁別的音素特徴(Distinctive Phonemic Feature;以下、DPFと記述する場合がある)を図3に示す。ここで、弁別的音素特徴とは、調音特徴の表現方法の一つである。図は、縦欄が弁別的特徴を示しており、横欄が個々の音素を示している。図中(+)は各音素についての弁別的特徴を有していることを意味し、(-)はその特徴を有しないことを意味する。なお、日本語以外の言語について弁別的音素特徴を把握する場合には、これらの弁別的特徴および音素に加えて、当該言語に特有の弁別的特徴または音素についても考慮されることとなる。 Here, the discriminative phoneme features used for the discriminative feature series stored in the articulation feature storage area 16 will be described in detail. As an example, Japanese phonemes are shown in FIG. 3 as distinctive phonemic features (hereinafter, sometimes referred to as DPF). Here, the discriminative phoneme feature is one method of expressing articulatory features. In the figure, the vertical column shows the distinguishing features, and the horizontal column shows the individual phonemes. In the figure, (+) means having a distinguishing feature for each phoneme, and (-) means not having that feature. In addition, when grasping discriminative phoneme features for languages other than Japanese, in addition to these discriminative features and phonemes, discriminative features or phonemes specific to the language are also considered.
 そして、この表から一つの音素を生成する際に必要な発声器官の動作を知ることができる。図3のうちnil(高/低)は、高舌性/低舌性のどちらにも属さない音素に対して弁別特徴を割り当て、nil(前/後)は、(舌の盛上る位置が)前方性/後方性のどちらにも属さない音素に対して弁別特徴を割り当てるためのものであり、新たに追加した特徴であることを示す。このように、音素間のバランスをとることで、音声認識性能が向上することが知られている。 And, from this table, it is possible to know the operation of the vocal organs necessary for generating one phoneme. In FIG. 3, nil (high / low) assigns a distinguishing feature to phonemes that do not belong to either high or low tongue, and nil (front / rear) is (the position where the tongue rises) This is for assigning a discrimination feature to a phoneme that does not belong to either forward or backward, and indicates a newly added feature. Thus, it is known that the speech recognition performance is improved by balancing the phonemes.
 なお、調音特徴の表現としては、国際音声記号(International Phonetic Alphabet;以下、IPAと称する)として広く使用されている表に記載されたものを用いてもよい。このIPAの表は、子音と母音の表に分かれ、子音では、調音位置および調音方法で分類されている。調音位置とは、唇、歯茎、硬口蓋、軟口蓋、声門などであり、調音方法とは破裂、摩擦、破擦、弾音、鼻音、半母音などである。また、それぞれについて有声と無声がある。例えば、/p/は、子音で、無声音、唇音、破裂音に分類される。一方、母音では、舌が最も盛上る場所および舌と口蓋との空間の広さで分類されている。舌が最も盛上る場所は、前(前舌)、後(後舌)または中(中舌)に区別され、舌と口蓋との空間の広さは、狭、半狭、半広または広に区分される。例えば、/i/は、前舌母音で狭母音(せまぼいん)である。IPAを使用する場合は、図3に示した弁別特徴の表と同様に、調音特徴のある個所(/p/を例にとると、子音、無声音、唇音、破裂音の個所)が+となり、それ以外では-となる。 In addition, as an expression of the articulation feature, those described in a table widely used as an international phonetic alphabet (hereinafter referred to as IPA) may be used. This IPA table is divided into consonant and vowel tables, and the consonants are classified by the articulation position and articulation method. The articulation position includes lips, gums, hard palate, soft palate, glottis and the like, and the articulation method includes rupture, friction, rubbing, bullet, nasal sound, semi-vowel and the like. There is voiced and unvoiced for each. For example, / p / is a consonant, and is classified into unvoiced sound, lip sound, and plosive sound. On the other hand, vowels are classified according to the place where the tongue is most prominent and the size of the space between the tongue and the palate. The place where the tongue is most prominent is distinguished from the front (front tongue), back (rear tongue) or middle (middle tongue), and the space between the tongue and the palate can be narrow, semi-narrow, half-wide or wide. It is divided. For example, / i / is a front vowel and a narrow vowel. In the case of using IPA, as in the discrimination feature table shown in FIG. 3, the part having the articulatory feature (the part of consonant, unvoiced sound, lip sound, burst sound is taken as +, for example, / p /), Otherwise-.
 従来の音声スペクトル由来の特徴を使用する音声認識では、話者や発話時の文脈、周囲
騒音等によってスペクトルが大きく変動してしまうため、音響的な尤度を求める際に使用するHMMの設計に多くの音声データを必要としていた。近年のHMMに基づく音声認識装置では、音声スペクトルを入力特徴として使用し、個々のベクトル要素の変動を複数の正規分布から表現する。なお、実際に多用される音声スペクトルは、音声スペクトルを聴覚特性に合わせて周波数をメル尺度化するとともに、スペクトルの対数値を離散コサイン変換(DCT)したメルケプストラム(MFCC)が使用される。また、複数の正規分布は混合分布と呼ばれ、この数は前述した様々な変形に対処するため、近年では60~70の分布を使用するものが現れている。このように、厖大なメモリと演算が必要になった原因は、音声中に隠された変数を特定せずに、音素や単語を分類しようとした結果といえる。これに対し、調音特徴を用いると、HMMの混合数を数個程度で済ませることができる(非特許文献3参照)。
In the conventional speech recognition using features derived from the speech spectrum, the spectrum greatly fluctuates depending on the speaker, the context at the time of speech, the ambient noise, etc., so that the design of the HMM used when obtaining the acoustic likelihood is used. I needed a lot of audio data. In recent speech recognition apparatuses based on HMM, a speech spectrum is used as an input feature, and fluctuations of individual vector elements are expressed from a plurality of normal distributions. Note that, as a speech spectrum that is frequently used in practice, a mel cepstrum (MFCC) is used in which the speech spectrum is scaled according to the auditory characteristics, and the logarithmic value of the spectrum is discrete cosine transformed (DCT). Further, a plurality of normal distributions are called mixed distributions, and in order to cope with the various variations described above, in recent years, those using distributions of 60 to 70 have appeared. Thus, it can be said that the reason why the enormous memory and calculation are necessary is the result of trying to classify phonemes and words without specifying the variables hidden in the speech. On the other hand, if the articulation feature is used, the number of HMMs can be reduced to about several (see Non-Patent Document 3).
 そこで、図4にMFCCを用いて音素単位のHMMを学習した際の音素認識性能と、調音特徴(具体的には弁別特徴(DPF、後述)を使用)をHMMへの入力特徴とした場合の音素認識性能とを比較したグラフを示す。この図において、横軸はHMMを表現する際に必要とした分布の混合数(左から1、2、4、8、16)を示しており、混合数が増加するほど認識に必要な演算量も増加している。混合数毎に示した棒グラフは、HMM学習に用いた男性話者の数を示し、それぞれの混合数毎に左から1名、2名、4名、8名、33名で×印は100名である。この時の変化を折れ線グラフで示す(破線がMFCCで、実線がDPFを示す)。この図から明らかなとおり、従来法では、学習人数を増やすほど、音素認識性能も向上するが、HMMの分布混合数を増やさないと性能は飽和していくことがわかる。このように、従来のMFCCを特徴パラメータとする音声認識は、高い音素認識を達成するために、多くの話者データを必要とするとともに、認識に必要とされる演算量も膨大であった。これに対し、DPFを使用した場合では、図からも明らかなとおり、少ない学習話者(1名)でも十分な音素認識性能を示しており、また、HMMの混合分布数も少なくて済むことが明らかである。音声認識では、話者の違いのほかに、騒音の重畳等があるため、これらに対してHMMの混合数を上げる必要はあるものの、図示のように、少なくとも話者に対しては調音特徴が不変量であることを理解することができる。そこで、このような不変量の調音特徴を調音運動の状態遷移モデル(HMM)として記憶させ、音声認識および音声合成において共通に参照可能にしているのである。 Therefore, the phoneme recognition performance and the articulation feature (specifically using the discriminating feature (DPF, which will be described later)) when learning the HMM in phonemes using MFCC in FIG. 4 are input features to the HMM. The graph which compared phoneme recognition performance is shown. In this figure, the horizontal axis indicates the number of mixed distributions (1, 2, 4, 8, 16 from the left) necessary for expressing the HMM, and the amount of computation required for recognition as the number of mixtures increases. Has also increased. The bar graph shown for each mixture number indicates the number of male speakers used for HMM learning. For each mixture number, one person, two persons, four persons, eight persons, and 33 persons from the left, and x indicates 100 persons It is. The change at this time is shown by a line graph (the broken line is MFCC and the solid line is DPF). As is apparent from this figure, in the conventional method, the phoneme recognition performance improves as the number of learners increases, but it can be seen that the performance saturates unless the number of HMM distribution mixture is increased. As described above, the conventional speech recognition using MFCC as a characteristic parameter requires a large amount of speaker data in order to achieve high phoneme recognition, and the amount of calculation required for the recognition is enormous. On the other hand, when the DPF is used, as is apparent from the figure, even a small number of learning speakers (one person) shows sufficient phoneme recognition performance, and the number of HMM mixture distributions may be small. it is obvious. In speech recognition, in addition to speaker differences, there is noise superposition, etc., so it is necessary to increase the number of HMMs to be mixed. However, as shown in the figure, at least the speaker has articulation characteristics. It can be understood that it is an invariant. Therefore, such invariant articulatory features are stored as articulatory motion state transition models (HMMs) so that they can be commonly referenced in speech recognition and speech synthesis.
 次に、音声合成装置1にて実行される音声認識処理および音声合成処理について、図5~図12を参照して説明する。図5は、音声合成装置1にて実行される音声認識および音声合成の処理を示す機能ブロック図である。この図に示すように、音声合成装置1において実行される音声認識処理および音声合成処理に必要な機能ブロックとして、入力部201、A/D変換部202、調音特徴抽出部210、音声認識部220、最適調音特徴・音声合成パラメータ変換部(図では、最適調音特徴系列(右矢印)音声合成パラメータ変換部と記載している)230、音声合成部240、D/A変換部206、出力部205、調音特徴計算用記憶部207、音素単位調音運動記憶部225および音声合成用記憶部235が設けられている。 Next, speech recognition processing and speech synthesis processing executed by the speech synthesizer 1 will be described with reference to FIGS. FIG. 5 is a functional block diagram showing speech recognition and speech synthesis processing executed by the speech synthesizer 1. As shown in this figure, as a functional block necessary for speech recognition processing and speech synthesis processing executed in the speech synthesizer 1, an input unit 201, an A / D conversion unit 202, an articulation feature extraction unit 210, and a speech recognition unit 220 are illustrated. , Optimum articulation feature / speech synthesis parameter conversion unit (in the figure, described as optimum articulation feature sequence (right arrow) speech synthesis parameter conversion unit) 230, speech synthesis unit 240, D / A conversion unit 206, output unit 205 , An articulation feature calculation storage unit 207, a phoneme unit articulation movement storage unit 225, and a speech synthesis storage unit 235 are provided.
 調音特徴計算用記憶部207には、音声分析のための各種係数セット2071、調音特徴計算のためのニューラルネット重み係数セット等が記憶されている。音素単位調音運動記憶部225には、調音運動を表現したHMMモデルの係数セット2251が記憶され、ここに記憶されている係数セット2251は、音声認識部220、および、最適調音特徴系列・音声合成パラメータ変換部230より参照可能な状態となっている。音声合成用記憶部235には、最適調音特徴系列・音声合成パラメータ変換部230の計算結果である音声合成パラメータセット2351と、駆動音源符号帳2352が記憶されている。そして、音声合成部240は、音声合成パラメータ(声道形状の変化に相当)を係数とするデジタルフィルタを構成し、駆動音源符号帳2352から読み出された駆動音源入力により
音声を合成する。合成音声はD/A変換部206を経て、出力部205に送られ、スピーカから音声を送出する。
The articulation feature calculation storage unit 207 stores various coefficient sets 2071 for speech analysis, neural network weighting coefficient sets for articulation feature calculation, and the like. The phoneme unit articulation movement storage unit 225 stores a coefficient set 2251 of an HMM model expressing the articulation movement. The coefficient set 2251 stored therein is a voice recognition unit 220 and an optimal articulation feature sequence / voice synthesis. The parameter conversion unit 230 can refer to it. The speech synthesis storage unit 235 stores a speech synthesis parameter set 2351 that is a calculation result of the optimum articulation feature sequence / speech synthesis parameter conversion unit 230 and a driving excitation codebook 2352. Then, the speech synthesizer 240 configures a digital filter using a speech synthesis parameter (corresponding to a change in the vocal tract shape) as a coefficient, and synthesizes speech using the drive excitation input read from the drive excitation codebook 2352. The synthesized voice is sent to the output unit 205 via the D / A conversion unit 206 and sent out from the speaker.
 入力部201は、外部から入力される音声を受け付け、アナログ電気信号に変換するために設けられている。A/D変換部202は、入力部201にて受け付けられたアナログ信号をデジタル信号に変換するために設けられている。調音特徴抽出部210は、音声認識のために必要となる所定の特徴量を抽出するために設けられ、また、分析フィルタにより抽出された特徴量の時系列データから、調音特徴の時系列データ(以下、「調音特徴系列」という)を抽出するために設けられている。音声認識部220は、調音特徴抽出部210より得られる調音特徴系列から、音声に含まれる音素・音節・単語などを探索するために設けられている。この探索の際には、音素単位調音運動記憶部225の調音運動モデル係数セット2251が参照される。出力部205は、音声認識部220において探索された結果の音素・音節・単語(列)を出力すると同時に、後述する合成音声を出力するために設けられている。 The input unit 201 is provided for receiving sound input from the outside and converting it into an analog electric signal. The A / D conversion unit 202 is provided to convert an analog signal received by the input unit 201 into a digital signal. The articulatory feature extraction unit 210 is provided to extract a predetermined feature amount necessary for speech recognition. Also, the articulatory feature extraction unit 210 extracts time-series data of articulatory features (from the time-series data of feature amounts extracted by the analysis filter). Hereinafter, it is provided for extracting “articulation feature series”. The speech recognition unit 220 is provided to search for phonemes, syllables, words, and the like included in speech from the articulation feature series obtained from the articulation feature extraction unit 210. In this search, the articulatory motion model coefficient set 2251 of the phoneme unit articulation motion storage unit 225 is referred to. The output unit 205 is provided to output phonemes, syllables, and words (sequences) obtained as a result of the search performed by the speech recognition unit 220, and at the same time, output synthesized speech that will be described later.
 音声認識処理では、入力部201から入力された未知の音声がA/D変換部202を通して離散化され、デジタル信号に変換される。そして、変換されたデジタル信号は、調音特徴抽出部210に出力される。デジタル信号から調音特徴を抽出する調音特徴抽出部210は、図6に示すように、分析フィルタ211、局所特徴抽出部212および弁別的(音素)特徴抽出部213から構成されている。 In the speech recognition process, unknown speech input from the input unit 201 is discretized through the A / D conversion unit 202 and converted into a digital signal. The converted digital signal is output to the articulation feature extraction unit 210. As shown in FIG. 6, the articulation feature extraction unit 210 that extracts the articulation feature from the digital signal includes an analysis filter 211, a local feature extraction unit 212, and a discriminative (phoneme) feature extraction unit 213.
 分析フィルタ211では、はじめに、A/D変換部202にて変換されたデジタル信号がフーリエ分析(窓幅24~32msecのハミング窓使用)される。次いで、24チャンネル程度の帯域通過フィルタに通されて周波数成分が抽出される。これにより、5~10msec間隔の音声スペクトル系列および音声パワー系列が抽出される。そして、得られた音声スペクトル系列および音声パワー系列は、局所特徴抽出部212に対して出力される。 In the analysis filter 211, first, the digital signal converted by the A / D converter 202 is subjected to Fourier analysis (using a Hamming window having a window width of 24 to 32 msec). Next, it is passed through a band pass filter of about 24 channels to extract frequency components. As a result, a speech spectrum sequence and a speech power sequence at intervals of 5 to 10 msec are extracted. The obtained speech spectrum sequence and speech power sequence are output to local feature extraction section 212.
 局所特徴抽出部212では、時間軸微分特徴抽出部2121および周波数軸微分特徴抽出部2122により、時間軸方向および周波数方向の微分特徴が抽出される。また、図示していないが、別途音声パワー系列の時間軸微分特徴が計算される。これらの微分特徴(以下、「局所特徴」という)の抽出にあたっては、ノイズ変動などの影響を抑えるため線形回帰演算が用いられる。抽出された局所特徴は、弁別的音素特徴抽出部213に出力される。なお、弁別的音素特徴抽出部213に出力されるデータとしては、上述の局所特徴以外にも、性能は若干劣るが、音声スペクトル、あるいは音声スペクトルを直交化したケプストラム(実際には周波数軸をメル尺度化して求めるメルケプストラムが用いられる)を使用してもよい。 In the local feature extraction unit 212, the time axis differential feature extraction unit 2121 and the frequency axis differential feature extraction unit 2122 extract differential features in the time axis direction and the frequency direction. In addition, although not shown, the time axis differential feature of the audio power sequence is calculated separately. In extracting these differential features (hereinafter referred to as “local features”), linear regression calculation is used to suppress the influence of noise fluctuations and the like. The extracted local features are output to the discriminative phoneme feature extraction unit 213. The data output to the discriminative phoneme feature extraction unit 213 is a little inferior in performance other than the above-mentioned local features, but the speech spectrum or a cepstrum obtained by orthogonalizing the speech spectrum (actually the frequency axis is a A mel cepstrum obtained by scaling) may be used.
 弁別的音素特徴抽出部213では、局所特徴抽出部212にて抽出された局所特徴に基づき、調音特徴系列が抽出される。弁別的音素特徴抽出部213は、二段のニューラルネットワーク2131,2132で構成されている。 The discriminative phoneme feature extraction unit 213 extracts the articulation feature series based on the local features extracted by the local feature extraction unit 212. The discriminative phoneme feature extraction unit 213 includes two-stage neural networks 2131 and 2132.
 この弁別的音素特徴抽出部213を構成するニューラルネットワークは、図6に示されているように、初段の第一多層ニューラルネット2131と、次段の第二多層ニューラルネット2132との二段から構成される。第一多層ニューラルネット2131では、音声スペクトル系列および音声パワー系列より求めた局所特徴間の相関から、調音特徴系列を抽出する。また、第二多層ニューラルネット2132では、調音特徴系列が持つ文脈情報、すなわちフレーム間の相互依存関係から意味のある部分空間を抽出し、精度の高い調音特徴系列を求める。 As shown in FIG. 6, the neural network constituting the discriminative phoneme feature extraction unit 213 is a two-stage circuit including a first multilayer neural network 2131 at the first stage and a second multilayer neural network 2132 at the next stage. Consists of The first multilayer neural network 2131 extracts an articulatory feature sequence from the correlation between local features obtained from the speech spectrum sequence and the speech power sequence. Further, the second multilayer neural network 2132 extracts a meaningful subspace from the context information of the articulation feature series, that is, the interdependence between frames, and obtains an accurate articulation feature series.
 弁別的音素特徴抽出部213にて算出された調音特徴抽出結果の一例を図7に示す。この図は、「人工衛星」の日本語読みである「jinkoese」という発話に対して求められた調音特徴抽出結果を示している。このように、二段のニューラルネットワーク2131,2132により抽出された調音特徴は、高い精度であることが理解される。 FIG. 7 shows an example of the articulation feature extraction result calculated by the discriminative phoneme feature extraction unit 213. This figure shows the articulation feature extraction result obtained for the utterance “jinkose” which is the Japanese reading of “artificial satellite”. In this way, it is understood that the articulation features extracted by the two-stage neural networks 2131 and 2132 have high accuracy.
 なお、調音特徴系列を求めるニューラルネットワークの構成は、図6にて示した二段構成のほかに、性能を犠牲にすることとなるが一段構成とすることも可能である(非特許文献3参照)。個々のニューラルネットワークは階層構造を持っており、入力層と出力層を除く隠れ層を1から2層持っている(これを多層ニューラルネットワークという)。また、出力層や隠れ層から入力層にフィードバックする構造を持ついわゆるリカレントニューラルネットワークが利用されることもある。調音特徴抽出に対する性能という点で比較すると、其々のニューラルネットワークにおいて算出された結果にそれほど大きな差はない。これらのニューラルネットワークは、非特許文献7に示される重み係数の学習を通して調音特徴抽出器として機能する(非特許文献7参照)。 In addition to the two-stage configuration shown in FIG. 6, the configuration of the neural network for obtaining the articulatory feature sequence may be a one-stage configuration at the expense of performance (see Non-Patent Document 3). ). Each neural network has a hierarchical structure, and has one or two hidden layers excluding an input layer and an output layer (this is called a multilayer neural network). A so-called recurrent neural network having a structure that feeds back from the output layer or hidden layer to the input layer may be used. When compared in terms of performance for articulatory feature extraction, the results calculated in each neural network are not significantly different. These neural networks function as articulatory feature extractors through learning of the weighting coefficient shown in Non-Patent Document 7 (see Non-Patent Document 7).
 また、弁別的音素特徴抽出部213のニューラルネットワークでの学習は、入力層に音声の局所特徴データを加え、出力層には、音声の調音特徴を教師信号として与えることで行われる。 Further, learning by the neural network of the discriminative phoneme feature extraction unit 213 is performed by adding voice local feature data to the input layer and giving the voice articulation feature to the output layer as a teacher signal.
 このように、調音特徴抽出部210によって抽出された調音特徴系列は、音声認識部220に出力され、音素単位調音運動記憶部225の調音運動モデル係数セット2251を参照しつつ最適音声単位系列が得られると同時に、後述の音声合成パラメータによる音声合成に使用され、調音特徴系列を個人に特化した音声に合成される(図5参照)。 As described above, the articulation feature sequence extracted by the articulation feature extraction unit 210 is output to the speech recognition unit 220, and an optimal speech unit sequence is obtained while referring to the articulation motion model coefficient set 2251 of the phoneme unit articulation motion storage unit 225. At the same time, it is used for speech synthesis using speech synthesis parameters, which will be described later, and the articulation feature series is synthesized into speech specialized for an individual (see FIG. 5).
 以上が音声認識部に関する説明である。上記説明において、入力部201が音声合成装置にかかる発明の音声取得手段に相当し、調音特徴抽出部210が調音特徴抽出手段に相当する。また、音声認識部220が最適音声単位系列識別手段に相当し、中央演算処理装置11が各記憶制御手段に、外部記憶装置15が各記憶手段に相当する。そして、音素単位調音運動記憶部225が音素単位調音運動記憶部に相当し、これに記憶されている不特定話者の調音特徴に基づくHMMが、調音運動の状態遷移モデルに相当する。さらに、これらの機能に基づいて処理されるステップは、音声合成方法にかかる発明の音声認識部における各ステップに相当する。 This completes the explanation of the voice recognition unit. In the above description, the input unit 201 corresponds to the voice acquisition unit of the invention according to the speech synthesizer, and the articulation feature extraction unit 210 corresponds to the articulation feature extraction unit. The voice recognition unit 220 corresponds to an optimum voice unit sequence identification unit, the central processing unit 11 corresponds to each storage control unit, and the external storage unit 15 corresponds to each storage unit. The phoneme unit articulation motion storage unit 225 corresponds to the phoneme unit articulation motion storage unit, and the HMM based on the articulation characteristics of the unspecified speaker stored therein corresponds to the state transition model of articulation motion. Furthermore, the steps processed based on these functions correspond to the steps in the speech recognition unit of the invention according to the speech synthesis method.
 次に、調音特徴に基づくHMM音声合成の動作について説明する。図5において示したように、音声合成処理では、最適調音特徴系列・音声合成パラメータ変換部230が、音素単位調音運動記憶部225に記憶されている調音運動を表現したHMMモデルの係数セット2251を参照しつつ、音声合成パラメータを生成し、音声合成部240に出力する。なお、合成の対象となるデータは、入力部201で入力されたテキストデータ(または音声データ)が使用される。 Next, the operation of HMM speech synthesis based on articulation features will be described. As shown in FIG. 5, in the speech synthesis process, the optimum articulation feature sequence / speech synthesis parameter conversion unit 230 generates an HMM model coefficient set 2251 representing the articulation motion stored in the phoneme unit articulation motion storage unit 225. While referencing, a speech synthesis parameter is generated and output to the speech synthesis unit 240. Note that text data (or voice data) input by the input unit 201 is used as data to be combined.
 図8は、HMM音声合成における最適調音特徴系列・音声合成パラメータ変換部230の動作説明図である。この図に示すように、不特定話者の調音特徴に基づくHMMから、Viterbiパス上の最適調音特徴系列が与えられると、次に時刻tを挟んで前後の計3フレームの調音特徴を3層ニューラルネットワークに入力し、対応するPARCOR係数を教師データとして、調音特徴系列・音声合成パラメータ(ここではPARCOR係数)変換部230が構成されている。 FIG. 8 is an explanatory diagram of the operation of the optimal articulation feature sequence / speech synthesis parameter converter 230 in HMM speech synthesis. As shown in this figure, when an optimum articulation feature sequence on the Viterbi path is given from the HMM based on the articulation feature of an unspecified speaker, next, three layers of articulation features of a total of three frames before and after the time t are placed. An articulatory feature series / speech synthesis parameter (here, PARCOR coefficient) conversion unit 230 is configured using the PARCOR coefficient corresponding to the teacher data as input to the neural network.
 HMMは、複数の定常信号源間を状態遷移することで、非定常な時系列信号を表現する確率モデルで、音声のように様々な要因で変動する時系列の表現に適している。出力確率分布としては、多次元正規分布の重み付き和で表わされる多次元正規混合分布が用いられ
ることが多く、本実施形態も同様である。これによって、話者や前後環境に起因する複雑な変動を細かくモデル化することが可能である。
The HMM is a probabilistic model that expresses a non-stationary time series signal by making a state transition between a plurality of stationary signal sources, and is suitable for the expression of a time series that varies due to various factors such as speech. As the output probability distribution, a multidimensional normal mixed distribution represented by a weighted sum of multidimensional normal distributions is often used, and this embodiment is also the same. As a result, it is possible to finely model complex fluctuations caused by the speaker and the surrounding environment.
 すなわち、HMMのモデルパラメータλの学習は、与えられた学習のベクトル系列Oに対して、観測尤度Ρ(O|λ)を最大にするλを求める形で数1に示すように定式化されている。 That is, the learning of the model parameter λ of the HMM is formulated as shown in Equation 1 in the form of obtaining λ that maximizes the observation likelihood Ρ (O | λ) for a given learning vector sequence O. ing.
Figure JPOXMLDOC01-appb-M000001
 
Figure JPOXMLDOC01-appb-M000001
 
 なお、このλは、EM(Expectation Maximization)アルゴリズムに基づいて導出できる。 The λ can be derived based on an EM (Expectation Maximization) algorithm.
 音素の初期モデルは、学習用音声データに音素ラベルが付与されていれば、セグメンタルk-means法によって得ることができる。また、音素境界が与えられていない場合には、ラベルが付与された少量のデータから初期モデルを作成し、その後、音素境界の付与されていない大量の音素データを使用して連結学習を行うことができる。音声認識では、未知のベクトル系列Oが観測されたとき、それがどのモデルλから生成されたかを推定する(Ρ(O|λ))。これはベイズの判定式から求めることができる。 The initial phoneme model can be obtained by the segmental k-means method if a phoneme label is assigned to the speech data for learning. In addition, if no phoneme boundary is given, an initial model is created from a small amount of data with a label, and then connected learning is performed using a large amount of phoneme data without a phoneme boundary. Can do. In speech recognition, when an unknown vector sequence O is observed, it is estimated from which model λ it is generated (Ρ (O | λ)). This can be obtained from a Bayesian judgment formula.
 次に、音声合成について説明する。音声合成の場合は、あるモデルλが最も高い確率で生成するパラメータ時系列を与える問題になる。連続出力分布型HMMλが与えられたとき、λから長さTの出力ベクトル系列(数2参照)を生成するため、尤度最大の意味で最適な音声パラメータ列を求めると、数3に示す式を得る。 Next, speech synthesis will be described. In the case of speech synthesis, there is a problem of giving a parameter time series that a certain model λ generates with the highest probability. When a continuous output distribution type HMMλ is given, an output vector sequence (see Equation 2) having a length T is generated from λ. Get.
Figure JPOXMLDOC01-appb-M000002
 
Figure JPOXMLDOC01-appb-M000002
 
Figure JPOXMLDOC01-appb-M000003
 
Figure JPOXMLDOC01-appb-M000003
 
 さらに、ここでは、問題を簡単化するため、混合分布サブステートに分解した上でViterbiパス上の確率を示すと、数4の式となり、この式において、Oに関して最大化する。 Further, here, in order to simplify the problem, when the probability on the Viterbi path is shown after being decomposed into the mixed distribution substate, the equation 4 is obtained, and in this equation, O is maximized.
Figure JPOXMLDOC01-appb-M000004
 
Figure JPOXMLDOC01-appb-M000004
 
 なお、oは、数5に示す静的特徴cのみを考慮する場合、個々のフレームでの出力は、前後のフレームでの出力とは独立に、そのフレームに対応する分布の平均となるため、ある状態から次の状態に遷移する部分でスペクトルに不連続が生じる。 Incidentally, o T when considering only static characteristics c t shown in Formula 5, the output of the individual frames, independently of the output before and after the frame, the average of the distribution corresponding to the frame Therefore, a discontinuity occurs in the spectrum at the transition from one state to the next state.
Figure JPOXMLDOC01-appb-M000005
 
Figure JPOXMLDOC01-appb-M000005
 
 このような不連続を回避するために、出力パラメータに動的特徴を導入することが行われる。 In order to avoid such discontinuities, dynamic features are introduced into output parameters.
 図8において図示される駆動音源は、学習音声データにより、HMM学習を行う際、調音特徴系列と駆動音源符号のマルチストリームで作成する。この際、図9に示すように、CELPの符号帳選択で使用される閉ループ学習アルゴリズムを適用することで、誤差最小の(残差)素片を選択し、同時に対応する調音運動の状態に駆動音源符号を登録することにより、高音質の合成音声を得ることができる。すなわち、全ての駆動音源を合成フィ
ルタ(PARCOR合成フィルタ)に通して得られる音声波形を元の波形と比較し、誤差の少ない駆動音源符号を選択する。駆動音源符号帳は、学習音声データからクラスタリングにより代表素片を登録するとともに、登録符号帳を木構造化することにより、コンパクトで効率のよい符号帳を構成できる。
The driving sound source illustrated in FIG. 8 is created by multi-streams of articulation feature sequences and driving sound source codes when performing HMM learning using learning speech data. At this time, as shown in FIG. 9, by applying a closed-loop learning algorithm used in the CELP codebook selection, the (residual) segment with the smallest error is selected and simultaneously driven to the corresponding articulation motion state. By registering the sound source code, high-quality synthesized speech can be obtained. That is, the speech waveform obtained by passing all the drive excitations through the synthesis filter (PARCOR synthesis filter) is compared with the original waveform, and the drive excitation code with less error is selected. The driving excitation codebook can configure a compact and efficient codebook by registering representative segments by clustering from learning speech data and by making the registered codebook a tree structure.
 以上が音声合成部に関する説明である。上記説明において、最適調音特徴系列・音声合成パラメータ変換部230のうち、HMMの係数セット2251を参照して最適調音特徴系列を取得する部分(図8参照)が、音声合成装置にかかる本発明の最適調音特徴系列生成手段に相当し、PARCOR係数変換部が音声合成パラメータ系列変換手段に相当する。また、音声合成部(PARCOR合成フィルタ)240が、音声合成パラメータと駆動音源信号から音声を合成する手段に相当する。なお、中央演算処理装置11が各記憶制御手段に、外部記憶装置15が各記憶手段にそれぞれ相当し、音素単位調音運動記憶部225が音素単位調音運動記憶部に相当し、これに記憶されている不特定話者の調音特徴に基づくHMMが、調音運動の状態遷移モデルに相当する点は、音声認識装置の場合と同様である。さらに、これらの機能に基づいて処理されるステップは、音声合成方法にかかる発明の音声合成部における各ステップに相当する。 This completes the explanation of the speech synthesis unit. In the above description, the portion of the optimal articulation feature sequence / speech synthesis parameter conversion unit 230 that acquires the optimal articulation feature sequence with reference to the HMM coefficient set 2251 (see FIG. 8) is related to the speech synthesizer. It corresponds to an optimal articulation feature sequence generation unit, and a PARCOR coefficient conversion unit corresponds to a speech synthesis parameter sequence conversion unit. Further, the speech synthesizer (PARCOR synthesis filter) 240 corresponds to means for synthesizing speech from speech synthesis parameters and drive sound source signals. The central processing unit 11 corresponds to each storage control unit, the external storage unit 15 corresponds to each storage unit, and the phoneme unit articulation motion storage unit 225 corresponds to the phoneme unit articulation motion storage unit, and is stored in this. The point that the HMM based on the articulation characteristics of the unspecified speaker is equivalent to the state transition model of articulation movement is the same as in the case of the speech recognition apparatus. Furthermore, the steps processed based on these functions correspond to the steps in the speech synthesizer of the invention relating to the speech synthesis method.
 本実施形態のように駆動音源符号帳から作成された音源波形と元の波形とを比較した。図10のうち(a)は原音声から抽出した残差の音源波形、(b)は従来用いられていたパルス列と雑音から近似した音声波形、(c)は本実施形態の駆動音源符号帳から作成した音源波形を示している。音源符号帳から作成した音源波形は、原音声をPARCOR分析した際の残差波形に近いことが分かる。 The excitation waveform created from the driving excitation codebook as in this embodiment was compared with the original waveform. In FIG. 10, (a) is a residual excitation waveform extracted from the original speech, (b) is a speech waveform approximated from a pulse train and noise conventionally used, and (c) is from the driving excitation codebook of this embodiment. The created sound source waveform is shown. It can be seen that the excitation waveform created from the excitation codebook is close to the residual waveform when the original speech is subjected to PARCOR analysis.
 また、本実施形態による合成音声と原音声のPARCOR分析した際のスペクトラムを比較した。図11のうち(a)は原音声のスペクトラムを示し、(b)は音声から求めた調音特徴により調音特徴系列を音声合成パラメータ(PARCOR係数列)に変換した合成音声のスペクトラムを示し、(c)は、本実施形態の合成音声(HMM/DPF・PARCOR分析)のスペクトラムを示す。図11の(a)と(c)を比較して明らかなとおり、本実施形態の合成音声は、HMMのスムージングにより、高域のスペクトルが平滑されているが、比較的少ない学習音声データによって十分に元の音声スペクトル形状を保っていることが分かる。また、(b)のスペクトラムも(c)に近似しており、音声認識結果を確認する際のトークバックなどにおいて、入力音声の調音特徴抽出結果を知る際に利用することができる。 Also, the spectrums of the synthesized speech and the original speech according to the present embodiment were compared in the PARCOR analysis. 11A shows the spectrum of the original speech, FIG. 11B shows the spectrum of the synthesized speech obtained by converting the articulation feature series into the speech synthesis parameters (PARCOR coefficient sequence) based on the articulation features obtained from the speech, ) Shows the spectrum of the synthesized speech (HMM / DPF / PARCOR analysis) of this embodiment. As is clear from comparison between FIGS. 11A and 11C, the synthesized speech of the present embodiment has a high-frequency spectrum smoothed by the smoothing of the HMM. It can be seen that the original speech spectrum shape is maintained. Further, the spectrum of (b) is also similar to (c), and can be used to know the articulation feature extraction result of the input voice in talkback when confirming the voice recognition result.
 さらに、合成音声波形を比較した。図12のうち(a)は原音声波形、(b)はパルス列と雑音から近似した音源波形を用いて合成した音声波形、(c)および(d)は駆動音源符号帳を用いて合成した際の音声波形である。なお、(c)は特定話者の駆動音源符号帳によるものであり、(d)は不特定話者の駆動音源符号帳によるものである。この図から明らかなとおり、(c)と(d)は元の音声に近い波形を得ている。ただし、(d)は不特定多数の話者の音声から駆動音源符号帳を作成しており、特定話者の音声(調音特徴を抽出し、音声合成パラメータ変換の多層ニューラルネット学習に用いた話者)のみから作成した符号帳の場合(c)と比較すれば、(d)に若干の劣化が見られる。従って、特定話者にチューニングさせる処理が必要となる。そこで、多量の不特定多数の話者音声から作成した符号帳に、少量の特定話者音声を符号帳に含めて学習することで、音質を改善することができる。また、同時に調音特徴を音声合成パラメータに変換する多層ニューラルネットについても、多量の不特定話者音声に対して、利用者となる特定話者音声を少量学習することで、変換精度を向上させることができる。 Furthermore, the synthesized speech waveforms were compared. 12, (a) is the original speech waveform, (b) is the speech waveform synthesized using the excitation waveform approximated from the pulse train and noise, and (c) and (d) are synthesized using the driving excitation codebook. It is a voice waveform. Note that (c) is based on the driving excitation codebook of a specific speaker, and (d) is based on the driving excitation codebook of an unspecified speaker. As is clear from this figure, (c) and (d) obtain waveforms close to the original speech. However, (d) creates a driving excitation codebook from the voices of an unspecified number of speakers, and extracts the voices of specific speakers (articulation features are extracted and used for multi-layer neural network learning for speech synthesis parameter conversion. In the case of a codebook created only from the person (c), a slight deterioration is seen in (d) compared with (c). Therefore, a process for tuning a specific speaker is required. Therefore, the sound quality can be improved by learning by including a small amount of specific speaker voice in the code book created from a large number of unspecified many speaker voices. In addition, for multi-layer neural networks that simultaneously convert articulation features into speech synthesis parameters, the conversion accuracy can be improved by learning a small amount of specific speaker speech as a user for a large amount of unspecified speaker speech. Can do.
 以上の説明では、音声を取得し、調音特徴系列を抽出し、HMMの調音運動モデルから、最適調音系列を取得し、さらに音声合成パラメータに変換して、合成音声を出力した。
しかし、本発明は、こうした利用に限られるものではなく、キーボードから入力された漢字かな混じり文に対しても、通常の音声合成器が行っているように、かな系列に変換した後、音声記号を取得すれば、調音特徴としての弁別的音素特徴は、容易に分かるようにかな文字と一対一に対応しており、かな文字・調音特徴系列の変換を通して、音声を容易に合成することができる。
In the above description, the voice is acquired, the articulation feature series is extracted, the optimal articulation series is obtained from the articulation motion model of the HMM, further converted into the voice synthesis parameters, and the synthesized voice is output.
However, the present invention is not limited to such use, and a kanji-kana mixed sentence input from a keyboard is also converted into a kana sequence after being converted into a kana sequence, as a normal speech synthesizer performs. , The distinctive phoneme feature as the articulation feature has a one-to-one correspondence with the kana character so that it can be easily understood, and it is possible to easily synthesize speech through conversion of the kana character / articulation feature series. .
 図13は、第1に、キーボードからのテキスト入力によって音声を合成する利用形態、第2に、音声から音声認識を経て認識結果のテキストをディスプレイに表示するとともに、認識結果を再合成して音声で認識する利用形態、第3に、調音特徴抽出部40からの出力(抽出された調音特徴)を調音特徴・声道パラメータ変換部43で変換して音声確認を行う利用形態(図のパス47)が可能である。 FIG. 13 is a usage form in which voice is synthesized by first inputting text from a keyboard, and secondly, the recognition result text is displayed on the display through voice recognition from the voice, and the recognition result is re-synthesized and voiced. And third, a usage mode in which the output from the articulation feature extraction unit 40 (extracted articulation feature) is converted by the articulation feature / vocal tract parameter conversion unit 43 and voice confirmation is performed (path 47 in the figure). Is possible.
 第1の利用形態では、図13のテキスト-音素変換部46において、図示されない単語辞書を利用し、テキストを音素系列に変換する。単語辞書中には、単語表記項目毎に「読み、品詞、アクセント」が格納されており、テキストは最初に単語辞書を参照して形態素(単語)に分割され、続いて単語の読みから音素系列とアクセント位置、および文全体のイントネーションなどが決定される。音素と韻律の系列は、調音特徴・声道パラメータ変換部43に送られ、音素単位の格納された話者共通の調音モデル42、すなわちHMMの各状態から調音特徴と音源の素片が読み出される(図8および図9参照)。続いて、調音特徴はPARCOR係数などの音道パラメータに変換され、これと駆動音源(残差信号)が音声合成部45に送られ、合成音声に変換される。 In the first usage mode, the text-phoneme conversion unit 46 in FIG. 13 uses a word dictionary (not shown) to convert the text into a phoneme sequence. In the word dictionary, “reading, part of speech, accent” is stored for each word notation item, the text is first divided into morphemes (words) with reference to the word dictionary, and then the phoneme sequence from the word reading And the accent position and intonation of the whole sentence are determined. The phoneme and prosody sequence is sent to the articulation feature / vocal tract parameter conversion unit 43, and the articulation features and sound source segments are read out from each state of the HMM, which is the common articulation model 42 stored in units of phonemes. (See FIGS. 8 and 9). Subsequently, the articulation feature is converted into a sound path parameter such as a PARCOR coefficient, and this and a driving sound source (residual signal) are sent to the speech synthesizer 45 and converted into synthesized speech.
 第2の利用形態では、音声認識された結果のテキストを出力するとともに、キー操作されたテキストと同様に処理されることとなるから、第1の利用形態と同じく認識結果のテキスト(単語もしくは文(単語列))から、上記第1の利用形態と同じ処理過程を経て合成音声を利用者に返すことになる。 In the second usage mode, the text of the speech recognition result is output and processed in the same manner as the key-operated text. Therefore, the recognition result text (word or sentence) is the same as in the first usage mode. (Word string)), the synthesized speech is returned to the user through the same process as the first usage pattern.
 第3の利用形態では、前記したように、調音特徴がパス47(図13)で示すように与えられているため、調音特徴・声道パラメータ変換部43を経由して、声道パラメータが得られる。音声合成器に必要なもう一方の音源信号については、図示されていない残差信号計算部(音声をPARCOR分析した際の残差を計算する)で、入力音声から残差信号が抽出され、上記声道パラメータと共に音声合成部45に送られて合成音声が得られる。この第3の利用形態では、コンピュータが利用者の音声が、正しい調音動作として抽出されたか否かを知ることができるため、利用者が音声認識処理の誤判定に関する情報を得ることができるほか、積極的な利用として発音訓練(特に外国語の発音訓練)などへ応用できるというメリットがある。 In the third usage mode, as described above, since the articulation feature is given as shown by the path 47 (FIG. 13), the vocal tract parameter is obtained via the articulation feature / vocal tract parameter conversion unit 43. It is done. For the other sound source signal necessary for the speech synthesizer, a residual signal is extracted from the input speech by a residual signal calculator (not shown) that calculates the residual when the speech is subjected to PARCOR analysis. It is sent to the speech synthesizer 45 together with the vocal tract parameters to obtain synthesized speech. In this third usage mode, since the computer can know whether or not the user's voice has been extracted as a correct articulation operation, the user can obtain information on misjudgment of voice recognition processing, There is an advantage that it can be applied to pronunciation training (particularly pronunciation training for foreign languages) as an active use.
1 音声合成装置
11 中央演算処理装置
12 入力装置
13 出力装置
14 記憶装置
15 外部記憶装置
201 入力部
202 A/D変換部
205 出力部
206 D/A変換部
207 調音特徴計算用記憶部
210 調音特徴抽出部
211 分析フィルタ
212 局所特徴抽出部
213 弁別的音素特徴抽出部
220 音声認識部
230 最適調音特徴系列・音声合成パラメータ変換部
235 音声合成用記憶部
240 音声合成部
DESCRIPTION OF SYMBOLS 1 Speech synthesizer 11 Central processing unit 12 Input device 13 Output device 14 Storage device 15 External storage device 201 Input unit 202 A / D conversion unit 205 Output unit 206 D / A conversion unit 207 Articulation feature calculation storage unit 210 Articulation feature Extraction unit 211 Analysis filter 212 Local feature extraction unit 213 Discriminative phoneme feature extraction unit 220 Speech recognition unit 230 Optimal articulation feature sequence / speech synthesis parameter conversion unit 235 Speech synthesis storage unit 240 Speech synthesis unit

Claims (12)

  1.  一定の音声単位毎に記憶された調音運動の状態遷移モデルを予め記憶する音素単位調音運動記憶部と、前記状態遷移モデルを参照しつつ音声認識を行う音声認識部と、前記状態遷移モデルから最適調音系列を取得しつつ音声合成を行う音声合成部とを備えた1モデル音声認識合成に基づく音声合成装置であって、
     音声認識部は、音声を取得する音声取得手段と、前記音声取得手段にて取得された音声の調音特徴を抽出する調音特徴抽出手段と、前記調音特徴抽出手段にて抽出された調音特徴を記憶手段に記憶する第1の記憶制御手段と、前記調音特徴の記憶手段から読み出された調音特徴時系列データと前記状態遷移モデルとを比較し最適音声単位系列を識別する最適音声単位系列識別手段を含み、
     音声合成部は、前記最適音声単位系列から調音運動に関する最適状態系列を推定し調音特徴系列を生成する最適調音特徴系列生成手段と、前記最適調音特徴系列生成手段にて生成された最適調音特徴系列データを記憶手段に記憶する第2の記憶制御手段と、前記最適調音特徴系列データの記憶手段から読み出された調音特徴系列データを音声合成パラメータ系列に変換する音声合成パラメータ系列変換手段と、前記音声合成パラメータ系列変換手段にて変換された音声合成パラメータ系列を記憶手段に記憶する第3の記憶制御手段と、前記音声合成パラメータ系列の記憶手段から読み出された音声合成パラメータと駆動音源信号から音声を合成する手段とを含むことを特徴とする音声合成装置。
    Optimal from the phoneme unit articulation motion storage unit that prestores the state transition model of articulation motion stored for each fixed speech unit, the speech recognition unit that performs speech recognition with reference to the state transition model, and the state transition model A speech synthesizer based on a one-model speech recognition synthesis comprising a speech synthesizer that performs speech synthesis while acquiring an articulation sequence,
    The voice recognition unit stores voice acquisition means for acquiring voice, articulation feature extraction means for extracting the articulation feature of the voice acquired by the voice acquisition means, and the articulation feature extracted by the articulation feature extraction means. First storage control means for storing in the means, and optimum speech unit sequence identification means for comparing the articulation feature time-series data read from the articulation feature storage means and the state transition model to identify the optimum speech unit sequence Including
    The speech synthesizer estimates an optimal state sequence related to articulation movement from the optimal speech unit sequence and generates an articulation feature sequence, and an optimal articulation feature sequence generated by the optimal articulation feature sequence generation unit Second storage control means for storing data in the storage means; speech synthesis parameter sequence conversion means for converting the articulation feature sequence data read from the storage means for the optimal articulation feature sequence data into a speech synthesis parameter sequence; From the third storage control means for storing the speech synthesis parameter series converted by the speech synthesis parameter series conversion means in the storage means, the speech synthesis parameters read from the speech synthesis parameter series storage means, and the driving sound source signal A speech synthesizer comprising: means for synthesizing speech.
  2.  前記音素単位調音運動記憶部は、調音運動を表現した隠れマルコフモデル(HMM)の係数セットが記憶され、前記音声認識部の最適音声単位系列識別手段および前記音声合成部の最適調音特徴系列生成手段から参照可能であることを特徴とする請求項1記載の音声合成装置。 The phoneme unit articulation motion storage unit stores a coefficient set of a hidden Markov model (HMM) expressing articulation motion, and an optimum speech unit sequence identification unit of the speech recognition unit and an optimal articulation feature sequence generation unit of the speech synthesis unit The speech synthesizer according to claim 1, wherein the speech synthesizer can be referred to.
  3.  前記調音特徴抽出手段は、音声のデジタル信号をフーリエ分析する分析フィルタと、時間軸微分特徴抽出部および周波数軸微分特徴抽出部を有する局所特徴抽出部と、多層ニューラルネットワークを一段または複数段に構成された弁別的音素特徴抽出部とを備えたことを特徴とする請求項1又は2に記載の音声合成装置。 The articulation feature extraction means comprises an analysis filter for Fourier analysis of a digital audio signal, a local feature extraction unit having a time axis differential feature extraction unit and a frequency axis differential feature extraction unit, and a multilayer neural network in one or more stages. The speech synthesizer according to claim 1, further comprising a discriminative phoneme feature extraction unit.
  4.  前記状態遷移モデルが、多数話者音声を用いて作成されるとともに、前記調音特徴系列データを音声合成パラメータ系列に変換する手段を、特定話者の音声のみ、もしくは不特定話者で作成した前記調音特徴系列データを音声合成パラメータ系列に変換する手段を、特定話者の音声で適応学習して作成されることを特徴とする請求項1ないし3のいずれかに記載の音声合成装置。 The state transition model is created using a multi-speaker voice, and means for converting the articulation feature series data into a speech synthesis parameter series is created by only a specific speaker's voice or an unspecified speaker. 4. The speech synthesizer according to claim 1, wherein the means for converting the articulation feature sequence data into a speech synthesis parameter sequence is created by adaptive learning with the voice of a specific speaker.
  5.  前記音声合成パラメータと駆動音源信号から音声を合成する手段において、駆動音源符号帳を設けるとともに、音声合成パラメータと駆動音源符号から合成された音声を元の学習音声と比較して最適な駆動音源を選択する手段と、前記選択された駆動音源符号を対応する調音運動の状態遷移モデルに登録する手段を備えたことを特徴とする請求項1ないし4のいずれかに記載の音声合成装置。 In the means for synthesizing speech from the speech synthesis parameter and the driving excitation signal, a driving excitation codebook is provided, and an optimum driving excitation is determined by comparing the speech synthesized from the speech synthesis parameter and the driving excitation code with the original learning speech. 5. The speech synthesizer according to claim 1, further comprising means for selecting and means for registering the selected drive excitation code in a corresponding articulatory motion state transition model.
  6.  一定の音声単位毎に記憶された調音運動の状態遷移モデルを予め記憶する音素単位調音
    運動記憶部と、前記状態遷移モデルを参照しつつ音声認識を行う音声認識部と、前記状態遷移モデルから最適調音系列を取得しつつ音声合成を行う音声合成部とを備えた1モデル音声認識合成に基づく音声合成方法であって、
     音声認識部は、音声を取得する音声取得ステップと、前記音声取得ステップにて取得された音声の調音特徴を抽出する調音特徴抽出ステップと、前記調音特徴抽出ステップにて抽出された調音特徴を記憶手段に記憶する第1の記憶制御ステップと、前記調音特徴の記憶手段から読み出された調音特徴時系列データと前記状態遷移モデルとを比較し最適音声単位系列を識別する最適音声単位系列識別ステップを含み、
     音声合成部は、前記最適音声単位系列から調音運動に関する最適状態系列を推定し調音特徴系列を生成する最適調音特徴系列生成ステップと、前記最適調音特徴系列生成ステップにて生成された最適調音特徴系列データを記憶手段に記憶する第2の記憶制御ステップと、前記最適調音特徴系列データの記憶手段から読み出された調音特徴系列データを音声合成パラメータ系列に変換する音声合成パラメータ系列変換ステップと、前記音声合成パラメータ系列変換ステップにて変換された音声合成パラメータ系列を記憶手段に記憶する第3の記憶制御ステップと、前記音声合成パラメータ系列の記憶手段から読み出された音声合成パラメータと駆動音源信号から音声を合成するステップとを含むことを特徴とする音声合成方法。
    Optimal from the phoneme unit articulation motion storage unit that prestores the state transition model of articulation motion stored for each fixed speech unit, the speech recognition unit that performs speech recognition with reference to the state transition model, and the state transition model A speech synthesis method based on a one-model speech recognition synthesis comprising a speech synthesis unit that performs speech synthesis while acquiring an articulation sequence,
    The speech recognition unit stores a speech acquisition step for acquiring speech, a articulation feature extraction step for extracting the articulation feature of the speech acquired in the speech acquisition step, and a articulation feature extracted in the articulation feature extraction step. A first storage control step for storing in the means; and an optimum speech unit sequence identification step for comparing the articulation feature time-series data read from the articulation feature storage means and the state transition model to identify the optimum speech unit sequence. Including
    The speech synthesizer estimates an optimal state sequence related to articulation movement from the optimal speech unit sequence and generates an articulation feature sequence, and an optimal articulation feature sequence generated in the optimal articulation feature sequence generation step A second storage control step for storing data in a storage unit; a speech synthesis parameter sequence conversion step for converting the articulation feature sequence data read from the storage unit for the optimal articulation feature sequence data into a speech synthesis parameter sequence; From the third storage control step of storing the speech synthesis parameter sequence converted in the speech synthesis parameter sequence conversion step in the storage means, from the speech synthesis parameters read from the storage means of the speech synthesis parameter series and the driving sound source signal A speech synthesis method comprising: synthesizing speech.
  7.  前記音素単位調音運動記憶部は、調音運動を表現した隠れマルコフモデル(HMM)の係数セットが記憶され、前記音声認識部の最適音声単位系列識別ステップおよび前記音声合成部の最適調音特徴系列生成ステップにおいて参照可能であることを特徴とする請求項6記載の音声合成方法。 The phoneme unit articulation motion storage unit stores a coefficient set of a Hidden Markov Model (HMM) expressing articulation motion, an optimal speech unit sequence identification step of the speech recognition unit, and an optimal articulation feature sequence generation step of the speech synthesis unit The speech synthesis method according to claim 6, wherein the speech synthesis method can be referred to.
  8.  前記調音特徴抽出ステップは、音声のデジタル信号をフーリエ分析する分析フィルタと、時間軸微分特徴抽出ステップおよび周波数軸微分特徴抽出ステップを有する局所特徴抽出ステップと、多層ニューラルネットワークにより処理される弁別的音素特徴抽出ステップとを備えたことを特徴とする請求項6又は7に記載の音声合成方法。 The articulation feature extraction step includes an analysis filter for Fourier analysis of a digital signal of speech, a local feature extraction step having a time axis differential feature extraction step and a frequency axis differential feature extraction step, and a discriminative phoneme processed by a multilayer neural network. The speech synthesis method according to claim 6, further comprising a feature extraction step.
  9.  前記状態遷移モデルが、多数話者音声を用いて作成されるとともに、前記調音特徴系列データを音声合成パラメータ系列に変換するステップを、特定話者の音声のみ、もしくは不特定話者で作成した前記調音特徴系列データを音声合成パラメータ系列に変換するステップを、特定話者の音声で適応学習して作成されることを特徴とする請求項6ないし8のいずれかに記載の音声合成方法。 The state transition model is created using a multi-speaker voice, and the step of converting the articulation feature series data into a speech synthesis parameter series is created by only a specific speaker's voice or an unspecified speaker. 9. The speech synthesis method according to claim 6, wherein the step of converting the articulation feature sequence data into a speech synthesis parameter sequence is created by adaptive learning with the speech of a specific speaker.
  10.  前記音声合成パラメータと駆動音源信号から音声を合成するステップにおいて、駆動音源符号帳を設けるとともに、音声合成パラメータと駆動音源符号から合成された音声を元の学習音声と比較して最適な駆動音源を選択するステップと、前記選択された駆動音源符号を対応する調音運動の状態遷移モデルに登録するステップを備えたことを特徴とする請求項6ないし9のいずれかに記載の音声合成方法。 In the step of synthesizing speech from the speech synthesis parameter and the drive excitation signal, a drive excitation codebook is provided, and an optimum drive excitation is determined by comparing the speech synthesized from the speech synthesis parameter and the drive excitation code with the original learning speech. 10. The speech synthesis method according to claim 6, further comprising a step of selecting and a step of registering the selected driving excitation code in a corresponding articulatory motion state transition model.
  11.  請求項1ないし5のいずれかに記載の音声合成装置の各処理手段としてコンピュータを駆動させるための音声合成プログラム。 A speech synthesis program for driving a computer as each processing means of the speech synthesizer according to any one of claims 1 to 5.
  12.  請求項6ないし10のいずれかに記載の音声合成方法の各処理ステップとしてコンピュータを駆動させるための音声合成プログラム。 A speech synthesis program for driving a computer as each processing step of the speech synthesis method according to claim 6.
PCT/JP2010/053802 2009-03-09 2010-03-08 Voice synthesis apparatus based on single-model voice recognition synthesis, voice synthesis method and voice synthesis program WO2010104040A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
JP2011503812A JP5574344B2 (en) 2009-03-09 2010-03-08 Speech synthesis apparatus, speech synthesis method and speech synthesis program based on one model speech recognition synthesis

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
JP2009-055784 2009-03-09
JP2009055784 2009-03-09

Publications (1)

Publication Number Publication Date
WO2010104040A1 true WO2010104040A1 (en) 2010-09-16

Family

ID=42728329

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2010/053802 WO2010104040A1 (en) 2009-03-09 2010-03-08 Voice synthesis apparatus based on single-model voice recognition synthesis, voice synthesis method and voice synthesis program

Country Status (2)

Country Link
JP (1) JP5574344B2 (en)
WO (1) WO2010104040A1 (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2014056235A (en) * 2012-07-18 2014-03-27 Toshiba Corp Voice processing system
CN110751940A (en) * 2019-09-16 2020-02-04 百度在线网络技术(北京)有限公司 Method, device, equipment and computer storage medium for generating voice packet
KR20210014526A (en) * 2019-07-30 2021-02-09 주식회사 케이티 Server, device and method for providing speech systhesis service
JP2022516784A (en) * 2019-01-11 2022-03-02 ネイバー コーポレーション Neural vocoder and neural vocoder training method to realize speaker adaptive model and generate synthetic speech signal

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2000066694A (en) * 1998-08-21 2000-03-03 Sanyo Electric Co Ltd Voice synthesizer and voice synthesizing method
JP2002351791A (en) * 2001-05-30 2002-12-06 Mitsubishi Electric Corp Electronic mail communication equipment, electronic mail communication method and electronic mail communication program
JP2003271182A (en) * 2002-03-18 2003-09-25 Toshiba Corp Device and method for preparing acoustic model
JP2004012584A (en) * 2002-06-04 2004-01-15 Nippon Telegr & Teleph Corp <Ntt> Method for creating information for voice recognition, method for creating acoustic model, voice recognition method, method for creating information for voice synthesis, voice synthesis method, apparatus therefor, program, and recording medium with program recorded thereon

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2000066694A (en) * 1998-08-21 2000-03-03 Sanyo Electric Co Ltd Voice synthesizer and voice synthesizing method
JP2002351791A (en) * 2001-05-30 2002-12-06 Mitsubishi Electric Corp Electronic mail communication equipment, electronic mail communication method and electronic mail communication program
JP2003271182A (en) * 2002-03-18 2003-09-25 Toshiba Corp Device and method for preparing acoustic model
JP2004012584A (en) * 2002-06-04 2004-01-15 Nippon Telegr & Teleph Corp <Ntt> Method for creating information for voice recognition, method for creating acoustic model, voice recognition method, method for creating information for voice synthesis, voice synthesis method, apparatus therefor, program, and recording medium with program recorded thereon

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
JUN HIROI ET AL.: "Very Low Bit Rate Speech Coding Based on HMMs", IEICE TECHNICAL REPORT, vol. 98, no. 264, 11 September 1998 (1998-09-11), pages 39 - 44 *
KEIICHI TOKUDA: "Speech Syntehsis Based on Hidden Markov Models", IEICE TECHNICAL REPORT, vol. 99, no. 255, 5 August 1999 (1999-08-05), pages 47 - 54 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2014056235A (en) * 2012-07-18 2014-03-27 Toshiba Corp Voice processing system
JP2022516784A (en) * 2019-01-11 2022-03-02 ネイバー コーポレーション Neural vocoder and neural vocoder training method to realize speaker adaptive model and generate synthetic speech signal
JP7274184B2 (en) 2019-01-11 2023-05-16 ネイバー コーポレーション A neural vocoder that implements a speaker-adaptive model to generate a synthesized speech signal and a training method for the neural vocoder
KR20210014526A (en) * 2019-07-30 2021-02-09 주식회사 케이티 Server, device and method for providing speech systhesis service
KR102479899B1 (en) * 2019-07-30 2022-12-21 주식회사 케이티 Server, device and method for providing speech systhesis service
CN110751940A (en) * 2019-09-16 2020-02-04 百度在线网络技术(北京)有限公司 Method, device, equipment and computer storage medium for generating voice packet

Also Published As

Publication number Publication date
JP5574344B2 (en) 2014-08-20
JPWO2010104040A1 (en) 2012-09-13

Similar Documents

Publication Publication Date Title
Tokuda et al. Speech synthesis based on hidden Markov models
US11735162B2 (en) Text-to-speech (TTS) processing
US11410684B1 (en) Text-to-speech (TTS) processing with transfer of vocal characteristics
US10692484B1 (en) Text-to-speech (TTS) processing
US20100057435A1 (en) System and method for speech-to-speech translation
US11763797B2 (en) Text-to-speech (TTS) processing
US10699695B1 (en) Text-to-speech (TTS) processing
Zhao et al. Using phonetic posteriorgram based frame pairing for segmental accent conversion
JP5574344B2 (en) Speech synthesis apparatus, speech synthesis method and speech synthesis program based on one model speech recognition synthesis
Dua et al. Spectral warping and data augmentation for low resource language ASR system under mismatched conditions
Ipsic et al. Croatian HMM-based speech synthesis
Sawada et al. The nitech text-to-speech system for the blizzard challenge 2016
Phan et al. A study in vietnamese statistical parametric speech synthesis based on HMM
Zhang et al. A prosodic mandarin text-to-speech system based on tacotron
Mullah A comparative study of different text-to-speech synthesis techniques
Takaki et al. Overview of NIT HMM-based speech synthesis system for Blizzard Challenge 2012
Ronanki et al. The CSTR entry to the Blizzard Challenge 2017
Huckvale 14 An Introduction to Phonetic Technology
Qin et al. An improved spectral and prosodic transformation method in STRAIGHT-based voice conversion
Astrinaki et al. sHTS: A streaming architecture for statistical parametric speech synthesis
Abbas et al. Multi-scale spectrogram modelling for neural text-to-speech
Phan et al. Extracting MFCC, F0 feature in Vietnamese HMM-based speech synthesis
Cai et al. The DKU Speech Synthesis System for 2019 Blizzard Challenge
Sawada et al. The NITECH HMM-based text-to-speech system for the Blizzard Challenge 2015
IMRAN ADMAS UNIVERSITY SCHOOL OF POST GRADUATE STUDIES DEPARTMENT OF COMPUTER SCIENCE

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 10750793

Country of ref document: EP

Kind code of ref document: A1

DPE1 Request for preliminary examination filed after expiration of 19th month from priority date (pct application filed from 20040101)
NENP Non-entry into the national phase

Ref country code: DE

WWE Wipo information: entry into national phase

Ref document number: 2011503812

Country of ref document: JP

122 Ep: pct application non-entry in european phase

Ref document number: 10750793

Country of ref document: EP

Kind code of ref document: A1