WO2010104040A1

WO2010104040A1 - Voice synthesis apparatus based on single-model voice recognition synthesis, voice synthesis method and voice synthesis program

Info

Publication number: WO2010104040A1
Application number: PCT/JP2010/053802
Authority: WO
Inventors: 恒雄新田
Original assignee: 国立大学法人豊橋技術科学大学
Priority date: 2009-03-09
Filing date: 2010-03-08
Publication date: 2010-09-16
Also published as: JP5574344B2; JPWO2010104040A1

Abstract

Disclosed are a voice synthesis apparatus, voice synthesis method and voice synthesis program capable of implementing voice synthesis of a specified individual with high quality using few items of learned voice data. The voice synthesis apparatus learns a transition model (225) of articulatory movement stored for each of fixed voice units such as phonemes, from an unspecified large number of speakers. The voice synthesis apparatus is provided with means (230) for converting to voice synthesis parameters that carry vocal tract shape information whereby a series of articulatory features is adapted to individuals and an optimum voice unit series is obtained at the same time by comparing this model with the input voice. In addition, the voice synthesis apparatus obtains high-quality synthesised voice for a specified individual by registering sound source code in a state transition model of articulatory movement using closed loop learning employing a drive sound source codebook.

Description

Speech synthesis apparatus, speech synthesis method and speech synthesis program based on one model speech recognition synthesis

The present invention relates to a speech synthesizer based on 1-model speech recognition synthesis, a speech synthesis method based on 1-model speech recognition synthesis, and a speech synthesis program based on 1-model speech recognition synthesis. More specifically, a model speech that extracts articulation features from speech utterances, constructs a state transition model related to articulatory motion that can be used for speech recognition, and synthesizes speech using the same articulatory motion state transition model The present invention relates to a speech synthesizer based on recognition synthesis, a speech synthesis method, and a speech synthesis program. One model means that a common (that is, one) state transition model is used for both speech recognition and speech synthesis.

Two types of speech recognition technology and speech synthesis technology are known as user interfaces using speech input / output. In speech recognition technology, pattern recognition processing using phonemes, syllables, words, and the like as recognition units has been generally performed based on the result of feature analysis processing such as frequency spectrum. This is based on the assumption that the human auditory nervous system has spectrum analysis capability, and that higher-level language processing is performed on the cerebrum for the spectrum time series. The speech recognition apparatuses developed so far classify words or word strings based on acoustic features composed of spectral time series.

Next, in speech synthesis technology, the waveform connection method and vocoder method are mainly used. In the waveform connection method, voices are generated by connecting waveform segments in units of phonemes and the like. The vocoder method is a method that simulates articulatory motion in human speech generation, and uses separately the sound source information such as vocal organ vibration information and vocal cord vibration. Specifically, parameters reflecting the movement of the vocal organs, that is, the articulatory motion, are extracted from the speech by PARCOR analysis, etc., and the segments consisting of these spectral envelope information are connected and pitch pulses or noise sequences are added to the excitation source. To generate audio.

Thus, current speech recognition and speech synthesis are realized as two different systems. On the other hand, from recent brain research, a hypothesis that humans perceive speech as articulatory motion instead of speech as an acoustic signal is promising (see Non-Patent Document 1).

Regarding the processing of spoken language in the human brain, it is first related to the French P.A. in 1861 that the Broca area, which governs the movement of articulatory muscles, is deeply involved in speech. P. Discovered by Broca. When this part is damaged, broker aphasia (motor aphasia), in which the fluency of speech is lost, is observed, so it was thought to be responsible mainly for the speech generation system. Subsequently, Wernickeno, who was involved in understanding the content of the utterance, Discovered by Wernicke. In this part of the disease, Wernicke aphasia (sensory aphasia), which utters fluent but error-prone sentences, is observed, so it was considered to be mainly related to the speech understanding system. In this way, in the case of human beings, there are two organs, the speech organ and the auditory organ, and as described above, the different functions of the two brain regions have been observed, and the 2-system theory has become dominant. The vocoder in speech synthesis described earlier was also developed in 1928 by H.C. When Dudley first made a device, the articulation command from the brain was shown in the figure, and the device that extracts the movement of the vocal organs with a band filter group and simultaneously extracts and transmits the sound source is realized with a vacuum tube circuit . The idea of this vocoder was later developed in 1969 by F.C. Itakura and B.I. It has been completed as a linear predictive coding (LPC) by Atal and is the basis of current voice communication.

Then, in 1976, H.C. McGurk discovered the McGark effect. This is because, for example, when an image uttered as / ga / is displayed on the screen and at the same time the voice of / ba / is presented from a speaker, it is determined that it is / da / or / ga /. This supported the theory that understanding was handled by the 1-system responsible for articulatory movements in the brain. The controversy over whether human speech generation and understanding is 1-system or 2-system has continued for a long time, but in recent years, brain research has greatly progressed with fMRI, etc. The global processing mechanism including the cooperation between the broker and Wernicke fields is related to the understanding, and the 1-system theory is dominant. In recent years, research on accurately extracting commands related to articulatory movements has been active in the field of speech recognition, while speech synthesis from articulatory commands is being observed by fMRI or the like.

In this way, the 1-system theory is becoming prominent, but there are many obstacles in putting such a system into practical use. As a system closest to realization, there is a hidden Markov model (Hidden Markov Model; hereinafter referred to as HMM) synthesis (see Non-Patent Document 2).

This system applies the HMM currently used as a standard in voice recognition, and the operation of the system is shown in FIG. The learning part of the HMM not shown in the figure uses a spectral parameter sequence (here, Mel Cepstrum Coefficient (hereinafter sometimes referred to as MFCC)) and pitch parameters based on a probability distribution in multiple spaces. Learning using the Baum-Welch algorithm by HMM. At this time, a state duration distribution is constructed from a trellis or the like obtained when the HMM 101 expressing the spectrum sequence of a specific speaker is continuously learned. In the synthesizer, after text is input and prosodic information is given by text analysis, each state of the HMM is continued based on the state duration distribution, and an excitation waveform generated from the obtained spectrum and pitch is expressed by MLSA (Mel Log). : Mel logarithm) The synthesized speech waveform is obtained through the synthesis filter 102.

On the other hand, human beings can listen to an unspecified number of human voices by listening to very few human voices, such as the parent's voice waveform, since they were young children. This fact suggests that the human brain listens by converting speech into an invariant feature pattern called articulation.

In the method disclosed in Non-Patent Document 2, the synthesis unit is configured by the specific speaker HMM created from the speech spectrum information of the specific speaker. There is a drawback of requiring data. Further, when this HMM is used for speech recognition, since it is an HMM designed with the speech of a specific speaker, only a low speech recognition result can be obtained for a large number of speakers other than the speaker.

The present invention has been made to solve the above-described problems, and realizes a function that conflicts with the conventional methods of high speech recognition performance for unspecified speakers and clear speech synthesis for specific individuals 1 An object of the present invention is to provide a speech synthesizer based on model speech recognition synthesis, a speech synthesis method, and a speech synthesis program.

In order to solve the above-described problem, in the speech synthesizer according to the first aspect of the present invention, a phoneme unit articulation motion storage unit that stores in advance a state transition model of articulation motion stored for each predetermined speech unit; A speech synthesizer based on one-model speech recognition synthesis, comprising: a speech recognition unit that performs speech recognition with reference to a state transition model; and a speech synthesis unit that performs speech synthesis while acquiring an optimal articulation sequence from the state transition model. The voice recognition unit includes voice acquisition means for acquiring voice, articulation feature extraction means for extracting the articulation feature of the voice acquired by the voice acquisition means, and articulation extracted by the articulation feature extraction means. The first storage control means for storing the features in the storage means, the articulation feature time series data read from the articulation feature storage means and the state transition model are compared to identify the optimum speech unit series An optimal speech unit sequence identifying unit, wherein the speech synthesis unit estimates an optimal state sequence related to articulation motion from the optimal speech unit sequence and generates an articulation feature sequence; and the optimal articulation feature sequence generation unit A second storage control unit that stores in the storage unit the optimum articulation feature sequence data generated in step (b), and converts the articulation feature sequence data read from the storage unit of the optimum articulation feature sequence data into a speech synthesis parameter sequence. Read out from the speech synthesis parameter series conversion means, the third storage control means for storing the speech synthesis parameter series converted by the speech synthesis parameter series conversion means in the storage means, and the speech synthesis parameter series storage means. And a means for synthesizing speech from the driving sound source signal.

In the speech synthesizer of the invention according to claim 2, the phoneme unit articulation motion storage unit stores a coefficient set of a hidden Markov model (HMM) expressing articulation motion, and the optimal speech unit sequence of the speech recognition unit. It can be referred to from the discriminating means and the optimum articulation feature sequence generating means of the speech synthesizer.

Further, in the speech synthesizer of the invention according to claim 3, the articulation feature extraction means includes an analysis filter for Fourier-analyzing the digital signal of speech, a local feature having a time axis differential feature extraction unit and a frequency axis differential feature extraction unit. It is characterized by comprising an extracting unit and a discriminative phoneme feature extracting unit having a multi-layer neural network configured in one or more stages.

In the speech synthesizer according to claim 4, the state transition model is created using a multi-speaker voice, and means for converting the articulation feature series data into a speech synthesis parameter series is provided. The means for converting the articulation feature series data created by only the voice or by the unspecified speaker into a voice synthesis parameter series is created by adaptive learning with the voice of the specific speaker.

In the speech synthesizer of the invention according to claim 5, the means for synthesizing speech from the speech synthesis parameter and the driving excitation signal provides a driving excitation codebook and the speech synthesized from the speech synthesis parameter and the driving excitation code. And a means for selecting an optimal driving sound source by comparing with the original learning speech and a means for registering the selected driving sound source code in a corresponding articulatory motion state transition model.

In the speech synthesis method according to the sixth aspect of the present invention, a phoneme unit articulation motion storage unit that stores in advance a state transition model of articulation motion stored for each predetermined speech unit, and speech recognition with reference to the state transition model. A speech synthesis method based on one-model speech recognition synthesis, comprising: a speech recognition unit to perform; and a speech synthesis unit that performs speech synthesis while acquiring an optimal articulation sequence from the state transition model,
The speech recognition unit stores a speech acquisition step for acquiring speech, a articulation feature extraction step for extracting the articulation feature of the speech acquired in the speech acquisition step, and a articulation feature extracted in the articulation feature extraction step. A first storage control step for storing in the means; and an optimum speech unit sequence identification step for comparing the articulation feature time-series data read from the articulation feature storage means and the state transition model to identify the optimum speech unit sequence. A speech synthesis unit that estimates an optimal state sequence related to articulation motion from the optimal speech unit sequence and generates an articulation feature sequence; and an optimal articulation feature sequence generation step that generates the optimal articulation feature sequence A second storage control step for storing the articulatory feature sequence data in the storage unit; and the optimum articulation feature sequence data read from the storage unit A speech synthesis parameter sequence conversion step for converting the sound feature sequence data into a speech synthesis parameter sequence; a third storage control step for storing in the storage means the speech synthesis parameter sequence converted in the speech synthesis parameter sequence conversion step; And synthesizing speech from the speech synthesis parameters read from the speech synthesis parameter series storage means and the driving sound source signal.

In the speech synthesis method according to the seventh aspect of the present invention, the phoneme unit articulation motion storage unit stores a coefficient set of a hidden Markov model (HMM) expressing articulation motion, and the optimal speech unit sequence of the speech recognition unit. It can be referred to in the identification step and the optimum articulation feature sequence generation step of the speech synthesizer.

In the speech synthesis method according to the eighth aspect of the present invention, the articulation feature extraction step includes a local feature including an analysis filter that performs Fourier analysis on a digital signal of speech, a time axis differential feature extraction step, and a frequency axis differential feature extraction step. It is characterized by comprising an extraction step and a discrimination phoneme feature extraction step processed by a multilayer neural network.

In the speech synthesis method of the invention according to claim 9, the state transition model is created using a multi-speaker speech, and the step of converting the articulation feature sequence data into a speech synthesis parameter sequence includes: The means for converting the articulation feature series data created by only the voice of the speaker or the unspecified speaker into a speech synthesis parameter series is created by adaptive learning with the voice of the specific speaker.

In the speech synthesis method according to claim 10, in the step of synthesizing speech from the speech synthesis parameter and the driving excitation signal, a driving excitation codebook is provided, and the speech synthesized from the speech synthesis parameter and the driving excitation code is provided. Are compared with the original learning speech, and an optimum driving sound source is selected, and the selected driving sound source code is registered in a corresponding articulatory motion state transition model.

In the speech synthesis program of the invention according to claim 11, a computer is driven as each processing means of the speech synthesis apparatus according to claim 1.

In the speech synthesis program according to the twelfth aspect, the computer is driven as each processing step of the speech synthesis method according to any one of the sixth to tenth aspects.

The speech synthesizer of the invention according to claim 1 is different from the “information based on spectrum” of the specific speaker used by the conventional HMM synthesizer, and extracts the “information based on articulatory motion” to extract the HMM synthesizer. Constitute. For this reason, since the HMM synthesis part is composed of parameters that are basically invariant to the articulatory speaker, there is an advantage that the learning speech data of each speaker is unnecessary or very small for the HMM part. In order to generate speech, it is necessary to convert articulatory motion into motion of a specific speaker's speech organ, but this portion can be realized with a small amount of speech data. Since the speech of the speaker is regarded as an invariant as a state transition model of articulatory movement, and the speech operation of the specific speaker is converted into a speech synthesis parameter series, both can be grasped separately. In this way, the articulatory motion command part (articulatory motion state transition model and phoneme unit articulatory memory unit) to speech organs that can be regarded as invariant speech synthesis, different speech organs for each individual and their By separating the parts related to the operation (optimum speech unit sequence identification means and optimum articulation feature sequence generation means), it is possible to realize a high-quality speech synthesizer that matches the characteristics of the individual speech organs.

In particular, in speech recognition using features derived from the conventional speech spectrum, the spectrum varies greatly depending on the speaker, the context at the time of speech or the surrounding noise, etc., so it is used when obtaining the acoustic likelihood. The HMM design required a lot of voice data. * On the other hand, when the articulatory feature is an input feature to the HMM, there are advantages that even a small number of learning speakers can obtain sufficient phoneme recognition performance and the number of HMM mixture distributions can be reduced. *

In the speech synthesizer of the invention according to claim 2, since the HMM coefficient set expressing the articulation motion is stored in the phoneme unit articulation motion storage unit, the optimum speech unit sequence identifying means and the optimum articulation feature referencing this In the sequence generation means, speech recognition processing and speech synthesis processing are realized by parameters that are basically unchanged for the speaker.

In the speech synthesizer of the invention according to claim 3, since the articulatory feature extracting unit is configured by the local feature extracting unit and the discriminative phoneme feature extracting unit, the discriminating feature based on the articulatory motion is input to the HMM. Therefore, sufficient phoneme recognition performance can be obtained with a small number of learning speakers.

The speech synthesizer according to the invention of claim 4 extracts “information based on articulatory motion of unspecified majority speakers” instead of “information based on the spectrum of specific speakers” used by the conventional HMM synthesizer. Thus, the HMM synthesizing apparatus is configured. Thereby, in addition to the effect of the above invention, the HMM synthesis part can be made common to the speakers, and each speaker has the advantage that the learning speech data can be made unnecessary for the HMM part in principle. In addition, the speech synthesis is separated into the articulatory motion command part for the speech organs and the speech organs and parts related to the motions that are different for each person, and the former is used for the speaker by using the articulation feature data of many speakers. On the other hand, it is possible to achieve both high-quality speech synthesis and high speech recognition performance in accordance with the characteristics of the individual speech organs by configuring as a more invariant articulation operation command.

In addition, in order to make it possible to obtain synthesized speech adapted to individual voices with less data, coupled with the realization of high phoneme recognition performance, unknown words that are problematic in voice dialogue are Enable the same response as you do. That is, when an unknown word appears, it is possible to easily synthesize a confirmation utterance for answering using an articulation feature sequence corresponding to the unknown word part.

The speech synthesizer of the invention according to claim 5 is a closed loop learning concept of CELP (Code Excited Linear Prediction) widely used in speech communication for driving sound source signals that greatly affect the sound quality of synthesized sound (non-patent document). 4) and the technology of PSOLA (Pitch Synchronous Overlap and Add) (see Non-Patent Document 5), which is also widely used for waveform synthesis, in addition to the effects of the above invention, the optimum driving excitation code Is selected and registered in the corresponding articulatory motion state transition model, and high-quality speech can be obtained by synthesizing speech while referring to the model.

The speech synthesis method of the invention according to claim 6 is different from the “information based on spectrum” of a specific speaker used in the conventional HMM synthesis method, and extracts the “information based on articulatory motion” to extract the HMM synthesis method. Constitute. For this reason, since the HMM synthesis part is composed of parameters that are basically invariant to the speaker, which is an articulatory movement, each speaker has the advantage that learning speech data is unnecessary or requires a very small amount for the HMM part. In order to generate speech, it is necessary to convert articulatory motion into motion of a specific speaker's speech organ, but this portion can be realized with a small amount of speech data. Since the speech of the speaker is regarded as an invariant as a state transition model of articulatory movement, and the speech operation of the specific speaker is converted into a speech synthesis parameter series, both can be grasped separately. In this way, the articulatory motion command part (articulatory motion state transition model and phoneme unit articulatory memory unit) to speech organs that can be regarded as invariant speech synthesis, different speech organs for each individual and their By separating the operation-related parts (optimum speech unit sequence identification step and optimum articulation feature sequence generation step), it is possible to realize a high-quality speech synthesis method that matches the characteristics of the individual utterance organs.

In particular, in speech recognition using features derived from the conventional speech spectrum, the spectrum varies greatly depending on the speaker, the context at the time of speech or the surrounding noise, etc., so it is used when obtaining the acoustic likelihood. The HMM design required a lot of voice data. On the other hand, when the articulatory feature is an input feature to the HMM, there are advantages that even a small number of learning speakers can obtain sufficient phoneme recognition performance and the number of HMM mixture distributions can be reduced.

In the speech synthesis method of the invention according to claim 7, since the HMM coefficient set expressing the articulation motion is stored in the phoneme unit articulation motion storage unit, the optimum speech unit sequence identification step and the optimum articulation feature referencing this In the sequence generation step, speech recognition processing and speech synthesis processing are realized by parameters that are basically unchanged for the speaker.

In the speech synthesis method of the invention according to claim 8, since the articulatory feature extraction step is configured by the local feature extraction step and the discriminative phoneme feature extraction step, the discrimination feature based on the articulatory motion is the input feature to the HMM. Therefore, sufficient phoneme recognition performance can be obtained with a small number of learning speakers.

The speech synthesis method of the invention according to claim 9 extracts “information based on articulatory motion of unspecified majority speakers” instead of “information based on the spectrum of specific speakers” used in the conventional HMM synthesis method. Thus, the HMM synthesis method is configured. Thereby, in addition to the effect of the above invention, the HMM synthesis part can be made common to the speakers, and each speaker has the advantage that the learning speech data can be made unnecessary for the HMM part in principle. In addition, the speech synthesis is separated into the articulatory motion command part for the speech organs and the speech organs and parts related to the motions that are different for each person, and the former is used for the speaker by using the articulation feature data of many speakers. On the other hand, it is possible to achieve both high-quality speech synthesis and high speech recognition performance in accordance with the characteristics of the individual speech organs by configuring as a more invariant articulation operation command.

The speech synthesis method of the invention according to claim 10 is similar to the CELP closed loop learning concept widely used in speech communication (see Non-Patent Document 4) for driving sound source signals that greatly affect the sound quality of synthesized speech. By introducing the PSOLA technology widely used for waveform synthesis (see Non-Patent Document 5), the optimum driving excitation code is selected and registered in the corresponding articulatory motion state transition model, while referring to this High-quality speech can be obtained by speech synthesis.

Since the speech synthesis program of the invention according to claim 11 can drive a computer as the speech synthesis processing means according to any of claims 1 to 5, the effects of the invention according to claims 1 to 5 can be obtained. Can play.

Since the speech synthesis program of the invention according to claim 12 can drive a computer as each processing step of the speech synthesis method according to any of claims 6 to 10, the invention according to claims 6 to 10. The effect of can be produced.

It is a schematic diagram which shows the HMM speech synthesis process based on the spectrum information of a specific speaker. It is a schematic diagram which shows the electrical structure of a speech synthesizer. It is a figure which shows an example of the discriminative phoneme characteristic showing an articulation characteristic. It is the figure which compared the phoneme recognition performance at the time of using an MFCC feature and an articulation feature. It is a functional block diagram which shows the speech synthesis process performed with a speech synthesizer. It is a block diagram which shows the functional detail of an articulation feature extraction part. It is a figure which shows an example of the articulation feature obtained in a discrimination phoneme feature extraction part. It is a figure explaining the operation | movement of the HMM speech synthesis based on an articulation feature. It is a figure explaining the code | symbol selection from the drive excitation codebook utilized by speech synthesis. It is the figure which compared the sound source waveform used in the speech synthesizer with the sound source waveform as the residual of the original speech. It is the figure which compared the spectrum envelope of the synthetic | combination speech produced | generated by the speech synthesizer, and the spectrum envelope of the original speech. It is the figure which compared the synthetic speech waveform produced | generated in the speech synthesizer with the original speech. It is the figure which showed the example of a structure of 1 model speech recognition synthesis system.

Embodiments of the present speech synthesis apparatus and speech synthesis method will be described below with reference to the drawings. These drawings are used to explain the technical features that can be adopted by the present invention, and the configuration of the apparatus described, the flow of various processes, etc., unless otherwise specified. It is not intended to be limited to that, but merely an illustrative example.

First, the electrical configuration of the speech synthesizer 1 will be described with reference to FIG. FIG. 2 shows an electrical configuration of the speech synthesizer 1. As shown in this figure, the speech synthesizer 1 includes a central processing unit 11, an input device 12, an output device 13, a storage device 14, and an external storage device 15.

The central processing unit 11 is provided for performing processing such as numerical computation and control, and performs computation and processing according to the processing procedure described in the present embodiment. For example, a CPU or the like can be used. The input device 12 is configured by a microphone, a keyboard, or the like, and inputs a voice uttered by a user or a character string input by a key. The output device 13 includes a display, a speaker, and the like, and outputs a voice synthesis result or information obtained by processing the voice synthesis result. The storage device 14 stores processing procedures (speech synthesis program) executed by the central processing unit 11 and temporary data necessary for the processing. For example, ROM (Read Only Memory) or RAM (Random Access Memory) can be used.

The external storage device 15 is used for the articulation feature series set used for the speech synthesis process, the neural network weight coefficient set used for the articulation feature extraction process, and the conversion process from the articulation feature series data to the speech synthesis parameter series. Set of neural network weight coefficients, HMM state transition model set of articulation motion, optimal articulation feature sequence data, model necessary for speech recognition processing, input speech data, speech synthesis parameter sequence data, drive sound source codebook It is provided for storing sets, analysis result data, and the like. For example, a hard disk drive (HDD) can be used. And these are electrically connected through the bus | bath 22 so that transmission / reception of data mutually is possible.

The hardware configuration of the speech synthesizer 1 of the present invention is not limited to the configuration shown in FIG. Accordingly, a communication I / F connected to a communication network such as the Internet may be provided.

In this embodiment, the speech synthesizer 1 and the speech synthesis program have a configuration independent of other systems, but the present invention is not limited to this configuration. Therefore, a configuration incorporated as a part of another device or a configuration incorporated as a part of another program may be employed. Further, the input in that case is indirectly performed through the other devices and programs described above.

Next, the storage data stored in the external storage device 15 will be described. The stored data is divided into each area and stored in the external storage device 15, and as shown in FIG. 2, the articulation feature storage area 16 in which the articulation features are stored, and the hidden Markov model in which the hidden Markov model is stored. A storage area 17, an optimal articulation feature sequence storage area 18 in which an optimal articulation feature sequence is stored, an input voice storage area 19 in which input speech is stored, a speech synthesis parameter storage area 20 in which speech synthesis parameters are stored, and synthesis A synthesized speech storage area 21 for storing the processed speech, a processing result storage area 22 for storing processed data, a coefficient storage area 23 for storing coefficients used in each processing, and other areas. It has been.

The articulation feature storage area 16 stores a discrimination feature series of speech. Discrimination features were proposed to classify phonemes (phonemes) based on structural features related to articulation, and are voiced / non-voiced / continuous / semi-vowel / bursting / friction / friction. Sexual / lingual / nasal / high tongue / low tongue / (position where tongue rises) anterior / posterior / ...; (Distinctive Fe
(character: DF). In addition, many methods for directly extracting articulatory features such as discriminative features from speech have been proposed, including a method using a neural network (see Non-Patent Document 6).

The hidden Markov model storage area 17 stores a hidden Markov model that is referred to when speech recognition or speech synthesis is performed in the central processing unit 11. The optimum articulation feature sequence storage area 18 stores an optimum articulation feature sequence as a result of searching the central processing unit 11 with reference to the hidden Markov model. The input voice storage area 19 stores voice data input via the input device 12. The speech synthesis parameter storage area 20 stores a speech synthesis parameter as a result calculated by the central processing unit 11 with reference to the weighting coefficient (coefficient storage area 23) of the neural network. The synthesized speech storage area 21 stores the synthesized speech data obtained as a result of referring to the speech synthesis parameter 20 and the driving sound source codebook set in the coefficient storage area 23 in the central processing 11. The processing result storage area 22 stores data obtained as a result of various processes executed in the central processing unit 11. The coefficient storage area 23 is used for a neural network weighting coefficient set for extracting articulation features, a neural network weighting coefficient set used for converting articulation feature series data into speech synthesis parameters, and used for speech synthesis. A codebook set for driving sound source is stored. Details of these data will be described later.

Here, the discriminative phoneme features used for the discriminative feature series stored in the articulation feature storage area 16 will be described in detail. As an example, Japanese phonemes are shown in FIG. 3 as distinctive phonemic features (hereinafter, sometimes referred to as DPF). Here, the discriminative phoneme feature is one method of expressing articulatory features. In the figure, the vertical column shows the distinguishing features, and the horizontal column shows the individual phonemes. In the figure, (+) means having a distinguishing feature for each phoneme, and (-) means not having that feature. In addition, when grasping discriminative phoneme features for languages other than Japanese, in addition to these discriminative features and phonemes, discriminative features or phonemes specific to the language are also considered.

And, from this table, it is possible to know the operation of the vocal organs necessary for generating one phoneme. In FIG. 3, nil (high / low) assigns a distinguishing feature to phonemes that do not belong to either high or low tongue, and nil (front / rear) is (the position where the tongue rises) This is for assigning a discrimination feature to a phoneme that does not belong to either forward or backward, and indicates a newly added feature. Thus, it is known that the speech recognition performance is improved by balancing the phonemes.

In addition, as an expression of the articulation feature, those described in a table widely used as an international phonetic alphabet (hereinafter referred to as IPA) may be used. This IPA table is divided into consonant and vowel tables, and the consonants are classified by the articulation position and articulation method. The articulation position includes lips, gums, hard palate, soft palate, glottis and the like, and the articulation method includes rupture, friction, rubbing, bullet, nasal sound, semi-vowel and the like. There is voiced and unvoiced for each. For example, / p / is a consonant, and is classified into unvoiced sound, lip sound, and plosive sound. On the other hand, vowels are classified according to the place where the tongue is most prominent and the size of the space between the tongue and the palate. The place where the tongue is most prominent is distinguished from the front (front tongue), back (rear tongue) or middle (middle tongue), and the space between the tongue and the palate can be narrow, semi-narrow, half-wide or wide. It is divided. For example, / i / is a front vowel and a narrow vowel. In the case of using IPA, as in the discrimination feature table shown in FIG. 3, the part having the articulatory feature (the part of consonant, unvoiced sound, lip sound, burst sound is taken as +, for example, / p /), Otherwise-.

In the conventional speech recognition using features derived from the speech spectrum, the spectrum greatly fluctuates depending on the speaker, the context at the time of speech, the ambient noise, etc., so that the design of the HMM used when obtaining the acoustic likelihood is used. I needed a lot of audio data. In recent speech recognition apparatuses based on HMM, a speech spectrum is used as an input feature, and fluctuations of individual vector elements are expressed from a plurality of normal distributions. Note that, as a speech spectrum that is frequently used in practice, a mel cepstrum (MFCC) is used in which the speech spectrum is scaled according to the auditory characteristics, and the logarithmic value of the spectrum is discrete cosine transformed (DCT). Further, a plurality of normal distributions are called mixed distributions, and in order to cope with the various variations described above, in recent years, those using distributions of 60 to 70 have appeared. Thus, it can be said that the reason why the enormous memory and calculation are necessary is the result of trying to classify phonemes and words without specifying the variables hidden in the speech. On the other hand, if the articulation feature is used, the number of HMMs can be reduced to about several (see Non-Patent Document 3).

Therefore, the phoneme recognition performance and the articulation feature (specifically using the discriminating feature (DPF, which will be described later)) when learning the HMM in phonemes using MFCC in FIG. 4 are input features to the HMM. The graph which compared phoneme recognition performance is shown. In this figure, the horizontal axis indicates the number of mixed distributions (1, 2, 4, 8, 16 from the left) necessary for expressing the HMM, and the amount of computation required for recognition as the number of mixtures increases. Has also increased. The bar graph shown for each mixture number indicates the number of male speakers used for HMM learning. For each mixture number, one person, two persons, four persons, eight persons, and 33 persons from the left, and x indicates 100 persons It is. The change at this time is shown by a line graph (the broken line is MFCC and the solid line is DPF). As is apparent from this figure, in the conventional method, the phoneme recognition performance improves as the number of learners increases, but it can be seen that the performance saturates unless the number of HMM distribution mixture is increased. As described above, the conventional speech recognition using MFCC as a characteristic parameter requires a large amount of speaker data in order to achieve high phoneme recognition, and the amount of calculation required for the recognition is enormous. On the other hand, when the DPF is used, as is apparent from the figure, even a small number of learning speakers (one person) shows sufficient phoneme recognition performance, and the number of HMM mixture distributions may be small. it is obvious. In speech recognition, in addition to speaker differences, there is noise superposition, etc., so it is necessary to increase the number of HMMs to be mixed. However, as shown in the figure, at least the speaker has articulation characteristics. It can be understood that it is an invariant. Therefore, such invariant articulatory features are stored as articulatory motion state transition models (HMMs) so that they can be commonly referenced in speech recognition and speech synthesis.

Next, speech recognition processing and speech synthesis processing executed by the speech synthesizer 1 will be described with reference to FIGS. FIG. 5 is a functional block diagram showing speech recognition and speech synthesis processing executed by the speech synthesizer 1. As shown in this figure, as a functional block necessary for speech recognition processing and speech synthesis processing executed in the speech synthesizer 1, an input unit 201, an A / D conversion unit 202, an articulation feature extraction unit 210, and a speech recognition unit 220 are illustrated. , Optimum articulation feature / speech synthesis parameter conversion unit (in the figure, described as optimum articulation feature sequence (right arrow) speech synthesis parameter conversion unit) 230, speech synthesis unit 240, D / A conversion unit 206, output unit 205 , An articulation feature calculation storage unit 207, a phoneme unit articulation movement storage unit 225, and a speech synthesis storage unit 235 are provided.

The articulation feature calculation storage unit 207 stores various coefficient sets 2071 for speech analysis, neural network weighting coefficient sets for articulation feature calculation, and the like. The phoneme unit articulation movement storage unit 225 stores a coefficient set 2251 of an HMM model expressing the articulation movement. The coefficient set 2251 stored therein is a voice recognition unit 220 and an optimal articulation feature sequence / voice synthesis. The parameter conversion unit 230 can refer to it. The speech synthesis storage unit 235 stores a speech synthesis parameter set 2351 that is a calculation result of the optimum articulation feature sequence / speech synthesis parameter conversion unit 230 and a driving excitation codebook 2352. Then, the speech synthesizer 240 configures a digital filter using a speech synthesis parameter (corresponding to a change in the vocal tract shape) as a coefficient, and synthesizes speech using the drive excitation input read from the drive excitation codebook 2352. The synthesized voice is sent to the output unit 205 via the D / A conversion unit 206 and sent out from the speaker.

The input unit 201 is provided for receiving sound input from the outside and converting it into an analog electric signal. The A / D conversion unit 202 is provided to convert an analog signal received by the input unit 201 into a digital signal. The articulatory feature extraction unit 210 is provided to extract a predetermined feature amount necessary for speech recognition. Also, the articulatory feature extraction unit 210 extracts time-series data of articulatory features (from the time-series data of feature amounts extracted by the analysis filter). Hereinafter, it is provided for extracting “articulation feature series”. The speech recognition unit 220 is provided to search for phonemes, syllables, words, and the like included in speech from the articulation feature series obtained from the articulation feature extraction unit 210. In this search, the articulatory motion model coefficient set 2251 of the phoneme unit articulation motion storage unit 225 is referred to. The output unit 205 is provided to output phonemes, syllables, and words (sequences) obtained as a result of the search performed by the speech recognition unit 220, and at the same time, output synthesized speech that will be described later.

In the speech recognition process, unknown speech input from the input unit 201 is discretized through the A / D conversion unit 202 and converted into a digital signal. The converted digital signal is output to the articulation feature extraction unit 210. As shown in FIG. 6, the articulation feature extraction unit 210 that extracts the articulation feature from the digital signal includes an analysis filter 211, a local feature extraction unit 212, and a discriminative (phoneme) feature extraction unit 213.

In the analysis filter 211, first, the digital signal converted by the A / D converter 202 is subjected to Fourier analysis (using a Hamming window having a window width of 24 to 32 msec). Next, it is passed through a band pass filter of about 24 channels to extract frequency components. As a result, a speech spectrum sequence and a speech power sequence at intervals of 5 to 10 msec are extracted. The obtained speech spectrum sequence and speech power sequence are output to local feature extraction section 212.

In the local feature extraction unit 212, the time axis differential feature extraction unit 2121 and the frequency axis differential feature extraction unit 2122 extract differential features in the time axis direction and the frequency direction. In addition, although not shown, the time axis differential feature of the audio power sequence is calculated separately. In extracting these differential features (hereinafter referred to as “local features”), linear regression calculation is used to suppress the influence of noise fluctuations and the like. The extracted local features are output to the discriminative phoneme feature extraction unit 213. The data output to the discriminative phoneme feature extraction unit 213 is a little inferior in performance other than the above-mentioned local features, but the speech spectrum or a cepstrum obtained by orthogonalizing the speech spectrum (actually the frequency axis is a A mel cepstrum obtained by scaling) may be used.

The discriminative phoneme feature extraction unit 213 extracts the articulation feature series based on the local features extracted by the local feature extraction unit 212. The discriminative phoneme feature extraction unit 213 includes two-stage

neural networks

2131 and 2132.

As shown in FIG. 6, the neural network constituting the discriminative phoneme feature extraction unit 213 is a two-stage circuit including a first multilayer neural network 2131 at the first stage and a second multilayer neural network 2132 at the next stage. Consists of The first multilayer neural network 2131 extracts an articulatory feature sequence from the correlation between local features obtained from the speech spectrum sequence and the speech power sequence. Further, the second multilayer neural network 2132 extracts a meaningful subspace from the context information of the articulation feature series, that is, the interdependence between frames, and obtains an accurate articulation feature series.

FIG. 7 shows an example of the articulation feature extraction result calculated by the discriminative phoneme feature extraction unit 213. This figure shows the articulation feature extraction result obtained for the utterance “jinkose” which is the Japanese reading of “artificial satellite”. In this way, it is understood that the articulation features extracted by the two-stage

neural networks

2131 and 2132 have high accuracy.

In addition to the two-stage configuration shown in FIG. 6, the configuration of the neural network for obtaining the articulatory feature sequence may be a one-stage configuration at the expense of performance (see Non-Patent Document 3). ). Each neural network has a hierarchical structure, and has one or two hidden layers excluding an input layer and an output layer (this is called a multilayer neural network). A so-called recurrent neural network having a structure that feeds back from the output layer or hidden layer to the input layer may be used. When compared in terms of performance for articulatory feature extraction, the results calculated in each neural network are not significantly different. These neural networks function as articulatory feature extractors through learning of the weighting coefficient shown in Non-Patent Document 7 (see Non-Patent Document 7).

Further, learning by the neural network of the discriminative phoneme feature extraction unit 213 is performed by adding voice local feature data to the input layer and giving the voice articulation feature to the output layer as a teacher signal.

As described above, the articulation feature sequence extracted by the articulation feature extraction unit 210 is output to the speech recognition unit 220, and an optimal speech unit sequence is obtained while referring to the articulation motion model coefficient set 2251 of the phoneme unit articulation motion storage unit 225. At the same time, it is used for speech synthesis using speech synthesis parameters, which will be described later, and the articulation feature series is synthesized into speech specialized for an individual (see FIG. 5).

This completes the explanation of the voice recognition unit. In the above description, the input unit 201 corresponds to the voice acquisition unit of the invention according to the speech synthesizer, and the articulation feature extraction unit 210 corresponds to the articulation feature extraction unit. The voice recognition unit 220 corresponds to an optimum voice unit sequence identification unit, the central processing unit 11 corresponds to each storage control unit, and the external storage unit 15 corresponds to each storage unit. The phoneme unit articulation motion storage unit 225 corresponds to the phoneme unit articulation motion storage unit, and the HMM based on the articulation characteristics of the unspecified speaker stored therein corresponds to the state transition model of articulation motion. Furthermore, the steps processed based on these functions correspond to the steps in the speech recognition unit of the invention according to the speech synthesis method.

Next, the operation of HMM speech synthesis based on articulation features will be described. As shown in FIG. 5, in the speech synthesis process, the optimum articulation feature sequence / speech synthesis parameter conversion unit 230 generates an HMM model coefficient set 2251 representing the articulation motion stored in the phoneme unit articulation motion storage unit 225. While referencing, a speech synthesis parameter is generated and output to the speech synthesis unit 240. Note that text data (or voice data) input by the input unit 201 is used as data to be combined.

FIG. 8 is an explanatory diagram of the operation of the optimal articulation feature sequence / speech synthesis parameter converter 230 in HMM speech synthesis. As shown in this figure, when an optimum articulation feature sequence on the Viterbi path is given from the HMM based on the articulation feature of an unspecified speaker, next, three layers of articulation features of a total of three frames before and after the time t are placed. An articulatory feature series / speech synthesis parameter (here, PARCOR coefficient) conversion unit 230 is configured using the PARCOR coefficient corresponding to the teacher data as input to the neural network.

The HMM is a probabilistic model that expresses a non-stationary time series signal by making a state transition between a plurality of stationary signal sources, and is suitable for the expression of a time series that varies due to various factors such as speech. As the output probability distribution, a multidimensional normal mixed distribution represented by a weighted sum of multidimensional normal distributions is often used, and this embodiment is also the same. As a result, it is possible to finely model complex fluctuations caused by the speaker and the surrounding environment.

That is, the learning of the model parameter λ of the HMM is formulated as shown in Equation 1 in the form of obtaining λ that maximizes the observation likelihood Ρ (O | λ) for a given learning vector sequence O. ing.

The λ can be derived based on an EM (Expectation Maximization) algorithm.

The initial phoneme model can be obtained by the segmental k-means method if a phoneme label is assigned to the speech data for learning. In addition, if no phoneme boundary is given, an initial model is created from a small amount of data with a label, and then connected learning is performed using a large amount of phoneme data without a phoneme boundary. Can do. In speech recognition, when an unknown vector sequence O is observed, it is estimated from which model λ it is generated (Ρ (O | λ)). This can be obtained from a Bayesian judgment formula.

Next, speech synthesis will be described. In the case of speech synthesis, there is a problem of giving a parameter time series that a certain model λ generates with the highest probability. When a continuous output distribution type HMMλ is given, an output vector sequence (see Equation 2) having a length T is generated from λ. Get.

Further, here, in order to simplify the problem, when the probability on the Viterbi path is shown after being decomposed into the mixed distribution substate, the equation 4 is obtained, and in this equation, O is maximized.

Incidentally, o _T when considering only static characteristics c _t shown in Formula 5, the output of the individual frames, independently of the output before and after the frame, the average of the distribution corresponding to the frame Therefore, a discontinuity occurs in the spectrum at the transition from one state to the next state.

In order to avoid such discontinuities, dynamic features are introduced into output parameters.

The driving sound source illustrated in FIG. 8 is created by multi-streams of articulation feature sequences and driving sound source codes when performing HMM learning using learning speech data. At this time, as shown in FIG. 9, by applying a closed-loop learning algorithm used in the CELP codebook selection, the (residual) segment with the smallest error is selected and simultaneously driven to the corresponding articulation motion state. By registering the sound source code, high-quality synthesized speech can be obtained. That is, the speech waveform obtained by passing all the drive excitations through the synthesis filter (PARCOR synthesis filter) is compared with the original waveform, and the drive excitation code with less error is selected. The driving excitation codebook can configure a compact and efficient codebook by registering representative segments by clustering from learning speech data and by making the registered codebook a tree structure.

This completes the explanation of the speech synthesis unit. In the above description, the portion of the optimal articulation feature sequence / speech synthesis parameter conversion unit 230 that acquires the optimal articulation feature sequence with reference to the HMM coefficient set 2251 (see FIG. 8) is related to the speech synthesizer. It corresponds to an optimal articulation feature sequence generation unit, and a PARCOR coefficient conversion unit corresponds to a speech synthesis parameter sequence conversion unit. Further, the speech synthesizer (PARCOR synthesis filter) 240 corresponds to means for synthesizing speech from speech synthesis parameters and drive sound source signals. The central processing unit 11 corresponds to each storage control unit, the external storage unit 15 corresponds to each storage unit, and the phoneme unit articulation motion storage unit 225 corresponds to the phoneme unit articulation motion storage unit, and is stored in this. The point that the HMM based on the articulation characteristics of the unspecified speaker is equivalent to the state transition model of articulation movement is the same as in the case of the speech recognition apparatus. Furthermore, the steps processed based on these functions correspond to the steps in the speech synthesizer of the invention relating to the speech synthesis method.

The excitation waveform created from the driving excitation codebook as in this embodiment was compared with the original waveform. In FIG. 10, (a) is a residual excitation waveform extracted from the original speech, (b) is a speech waveform approximated from a pulse train and noise conventionally used, and (c) is from the driving excitation codebook of this embodiment. The created sound source waveform is shown. It can be seen that the excitation waveform created from the excitation codebook is close to the residual waveform when the original speech is subjected to PARCOR analysis.

Also, the spectrums of the synthesized speech and the original speech according to the present embodiment were compared in the PARCOR analysis. 11A shows the spectrum of the original speech, FIG. 11B shows the spectrum of the synthesized speech obtained by converting the articulation feature series into the speech synthesis parameters (PARCOR coefficient sequence) based on the articulation features obtained from the speech, ) Shows the spectrum of the synthesized speech (HMM / DPF / PARCOR analysis) of this embodiment. As is clear from comparison between FIGS. 11A and 11C, the synthesized speech of the present embodiment has a high-frequency spectrum smoothed by the smoothing of the HMM. It can be seen that the original speech spectrum shape is maintained. Further, the spectrum of (b) is also similar to (c), and can be used to know the articulation feature extraction result of the input voice in talkback when confirming the voice recognition result.

Furthermore, the synthesized speech waveforms were compared. 12, (a) is the original speech waveform, (b) is the speech waveform synthesized using the excitation waveform approximated from the pulse train and noise, and (c) and (d) are synthesized using the driving excitation codebook. It is a voice waveform. Note that (c) is based on the driving excitation codebook of a specific speaker, and (d) is based on the driving excitation codebook of an unspecified speaker. As is clear from this figure, (c) and (d) obtain waveforms close to the original speech. However, (d) creates a driving excitation codebook from the voices of an unspecified number of speakers, and extracts the voices of specific speakers (articulation features are extracted and used for multi-layer neural network learning for speech synthesis parameter conversion. In the case of a codebook created only from the person (c), a slight deterioration is seen in (d) compared with (c). Therefore, a process for tuning a specific speaker is required. Therefore, the sound quality can be improved by learning by including a small amount of specific speaker voice in the code book created from a large number of unspecified many speaker voices. In addition, for multi-layer neural networks that simultaneously convert articulation features into speech synthesis parameters, the conversion accuracy can be improved by learning a small amount of specific speaker speech as a user for a large amount of unspecified speaker speech. Can do.

In the above description, the voice is acquired, the articulation feature series is extracted, the optimal articulation series is obtained from the articulation motion model of the HMM, further converted into the voice synthesis parameters, and the synthesized voice is output.
However, the present invention is not limited to such use, and a kanji-kana mixed sentence input from a keyboard is also converted into a kana sequence after being converted into a kana sequence, as a normal speech synthesizer performs. , The distinctive phoneme feature as the articulation feature has a one-to-one correspondence with the kana character so that it can be easily understood, and it is possible to easily synthesize speech through conversion of the kana character / articulation feature series. .

FIG. 13 is a usage form in which voice is synthesized by first inputting text from a keyboard, and secondly, the recognition result text is displayed on the display through voice recognition from the voice, and the recognition result is re-synthesized and voiced. And third, a usage mode in which the output from the articulation feature extraction unit 40 (extracted articulation feature) is converted by the articulation feature / vocal tract parameter conversion unit 43 and voice confirmation is performed (path 47 in the figure). Is possible.

In the first usage mode, the text-phoneme conversion unit 46 in FIG. 13 uses a word dictionary (not shown) to convert the text into a phoneme sequence. In the word dictionary, “reading, part of speech, accent” is stored for each word notation item, the text is first divided into morphemes (words) with reference to the word dictionary, and then the phoneme sequence from the word reading And the accent position and intonation of the whole sentence are determined. The phoneme and prosody sequence is sent to the articulation feature / vocal tract parameter conversion unit 43, and the articulation features and sound source segments are read out from each state of the HMM, which is the common articulation model 42 stored in units of phonemes. (See FIGS. 8 and 9). Subsequently, the articulation feature is converted into a sound path parameter such as a PARCOR coefficient, and this and a driving sound source (residual signal) are sent to the speech synthesizer 45 and converted into synthesized speech.

In the second usage mode, the text of the speech recognition result is output and processed in the same manner as the key-operated text. Therefore, the recognition result text (word or sentence) is the same as in the first usage mode. (Word string)), the synthesized speech is returned to the user through the same process as the first usage pattern.

In the third usage mode, as described above, since the articulation feature is given as shown by the path 47 (FIG. 13), the vocal tract parameter is obtained via the articulation feature / vocal tract parameter conversion unit 43. It is done. For the other sound source signal necessary for the speech synthesizer, a residual signal is extracted from the input speech by a residual signal calculator (not shown) that calculates the residual when the speech is subjected to PARCOR analysis. It is sent to the speech synthesizer 45 together with the vocal tract parameters to obtain synthesized speech. In this third usage mode, since the computer can know whether or not the user's voice has been extracted as a correct articulation operation, the user can obtain information on misjudgment of voice recognition processing, There is an advantage that it can be applied to pronunciation training (particularly pronunciation training for foreign languages) as an active use.

DESCRIPTION OF SYMBOLS 1 Speech synthesizer 11 Central processing unit 12 Input device 13 Output device 14 Storage device 15 External storage device 201 Input unit 202 A / D conversion unit 205 Output unit 206 D / A conversion unit 207 Articulation feature calculation storage unit 210 Articulation feature Extraction unit 211 Analysis filter 212 Local feature extraction unit 213 Discriminative phoneme feature extraction unit 220 Speech recognition unit 230 Optimal articulation feature sequence / speech synthesis parameter conversion unit 235 Speech synthesis storage unit 240 Speech synthesis unit

Claims

Optimal from the phoneme unit articulation motion storage unit that prestores the state transition model of articulation motion stored for each fixed speech unit, the speech recognition unit that performs speech recognition with reference to the state transition model, and the state transition model A speech synthesizer based on a one-model speech recognition synthesis comprising a speech synthesizer that performs speech synthesis while acquiring an articulation sequence,
The voice recognition unit stores voice acquisition means for acquiring voice, articulation feature extraction means for extracting the articulation feature of the voice acquired by the voice acquisition means, and the articulation feature extracted by the articulation feature extraction means. First storage control means for storing in the means, and optimum speech unit sequence identification means for comparing the articulation feature time-series data read from the articulation feature storage means and the state transition model to identify the optimum speech unit sequence Including
The speech synthesizer estimates an optimal state sequence related to articulation movement from the optimal speech unit sequence and generates an articulation feature sequence, and an optimal articulation feature sequence generated by the optimal articulation feature sequence generation unit Second storage control means for storing data in the storage means; speech synthesis parameter sequence conversion means for converting the articulation feature sequence data read from the storage means for the optimal articulation feature sequence data into a speech synthesis parameter sequence; From the third storage control means for storing the speech synthesis parameter series converted by the speech synthesis parameter series conversion means in the storage means, the speech synthesis parameters read from the speech synthesis parameter series storage means, and the driving sound source signal A speech synthesizer comprising: means for synthesizing speech.
The phoneme unit articulation motion storage unit stores a coefficient set of a hidden Markov model (HMM) expressing articulation motion, and an optimum speech unit sequence identification unit of the speech recognition unit and an optimal articulation feature sequence generation unit of the speech synthesis unit The speech synthesizer according to claim 1, wherein the speech synthesizer can be referred to.
The articulation feature extraction means comprises an analysis filter for Fourier analysis of a digital audio signal, a local feature extraction unit having a time axis differential feature extraction unit and a frequency axis differential feature extraction unit, and a multilayer neural network in one or more stages. The speech synthesizer according to claim 1, further comprising a discriminative phoneme feature extraction unit.
The state transition model is created using a multi-speaker voice, and means for converting the articulation feature series data into a speech synthesis parameter series is created by only a specific speaker's voice or an unspecified speaker. 4. The speech synthesizer according to claim 1, wherein the means for converting the articulation feature sequence data into a speech synthesis parameter sequence is created by adaptive learning with the voice of a specific speaker.
In the means for synthesizing speech from the speech synthesis parameter and the driving excitation signal, a driving excitation codebook is provided, and an optimum driving excitation is determined by comparing the speech synthesized from the speech synthesis parameter and the driving excitation code with the original learning speech. 5. The speech synthesizer according to claim 1, further comprising means for selecting and means for registering the selected drive excitation code in a corresponding articulatory motion state transition model.
Optimal from the phoneme unit articulation motion storage unit that prestores the state transition model of articulation motion stored for each fixed speech unit, the speech recognition unit that performs speech recognition with reference to the state transition model, and the state transition model A speech synthesis method based on a one-model speech recognition synthesis comprising a speech synthesis unit that performs speech synthesis while acquiring an articulation sequence,
The speech recognition unit stores a speech acquisition step for acquiring speech, a articulation feature extraction step for extracting the articulation feature of the speech acquired in the speech acquisition step, and a articulation feature extracted in the articulation feature extraction step. A first storage control step for storing in the means; and an optimum speech unit sequence identification step for comparing the articulation feature time-series data read from the articulation feature storage means and the state transition model to identify the optimum speech unit sequence. Including
The speech synthesizer estimates an optimal state sequence related to articulation movement from the optimal speech unit sequence and generates an articulation feature sequence, and an optimal articulation feature sequence generated in the optimal articulation feature sequence generation step A second storage control step for storing data in a storage unit; a speech synthesis parameter sequence conversion step for converting the articulation feature sequence data read from the storage unit for the optimal articulation feature sequence data into a speech synthesis parameter sequence; From the third storage control step of storing the speech synthesis parameter sequence converted in the speech synthesis parameter sequence conversion step in the storage means, from the speech synthesis parameters read from the storage means of the speech synthesis parameter series and the driving sound source signal A speech synthesis method comprising: synthesizing speech.
The phoneme unit articulation motion storage unit stores a coefficient set of a Hidden Markov Model (HMM) expressing articulation motion, an optimal speech unit sequence identification step of the speech recognition unit, and an optimal articulation feature sequence generation step of the speech synthesis unit The speech synthesis method according to claim 6, wherein the speech synthesis method can be referred to.
The articulation feature extraction step includes an analysis filter for Fourier analysis of a digital signal of speech, a local feature extraction step having a time axis differential feature extraction step and a frequency axis differential feature extraction step, and a discriminative phoneme processed by a multilayer neural network. The speech synthesis method according to claim 6, further comprising a feature extraction step.
The state transition model is created using a multi-speaker voice, and the step of converting the articulation feature series data into a speech synthesis parameter series is created by only a specific speaker's voice or an unspecified speaker. 9. The speech synthesis method according to claim 6, wherein the step of converting the articulation feature sequence data into a speech synthesis parameter sequence is created by adaptive learning with the speech of a specific speaker.
In the step of synthesizing speech from the speech synthesis parameter and the drive excitation signal, a drive excitation codebook is provided, and an optimum drive excitation is determined by comparing the speech synthesized from the speech synthesis parameter and the drive excitation code with the original learning speech. 10. The speech synthesis method according to claim 6, further comprising a step of selecting and a step of registering the selected driving excitation code in a corresponding articulatory motion state transition model.
A speech synthesis program for driving a computer as each processing means of the speech synthesizer according to any one of claims 1 to 5.
A speech synthesis program for driving a computer as each processing step of the speech synthesis method according to claim 6.