JPH10274992A

JPH10274992A - Method and device for forming voice model learning data

Info

Publication number: JPH10274992A
Application number: JP9270247A
Authority: JP
Inventors: Yasunaga Miyazawa; 康永宮沢; Sunao Aizawa; 直相澤; Mitsuhiro Inazumi; 満広稲積; Hiroo Hasegawa; 浩男長谷川
Original assignee: Seiko Epson Corp
Current assignee: Seiko Epson Corp
Priority date: 1997-01-30
Filing date: 1997-10-02
Publication date: 1998-10-13

Abstract

PROBLEM TO BE SOLVED: To inexpensively form a voice model in a short time when the voice model related to a new word absent in a data base is formed. SOLUTION: The utterance data of at least a person of speaker among the utterance data obtained from many speakers beforehand having as a data base are made the standard speaker data, and others are made the learning speaker data (step s1), and a conversional function from a standard speaker data space to a learning speaker data space is formed based on the beforehand provided word data (step s2). When the learning data related to the new word are formed, the data obtained by that the standard speaker utters related to the new word are data converted to the learning speaker data space by using the conversional function, and the learning data related to the new word are formed (steps s3, s4).

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は不特定話者音声認識
用の音声モデルを学習するための学習データを作成する
音声認識学習用データ作成方法およびその装置に関す
る。[0001] 1. Field of the Invention [0002] The present invention relates to a method and an apparatus for creating data for speech recognition learning for creating training data for learning a speech model for unspecified speaker speech recognition.

【０００２】４４標準話者コードデータ記憶部４５学習話者コードデータ記憶部４６コードデータ変換処理部44 Standard speaker code data storage unit 45 Learning speaker code data storage unit 46 Code data conversion processing unit

【０００３】[0003]

【従来の技術】不特定話者用の音声認識技術の一つとし
て、ＤＲＮＮ（Ｄynamic ＲecurrentＮeural Ｎetwork
s）音声モデルを用いた音声認識技術がある（このＤＲ
ＮＮによる音声認識技術については、本出願人が特開平
６−４０７９、特開平６−１１９４７６などにより出願
済みである）。2. Description of the Related Art DRNN (Dynamic Recurrent Neural Network) is one of the speech recognition techniques for unspecified speakers.
s) There is a speech recognition technology using a speech model (this DR
The applicant of the present invention has already filed an application for speech recognition technology by NN in Japanese Patent Application Laid-Open Nos. 6-4079 and 6-119476.

【０００４】このＤＲＮＮ音声モデルは、或る単語の特
徴ベクトル列が時系列データとして入力されると、その
単語に対する適切な出力が得られるようにするために、
予め定めた学習則に従って各ユニット間の重みとバイア
スがそれぞれ決められ、これにより、或る不特定話者の
発話する単語の音声データに対して、その単語に対する
教師出力に近い出力を得るようにしている。[0004] This DRNN speech model is designed so that when a feature vector sequence of a word is input as time-series data, an appropriate output for the word is obtained.
The weight and bias between each unit are determined in accordance with a predetermined learning rule, whereby an output similar to the teacher output for the word is obtained for the voice data of the word spoken by a certain unspecified speaker. ing.

【０００５】たとえば、或る不特定話者の「おはよう」
という単語の特徴ベクトル列の時系列データが入力され
たとき、その「おはよう」という単語に対する理想的な
出力（教師出力）に近い出力を得るために、「おはよ
う」という単語の各時刻における特徴ベクトルのそれぞ
れの次元ごとのデータを、対応する入力ユニットに与
え、学習則に従って設定された重みとバイアスによって
変換する。これを時系列データとして入力される或る１
つの単語の特徴ベクトル列すべてについて、各時刻対応
に時系列処理を行う。このようにして、或る不特定話者
の発話する単語の音声データに対して、その単語に対す
る教師出力に近い出力を得るようにしている。For example, "good morning" of a certain unspecified speaker
When the time series data of the feature vector sequence of the word “good morning” is input, in order to obtain an output close to the ideal output (teacher output) for the word “good morning”, the feature vector of the word “good morning” at each time Is given to the corresponding input unit, and is converted by the weight and bias set according to the learning rule. This is input as a time series data.
The time series processing is performed for all the feature vector strings of one word for each time. In this way, for speech data of a word spoken by an unspecified speaker, an output close to a teacher output for the word is obtained.

【０００６】このように、認識すべき単語全てに対応し
て用意されたＤＲＮＮ音声モデルについて、それぞれの
単語に対して適切な出力が取り出されるように重みを変
化させる学習則は、(社)電子情報通信学会発行の信学技
報:technical report of ＩＥＩＣＩ sp92-125(1993-0
1）の１７頁から２４頁に記載されている。[0006] As described above, the learning rule for changing the weight of the DRNN speech model prepared for all the words to be recognized so that an appropriate output is obtained for each word is described in IEICE technical report: IEICI sp92-125 (1993-0
It is described on pages 17 to 24 of 1).

【０００７】このＤＲＮＮ音声モデルに限らず、不特定
話者音声認識用の音声モデルを作成する際は、数百人程
度の不特定多数の話者がそれぞれの単語（たとえば、２
００単語程度）について発話して得られた単語ごとの発
話データから作成された学習データをデータベースとし
て持ち、そのデータベース化された学習データに基づい
て学習を行って音声モデルを作成するのが普通である。When creating a speech model for speech recognition of unspecified speakers, not limited to the DRNN speech model, an unspecified large number of speakers of about several hundreds are used for each word (for example, 2 words).
It is common to have learning data created from utterance data for each word obtained by uttering about 100 words) as a database, and perform learning based on the training data in the database to create a speech model. is there.

【０００８】[0008]

【発明が解決しようとする課題】しかしながら、データ
ベースに無い単語の音声モデルの作成要求がユーザから
出される場合もある。従来、データベースに無い単語の
音声モデルを作成する場合は、その単語について数百人
程度の話者にその単語を発話してもらい、その発話デー
タをもとにその単語の学習用データを作成して、その学
習用データをもとに音声モデルを作成する必要があっ
た。However, there are cases where the user issues a request to create a speech model of a word that is not in the database. Conventionally, when creating a speech model for a word that does not exist in the database, about hundreds of speakers speak the word about the word and create learning data for the word based on the utterance data. Therefore, it was necessary to create a speech model based on the learning data.

【０００９】このように、新たな単語についての音声モ
デルを作成する場合は、その音声モデルを学習するため
の学習データを作成するに当たって、数百人規模の発話
データを集めた上で学習データの作成を行わなければな
らない。したがって、音声モデルの作成に多くの時間を
要し、作成コストも高くなる問題がある。As described above, when a speech model for a new word is created, in order to create training data for learning the speech model, utterance data of several hundred people is collected, and then the training data is created. Creation must be done. Therefore, there is a problem that it takes a lot of time to create a voice model and the creation cost increases.

【００１０】そこで本発明は、データベースに無い単語
の音声モデルを作成する際、標準話者として選定された
一人あるいは数人の発話データを用いて数百人分の学習
データの作成を可能とすることで、新たな単語における
音声モデルを短期間で、かつ、安価に提供できる音声モ
デル学習データ作成方法およびその装置を実現すること
を目的とする。Accordingly, the present invention makes it possible to create learning data for hundreds of people by using one or several utterance data selected as a standard speaker when creating a speech model of a word that does not exist in the database. Accordingly, it is an object of the present invention to realize a speech model learning data creating method and an apparatus thereof that can provide a speech model for a new word in a short period of time and at low cost.

【００１１】[0011]

【課題を解決するための手段】本発明の音声モデル学習
データ作成方法は、請求項１に記載されるように、音声
認識用の音声モデルを学習するための学習データを作成
する音声モデル学習データ作成方法において、予めデー
タベースとして持っている多数の話者から得られた発話
データのうち少なくとも一人の話者の発話データを標準
話者データとし、他を学習話者データとして、予め持っ
ている単語データをもとに標準話者データ空間から学習
話者データ空間への変換関数を作成し、新たな単語につ
いての学習データを作成する際は、その新たな単語につ
いて前記標準話者が発話して得られたデータを前記変換
関数を用いて学習話者データ空間にデータ変換して新た
な単語についての学習データを作成することを特徴とし
ている。According to a first aspect of the present invention, there is provided a voice model learning data generating method for generating learning data for learning a voice model for voice recognition. In the creation method, the utterance data of at least one of the utterance data obtained from a large number of speakers having a database in advance is used as the standard speaker data, and the others are used as the learning speaker data. When creating a conversion function from the standard speaker data space to the learning speaker data space based on the data and creating learning data for a new word, the standard speaker speaks for the new word. The method is characterized in that the obtained data is converted into a learning speaker data space using the conversion function to generate learning data for a new word.

【００１２】これにより、データベースに無い新しい単
語についての音声モデルを作成する場合、わずかな人数
の標準話者の発する発話データを基に、その単語の学習
データを作成することができるので、従来のように、新
たな単語に対する音声モデルを作成するために、数百人
の発話データを収集して学習データを作る必要がなくな
り、短期間で安価に音声モデルを作成することができ
る。Thus, when a speech model is created for a new word that is not in the database, learning data for that word can be created based on utterance data generated by a small number of standard speakers. As described above, in order to create a speech model for a new word, there is no need to collect hundreds of utterance data to create learning data, and a speech model can be created in a short period of time and at low cost.

【００１３】また、請求項２の発明は、請求項１の発明
において、前記標準話者データ空間および学習話者デー
タ空間に存在するデータは、音声信号を周波数分析して
得られたそれぞれの単語ごとの特徴ベクトルであって、
また、前記新たな単語について前記標準話者が発話して
得られたデータを、前記変換関数を用いて学習話者デー
タ空間に変換する処理は、標準話者データ空間における
それぞれの単語を構成する特徴ベクトルと学習話者デー
タ空間におけるそれぞれの単語を構成する特徴ベクトル
との差分ベクトルを用いて行うことを特徴としている。According to a second aspect of the present invention, in the first aspect of the present invention, the data existing in the standard speaker data space and the learning speaker data space are each word obtained by frequency analysis of a speech signal. The feature vector of each
Further, the process of converting the data obtained by the standard speaker uttering the new word into the learning speaker data space using the conversion function constitutes each word in the standard speaker data space. The feature is that it is performed using a difference vector between a feature vector and a feature vector constituting each word in the learning speaker data space.

【００１４】この請求項２の発明は、音声信号を周波数
分析して得られた特徴ベクトル（たとえば、１０次元と
いうようなＬＰＣケプストラム係数で表されるデータ）
そのものを用いているので、高精度なデータが得られ、
さらに、予め求めた差分ベクトルを用いて標準話者デー
タ空間から学習話者データ空間にデータ変換するので、
簡単でしかも高精度なデータ変換が行える。According to a second aspect of the present invention, a feature vector (for example, data represented by an LPC cepstrum coefficient such as ten-dimensional) obtained by frequency-analyzing an audio signal is provided.
Because it uses itself, highly accurate data can be obtained,
Furthermore, since the data is converted from the standard speaker data space to the learning speaker data space using the difference vector obtained in advance,
Simple and high-precision data conversion can be performed.

【００１５】また、請求項３の発明は、請求項１の発明
において、前記標準話者データ空間および学習話者デー
タ空間に存在するデータは、音声信号を周波数分析して
得られたそれぞれの単語ごとの特徴ベクトルをベクトル
量子化したコードデータであって、また、前記新たな単
語について前記標準話者が発話して得られたデータを、
前記変換関数を用いて学習話者データ空間に変換する処
理は、新たな単語データを標準話者データ空間でベクト
ル量子化してコードデータを得て、そのコードデータを
前記学習話者データ空間にマッピングすることで標準話
者データ空間から学習話者データ空間にデータ変換する
ことを特徴としている。According to a third aspect of the present invention, in the first aspect of the present invention, the data existing in the standard speaker data space and the learning speaker data space are each a word obtained by frequency analysis of a speech signal. Code data obtained by vector quantization of each feature vector, and data obtained by the standard speaker uttering the new word,
The process of converting to the learning speaker data space by using the conversion function is to perform vector quantization on new word data in the standard speaker data space to obtain code data, and map the code data to the learning speaker data space. Thus, data conversion from the standard speaker data space to the learning speaker data space is performed.

【００１６】すなわち、この請求項３の発明は、音声信
号を周波数分析して得られた特徴ベクトルをベクトル量
子化して処理を行うものである。したがって、データは
多少粗くはなるが処理がより簡素化され処理時間の短縮
が図れる。That is, according to the third aspect of the present invention, a feature vector obtained by frequency analysis of an audio signal is subjected to vector quantization for processing. Therefore, although the data is somewhat coarse, the processing is simplified and the processing time can be reduced.

【００１７】また、本発明の音声モデル学習データ作成
装置は、請求項４に記載されるように、音声認識用の音
声モデルを学習するための学習データを作成する音声モ
デル学習データ作成装置において、予めデータベースと
して持っている多数の話者から得られた発話データのう
ち少なくとも一人の話者の発話データを標準話者データ
として格納する標準話者データ記憶部、前記標準話者以
外の発話データを学習話者データとして格納する学習話
者データ記憶部、前記標準話者データ空間から学習話者
データ空間へのデータ変換を、予め求められた変換関数
を用いて行うデータ変換部を有する疑似学習用単語デー
タ作成部と、この疑似学習用単語データ作成部で作成さ
れたデータを記憶する学習データ効く部とを有し、新た
な単語についての学習データを作成する際は、その新た
な単語について前記標準話者が発話して得られたデータ
を前記変換関数を用いて学習話者データ空間に変換して
新たな単語についての疑似学習用単語データを作成し、
その疑似学習用単語データを用いて学習データを作成す
ることを特徴としている。According to a fourth aspect of the present invention, there is provided a voice model learning data generating apparatus for generating learning data for learning a voice model for voice recognition. A standard speaker data storage unit that stores, as standard speaker data, utterance data of at least one of utterance data obtained from a number of speakers having a database in advance, and utterance data other than the standard speaker. A pseudo-learning unit having a learning speaker data storage unit for storing as learning speaker data and a data conversion unit for performing data conversion from the standard speaker data space to the learning speaker data space using a conversion function obtained in advance; A word data generating unit; and a learning data effective unit for storing data generated by the pseudo learning word data generating unit. When creating training data, the data obtained by the standard speaker uttering the new word is converted into a learning speaker data space using the conversion function, and a pseudo-learning word for the new word is generated. Create data,
It is characterized in that learning data is created using the pseudo learning word data.

【００１８】これにより、データベースに無い新しい単
語についての音声モデルを作成する場合、わずかな人数
の標準話者の発する発話データを基に、その単語の学習
データを作成することができるので、従来のように、新
たな単語に対する音声モデルを作成するために、数百人
の発話データを収集して学習データを作る必要がなくな
り、短期間で音声モデルを作成することができる。Thus, when creating a speech model for a new word that is not in the database, learning data for that word can be created based on the utterance data generated by a small number of standard speakers. As described above, in order to create a speech model for a new word, it is not necessary to collect hundreds of utterance data to create learning data, and a speech model can be created in a short time.

【００１９】また、請求項５の発明は、請求項４の発明
において、前記標準話者データ記憶部に格納される標準
話者データおよび学習話者データ記憶部に格納される学
習話者データは、音声信号を周波数分析して得られたそ
れぞれの単語ごとの特徴ベクトルであって、また、前記
新たな単語について前記標準話者が発話して得られたデ
ータを、前記変換関数を用いて学習話者データ空間に変
換する処理は、標準話者データ空間におけるそれぞれの
単語を構成する特徴ベクトルと学習話者データ空間にお
けるそれぞれの単語を構成する特徴ベクトルとの差分ベ
クトルを用いて行うことを特徴としている。According to a fifth aspect of the present invention, in the invention of the fourth aspect, the standard speaker data stored in the standard speaker data storage section and the learning speaker data stored in the learning speaker data storage section are different from each other. Learning the feature vector for each word obtained by frequency analysis of the voice signal, and the data obtained by the standard speaker uttering the new word using the conversion function. The process of converting to the speaker data space is performed using a difference vector between a feature vector forming each word in the standard speaker data space and a feature vector forming each word in the learning speaker data space. And

【００２０】この請求項５の発明は、音声信号を周波数
分析して得られた特徴ベクトルそのものを用いているの
で、高精度なデータが得られ、さらに、予め求められて
いる差分ベクトルを用いて標準話者データ空間から学習
話者データ空間にデータ変換するので、簡単でしかも高
精度なデータ変換が行える。る。According to the fifth aspect of the present invention, since the feature vector itself obtained by frequency analysis of the audio signal is used, highly accurate data can be obtained, and furthermore, using the difference vector obtained in advance. Since data is converted from the standard speaker data space to the learning speaker data space, simple and highly accurate data conversion can be performed. You.

【００２１】また、請求項６の発明は、請求項４の発明
において、前記標準話者データ記憶部に格納される標準
話者データおよび学習話者データ記憶部に格納される学
習話者データは、音声信号を周波数分析して得られたそ
れぞれの単語ごとの特徴ベクトルをベクトル量子化した
コードデータであって、また、前記新たな単語について
前記標準話者が発話して得られたデータを、前記変換関
数を用いて学習話者データ空間に変換する処理は、新た
な単語データを標準話者データ空間でベクトル量子化し
てコードデータを得て、そのコードデータを前記学習話
者データ空間にマッピングすることで標準話者データ空
間から学習話者データ空間にデータ変換することを特徴
とする請求項４記載の音声モデル学習データ作成装置。According to a sixth aspect of the present invention, in the fourth aspect of the present invention, the standard speaker data stored in the standard speaker data storage section and the learning speaker data stored in the learning speaker data storage section are different from each other. Code data obtained by vector quantization of a feature vector for each word obtained by frequency analysis of a voice signal, and data obtained by uttering the standard speaker for the new word, The process of converting to the learning speaker data space by using the conversion function is to perform vector quantization on new word data in the standard speaker data space to obtain code data, and map the code data to the learning speaker data space. 5. The speech model training data creation device according to claim 4, wherein the data is converted from the standard speaker data space to the learning speaker data space.

【００２２】すなわち、この請求項６の発明は、音声信
号を周波数分析して得られた特徴ベクトルをベクトル量
子化して処理を行うものである。したがって、データは
多少粗くはなるが、処理がより簡素化され処理時間の短
縮が図れ、かつ、標準話者データおよび学習話者データ
を格納するメモリが小容量のもので済む。That is, according to the invention of claim 6, a feature vector obtained by frequency analysis of an audio signal is subjected to vector quantization for processing. Therefore, although the data is somewhat coarse, the processing is simplified and the processing time can be shortened, and the memory for storing the standard speaker data and the learning speaker data requires only a small capacity.

【００２３】[0023]

【発明の実施の形態】以下、本発明の実施の形態を説明
する。Embodiments of the present invention will be described below.

【００２４】（第１の実施の形態）図１は本発明の実施
の形態を概略的に説明するフローチャートである。図１
において、すでにデータベースとして持っている数百人
程度（ここでは３００人とする）の発話データのうち、
任意に選んだ数人（たとえば５人）を標準話者として選
び、それぞれの標準話者の発話データを標準話者データ
とし、その他を学習話者として、それぞれの学習話者の
発話データを学習話者データとする（ステップｓ１）。(First Embodiment) FIG. 1 is a flowchart schematically illustrating an embodiment of the present invention. FIG.
In the utterance data of about several hundred people (here, 300 people) already stored as a database,
Several arbitrarily selected persons (for example, five persons) are selected as standard speakers, utterance data of each standard speaker is used as standard speaker data, and others are used as learning speakers, and utterance data of each learning speaker are learned. It is set as speaker data (step s1).

【００２５】そして、それぞれの標準話者データ空間か
ら学習話者データ空間への変換関数を、予めデータベー
ス化されている単語（２００単語程度）データを用いて
作成する（ステップｓ２）。Then, a conversion function from each of the standard speaker data spaces to the learning speaker data space is created by using word (about 200 words) data which is stored in a database in advance (step s2).

【００２６】このように、データベースとして持ってい
る単語データをもとに、或る標準話者データ空間からそ
れぞれの学習話者データ空間への変換関数を、すべての
単語を用いて予め作成しておく。これをそれぞれの標準
話者ごとにそれぞれの学習話者に対して、標準話者デー
タ空間からそれぞれの学習話者データ空間への変換関数
を求めておく。As described above, a conversion function from a certain standard speaker data space to each learning speaker data space is created in advance using all the words based on the word data held as a database. deep. Then, a conversion function from the standard speaker data space to each learning speaker data space is obtained for each learning speaker for each standard speaker.

【００２７】そして、前記データベースに無い新たな単
語についての学習用データを作成する場合は、その新た
な単語について、前記したそれぞれの標準話者に発話し
てもらい、その発話データを前記変換関数を用いて、そ
れぞれの学習話者データ空間にデータ変換し、数百名分
の疑似学習用単語データを作成する（ステップｓ３）。
このようにして作成した疑似学習用単語データを用い
て、新たな単語の学習用データを作成する（ステップｓ
４）。When creating learning data for a new word that is not in the database, the standard words are uttered by the standard speakers, and the utterance data is converted by the conversion function. Then, data is converted into the respective learning speaker data spaces to generate pseudo-learning word data for several hundred persons (step s3).
Using the pseudo-learning word data thus created, new word learning data is created (step s).
4).

【００２８】これにより、データベースに無い新しい単
語についての音声モデルを作成する場合、数人の標準話
者の発する発話データを基に、その単語の学習データを
作成することができる。以上の処理ついてさらに詳細に
説明する。Thus, when creating a speech model for a new word that is not in the database, it is possible to create learning data for the word based on utterance data generated by several standard speakers. The above processing will be described in more detail.

【００２９】図２は、データベースとして持っている数
百人の話者の発話データのうち、標準話者として選ばれ
た或る一人の話者と、標準話者以外の或る一人の話者
（学習話者）の幾つかの単語についての発話データ（た
とえば１０次元ＬＰＣケプストラム係数で表される特徴
ベクトル）を表すものであり、実際には２００単語程度
分のデータが存在するが、ここでは、３単語についての
特徴ベクトルを示している。FIG. 2 shows one speaker selected as a standard speaker and one speaker other than the standard speaker among the utterance data of several hundred speakers held as a database. This represents speech data (for example, a feature vector represented by a 10-dimensional LPC cepstrum coefficient) for some words of a (learning speaker). Actually, data for about 200 words exists. 3 shows feature vectors for three words.

【００３０】図２においては、説明を簡単にするため
に、或る一人の標準話者データ１０と或る一人の学習話
者データ２０のみが示されている。そして、その標準話
者データ空間における或る３つの単語Ａ，Ｂ，Ｃの各特
徴ベクトルと、学習話者データ空間における前記単語
Ａ，Ｂ，Ｃの特徴ベクトルは、すでにそれぞれ対応がと
れている。このようなデータベースとして持っている標
準話者データ空間におけるそれぞれの単語データと学習
話者データ空間における単語データをもとに、標準話者
データ空間から学習話者データ空間へのデータ変換関数
を予め作成しておく。このデータ変換関数として、ここ
では差分ベクトルを用いる。FIG. 2 shows only one piece of standard speaker data 10 and one piece of learning speaker data 20 to simplify the explanation. The feature vectors of certain three words A, B, and C in the standard speaker data space and the feature vectors of the words A, B, and C in the learning speaker data space already correspond to each other. . Based on the respective word data in the standard speaker data space and the word data in the learning speaker data space as such a database, a data conversion function from the standard speaker data space to the learning speaker data space is determined in advance. Create it. Here, a difference vector is used as the data conversion function.

【００３１】たとえば、標準話者データ空間における単
語Ａの特徴ベクトル値Ｖｋ１と学習話者データ空間にお
ける単語Ａの特徴ベクトル値Ｖｔ１は、その差分ベクト
ルをＶ１とすれば、Ｖｔ１＝Ｖｋ１＋Ｖ１で表される。また、標準話者データ空間における単語Ａ
の特徴ベクトル値Ｖｋ２と学習話者データ空間における
単語Ａの特徴ベクトル値Ｖｔ２は、その差分ベクトルを
Ｖ２とすれば、Ｖｔ２＝Ｖｋ２＋Ｖ２で表される。つまり、標準話者データ空間の特徴ベクト
ルＶｋ１に差分ベクトルＶ１をプラスして学習話者デー
タの特徴ベクトル値Ｖｔ１が得られ、標準話者データに
おける単語Ａの特徴ベクトル値Ｖｋ２に差分ベクトルＶ
２をプラスして学習話者データの特徴ベクトル値Ｖｔ２
が得られるというようにして、標準話者データ空間から
学習話者データ空間へのデータ変換がなされている。同
様にして、標準話者データ空間の特徴ベクトルＶｋ３に
差分ベクトルＶ３をプラスして学習話者データの特徴ベ
クトル値Ｖｔ３が得られ、標準話者データにおける特徴
ベクトル値Ｖｋ４に差分ベクトルＶ４をプラスして学習
話者データの特徴ベクトル値Ｖｔ４が得られ、標準話者
データ空間の特徴ベクトルＶｋ５に差分ベクトルＶ５を
プラスして学習話者データの特徴ベクトル値Ｖｔ５が得
られる。For example, the feature vector value Vk1 of the word A in the standard speaker data space and the feature vector value Vt1 of the word A in the learning speaker data space are represented by Vt1 = Vk1 + V1 if the difference vector is V1. . Also, the word A in the standard speaker data space
The feature vector value Vk2 of the word A and the feature vector value Vt2 of the word A in the learning speaker data space are represented by Vt2 = Vk2 + V2, where the difference vector is V2. That is, the difference vector V1 is added to the feature vector Vk1 of the standard speaker data space to obtain the feature vector value Vt1 of the learning speaker data, and the difference vector Vk2 is added to the feature vector value Vk2 of the word A in the standard speaker data.
2 plus the feature vector value Vt2 of the learning speaker data
Is obtained, the data conversion from the standard speaker data space to the learning speaker data space is performed. Similarly, the difference vector V3 is added to the feature vector Vk3 in the standard speaker data space to obtain the feature vector value Vt3 of the learning speaker data, and the difference vector V4 is added to the feature vector value Vk4 in the standard speaker data. Thus, a feature vector value Vt4 of the learning speaker data is obtained, and a feature vector value Vt5 of the learning speaker data is obtained by adding the difference vector V5 to the feature vector Vk5 of the standard speaker data space.

【００３２】このような変換関数を用いて、標準話者が
発話して得られた新たな単語の発話データを、学習話者
データ空間にデータ変換することで、その新たな単語に
ついての疑似学習用単語データを得る。これを図３を用
いて説明する。Using such a conversion function, utterance data of a new word obtained by uttering a standard speaker is converted into a learning speaker data space, so that pseudo learning on the new word is performed. Obtain word data. This will be described with reference to FIG.

【００３３】図３（ａ）は或る標準話者データ空間にお
いて、新たにデータベース化しようとする新しい単語Ｎ
に対して、その標準話者が発話して得られた特徴ベクト
ル列を示すものであり、その特徴ベクトル列は、Ｖｋｎ
１，Ｖｋｎ２，・・・，Ｖｋｎ５の特徴ベクトルから構
成されているものとする（図中、白丸で示されてい
る）。FIG. 3A shows a new word N to be newly created in a database in a certain standard speaker data space.
Represents a feature vector sequence obtained by uttering the standard speaker, and the feature vector sequence is Vkn
1, Vkn2,..., Vkn5 (shown by white circles in the figure).

【００３４】この特徴ベクトルＶｎｋ１，Ｖｋｎ２，・
・・，Ｖｋｎ５のうち、今、Ｖｋｎ１について考える。The feature vectors Vnk1, Vkn2,.
.., Vkn5, consider Vkn1 now.

【００３５】図３（ａ）において、特徴ベクトルＶｋｎ
１の周辺に存在する特徴ベクトル（すでにデータベース
として持っている２００単語程度の特徴ベクトル）のう
ち、最も近い位置に存在する幾つか特徴ベクトルを選
ぶ。すなわち、標準話者データ空間には、たとえば２０
０単語に対応する特徴ベクトル列を構成する多数の特徴
ベクトルが空間上に散在しており、その中から、単語Ｎ
の特徴ベクトルＶｋｎ１に近い特徴ベクトルを幾つか選
ぶ。ここでは特徴ベクトルＶｋｎ１に近い３つの特徴ベ
クトルを選ぶものとし、その３つの特徴ベクトルとして
Ｖｋ１０，Ｖｋ２０，Ｖｋ３０が選ばれたとする（図
中、黒丸で示されている）。In FIG. 3A, the feature vector Vkn
Among the feature vectors existing around 1 (feature vectors of about 200 words already stored as a database), some feature vectors existing at the closest positions are selected. That is, in the standard speaker data space, for example, 20
A large number of feature vectors constituting a feature vector sequence corresponding to 0 word are scattered in the space.
Some feature vectors close to the feature vector Vkn1 are selected. Here, it is assumed that three feature vectors close to the feature vector Vkn1 are selected, and Vk10, Vk20, and Vk30 are selected as the three feature vectors (indicated by black circles in the figure).

【００３６】そして、選択された３つの特徴ベクトルＶ
ｋ１０，Ｖｋ２０，Ｖｋ３０は、学習話者データ空間へ
の差分ベクトルをそれぞれ持ち、その差分ベクトルをＶ
１０，Ｖ２０，Ｖ３０とする。これらの差分ベクトルを
用いて、特徴ベクトルＶｋｎ１に対する差分ベクトルＶ
ｎ１を決定する。このＶｎ１は、Ｖｎ１＝μ１・Ｖ１０＋μ２・Ｖ２０＋μ３・Ｖ３０で求められる。この式において、μ１、μ２、μ３は重
みを表す係数であり、μ１はＶｋｎ１とＶｋ１０の距離
に応じた重み、μ２はＶｋｎ１とＶｋ２０の距離に応じ
た重み、μ３はＶｋｎ１とＶｋ３０の距離に応じた重み
であり、それぞれの距離に応じて重みの大きさが設定さ
れ、距離が近いほど重み付けを大きくする。なお、μ１
＋μ２＋μ３＝１となるようにそれぞれの値が設定され
る。Then, the three selected feature vectors V
k10, Vk20, and Vk30 each have a difference vector to the learning speaker data space, and
10, V20 and V30. Using these difference vectors, the difference vector V for the feature vector Vkn1 is calculated.
Determine n1. This Vn1 is obtained by the following equation: Vn1 = μ1 · V10 + μ2 · V20 + μ3 · V30. In this equation, μ1, μ2, and μ3 are coefficients representing weights, μ1 is a weight according to the distance between Vkn1 and Vk10, μ2 is a weight according to the distance between Vkn1 and Vk20, and μ3 is a coefficient according to the distance between Vkn1 and Vk30. The magnitude of the weight is set according to each distance, and the closer the distance, the greater the weight. Note that μ1
Each value is set so that + μ2 + μ3 = 1.

【００３７】このようにして、特徴ベクトルＶｋｎ１に
対する差分ベクトルＶｎ１が決定され、その差分ベクト
ルＶｎ１を用い、Ｖｔｎ１＝Ｖｋｎ１＋Ｖｎ１により、新たな単語の特徴ベクトル列を構成する１つの
特徴ベクトルＶｋｎ１が学習話者データ空間のベクトル
Ｖｔｎ１として変換される。In this way, the difference vector Vn1 with respect to the feature vector Vkn1 is determined, and using the difference vector Vn1, one feature vector Vkn1 constituting a new word feature vector sequence is obtained by Vtn1 = Vkn1 + Vn1. It is converted as a data space vector Vtn1.

【００３８】同様にして、Ｖｋｎ１以外の特徴ベクトル
Ｖｋｎ２，Ｖｋｎ３，Ｖｋｎ４，Ｖｋｎ５のそれぞれに
ついても、前記したような処理により、それぞれの差分
ベクトルＶｎ２，Ｖｎ３，Ｖｎ４，Ｖｎ５を求め、これ
らの差分ベクトルを用いて、学習話者データ空間のベク
トルＶｔｎ１，Ｖｔｎ２，Ｖｔｎ３，Ｖｔｎ４，Ｖｔｎ
５として変換する。図３（ｂ）はこのようにして、新た
な単語の特徴ベクトル列が学習話者データ空間に変換さ
れた様子を示すものである。これをここでは疑似学習用
単語データと呼ぶ。Similarly, for each of the feature vectors Vkn2, Vkn3, Vkn4, and Vkn5 other than Vkn1, the respective difference vectors Vn2, Vn3, Vn4, and Vn5 are obtained by the above-described processing, and these difference vectors are calculated. Using the vector Vtn1, Vtn2, Vtn3, Vtn4, Vtn of the learning speaker data space.
Converted as 5. FIG. 3B shows how the feature vector sequence of the new word is converted into the learning speaker data space in this manner. This is referred to here as pseudo-learning word data.

【００３９】以上の処理は或る一人の標準話者データと
一人の学習話者データとの間の処理であるが、実際に
は、数人分の標準話者データと数百人分の学習話者デー
タが存在するため、それぞれにおいて同様の処理がなさ
れる。The above process is a process between certain one standard speaker data and one learning speaker data. In practice, however, several standard speaker data and several hundred learning source data are used. Since speaker data exists, similar processing is performed for each.

【００４０】以上のような処理により、データベースに
もともと無かった単語について、数人の標準話者が発話
したデータを基に、数百人分の学習話者データ空間内に
それぞれ疑似学習用単語データが作成される。そして、
この疑似学習用単語データを用いて学習データを作成
し、この学習データを用いてたとえば前記したＤＲＮＮ
音声モデルを学習する。With the above processing, for words that were not originally found in the database, based on the data uttered by several standard speakers, the pseudo-learning word data were stored in the learning speaker data space for several hundred people. Is created. And
Learning data is created using the pseudo-learning word data, and using the learning data, for example, the above-described DRNN
Learn a speech model.

【００４１】図４は第１の実施の形態を実現するための
装置構成例を示す図であり、概略的には、音声入力部
１、Ａ／Ｄ変換部２、音声分析部３、疑似学習用単語デ
ータ作成部４、入力データ記憶部５、学習データ記憶部
６などから構成されている。FIG. 4 is a diagram showing an example of a device configuration for realizing the first embodiment. In general, a voice input unit 1, an A / D conversion unit 2, a voice analysis unit 3, a pseudo learning It comprises a word data creating section 4, an input data storage section 5, a learning data storage section 6, and the like.

【００４２】前記疑似学習用単語データ作成部４は、標
準話者データを格納するための標準話者データ記憶部４
１、学習話者データを格納する学習話者データ記憶部４
２、データ変換処理部４３を有し、図１〜図３を参照し
て説明した処理を行う。The pseudo-learning word data creating section 4 includes a standard speaker data storage section 4 for storing standard speaker data.
1. Learning speaker data storage unit 4 for storing learning speaker data
2. It has a data conversion processing unit 43, and performs the processing described with reference to FIGS.

【００４３】すなわち、すでにデータベースとして持っ
ている数百人程度の発話データ（特徴ベクトル）のう
ち、任意に選んだ数人を標準話者として、それぞれの標
準話者の発話データを標準話者データ記憶部４１に格納
し、その他を学習話者としてそれぞれの学習話者の発話
データ（特徴ベクトル）を学習話者データ記憶部４２に
格納しておく。そして、それぞれの標準話者データ空間
から学習話者データ空間への変換関数（差分ベクトル）
を２００単語程度を用いて各単語ごとに作成しておき、
その変換関数に基づいてデータ変換部４３が前記したよ
うな処理を行ってデータ変換を行う。That is, of the hundreds of utterance data (feature vectors) already stored in the database, several arbitrarily selected persons are set as standard speakers, and the utterance data of each standard speaker is used as standard speaker data. The utterance data (feature vector) of each of the learning speakers is stored in the learning speaker data storage unit 42 while the others are stored as the learning speakers. Then, the conversion function (difference vector) from each standard speaker data space to the learning speaker data space
Is created for each word using about 200 words,
The data conversion unit 43 performs the above-described processing based on the conversion function to perform data conversion.

【００４４】このような構成において、前記データベー
スに無い新たな単語についての学習用データを作成する
処理について説明する。A process for creating learning data for a new word not in the database in such a configuration will be described.

【００４５】まず、新たな単語について或る標準話者が
発話すると、その音声は音声入力部１を介してＡ／Ｄ変
換部２に入力され、Ａ／Ｄ変換されたのち、音声分析部
３でたとえば１０次元のＬＰＣケプストラム係数で表さ
れる特徴ベクトルに変換され、入力データ記憶部５に蓄
えられる。First, when a certain standard speaker speaks a new word, the voice is input to the A / D conversion unit 2 via the voice input unit 1 and is A / D converted. Is converted into a feature vector represented by, for example, a 10-dimensional LPC cepstrum coefficient, and stored in the input data storage unit 5.

【００４６】入力データ変換部４３は、標準話者データ
空間内において入力データ記憶部５に蓄えられた入力音
声の特徴ベクトルの周辺に存在する標準話者特徴ベクト
ルのうち、最も近い位置に存在する幾つかの特徴ベクト
ルを選び、予め求められた差分ベクトルを用い、それぞ
れの学習話者データに変換する。これにより、それぞれ
の学習データ空間内には疑似学習用単語データが作成さ
れ、その疑似学習用単語データは、学習データ記憶部６
に記憶される。The input data conversion unit 43 is located at the closest position among the standard speaker feature vectors existing around the feature vector of the input voice stored in the input data storage unit 5 in the standard speaker data space. Several feature vectors are selected, and converted into respective learning speaker data using a difference vector obtained in advance. Thereby, pseudo learning word data is created in each learning data space, and the pseudo learning word data is stored in the learning data storage unit 6.
Is stored.

【００４７】このように、第１の実施の形態によれば、
データベースに無い新しい単語についての音声モデルを
作成する場合、数人の標準話者の発する発話データを基
に、その単語の学習データを作成することができる。し
たがって、従来のように、新たな単語に対する音声モデ
ルを作成するために、数百人の発話データを収集して学
習データを作る必要がなくなり、短期間で音声モデルを
作成することができる。As described above, according to the first embodiment,
When creating a speech model for a new word that is not in the database, learning data for that word can be created based on utterance data generated by several standard speakers. Therefore, it is not necessary to collect hundreds of utterance data and create learning data in order to create a speech model for a new word as in the related art, and it is possible to create a speech model in a short period of time.

【００４８】（第２の実施の形態）前記第１の実施の形
態では、たとえば１０次元のＬＰＣケプストラム係数で
表わされる特徴ベクトルそのものを用いて処理したが、
この第２の実施の形態では、ＬＰＣケプストラム係数で
表した特徴ベクトルをベクトル量子化したコードベクト
ルを用いて処理を行うようにする。(Second Embodiment) In the first embodiment, the processing is performed using the feature vector itself represented by, for example, a 10-dimensional LPC cepstrum coefficient.
In the second embodiment, processing is performed using a code vector obtained by vector-quantizing a feature vector represented by an LPC cepstrum coefficient.

【００４９】すなわち、２００単語程度の単語数をそれ
ぞれの単語ごとに或る標準話者が発話した場合、１つの
単語につき３０個程度の特徴ベクトル数が有るとする
と、６０００個程度の特徴ベクトルが得られるが、それ
をベクトル量子化して、たとえば２５６個のコードベク
トルにまとめた標準話者コードブックを作成して、この
コードブックを用い、このコードブックのそれぞれのコ
ードベクトルを学習話者データ空間にマッピングした学
習話者コードブックを作成しておき、これらのコードブ
ックを用いて処理を行うものである。この処理を図５を
参照しながら説明する。That is, when a certain standard speaker utters about 200 words for each word, if there are about 30 feature vectors per word, about 6000 feature vectors are generated. However, the standard speaker codebook is compiled by, for example, 256 code vectors into a standard speaker codebook, and the codebook is used to convert each code vector of the codebook into the learning speaker data space. In this case, a learning speaker codebook mapped to is prepared, and processing is performed using these codebooks. This processing will be described with reference to FIG.

【００５０】図５における標準話者コードブック５０
は、それぞれの単語（２００単語程度）ごとのコードベ
クトル列を構成する２５６個のコードベクトルが存在す
るが、ここでは、単語Ａ、単語Ｂ、単語Ｃについてのコ
ードベクトル列のみが示されている。そして、学習話者
コードブック６０は、標準話者コードブック５０のそれ
ぞれの単語ごとにマッピングされたコードベクトルを有
している。たとえば、標準話者コードブック５０のコー
ドベクトルＣｋ１は学習話者コードブック６０のコード
ベクトルＣｔ１と対応し、標準話者コードブック５０の
コードベクトルＣｋ２は学習話者コードブック６０のコ
ードベクトルＣｔ２と対応し、標準話者コードブック５
０のコードベクトルＣｋ３は学習話者コードブック６０
のコードベクトルＣｔ３と対応するというように、それ
ぞれのコードベクトルが対応付けられている。The standard speaker codebook 50 in FIG.
Has 256 code vectors forming a code vector sequence for each word (about 200 words), but here, only code vector sequences for words A, B, and C are shown. . The learning speaker codebook 60 has a code vector mapped for each word in the standard speaker codebook 50. For example, the code vector Ck1 of the standard speaker codebook 50 corresponds to the code vector Ct1 of the learning speaker codebook 60, and the code vector Ck2 of the standard speaker codebook 50 corresponds to the code vector Ct2 of the learning speaker codebook 60. And standard speaker codebook 5
The code vector Ck3 of 0 is the learning speaker codebook 60.
Each code vector is associated with the corresponding code vector Ct3.

【００５１】このような標準話者コードブック５０と学
習話者コードブック６０を用いて、新しい単語を標準話
者コードブック５０でベクトル量子化し、その量子化さ
れたコードベクトル列を学習話者コードブック６０にマ
ッピングすることで、疑似学習用単語データを作成す
る。これを図６を参照しながら説明する。Using the standard speaker codebook 50 and the learning speaker codebook 60, a new word is vector-quantized by the standard speaker codebook 50, and the quantized code vector sequence is converted to a learning speaker code. By mapping to the book 60, pseudo-learning word data is created. This will be described with reference to FIG.

【００５２】今、図６に示すように、新たにデータベー
ス化しようとする単語Ｎに対して、標準話者が発話して
得られた発話データの特徴ベクトル列をＶｋ１，Ｖｋ
２，Ｖｋ３，・・・，Ｖｋ７とする。これを以下のよう
にしてベクトル量子化する。As shown in FIG. 6, a feature vector sequence of utterance data obtained by uttering a standard speaker with respect to a word N to be newly made into a database is represented by Vk1 and Vk.
2, Vk3,..., Vk7. This is vector quantized as follows.

【００５３】標準話者コードブックは、ここでは、その
サイズを２５６とし、２５６個の不特定話者コードベク
トルで構成されている。そして、これらの標準話者コー
ドブック５０のコードベクトルをＣｋ１，Ｃｋ２，Ｃｋ
３，・・・，Ｃｋ２５６で表し、実際には、２５６個の
コードベクトルで構成されるが、図６では図面を簡略化
するために、このコードベクトルはＣｋ１１，Ｃｋ１
２，・・・，Ｃｋ２０の１０個のコードベクトルのみが
図示されている（これらのコードベクトルは図中、黒丸
で示されている）。Here, the standard speaker codebook has a size of 256, and is composed of 256 unspecified speaker code vectors. Then, the code vectors of these standard speaker codebooks 50 are Ck1, Ck2, Ck.
, Ck256, which is actually composed of 256 code vectors. In FIG. 6, for simplification of the drawing, these code vectors are Ck11, Ck1.
Only 10 code vectors 2, 2,..., Ck20 are shown (these code vectors are indicated by black circles in the figure).

【００５４】このような標準話者コードブック５０に対
して、標準話者が発話して得られた新たな単語Ｎの特徴
ベクトル列（図中、白丸で示し、特徴ベクトルＶｋ１，
Ｖｋ２，Ｖｋ３，・・・，Ｖｋ７で構成されている）を
ベクトル量子化する。With respect to such a standard speaker codebook 50, a feature vector sequence of a new word N obtained by uttering a standard speaker (indicated by white circles in the figure, feature vectors Vk1,
Vk2, Vk3,..., Vk7).

【００５５】つまり、単語Ｎの特徴ベクトル列の各特徴
ベクトルと、標準話者コードブック５０に散在するＣｋ
１，Ｃｋ２，Ｃｋ３，・・・，Ｃｋ２５６のコードベク
トルとの距離を計算し、特徴ベクトルに対し最短距離の
コードベクトルを選択する。たとえば、単語Ｎの特徴ベ
クトル列の１番目と２番目の特徴ベクトルＶｋ１，Ｖｋ
２がコードベクトルＣｋ１１と対応づけられ、３番目の
特徴ベクトルＶｋ３がコードベクトルＣｋ１３と対応づ
けられ、４番目の特徴ベクトルＶｋ４がコードベクトル
Ｃｋ１４と対応づけられ、５番目、６番目、７番目の特
徴ベクトルＶｋ５，Ｖｋ６，Ｖｋ７がそれぞれコードベ
クトルＣｋ１５と対応づけられたとする。That is, each feature vector of the feature vector sequence of the word N and Ck scattered in the standard speaker codebook 50
1, Ck2, Ck3,..., Ck256, and the distance to the code vector is calculated, and the code vector having the shortest distance to the feature vector is selected. For example, the first and second feature vectors Vk1 and Vk of the feature vector sequence of word N
2 is associated with the code vector Ck11, the third feature vector Vk3 is associated with the code vector Ck13, the fourth feature vector Vk4 is associated with the code vector Ck14, the fifth, sixth, and seventh features. It is assumed that the vectors Vk5, Vk6, and Vk7 are respectively associated with the code vector Ck15.

【００５６】これにより、単語Ｎの特徴ベクトル列は、
Ｃｋ１１，Ｃｋ１１，Ｃｋ１３，Ｃｋ１４，Ｃｋ１５，
Ｃｋ１５，Ｃｋ１５のコードベクトル列に置き換えられ
ることになる。Thus, the feature vector sequence of the word N is
Ck11, Ck11, Ck13, Ck14, Ck15,
Ck15 is replaced by a code vector sequence of Ck15.

【００５７】このようにして新たな単語の特徴ベクトル
列が量子化されると、その量子化されて得られた単語Ｎ
のコードベクトルを学習話者コードブック６０にマッピ
ングする。すなわち、標準話者コードブック５０と学習
話者コードブック６０のそれぞれのコードベクトルはそ
れぞれ対応がとれているので、標準話者コードブック５
０における単語ＮのコードベクトルＣｋ１１は、学習話
コードブック６０のコードベクトルＣｔ１１に、コード
ベクトルＣｋ１３はコードベクトルＣｔ１３に、コード
ベクトルＣｋ１４はコードベクトルＣｔ１４に、コード
ベクトルＣｋ１５はコードベクトルＣｔ１５にというよ
うにマッピングされる。これにより、単語Ｎの学習話者
コードブック６０におけるコードベクトル列は、Ｃｔ１
１，Ｃｔ１１，Ｃｔ１３，Ｃｔ１４，Ｃｔ１５，Ｃｔ１
５，Ｃｔ１５で表されることになる。When the feature vector sequence of a new word is quantized in this way, the quantized word N is obtained.
Are mapped to the learning speaker codebook 60. That is, since the code vectors of the standard speaker codebook 50 and the learning speaker codebook 60 correspond to each other, the standard speaker codebook 5
The code vector Ck11 of the word N at 0 is the code vector Ct11 of the learning story codebook 60, the code vector Ck13 is the code vector Ct13, the code vector Ck14 is the code vector Ct14, the code vector Ck15 is the code vector Ct15, and so on. Is mapped. Thus, the code vector sequence of the word N in the learning speaker codebook 60 is Ct1.
1, Ct11, Ct13, Ct14, Ct15, Ct1
5, Ct15.

【００５８】以上のように、データベース化されている
学習話者声データをもとに、数人分の標準話者コードブ
ックとそれ以外の数百人分の学習話者コードブックをそ
れぞれの話者ごとに作成しておき、新たな単語を標準話
者が発話することにより得られる特徴ベクトル列をベク
トル量子化してコードベクトル列を得て、そのコードベ
クトル列を学習話者コードブックにマッピングすること
によりそれぞれの学習話者コードブックに疑似学習用単
語データを作成する。そして、この疑似学習用単語デー
タを用いて学習データを作成し、この学習データを用い
てたとえば前記したＤＲＮＮ音声モデルを学習する。As described above, based on the learning speaker's voice data in the database, the standard speaker codebooks for several speakers and the learning speaker codebooks for hundreds of other speakers are stored in the respective speakers. Is created for each speaker, the feature vector sequence obtained by uttering a new word by the standard speaker is vector-quantized to obtain a code vector sequence, and the code vector sequence is mapped to the learning speaker codebook. In this way, pseudo learning word data is created in each learning speaker codebook. Then, learning data is created using the pseudo-learning word data, and for example, the above-mentioned DRNN speech model is learned using the learning data.

【００５９】図７は第２の実施の形態を実現するための
装置構成例を示す図であり、概略的には、音声入力部
１、Ａ／Ｄ変換部２、音声分析部３、疑似学習用単語デ
ータ作成部４、入力データ記憶部５、学習データ作成部
６などから構成されている。FIG. 7 is a diagram showing an example of a device configuration for realizing the second embodiment. In general, a voice input unit 1, an A / D conversion unit 2, a voice analysis unit 3, a pseudo learning It is composed of a word data creating unit 4, an input data storing unit 5, a learning data creating unit 6, and the like.

【００６０】前記疑似学習用単語データ作成部４は、標
準話者データ（コードベクトル）を格納するための標準
話者コードデータ記憶部４４、学習話者データ（コード
ベクトル）を格納する学習話者コードデータ記憶部４
５、コードデータ変換処理部４６を有し、これまでに説
明した処理を行う。The pseudo-learning word data creating section 4 includes a standard speaker code data storage section 44 for storing standard speaker data (code vectors), and a learning speaker for storing learning speaker data (code vectors). Code data storage unit 4
5. It has a code data conversion processing unit 46, and performs the processing described so far.

【００６１】すなわち、すでにデータベースとして持っ
ている数百人程度の発話データのうち、任意に選んだ数
人を標準話者として、それぞれの標準話者の発話データ
をベクトル量子化して標準話者コードブックを作成し、
その標準話者コードブックデータを標準話者コードデー
タ記憶部４４に格納する。そして、その標準話者以外の
話者を学習話者として、それぞれの学習話者の発話デー
タを学習話者データとし、それぞれの標準話者コードブ
ックから学習話者データに対し、２００単語程度分のコ
ードベクトルをマッピングし学習話者コードブックを作
成し、その学習話者コードブックデータを学習話者コー
ドデータ記憶部４５に格納しておく。That is, of the hundreds of utterance data already stored as a database, arbitrarily selected several persons are set as standard speakers, and the utterance data of each of the standard speakers is vector-quantized to obtain a standard speaker code. Create a book,
The standard speaker code book data is stored in the standard speaker code data storage unit 44. A speaker other than the standard speaker is set as a learning speaker, and utterance data of each learning speaker is set as learning speaker data. Are mapped to create a learning speaker codebook, and the learning speaker codebook data is stored in the learning speaker code data storage unit 45.

【００６２】そして、前記データベースにない単語につ
いての学習用データを作成する場合は、その新たな単語
について前記それぞれの標準話者が発話すると、その音
声はＡ／Ｄ変換部２でＡ／Ｄ変換され、音声分析部３で
たとえば１０次元のＬＰＣケプストラム係数で表される
特徴ベクトルの信号に変換されたのち、入力データ記憶
部５に蓄えられる。When creating learning data for a word that is not in the database, when each of the standard speakers utters the new word, the voice is subjected to A / D conversion by the A / D conversion unit 2. After being converted into a signal of a feature vector represented by, for example, a 10-dimensional LPC cepstrum coefficient by the voice analysis unit 3, the signal is stored in the input data storage unit 5.

【００６３】そして、コードデータ変換処理部４６は、
その特徴ベクトルを標準話者コードブックを用いてベク
トル量子化したのち、それぞれの学習話者コードブック
にマッピングする。このように作成されたそれぞれの学
習話者の疑似学習用単語データは、学習データ記憶部６
に記憶される。Then, the code data conversion processing section 46
The feature vector is vector-quantized using a standard speaker codebook, and then mapped to each learning speaker codebook. The pseudo-learning word data of each learning speaker created in this manner is stored in the learning data storage unit 6.
Is stored.

【００６４】このように、データベースに無い新しい単
語についての音声モデルを作成する場合、第１の実施の
形態同様、数人の標準話者の発する発話データを基に、
その単語の学習データを作成することができるので、従
来のように、新たな単語に対する音声モデルを作成する
ために、数百人の発話データを収集して学習データを作
る必要がなくなり、短期間で音声モデルを作成すること
ができる。As described above, when creating a speech model for a new word that is not in the database, as in the first embodiment, based on speech data generated by several standard speakers,
Since it is possible to create learning data for that word, there is no need to collect hundreds of utterance data to create learning data in order to create a speech model for a new word, as in the past. Can create a voice model.

【００６５】また、この第２の実施の形態では、データ
をベクトル量子化して処理を行っているので、多少粗い
データとなるが、処理量を大幅に少なくすることができ
る利点がある。これに対して、前記した第１の実施の形
態では、処理量は多いが高精度なデータが得られる利点
がある。Further, in the second embodiment, since the data is subjected to vector quantization for processing, the data is somewhat coarse, but there is an advantage that the processing amount can be greatly reduced. On the other hand, the first embodiment has an advantage that high-precision data can be obtained although the processing amount is large.

【００６６】ところで、以上説明した第１、第２の実施
の形態において、標準話者が新たな単語について発話し
てその発話データを得る場合、一つの単語について１回
の発話ではなく、複数回発話して、それぞれの発話デー
タを得るようにした方が、発話の長さの変動に対応した
バリエーションの多いデータを得ることができる。In the first and second embodiments described above, when the standard speaker utters a new word and obtains the utterance data, it is not one utterance for one word but a plurality of utterances. By uttering and obtaining each utterance data, it is possible to obtain data with many variations corresponding to fluctuations in the utterance length.

【００６７】また、新たな単語についての学習データを
作成したのち、その学習データを用いて、音声モデルを
学習する際、学習話者データのみではなく標準話者の発
話したデータをも学習データとして用いて学習を行うよ
うにした方がよい。Further, after learning data for a new word is created, when learning a speech model using the learning data, not only learning speaker data but also data spoken by a standard speaker is used as learning data. It is better to use it for learning.

【００６８】また、前記した各実施の形態では、標準話
者として数人の話者を選んだが、これは一人でも可能で
ある。ただし、標準話者を複数人とした方が、同じ単語
についてもバリエーションの多い学習データが得られる
ので、より一層良好な学習結果を得ることができる。In each of the above embodiments, several speakers are selected as standard speakers, but this can be done by one person. However, when a plurality of standard speakers are used, learning data with many variations can be obtained for the same word, so that a better learning result can be obtained.

【００６９】また、標準話者を複数人とした場合、各学
習話者について標準話者選択を行うようにしてもよい。
つまり、標準話者をたとえば５人、学習話者をたとえば
２００人とした場合、新たにデータベース化しようとす
る１つの単語について１０００個のバリエーションを持
った疑似学習用単語データが作成されることになる。こ
れは、確かに多くのバリエーションが得られるという点
では効果を有するが、学習するときに多くの処理時間を
要するため、距離の近い標準話者を選択してその選択さ
れた標準話者の発話データをマッピングして得られた疑
似話者用単語データを基に学習データを作成することに
よって、学習の効率化を図るようにしてもよい。When there are a plurality of standard speakers, a standard speaker may be selected for each learning speaker.
That is, if the number of standard speakers is, for example, 5 and the number of learning speakers is, for example, 200, pseudo-learning word data having 1000 variations for one word to be newly created in the database is created. Become. This is effective in that many variations can be obtained, but it takes a lot of processing time to learn, so a standard speaker with a short distance is selected and the utterance of the selected standard speaker is selected. Learning efficiency may be improved by creating learning data based on pseudo-speaker word data obtained by mapping the data.

【００７０】たとえば、１番目の学習話者で考えると、
その１番目の学習話者は５人の標準話者のうち１番目の
標準話者の発話データが他の標準話者の発話データに比
べてより近いとされた場合には、１番目の標準話者のみ
の発話データをマッピングして得られた疑似話者用単語
データを基に学習データを作成する。For example, considering the first learning speaker,
The first learning speaker is the first standard speaker if the utterance data of the first standard speaker among the five standard speakers is closer to the utterance data of the other standard speakers. Learning data is created based on pseudo-speaker word data obtained by mapping the utterance data of only the speaker.

【００７１】このように１つの学習話者に対して最も距
離の近い１つの標準話者データを選択して、その選択さ
れた標準話者に対する疑似学習話者用単語データを作成
するようにすれば、その後の処理を大幅に効率化するこ
とができる。As described above, one standard speaker data closest to one learning speaker is selected, and pseudo learning speaker word data for the selected standard speaker is created. If this is the case, the subsequent processing can be made much more efficient.

【００７２】また、本発明は、ＤＲＮＮ音声モデルの学
習だけではなく、他のニューラルネットワーク用の音声
モデルや、隠れマルコフ法（ＨＭＭ法）による音声モデ
ルなど他の音声認識手段における音声モデルの学習にも
適用できることは勿論である。さらに、本発明は以上説
明した各実施の形態に限定されるものではなく、本発明
の要旨を逸脱しない範囲で種々変形実施可能となるもの
でる。The present invention can be applied not only to learning of a DRNN speech model, but also to learning of a speech model for another neural network or a speech model in another speech recognition means such as a speech model based on a hidden Markov method (HMM method). Of course, it can also be applied. Further, the present invention is not limited to the embodiments described above, and various modifications can be made without departing from the gist of the present invention.

【００７３】また、以上説明した本発明の処理を行う処
理プログラムは、フロッピィディスク、光ディスク、ハ
ードディスクなどの記憶媒体に記憶させておくことがで
き、本発明はその記憶媒体をも含むものであり、また、
ネットワークからデータを得る形式にしてもよい。The processing program for performing the processing of the present invention described above can be stored in a storage medium such as a floppy disk, an optical disk, or a hard disk. The present invention also includes the storage medium. Also,
A format for obtaining data from a network may be used.

【００７４】[0074]

【発明の効果】以上説明したように、本発明によれば、
予めデータベースとして持っている多数の話者から得ら
れた発話データのうちわずかな人数の話者の発話データ
を標準話者データを選定し、他を学習話者データとし
て、データベースには無い新たな単語についての学習デ
ータを作成する際は、その新たな単語について前記標準
話者が発話するデータを所定の変換関数を用いて学習話
者データ空間にデータ変換して新たな単語についての学
習データを作成するようにしたので、従来のように、新
たな単語に対する音声モデルを作成するために、数百人
の発話データを収集して学習データを作る必要がなくな
り、短期間で安価に音声モデルを作成することができ
る。したがって、たとえば、子供用の新たな音声モデル
を作成しようとする場合、子どものデータを数百人分集
めなくても、身近にいる成人女性を標準話者として選定
し、すでにデータベースとして持っている子供のデータ
に、標準話者として選定された成人女性の音声データを
マッピングすることにより、新たな単語の子供用の学習
データを作成して、その学習データをもとに新たな単語
の音声モデルを作成するというようなことも可能とな
る。As described above, according to the present invention,
From the utterance data obtained from a large number of speakers previously held as a database, the utterance data of a small number of speakers are selected as standard speaker data, and the other as learning speaker data, a new When creating learning data for a word, the data uttered by the standard speaker for the new word is converted to a learning speaker data space using a predetermined conversion function, and the learning data for the new word is converted. Because it was created, there is no need to collect hundreds of utterance data and create training data to create a speech model for a new word, as in the past, and a speech model can be created in a short period of time at low cost. Can be created. So, for example, if you want to create a new voice model for a child, you can select a nearby adult woman as a standard speaker and already have it as a database without collecting hundreds of children's data. By mapping voice data of an adult female selected as a standard speaker to child data, learning data for children of new words is created, and a voice model of a new word is created based on the learning data. Can also be created.

【００７５】このように、本発明は、従来のように、新
たな単語に対する音声モデルを作成するために、数百人
の発話データを収集して学習データを作る必要がなくな
り、短期間でしかも低コストにて新たな音声モデルを作
成することができ、これにより、データベースとして持
っていない単語の音声モデルをユーザから注文を受けた
場合、短い納期で安価に提供できるようになる。As described above, according to the present invention, in order to create a speech model for a new word, it is not necessary to collect hundreds of utterance data to create learning data, which is a short time. A new voice model can be created at a low cost, whereby when a voice model of a word that does not have as a database is ordered from a user, it can be provided inexpensively with a short delivery date.

[Brief description of the drawings]

【図１】本発明の概略的な処理を説明するフローチャー
ト。FIG. 1 is a flowchart illustrating a schematic process of the present invention.

【図２】本発明の第１の実施の形態における標準話者デ
ータから学習話者データへの変換関数について説明する
図。FIG. 2 is a diagram illustrating a conversion function from standard speaker data to learning speaker data according to the first embodiment of the present invention.

【図３】第１の実施の形態における疑似学習用単語デー
タの作成処理を説明する図。FIG. 3 is a view for explaining a process of creating pseudo-learning word data according to the first embodiment;

【図４】第１の実施の形態の装置構成例を示すブロック
図。FIG. 4 is a block diagram illustrating an example of a device configuration according to the first embodiment;

【図５】第２の実施の形態における標準話者データから
学習話者データへのコードマッピングについて説明する
図。FIG. 5 is a diagram illustrating code mapping from standard speaker data to learning speaker data according to the second embodiment.

【図６】第２の実施の形態において入力データをベクト
ル量子化してそのコードベクトルを学習話者コードブッ
クにマッピングする処理を説明する図。FIG. 6 is a view for explaining processing of vector quantization of input data and mapping of the code vector to a learning speaker codebook in the second embodiment.

【図７】第２の実施の形態の装置構成例を示すブロック
図。FIG. 7 is a block diagram illustrating an example of a device configuration according to a second embodiment;

【符号の説明】１音声入力部２Ａ／Ｄ変換部３周波数分析部４疑似学習用単語データ作成部５入力データ記憶部６学習データ作成部４１標準話者データ記憶部４２学習話者データ記憶部４３データ変換処理部Ｖｋ１，Ｖｋ２，・・・標準話者特徴ベクトルＶｔ１，Ｖｔ２，・・・学習話者特徴ベクトルＶ１，Ｖ２，・・・差分ベクトルＮ新たな単語Ｖｋｎ１，Ｖｋｎ２，・・・新たな単語の特徴ベクト
ルＶｔｎ１，Ｖｔｎ２，・・・疑似学習用単語データＣｋ１，Ｃｋ２，・・・標準話者コードベクトルＣｔ１，Ｃｔ２，・・・学習話者コードベクトル[Description of Signs] 1 Voice input unit 2 A / D conversion unit 3 Frequency analysis unit 4 Word data creation unit for pseudo learning 5 Input data storage unit 6 Learning data creation unit 41 Standard speaker data storage unit 42 Learning speaker data storage Part 43 Data conversion processing part Vk1, Vk2,... Standard speaker characteristic vector Vt1, Vt2,... Learning speaker characteristic vector V1, V2,... Difference vector N New word Vkn1, Vkn2,. New word feature vector Vtn1, Vtn2,... Pseudo-learning word data Ck1, Ck2,... Standard speaker code vector Ct1, Ct2,.

───────────────────────────────────────────────────── フロントページの続き (72)発明者長谷川浩男長野県諏訪市大和３丁目３番５号セイコーエプソン株式会社内 ──────────────────────────────────────────────────続き Continued on the front page (72) Inventor Hiroo Hasegawa 3-5-5 Yamato, Suwa City, Nagano Prefecture Seiko Epson Corporation

Claims

[Claims]

1. A speech model learning data creating method for creating learning data for learning a speech model for speech recognition, wherein at least one of utterance data obtained from a large number of speakers previously stored as a database. The speaker's utterance data is used as standard speaker data, and the others are used as learning speaker data.
When a conversion function from the standard speaker data space to the learning speaker data space is created based on the word data held in advance, and learning data for a new word is created, the standard language is used for the new word. A method for generating speech model learning data, wherein data obtained by uttering a speaker is converted into learning speaker data space using the conversion function to generate learning data for a new word.

2. The data existing in the standard speaker data space and the learning speaker data space is a feature vector for each word obtained by frequency analysis of a speech signal, and the new word The process of converting the data obtained by uttering the standard speaker into the learning speaker data space by using the conversion function is performed by using the feature vector and the learning speaker constituting each word in the standard speaker data space. The method according to claim 1, wherein the method is performed using a difference vector from a feature vector constituting each word in the data space.

3. The data present in the standard speaker data space and the learning speaker data space is code data obtained by vector-quantizing a feature vector for each word obtained by frequency analysis of a speech signal. Further, the process of converting the data obtained by the standard speaker uttering the new word into the learning speaker data space using the conversion function includes converting the new word data into the standard speaker data space. The method according to claim 1, wherein code data is obtained by vector quantization, and the code data is mapped to the learning speaker data space to convert the data from the standard speaker data space to the learning speaker data space. How to create speech model training data.

4. A speech model learning data creating apparatus for creating learning data for learning a speech model for speech recognition, wherein at least one of utterance data obtained from a large number of speakers, which is previously stored as a database. A standard speaker data storage unit for storing speaker data as standard speaker data, a learning speaker data storage unit for storing speech data other than the standard speaker as learning speaker data, and the standard speaker data Data conversion from space to learning speaker data space,
A pseudo-learning word data creation unit having a data conversion unit that performs a conversion using a conversion function obtained in advance; and a learning data storage unit that stores data created by the pseudo-learning word data creation unit. When creating learning data for a new word, the data obtained by uttering the standard speaker for the new word is converted into a learning speaker data space using the conversion function, and a new word is created. A speech model learning data creating apparatus, which creates pseudo-learning word data for and generates learning data using the pseudo-learning word data.

5. The standard speaker data stored in the standard speaker data storage unit and the learning speaker data stored in the learning speaker data storage unit are words obtained by frequency analysis of a speech signal. The feature vector of each
The process of converting the data obtained by the standard speaker uttering the new word into the learning speaker data space using the conversion function includes a feature vector constituting each word in the standard speaker data space. 5. The speech model learning data creating apparatus according to claim 4, wherein the processing is performed using a difference vector between the feature vector and a feature vector constituting each word in the learning speaker data space.

6. The standard speaker data stored in the standard speaker data storage unit and the learning speaker data stored in the learning speaker data storage unit are words obtained by frequency analysis of a speech signal. Code data obtained by vector-quantizing a feature vector for each word, and data obtained by the standard speaker uttering the new word is converted into a learning speaker data space using the conversion function. Processing is
New word data is vector-quantized in the standard speaker data space to obtain code data, and the code data is mapped to the learning speaker data space to convert the data from the standard speaker data space to the learning speaker data space. The voice model learning data creating apparatus according to claim 4, wherein