JPH1195789A

JPH1195789A - Voice recognition system and speaker adaptive method in the same

Info

Publication number: JPH1195789A
Application number: JP9259844A
Authority: JP
Inventors: Nobuo Hataoka; 信夫畑岡; Yasunari Obuchi; 康成大淵; Toshiyuki Odaka; 俊之小高; Akio Amano; 明雄天野; Masakazu Ejiri; 正員江尻; Shinya Oba; 信弥大場
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 1997-09-25
Filing date: 1997-09-25
Publication date: 1999-04-09

Abstract

PROBLEM TO BE SOLVED: To enable anyone to use a voice recognition system effectively by performing a voice recognition while using the acoustic model of the voice recognition system suited for the feature of a speaker estimated in accordance with a detected physical feature information of the speaker to compensating the degradation of a recognition rate to be caused by depending on feature of the voice of a user. SOLUTION: The feature of the user is extracted from a voice 100 in the personal information extracting part 310 of the voice of the user by using the voice 100 or measured results of a mat type scale 160 and/or a sheet position detecting meter 170 as input information. Moreover, when the measured information of the mat type scale 160 and/or the sheet potion detecting meter are used as the input information, the information are converted into information of a weight or a stature in a physical feature converting part 320 and these are converted into the feature of the voice of the user in a voice feature conversion part 330. Then, a speaker adaptation is executed in a speaker adaptation part 350 for acoustic model so that an acoustic model to be used in the voice recognition becomes the acoustic model suited for the user based on the personal information detected in this manner.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、音声認識システム
に関し、特に、話者に適合して認識率を向上させること
ができる話者適応方法および装置に関する。[0001] 1. Field of the Invention [0002] The present invention relates to a speech recognition system, and more particularly to a speaker adaptation method and apparatus capable of improving a recognition rate in conformity with a speaker.

【０００２】[0002]

【従来の技術】図１は、本発明が生まれるに至った、従
来の話者適応の概念を示す図である。従来の話者適応
は、入力音声１００を用いて、その話者に合った音響モ
デルを適応する処理３００を行い、従来の音声認識処理
５００を行い、認識結果９９９を出力する。音響モデル
適応処理３００は、個人性の情報を抽出する処理３１０
とその結果から既存の音響モデルを適応する処理３５０
とから構成される。2. Description of the Related Art FIG. 1 is a diagram showing a conventional concept of speaker adaptation which led to the invention. In the conventional speaker adaptation, using the input speech 100, a process 300 for adapting an acoustic model suitable for the speaker is performed, a conventional speech recognition process 500 is performed, and a recognition result 999 is output. The acoustic model adaptation process 300 is a process 310 for extracting personality information.
And a process 350 for adapting an existing acoustic model from the results
It is composed of

【０００３】[0003]

【発明が解決しようとする課題】従来の音声認識システ
ムでは、依然、認識率の話者依存性が大きく、かつ話者
適応を行う場合もどのような特徴をもって話者適応を行
うかが明確でないという問題があった。In the conventional speech recognition system, the recognition rate is still highly dependent on the speaker, and it is not clear what features the speaker adaptation is to be performed in when the speaker adaptation is performed. There was a problem.

【０００４】本発明の目的は、より確実な話者適応によ
り、使用者の音声の特徴に依存して起こる認識率の劣化
を補償して、誰でもが音声認識システムを有効に使用で
きる話者方法及び装置を提供することにある。SUMMARY OF THE INVENTION It is an object of the present invention to provide a speaker adaptation method capable of effectively using a speech recognition system by compensating for a deterioration of a recognition rate depending on characteristics of a user's speech by more reliable speaker adaptation. It is to provide a method and an apparatus.

【０００５】[0005]

【課題を解決するための手段】上記目的を達成するため
に、本発明による音声認識システムにおける話者適応方
法では、話者の身体的特徴情報を検出し、該身体的特徴
情報に応じて当該話者の音声の特徴を推測し、該推測し
た音声の特徴に適した音声認識システムの音響モデルを
用いて音声認識を行うことを特徴とする。In order to achieve the above object, in a speaker adaptation method in a speech recognition system according to the present invention, physical characteristic information of a speaker is detected, and the characteristic is detected in accordance with the physical characteristic information. It is characterized by estimating the characteristics of the speaker's voice and performing voice recognition using an acoustic model of a voice recognition system suitable for the estimated characteristics of the voice.

【０００６】身体的特徴情報とは、例えば、話者の体重
情報および身長情報の少なくとも一方である。The physical characteristic information is, for example, at least one of the speaker's weight information and height information.

【０００７】このような身体的特徴は、音声発声器官の
大小、長短の尺度と基本的には比例的な関係があり、こ
れらの情報から、音声発声器官の特徴を推測することが
可能である。[0007] Such physical characteristics are basically proportional to the scale of the size and length of the voice vocal organ, and it is possible to infer the characteristics of the voice vocal organ from such information. .

【０００８】本発明によって、従来のように単に話者の
音声による話者適応では行えないような、より適切な話
者適応を行うことが可能になり、その結果として、任意
の話者の音声に対する認識率を向上させることができ
る。According to the present invention, it is possible to perform more appropriate speaker adaptation, which cannot be conventionally performed by speaker adaptation based on a speaker's speech, and as a result, the speech of an arbitrary speaker can be improved. Can be improved.

【０００９】このような方法を実現するシステムとして
の、本発明による音声認識機能を搭載した音声認識シス
テムは、話者の身体的特徴情報を取得する手段と、話者
の身体的特徴情報と音声の特徴とを対応づけた対応テー
ブルと、前記取得された身体的特徴情報を前記対応テー
ブルに照らして前記話者の音声の特徴を抽出する手段
と、音声認識のための複数の音響モデルと、前記抽出さ
れた音声の特徴に適合した音響モデルを前記複数の音響
モデルの中から選択または変換する音響モデル適応手段
と、該音響モデル適応手段により得られた音響モデルを
用いて話者の音声認識処理を行う音声認識処理手段とを
備えたことを特徴とする。A speech recognition system equipped with a speech recognition function according to the present invention as a system for realizing such a method includes means for acquiring physical characteristic information of a speaker, physical characteristic information of the speaker, and speech. A correspondence table that associates the features of the above, means for extracting the features of the speaker's voice by illuminating the acquired physical feature information with the correspondence table, and a plurality of acoustic models for speech recognition, Acoustic model adaptation means for selecting or converting an acoustic model suitable for the extracted speech feature from the plurality of acoustic models, and speaker's speech recognition using the acoustic model obtained by the acoustic model adaptation means Voice recognition processing means for performing processing.

【００１０】前記音声認識システムがカーナビゲーショ
ン装置に搭載される場合、前記体重情報を取得する手段
は話者の座席に配置されるマット型体重計を有し、前記
身長情報を取得する手段は話者の座席の位置を検出する
シート位置検出計を有する。When the voice recognition system is mounted on a car navigation system, the means for acquiring the weight information includes a mat-type weighing scale arranged on a seat of a speaker, and the means for acquiring the height information includes a speech. It has a seat position detector for detecting the position of the person's seat.

【００１１】カーナビゲーションのように身長情報の取
得のためにシート位置検出計を利用できない場合、前記
身長情報を取得する手段としては話者の画像を取得する
ビデオカメラおよび当該画像の処理により身長情報を得
る処理手段により構成することができる。When the seat position detector cannot be used for obtaining height information as in car navigation, the means for obtaining the height information includes a video camera for obtaining an image of a speaker and processing of the image to obtain the height information. Can be constituted by processing means for obtaining

【００１２】身体的特徴に加えて、話者による予め指定
された適応語の発声内容から話者の音声の特徴を識別
し、該音声の特徴と、前記話者の身体的特徴情報に基づ
いて得られた音声の特徴とを組合せて用いてもよい。こ
の場合、前記音響モデル適応手段は、当該組み合わせた
音声の特徴に基づいて前記音響モデルの選択または変換
を行う。[0012] In addition to the physical features, the speech characteristics of the speaker are identified from the utterance content of the adaptation word specified in advance by the speaker, and based on the speech features and the physical feature information of the speaker. You may use it combining the characteristic of the obtained audio | voice. In this case, the acoustic model adaptation means selects or converts the acoustic model based on the characteristics of the combined speech.

【００１３】前記音響モデルは、少なくとも男性用、女
性用の２種類が設けられる。さらには、予め身体的特徴
ごとに別個の音響モデルを設けてもよい。At least two types of acoustic models are provided for men and women. Further, a separate acoustic model may be provided for each physical feature in advance.

【００１４】話者の音声認識に先立ち前記音響モデルの
話者適応を適用するか否かをユーザが選択する選択手段
を設ければ、必要な場合にだけ話者適応を行うことが可
能となる。If the user is provided with a selection means for selecting whether or not to apply the speaker adaptation of the acoustic model prior to the speaker's speech recognition, the speaker adaptation can be performed only when necessary. .

【００１５】[0015]

【発明の実施の形態】以下、実施の形態を詳細に説明す
る。Embodiments of the present invention will be described below in detail.

【００１６】図２は、本発明である話者適応の方式を示
す処理の概念図である。話者適応処理３００は、図１で
説明した従来の音声の特徴からの話者適応処理の他に、
音声を特徴付ける話者の身長や体重などの身体的な特徴
情報から、音声の個人的な特徴情報を抽出して、既存の
音響モデルを適応する方式となっている。FIG. 2 is a conceptual diagram of a process showing a speaker adaptation method according to the present invention. The speaker adaptation process 300 includes, in addition to the conventional speaker adaptation process based on the features of speech described in FIG.
In this method, personal characteristic information of a voice is extracted from physical characteristic information such as the height and weight of a speaker characterizing the voice, and an existing acoustic model is adapted.

【００１７】具体的には、音声１００、あるいはマット
型体重計１６０、及び／またはシート位置検出計１７０
の測定結果を入力情報として、音声１００からは音声の
個人性情報抽出部３１０において、使用者の音声の特徴
が抽出される。また、マット型体重計１６０、及び／ま
たはシート位置検出計１７０を入力情報とした場合は、
まず身体的な特徴変換部３２０において、体重、あるい
は身長の情報へ変換され、音声特徴変換部３３０にて、
その使用者の音声の特徴へ変換される。以上の手段にて
抽出された音声の個人性情報をもとに、音響モデルの話
者適応部３５０にて、認識に使用される音響モデルが使
用者に適した音響モデルとなるように話者適応が実行さ
れる。その後、通常の音声認識５００が実行され、認識
結果９９９が出力される。More specifically, the voice 100, the mat-type weight scale 160, and / or the seat position detector 170
The voice personality information extraction unit 310 extracts characteristics of the user's voice from the voice 100 using the measurement result of the above as input information. When the mat type weight scale 160 and / or the seat position detector 170 are input information,
First, in the physical feature conversion unit 320, the weight or height is converted into information, and in the voice feature conversion unit 330,
It is converted into features of the user's voice. Based on the personality information of the voice extracted by the above means, the speaker adaptation unit 350 of the acoustic model uses the speaker so that the acoustic model used for recognition becomes an acoustic model suitable for the user. Adaptation is performed. Thereafter, normal speech recognition 500 is performed, and a recognition result 999 is output.

【００１８】図３は、本発明の主点である、話者の身体
的な特徴を検出する手段の一例を示した図である。本実
施の形態では、カーナビゲーション装置への音声認識の
応用を取り上げて説明している。FIG. 3 is a diagram showing an example of means for detecting a physical characteristic of a speaker, which is the main point of the present invention. In the present embodiment, an application of voice recognition to a car navigation device is described.

【００１９】まず、カーナビゲーション装置を搭載した
車輌１０００において、図に示すように座席シート１５
０の下に、マット型体重計１６０が備え付けられてお
り、座席使用者、すなわち通常は運転者の体重が自動的
に検出される。また、運転者が座席の位置合わせをする
ことにより、備え付けられたシート位置検出計１７０か
ら運転者のおおよその身長が検出されることになる。こ
れらの体重と身長の尺度は、音声発声器官の大小、長短
の尺度と基本的には比例的な関係があり、体重や身長か
ら、音声発声器官の特徴を推測することが可能である。First, in a vehicle 1000 equipped with a car navigation device, as shown in FIG.
Below 0, a mat-type weight scale 160 is provided, and the weight of the seat occupant, usually the driver, is automatically detected. When the driver positions the seat, the approximate height of the driver is detected from the seat position detector 170 provided. These weight and height scales are basically proportional to the size and length scales of the voice vocal organs, and it is possible to infer the characteristics of the voice vocal organs from the weight and height.

【００２０】図４は、本発明の構成の一例を詳細に示す
ブロック図である。音声信号１００を入力として、音声
入力部２００のＬＰＦ（Low Pass Filter）２０１０と
Ａ／Ｄコンバータ２０２０にて、音声信号のサンプリン
グが行われ、アナログの音声情報がデジタルの音声波形
情報へと変換される。その後、音声分析部２１０にて、
音声の特徴パラメータが抽出される。音声パラメータに
関しては、例えば線形予測分析により求まるＬＰＣケプ
ストラムなどがある。詳細は、例えば、文献「音声情報
処理の基礎」（斉藤収三、中田和男共著、オーム社）を
参照されたい。その後、個人性情報抽出部３１０にて、
音声の個人的な情報（個人性情報）が抽出される。本実
施の形態ではこの個人性情報は、声の高低の情報を規定
している基本周波数（ピッチ）情報である。この音声信
号１００に基づいて得られた個人性情報は、後述する身
体的な特徴から求められた個人性情報との併合処理３４
０により、組み合わされる。この併合処理３４０では、
例えば音声から得た個人性情報と身体的特徴から得た個
人性情報とを、ある重み付けで平均化する、等の方法を
採用する。または、ある変換テーブルを用いて２つの情
報を併合する。具体的には、ピッチ情報でみた時は併合
された結果のピッチ情報の値となる。FIG. 4 is a block diagram showing an example of the configuration of the present invention in detail. With the audio signal 100 as an input, the audio signal is sampled by an LPF (Low Pass Filter) 2010 and an A / D converter 2020 of the audio input unit 200, and analog audio information is converted into digital audio waveform information. You. Then, in the voice analysis unit 210,
The voice feature parameters are extracted. The speech parameters include, for example, an LPC cepstrum determined by a linear prediction analysis. For details, see, for example, the document “Basics of Speech Information Processing” (co-authored by Shozo Saito and Kazuo Nakata, Ohmsha). Then, in the personality information extraction unit 310,
Voice personal information (personality information) is extracted. In the present embodiment, the personality information is fundamental frequency (pitch) information that defines information on the pitch of the voice. The personality information obtained based on the voice signal 100 is combined with personality information 34 obtained from physical characteristics described later.
0 is combined. In this merging process 340,
For example, a method of averaging personality information obtained from voice and personality information obtained from physical characteristics with a certain weight is adopted. Alternatively, two pieces of information are merged using a certain conversion table. Specifically, when viewed from the pitch information, the value is the value of the pitch information resulting from the merging.

【００２１】身体的な情報からの個人性情報の求め方と
しては、本発明では体重と身長とを取り上げた。すなわ
ち、体重情報１１０はマット型体重計１６０にて体重の
尺度として計量され、シート位置信号１２０は、シート
検出計１７０にて身長の尺度として計量される。その
後、体重や身長と音声との相関的な関係を利用して、音
声特徴変換部３３０にて、音声の個人性情報へと変換さ
れる。この個人性情報としては、認識に用いる音響モデ
ルの分布（特徴パラメータの値の分布）や、あるいは単
純には声の高低の情報を規定しているピッチ情報などが
ある。その後、すでに説明したように、音声から求めら
れた個人性情報との併合処理３４０にて、併合が行わ
れ、音響モデル適応部３５０にて、音響モデルの適応が
行われる。さらに音声認識処理部５００にて、音声認識
処理が実行され、認識結果９９９が出力される。音響モ
デルの制約（選択）、変換と音声認識の実行の際には、
音響モデル４００が読み込まれ、音響モデルの変換では
音響モデルの修正、音声認識では利用が行われる。In the present invention, weight and height are taken as a method of obtaining personality information from physical information. That is, the weight information 110 is weighed by the mat type weighing scale 160 as a measure of weight, and the seat position signal 120 is weighed by the seat detector 170 as a measure of height. Then, using the correlation between the weight and height and the voice, the voice feature conversion unit 330 converts the voice into personality information. The personality information includes a distribution of acoustic models used for recognition (a distribution of characteristic parameter values), or simply pitch information that defines voice pitch information. Thereafter, as described above, the merging is performed in the merging process 340 with the personality information obtained from the voice, and the acoustic model adapting unit 350 adapts the acoustic model. Further, the speech recognition processing section 500 executes a speech recognition process, and outputs a recognition result 999. When performing acoustic model constraints (selection), conversion and speech recognition,
The acoustic model 400 is read, the acoustic model is modified in the acoustic model conversion, and the acoustic model is used in the speech recognition.

【００２２】図５は、本発明による音響モデル適応部３
５０の一例を示すブロック図である。併合処理３４０に
て併合が行われ、個人性情報を入力として音声特徴比較
部３５１０にて、既に格納されている標準的な複数の音
響モデル４００と該個人性情報との比較が行われ、次の
適応判定部３５２０にて、該音響モデルの選択、変換な
どの音響モデル適応実行部３５３０の具体的な処理の判
定が決定される。本実施の形態では、適応処理として
は、例えば、音響モデル制約部３５３１あるいは音響モ
デル変換部３５３２がある。音響モデル制約部３５３１
では、複数格納されている音響モデルの中から、使用者
の個人性情報から得られた音声の特徴に類似している音
響モデルのセットが選択されて、設定される。音響モデ
ル変換部３５３２では、使用者の音声の特徴をもとに、
既存の音響モデルが変換されて、新しい音響モデルに設
定される。新しく設定された音響モデルは、音響モデル
格納部４００に新たに格納されて、次の音声認識処理部
にて使用される。FIG. 5 shows an acoustic model adaptation unit 3 according to the present invention.
It is a block diagram showing an example of 50. Merging is performed in the merging process 340, and the personality information is input, and the voice feature comparison unit 3510 compares the personality information with the standard acoustic models 400 already stored. Of the acoustic model adaptation execution unit 3530, such as selection and conversion of the acoustic model, is determined by the adaptation determination unit 3520. In the present embodiment, as the adaptive processing, for example, there is an acoustic model restriction unit 3531 or an acoustic model conversion unit 3532. Acoustic model constraint unit 3531
In, a set of acoustic models that are similar to the features of the voice obtained from the user's personality information is selected and set from a plurality of stored acoustic models. In the acoustic model conversion unit 3532, based on the characteristics of the user's voice,
The existing acoustic model is converted and set as a new acoustic model. The newly set acoustic model is newly stored in the acoustic model storage unit 400 and used in the next speech recognition processing unit.

【００２３】なお、音響モデル制約部３５３１と音響モ
デル変換部３５３２の使い分けは例えば次のように行わ
れる。すなわち、複数の音響モデルを記録しておけるメ
モリ規模が十分にある装置では、数多くの使用者に対応
した数多くの音響モデルを記録しておき、音響モデル制
約部により、実際の使用者に合致した音響モデルが決定
される。十分な記録メモリがない装置においては、記録
されている音響モデルを変換処理して、使用者の音声に
合致した音響モデルを算出する。The use of the acoustic model restriction unit 3531 and the acoustic model conversion unit 3532 is performed, for example, as follows. That is, in a device having a sufficient memory size to store a plurality of acoustic models, a large number of acoustic models corresponding to a large number of users are recorded, and the acoustic model constraint unit matches the actual user. An acoustic model is determined. In a device without sufficient recording memory, a recorded acoustic model is converted to calculate an acoustic model that matches the user's voice.

【００２４】図６は、音声認識処理部５００の一例を示
すブロック図である。本実施の形態では、音声認識処理
部の例としては、連続型ヒドン・マルコフ・モデル（Hi
ddenMarkov Models）を使用した場合を考える。連続型
ヒドン・マルコフ・モデルに関しては、文献「確率モデ
ルによる音声認識」（中川聖一著、（社）電子情報通信
学会編）に詳細に説明されている。図４にて説明したよ
うに、入力音声信号をサンプリングして得られた音声の
特徴パターンを入力として、本実施の形態により話者に
適応された音響モデル４００を用いて、ヒドン・マルコ
フ・モデルによる音響照合が実行される。すなわち、確
率分布計算部５１０にて入力音声の特徴パターンに対し
て、音響モデルの分布確率が計算される。更に、確率累
積部５２０にて、単語辞書６００に記述された単語の系
列に対応した音響モデルの確率分布が累積されて、単語
辞書の各エントリィの累積確率が求まる。その後、判定
部５３０にて、確率がもっとも高い単語エントリィが認
識結果９９９として出力される。本実施の形態では、単
語認識を例にして説明したが、例えば文節認識や文章認
識も、単語辞書を文節や文章の文字系列とすることで、
単語認識と同様にして実現される。FIG. 6 is a block diagram showing an example of the speech recognition processing section 500. In the present embodiment, a continuous Hidden Markov Model (Hi
ddenMarkov Models). The continuous Hidden Markov Model is described in detail in the document "Speech Recognition by Probabilistic Model" (by Seiichi Nakagawa, edited by The Institute of Electronics, Information and Communication Engineers). As described with reference to FIG. 4, a Hidden Markov Model is used by using an acoustic model 400 adapted to a speaker according to the present embodiment, using a speech feature pattern obtained by sampling an input speech signal as an input. Is performed. That is, the probability distribution calculation unit 510 calculates the distribution probability of the acoustic model for the feature pattern of the input voice. Further, the probability accumulation unit 520 accumulates the probability distribution of the acoustic model corresponding to the series of words described in the word dictionary 600, and determines the cumulative probability of each entry in the word dictionary. Then, the judgment unit 530 outputs the word entry with the highest probability as the recognition result 999. In the present embodiment, word recognition has been described as an example, but for example, phrase recognition and sentence recognition are also achieved by using a word dictionary as a character sequence of phrases and sentences.
This is realized in the same way as word recognition.

【００２５】なお、一旦話者適応された後は、基本的に
は話者が変わらない限り話者適応を行う必要はないの
で、個人性情報の抽出および音響モデル適用の処理の実
行は不要である。Once speaker adaptation has been performed, there is basically no need to perform speaker adaptation unless the speaker changes, so there is no need to perform processing for extracting personality information and applying acoustic models. is there.

【００２６】図７は、本実施の形態による身長・体重等
検出部３２０及び音声特徴変換部３３０の一例を示す図
である。体重信号（体重値）１１０及びシート位置信号
１２０は、それぞれ、身長・体重等検出部３２０のマッ
ト型体重計１６０とシート位置検出計１７０にて生成さ
れる。シート位置信号１２０は身長変換部１８０にて身
長値情報１３０へ変換される。その後、これら２つの情
報は、音声特徴変換部３３０にて音声の特徴パターンに
変換される。具体的には、音声特徴比較部３３１０に
て、音声特徴と体重・身長の値との対応関係を記述した
音声特徴対応データ３３２０を使って、体重値と身長値
にもっとも対応した音声特徴が男女別に抽出され、これ
に対応して音声特徴変換パラメータ算出部３３３０にて
音声特徴パラメータが求められることになる。FIG. 7 is a diagram showing an example of the height / weight etc. detecting section 320 and the voice feature converting section 330 according to the present embodiment. The weight signal (weight value) 110 and the seat position signal 120 are generated by the mat type weight scale 160 and the seat position detector 170 of the height / weight etc. detecting unit 320, respectively. The seat position signal 120 is converted into height value information 130 by a height conversion unit 180. After that, these two pieces of information are converted into a voice feature pattern by the voice feature conversion unit 330. Specifically, the voice feature comparison unit 3310 uses the voice feature correspondence data 3320 describing the correspondence between the voice feature and the weight / height value, and determines whether the voice feature most corresponding to the weight value and the height value is gender. The voice feature parameters are separately extracted, and the voice feature conversion parameter calculating unit 3330 obtains the voice feature parameters correspondingly.

【００２７】音声特徴対応データ３３２０としては、例
えば、図１２に示すように、男女別に、それぞれ体重と
身長の区分ごとに対応する音響特徴としての基本周波数
（ピッチ）範囲を定めたテーブルが予め設けられてい
る。As the voice feature correspondence data 3320, for example, as shown in FIG. 12, a table in which a fundamental frequency (pitch) range as an acoustic feature corresponding to each of weight and height for each gender is provided in advance. Have been.

【００２８】音声特徴パラメータとしては、例えば特定
の音節、音韻などの音響モデルに対応する特徴パラメー
タであり、この情報をもとに音響モデル全体の分布を音
声認識使用者の音声に適応した音響モデルへと適応させ
ることになる。なお、この場合、男声と女声用に別々の
音響モデルを用いて音声認識が行われ、結果の良好な方
が選ばれることになる。The speech feature parameters are, for example, feature parameters corresponding to an acoustic model such as a specific syllable or phoneme. Based on this information, the distribution of the entire acoustic model is adapted to the speech model of the speech recognition user. Will be adapted. In this case, speech recognition is performed using different acoustic models for male and female voices, and the better result is selected.

【００２９】図８は、本実施の形態をカーナビゲーショ
ンへ応用した時の概念を示す図である。カーナビゲーシ
ョン装置５０００は、表示部５０１０、スピーカ５０２
０、イァフォンジャク５０３０、セレクタ５０４０、及
びマイクロホン５０５０で少なくとも構成されている。
話者適応の概念は、マイクロホン５０５０から入力され
たカーナビゲーション装置の使用者の音声をもとに既存
の音響モデルを適応することと、車の運転席などに備え
付けられた体重マットやシート位置検出計の入力をもと
に音響モデルが適応されることになる。使用者の音声を
もとに既存の音響モデルを適応する場合は、例えば「お
はよう」とか「今日は」などのような挨拶の音声を入力
として、その話者の音声に近くなるように音響モデルを
適応化することが考えられる。スピーカ５０２０やイァ
フォンジャク５０３０は、システムからの応答音声を出
力する装置として働く。さらに、セレクタ５０４０は話
者の音声認識に先立って話者適応手段を実行するかどう
かの設定を行うために設けられている。FIG. 8 is a diagram showing the concept when this embodiment is applied to car navigation. The car navigation device 5000 includes a display unit 5010, a speaker 502
0, an earphone jack 5030, a selector 5040, and a microphone 5050.
The concept of speaker adaptation is to adapt an existing acoustic model based on the voice of the user of the car navigation device input from the microphone 5050, and to detect a weight mat or seat position provided in the driver's seat of the car or the like. The acoustic model is adapted based on the input of the meter. When adapting an existing acoustic model based on the user's voice, input the voice of a greeting such as "Good morning" or "Today is" and input the voice model so that it is close to the voice of the speaker. It is possible to adapt. The speaker 5020 and the earphone jack 5030 function as a device that outputs a response voice from the system. Further, the selector 5040 is provided for setting whether or not to execute the speaker adaptation means prior to the speaker's speech recognition.

【００３０】図９は、音声認識における話者依存を示す
音声認識性能評価結果の一例を示す図である。認識率
は、話者一律に同じ値ではなく、通常は発声の仕方や、
音声の訛りなどのくせにより、話者に依存して異なった
性能値を示す。図９では、認識率は８０％以下から１０
０％近くまで分布しており、９０％近辺でもっとも話者
が多い結果となっていることがわかる。本実施の形態
は、既存の音響モデル等の話者適応を効率良く行い、認
識率の悪い話者に対しても認識理性能の向上を実現する
ことができる。FIG. 9 is a diagram showing an example of a speech recognition performance evaluation result indicating speaker dependence in speech recognition. Recognition rate is not the same value for all speakers.
Due to habits such as voice accents, different performance values are shown depending on the speaker. In FIG. 9, the recognition rate is from 80% or less to 10%.
It can be seen that the distribution is close to 0%, and the result is that the number of speakers is the largest around 90%. The present embodiment can efficiently perform speaker adaptation of an existing acoustic model or the like, and can improve recognition performance even for a speaker having a low recognition rate.

【００３１】図１０は、本実施の形態で使用する話者適
応方式の一例を示す概念図である。既存のＨＭＭの音響
モデルの特徴空間６１００が図に示したような分布とな
っているとき、本実施の形態による適応化によって、特
定話者の音響モデル空間６２００へと変換適応される。
本実施の形態では、音響モデルの音声単位を音韻よりも
短い音素片６０００としている。本実施の形態では、こ
のような音素片６０００などを表現した音響モデルの話
者適応を行う。FIG. 10 is a conceptual diagram showing an example of a speaker adaptation method used in the present embodiment. When the feature space 6100 of the existing HMM acoustic model has a distribution as shown in the figure, the adaptation according to the present embodiment converts and adapts to the acoustic model space 6200 of the specific speaker.
In the present embodiment, the speech unit of the acoustic model is a phoneme piece 6000 shorter than a phoneme. In the present embodiment, speaker adaptation of an acoustic model expressing such a phoneme 6000 or the like is performed.

【００３２】図１１は、話者適応を施した結果として、
認識性能向上の一例を示す概念図である。本例では、予
め決められた単語の発声を条件として音響モデルを話者
適応する例を挙げている。横軸は適応語数を示し、適応
単語が増えるに従って、平均認識率と、最も認識率が悪
い最下位話者に対する認識率も向上することがわかる。
とくに、適応単語数がさらに増加すると、最下位話者の
認識率が平均の認識率へ近づき、話者適応の効果が大き
いことがわかる。本例は、実際の話者適応の結果得られ
た評価結果である。本発明では、これに対してさらに体
重および身長等の個人性情報を考慮した話者適応を行う
ことによりさらに認識率を向上させることができる。FIG. 11 shows the result of speaker adaptation.
It is a conceptual diagram showing an example of improvement of recognition performance. In this example, an example is described in which the acoustic model is speaker-adapted on the condition that a predetermined word is uttered. The horizontal axis indicates the number of adaptive words, and it can be seen that as the number of adaptive words increases, the average recognition rate and the recognition rate for the lowest speaker having the lowest recognition rate also increase.
In particular, when the number of adaptive words further increases, the recognition rate of the lowest speaker approaches the average recognition rate, indicating that the effect of speaker adaptation is large. This example is an evaluation result obtained as a result of actual speaker adaptation. In the present invention, the recognition rate can be further improved by performing speaker adaptation in consideration of personality information such as weight and height.

【００３３】図４では個人性情報抽出部３１０でピッチ
情報を抽出する例を説明したが、ピッチ情報に加えて、
またはピッチ情報に代えて、性別を抽出するようにして
もよい。この変形例について図１３および図１４により
説明する。FIG. 4 illustrates an example in which pitch information is extracted by the personality information extraction unit 310. In addition to the pitch information,
Alternatively, gender may be extracted instead of pitch information. This modification will be described with reference to FIGS.

【００３４】図１３（ａ）（ｂ）は、それぞれ母音／ａ
／と／ｉ／のフォルマントを模式的に表した周波数分析
結果を示すグラフである。フォルマントとは、周波数軸
上での共振周波数スペクトルの山（ピーク）のことであ
る。図１３では、周波数の低い側から第１フォルマント
（Ｆ１）、第２フォルマント（Ｆ２）、第３フォルマン
ト（Ｆ３）としている。このフォルマントおよびスペク
トル全体の形状や傾斜は、性別によって異なり、また個
人個人によっても異なる。FIGS. 13A and 13B show vowels / a
It is a graph which shows the frequency analysis result which represented the formant of / and / i / typically. A formant is a peak of a resonance frequency spectrum on a frequency axis. In FIG. 13, the first formant (F1), the second formant (F2), and the third formant (F3) are set from the lower frequency side. The formants and the shape and inclination of the entire spectrum differ depending on gender, and also differ between individuals.

【００３５】図１４に、第１フォルマントを横軸、第２
フォルマントを縦軸にとったＦ１−Ｆ２平面上での５母
音の分布の模式図を示す。各母音のＦ１−Ｆ２プロット
点は個人個人で異なる。また一般に、各母音についての
プロット点は、概ね図１４の長方形の領域で示すような
範囲内に包含される。通常、成人男声は成人女声に比べ
てそのＦ１，Ｆ２はともに、より低い側にあることが知
られている。このことから、図４の個人性情報抽出部３
１０で入力音声が男声が女声かを推測することができ
る。したがって、この例では、音響モデル適応部３５０
で入力音声の判定結果に基づいて男声または女声用の音
響モデルを選択することができる。FIG. 14 shows the first formant on the horizontal axis and the second formant on the second axis.
FIG. 3 shows a schematic diagram of a distribution of five vowels on an F1-F2 plane with a formant on a vertical axis. The F1-F2 plot points of each vowel differ from person to person. Also, in general, the plot points for each vowel are included within a range generally indicated by a rectangular area in FIG. In general, it is known that an adult male voice has both lower F1 and F2 than an adult female voice. From this, the personality information extraction unit 3 in FIG.
At 10, it can be inferred whether the input voice is male or female. Therefore, in this example, the acoustic model adaptation unit 350
, A male or female acoustic model can be selected based on the determination result of the input voice.

【００３６】以上は、本発明をカーナビゲーションへ適
用した例を示したが、本発明はこれに限定されるもので
はない。例えば、公共サービス端末に適用することもで
きる。（ここで、「公共サービス」とは私企業によるサ
ービスを排除する意図ではなく、公共的に行われるサー
ビスを意味している。）図１５にこのような公共サービス端末１５００の構成例
を示す。この公共サービス端末１５００は、ユーザ１５
１０が表示部１５０３の画面に対してタッチパネルや操
作ボタン等（図示せず）の操作により特定のサービスを
受けるためのものであり、ユーザの位置する床面にマッ
ト型体重計１５０５、ユーザ１５１０の前面上部にビデ
オカメラ１５０２が配置されている。また、ユーザ１５
１０の音声を収集するためのマイクロホン１５０４も設
けられている。これらの各要素は処理装置１５０１の下
で集中管理され、体重情報、画像情報、音声情報が取り
込まれる。処理装置１５０１では、画像情報から大凡の
ユーザの身長情報を求める。したがって、求められた体
重情報および身長情報に基づいて、上記の例と同様に話
者適応が行える。Although the present invention has been described with reference to an example in which the present invention is applied to a car navigation system, the present invention is not limited to this. For example, the present invention can be applied to a public service terminal. (Here, “public service” is not intended to exclude services provided by private companies, but means services provided publicly.) FIG. 15 shows a configuration example of such a public service terminal 1500. This public service terminal 1500 is used by the user 15
Numeral 10 is for receiving a specific service on the screen of the display unit 1503 by operating a touch panel, operation buttons, or the like (not shown), and a mat type weight scale 1505 and a user 1510 are provided on the floor where the user is located. A video camera 1502 is arranged at the upper front. User 15
A microphone 1504 for collecting ten voices is also provided. These elements are centrally managed under the processing device 1501, and weight information, image information, and audio information are captured. The processing device 1501 obtains approximate user height information from the image information. Therefore, speaker adaptation can be performed based on the obtained weight information and height information in the same manner as in the above example.

【００３７】以上、本発明の好適な実施の形態について
説明したが、本発明の要旨を逸脱することなく種々の変
形・変更を行うことが可能である。例えば、上記のカー
ナビゲーション装置への応用例では、運転者の発声の音
声認識を行う構成としたが、搭乗者席においてもその体
重および座席位置を検出する機能を設け、運転者と搭乗
者のいずれが発声するかに応じてユーザがその切替を行
う手段を設けるようにしてもよい。Although the preferred embodiment of the present invention has been described above, various modifications and changes can be made without departing from the gist of the present invention. For example, in the application example to the car navigation apparatus described above, the voice recognition of the driver's utterance was performed, but the function of detecting the weight and the seat position is also provided in the passenger seat, so that the driver and the passenger can be recognized. A means may be provided for the user to switch between them depending on which utters.

【００３８】[0038]

【発明の効果】本発明によれば、より確実な話者適応を
可能とし、認識率の性能を向上させ、結果として操作性
の良い音声認識応用システムを提供できる効果がある。According to the present invention, the speaker adaptation can be performed more reliably, the performance of the recognition rate can be improved, and as a result, a speech recognition application system with good operability can be provided.

[Brief description of the drawings]

【図１】本発明が生まれるに至った、従来の話者適応の
概念を示す図。FIG. 1 is a diagram showing the concept of conventional speaker adaptation that led to the invention.

【図２】本発明である話者適応の方式を示す処理の概念
図。FIG. 2 is a conceptual diagram of a process showing a speaker adaptation method according to the present invention.

【図３】本発明の主点である、話者の身体的な特徴を検
出する手段の一例を示した図。FIG. 3 is a diagram showing an example of means for detecting a physical characteristic of a speaker, which is a main point of the present invention.

【図４】本発明の構成の一例を詳細に示すブロック図。FIG. 4 is a block diagram showing an example of the configuration of the present invention in detail.

【図５】図４に示した音響モデル適応部の一例を示すブ
ロック図。FIG. 5 is a block diagram showing an example of an acoustic model adaptation unit shown in FIG. 4;

【図６】図４に示した音声認識処理部の一例を示すブロ
ック図。FIG. 6 is a block diagram illustrating an example of a speech recognition processing unit illustrated in FIG. 4;

【図７】本発明による身長・体重等検出部、及び音声特
徴変換部の一例を示す図。FIG. 7 is a diagram showing an example of a height / weight detection unit and a voice feature conversion unit according to the present invention.

【図８】本発明をカーナビゲーションへ応用した場合の
概念を示す図。FIG. 8 is a diagram showing the concept when the present invention is applied to car navigation.

【図９】音声認識における話者依存を示す音声認識性能
評価結果の一例を示す図。FIG. 9 is a diagram showing an example of a speech recognition performance evaluation result indicating speaker dependence in speech recognition.

【図１０】本発明で使用する話者適応方式の一例を示す
概念図。FIG. 10 is a conceptual diagram showing an example of a speaker adaptation method used in the present invention.

【図１１】認識性能向上の一例を示す概念図。FIG. 11 is a conceptual diagram showing an example of improvement in recognition performance.

【図１２】本発明における音声特徴対応データの一例を
示す説明図。FIG. 12 is an explanatory diagram showing an example of audio feature correspondence data according to the present invention.

【図１３】母音／ａ／と／ｉ／のフォルマント構成の説
明図。FIG. 13 is an explanatory diagram of a formant configuration of vowels / a / and / i /.

【図１４】男性と女性の母音のＦ１−Ｆ２分布を示す説
明図。FIG. 14 is an explanatory diagram showing F1-F2 distribution of vowels of men and women.

【図１５】本発明を公共サービス端末へ応用した場合の
構成を示す図。FIG. 15 is a diagram showing a configuration when the present invention is applied to a public service terminal.

[Explanation of symbols]

１００…音声信号、１１０…体重信号、１２０…シート
位置信号、１５０…座席シート、１６０…マット型体重
計、１７０…シート位置検出計、１８０…身長変換部、
２００…音声入力部、２１０…音声分析部、３００…音
響モデル適応処理、３１０…音声の個人性情報抽出処
理、３２０…身体的な特徴検出部、３３０…音声特徴変
換部、３３１０…音声特徴比較部、３３２０…音声特徴
対応データ、３３３０…音声特徴変換パラメータ算出
部、個人性情報融合部…３４０、３５０…音響モデル適
応部、３５１０…音響特徴比較部、３５２０…適応判定
部、３５３０…音響モデル適応実行部、３５３１…音響
モデル制約部、３５３２…音響モデル変換部、４００…
音響モデル、５００…音声認識処理、５１０…確率分布
計算部、５２０…確率累積部、５３０…判定部、６００
…単語辞書、９９９…認識結果、２０００…話者適応機
能付き音声認識システム、５０００…カーナビゲーショ
ン装置、５０１０…表示部、５０２０スピーカ、５０３
０…イァホンジャック、５０４０…セレクタ、５０５０
…マイクロホン、６０００…音素片、６１００…ＨＭＭ
モデルの特徴空間、６２００…特定話者モデルの特徴空
間。100: voice signal, 110: weight signal, 120: seat position signal, 150: seat, 160: mat-type weight scale, 170: seat position detector, 180: height conversion unit,
200: voice input unit, 210: voice analysis unit, 300: acoustic model adaptation process, 310: voice personality information extraction process, 320: physical feature detection unit, 330: voice feature conversion unit, 3310: voice feature comparison Unit, 3320: voice feature correspondence data, 3330: voice feature conversion parameter calculation unit, personality information fusion unit: 340, 350: acoustic model adaptation unit, 3510: acoustic feature comparison unit, 3520: adaptation determination unit, 3530: acoustic model Adaptive execution unit, 3531 ... Acoustic model constraint unit, 3532 ... Acoustic model conversion unit, 400 ...
Acoustic model, 500: voice recognition processing, 510: probability distribution calculation unit, 520: probability accumulation unit, 530: determination unit, 600
... word dictionary, 999 ... recognition result, 2000 ... voice recognition system with speaker adaptation function, 5000 ... car navigation device, 5010 ... display, 5020 speaker, 503
0: Earphone jack, 5040: Selector, 5050
... microphone, 6000 ... phoneme, 6100 ... HMM
Model feature space, 6200: Feature space of specific speaker model.

───────────────────────────────────────────────────── フロントページの続き (72)発明者天野明雄東京都国分寺市東恋ケ窪一丁目280番地株式会社日立製作所中央研究所内 (72)発明者江尻正員東京都国分寺市東恋ケ窪一丁目280番地株式会社日立製作所中央研究所内 (72)発明者大場信弥東京都小平市上水本町五丁目20番１号株式会社日立製作所半導体事業部内 ──────────────────────────────────────────────────の Continuing on the front page (72) Inventor Akio Amano 1-280 Higashi Koigakubo, Kokubunji City, Tokyo Inside the Central Research Laboratory of Hitachi, Ltd. (72) Inventor Shinya Oba 5--20-1, Josuihoncho, Kodaira-shi, Tokyo Semiconductor Division, Hitachi, Ltd.

Claims

[Claims]

1. A speaker adaptation method in a speech recognition system equipped with a speech recognition function, comprising detecting physical characteristic information of a speaker, and characterizing the speaker's voice in accordance with the physical characteristic information. A speaker adaptation method in a speech recognition system, comprising guessing and performing speech recognition using an acoustic model of the speech recognition system suitable for the estimated speech feature.

2. The method according to claim 1, wherein the physical characteristic information is at least one of weight information and height information of the speaker.

3. A speech feature of a speaker is identified from the utterance content of an adaptation word specified in advance by a speaker, and the speech feature obtained based on the speech feature and the physical feature information of the speaker is identified. 3. The selection or conversion of the acoustic model based on a feature of a voice combined with a feature.
A speaker adaptation method in the described speech recognition system.

4. A speech recognition system equipped with a speech recognition function, comprising: means for acquiring physical characteristic information of a speaker; a correspondence table in which the physical characteristic information of the speaker is associated with voice characteristics; Means for extracting the features of the speaker's voice by illuminating the acquired physical feature information with the correspondence table; a plurality of acoustic models for speech recognition; and an acoustic model adapted to the extracted features of the voice. Acoustic model adaptation means for selecting or converting from among the plurality of acoustic models, and speech recognition processing means for performing speaker speech recognition processing using the acoustic model obtained by the acoustic model adaptation means. A speech recognition system characterized by the following.

5. The speech recognition system according to claim 4, wherein said physical characteristic information is at least one of weight information and height information of a speaker.

6. The voice recognition system is mounted on a car navigation device, the means for acquiring weight information includes a mat-type weighing scale disposed on a speaker's seat, and the means for acquiring height information includes a speech. 6. The voice recognition system according to claim 5, further comprising a seat position detector for detecting a position of a seat of a person.

7. The means for obtaining weight information includes a mat-type weight scale arranged at a place where a speaker is located, and the means for obtaining height information includes a video camera for obtaining an image of a speaker and the video camera. 6. The speech recognition system according to claim 5, further comprising processing means for obtaining height information by processing an image.

8. A speech feature of a speaker is identified from the utterance content of an adaptation word specified in advance by the speaker, and the speech feature and the speech obtained based on the physical feature information of the speaker are identified. 5. A speech recognition system according to claim 4, wherein said speech model adaptation means selects or converts said speech model based on said combined speech features.

9. The speech recognition system according to claim 4, wherein at least two types of said acoustic models are provided for men and women.

10. The apparatus according to claim 4, further comprising a selection means for allowing a user to select whether or not to apply the speaker adaptation of the acoustic model before the speech recognition of the speaker. Voice recognition system.