JP2004138914A

JP2004138914A - Device and method for adjusting dictionary for voice recognition

Info

Publication number: JP2004138914A
Application number: JP2002304970A
Authority: JP
Inventors: Masaharu Harada; 原田　将治
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 2002-10-18
Filing date: 2002-10-18
Publication date: 2004-05-13
Anticipated expiration: 2022-10-18
Also published as: JP3992586B2

Abstract

<P>PROBLEM TO BE SOLVED: To provide a device and a method for adjusting a dictionary for voice recognition capable of efficiently performing maintenance of the dictionary for the voice recognition while improving the recognition accuracy. <P>SOLUTION: The device and the method input voice signals generated by a user, recognize them by using a first dictionary for voice recognition structured by each word of blocks and extracts at least the voice signals of each word of the blocks to be recognized and generating times of the voice signals from the recognition result. The device and the method recognize voices again by using a second dictionary for voice recognition structured by each word by using corresponding voice signals at each extracted generation time, and based on the recognition result, stores reading information corresponding to the voice signals of each word of the blocks as a pair of data and updates the first dictionary by using the pair of the data. <P>COPYRIGHT: (C)2004,JPO

Description

【０００１】
【発明の属する技術分野】
本発明は、音声認識装置における音声認識用辞書のメンテナンスを効率的に行うことができる音声認識用辞書調整装置及び方法に関する。
【０００２】
【従来の技術】
従来の音声認識装置では、認識精度を高めることを目的として、最も評価値の高い単語及びそれに対応する読み情報を一対のデータとして集約することによって、入力された音声信号に対する音声認識用辞書を構成している。
【０００３】
しかし、従来の音声認識装置においては、音声認識用辞書に登録されるべき認識対象となる単語の読み情報については、専門家の手作業によることが多かったが、かかる作業は工数的にも煩雑な作業であることから、特定の自動変換ルール等を定めることによって予め読み情報を付与しておく技術が広く用いられている。
【０００４】
しかし、単語の読みには、発声者が同じであっても、個人ごとに揺らぎが存在する。例えば同じ発声者であっても、同じ単語、すなわち“敬語”と表記される単語が、「けいご」と発声されたり、「けえご」と発声されたりすることも考えられる。したがって、このような場合であっても正確に音声認識を行うためには、専門家の手作業によって各単語に読みを追加する、あるいは変換ルールを変更する等によって、音声認識精度の改善を行うチューニング作業が必要不可欠となっていた。
【０００５】
しかし、かかる作業は、たとえ専門家であっても手作業で行うのは工数的にも実用的ではない。したがって、特定のシステムを採用することによって自動的にチューニングを施すための方策が多々考えられている。
【０００６】
例えば（特許文献１）においては、音声認識モードと音声登録モードを切り替え、音声登録モードにおいては、単語単位に登録すべき音声信号についても、構成する一語単位に分割して音声認識用辞書に登録することによって、１回の登録作業で多数の登録データを得ることができる技術が開示されている。
【０００７】
また、（特許文献２）においては、一つの文字（単語）に対して複数の読みを自動的に付与することによって、登録作業を行っていない読みに対しても確実に認識文字を推定することができる技術が開示されている。
【０００８】
【特許文献１】
特開平１１−２８２４８６号公報
【０００９】
【特許文献２】
特開２０００−４７６８４号公報
【００１０】
【発明が解決しようとする課題】
しかし、（特許文献１）に開示されている方法では、単語を一語単位に展開する精度自体に問題があり、場合によっては登録されていない音節や音素も生じるおそれがあることから、場合によっては認識精度がかえって下がってしまうという問題点があった。
【００１１】
また、（特許文献２）に開示されている方法では、どの程度まで複数の読みを付加すれば認識精度が向上するのか判断することが難しく、結局は専門家が音声認識用辞書をメンテナンスするのと同等の作業工数となってしまうという問題点があった。
【００１２】
本発明は、上記問題点を解決するために、認識精度を向上させながら音声認識用辞書のメンテナンスを効率的に行うことができる音声認識用辞書調整装置及び方法を提供することを目的とする。
【００１３】
【課題を解決するための手段】
上記目的を達成するために本発明にかかる音声認識用辞書調整装置は、利用者の発する音声信号を入力する音声信号入力部と、入力された音声信号をひとまとまりの言葉単位に構成された第１の音声認識用辞書を用いて認識する第１の音声認識部と、第１の音声認識部における認識結果から少なくとも認識対象となったひとまとまりの言葉単位の音声信号と音声信号の発生時間を抽出する音声信号情報抽出部と、発生時間ごとに、対応する音声信号を用いて一語単位に構成された第２の音声認識用辞書を用いて再度認識する第２の音声認識部と、第２の音声認識部における認識結果に基づいて、ひとまとまりの言葉単位の音声信号と対応する読み情報を一対のデータとして保存する認識結果保存部と、保存されている一対のデータを用いて第１の音声認識用辞書を更新する認識辞書更新部とを含むことを特徴とする。
【００１４】
かかる構成により、単語単位で音声認識処理を行うことで単語単位の音声データを抽出し、その後単語に対応する音声データについて一語単位による音声認識を再度行うことにより、登録されていない音節や音素等が生じることがなく、利用者が用いた単語について確実に音声認識用辞書に追加することができることから、無駄な読み情報を登録することなく、音声認識精度の高い効率的な音声認識用辞書となるよう調整することが可能となる。
【００１５】
また、本発明にかかる音声認識用辞書調整装置は、ひとまとまりの言葉ごとに、第１の音声認識用辞書を更新した回数を集計する更新回数集計部を含むことが好ましい。更新回数の少ない認識結果は誤認識である可能性が高いと考えられることから、かかる認識結果を音声認識用辞書に反映させるのを防ぐことができるからである。
【００１６】
また、本発明にかかる音声認識用辞書調整装置は、第１の音声認識用辞書を利用者ごとに保存することが好ましい。あるいは、本発明にかかる音声認識用辞書調整装置は、第１の音声認識用辞書を利用者の使用する環境ごとに保存することも好ましい。
【００１７】
また、本発明にかかる音声認識用辞書調整装置は、第２の音声認識エンジンを複数個使用することも好ましい。
【００１８】
また、本発明は、上記のような音声認識用辞書調整装置の機能をコンピュータの処理ステップとして実行するソフトウェアを特徴とするものであり、具体的には、利用者の発する音声信号を入力する第一の工程と、入力された音声信号をひとまとまりの言葉単位に構成された第１の音声認識用辞書を用いて認識する第二の工程と、第二の工程における認識結果から少なくとも認識対象となったひとまとまりの言葉単位の音声信号と音声信号の発生時間を抽出する第三の工程と、発生時間ごとに、対応する音声信号を用いて一語単位に構成された第２の音声認識用辞書を用いて再度認識する第四の工程と、第四の工程における認識結果に基づいて、ひとまとまりの言葉単位の音声信号と対応する読み情報を一対のデータとして保存する第五の工程と、保存されている一対のデータを用いて第１の音声認識用辞書を更新する第六の工程とを含む音声認識用辞書調整方法並びにそのような工程を具現化するコンピュータ実行可能なプログラムであることを特徴とする。
【００１９】
かかる構成により、コンピュータ上へ当該プログラムをロードさせ実行することで、単語単位で音声認識処理を行うことで単語単位の音声データを抽出し、その後単語に対応する音声データについて一語単位による音声認識を再度行うことにより、登録されていない音節や音素等が生じることがなく、利用者が用いた単語について確実に音声認識用辞書に追加することができることから、無駄な読み情報を登録することなく、音声認識精度の高い効率的な音声認識用辞書となるよう調整することができる音声認識用辞書調整装置を実現することが可能となる。
【００２０】
【発明の実施の形態】
以下、本発明の実施の形態にかかる音声認識用辞書調整装置について、図面を参照しながら説明する。図１は本発明の実施の形態にかかる音声認識用辞書調整装置の構成図である。
【００２１】
図１において、音声を入力する利用者１１が、「今日の天気」という言葉を音声入力した場合、まず第１の音声認識エンジン１２において、ひとまとまりの単語単位で構成されている第１の音声認識用辞書１３を参照しながら音声認識を行う。
【００２２】
図２に本発明の実施の形態にかかる音声認識用辞書調整装置における第１の音声認識用辞書１３のデータ構成の例示図を示す。図２の例では、単語の品詞単位に、読み情報との対応を示すデータの集合として第１の音声認識用辞書１３を構成している。もちろん、品詞単位に限定されるものではない。
【００２３】
次に、認識結果抽出部１４において、認識結果として出力される単語と、当該出力単語に対応する音声信号を抽出する。音声信号の抽出は、出力される単語に対応する時間的信号区間によって行われる。例えば図１の例では、「今日」に対応する信号区間が１０ｍｓ〜２００ｍｓの間の区間、「の」に対応する信号区間が２００ｍｓ〜２５０ｍｓの間の区間、「天気」に対応する信号区間が２５０ｍｓ〜５００ｍｓの間の区間、というように抽出する。
【００２４】
そして、第２の音声認識エンジン１５において、音節や音素等の一語単位で構成された第２の音声認識用辞書１６を参照しながら、各々の音声データ区間に対し、読み情報単位で音声認識を行う。例えば、音節単位で音声認識するものとすると、図１の例では、「今日」に対応する１０ｍｓ〜２００ｍｓの間の区間における音声信号に基づいて、「きょ」と「お」の２つの音節に対する読み情報を第２の音声認識エンジン１５において認識することになる。
【００２５】
図３に本発明の実施の形態にかかる音声認識用辞書調整装置における第２の音声認識用辞書１６のデータ構成の例示図を示す。図３の例では、「あ」、「い」、「う」等の各音節単位で読み情報との対応を示す一対のデータの集合として第２の音声認識用辞書１６を構成している。もちろん、音節単位に限定されるものではなく、例えば音素単位であっても良い。
【００２６】
そして、認識結果保存部１７では、第２の音声認識エンジン１５による認識結果を、認識の対象となった信号区間に相当する音声信号に対応する単語と対応付けた一対のデータとして保存する。すなわち、図４に示すように、認識結果保存部１７には、単語「今日」に対して読み情報として「きょお」が保存されることになる。同様に、入力された音声信号全てに対して、第２の音声認識エンジン１５による認識結果を、認識の対象となった信号区間に相当する音声信号に対応する単語と対応付けた一対のデータとして保存することになる。
【００２７】
最後に、認識辞書更新部１８において、第１の音声認識用辞書１３に対して、認識結果保存部１７に保存されている一対のデータに基づいて音声認識用辞書の内容を更新することになる。
【００２８】
ここで、図５に示すように、第２の音声認識エンジン１５として、複数個の音声認識エンジンを用いることも考えられる。図５の例においては、第２の音声認識エンジン１５に、さらに第３の音声認識エンジン１９を追加している。
【００２９】
第２の音声認識エンジン１５と第３の音声認識エンジン１９とは、異なる音響モデルを利用している。そして、音節や音素等の一語単位で構成された第２の音声認識用辞書１６を参照しながら、各々の音声データ区間に対し、おのおの読み情報単位で音声認識を行うことになる。
【００３０】
例えば、図５の例において、単語「気温」と対応付けられた音声データ区間に対して、第２の音声認識エンジン１５では「きょぐ」と、第３の音声認識エンジン１９では「てんき」と、異なる読み情報で認識される場合、複数の音声認識エンジンにおいて認識結果が異なっていることから、かかる認識結果が誤っているものと判定され、第１の音声認識用辞書１３に対して更新しないようにするものである。
【００３１】
一方、単語「今日」と対応付けられた音声データ区間に対し、第２の音声認識エンジン１５では「きょお」、第３の音声認識エンジン１９でも「きょお」と認識される場合には、複数の音声認識エンジンにおいて認識結果が一致していることから、かかる認識結果は正しいものと判定され、第１の音声認識用辞書１３を更新することになる。
【００３２】
なお、第３の音声認識エンジン１９による「気温：てんき」という認識結果のうち、読み情報の部分について第１の音声認識用辞書１３と照合すると、読み情報の部分についてはすでに登録されている「天気：てんき」という一対のデータと一致していることも考えられる。このような場合には、第１の音声認識用辞書１３を更新しないようにしても良い。
【００３３】
すなわち、認識辞書更新部１８において第１の音声認識用辞書１３を更新する際、第１の音声認識用辞書１３に存在する他の単語に対応する読み情報と重なるものがないか照合を行うことになる。
【００３４】
図６（ａ）に、かかる照合を行う場合における第１の音声認識用辞書１３のデータ構成例を、図６（ｂ）に認識結果保存部１７のデータ構成例を、それぞれ示す。
【００３５】
図６（ｂ）に示すように、例えば認識結果保存部１７における一対のデータ「太郎：じろう」の読み情報部分である「じろう」は、既に第１の音声認識用辞書１３に登録されている他の一対のデータ「次郎：じろう」にも存在することが検出できる。この場合、一対のデータ「太郎：じろう」の方を削除しても良いし、読み情報が類似している単語として「太郎」と「次郎」を提示するようにしても良い。
【００３６】
また、第１の音声認識用辞書１３に同じ読み情報が存在する場合、例えば図６（ｂ）に示すように、認識結果保存部１７に「山田：がまだ」、「鎌田：がまだ」という２つの一対のデータが存在する場合には、両方のデータを削除しても良い。あるいは、同じ読み情報が存在する単語であっても、文法上で同時に出現することの有無を検証することによって、削除するか否かを決定することも考えられる。
【００３７】
さらに、認識辞書更新部１８において第１の音声認識用辞書１３を更新する際、既に保存されている単語と対応する読み情報という一対のデータの中に、更新しようとする一対のデータと同じ一対のデータが存在する場合も考えられる。この場合、当該一対のデータが保存された頻度を集計し、集計された更新頻度に基づいて、第１の音声認識用辞書１３に追加する単語と当該単語の読み情報との一対のデータを決定することも考えられる。
【００３８】
図７（ａ）に、かかる照合を行う場合における第１の音声認識用辞書１３のデータ構成例を、図７（ｂ）に認識結果保存部１７のデータ構成例を、それぞれ示す。
【００３９】
例えば、認識結果保存部１７に保存されている一対のデータの中に、更新頻度が１００回以上である６つの一対のデータが含まれているが、このうち第１の音声認識用辞書１３に含まれていない、「今日：ｋｙ　ｏ　ｏ」、「明日：ａ　ｓｕ　ｔａ」、「天気：ｔ　ｅ　ｇ　ｋ　ｉ」の３つの一対のデータについて、第１の音声認識用辞書１３に登録することになる。このように更新頻度の高いデータについてのみ第１の音声認識用辞書１３の更新の対象とすることにより、偶然誤って認識した結果等が第１の音声認識用辞書１３に反映されることがないことから、更新された第１の音声認識用辞書１３の音声認識精度を落とすことなく辞書の更新を行うことが可能となる。
【００４０】
また、第１の音声認識用辞書１３を利用者ごとに保持することも考えられる。すなわち、図８に示すように、利用者ＩＤごとに第１の音声認識用辞書１３を構成しておき、例えば利用者ＩＤが記述されたボタンを選択するような利用者ＩＤ認識部２０によって認識された利用者ＩＤごとに第１の音声認識用辞書１３を更新することになる。
【００４１】
また、利用者ごとに保持することに特に限定されるものではなく、例えば利用者の使用環境における背景雑音のレベルや、使用する電話回線の種類、マイクの種類等の音声信号を入力する環境に関する情報に基づいて、第１の音声認識用辞書１３を複数個設けるものであっても良い。
【００４２】
次に、本発明の実施の形態にかかる音声認識用辞書調整装置を実現するプログラムの処理の流れについて説明する。図９に本発明の実施の形態にかかる音声認識用辞書調整装置を実現するプログラムの処理の流れ図を示す。
【００４３】
図９において、まず利用者の音声信号を受信し（ステップＳ９０１）、第１の音声認識エンジン１２において、ひとまとまりの単語単位で構成されている第１の音声認識用辞書１３を参照しながら音声認識を行う（ステップＳ９０２）。
【００４４】
次に、認識結果として出力される単語と、当該出力単語に対応する音声信号を信号区間として抽出する（ステップＳ９０３）。そして、第２の音声認識エンジン１５において、音節や音素等の一語単位で構成された第２の音声認識用辞書１６を参照しながら、各々の音声データ区間に対し、読み情報単位で音声認識を行う（ステップＳ９０４）。
【００４５】
そして、第２の音声認識エンジン１５による認識結果を、認識の対象となった音声データに対応する単語と対応付けて保存し（ステップＳ９０５）、第１の音声認識用辞書１３の内容を、保存されている内容に基づいて更新する（ステップＳ９０６）。
【００４６】
なお、本実施の形態においては、第１の音声認識エンジン１２と第２の音声認識エンジン１５、あるいは追加される他の音声認識エンジンとは、同じ音声認識エンジンを用いても良いし、異なる音声認識エンジンを用いても良い。また、本実施の形態においては、最良であると判断された読み情報のみに基づいて第１の音声認識用辞書１３を更新しているが、複数個の読み情報候補について第１の音声認識用辞書１３を更新しても良い。
【００４７】
以上のように本実施の形態によれば、単語単位で音声認識処理を行うことで単語単位の音声データを抽出し、その後単語に対応する音声データについて一語単位による音声認識を再度行うことにより、登録されていない音節や音素等が生じることがなく、利用者が用いた単語について確実に音声認識用辞書に追加することができることから、無駄な読み情報を登録することなく、音声認識精度の高い効率的な音声認識用辞書となるよう調整することが可能となる。
【００４８】
本発明の実施の形態にかかる音声認識用辞書調整装置を実現するプログラムは、図１０に示すように、ＣＤ−ＲＯＭ１０２−１やフレキシブルディスク１０２−２等の可搬型記録媒体１０２だけでなく、通信回線の先に備えられた他の記憶装置１０１や、コンピュータ１０３のハードディスクやＲＡＭ等の記録媒体１０４のいずれに記憶されるものであっても良く、プログラム実行時には、プログラムはローディングされ、主メモリ上で実行される。
【００４９】
また、本発明の実施の形態にかかる音声認識用辞書調整装置により用いられる第１の音声認識用辞書や第２の音声認識用辞書等についても、図１０に示すように、ＣＤ−ＲＯＭ１０２−１やフレキシブルディスク１０２−２等の可搬型記録媒体１０２だけでなく、通信回線の先に備えられた他の記憶装置１０１や、コンピュータ１０３のハードディスクやＲＡＭ等の記録媒体１０４のいずれに記憶されるものであっても良く、例えば本発明にかかる音声認識用辞書調整装置を利用する際にコンピュータ１０３により読み取られる。
【００５０】
（付記１）　利用者の発する音声信号を入力する音声信号入力部と、
入力された前記音声信号をひとまとまりの言葉単位に構成された第１の音声認識用辞書を用いて認識する第１の音声認識部と、
前記第１の音声認識部における認識結果から少なくとも認識対象となった前記ひとまとまりの言葉単位の音声信号と前記音声信号の発生時間を抽出する音声信号情報抽出部と、
前記発生時間ごとに、対応する前記音声信号を用いて一語単位に構成された第２の音声認識用辞書を用いて再度認識する第２の音声認識部と、
前記第２の音声認識部における認識結果に基づいて、前記ひとまとまりの言葉単位の音声信号と対応する読み情報を一対のデータとして保存する認識結果保存部とを含むことを特徴とする音声認識用辞書調整装置。
【００５１】
（付記２）　保存されている前記一対のデータを用いて前記第１の音声認識用辞書を更新する認識辞書更新部をさらに含む付記１に記載の音声認識用辞書調整装置。
【００５２】
（付記３）　前記ひとまとまりの言葉ごとに、前記第１の音声認識用辞書を更新した回数を集計する更新回数集計部を含む付記１又は２に記載の音声認識用辞書調整装置。
【００５３】
（付記４）　前記第１の音声認識用辞書を前記利用者ごとに保存する付記１又は２に記載の音声認識用辞書調整装置。
【００５４】
（付記５）　前記第１の音声認識用辞書を前記利用者の使用する環境ごとに保存する付記１又は２に記載の音声認識用辞書調整装置。
【００５５】
（付記６）　前記第２の音声認識エンジンを複数個使用する付記１又は２に記載の音声認識用辞書調整装置。
【００５６】
（付記７）　利用者の発する音声信号を入力する第一の工程と、
入力された前記音声信号をひとまとまりの言葉単位に構成された第１の音声認識用辞書を用いて認識する第二の工程と、
前記第二の工程における認識結果から少なくとも認識対象となった前記ひとまとまりの言葉単位の音声信号と前記音声信号の発生時間を抽出する第三の工程と、
前記発生時間ごとに、対応する前記音声信号を用いて一語単位に構成された第２の音声認識用辞書を用いて再度認識する第四の工程と、
前記第四の工程における認識結果に基づいて、前記ひとまとまりの言葉単位の音声信号と対応する読み情報を一対のデータとして保存する第五の工程とを含むことを特徴とする音声認識用辞書調整方法。
【００５７】
（付記８）　利用者の発する音声信号を入力する第一の処理ステップと、
入力された前記音声信号をひとまとまりの言葉単位に構成された第１の音声認識用辞書を用いて認識する第二の処理ステップと、
前記第二の処理ステップにおける認識結果から少なくとも認識対象となった前記ひとまとまりの言葉単位の音声信号と前記音声信号の発生時間を抽出する第三の処理ステップと、
前記発生時間ごとに、対応する前記音声信号を用いて一語単位に構成された第２の音声認識用辞書を用いて再度認識する第四の処理ステップと、
前記第四の処理ステップにおける認識結果に基づいて、前記ひとまとまりの言葉単位の音声信号と対応する読み情報を一対のデータとして保存する第五の処理ステップとを含む音声認識用辞書調整方法を具現化することを特徴とするコンピュータ実行可能なプログラム。
【００５８】
【発明の効果】
以上のように本発明にかかる音声認識用辞書調整装置によれば、単語単位で音声認識処理を行うことで単語単位の音声データを抽出し、その後単語に対応する音声データについて一語単位による音声認識を再度行うことにより、登録されていない音節や音素等が生じることがなく、利用者が用いた単語について確実に音声認識用辞書に追加することができることから、無駄な読み情報を登録することなく、音声認識精度の高い効率的な音声認識用辞書となるよう調整することが可能となる。
【図面の簡単な説明】
【図１】本発明の実施の形態にかかる音声認識用辞書調整装置の構成図
【図２】本発明の実施の形態にかかる音声認識用辞書調整装置における第１の音声認識用辞書のデータ構成例示図
【図３】本発明の実施の形態にかかる音声認識用辞書調整装置における第２の音声認識用辞書のデータ構成例示図
【図４】本発明の実施の形態にかかる音声認識用辞書調整装置における認識結果保存部のデータ構成例示図
【図５】本発明の実施の形態にかかる音声認識用辞書調整装置の他の構成図
【図６】本発明の実施の形態にかかる音声認識用辞書調整装置における第１の音声認識用辞書及び認識結果保存部のデータ構成例示図
【図７】本発明の実施の形態にかかる音声認識用辞書調整装置における第１の音声認識用辞書及び認識結果保存部のデータ構成例示図
【図８】本発明の実施の形態にかかる音声認識用辞書調整装置の他の構成図
【図９】本発明の実施の形態にかかる音声認識用辞書調整装置における処理の流れ図
【図１０】コンピュータ環境の例示図
【符号の説明】
１１　利用者
１２　第１の音声認識エンジン
１３　第１の音声認識用辞書
１４　認識結果抽出部
１５　第２の音声認識エンジン
１６　第２の音声認識用辞書
１７　認識結果保存部
１８　認識辞書更新部
１９　第３の音声認識エンジン
２０　利用者ＩＤ認識部
１０１　回線先の記憶装置
１０２　ＣＤ−ＲＯＭやフレキシブルディスク等の可搬型記録媒体
１０２−１　ＣＤ−ＲＯＭ
１０２−２　フレキシブルディスク
１０３　コンピュータ
１０４　コンピュータ上のＲＡＭ／ハードディスク等の記録媒体[0001]
TECHNICAL FIELD OF THE INVENTION
The present invention relates to a speech recognition dictionary adjustment device and method capable of efficiently maintaining a speech recognition dictionary in a speech recognition device.
[0002]
[Prior art]
In the conventional speech recognition device, a word for the highest evaluation value and reading information corresponding to the word with the highest evaluation value are aggregated as a pair of data for the purpose of improving recognition accuracy, thereby forming a speech recognition dictionary for the input speech signal. are doing.
[0003]
However, in the conventional speech recognition device, the reading information of the word to be recognized to be registered in the speech recognition dictionary is often manually performed by an expert, but such work is complicated in man-hours. Therefore, a technique of adding reading information in advance by defining a specific automatic conversion rule or the like is widely used.
[0004]
However, there are fluctuations in reading individual words, even for the same speaker. For example, even with the same speaker, the same word, that is, a word described as “Honorific” may be uttered as “Keigo” or “Kego”. Therefore, even in such a case, in order to accurately perform speech recognition, the speech recognition accuracy is improved by manually adding a reading to each word or changing a conversion rule by an expert. Tuning work was indispensable.
[0005]
However, it is impractical to perform such a task manually, even for an expert. Therefore, many measures have been considered for automatically performing tuning by adopting a specific system.
[0006]
For example, in (Patent Document 1), a voice recognition mode and a voice registration mode are switched. In the voice registration mode, a voice signal to be registered in a word unit is also divided into constituent word units and stored in a voice recognition dictionary. A technique has been disclosed in which a large number of registration data can be obtained by one registration operation by registering.
[0007]
Also, in Patent Document 2, by automatically assigning a plurality of readings to one character (word), it is possible to reliably estimate recognized characters even for readings that have not been registered. There is disclosed a technology that can do this.
[0008]
[Patent Document 1]
JP-A-11-282486
[Patent Document 2]
JP 2000-47684 A
[Problems to be solved by the invention]
However, in the method disclosed in (Patent Document 1), there is a problem in accuracy of expanding words in word units, and in some cases, unregistered syllables or phonemes may be generated. However, there is a problem that the recognition accuracy is lowered.
[0011]
In addition, according to the method disclosed in Patent Document 2, it is difficult to determine to what extent a plurality of readings are added to improve recognition accuracy, and after all, it is difficult for an expert to maintain a speech recognition dictionary. However, there is a problem that the number of working steps is equivalent to that of the above.
[0012]
SUMMARY OF THE INVENTION It is an object of the present invention to provide a speech recognition dictionary adjustment apparatus and method capable of efficiently performing maintenance of a speech recognition dictionary while improving recognition accuracy in order to solve the above problems.
[0013]
[Means for Solving the Problems]
In order to achieve the above object, a voice recognition dictionary adjustment device according to the present invention includes a voice signal input unit for inputting a voice signal emitted by a user, and a voice signal input unit configured to convert the input voice signal into a group of words. A first speech recognition unit for recognizing by using the first speech recognition dictionary, and at least a set of speech signals in units of words and a generation time of the speech signal, which are recognition targets, based on the recognition result in the first speech recognition unit. An audio signal information extracting unit to be extracted, a second audio recognizing unit that recognizes again by using a second audio recognition dictionary configured for each word using a corresponding audio signal for each occurrence time, A recognition result storage unit that stores a set of speech signals corresponding to a group of words and corresponding reading information as a pair of data, based on the recognition result of the second voice recognition unit; Characterized in that it comprises a recognition dictionary update unit for updating the speech recognition dictionary.
[0014]
With this configuration, speech recognition processing is performed in units of words to extract speech data in units of words, and then speech recognition corresponding to the words is performed again in units of words, so that unregistered syllables and phonemes are obtained. Since the words used by the user can be reliably added to the dictionary for speech recognition without any occurrence of the like, an efficient dictionary for speech recognition with high speech recognition accuracy without registering useless reading information. It can be adjusted so that
[0015]
In addition, it is preferable that the speech recognition dictionary adjustment device according to the present invention include an update number counting unit that counts the number of times the first speech recognition dictionary has been updated for each group of words. This is because a recognition result having a small number of update times is considered to be likely to be erroneous recognition, so that it is possible to prevent the recognition result from being reflected in the speech recognition dictionary.
[0016]
Further, the speech recognition dictionary adjustment device according to the present invention preferably stores the first speech recognition dictionary for each user. Alternatively, the speech recognition dictionary adjustment device according to the present invention preferably stores the first speech recognition dictionary for each environment used by the user.
[0017]
It is also preferable that the speech recognition dictionary adjustment device according to the present invention uses a plurality of second speech recognition engines.
[0018]
Further, the present invention is characterized by software that executes the function of the above-described dictionary adjustment device for speech recognition as a processing step of a computer, and more specifically, a software for inputting a speech signal emitted by a user. One step, a second step of recognizing the input voice signal using a first voice recognition dictionary configured in a unit of words, and at least a recognition target based on a recognition result in the second step. A third step of extracting a voice signal in units of words and a generation time of the voice signal, and a second voice recognition unit configured for each generation time using a corresponding voice signal for each word. A fourth step of re-recognizing using the dictionary, and a fifth step of storing a set of speech signals in word units and corresponding reading information as a pair of data based on the recognition result in the fourth step. And a sixth step of updating the first dictionary for speech recognition using a pair of stored data, and a method for adjusting the dictionary for speech recognition, and a computer-executable program embodying such a step. There is a feature.
[0019]
With this configuration, by loading and executing the program on a computer, speech recognition processing is performed in units of words to extract speech data in units of words, and thereafter, speech recognition in units of words is performed on speech data corresponding to the words. Again, no unregistered syllables, phonemes, etc. are generated, and the words used by the user can be reliably added to the speech recognition dictionary. In addition, it is possible to realize a speech recognition dictionary adjustment device that can be adjusted to be an efficient speech recognition dictionary with high speech recognition accuracy.
[0020]
BEST MODE FOR CARRYING OUT THE INVENTION
Hereinafter, a dictionary adjustment device for speech recognition according to an embodiment of the present invention will be described with reference to the drawings. FIG. 1 is a configuration diagram of a speech recognition dictionary adjustment device according to an embodiment of the present invention.
[0021]
In FIG. 1, when a user 11 who inputs voice inputs the word “today's weather” by voice, first, the first voice recognition engine 12 outputs a first voice composed of a group of words. Voice recognition is performed with reference to the recognition dictionary 13.
[0022]
FIG. 2 shows an example of a data configuration of the first speech recognition dictionary 13 in the speech recognition dictionary adjustment device according to the embodiment of the present invention. In the example of FIG. 2, the first speech recognition dictionary 13 is configured as a set of data indicating the correspondence with the reading information for each part of speech of a word. Of course, it is not limited to a part of speech unit.
[0023]
Next, the recognition result extraction unit 14 extracts a word output as a recognition result and a speech signal corresponding to the output word. The extraction of the audio signal is performed by a temporal signal section corresponding to the output word. For example, in the example of FIG. 1, the signal section corresponding to “today” is a section between 10 ms and 200 ms, the signal section corresponding to “no” is a section between 200 ms and 250 ms, and the signal section corresponding to “weather” is A section between 250 ms and 500 ms is extracted.
[0024]
Then, the second speech recognition engine 15 refers to the second speech recognition dictionary 16 composed of one word such as syllables and phonemes, and performs speech recognition for each speech data section in units of reading information. I do. For example, assuming that speech is recognized in units of syllables, in the example of FIG. 1, two syllables “Kyo” and “O” are based on a voice signal in a section between 10 ms and 200 ms corresponding to “Today”. Is recognized by the second speech recognition engine 15.
[0025]
FIG. 3 shows an example of the data configuration of the second speech recognition dictionary 16 in the speech recognition dictionary adjustment device according to the embodiment of the present invention. In the example of FIG. 3, the second speech recognition dictionary 16 is configured as a set of a pair of data indicating a correspondence with reading information in each syllable unit such as “A”, “I”, and “U”. Of course, the present invention is not limited to syllable units, but may be, for example, phoneme units.
[0026]
Then, the recognition result storage unit 17 stores the recognition result by the second voice recognition engine 15 as a pair of data associated with a word corresponding to a voice signal corresponding to a signal section to be recognized. That is, as shown in FIG. 4, "Kyoo" is stored in the recognition result storage unit 17 as reading information for the word "today". Similarly, for all input speech signals, the recognition result by the second speech recognition engine 15 is used as a pair of data associated with a word corresponding to the speech signal corresponding to the signal section to be recognized. Will be saved.
[0027]
Finally, the recognition dictionary updating unit 18 updates the content of the speech recognition dictionary with respect to the first speech recognition dictionary 13 based on the pair of data stored in the recognition result storage unit 17. .
[0028]
Here, as shown in FIG. 5, a plurality of speech recognition engines may be used as the second speech recognition engine 15. In the example of FIG. 5, a third speech recognition engine 19 is further added to the second speech recognition engine 15.
[0029]
The second speech recognition engine 15 and the third speech recognition engine 19 use different acoustic models. Then, while referring to the second speech recognition dictionary 16 composed of single words such as syllables and phonemes, speech recognition is performed for each speech data section in units of reading information.
[0030]
For example, in the example of FIG. 5, for the voice data section associated with the word “temperature”, the second voice recognition engine 15 is “Kyog” and the third voice recognition engine 19 is “Tenki”. When the recognition is performed with different reading information, since the recognition results are different in a plurality of speech recognition engines, it is determined that the recognition result is incorrect, and the first speech recognition dictionary 13 is updated. That is not to do.
[0031]
On the other hand, when the voice data section associated with the word “today” is recognized as “Kyo” by the second voice recognition engine 15 and “Kyo” by the third voice recognition engine 19, Since the recognition results match in a plurality of speech recognition engines, the recognition result is determined to be correct, and the first speech recognition dictionary 13 is updated.
[0032]
When the reading information portion of the recognition result of “Temperature: Tenki” by the third voice recognition engine 19 is checked against the first voice recognition dictionary 13, the reading information portion is already registered. It is conceivable that it matches the pair of data "weather: weather". In such a case, the first voice recognition dictionary 13 may not be updated.
[0033]
In other words, when the first dictionary 13 for speech recognition is updated by the recognition dictionary update unit 18, it is checked whether or not there is any overlap with reading information corresponding to other words existing in the first dictionary 13 for speech recognition. become.
[0034]
FIG. 6A shows an example of the data configuration of the first dictionary for speech recognition 13 in the case of performing such matching, and FIG. 6B shows an example of the data configuration of the recognition result storage unit 17, respectively.
[0035]
As shown in FIG. 6B, for example, “Jiro”, which is a reading information portion of a pair of data “Taro: Jiro” in the recognition result storage unit 17, is already registered in the first voice recognition dictionary 13. It can be detected that it also exists in another pair of data “Jiro: Jiro”. In this case, the pair of data “Taro: Jiro” may be deleted, or “Taro” and “Jiro” may be presented as words having similar reading information.
[0036]
In addition, when the same reading information exists in the first dictionary 13 for speech recognition, for example, as shown in FIG. 6B, “Yamada: ga still” and “Kamata: ga still” are stored in the recognition result storage 17. If two pairs of data exist, both data may be deleted. Alternatively, it is conceivable to determine whether to delete words by verifying whether or not words having the same reading information appear simultaneously in the grammar.
[0037]
Further, when the first dictionary 13 for speech recognition is updated by the recognition dictionary update unit 18, the same pair of data as the pair of data to be updated is included in a pair of data of reading information corresponding to a word already stored. It is also conceivable that the data exists. In this case, the frequency at which the pair of data is stored is totaled, and a pair of data of the word to be added to the first dictionary 13 for speech recognition and the reading information of the word is determined based on the totalized update frequency. It is also possible to do.
[0038]
FIG. 7A shows an example of a data configuration of the first dictionary 13 for speech recognition in the case of performing such matching, and FIG. 7B shows an example of a data configuration of the recognition result storage unit 17, respectively.
[0039]
For example, the pair of data stored in the recognition result storage unit 17 includes six pairs of data whose update frequency is 100 times or more. Of these, the first voice recognition dictionary 13 includes Registering, in the first voice recognition dictionary 13, three pairs of data that are not included, “Today: ky o o”, “Tomorrow: a su ta”, and “Weather: teg ki” become. As described above, only the frequently updated data is set as the target of updating the first voice recognition dictionary 13, so that the result of accidental recognition by mistake is not reflected on the first voice recognition dictionary 13. Therefore, the dictionary can be updated without lowering the speech recognition accuracy of the updated first speech recognition dictionary 13.
[0040]
It is also conceivable that the first voice recognition dictionary 13 is stored for each user. That is, as shown in FIG. 8, a first voice recognition dictionary 13 is formed for each user ID, and the first voice recognition dictionary 13 is recognized by a user ID recognition unit 20 that selects a button in which the user ID is described, for example. The first dictionary for voice recognition 13 is updated for each user ID that has been set.
[0041]
Further, the present invention is not particularly limited to holding for each user. For example, the present invention relates to an environment for inputting audio signals such as a background noise level in a user's use environment, a type of telephone line used, and a type of microphone. A plurality of first speech recognition dictionaries 13 may be provided based on the information.
[0042]
Next, a description will be given of a processing flow of a program for realizing the speech recognition dictionary adjustment device according to the embodiment of the present invention. FIG. 9 shows a flowchart of the processing of a program for realizing the speech recognition dictionary adjustment device according to the embodiment of the present invention.
[0043]
In FIG. 9, first, a user's voice signal is received (step S901), and the first voice recognition engine 12 refers to the first voice recognition dictionary 13 composed of a group of words to generate a voice. Recognition is performed (step S902).
[0044]
Next, a word output as a recognition result and a speech signal corresponding to the output word are extracted as a signal section (step S903). Then, the second speech recognition engine 15 refers to the second speech recognition dictionary 16 composed of one word such as syllables and phonemes, and performs speech recognition for each speech data section in units of reading information. Is performed (step S904).
[0045]
Then, the recognition result by the second speech recognition engine 15 is stored in association with the word corresponding to the speech data to be recognized (step S905), and the contents of the first speech recognition dictionary 13 are stored. The content is updated based on the content (step S906).
[0046]
In the present embodiment, the first speech recognition engine 12 and the second speech recognition engine 15 or the other speech recognition engines to be added may use the same speech recognition engine, or may use different speech recognition engines. A recognition engine may be used. Further, in the present embodiment, the first speech recognition dictionary 13 is updated based only on the reading information determined to be the best, but the first speech recognition dictionary 13 is updated for a plurality of reading information candidates. The dictionary 13 may be updated.
[0047]
As described above, according to the present embodiment, speech data is extracted in units of words by performing speech recognition processing in units of words, and speech recognition is performed again in units of words for speech data corresponding to words. Since unregistered syllables and phonemes do not occur, the words used by the user can be reliably added to the dictionary for speech recognition. It is possible to make adjustments so as to be a highly efficient dictionary for speech recognition.
[0048]
As shown in FIG. 10, the program for realizing the speech recognition dictionary adjustment device according to the embodiment of the present invention includes not only the portable recording medium 102 such as the CD-ROM 102-1 and the flexible disk 102-2 but also the communication The program may be stored in any of the other storage device 101 provided at the end of the line and the recording medium 104 such as the hard disk or RAM of the computer 103. When the program is executed, the program is loaded and stored in the main memory. Executed in
[0049]
Also, as shown in FIG. 10, the first speech recognition dictionary and the second speech recognition dictionary used by the speech recognition dictionary adjusting device according to the embodiment of the present invention also have the CD-ROM 102-1. And any other storage device 101 provided at the end of the communication line, or a storage medium 104 such as a hard disk or a RAM of the computer 103, as well as the portable storage medium 102 such as the hard disk and the flexible disk 102-2. For example, it is read by the computer 103 when the speech recognition dictionary adjustment device according to the present invention is used.
[0050]
(Supplementary Note 1) An audio signal input unit for inputting an audio signal emitted by the user;
A first voice recognition unit that recognizes the input voice signal using a first voice recognition dictionary configured in a unit of words;
An audio signal information extraction unit that extracts an audio signal in units of words and a generation time of the audio signal, which are at least recognition targets, from a recognition result in the first audio recognition unit;
For each occurrence time, a second speech recognition unit that recognizes again using a second speech recognition dictionary configured for each word using the corresponding speech signal,
A recognition result storage unit for storing the speech signal in units of words and corresponding reading information as a pair of data based on a recognition result in the second speech recognition unit. Dictionary adjustment device.
[0051]
(Supplementary Note 2) The speech recognition dictionary adjustment device according to supplementary note 1, further comprising a recognition dictionary updating unit that updates the first speech recognition dictionary using the stored pair of data.
[0052]
(Supplementary Note 3) The speech recognition dictionary adjustment device according to Supplementary Note 1 or 2, further including an update count totalizing unit that counts the number of times the first speech recognition dictionary has been updated for each of the set of words.
[0053]
(Supplementary note 4) The speech recognition dictionary adjustment device according to supplementary note 1 or 2, wherein the first speech recognition dictionary is stored for each user.
[0054]
(Supplementary note 5) The speech recognition dictionary adjustment device according to Supplementary note 1 or 2, wherein the first speech recognition dictionary is stored for each environment used by the user.
[0055]
(Supplementary note 6) The speech recognition dictionary adjustment device according to Supplementary note 1 or 2, wherein a plurality of the second speech recognition engines are used.
[0056]
(Supplementary Note 7) A first step of inputting a voice signal emitted by the user;
A second step of recognizing the input voice signal using a first voice recognition dictionary configured in a group of words;
A third step of extracting the generation time of the voice signal and the voice signal in the unit of words that have been at least the recognition target from the recognition result in the second step,
For each occurrence time, a fourth step of re-recognizing using a second dictionary for speech recognition configured on a word-by-word basis using the corresponding speech signal;
A fifth step of storing, based on the recognition result in the fourth step, the group of word-unit speech signals and corresponding reading information as a pair of data, a fifth step of speech dictionary adjustment. Method.
[0057]
(Supplementary Note 8) A first processing step of inputting a voice signal emitted by a user;
A second processing step of recognizing the input voice signal by using a first voice recognition dictionary configured in a unit of words;
A third processing step of extracting the generation time of the voice signal and the voice signal in the unit of words that have been at least the recognition target from the recognition result in the second processing step,
For each occurrence time, a fourth processing step of re-recognizing using the second speech recognition dictionary configured for each word using the corresponding speech signal,
And a fifth processing step of storing, based on the recognition result in the fourth processing step, the set of speech signals in units of words and corresponding reading information as a pair of data. A computer-executable program characterized by the following:
[0058]
【The invention's effect】
As described above, according to the dictionary adjustment device for speech recognition according to the present invention, speech data is extracted in units of words by performing speech recognition processing in units of words, and then speech data in units of words is extracted for speech data corresponding to the words. By re-recognizing, unregistered syllables and phonemes do not occur, and words used by the user can be reliably added to the dictionary for speech recognition. In addition, it is possible to make an adjustment so as to be an efficient speech recognition dictionary with high speech recognition accuracy.
[Brief description of the drawings]
FIG. 1 is a configuration diagram of a speech recognition dictionary adjustment device according to an embodiment of the present invention; FIG. 2 is a data configuration of a first speech recognition dictionary in the speech recognition dictionary adjustment device according to the embodiment of the present invention; FIG. 3 is a diagram illustrating a data configuration example of a second speech recognition dictionary in the speech recognition dictionary adjustment device according to the embodiment of the present invention. FIG. 4 is a speech recognition dictionary adjustment according to the embodiment of the present invention. FIG. 5 is a diagram illustrating an example of a data configuration of a recognition result storage unit in the apparatus. FIG. 5 is another configuration diagram of a speech recognition dictionary adjustment device according to an embodiment of the present invention. FIG. 6 is a speech recognition dictionary according to an embodiment of the present invention. FIG. 7 is a diagram illustrating an example of a data configuration of a first dictionary for speech recognition and a recognition result storage unit in the adjustment device. FIG. 7 is a diagram illustrating a first dictionary for speech recognition and storage of recognition results in the speech recognition dictionary adjustment device according to the embodiment of the present invention. Data structure of division FIG. 8 is another configuration diagram of the speech recognition dictionary adjustment device according to the embodiment of the present invention. FIG. 9 is a flowchart of processing in the speech recognition dictionary adjustment device according to the embodiment of the present invention. ] Illustration of computer environment [Explanation of reference numerals]
11 User 12 First Speech Recognition Engine 13 First Speech Recognition Dictionary 14 Recognition Result Extraction Unit 15 Second Speech Recognition Engine 16 Second Speech Recognition Dictionary 17 Recognition Result Storage Unit 18 Recognition Dictionary Update Unit 19 3 voice recognition engine 20 user ID recognition unit 101 line storage device 102 portable recording medium 102-1 such as CD-ROM or flexible disk CD-ROM
102-2 Flexible disk 103 Computer 104 Recording medium such as RAM / hard disk on computer

Claims

An audio signal input unit for inputting an audio signal emitted by the user;
A first voice recognition unit that recognizes the input voice signal using a first voice recognition dictionary configured in a unit of words;
An audio signal information extraction unit that extracts an audio signal in units of words and a generation time of the audio signal, which are at least recognition targets, from a recognition result in the first audio recognition unit;
For each occurrence time, a second speech recognition unit that recognizes again using a second speech recognition dictionary configured for each word using the corresponding speech signal,
A recognition result storage unit for storing the speech signal in units of words and corresponding reading information as a pair of data based on a recognition result in the second speech recognition unit. Dictionary adjustment device.

The apparatus for adjusting a dictionary for speech recognition according to claim 1, further comprising a recognition dictionary updating unit that updates the first dictionary for speech recognition using the stored pair of data.

The speech recognition dictionary adjustment device according to claim 1, wherein the first speech recognition dictionary is stored for each user.

3. The speech recognition dictionary adjusting device according to claim 1, wherein a plurality of said second speech recognition engines are used.

A first step of inputting a voice signal emitted by the user,
A second step of recognizing the input voice signal using a first voice recognition dictionary configured in a group of words;
A third step of extracting the generation time of the voice signal and the voice signal in the unit of words that have been at least the recognition target from the recognition result in the second step,
For each occurrence time, a fourth step of re-recognizing using a second dictionary for speech recognition configured on a word-by-word basis using the corresponding speech signal;
A fifth step of storing, based on the recognition result in the fourth step, the group of word-unit speech signals and corresponding reading information as a pair of data, a fifth step of speech dictionary adjustment. Method.

A first processing step of inputting an audio signal emitted by the user;
A second processing step of recognizing the input voice signal by using a first voice recognition dictionary configured in a unit of words;
A third processing step of extracting the generation time of the voice signal and the voice signal in the unit of words that have been at least the recognition target from the recognition result in the second processing step,
For each occurrence time, a fourth processing step of re-recognizing using the second speech recognition dictionary configured for each word using the corresponding speech signal,
And a fifth processing step of storing, based on the recognition result in the fourth processing step, the set of speech signals in units of words and corresponding reading information as a pair of data. A computer-executable program characterized by the following: