JPH11202890A

JPH11202890A - Speech retrieval device

Info

Publication number: JPH11202890A
Application number: JP10022629A
Authority: JP
Inventors: Tetsuya Muroi; 哲也室井
Original assignee: Ricoh Co Ltd
Current assignee: Ricoh Co Ltd
Priority date: 1998-01-20
Filing date: 1998-01-20
Publication date: 1999-07-30

Abstract

PROBLEM TO BE SOLVED: To provide a speech retrieval device which can accurately retrieve speech data from a speech data base. SOLUTION: This device has a speech data base 1 stored with speech data to be retrieved, a 1st speech recognition part 2 which detects specific speech data by spotting among the speech data of the speech data base 1, a 2nd speech recognition part 3 which recognizes a key word from the voice that a user has spoken, and a speech dictionary generation part 4 which generates a speech dictionary for the 1st speech recognition part 2 from the key word recognized by the 2nd speech recognition part 3, and the 1st speech recognition part 2 uses the speech dictionary generated by the speech dictionary generation part 4 and detects speech data corresponding to the speech dictionary by spotting among the speech data of the speech data base.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、音声データベース
の内容を検索する音声検索装置に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a voice search device for searching the contents of a voice database.

【０００２】[0002]

【従来の技術】留守番電話やボイスメールなどの音声デ
ータ、あるいはビデオなど映像を伴った音声データの中
から、音声認識用のキーワードを検索するのに、従来、
特開平６−１７５６９８号に示されているような音声検
索装置が知られている。この音声検索装置では、第１の
方法として、利用者がキーワードを発声すると、音声デ
ータベースの音声データの中から利用者の発声したキー
ワードに対応するものをワードスポッティング(認識)に
より順次に検出して、利用者が意図した音声データを音
声データベースから検索する方法がある。あるいは、第
２の方法として、音声データベースが、音声データ格納
部と、キーワード格納部とを有し、音声データ格納部に
留守番電話や電子メールなどの音声データが格納されて
いるとともに、さらにキーワード格納部に音声データ格
納部内の音声データに対するキーワードが予め格納され
ている場合に、利用者がキーワードを発声すると、利用
者が発声したキーワードを音声データベースのキーワー
ド格納部に格納されているキーワード(キーワード候補)
と照合し、この照合の結果、一致したキーワードが検出
されたとき、このキーワードに対応した音声データ格納
部内の音声データ(音声信号)を抽出する方法がある。2. Description of the Related Art In order to search for a keyword for voice recognition from voice data such as an answering machine or voice mail, or voice data accompanying a video such as a video, conventionally,
2. Description of the Related Art A voice search device as disclosed in JP-A-6-175698 is known. In this voice search device, as a first method, when a user utters a keyword, words corresponding to the keyword uttered by the user are sequentially detected from voice data in a voice database by word spotting (recognition). There is a method of retrieving voice data intended by a user from a voice database. Alternatively, as a second method, the voice database has a voice data storage unit and a keyword storage unit, and voice data such as an answering machine or an e-mail is stored in the voice data storage unit, and further a keyword storage is performed. When a user utters a keyword when a keyword for the voice data in the voice data storage unit is stored in advance in the voice data storage unit, the keyword uttered by the user is stored in the keyword storage unit of the voice database (keyword candidate). )
When a matching keyword is detected as a result of the matching, there is a method of extracting voice data (voice signal) in the voice data storage unit corresponding to the keyword.

【０００３】[0003]

【発明が解決しようとする課題】しかしながら、上記第
１の方法の場合、通常は、利用者と音声データベースの
話者(作成者)とが異なるために(これらの音声が通常、
相違しているため)、認識性能(検索性能)が良くないと
いう欠点があった。特に、利用者が女性，音声データベ
ースが男性アナウンサーによるニュースというような話
者の性別が異なると、正確に認識することが難しくな
る。However, in the case of the first method, since the user and the speaker (creator) of the voice database are usually different (these voices are usually
There is a drawback that recognition performance (search performance) is not good. In particular, if the gender of the speaker is different, such as when the user is a woman and the voice database is news by a male announcer, it will be difficult to recognize accurately.

【０００４】また、上記第２の方法では、話者の音声を
音声データベースの音声と直接比較しないので、第１の
方法の上記欠点はないが、音声データベースに予めキー
ワードを格納しなければならないという欠点が生じる。
これは、単に入力の手間の問題だけでなく、利用者が必
要とするであろうキーワード(検索のタグ)を予め予測し
て、キーワード候補とする必要があるため、その予測精
度を高くしておかないと検索の際に必要なキーワ−ドが
ないという事態に陥る。In the second method, since the voice of the speaker is not directly compared with the voice of the voice database, the above-mentioned disadvantage of the first method is not present, but the keyword must be stored in the voice database in advance. Disadvantages arise.
This is not simply a matter of inputting time, but it is necessary to predict keywords (search tags) that the user will need in advance and make them as keyword candidates. Otherwise, there will be no keyword required for searching.

【０００５】本発明は、音声データベース内の音声デー
タを正確に検索することの可能な音声検索装置を提供す
ることを目的としている。[0005] It is an object of the present invention to provide a voice search device capable of accurately searching voice data in a voice database.

【０００６】[0006]

【課題を解決するための手段】上記目的を達成するため
に、請求項１記載の発明は、検索対象となる音声データ
が格納されている音声データベースと、該音声データベ
ースの音声データの中から所定の音声データをスポッテ
ィングにより検出するための第１の音声認識部と、利用
者が発声した音声からキーワードを認識するための第２
の音声認識部と、第２の音声認識部で認識されたキーワ
ードから第１の音声認識部のための音声辞書を作成する
音声辞書作成部とを有しており、第１の音声認識部は、
音声辞書作成部で作成された音声辞書を用いて音声デー
タベースの音声データの中から音声辞書に対応した音声
データをスポッティングにより検出することを特徴とし
ている。In order to achieve the above object, according to the first aspect of the present invention, there is provided a voice database storing voice data to be searched, and a voice database stored in the voice database. A first voice recognition unit for detecting voice data by spotting, and a second voice recognition unit for recognizing a keyword from voice uttered by a user.
And a speech dictionary creating unit that creates a speech dictionary for the first speech recognition unit from the keywords recognized by the second speech recognition unit. The first speech recognition unit ,
The voice dictionary corresponding to the voice dictionary is detected by spotting from the voice data of the voice database using the voice dictionary created by the voice dictionary creating unit.

【０００７】また、請求項２記載の発明は、請求項１記
載の音声検索装置において、キーワード候補を格納する
ためのキーワード格納手段がさらに設けられており、前
記第２の音声認識部は、利用者が発声した音声からキー
ワード候補を認識し、該キーワード候補をキーワード格
納手段に格納し、前記音声辞書作成部は、該キーワード
格納手段に格納されたキーワード候補をスポッティング
し、スポッティングしたキーワードから第１の音声認識
部のための音声辞書を作成するようになっていることを
特徴としている。According to a second aspect of the present invention, in the voice search device according to the first aspect, keyword storage means for storing keyword candidates is further provided, and the second voice recognition unit is used. The keyword candidate is recognized from the voice uttered by the person, and the keyword candidate is stored in the keyword storage unit. The voice dictionary creating unit spots the keyword candidate stored in the keyword storage unit, and determines the first keyword from the spotted keyword. Is characterized in that a voice dictionary for the voice recognition unit is created.

【０００８】また、請求項３記載の発明は、請求項２記
載の音声検索装置において、前記キーワード格納手段に
格納されるキーワード候補は、大分類，小分類の少なく
とも２階層以上の階層構造をもつ文法に記述され、前記
音声辞書作成部は、小分類の項目に記述されたキーワー
ドの認識結果から第１の音声認識部のための音声辞書を
作成するようになっていることを特徴としている。According to a third aspect of the present invention, in the voice search device according to the second aspect, the keyword candidates stored in the keyword storage means have a hierarchical structure of at least two levels of a large category and a small category. Described in grammar, the speech dictionary creation unit creates a speech dictionary for the first speech recognition unit from the recognition result of the keyword described in the item of the small classification.

【０００９】[0009]

【発明の実施の形態】以下、本発明の実施形態を図面に
基づいて説明する。図１は本発明に係る音声検索装置の
構成例を示す図である。図１を参照すると、この音声検
索装置は、検索対象となる音声データが格納されている
音声データベース１と、該音声データベース１の音声デ
ータの中から所定の音声データをスポッティングにより
検出するための第１の音声認識部２と、利用者が発声し
た音声からキーワードを認識するための第２の音声認識
部３と、第２の音声認識部３で認識されたキーワードか
ら第１の音声認識部２のための音声辞書を作成する音声
辞書作成部４とを有しており、第１の音声認識部２は、
音声辞書作成部４で作成された音声辞書を用いて音声デ
ータベースの音声データの中から音声辞書に対応した音
声データをスポッティングにより検出するようになって
いる。Embodiments of the present invention will be described below with reference to the drawings. FIG. 1 is a diagram showing a configuration example of a voice search device according to the present invention. Referring to FIG. 1, the voice search apparatus includes a voice database 1 storing voice data to be searched, and a voice database 1 for detecting predetermined voice data from voice data of the voice database 1 by spotting. 1 voice recognition unit 2, a second voice recognition unit 3 for recognizing a keyword from voice uttered by the user, and a first voice recognition unit 2 based on the keyword recognized by the second voice recognition unit 3. And a voice dictionary creating unit 4 for creating a voice dictionary for
The audio dictionary corresponding to the audio dictionary is detected by spotting from the audio data in the audio database using the audio dictionary created by the audio dictionary creating unit 4.

【００１０】図２は第２の音声認識部３の具体例を示す
図である。図２の例では、第２の音声認識部３は、第２
の音声認識部用の辞書を用いて音声認識を行なうように
なっている。この辞書には、利用者の興味のある単語
(群)，例えば「いぬ」，「ねこ」，「にほん」，…など
の音声標準パターンが予め登録されている。この場合、
利用者が「いぬ」を発声し、この音声が第２の音声認識
部３に入力すると、第２の音声認識部３では、この音声
「いぬ」の特徴パターンを抽出し、これを辞書内の音声
標準パターンと照合し、最も高い類似度を認識結果とし
て出力するようになっている。FIG. 2 is a diagram showing a specific example of the second speech recognition section 3. In the example of FIG. 2, the second speech recognition unit 3
The voice recognition is performed using the dictionary for the voice recognition unit. This dictionary contains words of interest to users
(Groups), for example, voice standard patterns such as "dog", "cat", "japan",... Are registered in advance. in this case,
When the user utters “Inu” and this voice is input to the second voice recognition unit 3, the second voice recognition unit 3 extracts the feature pattern of this voice “Inu” and stores it in the dictionary. It is collated with a voice standard pattern, and the highest similarity is output as a recognition result.

【００１１】なお、上述の例では、辞書は単語単位で作
成されているが、音素単位に作成されていても良い。ま
た、特定話者用のものであっても良いし、不特定話者用
のものであっても良い。In the above-described example, the dictionary is created for each word, but may be created for each phoneme. Also, it may be for a specific speaker or for an unspecified speaker.

【００１２】また、図３は音声辞書作成部４の構成例を
示す図であり、この例では、音声辞書作成部４は、かな
−音素変換表１１と、第２の音声認識部３の認識結果で
ある文字列(かな)が入力すると、この文字列(かな)に対
応した音素系列をかな−音素変換表１１から得るかな−
音素変換部１２と、音素辞書１３と、かな−音素変換部
１２から音素系列が与えられるとき、音素辞書１３を参
照して、音素系列をパラメータ列に変換し、このパラメ
ータ列を音声辞書１５とする音声辞書変換部１４とを有
している。FIG. 3 is a diagram showing an example of the configuration of the speech dictionary creation unit 4. In this example, the speech dictionary creation unit 4 performs the kana-phoneme conversion table 11 and the recognition by the second speech recognition unit 3. When the resulting character string (kana) is input, a phoneme sequence corresponding to this character string (kana) is obtained from the kana-phoneme conversion table 11-
When a phoneme sequence is given from the phoneme conversion unit 12, the phoneme dictionary 13, and the kana-phoneme conversion unit 12, the phoneme sequence is converted into a parameter sequence by referring to the phoneme dictionary 13. And a voice dictionary conversion unit 14.

【００１３】図４はかな−音素変換表１１の一例を示す
図であり、かな「あ」，「い」…のそれぞれに対応した
音素／ａ／，／ｉ／，…が予め登録されている。この場
合、かな−音素変換部１２は、かな文字列として例えば
「いぬ」が入力すると、図４のかな−音素変換表１１を
用いて、これを、／ｉ／，／ｎ／，／ｕ／という音素系
列に変換するようになっている。FIG. 4 is a diagram showing an example of a kana-phoneme conversion table 11, in which phonemes / a /, / i /,... Corresponding to kana "a", "i",. . In this case, when the kana-phoneme conversion unit 12 inputs, for example, "Inu" as a kana character string, the kana-phoneme conversion table 11 uses the kana-phoneme conversion table 11 of FIG. Is converted into a phoneme sequence.

【００１４】また、図３において、音素辞書１３には、
不特定話者用の音声パラメータが予め登録されている。
すなわち、音素辞書１３には、各音素ごとに、特徴ベク
トル(普通は複数)と継続時間のパラメータが登録されて
いる。また、音声辞書変換部１４は、かな−音素変換部
１２から音素系列として例えば／ｉ／，／ｎ／，／ｕ／
が出力されるとき、音素系列(/i/,/n/,/u/)の順で、各
音素ごとに音素辞書１３内のパラメータを並べて、音声
辞書を作成するようになっている。なお、上記例では、
音素系列(/i/,/n/,/u/)が単語「いぬ」の音素系列であ
るので、上記のように作成される音声辞書は、単語辞書
となる。In FIG. 3, the phoneme dictionary 13 includes
Voice parameters for unspecified speakers are registered in advance.
That is, in the phoneme dictionary 13, feature vectors (usually plural) and parameters of duration are registered for each phoneme. Also, the voice dictionary conversion unit 14 outputs, for example, / i /, / n /, / u / as a phoneme sequence from the kana-phoneme conversion unit 12.
Is output, the parameters in the phoneme dictionary 13 are arranged for each phoneme in the order of the phoneme sequence (/ i /, / n /, / u /) to create a speech dictionary. In the above example,
Since the phoneme sequence (/ i /, / n /, / u /) is the phoneme sequence of the word “Inu”, the speech dictionary created as described above is a word dictionary.

【００１５】図５は音声辞書作成部４で作成される音声
辞書の構成例を示す図である。図５の例では、図５の音
声辞書作成部４に、文字列として１つの認識結果，例え
ば「にほん」がそのまま入力する(すなわち「にほん」
として入力する)ときに得られるものとなっている。FIG. 5 is a diagram showing a configuration example of a speech dictionary created by the speech dictionary creating section 4. In the example of FIG. 5, one recognition result, for example, “Nihon” is directly input as a character string to the voice dictionary creating unit 4 of FIG. 5 (ie, “Nihon”).
Input as ").

【００１６】次に、図１の音声検索装置の処理動作につ
いて説明する。利用者がキーワードを発声し、例えばマ
イクロフォンなどの音声入力部(図示せず)からこの利用
者の音声(キーワード)が入力すると、入力された音声
(キーワード)は、音声認識に必要な特徴パラメータに変
換され、第２の音声認識部３によって音声認識がなされ
る。ここで、第２の音声認識部３における音声認識の特
徴パラメータ，照合方式については、任意の手法を用い
ることができる。例えば文献「“確率モデルによる音声
認識”(中川聖一著，電子情報通信学会)」に記載されて
いるような手法を用いることができる。Next, the processing operation of the voice search device of FIG. 1 will be described. When the user utters a keyword and inputs the user's voice (keyword) from a voice input unit (not shown) such as a microphone, the input voice is input.
The (keyword) is converted into characteristic parameters required for speech recognition, and speech recognition is performed by the second speech recognition unit 3. Here, any method can be used for the feature parameter and the matching method of the voice recognition in the second voice recognition unit 3. For example, a method described in the document "" Speech Recognition by Stochastic Model "(by Seiichi Nakagawa, IEICE) can be used.

【００１７】但し、第２の音声認識部３は、利用者の音
声を認識するだけの機能があれば良いので、第２の音声
認識部３としては、利用者の音声を最も正確にあるいは
高速に認識できる装置を用いることができる。例えば、
利用者の音声データだけで訓練された音素モデルを利用
する特定話者型の音声認識装置を用いることができる。
あるいは、不特定話者型の認識装置であっても話者適応
機能をもったものを選択し、十分に利用者の音声に適応
したものを用いることができる。However, since the second voice recognition section 3 only needs to have a function of recognizing the user's voice, the second voice recognition section 3 recognizes the user's voice most accurately or at high speed. Any device that can be recognized can be used. For example,
It is possible to use a specific speaker type speech recognition device using a phoneme model trained only with the user's speech data.
Alternatively, an unspecified speaker-type recognition device having a speaker adaptation function can be selected and a device that is sufficiently adapted to the user's voice can be used.

【００１８】第２の音声認識部３によって認識されたキ
ーワード(通常は、かな文字で表現される)は、音声辞書
作成部４によって第１の音声認識部２のための音声辞書
に変換される。この音声辞書の作成方法としては、単語
単位の認識方式ではない音声認識方式、例えば連続音声
認識方式や音素を単位とした大語彙音声認識などで一般
的に用いられている方式を利用できる。このような単語
単位の方式ではない方式では、前述のように、音素辞書
をもっており、つづりから自動的に音声辞書(単語辞
書)，すなわち音素が順番に並んだ辞書を作って認識し
ている。なお、この場合、明示的に単語辞書を作る必要
はなく(陽に単語辞書は作る必要はなく)、音素の順番を
指定するようにしても良い。The keyword (usually expressed by kana characters) recognized by the second voice recognition unit 3 is converted into a voice dictionary for the first voice recognition unit 2 by the voice dictionary creation unit 4. . As a method for creating the speech dictionary, a speech recognition method that is not a word-based recognition method, for example, a method generally used in a continuous speech recognition method or a large vocabulary speech recognition in units of phonemes can be used. In a method that is not such a word-based method, as described above, a phoneme dictionary is provided, and a speech dictionary (word dictionary), that is, a dictionary in which phonemes are arranged in order, is created and recognized automatically from the spelling. In this case, it is not necessary to explicitly create a word dictionary (it is not necessary to explicitly create a word dictionary), and the order of phonemes may be specified.

【００１９】この方式では、前述の例のように、例え
ば、「いぬ」という文字列から／ｉ／，／ｎ／，／ｕ／
という音声辞書を作成する。ここで、／ｉ／，／ｎ／，
／ｕ／の音素はそれぞれ、第１の音声認識部２における
認識処理に適した音響的辞書が入っている(利用者の声
に合わせていない不特定の辞書が入っている)。そこ
で、第１の音声認識部２では、この／ｉ／，／ｎ／，／
ｕ／という音素系列(単語の辞書)を辞書にして、音声デ
ータベース１を検索することができる。In this method, as in the above-described example, for example, the character string "inu" is used to form / i /, / n /, / u /
Create a voice dictionary called Where / i /, / n /,
Each phoneme of / u / contains an acoustic dictionary suitable for recognition processing in the first speech recognition unit 2 (an unspecified dictionary that does not match the user's voice). Therefore, the first speech recognition unit 2 uses the / i /, / n /, /
The speech database 1 can be searched using the phoneme sequence u / (dictionary of words) as a dictionary.

【００２０】すなわち、上記のように作成された音声辞
書は、第１の音声認識部２に与えられ、第１の音声認識
部２は、音声辞書作成部４からの音声辞書を音声データ
ベース１に格納されている音声データと照合し、照合ス
コアが予め定められた閾値よりも高い音声データを検出
する。ここで、閾値は、音声データベースの質(格納さ
れている音声データの質(明瞭か否かなど))や必要とす
る検出精度によって変更することもあるので、閾値は可
変であることが望ましい。このとき、閾値を小さく設定
すると、キーワードと合致しない位置も検出され、逆
に、閾値が大きいと、キーワードと合致した音声が存在
してもその位置が検出されないことがあるので、閾値は
最適なものに設定される必要がある。That is, the speech dictionary created as described above is given to the first speech recognition unit 2, and the first speech recognition unit 2 stores the speech dictionary from the speech dictionary creation unit 4 in the speech database 1. The voice data is collated with the stored voice data, and voice data having a collation score higher than a predetermined threshold is detected. Here, the threshold value may be changed depending on the quality of the audio database (the quality of stored audio data (e.g., whether or not it is clear)) and the required detection accuracy. Therefore, it is preferable that the threshold value be variable. At this time, if the threshold value is set small, a position that does not match the keyword is also detected. Conversely, if the threshold value is large, the position may not be detected even if there is a voice that matches the keyword. Need to be set to something.

【００２１】このように、図１の音声検索装置では、利
用者の発声した音声データを音声データベース内の音声
データと直接照合していないので(利用者の発声された
音声を直接検索に用いないので)、正確な認識が可能と
なる。すなわち、利用者の音声を認識する第２の音声認
識部３としては、利用者の音声を登録した特定話者型の
音声認識装置，あるいは不特定話者型の音声認識装置で
あっても、十分に利用者の音声によって話者適応が施さ
れた音声認識装置を用いることができ、このような第２
の音声認識部３で利用者の音声を認識した後、その認識
結果から第１の音声認識部２のための音声辞書を作成す
るので、音声データベース１を認識(スポッティング)す
る際も音声データベース１に最適な音声認識装置を第１
の音声認識部２に利用することができる。すなわち、利
用者の音声を認識する際と音声データベース１を認識す
る際との両方に最適な認識装置を選択することができる
ので、音声データベース１から音声データの正確な検索
が可能になる。As described above, in the voice search apparatus shown in FIG. 1, voice data uttered by the user is not directly compared with voice data in the voice database, so that the voice uttered by the user is not directly used for the search. So) accurate recognition is possible. That is, the second voice recognition unit 3 that recognizes the voice of the user may be a specific speaker type voice recognition device that registers the user's voice or an unspecified speaker type voice recognition device. It is possible to use a speech recognition device that is sufficiently speaker-adapted by the user's voice.
After recognizing the user's voice by the voice recognition unit 3, the voice dictionary for the first voice recognition unit 2 is created from the recognition result. Therefore, when the voice database 1 is recognized (spotted), the voice database 1 is recognized. The best speech recognition device for
Can be used for the voice recognition unit 2. That is, since it is possible to select an optimum recognition device for both recognizing the user's voice and recognizing the voice database 1, it is possible to accurately search for voice data from the voice database 1.

【００２２】図６は本発明に係る音声検索装置の変形例
を示す図である。図６を参照すると、この音声検索装置
は、図１の音声検索装置に、さらに、キーワード候補を
格納するためのキーワード格納手段５が備わったものと
なっており、この場合、第２の音声認識部３は、利用者
が発声した音声からキーワード候補を認識し、該キーワ
ード候補をキーワード格納手段５に格納し、音声辞書作
成部４は、キーワード格納手段５に格納されたキーワー
ド候補をスポッティングし、スポッティングしたキーワ
ードから第１の音声認識部２のための音声辞書を作成す
るようになっている。FIG. 6 is a diagram showing a modification of the voice search device according to the present invention. Referring to FIG. 6, the voice search device is the same as the voice search device of FIG. 1, but further includes a keyword storage unit 5 for storing keyword candidates. The unit 3 recognizes the keyword candidates from the voice uttered by the user, stores the keyword candidates in the keyword storage unit 5, and the voice dictionary creating unit 4 spots the keyword candidates stored in the keyword storage unit 5, A voice dictionary for the first voice recognition unit 2 is created from the spotted keywords.

【００２３】図７はキーワード格納手段５の一例を示す
図である。図７の例では、キーワード候補は大分類，小
分類の少なくとも２階層以上の階層構造をもつ文法に記
述されており、この場合、音声辞書作成部４は、小分類
の項目に記述されたキーワードの認識結果から第１の音
声認識部２のための音声辞書を作成するようになってい
る。FIG. 7 is a diagram showing an example of the keyword storage means 5. In the example of FIG. 7, the keyword candidates are described in a grammar having a hierarchical structure of at least two or more hierarchies of a major classification and a minor classification. , A speech dictionary for the first speech recognition unit 2 is created.

【００２４】すなわち、例えば、動物の名前を検索する
とき、その数が多い場合には第２の音声認識部３で一遍
に認識すると動物の種類が多い(認識すべき語数が多い)
ために認識性能が劣化し、検索が効率良く行なうことが
できなくなる場合がある。このようなとき、図７に示す
ように、キーワード候補を大分類，小分類の構造をもつ
文法で記述することにより、第２の音声認識部３では段
階的な認識が可能となり、精度良く高速な認識ができる
ようになる。このように、「国の名前−都市の名前」，
「職業の名前−人名」などの階層的な構造をもつ文法を
利用することで、検索に必要な小分類のキーワードを一
括して認識する場合に比べて、高い認識精度で利用者の
音声をスポッティングすることが可能になる。具体的
に、例えば、「ポチ」という名の犬を検索する際に、大
分類「犬」，小分類「ポチ」という文法を作成しておい
て、利用者が音声「犬のポチです」を発声するとき、こ
の音声から最初に大分類の「犬」，次に小分類の「ポ
チ」と階層的に認識することができる。That is, for example, when searching for animal names, if the number is large, if the second voice recognition unit 3 recognizes them all at once, there are many types of animals (the number of words to be recognized is large).
As a result, the recognition performance may be degraded, and the search may not be performed efficiently. In such a case, as shown in FIG. 7, the keyword candidates are described in a grammar having a structure of a large classification and a small classification, so that the second speech recognition unit 3 can perform stepwise recognition and can perform high-speed processing with high accuracy. Can be recognized. Thus, "country name-city name",
By using a grammar having a hierarchical structure such as "professional name-personal name", the user's voice can be recognized with higher recognition accuracy compared to the case of collectively recognizing keywords in small categories required for search. Spotting becomes possible. Specifically, for example, when searching for a dog named "Pochi", the grammar of the large category "Dog" and the small category "Pochi" is created, and the user speaks the voice "Dog Pochi." When uttering, it is possible to hierarchically recognize, first from this voice, "dog" of a large classification and then "pochi" of a small classification.

【００２５】次に、図６の音声検索装置の処理動作につ
いて説明する。この音声検索装置では、利用者の発声し
た音声(入力音声)が例えば「鳥の孔雀です」であったと
すると、第２の音声認識部３では、入力音声の全区間を
対象に、図７の分類表において、大分類の項目のスポッ
ティングを試みる。そして、認識結果「鳥」を得ると、
次に入力音声区間の中で「鳥」がスポッティングされた
区間を除く区間に対し、大分類「鳥」に含まれる小分類
の名前「すずめ」，「つばめ」，「孔雀」…のスポッテ
ィングを行ない、「孔雀」を認識できる。このように２
段階に分けて、「孔雀」を認識することで(スポッティ
ングを行なうことで)、１回の認識あたりの対象語彙数
を少なくでき、精度良く高速な認識ができる。Next, the processing operation of the voice search device shown in FIG. 6 will be described. In this voice search device, assuming that the voice uttered by the user (input voice) is, for example, “bird peacock”, the second voice recognition unit 3 targets the entire input voice section in FIG. In the classification table, try to spot large classification items. Then, when the recognition result "bird" is obtained,
Next, for the section other than the section where “bird” was spotted in the input voice section, the spot names “sparrow”, “swallow”, “peacock”,... , "Peacock" can be recognized. Thus 2
By recognizing “peacock” in stages (by performing spotting), the number of target words per recognition can be reduced, and high-speed recognition can be performed with high accuracy.

【００２６】音声辞書作成部４では、このようにして第
２の音声認識部３で認識した結果の「孔雀」に対して音
声辞書を作成し、この音声辞書を用いて第１の音声認識
部２により音声データベース１の中からスポッティング
を行なうことにより、「孔雀」という音声が含まれてい
る部分を検索することができる。The speech dictionary creating unit 4 creates a speech dictionary for the "peacock" obtained as a result of the recognition by the second speech recognition unit 3 in this manner, and uses the speech dictionary to create a first speech recognition unit. By performing spotting from the audio database 1 by using 2, it is possible to search for a portion including the audio of "peacock".

【００２７】このように、図６の音声検索装置では、図
１の音声検索装置の利点(すなわち、利用者の発声した
音声データを音声データベース内の音声データと直接照
合しないので、正確な認識が可能となるという利点)に
加えて、キーワード格納手段５にキーワード候補が格納
されており、スポッティングによってキーワードを認識
できることから、利用者は不要語の付加などについて気
にする必要がなくなり、気楽に発声することができる。As described above, the voice search device of FIG. 6 has the advantage of the voice search device of FIG. 1 (that is, since the voice data uttered by the user is not directly compared with the voice data in the voice database, accurate recognition is not possible). In addition to the above, the keyword candidates are stored in the keyword storage means 5 and the keywords can be recognized by spotting, so that the user does not need to worry about adding unnecessary words, etc. can do.

【００２８】また、図６の音声検索装置では、図７に示
したように、キーワード格納手段５に格納されるキーワ
ード候補が、大分類，小分類の少なくとも２階層以上の
階層構造をもつ文法に記述され、音声辞書作成部４は、
小分類の項目に記述されたキーワードの認識結果から第
１の音声認識部２のための音声辞書を作成するようにな
っており、利用者の音声を複数段階に分けて認識するこ
とで、１回の認識あたりの対象語彙数を少なくでき、精
度良く高速な認識ができる。Further, in the voice search apparatus shown in FIG. 6, as shown in FIG. 7, the keyword candidates stored in the keyword storage means 5 are converted into a grammar having a hierarchical structure of at least two hierarchies of a large classification and a small classification. The voice dictionary creation unit 4 describes
A speech dictionary for the first speech recognition unit 2 is created from the recognition results of the keywords described in the sub-category items. The number of target words per recognition can be reduced, and high-speed recognition can be performed with high accuracy.

【００２９】なお、上述の例(図７の例)では、キーワー
ドの分類の仕方が２段階のものであるとして説明した
が、さらに多くの階層をもつ分類のものにしても良い。In the example described above (the example in FIG. 7), the keyword is classified in two steps. However, the keyword may be classified into more layers.

【００３０】また、上述の各例では、音声辞書が、図５
のように、第２の音声認識部３で例えば「にほん」と認
識されたときに、この１つの認識結果を文字列として入
力して作成されるとしたが、音声辞書を例えば図８のよ
うに、第２の音声認識部３で例えば「にほん」と認識さ
れたときに、この１つの認識結果「にほん」を第１の文
字列として入力するとともに、これと類似する「にっぽ
ん」を第２の文字列として入力して(２つの文字列を入
力して)作成しても良い。In each of the above examples, the speech dictionary is
When the second speech recognition unit 3 recognizes, for example, "Nihon", the one speech recognition result is input as a character string to create the speech dictionary. When the second speech recognition unit 3 recognizes, for example, "Japan", the one recognition result "Japan" is input as a first character string, and a similar "Japan" is input to the second character string. May be created by inputting as a character string (inputting two character strings).

【００３１】すなわち、図８の例では、利用者が例えば
「にほん」を発声し、この入力音声が第２の音声認識部
３によって「にほん」と認識されるとき(「にほん」の
キーワードが得られるとき)、音声辞書作成部４は、こ
のキーワード「にほん」に対応した音声辞書として、
「にほん」，「にっぽん」を作成し、そして、この場
合、第１の音声認識部２は、「にほん」，「にっぽん」
の音声辞書によって音声データベース１を検索(ワード
スポッティング)することができる。これにより、より
正確な検索を行なうことができる。That is, in the example of FIG. 8, when the user utters, for example, "Nihon" and this input voice is recognized as "Nihon" by the second voice recognition unit 3, the keyword "Nihon" is obtained. When the voice dictionary is created, the voice dictionary creating unit 4 generates a voice dictionary corresponding to the keyword “Nihon”.
“Nihon” and “Nippon” are created, and in this case, the first speech recognition unit 2 outputs “Nihon” and “Nippon”.
The voice database 1 can be searched (word spotting) using the voice dictionary. Thereby, a more accurate search can be performed.

【００３２】図９は図１または図６の音声検索装置のハ
ードウェア構成例を示す図である。図９を参照すると、
この音声検索装置は、例えばパーソナルコンピュータ等
で実現され、全体を制御するＣＰＵ２１と、ＣＰＵ２１
の制御プログラム等が記憶されているＲＯＭ２２と、Ｃ
ＰＵ２１のワークエリア等として使用されるＲＡＭ２３
と、利用者の音声を入力する音声入力装置２４と、音声
データ等が格納されるデータ記憶装置２５と、音声検索
を行なった結果の情報を出力する結果出力装置(例え
ば、ディスプレイやプリンタ)２６とを有している。FIG. 9 is a diagram showing an example of a hardware configuration of the voice search device of FIG. 1 or FIG. Referring to FIG.
This voice search device is realized by, for example, a personal computer or the like, and controls a CPU 21 that controls the entirety.
ROM 22 storing the control program of
RAM 23 used as a work area of PU 21
A voice input device 24 for inputting a user's voice, a data storage device 25 for storing voice data and the like, and a result output device (for example, a display or a printer) 26 for outputting information on a result of voice search. And

【００３３】ここで、ＣＰＵ２１は、図１，図６の第
１，第２の音声認識部２，３，音声辞書作成部４の機能
を有している。Here, the CPU 21 has the functions of the first and second speech recognition units 2 and 3 and the speech dictionary creation unit 4 shown in FIGS.

【００３４】なお、ＣＰＵ２１におけるこのような第
１，第２の音声認識部２，３，音声辞書作成部４等とし
ての機能は、例えばソフトウェアパッケージ(具体的に
は、ＣＤ−ＲＯＭ等の情報記録媒体)の形で提供するこ
とができ、このため、図９の例では、情報記録媒体３０
がセットさせるとき、これを駆動する媒体駆動装置３１
が設けられている。The functions of the CPU 21 such as the first and second speech recognition units 2 and 3 and the speech dictionary creation unit 4 are, for example, software packages (specifically, information recording such as a CD-ROM). (Medium), and therefore, in the example of FIG.
Is set, the medium drive device 31 that drives this
Is provided.

【００３５】換言すれば、本発明の音声検索装置は、音
声入力装置，ディスプレイ等を備えた汎用の計算機シス
テムにＣＤ−ＲＯＭ等の情報記録媒体に記録されたプロ
グラムを読み込ませて、この汎用計算機システムのマイ
クロプロセッサに音声検索処理を実行させる装置構成に
おいても実施することが可能である。この場合、本発明
の音声検索処理を実行するためのプログラム(すなわ
ち、ハードウェアシステムで用いられるプログラム)
は、媒体に記録された状態で提供される。プログラムな
どが記録される情報記録媒体としては、ＣＤ−ＲＯＭに
限られるものではなく、ＲＯＭ，ＲＡＭ，フレキシブル
ディスク，メモリカード等が用いられても良い。媒体に
記録されたプログラムは、ハードウェアシステムに組み
込まれている記憶装置、例えばハードディスク装置にイ
ンストールされることにより、このプログラムを実行し
て、本発明の音声検索機能を実現できる。In other words, the voice search device of the present invention allows a general-purpose computer system having a voice input device, a display, and the like to read a program recorded on an information recording medium such as a CD-ROM, and The present invention can also be implemented in an apparatus configuration in which a microprocessor of the system executes a voice search process. In this case, a program for executing the voice search processing of the present invention (that is, a program used in a hardware system)
Is provided in a state recorded on a medium. The information recording medium on which the program or the like is recorded is not limited to a CD-ROM, but may be a ROM, a RAM, a flexible disk, a memory card, or the like. The program recorded on the medium is installed in a storage device incorporated in the hardware system, for example, a hard disk device, so that the program can be executed to realize the voice search function of the present invention.

【００３６】[0036]

【発明の効果】以上に説明したように、請求項１乃至請
求項３記載の発明によれば、検索対象となる音声データ
が格納されている音声データベースと、該音声データベ
ースの音声データの中から所定の音声データをスポッテ
ィングにより検出するための第１の音声認識部と、利用
者が発声した音声からキーワードを認識するための第２
の音声認識部と、第２の音声認識部で認識されたキーワ
ードから第１の音声認識部のための音声辞書を作成する
音声辞書作成部とを有しており、第１の音声認識部は、
音声辞書作成部で作成された音声辞書を用いて音声デー
タベースの音声データの中から音声辞書に対応した音声
データをスポッティングにより検出するようになってお
り、利用者の音声を認識する際と音声データベースを認
識する際との両方に最適な認識装置を選択することがで
きるので、音声データベースから音声データの正確な検
索が可能になる。As described above, according to the first to third aspects of the present invention, a voice database storing voice data to be searched and a voice database stored in the voice database. A first voice recognition unit for detecting predetermined voice data by spotting; and a second voice recognition unit for recognizing a keyword from voice uttered by a user.
And a speech dictionary creating unit that creates a speech dictionary for the first speech recognition unit from the keywords recognized by the second speech recognition unit. The first speech recognition unit ,
Using the voice dictionary created by the voice dictionary creation unit, voice data corresponding to the voice dictionary is detected from the voice data of the voice database by spotting. Since it is possible to select the most appropriate recognition device for both the recognition of the voice data and the voice data, the voice data can be accurately retrieved from the voice database.

【００３７】特に、請求項２記載の発明によれば、キー
ワード候補を格納するためのキーワード格納手段がさら
に設けられており、第２の音声認識部は、利用者が発声
した音声からキーワード候補を認識し、該キーワード候
補をキーワード格納手段に格納し、音声辞書作成部は、
該キーワード格納手段に格納されたキーワード候補をス
ポッティングし、スポッティングしたキーワードから第
１の音声認識部のための音声辞書を作成するようになっ
ているので、利用者は不要語の付加などについて気にす
る必要がなくなり、気楽に発声することができる。In particular, according to the second aspect of the present invention, a keyword storage means for storing keyword candidates is further provided, and the second voice recognition unit converts the keyword candidates from voice uttered by the user. Recognizing and storing the keyword candidate in the keyword storage means,
Since the keyword candidates stored in the keyword storage means are spotted and a voice dictionary for the first voice recognition unit is created from the spotted keywords, the user is concerned about adding unnecessary words. This eliminates the need to do so, and makes it easier to speak.

【００３８】また、請求項３記載の発明によれば、キー
ワード格納手段に格納されるキーワード候補は、大分
類，小分類の少なくとも２階層以上の階層構造をもつ文
法に記述され、音声辞書作成部は、小分類の項目に記述
されたキーワードの認識結果から第１の音声認識部のた
めの音声辞書を作成するようになっているので、１回の
認識あたりの対象語彙数を少なくでき、精度良く高速な
認識ができる。According to the third aspect of the invention, the keyword candidates stored in the keyword storage means are described in a grammar having a hierarchical structure of at least two hierarchies of a large classification and a small classification, and the speech dictionary creation unit Creates a speech dictionary for the first speech recognition unit from the recognition results of the keywords described in the items of the small classification, so that the number of target words per recognition can be reduced, and Good and fast recognition.

[Brief description of the drawings]

【図１】本発明に係る音声検索装置の構成例を示す図で
ある。FIG. 1 is a diagram showing a configuration example of a voice search device according to the present invention.

【図２】第２の音声認識部の具体例を示す図である。FIG. 2 is a diagram illustrating a specific example of a second speech recognition unit.

【図３】音声辞書作成部の構成例を示す図である。FIG. 3 is a diagram illustrating a configuration example of a voice dictionary creation unit.

【図４】かな−音素変換表の一例を示す図である。FIG. 4 is a diagram showing an example of a kana-phoneme conversion table.

【図５】音声辞書作成部で作成される音声辞書の一例を
示す図である。FIG. 5 is a diagram illustrating an example of a speech dictionary created by a speech dictionary creation unit.

【図６】本発明に係る音声検索装置の構成例を示す図で
ある。FIG. 6 is a diagram showing a configuration example of a voice search device according to the present invention.

【図７】キーワード格納手段の一例を示す図である。FIG. 7 is a diagram illustrating an example of a keyword storage unit.

【図８】音声辞書作成部で作成される音声辞書の他の例
を示す図である。FIG. 8 is a diagram illustrating another example of the speech dictionary created by the speech dictionary creation unit.

【図９】図１または図６の音声検索装置のハードウェア
構成例を示す図である。FIG. 9 is a diagram illustrating an example of a hardware configuration of the voice search device in FIG. 1 or FIG. 6;

[Explanation of symbols]

１音声データベース２第１の音声認識部３第２の音声認識部４音声辞書作成部５キーワード格納手段２１ＣＰＵ２２ＲＯＭ２３ＲＡＭ２４音声入力装置２５データ記憶装置２６結果出力装置３０情報記憶媒体３１媒体駆動装置 DESCRIPTION OF SYMBOLS 1 Speech database 2 1st speech recognition part 3 2nd speech recognition part 4 Speech dictionary creation part 5 Keyword storage means 21 CPU 22 ROM 23 RAM 24 Speech input device 25 Data storage device 26 Result output device 30 Information storage medium 31 Medium Drive

Claims

[Claims]

An audio database in which audio data to be searched is stored; a first audio recognition unit for detecting predetermined audio data from audio data in the audio database by spotting; A second speech recognition unit for recognizing a keyword from the voice uttered by the user, and a first speech recognition unit based on the keyword recognized by the second speech recognition unit.
And a voice dictionary creating unit for creating a voice dictionary for the voice recognition unit. The first voice recognition unit uses the voice dictionary created by the voice dictionary creating unit to convert the voice data of the voice database. A voice search device characterized by detecting voice data corresponding to a voice dictionary from inside by spotting.

2. The voice search device according to claim 1, wherein
Keyword storage means for storing keyword candidates is further provided, and the second voice recognition unit recognizes keyword candidates from voice uttered by the user, stores the keyword candidates in the keyword storage means, The voice dictionary creating unit spots keyword candidates stored in the keyword storage unit and creates a voice dictionary for a first voice recognition unit from the spotted keywords. Search device.

3. The voice search device according to claim 2, wherein
The keyword candidates stored in the keyword storage means are described in a grammar having a hierarchical structure of at least two levels of a large classification and a small classification, and the speech dictionary creating unit recognizes the keyword described in the item of the small classification. A speech search device for creating a speech dictionary for a first speech recognition unit from a result.