JP2006209077A

JP2006209077A - Voice interactive device and method

Info

Publication number: JP2006209077A
Application number: JP2005260406A
Authority: JP
Inventors: Kengo Suzuki; 堅悟鈴木; Hiroshi Saito; 浩斎藤
Original assignee: Nissan Motor Co Ltd
Current assignee: Nissan Motor Co Ltd
Priority date: 2004-12-28
Filing date: 2005-09-08
Publication date: 2006-08-10

Abstract

PROBLEM TO BE SOLVED: To make word chains with a speaker by speech-recognizing a vocally inputted word and outputting a word starting with the same character as the ending character of the word recognized. SOLUTION: A controller 106 voice-recognizes the beginning and the ending of a word inputted via a microphone 101, determines whether the inputted word have been spoken, based on a making-word chains rule, based on the speech-recognized beginning and ending of the word, and extracts an answering word, which starts with the same character with the ending of the speech-recognized word, when it is determined that the inputted word has been spoken, according to the making-word chains rule and vocally output the word via a loudspeaker 105. COPYRIGHT: (C)2006,JPO&NCIPI

Description

本発明は、発話者の発話した単語を音声認識して、発話者としりとりをする音声対話装置、および方法に関する。 The present invention relates to a speech dialogue apparatus and method for recognizing a word spoken by a speaker and recognizing it as a speaker.

次のようなしりとりゲーム機が特許文献１によって知られている。このしりとりゲーム機では、発話者の発話した単語を単語記憶手段に記憶した単語と照合することで音声認識した後に、音声認識した単語の語頭と語尾の文字を抽出する。そして、音声認識した語尾で始まり、語尾が「ん」でない単語を単語記憶手段から読み込んで音声出力する。 The following shiritori game machine is known from Patent Document 1. In this shiritori game machine, after the speech recognition is performed by comparing the word uttered by the speaker with the word stored in the word storage means, the beginning and ending characters of the speech recognized word are extracted. Then, a word that starts with the speech-recognized ending and does not end with “n” is read from the word storage means and output as speech.

特開平６−２６９５３４号公報JP-A-6-269534

しかしながら、従来のしりとりゲーム機では、発話者の発話した単語を音声認識した後に単語の語頭と語尾を抽出するため、一度単語全体を音声認識する必要が生じ、単語記憶手段に記憶した単語数が多くなるにしたがって音声認識に要する時間が増大し、また誤認識率が高くなるという問題が生じていた。 However, in the conventional shiritori game machine, since the beginning and ending of the word are extracted after speech recognition of the word spoken by the speaker, it is necessary to recognize the entire word once, and the number of words stored in the word storage means is As the number of times increases, the time required for speech recognition increases, and the problem of an increased recognition rate has arisen.

本発明は、発話者としりとりを行う音声対話装置および方法であって、音声入力手段を介して入力された単語から語頭および語尾を抽出して音声認識し、音声認識した単語の語頭および語尾に基づいて、入力された単語がしりとりのルールに則って発話されたものであるか否かを判定し、入力された単語がしりとりのルールに則って発話されたものであると判定した場合には、音声認識した単語の語尾と同一文字から始まる応答用単語を抽出して音声出力することを特徴とする。 The present invention relates to a spoken dialogue apparatus and method for performing conversation with a speaker, extracting a word head and a word ending from a word input via a voice input means, performing voice recognition, and applying the word recognition to the word head and word ending. If it is determined that the input word is uttered according to Shiritori rules, and the input word is uttered according to Shiritori rules, The response word starting from the same character as the end of the speech-recognized word is extracted and output as speech.

本発明によれば、入力された単語の語頭および語尾を音声認識するようにした。これによって、しりとりにおいてルール判定で用いられるのは単語の語頭および語尾であることを加味して、当該ポイントとなる部分のみを音声認識することで、処理速度を向上し、誤認識率を低下することができる。 According to the present invention, the beginning and ending of the input word are recognized by voice. As a result, taking into account that the beginning and end of a word are used for rule determination in shiritori, the processing speed is improved and the misrecognition rate is reduced by recognizing only the part that becomes the point. be able to.

―第１の実施の形態―
図１は、第１の実施の形態における音声対話装置の一実施の形態の構成を示すブロック図である。音声対話装置１００は、発話者が発話した単語を入力するマイク１０１と、音声入力の開始、すなわちしりとりの開始を指示するための音声入力操作スイッチ１０２と、マイク１０１を介して入力された単語の語頭、および語尾を認識するために「あ」〜「ん」の５０音全ての音節を待ち受け用音節として格納する認識用音節辞書１０３と、しりとり用の応答用単語を格納する応答用単語データベース１０４と、発話者に対して応答用単語を音声出力するためのスピーカー１０５と、入力された単語の語頭語尾の認識、および応答単語の決定などの各種処理を実行する制御装置１０６と、過去のしりとり履歴、すなわち発話者が発話した単語、および音声対話装置１００が発話者に応答した応答単語の音声データの履歴を記憶するための履歴メモリ１０７とを備えている。 -First embodiment-
FIG. 1 is a block diagram showing a configuration of an embodiment of a voice interaction apparatus according to the first embodiment. The voice interactive apparatus 100 includes a microphone 101 for inputting a word spoken by a speaker, a voice input operation switch 102 for instructing start of voice input, that is, start of shiritori, and a word input via the microphone 101. A recognition syllable dictionary 103 that stores all 50 syllables of “a” to “n” as standby syllables for recognizing the beginning and ending, and a response word database 104 that stores response words for shiritori. A speaker 105 for outputting a response word to a speaker as a voice, a control device 106 for executing various processes such as recognition of a prefix of an input word and determination of a response word, and a past shiritori History, that is, a history for storing speech data history of words spoken by the speaker and response words of the spoken dialogue apparatus 100 responding to the speaker And a memory 107.

制御装置１０６は、マイク１０１を介して単語の音声データが入力されると、まず入力された単語の語頭および語尾を抽出する。すなわち、図２に示すように入力された音声データから語頭に相当する部分２ａ、および語尾に相当する部分２ｂを抽出する。そして、抽出した語頭および語尾の音声データと、認識用音節辞書１０３に格納された待ち受け用音節とを照合処理して、各待ち受け用音節における語頭および語尾の認識尤度、すなわち確からしさを算出する。そして、語頭および語尾のそれぞれに対して、認識尤度が最大となる待ち受け用音節を抽出して、それぞれを語頭および語尾の認識結果候補とする。 When the voice data of a word is input via the microphone 101, the control device 106 first extracts the beginning and ending of the input word. That is, as shown in FIG. 2, the part 2a corresponding to the beginning of the word and the part 2b corresponding to the ending are extracted from the input voice data. Then, the extracted beginning and ending speech data and the standby syllables stored in the recognition syllable dictionary 103 are collated to calculate the recognition likelihood of the beginning and ending in each standby syllable, that is, the probability. . Then, standby syllables with the maximum recognition likelihood are extracted for each of the beginning and ending, and each is used as a recognition result candidate for the beginning and ending.

なお、第１の実施の形態において、認識尤度は、例えば０〜１の数値で表され、抽出した語頭または語尾と待ち受け用音節とが全く一致しない場合には０、完全に一致した場合には１が算出される。すなわち、語頭および語尾と待ち受け用音節との一致度が高いほど、認識尤度は大きくなるものとする。例えば、発話者によって「りんご」と発話され、音声入力された場合には、図３に示すように、語頭の認識結果候補が「り」となり、語尾の認識結果候補が「ご」となる。また、そのときの認識尤度として、それぞれ（Ａ）、例えば０．７、および（Ｂ）、例えば０．８が算出されたものとする。この場合、算出された語頭および語尾の認識尤度の大きさに基づいて、発話者への応答内容を変化させる。 In the first embodiment, the recognition likelihood is represented by a numerical value of 0 to 1, for example, 0 when the extracted head or ending and the standby syllable do not match at all. 1 is calculated. That is, it is assumed that the recognition likelihood increases as the coincidence between the beginning and ending and the standby syllable increases. For example, when an “apple” is uttered by a speaker and a voice is input, as shown in FIG. 3, the recognition result candidate at the beginning is “ri” and the recognition result candidate at the ending is “go”. It is assumed that (A), for example, 0.7, and (B), for example, 0.8 are calculated as the recognition likelihood at that time. In this case, the response content to the speaker is changed based on the calculated recognition likelihood of the beginning and ending.

すなわち、語頭の認識尤度、および語尾の認識尤度がそれぞれあらかじめ設定した所定値より大きいか否かを判断する。ここで所定値は、認識尤度が当該所定値より大きければ、語頭および語尾と待ち受け用音節との一致度が十分に高く、音声認識結果として採用することができるような値が設定されている。そして、語頭および語尾の認識尤度と、この所定値との比較結果を、図４に示すパターン１〜パターン４の４つのパターンに分類し、次の（１）〜（４）に示すように、各パターンごとに発話者への応答内容を変化させる。 That is, it is determined whether the recognition likelihood of the beginning and the recognition likelihood of the ending are each greater than a predetermined value set in advance. Here, the predetermined value is set such that if the likelihood of recognition is greater than the predetermined value, the degree of coincidence between the beginning and ending and the standby syllable is sufficiently high and can be adopted as a speech recognition result. . Then, the comparison result between the recognition likelihood of the beginning and the end of the word and this predetermined value is classified into four patterns 1 to 4 shown in FIG. 4, and as shown in the following (1) to (4) The response content to the speaker is changed for each pattern.

（１）パターン１：語頭および語尾の認識尤度が共に所定値より大きい場合
この場合には、語尾の認識尤度が所定値より大きいことから、発話者が発話した単語の語尾を特定することができる。よって、この語尾と同一文字から始まる応答単語を出力することが可能となる。しかし、発話者によって発話された単語が、一般的なしりとりのルールに則ってされたものであるか否かを判定する必要があることから、以下のように処理する。なお、一般的なしりとりのルールとしては、「発話された単語の語尾が「ん」でないか」、「発話された単語の語頭が直前の応答単語の語尾と同一文字で始まるか」、および「過去に出現した単語を繰り返し発話していないか」について判定する。 (1) Pattern 1: When the recognition likelihood of the beginning and the ending is both larger than a predetermined value In this case, the ending of the word spoken by the speaker is specified because the recognition likelihood of the ending is larger than the predetermined value. Can do. Therefore, it is possible to output a response word starting from the same character as this ending. However, since it is necessary to determine whether the word uttered by the speaker is in accordance with the general rules of shiritori, the following processing is performed. In addition, as a general rule of shiritori, “the ending of the spoken word is not“ n ””, “whether the beginning of the spoken word begins with the same character as the ending of the previous response word”, and “ Whether or not a word that appeared in the past has been uttered repeatedly is determined.

このために、まず、音声認識の結果特定できた語尾が「ん」でないかを判定する。また、語頭の認識結果候補が履歴メモリ１０７に格納されている直前に音声対話装置１００が応答した応答単語の語尾と一致しているかを判定する。ここまでのルール判定（第１のルール判定）結果に基づいて、発話者によって発話された単語がしりとりルールに則っていないと判断した場合には、発話者に対して「もう一度お話ください」のような単語の再発話を促す応答メッセージをスピーカー１０５を介して出力する。 For this purpose, first, it is determined whether or not the ending that can be identified as a result of speech recognition is “n”. In addition, it is determined whether or not the recognition result candidate for the beginning of the word matches the ending of the response word responded by the spoken dialogue apparatus 100 immediately before being stored in the history memory 107. If it is determined that the word spoken by the speaker is not in accordance with the rules of shiritori based on the result of the rule determination (first rule determination) so far, “Please speak again” to the speaker A response message that prompts the user to repeat a new word is output via the speaker 105.

一方、ここまでの第１のルール判定の結果に関しては、発話者によって発話された単語がしりとりルールに則ったものであると判断した場合には、さらに第２のルール判定を実行して「過去に出現した単語を繰り返し発話していないか」について判定する。第１の実施の形態では、上述したように発話された単語の語頭および語尾のみを抽出して音声認識することから、単語の語頭および語尾のみを用いて第２のルール判定を行う。このために履歴メモリ１０７に格納されたしりとり履歴に含まれる過去に出現した全単語、すなわち発話者が発話済みの単語と音声対話装置１００が出力済みの応答単語における語頭と語尾との対の中に、発話者が発話した単語から抽出した語頭と語尾との対と一致するものが所定数以上存在するか否かを判断する。 On the other hand, regarding the result of the first rule determination so far, when it is determined that the word uttered by the speaker is in accordance with the shiritori rule, the second rule determination is further executed and “past It is determined whether or not the word that appears in is repeatedly uttered. In the first embodiment, as described above, only the beginning and ending of the uttered word are extracted and speech recognition is performed, so the second rule determination is performed using only the beginning and ending of the word. For this reason, among all the words that have appeared in the past included in the shiritori history stored in the history memory 107, that is, the pair of the beginning and ending in the response word that has already been spoken by the speaker and the response word that has been output by the voice interactive device 100 In addition, it is determined whether or not there are a predetermined number or more of coinciding word pairs that are extracted from words spoken by the speaker.

履歴メモリ１０７に格納された過去に出現した全単語の語頭と語尾との対の中に、発話者が今回発話した単語の語頭と語尾との対と一致するものが所定数以上存在すると判断した場合には、発話者は既に出現済みの単語を繰り返し発話したと判定する。そして、発話者に対して「その単語は既に発話済みです。他の単語を発話してください」のような単語の再発話を促す応答メッセージをスピーカー１０５を介して出力する。 It has been determined that there are more than a predetermined number of pairs of word prefixes and endings of all the words that have appeared in the past stored in the history memory 107 that match the word prefix and ending pairs of words spoken by the speaker this time. In this case, the speaker determines that the word that has already appeared has been uttered repeatedly. Then, a response message prompting the re-speech of a word such as “The word has already been spoken. Please utter another word” is output to the speaker via the speaker 105.

これに対して、履歴メモリ１０７に格納された過去に出現した全単語の語頭と語尾との対の中に、発話者が今回発話した単語の語頭と語尾との対と一致するものが所定数以上存在しないと判断した場合には、発話者によって発話された単語を受け付け、認識した語尾から始まり、履歴メモリ１０７内に存在しない任意の応答用単語を応答用単語データベース１０４から抽出して、スピーカー１０５を介して出力する。このとき、応答用単語データベース１０４内に該当する応答用単語が存在しない場合には、発話者の勝利となる。 In contrast, a predetermined number of pairs of beginning and ending words of all the words that have appeared in the past stored in the history memory 107 coincide with the beginning and ending pairs of the words spoken by the speaker this time. If it is determined that there is no more, a word spoken by the speaker is accepted, an arbitrary response word starting from the recognized ending and not existing in the history memory 107 is extracted from the response word database 104, and the speaker The data is output via 105. At this time, if there is no corresponding response word in the response word database 104, the speaker wins.

なお、このような出現済み単語の判定方法においては、実際には発話者は未出現の単語を発話したにも関わらず、語頭と語尾の組み合わせが同一の単語が過去に所定回数以上出現していればルールに則った単語でないと判定され、逆に同じ単語を繰り返し発話したにも関わらず、語頭と語尾の組み合わせが同一の単語の出現回数が所定回数未満であればルールに則った単語であると判定される可能性がある。しかし、第１の実施の形態では、発話された単語の語頭および語尾のみを抽出して音声認識することから、単語の語頭および語尾の組み合わせのみを考慮して画一的にルール判定を行うものとする。 It should be noted that in such a method for determining an already-occurring word, a word having the same combination of beginning and ending has appeared more than a predetermined number of times in the past, even though the speaker actually uttered a word that has not yet appeared. If the number of occurrences of the word with the same combination of beginning and end is less than the predetermined number of times, it is determined that the word does not comply with the rule. It may be determined that there is. However, in the first embodiment, since only the beginning and ending of the spoken word are extracted and speech recognition is performed, the rule determination is performed uniformly considering only the combination of the beginning and ending of the word. And

（２）パターン２：語頭の認識尤度が所定値以下で、語尾の認識尤度が所定値より大きい場合
この場合にも、語尾の認識尤度が所定値より大きいことから、発話者が発話した単語の語尾を特定することができる。このため、上述したパターン1と同様の処理を行う。 (2) Pattern 2: When the recognition likelihood of the beginning is less than a predetermined value and the recognition likelihood of the ending is larger than the predetermined value In this case, since the recognition likelihood of the ending is larger than the predetermined value, the speaker speaks The ending of the word can be specified. For this reason, the same processing as that of the pattern 1 described above is performed.

（３）パターン３：語頭の認識尤度が所定値より大きく、語尾の認識尤度が所定値以下の場合
この場合には、語尾の認識尤度が所定値以下であることから発話者が発話した単語の語尾を特定することができず、発話者に対して出力する応答用単語を決定することができない。したがって、発話者に対して「もう一度お話ください」のような単語の再発話を促す応答メッセージをスピーカー１０５を介して出力する。 (3) Pattern 3: When the recognition likelihood of the beginning is larger than a predetermined value and the recognition likelihood of the ending is less than or equal to a predetermined value In this case, since the recognition likelihood of the ending is less than or equal to a predetermined value, the speaker speaks It is impossible to specify the ending of the selected word and to determine the response word to be output to the speaker. Therefore, a response message that prompts the speaker to re-speak a word such as “Please speak again” is output via the speaker 105.

（４）パターン４：語頭および語尾の認識尤度が共に所定値以下の場合
この場合には、語頭および語尾の両方が正常に音声認識できないことから、発話者は、音声対話装置１００が直前に出力した応答単語を正しく理解しておらず、はっきりと発話していない可能性がある。したがって、直前の応答単語を再度発話者に提示して再発話を促すための応答メッセージをスピーカー１０５を介して出力する。例えば、履歴メモリ１０７に格納された直前の応答単語をが「パパイヤ」である場合には、「パパイヤの“や”で考えてください」のような応答メッセージを出力する。 (4) Pattern 4: When the recognition likelihood of the beginning and the ending is both equal to or less than the predetermined value In this case, since both the beginning and the ending cannot be normally recognized, the speaker can immediately The output response word may not be understood correctly and may not be spoken clearly. Therefore, a response message for prompting the re-speaking by presenting the immediately previous response word to the speaker again is output via the speaker 105. For example, if the immediately preceding response word stored in the history memory 107 is “papaya”, a response message such as “Please think about papaya“ ya ”” is output.

図５は、第１の実施の形態における音声対話装置１００の処理を示すフローチャートである。図５に示す処理は音声対話装置１００の電源がオンされると、制御装置１０６によって実行される。ステップＳ１０において、発話者によって音声入力操作スイッチ１０２が押下され、しりとりの開始が指示されたか否かが判断される。音声入力操作スイッチ１０２が押下されたと判断した場合には、ステップＳ２０へ進む。ステップＳ２０では、応答用単語データベース１０４内に格納されている任意の応答用単語を抽出して、スピーカー１０４から音声出力する。その後、ステップＳ３０へ進む。 FIG. 5 is a flowchart showing processing of the voice interactive apparatus 100 in the first embodiment. The processing shown in FIG. 5 is executed by the control device 106 when the voice interactive device 100 is turned on. In step S10, it is determined whether or not the voice input operation switch 102 has been pressed by the speaker and the start of shiritori has been instructed. If it is determined that the voice input operation switch 102 has been pressed, the process proceeds to step S20. In step S 20, an arbitrary response word stored in the response word database 104 is extracted, and the sound is output from the speaker 104. Then, it progresses to step S30.

ステップＳ３０では、出力した応答単語の音声データを履歴メモリ１０６に記憶して、ステップＳ４０へ進む。ステップＳ４０では、発話者から単語が発話され、マイク１０１を介して入力されたか否かが判断される。発話者による単語の発話があったと判断した場合には、ステップＳ５０へ進む。ステップＳ５０では、入力された単語の音声データから語頭に相当する部分２ａ、および語尾に相当する部分２ｂを抽出して、ステップＳ６０へ進む。ステップＳ６０では、抽出した語頭および語尾の音声データと、認識用音節辞書１０３に格納された待ち受け用音節とを照合処理して、各待ち受け用音節における語頭および語尾の認識尤度を算出する。その後、ステップＳ７０へ進む。 In step S30, the output voice data of the response word is stored in the history memory 106, and the process proceeds to step S40. In step S 40, it is determined whether or not a word is uttered by the speaker and input via the microphone 101. If it is determined that a word has been uttered by the speaker, the process proceeds to step S50. In step S50, the part 2a corresponding to the beginning of the word and the part 2b corresponding to the ending are extracted from the voice data of the input word, and the process proceeds to step S60. In step S60, the extracted beginning and ending speech data and the standby syllable stored in the recognition syllable dictionary 103 are collated to calculate the recognition likelihood of the beginning and ending in each standby syllable. Thereafter, the process proceeds to step S70.

ステップＳ７０では、算出した語尾の認識尤度が上述した所定値より大きいか否かが判断される。語尾の認識尤度が所定値より大きいと判断した場合には、ステップＳ８０へ進む。ステップＳ８０では、上述したパターン１またはパターン２に該当することから、第１のルール判定として「発話された単語の語尾が「ん」でないか」、および「発話された単語の語頭が直前の応答単語の語尾と同一文字で始まるか」について判定する。そして、ステップＳ９０へ進み、この第１のルール判定の結果に基づいて処理を分岐する。 In step S70, it is determined whether or not the calculated ending recognition likelihood is greater than the predetermined value. If it is determined that the ending recognition likelihood is greater than the predetermined value, the process proceeds to step S80. In step S80, since it corresponds to the above-described pattern 1 or pattern 2, as the first rule determination, “whether the utterance of the spoken word is not“ n ”” and “the head of the spoken word is the previous response” It is determined whether or not it begins with the same character as the end of the word. Then, the process proceeds to step S90, and the process branches based on the result of the first rule determination.

発話者の発話した単語が第１のルールに則ったものでないと判断した場合にはステップＳ１３０へ進み、発話者に対して単語の再発話を促す応答メッセージをスピーカー１０５を介して出力する。これに対して、発話者の発話した単語が第１のルールに則ったものであると判断した場合にはステップＳ１００へ進む。ステップＳ１００では、第２のルール判定として「過去に出現した単語を繰り返し発話していないか」について判定するために、履歴メモリ１０７に格納された過去に出現した全単語の語頭と語尾との対の中に、発話者が今回発話した単語の語頭と語尾との対と一致するものが所定数以上存在するか否かを判断する。そして、ステップＳ１１０へ進み、判定結果に基づいて処理を分岐する。 If it is determined that the word spoken by the speaker is not in accordance with the first rule, the process proceeds to step S130, and a response message that prompts the speaker to re-utter the word is output via the speaker 105. On the other hand, if it is determined that the word spoken by the speaker conforms to the first rule, the process proceeds to step S100. In step S100, in order to determine “whether the word that has appeared in the past has been repeatedly uttered” as the second rule determination, the pair of the beginning and the ending of all the words that have been stored in the history memory 107 in the past. Then, it is determined whether or not there are a predetermined number or more that match the pair of the beginning and ending of the word spoken by the speaker. And it progresses to step S110 and branches a process based on the determination result.

発話者が既に出現済みの単語を繰り返し発話したと判断した場合には、上述したステップＳ１３０へ進み、発話者に対して単語の再発話を促す応答メッセージをスピーカー１０５を介して出力する。これに対して、発話者が既に出現済みの単語を繰り返し発話していないと判断した場合には、ステップＳ１５０へ進む。ステップＳ１５０では、発話者によって発話された単語を受け付け、認識した語尾から始まり、履歴メモリ１０７内に存在しない任意の応答用単語を応答用単語データベース１０４から抽出して、ステップＳ１６０へ進む。 When it is determined that the speaker has repeatedly uttered a word that has already appeared, the process proceeds to step S130 described above, and a response message that prompts the speaker to re-utter the word is output via the speaker 105. On the other hand, if it is determined that the speaker has not repeatedly spoken words that have already appeared, the process proceeds to step S150. In step S150, a word uttered by the speaker is accepted, an arbitrary response word starting from the recognized ending and not existing in the history memory 107 is extracted from the response word database 104, and the process proceeds to step S160.

ステップＳ１６０では、応答用単語が応答用単語データベース１０４から抽出できたか否かを判断する。応答用単語が抽出できたと判断した場合には、引き続きしりとりを続行できることから、ステップＳ１７０へ進んで発話者によって入力された単語の音声データを履歴メモリ１０７に記憶する。その後、ステップＳ２０へ戻り、抽出した応答用単語をスピーカー１０５を介して出力する。一方、応答用単語が応答用単語データベース１０４から抽出できないと判断した場合には、ステップＳ１８０へ進み、使用者の勝利と判定して処理を終了する。 In step S160, it is determined whether or not the response word has been extracted from the response word database 104. If it is determined that the response word can be extracted, the chatter can be continued, and the process proceeds to step S170, where the speech data of the word input by the speaker is stored in the history memory 107. Thereafter, the process returns to step S20, and the extracted response word is output via the speaker 105. On the other hand, if it is determined that the response word cannot be extracted from the response word database 104, the process proceeds to step S180, where it is determined that the user has won, and the process ends.

次に、ステップＳ７０で語尾の認識尤度が所定値以下であると判断した場合の処理について説明する。この場合には、ステップＳ１２０へ進む。ステップＳ１２０では、語頭の認識尤度が所定値よりも大きいか否かを判断する。語頭の認識尤度が所定値よりも大きいと判断した場合には、上述したパターン３に該当することから、ステップＳ１３０へ進み、発話者に対して単語の再発話を促す応答メッセージをスピーカー１０５を介して出力する。これに対して、語頭の認識尤度が所定値以下であると判断した場合には、ステップＳ１４０へ進む。この場合には、上述したパターン４に該当することから、直前の応答単語を再度発話者に提示して再発話を促すための応答メッセージをスピーカー１０５を介して出力する。 Next, processing when it is determined in step S70 that the ending recognition likelihood is equal to or less than a predetermined value will be described. In this case, the process proceeds to step S120. In step S120, it is determined whether the recognition likelihood of the beginning of the word is greater than a predetermined value. If it is determined that the recognition probability of the beginning of the word is larger than the predetermined value, the pattern 3 corresponds to the above-described pattern 3, and thus the process proceeds to step S130, and a response message that prompts the speaker to re-utter the word is sent to the speaker 105. To output. In contrast, if it is determined that the recognition likelihood of the beginning of the word is equal to or less than a predetermined value, the process proceeds to step S140. In this case, since it corresponds to the pattern 4 described above, a response message for prompting a recurrence speech by presenting the immediately previous response word to the speaker again is output via the speaker 105.

以上説明した第１の実施の形態によれば、以下のような作用効果を得ることができる。
（１）発話者によって音声入力された単語の語頭および語尾のみを抽出し、抽出した単語を認識用音節辞書１０３に格納された待ち受け用音節と照合処理して、各待ち受け用音節における語頭および語尾の認識尤度を算出して音声認識するようにした。これによって、音声認識の対象をしりとりのポイントとなる語頭および語尾のみに限定することができ、認識処理の負荷を低減することができる。 According to the first embodiment described above, the following operational effects can be obtained.
(1) Only the beginning and ending of a word input by a speaker are extracted, and the extracted word is collated with a standby syllable stored in the recognition syllable dictionary 103, so that the beginning and ending of each standby syllable are detected. The recognition likelihood is calculated and voice recognition is performed. As a result, the target of speech recognition can be limited to only the beginning and ending of the ritual point, and the load of recognition processing can be reduced.

（２）また、語頭と語尾の間に含まれる文字を認識する必要がないことから、音声認識時の待ち受け単語として大量の単語を用意しておく必要がなく、発話者が発話したあらゆる単語に対応することが可能となる。 (2) In addition, since there is no need to recognize characters contained between the beginning and end of a word, it is not necessary to prepare a large number of words as standby words at the time of speech recognition. It becomes possible to respond.

（３）語尾の認識尤度が所定値以下の場合には、発話者に対して再発話を促すようにした。これによって、誤認識を防止することができる。 (3) When the recognition likelihood of the ending is less than or equal to a predetermined value, the utterer is prompted to recite. Thereby, erroneous recognition can be prevented.

（４）語頭および語尾の認識尤度が共に所定値以下の場合には、発話者に対して直前の応答単語を提示して再発話を促すようにした。これによって、発話者が直前の応答単語を理解していない可能性がある場合に、直前の応答単語を再提示することによって発話者が発話すべき語頭の情報を提示して、スムーズにしりとりを続けることができる。 (4) When both the initial and final recognition likelihoods are less than or equal to a predetermined value, the immediately preceding response word is presented to the speaker to encourage recurrent speech. As a result, when there is a possibility that the speaker does not understand the immediately preceding response word, by re-presenting the immediately preceding response word, information on the beginning of the speech to be spoken by the speaker is presented so that the conversation can be performed smoothly. You can continue.

（５）第１のルール判定として、音声認識の結果特定できた語尾が「ん」でないかを判定し、さらに語頭の認識結果候補が履歴メモリ１０７に格納されている直前に音声対話装置１００が応答した応答単語の語尾と一致しているかを判定するようにした。これによって、簡易に発話者によって発話された単語が、一般的なしりとりのルールに則っているか否かを判定することができる。 (5) As the first rule determination, it is determined whether or not the ending that can be specified as a result of the speech recognition is “n”, and the speech dialogue apparatus 100 immediately before the initial recognition result candidate is stored in the history memory 107. Judgment is made as to whether or not it matches the end of the response word. Thereby, it is possible to easily determine whether or not a word spoken by a speaker is in accordance with a general rule of shiritori.

（６）履歴メモリ１０７に格納された過去に出現した全単語の語頭と語尾との対の中に、発話者が今回発話した単語の語頭と語尾との対と一致するものが所定数以上存在すると判断した場合には、発話者は既に出現済みの単語を繰り返し発話したと判定するようにした。これによって、発話者が発話した単語全体と、履歴メモリ１０７に格納された過去に出現した全ての単語の全体とをマッチング処理する必要がなく、語頭および語尾の組み合わせのみをマッチング処理すれば良いことから、判定処理の負荷を低減することができる。 (6) There are more than a predetermined number of pairs of beginning and ending parts of all words that have appeared in the past stored in the history memory 107 that match the beginning and ending pairs of the words spoken by the speaker this time. If so, the speaker determines that the word that has already appeared has been uttered repeatedly. As a result, it is not necessary to perform matching processing on the entire word spoken by the speaker and all of the words that have appeared in the past stored in the history memory 107, and only the combination of the beginning and the end of the word needs to be matched. Therefore, the load of the determination process can be reduced.

―第２の実施の形態―
上述した第１の実施の形態では、しりとり用の応答用単語を応答用単語データベース１０４に格納しておき、発話者によって発話された単語に基づいて、応答用単語データベース１０４内に格納されている応答用単語を抽出して、スピーカー１０５を介して出力する例について説明した。これに対して第２の実施の形態では、応答用単語データベース１０４において、応答用単語をカテゴリに分類して格納しておき、さらに各応答用単語に難易度（レベル）を付加する。 -Second embodiment-
In the first embodiment described above, the response word for shiritori is stored in the response word database 104 and stored in the response word database 104 based on the words spoken by the speaker. The example in which the response word is extracted and output via the speaker 105 has been described. On the other hand, in the second embodiment, in the response word database 104, response words are classified and stored in categories, and a difficulty level (level) is added to each response word.

そして、使用者は、しりとりで使用する単語の範囲としてカテゴリを指定し、さらにしりとりの難易度を設定（レベルを調整）することによって、音声対話装置１００から出力される単語を制御することができる。 The user can control a word output from the voice interactive apparatus 100 by designating a category as a range of words used in shiritori and further setting a level of difficulty of shiritori (adjusting the level). .

なお、第２の実施の形態では、図２に示した発話音声データの波形に基づいた単語の語頭と語尾を抽出する具体例を示す図、図３に示した単語の語頭および語尾と待ち受け用音節との照合結果の具体例を示す図、および図４に示した語頭および語尾の認識尤度と所定値との比較結果を示す図の各図については、第１の実施の形態と同様のため、説明を省略する。 In the second embodiment, a diagram showing a specific example of extracting the word head and ending based on the waveform of the speech voice data shown in FIG. 2, and the word head and ending and the standby shown in FIG. About each figure of the figure which shows the specific example of the collation result with a syllable, and the figure which shows the comparison result with the recognition likelihood of the beginning and ending shown in FIG. 4, and a predetermined value, it is the same as that of 1st Embodiment Therefore, the description is omitted.

また、以下の説明では、使用者によってマイク１０１を介して入力された単語に対して、第１の実施の形態で上述したしりとりのルール判定を行った結果、発話者によって発話された単語はしりとりのルールに則ったものであると判定されているものとする。 Further, in the following description, as a result of performing the above-described shiritori rule determination in the first embodiment for a word input by the user via the microphone 101, the word spoken by the speaker is shiritori. It is assumed that it is determined that it conforms to the rules of

図６は、第２の実施の形態における音声対話装置の一実施の形態の構成を示すブロック図である。なお、図６においては、図１に示す第１の実施の形態と同一の構成要素に対しては同一の符号を付与して相違点を中心に説明する。音声対話装置１００は、発話者に対して出力される応答単語や、各種メニューなどを表示するモニタ１０８をさらに備えている。 FIG. 6 is a block diagram illustrating a configuration of an embodiment of a voice interaction apparatus according to the second embodiment. In FIG. 6, the same components as those in the first embodiment shown in FIG. The voice interactive apparatus 100 further includes a monitor 108 that displays response words output to the speaker, various menus, and the like.

このモニタ１０８は、使用者によって操作されるタッチパネル１０８ａを備えている。使用者はモニタ１０８に表示されたメニュー上の任意の項目を指で触れる（タッチする）ことにより、タッチした項目を選択して、音声対話装置１００に対して処理の実行を指示することができる。なお、この実施の形態では、タッチパネル１０８ａを使用者が操作して音声対話装置１００に対するコマンドを入力する例について説明するが、リモコンやハードスイッチなどのその他の入力装置を搭載し、使用者はこれらの入力装置を介してコマンドを入力してもよい。また、マイク１０１を介して音声コマンドを音声入力するようにしてもよい。 The monitor 108 includes a touch panel 108a operated by a user. The user can select any touched item by touching (touching) any item on the menu displayed on the monitor 108 and instruct the voice interactive apparatus 100 to execute the process. . In this embodiment, an example in which the user operates the touch panel 108a to input a command to the voice interactive device 100 will be described. However, other input devices such as a remote controller and a hard switch are installed, and the user A command may be input via the input device. In addition, a voice command may be input via the microphone 101.

図７は、第２の実施の形態における応答用単語データベース１０４内に格納される応答用単語を模式的に示した図である。ここでは、使用者によって発話された単語の語尾は「あ」であると判定された場合について説明する。したがって、図７に示す例では、「あ」で始まる応答用単語のみを示す。 FIG. 7 is a diagram schematically showing response words stored in the response word database 104 according to the second embodiment. Here, a case will be described in which it is determined that the ending of a word uttered by the user is “A”. Therefore, in the example shown in FIG. 7, only the response word starting with “A” is shown.

この図７に示すように、各応答用単語は、単語の読みを示すラベル（見出し語）７ａと、応答用単語の品詞７ｂと、応答用単語が属するカテゴリ７ｃと、後述するレベル７ｄとが対応付けられている。 As shown in FIG. 7, each response word has a label (headword) 7a indicating the reading of the word, a part of speech 7b of the response word, a category 7c to which the response word belongs, and a level 7d described later. It is associated.

カテゴリ７ｃは、しりとりに用いる応答用単語の範囲に関する情報であり、駅名、地名、植物、自然、または動物などの、各応答用単語をその意味によりカテゴリに分類した結果を示している。使用者によってしりとりで使用する単語の範囲を限定するために、カテゴリとして「駅名」が指定された場合には、カテゴリ７ｃが「駅名」であるラベル７ａが「あきるの」の応答用単語のみを応答単語の候補として抽出する。 The category 7c is information on the range of response words used for shiritori, and shows the result of classifying each response word such as a station name, place name, plant, nature, or animal into a category according to its meaning. In order to limit the range of words used in shiritori by the user, when “station name” is designated as the category, only the response word whose label 7a is “station name” and whose category 7c is “station name” is used. Extract as response word candidates.

レベル７ｄは、しりとりの難易度に関する情報であり、しりとりの難易度に対応する数値で表される。しりとりの難易度は、「易しい」、「普通」、「難しい」の３段階あり、難易度が「易しい」ほど、音声対話装置１００が応答できる単語を少なくして、使用者が勝利する可能性を高くする。これに対して、難易度が「難しい」ほど、音声対話装置１００が応答できる単語を多くして少なくして、使用者が勝利する可能性を低くする。この実施の形態では、難易度「易しい」に相当するレベル７ｃとして「１」が設定され、難易度「普通」に相当するレベル７ｃとして「２」が設定され、難易度「難しい」に相当するレベル７ｃとして「３」が設定される。 Level 7d is information on the difficulty level of shiritori and is represented by a numerical value corresponding to the difficulty level of shiritori. There are three levels of difficulty of shiritori: “easy”, “ordinary”, and “difficult”, and the easier the difficulty is, the fewer words that the voice interactive device 100 can respond to, and the possibility that the user will win To increase. On the other hand, as the difficulty level is “difficult”, the number of words that can be answered by the voice interaction apparatus 100 is increased and decreased to reduce the possibility that the user will win. In this embodiment, “1” is set as the level 7c corresponding to the difficulty level “easy”, and “2” is set as the level 7c corresponding to the difficulty level “ordinary”, which corresponds to the difficulty level “difficult”. “3” is set as the level 7c.

使用者によって難易度が「易しい」に設定された場合には、音声対話装置１００が応答できる単語を少なくするために、レベル７ｃが「１」の応答用単語のみをしりとりで使用する単語とする。使用者によって難易度が「普通」に設定された場合には、音声対話装置１００が応答できる単語を難易度が「易しい」場合よりも多くするために、レベル７ｃが「１」の応答用単語と、レベル７ｃが「２」の応答用単語をしりとりで使用する単語とする。そして、使用者によって難易度が「難しい」に設定された場合には、音声対話装置１００が応答できる単語を多くするために、レベル７ｃが「１」〜「３」の全ての応答用単語をしりとりで使用する単語とする。 When the difficulty level is set to “easy” by the user, in order to reduce the number of words that the voice interaction apparatus 100 can respond to, only the response word with level “1c” of “1” is used as a word for shiritori. . When the difficulty level is set to “ordinary” by the user, in order to increase the number of words that can be answered by the spoken dialogue apparatus 100 as compared with the case where the difficulty level is “easy”, the response word having the level 7c of “1” Then, a response word having a level 7c of “2” is used as a word used for shiritori. When the difficulty level is set to “difficult” by the user, in order to increase the number of words that the voice interaction apparatus 100 can respond to, all the response words whose level 7c is “1” to “3” are displayed. It is a word used in shiritori.

以下、使用者によって、上述したように音声対話装置１００から出力される単語を制御するために、応答用単語のカテゴリ、およびしりとりの難易度が設定された場合の具体例について説明する。図８は、使用者が応答用単語のカテゴリ指定、およびしりとりの難易度設定を行うためのモニタ１０８に表示される設定画面の具体例を示す図である。使用者は、使用者がタッチパネル１０８ａを操作して、この図８に示す設定画面により、あらかじめ応答用単語の抽出条件として、応答用単語のカテゴリ、およびしりとりの難易度を設定しておく。 Hereinafter, a specific example will be described in which the category of the response word and the difficulty level of shiritori are set by the user in order to control the word output from the voice interaction apparatus 100 as described above. FIG. 8 is a diagram showing a specific example of a setting screen displayed on the monitor 108 for the user to specify a category of response words and to set a difficulty level of shiritori. The user operates the touch panel 108a to set in advance the response word category and the level of difficulty of shiritori as the response word extraction conditions on the setting screen shown in FIG.

この図８は、しりとりの難易度として、難易度設定８ａで「難しい」が設定され、応答用単語のカテゴリとして、カテゴリ設定８ｂで「駅名」が指定された場合の具体例を示している。このように、図８に示す設定画面で、応答用単語のカテゴリ、およびしりとりの難易度が設定された場合には、制御装置１０６は、設定された応答用単語のカテゴリ、およびしりとりの難易度を応答用単語の抽出条件として、図７で上述した応答用単語データベース１０４内に格納されている応答用単語の中から抽出条件に合致する応答用単語のみを適合単語として抽出する。 FIG. 8 shows a specific example in which “difficult” is set in the difficulty setting 8a as the difficulty level of the shiritori, and “station name” is specified in the category setting 8b as the category of the response word. In this way, when the response word category and the shiritori difficulty level are set on the setting screen shown in FIG. 8, the control device 106 sets the set response word category and the shiritori difficulty level. Is used as a response word extraction condition, only response words that match the extraction condition are extracted as matching words from the response words stored in the response word database 104 described above with reference to FIG.

すなわち、図８に示す例では、使用者によって難易度設定８ａで「難しい」が指定され、カテゴリ設定８ｂで「駅名」が指定されていることから、制御装置１０６は、図７で上述した応答用単語データベース１０４内に格納されている応答用単語の中から、カテゴリ７ｃが「駅名」であり、かつレベル７ｃが「１」〜「３」の全ての応答用単語を適合単語として抽出する。その結果、図７に示した応答用単語の中から、ラベル７ａが「あきるの」の応答用単語のみが適合単語として抽出される。 That is, in the example shown in FIG. 8, since “difficult” is designated by the difficulty setting 8a by the user and “station name” is designated by the category setting 8b, the control device 106 responds as described above with reference to FIG. From the response words stored in the service word database 104, all response words whose category 7c is “station name” and whose level 7c is “1” to “3” are extracted as matching words. As a result, from the response words shown in FIG. 7, only the response word whose label 7a is “Akiru” is extracted as a matching word.

なお、応答用単語データベース１０４内に、図８に示す設定画面で使用者によって設定された抽出条件に合致する応答用単語が存在しない場合、すなわち適合単語として抽出される応答用単語が１つもない場合には、発話者の勝利となる。そして、制御装置１０６は、スピーカー１０５を介して負けを宣言するガイダンス、例えば「思いつく言葉がありません。私の負けです」を出力する。 In the response word database 104, when there is no response word that matches the extraction condition set by the user on the setting screen shown in FIG. 8, that is, there is no response word extracted as a matching word. In that case, the speaker wins. And the control apparatus 106 outputs the guidance which declares losing via the speaker 105, for example, "There is no word which can be thought. I am losing."

上述した処理の結果、適合単語として１つ以上の応答用単語が抽出された場合には、抽出した応答用単語が、発話者によって既に発話された単語と同一ではないか、また音声対話装置１００が既に応答した単語と同一ではないかを判定する。このために、第１の実施の形態で発話者が既に出現済みの単語を繰り返し発話したか否かを判定する際に行った第２のルール判定における処理を、抽出した各適合単語に対して実行する。すなわち、抽出した各適合単語の語頭と語尾との対と、履歴メモリ１０７に格納された過去に出現した全単語の語頭と語尾との対とを比較して、各適合単語が既に使用された単語であるか否かを判定する。そして、抽出した適合単語の中から、すでに使用されている応答用単語を除外する。 If one or more response words are extracted as matching words as a result of the processing described above, the extracted response word is not the same as the word already spoken by the speaker, and the voice interaction apparatus 100 Is not the same as the already responded word. For this reason, the processing in the second rule determination performed when determining whether or not the speaker has already uttered a word that has already appeared in the first embodiment is performed on each extracted matching word. Execute. That is, the matching word is already used by comparing the pair of the beginning and ending of each extracted matching word with the pair of the beginning and ending of all the words that have been stored in the history memory 107 in the past. Determine whether it is a word. Then, response words that are already used are excluded from the extracted matching words.

その結果、残った適合単語が０個であれば、発話者の勝利となることから、制御装置１０６は、上述したような負けを宣言するガイダンスをスピーカー１０５を介して出力する。これに対して、残った適合単語が１個であれば、その残った適合単語をスピーカー１０５を介して出力して応答する。また、残った適合単語が２個以上であれば、残った適合単語の中から無作為に任意の１つの単語を抽出して、スピーカー１０５を介して出力して応答する。 As a result, if the number of remaining matching words is 0, the speaker wins, and the control device 106 outputs the guidance for declaring the loss as described above via the speaker 105. On the other hand, if there is one remaining matching word, the remaining matching word is output through the speaker 105 and responded. If there are two or more matching words remaining, any one word is randomly extracted from the remaining matching words, and is output via the speaker 105 to respond.

図９は、第２の実施の形態における音声対話装置１００の処理を示すフローチャートである。図９に示す処理は音声対話装置１００の電源がオンされると、制御装置１０６によって実行される。なお、図９においては、図５に示す第１の実施の形態における音声対話装置１００の処理と同一の処理内容については、同じステップ番号を付与し、相違点を中心に説明する。ステップＳ１５１において、図１０に示す応答単語抽出処理を実行する。 FIG. 9 is a flowchart showing processing of the voice interaction apparatus 100 according to the second embodiment. The processing shown in FIG. 9 is executed by the control device 106 when the voice interaction device 100 is powered on. In FIG. 9, the same processing steps as those of the voice interactive device 100 in the first embodiment shown in FIG. 5 are given the same step numbers, and differences will be mainly described. In step S151, the response word extraction process shown in FIG. 10 is executed.

図１０は、第２の実施の形態における応答単語抽出処を示すフローチャートである。ステップＳ２１０において、図８で上述した設定画面で、使用者によってあらかじめ設定されている応答用単語の抽出条件、すなわち設定された応答用単語のカテゴリ、および指定されたしりとりの難易度を読み込む。その後、ステップＳ２２０へ進み、設定された抽出条件に基づいて、図７で上述した応答用単語データベース１０４内に格納されている応答用単語の中から抽出条件に合致する応答用単語のみを適合単語として抽出する。その後、ステップＳ２３０へ進む。 FIG. 10 is a flowchart showing a response word extraction process in the second embodiment. In step S210, the extraction condition of the response word preset by the user, that is, the set category of response word and the difficulty level of the designated shiritori are read on the setting screen described above with reference to FIG. Thereafter, the process proceeds to step S220, and based on the set extraction condition, only the response word that matches the extraction condition is selected from the response words stored in the response word database 104 described above with reference to FIG. Extract as Thereafter, the process proceeds to step S230.

ステップＳ２３０では、適合単語が抽出されたか否かを判断する。適合単語が１つも抽出されないと判断した場合には、応答単語の抽出は不可であると判定して図９に示す処理に復帰する。これに対して抽出単語が抽出された場合には、ステップＳ２４０へ進む。ステップＳ２４０では、抽出した適合単語の語頭、および語尾を抽出して、ステップＳ２５０へ進む。ステップＳ２５０では、第１の実施の形態における第２のルール判定における処理と同様に、抽出した各適合単語の語頭と語尾との対と、履歴メモリ１０７に格納された過去に出現した全単語の語頭と語尾との対とを比較して、各適合単語が既に使用された単語であるか否かを判定する。 In step S230, it is determined whether a matching word has been extracted. If it is determined that no matching word is extracted, it is determined that the response word cannot be extracted, and the process returns to the process shown in FIG. On the other hand, if the extracted word is extracted, the process proceeds to step S240. In step S240, the beginning and ending of the extracted matching word are extracted, and the process proceeds to step S250. In step S250, similar to the processing in the second rule determination in the first embodiment, the pair of the beginning and end of each extracted matching word and all the words that have appeared in the past stored in the history memory 107 are displayed. The beginning and ending pairs are compared to determine whether each matching word is already used.

その後、ステップＳ２６０へ進み、抽出した適合単語の中から、すでに使用されている応答用単語を除外して、ステップＳ２７０へ進む。ステップＳ２７０では、すでに使用されている応答用単語を除外した結果、残った適合単語の数を判定する。残った適合単語が０個であると判断した場合には、応答単語の抽出は不可であると判定して図９に示す処理に復帰する。 Thereafter, the process proceeds to step S260, where the already used response word is excluded from the extracted matching words, and the process proceeds to step S270. In step S270, the number of matching words remaining as a result of excluding the response words already used is determined. If it is determined that there are no remaining matching words, it is determined that the response word cannot be extracted, and the process returns to the process illustrated in FIG.

これに対して、残った適合単語が１個であると判断した場合には、ステップＳ２８０へ進み、その残った適合単語を応答単語として決定する。その後、図９に示す処理に復帰する。また、残った適合単語が複数であると判断した場合には、ステップＳ２９０へ進み、残った適合単語の中から無作為に任意の１つの単語を抽出して、応答単語として決定する。その後、図９に示す処理に復帰する。 On the other hand, if it is determined that there is one remaining matching word, the process proceeds to step S280, and the remaining matching word is determined as a response word. Thereafter, the process returns to the process shown in FIG. If it is determined that there are a plurality of remaining matching words, the process proceeds to step S290, where one arbitrary word is randomly extracted from the remaining matching words and determined as a response word. Thereafter, the process returns to the process shown in FIG.

以上説明した第２の実施の形態によれば、第１の実施の形態による作用効果に加えて、以下のような効果を得ることができる。
（１）応答用単語データベース１０４内に格納されている応答用単語の中から、使用者によって設定された応答用単語のカテゴリと一致するカテゴリの応答用単語のみを適合単語として抽出して、使用者に対して応答する単語の候補とするようにした。これによって、特定のカテゴリ（ジャンル）の単語のみを対象としたしりとりを行うことができる。 According to the second embodiment described above, the following effects can be obtained in addition to the operational effects of the first embodiment.
(1) From the response words stored in the response word database 104, only the response words in the category that matches the category of the response word set by the user are extracted as matching words and used. Word candidates to respond to the person. As a result, it is possible to perform a shiritori for only words of a specific category (genre).

（２）応答用単語データベース１０４内に格納されている応答用単語の中から、使用者によって指定された難易度に応じて抽出した適合単語を使用者に対して応答する単語の候補とするようにした。これによって、使用者はしりとりゲームの難易度を任意に設定することができ、上級者から初心者まで幅広い使用者を対象としたしりとりゲームを提供することができる。 (2) Matching words extracted according to the degree of difficulty specified by the user from among the response words stored in the response word database 104 are used as word candidates to respond to the user. I made it. Thus, the user can arbitrarily set the difficulty level of the shiritori game, and can provide a shiritori game for a wide range of users from advanced users to beginners.

―変形例―
なお、上述した実施の形態の音声対話装置は、以下のように変形することもできる。
（１）上述した第１および第２の実施の形態では、認識尤度は０〜１の数値で表され、発話者が発話した単語から抽出した語頭および語尾と待ち受け用音節とが全く一致しない場合には０、完全に一致した場合には１が算出される例について説明した。しかしこれに限定されず、その他の算出方法によって認識尤度を算出してもよい。 -Modification-
Note that the voice interaction apparatus according to the embodiment described above can be modified as follows.
(1) In the first and second embodiments described above, the recognition likelihood is represented by a numerical value of 0 to 1, and the beginning and ending extracted from the words spoken by the speaker and the standby syllables do not coincide at all. The example in which 0 is calculated in the case and 1 is calculated in the case of complete coincidence has been described. However, the present invention is not limited to this, and the recognition likelihood may be calculated by other calculation methods.

（２）上述した第１および第２の実施の形態では、履歴メモリ１０７に格納された過去に出現した全単語の語頭と語尾との対の中に、発話者が発話した単語から抽出した語頭と語尾との対と一致するものが所定数以上存在すると判断した場合には、発話者は既に出現済みの単語を繰り返し発話したと判定する例について説明した。しかしこれに限定されず、発話者が発話した単語の音声データの波形と、履歴メモリ１０７に格納された出現済みの全単語の音声データの波形とを比較して、履歴メモリ１０７に格納された単語の中に、波形が発話者が発話した単語と類似するものが存在する場合に、発話者が既に出現済みの単語を繰り返し発話したと判定してもよい。また、第２の実施の形態において、抽出した適合単語が既に使用された単語であるか否かを判定する場合も同様である。 (2) In the first and second embodiments described above, the beginning extracted from the word spoken by the speaker in the pair of the beginning and ending of all the words that have been stored in the history memory 107 in the past An example has been described in which it is determined that a speaker has repeatedly uttered a word that has already appeared when it is determined that there is a predetermined number or more matching the pair of ending and ending. However, the present invention is not limited to this, and the waveform of the speech data of the words spoken by the speaker and the waveform of the speech data of all the words that have already been stored stored in the history memory 107 are compared and stored in the history memory 107. If there is a word whose waveform is similar to the word spoken by the speaker, it may be determined that the speaker has repeatedly spoken a word that has already appeared. Further, in the second embodiment, the same applies to the case where it is determined whether or not the extracted matching word is a used word.

（３）上述した第２の実施の形態では、使用者は、しりとりで使用する単語の範囲を指定するために単語のカテゴリを指定する例について説明した。しかしこれに限定されず、例えば図７に示した品詞７ｂを指定してしりとりで使用する単語の範囲を指定してもよく、その他のしりとりで使用する単語の範囲を指定する条件を設定できるようにしてもよい。この場合には、応答用単語データベース１０４内に格納する応答用単語に、しりとりで使用する単語の範囲を指定する条件となる情報を付加するようにする。 (3) In the above-described second embodiment, an example has been described in which the user designates a word category in order to designate a word range used in shiritori. However, the present invention is not limited to this. For example, the range of words used in shiritori may be designated by specifying the part of speech 7b shown in FIG. 7, and a condition for designating the range of words used in other shiritori can be set. It may be. In this case, information serving as a condition for designating a range of words used in shiritori is added to the response word stored in the response word database 104.

なお、本発明の特徴的な機能を損なわない限り、本発明は、上述した実施の形態における構成に何ら限定されない。 Note that the present invention is not limited to the configurations in the above-described embodiments as long as the characteristic functions of the present invention are not impaired.

特許請求の範囲の構成要素と実施の形態との対応関係について説明する。マイク１０１は音声入力手段に、応答用単語データベース１０４は格納手段に、履歴メモリ１０７は履歴記憶手段に相当する。制御装置１０６は音声認識手段、ルール判定手段、出力手段、および出現済み単語判定手段に相当する。タッチパネル１０８ａは難易度設定手段、および範囲設定手段に相当する。なお、この対応は一例であり、実施の形態の構成によって対応関係は異なるものである。 The correspondence between the constituent elements of the claims and the embodiment will be described. The microphone 101 corresponds to voice input means, the response word database 104 corresponds to storage means, and the history memory 107 corresponds to history storage means. The control device 106 corresponds to speech recognition means, rule determination means, output means, and appeared word determination means. The touch panel 108a corresponds to a difficulty level setting unit and a range setting unit. This correspondence is an example, and the correspondence is different depending on the configuration of the embodiment.

音声対話装置の一実施の形態の構成を示すブロック図である。It is a block diagram which shows the structure of one Embodiment of a voice interactive apparatus. 発話者が発話した音声データの波形に基づいて、単語の語頭と語尾を抽出する具体例を示す図である。It is a figure which shows the specific example which extracts the beginning and ending of a word based on the waveform of the audio | voice data which the speaker uttered. 単語の語頭および語尾と待ち受け用音節との照合結果の具体例を示す図である。It is a figure which shows the specific example of the collation result with the beginning and ending of a word, and a standby syllable. 語頭および語尾の認識尤度と所定値との比較結果を示す図である。It is a figure which shows the comparison result of the recognition likelihood of a head part and a word end, and a predetermined value. 第１の実施の形態における音声対話装置１００の処理を示すフローチャート図である。It is a flowchart figure which shows the process of the voice interactive apparatus 100 in 1st Embodiment. 第２の実施の形態における音声対話装置の一実施の形態の構成を示すブロック図である。It is a block diagram which shows the structure of one Embodiment of the voice interactive apparatus in 2nd Embodiment. 第２の実施の形態における応答用単語データベース１０４内に格納される応答用単語を模式的に示した図である。It is the figure which showed typically the word for response stored in the word database for response 104 in 2nd Embodiment. 応答用単語のカテゴリ指定、およびしりとりの難易度設定を行うための設定画面の具体例を示す図である。It is a figure which shows the specific example of the setting screen for performing category designation | designated of the word for a response, and setting the difficulty level of shiritori. 第２の実施の形態における音声対話装置１００の処理を示すフローチャート図である。It is a flowchart figure which shows the process of the voice interactive apparatus 100 in 2nd Embodiment. 第２の実施の形態における応答単語抽出処を示すフローチャート図である。It is a flowchart figure which shows the response word extraction process in 2nd Embodiment.

Explanation of symbols

１００音声対話装置
１０１マイク
１０２音声入力操作スイッチ
１０３認識用音節辞書
１０４応答用単語データベース
１０５スピーカー
１０６制御装置
１０７履歴メモリ
１０８モニタ
１０８ａタッチパネル DESCRIPTION OF SYMBOLS 100 Voice interaction apparatus 101 Microphone 102 Voice input operation switch 103 Recognition syllable dictionary 104 Response word database 105 Speaker 106 Control apparatus 107 History memory 108 Monitor 108a Touch panel

Claims

A spoken dialogue device that interacts with a speaker,
Speech recognition means for recognizing speech by extracting the beginning and ending from the word input via the speech input means;
Storage means for storing a response word for shiritori;
Rule determining means for determining whether or not the input word is uttered in accordance with a rule of shiritori based on the beginning and ending of the word recognized by the voice recognition means;
When it is determined that the word input by the rule determination unit is uttered according to a rule of shiritori, the response word starting from the same character as the ending of the word recognized by the voice recognition unit is A voice dialogue apparatus comprising: output means for extracting voice from the storage means and outputting the voice.

The voice interactive apparatus according to claim 1,
The speech recognition means extracts the beginning and ending from the speech data of the word input via the speech input means, and compares the extracted beginning and ending with the standby syllable to calculate each recognition likelihood. A speech dialogue apparatus characterized by recognizing the beginning and ending of a word by means of voice.

The voice interaction apparatus according to claim 2,
A spoken dialogue apparatus characterized by prompting a speaker to recite a word when the ending recognition likelihood calculated by the voice recognition means is a predetermined value or less.

The voice interaction apparatus according to claim 2 or 3,
When the speech recognition likelihood of the beginning calculated by the speech recognition means is less than or equal to a predetermined value and the recognition likelihood of the ending is also less than or equal to a predetermined value, information on the beginning of the speaker to be uttered by the speaker is presented to the speaker A speech dialogue apparatus characterized by prompting a re-speech of a word.

In the voice interactive apparatus according to any one of claims 1 to 4,
A history storage means for storing the history of the word input via the voice input means and the response word output via the output means as a slicing history;
The rule determination means refers to the bookmarking history stored in the history storage means and determines whether or not the word input via the voice input means is an already appearing word Including
When the appearing word determination means determines that the word input via the voice input means is an already appearing word, it prompts the speaker to recite the word. Spoken dialogue device.

The voice interaction apparatus according to claim 5, wherein
The appearance word determination means includes a pair of the beginning and ending of the word speech-recognized by the speech recognition means in the pair of the beginning and ending of all the words included in the shiritori history stored in the history storage means. A spoken dialogue apparatus characterized in that if a predetermined number or more of the same ones exist, it is determined that a word input through the voice input means has already appeared.

The voice interaction apparatus according to claim 5, wherein
The appearance word determination unit is similar to the waveform of the speech data of the word input via the speech input unit in the waveform of the speech data of all the words included in the shiritori history stored in the history storage unit. A spoken dialogue apparatus characterized in that, when a word exists, it is determined that the word input through the voice input means has already appeared.

In the voice interactive apparatus according to any one of claims 1 to 7,
It further comprises a difficulty level setting means for setting the difficulty level of the shiritori.
Information about the difficulty level of each shiritori is added to each of the shiritori response words stored in the storage means,
The output means extracts a response word corresponding to the difficulty level of the shiritori set by the difficulty level setting means from the storage means based on the information on the difficulty level of the shiritori, and outputs it as a voice. Voice dialogue device.

In the voice interaction device according to any one of claims 1 to 8,
It further comprises range setting means for setting a range of response words used for shiritori,
Information about the range of response words used for each shiritori is added to each of the shiritori response words stored in the storage means,
The output means extracts, from the storage means, response words corresponding to the range of response words used for shiritori set by the range setting means, based on information about the range of response words used for the shiritori. A voice dialogue apparatus characterized by outputting a voice.

A spoken dialogue method that interacts with a speaker,
Extracting the beginning and ending from the word input via the voice input means to recognize the voice;
Based on the beginning and ending of the speech-recognized word, determine whether the input word is spoken according to the rules of shiritori,
When it is determined that the input word is uttered in accordance with the rules of shiritori, a response word starting from the same character as the end of the speech-recognized word is extracted and output as speech Spoken dialogue method.