JP5701935B2

JP5701935B2 - Speech recognition system and method for controlling speech recognition system

Info

Publication number: JP5701935B2
Application number: JP2013122552A
Authority: JP
Inventors: 正樹渋谷; 潤西岡; 町田　和彦; 和彦町田; 正和戸室
Original assignee: Fuji Soft Inc
Current assignee: Fuji Soft Inc
Priority date: 2013-06-11
Filing date: 2013-06-11
Publication date: 2015-04-15
Anticipated expiration: 2033-06-11
Also published as: JP2014240864A

Description

本発明は、音声認識システムおよび音声認識システムの制御方法に関する。 The present invention relates to a speech recognition system and a control method for the speech recognition system.

人同士の対話では、一つの言葉に対して応答（返事）は一通りでなく、状況に応じて多様な応答が行われる。そこで、人とロボットとの対話においても、一つの言葉に対して多様な応答をできるよう工夫されている（特許文献１−４）。 In a dialogue between people, there is no single response (reply) to one word, and various responses are made according to the situation. Therefore, it has been devised so that various responses can be made to one word even in a dialogue between a person and a robot (Patent Documents 1-4).

特許文献１では、ユーザからの入力文を分類し、分類ごとのルールに従って応答文を生成する。特許文献１に記載の従来技術では、入力文を単語で分類し、例えば、「こんにちは」「ハロー」は、その時の感情に応じて返事を変える（特許文献１の図８）。また、入力文を「挨拶」「質問」「感想」などに分類し、ユーザからの質問に対し質問を返さないようにすることもできる（特許文献１の図２１）。 In Patent Document 1, an input sentence from a user is classified, and a response sentence is generated according to a rule for each classification. In the prior art described in Patent Document 1, to classify the input sentence with the word, for example, "Hello", "hello" changes the reply in response to the emotion of that time (Fig. 8 of Patent Document 1). Further, the input sentence can be classified into “greeting”, “question”, “impression”, etc., so that the question is not returned in response to the question from the user (FIG. 21 of Patent Document 1).

特許文献２では、入力文の構造を解析して、入力文から抽出した単語が、感情、頻度、並列目的語を尋ねることが適切な単語であるか否かを判断し、適切であると判断した場合、応答文を生成する。 In Patent Document 2, the structure of the input sentence is analyzed, and it is determined whether or not the word extracted from the input sentence is an appropriate word to ask for emotion, frequency, and parallel object, and is determined to be appropriate. If so, a response sentence is generated.

特許文献３では、対話相手の発話に対し応答生成ルールに従って複数の応答を生成し、生成された応答の中から１つの応答を所定の規則に基づいて選択する。 In Patent Document 3, a plurality of responses are generated according to a response generation rule for an utterance of a conversation partner, and one response is selected from the generated responses based on a predetermined rule.

特許文献４では、応答文に、位置関係に対応して使い分けられる言葉が含まれる場合、現在の位置関係に対応する言葉を選択して応答文を生成する。 In Patent Document 4, when a response sentence includes words that can be used according to the positional relationship, a response sentence is generated by selecting a word corresponding to the current positional relationship.

特開２００９−１５１３１４号公報JP 2009-151314 A 特開２０１０−２２４６０８号公報JP 2010-224608 A 特開２００３−２５５９９０号公報JP 2003-255990 A 特開２００１−１８８５５１号公報JP 2001-188551 A

特許文献１の従来技術は、入力文を「挨拶」「質問」「感想」などの種類に分類しているが、その分類は入力文と応答文の係り受けを保つために使用しているに過ぎず、自然な応答を実現するのは難しい。 The prior art of Patent Document 1 classifies input sentences into types such as “greeting”, “question”, and “impression”, but the classification is used to maintain the dependency between the input sentence and the response sentence. It is difficult to achieve a natural response.

特許文献２に記載の従来技術は、抽出単語に応じて質問形式を分類し、質問形式に合った応答文を生成しているが、文の構造に依存した応答となるため、自然な応答を実現するのは難しい。 The conventional technology described in Patent Document 2 classifies the question format according to the extracted word and generates a response sentence suitable for the question format. However, since the response depends on the structure of the sentence, a natural response is obtained. It is difficult to realize.

特許文献３に記載の従来技術は、確率で応答を選択しているため、対話の状況を反映した自然な応答を実現するのは難しい。 Since the prior art described in Patent Document 3 selects a response with a probability, it is difficult to realize a natural response reflecting the state of dialogue.

特許文献４に記載の従来技術は、単に位置関係を考慮して応答するに過ぎず、自然な応答を実現するのは難しい。 The prior art described in Patent Document 4 simply responds in consideration of the positional relationship, and it is difficult to realize a natural response.

このように、一つの言葉に対して予め登録されている複数の応答の中から確率で応答を選択する技術では、状況を正しく反映して応答することはできない。また、シナリオに従って応答する技術では、様々な状況に対応するシナリオを予め作成する必要があり、その作業に大変手間がかかる。 As described above, in the technique of selecting a response with a probability from a plurality of responses registered in advance for one word, it is not possible to correctly reflect the situation and respond. Also, in the technology that responds according to the scenario, it is necessary to create scenarios corresponding to various situations in advance, and this work is very laborious.

ところで、応答の表現が異なっても、その文意および状況が同じであれば同じ反応をする方が、統一感があるので自然に感じられる。 By the way, even if the expressions of the responses are different, it is natural to feel the same reaction as long as the meaning and situation are the same because there is a sense of unity.

本発明は、上記の問題に鑑みてなされたもので、その目的は、音声入力文に応じて統一感のある多様な応答ができるシステムを比較的簡単に実現できるようにした音声認識システムおよび音声認識システムの制御方法を提供することにある。本発明の他の目的は、音声入力文の文意および種別に応じて統一感のある応答をできるようにした音声認識システムおよび音声認識システムの制御方法を提供することにある。 The present invention has been made in view of the above problems, and an object of the present invention is to provide a voice recognition system and a voice that can realize a system capable of various responses with a sense of unity according to a voice input sentence relatively easily. It is to provide a control method of a recognition system. Another object of the present invention is to provide a voice recognition system and a control method for the voice recognition system that can provide a unified response according to the meaning and type of the voice input sentence.

本発明の一つの観点に係るシステムは、音声を認識して応答する音声認識システムであって、音声入力部から入力される音声入力文を認識する音声認識部と、音声認識部による音声入力文の認識結果と所定の情報とに基づいて、音声入力文が自システムに向けて入力された音声であるかを判定する判定部と、音声入力文を予め用意されている複数の入力文種別のいずれかに分類する入力文分類部と、音声入力文が分類された入力文種別と判定部による音声入力文についての判定結果とに基づいて、入力文種別毎に予め用意されている複数の反応レベルの中から音声入力文に対する反応レベルを決定する反応レベル決定部と、決定した反応レベルに基づいて、予め定義されている応答の中から音声入力文に対する応答を決定する応答決定部と、決定した応答を出力する応答出力部と、を備える。 A system according to one aspect of the present invention is a speech recognition system that recognizes and responds to speech, and recognizes a speech input sentence input from the speech input section, and a speech input sentence by the speech recognition section. Based on the recognition result and predetermined information, a determination unit for determining whether the voice input sentence is a voice input to the own system, and a plurality of input sentence types prepared in advance. A plurality of reactions prepared in advance for each input sentence type based on the input sentence classification unit to be classified, the input sentence type into which the voice input sentence is classified, and the determination result on the voice input sentence by the determination unit A response level determination unit for determining a response level for a voice input sentence from the levels, a response determination unit for determining a response to the voice input sentence from predefined responses based on the determined response level, Includes a response output unit for outputting a constant and response, the.

入力文分類部は、音声入力文を、文意に応じて予め用意される複数の文意グループのいずれかに分類し、さらに、一つまたは複数の文意グループに対応付けられている複数の入力文種別のうち、音声入力文の分類先である文意グループに対応する入力文種別に分類することができる。 The input sentence classification unit classifies the voice input sentence into any of a plurality of meaning groups prepared in advance according to the meaning of the sentence, and further, a plurality of correspondences associated with one or a plurality of meaning groups Among the input sentence types, the input sentence types can be classified into the input sentence types corresponding to the meaning group to which the voice input sentences are classified.

自システム周辺の所定範囲の画像を出力する画像出力部を備え、反応レベル決定部は、入力文種別と判定結果と画像出力部の出力する画像の解析結果とに基づいて、複数の反応レベルの中から音声入力文に対する反応レベルを決定することもできる。 An image output unit that outputs an image of a predetermined range around the own system, and the reaction level determination unit has a plurality of response levels based on the input sentence type, the determination result, and the analysis result of the image output by the image output unit. The reaction level for the voice input sentence can be determined from the inside.

応答決定部は、決定した応答レベルと画像出力部の出力する画像の解析結果とに基づいて、音声入力文に対する応答を決定することもできる。 The response determination unit can determine a response to the voice input sentence based on the determined response level and the analysis result of the image output from the image output unit.

音声認識システムの構成を示すブロック図。The block diagram which shows the structure of a speech recognition system. ユーザの発話の信頼度を評価するための複数のパラメータを抽出する方法を示す説明図。Explanatory drawing which shows the method of extracting the some parameter for evaluating the reliability of a user's utterance. キーワードと動詞との関連づけ（ａ）、および、キーワード間の関連性（ｂ）を示す説明図。Explanatory drawing which shows the correlation (a) of a keyword and a verb, and the relationship (b) between keywords. 入力文を分類するためのデータベースの例を示す説明図。Explanatory drawing which shows the example of the database for classifying an input sentence. 入力文の種別と信頼度などに応じて応答の反応レベルを決定するデータベースの例を示す説明図。Explanatory drawing which shows the example of the database which determines the reaction level of a response according to the classification and reliability of an input sentence. 入力文の文意と反応レベルなどに応じて応答内容を決定するデータベースの例を示す説明図。Explanatory drawing which shows the example of the database which determines the content of a response according to the meaning of an input sentence, a reaction level, etc. 音声を認識して応答するまでの全体処理を示すフローチャート。The flowchart which shows the whole process until it recognizes and responds to an audio | voice. 入力文を分類する処理を示すフローチャート。The flowchart which shows the process which classifies an input sentence.

本実施形態では、以下に詳述する通り、音声入力文の種類（例えば、挨拶、話しかけ、会話など）を状況に応じて分類し、状況に応じてされる一般的な反応の度合い（強弱）を定義する。本実施形態では、さらに、音声入力文を文意でグループ化し、文意グループごとに、反応の度合い（および画像解析結果を利用した状況）に応じた応答を事前に定義しておく。これにより、本実施形態の音声認識システムは、音声入力文の種類及び状況に応じて、適切な反応の度合いで、かつ、多様に応答できる。 In the present embodiment, as described in detail below, the types of voice input sentences (for example, greetings, conversations, conversations, etc.) are classified according to the situation, and the general degree of reaction (strength) depending on the situation Define In the present embodiment, the voice input sentences are further grouped according to their meanings, and a response corresponding to the degree of reaction (and the situation using the image analysis result) is defined in advance for each meaning group. Thereby, the speech recognition system of this embodiment can respond variously with an appropriate degree of reaction according to the type and situation of the speech input sentence.

さらに、本システムは、個別の場面毎にシナリオなどを用意する必要がなく、比較的簡単なプログラムで実現可能であり、本システムの構築コストを低減できる。 Furthermore, the present system does not need to prepare a scenario for each individual scene, and can be realized with a relatively simple program, thereby reducing the construction cost of the present system.

また、本実施形態では、音声入力文を文意に応じてグループ化し、さらに文意毎に入力文の種別を決定するため、新たな単語が登場した場合でも、その文意に合ったグループに新単語を登録するだけで対応することができる。 Further, in this embodiment, voice input sentences are grouped according to the meaning of the sentence, and the type of the input sentence is determined for each sentence meaning, so even when a new word appears, the voice input sentences are grouped according to the meaning of the sentence. Just register a new word.

図１は、本実施例の音声認識システム１の全体構成を示すブロック図である。音声認識システム１は、コンピュータシステムとして構成される。音声認識システム１は、例えばロボット形状または筒型や直方体等の種々の形状を有するオブジェクト２に搭載することができる。さらに、音声認識システム１は、例えば、ディスプレイ装置、掃除機、冷蔵庫、携帯電話、携帯情報端末、家庭内エネルギー管理システム（ＨＥＭＳ）、洗濯機等の各種電気製品３に搭載することもできる。本実施例では、音声認識システム１を人型ロボット２に搭載する場合を例に挙げて説明する。 FIG. 1 is a block diagram showing the overall configuration of the speech recognition system 1 of the present embodiment. The voice recognition system 1 is configured as a computer system. The voice recognition system 1 can be mounted on an object 2 having various shapes such as a robot shape or a cylindrical shape or a rectangular parallelepiped. Furthermore, the voice recognition system 1 can be mounted on various electrical products 3 such as a display device, a vacuum cleaner, a refrigerator, a mobile phone, a portable information terminal, a home energy management system (HEMS), and a washing machine. In this embodiment, a case where the voice recognition system 1 is mounted on a humanoid robot 2 will be described as an example.

なお、音声認識システム１の全部をロボット２内に設けてもよいし、音声認識システム１の一部をロボット２の外部に設けてもよい。さらには、音声認識システム１のほぼ全体を外部サーバに設け、ユーザとの情報を交換するマンマシンインターフェース（音声入力部２１、カメラ２３、音声出力部４１、表示部４２）をロボット２に設ける構成でもよい。但し、音声認識システム１の全体をロボット２に設けることで、ユーザとの会話にタイムラグが生じるのを防止し、より自然なコミュニケーションを実現することができる。 Note that the entire voice recognition system 1 may be provided in the robot 2, or a part of the voice recognition system 1 may be provided outside the robot 2. Furthermore, a configuration in which almost the entire speech recognition system 1 is provided in an external server, and a man-machine interface (speech input unit 21, camera 23, speech output unit 41, display unit 42) for exchanging information with the user is provided in the robot 2. But you can. However, by providing the entire voice recognition system 1 in the robot 2, it is possible to prevent a time lag from occurring in the conversation with the user and realize more natural communication.

音声認識システム１は、例えば、音声認識部１１、発話信頼度判定部１２、入力文分類部１３、反応レベル決定部１４、応答決定部１５、応答出力部１６を備える。さらに、音声認識システム１は、音声入力部２１、画像解析部２２、カメラ２３、音響モデルデータベース２４、文法データベース２５、辞書データベース２６、キーワード−動詞データベース２７、認識結果履歴データベース２８、音声出力部４１、表示部４２、動作機構４３を備える。 The speech recognition system 1 includes, for example, a speech recognition unit 11, an utterance reliability determination unit 12, an input sentence classification unit 13, a reaction level determination unit 14, a response determination unit 15, and a response output unit 16. Furthermore, the speech recognition system 1 includes a speech input unit 21, an image analysis unit 22, a camera 23, an acoustic model database 24, a grammar database 25, a dictionary database 26, a keyword-verb database 27, a recognition result history database 28, and a speech output unit 41. A display unit 42 and an operation mechanism 43.

音声入力部２１は、例えば一つまたは複数のマイクロフォン装置とＡ／Ｄ（Analog/Digital）変換回路等を含んで構成される。本実施例では、音源の方位を検出することのできる音源方位マイクも音声入力部２１に含まれているものとする。 The audio input unit 21 includes, for example, one or a plurality of microphone devices and an A / D (Analog / Digital) conversion circuit. In this embodiment, it is assumed that a sound source direction microphone capable of detecting the direction of a sound source is also included in the voice input unit 21.

画像解析部２２は、「画像出力部」の一例であるカメラ２３で撮影した画像データを解析して、解析結果を出力する。画像解析部２２は、例えば、撮影した画像データに顔が写っているか、写っている顔は正面の顔か横顔か、写っている顔はシステム１に登録済みのユーザの顔と一致するか否か、などを解析する。画像解析結果は、発話信頼度判定部１２、反応レベル決定部１４、応答決定部１５で利用される。 The image analysis unit 22 analyzes image data captured by the camera 23 which is an example of the “image output unit”, and outputs an analysis result. For example, the image analysis unit 22 determines whether the captured image data includes a face, whether the captured face is a front face or a side face, and whether the reflected face matches a user's face registered in the system 1. Or analyze. The image analysis result is used by the utterance reliability determination unit 12, the reaction level determination unit 14, and the response determination unit 15.

カメラ２３は、例えばロボット２の頭部または胸部などに取り付けられており、ロボット２の周囲に存在するユーザ等を撮影し、画像データを出力する。なお、カメラ２３はロボット２に取り付けられている必要は必ずしもない。ロボット２から離れた場所にカメラ２３を設け、カメラ２３で撮影した画像情報を無線通信等を介して音声認識システム１に送信する構成でもよい。なお、カメラ２３で撮影した画像データ、および／または、画像解析部２２の画像解析結果は、所定時間分蓄積することができる。 The camera 23 is attached to, for example, the head or chest of the robot 2 and photographs a user or the like existing around the robot 2 and outputs image data. The camera 23 is not necessarily attached to the robot 2. A configuration may be employed in which a camera 23 is provided at a location away from the robot 2 and image information captured by the camera 23 is transmitted to the voice recognition system 1 via wireless communication or the like. Note that the image data captured by the camera 23 and / or the image analysis result of the image analysis unit 22 can be accumulated for a predetermined time.

音声認識部１１は、音声入力部２１から入力される音声を、音響モデルデータベース２４と文法データベース２５および辞書データベース２６を用いて解析し、単語に変換する機能である。 The voice recognition unit 11 has a function of analyzing the voice input from the voice input unit 21 using the acoustic model database 24, the grammar database 25, and the dictionary database 26, and converting them into words.

音響モデルデータベース２４は、テキスト（読み）とテキストを発音したときの波形とを対応づけて記憶したデータベースであり、どのような波形の音がどのような単語として認識されるかを定義している。文法データベース２５は、単語の並べ方（文法）などを記憶したデータベースである。辞書データベース２６には、所定のキーワードを含む種々の単語がその読みと共に登録されている。音声認識部１１による音声認識結果の履歴は、認識結果履歴データベース２８に記憶される。 The acoustic model database 24 is a database that stores text (reading) and a waveform when the text is pronounced in association with each other, and defines what kind of waveform sound is recognized as what word. . The grammar database 25 is a database that stores word arrangement (grammar) and the like. In the dictionary database 26, various words including predetermined keywords are registered together with their readings. The history of the speech recognition result by the speech recognition unit 11 is stored in the recognition result history database 28.

発話信頼度判定部１２は、発話信頼度を判定するための情報となる所定パラメータを抽出する。発話信頼度判定部１２は、音声認識部１１の認識結果と、カメラ２３の撮影した画像情報と、音源方位情報と、キーワード−動詞データベース２７と、認識結果履歴データベース２８を用いて、所定パラメータを取得する。 The utterance reliability determination unit 12 extracts a predetermined parameter serving as information for determining the utterance reliability. The utterance reliability determination unit 12 uses the recognition result of the speech recognition unit 11, the image information captured by the camera 23, the sound source direction information, the keyword-verb database 27, and the recognition result history database 28 to set predetermined parameters. get.

さらに、発話信頼度判定部１２は、所定パラメータに基づいて、音声認識部１１による認識結果が自システムに向けられたユーザからの音声であるか判定する。詳しくは、発話信頼度判定部１２は、ユーザから自システムへの音声入力であるか、自システムへの音声指示であるかを判定する。音声指示とは、基本的に、所定のキーワードと所定の動詞との組合せとして構成される。しかし、これに限らず、所定のキーワードのみでも音声指示であると判定することができる。 Furthermore, the speech reliability determination unit 12 determines whether the recognition result by the voice recognition unit 11 is a voice from a user directed to the own system based on a predetermined parameter. Specifically, the utterance reliability determination unit 12 determines whether the input is a voice input from the user to the own system or a voice instruction to the own system. The voice instruction is basically configured as a combination of a predetermined keyword and a predetermined verb. However, the present invention is not limited to this, and it can be determined that only a predetermined keyword is a voice instruction.

入力文分類部１３は、音声認識部１１で認識され、発話信頼度判定部１２で下限値以上の発話信頼度を有すると判断された音声入力文を、入力文分類データベース２９に予め用意されている複数の入力文分類のうちのいずれかに分類する。 The input sentence classification unit 13 is prepared in advance in the input sentence classification database 29 with the voice input sentence recognized by the voice recognition unit 11 and determined by the utterance reliability determination unit 12 to have an utterance reliability equal to or higher than the lower limit value. The input sentence is classified into one of a plurality of input sentence classifications.

反応レベル決定部１４は、音声入力文の分類結果と発話信頼度とに基づいて反応レベルデータベース３０を検索することで、予め用意されている反応レベルの中から音声入力文に対応する反応レベルを一つ選択する。 The reaction level determination unit 14 searches the reaction level database 30 based on the classification result of the voice input sentence and the utterance reliability, so that the reaction level corresponding to the voice input sentence is selected from the reaction levels prepared in advance. Select one.

応答決定部１５は、反応レベル決定部１４で決定された反応レベルと音声入力文に基づいて応答定義データベース３１を検索することで、予め定義されている応答の中から音声入力文に対する応答を決定する。 The response determination unit 15 searches the response definition database 31 based on the reaction level determined by the reaction level determination unit 14 and the voice input sentence, thereby determining a response to the voice input sentence from the predefined responses. To do.

応答出力部１６は、応答決定部１５で決定された応答を実行する。応答出力部１６は、例えば、音声出力部４１、表示部４２、動作機構４３のいずれか一つまたは複数を用いて、ユーザに応答することができる。 The response output unit 16 executes the response determined by the response determination unit 15. The response output unit 16 can respond to the user using, for example, one or more of the audio output unit 41, the display unit 42, and the operation mechanism 43.

音声出力部４１は、例えば合成音声出力装置とスピーカ等から構成されており、ユーザに音声で返事をする。表示部４２は、例えばロボット２に設けられるランプ、ディスプレイから構成され、メッセージを表示したり、ランプを点灯させたりする。動作機構４３は、例えば、ロボットの首、手足等を動かすための機構である。なお、音声出力、表示出力、動作以外に、プリンタ等への出力を行ってもよい。 The voice output unit 41 is composed of, for example, a synthesized voice output device and a speaker, and responds to the user with voice. The display unit 42 includes, for example, a lamp and a display provided in the robot 2 and displays a message or turns on the lamp. The operation mechanism 43 is, for example, a mechanism for moving the robot's neck, limbs, and the like. In addition to audio output, display output, and operation, output to a printer or the like may be performed.

図２を用いて、発話信頼度判定部１２が抽出する所定パラメータについて説明する。発話信頼度判定部１２は、以下に述べるように、第１パラメータ、第２パラメータおよび第３パラメータの全部または一部を抽出することができる。発話信頼度判定部１２は、第１〜第３パラメータを常に全て検出する必要はない。また、以下に述べる発話信頼度の判定方法は一例であって、本発明は下記の方法に限定されない。以下、音声入力文を入力文と略記する場合がある。 The predetermined parameters extracted by the utterance reliability determination unit 12 will be described with reference to FIG. The utterance reliability determination unit 12 can extract all or part of the first parameter, the second parameter, and the third parameter as described below. The utterance reliability determination unit 12 need not always detect all the first to third parameters. Further, the method for determining the speech reliability described below is an example, and the present invention is not limited to the following method. Hereinafter, a voice input sentence may be abbreviated as an input sentence.

図２（１）に示す第１パラメータは、音声入力部２１から音声認識部１１に入力された最新の入力文を解析することで得られる。第１パラメータは、さらに複数の（例えば３つの）サブパラメータ（１Ａ）〜（１Ｃ）を含むことができる。 The first parameter shown in FIG. 2A is obtained by analyzing the latest input sentence input from the voice input unit 21 to the voice recognition unit 11. The first parameter can further include a plurality of (for example, three) subparameters (1A) to (1C).

第１サブパラメータ（１Ａ）は、最新入力文に占める所定のキーワード（および所定の動詞）の率である。音声認識部１１は、音声入力部２１に入力された音声（ユーザの音声指示、周囲の会話、周囲の雑音等）を辞書データベース２６に当てはめて認識する。その認識された単語群の中に所定のキーワード（および所定の動詞）が含まれている割合から、発話信頼度の高低を判断できる。 The first subparameter (1A) is a rate of a predetermined keyword (and a predetermined verb) in the latest input sentence. The voice recognition unit 11 recognizes the voice (user's voice instruction, surrounding conversation, ambient noise, etc.) input to the voice input unit 21 by applying it to the dictionary database 26. The level of the utterance reliability can be determined from the ratio of the predetermined keyword (and the predetermined verb) included in the recognized word group.

所定のキーワードとは、一般的なキーワードのうち、音声認識システム１が提供可能なサービス（詳しくは音声認識システム１が搭載されているシステム（ここではロボット２）で提供可能なサービス）を示すキーワードである。さらに、所定のキーワードには、ユーザがサービスの利用に際して発する可能性のあるキーワード、例えば、「はい」、「いいえ」等の返事に相当するキーワードも含む。 The predetermined keyword is a keyword indicating a service that can be provided by the speech recognition system 1 (specifically, a service that can be provided by the system (the robot 2) in which the speech recognition system 1 is installed) among general keywords. It is. Furthermore, the predetermined keyword includes a keyword that may be issued when the user uses the service, for example, a keyword corresponding to a reply such as “Yes” or “No”.

所定のキーワードは、サービスを利用する際の音声指示となるキーワードとして、音声認識システム１に予め登録されている。具体的には、音声認識システム１が搭載されるロボット２がユーザに提供可能なサービスに関するキーワードが、所定のキーワードとして音声認識システム１に登録されている。 The predetermined keyword is registered in advance in the speech recognition system 1 as a keyword that serves as a voice instruction when using the service. Specifically, keywords relating to services that can be provided to the user by the robot 2 on which the voice recognition system 1 is mounted are registered in the voice recognition system 1 as predetermined keywords.

例えば、ロボット２がダンスを踊ったり、ニュースや物語を読んだり、クイズを出題したり、天気を予測したり、占ったりするサービスを提供可能である場合、それらサービスを特定するためのキーワード（「ダンス」、「ニュース」、「物語」、「クイズ」、「天気」、「占い」）が所定のキーワードとして登録される。 For example, if the robot 2 can provide services for dancing, reading news and stories, giving quizzes, predicting the weather, and fortune-telling, a keyword (“ Dance, News, Story, Quiz, Weather, Fortune-telling) are registered as predetermined keywords.

先に図３を参照する。図３（ａ）は、所定のキーワードと所定の動詞との対応付けを管理するテーブルＴ１０の例である。このテーブルＴ１０は、例えば図１に示すキーワード−動詞データベース２７の例である。 Reference is first made to FIG. FIG. 3A is an example of a table T10 that manages association between a predetermined keyword and a predetermined verb. This table T10 is an example of the keyword-verb database 27 shown in FIG.

所定のキーワードには、それぞれのキーワードに関連する所定の動詞が予め対応付けられている。例えばキーワード「ダンス」の場合、そのキーワードと共に使用される頻度の高い所定の動詞としては「やって」、「みせて」、「おどって」等が考えられる。キーワード「ニュース」の場合は、所定の動詞として「よんで」、「きかせて」、「おしえて」等が考えられる。 A predetermined verb related to each keyword is associated with the predetermined keyword in advance. For example, in the case of the keyword “dance”, “do”, “show”, “dodori”, and the like can be considered as predetermined verbs frequently used with the keyword. In the case of the keyword “news”, “Yonde”, “Kisekete”, “Takeshite”, etc. can be considered as predetermined verbs.

図２に戻る。第１サブパラメータ（１Ａ）として抽出される、最新入力文を構成する単語に占める所定のキーワードの比率から、発話信頼度を判定することができる。 Returning to FIG. The utterance reliability can be determined from the ratio of a predetermined keyword occupying the words constituting the latest input sentence extracted as the first subparameter (1A).

例えば、所定のキーワードと所定の動詞の両方が最新入力文に含まれている場合、発話信頼度は高いと判定できる。例えば、所定のキーワードだけが最新入力文に含まれており、所定の動詞は存在しない場合、発話信頼度は中程度であると判定できる。例えば、所定のキーワードが含まれていない場合、または、所定のキーワード以外の無意味な単語（非キーワード）の占める割合が所定の基準値より大きい場合、発話信頼度は低いと判定できる。 For example, when both a predetermined keyword and a predetermined verb are included in the latest input sentence, it can be determined that the utterance reliability is high. For example, when only a predetermined keyword is included in the latest input sentence and there is no predetermined verb, it can be determined that the utterance reliability is medium. For example, when a predetermined keyword is not included, or when the proportion of meaningless words (non-keywords) other than the predetermined keyword is larger than a predetermined reference value, it can be determined that the utterance reliability is low.

第１パラメータの第２サブパラメータ（１Ｂ）を説明する。第２サブパラメータ（１Ｂ）では、最新入力文全体の音の強さの平均値と単語ごとの音の強さとの関連性に基づいて、所定のキーワードが含まれているかを判定する。 The second subparameter (1B) of the first parameter will be described. In the second subparameter (1B), it is determined whether a predetermined keyword is included based on the relationship between the average value of the sound intensity of the entire latest input sentence and the sound intensity of each word.

図２（１）に示す「ううう」は、例えば空調機、冷蔵庫、洗濯機等の発する機械音（サーボ音）であり、無意味な非キーワードである。音声認識部１１は、入力された音声データをできる限り何らかの単語に変換しようとするため、単なる機械音であっても例えば「ううう」のように何らかの単語に認識する。 “Uu” shown in FIG. 2 (1) is a mechanical sound (servo sound) generated by, for example, an air conditioner, a refrigerator, or a washing machine, and is a meaningless non-keyword. Since the speech recognition unit 11 tries to convert the input speech data into a certain word as much as possible, even a simple mechanical sound is recognized as a certain word such as “Uu”.

しかし、機械音はユーザの発した言葉ではなく、単なる環境雑音であるため、音のレベルは弱い。これに対し、ユーザの発する言葉は比較的音のレベルが強い。特に、ユーザがロボット２の応答を期待して所定のキーワードを発する場合は、その音のレベルは相対的に強くなると考えられる。 However, since the mechanical sound is not an utterance from the user but merely an environmental noise, the sound level is weak. On the other hand, the words spoken by the user have a relatively strong sound level. In particular, when the user issues a predetermined keyword in anticipation of the response of the robot 2, the sound level is considered to be relatively strong.

そこで、第２サブパラメータ（１Ｂ）では、入力文全体の単語の音の強さの平均値と単語毎の音の強さとを比較して、強く発せられた所定のキーワードだけを所定のキーワードであると判定する。図示の例では、「ダンス」の音のレベルは３０００であり、「やって」の音のレベルは１０００であり、いずれも、他の単語の音の強さ（４５０〜６００）よりも明らかに強い。これに対し、所定のキーワードとして認識された単語であっても、その音のレベルが入力文全体の音の強さの平均値に比べて弱い場合は無視する。音のレベルが弱いキーワードは、周囲の雑音等が組み合わさって偶然に誤認識されたものであると考えられるためである。 Therefore, in the second subparameter (1B), the average value of the sound intensity of the words in the entire input sentence is compared with the sound intensity of each word, and only the predetermined keyword that is strongly generated is determined as the predetermined keyword. Judge that there is. In the illustrated example, the level of the sound of “dance” is 3000, the level of the sound of “do” is 1000, and each is clearly more than the sound intensity of other words (450 to 600). strong. On the other hand, even a word recognized as a predetermined keyword is ignored if its sound level is weaker than the average sound intensity of the entire input sentence. This is because a keyword having a low sound level is considered to have been erroneously recognized accidentally by a combination of ambient noise and the like.

第１パラメータの第３サブパラメータ（１Ｃ）を説明する。第３サブパラメータ（１Ｃ）では、最新入力文を構成する各単語の音の長さ（フレーム長）に基づいて、ユーザの発した言葉であるか否かを判定する。換言すれば、第３サブパラメータ（１Ｃ）は、音の長さに基づいて、雑音等の認識結果を取り除く。１フレームの長さは、音声入力部２１のサンプリング周波数から求められる。 The third subparameter (1C) of the first parameter will be described. In the third subparameter (1C), it is determined whether or not the word is a word issued by the user based on the sound length (frame length) of each word constituting the latest input sentence. In other words, the third subparameter (1C) removes a recognition result such as noise based on the length of the sound. The length of one frame is obtained from the sampling frequency of the audio input unit 21.

ユーザが或る所定のキーワードを発声する場合、肉声の特性上、所定の時間を要する。これに対し、周囲の雑音等が偶然組み合わさり、所定のキーワードであると誤認識された場合は、人の発声可能な時間よりも短い時間しか有さない場合が多い。そこで、第３サブパラメータ（１Ｃ）では、最新入力文の認識結果の中から不自然な長さの単語を除去し、自然な長さを有する単語をユーザの発した単語であると判定する。 When a user utters a predetermined keyword, a predetermined time is required due to the characteristics of the real voice. On the other hand, when ambient noise and the like are accidentally combined and misrecognized as a predetermined keyword, it often has a time shorter than the time that a person can speak. Therefore, in the third subparameter (1C), the word having an unnatural length is removed from the recognition result of the latest input sentence, and the word having the natural length is determined as the word originated by the user.

図２（２）は、第２パラメータを示す。第２パラメータは、現在から所定期間（例えば数秒〜十数秒程度）の認識結果を分析して得られる。所定期間内の認識結果を均等に考慮してもよいし、過去の認識結果となるほど重みを下げて判断してもよい。例えば、所定期間を７秒間とすると、７秒前から５秒前までに認識した単語は０．４、４秒前から２秒前までに認識した単語は０．７、１秒前および最後に認識した単語は１．０を乗じて、判定してもよい。 FIG. 2 (2) shows the second parameter. The second parameter can be obtained by analyzing a recognition result for a predetermined period (for example, about several seconds to several tens of seconds) from the present time. The recognition result within a predetermined period may be considered equally, or the weight may be determined so as to become a past recognition result. For example, if the predetermined period is 7 seconds, words recognized 7 seconds to 5 seconds before are 0.4 words, words recognized 4 seconds to 2 seconds ago are 0.7, 1 second and last. The recognized word may be determined by multiplying by 1.0.

第２パラメータは、以下に述べるように、複数の（例えば４つの）サブパラメータ（２Ａ）〜（２Ｄ）を含むことができる。 The second parameter may include a plurality of (for example, four) subparameters (2A) to (2D) as described below.

第２パラメータの第１サブパラメータ（２Ａ）では、「ううう」等の機械音が連続した後で所定のキーワードが検出された場合、ユーザから話しかけられたと判断する。つまり、機械音しか聞こえなかった環境下で、所定のキーワードが発せられた場合は、ユーザからの言葉であると判定する。 In the first sub-parameter (2A) of the second parameter, when a predetermined keyword is detected after continuous mechanical sounds such as “Uu”, it is determined that the user has spoken. That is, when a predetermined keyword is issued in an environment where only mechanical sounds can be heard, it is determined that the word is from the user.

第２パラメータの第２サブパラメータ（２Ｂ）では、無関係な単語が連続して認識された場合に、それはユーザの音声指示ではなく、ロボット２の周囲で行われている会話であると判定する。周囲の会話としては、例えば、人間同士の会話、人間と他の機械（音声指示で制御可能な他の機械）との会話、テレビまたはラジオ等からの音声を挙げることができる。 In the second sub-parameter (2B) of the second parameter, when irrelevant words are continuously recognized, it is determined that the conversation is being performed around the robot 2 instead of a voice instruction of the user. Examples of surrounding conversations include conversations between humans, conversations between humans and other machines (other machines that can be controlled by voice instructions), and audio from televisions or radios.

図３（ｂ）は、キーワード間の関連を模式的に示す。所定のキーワードの周辺には、その所定のキーワードと意味的に関連する関連キーワードが一つまたは複数存在する。例えば、所定のキーワード「クイズ」の場合、その周辺には「問題」、「答え」等の関連キーワードが存在する。さらに、所定のキーワード「クイズ」と関連性の少ない他のキーワードも存在する。関連性の少ない他のキーワードには、他の所定のキーワード、他の所定のキーワードに関連するキーワード等が含まれる。このように、キーワード同士（単語同士）の関連性を予め分析しておくことで、無関係な単語が連続して入力されたか否かを判定することができる。 FIG. 3B schematically shows the relationship between keywords. Around the predetermined keyword, there are one or more related keywords that are semantically related to the predetermined keyword. For example, in the case of a predetermined keyword “quiz”, related keywords such as “question” and “answer” exist in the vicinity thereof. Furthermore, there are other keywords that are less relevant to the predetermined keyword “quiz”. Other less relevant keywords include other predetermined keywords, keywords related to other predetermined keywords, and the like. In this way, by analyzing the relevance between keywords (words) in advance, it is possible to determine whether or not irrelevant words are continuously input.

図２に戻る。第２パラメータの第３サブパラメータ（２Ｃ）では、同一の所定のキーワードの出現頻度が大きい場合、発話信頼度が高いと判定する。例えば、「ダンス」、「ダンス」、「ダンス」のように、同一の所定のキーワードが高頻度で出現する場合は、ユーザが話している可能性が高い。特に、幼児、老人等の発話が不明瞭なユーザの場合、動詞を伴わずに所定のキーワードだけを連呼する可能性がある。 Returning to FIG. In the third sub-parameter (2C) of the second parameter, it is determined that the utterance reliability is high when the appearance frequency of the same predetermined keyword is large. For example, when the same predetermined keyword appears frequently such as “dance”, “dance”, and “dance”, there is a high possibility that the user is speaking. In particular, in the case of a user whose speech is not clear, such as an infant or an elderly person, there is a possibility that only a predetermined keyword is called continuously without a verb.

第２パラメータの第４サブパラメータ（２Ｄ）では、所定の無音期間が経過した後で、所定のキーワードが検出された場合に、ユーザからロボット２への音声指示であると判定する。つまり、第４サブパラメータ（２Ｄ）は、第１サブパラメータ（２Ａ）の変形例であり、比較的静かな環境下で所定のキーワードが検出された場合は、ユーザからの音声指示であると判定する。 In the fourth sub-parameter (2D) of the second parameter, when a predetermined keyword is detected after a predetermined silent period has elapsed, it is determined that the voice instruction is given from the user to the robot 2. That is, the fourth subparameter (2D) is a modified example of the first subparameter (2A), and when a predetermined keyword is detected in a relatively quiet environment, it is determined that the instruction is a voice instruction from the user. To do.

第３パラメータを説明する。図２（３）に示す第３パラメータは、カメラ２３の撮影した画像情報、および、音声入力部２１に含まれる音源方位マイクで特定される音源方位情報から抽出される。第３パラメータは、複数の（例えば３つの）サブパラメータを含むことができる。 The third parameter will be described. The third parameter shown in FIG. 2 (3) is extracted from the image information captured by the camera 23 and the sound source direction information specified by the sound source direction microphone included in the audio input unit 21. The third parameter can include a plurality (eg, three) subparameters.

第１サブパラメータ（３Ａ）は、顔の有無である。カメラ２３が人間の顔を撮影した場合、そのときの認識結果はユーザからの音声指示である可能性が高い。第２サブパラメータ（３Ｂ）は、顔の向きである。ユーザの正面の顔がカメラ２３で撮影された場合、ユーザがロボット２に向けて話しかけている可能性が高い。第３サブパラメータ（３Ｃ）は、音源方位である。音声がロボット２の正面方向から発せられている場合、ユーザからの音声指示である可能性が高い。 The first subparameter (3A) is the presence or absence of a face. When the camera 23 captures a human face, there is a high possibility that the recognition result at that time is a voice instruction from the user. The second subparameter (3B) is the face orientation. When the front face of the user is photographed by the camera 23, it is highly likely that the user is talking to the robot 2. The third subparameter (3C) is a sound source direction. When the voice is emitted from the front direction of the robot 2, there is a high possibility that it is a voice instruction from the user.

発話信頼度判定部１２は、上述した各パラメータの全てを考慮して発話信頼度を算出してもよいし、各パラメータのうちの一つまたは複数のパラメータのみに基づいて発話信頼度を算出してもよい。各パラメータでの判定結果に重み付けして発話信頼度を算出してもよいし、単純に判定結果の合計と判定総数の比として求めてもよい。例えば、発話信頼度を、自システムへの音声入力であると判定したパラメータの数ｐｎと使用したパラメータの総数ｔｐとの比として算出してもよい（発話信頼度＝ｐｎ／ｔｐ）。
発話信頼度は、所定の複数の判定値と比較されてる。システム１は、予め設定される第１判定値としての下限値未満の発話信頼度しか持たない入力文を、自システムへの入力ではなく、処理対象外の入力文として扱う。システム１は、入力文の発話信頼度が、予め設定された第２判定値としての通常値以上である場合、その入力文の発話信頼度は「通常」であると判定する。システム１は、入力文の発話信頼度が前記第２判定値未満の場合、その入力文の発話信頼度は「低」であると判定する。このように、本実施例では、入力文の発話信頼度を、「対象外」、「低」、「通常」のいずれかに区別して扱うが、これに限らず、例えば、「対象外」、「低」、「中」、「高」のようにより細かくランクを分けて扱う構成としてもよい。 The utterance reliability determination unit 12 may calculate the utterance reliability in consideration of all the parameters described above, or may calculate the utterance reliability based only on one or more parameters among the parameters. May be. The utterance reliability may be calculated by weighting the determination result with each parameter, or simply calculated as the ratio of the total determination result to the total determination. For example, the utterance reliability may be calculated as a ratio between the number pn of parameters determined to be speech input to the own system and the total number tp of parameters used (utterance reliability = pn / tp).
The utterance reliability is compared with a plurality of predetermined determination values. The system 1 treats an input sentence having only an utterance reliability less than a lower limit value as a first determination value set in advance as an input sentence not to be processed, not an input to the own system. The system 1 determines that the utterance reliability of the input sentence is “normal” when the utterance reliability of the input sentence is equal to or higher than a normal value as a second determination value set in advance. When the utterance reliability of the input sentence is less than the second determination value, the system 1 determines that the utterance reliability of the input sentence is “low”. As described above, in the present embodiment, the utterance reliability of the input sentence is distinguished and treated as any one of “non-target”, “low”, and “normal”, but not limited to this, for example, “non-target”, A configuration may be adopted in which the ranks are handled in a more detailed manner such as “low”, “medium”, and “high”.

図４は、入力文分類部１３の使用する入力文分類データベース２９の一例を示す。図４中の左端には、音声認識結果である入力文がカタカナで表示されている。図４の右側には、入力文の種別を示す入力文種別が例えば「話しかけ１」、「話しかけ２」、「コマンド１」などとして表示されている。入力文と入力文種別との間には、入力文の文意でグループ化するための文意ラベルが表示されている。文意ラベルは、入力文の意味で分類した意味的グループなどと呼び変えることもできる。入力文種別は、応答レベルを決定するためのグループであるから、応答レベル決定用グループまたは応答制御用グループなどと呼び変えることもできる。 FIG. 4 shows an example of the input sentence classification database 29 used by the input sentence classification unit 13. At the left end in FIG. 4, an input sentence that is a voice recognition result is displayed in katakana. On the right side of FIG. 4, input sentence types indicating the types of input sentences are displayed as, for example, “Talk 1”, “Talk 2”, “Command 1”, and the like. Between the input sentence and the input sentence type, a meaning label for grouping according to the meaning of the input sentence is displayed. The meaning label can also be called a semantic group classified according to the meaning of the input sentence. Since the input sentence type is a group for determining the response level, it can be called a response level determination group or a response control group.

例えば、入力文「ナマエハ」、「オナマエハ」、「ナンテナマエ」は、いずれも呼びかけ先の名を問うものであるため、「名前は？」という文意ラベルが設定される。入力文「アナタダレ」、「ダレデスカ」、「ダレ」は、いずれも呼びかけ先が誰であるのかを問うものでるから、「誰？」という文意ラベルが設定される。そして、文意ラベル「名前は？」と「誰？」とは、その上位概念グループである「話しかけ１「名前は？」」に対応付けられる。
入力文種別は、文意ラベルを応答レベルに応じて分類するための種別名（例えば、話しかけ、コマンド、挨拶など）と、その種別に属する一つまたは複数の文意ラベルのうち代表する文意ラベルとを対応付けて定義されている。上記の例では、文意ラベル「名前は？」と「誰？」とはいずれも、「話しかけ１」という種別の中の代表文意ラベル「名前は？」に属するものとして定義されている。 For example, the input sentences “Namaeha”, “Onamaeha”, and “Nantamae” all ask the name of the callee, so the meaning label “What is the name?” Is set. Since each of the input sentences “anatadare”, “daredeska”, and “dare” asks who the callee is, the meaning label “who?” Is set. The meaning labels “What is your name?” And “Who is?” Are associated with “Speak 1“ What is your name? ””, Which is a superordinate concept group.
The input sentence type is a type name (for example, talking, command, greeting, etc.) for classifying the meaning label according to the response level and a representative meaning of one or more meaning labels belonging to the type. It is defined in association with a label. In the above example, both the meaning labels “What?” And “Who?” Are defined as belonging to the representative meaning label “Name?” In the type “Talk 1”.

入力文「ヘンジシテ」、「ムシシナイデ」、「ハナシキイテ」は、文意ラベル「返事して」に対応付けられている。入力文「ハンノウシテ」、「ハンノウハ」は、文意ラベル「反応して」に対応付けられている。そして、文意ラベル「返事して」と「反応して」は、いずれも入力文種別「話しかけ１「返事して」」に対応付けられている。 The input sentences “Hanjishite”, “Mushishinide”, and “Hanashikiite” are associated with the sentence label “Reply”. The input sentences “Hannoushite” and “Hannoha” are associated with the sentence label “React”. The sentence labels “Respond” and “Respond” are both associated with the input sentence type “Talk 1“ Respond ””.

入力文「アタマイイ」、「リコウ」、「カシコイ」は、文意ラベル「頭良い」に対応付けられており、その文意ラベルは入力文種別「話しかけ２「頭良い」」に対応付けられている。 The input sentences “Attamai”, “Rikou”, and “Kashikoi” are associated with the sentence label “Head”, and the sentence label is associated with the input sentence type “Talk 2“ Head ””. Yes.

「話しかけ１」に属する入力文と「話しかけ２」に属する入力文とは、いずれもユーザから自システムに話しかけられた言葉であるが、応答レベルが異なるため別々の種別として定義している。 The input sentence belonging to “Talk 1” and the input sentence belonging to “Talk 2” are both words spoken to the user's system by the user, but are defined as different types because the response levels are different.

入力文「キョウナンニチ」、「ヒニチオシエテ」は、文意ラベル「今日何日？」に対応付けられており、その文意ラベルは入力文種別「コマンド１「今日何日？」」に対応付けられている。種別「コマンド」は、ロボット２に搭載されているサービス（例えばダンス、ニュース読み上げ、日にち回答など）を利用するための音声指示であることを示す種別である。 The input sentences “Kyonanichi” and “Hinichioshie” are associated with the sentence label “What day today?”, And the sentence label is associated with the input sentence type “Command 1“ What day today? ””. ing. The type “command” is a type indicating a voice instruction for using a service (for example, dance, news reading, date answer, etc.) installed in the robot 2.

入力文「ダンスオドッテ」、「ダンスミセテ」、「ダンスヤッテ」は、文意ラベル「ダンス踊って」に対応付けられており、その文意ラベルは入力文種別「コマンド２「ダンス踊って」」に対応付けられている。「コマンド１」に属する入力文と「コマンド２」に属する入力文とは、いずれもロボット２のサービスを利用するための音声指示であるが、応答レベルが異なるため（応答レベルを変えるため）、それぞれ別々の種別として定義している。 The input sentences “Dance Odotte”, “Dance Missete” and “Dance Yttete” are associated with the sentence label “Dance Dance”, and the sentence label corresponds to the input sentence type “Command 2“ Dance Dance ””. It is attached. The input sentence belonging to “command 1” and the input sentence belonging to “command 2” are both voice instructions for using the service of the robot 2 but have different response levels (to change the response level). Each is defined as a separate type.

図４の下に示す入力文「ハジメマシテ」、「オハツデス」は、文意ラベル「初めまして」に対応付けられており、その文意ラベルは入力文種別「挨拶１「初めまして」」に対応付けられている。 The input sentences “Hajimemashite” and “Ohatsudes” shown at the bottom of FIG. 4 are associated with the meaning label “Nice to meet you”, and the meaning label is associated with the input sentence type “greeting 1“ Nice to meet you ””. It has been.

図４は、システム１で認識可能な入力文の一部を示しており、認識可能な入力文は、原則的にいずれかの文意ラベルに対応付けられる。本実施例では、音声認識結果である入力文を、文意ラベルごとにグループ化し、さらに文意ラベルを介していずれかの入力文種別に分類する。入力文分類データベース２９に登録されていない入力文は、応答せずに無視してもよいし、初期値として設定される入力文種別に分類してもよい。 FIG. 4 shows a part of an input sentence that can be recognized by the system 1, and the input sentence that can be recognized is in principle associated with any sentence label. In this embodiment, the input sentences that are the speech recognition results are grouped for each meaning label, and further classified into one of the input sentence types via the meaning label. An input sentence that is not registered in the input sentence classification database 29 may be ignored without responding, or may be classified into an input sentence type set as an initial value.

新たな入力文（単語だけの場合もある）を入力文分類データベース２９に登録する場合は、複数の文意ラベルのいずれかに対応付けるだけでよい。新たな入力文は、設定された文意ラベルを介して、所定の入力文種別に分類される。新たな入力文に適切な文意ラベルが見当たらない場合、新たな文意ラベルを生成してそこに対応付け、その新たな文意ラベルをいずれかの入力文種別に対応付ければよい。 When a new input sentence (which may be only a word) is registered in the input sentence classification database 29, it is only necessary to associate it with one of a plurality of sentence meaning labels. The new input sentence is classified into a predetermined input sentence type via the set sentence label. When an appropriate sentence label is not found in the new input sentence, a new sentence label is generated and associated therewith, and the new sentence label is associated with one of the input sentence types.

登録済みの入力文に対する反応レベルを変えたい場合には、その入力文の属する文意ラベルを、希望する反応レベルの入力文種別に対応付ければよい。例えば、「ヘンジシテ」という入力文の反応レベルを変えたい場合、その入力文に設定されている文意ラベル「返事して」を入力文種別「話しかけ１」から「話しかけ２」に変更できる。あるいは、新たな入力文種別を定義して、変更対象の文意ラベルを新たに定義した入力文種別に対応付けてもよい。 When it is desired to change the response level for a registered input sentence, the meaning label to which the input sentence belongs may be associated with the input sentence type of the desired reaction level. For example, when it is desired to change the response level of the input sentence “Hanjishite”, the sentence label “Respond” set for the input sentence can be changed from the input sentence type “Talk 1” to “Talk 2”. Alternatively, a new input sentence type may be defined, and the sentence label to be changed may be associated with the newly defined input sentence type.

入力文分類データベース２９では、複数の入力文を文意でグループ化した上で、入力文種別に対応付けるため、表現が異なる入力文であっても、その文意が同じか似通っていれば、同一の反応レベルで応答できる。これにより、似たような意味合いの入力文に対して同一の反応レベルで応答できるため、統一感のある応答を実現することができる。なお、似通った文意とは、例えば、同一の入力文種別に対応付けられる異なる複数の文意ラベル（「名前は？」と「誰？」、「返事して」と「反応して」など）である。 In the input sentence classification database 29, a plurality of input sentences are grouped according to their meanings, and are associated with the input sentence types. Therefore, even if the input sentences have different expressions, they are the same as long as the sentence meanings are the same or similar. It can respond at the reaction level. Thereby, since it can respond with the same reaction level with respect to the input sentence of the similar meaning, the response with a sense of unity can be implement | achieved. Note that the similar sentence meanings include, for example, a plurality of different meaning labels associated with the same input sentence type (such as “What is the name?” And “Who?”, “Respond” and “React”). ).

さらに、新たな入力文に文意ラベルを設定するだけで適切な入力文種別に分類することができるため、入力文ごとに反応レベルを直接定義する必要がなく、新たな単語や表現に柔軟かつ速やかに対応することができ、システム構築上およびシステム運用上の使い勝手がよい。 Furthermore, since it is possible to classify into the appropriate input sentence type simply by setting the meaning label for the new input sentence, it is not necessary to directly define the reaction level for each input sentence, and it is flexible to new words and expressions. It can respond promptly and is easy to use in system construction and system operation.

図５は、反応レベルデータベース３０の例を示す。反応レベルデータベース３０は、反応レベル決定部１４により使用されるもので、入力文種別ごとに一つまたは複数の反応レベルが設定されている。 FIG. 5 shows an example of the reaction level database 30. The reaction level database 30 is used by the reaction level determination unit 14, and one or a plurality of reaction levels are set for each input sentence type.

本実施例では、反応レベルとして、「無反応」、「無難な反応」、「本来の反応」の３段階を設定している。「無反応」とは、入力文に対して応答しないことを意味する。「無難な反応」とは、入力文に対して応答はするが、その入力文に対する本来の反応とは異なる、曖昧な応答または不明瞭な応答をすることを意味する。「本来の反応」とは、入力文に対する本来の反応、通常の対話として成り立つ応答をすること、または、発話者であるユーザが期待しているであろう通常の応答を返すこと、を意味する。
各入力文種別には、各反応レベルの一部または全部を選択するための条件が設定されている。例えば、入力文種別「話しかけ１」の場合、発話信頼度が低いときは反応レベル「無反応」が選択され、発話信頼度が通常のときは反応レベル「本来の反応」が選択されるように設定されており、「無難な反応」は設定されていない。 In this embodiment, three levels of “no reaction”, “safe reaction”, and “original reaction” are set as reaction levels. “No response” means no response to the input sentence. “Safe response” means responding to an input sentence, but making an ambiguous or unclear response different from the original response to the input sentence. “Original response” means an original response to an input sentence, a response that is realized as a normal dialogue, or a normal response that a user who is a speaker would expect. .
Each input sentence type has a condition for selecting a part or all of each reaction level. For example, in the case of the input sentence type “talk 1”, the response level “no response” is selected when the utterance reliability is low, and the response level “original response” is selected when the utterance reliability is normal. It is set, and “safe reaction” is not set.

入力文種別「話しかけ２」の場合、発話信頼度が低いときは反応レベル「無反応」が選択され、発話信頼度が通常かつ顔を検出できないときは反応レベル「無難な反応」が選択され、発話信頼度が通常かつ顔を検出できたときは反応レベル「本来の反応」が選択されるように設定されている。 For the input sentence type “talk 2”, when the utterance reliability is low, the response level “no response” is selected, and when the utterance reliability is normal and the face cannot be detected, the reaction level “safe response” is selected, When the utterance reliability is normal and a face can be detected, the response level “original response” is set to be selected.

入力文種別「コマンド１」の場合、常に反応レベル「本来の反応」が選択されるように設定されており、反応レベル「無反応」および「無難な反応」は設定されていない。システム１は、「コマンド１」に分類される音声指示を認識すると、状況を問わずに、そのコマンドに対応する応答を実行する。 In the case of the input sentence type “command 1”, the response level “original response” is always selected, and the response levels “no response” and “safe response” are not set. When the system 1 recognizes the voice instruction classified as “command 1”, the system 1 executes a response corresponding to the command regardless of the situation.

入力文種別「コマンド２」の場合、発話信頼度が低くかつ顔を検出できないときは反応レベル「無難な反応」が選択され、発話信頼度が通常でかつ顔を検出できたときは反応レベル「本来の反応」が選択されるように設定されており、反応レベル「無反応」は設定されていない。
音声指示に対する「無難な反応」な反応の例としては、例えば指示内容を聞き返したり（「ダンスやって」に対して「ダンスですか？」と応答する等）、実行の延期を求めたりする応答（「ダンスですか。ちょっと準備させてください等」）が考えられる。 In the case of the input sentence type “command 2”, when the utterance reliability is low and the face cannot be detected, the reaction level “safe reaction” is selected, and when the utterance reliability is normal and the face can be detected, the reaction level “ The “original reaction” is set to be selected, and the reaction level “no reaction” is not set.
Examples of “safe responses” to voice instructions include, for example, listening back to the contents of the instructions (for example, responding to “doing dance” as “dance?”) Or requesting postponement of execution ("Is it a dance? Please let me get ready").

入力文種別「挨拶１」の場合、発話信頼度が低いときは反応レベル「無反応」が選択され、発話信頼度が通常のときは反応レベル「本来の反応」が選択されるように設定されており、反応レベル「無難な反応」は設定されていない。 In the case of the input sentence type “greeting 1”, the response level “no response” is selected when the utterance reliability is low, and the response level “original response” is selected when the utterance reliability is normal. The reaction level “safe reaction” is not set.

入力文種別「挨拶２」の場合、発話信頼度が低くかつ顔が検出できないときは反応レベル「無難な反応」が選択され、発話信頼度が通常でかつ顔を検出できたときは反応レベル「本来の反応」が選択されるようになっており、「無反応」は設定されていない。 In the case of the input sentence type “greeting 2”, when the utterance reliability is low and the face cannot be detected, the reaction level “safe reaction” is selected, and when the utterance reliability is normal and the face can be detected, the reaction level “ “Original response” is selected, and “No response” is not set.

挨拶に対する「無難な反応」の例としては、いろいろな受取り方のできる「やあ」、「はい」等のありふれた挨拶を返すことが考えられる。 An example of a “safe response” to a greeting is to return a common greeting such as “Hi” or “Yes” that can be received in various ways.

以上述べたように、反応レベルデータベース３０を用いて、入力文への応答に関する反応レベルを決定することができる。全く異なる入力文であっても、反応レベルが同一である場合は、同じ程度の強さで応答することになる。逆に、たとえ同じ入力文であっても、状況（顔が検出できたか否か）が異なれば反応レベルも異なり、その結果、応答の強さも変化する。 As described above, the reaction level relating to the response to the input sentence can be determined using the reaction level database 30. Even if the input sentence is completely different, if the response level is the same, it will respond with the same level of strength. Conversely, even if the input sentence is the same, if the situation (whether or not a face has been detected) is different, the reaction level is different, and as a result, the strength of the response also changes.

図６は、応答決定部１５の使用する応答定義データベース３１の例を示す。応答定義データベース３１は、文意ラベルごとに、反応レベルに応じた応答内容、または、反応レベルおよび状況に応じた応答内容を定義している。 FIG. 6 shows an example of the response definition database 31 used by the response determination unit 15. The response definition database 31 defines the response content corresponding to the reaction level or the response content corresponding to the reaction level and the situation for each sentence label.

文意ラベル「名前は？」に属する入力文に対しては、発話信頼度が通常の場合に「本来の反応」で応答する。システム１は、発話者の個人認証をし、発話者が未登録であれば「僕の名前は〇〇です」と応答する。発話者が登録済みであれば「僕の名前は〇〇です。知っていると思ってました」のように応答する。発話者がシステム１に登録済みであるか否かは、事前に登録されたユーザ音声又は、顔画像との類似度から判定することができる。 When the utterance reliability is normal, the input sentence belonging to the meaning label “What is your name?” Responds with “original response”. The system 1 authenticates the speaker, and responds with "My name is OO" if the speaker is not registered. If the speaker is already registered, it responds like “My name is 〇〇. I thought I knew it”. Whether or not the speaker has been registered in the system 1 can be determined from a user voice registered in advance or a similarity to a face image.

文意ラベル「返事して」に属する入力文は、入力文種別「話しかけ１」に分類されるため、その発話信頼度が通常の場合に、その入力文に対し「本来の反応」として「はっ！ぼーっとしてました」などと応答する。 Since the input sentence belonging to the sentence label “Respond” is classified into the input sentence type “Talk 1”, when the utterance reliability is normal, the “original response” is “ Replying to @Now

文意ラベル「頭良い」に属する入力文は、入力文種別「話しかけ２」に分類されるため、発話信頼度が通常でかつ顔が検出できないときに、その入力文に対する「無難な反応」として、「もしかしてほめられている？」などと応答する。これに対し、発話信頼度が通常でかつ顔を検出できたときは、「本来の反応」として、「まだまだ勉強中です」のように応答する。 Since the input sentence belonging to the sentence label “smart” is classified into the input sentence type “talk 2”, when the utterance reliability is normal and the face cannot be detected, the “safe response” to the input sentence is given. , “Are you praised?” On the other hand, when the utterance reliability is normal and a face can be detected, a response such as “I am still studying” is given as the “original response”.

文意ラベル「今日何日？」に属する入力文は、入力文種別「コマンド１」に分類されるため、その入力文に対しては、状況によらず常に「本来の反応」として「〇月○日です」のように応答する。 Since the input sentence belonging to the sentence label “How many days today?” Is classified into the input sentence type “command 1”, the input sentence is always “original response” regardless of the situation as “Otsuki”. “It's a day.”

文意ラベル「ダンス踊って」に属する入力文は、入力文種別「コマンド２」に分類されるため、発話信頼度が低くかつ顔が検出できないときは「無難な反応」として、「ダンスですか？」のように指示内容を聞き返す応答をする。これに対し、発話信頼度が通常でかつ顔を検出できたときは「本来の反応」として、予め記憶されているダンス用プログラムを実行し、ダンスを踊る。 Input sentences belonging to the meaning label “dance dance dance” are classified into the input sentence type “command 2”, so when the utterance reliability is low and the face cannot be detected, “safe response” is indicated as “dancing? "?" On the other hand, when the utterance reliability is normal and a face can be detected, a dance program stored in advance is executed as an “original reaction” to dance.

文意ラベル「初めまして」に属する入力文は、入力文種別「挨拶１」に分類されるため、発話信頼度が通常の場合に「本来の反応」で応答を返す。システム１は、反応レベルだけでなく、音声入力文の検出状況も考慮して応答する。例えば、システム１は、画像解析部２２が画像データ中に顔を検出できたか判定し、次に発話者の個人認証を行い、さらに、検出した顔が横顔であるか判定する。登録済みのユーザ音声と入力文との類似度から個人認証してもよいし、または、もしもそれが可能ならば、登録済みのユーザの顔データと画像解析部２２で検出した顔データの一致度から個人認証してもよい。 Since the input sentence belonging to the sentence label “Nice to meet you” is classified into the input sentence type “greeting 1”, a response is returned with “original response” when the utterance reliability is normal. The system 1 responds considering not only the reaction level but also the detection status of the voice input sentence. For example, the system 1 determines whether the image analysis unit 22 has detected a face in the image data, performs personal authentication of the speaker, and further determines whether the detected face is a profile. Personal authentication may be performed based on the similarity between the registered user voice and the input sentence, or if this is possible, the degree of coincidence between the registered user's face data and the face data detected by the image analysis unit 22 You may authenticate yourself.

第１の状況は、顔が検出できず、未登録の発話者の場合、または、横顔だけ検出できた未登録の発話者の場合である。第１の状況では、「本来の反応」として「初めまして、僕の名前は〇〇です」のように応答する。 The first situation is a case where the face cannot be detected and the speaker is an unregistered speaker or an unregistered speaker who can detect only the side face. In the first situation, it responds as “original response,” “Nice to meet you, my name is 00”.

第２の状況は、正面の顔を検出でき、かつ発話者が未登録の場合である。第２の状況では、「本来の反応」として「初めまして、僕の名前は〇〇です。握手してください」のように応答する。 The second situation is when the front face can be detected and the speaker is not registered. In the second situation, it responds like "Nice to meet you, my name is OO. Please shake your hand."

第３の状況は、正面の顔を検出でき、かつ発話者が登録済みの場合である。第３の状況では、「本来の反応」として「初めまして、僕の名前は〇〇です。あれ？知ってますよね」のように応答する。
このように、文意ラベルごとに、反応レベルと状況とに基づいて、応答の内容を定義することができる。この結果、一貫した反応レベルで、統一性のある応答を返すことができ、かつ自然な対話を実現できる。 The third situation is when the front face can be detected and the speaker has already been registered. In the third situation, respond as follows: “Nice to meet you, my name is 〇〇.
As described above, the content of the response can be defined for each sentence label based on the reaction level and the situation. As a result, a uniform response can be returned at a consistent response level, and a natural dialogue can be realized.

図７は、音声認識システム１の全体処理を示すフローチャートである。システム１は、音声入力部２１を介して何らかの音声を検出すると（Ｓ１）、その入力された音声を解析し、図２で述べた所定パラメータを抽出し（Ｓ２）、所定パラメータに基づいて発話信頼度を判定する（Ｓ３）。 FIG. 7 is a flowchart showing the overall processing of the voice recognition system 1. When the system 1 detects some kind of voice via the voice input unit 21 (S1), it analyzes the inputted voice, extracts the predetermined parameter described in FIG. 2 (S2), and determines the utterance trust based on the predetermined parameter. The degree is determined (S3).

システム１は、入力文について算出された発話信頼度が所定の下限値以上であるか判定し（Ｓ４）、下限値以下であると判定すると（Ｓ４：ＮＯ）本処理を終了する。システム１は、下限値以上の発話信頼度を持つ入力文であると判定すると（Ｓ４：ＹＥＳ）、入力文分類データベース２９を用いて、その入力文を文意ラベルおよび入力文種別で分類する（Ｓ５）。 The system 1 determines whether the utterance reliability calculated for the input sentence is greater than or equal to a predetermined lower limit value (S4), and if it is determined that the utterance reliability is equal to or lower than the lower limit value (S4: NO), the process ends. If the system 1 determines that the input sentence has an utterance reliability equal to or higher than the lower limit value (S4: YES), the input sentence is classified by the meaning label and the input sentence type using the input sentence classification database 29 ( S5).

システム１は、反応レベルデータベース３０を用いて、入力文の分類先である入力文種別についての反応レベルを決定する（Ｓ６）。システム１は、応答定義データベース３１を用いて、入力文に設定された文意ラベルと決定された反応レベルとに基づいて、応答内容を決定する（Ｓ７）。システム１は、決定した応答内容を、音声出力部４１、表示部４２、動作機構４３のいずれか一つまたは複数を用いて出力する（Ｓ８）。 The system 1 uses the reaction level database 30 to determine the reaction level for the input sentence type that is the classification destination of the input sentence (S6). The system 1 uses the response definition database 31 to determine the response content based on the sentence label set in the input sentence and the determined reaction level (S7). The system 1 outputs the determined response content using one or more of the audio output unit 41, the display unit 42, and the operation mechanism 43 (S8).

図８は、図７中の入力文分類処理Ｓ５の一例を示すフローチャートである。システム１の入力文分類部１３は、入力文分類データベース２９を検索することで、入力文に該当する文意ラベルがあるか判定する（Ｓ５０）。入力文分類部１３は、該当する文意ラベルがあると判定した場合（Ｓ５０：ＹＥＳ）、その文意ラベルを入力文に設定（Ｓ５１）。さらに、入力文分類部１３は、選択した文意ラベルの属する入力文種別を、入力文の分類先として決定する（Ｓ５２）。 FIG. 8 is a flowchart showing an example of the input sentence classification process S5 in FIG. The input sentence classification unit 13 of the system 1 searches the input sentence classification database 29 to determine whether there is a sentence label corresponding to the input sentence (S50). When the input sentence classification unit 13 determines that there is a corresponding sentence label (S50: YES), the input sentence classification unit 13 sets the sentence label as an input sentence (S51). Furthermore, the input sentence classification unit 13 determines the input sentence type to which the selected sentence label belongs as the input sentence classification destination (S52).

このように構成される本実施例によれば、入力文を文意に応じた入力文種別に分類し、入力文種別ごとに発話信頼度および状況を考慮して反応レベルを定義し、反応レベルと状況に応じて文意ラベルごとに応答内容を決定する。 According to this embodiment configured as described above, the input sentence is classified into input sentence types according to the meaning of the sentence, and the reaction level is defined for each input sentence type in consideration of the utterance reliability and the situation. The response content is determined for each sentence label according to the situation.

このように本実施例では、反応レベルの決定処理（入力文分類部１３および反応レベル決定部１４、ステップＳ５およびＳ６）と、応答内容の決定処理（応答決定部１５、ステップＳ７）とを分けており、反応レベルは入力文種別ごとに設定するため、入力文の文意に応じて統一性のある一貫した応答を返すことができ、より自然な対話を実現することができる。 As described above, in this embodiment, the reaction level determination process (input sentence classification unit 13 and reaction level determination unit 14, steps S5 and S6) is divided from the response content determination process (response determination unit 15, step S7). Since the response level is set for each type of input sentence, a consistent and consistent response can be returned according to the sentence meaning of the input sentence, and a more natural dialogue can be realized.

さらに、入力文の文意ごとに反応レベルおよび状況に応じた応答内容を定義するため、統一性があり、かつ柔軟な応答を返すことができる。 Furthermore, since the response content corresponding to the reaction level and the situation is defined for each sentence of the input sentence, a uniform and flexible response can be returned.

本実施例では、入力文に、その文意に応じた文意ラベルを設定してグループ化するため、新たな単語や表現などにも柔軟かつ速やかに対応することができる。 In this embodiment, the input sentence is grouped by setting a meaning label according to the meaning of the sentence, so that new words and expressions can be flexibly and promptly dealt with.

本実施例では、画像データから解析される顔の画像を発話信頼度を算出するためのパラメータの一つとして利用するだけでなく、対話状況を判定するための状況判定用情報としても利用する。従って、実際の人間の多くがそうであるように、顔の有無などから状況を判断して応答することができ、より自然な対話を実現できる。 In this embodiment, the face image analyzed from the image data is used not only as one of the parameters for calculating the utterance reliability but also as the situation determination information for determining the conversation situation. Therefore, as is the case with many actual humans, it is possible to respond by judging the situation based on the presence or absence of a face, etc., and realizing a more natural dialogue.

なお、本発明は、上述した実施の形態に限定されない。当業者であれば、本発明の範囲内で、種々の追加や変更等を行うことができる。 The present invention is not limited to the above-described embodiment. A person skilled in the art can make various additions and changes within the scope of the present invention.

１：音声認識システム、２：ロボット、３：家電製品、１１：音声認識部、１２：発話信頼度判定部、１３：入力文分類部、１４：反応レベル決定部、１５：応答決定部、１６：応答出力部、２１：音声入力部、２２：画像解析部、２３：カメラ、２８：認識結果履歴データベース、４１：音声出力部、４２：表示部、４３：動作機構 1: voice recognition system, 2: robot, 3: home appliance, 11: voice recognition unit, 12: utterance reliability determination unit, 13: input sentence classification unit, 14: reaction level determination unit, 15: response determination unit, 16 : Response output unit, 21: Audio input unit, 22: Image analysis unit, 23: Camera, 28: Recognition result history database, 41: Audio output unit, 42: Display unit, 43: Operating mechanism

Claims

A speech recognition system that recognizes and responds to speech,
A speech recognition unit that recognizes a speech input sentence input from the speech input unit;
A determination unit that determines whether the voice input sentence is a voice input to the system based on a recognition result of the voice input sentence by the voice recognition unit and predetermined information;
An input sentence classification unit for classifying the voice input sentence into any of a plurality of input sentence types prepared in advance;
Plural items defined according to the degree of reaction prepared in advance for each input sentence type based on the input sentence type into which the voice input sentence is classified and the determination result of the voice input sentence by the determination unit A reaction level determination unit for determining a reaction level for the voice input sentence from among the reaction levels of
A response determining unit that determines a response to the voice input sentence from among predetermined responses based on the determined response level;
A response output unit for outputting the determined response;
A speech recognition system comprising:

The input sentence classification unit classifies the voice input sentence into any of a plurality of meaning groups prepared in advance according to the meaning of the sentence, and further associates the input sentence with one or a plurality of meaning groups. Among a plurality of input sentence types, classify into input sentence types corresponding to the meaning group to which the voice input sentence is classified,
The speech recognition system according to claim 1.

An image output unit that outputs an image of a predetermined range around the own system,
The response level determination unit determines a response level for the voice input sentence from the plurality of reaction levels based on the input sentence type, the determination result, and the analysis result of the image output from the image output unit. ,
The speech recognition system according to claim 1.

The response determination unit determines the response to the voice input sentence based on the determined response level and the analysis result of the image output by the image output unit;
The voice recognition system according to claim 3.

The predetermined information includes sound source direction information indicating an analysis result of the image by the image output unit and / or an input direction of the voice input sentence,
The determination unit includes a first parameter obtained from an analysis result regarding the latest speech input sentence recognized by the speech recognition unit, a second parameter obtained from a recognition result history recognized by the speech recognition unit, and the image Determining based on the analysis result and / or the third parameter obtained from the sound source direction information;
The voice recognition system according to claim 3.

The first parameter includes a keyword rate indicating a ratio that a predetermined keyword is included in the latest input sentence, a sound intensity of each word constituting the latest input sentence, and each of the words constituting the latest input sentence. Includes at least one of the lengths of the sounds of the words,
The speech recognition system according to claim 5.

The second parameter includes a possibility of indicating whether or not a conversation unrelated to the voice instruction from the user is being performed in the surroundings,
The third parameter includes at least one of whether or not a user's face has been detected and the detected face orientation.
The speech recognition system according to claim 5 or 6.

A method for controlling a speech recognition system that recognizes and responds to speech,
A voice recognition step for recognizing a voice input sentence input from the voice input unit;
A determination step of determining whether or not the voice input sentence is a voice input to the own system based on a recognition result of the voice input sentence by the voice recognition step and predetermined information;
An input sentence classification step for classifying the voice input sentence into any of a plurality of input sentence types prepared in advance;
Plural items defined according to the degree of reaction prepared in advance for each input sentence type based on the input sentence type into which the voice input sentence is classified and the determination result of the voice input sentence in the determining step. A reaction level determination step for determining a response level for the voice input sentence from among the response levels of
A response determining step for determining a response to the voice input sentence from among predetermined responses based on the determined response level;
A response output step of outputting the determined response;
For controlling a speech recognition system.

A computer program for causing a computer to function as a speech recognition system that recognizes and responds to speech,
A voice recognition unit for recognizing a voice input sentence input from a voice input unit connected to the computer;
A determination unit that determines whether the voice input sentence is a voice input to the system based on a recognition result of the voice input sentence by the voice recognition unit and predetermined information;
An input sentence classification unit for classifying the voice input sentence into any of a plurality of input sentence types prepared in advance;
Plural items defined according to the degree of reaction prepared in advance for each input sentence type based on the input sentence type into which the voice input sentence is classified and the determination result of the voice input sentence by the determination unit A reaction level determination unit for determining a reaction level for the voice input sentence from among the reaction levels of
A response determining unit that determines a response to the voice input sentence from among predetermined responses based on the determined response level;
A response output unit for outputting the determined response;
A computer program for realizing the above on the computer.