JP6643468B2

JP6643468B2 - Response control device, control program, information processing method, and communication system

Info

Publication number: JP6643468B2
Application number: JP2018518113A
Authority: JP
Inventors: 田上　文俊; 文俊田上; 拓也小柳津
Original assignee: Sharp Corp
Current assignee: Sharp Corp
Priority date: 2016-05-18
Filing date: 2017-03-07
Publication date: 2020-02-12
Anticipated expiration: 2037-03-07
Also published as: WO2017199545A1; JPWO2017199545A1

Description

本発明は、応答制御装置と、制御プログラムと、情報処理方法と、サーバおよび応答制御装置を備えた通信システムとに関する。本出願は、２０１６年５月１８日に出願した日本特許出願である特願２０１６−０９９４８８号に基づく優先権を主張する。当該日本特許出願に記載された全ての記載内容は、参照によって本明細書に援用される。 The present invention relates to a response control device, a control program, an information processing method, and a communication system including a server and a response control device. This application claims the priority based on Japanese Patent Application No. 2006-099488 filed on May 18, 2016. The entire contents described in the Japanese patent application are incorporated herein by reference.

従来、ユーザとの間で音声対話を行うための音声対話装置が知られている。音声対話装置においては、ユーザからの質問に幅広く対応させるために、膨大な質問内容と応答内容とを予め記憶させておく必要がある一方、そのように準備したとしても想定外の質問をされることがあるため、最初から完璧に対応させることが不可能であった。 2. Description of the Related Art Conventionally, a voice interaction device for performing a voice interaction with a user has been known. In a spoken dialogue apparatus, in order to respond to a wide range of questions from a user, it is necessary to store a large amount of question contents and response contents in advance, but even if such preparation is made, unexpected questions are asked. Sometimes it was impossible to get a perfect fit from the start.

このため、たとえば、特開２００４−１０９３２３号公報（特許文献１）は、予め記憶されていない質問を認識したときに、当該質問に対する応答内容をユーザに問い返して、ユーザからの回答を応答内容として記憶して、次からの対話に用いるように学習する音声対話装置を開示している。 For this reason, for example, Japanese Patent Application Laid-Open No. 2004-109323 (Patent Document 1), when recognizing a question that is not stored in advance, asks the user for a response to the question and uses the answer from the user as the response. A speech dialogue device that stores and learns to use for subsequent dialogues is disclosed.

特開２００４−１０９３２３号公報JP 2004-109323 A

しかしながら、従来の音声対話装置は、予め記憶されていない質問を認識すると、即座にユーザに対して応答内容を問い返す処理を行なう。このため、ユーザからの実際の質問が記憶済の質問であっても、たとえば音声認識に失敗するなどして記憶されていない質問として誤認識されてしまった場合には、問い返す処理が行なわれてしまう。 However, when recognizing a question that is not stored in advance, the conventional voice interaction device immediately performs a process of inquiring the user of the response content. For this reason, even if the actual question from the user is a stored question, if the question is incorrectly recognized as an unstored question due to, for example, voice recognition failure, a process of returning the question is performed. I will.

その結果、従来の音声対話装置は、音声認識の精度が１００％でなければ問い返す処理を意図通りに行なうことをできず、ユーザの失望を招いてしまう虞があった。また、音声認識の失敗により問い返す処理が行なわれてしまった場合には、誤認識に基づくものでありユーザが意図していない問い返し内容となるため、その問い返し内容がユーザの想定内でなければ、不適切な回答となってしまう虞があった。 As a result, the conventional voice interaction device cannot perform the process of inquiring as intended unless the accuracy of voice recognition is 100%, which may cause the user to be disappointed. Also, if the process of returning the question is performed due to the failure of the speech recognition, the question is based on the misrecognition and the question is not intended by the user. There was a risk that the answer would be inappropriate.

本開示は、上記の問題点に鑑みなされたものであって、そのある局面における目的は、不適切な問い返しおよび学習が行なわれてしまうことを防止できる応答制御装置と、制御プログラムと、情報処理方法と、通信システムとを提供することにある。 The present disclosure has been made in view of the above problems, and an object in one aspect thereof is to provide a response control device capable of preventing inappropriate inquiry and learning from being performed, a control program, and an information processing method. It is to provide a method and a communication system.

ある局面に従うと、応答制御装置は、音声の入力を受け付ける音声受付手段と、音声受付手段により受け付けられた音声から特定されるフレーズが予め定められた複数種類のフレーズのうちのいずれかであるときに、当該フレーズに対応する応答処理を実行する応答処理実行手段と、音声受付手段により受け付けられた音声から特定されるフレーズが複数種類のフレーズのいずれでもないときであって、当該フレーズとなった頻度が所定頻度に達しているときに、その後において当該フレーズに対応する応答処理を特定可能にするための学習処理を実行する学習処理実行手段とを備える。 According to a certain aspect, the response control device includes a voice receiving unit that receives a voice input, and a phrase specified from the voice received by the voice receiving unit is one of a plurality of predetermined types of phrases. In the case where the phrase specified from the voice received by the voice processing unit and the response process execution unit that executes the response process corresponding to the phrase is not any of the plural types of phrases, When the frequency has reached a predetermined frequency, there is provided a learning process executing means for executing a learning process for specifying a response process corresponding to the phrase thereafter.

他の局面に従うと、制御プログラムは、応答制御装置としてコンピュータを機能させ、コンピュータを上記各手段として機能させる。 According to another aspect, the control program causes a computer to function as a response control device, and causes the computer to function as each of the above units.

さらに他の局面に従うと、情報処理方法は、音声の入力を受け付けるステップと、受け付けられた音声から特定されるフレーズが予め定められた複数種類のフレーズのうちのいずれかであるときに、当該フレーズに対応する応答処理を実行するステップと、受け付けられた音声から特定されるフレーズが複数種類のフレーズのいずれでもないときであって、当該フレーズとなった頻度が所定頻度に達しているときに、その後において当該フレーズに対応する応答処理を特定可能にするための学習処理を実行するステップとを備える。 According to yet another aspect, the information processing method includes the steps of: receiving a voice input; and, when the phrase specified from the received voice is one of a plurality of predetermined types of phrases, Performing a response process corresponding to the phrase, and when the phrase identified from the received voice is not any of the plurality of types of phrases, and when the frequency of the phrase has reached a predetermined frequency, And thereafter performing a learning process for enabling a response process corresponding to the phrase to be specified.

さらに他の局面に従うと、通信システムは、サーバと、当該サーバと通信可能な応答制御装置とを備える。応答制御装置は、音声の入力を受け付ける音声受付手段と、音声受付手段により受け付けられた音声に対応する音声情報を送信し、サーバからの応答情報を受信する通信手段と、受信した応答情報に基づいて応答処理を実行する応答処理実行手段とを含む。サーバは、予め定められた複数種類のフレーズ各々に対応する応答情報を記憶する記憶手段と、応答制御装置からの音声情報から特定されるフレーズが複数種類のフレーズのうちのいずれかであるときに、当該フレーズに対応する応答情報を送信する応答情報送信手段と、応答制御装置からの音声情報から特定されるフレーズが複数種類のフレーズのいずれでもないときであって、当該フレーズとなった頻度が所定頻度に達しているときに、その後において当該フレーズに対応する応答処理を特定可能にするための学習処理を実行する学習処理実行手段とを含む。 According to yet another aspect, a communication system includes a server and a response control device capable of communicating with the server. The response control device includes: a voice receiving unit that receives a voice input; a communication unit that transmits voice information corresponding to the voice received by the voice receiving unit and receives response information from the server; Response processing executing means for executing response processing. The server includes a storage unit that stores response information corresponding to each of the plurality of predetermined types of phrases, and a storage unit that stores the phrase specified from the voice information from the response control device when the phrase is one of the plurality of types of phrases. A response information transmitting unit that transmits response information corresponding to the phrase; and a case where the phrase identified from the voice information from the response control device is not one of the plural types of phrases, and the frequency of the phrase is A learning process executing means for executing a learning process for enabling a response process corresponding to the phrase to be specified when the predetermined frequency is reached.

ある局面によれば、不適切な問い返しおよび学習が行なわれてしまうことを防止できる。 According to one aspect, inappropriate inquiry and learning can be prevented from being performed.

通信システムの概略構成を説明するための図である。FIG. 1 is a diagram for explaining a schematic configuration of a communication system. ユーザと端末との会話のやりとりの一例を示す図である。It is a figure showing an example of exchange of a conversation between a user and a terminal. 端末のハードウェア構成の一例を表した図である。FIG. 3 is a diagram illustrating an example of a hardware configuration of a terminal. サーバ装置のハードウェア構成の一例を表した図である。FIG. 3 is a diagram illustrating an example of a hardware configuration of a server device. 端末の機能的構成を説明するための機能ブロック図である。FIG. 3 is a functional block diagram for describing a functional configuration of a terminal. 記憶部に記憶されている応答フレーズＤＢ、音声認識結果ＤＢ、および学習結果ＤＢの概略構成を説明するための図である。FIG. 4 is a diagram for explaining a schematic configuration of a response phrase DB, a speech recognition result DB, and a learning result DB stored in a storage unit. 音声入力時応答処理の流れを説明するためのフローチャートである。It is a flowchart for demonstrating the flow of the response process at the time of a voice input. 端末およびサーバ装置の機能的構成を説明するための機能ブロック図である。It is a functional block diagram for explaining functional composition of a terminal and a server device. 正規のフレーズと近似するフレーズとに対応して応答フレーズが記憶されている応答フレーズＤＢの概略構成を説明するための図である。FIG. 4 is a diagram for explaining a schematic configuration of a response phrase DB in which response phrases are stored in correspondence with regular phrases and phrases that are close to each other.

以下、図面を参照しつつ、実施の形態について説明する。以下の説明では、同一の部品には同一の符号を付してある。それらの名称および機能も同じである。したがって、それらについての詳細な説明は繰り返さない。 Hereinafter, embodiments will be described with reference to the drawings. In the following description, the same components are denoted by the same reference numerals. Their names and functions are the same. Therefore, detailed description thereof will not be repeated.

［実施の形態１］
＜Ａ．システム構成＞
図１は、本実施の形態にかかる通信システムの概略構成を説明するための図である。図１を参照して、通信システム１は、携帯端末１００（以下、端末１００ともいう）と、サーバ装置２００とを含む。端末１００は、応答制御装置の一例であって、ユーザの音声に対する応答フレーズを出力する処理（音声入力時応答処理）を行なう。以下では、端末１００として、プログラムの実行により、筐体を構成する可動部を自動的に動かすことが可能な端末（いわゆる、ロボット型の端末）を例に挙げて説明する。[Embodiment 1]
<A. System Configuration>
FIG. 1 is a diagram for explaining a schematic configuration of a communication system according to the present embodiment. Referring to FIG. 1, communication system 1 includes a mobile terminal 100 (hereinafter, also referred to as terminal 100) and a server device 200. The terminal 100 is an example of a response control device, and performs a process of outputting a response phrase to a user's voice (voice input response process). Hereinafter, as the terminal 100, a terminal (a so-called robot-type terminal) capable of automatically moving a movable part forming a housing by executing a program will be described as an example.

具体的には、端末１００は、手、足、頭部、胴部等を備える。端末１００は、典型的には、歩行可能な自律型の移動体として構成されている。頭部は、胴部に対して所定の角度内において回転可能に構成されている。また、頭部には、カメラが内蔵されている。なお、端末１００は、上記のような人型のロボットに限定されるものではない。 Specifically, the terminal 100 includes a hand, a foot, a head, a torso, and the like. The terminal 100 is typically configured as a walkable autonomous mobile body. The head is configured to be rotatable within a predetermined angle with respect to the trunk. A camera is built in the head. The terminal 100 is not limited to a humanoid robot as described above.

端末１００は、ユーザ７００によって持ち運ばれることにより、様々な場所で利用される。端末１００は、基地局５００およびネットワーク６００を介して、サーバ装置２００と通信する。 The terminal 100 is used in various places by being carried by the user 700. The terminal 100 communicates with the server device 200 via the base station 500 and the network 600.

＜Ｂ．処理の概要＞
以下、通信システム１における処理の概要について説明する。端末１００は、ユーザ７００から発せられる音声に基づき、音声認識してフレーズを特定する。フレーズとは、たとえば、句、単語、単語の集まりなどをいう。音声認識とは、入力された音声データを対応するフレーズに変換することをいう。端末１００は、想定される複数種類のフレーズ（正規のフレーズともいう）に対応する応答フレーズを予め記憶している。端末１００は、音声から特定したフレーズに対応する応答フレーズを記憶しているときに、当該応答フレーズを出力する。<B. Overview of processing>
Hereinafter, an outline of the processing in the communication system 1 will be described. The terminal 100 identifies the phrase by performing voice recognition based on the voice emitted from the user 700. The phrase refers to, for example, a phrase, a word, a group of words, and the like. Speech recognition refers to converting input speech data into a corresponding phrase. The terminal 100 stores in advance response phrases corresponding to a plurality of types of assumed phrases (also referred to as regular phrases). The terminal 100 outputs the response phrase when the response phrase corresponding to the phrase specified from the voice is stored.

また、端末１００は、ユーザの発音や周りの騒音などの影響により、音声の一部を誤認識することが生じ得る。このような誤認識に備えて、端末１００は、想定される正規のフレーズと近似するフレーズについても、対応する応答フレーズを予め記憶している。近似するフレーズとは、正規のフレーズに対し、たとえば、濁点（「゛」）の有無、促音（「ッ」）の有無、長音符（「ー」）の有無などの点のみにおいて相違しているフレーズをいう。このため、端末１００は、特定したフレーズに対応する応答フレーズを記憶していないときであっても、当該フレーズと近似するフレーズに対応する応答フレーズを記憶しているときに、当該応答フレーズを出力する。 In addition, the terminal 100 may erroneously recognize a part of the voice due to the influence of the user's pronunciation, surrounding noise, and the like. In preparation for such an erroneous recognition, the terminal 100 also stores in advance a response phrase corresponding to a phrase that is similar to an assumed regular phrase. Phrases that are similar to similar phrases differ from regular phrases only in, for example, the presence or absence of a cloud point ("゛"), the presence or absence of a prompt ("tsu"), the presence or absence of a long note ("-"), and the like. Say. For this reason, even when the terminal 100 does not store the response phrase corresponding to the specified phrase, the terminal 100 outputs the response phrase when storing the response phrase corresponding to the phrase similar to the phrase. I do.

さらに、端末１００は、学習条件が成立したときに、応答フレーズを学習するための学習処理を行なう。学習条件は、記憶されていないフレーズを所定頻度で特定（たとえば、２回連続して特定）したとき、および、予め定められている学習開始フレーズを特定したときなどに成立する。以下に、概要を説明する。 Further, when the learning condition is satisfied, the terminal 100 performs a learning process for learning a response phrase. The learning condition is satisfied when a phrase that is not stored is specified at a predetermined frequency (for example, specified twice in succession), or when a predetermined learning start phrase is specified. The outline is described below.

図２は、ユーザ７００と端末１００との会話のやりとりの一例を示す図である。図２に示される吹き出し（ステップＭ０１〜Ｍ１２）は、ユーザ７００から発せられる音声あるいは端末１００から出力される音声を示している。また、図２に示される四角囲い（ステップＳ０１〜Ｓ１３）は、端末１００により実行される処理の概要を示している。 FIG. 2 is a diagram illustrating an example of a conversation exchange between the user 700 and the terminal 100. The speech balloons (steps M01 to M12) shown in FIG. 2 indicate a voice emitted from the user 700 or a voice output from the terminal 100. The boxes (steps S01 to S13) shown in FIG. 2 show the outline of the processing executed by the terminal 100.

まず、音声から特定されるフレーズに対応する応答フレーズが端末１００に記憶されている場合について説明する。 First, a case where a response phrase corresponding to a phrase specified from a voice is stored in terminal 100 will be described.

ステップＭ０１に示すように、ユーザ７００は、端末１００に対して、「身長は？」という音声を発したとする。これに対し、端末１００では、ユーザ７００からのメッセージを音声認識し、当該音声認識の結果に対応する応答フレーズを抽出して出力する。なお、端末１００は、音声認識の結果として特定されるフレーズを履歴として記憶する。 As shown in step M01, it is assumed that the user 700 has uttered a voice saying “height?” To the terminal 100. On the other hand, the terminal 100 performs voice recognition of the message from the user 700, extracts and outputs a response phrase corresponding to the result of the voice recognition. The terminal 100 stores a phrase specified as a result of the voice recognition as a history.

図２では、ステップＳ０１に示すように、端末１００は、ユーザ７００からの音声が「シンチョーハ？」であると認識し、当該フレーズに基づいて「シンチョー」について問われていると認識する。 In FIG. 2, as shown in step S01, the terminal 100 recognizes that the voice from the user 700 is “Shinchoha?”, And recognizes that “Shincho” has been asked based on the phrase.

ステップＳ０２では、問われている対象である「シンチョー」と合致する応答フレーズを抽出する。この例では、「シンチョー」と合致する応答フレーズとして、「身長はだいたい１９ｃｍだよ。」が記憶されているとする。よって、ステップＭ０２に示すように、端末１００は、「身長はだいたい１９ｃｍだよ。」といった応答フレーズを出力する。 In step S02, a response phrase that matches "Shincho" which is the object to be queried is extracted. In this example, it is assumed that “height is about 19 cm” is stored as a response phrase that matches “Shincho”. Therefore, as shown in Step M02, the terminal 100 outputs a response phrase such as "Height is about 19 cm."

次に、音声から特定されるフレーズに対応する応答フレーズが端末１００に記憶されていないが、音声から特定されるフレーズと近似するフレーズ（フレーズ（近似）とも示す）に対応する応答フレーズが記憶されている場合について説明する。 Next, the response phrase corresponding to the phrase specified from the voice is not stored in terminal 100, but the response phrase corresponding to the phrase similar to the phrase specified from the voice (also referred to as phrase (approximate)) is stored. Will be described.

ステップＭ０３に示すように、ユーザ７００は、端末１００に対して、「体重は？」という音声を発したとする。これに対し、ステップＳ０３に示すように、端末１００は、１文字目に濁点が付いた「ダイジューハ？」であると誤認識したとする。 As shown in Step M03, it is assumed that the user 700 has uttered a voice saying “Weight?” To the terminal 100. On the other hand, as shown in step S03, it is assumed that the terminal 100 has erroneously recognized that "Daijoha?"

しかし、「ダイジュー」という意味を成さない文言と合致するフレーズは、設計段階において想定されておらず記憶されていない。このような場合、端末１００は、音声から特定されるフレーズと近似するフレーズが記憶されているか否かを判定し、近似するフレーズが記憶されている場合には当該近似するフレーズに対応する応答フレーズを抽出して出力する。端末１００は、音声認識の結果に基づいて特定されるフレーズと濁点・促音・長音符などの有無の点において相違するフレーズが記憶されているか否かを判定する。 However, a phrase that matches a word that does not make sense of “daijou” is not assumed or stored in the design stage. In such a case, the terminal 100 determines whether or not a phrase similar to the phrase identified from the voice is stored, and if the similar phrase is stored, the response phrase corresponding to the approximate phrase is stored. Is extracted and output. The terminal 100 determines whether or not a phrase different from the phrase specified based on the result of the voice recognition in terms of the presence or absence of a voiced voice, a prompt, a long note, and the like is stored.

図２の例では、「ダイジュー」の一文字目の「ダ」の濁点を除いた「タイジュー」が記憶されているとする。この場合、端末１００は、「ダイジュー」が「タイジュー」と近似していると判定し、ステップＭ０４に示されるように、近似する「タイジュー」と合致する応答フレーズを抽出する。この例では、近似する「タイジュー」と合致する応答フレーズとして、「ひょっとして体重のこと？体重はだいたい３００ｇだよ。」が記憶されているとする。このため、ステップＭ０４に示すように、端末１００は、「ひょっとして体重のこと？体重はだいたい３００ｇだよ。」といった応答フレーズを出力する。 In the example of FIG. 2, it is assumed that “Taijou” excluding the first character “Da” of the “Daijou” is stored. In this case, the terminal 100 determines that “daiju” is similar to “taiju”, and extracts a response phrase that matches the approximate “taiju” as shown in Step M04. In this example, it is assumed that “possibly weight? Weight is about 300 g” is stored as a response phrase that matches the approximate “taiju”. For this reason, as shown in Step M04, the terminal 100 outputs a response phrase such as "perhaps your weight? Weight is about 300 g."

次に、音声から特定されるフレーズに対応する応答フレーズも近似するフレーズに対応する応答フレーズも記憶されていない場合について説明する。 Next, a case where neither the response phrase corresponding to the phrase specified from the voice nor the response phrase corresponding to the approximate phrase is stored will be described.

ステップＭ０５に示すように、ユーザ７００は、端末１００に対して、「足の大きさは？」という音声を発したとする。これに対し、ステップＳ０５に示すように、端末１００は、ユーザ７００からの音声が「アシノオオキサハ？」であると正しく認識したとする。 As shown in Step M05, it is assumed that the user 700 has uttered a voice saying "What is your foot size?" On the other hand, as shown in step S05, it is assumed that the terminal 100 correctly recognizes that the voice from the user 700 is “Ashinoxah?”.

しかし、「アシノオオキサハ？」について問われることが設計段階において想定されていないときには、「アシノオオキサハ？」というフレーズが記憶されておらず、当該フレーズと近似するフレーズも記憶されていないことになる。 However, when it is not assumed at the design stage that “Asino Oxaha?” Is asked, the phrase “Ashino Oxaha?” Is not stored, and a phrase similar to the phrase is not stored.

この場合、端末１００は、ステップＳ０６において特定されたフレーズに対応する応答フレーズがないと判定し、ステップＳ０７において、「アシノオオキサハ？」という音声認識の結果そのものを履歴として記憶する。その上で、端末１００は、不明なフレーズを特定した場合の応答フレーズを出力する。不明なフレーズを特定した場合の応答フレーズとしては、再度の発話を促すフレーズが定められており、たとえば、ステップＭ０６に示すように「よく聞こえなかったよ。」というフレーズが定められている。端末１００は、当該応答フレーズを出力するとともに、首を傾げるポーズをとるように頭部を駆動させる。 In this case, the terminal 100 determines that there is no response phrase corresponding to the phrase specified in step S06, and in step S07, stores the result of the voice recognition “Ashinoooxaha?” As history. Then, the terminal 100 outputs a response phrase when the unknown phrase is specified. As a response phrase in the case where an unknown phrase is specified, a phrase urging re-utterance is defined. For example, as shown in Step M06, a phrase “I did not hear well” is defined. The terminal 100 outputs the response phrase and drives the head to take a pose of tilting the neck.

ステップＭ０７に示すように、不明なフレーズが特定された状況において、ユーザ７００が再度「足の大きさは？」という音声を発した場合、端末１００は、ステップＳ０８、Ｓ０９に示すように、前回と同様に応答フレーズが記憶されていないと判定する。続いて、今回の音声認識の結果がステップＳ０７において記憶された直近（前回）の音声認識の結果と合致するか否かを判定する。 As shown in step M07, in a situation where the unknown phrase is specified, if the user 700 again utters the voice "What is your foot size?", The terminal 100 returns to the previous time as shown in steps S08 and S09. It is determined that the response phrase is not stored in the same manner as in. Subsequently, it is determined whether or not the result of the current speech recognition matches the latest (previous) speech recognition result stored in step S07.

ステップＳ１０で示すように、端末１００は、今回の音声認識の結果が前回の音声認識の結果と合致すると判定した場合は、音声を誤認識したのではなく、ユーザ７００が意図して「足の大きさは？」と発話している蓋然性が高いため、以下に示すような学習処理を行なう。 As shown in step S <b> 10, when the terminal 100 determines that the result of the current speech recognition matches the result of the previous speech recognition, the terminal 700 does not misrecognize the speech but intentionally “ It is highly probable that the user is saying "how big is it?", So the following learning process is performed.

まず、ステップＭ０８に示すように、端末１００は、特定された不明なフレーズに基づき「「アシノオオキサハ？」と聞かれたらなんて答えたらいい？」といった応答フレーズを出力する。このように、オウム返しのように応答するため、ユーザにとって意味が分からないことを問いかけてしまうことを防止できる。この問い掛けに対して、ステップＭ０９に示すように、ユーザ７００は、「５ｃｍだよ。」という音声を発したとする。 First, as shown in Step M08, based on the identified unknown phrase, the terminal 100 should answer "If you ask" Ashinohoxaha? " Is output. In this way, since a response is made like a parrot return, it is possible to prevent the user from asking what the meaning is not understood. In response to this inquiry, it is assumed that the user 700 has uttered a voice saying “5 cm” as shown in Step M09.

ステップＳ１１に示すように、端末１００は、ユーザ７００からの音声が「ゴセンチメートルダヨ」であると認識し、その結果に基づいて、ステップＭ１０に示すように「「ゴセンチメートルだよ」と答えればいい？」といった応答フレーズを出力する。この問い掛けに対して、ステップＭ１１に示すように、ユーザ７００は、「オーケー（ＯＫ）」という音声を発したとする。これに対し、ステップＳ１２に示すように、端末１００は、その音声を「オーケー」と認識した場合、ステップＳ１３に示すように、「アシノオオキサハ？」の応答フレーズとして「ゴセンチメートルダヨ」というフレーズを記憶した上で、ステップＭ１２に示すように「わかったよ。」といった応答フレーズを出力する。 As shown in step S11, the terminal 100 recognizes that the voice from the user 700 is "gocentimeter dayo", and based on the result, as shown in step M10, "" Should I answer? Is output. In response to this inquiry, as shown in step M11, it is assumed that the user 700 has uttered a sound “OK”. On the other hand, as shown in step S12, when the terminal 100 recognizes the voice as “OK”, as shown in step S13, the terminal 100 outputs the phrase “Gocmidayo” as a response phrase of “Ashinokioxaha?”. After the storage, a response phrase such as "I understand" is output as shown in step M12.

このような学習処理が行なわれることにより、以後、端末１００は、音声認識の結果として「アシノオオキサハ？」を特定したときには、応答フレーズとして記憶されている「ゴセンチメートルダヨ」を出力することができる。また、音声認識の結果が「アシノオオキサ」と近似する結果となったとき（たとえば、「アジノオオキサ」など）にも、応答フレーズとして「ひょっとしてアシノオオキサのこと？アシノオオキサはゴセンチメートルダヨ。」を出力するようにしてもよい。 By performing such a learning process, the terminal 100 can thereafter output “Gocentimeter Dayo” stored as a response phrase when “Asino Oxaha?” Is specified as a result of speech recognition. . Also, when the result of the speech recognition is similar to “ashinooxa” (for example, “azinooooxa”), “possibly asinooxa” or “ashinooxa” is output as a response phrase. You may make it.

以上のように、端末１００は、ユーザ７００から発せられる音声に基づいて特定したフレーズに対応する応答フレーズが記憶されているときには、当該応答フレーズを出力する（ステップＳ０１、Ｓ０２、Ｍ０２）。また、端末１００は、特定したフレーズに対応する応答フレーズが記憶されていないときであっても、当該フレーズと近似するフレーズに対応する応答フレーズが記憶されているときには当該応答フレーズを出力する（ステップＳ０３、Ｓ０４、Ｍ０４）。さらに、端末１００は、近似するフレーズに対応する応答フレーズも記憶されていないときであって、当該特定したフレーズが所定頻度で認識（たとえば、２回連続して認識）されたときに、当該フレーズに対応する応答フレーズを学習するための学習処理を行なう（ステップＳ０５〜Ｓ１３、Ｍ０６〜Ｍ１２）。 As described above, when the response phrase corresponding to the phrase specified based on the voice emitted from the user 700 is stored, the terminal 100 outputs the response phrase (Steps S01, S02, and M02). Further, even when the response phrase corresponding to the specified phrase is not stored, terminal 100 outputs the response phrase when the response phrase corresponding to the phrase similar to the phrase is stored (step S03, S04, M04). Furthermore, when the response phrase corresponding to the approximate phrase is not stored, and when the specified phrase is recognized at a predetermined frequency (for example, two consecutive recognitions), the terminal 100 (Steps S05 to S13, M06 to M12).

＜Ｃ．ハードウェア構成＞
図３は、端末１００のハードウェア構成の一例を表した図である。図３を参照して、端末１００は、主たる構成要素として、プログラムを実行するＣＰＵ（Central Processing Unit）１５１と、データを不揮発的に格納するＲＯＭ（Read-Only Memory）１５２と、ＣＰＵ１５１によるプログラムの実行により生成されたデータ、又は入力装置を介して入力されたデータを揮発的に格納するＲＡＭ（Random Access Memory）１５３と、データを不揮発的に格納するフラッシュメモリ１５４と、ＬＥＤ（Light Emitting Diode）１５５と、操作キー１５６と、スイッチ１５７と、ＧＰＳ（Global Positioning System）受信機１５８と、通信ＩＦ（Interface）１５９と、電源回路１６０と、タッチスクリーン１６１と、マイク１６２と、スピーカ１６３と、カメラ１６４と、駆動装置１６５と、アンテナ１５８１，１５９１とを含む。各構成要素は、相互にデータバスによって接続されている。<C. Hardware Configuration>
FIG. 3 is a diagram illustrating an example of a hardware configuration of the terminal 100. Referring to FIG. 3, terminal 100 includes, as main components, CPU (Central Processing Unit) 151 that executes a program, ROM (Read-Only Memory) 152 that stores data in a nonvolatile manner, and a program executed by CPU 151. RAM (Random Access Memory) 153 for volatilely storing data generated by execution or data input via the input device, flash memory 154 for storing data in a nonvolatile manner, and LED (Light Emitting Diode) 155, operation keys 156, switch 157, GPS (Global Positioning System) receiver 158, communication IF (Interface) 159, power supply circuit 160, touch screen 161, microphone 162, speaker 163, camera 164, a driving device 165, and antennas 1581 and 1591. Each component is mutually connected by a data bus.

タッチスクリーン１６１は、ディスプレイ１６１１と、タッチパネル１６１２により構成される。アンテナ１５８１は、ＧＰＳ受信機１５８用のアンテナである。アンテナ１５９１は、通信ＩＦ１５９用のアンテナである。 The touch screen 161 includes a display 1611 and a touch panel 1612. The antenna 1581 is an antenna for the GPS receiver 158. The antenna 1591 is an antenna for the communication IF 159.

ＬＥＤ１５５は、端末１００の動作状態を表す各種の表示ランプである。たとえば、ＬＥＤ１５５は、端末１００の主電源のオンまたはオフ状態、およびフラッシュメモリ１５４への読み出しまたは書き込み状態等を表す。 The LEDs 155 are various display lamps indicating the operation state of the terminal 100. For example, the LED 155 indicates the on / off state of the main power supply of the terminal 100, the read / write state of the flash memory 154, and the like.

操作キー１５６は、端末１００のユーザが主電源のオンまたはオフ等するためのキー（操作ボタン）である。スイッチ１５７は、電源回路１６０に給電を行なうか否かを切替えるための主電源用のスイッチ、およびその他の各種の押しボタンスイッチである。 The operation key 156 is a key (operation button) for the user of the terminal 100 to turn on or off the main power. The switch 157 is a main power switch for switching whether or not to supply power to the power supply circuit 160, and other various push button switches.

ＧＰＳ受信機１５８は、４つ以上のＧＰＳ衛星からの電波に基づき、端末１００の現在位置の位置情報を取得する。ＧＰＳ受信機１５８によって取得された位置情報は、通信ＩＤ１５９を介して、サーバ装置２００に送信される。端末１００による位置情報の取得の開始タイミングについては、後述する。 The GPS receiver 158 acquires position information of the current position of the terminal 100 based on radio waves from four or more GPS satellites. The position information acquired by the GPS receiver 158 is transmitted to the server device 200 via the communication ID 159. The start timing of the acquisition of the position information by the terminal 100 will be described later.

通信ＩＦ１５９は、サーバ装置２００に対するデータの送信処理およびサーバ装置２００から送信されたデータの受信処理を行なう。 Communication IF 159 performs a process of transmitting data to server device 200 and a process of receiving data transmitted from server device 200.

電源回路１６０は、コンセントを介して受信した商用電源の電圧を降圧し、端末１００の各部に電源供給を行なう回路である。 The power supply circuit 160 is a circuit that steps down the voltage of the commercial power supply received via the outlet and supplies power to each unit of the terminal 100.

タッチスクリーン１６１は、各種のデータを表示および入力を受け付けるためのデバイスである。ディスプレイ１６１１は、画像を表示するための画面を含んで構成されている。 The touch screen 161 is a device for displaying various data and accepting inputs. The display 1611 is configured to include a screen for displaying an image.

マイク１６２は、端末１００の周囲の音を集音する。たとえば、マイク１６２は、ユーザ７００の発話に基づく音声を集める。 Microphone 162 collects sounds around terminal 100. For example, microphone 162 collects voice based on the utterance of user 700.

スピーカ１６３は、応答フレーズに対応する音声を出力する。スピーカ１６３は、ある局面においては、ユーザ等とのコミュニケーションのために、発話を行なう。 Speaker 163 outputs a sound corresponding to the response phrase. The speaker 163 utters in a certain situation for communication with a user or the like.

カメラ１６４は、端末１００の周囲の被写体を撮像するための撮像装置である。カメラ１６４による撮像により得られた画像データは、通信ＩＤ１５９を介して、サーバ装置２００に送信される。 The camera 164 is an imaging device for imaging an object around the terminal 100. Image data obtained by imaging with the camera 164 is transmitted to the server device 200 via the communication ID 159.

駆動装置１６５は、端末１００の手、足、頭部を駆動させるための駆動機構である。なお、駆動装置１６５により足が駆動されることにより、端末１００は歩行する。また、駆動装置１６５によって頭部が胴部に対して回転することにより、カメラ１６４の向きが代わる。また、端末１００は、駆動装置１６５によって頭部の角度を変化させることにより、首を傾げるポーズが可能となる。 The driving device 165 is a driving mechanism for driving the hands, feet, and head of the terminal 100. Note that the terminal 100 walks when the feet are driven by the driving device 165. In addition, when the head is rotated with respect to the trunk by the driving device 165, the direction of the camera 164 is changed. In addition, the terminal 100 can perform a pose of tilting the neck by changing the angle of the head by the driving device 165.

端末１００における処理（たとえば、音声入力時応答処理）は、各ハードウェアおよびＣＰＵ１５１により実行されるソフトウェア（制御プログラム）によって実現される。このようなソフトウェアは、フラッシュメモリ１５４に予め記憶されている場合がある。また、ソフトウェアは、その他の記憶媒体に格納されて、プログラムプロダクトとして流通している場合もある。あるいは、ソフトウェアは、いわゆるインターネットに接続されている情報提供事業者によってダウンロード可能なプログラムプロダクトとして提供される場合もある。このようなソフトウェアは、読取装置によりその記憶媒体から読み取られて、あるいは、通信ＩＦ１５９等を介してダウンロードされた後、フラッシュメモリ１５４に一旦格納される。そのソフトウェアは、ＣＰＵ１５１によってフラッシュメモリ１５４から読み出され、ＲＡＭ１５３に実行可能なプログラムの形式で格納される。ＣＰＵ１５１は、そのプログラムを実行する。 The processing in terminal 100 (for example, voice input response processing) is realized by each hardware and software (control program) executed by CPU 151. Such software may be stored in the flash memory 154 in advance. The software may be stored in another storage medium and distributed as a program product. Alternatively, the software may be provided as a downloadable program product by an information provider connected to the so-called Internet. Such software is temporarily stored in the flash memory 154 after being read from the storage medium by the reading device or downloaded via the communication IF 159 or the like. The software is read from the flash memory 154 by the CPU 151 and stored in the RAM 153 in the form of an executable program. CPU 151 executes the program.

同図に示される端末１００を構成する各構成要素は、一般的なものである。したがって、本開示の本質的な部分は、ＲＡＭ１５３、フラッシュメモリ１５４、記憶媒体に格納されたソフトウェア、あるいはネットワークを介してダウンロード可能なソフトウェアであるともいえる。なお、端末１００の各ハードウェアの動作は周知であるので、詳細な説明は繰り返さない。 Each component constituting the terminal 100 shown in FIG. 1 is a general one. Therefore, it can be said that an essential part of the present disclosure is the software stored in the RAM 153, the flash memory 154, the storage medium, or the software downloadable via the network. Since the operation of each piece of hardware of terminal 100 is well known, detailed description will not be repeated.

なお、記録媒体としては、ＤＶＤ（Digital Versatile Disc）−ＲＡＭに限られず、ＤＶＤ-ＲＯＭ、ＣＤ（Compact Disc）−ＲＯＭ、ＦＤ（Flexible Disc）、ハードディスク、磁気テープ、カセットテープ、光ディスク、ＥＥＰＲＯＭ（Electrically Erasable Programmable ROM）、フラッシュＲＯＭなどの半導体メモリ等の固定的にプログラムを担持する媒体でもよい。また、記録媒体は、当該プログラム等をコンピュータが読取可能な一時的でない媒体である。また、ここでいうプログラムとは、ＣＰＵにより直接実行可能なプログラムだけでなく、ソースプログラム形式のプログラム、圧縮処理されたプログラム、暗号化されたプログラム等を含む。 The recording medium is not limited to a DVD (Digital Versatile Disc) -RAM, but may be a DVD-ROM, a CD (Compact Disc) -ROM, a FD (Flexible Disc), a hard disk, a magnetic tape, a cassette tape, an optical disk, an EEPROM (Electrically). A medium that fixedly holds the program, such as a semiconductor memory such as an Erasable Programmable ROM or a flash ROM, may be used. The recording medium is a non-transitory medium in which the program or the like can be read by a computer. In addition, the program referred to here includes not only a program directly executable by the CPU but also a program in a source program format, a compressed program, an encrypted program, and the like.

図４は、サーバ装置２００のハードウェア構成の一例を表した図である。図４を参照して、サーバ装置２００は、主たる構成要素として、プログラムを実行するＣＰＵ２５１と、データを不揮発的に格納するＲＯＭ２５２と、ＣＰＵ２５１によるプログラムの実行により生成されたデータ、又は入力装置を介して入力されたデータを揮発的に格納するＲＡＭ２５３と、データを不揮発的に格納するＨＤＤ（Hard Disc Drive）２５４と、ＬＥＤ２５５と、スイッチ２５６と、通信ＩＦ（Interface）２５７と、電源回路２５８と、ディスプレイ２５９と、操作キー２６０とを含む。各構成要素は、相互にデータバスによって接続されている。 FIG. 4 is a diagram illustrating an example of a hardware configuration of the server device 200. Referring to FIG. 4, server device 200 includes, as main components, CPU 251 for executing a program, ROM 252 for storing data in a nonvolatile manner, and data generated by execution of the program by CPU 251 or an input device. RAM 253 for volatilely storing input data, HDD (Hard Disc Drive) 254 for nonvolatilely storing data, LED 255, switch 256, communication IF (Interface) 257, power supply circuit 258, A display 259 and operation keys 260 are included. Each component is mutually connected by a data bus.

電源回路２５８は、コンセントを介して受信した商用電源の電圧を降圧し、サーバ装置２００の各部に電源供給を行なう回路である。スイッチ２５６は、電源回路２５８に給電を行なうか否かを切替えるための主電源用のスイッチ、およびその他の各種の押しボタンスイッチである。ディスプレイ２５９は、各種のデータを表示するためのデバイスである。 The power supply circuit 258 is a circuit that steps down the voltage of the commercial power supply received via the outlet and supplies power to each unit of the server device 200. The switch 256 is a switch for a main power supply for switching whether or not to supply power to the power supply circuit 258, and other various push button switches. The display 259 is a device for displaying various data.

通信ＩＦ２５７は、端末１００に対するデータの送信処理および端末１００から送信されたデータの受信処理を行なう。 Communication IF 257 performs a process of transmitting data to terminal 100 and a process of receiving data transmitted from terminal 100.

ＬＥＤ２５５は、サーバ装置２００の動作状態を表す各種の表示ランプである。たとえば、ＬＥＤ２５５は、サーバ装置２００の主電源のオンまたはオフ状態、およびＨＤＤ２５４への読み出しまたは書き込み状態等を表す。操作キー２６０は、サーバ装置２００のユーザがサーバ装置２００へデータを入力するための用いるキー（キーボード）である。 The LEDs 255 are various display lamps indicating the operation state of the server device 200. For example, the LED 255 indicates the on / off state of the main power supply of the server device 200, the read / write state of the HDD 254, and the like. The operation keys 260 are keys (keyboard) used by a user of the server device 200 to input data to the server device 200.

サーバ装置２００における処理は、各ハードウェアおよびＣＰＵ２５１により実行されるソフトウェアによって実現される。このようなソフトウェアは、ＨＤＤ２５４に予め記憶されている場合がある。また、ソフトウェアは、その他の記憶媒体に格納されて、プログラムプロダクトとして流通している場合もある。あるいは、ソフトウェアは、いわゆるインターネットに接続されている情報提供事業者によってダウンロード可能なプログラムプロダクトとして提供される場合もある。このようなソフトウェアは、読取装置によりその記憶媒体から読み取られて、あるいは、通信ＩＦ２５７等を介してダウンロードされた後、ＨＤＤ２５４に一旦格納される。そのソフトウェアは、ＣＰＵ２５１によってＨＤＤ２５４から読み出され、ＲＡＭ２５３に実行可能なプログラムの形式で格納される。ＣＰＵ２５１は、そのプログラムを実行する。 The processing in the server device 200 is realized by each hardware and software executed by the CPU 251. Such software may be stored in the HDD 254 in advance. The software may be stored in another storage medium and distributed as a program product. Alternatively, the software may be provided as a downloadable program product by an information provider connected to the so-called Internet. Such software is temporarily stored in the HDD 254 after being read from the storage medium by the reading device or downloaded via the communication IF 257 or the like. The software is read from the HDD 254 by the CPU 251 and stored in the RAM 253 in the form of an executable program. CPU 251 executes the program.

同図に示されるサーバ装置２００を構成する各構成要素は、一般的なものである。したがって、本開示の本質的な部分は、ＲＡＭ２５３、ＨＤＤ２５４、記憶媒体に格納されたソフトウェア、あるいはネットワークを介してダウンロード可能なソフトウェアであるともいえる。なお、サーバ装置２００の各ハードウェアの動作は周知であるので、詳細な説明は繰り返さない。 Each component constituting the server device 200 shown in the figure is a general one. Therefore, it can be said that an essential part of the present disclosure is software stored in the RAM 253, the HDD 254, the storage medium, or software downloadable via the network. Since the operation of each piece of hardware of server device 200 is well known, detailed description will not be repeated.

なお、記録媒体としては、ＤＶＤ−ＲＡＭに限られず、ＤＶＤ-ＲＯＭ、ＣＤ−ＲＯＭ、ＦＤ、ハードディスク、磁気テープ、カセットテープ、光ディスク、ＥＥＰＲＯＭ、フラッシュＲＯＭなどの半導体メモリ等の固定的にプログラムを担持する媒体でもよい。また、記録媒体は、当該プログラム等をコンピュータが読取可能な一時的でない媒体である。また、ここでいうプログラムとは、ＣＰＵにより直接実行可能なプログラムだけでなく、ソースプログラム形式のプログラム、圧縮処理されたプログラム、暗号化されたプログラム等を含む。 Note that the recording medium is not limited to a DVD-RAM, and a fixed program such as a DVD-ROM, a CD-ROM, an FD, a hard disk, a magnetic tape, a cassette tape, an optical disk, a semiconductor memory such as an EEPROM, a flash ROM, or the like. Media may be used. The recording medium is a non-transitory medium in which the program or the like can be read by a computer. In addition, the program referred to here includes not only a program directly executable by the CPU but also a program in a source program format, a compressed program, an encrypted program, and the like.

＜Ｄ．機能的構成＞
図５は、端末１００の機能的構成を説明するための機能ブロック図である。図５を参照して、端末１００は、制御部１１１と、記憶部１１２と、駆動部１１５と、音声入力部１１６と、音声出力部１１７と、通信処理部１１８とを備えている。なお、端末１００には、位置情報取得部１１３および撮像部１１４なども備えており、さらにその他の機能的構成を備えるものであってもよい。<D. Functional configuration>
FIG. 5 is a functional block diagram for describing a functional configuration of terminal 100. Referring to FIG. 5, terminal 100 includes control unit 111, storage unit 112, driving unit 115, audio input unit 116, audio output unit 117, and communication processing unit 118. Note that the terminal 100 also includes the position information acquisition unit 113, the imaging unit 114, and the like, and may have another functional configuration.

音声入力部１１６は、端末１００の周囲の音を集め、集められた音声を音声データとして制御部１１１に送る。音声入力部１１６は、たとえばマイク１６２により構成されている。音声出力部１１７は、応答フレーズに対応する音声を出力する。音声出力部１１７は、たとえばスピーカ１６３により構成されている。 The sound input unit 116 collects sounds around the terminal 100 and sends the collected sounds to the control unit 111 as sound data. The voice input unit 116 is constituted by, for example, a microphone 162. The voice output unit 117 outputs a voice corresponding to the response phrase. The audio output unit 117 is configured by, for example, a speaker 163.

記憶部１１２は、各種の制御プログラムを記憶するとともに、応答フレーズＤＢ（Data Base）１１２１と、音声認識結果ＤＢ１１２２と、学習結果ＤＢ１１２３とを有している。記憶部１１２は、たとえばＲＡＭ１５３などにより構成されている。 The storage unit 112 stores various control programs, and has a response phrase DB (Data Base) 1121, a speech recognition result DB 1122, and a learning result DB 1123. The storage unit 112 includes, for example, a RAM 153 and the like.

応答フレーズＤＢ１１２１は、設計段階から想定されている複数種類のフレーズ（正規のフレーズ）に対応する応答フレーズ、および複数種類のフレーズ（正規のフレーズ）各々と近似する場合の応答フレーズなどを記憶する。音声認識結果ＤＢ１１２２は、ユーザから発せられた音声に基づく音声認識の結果を記憶する。学習結果ＤＢ１１２３は、学習処理により新たに追加されたフレーズに対応する応答フレーズを記憶する。以下に具体例を説明する。 The response phrase DB 1121 stores response phrases corresponding to a plurality of types of phrases (regular phrases) assumed from the design stage, response phrases when each of the plurality of types of phrases (regular phrases) is approximated, and the like. The voice recognition result DB 1122 stores a result of voice recognition based on a voice uttered by the user. The learning result DB 1123 stores a response phrase corresponding to the phrase newly added by the learning process. A specific example will be described below.

図６は、記憶部１１２に記憶されている応答フレーズＤＢ１１２１、音声認識結果ＤＢ１１２２、および学習結果ＤＢ１１２３の概略構成を説明するための図である。図６（ａ）を参照して、応答フレーズＤＢ１１２１は、フレーズと、当該フレーズに対応する応答フレーズとを含む。 FIG. 6 is a diagram for explaining a schematic configuration of the response phrase DB 1121, the speech recognition result DB 1122, and the learning result DB 1123 stored in the storage unit 112. Referring to FIG. 6A, response phrase DB 1121 includes a phrase and a response phrase corresponding to the phrase.

フレーズとしては、たとえば、「シンチョー」「タイジュー」など複数種類のフレーズ（フレーズ（合致））、複数種類のフレーズ各々に近似するフレーズ（フレーズ（近似）、および学習開始フレーズである「ヘンジオボエテ」などが記憶されている。また、それぞれのフレーズに対しては、応答フレーズが記憶されている。たとえば、「シンチョー（合致）」に対しては、図２のＭ０２で示したとおり、「身長はだいたい１９ｃｍだよ。」というメッセージが記憶されている。また、「タイジュー（近似）」に対しては、図２のＭ０４で示したとおり、「ひょっとして体重のこと？体重はだいたい３００ｇだよ。」というメッセージが記憶されている。また、学習開始フレーズに対しては、「オーケー、まずは覚える言葉を教えてね。」というメッセージが記憶されている。その他、不明なフレーズや学習処理中のフレーズに対しては、たとえば図２のステップＭ０６、Ｍ０８、Ｍ１０、Ｍ１２などに示すような応答フレーズが記憶されている。 Examples of the phrase include plural types of phrases (phrase (match)) such as “Shincho” and “Taiju”, phrases (phrases (approximate)) approximating each of the plural types of phrases, and “Henzio Boete” as a learning start phrase. In addition, a response phrase is stored for each phrase, for example, for “Shincho (match)”, as shown by M02 in FIG. The message "Taijou (approximate)" is stored, as shown by M04 in FIG. 2, "Is it probably weight? Weight is about 300g." For the learning start phrase, "Okay, first let me know the words to learn." Message is stored to say. Others, for unknown phrases or phrases in the training process, for example, steps of FIG. 2 M06, M08, M10, M12 response phrases as shown in the like have been stored.

図６（ｂ）を参照して、音声認識結果ＤＢ１１２２は、音声認識の結果により特定されたフレーズを含む。図６（ｂ）の例では、図２のステップＳ０１、Ｓ０３、Ｓ０５、Ｓ０８、Ｓ１１、Ｓ１２における音声認識の結果により特定されたフレーズが記憶される。 Referring to FIG. 6B, speech recognition result DB 1122 includes a phrase specified by the result of speech recognition. In the example of FIG. 6B, the phrase specified by the result of the voice recognition in steps S01, S03, S05, S08, S11, and S12 of FIG. 2 is stored.

図６（ｃ）を参照して、学習結果ＤＢ１１２３は、学習により追加されたフレーズと、当該フレーズに対応する応答フレーズとを含む。図６（ｃ）の例では、図２のステップＳ１３により、追加フレーズ「アシノオオキサハ？」に対して応答フレーズ「ゴセンチメートルダヨ。」が記憶されている。 Referring to FIG. 6C, learning result DB 1123 includes a phrase added by learning and a response phrase corresponding to the phrase. In the example of FIG. 6C, the response phrase “Gocentimeter Dayo.” Is stored with respect to the additional phrase “Ashinohoxaha?” In step S13 of FIG. 2.

応答フレーズＤＢ１１２１の記憶情報は、サーバ装置２００から定期的に送信される更新用データに基づきアップデートされる。これにより、フレーズおよび応答フレーズを更新することができる。更新用データは、サーバ装置２００を管理する管理者などにより入力されたフレーズおよび応答フレーズを特定するためのデータである。 The storage information of the response phrase DB 1121 is updated based on update data periodically transmitted from the server device 200. Thereby, the phrase and the response phrase can be updated. The update data is data for specifying a phrase and a response phrase input by an administrator who manages the server device 200 or the like.

また、学習結果ＤＢ１１２３の記憶情報は、サーバ装置２００に送信可能である。サーバ装置２００は、端末１００からの学習結果ＤＢ１１２３の記憶情報を含む更新用データを他の端末に送信する。これにより、端末１００において学習させた内容を、他の端末にも反映させることができる。 The information stored in the learning result DB 1123 can be transmitted to the server device 200. The server device 200 transmits update data including information stored in the learning result DB 1123 from the terminal 100 to another terminal. Thus, the content learned on the terminal 100 can be reflected on other terminals.

図５に戻り、制御部１１１は、端末１００の全体の動作を制御する。制御部１１１は、音声認識部１１１０と、発話内容決定部１１１１と、近似判定部１１１２と、学習機能部１１１３と、駆動制御部１１１４と、表示制御部１１１５とを有する。制御部１１１は、たとえばＣＰＵ１５１などにより構成されている。 Returning to FIG. 5, the control unit 111 controls the overall operation of the terminal 100. The control unit 111 includes a speech recognition unit 1110, an utterance content determination unit 1111, an approximation determination unit 1112, a learning function unit 1113, a drive control unit 1114, and a display control unit 1115. The control unit 111 includes, for example, a CPU 151 and the like.

音声認識部１１１０は、音声入力部１１６により入力された音声データに基づいて、フレーズを特定するための音声認識を行なう機能を有している。 The voice recognition unit 1110 has a function of performing voice recognition for specifying a phrase based on voice data input by the voice input unit 116.

発話内容決定部１１１１は、音声出力部１１６から出力する応答フレーズを決定する機能を有している。具体的に、発話内容決定部１１１１は、音声認識部１１１０により特定されたフレーズに対応する応答フレーズが応答フレーズＤＢ１１２１あるいは学習結果ＤＢ１１２３に記憶されているか否かを判定し、記憶されているときには当該応答フレーズに決定する。 The utterance content determining unit 1111 has a function of determining a response phrase output from the audio output unit 116. Specifically, the utterance content determination unit 1111 determines whether or not a response phrase corresponding to the phrase specified by the voice recognition unit 1110 is stored in the response phrase DB 1121 or the learning result DB 1123. Determine the response phrase.

近似判定部１１１２は、音声認識部１１１０により特定されたフレーズと近似する正規のフレーズが応答フレーズＤＢ１１２１あるいは学習結果ＤＢ１１２３に記憶されているか否かを判定する。具体的に、近似判定部１１１２は、音声認識部１１１０により特定されたフレーズと、濁点・促音・長音符などの有無の点のみにおいて相違しているフレーズが記憶されているか否かを判定する。 The approximation determination unit 1112 determines whether or not a regular phrase approximate to the phrase specified by the speech recognition unit 1110 is stored in the response phrase DB 1121 or the learning result DB 1123. Specifically, the approximation determination unit 1112 determines whether or not a phrase that is different from the phrase specified by the voice recognition unit 1110 only in the presence or absence of a voiced voice, a prompt sound, a long note, or the like is stored.

発話内容決定部１１１１は、特定されたフレーズに対応する応答フレーズが記憶されていないときであっても、近似判定部１１１２により当該フレーズと近似するフレーズが記憶されていると判定されたときには、当該近似するフレーズに対応する応答フレーズに決定する。 Even when the response phrase corresponding to the specified phrase is not stored, the utterance content determination unit 1111 determines that the approximate determination unit 1112 determines that a phrase similar to the phrase is stored. A response phrase corresponding to the approximate phrase is determined.

学習機能部１１１３は、学習条件が成立したときに学習処理を実行する機能を有している。学習機能部１１１３は、たとえば、応答フレーズＤＢ１１２１および学習結果ＤＢ１１２３に記憶されていないフレーズが２回連続して認識されることなどにより学習条件が成立したと判定したときに、当該フレーズに対応する応答フレーズを学習結果ＤＢ１１２３に記憶する。 The learning function unit 1113 has a function of executing a learning process when a learning condition is satisfied. When the learning function unit 1113 determines that the learning condition is satisfied, for example, by recognizing a phrase that is not stored in the response phrase DB 1121 and the learning result DB 1123 twice consecutively, the learning function unit 1113 responds to the corresponding phrase. The phrase is stored in the learning result DB 1123.

駆動制御部１１１３は、端末１００の駆動部１１５を駆動させる機能を有する。これにより、端末１００は、可動部を動かすことが可能となる。表示制御部１１１５は、端末１００の表示部１１９に各種の情報を表示させる機能を有する。 The drive control unit 1113 has a function of driving the drive unit 115 of the terminal 100. Thereby, the terminal 100 can move the movable part. The display control unit 1115 has a function of displaying various types of information on the display unit 119 of the terminal 100.

通信処理部１１８は、ネットワーク６００を介したサーバ装置２００との通信に用いられる。通信処理部１１８は、データをサーバ装置２００に送信するための送信部１１８１と、データをサーバ装置２００から受信するための受信部１１８２とを有する。 The communication processing unit 118 is used for communication with the server device 200 via the network 600. The communication processing unit 118 includes a transmitting unit 1181 for transmitting data to the server device 200, and a receiving unit 1182 for receiving data from the server device 200.

＜Ｅ．処理の詳細＞
図７は、端末１００のＣＰＵ１５１が実行する音声入力時応答処理の流れを説明するためのフローチャートである。ＣＰＵ１５１は、ユーザ７００から音声が発せられて、音声入力部１１６から音声データが入力されたときに音声入力時応答処理を実行する。ＣＰＵ１５１の音声認識部１１１０は、入力された音声データに基づいて音声認識し、フレーズを特定する。<E. Details of processing>
FIG. 7 is a flowchart illustrating the flow of the voice input response process executed by CPU 151 of terminal 100. The CPU 151 executes voice input response processing when voice is emitted from the user 700 and voice data is input from the voice input unit 116. The voice recognition unit 1110 of the CPU 151 performs voice recognition based on the input voice data and specifies a phrase.

図７を参照して、ステップＳ１００においては、特定されたフレーズを探す処理が行なわれる。具体的には、特定されたフレーズあるいは近似するフレーズが応答フレーズＤＢ１１２１および学習結果ＤＢ１１２３に記憶されているか否かを判定する。 Referring to FIG. 7, in step S100, a process of searching for the specified phrase is performed. Specifically, it is determined whether or not the specified phrase or an approximate phrase is stored in the response phrase DB 1121 and the learning result DB 1123.

ステップＳ１０１においては、ＣＰＵ１５１は、特定されたフレーズそのものと合致するフレーズが記憶されているか否かを判定する。ステップＳ１０１において合致するフレーズが記憶されていると判定されたときには、ＣＰＵ１５１は、ステップＳ１０２において当該フレーズに対応して記憶されている応答フレーズを出力する。これにより、音声出力部１１７から応答フレーズを出力させることができる。これにより、端末１００は、ユーザ７００からの発話から特定されるフレーズに対する応答を行なうことができる。 In step S101, the CPU 151 determines whether a phrase that matches the specified phrase itself is stored. When it is determined in step S101 that the matching phrase is stored, the CPU 151 outputs the response phrase stored corresponding to the phrase in step S102. Thus, the response phrase can be output from the audio output unit 117. Accordingly, terminal 100 can respond to a phrase specified from the utterance from user 700.

一方、ステップＳ１０１において合致するフレーズが記憶されていないと判定されたときには、ステップＳ１０３において、ＣＰＵ１５１は、特定されたフレーズと近似するフレーズが記憶されているか否かを判定する。ステップＳ１０３において近似するフレーズが記憶されていると判定されたときには、ステップＳ１０４において、ＣＰＵ１５１は、当該近似するフレーズに対応して記憶されている応答フレーズを出力する。これにより、ユーザ７００からの発話から特定されるフレーズが記憶されていない場合であっても、端末１００は、当該フレーズと近似するフレーズに対する応答を行なうことができる。その結果、音声から特定されるフレーズそのものに対応する応答フレーズが準備されていない場合であっても、端末１００は、ユーザ７００に応答することでき、応答フレーズの不足を補うことができる。 On the other hand, when it is determined in step S101 that no matching phrase is stored, in step S103, the CPU 151 determines whether or not a phrase similar to the specified phrase is stored. When it is determined in step S103 that an approximate phrase is stored, in step S104, the CPU 151 outputs a response phrase stored corresponding to the approximate phrase. As a result, even when a phrase specified from an utterance from the user 700 is not stored, the terminal 100 can respond to a phrase similar to the phrase. As a result, even when a response phrase corresponding to the phrase itself specified by the voice is not prepared, the terminal 100 can respond to the user 700 and compensate for the lack of the response phrase.

ステップＳ１０３において近似するフレーズが記憶されていないと判定されたときには、ステップＳ１０５において、ＣＰＵ１５１は、特定されたフレーズが学習処理を開始するための学習開始フレーズであるか否かを判定する。学習開始フレーズとは、たとえば、「返事覚えて（ヘンジオボエテ）」、「言葉覚えて（コトバオボエテ）」などである。 When it is determined in step S103 that an approximate phrase is not stored, in step S105, the CPU 151 determines whether the specified phrase is a learning start phrase for starting a learning process. The learning start phrase is, for example, "remember reply (henji boete)", "remember words (kotoba oboete)" or the like.

ステップＳ１０５において学習開始フレーズであると判定されなかったときには、ステップＳ１０６において、ＣＰＵ１５１は、今回の音声認識の結果そのもののフレーズを音声認識結果ＤＢ１１２２に記憶する。これにより、端末１００は、音声認識の結果の履歴を蓄積することができる。なお、合致するフレーズあるいは近似するフレーズが記憶されているときにも、音声認識の結果は履歴として蓄積される。 When it is not determined in step S105 that the phrase is the learning start phrase, in step S106, the CPU 151 stores the phrase as the result of the current voice recognition in the voice recognition result DB 1122. Thereby, the terminal 100 can accumulate the history of the result of the voice recognition. Even when a matching phrase or a similar phrase is stored, the result of voice recognition is accumulated as a history.

ステップＳ１０７においては、ＣＰＵ１５１は、今回の音声認識の結果が前回の音声認識の結果と合致するか否かを判定する。つまり、２回連続で同じフレーズが特定されたか否かが判定される。ステップＳ１０７において、前回の音声認識の結果と合致しないと判定されたときには、ステップＳ１０９において、ＣＰＵ１５１は、「よく聞こえなかったよ。」を応答フレーズとして出力するとともに、首を傾げるポーズをとるように頭部を駆動させる。これにより、ユーザに再度の発話を促すことができる。 In step S107, the CPU 151 determines whether or not the result of the current speech recognition matches the result of the previous speech recognition. That is, it is determined whether the same phrase has been specified twice consecutively. If it is determined in step S107 that the result does not match the result of the previous speech recognition, in step S109, the CPU 151 outputs “I did not hear well” as a response phrase and sets the head to pose to tilt. Drive unit. Thereby, the user can be prompted to speak again.

一方、ステップＳ１０７において今回の音声認識の結果が前回の音声認識の結果と合致すると判定されて学習条件が成立したときには、ＣＰＵ１５１は、制御をステップＳ１０８へ移行して、学習処理を実行する。２回連続で同じフレーズが特定されることにより実行される学習処理では、ＣＰＵ１５１は、図２のステップＭ０８〜Ｍ１２、Ｓ１１〜Ｓ１３に例示する発話・応答を行なうことにより、今回の音声認識の結果に基づくフレーズに対応する応答フレーズを学習結果ＤＢ１１２３に記憶する。このように、不明なフレーズが所定頻度で特定されたときに学習処理が実行されるため、不適切な問い返しおよび学習が行なわれてしまうことを防止できる。以降においては、音声から特定されるフレーズが学習処理で追加したフレーズとなったときに、端末１００は、対応する応答フレーズを出力することができる。 On the other hand, when it is determined in step S107 that the result of the current speech recognition matches the result of the previous speech recognition and the learning condition is satisfied, the CPU 151 shifts the control to step S108 and executes a learning process. In the learning process executed by specifying the same phrase twice consecutively, the CPU 151 performs the utterance / response exemplified in steps M08 to M12 and S11 to S13 in FIG. The response phrase corresponding to the phrase based on is stored in the learning result DB 1123. As described above, since the learning process is executed when the unknown phrase is specified at a predetermined frequency, it is possible to prevent inappropriate inquiry and learning from being performed. Thereafter, when the phrase specified from the voice becomes the phrase added in the learning process, the terminal 100 can output the corresponding response phrase.

ステップＳ１０５に戻り、学習開始フレーズであると判定されたときには、ＣＰＵ１５１は、制御をステップＳ１０８へ移行して、学習処理を実行する。学習開始フレーズとなることにより実行される学習処理では、たとえば、以下のような発話・応答が行なわれる。
端末１００の応答内容：オッケー、まずは覚える言葉を教えてね。
ユーザ７００の発話内容：「○△×□」だよ。
端末１００の応答内容：「○△×□」だね？「○△×□」って言われたら、なんて返事したらいい？返事する言葉を教えてね。
ユーザ７００の発話内容：「△△×※○□」でいいよ。
端末１００の応答内容：「△△×※○□」だね。オッケー、覚えたよ。Returning to step S105, when it is determined that the phrase is a learning start phrase, the CPU 151 shifts the control to step S108 and executes a learning process. In the learning process executed by becoming the learning start phrase, for example, the following utterance / response is performed.
Response content of terminal 100: OK, first tell me the words to learn.
The content of the utterance of the user 700: “○ △ × □”.
Response content of terminal 100: "○ △ × □", right? What should I do if I say "○ △ × □"? Please tell me the words to reply.
The utterance content of the user 700: “△△ × ※” is sufficient.
The response content of the terminal 100: “△△ × ※”. Okay, I remember.

このような学習処理が行なわれることにより、端末１００は、音声認識の結果に基づくフレーズ（○△×□）に対応する応答フレーズ（△△×※○□）を学習結果ＤＢ１１２３に記憶する。これにより、以降において学習処理で追加したフレーズ（○△×□）となったときに、対応する応答フレーズ（△△×※○□）を出力することができる。 By performing such a learning process, the terminal 100 stores the response phrase (△△ × ※) corresponding to the phrase (△△ □) based on the result of the voice recognition in the learning result DB 1123. Accordingly, when a phrase (△ Δ △) added in the learning process thereafter, a corresponding response phrase (△△ × ※) can be output.

［実施の形態２］
上記実施の形態１においては、端末１００単独で音声入力時応答処理を実行可能な例について説明したが、これに限らず、サーバ装置２００と通信することにより音声入力時応答処理が実行可能となるようにしてもよい。[Embodiment 2]
In the first embodiment, the example in which the terminal 100 can execute the voice input response process alone has been described. However, the present invention is not limited to this, and the terminal 100 can communicate with the server device 200 to execute the voice input response process. You may do so.

たとえば、図５に示した記憶部１１２、発話内容決定部１１１１、学習機能部１１１３を、サーバ装置２００が備えるようにしてもよい。この場合における端末１００およびサーバ装置２００の機能的構成を例示する。図８は、端末１００およびサーバ装置２００の機能的構成を説明するための機能ブロック図である。 For example, the server device 200 may include the storage unit 112, the utterance content determination unit 1111, and the learning function unit 1113 illustrated in FIG. The functional configuration of the terminal 100 and the server device 200 in this case will be exemplified. FIG. 8 is a functional block diagram for explaining the functional configuration of terminal 100 and server device 200.

図８に示すように、サーバ装置２００は、制御部２１１と、記憶部２１２と、通信処理部２１３とを備えている。記憶部２１２は、応答フレーズＤＢ２１２１、音声認識結果ＤＢ２１２２、および学習結果ＤＢ２１２３を有している。応答フレーズＤＢ２１２１、音声認識結果ＤＢ２１２２、および学習結果ＤＢ２１２３は、各々、実施の形態１における応答フレーズＤＢ１１２１、音声認識結果ＤＢ１１２２、および学習結果ＤＢ１１２３に相当する。 As shown in FIG. 8, the server device 200 includes a control unit 211, a storage unit 212, and a communication processing unit 213. The storage unit 212 has a response phrase DB 2121, a speech recognition result DB 2122, and a learning result DB 2123. The response phrase DB 2121, the speech recognition result DB 2122, and the learning result DB 2123 correspond to the response phrase DB 1121, the speech recognition result DB 1122, and the learning result DB 1123, respectively, in the first embodiment.

制御部２１１は、たとえば、発話内容決定部２１１１、近似判定部２１１２、および学習機能部２１１３を有する。発話内容決定部２１１１、近似判定部２１１２、および学習機能部２１１３は、各々、実施の形態１における発話内容決定部１１１１、近似判定部１１１２、および学習機能部１１１３に相当する。 The control unit 211 includes, for example, an utterance content determination unit 2111, an approximation determination unit 2112, and a learning function unit 2113. The utterance content determination unit 2111, the approximation determination unit 2112, and the learning function unit 2113 correspond to the utterance content determination unit 1111, the approximation determination unit 1112, and the learning function unit 1113, respectively, in the first embodiment.

通信処理部２１３は、ネットワーク６００を介した端末１００との通信に用いられる。通信処理部２１３は、データを端末１００に送信するための送信部２１３１と、データを端末１００から受信するための受信部２１３２とを有する。 The communication processing unit 213 is used for communication with the terminal 100 via the network 600. The communication processing unit 213 includes a transmitting unit 2131 for transmitting data to the terminal 100 and a receiving unit 2132 for receiving data from the terminal 100.

次に、音声入力時応答処理の概要について説明する。端末１００は、ユーザ７００から発せられる音声に基づいてフレーズを特定し、当該フレーズを特定可能なフレーズデータを送信部１１８１を介してサーバ装置２００へ送信する。 Next, an outline of the voice input response process will be described. The terminal 100 specifies a phrase based on a voice emitted from the user 700 and transmits phrase data that can specify the phrase to the server device 200 via the transmission unit 1181.

サーバ装置２００は、フレーズデータを受信すると、当該フレーズデータから特定されるフレーズに基づいて発話内容決定部２１１１により応答フレーズを決定する。発話内容決定部２１１１は、特定されるフレーズに対応する応答フレーズが記憶部２１２に記憶されているか否かを判定し、当該フレーズに対応する応答フレーズが記憶されているときには、当該応答フレーズを特定可能な応答データを送信部２１３１を介して端末１００へ送信する。これにより、端末１００は、特定されたフレーズに対応する応答フレーズを出力することができる。 When the server device 200 receives the phrase data, the speech content determination unit 2111 determines a response phrase based on the phrase specified from the phrase data. The utterance content determination unit 2111 determines whether or not a response phrase corresponding to the specified phrase is stored in the storage unit 212, and specifies the response phrase when the response phrase corresponding to the phrase is stored. The possible response data is transmitted to the terminal 100 via the transmission unit 2131. Thereby, the terminal 100 can output a response phrase corresponding to the specified phrase.

また、発話内容決定部２１１１は、特定したフレーズに対応する応答フレーズが記憶されていないときであっても、近似判定部２１１２により当該フレーズと近似すると判定されたフレーズに対応する応答フレーズが記憶されているときには、当該応答フレーズを特定可能な応答データを送信部２１３１を介して端末１００へ送信する。これにより、端末１００は、特定されたフレーズと近似するフレーズに対応する応答フレーズを出力することができる。 Further, even when the response phrase corresponding to the specified phrase is not stored, the utterance content determination unit 2111 stores the response phrase corresponding to the phrase determined to be approximate to the phrase by the approximation determination unit 2112. If so, response data that can specify the response phrase is transmitted to the terminal 100 via the transmission unit 2131. Thereby, the terminal 100 can output a response phrase corresponding to a phrase that is similar to the specified phrase.

さらに、発話内容決定部２１１１により、特定したフレーズおよび近似するフレーズに対応する応答フレーズが記憶されていないと判定されたときであっても、当該特定したフレーズが所定頻度で認識（たとえば、２回連続して認識）されたときには、学習機能部２１１３は、当該フレーズに対応する応答フレーズを学習するための学習処理を行なう。具体的には、学習機能部２１１３は、図２のステップＭ０８〜Ｍ１２、Ｓ１１〜Ｓ１３に例示する発話・応答を行なうための処理を実行する。 Further, even when the utterance content determination unit 2111 determines that the response phrase corresponding to the specified phrase and the approximate phrase is not stored, the specified phrase is recognized at a predetermined frequency (for example, twice). When the recognition is continuously performed, the learning function unit 2113 performs a learning process for learning a response phrase corresponding to the phrase. Specifically, the learning function unit 2113 executes a process for performing an utterance / response exemplified in steps M08 to M12 and S11 to S13 in FIG.

また、発話内容決定部２１１１は、特定したフレーズが学習開始フレーズであったときにも実施の形態１で説明した学習開始フレーズ判定時の発話・応答を行なうための処理を実行する。 Also, the utterance content determination unit 2111 executes the process for performing the utterance / response at the time of the learning start phrase determination described in the first embodiment, even when the specified phrase is the learning start phrase.

この場合、学習結果ＤＢ２１２３の記憶情報は、端末毎（たとえば、端末を識別可能な識別番号毎）に特定可能に記憶されているものであってもよく、すべての端末間で共有可能となるように記憶されているものであってもよい。 In this case, the storage information of the learning result DB 2123 may be stored so as to be identifiable for each terminal (for example, for each identification number capable of identifying the terminal), and may be shared among all terminals. May be stored.

なお、サーバ装置２００と通信することにより音声入力時応答処理が実行可能となる例として、図５に示した記憶部１１２、発話内容決定部１１１１、学習機能部１１１３のみならず、音声認識部１１１０についても、サーバ装置２００が備えるようにしてもよい。この場合、端末１００は、ユーザ７００から発せられる音声を特定可能な音声データを送信部１１８１を介してサーバ装置２００へ送信する。サーバ装置２００は、音声データを受信すると、当該音声データに基づいて音声認識部により音声認識してフレーズを特定し、当該フレーズに基づく処理を実行するようにしてもよい。 In addition, as an example in which the response process at the time of voice input can be executed by communicating with the server device 200, not only the storage unit 112, the utterance content determination unit 1111 and the learning function unit 1113 shown in FIG. May be included in the server device 200. In this case, terminal 100 transmits audio data that can specify the audio emitted from user 700 to server device 200 via transmitting section 1181. When receiving the voice data, the server device 200 may recognize the voice by the voice recognition unit based on the voice data, specify the phrase, and execute a process based on the phrase.

［実施の形態３］
上記実施の形態１および２においては、近似するフレーズとして、濁点などの有無の点のみにおいて相違しているフレーズを例示したが、これに替えてあるいは加えて、正規のフレーズに含まれる一部のフレーズを近似するフレーズとしてもよい。たとえば、「シンチョー」に近似するフレーズとしては、「ジンチョー」などに替えてあるいは加えて、「シンチョ」や「ンチョー」などを含めてもよい。また、「タイジュー」に近似するフレーズとしては、「ダイジュー」などに替えてあるいは加えて、「タイジュ」や「イジュー」などを含めてもよい。[Embodiment 3]
In the first and second embodiments, as the approximate phrase, a phrase that is different only in the presence or absence of a cloud point or the like is exemplified. However, instead of or in addition to this, some phrases included in the regular phrase are included. The phrase may be a phrase that approximates the phrase. For example, phrases similar to “sincho” may include “sincho” or “ncho” instead of or in addition to “gincho” or the like. Further, the phrase approximated to “Taiju” may include “Taiju”, “Iju”, or the like instead of or in addition to “Daiju”.

また、上記実施の形態１および２においては、近似判定部を備え、当該近似判定部により近似するフレーズであるか否かを判定する例について説明したが、近似判定部を備えることなく、図９に示すように、近似するフレーズそのものに対して応答フレーズが記憶されるように応答フレーズＤＢを構成してもよい。 Further, in the first and second embodiments, the example in which the approximation determining unit is provided and the approximation determining unit determines whether or not the phrase is an approximation has been described. , The response phrase DB may be configured such that the response phrase is stored for the approximate phrase itself.

図９は、フレーズとして正規のフレーズと、近似するフレーズとに対応して応答フレーズが記憶されている応答フレーズＤＢの概略構成を説明するための図である。たとえば、正規のフレーズである「シンチョー」や「タイジュー」などに対応する応答フレーズが記憶されるとともに、「シンチョー」に近似するフレーズとして「ジンチョー」「シンチョ」「ンチョー」などに対応する応答フレーズが記憶されるとともに、「タイジュー」に近似するフレーズとして「ダイジュー」「タイジュ」「イジュー」などに対応する応答フレーズが記憶されている。 FIG. 9 is a diagram for explaining a schematic configuration of a response phrase DB in which response phrases are stored corresponding to regular phrases and approximate phrases as phrases. For example, a response phrase corresponding to a regular phrase such as "Shincho" or "Taijou" is stored, and a response phrase corresponding to "Gincho", "Sincho", "Ncho" or the like as a phrase similar to "Shincho" is stored. In addition, a response phrase corresponding to “daiju”, “taiju”, “juju” or the like is stored as a phrase similar to “taiju”.

このように応答フレーズＤＢが構成されている場合、発話内容決定部は、音声認識の結果により特定されたフレーズが応答フレーズＤＢに記憶されているか否かを判定することにより、近似判定部を備えずとも、正規のフレーズに対応する応答フレーズのみならず、近似するフレーズに対応する応答フレーズを抽出することができる。 When the response phrase DB is configured as described above, the utterance content determination unit includes an approximation determination unit by determining whether the phrase specified by the result of the voice recognition is stored in the response phrase DB. At least, it is possible to extract not only a response phrase corresponding to a regular phrase but also a response phrase corresponding to an approximate phrase.

また、正規のフレーズと近似するフレーズに対応する応答フレーズは、近似するフレーズにかかわらず、共通（兼用）の応答フレーズを記憶するものであってもよい。具体的に、近似する場合における共通の応答フレーズとして、「ひょっとして…のこと？」を応答フレーズとして記憶し応答フレーズとしては、「…」の部分に正規のフレーズそのものを挿入し、かつ正規のフレーズに対応する応答フレーズをその後に付加するものであってもよい。たとえば、「シンチョー」や「タイジュー」などに近似するフレーズに対応して「ひょっとして…のこと？」が定められており、「シンチョー」に近似するフレーズが特定されたときには、応答フレーズとして「ひょっとしてシンチョーのこと？身長はだいたい１９ｃｍだよ。」を出力するようにしてもよい。これにより、近似するフレーズに対応する応答データを記憶するための記憶容量を低減できる。 Further, the response phrase corresponding to the phrase approximate to the regular phrase may store a common (shared) response phrase regardless of the approximate phrase. More specifically, as a common response phrase in the case of approximation, "maybe ...?" Is stored as a response phrase, and as a response phrase, a regular phrase itself is inserted in the part of "...". The response phrase corresponding to the phrase may be added thereafter. For example, "maybe ..." is defined for phrases that approximate "Shincho" or "Taijou", and when a phrase that approximates "Shincho" is identified, " Maybe you're a Sincho? He's about 19cm tall. " Thereby, the storage capacity for storing the response data corresponding to the approximate phrase can be reduced.

［実施の形態４］
上記実施の形態１〜３における学習処理は、不明なフレーズを２回連続して特定したときに実行する例について説明したが、不明なフレーズが所定頻度で特定されることにより実行されるものであればこれに限るものではない。学習処理は、たとえば、音声認識結果ＤＢにおける直近１０回の履歴のうちで、不明な同一フレーズが３回特定されることにより実行されるようにしてもよい。また、回数だけでなく、１回目と２回目の間隔が１分以内といった期間での判定としてもよい。[Embodiment 4]
Although the learning process in the first to third embodiments is described as being executed when the unknown phrase is specified twice consecutively, the learning process is executed when the unknown phrase is specified at a predetermined frequency. If there is, it is not limited to this. The learning process may be performed, for example, by identifying the same unknown phrase three times in the last ten histories in the speech recognition result DB. Further, the determination may be made not only in the number of times but also in a period in which the interval between the first and second times is within one minute.

［その他］
上記実施の形態１〜４では、応答フレーズを端末１００かサーバ装置２００のいずれかで決定する例について説明したが、これに限らず、端末１００において応答フレーズを決定するとともに、サーバ装置２００においても応答フレーズを決定するようにしてもよい。この場合、端末１００は、ユーザからの音声に対して応答する応答フレーズをサーバ装置２００からも取得し、当該応答フレーズと自ら決定した応答フレーズとのうちから、情報の重要度（応答レベル）がより高い応答フレーズを、出力すべき応答フレーズとして選択して出力するようにしてもよい。[Others]
In the first to fourth embodiments, an example has been described in which the response phrase is determined by either the terminal 100 or the server device 200. However, the present invention is not limited to this. The response phrase may be determined. In this case, the terminal 100 also obtains a response phrase responding to the voice from the user from the server device 200, and determines the importance (response level) of the information from the response phrase and the response phrase determined by itself. A higher response phrase may be selected and output as a response phrase to be output.

上記実施の形態１〜４では、近似するフレーズが特定されたときには、当該近似するフレーズに対応する応答フレーズを出力する例について説明したが、これに限らず、近似するフレーズが所定頻度で特定されたとき（２回連続で特定されたときなど）に、特定されたフレーズが正規のフレーズであると擬制し、当該特定されたフレーズに対する応答フレーズを学習させるようにしてもよい。 In the first to fourth embodiments, an example has been described in which, when an approximate phrase is specified, a response phrase corresponding to the approximate phrase is output. However, the present invention is not limited to this, and an approximate phrase is specified at a predetermined frequency. When the specified phrase is specified (for example, when specified twice in succession), the specified phrase may be assumed to be a legitimate phrase, and a response phrase to the specified phrase may be learned.

上記実施の形態１〜４では、音声から特定されるフレーズに対応する応答処理として、応答フレーズを出力する処理、学習処理を例示したが、予め対応付けられた処理であればこれに限らず、たとえば、端末１００を所定態様で駆動する処理、カメラ１６４で撮像する処理などであってもよい。 In the first to fourth embodiments, the processing for outputting the response phrase and the learning processing are illustrated as the response processing corresponding to the phrase specified from the voice. However, the processing is not limited to this as long as the processing is associated in advance. For example, a process of driving the terminal 100 in a predetermined mode, a process of capturing an image with the camera 164, and the like may be used.

［まとめ］
以下、上述した処理のうち主要な処理と、当該処理により得られる利点とについて記載する。[Summary]
Hereinafter, main processes among the processes described above and advantages obtained by the processes will be described.

（１）端末１００は、ユーザ７００からの音声から特定されるフレーズが、応答フレーズＤＢあるいは学習結果ＤＢに記憶されたフレーズのうちのいずれかであるときに、当該フレーズに対応する応答フレーズを出力する処理を実行し、記憶されていない不明なフレーズであって当該不明なフレーズが所定頻度で特定されているときに、その後において当該不明なフレーズに対応する応答フレーズを出力可能にするための学習処理を実行する。これにより、不明なフレーズが特定されると即座に学習処理を実行せず、所定頻度に達したときにユーザが意図してその不明なフレーズを発していると擬制して学習処理を実行できる。その結果、ユーザからの発話に対して将来的に幅広く応答できるようにしつつも、不適切な問い返しおよび学習が行なわれてしまうことを防止できる。 (1) The terminal 100 outputs the response phrase corresponding to the phrase specified from the voice from the user 700 when the phrase is one of the response phrase DB and the phrase stored in the learning result DB. Learning to execute a process to perform a response phrase and to output a response phrase corresponding to the unknown phrase when the unknown phrase is not stored and the unknown phrase is specified at a predetermined frequency. Execute the process. Thus, the learning process is not executed immediately when the unknown phrase is specified, and the learning process can be executed on the assumption that the user intentionally emits the unknown phrase when the frequency reaches a predetermined frequency. As a result, inappropriate questions and learning can be prevented from being performed while allowing the user to respond widely to utterances in the future.

（２）端末１００は、想定される複数種類のフレーズおよび近似するフレーズに対応する応答フレーズを記憶する応答フレーズＤＢと、学習処理によりフレーズに対応する応答フレーズを更新記憶する学習結果ＤＢとを有する。これにより、ユーザが音声を発してから応答するまでの間を極力短縮できる。 (2) The terminal 100 has a response phrase DB that stores response phrases corresponding to a plurality of types of supposed phrases and approximate phrases, and a learning result DB that updates and stores response phrases corresponding to the phrases by learning processing. . As a result, the time from when the user utters the voice to when the user responds can be minimized.

（３）学習処理を行なう契機となる所定頻度は、記憶されていない不明なフレーズが２回連続して特定されることにより達する頻度である。これにより、たとえば学習処理を開始するための特別な音声や操作を行なう必要がないため、学習のハードルを下げることができる。その結果、学習頻度を向上させることができる。 (3) The predetermined frequency that triggers the learning process is a frequency that is reached when an unknown phrase that is not stored is specified twice consecutively. Thus, for example, there is no need to perform a special voice or operation for starting the learning process, so that the learning hurdle can be reduced. As a result, the learning frequency can be improved.

（４）記憶されていない不明なフレーズが特定されたときには、図２のＭ０６に示すように、「よく聞こえなかったよ。」といった応答が出力される。これにより、ユーザに対して再度の発話を促すことができる。 (4) When an unknown phrase that is not stored is specified, a response such as “I did not hear well” is output as shown in M06 in FIG. Thus, the user can be prompted to speak again.

（５）学習処理は、学習開始契機となった不明なフレーズに対応する応答フレーズの発話を促す処理（図２のＭ０８）と、ユーザの発話から特定されるフレーズそのものを不明なフレーズに対応する応答フレーズとして記憶する処理（図２のＳ１３）とを含む。これにより、どのようなフレーズについても応答フレーズとして記憶することができる。 (5) The learning process is a process of prompting the utterance of a response phrase corresponding to the unknown phrase that triggered the learning (M08 in FIG. 2), and the phrase itself identified from the user's utterance corresponds to the unknown phrase. (S13 in FIG. 2) for storing as a response phrase. Thus, any phrase can be stored as a response phrase.

（６）学習開始フレーズであるときには、その後においてフレーズに対応する応答フレーズを学習可能となる。これにより、ユーザの意思に基づいて積極的に学習させることができる。 (6) When the phrase is a learning start phrase, a response phrase corresponding to the phrase can be learned thereafter. Thereby, learning can be positively performed based on the intention of the user.

（７）端末１００は、ユーザ７００からの音声から特定されるフレーズが、応答フレーズＤＢあるいは学習結果ＤＢに記憶された正規のフレーズと合致するときに、当該正規のフレーズに対応する応答フレーズを出力する処理を実行し、応答フレーズＤＢあるいは学習結果ＤＢに記憶された正規のフレーズと近似するときに、近似する場合に対応する応答フレーズを出力する処理を実行する。これにより、ユーザ７００の音声から特定されるフレーズそのものに対応する応答フレーズが準備されていない場合であっても応答することできる。その結果、応答フレーズの不足を補うことができる。 (7) The terminal 100 outputs a response phrase corresponding to the regular phrase when the phrase specified from the voice from the user 700 matches the regular phrase stored in the response phrase DB or the learning result DB. When the approximate phrase is approximated to the regular phrase stored in the response phrase DB or the learning result DB, a process of outputting a response phrase corresponding to the approximate case is performed. This allows a response even when a response phrase corresponding to the phrase itself specified from the voice of the user 700 is not prepared. As a result, the shortage of the response phrase can be compensated.

（８）近似判定部を有する実施の形態では、ユーザ７００の音声から特定されるフレーズが正規のフレーズのうちのいずれかと近似するか否かを判定する。発話内容決定部は、ユーザ７００の音声から特定されるフレーズが正規のフレーズのうちのいずれかと近似するときには、近似する場合に対応して記憶されている応答フレーズを出力する。これにより、正規のフレーズに対して合致する場合と近似する場合との応答フレーズを準備することにより、ユーザの発話に対して幅広く応答することができる。 (8) In the embodiment having the approximation determination unit, it is determined whether or not the phrase specified from the voice of the user 700 approximates one of the regular phrases. When the phrase specified from the voice of the user 700 approximates one of the regular phrases, the utterance content determination unit outputs the stored response phrase corresponding to the approximate case. Thus, by preparing response phrases that match and approximate a regular phrase, it is possible to respond widely to the utterance of the user.

（９）近似判定部を有しない実施の形態では、正規のフレーズに含まれる一部のフレーズを当該正規のフレーズと近似するフレーズと擬制した上で、図９の応答フレーズＤＢに示されるように、正規のフレーズに対応する応答フレーズと、当該正規のフレーズに含まれる一部のフレーズに対応する応答フレーズとを準備することにより、処理負担を軽減しつつユーザの発話に対して幅広く応答することができる。 (9) In the embodiment having no approximation determination unit, some phrases included in the regular phrase are simulated as phrases approximating the regular phrase, and then as shown in the response phrase DB in FIG. By preparing a response phrase corresponding to a regular phrase and a response phrase corresponding to some of the phrases included in the regular phrase, it is possible to respond widely to the utterance of the user while reducing the processing load. Can be.

（１０）正規のフレーズのうち、たとえば、「シンチョー」と近似する場合の応答フレーズと、「タイジュー」と近似する場合の応答フレーズとは、「ひょっとして」といった共通のフレーズを含む。 (10) Of the regular phrases, for example, the response phrase when approximating “Shincho” and the response phrase when approximating “Taijou” include a common phrase such as “by chance”.

また、「シンチョー」と近似する場合には、共通の「ひょっとして」と、「シンチョー」に対応する応答フレーズとを用いて、たとえば、「ひょっとしてシンチョーのこと？身長はだいたい１９ｃｍだよ。」を出力する。これにより、近似するフレーズ毎に異なる応答フレーズを準備するものと比較して、応答フレーズを記憶するための記憶容量を低減できる。 Also, when approximating “Shincho”, for example, using a common “Hyotto” and a response phrase corresponding to “Shincho”, for example, “Hitto is a Shincho? Height is about 19 cm. Is output. This makes it possible to reduce the storage capacity for storing the response phrases as compared with the case where a different response phrase is prepared for each similar phrase.

（１１）通信システムは、サーバ装置２００と、当該サーバ装置２００と通信可能な端末１００とを備える。その上で、実施の形態２および３における端末１００は、ユーザからの音声に対応する音声情報（たとえば、音声認識の結果から特定されるフレーズデータ、音声データなど）を送信し、その後にサーバ装置２００から送信される応答情報（応答データ）に基づいて応答フレーズを出力する処理を実行する。 (11) The communication system includes the server device 200 and the terminal 100 capable of communicating with the server device 200. Then, terminal 100 in Embodiments 2 and 3 transmits voice information (for example, phrase data, voice data, and the like specified from the result of voice recognition) corresponding to the voice from the user, and thereafter, server device A process of outputting a response phrase based on the response information (response data) transmitted from 200 is executed.

一方、サーバ装置２００は、端末１００からの音声情報から特定されるフレーズが、応答フレーズＤＢあるいは学習結果ＤＢに記憶されたフレーズのうちのいずれかであるときに、当該フレーズに対応する応答情報を出力する処理を実行し、記憶されていない不明なフレーズであって当該不明なフレーズが所定頻度で特定されているときに、その後において当該不明なフレーズに対応する応答情報を出力可能にするための学習処理を実行する。これにより、不明なフレーズが特定されると即座に学習処理を実行せず、所定頻度に達したときにユーザが意図してその不明なフレーズを発していると擬制して学習処理を実行できる。その結果、ユーザからの発話に対して将来的に幅広く応答できるようにしつつも、不適切な問い返しおよび学習が行なわれてしまうことを防止できる。 On the other hand, when the phrase specified from the voice information from the terminal 100 is one of the response phrase DB and the phrase stored in the learning result DB, the server device 200 transmits the response information corresponding to the phrase. Executing an output process to output response information corresponding to the unknown phrase when the unknown phrase is not stored and the unknown phrase is specified at a predetermined frequency. Execute the learning process. Thereby, the learning process is not immediately executed when the unknown phrase is specified, and the learning process can be executed on the assumption that the user intentionally utters the unknown phrase when the predetermined frequency is reached. As a result, it is possible to respond to the utterance from the user in a wide range in the future, while preventing inappropriate inquiry and learning from being performed.

また、サーバ装置２００は、端末１００からの音声情報から特定されるフレーズが、応答フレーズＤＢあるいは学習結果ＤＢに記憶されたフレーズと合致するときに、当該フレーズに対応する応答情報を出力する処理を実行し、応答フレーズＤＢあるいは学習結果ＤＢに記憶されたフレーズと近似するときに、当該近似するフレーズに対応する応答情報を出力する処理を実行する。これにより、ユーザの音声から特定されるフレーズそのものに対応する応答情報が準備されていない場合であっても応答することできる。その結果、応答情報の不足を補うことができる。 Further, when the phrase specified from the voice information from terminal 100 matches the phrase stored in response phrase DB or learning result DB, server device 200 performs a process of outputting response information corresponding to the phrase. When executed, when the phrase approximates to the phrase stored in the response phrase DB or the learning result DB, a process of outputting response information corresponding to the approximate phrase is executed. Accordingly, a response can be made even when response information corresponding to the phrase itself specified from the voice of the user is not prepared. As a result, shortage of response information can be compensated.

今回開示された実施の形態はすべての点で例示であって制限的なものではないと考えられるべきである。本発明の範囲は上記した説明ではなくて請求の範囲によって示され、請求の範囲と均等の意味および範囲内でのすべての変更が含まれることが意図される。 The embodiments disclosed this time are to be considered in all respects as illustrative and not restrictive. The scope of the present invention is defined by the terms of the claims, rather than the description above, and is intended to include any modifications within the scope and meaning equivalent to the terms of the claims.

１通信システム、１００通信端末、１１１，２１１制御部、１１２，２１２記憶部、１１５駆動部、１１６音声入力部、１１７音声出力部、１１８，２１３通信処理部、１１９表示部、１５８ＧＰＳ受信機、１６２マイク、１６３スピーカ、１６４カメラ、１６５駆動装置、２００サーバ装置、５００基地局、６００ネットワーク、７００ユーザ、１１１０音声認識部、１１１１，２１１１発話内容決定部、１１１２，２１１２近似判定部、１１１３，２１１３学習機能部、１１１４駆動制御部、１１１５表示制御部、１１２１，２１２１応答フレーズＤＢ、１１２２，２１２２音声認識結果ＤＢ、１１２３，２１２３学習結果ＤＢ、１１８１，２１３１送信部、１１８２，２１３２受信部。 1 communication system, 100 communication terminals, 111, 211 control units, 112, 212 storage units, 115 drive units, 116 audio input units, 117 audio output units, 118, 213 communication processing units, 119 display units, 158 GPS receivers, 162 microphone, 163 speaker, 164 camera, 165 driving device, 200 server device, 500 base station, 600 network, 700 user, 1110 voice recognition unit, 1111, 111 utterance content determination unit, 1112, 2112 approximation determination unit, 1113, 2113 Learning function unit, 1114 drive control unit, 1115 display control unit, 1121, 121 response phrase DB, 1122, 2122 speech recognition result DB, 1123, 2123 learning result DB, 1181, 2131 transmission unit, 1182, 2132 reception unit.

Claims

A voice receiving unit for receiving voice input;
When the phrase specified from the voice received by the voice receiving unit is any of a plurality of predetermined types of phrases, a response process execution unit that performs a response process corresponding to the phrase,
When the phrase specified from the voice received by the voice receiving unit is not any of the plurality of types of phrases, and when the frequency of the phrase has reached a predetermined frequency, the phrase A response control device comprising: a learning process execution unit that executes a learning process for enabling a corresponding response process to be specified.

A storage unit that stores information that can specify a response process corresponding to each of the plurality of types of phrases,
The response process execution unit executes a response process corresponding to a phrase specified from the voice received by the voice reception unit, based on the storage information of the storage unit,
The response control according to claim 1, wherein the learning process execution unit updates the storage information of the storage unit so that a corresponding response process can be specified from the phrase that has reached the predetermined frequency by the learning process. apparatus.

3. The response control device according to claim 1, wherein the predetermined frequency is a frequency that is reached when a voice in which a phrase that is not any of the plurality of types of phrases is specified is continuously received a predetermined number of times. 4.

When the phrase specified from the voice received by the voice receiving unit is not any of the plurality of types of phrases, the voice control unit includes a voice promoting unit that prompts the user of the response control device to voice the voice. The response control device according to claim 1.

The response process that can be specified by the learning process is a process of outputting a phrase specified from the voice received by the voice receiving unit as a response phrase,
The learning process includes:
A process for prompting the user of the response control device to utter a voice corresponding to the response phrase,
5. The processing according to claim 1, further comprising: specifying a phrase specified from the voice received by the voice receiving unit as a response phrase corresponding to a phrase that triggered the learning process. The response control device according to any one of the above.

The learning process execution unit, when the phrase specified from the voice received by the voice receiving unit is a predetermined learning phrase, then to the phrase specified from the voice received by the voice receiving unit The response control device according to claim 1, wherein the response control device executes a process for enabling a corresponding response process to be specified.

A control program for causing a computer to function as the response control device according to claim 1, wherein the control program causes the computer to function as each of the units.

Accepting voice input,
When the phrase specified from the received voice is one of a plurality of predetermined types of phrases, executing a response process corresponding to the phrase;
When the phrase specified from the received voice is not any of the plural types of phrases, and when the frequency of the phrase has reached a predetermined frequency, a response process corresponding to the phrase is thereafter performed. Performing a learning process for enabling identification.

A communication system comprising a server and a response control device capable of communicating with the server,
The response control device,
A voice receiving unit for receiving voice input;
A communication unit that transmits voice information corresponding to the voice received by the voice reception unit and receives response information from the server;
A response processing execution unit that performs response processing based on the received response information,
The server is
A storage unit for storing response information corresponding to each of a plurality of predetermined types of phrases,
When the phrase specified from the voice information from the response control device is any of the plurality of types of phrases, a response information transmission unit that transmits response information corresponding to the phrase,
When the phrase specified from the voice information from the response control device is not any of the plural types of phrases, and when the frequency of the phrase has reached a predetermined frequency, the phrase corresponding to the phrase thereafter A learning process executing unit that executes a learning process for enabling a response process to be performed to be specified.