JP6698423B2

JP6698423B2 - Response control device, control program, information processing method, and communication system

Info

Publication number: JP6698423B2
Application number: JP2016099496A
Authority: JP
Inventors: 田上　文俊; 文俊田上; 拓也小柳津
Original assignee: Sharp Corp
Current assignee: Sharp Corp
Priority date: 2016-05-18
Filing date: 2016-05-18
Publication date: 2020-05-27
Anticipated expiration: 2036-05-18
Also published as: JP2017207610A

Description

本発明は、応答制御装置と、制御プログラムと、情報処理方法と、サーバおよび応答制御装置を備えた通信システムとに関する。 The present invention relates to a response control device, a control program, an information processing method, and a communication system including a server and a response control device.

従来、ユーザとの間で音声対話を行うための音声認識装置が知られている。音声認識装置においては、ユーザからの質問に幅広く対応させるために、膨大な質問内容と応答内容とを予め記憶させておく必要がある。一方、ユーザからの質問は、バラエティに富んでおり、すべてを想定して質問内容と応答内容とを準備することが不可能であった。 Conventionally, a voice recognition device for performing voice conversation with a user is known. In the voice recognition device, it is necessary to store a huge amount of question contents and response contents in advance in order to widely respond to questions from users. On the other hand, the questions from the user are rich in variety, and it is impossible to prepare the question content and the response content assuming all the questions.

このため、たとえば、特開２００９−２５１０１９号公報（特許文献１）は、入力された音声に対する認識結果の信頼度と、当該認識結果に対応するタスクの影響度とに基づいて、当該認識結果に対する応答内容を決定する音声認識装置を開示している。 Therefore, for example, Japanese Unexamined Patent Application Publication No. 2009-251019 (Patent Document 1) discloses a recognition result for an input voice based on the reliability of the recognition result and the influence degree of a task corresponding to the recognition result. Disclosed is a voice recognition device that determines a response content.

特開２００９−２５１０１９号公報JP, 2009-251019, A

しかしながら、従来の音声認識装置は、想定されていない認識結果に対する応答内容を定めていない。このため、誤認識などにより想定されていない認識結果となった場合には、適切に応答することができず、応答内容の不足を補うことにまで寄与できていなかった。 However, the conventional voice recognition device does not define the response content for an unexpected recognition result. For this reason, when an unexpected recognition result is obtained due to erroneous recognition or the like, it is not possible to appropriately respond, and it has not been possible to contribute to the lack of response content.

本開示は、上記の問題点に鑑みなされたものであって、そのある局面における目的は、応答内容の不足を補うことができる応答制御装置と、制御プログラムと、情報処理方法と、通信システムとを提供することにある。 The present disclosure has been made in view of the above problems, and an object of a certain aspect thereof is to provide a response control device, a control program, an information processing method, and a communication system capable of compensating for lack of response content. To provide.

ある局面に従うと、応答制御装置は、音声の入力を受け付ける音声受付手段と、音声受付手段により受け付けられた音声から特定されるフレーズが、予め定められた複数種類のフレーズのうちのいずれかであるときに、当該フレーズに対応するフレーズ用応答処理を実行する応答処理実行手段とを備え、応答処理実行手段は、音声受付手段により受け付けられた音声から特定されるフレーズが複数種類のフレーズのうちのいずれかと近似するときに、近似する当該フレーズに対応する近似用応答処理を実行する。 According to one aspect, in the response control device, the voice reception unit that receives a voice input and the phrase specified by the voice received by the voice reception unit is one of a plurality of predetermined phrases. And a response process executing means for executing a response process for a phrase corresponding to the phrase, wherein the response process executing means is one of a plurality of types of phrases in which the phrase identified from the voice accepted by the voice accepting means. When approximating any of them, the approximating response process corresponding to the approximating phrase is executed.

他の局面に従うと、制御プログラムは、応答制御装置としてコンピュータを機能させ、コンピュータを上記各手段として機能させる。 According to another aspect, the control program causes a computer to function as a response control device and causes the computer to function as each of the above means.

さらに他の局面に従うと、情報処理方法は、音声の入力を受け付けるステップと、受け付けられた音声から特定されるフレーズが、予め定められた複数種類のフレーズのうちのいずれかであるときに、当該フレーズに対応するフレーズ用応答処理を実行するステップと、受け付けられた音声から特定されるフレーズが複数種類のフレーズのうちのいずれかと近似するときに、近似する当該フレーズに対応する近似用応答処理を実行するステップとを備える。 According to still another aspect, the information processing method includes a step of receiving a voice input, and when the phrase identified from the received voice is one of a plurality of predetermined phrases. A step of executing a response process for a phrase corresponding to the phrase, and an approximation response process corresponding to the approximate phrase when the phrase specified from the received voice approximates one of a plurality of types of phrases. Performing steps.

さらに他の局面に従うと、通信システムは、サーバと、当該サーバと通信可能な応答制御装置とを備える。応答制御装置は、音声の入力を受け付ける音声受付手段と、音声受付手段により受け付けられた音声に対応する音声情報を送信し、サーバからの応答情報を受信する通信手段と、受信した応答情報に基づいて応答処理を実行する応答処理実行手段とを含む。サーバは、予め定められた複数種類のフレーズ各々に対応する応答情報としてフレーズ用応答情報と近似用応答情報とを記憶する記憶手段と、応答制御装置からの音声情報から特定されるフレーズが複数種類のフレーズのうちのいずれかであるときに、当該フレーズに対応するフレーズ用応答情報を応答情報として送信する応答情報送信手段と、応答制御装置からの音声情報から特定されるフレーズが複数種類のフレーズのうちのいずれかと近似するときに、近似する当該フレーズに対応する近似用応答情報を応答情報として送信する近似応答情報送信手段とを含む。 According to still another aspect, the communication system includes a server and a response control device capable of communicating with the server. The response control device includes a voice receiving unit that receives a voice input, a communication unit that transmits voice information corresponding to the voice received by the voice receiving unit and receives response information from the server, and based on the received response information. Response processing executing means for executing response processing according to the present invention. The server stores a plurality of types of phrases specified from a storage unit that stores response information for phrases and response information for approximation as response information corresponding to each of a plurality of types of predetermined phrases, and a phrase specified from voice information from the response control device. Response phrase transmitting means for transmitting the phrase response information corresponding to the phrase as response information, and the phrase specified from the voice information from the response control device has a plurality of types of phrases. And an approximate response information transmitting means for transmitting, as response information, approximate response information corresponding to the approximate phrase.

ある局面によれば、応答内容の不足を補うことができる。 According to one aspect, it is possible to compensate for the lack of response content.

通信システムの概略構成を説明するための図である。It is a figure for explaining a schematic structure of a communication system. ユーザと端末との会話のやりとりの一例を示す図である。It is a figure showing an example of exchange of a conversation between a user and a terminal. 端末のハードウェア構成の一例を表した図である。It is a figure showing an example of the hardware constitutions of a terminal. サーバ装置のハードウェア構成の一例を表した図である。It is a figure showing an example of the hardware constitutions of a server apparatus. 端末の機能的構成を説明するための機能ブロック図である。It is a functional block diagram for demonstrating the functional structure of a terminal. 記憶部に記憶されている応答フレーズＤＢ、音声認識結果ＤＢ、および学習結果ＤＢの概略構成を説明するための図である。It is a figure for demonstrating the schematic structure of the response phrase DB, voice recognition result DB, and learning result DB which are memorize|stored in the memory|storage part. 音声入力時応答処理の流れを説明するためのフローチャートである。It is a flow chart for explaining the flow of response processing at the time of voice input. 端末およびサーバ装置の機能的構成を説明するための機能ブロック図である。It is a functional block diagram for demonstrating the functional structure of a terminal and a server apparatus. 正規のフレーズと近似するフレーズとに対応して応答フレーズが記憶されている応答フレーズＤＢの概略構成を説明するための図である。It is a figure for demonstrating the schematic structure of the response phrase DB in which the response phrase is memorize|stored corresponding to the regular phrase and the phrase approximate.

以下、図面を参照しつつ、実施の形態について説明する。以下の説明では、同一の部品には同一の符号を付してある。それらの名称および機能も同じである。したがって、それらについての詳細な説明は繰り返さない。 Hereinafter, embodiments will be described with reference to the drawings. In the following description, the same parts are designated by the same reference numerals. Their names and functions are also the same. Therefore, detailed description thereof will not be repeated.

［実施の形態１］
＜Ａ．システム構成＞
図１は、本実施の形態にかかる通信システムの概略構成を説明するための図である。図１を参照して、通信システム１は、携帯端末１００（以下、端末１００ともいう）と、サーバ装置２００とを含む。端末１００は、応答制御装置の一例であって、ユーザの音声に対する応答フレーズを出力する処理（音声入力時応答処理）を行なう。以下では、端末１００として、プログラムの実行により、筐体を構成する可動部を自動的に動かすことが可能な端末（いわゆる、ロボット型の端末）を例に挙げて説明する。 [Embodiment 1]
<A. System configuration>
FIG. 1 is a diagram for explaining the schematic configuration of the communication system according to the present embodiment. Referring to FIG. 1, communication system 1 includes a mobile terminal 100 (hereinafter, also referred to as terminal 100) and a server device 200. The terminal 100 is an example of a response control device, and performs a process of outputting a response phrase to a user's voice (a voice input response process). In the following, as the terminal 100, a terminal (so-called robot-type terminal) capable of automatically moving a movable part forming a housing by executing a program will be described as an example.

具体的には、端末１００は、手、足、頭部、胴部等を備える。端末１００は、典型的には、歩行可能な自律型の移動体として構成されている。頭部は、胴部に対して所定の角度内において回転可能に構成されている。また、頭部には、カメラが内蔵されている。なお、端末１００は、上記のような人型のロボットに限定されるものではない。 Specifically, the terminal 100 includes hands, feet, a head, a body, and the like. The terminal 100 is typically configured as an autonomous mobile body that can walk. The head is configured to be rotatable within a predetermined angle with respect to the body. A camera is built in the head. The terminal 100 is not limited to the humanoid robot described above.

端末１００は、ユーザ７００によって持ち運ばれることにより、様々な場所で利用される。端末１００は、基地局５００およびネットワーク６００を介して、サーバ装置２００と通信する。 The terminal 100 is used in various places by being carried by the user 700. The terminal 100 communicates with the server device 200 via the base station 500 and the network 600.

＜Ｂ．処理の概要＞
以下、通信システム１における処理の概要について説明する。端末１００は、ユーザ７００から発せられる音声に基づき、音声認識してフレーズを特定する。フレーズとは、たとえば、句、単語、単語の集まりなどをいう。音声認識とは、入力された音声データを対応するフレーズに変換することをいう。端末１００は、想定される複数種類のフレーズ（正規のフレーズともいう）に対応する応答フレーズを予め記憶している。端末１００は、音声から特定したフレーズに対応する応答フレーズを記憶しているときに、当該応答フレーズを出力する。 <B. Outline of processing>
The outline of the processing in the communication system 1 will be described below. The terminal 100 performs voice recognition and specifies a phrase based on the voice uttered by the user 700. A phrase is, for example, a phrase, a word, a group of words, or the like. The voice recognition means converting input voice data into a corresponding phrase. The terminal 100 stores in advance response phrases corresponding to a plurality of types of supposed phrases (also referred to as regular phrases). The terminal 100 outputs the response phrase when the response phrase corresponding to the phrase specified from the voice is stored.

また、端末１００は、ユーザの発音や周りの騒音などの影響により、音声の一部を誤認識することが生じ得る。このような誤認識に備えて、端末１００は、想定される正規のフレーズと近似するフレーズについても、対応する応答フレーズを予め記憶している。近似するフレーズとは、正規のフレーズに対し、たとえば、濁点（「゛」）の有無、促音（「ッ」）の有無、長音符（「ー」）の有無などの点のみにおいて相違しているフレーズをいう。このため、端末１００は、特定したフレーズに対応する応答フレーズを記憶していないときであっても、当該フレーズと近似するフレーズに対応する応答フレーズを記憶しているときに、当該応答フレーズを出力する。 Further, the terminal 100 may erroneously recognize a part of the voice due to the influence of the user's pronunciation and surrounding noise. In preparation for such an erroneous recognition, the terminal 100 stores in advance corresponding response phrases for phrases that are similar to the assumed regular phrases. An approximate phrase is a phrase that differs from a regular phrase only in the presence or absence of a dakuten (""), the presence or absence of a consonant ("t"), and the presence or absence of a long note ("-"). Say. Therefore, the terminal 100 outputs the response phrase when it stores the response phrase corresponding to the phrase close to the phrase even when the response phrase corresponding to the specified phrase is not stored. To do.

さらに、端末１００は、学習条件が成立したときに、応答フレーズを学習するための学習処理を行なう。学習条件は、記憶されていないフレーズを所定頻度で特定（たとえば、２回連続して特定）したとき、および、予め定められている学習開始フレーズを特定したときなどに成立する。以下に、概要を説明する。 Further, the terminal 100 performs a learning process for learning the response phrase when the learning condition is satisfied. The learning condition is satisfied when an unstored phrase is specified at a predetermined frequency (for example, twice in succession), and when a predetermined learning start phrase is specified. The outline will be described below.

図２は、ユーザ７００と端末１００との会話のやりとりの一例を示す図である。図２に示される吹き出し（ステップＭ０１〜Ｍ１２）は、ユーザ７００から発せられる音声あるいは端末１００から出力される音声を示している。また、図２に示される四角囲い（ステップＳ０１〜Ｓ１３）は、端末１００により実行される処理の概要を示している。 FIG. 2 is a diagram showing an example of exchange of conversation between the user 700 and the terminal 100. The balloons (steps M01 to M12) shown in FIG. 2 represent voices emitted from the user 700 or voices output from the terminal 100. Further, a square box (steps S01 to S13) shown in FIG. 2 shows an outline of the process executed by the terminal 100.

まず、音声から特定されるフレーズに対応する応答フレーズが端末１００に記憶されている場合について説明する。
ステップＭ０１に示すように、ユーザ７００は、端末１００に対して、「身長は？」という音声を発したとする。これに対し、端末１００では、ユーザ７００からのメッセージを音声認識し、当該音声認識の結果に対応する応答フレーズを抽出して出力する。なお、端末１００は、音声認識の結果として特定されるフレーズを履歴として記憶する。 First, a case where a response phrase corresponding to a phrase specified by voice is stored in the terminal 100 will be described.
As shown in step M01, it is assumed that the user 700 utters a voice “height?” to the terminal 100. On the other hand, in the terminal 100, the message from the user 700 is voice-recognized, and the response phrase corresponding to the result of the voice recognition is extracted and output. The terminal 100 stores the phrase specified as a result of the voice recognition as a history.

図２では、ステップＳ０１に示すように、端末１００は、ユーザ７００からの音声が「シンチョーハ？」であると認識し、当該フレーズに基づいて「シンチョー」について問われていると認識する。
ステップＳ０２では、問われている対象である「シンチョー」と合致する応答フレーズを抽出する。この例では、「シンチョー」と合致する応答フレーズとして、「身長はだいたい１９ｃｍだよ。」が記憶されているとする。よって、ステップＭ０２に示すように、端末１００は、「身長はだいたい１９ｃｍだよ。」といった応答フレーズを出力する。 In FIG. 2, as shown in step S01, the terminal 100 recognizes that the voice from the user 700 is “Shinchoha?”, and recognizes that “Shincho” is asked based on the phrase.
In step S02, a response phrase that matches the target "Shincho" is extracted. In this example, it is assumed that “height is about 19 cm” is stored as a response phrase that matches “Shincho”. Therefore, as shown in step M02, the terminal 100 outputs a response phrase such as "height is about 19 cm."

次に、音声から特定されるフレーズに対応する応答フレーズが端末１００に記憶されていないが、音声から特定されるフレーズと近似するフレーズ（フレーズ（近似）とも示す）に対応する応答フレーズが記憶されている場合について説明する。
ステップＭ０３に示すように、ユーザ７００は、端末１００に対して、「体重は？」という音声を発したとする。これに対し、ステップＳ０３に示すように、端末１００は、１文字目に濁点が付いた「ダイジューハ？」であると誤認識したとする。 Next, although the response phrase corresponding to the phrase specified by the voice is not stored in the terminal 100, the response phrase corresponding to the phrase similar to the phrase specified by the voice (also referred to as phrase (approximate)) is stored. The case will be described.
As shown in step M03, it is assumed that the user 700 utters the voice “Weight is?” to the terminal 100. On the other hand, as shown in step S03, it is assumed that the terminal 100 erroneously recognizes that it is "daijuha?"

しかし、「ダイジュー」という意味を成さない文言と合致するフレーズは、設計段階において想定されておらず記憶されていない。このような場合、端末１００は、音声から特定されるフレーズと近似するフレーズが記憶されているか否かを判定し、近似するフレーズが記憶されている場合には当該近似するフレーズに対応する応答フレーズを抽出して出力する。端末１００は、音声認識の結果に基づいて特定されるフレーズと濁点・促音・長音符などの有無の点において相違するフレーズが記憶されているか否かを判定する。 However, a phrase that matches the wording that does not mean “daiju” is not assumed at the design stage and is not stored. In such a case, the terminal 100 determines whether or not a phrase similar to the phrase specified by the voice is stored, and if a similar phrase is stored, the response phrase corresponding to the similar phrase is stored. Is extracted and output. The terminal 100 determines whether or not a phrase that differs from the phrase identified based on the result of voice recognition in terms of the presence or absence of a dakuten, a consonant, a long note, etc. is stored.

図２の例では、「ダイジュー」の一文字目の「ダ」の濁点を除いた「タイジュー」が記憶されているとする。この場合、端末１００は、「ダイジュー」が「タイジュー」と近似していると判定し、ステップＭ０４に示されるように、近似する「タイジュー」と合致する応答フレーズを抽出する。この例では、近似する「タイジュー」と合致する応答フレーズとして、「ひょっとして体重のこと？体重はだいたい３００ｇだよ。」が記憶されているとする。このため、ステップＭ０４に示すように、端末１００は、「ひょっとして体重のこと？体重はだいたい３００ｇだよ。」といった応答フレーズを出力する。 In the example of FIG. 2, it is assumed that “daiju”, which is the first character of “dai”, is stored, excluding the dakuten of “da”. In this case, the terminal 100 determines that “daiju” is close to “taiju”, and extracts a response phrase that matches the approximate “taiju” as shown in step M04. In this example, it is assumed that “maybe the weight? The weight is about 300 g.” is stored as the response phrase that matches the approximate “Taiju”. For this reason, as shown in step M04, the terminal 100 outputs a response phrase such as "maybe the weight? The weight is about 300g."

次に、音声から特定されるフレーズに対応する応答フレーズも近似するフレーズに対応する応答フレーズも記憶されていない場合について説明する。
ステップＭ０５に示すように、ユーザ７００は、端末１００に対して、「足の大きさは？」という音声を発したとする。これに対し、ステップＳ０５に示すように、端末１００は、ユーザ７００からの音声が「アシノオオキサハ？」であると正しく認識したとする。 Next, a case will be described in which a response phrase corresponding to a phrase specified from a voice and a response phrase corresponding to a similar phrase are not stored.
As shown in step M05, the user 700 supposes that the user 100 utters “what is your foot size?” to the terminal 100. On the other hand, as shown in step S05, it is assumed that the terminal 100 correctly recognizes that the voice from the user 700 is “Ashino Oxah?”.

しかし、「アシノオオキサハ？」について問われることが設計段階において想定されていないときには、「アシノオオキサハ？」というフレーズが記憶されておらず、当該フレーズと近似するフレーズも記憶されていないことになる。 However, when it is not assumed in the design stage that the question about “Ashino Oxaha?” is expected, the phrase “Ashino Oxaha?” is not stored, and the phrase close to the phrase is not stored.

この場合、端末１００は、ステップＳ０６において特定されたフレーズに対応する応答フレーズがないと判定し、ステップＳ０７において、「アシノオオキサハ？」という音声認識の結果そのものを履歴として記憶する。その上で、端末１００は、不明なフレーズを特定した場合の応答フレーズを出力する。不明なフレーズを特定した場合の応答フレーズとしては、再度の発話を促すフレーズが定められており、たとえば、ステップＭ０６に示すように「よく聞こえなかったよ。」というフレーズが定められている。端末１００は、当該応答フレーズを出力するとともに、首を傾げるポーズをとるように頭部を駆動させる。 In this case, the terminal 100 determines that there is no response phrase corresponding to the phrase specified in step S06, and in step S07, the result itself of the voice recognition "Ashino Oxah?" is stored as a history. Then, the terminal 100 outputs a response phrase when an unknown phrase is specified. As a response phrase when an unknown phrase is specified, a phrase for prompting another utterance is defined, for example, a phrase "I didn't hear well" is defined as shown in step M06. The terminal 100 outputs the response phrase and drives the head so as to take a pose of tilting the neck.

ステップＭ０７に示すように、不明なフレーズが特定された状況において、ユーザ７００が再度「足の大きさは？」という音声を発した場合、端末１００は、ステップＳ０８、Ｓ０９に示すように、前回と同様に応答フレーズが記憶されていないと判定する。続いて、今回の音声認識の結果がステップＳ０７において記憶された直近（前回）の音声認識の結果と合致するか否かを判定する。 As shown in step M07, in a situation in which an unknown phrase is specified, when the user 700 again utters the voice “What is the size of your foot?”, the terminal 100 returns to the previous step as shown in steps S08 and S09. Similarly to, it is determined that the response phrase is not stored. Then, it is determined whether or not the result of the current voice recognition matches the result of the latest (previous) voice recognition stored in step S07.

ステップＳ１０で示すように、端末１００は、今回の音声認識の結果が前回の音声認識の結果と合致すると判定した場合は、音声を誤認識したのではなく、ユーザ７００が意図して「足の大きさは？」と発話している蓋然性が高いため、以下に示すような学習処理を行なう。 As shown in step S10, when the terminal 100 determines that the result of the current voice recognition matches the result of the previous voice recognition, the terminal 100 does not recognize the voice erroneously, and the user 700 intentionally reads “ It is highly probable that he is uttering "How big is it?", so the following learning process is performed.

まず、ステップＭ０８に示すように、端末１００は、特定された不明なフレーズに基づき「「アシノオオキサハ？」と聞かれたらなんて答えたらいい？」といった応答フレーズを出力する。このように、オウム返しのように応答するため、ユーザにとって意味が分からないことを問いかけてしまうことを防止できる。この問い掛けに対して、ステップＭ０９に示すように、ユーザ７００は、「５ｃｍだよ。」という音声を発したとする。 First, as shown in step M08, what should the terminal 100 answer when asked ""Ashino Ooxaha?", based on the specified unknown phrase. The response phrase such as "is output. In this way, since a response such as a parrot return is made, it is possible to prevent the user from inquiring about something that does not make sense. In response to this inquiry, as shown in step M09, it is assumed that the user 700 utters "5 cm.".

ステップＳ１１に示すように、端末１００は、ユーザ７００からの音声が「ゴセンチメートルダヨ」であると認識し、その結果に基づいて、ステップＭ１０に示すように「「ゴセンチメートルだよ」と答えればいい？」といった応答フレーズを出力する。この問い掛けに対して、ステップＭ１１に示すように、ユーザ７００は、「オーケー（ＯＫ）」という音声を発したとする。これに対し、ステップＳ１２に示すように、端末１００は、その音声を「オーケー」と認識した場合、ステップＳ１３に示すように、「アシノオオキサハ？」の応答フレーズとして「ゴセンチメートルダヨ」というフレーズを記憶した上で、ステップＭ１２に示すように「わかったよ。」といった応答フレーズを出力する。 As shown in step S11, the terminal 100 recognizes that the voice from the user 700 is “Gocentimeter Dayo”, and based on the result, as shown in step M10, “It is Gocentimeter”. Should i answer The response phrase such as "is output. In response to this inquiry, as shown in step M11, it is assumed that the user 700 utters a voice "OK". On the other hand, as shown in step S12, when the terminal 100 recognizes the voice as “OK”, the terminal 100 reads the phrase “gocentimeter dayo” as the response phrase of “Ashino Oxaha?” as shown in step S13. After storing, a response phrase such as "I understand." is output as shown in step M12.

このような学習処理が行なわれることにより、以後、端末１００は、音声認識の結果として「アシノオオキサハ？」を特定したときには、応答フレーズとして記憶されている「ゴセンチメートルダヨ」を出力することができる。また、音声認識の結果が「アシノオオキサ」と近似する結果となったとき（たとえば、「アジノオオキサ」など）にも、応答フレーズとして「ひょっとしてアシノオオキサのこと？アシノオオキサはゴセンチメートルダヨ。」を出力するようにしてもよい。 By performing such a learning process, thereafter, when the terminal 100 specifies “Ashino Oxaha?” as a result of the voice recognition, the terminal 100 can output “Gocentimeter Dayo” stored as the response phrase. .. Also, when the result of the voice recognition is similar to "Ashinoooxa" (for example, "Azinoooxa"), the response phrase "Hello by chance? You may do so.

以上のように、端末１００は、ユーザ７００から発せられる音声に基づいて特定したフレーズに対応する応答フレーズが記憶されているときには、当該応答フレーズを出力する（ステップＳ０１、Ｓ０２、Ｍ０２）。また、端末１００は、特定したフレーズに対応する応答フレーズが記憶されていないときであっても、当該フレーズと近似するフレーズに対応する応答フレーズが記憶されているときには当該応答フレーズを出力する（ステップＳ０３、Ｓ０４、Ｍ０４）。さらに、端末１００は、近似するフレーズに対応する応答フレーズも記憶されていないときであって、当該特定したフレーズが所定頻度で認識（たとえば、２回連続して認識）されたときに、当該フレーズに対応する応答フレーズを学習するための学習処理を行なう（ステップＳ０５〜Ｓ１３、Ｍ０６〜Ｍ１２）。 As described above, when the response phrase corresponding to the phrase specified based on the voice uttered by the user 700 is stored, the terminal 100 outputs the response phrase (steps S01, S02, M02). In addition, the terminal 100 outputs the response phrase even when the response phrase corresponding to the specified phrase is not stored, when the response phrase corresponding to the phrase close to the phrase is stored (step S03, S04, M04). Furthermore, when the response phrase corresponding to the approximate phrase is not stored, and the terminal 100 recognizes the specified phrase at a predetermined frequency (for example, recognizes the phrase twice in succession), the terminal 100 determines the phrase. Learning processing for learning the response phrase corresponding to is performed (steps S05 to S13, M06 to M12).

＜Ｃ．ハードウェア構成＞
図３は、端末１００のハードウェア構成の一例を表した図である。図３を参照して、端末１００は、主たる構成要素として、プログラムを実行するＣＰＵ（Central Processing Unit）１５１と、データを不揮発的に格納するＲＯＭ（Read-Only Memory）１５２と、ＣＰＵ１５１によるプログラムの実行により生成されたデータ、又は入力装置を介して入力されたデータを揮発的に格納するＲＡＭ（Random Access Memory）１５３と、データを不揮発的に格納するフラッシュメモリ１５４と、ＬＥＤ（Light Emitting Diode）１５５と、操作キー１５６と、スイッチ１５７と、ＧＰＳ（Global Positioning System）受信機１５８と、通信ＩＦ（Interface）１５９と、電源回路１６０と、タッチスクリーン１６１と、マイク１６２と、スピーカ１６３と、カメラ１６４と、駆動装置１６５と、アンテナ１５８１，１５９１とを含む。各構成要素は、相互にデータバスによって接続されている。 <C. Hardware configuration>
FIG. 3 is a diagram showing an example of the hardware configuration of the terminal 100. Referring to FIG. 3, terminal 100 has, as main components, a CPU (Central Processing Unit) 151 that executes a program, a ROM (Read-Only Memory) 152 that stores data in a nonvolatile manner, and a program of CPU 151. A RAM (Random Access Memory) 153 for volatile storage of data generated by execution or data input via an input device, a flash memory 154 for non-volatile storage of data, and an LED (Light Emitting Diode) 155, operation key 156, switch 157, GPS (Global Positioning System) receiver 158, communication IF (Interface) 159, power supply circuit 160, touch screen 161, microphone 162, speaker 163, camera 164, a driving device 165, and antennas 1581 and 1591. The respective constituent elements are mutually connected by a data bus.

タッチスクリーン１６１は、ディスプレイ１６１１と、タッチパネル１６１２により構成される。アンテナ１５８１は、ＧＰＳ受信機１５８用のアンテナである。アンテナ１５９１は、通信ＩＦ１５９用のアンテナである。 The touch screen 161 includes a display 1611 and a touch panel 1612. The antenna 1581 is an antenna for the GPS receiver 158. The antenna 1591 is an antenna for the communication IF 159.

ＬＥＤ１５５は、端末１００の動作状態を表す各種の表示ランプである。たとえば、ＬＥＤ１５５は、端末１００の主電源のオンまたはオフ状態、およびフラッシュメモリ１５４への読み出しまたは書き込み状態等を表す。 The LED 155 is various display lamps that indicate the operating state of the terminal 100. For example, the LED 155 indicates the on/off state of the main power supply of the terminal 100, the reading/writing state of the flash memory 154, and the like.

操作キー１５６は、端末１００のユーザが主電源のオンまたはオフ等するためのキー（操作ボタン）である。スイッチ１５７は、電源回路１６０に給電を行なうか否かを切替えるための主電源用のスイッチ、およびその他の各種の押しボタンスイッチである。 The operation key 156 is a key (operation button) for the user of the terminal 100 to turn on or off the main power source. The switch 157 is a main power switch for switching whether or not to supply power to the power circuit 160, and various other push button switches.

ＧＰＳ受信機１５８は、４つ以上のＧＰＳ衛星からの電波に基づき、端末１００の現在位置の位置情報を取得する。ＧＰＳ受信機１５８によって取得された位置情報は、通信ＩＤ１５９を介して、サーバ装置２００に送信される。端末１００による位置情報の取得の開始タイミングについては、後述する。 The GPS receiver 158 acquires position information of the current position of the terminal 100 based on radio waves from four or more GPS satellites. The position information acquired by the GPS receiver 158 is transmitted to the server device 200 via the communication ID 159. The start timing of acquisition of position information by the terminal 100 will be described later.

通信ＩＦ１５９は、サーバ装置２００に対するデータの送信処理およびサーバ装置２００から送信されたデータの受信処理を行なう。 The communication IF 159 performs a data transmission process for the server device 200 and a data reception process for the data transmitted from the server device 200.

電源回路１６０は、コンセントを介して受信した商用電源の電圧を降圧し、端末１００の各部に電源供給を行なう回路である。 The power supply circuit 160 is a circuit that lowers the voltage of the commercial power supply received via the outlet and supplies power to each unit of the terminal 100.

タッチスクリーン１６１は、各種のデータを表示および入力を受け付けるためのデバイスである。ディスプレイ１６１１は、画像を表示するための画面を含んで構成されている。 The touch screen 161 is a device for displaying various data and receiving an input. The display 1611 is configured to include a screen for displaying an image.

マイク１６２は、端末１００の周囲の音を集音する。たとえば、マイク１６２は、ユーザ７００の発話に基づく音声を集める。 The microphone 162 collects sounds around the terminal 100. For example, the microphone 162 collects voice based on the utterance of the user 700.

スピーカ１６３は、応答フレーズに対応する音声を出力する。スピーカ１６３は、ある局面においては、ユーザ等とのコミュニケーションのために、発話を行なう。 The speaker 163 outputs a voice corresponding to the response phrase. In a certain situation, speaker 163 speaks in order to communicate with a user or the like.

カメラ１６４は、端末１００の周囲の被写体を撮像するための撮像装置である。カメラ１６４による撮像により得られた画像データは、通信ＩＤ１５９を介して、サーバ装置２００に送信される。 The camera 164 is an imaging device for imaging a subject around the terminal 100. The image data obtained by the image pickup by the camera 164 is transmitted to the server device 200 via the communication ID 159.

駆動装置１６５は、端末１００の手、足、頭部を駆動させるための駆動機構である。なお、駆動装置１６５により足が駆動されることにより、端末１００は歩行する。また、駆動装置１６５によって頭部が胴部に対して回転することにより、カメラ１６４の向きが代わる。また、端末１００は、駆動装置１６５によって頭部の角度を変化させることにより、首を傾げるポーズが可能となる。 The drive device 165 is a drive mechanism for driving the hands, feet, and head of the terminal 100. Note that the terminal 100 walks when the legs are driven by the driving device 165. Further, the orientation of the camera 164 changes as the head rotates with respect to the body by the drive device 165. In addition, the terminal 100 can make a pose of tilting the neck by changing the angle of the head by the driving device 165.

端末１００における処理（たとえば、音声入力時応答処理）は、各ハードウェアおよびＣＰＵ１５１により実行されるソフトウェア（制御プログラム）によって実現される。このようなソフトウェアは、フラッシュメモリ１５４に予め記憶されている場合がある。また、ソフトウェアは、その他の記憶媒体に格納されて、プログラムプロダクトとして流通している場合もある。あるいは、ソフトウェアは、いわゆるインターネットに接続されている情報提供事業者によってダウンロード可能なプログラムプロダクトとして提供される場合もある。このようなソフトウェアは、読取装置によりその記憶媒体から読み取られて、あるいは、通信ＩＦ１５９等を介してダウンロードされた後、フラッシュメモリ１５４に一旦格納される。そのソフトウェアは、ＣＰＵ１５１によってフラッシュメモリ１５４から読み出され、ＲＡＭ１５３に実行可能なプログラムの形式で格納される。ＣＰＵ１５１は、そのプログラムを実行する。 The process (for example, voice input response process) in the terminal 100 is realized by each hardware and software (control program) executed by the CPU 151. Such software may be stored in the flash memory 154 in advance. The software may be stored in another storage medium and distributed as a program product. Alternatively, the software may be provided as a program product that can be downloaded by an information provider connected to the so-called Internet. Such software is temporarily stored in the flash memory 154 after being read from the storage medium by the reading device or downloaded via the communication IF 159 or the like. The software is read from the flash memory 154 by the CPU 151 and stored in the RAM 153 in the form of an executable program. The CPU 151 executes the program.

同図に示される端末１００を構成する各構成要素は、一般的なものである。したがって、本開示の本質的な部分は、ＲＡＭ１５３、フラッシュメモリ１５４、記憶媒体に格納されたソフトウェア、あるいはネットワークを介してダウンロード可能なソフトウェアであるともいえる。なお、端末１００の各ハードウェアの動作は周知であるので、詳細な説明は繰り返さない。 The respective constituent elements that make up the terminal 100 shown in the figure are general ones. Therefore, it can be said that the essential part of the present disclosure is the software stored in the RAM 153, the flash memory 154, the storage medium, or the software downloadable via the network. Since the operation of each hardware of terminal 100 is well known, detailed description will not be repeated.

なお、記録媒体としては、ＤＶＤ（Digital Versatile Disc）−ＲＡＭに限られず、ＤＶＤ-ＲＯＭ、ＣＤ（Compact Disc）−ＲＯＭ、ＦＤ（Flexible Disc）、ハードディスク、磁気テープ、カセットテープ、光ディスク、ＥＥＰＲＯＭ（Electrically Erasable Programmable ROM）、フラッシュＲＯＭなどの半導体メモリ等の固定的にプログラムを担持する媒体でもよい。また、記録媒体は、当該プログラム等をコンピュータが読取可能な一時的でない媒体である。また、ここでいうプログラムとは、ＣＰＵにより直接実行可能なプログラムだけでなく、ソースプログラム形式のプログラム、圧縮処理されたプログラム、暗号化されたプログラム等を含む。 The recording medium is not limited to a DVD (Digital Versatile Disc)-RAM, but may be a DVD-ROM, a CD (Compact Disc)-ROM, an FD (Flexible Disc), a hard disk, a magnetic tape, a cassette tape, an optical disk, or an EEPROM (Electrically). It may be a medium such as an Erasable Programmable ROM) or a semiconductor memory such as a flash ROM that fixedly carries a program. The recording medium is a non-transitory medium in which the program and the like can be read by a computer. Further, the program mentioned here includes not only a program directly executable by the CPU but also a program in a source program format, a compressed program, an encrypted program, and the like.

図４は、サーバ装置２００のハードウェア構成の一例を表した図である。図４を参照して、サーバ装置２００は、主たる構成要素として、プログラムを実行するＣＰＵ２５１と、データを不揮発的に格納するＲＯＭ２５２と、ＣＰＵ２５１によるプログラムの実行により生成されたデータ、又は入力装置を介して入力されたデータを揮発的に格納するＲＡＭ２５３と、データを不揮発的に格納するＨＤＤ（Hard Disc Drive）２５４と、ＬＥＤ２５５と、スイッチ２５６と、通信ＩＦ（Interface）２５７と、電源回路２５８と、ディスプレイ２５９と、操作キー２６０とを含む。各構成要素は、相互にデータバスによって接続されている。 FIG. 4 is a diagram showing an example of a hardware configuration of the server device 200. Referring to FIG. 4, server device 200 has, as main components, a CPU 251 that executes a program, a ROM 252 that stores data in a nonvolatile manner, data generated by execution of the program by CPU 251, or an input device. RAM 253 for volatile storage of input data, HDD (Hard Disc Drive) 254 for non-volatile storage of data, LED 255, switch 256, communication IF (Interface) 257, power supply circuit 258, It includes a display 259 and operation keys 260. The respective constituent elements are mutually connected by a data bus.

電源回路２５８は、コンセントを介して受信した商用電源の電圧を降圧し、サーバ装置２００の各部に電源供給を行なう回路である。スイッチ２５６は、電源回路２５８に給電を行なうか否かを切替えるための主電源用のスイッチ、およびその他の各種の押しボタンスイッチである。ディスプレイ２５９は、各種のデータを表示するためのデバイスである。 The power supply circuit 258 is a circuit that steps down the voltage of the commercial power supply received through the outlet and supplies power to each unit of the server device 200. The switch 256 is a switch for the main power supply for switching whether to supply power to the power supply circuit 258, and various other push button switches. The display 259 is a device for displaying various data.

通信ＩＦ２５７は、端末１００に対するデータの送信処理および端末１００から送信されたデータの受信処理を行なう。 The communication IF 257 performs a data transmission process for the terminal 100 and a data reception process for the data transmitted from the terminal 100.

ＬＥＤ２５５は、サーバ装置２００の動作状態を表す各種の表示ランプである。たとえば、ＬＥＤ２５５は、サーバ装置２００の主電源のオンまたはオフ状態、およびＨＤＤ２５４への読み出しまたは書き込み状態等を表す。操作キー２６０は、サーバ装置２００のユーザがサーバ装置２００へデータを入力するための用いるキー（キーボード）である。 The LED 255 is various display lamps that indicate the operating state of the server device 200. For example, the LED 255 indicates the on/off state of the main power supply of the server device 200, the read/write state to the HDD 254, and the like. The operation keys 260 are keys (keyboard) used by the user of the server device 200 to input data to the server device 200.

サーバ装置２００における処理は、各ハードウェアおよびＣＰＵ２５１により実行されるソフトウェアによって実現される。このようなソフトウェアは、ＨＤＤ２５４に予め記憶されている場合がある。また、ソフトウェアは、その他の記憶媒体に格納されて、プログラムプロダクトとして流通している場合もある。あるいは、ソフトウェアは、いわゆるインターネットに接続されている情報提供事業者によってダウンロード可能なプログラムプロダクトとして提供される場合もある。このようなソフトウェアは、読取装置によりその記憶媒体から読み取られて、あるいは、通信ＩＦ２５７等を介してダウンロードされた後、ＨＤＤ２５４に一旦格納される。そのソフトウェアは、ＣＰＵ２５１によってＨＤＤ２５４から読み出され、ＲＡＭ２５３に実行可能なプログラムの形式で格納される。ＣＰＵ２５１は、そのプログラムを実行する。 The processing in the server device 200 is realized by each hardware and software executed by the CPU 251. Such software may be stored in the HDD 254 in advance. The software may be stored in another storage medium and distributed as a program product. Alternatively, the software may be provided as a program product that can be downloaded by an information provider connected to the so-called Internet. Such software is temporarily stored in the HDD 254 after being read from the storage medium by the reading device or downloaded via the communication IF 257 or the like. The software is read from the HDD 254 by the CPU 251 and stored in the RAM 253 in the form of an executable program. The CPU 251 executes the program.

同図に示されるサーバ装置２００を構成する各構成要素は、一般的なものである。したがって、本開示の本質的な部分は、ＲＡＭ２５３、ＨＤＤ２５４、記憶媒体に格納されたソフトウェア、あるいはネットワークを介してダウンロード可能なソフトウェアであるともいえる。なお、サーバ装置２００の各ハードウェアの動作は周知であるので、詳細な説明は繰り返さない。 The respective constituent elements of the server device 200 shown in the figure are general ones. Therefore, it can be said that the essential part of the present disclosure is the software stored in the RAM 253, the HDD 254, the storage medium, or the software downloadable via the network. Since the operation of each hardware of server device 200 is well known, detailed description will not be repeated.

なお、記録媒体としては、ＤＶＤ−ＲＡＭに限られず、ＤＶＤ-ＲＯＭ、ＣＤ−ＲＯＭ、ＦＤ、ハードディスク、磁気テープ、カセットテープ、光ディスク、ＥＥＰＲＯＭ、フラッシュＲＯＭなどの半導体メモリ等の固定的にプログラムを担持する媒体でもよい。また、記録媒体は、当該プログラム等をコンピュータが読取可能な一時的でない媒体である。また、ここでいうプログラムとは、ＣＰＵにより直接実行可能なプログラムだけでなく、ソースプログラム形式のプログラム、圧縮処理されたプログラム、暗号化されたプログラム等を含む。 The recording medium is not limited to the DVD-RAM, and the program is fixedly carried such as a DVD-ROM, a CD-ROM, an FD, a hard disk, a magnetic tape, a cassette tape, an optical disk, an EEPROM, or a flash ROM. It may be a medium. The recording medium is a non-transitory medium in which the program and the like can be read by a computer. Further, the program mentioned here includes not only a program directly executable by the CPU but also a program in a source program format, a compressed program, an encrypted program, and the like.

＜Ｄ．機能的構成＞
図５は、端末１００の機能的構成を説明するための機能ブロック図である。図５を参照して、端末１００は、制御部１１１と、記憶部１１２と、駆動部１１５と、音声入力部１１６と、音声出力部１１７と、通信処理部１１８とを備えている。なお、端末１００には、位置情報取得部１１３および撮像部１１４なども備えており、さらにその他の機能的構成を備えるものであってもよい。 <D. Functional configuration>
FIG. 5 is a functional block diagram for explaining the functional configuration of the terminal 100. Referring to FIG. 5, terminal 100 includes a control unit 111, a storage unit 112, a drive unit 115, a voice input unit 116, a voice output unit 117, and a communication processing unit 118. The terminal 100 also includes the position information acquisition unit 113, the image pickup unit 114, and the like, and may further include other functional configurations.

音声入力部１１６は、端末１００の周囲の音を集め、集められた音声を音声データとして制御部１１１に送る。音声入力部１１６は、たとえばマイク１６２により構成されている。音声出力部１１７は、応答フレーズに対応する音声を出力する。音声出力部１１７は、たとえばスピーカ１６３により構成されている。 The voice input unit 116 collects sounds around the terminal 100 and sends the collected sounds as voice data to the control unit 111. The voice input unit 116 includes, for example, a microphone 162. The voice output unit 117 outputs a voice corresponding to the response phrase. The audio output unit 117 is composed of, for example, a speaker 163.

記憶部１１２は、各種の制御プログラムを記憶するとともに、応答フレーズＤＢ（Data Base）１１２１と、音声認識結果ＤＢ１１２２と、学習結果ＤＢ１１２３とを有している。記憶部１１２は、たとえばＲＡＭ１５３などにより構成されている。 The storage unit 112 stores various control programs and has a response phrase DB (Data Base) 1121, a voice recognition result DB 1122, and a learning result DB 1123. The storage unit 112 is composed of, for example, the RAM 153.

応答フレーズＤＢ１１２１は、設計段階から想定されている複数種類のフレーズ（正規のフレーズ）に対応する応答フレーズ、および複数種類のフレーズ（正規のフレーズ）各々と近似する場合の応答フレーズなどを記憶する。音声認識結果ＤＢ１１２２は、ユーザから発せられた音声に基づく音声認識の結果を記憶する。学習結果ＤＢ１１２３は、学習処理により新たに追加されたフレーズに対応する応答フレーズを記憶する。以下に具体例を説明する。 The response phrase DB 1121 stores a response phrase corresponding to a plurality of types of phrases (regular phrases) assumed from the design stage, a response phrase in the case of approximating each of the plurality of types of phrases (regular phrases), and the like. The voice recognition result DB 1122 stores the result of voice recognition based on the voice uttered by the user. The learning result DB 1123 stores the response phrase corresponding to the phrase newly added by the learning process. A specific example will be described below.

図６は、記憶部１１２に記憶されている応答フレーズＤＢ１１２１、音声認識結果ＤＢ１１２２、および学習結果ＤＢ１１２３の概略構成を説明するための図である。図６（ａ）を参照して、応答フレーズＤＢ１１２１は、フレーズと、当該フレーズに対応する応答フレーズとを含む。 FIG. 6 is a diagram for explaining a schematic configuration of the response phrase DB 1121, the voice recognition result DB 1122, and the learning result DB 1123 stored in the storage unit 112. Referring to FIG. 6A, the response phrase DB 1121 includes a phrase and a response phrase corresponding to the phrase.

フレーズとしては、たとえば、「シンチョー」「タイジュー」など複数種類のフレーズ（フレーズ（合致））、複数種類のフレーズ各々に近似するフレーズ（フレーズ（近似）、および学習開始フレーズである「ヘンジオボエテ」などが記憶されている。また、それぞれのフレーズに対しては、応答フレーズが記憶されている。たとえば、「シンチョー（合致）」に対しては、図２のＭ０２で示したとおり、「身長はだいたい１９ｃｍだよ。」というメッセージが記憶されている。また、「タイジュー（近似）」に対しては、図２のＭ０４で示したとおり、「ひょっとして体重のこと？体重はだいたい３００ｇだよ。」というメッセージが記憶されている。また、学習開始フレーズに対しては、「オーケー、まずは覚える言葉を教えてね。」というメッセージが記憶されている。その他、不明なフレーズや学習処理中のフレーズに対しては、たとえば図２のステップＭ０６、Ｍ０８、Ｍ１０、Ｍ１２などに示すような応答フレーズが記憶されている。 Examples of phrases include multiple types of phrases (phrases (matches)) such as "Shincho" and "Taiju", phrases (phrases (approximation)) that are close to each of multiple types of phrases, and "Hengio Boete" that is a learning start phrase. In addition, a response phrase is stored for each phrase.For example, for "Shincho (match)", as shown in M02 of Fig. 2, "height is about 19 cm. As for the "Taiju (approximate)", as indicated by M04 in Fig. 2, "maybe weight? Weight is about 300g." For the learning start phrase, the message "OK, please tell me the words to remember first." is stored. In addition, for unknown phrases and phrases in the process of learning On the other hand, response phrases such as those shown in steps M06, M08, M10 and M12 of FIG. 2 are stored.

図６（ｂ）を参照して、音声認識結果ＤＢ１１２２は、音声認識の結果により特定されたフレーズを含む。図６（ｂ）の例では、図２のステップＳ０１、Ｓ０３、Ｓ０５、Ｓ０８、Ｓ１１、Ｓ１２における音声認識の結果により特定されたフレーズが記憶される。 Referring to FIG. 6B, the voice recognition result DB 1122 includes a phrase specified by the result of voice recognition. In the example of FIG. 6B, the phrase specified by the result of the voice recognition in steps S01, S03, S05, S08, S11, and S12 of FIG. 2 is stored.

図６（ｃ）を参照して、学習結果ＤＢ１１２３は、学習により追加されたフレーズと、当該フレーズに対応する応答フレーズとを含む。図６（ｃ）の例では、図２のステップＳ１３により、追加フレーズ「アシノオオキサハ？」に対して応答フレーズ「ゴセンチメートルダヨ。」が記憶されている。 With reference to FIG. 6C, the learning result DB 1123 includes a phrase added by learning and a response phrase corresponding to the phrase. In the example of FIG. 6C, in step S13 of FIG. 2, the response phrase “Gocentimeter Dayo.” is stored with respect to the additional phrase “Asino Oxaha?”.

応答フレーズＤＢ１１２１の記憶情報は、サーバ装置２００から定期的に送信される更新用データに基づきアップデートされる。これにより、フレーズおよび応答フレーズを更新することができる。更新用データは、サーバ装置２００を管理する管理者などにより入力されたフレーズおよび応答フレーズを特定するためのデータである。 The stored information in the response phrase DB 1121 is updated based on the update data periodically transmitted from the server device 200. Thereby, the phrase and the response phrase can be updated. The update data is data for specifying a phrase and a response phrase input by an administrator who manages the server device 200.

また、学習結果ＤＢ１１２３の記憶情報は、サーバ装置２００に送信可能である。サーバ装置２００は、端末１００からの学習結果ＤＢ１１２３の記憶情報を含む更新用データを他の端末に送信する。これにより、端末１００において学習させた内容を、他の端末にも反映させることができる。 Further, the storage information of the learning result DB 1123 can be transmitted to the server device 200. The server device 200 transmits the update data including the storage information of the learning result DB 1123 from the terminal 100 to another terminal. Thereby, the content learned in the terminal 100 can be reflected in other terminals.

図５に戻り、制御部１１１は、端末１００の全体の動作を制御する。制御部１１１は、音声認識部１１１０と、発話内容決定部１１１１と、近似判定部１１１２と、学習機能部１１１３と、駆動制御部１１１４と、表示制御部１１１５とを有する。制御部１１１は、たとえばＣＰＵ１５１などにより構成されている。 Returning to FIG. 5, the control unit 111 controls the overall operation of the terminal 100. The control unit 111 includes a voice recognition unit 1110, a speech content determination unit 1111, an approximation determination unit 1112, a learning function unit 1113, a drive control unit 1114, and a display control unit 1115. The control unit 111 is composed of, for example, the CPU 151 and the like.

音声認識部１１１０は、音声入力部１１６により入力された音声データに基づいて、フレーズを特定するための音声認識を行なう機能を有している。 The voice recognition unit 1110 has a function of performing voice recognition for identifying a phrase based on the voice data input by the voice input unit 116.

発話内容決定部１１１１は、音声出力部１１６から出力する応答フレーズを決定する機能を有している。具体的に、発話内容決定部１１１１は、音声認識部１１１０により特定されたフレーズに対応する応答フレーズが応答フレーズＤＢ１１２１あるいは学習結果ＤＢ１１２３に記憶されているか否かを判定し、記憶されているときには当該応答フレーズに決定する。 The utterance content determination unit 1111 has a function of determining the response phrase output from the voice output unit 116. Specifically, the utterance content determination unit 1111 determines whether or not a response phrase corresponding to the phrase specified by the voice recognition unit 1110 is stored in the response phrase DB 1121 or the learning result DB 1123, and when it is stored, the corresponding phrase is stored. Decide on a response phrase.

近似判定部１１１２は、音声認識部１１１０により特定されたフレーズと近似する正規のフレーズが応答フレーズＤＢ１１２１あるいは学習結果ＤＢ１１２３に記憶されているか否かを判定する。具体的に、近似判定部１１１２は、音声認識部１１１０により特定されたフレーズと、濁点・促音・長音符などの有無の点のみにおいて相違しているフレーズが記憶されているか否かを判定する。 The approximation determination unit 1112 determines whether a regular phrase that is similar to the phrase identified by the voice recognition unit 1110 is stored in the response phrase DB 1121 or the learning result DB 1123. Specifically, the approximation determination unit 1112 determines whether or not a phrase that is different from the phrase specified by the voice recognition unit 1110 only in the presence or absence of a dakuten, a consonant, a long note, etc. is stored.

発話内容決定部１１１１は、特定されたフレーズに対応する応答フレーズが記憶されていないときであっても、近似判定部１１１２により当該フレーズと近似するフレーズが記憶されていると判定されたときには、当該近似するフレーズに対応する応答フレーズに決定する。 Even if the response phrase corresponding to the specified phrase is not stored, the utterance content determination unit 1111 determines that the approximation determination unit 1112 determines that a phrase similar to the phrase is stored. The response phrase corresponding to the approximate phrase is determined.

学習機能部１１１３は、学習条件が成立したときに学習処理を実行する機能を有している。学習機能部１１１３は、たとえば、応答フレーズＤＢ１１２１および学習結果ＤＢ１１２３に記憶されていないフレーズが２回連続して認識されることなどにより学習条件が成立したと判定したときに、当該フレーズに対応する応答フレーズを学習結果ＤＢ１１２３に記憶する。 The learning function unit 1113 has a function of executing learning processing when a learning condition is satisfied. When the learning function unit 1113 determines that the learning condition is satisfied by, for example, recognizing a phrase that is not stored in the response phrase DB 1121 and the learning result DB 1123 twice in succession, the response corresponding to the phrase. The phrase is stored in the learning result DB 1123.

駆動制御部１１１３は、端末１００の駆動部１１５を駆動させる機能を有する。これにより、端末１００は、可動部を動かすことが可能となる。表示制御部１１１５は、端末１００の表示部１１９に各種の情報を表示させる機能を有する。 The drive control unit 1113 has a function of driving the drive unit 115 of the terminal 100. Accordingly, the terminal 100 can move the movable part. The display control unit 1115 has a function of displaying various information on the display unit 119 of the terminal 100.

通信処理部１１８は、ネットワーク６００を介したサーバ装置２００との通信に用いられる。通信処理部１１８は、データをサーバ装置２００に送信するための送信部１１８１と、データをサーバ装置２００から受信するための受信部１１８２とを有する。 The communication processing unit 118 is used for communication with the server device 200 via the network 600. The communication processing unit 118 has a transmitting unit 1181 for transmitting data to the server device 200 and a receiving unit 1182 for receiving data from the server device 200.

＜Ｅ．処理の詳細＞
図７は、端末１００のＣＰＵ１５１が実行する音声入力時応答処理の流れを説明するためのフローチャートである。ＣＰＵ１５１は、ユーザ７００から音声が発せられて、音声入力部１１６から音声データが入力されたときに音声入力時応答処理を実行する。ＣＰＵ１５１の音声認識部１１１０は、入力された音声データに基づいて音声認識し、フレーズを特定する。 <E. Processing details>
FIG. 7 is a flowchart for explaining the flow of the voice input response process executed by the CPU 151 of the terminal 100. The CPU 151 executes a voice input response process when voice is output from the user 700 and voice data is input from the voice input unit 116. The voice recognition unit 1110 of the CPU 151 performs voice recognition based on the input voice data and specifies a phrase.

図７を参照して、ステップＳ１００においては、特定されたフレーズを探す処理が行なわれる。具体的には、特定されたフレーズあるいは近似するフレーズが応答フレーズＤＢ１１２１および学習結果ＤＢ１１２３に記憶されているか否かを判定する。 Referring to FIG. 7, in step S100, a process of searching for the specified phrase is performed. Specifically, it is determined whether the specified phrase or a similar phrase is stored in the response phrase DB 1121 and the learning result DB 1123.

ステップＳ１０１においては、ＣＰＵ１５１は、特定されたフレーズそのものと合致するフレーズが記憶されているか否かを判定する。ステップＳ１０１において合致するフレーズが記憶されていると判定されたときには、ＣＰＵ１５１は、ステップＳ１０２において当該フレーズに対応して記憶されている応答フレーズを出力する。これにより、音声出力部１１７から応答フレーズを出力させることができる。これにより、端末１００は、ユーザ７００からの発話から特定されるフレーズに対する応答を行なうことができる。 In step S101, the CPU 151 determines whether a phrase that matches the specified phrase itself is stored. When it is determined in step S101 that the matching phrase is stored, the CPU 151 outputs the response phrase stored corresponding to the phrase in step S102. As a result, the voice output unit 117 can output the response phrase. Thereby, the terminal 100 can make a response to the phrase specified from the utterance from the user 700.

一方、ステップＳ１０１において合致するフレーズが記憶されていないと判定されたときには、ステップＳ１０３において、ＣＰＵ１５１は、特定されたフレーズと近似するフレーズが記憶されているか否かを判定する。ステップＳ１０３において近似するフレーズが記憶されていると判定されたときには、ステップＳ１０４において、ＣＰＵ１５１は、当該近似するフレーズに対応して記憶されている応答フレーズを出力する。これにより、ユーザ７００からの発話から特定されるフレーズが記憶されていない場合であっても、端末１００は、当該フレーズと近似するフレーズに対する応答を行なうことができる。その結果、音声から特定されるフレーズそのものに対応する応答フレーズが準備されていない場合であっても、端末１００は、ユーザ７００に応答することでき、応答フレーズの不足を補うことができる。 On the other hand, when it is determined in step S101 that the matching phrase is not stored, in step S103, the CPU 151 determines whether or not a phrase similar to the specified phrase is stored. When it is determined in step S103 that the approximate phrase is stored, in step S104, the CPU 151 outputs the response phrase stored corresponding to the approximate phrase. As a result, even when the phrase specified by the utterance from the user 700 is not stored, the terminal 100 can make a response to a phrase that is similar to the phrase. As a result, even when the response phrase corresponding to the phrase itself specified from the voice is not prepared, the terminal 100 can respond to the user 700 and can supplement the lack of the response phrase.

ステップＳ１０３において近似するフレーズが記憶されていないと判定されたときには、ステップＳ１０５において、ＣＰＵ１５１は、特定されたフレーズが学習処理を開始するための学習開始フレーズであるか否かを判定する。学習開始フレーズとは、たとえば、「返事覚えて（ヘンジオボエテ）」、「言葉覚えて（コトバオボエテ）」などである。 When it is determined in step S103 that a similar phrase is not stored, in step S105, the CPU 151 determines whether the identified phrase is a learning start phrase for starting the learning process. The learning start phrase is, for example, “Remember Me (Hengio Boete)” or “Remember Me (Kotoba Oboete)”.

ステップＳ１０５において学習開始フレーズであると判定されなかったときには、ステップＳ１０６において、ＣＰＵ１５１は、今回の音声認識の結果そのもののフレーズを音声認識結果ＤＢ１１２２に記憶する。これにより、端末１００は、音声認識の結果の履歴を蓄積することができる。なお、合致するフレーズあるいは近似するフレーズが記憶されているときにも、音声認識の結果は履歴として蓄積される。 When it is not determined in step S105 that the phrase is the learning start phrase, in step S106, the CPU 151 stores the phrase of the result of the current voice recognition in the voice recognition result DB 1122. Thereby, the terminal 100 can accumulate the history of the result of the voice recognition. The result of voice recognition is stored as a history even when a matching phrase or a similar phrase is stored.

ステップＳ１０７においては、ＣＰＵ１５１は、今回の音声認識の結果が前回の音声認識の結果と合致するか否かを判定する。つまり、２回連続で同じフレーズが特定されたか否かが判定される。ステップＳ１０７において、前回の音声認識の結果と合致しないと判定されたときには、ステップＳ１０９において、ＣＰＵ１５１は、「よく聞こえなかったよ。」を応答フレーズとして出力するとともに、首を傾げるポーズをとるように頭部を駆動させる。これにより、ユーザに再度の発話を促すことができる。 In step S107, the CPU 151 determines whether or not the result of the current voice recognition matches the result of the previous voice recognition. That is, it is determined whether the same phrase is specified twice consecutively. When it is determined in step S107 that the result does not match the result of the previous voice recognition, in step S109, the CPU 151 outputs "I didn't hear well" as a response phrase and at the same time, puts the head in a leaning pose. Drive the department. This can prompt the user to speak again.

一方、ステップＳ１０７において今回の音声認識の結果が前回の音声認識の結果と合致すると判定されて学習条件が成立したときには、ＣＰＵ１５１は、制御をステップＳ１０８へ移行して、学習処理を実行する。２回連続で同じフレーズが特定されることにより実行される学習処理では、ＣＰＵ１５１は、図２のステップＭ０８〜Ｍ１２、Ｓ１１〜Ｓ１３に例示する発話・応答を行なうことにより、今回の音声認識の結果に基づくフレーズに対応する応答フレーズを学習結果ＤＢ１１２３に記憶する。このように、不明なフレーズが所定頻度で特定されたときに学習処理が実行されるため、不適切な問い返しおよび学習が行なわれてしまうことを防止できる。以降においては、音声から特定されるフレーズが学習処理で追加したフレーズとなったときに、端末１００は、対応する応答フレーズを出力することができる。 On the other hand, when it is determined in step S107 that the result of the current voice recognition matches the result of the previous voice recognition and the learning condition is satisfied, the CPU 151 shifts the control to step S108 and executes the learning process. In the learning process executed by identifying the same phrase twice consecutively, the CPU 151 performs the utterance/response illustrated in steps M08 to M12 and S11 to S13 of FIG. The response phrase corresponding to the phrase based on is stored in the learning result DB 1123. In this way, since the learning process is executed when the unknown phrase is specified with a predetermined frequency, it is possible to prevent inappropriate questions and learning. After that, when the phrase specified from the voice becomes the phrase added in the learning process, the terminal 100 can output the corresponding response phrase.

ステップＳ１０５に戻り、学習開始フレーズであると判定されたときには、ＣＰＵ１５１は、制御をステップＳ１０８へ移行して、学習処理を実行する。学習開始フレーズとなることにより実行される学習処理では、たとえば、以下のような発話・応答が行なわれる。
端末１００の応答内容：オッケー、まずは覚える言葉を教えてね。
ユーザ７００の発話内容：「○△×□」だよ。
端末１００の応答内容：「○△×□」だね？「○△×□」って言われたら、なんて返事したらいい？返事する言葉を教えてね。
ユーザ７００の発話内容：「△△×※○□」でいいよ。
端末１００の応答内容：「△△×※○□」だね。オッケー、覚えたよ。 Returning to step S105, when it is determined that the phrase is the learning start phrase, the CPU 151 shifts the control to step S108, and executes the learning process. In the learning process executed by becoming the learning start phrase, for example, the following utterance/response is performed.
Response from terminal 100: Okay, first of all, please tell me the words to remember.
Utterance content of the user 700: It is “○Δ×□”.
Response content of the terminal 100: "○△×□", right? What should I reply when asked "○△×□"? Please tell me the words to reply.
Utterance content of the user 700: “ΔΔ×*○□” is acceptable.
The response content of the terminal 100: "△△×※○□". Okay, I remember.

このような学習処理が行なわれることにより、端末１００は、音声認識の結果に基づくフレーズ（○△×□）に対応する応答フレーズ（△△×※○□）を学習結果ＤＢ１１２３に記憶する。これにより、以降において学習処理で追加したフレーズ（○△×□）となったときに、対応する応答フレーズ（△△×※○□）を出力することができる。 By performing such learning processing, the terminal 100 stores in the learning result DB 1123 the response phrase (ΔΔ×*◯□) corresponding to the phrase (◯Δ×□) based on the result of voice recognition. This makes it possible to output the corresponding response phrase (ΔΔ×*○□) when the phrase (○Δ×□) added later in the learning process is reached.

［実施の形態２］
上記実施の形態１においては、端末１００単独で音声入力時応答処理を実行可能な例について説明したが、これに限らず、サーバ装置２００と通信することにより音声入力時応答処理が実行可能となるようにしてもよい。 [Second Embodiment]
In the first embodiment, the example in which the terminal 100 alone can execute the voice input response process has been described, but the present invention is not limited to this, and the voice input response process can be executed by communicating with the server device 200. You may do it.

たとえば、図５に示した記憶部１１２、発話内容決定部１１１１、学習機能部１１１３を、サーバ装置２００が備えるようにしてもよい。この場合における端末１００およびサーバ装置２００の機能的構成を例示する。図８は、端末１００およびサーバ装置２００の機能的構成を説明するための機能ブロック図である。 For example, the storage device 112, the utterance content determination unit 1111, and the learning function unit 1113 illustrated in FIG. 5 may be included in the server device 200. A functional configuration of the terminal 100 and the server device 200 in this case will be illustrated. FIG. 8 is a functional block diagram for explaining the functional configurations of the terminal 100 and the server device 200.

図８に示すように、サーバ装置２００は、制御部２１１と、記憶部２１２と、通信処理部２１３とを備えている。記憶部２１２は、応答フレーズＤＢ２１２１、音声認識結果ＤＢ２１２２、および学習結果ＤＢ２１２３を有している。応答フレーズＤＢ２１２１、音声認識結果ＤＢ２１２２、および学習結果ＤＢ２１２３は、各々、実施の形態１における応答フレーズＤＢ１１２１、音声認識結果ＤＢ１１２２、および学習結果ＤＢ１１２３に相当する。 As shown in FIG. 8, the server device 200 includes a control unit 211, a storage unit 212, and a communication processing unit 213. The storage unit 212 has a response phrase DB 2121, a voice recognition result DB 2122, and a learning result DB 2123. Response phrase DB2121, voice recognition result DB2122, and learning result DB2123 correspond to response phrase DB1121, voice recognition result DB1122, and learning result DB1123, respectively, in the first embodiment.

制御部２１１は、たとえば、発話内容決定部２１１１、近似判定部２１１２、および学習機能部２１１３を有する。発話内容決定部２１１１、近似判定部２１１２、および学習機能部２１１３は、各々、実施の形態１における発話内容決定部１１１１、近似判定部１１１２、および学習機能部１１１３に相当する。 The control unit 211 has, for example, an utterance content determination unit 2111, an approximation determination unit 2112, and a learning function unit 2113. The utterance content determination unit 2111, the approximation determination unit 2112, and the learning function unit 2113 correspond to the utterance content determination unit 1111, the approximation determination unit 1112, and the learning function unit 1113 in the first embodiment, respectively.

通信処理部２１３は、ネットワーク６００を介した端末１００との通信に用いられる。通信処理部２１３は、データを端末１００に送信するための送信部２１３１と、データを端末１００から受信するための受信部２１３２とを有する。 The communication processing unit 213 is used for communication with the terminal 100 via the network 600. The communication processing unit 213 includes a transmitting unit 2131 for transmitting data to the terminal 100 and a receiving unit 2132 for receiving data from the terminal 100.

次に、音声入力時応答処理の概要について説明する。端末１００は、ユーザ７００から発せられる音声に基づいてフレーズを特定し、当該フレーズを特定可能なフレーズデータを送信部１１８１を介してサーバ装置２００へ送信する。 Next, an outline of the voice input response process will be described. The terminal 100 specifies a phrase based on the voice uttered by the user 700, and transmits phrase data capable of specifying the phrase to the server device 200 via the transmission unit 1181.

サーバ装置２００は、フレーズデータを受信すると、当該フレーズデータから特定されるフレーズに基づいて発話内容決定部２１１１により応答フレーズを決定する。発話内容決定部２１１１は、特定されるフレーズに対応する応答フレーズが記憶部２１２に記憶されているか否かを判定し、当該フレーズに対応する応答フレーズが記憶されているときには、当該応答フレーズを特定可能な応答データを送信部２１３１を介して端末１００へ送信する。これにより、端末１００は、特定されたフレーズに対応する応答フレーズを出力することができる。 When the server device 200 receives the phrase data, the utterance content determination unit 2111 determines the response phrase based on the phrase specified from the phrase data. The utterance content determination unit 2111 determines whether or not the response phrase corresponding to the specified phrase is stored in the storage unit 212, and when the response phrase corresponding to the phrase is stored, specifies the response phrase. The possible response data is transmitted to the terminal 100 via the transmission unit 2131. Thereby, the terminal 100 can output the response phrase corresponding to the specified phrase.

また、発話内容決定部２１１１は、特定したフレーズに対応する応答フレーズが記憶されていないときであっても、近似判定部２１１２により当該フレーズと近似すると判定されたフレーズに対応する応答フレーズが記憶されているときには、当該応答フレーズを特定可能な応答データを送信部２１３１を介して端末１００へ送信する。これにより、端末１００は、特定されたフレーズと近似するフレーズに対応する応答フレーズを出力することができる。 The utterance content determination unit 2111 stores the response phrase corresponding to the phrase determined to be approximate to the phrase by the approximation determination unit 2112 even when the response phrase corresponding to the specified phrase is not stored. In the meantime, the response data capable of specifying the response phrase is transmitted to the terminal 100 via the transmission unit 2131. As a result, the terminal 100 can output the response phrase corresponding to the phrase close to the specified phrase.

さらに、発話内容決定部２１１１により、特定したフレーズおよび近似するフレーズに対応する応答フレーズが記憶されていないと判定されたときであっても、当該特定したフレーズが所定頻度で認識（たとえば、２回連続して認識）されたときには、学習機能部２１１３は、当該フレーズに対応する応答フレーズを学習するための学習処理を行なう。具体的には、学習機能部２１１３は、図２のステップＭ０８〜Ｍ１２、Ｓ１１〜Ｓ１３に例示する発話・応答を行なうための処理を実行する。 Further, even when the utterance content determination unit 2111 determines that the response phrase corresponding to the specified phrase and the similar phrase is not stored, the specified phrase is recognized at a predetermined frequency (for example, twice. When continuously recognized), the learning function unit 2113 performs a learning process for learning a response phrase corresponding to the phrase. Specifically, the learning function unit 2113 executes the processing for performing the utterance/response illustrated in steps M08 to M12 and S11 to S13 of FIG.

また、発話内容決定部２１１１は、特定したフレーズが学習開始フレーズであったときにも実施の形態１で説明した学習開始フレーズ判定時の発話・応答を行なうための処理を実行する。 Further, the utterance content determination unit 2111 also executes the processing for performing the utterance/response at the time of determining the learning start phrase described in the first embodiment even when the identified phrase is the learning start phrase.

この場合、学習結果ＤＢ２１２３の記憶情報は、端末毎（たとえば、端末を識別可能な識別番号毎）に特定可能に記憶されているものであってもよく、すべての端末間で共有可能となるように記憶されているものであってもよい。 In this case, the storage information of the learning result DB 2123 may be stored so that it can be specified for each terminal (for example, for each identification number that can identify the terminal), and can be shared among all the terminals. May be stored in.

なお、サーバ装置２００と通信することにより音声入力時応答処理が実行可能となる例として、図５に示した記憶部１１２、発話内容決定部１１１１、学習機能部１１１３のみならず、音声認識部１１１０についても、サーバ装置２００が備えるようにしてもよい。この場合、端末１００は、ユーザ７００から発せられる音声を特定可能な音声データを送信部１１８１を介してサーバ装置２００へ送信する。サーバ装置２００は、音声データを受信すると、当該音声データに基づいて音声認識部により音声認識してフレーズを特定し、当該フレーズに基づく処理を実行するようにしてもよい。 As an example in which the response process at the time of voice input can be executed by communicating with the server device 200, not only the storage unit 112, the utterance content determination unit 1111 and the learning function unit 1113 illustrated in FIG. As for the above, the server device 200 may be provided. In this case, the terminal 100 transmits voice data capable of specifying the voice emitted from the user 700 to the server device 200 via the transmitting unit 1181. When the server device 200 receives the voice data, the voice recognition unit may perform voice recognition on the basis of the voice data to identify a phrase, and perform a process based on the phrase.

［実施の形態３］
上記実施の形態１および２においては、近似するフレーズとして、濁点などの有無の点のみにおいて相違しているフレーズを例示したが、これに替えてあるいは加えて、正規のフレーズに含まれる一部のフレーズを近似するフレーズとしてもよい。たとえば、「シンチョー」に近似するフレーズとしては、「ジンチョー」などに替えてあるいは加えて、「シンチョ」や「ンチョー」などを含めてもよい。また、「タイジュー」に近似するフレーズとしては、「ダイジュー」などに替えてあるいは加えて、「タイジュ」や「イジュー」などを含めてもよい。 [Third Embodiment]
In the above-described first and second embodiments, the phrase that is different only in the presence or absence of a dakuten or the like has been illustrated as an approximate phrase. It may be a phrase that approximates the phrase. For example, as a phrase similar to "Shincho", instead of or in addition to "Jincho" or the like, "Shincho" or "Ncho" may be included. Further, as a phrase similar to “taiju”, “daiju” or “idue” may be included instead of or in addition to “daiju” or the like.

また、上記実施の形態１および２においては、近似判定部を備え、当該近似判定部により近似するフレーズであるか否かを判定する例について説明したが、近似判定部を備えることなく、図９に示すように、近似するフレーズそのものに対して応答フレーズが記憶されるように応答フレーズＤＢを構成してもよい。 In addition, in the above-described first and second embodiments, an example has been described in which the approximate determination unit is provided and it is determined by the approximate determination unit whether or not the phrase is approximate. However, without the approximate determination unit, FIG. As shown in, the response phrase DB may be configured such that the response phrase is stored for the approximate phrase itself.

図９は、フレーズとして正規のフレーズと、近似するフレーズとに対応して応答フレーズが記憶されている応答フレーズＤＢの概略構成を説明するための図である。たとえば、正規のフレーズである「シンチョー」や「タイジュー」などに対応する応答フレーズが記憶されるとともに、「シンチョー」に近似するフレーズとして「ジンチョー」「シンチョ」「ンチョー」などに対応する応答フレーズが記憶されるとともに、「タイジュー」に近似するフレーズとして「ダイジュー」「タイジュ」「イジュー」などに対応する応答フレーズが記憶されている。 FIG. 9 is a diagram for explaining a schematic configuration of a response phrase DB in which response phrases are stored in correspondence with regular phrases as phrases and similar phrases. For example, a response phrase corresponding to a regular phrase such as “Shincho” or “Taiju” is stored, and response phrases corresponding to “Jincho”, “Shincho”, “Ncho”, etc. are similar to “Shincho”. In addition to being stored, a response phrase corresponding to “daiju”, “taiju”, “idue”, etc. is stored as a phrase similar to “taiju”.

このように応答フレーズＤＢが構成されている場合、発話内容決定部は、音声認識の結果により特定されたフレーズが応答フレーズＤＢに記憶されているか否かを判定することにより、近似判定部を備えずとも、正規のフレーズに対応する応答フレーズのみならず、近似するフレーズに対応する応答フレーズを抽出することができる。 When the response phrase DB is configured in this way, the utterance content determination unit includes the approximation determination unit by determining whether or not the phrase specified by the result of voice recognition is stored in the response phrase DB. Of course, not only the response phrase corresponding to the regular phrase but also the response phrase corresponding to the approximate phrase can be extracted.

また、正規のフレーズと近似するフレーズに対応する応答フレーズは、近似するフレーズにかかわらず、共通（兼用）の応答フレーズを記憶するものであってもよい。具体的に、近似する場合における共通の応答フレーズとして、「ひょっとして…のこと？」を応答フレーズとして記憶し応答フレーズとしては、「…」の部分に正規のフレーズそのものを挿入し、かつ正規のフレーズに対応する応答フレーズをその後に付加するものであってもよい。たとえば、「シンチョー」や「タイジュー」などに近似するフレーズに対応して「ひょっとして…のこと？」が定められており、「シンチョー」に近似するフレーズが特定されたときには、応答フレーズとして「ひょっとしてシンチョーのこと？身長はだいたい１９ｃｍだよ。」を出力するようにしてもよい。これにより、近似するフレーズに対応する応答データを記憶するための記憶容量を低減できる。 Further, the response phrase corresponding to the phrase approximate to the regular phrase may be a common (shared) response phrase regardless of the approximate phrase. Specifically, as a common response phrase in the case of approximation, “Hottotto…?” is stored as a response phrase, and as the response phrase, the regular phrase itself is inserted in the part of “...” The response phrase corresponding to the phrase may be added thereafter. For example, "Hottotto... means?" is defined for phrases that are similar to "Shincho" or "Taiju", and when a phrase that is similar to "Shincho" is specified, the response phrase is " Maybe it's Shincho? He's about 19 cm tall." This can reduce the storage capacity for storing the response data corresponding to the approximated phrase.

［実施の形態４］
上記実施の形態１〜３における学習処理は、不明なフレーズを２回連続して特定したときに実行する例について説明したが、不明なフレーズが所定頻度で特定されることにより実行されるものであればこれに限るものではない。学習処理は、たとえば、音声認識結果ＤＢにおける直近１０回の履歴のうちで、不明な同一フレーズが３回特定されることにより実行されるようにしてもよい。また、回数だけでなく、１回目と２回目の間隔が１分以内といった期間での判定としてもよい。 [Embodiment 4]
Although the learning process in the first to third embodiments has been described with respect to the example executed when the unknown phrase is specified twice consecutively, the learning process is executed by specifying the unknown phrase at a predetermined frequency. If so, it is not limited to this. The learning process may be executed, for example, by identifying the same unknown phrase three times in the history of the latest 10 times in the voice recognition result DB. Further, the determination may be made not only in the number of times but also in a period in which the first and second intervals are within 1 minute.

［その他］
上記実施の形態１〜４では、応答フレーズを端末１００かサーバ装置２００のいずれかで決定する例について説明したが、これに限らず、端末１００において応答フレーズを決定するとともに、サーバ装置２００においても応答フレーズを決定するようにしてもよい。この場合、端末１００は、ユーザからの音声に対して応答する応答フレーズをサーバ装置２００からも取得し、当該応答フレーズと自ら決定した応答フレーズとのうちから、情報の重要度（応答レベル）がより高い応答フレーズを、出力すべき応答フレーズとして選択して出力するようにしてもよい。 [Other]
In the above-described first to fourth embodiments, an example in which the response phrase is determined by either the terminal 100 or the server device 200 has been described. The response phrase may be determined. In this case, the terminal 100 also acquires a response phrase for responding to the voice from the user from the server device 200, and determines the importance of information (response level) from the response phrase and the response phrase determined by itself. A higher response phrase may be selected and output as the response phrase to be output.

上記実施の形態１〜４では、近似するフレーズが特定されたときには、当該近似するフレーズに対応する応答フレーズを出力する例について説明したが、これに限らず、近似するフレーズが所定頻度で特定されたとき（２回連続で特定されたときなど）に、特定されたフレーズが正規のフレーズであると擬制し、当該特定されたフレーズに対する応答フレーズを学習させるようにしてもよい。 In the above-described first to fourth embodiments, an example of outputting a response phrase corresponding to the approximate phrase when the approximate phrase is specified has been described, but the present invention is not limited to this, and the approximate phrase is specified at a predetermined frequency. When the specified phrase is specified twice (for example, two consecutive times), the specified phrase may be pretended to be a regular phrase, and the response phrase for the specified phrase may be learned.

上記実施の形態１〜４では、音声から特定されるフレーズに対応する応答処理として、応答フレーズを出力する処理、学習処理を例示したが、予め対応付けられた処理であればこれに限らず、たとえば、端末１００を所定態様で駆動する処理、カメラ１６４で撮像する処理などであってもよい。 In the above-described first to fourth embodiments, as the response process corresponding to the phrase specified from the voice, the process of outputting the response phrase and the learning process are illustrated, but the process is not limited to this as long as it is a process associated in advance, For example, it may be a process of driving the terminal 100 in a predetermined mode, a process of capturing an image with the camera 164.

［まとめ］
以下、上述した処理のうち主要な処理と、当該処理により得られる利点とについて記載する。 [Summary]
Hereinafter, the main processing of the above-mentioned processing and the advantages obtained by the processing will be described.

（１）端末１００は、ユーザ７００からの音声から特定されるフレーズが、応答フレーズＤＢあるいは学習結果ＤＢに記憶されたフレーズのうちのいずれかであるときに、当該フレーズに対応する応答フレーズを出力する処理を実行し、記憶されていない不明なフレーズであって当該不明なフレーズが所定頻度で特定されているときに、その後において当該不明なフレーズに対応する応答フレーズを出力可能にするための学習処理を実行する。これにより、不明なフレーズが特定されると即座に学習処理を実行せず、所定頻度に達したときにユーザが意図してその不明なフレーズを発していると擬制して学習処理を実行できる。その結果、ユーザからの発話に対して将来的に幅広く応答できるようにしつつも、不適切な問い返しおよび学習が行なわれてしまうことを防止できる。 (1) When the phrase specified by the voice from the user 700 is one of the phrases stored in the response phrase DB or the learning result DB, the terminal 100 outputs the response phrase corresponding to the phrase. Learning to execute the process to perform output processing and output a response phrase corresponding to the unknown phrase after that when the unknown phrase is identified at a predetermined frequency. Execute the process. Accordingly, when the unknown phrase is specified, the learning process is not immediately executed, but when the predetermined frequency is reached, it is possible to assume that the user intentionally outputs the unknown phrase and execute the learning process. As a result, it is possible to widely respond to the utterance from the user in the future, but it is possible to prevent inappropriate questions and learning.

（２）端末１００は、想定される複数種類のフレーズおよび近似するフレーズに対応する応答フレーズを記憶する応答フレーズＤＢと、学習処理によりフレーズに対応する応答フレーズを更新記憶する学習結果ＤＢとを有する。これにより、ユーザが音声を発してから応答するまでの間を極力短縮できる。 (2) The terminal 100 has a response phrase DB that stores response phrases corresponding to a plurality of types of expected phrases and similar phrases, and a learning result DB that updates and stores response phrases corresponding to phrases by learning processing. .. As a result, it is possible to shorten as much as possible from the time when the user utters the voice to the time when the user responds.

（３）学習処理を行なう契機となる所定頻度は、記憶されていない不明なフレーズが２回連続して特定されることにより達する頻度である。これにより、たとえば学習処理を開始するための特別な音声や操作を行なう必要がないため、学習のハードルを下げることができる。その結果、学習頻度を向上させることができる。 (3) The predetermined frequency that triggers the learning process is the frequency that is reached when an unstored unknown phrase is specified twice in succession. As a result, it is not necessary to perform a special voice or operation for starting the learning process, so that the learning hurdle can be reduced. As a result, the learning frequency can be improved.

（４）記憶されていない不明なフレーズが特定されたときには、図２のＭ０６に示すように、「よく聞こえなかったよ。」といった応答が出力される。これにより、ユーザに対して再度の発話を促すことができる。 (4) When an unknown phrase that is not stored is specified, a response such as "I didn't hear well." is output, as indicated by M06 in FIG. This can prompt the user to speak again.

（５）学習処理は、学習開始契機となった不明なフレーズに対応する応答フレーズの発話を促す処理（図２のＭ０８）と、ユーザの発話から特定されるフレーズそのものを不明なフレーズに対応する応答フレーズとして記憶する処理（図２のＳ１３）とを含む。これにより、どのようなフレーズについても応答フレーズとして記憶することができる。 (5) The learning process corresponds to the process of urging the utterance of the response phrase corresponding to the unknown phrase that triggered the learning (M08 in FIG. 2) and the phrase itself identified from the utterance of the user as the unknown phrase. The process of storing as a response phrase (S13 of FIG. 2) is included. This allows any phrase to be stored as a response phrase.

（６）学習開始フレーズであるときには、その後においてフレーズに対応する応答フレーズを学習可能となる。これにより、ユーザの意思に基づいて積極的に学習させることができる。 (6) When it is the learning start phrase, the response phrase corresponding to the phrase can be learned thereafter. Thereby, it is possible to positively learn based on the intention of the user.

（７）端末１００は、ユーザ７００からの音声から特定されるフレーズが、応答フレーズＤＢあるいは学習結果ＤＢに記憶された正規のフレーズと合致するときに、当該正規のフレーズに対応する応答フレーズを出力する処理を実行し、応答フレーズＤＢあるいは学習結果ＤＢに記憶された正規のフレーズと近似するときに、近似する場合に対応する応答フレーズを出力する処理を実行する。これにより、ユーザ７００の音声から特定されるフレーズそのものに対応する応答フレーズが準備されていない場合であっても応答することできる。その結果、応答フレーズの不足を補うことができる。 (7) When the phrase specified by the voice from the user 700 matches the regular phrase stored in the response phrase DB or the learning result DB, the terminal 100 outputs the response phrase corresponding to the regular phrase. Is executed to output a response phrase corresponding to the approximate phrase when the regular phrase stored in the response phrase DB or the learning result DB is approximated. Thereby, even if the response phrase corresponding to the phrase itself specified from the voice of the user 700 is not prepared, it is possible to respond. As a result, the lack of response phrases can be compensated.

（８）近似判定部を有する実施の形態では、ユーザ７００の音声から特定されるフレーズが正規のフレーズのうちのいずれかと近似するか否かを判定する。発話内容決定部は、ユーザ７００の音声から特定されるフレーズが正規のフレーズのうちのいずれかと近似するときには、近似する場合に対応して記憶されている応答フレーズを出力する。これにより、正規のフレーズに対して合致する場合と近似する場合との応答フレーズを準備することにより、ユーザの発話に対して幅広く応答することができる。 (8) In the embodiment including the approximation determining unit, it is determined whether or not the phrase specified from the voice of the user 700 is close to any of the regular phrases. When the phrase specified from the voice of the user 700 approximates any of the regular phrases, the utterance content determination unit outputs a response phrase stored corresponding to the case. Thus, by preparing a response phrase that matches the regular phrase and a case that approximates the regular phrase, it is possible to widely respond to the utterance of the user.

（９）近似判定部を有しない実施の形態では、正規のフレーズに含まれる一部のフレーズを当該正規のフレーズと近似するフレーズと擬制した上で、図９の応答フレーズＤＢに示されるように、正規のフレーズに対応する応答フレーズと、当該正規のフレーズに含まれる一部のフレーズに対応する応答フレーズとを準備することにより、処理負担を軽減しつつユーザの発話に対して幅広く応答することができる。 (9) In the embodiment that does not have the approximation determining unit, some phrases included in the regular phrase are pretended to be phrases close to the regular phrase, and then, as shown in the response phrase DB of FIG. By providing a response phrase corresponding to a legitimate phrase and a response phrase corresponding to a part of the phrases included in the legitimate phrase, it is possible to widely respond to the user's utterance while reducing the processing load. You can

（１０）正規のフレーズのうち、たとえば、「シンチョー」と近似する場合の応答フレーズと、「タイジュー」と近似する場合の応答フレーズとは、「ひょっとして」といった共通のフレーズを含む。 (10) Among the regular phrases, for example, the response phrase when approximating “Shincho” and the response phrase when approximating “Taiju” include a common phrase such as “hyotto”.

また、「シンチョー」と近似する場合には、共通の「ひょっとして」と、「シンチョー」に対応する応答フレーズとを用いて、たとえば、「ひょっとしてシンチョーのこと？身長はだいたい１９ｃｍだよ。」を出力する。これにより、近似するフレーズ毎に異なる応答フレーズを準備するものと比較して、応答フレーズを記憶するための記憶容量を低減できる。 In addition, in the case of approximating "Shincho", using the common "Hyotto" and the response phrase corresponding to "Shincho", for example, "Hyotto Shincho? Height is about 19 cm. Is output. As a result, the storage capacity for storing the response phrase can be reduced as compared with the case where a different response phrase is prepared for each approximate phrase.

（１１）通信システムは、サーバ装置２００と、当該サーバ装置２００と通信可能な端末１００とを備える。その上で、実施の形態２および３における端末１００は、ユーザからの音声に対応する音声情報（たとえば、音声認識の結果から特定されるフレーズデータ、音声データなど）を送信し、その後にサーバ装置２００から送信される応答情報（応答データ）に基づいて応答フレーズを出力する処理を実行する。 (11) The communication system includes a server device 200 and a terminal 100 capable of communicating with the server device 200. Then, terminal 100 in the second and third embodiments transmits voice information corresponding to the voice from the user (for example, phrase data and voice data specified from the result of voice recognition), and then the server device. A process of outputting a response phrase is executed based on the response information (response data) transmitted from 200.

一方、サーバ装置２００は、端末１００からの音声情報から特定されるフレーズが、応答フレーズＤＢあるいは学習結果ＤＢに記憶されたフレーズのうちのいずれかであるときに、当該フレーズに対応する応答情報を出力する処理を実行し、記憶されていない不明なフレーズであって当該不明なフレーズが所定頻度で特定されているときに、その後において当該不明なフレーズに対応する応答情報を出力可能にするための学習処理を実行する。これにより、不明なフレーズが特定されると即座に学習処理を実行せず、所定頻度に達したときにユーザが意図してその不明なフレーズを発していると擬制して学習処理を実行できる。その結果、ユーザからの発話に対して将来的に幅広く応答できるようにしつつも、不適切な問い返しおよび学習が行なわれてしまうことを防止できる。 On the other hand, when the phrase specified from the voice information from the terminal 100 is one of the phrases stored in the response phrase DB or the learning result DB, the server device 200 outputs the response information corresponding to the phrase. To execute the process of outputting and output response information corresponding to the unknown phrase after that when the unknown phrase is not stored and the unknown phrase is specified at a predetermined frequency. Perform learning processing. Accordingly, when the unknown phrase is specified, the learning process is not immediately executed, but when the predetermined frequency is reached, it is possible to assume that the user intentionally outputs the unknown phrase and execute the learning process. As a result, it is possible to widely respond to the utterance from the user in the future, but it is possible to prevent inappropriate questions and learning.

また、サーバ装置２００は、端末１００からの音声情報から特定されるフレーズが、応答フレーズＤＢあるいは学習結果ＤＢに記憶されたフレーズと合致するときに、当該フレーズに対応する応答情報を出力する処理を実行し、応答フレーズＤＢあるいは学習結果ＤＢに記憶されたフレーズと近似するときに、当該近似するフレーズに対応する応答情報を出力する処理を実行する。これにより、ユーザの音声から特定されるフレーズそのものに対応する応答情報が準備されていない場合であっても応答することできる。その結果、応答情報の不足を補うことができる。 Further, the server device 200 performs a process of outputting the response information corresponding to the phrase when the phrase specified from the voice information from the terminal 100 matches the phrase stored in the response phrase DB or the learning result DB. When executing and approximating the phrase stored in the response phrase DB or the learning result DB, a process of outputting response information corresponding to the approximating phrase is executed. As a result, it is possible to respond even if the response information corresponding to the phrase itself specified from the user's voice is not prepared. As a result, the lack of response information can be compensated.

今回開示された実施の形態はすべての点で例示であって制限的なものではないと考えられるべきである。本発明の範囲は上記した説明ではなくて特許請求の範囲によって示され、特許請求の範囲と均等の意味および範囲内でのすべての変更が含まれることが意図される。 The embodiments disclosed this time are to be considered as illustrative in all points and not restrictive. The scope of the present invention is shown not by the above description but by the claims, and is intended to include meanings equivalent to the claims and all modifications within the scope.

１通信システム、１００通信端末、１１１，２１１制御部、１１２，２１２記憶部、１１５駆動部、１１６音声入力部、１１７音声出力部、１１８，２１３通信処理部、１１９表示部、１５８ＧＰＳ受信機、１６２マイク、１６３スピーカ、１６４カメラ、１６５駆動装置、２００サーバ装置、５００基地局、６００ネットワーク、７００ユーザ、１１１０音声認識部、１１１１，２１１１発話内容決定部、１１１２，２１１２近似判定部、１１１３，２１１３学習機能部、１１１４駆動制御部、１１１５表示制御部、１１２１，２１２１応答フレーズＤＢ、１１２２，２１２２音声認識結果ＤＢ、１１２３，２１２３学習結果ＤＢ、１１８１，２１３１送信部、１１８２，２１３２受信部。 1 communication system, 100 communication terminal, 111, 211 control unit, 112, 212 storage unit, 115 drive unit, 116 voice input unit, 117 voice output unit, 118, 213 communication processing unit, 119 display unit, 158 GPS receiver, 162 microphones, 163 speakers, 164 cameras, 165 driving devices, 200 server devices, 500 base stations, 600 networks, 700 users, 1110 voice recognition units, 1111 and 1111 utterance content determination units, 1112 and 2112 approximation determination units, 1113 and 2113 Learning function unit, 1114 drive control unit, 1115 display control unit, 1121,121 response phrase DB, 1122, 2122 voice recognition result DB, 1123, 2123 learning result DB, 1181,2131 transmission unit, 1182, 2132 reception unit.

Claims

A voice receiving means for receiving voice input,
When the phrase specified by the voice accepted by the voice accepting unit is one of a plurality of predetermined phrases, a response process executing unit that executes a phrase response process corresponding to the phrase. With and
The response process executing means, when a phrase specified by the voice received by the voice receiving device approximates any one of the plurality of types of phrases, executes a response process for approximation corresponding to the approximate phrase. Then
Further comprising a storage means for storing information capable of specifying a phrase response process and an approximation response process corresponding to each of the plurality of types of phrases,
The response process execution means executes a phrase response process or an approximation response process corresponding to a phrase identified from the voice received by the voice reception device, based on the storage information of the storage device,
Further comprising an approximate determination means for determining whether or not to approximate the one of the phrases in the phrase is the plurality of types identified from the voice received by the voice receiving unit,
The storage means stores the response processing for approximation in association with a phrase approximate to each of the plurality of types of phrases,
The response processing execution means, when the approximation determination means determines that the phrase is approximated to any one of the plurality of types of phrases, the response processing for approximation stored in the storage means corresponding to the approximated phrase. It is executed, response control device.

A voice receiving means for receiving voice input,
When the phrase specified by the voice accepted by the voice accepting unit is one of a plurality of predetermined phrases, a response process executing unit that executes a phrase response process corresponding to the phrase. With and
The response process executing means, when a phrase specified by the voice received by the voice receiving device approximates any one of the plurality of types of phrases, executes a response process for approximation corresponding to the approximate phrase. Then
Further comprising a storage means for storing information capable of specifying a phrase response process and an approximation response process corresponding to each of the plurality of types of phrases,
The response process execution means executes a phrase response process or an approximation response process corresponding to a phrase identified from the voice received by the voice reception device, based on the storage information of the storage device,
The storage means stores approximate response processing in association with some of the phrases included in each of the plurality of types of phrases,
The response process executing means, when the phrase specified by the voice accepted by the voice accepting means is not any of the plurality of types of phrases and includes the part of the phrase, selects the part of the phrase. performing approximation for response processing stored in the storage means in response, response control device.

The plurality of types of phrases include a first phrase and a second phrase,
The response control device according to claim 1 , wherein the approximation response process corresponding to the first phrase and the approximation response process corresponding to the second phrase include common processes.

The response processing execution means, when the phrase specified from the voice received by the voice receiving means approximates any one of the plurality of types of phrases, only the response processing for approximation corresponding to the approximate phrase. The response control device according to any one of claims 1 to 3 , which also executes a phrase response process corresponding to the phrase.

A control program for causing a computer to function as the response control device according to any one of claims 1 to 4, a control program for causing a computer to function as each means described above.

A step of receiving voice input,
When the phrase specified from the received voice is one of a plurality of types of predetermined phrases, a step of executing a phrase response process corresponding to the phrase,
When the phrase specified from the received voice is approximated to any one of the plurality of types of phrases, a step of performing an approximate response process corresponding to the approximated phrase is provided ,
The step of executing the phrase response process may include executing the phrase response process based on storage information of a storage unit that stores information capable of specifying the phrase response process and the approximation response process corresponding to each of the plurality of types of phrases. Including performing a response process,
The step of performing the approximation response process,
Determining whether or not the phrase specified from the received voice in the step of receiving the input of the voice is similar to any of the plurality of types of phrases;
Executing an approximation response process stored in the storage means in correspondence with the approximate phrase when it is determined to be approximate to any one of the plurality of types of phrases. ..

A step of receiving voice input,
When the phrase specified from the received voice is one of a plurality of types of predetermined phrases, a step of executing a phrase response process corresponding to the phrase,
When the phrase specified from the received voice is approximated to any one of the plurality of types of phrases, a step of performing an approximate response process corresponding to the approximated phrase is provided,
The step of executing the phrase response process is information capable of specifying the phrase response process corresponding to the plurality of types of phrases and the approximation response process corresponding to some of the phrases included in each of the plurality of types of phrases. Including executing the phrase response process based on storage information of a storage unit that stores
The step of executing the response processing for approximation is performed when the phrase specified from the voice received in the step of receiving the input of the voice is not any of the plurality of types of phrases and includes the partial phrase. An information processing method , comprising: executing an approximate response process stored in the storage means in correspondence with the part of the phrases .

A communication system comprising a server and a response control device capable of communicating with the server,
The response control device,
A voice receiving means for receiving voice input,
A communication unit for transmitting voice information corresponding to the voice received by the voice receiving unit and receiving response information from the server;
And a response process executing means for executing a response process based on the received response information,
The server is
A storage unit that stores phrase response information and approximation response information as response information corresponding to each of a plurality of predetermined phrases.
When the phrase specified from the voice information from the response control device is any of the plurality of types of phrases, response information transmitting means for transmitting the phrase response information corresponding to the phrase as response information,
Approximate response information transmission for transmitting, as response information, approximate response information corresponding to the approximate phrase when the phrase specified from the voice information from the response control device approximates any of the plurality of types of phrases and means only including,
The storage means stores approximate response information in association with a phrase that is similar to each of the plurality of types of phrases,
The server further includes an approximation determination unit that determines whether or not the phrase specified from the voice information from the response control device is similar to any one of the plurality of types of phrases,
The response process executing means, when the approximation determining means determines that the phrase is approximate to any one of the plurality of types of phrases, the response information for approximation stored in the storage means corresponding to the approximate phrase. Is transmitted as the response information .

A communication system comprising a server and a response control device capable of communicating with the server,
The response control device,
A voice receiving means for receiving voice input,
Communication means for transmitting voice information corresponding to the voice accepted by the voice accepting means and receiving response information from the server;
A response process executing means for executing a response process based on the received response information,
The server is
A storage unit that stores phrase response information and approximation response information as response information corresponding to each of a plurality of predetermined phrases.
When the phrase specified from the voice information from the response control device is one of the plurality of types of phrases, response information transmitting means for transmitting the response information for phrase corresponding to the phrase as response information,
Approximate response information transmission for transmitting, as response information, approximate response information corresponding to the approximate phrase when the phrase specified from the voice information from the response control device approximates any one of the plurality of types of phrases And means,
The storage means stores approximate response processing in association with some of the phrases included in each of the plurality of types of phrases,
When the phrase specified from the voice information from the response control device is not any of the plurality of types of phrases and includes the partial phrase, the approximate response information transmitting unit selects the partial phrase. A communication system which transmits the response information for approximation correspondingly stored in the storage means as the response information.