JP2020085942A

JP2020085942A - Information processing apparatus, information processing method, and program

Info

Publication number: JP2020085942A
Application number: JP2018214995A
Authority: JP
Inventors: 彰浩各務; Akihiro Kagami
Original assignee: Sharp Corp
Current assignee: Sharp Corp
Priority date: 2018-11-15
Filing date: 2018-11-15
Publication date: 2020-06-04

Abstract

To allow a user to efficiently have training for speech to improve the rate of recognition of an instruction by voice input.SOLUTION: An information processing apparatus outputs contents on which a user is requested to speak via a voice output unit (15), compares the user's voice acquired via a voice information acquisition unit (11) with the content on which the user has been requested to speak, and changes an advice for the user according to a result of the comparison.SELECTED DRAWING: Figure 1

Description

本発明は、情報処理装置、情報処理方法、およびプログラムに関する。 The present invention relates to an information processing device, an information processing method, and a program.

近年、ユーザによる発話に基づく音声を認識し、自動応答を行う技術が一般的に利用されるようになってきた。音声認識は、ユーザの発声の方法が不適切であることが誤認識につながることが多いため、ユーザの発声方法についてアドバイスを行う情報処理装置が提供されている（例えば、特許文献１参照）。 In recent years, a technique of recognizing a voice based on a user's utterance and making an automatic response has been generally used. In voice recognition, an inappropriate method of uttering a user often leads to erroneous recognition. Therefore, an information processing apparatus that provides advice on a method of uttering a user is provided (for example, refer to Patent Document 1).

特開２００８−１２２４８３号公報（２００８年５月２９日公開）JP 2008-122483 A (Published May 29, 2008)

上述のような従来技術は、ユーザによって入力を取り消す指示がなされた場合に、ユーザに発生方法のガイドを行うか否かを判定する構成である。しかしながら、音声認識による自動応答システムを効率よく使えるようになるには、ユーザは正確に認識されやすい話し方を工夫しながら話す必要があり、このような話し方ができるようになるために熟練が必要であった。 The above-described conventional technique has a configuration in which when the user gives an instruction to cancel the input, it is determined whether or not to guide the user in the generation method. However, in order to efficiently use the automatic response system based on voice recognition, the user needs to speak while devising a speech style that is easily recognized accurately, and skill is required to enable such a speech style. ..

本発明の一態様は、上述した事情に鑑みてなされたものであり、ユーザが音声入力による指示の認識率を向上させるための話し方の練習を効率よく行うことができる技術を提供することを目的とする。 One aspect of the present invention has been made in view of the above circumstances, and an object of the present invention is to provide a technique that enables a user to efficiently practice speaking styles to improve the recognition rate of instructions by voice input. And

上記の課題を解決するために、本発明の一態様に係る情報処理装置は、音声情報取得部と、出力部と、制御部とを備えている情報処理装置において、上記制御部は、ユーザに対して発話を要求する内容を、上記出力部を介して出力し、上記音声情報取得部を介して取得したユーザの音声と、当該ユーザに対して発話を要求した内容とを比較し、上記比較した結果に応じて、上記ユーザに対するアドバイスを切り替える。 In order to solve the above problems, an information processing apparatus according to an aspect of the present invention is an information processing apparatus including a voice information acquisition unit, an output unit, and a control unit, wherein the control unit is provided to a user. The content requesting utterance to the user is output via the output unit, the voice of the user acquired via the voice information acquisition unit is compared with the content requested to speak to the user, and the comparison is performed. The advice to the user is switched according to the result.

本発明の一態様によれば、ユーザは音声入力による指示の認識率を向上させるための話し方の練習を効率よく行うことができる。 According to one aspect of the present invention, a user can efficiently practice speaking to improve the recognition rate of instructions by voice input.

本発明の実施形態１に係る情報処理装置の要部構成を示すブロック図である。FIG. 3 is a block diagram showing a main configuration of an information processing apparatus according to the first embodiment of the present invention. 情報処理システムの概要構成を模式的に示す図である。It is a figure which shows the schematic structure of an information processing system typically. 誤認識パターンの分類を示す図である。It is a figure which shows the classification of a misrecognition pattern. 誤認識パターンの例を示す図である。It is a figure which shows the example of a misrecognition pattern. アドバイス例を示す図である。It is a figure which shows the example of advice. 情報処理装置の処理の流れを示すフローチャートである。It is a flow chart which shows a flow of processing of an information processor. 情報処理装置の処理の流れを示すフローチャートである。It is a flow chart which shows a flow of processing of an information processor. 本発明の実施形態２に係る情報処理装置の要部構成を示すブロック図である。It is a block diagram which shows the principal part structure of the information processing apparatus which concerns on Embodiment 2 of this invention. 情報処理装置として利用可能なコンピュータの構成を例示したブロック図である。It is a block diagram which illustrated the composition of the computer which can be used as an information processor.

〔実施形態１〕
以下、本発明の実施形態１について、詳細に説明する。図１は、本発明の実施形態１に係る情報処理装置１０の概略構成を示すブロック図である。図２は、本実施形態に係る情報処理装置１０の全体構成の概要を模式的に示す図である。 [Embodiment 1]
Hereinafter, Embodiment 1 of the present invention will be described in detail. FIG. 1 is a block diagram showing a schematic configuration of an information processing device 10 according to the first embodiment of the present invention. FIG. 2 is a diagram schematically showing an overview of the overall configuration of the information processing device 10 according to the present embodiment.

図１、および図２に示すように、情報処理装置１０は、ユーザの発話音声を音声情報として取得するとともに、取得した音声情報に応じた回答を音声出力することで、ユーザと音声を用いた会話を行うことができる装置である。また、情報処理装置１０は、ユーザの音声指示に応じて、情報検索、および音楽再生等の情報処理装置１０が提供可能な機能と、テレビ、エアコン、および照明などの周辺機器の各種機能とを操作する音声操作に対応した装置である。 As shown in FIG. 1 and FIG. 2, the information processing apparatus 10 acquires the uttered voice of the user as voice information and outputs the answer according to the acquired voice information by voice, thereby using the user and voice. It is a device that can have a conversation. In addition, the information processing device 10 has a function that the information processing device 10 can provide such as information retrieval and music reproduction, and various functions of peripheral devices such as a television, an air conditioner, and lighting, according to a voice instruction from the user. It is a device that supports voice operations.

（情報処理装置１０の構成）
情報処理装置１０は、制御部２０、音声情報取得部１１、音声出力部（出力部）１５、および記憶部３０を備えている。情報処理装置１０は、例えば、スマートフォンやロボット型の携帯型端末装置である。情報処理装置１０は、ロボット型の携帯型端末装置である場合に、ロボットの手足、胴体、頭部、発光部、バイブレータの各部を駆動させる１又は複数の駆動部３５を更に備えている。 (Configuration of Information Processing Device 10)
The information processing device 10 includes a control unit 20, a voice information acquisition unit 11, a voice output unit (output unit) 15, and a storage unit 30. The information processing device 10 is, for example, a smartphone or a robot-type portable terminal device. When the information processing apparatus 10 is a robot-type portable terminal device, the information processing apparatus 10 further includes one or a plurality of drive units 35 that drive the robot limbs, torso, head, light emitting unit, and vibrator.

制御部２０は、情報処理装置１０の各部を統括的に制御する機能を備えた演算装置である。制御部２０は、例えば１つ以上のプロセッサ（例えばＣＰＵなど）が、１つ以上のメモリ（例えばＲＡＭやＲＯＭなど）に記憶されているプログラムを実行することで情報処理装置１０の各構成要素を制御する。 The control unit 20 is an arithmetic device having a function of comprehensively controlling each unit of the information processing device 10. The control unit 20 controls each component of the information processing apparatus 10 by executing a program stored in one or more memories (such as RAM and ROM) by one or more processors (such as CPU), for example. Control.

音声情報取得部１１は、ユーザの発話音声に基づく音声情報を取得する。音声情報取得部１１は、周囲の音を集音するマイク（マイクロフォン）等の集音装置を備え、集音した音から生成される音声情報を取得してもよい。音声情報取得部１１は、取得した音声情報を制御部２０に提供する。音声情報取得部１１は、集音した音声をそのまま音声情報として制御部２０に提供してもよい。また、音声情報取得部１１は、集音した音声を音波に変換したデータを音声情報として制御部２０に提供してもよい。 The voice information acquisition unit 11 acquires voice information based on the voice uttered by the user. The voice information acquisition unit 11 may include a sound collector such as a microphone (microphone) that collects ambient sounds, and may acquire voice information generated from the collected sounds. The voice information acquisition unit 11 provides the acquired voice information to the control unit 20. The voice information acquisition unit 11 may provide the collected voice as it is to the control unit 20 as voice information. In addition, the voice information acquisition unit 11 may provide the control unit 20 with data obtained by converting collected voices into sound waves as voice information.

音声出力部１５は、音声信号を人の耳が認識できる音波範囲の物理振動に変換して音声を出力するスピーカを備えている。音声出力部１５は、制御部２０によって制御され、音声情報取得部１１が取得した音声情報に応じたユーザに対する回答を音声として出力する機能を有する。 The voice output unit 15 includes a speaker that converts a voice signal into physical vibration in a sound wave range that a human ear can recognize and outputs a voice. The voice output unit 15 is controlled by the control unit 20 and has a function of outputting, as a voice, an answer to the user according to the voice information acquired by the voice information acquisition unit 11.

記憶部３０は、制御部２０で用いられる種々のデータを格納するストレージである。記憶部３０は、例えば、内容の書き換えが可能な不揮発性メモリである、ＥＰＲＯＭ、ＥＥＰＲＯＭ（登録商標）、ＨＤＤ、フラッシュメモリなどのいずれか１つ、又はそれらの１つ以上の組み合わせによって実現される。 The storage unit 30 is a storage that stores various data used by the control unit 20. The storage unit 30 is realized, for example, by any one of EPROM, EEPROM (registered trademark), HDD, flash memory, and the like, which are rewritable nonvolatile memories, or a combination of one or more thereof. ..

図２に示すように、音声操作のための音声指示に適した話し方を練習したい場合、ユーザは、情報処理装置１０に対して、例えば「音声入力練習をしたい」と話しかける。すると、情報処理装置１０から、練習１として、例えば「テレビをつけて」という話し方の練習用の音声が出力される。 As shown in FIG. 2, when practicing a speaking style suitable for a voice instruction for a voice operation, the user speaks to the information processing apparatus 10, for example, “I want to practice voice input”. Then, the information processing apparatus 10 outputs, as the practice 1, a practice voice of, for example, "turn on the TV".

ユーザは、情報処理装置１０から出力された練習用の文言である「テレビをつけて」を、情報処理装置１０から出力された音声に真似て話す。すると、情報処理装置１０から、ユーザの話し方に対する応答が音声出力される。ユーザが練習用の文言を、音声操作のための音声指示に適した話し方で話すことができた場合には、情報処理装置１０は、例えば「完璧です。この調子で話してください」という応答の音声が出力される。ユーザが練習用の文言を、音声操作のための音声指示に適した話し方で話すことができなかった場合には、詳細については後述するが、情報処理装置１０は、ユーザの話し方に応じたアドバイスを行う応答の音声を出力する。 The user speaks the phrase “Turn on TV”, which is the practice wording output from the information processing device 10, by imitating the voice output from the information processing device 10. Then, the information processing device 10 outputs a voice response to the user's speaking style. When the user can speak the practice language in a manner suitable for the voice instruction for voice operation, the information processing apparatus 10 responds with, for example, “perfect. Please speak in this condition.” Sound is output. When the user cannot speak the text for practice in a speech style suitable for the voice instruction for the voice operation, the information processing device 10 will provide advice according to the speaking style of the user, which will be described later in detail. The response voice is output.

このように、情報処理装置１０の制御部２０は、「音声入力練習をしたい」というユーザの要望に応じて、ユーザに発話を要求する内容であって、音声操作に係る内容を決定する。 In this way, the control unit 20 of the information processing device 10 determines the content of requesting the user to speak and the content related to the voice operation, in response to the user's desire to “practice voice input”.

（制御部２０の構成）
制御部２０は、音声認識部２１、発話決定部２２、回答判定部２３、および音声合成部２４を備えている。また、ロボット型の携帯端末装置である情報処理装置１０では、制御部２０は、更に駆動部３５を制御する駆動制御部２５を備えている。 (Configuration of control unit 20)
The control unit 20 includes a voice recognition unit 21, a speech determination unit 22, an answer determination unit 23, and a voice synthesis unit 24. Further, in the information processing device 10 which is a robot type mobile terminal device, the control unit 20 further includes a drive control unit 25 which controls the drive unit 35.

音声認識部２１は、音声情報取得部１１によって取得された、ユーザの発話音声に基づく音声情報の音声認識を行う。音声認識部２１は、例えば、音声情報の波形データを用いた音声認識により音声情報をテキストに変換する。 The voice recognition unit 21 performs voice recognition of voice information acquired by the voice information acquisition unit 11 based on the user's uttered voice. The voice recognition unit 21 converts the voice information into text by voice recognition using the waveform data of the voice information, for example.

また、音声認識部２１は、ユーザの発話音声に基づく音声情報の音声認識を行い、ユーザにより音声入力の練習開始が指示されたことを検出する。 In addition, the voice recognition unit 21 performs voice recognition of voice information based on the user's uttered voice, and detects that the user has instructed to start practicing voice input.

発話決定部２２は、音声認識部２１によってテキストに変換されたユーザによる発話に係る情報と、記憶部３０に予め記憶されている静的または動的なテキスト辞書とのテキストマッチングを用いて、ユーザの発話内容を特定する。発話決定部２２は、従来公知の例えば編集距離等の手法を用いて、ユーザの発話内容を特定する。 The utterance determination unit 22 uses the text matching between the information related to the user's utterance converted into text by the voice recognition unit 21 and a static or dynamic text dictionary stored in the storage unit 30 in advance, and uses the text matching. Specify the utterance content of. The utterance determining unit 22 specifies the utterance content of the user by using a conventionally known method such as editing distance.

発話決定部２２は、特定したユーザの発話内容に応じて、ユーザに対して発話する文言を、記憶部３０に予め記憶されている発話例を参照して決定する。音声認識部２１により、ユーザにより音声入力の練習開始が指示されたことが検出されると、発話決定部２２は、ユーザに対して音声入力の練習のために発話を要求する内容を決定する。発話決定部２２は、予め記憶部３０に記憶されている音声入力の練習のための例題集を参照して、ユーザに対して発話を要求する練習用の発話文言を決定してもよい。また、発話決定部２２は、ユーザの音声操作による指示履歴を参照して、ユーザに対して発話を要求する練習用の発話文言を決定することができてもよい。例えば、発話決定部２２は、ユーザの過去の音声操作による指示における誤認識の履歴を参照して、ユーザに対して発話を要求する練習用の発話文言を決定してもよい。 The utterance determination unit 22 determines the wording to be uttered to the user by referring to the utterance example stored in advance in the storage unit 30 according to the uttered content of the identified user. When the voice recognition unit 21 detects that the user has instructed to start the voice input practice, the utterance determination unit 22 determines the content of requesting the user to speak for the voice input practice. The utterance determination unit 22 may determine the practice utterance wording for requesting the utterance from the user, with reference to a collection of examples for practice of voice input stored in the storage unit 30 in advance. Further, the utterance determination unit 22 may be able to determine the utterance wording for practice requesting the user to speak by referring to the instruction history of the user's voice operation. For example, the utterance determination unit 22 may determine the practice utterance wording requesting the user to utter by referring to the history of misrecognition in the instruction by the user's past voice operation.

また、発話決定部２２は、詳細については後述するが、回答判定部２３の判定結果に応じて、ユーザに対して発話するアドバイスの内容を、記憶部３０に記憶されているアドバイス例３２を参照して切り替える。 Although the details will be described later, the utterance determination unit 22 refers to the advice example 32 stored in the storage unit 30 for the content of the advice uttered to the user according to the determination result of the answer determination unit 23. And switch.

このように、情報処理装置１０の制御部２０は、音声認識部２１の認識結果に応じた音声操作を実行することができる。また、情報処理装置１０の制御部２０は、音声認識部２１の認識結果に応じた、ユーザに対して発話する内容を発話決定部２２の機能により決定し、ユーザとの間で音声によるコミュニケーションを実行することができる。 In this way, the control unit 20 of the information processing device 10 can execute the voice operation according to the recognition result of the voice recognition unit 21. In addition, the control unit 20 of the information processing device 10 determines the content to be uttered to the user according to the recognition result of the voice recognition unit 21 by the function of the utterance determination unit 22, and performs voice communication with the user. Can be executed.

回答判定部２３は、発話決定部２２によって決定されて出力された、ユーザに対して発話を要求する内容と、ユーザによって入力された回答音声とを比較する。回答判定部２３は、音声認識部２１によってテキストに変換されたユーザの回答音声と、ユーザに対して発話を要求した内容とを比較し、それらの違いに応じて、ユーザによる回答音声の誤認識のパターンを分類する。 The answer determination unit 23 compares the content, which is determined and output by the speech determination unit 22 and requests the user to speak, and the answer voice input by the user. The answer determination unit 23 compares the answer voice of the user converted into text by the voice recognition unit 21 with the content requested to speak to the user, and erroneously recognizes the answer voice by the user according to the difference between them. Classify the patterns.

回答判定部２３は、記憶部３０に予め記憶された誤認識パターンを参照して、ユーザに対して発話を要求した内容と、ユーザの回答音声とを比較し、ユーザの回答音声の誤認識パターンを分類する。 The answer determination unit 23 refers to the erroneous recognition pattern stored in the storage unit 30 in advance and compares the content of the utterance requested to the user with the answer voice of the user, and the erroneous recognition pattern of the answer voice of the user. Classify.

図３は、誤認識パターンの分類表の一例を示す図である。図３に示すように、回答判定部２３は、ユーザの回答音声に基づくテキストと、ユーザに対して発話を要求した内容とを比較し、例えば、以下のように分類する。
・回答音声が要求した内容に対して全て一致している場合には、分類Ａとする
・回答音声が要求した内容に対して語尾が１文字欠けている場合には、分類Ｂとする
・回答音声が要求した内容に対して先頭が１文字欠けている場合には、分類Ｃとする
・回答音声が要求した内容に対して中が１文字欠けている場合には、分類Ｄとする
・回答音声が要求した内容に対して１文字化けている場合には、分類Ｅとする
・回答音声が要求した内容に対して２文字欠けている場合には、分類Ｆとする
・回答音声が要求した内容に対して２文字化けている場合には、分類Ｇとする
・回答音声が要求した内容に対して３文字以上欠けている又は３文字以上化けている場合には、分類Ｈとする
・回答音声が要求した内容に対して全く一致しない場合には、分類Ｉとする
・回答音声が識別できない場合には、分類Ｊとする
図４は、「テレビをつけて」というユーザに対して発話を要求した内容に対する、ユーザの回答音声の音声認識部２１による認識結果と、回答判定部２３による判定結果とを示す図である。「テレビをつけて」というユーザに対して発話を要求した内容に対して、ユーザの回答音声の音声認識の結果が「テレビをつけて」である場合には、回答判定部２３は、回答音声が要求した内容に対して全て一致しているため、分類Ａと判定する。 FIG. 3 is a diagram showing an example of a classification table of erroneous recognition patterns. As shown in FIG. 3, the answer determination unit 23 compares the text based on the answer voice of the user with the content requested to be uttered by the user, and classifies as follows, for example.
・If the answer voice matches all the requested contents, it is classified as Category A. ・If the answer voice lacks one character at the end of the requested contents, it is classified as Category B. ・Reply If the beginning of the content requested by the voice is one character missing, it is classified as C. Answer If the content requested by the voice is missing one character in the middle, it is classified as D. If the voice has one character garbled in the requested content, it is classified as E. If the answer voice lacks two characters in the requested content, it is classified as F. The reply voice requires it. If the content is garbled with 2 characters, it is classified as G. If the answer voice is missing 3 characters or more or is garbled with 3 characters or more, it is classified as H. Answer If the voice does not match the requested content at all, it is classified as Category I. If the answer voice cannot be identified, it is classified as Class J. FIG. 4 shows the utterance to the user "Turn on TV". It is a figure which shows the recognition result by the voice recognition part 21 of the user's reply voice with respect to the requested content, and the determination result by the response determination part 23. When the result of the voice recognition of the answer voice of the user is “Turn on TV” for the content of requesting the user to turn on the TV, the answer determination unit 23 determines the answer voice. Since all of the contents requested by are matched, it is determined to be classification A.

ユーザの回答音声の音声認識の結果が「テレビをつけ」である場合には、回答判定部２３は、回答音声が要求した内容に対して語尾が１文字欠けているため、分類Ｂと判定する。ユーザの回答音声の音声認識の結果が「レビをつけて」である場合には、回答判定部２３は、回答音声が要求した内容に対して先頭が１文字欠けているため、分類Ｃと判定する。ユーザの回答音声の音声認識の結果が「テレビつけて」である場合には、回答判定部２３は、回答音声が要求した内容に対して中が１文字欠けているため、分類Ｄと判定する。ユーザの回答音声の音声認識の結果が「テレビにつけて」である場合には、回答判定部２３は、回答音声が要求した内容に対して１文字化けているため、分類Ｅと判定する。ユーザの回答音声の音声認識の結果が「テレつけて」、又は「手をつけて」である場合には、回答判定部２３は、回答音声が要求した内容に対して２文字欠けているため、分類Ｆと判定する。 When the result of the voice recognition of the answer voice of the user is "Turn on TV", the answer determination unit 23 determines that the content requested by the answer voice has one character at the end of the word, and thus the category is B. .. When the result of the voice recognition of the answer voice of the user is “with Levi”, the answer determination unit 23 determines that the content requested by the answer voice is missing one character at the beginning, and therefore the answer determination unit 23 classifies the category C. To do. When the result of the voice recognition of the answer voice of the user is “Turn on TV”, the answer determination unit 23 determines that the content requested by the answer voice has one character missing, and therefore the answer determination unit 23 classifies the category D. .. If the result of the voice recognition of the answer voice of the user is “put on TV”, the answer determination unit 23 determines that the content requested by the answer voice is garbled, and therefore is classified as E. When the result of the voice recognition of the answer voice of the user is “telephone attached” or “hand attached”, the answer determination unit 23 lacks two characters with respect to the content requested by the answer voice. , Class F is determined.

発話決定部２２は、回答判定部２３が判定した分類に応じて、ユーザに対して発話するアドバイスの内容を、記憶部３０に記憶されている、ユーザに対するアドバイスを列挙したアドバイス例３２を参照して決定する。図５は、アドバイス例３２を示す図である。図５に示すように、発話決定部２２は、回答判定部２３が判定した分類に応じたユーザに対するアドバイスの文言を記憶部３０に記憶されているアドバイス例３２から選択する。 The utterance determination unit 22 refers to the advice example 32 in which the content of the advice uttered to the user is stored in the storage unit 30 and which lists the advice to the user according to the classification determined by the answer determination unit 23. To decide. FIG. 5 is a diagram illustrating an advice example 32. As illustrated in FIG. 5, the utterance determination unit 22 selects a word of advice to the user according to the classification determined by the answer determination unit 23 from the advice example 32 stored in the storage unit 30.

回答判定部２３の判定結果が分類Ａであれば、発話決定部２２はアドバイス例３２を参照して、例えば「完璧です。この調子で話してください。」という文言に、ユーザに対するアドバイスを切り替える。回答判定部２３の判定結果が分類Ｂであれば、発話決定部２２はアドイス例３２を参照して、例えば「惜しいです。語尾が小さくならないように、元気よく話してください。」という文言に、ユーザに対するアドバイスを切り替える。回答判定部２３の判定結果が分類Ｃであれば、発話決定部２２はアドイス例３２を参照して、例えば「惜しいです。始めが小さくならないように、勢いよく話してください。」という文言に、ユーザに対するアドバイスを切り替える。回答判定部２３の判定結果が分類Ｄであれば、発話決定部２２はアドイス例３２を参照して、例えば「惜しいです。正しく聞き取れないところがありました。声が大きすぎても小さすぎても聞き取れない場合があります。」という文言に、ユーザに対するアドバイスを切り替える。回答判定部２３の判定結果が分類Ｉであれば、発話決定部２２はアドイス例３２を参照して、例えば「正しく聞き取れませんでした。少しマイクから離れるか、のどの調子を整えてから再チャレンジしてみてください。」という文言に、ユーザに対するアドバイスを切り替える。回答判定部２３の判定結果が分類Ｊであれば、発話決定部２２はアドイス例３２を参照して、例えば「声が聞き取れません、もう少しマイクに近づいて話してください。」という文言に、ユーザに対するアドバイスを切り替える。 If the determination result of the answer determination unit 23 is the classification A, the utterance determination unit 22 refers to the advice example 32 and switches the advice to the user to, for example, the phrase "perfect. Please speak in this condition." If the determination result of the answer determination unit 23 is the classification B, the utterance determination unit 22 refers to the adices example 32 and, for example, says "I'm sorry. Please speak well so that the ending does not become small." Switch the advice to the user. If the determination result of the answer determination unit 23 is the classification C, the utterance determination unit 22 refers to the adice example 32, and, for example, says "I am sorry. Please speak vigorously so that the beginning does not become small." Switch the advice to the user. If the determination result of the answer determination unit 23 is the classification D, the utterance determination unit 22 refers to the adices example 32 and, for example, "I am sorry. There was a place where I could not hear it correctly. It may not exist.” The advice to the user is switched to the phrase. If the determination result of the answer determination unit 23 is the classification I, the utterance determination unit 22 refers to the adices example 32 and, for example, "I could not hear it correctly. Please switch the advice to the user. If the determination result of the answer determination unit 23 is the classification J, the utterance determination unit 22 refers to the example 32 of the audio device, and refers to the phrase "I cannot hear the voice, please speak closer to the microphone." Switch advice for.

発話決定部２２は、回答判定部２３の判定結果が分類Ａであれば、「完璧です。この調子で話してください。」という文言に、ユーザに対するアドバイスを切り替えるとともに、ユーザに対して発話を要求する、次の音声入力の練習用の発話文言を決定してもよい。また、発話決定部２２は、回答判定部２３の判定結果が分類Ａでなければ、アドバイスの文言を適宜切り替えるとともに、再度同じ練習用の文言を、ユーザに対して発話を要求する内容として選択して、ユーザに発話を促すことができてもよい。 If the determination result of the answer determination unit 23 is the classification A, the utterance determination unit 22 switches the advice to the user to the phrase “Perfect. Please speak in this condition.” and requests the user to speak. The utterance wording for practice of the next voice input may be determined. If the determination result of the answer determination unit 23 is not the classification A, the utterance determination unit 22 appropriately switches the wording of the advice and again selects the same wording for practice as the content requesting the user to utter. It may be possible to prompt the user to speak.

音声合成部２４は、発話決定部２２によって決定された発話文言のテキストを音声に変換する。音声合成部２４によって音声変換された発話音声は、音声出力部１５を介して出力される。 The voice synthesis unit 24 converts the text of the utterance wording determined by the utterance determination unit 22 into voice. The uttered voice that has been voice-converted by the voice synthesizer 24 is output via the voice output unit 15.

駆動制御部２５は、ユーザが話し方の練習を楽しんで行うことができるように、音声合成部２４によって音声が出力される際に、各駆動部３５を動かしてもよい。駆動制御部２５は、ロボット型の携帯端末である情報処理装置１０が、音声操作に適した話し方についてのアドバイスをユーザに提示する場合に、各駆動部３５を駆動させてもよい。 The drive control unit 25 may move each drive unit 35 when the voice is output by the voice synthesis unit 24 so that the user can enjoy practicing how to speak. The drive control unit 25 may drive each drive unit 35 when the information processing apparatus 10, which is a robot-type portable terminal, presents the user with advice on how to speak suitable for voice operation.

このように、情報処理装置１０の制御部２０は、ユーザに対して発話を要求する内容を音声出力部１５を介して出力し、音声情報取得部１１を介して取得したユーザの音声と、当該ユーザに対して発話を要求した内容とを比較し、比較した結果に応じて、ユーザに対するアドバイスを切り替える。ユーザは、情報処理装置１０から受け取ったアドバイスに従って、音声入力の練習を繰り返し行うことができる。これにより、ユーザは、音声入力の認識率を向上させるため練習を効率的に行うことができる。 As described above, the control unit 20 of the information processing device 10 outputs the content requesting the user to speak via the voice output unit 15 and the user's voice acquired via the voice information acquisition unit 11, The content requested to speak to the user is compared, and the advice to the user is switched according to the comparison result. The user can repeatedly practice the voice input according to the advice received from the information processing device 10. As a result, the user can efficiently practice in order to improve the recognition rate of voice input.

〔情報処理装置１０の処理の流れについて〕
図６は、ユーザによる音声入力の練習開始の指示を受け、ユーザに対して発話を要求する内容を出力するまでの情報処理装置１０の処理の流れを示す図である。 [Regarding Processing Flow of Information Processing Device 10]
FIG. 6 is a diagram showing a flow of processing of the information processing apparatus 10 until a user inputs an instruction to start voice input practice and outputs contents requesting the user to speak.

（ステップＳ１）
制御部２０は、音声情報取得部１１を介して、ユーザの発話に基づく音声情報を取得したか否かを判定する。制御部２０は、音声情報を取得したと判定するまで、音声情報を取得したか否かを継続して監視する。制御部２０は、音声情報を取得したと判定すると（ステップＳ１でＹＥＳ）、ステップＳ２に進む。 (Step S1)
The control unit 20 determines whether or not the voice information based on the utterance of the user is acquired via the voice information acquisition unit 11. The control unit 20 continuously monitors whether or not the voice information has been acquired until it determines that the voice information has been acquired. When determining that the voice information is acquired (YES in step S1), the control unit 20 proceeds to step S2.

（ステップＳ２）
制御部２０は、取得した音声情報から音声認識部２１の機能によりテキストを抽出する。 (Step S2)
The control unit 20 extracts text from the acquired voice information by the function of the voice recognition unit 21.

（ステップＳ３）
制御部２０は、音声認識部２１の機能により、取得した音声情報がユーザによる音声入力の練習開始の指示か否かを判定する。制御部２０は、取得した音声情報がユーザによる音声入力の練習開始の指示であると判定すると（ステップＳ３でＹＥＳ）、ステップＳ４に進む。制御部２０は、取得した音声情報がユーザによる音声入力の練習開始の指示ではないと判定すると（ステップＳ３でＮＯ）、ステップＳ１に戻り音声情報を取得したか否かの監視を継続する。 (Step S3)
The control unit 20 determines, by the function of the voice recognition unit 21, whether the acquired voice information is a user's instruction to start the practice of voice input. When the control unit 20 determines that the acquired voice information is the instruction to start the practice of voice input by the user (YES in step S3), the process proceeds to step S4. When the control unit 20 determines that the acquired voice information is not the instruction to start the practice of voice input by the user (NO in step S3), the process returns to step S1 to continue monitoring whether or not the voice information is acquired.

（ステップＳ４）
制御部２０は、発話決定部２２の機能により音声入力の練習問題を決定し、ユーザに対して発話を要求する内容の音声を音声合成部２４の機能により生成する。 (Step S4)
The control unit 20 determines a voice input exercise by the function of the utterance determining unit 22, and generates a voice of a content requesting the user to speak by the function of the voice synthesizing unit 24.

（ステップＳ５）
制御部２０は、音声出力部１５を介してユーザに対して発話を要求する内容の音声を発話する。 (Step S5)
The control unit 20 utters a voice of a content requesting the user to speak via the voice output unit 15.

ユーザは、情報処理装置１０がユーザに対して発話を要求する内容の音声を聞き、当該音声を真似るようにして回答音声を発話することで、音声入力の練習を行う。 The user practices the voice input by listening to the voice of the content that the information processing apparatus 10 requests the user to speak and speaking the answer voice by imitating the voice.

図７は、ユーザに対して発話を要求する内容の音声に対する回答音声がユーザによって入力されてから、回答音声に対するアドバイスを音声出力するまでの情報処理装置１０の処理の流れを示す図である。 FIG. 7 is a diagram showing a flow of processing of the information processing apparatus 10 from the input of a reply voice corresponding to the voice requesting the user to speak to the voice output of the advice for the reply voice.

（ステップＳ１１）
制御部２０は、ユーザによって発話された回答音声が音声情報取得部１１を介して入力されたか否かを判定する。制御部２０は、回答音声が入力されたと判定するまで、回答音声が入力されたか否かを継続して監視する。制御部２０は、回答音声が入力されたと判定すると（ステップＳ１１でＹＥＳ）、ステップＳ１２に進む。 (Step S11)
The control unit 20 determines whether the answer voice uttered by the user is input via the voice information acquisition unit 11. The control unit 20 continuously monitors whether or not the answer voice is input until it determines that the answer voice is input. When the control unit 20 determines that the answer voice is input (YES in step S11), the process proceeds to step S12.

（ステップＳ１２）
制御部２０は、入力された回答音声から音声認識部２１の機能によりテキストを抽出する。 (Step S12)
The control unit 20 extracts text from the input answer voice by the function of the voice recognition unit 21.

（ステップＳ１３）
制御部２０は、回答判定部２３の機能により、入力された回答音声を判定する。 (Step S13)
The control unit 20 determines the input answer voice by the function of the answer determination unit 23.

（ステップＳ１４）
制御部２０は、回答判定部２３の判定結果に応じて、発話決定部２２の機能によりユーザの回答音声に応じたアドバイスを決定する。 (Step S14)
The control unit 20 determines the advice according to the answer voice of the user by the function of the utterance determination unit 22 according to the determination result of the response determination unit 23.

（ステップＳ１５）
制御部２０は、発話決定部２２によって決定されたアドバイスを、音声合成部２４の機能により音声合成する。 (Step S15)
The control unit 20 voice-synthesizes the advice determined by the utterance determination unit 22 with the function of the voice synthesis unit 24.

（ステップＳ１６）
制御部２０は、音声合成部２４によって音声合成されたアドバイスを、音声出力部１５を介して発話する。 (Step S16)
The control unit 20 speaks the advice synthesized by the speech synthesis unit 24 via the speech output unit 15.

これにより、ユーザは、音声入力の練習のために発話した回答音声に対する適切なアドバイスを受け取ることができ、アドバイスに従って発話することで、音声入力による指示の認識率を向上させるための話し方の練習を効率よく行うことができる。 As a result, the user can receive appropriate advice for the answer voice uttered for practicing voice input, and by speaking according to the advice, the user can practice speaking to improve the recognition rate of the instruction by voice input. It can be done efficiently.

〔実施形態２〕
本発明の実施形態２について、以下に説明する。なお、説明の便宜上、上記実施形態１にて説明した部材と同じ機能を有する部材については、同じ符号を付記し、その説明を繰り返さない。図８は、実施形態２に係る情報処理装置の要部構成を示すブロック図である。図８に示すように、情報処理装置は、端末装置１５０と、サーバ１１０とを含んで構成され、実施形態１にて説明した情報処理を、端末装置１５０と、サーバ１１０との協働で実行する構成であってもよい。端末装置１５０は、スマートフォン及びロボット型の携帯端末装置等の端末である。端末装置１５０は、サーバ１１０と通信し、ユーザによる端末装置１５０に対する音声入力に基づく音声情報をサーバ１１０に送信する。また、端末装置１５０は、サーバ１１０で処理されたユーザの音声入力に応じた応答を、サーバ１１０から受信して音声出力する。 [Embodiment 2]
The second embodiment of the present invention will be described below. For convenience of description, members having the same functions as those described in the first embodiment will be designated by the same reference numerals, and the description thereof will not be repeated. FIG. 8 is a block diagram showing the main configuration of the information processing apparatus according to the second embodiment. As illustrated in FIG. 8, the information processing device includes a terminal device 150 and a server 110, and executes the information processing described in the first embodiment in cooperation with the terminal device 150 and the server 110. It may be configured to. The terminal device 150 is a terminal such as a smartphone or a robot type mobile terminal device. The terminal device 150 communicates with the server 110 and transmits voice information based on a voice input by the user to the terminal device 150 to the server 110. Further, the terminal device 150 receives, from the server 110, a response in response to the user's voice input processed by the server 110, and outputs it as voice.

〔端末装置１５０の構成〕
端末装置１５０は、通信部１５１、制御部１６０、マイク１５２、スピーカ１５３を備えている。また、端末装置１５０は、ロボット型の携帯端末装置である場合には、駆動部３５を備えている。 [Configuration of terminal device 150]
The terminal device 150 includes a communication unit 151, a control unit 160, a microphone 152, and a speaker 153. In addition, the terminal device 150 includes the drive unit 35 when the terminal device 150 is a robot-type mobile terminal device.

通信部１５１は、サーバ１１０を含む他の装置と通信可能に構成されており、例えばＷｉ−Ｆｉ（登録商標）などの無線通信回路を備えている。 The communication unit 151 is configured to be able to communicate with other devices including the server 110, and includes a wireless communication circuit such as Wi-Fi (registered trademark).

制御部１６０は、端末装置１５０の各部を統括的に制御する機能を備えた演算装置である。制御部１６０は、例えば１つ以上のプロセッサ（例えばＣＰＵなど）が、１つ以上のメモリ（例えばＲＡＭやＲＯＭなど）に記憶されているプログラムを実行することで端末装置１５０の各構成要素を制御する。 The control unit 160 is an arithmetic device having a function of integrally controlling each unit of the terminal device 150. The control unit 160 controls each component of the terminal device 150 by, for example, one or more processors (such as CPU) executing programs stored in one or more memories (such as RAM and ROM). To do.

マイク１５２は、周囲の音を集音する集音装置である。 The microphone 152 is a sound collecting device that collects ambient sounds.

スピーカ１５３は、音声信号を人の耳が認識できる音波範囲の物理振動に変換して出力する。 The speaker 153 converts the audio signal into physical vibration in a sound wave range that can be recognized by a human ear and outputs the physical vibration.

〔制御部１６０の構成〕
制御部１６０は、音声情報取得部１６１、音声出力部１６２、駆動制御部２５を備えている。 [Structure of control unit 160]
The control unit 160 includes a voice information acquisition unit 161, a voice output unit 162, and a drive control unit 25.

音声情報取得部１６１は、マイク１５２によって集音された音声をＡＤ変換し、デジタル信号化した音声情報を通信部１５１を介してサーバ１１０に送信する。 The voice information acquisition unit 161 AD-converts the voice collected by the microphone 152, and transmits the digital signal-converted voice information to the server 110 via the communication unit 151.

音声出力部１６２は、通信部１５１を介してサーバ１１０から受信した音声情報をＤＡ変換して、スピーカ１５３を介して出力する。 The voice output unit 162 DA-converts the voice information received from the server 110 via the communication unit 151, and outputs the DA information via the speaker 153.

〔サーバ１１０の構成〕
サーバ１１０は、サーバ通信部１１１、サーバ制御部１２０、記憶部３０を備えている。 [Configuration of Server 110]
The server 110 includes a server communication unit 111, a server control unit 120, and a storage unit 30.

サーバ通信部１１１は、端末装置１５０を含む他の装置と通信可能に構成されており、例えばＷｉ−Ｆｉ（登録商標）などの無線通信回路を備えている。 The server communication unit 111 is configured to be able to communicate with other devices including the terminal device 150, and includes a wireless communication circuit such as Wi-Fi (registered trademark).

サーバ制御部１２０は、サーバ１１０の各部を統括的に制御する機能を備えた演算装置である。サーバ制御部１２０は、例えば１つ以上のプロセッサ（例えばＣＰＵなど）が、１つ以上のメモリ（例えばＲＡＭやＲＯＭなど）に記憶されているプログラムを実行することでサーバ１１０の各構成要素を制御する。 The server control unit 120 is an arithmetic device having a function of controlling each unit of the server 110 in an integrated manner. The server control unit 120 controls each component of the server 110 by executing a program stored in one or more memories (such as RAM and ROM) by one or more processors (such as CPU), for example. To do.

サーバ制御部１２０は、音声認識部２１、発話決定部２２、回答判定部２３、音声合成部２４を含んでいる。サーバ制御部１２０は、サーバ通信部１１１を介して端末装置１５０からユーザの発話に基づく音声情報を取得する。サーバ制御部１２０は、ユーザに対して発話を要求する内容を決定し、発話を要求する内容の音声を端末装置１５０に提供する。サーバ制御部１２０は、ユーザによる回答音声を判定し、判定結果に応じてユーザに対するアドバイスを切り替え、ユーザに対するアドバイを端末装置１５０に提供する。 The server control unit 120 includes a voice recognition unit 21, a speech determination unit 22, an answer determination unit 23, and a voice synthesis unit 24. The server control unit 120 acquires voice information based on the user's utterance from the terminal device 150 via the server communication unit 111. The server control unit 120 determines the content of which the user is requested to speak, and provides the terminal device 150 with a voice of the content of which the user is requested to speak. The server control unit 120 determines the answer voice from the user, switches the advice to the user according to the determination result, and provides the terminal device 150 with an advice to the user.

このように、情報処理装置は、端末装置１５０と、サーバ１１０との協働により、ユーザが音声入力による指示の認識率を向上させるための話し方の練習を効率よく行うことができる機能を提供する構成であってもよい。 As described above, the information processing apparatus provides a function that allows the user to efficiently practice speaking in order to improve the recognition rate of instructions by voice input, in cooperation with the terminal device 150 and the server 110. It may be configured.

〔実施形態３〕
上記実施形態２では、１つのサーバ１１０を用いる例を説明したが、サーバ１１０の有する各機能が、個別のサーバにて実現されていてもよい。そして、複数のサーバを適用する場合においては、各サーバは、同じ事業者によって管理されていてもよいし、異なる事業者によって管理されていてもよい。 [Embodiment 3]
In the second embodiment, the example in which one server 110 is used has been described, but each function of the server 110 may be realized by an individual server. Then, when a plurality of servers are applied, each server may be managed by the same business operator or may be managed by different business operators.

〔実施形態４〕
情報処理装置１０の各ブロックは、集積回路（ＩＣチップ）等に形成された論理回路（ハードウェア）によって実現してもよいし、ソフトウェアによって実現してもよい。後者の場合、情報処理装置１０のそれぞれを、図９に示すようなコンピュータ（電子計算機）を用いて構成することができる。 [Embodiment 4]
Each block of the information processing device 10 may be realized by a logic circuit (hardware) formed on an integrated circuit (IC chip) or the like, or may be realized by software. In the latter case, each of the information processing devices 10 can be configured using a computer (electronic computer) as shown in FIG.

図９は、情報処理装置１０として利用可能なコンピュータ９１０の構成を例示したブロック図である。コンピュータ９１０は、バス９１１を介して互いに接続された演算装置９１２と、主記憶装置９１３と、補助記憶装置９１４と、入出力インターフェース９１５と、通信インターフェース９１６とを備えている。演算装置９１２、主記憶装置９１３、および補助記憶装置９１４は、それぞれ、例えばプロセッサ（例えばＣＰＵ：Central Processing Unit等）、ＲＡＭ（random access memory）、ハードディスクドライブであってもよい。入出力インターフェース９１５には、ユーザがコンピュータ９１０に各種情報を入力するための入力装置９２０、および、コンピュータ９１０がユーザに各種情報を出力するための出力装置９３０が接続される。入力装置９２０および出力装置９３０は、コンピュータ９１０に内蔵されたものであってもよいし、コンピュータ９１０に接続された（外付けされた）ものであってもよい。例えば、入力装置９２０は、キーボード、マウス、タッチセンサなどであってもよく、出力装置９３０は、ディスプレイ、プリンタ、スピーカなどであってもよい。また、タッチセンサとディスプレイとが一体化されたタッチパネルのような、入力装置９２０および出力装置９３０の双方の機能を有する装置を適用してもよい。そして、通信インターフェース９１６は、コンピュータ９１０が外部の装置と通信するためのインターフェースである。 FIG. 9 is a block diagram illustrating the configuration of a computer 910 that can be used as the information processing device 10. The computer 910 includes an arithmetic unit 912, a main storage device 913, an auxiliary storage device 914, an input/output interface 915, and a communication interface 916 that are connected to each other via a bus 911. The arithmetic device 912, the main storage device 913, and the auxiliary storage device 914 may be, for example, a processor (for example, CPU: Central Processing Unit), a RAM (random access memory), and a hard disk drive. The input/output interface 915 is connected to an input device 920 for the user to input various information to the computer 910 and an output device 930 for the computer 910 to output various information to the user. The input device 920 and the output device 930 may be built in the computer 910 or may be connected (externally attached) to the computer 910. For example, the input device 920 may be a keyboard, a mouse, a touch sensor, etc., and the output device 930 may be a display, a printer, a speaker, etc. Further, a device having both the functions of the input device 920 and the output device 930, such as a touch panel in which a touch sensor and a display are integrated, may be applied. The communication interface 916 is an interface for the computer 910 to communicate with an external device.

補助記憶装置９１４には、コンピュータ９１０を情報処理装置１０として動作させるための各種のプログラムが格納されている。そして、演算装置９１２は、補助記憶装置９１４に格納された上記プログラムを主記憶装置９１３上に展開して該プログラムに含まれる命令を実行することによって、コンピュータ９１０を、情報処理装置１０が備える各部として機能させる。なお、補助記憶装置９１４が備える、プログラム等の情報を記録する記録媒体は、コンピュータ読み取り可能な「一時的でない有形の媒体」であればよく、例えば、テープ、ディスク、カード、半導体メモリ、プログラマブル論理回路などであってもよい。また、記録媒体に記録されているプログラムを、主記憶装置９１３上に展開することなく実行可能なコンピュータであれば、主記憶装置９１３を省略してもよい。なお、上記各装置（演算装置９１２、主記憶装置９１３、補助記憶装置９１４、入出力インターフェース９１５、通信インターフェース９１６、入力装置９２０、および出力装置９３０）は、それぞれ１つであってもよいし、複数であってもよい。 The auxiliary storage device 914 stores various programs for operating the computer 910 as the information processing device 10. Then, the arithmetic unit 912 expands the program stored in the auxiliary storage device 914 onto the main storage device 913 and executes the instructions included in the program, so that the computer 910 and each unit included in the information processing device 10 are provided. To function as. The recording medium for recording information such as programs provided in the auxiliary storage device 914 may be any computer-readable “non-transitory tangible medium”, and examples thereof include tape, disk, card, semiconductor memory, and programmable logic. It may be a circuit or the like. Further, the main storage device 913 may be omitted as long as the computer can execute the program recorded in the recording medium without expanding the program on the main storage device 913. It should be noted that each of the above-mentioned devices (arithmetic device 912, main storage device 913, auxiliary storage device 914, input/output interface 915, communication interface 916, input device 920, and output device 930) may be one respectively. There may be a plurality.

また、上記プログラムは、コンピュータ９１０の外部から取得してもよく、この場合、任意の伝送媒体（通信ネットワークや放送波等）を介して取得してもよい。そして、本発明は、上記プログラムが電子的な伝送によって具現化された、搬送波に埋め込まれたデータ信号の形態でも実現され得る。 Further, the program may be acquired from outside the computer 910, and in this case, it may be acquired via an arbitrary transmission medium (communication network, broadcast wave, or the like). The present invention can also be realized in the form of a data signal embedded in a carrier wave, in which the program is embodied by electronic transmission.

〔まとめ〕
本発明の態様１に係る情報処理装置（１０）は、音声情報取得部（１１）と、音声出力部（１５）と、制御部（２０）とを備えている情報処理装置（１０）において、上記制御部（２０）は、ユーザに対して発話を要求する内容を、上記音声出力部（１５）を介して出力し、上記音声情報取得部（１１）を介して取得したユーザの音声と、当該ユーザに対して発話を要求した内容とを比較し、上記比較した結果に応じて、上記ユーザに対するアドバイスを切り替える構成である。 [Summary]
An information processing device (10) according to aspect 1 of the present invention is an information processing device (10) including a voice information acquisition unit (11), a voice output unit (15), and a control unit (20), The control unit (20) outputs, via the voice output unit (15), the content requesting the user to speak, and the user's voice acquired via the voice information acquisition unit (11). This is a configuration in which the content requested to speak to the user is compared and the advice to the user is switched according to the result of the comparison.

上記の構成によれば、ユーザは、要求された発話内容に応じて話し方の練習を行うことができる。また、制御部２０は、発話を要求した内容と、ユーザの音声とを比較して、ユーザに対するアドバイスを切り替えるため、ユーザに対して、音声入力による指示の認識率を向上させるための話し方に係る適切なアドバイスを行うことができる。また、ユーザは、アドバイスに応じて、発話を要求された内容を繰り返し練習することができ、音声入力による指示の認識率を向上させるための話し方の練習を効率よく行うことができる。 According to the above configuration, the user can practice speaking in accordance with the requested utterance content. In addition, the control unit 20 compares the content requested to be spoken with the voice of the user and switches the advice to the user, and thus relates to the way of speaking to the user to improve the recognition rate of the instruction by voice input. Can give appropriate advice. In addition, the user can repeatedly practice the content requested to be uttered in response to the advice, and can efficiently practice speaking to improve the recognition rate of the instruction by voice input.

本発明の態様２に係る情報処理装置（１０）は、上記の態様１において、上記ユーザに対して発話を要求する内容は、音声操作に係る内容であり、上記制御部（２０）は、上記ユーザの要望に応じて、上記発話を要求する内容を決定する構成としてもよい。 In the information processing apparatus (10) according to the second aspect of the present invention, in the first aspect, the content for requesting the user to speak is content related to a voice operation, and the control unit (20) is configured as described above. The content requesting the utterance may be determined according to the user's request.

上記の構成によれば、ユーザは、音声操作に係る内容の話し方の練習を効率よく行うことができ、音声操作に係る音声入力による指示の認識率を向上させることができる。 According to the above configuration, the user can efficiently practice how to speak the content related to the voice operation, and can improve the recognition rate of the instruction by the voice input related to the voice operation.

本発明の態様３に係る情報処理装置（１０）は、上記の態様１又は２において、上記音声情報取得部（１１）は、音声を集音するマイク（１５２）を備えている構成としてもよい。 The information processing device (10) according to the third aspect of the present invention may be configured such that, in the first or second aspect, the voice information acquisition section (11) includes a microphone (152) that collects a voice. ..

上記の構成によれば、ユーザの音声を精度良く集音することができる。 According to the above configuration, the voice of the user can be accurately collected.

本発明の態様４に係る情報処理装置（１０）は、上記の態様１から３において、上記音声出力部（１５）は、音声を出力するスピーカを備えている構成としてもよい。 The information processing apparatus (10) according to the fourth aspect of the present invention may be configured such that, in the first to third aspects, the voice output section (15) includes a speaker that outputs a voice.

上記の構成によれば、ユーザに対して発話を要求する内容、及びユーザに対するアドバイスをユーザにとって聞き取り易い音波範囲のおとにより出力することができる。 According to the above configuration, the content requesting the user to speak and the advice to the user can be output in the sound wave range that is easy for the user to hear.

本発明の態様５に係る情報処理装置（１０）は、上記の態様１から４において、１又は複数の駆動部（３５）を更に備え、上記制御部（２０）は、上記アドバイスをユーザに提示する場合、上記駆動部（３５）を駆動する構成としてもよい。 The information processing apparatus (10) according to the fifth aspect of the present invention further comprises one or a plurality of drive sections (35) in the first to fourth aspects, and the control section (20) presents the advice to the user. In this case, the drive unit (35) may be driven.

上記の構成によれば、ユーザが楽しく音声入力による指示の認識率を向上させるための話し方の練習を行うことができるように、ロボット型の情報処理装置１０において、手足、胴体、頭部、発光部、バイブレータ等の複数の駆動部（３５）のそれぞれを駆動させることができる。 With the above configuration, in the robot-type information processing device 10, the limbs, torso, head, and light emission are performed so that the user can practice speaking in order to improve the recognition rate of instructions by voice input in a fun manner. It is possible to drive each of a plurality of drive units (35) such as a section and a vibrator.

本発明の各態様に係る情報処理装置１０は、コンピュータによって実現してもよく、この場合には、コンピュータを上記情報処理装置１０が備える各部（ソフトウェア要素）として動作させることにより上記情報処理装置１０をコンピュータにて実現させる情報処理装置１０の制御プログラム、およびそれを記録したコンピュータ読み取り可能な記録媒体も、本発明の範疇に入る。 The information processing apparatus 10 according to each aspect of the present invention may be realized by a computer. In this case, the information processing apparatus 10 is operated by operating the computer as each unit (software element) included in the information processing apparatus 10. The control program of the information processing device 10 for realizing the above with a computer, and a computer-readable recording medium recording the program are also included in the scope of the present invention.

本発明は上述した各実施形態に限定されるものではなく、請求項に示した範囲で種々の変更が可能であり、異なる実施形態にそれぞれ開示された技術的手段を適宜組み合わせて得られる実施形態についても本発明の技術的範囲に含まれる。さらに、各実施形態にそれぞれ開示された技術的手段を組み合わせることにより、新しい技術的特徴を形成することができる。 The present invention is not limited to the above-described embodiments, and various modifications can be made within the scope of the claims, and the embodiments obtained by appropriately combining the technical means disclosed in the different embodiments. Is also included in the technical scope of the present invention. Furthermore, new technical features can be formed by combining the technical means disclosed in each of the embodiments.

１０情報処理装置
１１、１６１音声情報取得部
１５、１６２音声出力部
２０、１６０制御部
２１音声認識部
２２発話決定部
２３回答判定部
２４音声合成部
２５駆動制御部
３５駆動部
１５２マイク
１５３スピーカ 10 information processing device 11, 161 voice information acquisition unit 15, 162 voice output unit 20, 160 control unit 21 voice recognition unit 22 speech determination unit 23 answer determination unit 24 voice synthesis unit 25 drive control unit 35 drive unit 152 microphone 153 speaker

Claims

In an information processing device including a voice information acquisition unit, an output unit, and a control unit,
The control unit is
The content requesting the user to speak is output via the output unit,
The user's voice acquired via the voice information acquisition unit is compared with the content requested to speak to the user,
An information processing apparatus, characterized in that advice is switched to the user according to a result of the comparison.

The contents requesting the user to speak are contents relating to voice operation,
The control unit is
The information processing apparatus according to claim 1, wherein the content for requesting the utterance is determined according to the request of the user.

The information processing apparatus according to claim 1, wherein the voice information acquisition unit includes a microphone that collects voice.

The information processing apparatus according to claim 1, wherein the output unit includes a speaker that outputs a sound.

Further comprising one or more drive units,
The control unit is
The information processing apparatus according to any one of claims 1 to 4, wherein when presenting the advice to the user, the drive unit is driven.

Outputting the content requesting the user to speak,
A step of comparing the acquired voice of the user with the content requested to speak to the user,
Switching the advice to the user according to the result of the comparison,
An information processing method comprising:

A program for causing a computer to function as the information processing apparatus according to claim 1, wherein the computer functions as the voice information acquisition unit and the control unit.