JP2017167600A

JP2017167600A - Terminal device

Info

Publication number: JP2017167600A
Application number: JP2016049342A
Authority: JP
Inventors: 松岡　保静; Hosei Matsuoka; 保静松岡
Original assignee: NTT Docomo Inc
Current assignee: NTT Docomo Inc
Priority date: 2016-03-14
Filing date: 2016-03-14
Publication date: 2017-09-21

Abstract

PROBLEM TO BE SOLVED: To facilitate a process of selecting functions associated with voice recognition processing.SOLUTION: A terminal device 10 includes: a detection unit 15 configured to detect an object approaching the terminal device and measure a period of time the object stays in the proximity (proximity time); an input unit 16 for inputting voice; an execution unit 17 configured to execute functions associated with voice recognition processing on the voice input by the input unit 16; and a determination unit 18 configured to determine a function associated with the voice recognition processing to be executed by the execution means based on the proximity time measured by the detection unit 15.SELECTED DRAWING: Figure 1

Description

本発明は、端末装置に関する。 The present invention relates to a terminal device.

特許文献１は、近接センサおよび音声認識手段を備えた電子装置を開示する。この装置では、物体が装置に近接していることを近接センサが検出すると、音声認識処理による装置の操作が可能になる。 Patent Document 1 discloses an electronic device including a proximity sensor and voice recognition means. In this apparatus, when the proximity sensor detects that an object is close to the apparatus, the apparatus can be operated by voice recognition processing.

特表２００３−５０１９５９号公報Special table 2003-501959 gazette

スマートホンなどに代表される端末装置では、音楽再生およびユーザとの対話といった種々の機能が、音声認識処理の結果を用いたユーザ操作によって利用可能である。その音声認識処理の結果も、さまざまな認識手法のうちの特定の認識を用いて取得することができる。そのため、特定の認識手法による音声認識処理の結果を取得する機能や音声認識処理の結果を用いた機能（以下、「音声認識処理に係る機能」という場合もある）を選択しなければならない。 In a terminal device typified by a smart phone or the like, various functions such as music playback and user interaction can be used by a user operation using the result of speech recognition processing. The result of the speech recognition process can also be obtained using specific recognition among various recognition methods. Therefore, it is necessary to select a function that acquires the result of speech recognition processing by a specific recognition method or a function that uses the result of speech recognition processing (hereinafter, also referred to as “function related to speech recognition processing”).

たとえば特許文献１の手法を参考に、近接センサを用いて端末の音声認識処理を起動させた後、音声認識処理を用いた端末操作によって、ユーザが音声認識処理に係る機能を選択することも考えられる。しかし、選択の都度、音声認識処理を用いた端末操作を行わなければならないとすると、ユーザが煩わしく感じる可能性がある。 For example, referring to the technique of Patent Document 1, it is considered that the user selects a function related to the voice recognition process by starting the voice recognition process of the terminal using the proximity sensor and then operating the terminal using the voice recognition process. It is done. However, if it is necessary to perform a terminal operation using a voice recognition process for each selection, the user may feel annoying.

本発明は、上記課題に鑑みてなされたものであり、音声認識処理に係る機能を選択する際に、その選択を容易に行うことが可能な端末装置を提供することを目的とする。 The present invention has been made in view of the above problems, and an object of the present invention is to provide a terminal device that can easily select a function related to speech recognition processing.

本発明の一態様に係る端末装置は、自端末装置への物体の近接を検出するとともに検出した物体の近接時間を計測する検出手段と、音声を入力する入力手段と、入力手段によって入力された音声を対象とした音声認識処理に係る機能を実行する実行手段と、検出手段によって計測された近接時間に基づいて、実行手段が実行する音声認識処理に係る機能を決定する決定手段と、を備える。 A terminal device according to an aspect of the present invention detects a proximity of an object to the terminal device and detects a proximity time of the detected object, an input unit that inputs sound, and an input unit An execution unit that executes a function related to a voice recognition process for speech; and a determination unit that determines a function related to the voice recognition process executed by the execution unit based on the proximity time measured by the detection unit. .

上記の端末装置によれば、入力された音声を対象とした音声認識処理に係る機能が実行される。ここで、実行される機能は、物体の近接時間に基づいて決定される。そのため、ユーザは、所定時間端末装置の近くに物体を近接させる、たとえば端末装置に手をかざす（手を近づける）といった動作を行うだけで、音声認識に係る機能を容易に選択することができる。 According to said terminal device, the function which concerns on the audio | voice recognition process for the input audio | voice is performed. Here, the function to be executed is determined based on the proximity time of the object. Therefore, the user can easily select a function related to speech recognition only by performing an operation of bringing an object close to the terminal device for a predetermined time, for example, holding a hand (close the hand) to the terminal device.

決定手段は、音声認識処理に係る機能として、近接時間に応じた認識手法による音声認識処理の結果を取得する機能を決定してもよい。これにより、どのような認識手法による音声認識処理の結果を取得するのかを容易に選択することができる。 The determining means may determine a function for acquiring a result of the speech recognition process by a recognition method according to the proximity time as a function related to the speech recognition process. Thereby, it is possible to easily select the recognition method by which the result of the speech recognition process is acquired.

決定手段は、音声認識処理に係る機能として、音声認識処理の結果を用いた機能を決定してもよい。これにより、音声認識処理の結果を用いた機能を容易に選択することができる。 The determining unit may determine a function using a result of the voice recognition process as a function related to the voice recognition process. Thereby, the function using the result of the voice recognition process can be easily selected.

本発明によれば、音声認識処理に係る機能を選択する際に、その選択を容易に行うことが可能になる。 According to the present invention, when selecting a function related to speech recognition processing, the selection can be easily performed.

端末装置の概略構成を示す図である。It is a figure which shows schematic structure of a terminal device. 端末装置のハードウェア構成を示す図である。It is a figure which shows the hardware constitutions of a terminal device. 端末装置において実行される処理の一例を示すフローチャートである。It is a flowchart which shows an example of the process performed in a terminal device.

以下、本発明の実施形態について、図面を参照しながら説明する。なお、図面の説明において同一要素には同一符号を付し、重複する説明は省略する。 Hereinafter, embodiments of the present invention will be described with reference to the drawings. In the description of the drawings, the same elements are denoted by the same reference numerals, and redundant descriptions are omitted.

図１は、実施形態に係る端末装置の概略構成を示す図である。端末装置１０は、ユーザが利用可能な種々の機能を実行可能に構成される。なお、端末装置１０は、音声操作デバイスと言うこともできる。これは、端末装置１０が実行可能な機能の少なくとも一部が、音声認識処理を用いた操作を伴うからである。そのような機能を端末装置１０のユーザが利用する際、端末装置１０では、近接センサを利用した後述の原理により、音声認識処理に係る機能（後述の近接時間に応じて定められる特定の音声認識手法による音声認識処理の結果を取得する機能、および音声認識処理の結果を用いた機能）が選択される。 FIG. 1 is a diagram illustrating a schematic configuration of a terminal device according to the embodiment. The terminal device 10 is configured to be able to execute various functions available to the user. The terminal device 10 can also be referred to as a voice operation device. This is because at least a part of the functions that can be executed by the terminal device 10 involves an operation using voice recognition processing. When the user of the terminal device 10 uses such a function, the terminal device 10 uses a function related to speech recognition processing (specific speech recognition determined according to the proximity time described later) according to the principle described below using a proximity sensor. A function for obtaining the result of the speech recognition processing by the technique and a function using the result of the speech recognition processing) are selected.

まず、音声認識処理に係る機能として、音声認識処理の結果を用いた機能（アプリケーション）について説明する。そのようなアプリケーションは、音声認識処理を用いたユーザ操作によって利用可能である。ユーザは、端末装置１０に物理的に接触して行う操作（タッチパネルやボタンなどの操作）を用いることなく、アプリケーションを利用することができる。アプリケーションの種類はとくに限定されない。本実施形態では、ユーザとの対話を行うアプリケーション（対話アプリ）や、種々のコンテンツを再生してユーザに提供するアプリケーション（コンテンツ提供アプリ）を例に挙げて説明する。コンテンツ提供アプリとしては、音楽を再生するアプリケーション(音楽アプリ)、動画を再生するアプリケーション（動画アプリ）、交通情報の提示や通訳が可能であり旅行の際に役立つようなアプリケーション（旅行アプリ）、商品情報の提示や購買手続が可能であり買物の際に役立つようなアプリケーション（買物アプリ）、天気情報を提示するアプリケーション（天気アプリ）などがある。 First, as a function related to the voice recognition process, a function (application) using the result of the voice recognition process will be described. Such an application can be used by a user operation using voice recognition processing. The user can use the application without using an operation (operation of a touch panel, a button, or the like) performed by physically contacting the terminal device 10. The type of application is not particularly limited. In the present embodiment, an application that interacts with the user (interactive application) and an application that reproduces various contents and provides the user (content providing application) will be described as examples. Content providing applications include music playback applications (music apps), video playback applications (video apps), applications that can present and interpret traffic information (travel apps), products There are applications (shopping application) that can provide information and purchase procedures and are useful for shopping, and applications (weather application) that present weather information.

次に、音声認識処理に係る機能として、特定の認識手法（認識モデル）による音声認識処理の結果を取得する機能（取得機能）について説明する。この機能は、利用されるアプリケーション（上述の音楽アプリ、旅行アプリ、買物アプリなど）に応じて、音声認識処理の認識モデルを使い分ける機能と言うこともできる。認識モデルは、音声認識処理にどのような言語モデル（形態素の前後間のつながりやすさを示すモデル）や対話モデル等を用いるかを定めた手法である。音楽アプリが利用される場合には、楽曲の名称や歌手の氏名などに関する音声を精度良く認識するように設計された言語モデル等を用いた認識モデル（音楽用認識モデル）を使用するとよい。旅行アプリが利用される場合には、宿泊施設、観光施設、移動手段などに関する音声を精度良く認識するように設計された言語モデル等を用いた認識モデル（旅行用認識モデル）を使用するとよい。買物アプリが利用される場合には、商品、店舗などに関する音声を精度良く認識するように設計された言語モデル等を用いた認識モデル（買物用認識モデル）を使用するとよい。対話アプリが利用される場合には、ユーザの属性（国籍、年齢、性別など）に応じた対話モデルを用いた認識モデル（対話用認識モデル）を使用するとよい。対話モデルは、ユーザの属性に応じた言語モデルおよび音響モデル（音声の特徴量（メル周波数ケプストラム係数等）と音素（個々の母音・子音）との対応関係を示すモデル）を組み合わせて構築されてもよい。 Next, a function (acquisition function) for acquiring a result of speech recognition processing by a specific recognition method (recognition model) will be described as a function related to the speech recognition processing. This function can also be said to be a function that uses different recognition models for voice recognition processing depending on the application used (the above-mentioned music application, travel application, shopping application, etc.). The recognition model is a method that determines what language model (a model indicating ease of connection between morphemes before and after), an interaction model, and the like are used for speech recognition processing. When a music application is used, it is preferable to use a recognition model (music recognition model) using a language model or the like designed to accurately recognize voices related to the name of a song or the name of a singer. When a travel application is used, it is preferable to use a recognition model (travel recognition model) using a language model or the like designed to accurately recognize voices related to accommodation facilities, tourist facilities, transportation means, and the like. When a shopping application is used, it is preferable to use a recognition model (a recognition model for shopping) using a language model or the like designed to accurately recognize voices related to products, stores, and the like. When a dialogue application is used, a recognition model (a recognition model for dialogue) using a dialogue model according to the user attributes (nationality, age, sex, etc.) may be used. The conversation model is constructed by combining a language model and an acoustic model (a model that shows the correspondence between phonetic features (mel frequency cepstrum coefficients, etc.) and phonemes (individual vowels / consonants)) according to the user's attributes. Also good.

以上のような音声認識処理に係る機能の利用シーンの例について説明する。この例では、上述の音楽アプリが、音楽アプリに適した認識モデル（音楽用認識モデル）による音声認識処理の結果を用いて利用される。前提として、図１に示されるように、端末装置１０は、端末装置１０の外部に設けられた音声認識サーバ２０およびコンテンツサーバ３０と通信可能であるものとする。 An example of the usage scene of the function related to the voice recognition processing as described above will be described. In this example, the above-described music application is used by using the result of the speech recognition processing by the recognition model (music recognition model) suitable for the music application. As a premise, as shown in FIG. 1, it is assumed that the terminal device 10 can communicate with a voice recognition server 20 and a content server 30 provided outside the terminal device 10.

まず、端末装置１０のユーザは、端末装置１０において、後述の近接センサ１１を用いて音声認識処理に係る機能を決定するためのアプリケーション（近接アプリ）を予め起動させておく。たとえば、端末装置１０のタッチパネルやボタンなどを用いたユーザ操作により近接アプリの起動が指示されたことに応じて、近接アプリが起動する。なお、そのようなユーザ操作の有無にかかわらず、近接アプリは、端末装置１０が動作している間、常時起動していてもよい。この近接アプリを利用することで、ユーザは、次に説明するように、音声認識処理に係る機能を選択して利用することができる。 First, the user of the terminal device 10 activates in advance an application (proximity app) for determining a function related to voice recognition processing using the proximity sensor 11 described later in the terminal device 10. For example, the proximity app is activated in response to an instruction to activate the proximity app by a user operation using a touch panel or a button of the terminal device 10. Note that the proximity app may be constantly activated while the terminal device 10 is operating, regardless of whether or not such a user operation is performed. By using this proximity app, the user can select and use a function related to the voice recognition process, as will be described below.

端末装置１０のユーザは、所定時間、端末装置１０に手を近づける（たとえば手をかざす）。一例として、所定時間は０．５秒であり、それに応じて、端末装置１０は、音楽用認識モデルによる音声認識処理の実行を開始する（音声認識機能をウェイクアップさせる）。また、端末装置１０は、音楽アプリの実行を開始（音楽アプリを起動）する。そして、ユーザは、音声操作によって音楽アプリを利用する。たとえば、ユーザは、再生を希望する音楽の種類を指定するための音声を発する。楽曲を指定するための音声は、楽曲の名称や歌手の氏名など示す音声であってもよい。 The user of the terminal device 10 brings his hand close to the terminal device 10 for a predetermined time (for example, holding his hand). As an example, the predetermined time is 0.5 seconds, and in response, the terminal device 10 starts executing the speech recognition process using the music recognition model (wakes up the speech recognition function). Further, the terminal device 10 starts execution of the music application (activates the music application). And a user uses a music application by voice operation. For example, the user utters a sound for designating the type of music desired to be played. The sound for designating the music may be a sound indicating the name of the music or the name of the singer.

端末装置１０は、ユーザが発した音声を入力する（受け付ける）。端末装置１０は、入力した音声を、音声認識サーバ２０に送信する。その際、端末装置１０は、音声認識処理の認識モデルを音楽用認識モデルとすることを音声認識サーバ２０に要求する。音声認識サーバ２０は、受信した音声に対し、音楽用認識モデルによる音声認識処理を実行し、その音声認識処理の結果（楽曲の名称等）を得る。音声認識サーバ２０は、音声認識の結果を、端末装置１０に送信する。 The terminal device 10 inputs (receives) the voice uttered by the user. The terminal device 10 transmits the input voice to the voice recognition server 20. At that time, the terminal device 10 requests the voice recognition server 20 to set the recognition model of the voice recognition processing to be a music recognition model. The voice recognition server 20 executes voice recognition processing based on the music recognition model on the received voice, and obtains the result of the voice recognition processing (such as the name of a song). The voice recognition server 20 transmits the result of voice recognition to the terminal device 10.

端末装置１０は、受信（取得）した音声認識処理の結果を用いて音楽アプリを実行する。端末装置１０は、コンテンツサーバ３０にさまざまなコンテンツの配信の要求を行う。この例では、音声認識処理の結果によって指定された楽曲の再生に必要な情報（楽曲データ）を、端末装置１０がコンテンツサーバ３０に要求する。端末装置１０からの要求に応じて、コンテンツサーバ３０は、楽曲データを端末装置１０に送信（配信）する。 The terminal device 10 executes the music application using the received (acquired) result of the speech recognition process. The terminal device 10 requests the content server 30 to distribute various contents. In this example, the terminal device 10 requests the content server 30 for information (music data) necessary for reproducing the music specified by the result of the voice recognition process. In response to a request from the terminal device 10, the content server 30 transmits (distributes) music data to the terminal device 10.

端末装置１０は、受信した楽曲データを用いて、その楽曲を再生する。その後ユーザが楽曲の再生の停止を指示するための音声を発した場合には、その音声に対して再び音声認識処理が実行される。音声認識処理の結果すなわち楽曲の再生の停止の指示に応じて、端末装置１０は楽曲の再生を停止する。 The terminal device 10 reproduces the music using the received music data. Thereafter, when the user utters a sound for instructing to stop the reproduction of the music, the voice recognition process is performed again on the sound. In response to the result of the voice recognition process, that is, the instruction to stop the reproduction of the music, the terminal device 10 stops the reproduction of the music.

以上のようにして、ユーザは、特定の認識手法（認識モデル）による音声認識処理の結果を取得する機能（取得機能）として、音声用認識モデルによる音声認識処理の結果を取得する機能を選択し、音声認識処理を用いた機能（アプリケーション）として音楽アプリを選択して利用することができる。 As described above, the user selects the function for acquiring the result of the speech recognition process using the speech recognition model as the function (acquisition function) for acquiring the result of the speech recognition process using the specific recognition method (recognition model). The music application can be selected and used as a function (application) using the voice recognition process.

以下、端末装置１０について詳述する。図１に示されるように、端末装置１０は、近接センサ１１と、マイク１２と、スピーカ１３と、ディスプレイ１４とを含む。また、端末装置１０は、機能ブロックとして、検出部１５と、入力部１６と、実行部１７と、決定部１８とを含む。 Hereinafter, the terminal device 10 will be described in detail. As shown in FIG. 1, the terminal device 10 includes a proximity sensor 11, a microphone 12, a speaker 13, and a display 14. The terminal device 10 includes a detection unit 15, an input unit 16, an execution unit 17, and a determination unit 18 as functional blocks.

近接センサ１１は、端末装置１０への物体の近接（すなわち自端末装置への物体の近接）を検出するために用いられる。物体の種類はとくに限定されないが、ユーザの手や指といったユーザの身体の一部であることが想定される。そのため、近接センサ１１として、人の手や指などの近接が検出可能な種々の公知の近接センサが用いられてよい。近接センサ１１が設けられる位置はとくに限定されず、物体が端末装置１０に近接しているか否かを検出できる位置であればよい。たとえば物体と端末装置１０との間の距離が数ｍｍ〜数ｃｍよりも短い場合に、物体の端末装置１０への近接が近接センサ１１によって検出される。 The proximity sensor 11 is used to detect the proximity of an object to the terminal device 10 (that is, the proximity of the object to the terminal device). The type of the object is not particularly limited, but is assumed to be a part of the user's body such as the user's hand or finger. Therefore, as the proximity sensor 11, various known proximity sensors that can detect the proximity of a human hand or a finger may be used. The position where the proximity sensor 11 is provided is not particularly limited as long as it can detect whether an object is close to the terminal device 10. For example, when the distance between the object and the terminal device 10 is shorter than several mm to several cm, the proximity sensor 11 detects the proximity of the object to the terminal device 10.

近接センサ１１の検出結果は、後述の検出部１５に送られる。近接センサ１１の検出結果は、端末装置１０に近接している物体が存在しているか否かということを示す情報であってよい。近接センサ１１の検出結果は、たとえば、近接センサ１１が作動している間、リアルタイムあるいは所定の周期で、検出部１５に送られる。 The detection result of the proximity sensor 11 is sent to the detection unit 15 described later. The detection result of the proximity sensor 11 may be information indicating whether or not an object close to the terminal device 10 exists. The detection result of the proximity sensor 11 is sent to the detection unit 15 in real time or at a predetermined cycle while the proximity sensor 11 is operating, for example.

近接センサ１１は、先に説明した近接アプリが起動していないときには動作しておらず（ＯＦＦとなっており）、近接アプリが起動されたことに応じて動作を開始する（ＯＮとなる）。近接アプリは、後述の実行部１７によって実行される。 The proximity sensor 11 does not operate when the proximity app described above is not activated (OFF), and starts to operate (becomes ON) when the proximity app is activated. The proximity app is executed by the execution unit 17 described later.

マイク１２、スピーカ１３およびディスプレイ１４は、端末装置１０とユーザとの間の情報のやり取りのために用いられる。 The microphone 12, the speaker 13, and the display 14 are used for exchanging information between the terminal device 10 and the user.

マイク１２は、端末装置１０の周囲で発生した音や音声を検出する。たとえば、ユーザが発した音声が、マイク１２によって検出され、後述の入力部１６に送られる。 The microphone 12 detects sounds and sounds generated around the terminal device 10. For example, the voice uttered by the user is detected by the microphone 12 and sent to the input unit 16 described later.

スピーカ１３は、音や音声を出力するために用いられる。音楽アプリの利用時には、楽曲がスピーカ１３によって出力される。対話アプリの利用時には、ユーザとの対話を行うための音声がスピーカ１３によって出力される。 The speaker 13 is used for outputting sound and sound. When the music application is used, the music is output by the speaker 13. When using the dialog application, the speaker 13 outputs sound for performing a dialog with the user.

ディスプレイ１４は、画像（動画を含む）を出力するために用いられる。動画アプリの利用時には、動画がディスプレイ１４によって出力（表示）される。 The display 14 is used to output an image (including a moving image). When the moving image application is used, the moving image is output (displayed) on the display 14.

マイク１２、スピーカ１３、ディスプレイ１４の種類は特に限定されず、種々の公知のものを用いることができる。ディスプレイ１４には、タッチパネルが用いられてもよい。 The types of the microphone 12, the speaker 13, and the display 14 are not particularly limited, and various known ones can be used. A touch panel may be used for the display 14.

検出部１５は、端末装置１０への物体の近接を検出するとともに検出した物体の近接時間を計測する部分（検出手段）である。検出部１５は、上述の近接センサ１１の検出結果を受けることによって、端末装置１０への物体の近接を検出する。近接センサ１１から受けた検出結果が、近接物体が存在していることを示す状態（検出状態）であれば、検出部１５は、端末装置１０への物体の近接を検出する。検出結果が、近接物体が存在していないことを示す状態（非検出状態）であれば、検出部１５は、端末装置１０への物体の近接を検出しない。 The detection unit 15 is a part (detection unit) that detects the proximity of the object to the terminal device 10 and measures the proximity time of the detected object. The detection unit 15 detects the proximity of the object to the terminal device 10 by receiving the detection result of the proximity sensor 11 described above. If the detection result received from the proximity sensor 11 is a state (detection state) indicating that a proximity object exists, the detection unit 15 detects the proximity of the object to the terminal device 10. If the detection result is a state (non-detection state) indicating that no proximity object exists, the detection unit 15 does not detect the proximity of the object to the terminal device 10.

検出部１５による物体の近接時間の計測は、たとえば以下のように行われる。すなわち、検出部１５は、物体が端末装置１０に近接した状態になった時から、その物体が端末装置１０に近接しなくなった（近接が解除された）時までの時間を、その物体の近接時間としてとして計測する。上述のように近接センサ１１の検出結果がリアルタイムあるいは所定の周期で送られて来る場合には、検出部１５は、近接センサ１１の検出結果が非検出状態から検出状態に切り替わった時と、非検出状態から検出状態に切り替わった時との時間差を、近接時間として計測することができる。たとえば検出部１５がタイマ機能を備えていれば、当該タイマ機能を用いて近接時間の計測を行ってもよい。近接時間の計測については、後に図３を参照して改めて説明する。 The measurement of the proximity time of the object by the detection unit 15 is performed as follows, for example. In other words, the detection unit 15 determines the time from when the object is in proximity to the terminal device 10 to when the object is no longer in proximity to the terminal device 10 (proximity is released). Measure as time. As described above, when the detection result of the proximity sensor 11 is sent in real time or at a predetermined cycle, the detection unit 15 detects when the detection result of the proximity sensor 11 switches from the non-detection state to the detection state. The time difference from when the detection state is switched to the detection state can be measured as the proximity time. For example, if the detection unit 15 has a timer function, the proximity time may be measured using the timer function. The measurement of the proximity time will be described later with reference to FIG.

入力部１６は、マイク１２によって検出された音声を入力する部分（入力手段）である。入力部１６は、検出部１５によって物体の近接が検出されたことに応じて（近接センサ１１の検出結果に反応して）、あるいは上述の近接時間が計測されたことに応じて、音声認識処理の対象としての音声の入力を開始する。たとえば、音声認識処理の精度の向上等のために入力された音声からノイズを取り除くフィルタリング処理や、音声認識処理に供するために入力された音声情報を一時的に記憶する記憶処理等が、入力部１６によって実行されてもよい。入力部１６に入力された音声は、次に説明する実行部１７に送られ、音声認識処理の対象となる。 The input unit 16 is a part (input means) for inputting the sound detected by the microphone 12. The input unit 16 performs voice recognition processing in response to the proximity of the object detected by the detection unit 15 (in response to the detection result of the proximity sensor 11) or in response to the above-described proximity time being measured. Start inputting voice as the target of. For example, an input unit includes a filtering process for removing noise from an input voice in order to improve accuracy of the voice recognition process, a storage process for temporarily storing voice information input for use in the voice recognition process, and the like. 16 may be executed. The voice input to the input unit 16 is sent to the execution unit 17 described below, and is subjected to voice recognition processing.

実行部１７は、入力部１６に入力された音声を対象とした音声認識処理に係る機能を実行する部分（実行手段）である。ここで、実行部１７が実行する音声認識処理に係る機能は、決定部１８によって決定される。 The execution unit 17 is a part (execution unit) that executes a function related to the voice recognition process for the voice input to the input unit 16. Here, the function related to the speech recognition processing executed by the execution unit 17 is determined by the determination unit 18.

決定部１８は、検出部１５によって計測された近接時間に基づいて、実行部１７が実行する音声認識処理に係る機能を決定する部分（決定手段）である。具体的に、決定部１８は、近接時間に基づいて、音楽用認識モデル、旅行用認識モデル、買物用認識モデルのような特定の認識モデルによる音声認識処理の結果を取得する機能（取得機能）、および、音楽アプリ、旅行アプリ、買物アプリのような音声認識処理の結果を用いた機能（アプリケーション）を決定する。決定部１８の決定結果は、実行部１７に送られる。 The determining unit 18 is a part (determining unit) that determines a function related to the voice recognition process executed by the executing unit 17 based on the proximity time measured by the detecting unit 15. Specifically, the determination unit 18 obtains a result of the speech recognition processing by a specific recognition model such as a music recognition model, a travel recognition model, or a shopping recognition model based on the proximity time (acquisition function). And a function (application) using a result of the voice recognition processing such as a music application, a travel application, and a shopping application is determined. The determination result of the determination unit 18 is sent to the execution unit 17.

近接時間に基づく機能の決定は、たとえば以下のようにして行われる。すなわち、異なる複数の近接時間と各機能とを対応づけて記述した情報（テーブル）を予め作成し、端末装置１０内の記憶部（不図示）に格納しておく。具体的に、このテーブルでは、異なる複数の近接時間と、認識モデルと、アプリケーションとが対応づけて記述されている。このようなテーブルを参照することによって、決定部１８は、近接時間に基づいて（近接時間に対応した）、音声認識処理に係る機能（取得機能およびアプリケーション）を決定することができる。 The function is determined based on the proximity time, for example, as follows. That is, information (table) in which a plurality of different proximity times and functions are described in association with each other is created in advance and stored in a storage unit (not shown) in the terminal device 10. Specifically, in this table, a plurality of different proximity times, recognition models, and applications are described in association with each other. By referring to such a table, the determination unit 18 can determine a function (acquisition function and application) related to the voice recognition processing based on the proximity time (corresponding to the proximity time).

一例として、近接時間が０．５秒の場合、決定部１８は、取得機能を、先に説明したような音楽用認識モデルや買物用認識モデルなどによる音声認識処理の結果を取得する機能に決定してもよい。それとともに、決定部１８は、音声認識処理に係る機能（アプリケーション）を音楽アプリや買物アプリなどに決定してもよい。近接時間が１秒の場合、決定部１８は、認識モデルを、先に説明したような対話用認識モデルに決定してもよい。それとともに、決定部１８は、アプリケーションを対話アプリに決定してもよい。 As an example, when the proximity time is 0.5 seconds, the determination unit 18 determines the acquisition function as a function for acquiring the result of the speech recognition process using the music recognition model or the shopping recognition model as described above. May be. At the same time, the determination unit 18 may determine a function (application) related to the voice recognition processing as a music application, a shopping application, or the like. When the proximity time is 1 second, the determination unit 18 may determine the recognition model as the recognition model for interaction as described above. At the same time, the determination unit 18 may determine the application as a dialogue application.

近接時間と、決定される音声認識処理に係る機能（取得機能やアプリケーション）との組合せは、上記の例に限られない。後に図３のフローチャートを参照して説明する例では、近接時間が０．５秒未満の場合には取得機能が旅行用認識モデルによる音声認識処理の結果を取得する機能に決定され、そうでない場合には買物用認識モデルによる音声認識処理の結果を取得する機能に決定される。 The combination of the proximity time and the function (acquisition function or application) related to the determined voice recognition process is not limited to the above example. In the example described later with reference to the flowchart of FIG. 3, when the proximity time is less than 0.5 seconds, the acquisition function is determined as the function for acquiring the result of the speech recognition processing by the travel recognition model, and otherwise The function for acquiring the result of the speech recognition processing by the shopping recognition model is determined.

再び実行部１７の説明に戻り、実行部１７は、決定部１８から送られる決定結果に従い、決定部１８によって決定された音声認識処理に係る機能を実行する。具体的例として、以下では、音楽アプリが利用される場合について説明する。 Returning to the description of the execution unit 17 again, the execution unit 17 executes the function related to the speech recognition processing determined by the determination unit 18 according to the determination result sent from the determination unit 18. As a specific example, a case where a music application is used will be described below.

ユーザ操作によって音楽アプリを利用するために、実行部１７は、入力部１６に、音声認識処理の対象としてマイク１２からの音声の入力を開始させる。入力部１６が入力した音声が実行部１７に送られると、実行部１７は、この音声を音声認識サーバ２０に送信する。また、実行部１７は、決定部１８によって決定された取得機能における特定の認識モデル（この例では音楽用認識モデル）を指定するための情報（認識モデル指定情報）を、音声認識サーバ２０に送信する。 In order to use the music application by a user operation, the execution unit 17 causes the input unit 16 to start inputting voice from the microphone 12 as a target of voice recognition processing. When the voice input by the input unit 16 is sent to the execution unit 17, the execution unit 17 transmits this voice to the voice recognition server 20. Further, the execution unit 17 transmits information (recognition model designation information) for designating a specific recognition model (recognition model for music in this example) in the acquisition function determined by the determination unit 18 to the speech recognition server 20. To do.

音声認識サーバ２０は、実行部１７から受信した認識モデル指定情報によって指定された音声認識処理の認識モデル（この例では音楽用認識モデル）により、実行部１７から受信した音声に対して音声認識処理を実行する。たとえば、音声認識サーバ２０は、認識モデル指定情報によって指定される可能性のある種々の認識モデルにそれぞれ対応した複数の音声認識エンジンを備えている。音声認識サーバ２０は、認識モデル指定情報に応じて音声認識エンジンを使い分けることによって、指定された音声認識処理の認識モデル（この例では音楽用認識モデル）による音声認識処理を実行することができる。音声認識処理の結果は、公知の音声認識処理によって得られるものと同様である。たとえば、実行部１７から受信した音声に対応した文字列のデータ（テキストデータ）や当該文字列の意味（文字列に含まれる各単語の品詞など）を表す情報が、音声認識処理の結果として得られる。音声認識サーバ２０は、音声認識処理の結果を、実行部１７に送信する。 The voice recognition server 20 performs voice recognition processing on the voice received from the execution unit 17 based on the recognition model (in this example, a music recognition model) of the voice recognition process specified by the recognition model designation information received from the execution unit 17. Execute. For example, the speech recognition server 20 includes a plurality of speech recognition engines respectively corresponding to various recognition models that may be specified by the recognition model specifying information. The voice recognition server 20 can execute a voice recognition process using a designated voice recognition process recognition model (a music recognition model in this example) by using different voice recognition engines according to the recognition model designation information. The result of the speech recognition process is the same as that obtained by a known speech recognition process. For example, character string data (text data) corresponding to the voice received from the execution unit 17 and information indicating the meaning of the character string (such as part of speech of each word included in the character string) are obtained as a result of the voice recognition process. It is done. The voice recognition server 20 transmits the result of the voice recognition process to the execution unit 17.

実行部１７は、音声認識サーバ２０から受信（取得）した特定の認識モデル（この例では音楽用認識モデル）による音声認識処理の結果を用いた、音楽アプリの操作が行われる。ここでの音声認識処理の結果は、音楽アプリにおいて、特定の楽曲を指定したり、その楽曲の再生を指示したり、その楽曲の再生の停止を指示したりするための種々のコマンドである。音声認識処理の結果が、特定の楽曲の指定および再生の指示である場合、実行部１７は、その楽曲データの配信の要求を、コンテンツサーバ３０に送信する。 The execution unit 17 performs an operation on the music application using the result of the speech recognition process using a specific recognition model (in this example, a music recognition model) received (acquired) from the speech recognition server 20. The result of the voice recognition processing here is various commands for designating a specific music piece, instructing the reproduction of the music piece, or instructing the stop of the reproduction of the music piece in the music application. When the result of the voice recognition process is a specific music designation and reproduction instruction, the execution unit 17 transmits a request for distribution of the music data to the content server 30.

コンテンツサーバ３０は、実行部１７から受信した要求に応じて、楽曲データを取得する。たとえば、コンテンツサーバ３０は、実行部１７から要求される可能性のある種々のコンテンツ情報を格納した記憶部（不図示）を備えている。その場合、コンテンツサーバ３０は、要求に応じたコンテンツ情報を記憶部から取得する。この例では、コンテンツサーバ３０は、記憶部に格納されている種々の楽曲データから上記の楽曲データを取得する。コンテンツサーバ３０は、取得した楽曲データを実行部１７に送信する。 The content server 30 acquires music data in response to the request received from the execution unit 17. For example, the content server 30 includes a storage unit (not shown) that stores various content information that may be requested from the execution unit 17. In that case, the content server 30 acquires content information corresponding to the request from the storage unit. In this example, the content server 30 acquires the music data from various music data stored in the storage unit. The content server 30 transmits the acquired music data to the execution unit 17.

実行部１７は、コンテンツサーバ３０から受信した楽曲データを用いて、その楽曲を再生する。具体的に、実行部１７は、スピーカ１３にその楽曲を出力させる。 The execution unit 17 uses the music data received from the content server 30 to reproduce the music. Specifically, the execution unit 17 causes the speaker 13 to output the music.

以上は音楽アプリが利用される際に実行される処理の一例であるが、旅行アプリ、買物アプリ、対話アプリなどが利用される場合も同様に、実行部１７は、入力部１６、音声認識サーバ２０およびスピーカ１３と協働することによってそれらのアプリを実行することができる。ディスプレイ１４とも協働すれば、実行部１７は、ユーザに画像を提供するといったこともできる。 The above is an example of processing executed when a music application is used. Similarly, when a travel application, a shopping application, a dialogue application, or the like is used, the execution unit 17 includes an input unit 16, a voice recognition server, and the like. These apps can be executed by cooperating with 20 and the speaker 13. By cooperating with the display 14, the execution unit 17 can provide an image to the user.

図２は、端末装置１０のハードウェア構成を示す。図２に示されるように、端末装置１０は、１つ以上のＣＰＵ（Central Processing Unit）１０１、主記憶装置であるＲＡＭ（Random Access Memory）１０２およびＲＯＭ（Read Only Memory）１０３、操作モジュール（操作部）１０４、通信モジュール（通信部）１０５、先に説明した近接センサ１１、マイク１２、スピーカ１３、ディスプレイ１４等のハードウェアにより構成されている。通信モジュール１０５が無線通信を行う場合には、アンテナ１０７がさらに追加される。これらの構成要素がプログラム等により動作することで、先に図１を参照して説明した端末装置１０の各機能が発揮される。 FIG. 2 shows a hardware configuration of the terminal device 10. As shown in FIG. 2, the terminal device 10 includes one or more CPUs (Central Processing Units) 101, a RAM (Random Access Memory) 102 and a ROM (Read Only Memory) 103, which are main storage devices, and an operation module (operations). Part) 104, a communication module (communication part) 105, the proximity sensor 11, the microphone 12, the speaker 13, and the display 14 described above. When the communication module 105 performs wireless communication, an antenna 107 is further added. By operating these components by a program or the like, each function of the terminal device 10 described above with reference to FIG. 1 is exhibited.

図３は、端末装置１０において実行される処理の一例を示すフローチャートである。この例では、処理の開始時には、端末装置１０に近接している物体は存在しないものとする。 FIG. 3 is a flowchart illustrating an example of processing executed in the terminal device 10. In this example, it is assumed that there is no object close to the terminal device 10 at the start of processing.

ステップＳ１において、端末装置１０では、近接センサがＯＮとされる。この処理は、たとえば、タッチパネルやボタンなどのユーザ操作により近接アプリの起動が指示されたことに応じて、実行部１７が実行する。近接センサ１１がＯＮになると、近接センサ１１の検出結果が検出部１５へ送られる。上述のように、このフローチャートの処理の開始時には端末装置１０に近接している物体は存在しないので、ここでは、検出部１５に送られる近接センサ１１の検出結果は非近接状態となっている。 In step S1, in the terminal device 10, the proximity sensor is turned on. This process is executed by the execution unit 17 in response to an instruction to start a proximity app by a user operation such as a touch panel or a button. When the proximity sensor 11 is turned on, the detection result of the proximity sensor 11 is sent to the detection unit 15. As described above, since there is no object that is close to the terminal device 10 at the start of the processing of this flowchart, the detection result of the proximity sensor 11 sent to the detection unit 15 is in a non-proximity state here.

ステップＳ２において、近接物体が有るか否かが判断される。この判断は、検出部１５が、近接センサ１１の検出結果に基づき近接物体が存在しているか否かを判断することによって行う。具体的に、近接センサ１１の検出結果が非近接状態から近接状態に切り替わると、検出部１５は、近接物体が有る（物体の近接の検出を開始した）と判断する。近接物体が有る場合（ステップＳ２：ＹＥＳ）、ステップＳ３に処理が進められる。そうでない場合（ステップＳ２：ＮＯ）、再びステップＳ２の処理が実行される。その際、所定の待ち時間処理が待機された後に再びステップＳ２の処理が実行されてもよい。 In step S2, it is determined whether or not there is a close object. This determination is performed by the detection unit 15 determining whether or not a proximity object exists based on the detection result of the proximity sensor 11. Specifically, when the detection result of the proximity sensor 11 switches from the non-proximity state to the proximity state, the detection unit 15 determines that there is a proximity object (detection of proximity of the object has started). If there is a close object (step S2: YES), the process proceeds to step S3. When that is not right (step S2: NO), the process of step S2 is performed again. At that time, the process of step S2 may be executed again after waiting for a predetermined waiting time process.

ステップＳ３において、近接時間の計測が開始される。具体的に、検出部１５が、先のステップＳ２において近接の検出が開始されたタイミングを開始時点として、近接時間の計測を開始する。 In step S3, the measurement of the proximity time is started. Specifically, the detection unit 15 starts measuring the proximity time with the timing at which the proximity detection is started in the previous step S2 as a start time.

ステップＳ４において、近接物体が無いか否かが判断される。この判断は、検出部１５が、近接センサ１１の検出結果に基づき近接物体が存在しているか否かを判断することによって行う。検出結果が依然として近接状態である場合には、検出部１５は、近接物体が有ると判断する。検出結果が非近接状態となった場合には、検出部１５は、近接物体は無いと判断する。近接物体が無い（検出が継続しなくなった）場合（ステップＳ４：ＹＥＳ）、ステップＳ６に処理が進められる。そうでない場合（ステップＳ４：ＮＯ）、ステップＳ５に処理が進められる。 In step S4, it is determined whether or not there is a close object. This determination is performed by the detection unit 15 determining whether or not a proximity object exists based on the detection result of the proximity sensor 11. If the detection result is still in the proximity state, the detection unit 15 determines that there is a proximity object. When the detection result is in a non-proximity state, the detection unit 15 determines that there is no proximity object. If there is no adjacent object (detection is no longer continued) (step S4: YES), the process proceeds to step S6. Otherwise (step S4: NO), the process proceeds to step S5.

ステップＳ５において、所定期間、近接時間の計測が継続される。この間は、検出部１５が上述の計測を継続する。所定期間の長さは、たとえば計測しようとする近接時間の最小の単位（分解能）として設定してもよい。ステップＳ５の処理が完了した後、ステップＳ４に再び処理が戻される。 In step S5, the measurement of the proximity time is continued for a predetermined period. During this time, the detection unit 15 continues the above-described measurement. The length of the predetermined period may be set as a minimum unit (resolution) of the proximity time to be measured, for example. After the process of step S5 is completed, the process is returned to step S4.

ステップＳ６において、近接時間の計測が終了される。具体的に、検出部１５が、先のステップＳ４において近接物体が無くなった（近接の検出が継続しなくなった）タイミングを終了時点として、近接時間の計測を終了する。ステップＳ３において近接時間の計測を開始した時点からこのステップＳ６において近接時間の計測を終了した時点までの時間が、近接時間として計測されることになる。 In step S6, the measurement of the proximity time is terminated. Specifically, the detection unit 15 ends the measurement of the proximity time with the timing when the proximity object disappears in the previous step S4 (the detection of the proximity is not continued) as the end point. The time from the time when the proximity time measurement is started in step S3 to the time when the proximity time measurement is completed in step S6 is measured as the proximity time.

ステップＳ７において、近接時間が予め定められた閾値未満であるか否かが判断される。この例では、閾値は０．５秒である。具体的に、決定部１８が、上記計測した近接時間が閾値未満であるか否かを判断する。近接時間が閾値未満の場合（ステップＳ７：ＹＥＳ）、ステップＳ８に処理が進められる。そうでない場合（ステップＳ７：ＮＯ）、ステップＳ９に処理が進められる。 In step S7, it is determined whether or not the proximity time is less than a predetermined threshold value. In this example, the threshold is 0.5 seconds. Specifically, the determination unit 18 determines whether or not the measured proximity time is less than a threshold value. If the proximity time is less than the threshold (step S7: YES), the process proceeds to step S8. If not (step S7: NO), the process proceeds to step S9.

ステップＳ８において、音声認識処理の認識手法（認識モデル）が旅行用（つまり旅行用認識モデル）に決定される。具体的に、決定部１８が、特定の認識手法（認識モデル）による音声認識処理の結果を取得する機能（取得機能）を、旅行用認識モデルによる音声認識処理の結果を取得する機能に決定する。また、決定部１８は、音声認識処理の結果を用いた機能（アプリケーション）を旅行アプリに決定する。ステップＳ８処理が完了した後、ステップＳ１０に処理が進められる。 In step S8, the recognition method (recognition model) of the speech recognition process is determined for travel (that is, travel recognition model). Specifically, the determination unit 18 determines the function (acquisition function) for acquiring the result of the speech recognition process by the specific recognition method (recognition model) as the function for acquiring the result of the speech recognition process by the travel recognition model. . In addition, the determination unit 18 determines a function (application) using the result of the voice recognition process as a travel application. After step S8 is completed, the process proceeds to step S10.

ステップＳ９において、音声認識処理の認識手法（認識モデル）が買物用（つまり買物用認識モデル）に決定される。具体的に、決定部１８が、取得機能を、買物用認識モデルによる音声認識処理の結果を取得する機能に決定する。また、決定部１８は、アプリケーションを買物アプリに決定する。ステップＳ９の処理が完了した後、ステップＳ１０に処理が進められる。 In step S9, the recognition method (recognition model) of the speech recognition process is determined for shopping (that is, the shopping recognition model). Specifically, the determination unit 18 determines the acquisition function as a function for acquiring the result of the speech recognition process using the shopping recognition model. Moreover, the determination part 18 determines an application as a shopping application. After the process of step S9 is completed, the process proceeds to step S10.

ステップＳ１０において、音声認識が開始される。具体的に、実行部１７が、入力部１６に、音声認識処理の対象としてマイク１２からの音声の入力を開始させる。そして、実行部１７は、先のステップＳ８またはＳ９で決定部１８によって決定された音声認識処理に係る機能の実行を開始する。先に説明したように、実行部１７が、音声認識サーバ２０、コンテンツサーバ３０、マイク１２、スピーカ１３、ディスプレイ１４等と協働することによって、決定した認識モデル（この例では旅行用認識モデルまたは買物用認識モデル）による音声認識処理を実行し、音声認識処理の結果を取得する。また、その音声認識処理の結果を用いて、決定したアプリケーション（この例では旅行用アプリまたは買物用アプリ）を実行する。 In step S10, voice recognition is started. Specifically, the execution unit 17 causes the input unit 16 to start inputting voice from the microphone 12 as a target of voice recognition processing. And the execution part 17 starts execution of the function which concerns on the speech recognition process determined by the determination part 18 by previous step S8 or S9. As described above, the execution unit 17 cooperates with the voice recognition server 20, the content server 30, the microphone 12, the speaker 13, the display 14, and the like to determine the recognition model (in this example, the travel recognition model or A speech recognition process using a shopping recognition model) is executed, and a result of the speech recognition process is acquired. Further, the determined application (in this example, a travel application or a shopping application) is executed using the result of the voice recognition processing.

以上説明した端末装置１０によれば、物体の近接時間に基づいて実行すべき音声認識処理に係る機能が決定され（ステップＳ７〜Ｓ９）、入力された音声を対象として、決定された音声認識処理に係る機能が実行される（ステップＳ１０）。よって、端末装置１０のユーザは、所定時間、端末装置１０の近くに物体を近接させる、たとえば端末装置１０に手をかざすといった動作を行うだけで、音声認識に係る機能を容易に選択することができる。 According to the terminal device 10 described above, the function related to the speech recognition process to be executed based on the proximity time of the object is determined (steps S7 to S9), and the determined speech recognition process for the input speech. The function concerning is executed (step S10). Therefore, the user of the terminal device 10 can easily select a function related to speech recognition only by performing an operation of bringing an object close to the terminal device 10 for a predetermined time, for example, holding the hand over the terminal device 10. it can.

音声認識処理に係る機能として、近接時間に応じた認識手法（近接時間に応じて定められる特定の認識モデル）による音声認識処理の結果を取得する機能（取得機能）が決定される（ステップＳ８、Ｓ９）。具体的に、取得機能が、旅行用認識モデルによる音声認識処理の結果を取得する機能に決定されたり（ステップＳ８）、買物用認識モデルによる音声認識処理の結果を取得する機能に決定されたりする（ステップＳ９）。これにより、端末装置１０のユーザは、どのような認識モデルによる音声認識処理の結果を取得するのかを容易に選択することができる。 As a function related to the speech recognition process, a function (acquisition function) for acquiring the result of the speech recognition process by the recognition method according to the proximity time (a specific recognition model determined according to the proximity time) is determined (Step S8, S9). Specifically, the acquisition function is determined to be a function for acquiring the result of the speech recognition processing by the travel recognition model (step S8), or determined to be a function for acquiring the result of the speech recognition processing by the shopping recognition model. (Step S9). Thereby, the user of the terminal device 10 can easily select what recognition model is used to obtain the result of the speech recognition process.

また、音声認識処理に係る機能として、旅行アプリや買物アプリといった、音声認識処理の結果を用いた機能（アプリケーション）が決定される（ステップＳ８、Ｓ１０）。これにより、端末装置１０のユーザは、アプリケーションを利用するかを容易に選択することができる。 In addition, as a function related to the voice recognition process, a function (application) using a result of the voice recognition process such as a travel application or a shopping application is determined (steps S8 and S10). Thereby, the user of the terminal device 10 can easily select whether to use the application.

以上、本発明の一実施形態について説明したが、本発明は上記実施形態に限定されるものでない。 As mentioned above, although one Embodiment of this invention was described, this invention is not limited to the said embodiment.

上記実施形態では、決定部１８は、特定の認識手法（認識モデル）による音声認識処理の結果を取得する機能（取得機能）および音声認識処理の結果を用いた機能（アプリケーション）の両方を近接時間に基づいて決定する例について説明した。ただし、取得機能およびアプリケーションの一方のみが近接時間に基づいて決定され、他方については予め定められた認識モデルまたはアプリケーションとされてもよい。 In the above-described embodiment, the determination unit 18 performs both of the function (acquisition function) for acquiring the result of the speech recognition process using the specific recognition technique (recognition model) and the function (application) using the result of the speech recognition process. An example of determining based on the above has been described. However, only one of the acquisition function and the application may be determined based on the proximity time, and the other may be a predetermined recognition model or application.

上記実施形態では、近接時間に基づいて決定される音声認識処理に係る機能は２種類であったが、３種類以上の機能が近接時間に基づいて決定されてもよい。一例として、近接時間が第１の閾値未満の場合には、特定の認識モデルによる音声認識処理の結果を取得する機能（取得機能）を第１の取得機能（たとえば旅行用認識モデルによる音声認識処理の結果を取得する機能）に決定し、近接時間が第１の閾値以上であり第２の閾値未満である場合には、第２の取得機能（たとえば買物用認識モデルによる音声認識処理の結果を取得する機能）に決定し、近接時間がそれ以外の場合（第２の閾値以上の場合）には、第３の取得機能（たとえば音楽用認識モデルや対話用認識モデルによる音声認識処理の結果を取得する機能）に決定してもよい。音声認識処理の結果を用いた機能（旅行アプリ、買物アプリ、対話アプリ、音楽アプリなど）の決定についても同様とすることができる。 In the above-described embodiment, there are two types of functions related to the speech recognition process determined based on the proximity time. However, three or more types of functions may be determined based on the proximity time. As an example, when the proximity time is less than the first threshold, the function (acquisition function) for acquiring the result of the speech recognition process by the specific recognition model is changed to the first acquisition function (for example, the speech recognition process by the travel recognition model). If the proximity time is greater than or equal to the first threshold and less than the second threshold, the second acquisition function (for example, the result of the speech recognition process using the shopping recognition model) If the proximity time is other than that (second threshold or more), the result of the speech recognition processing by the third acquisition function (for example, the music recognition model or the conversation recognition model) is determined. The function to obtain may be determined. The same can be applied to the determination of functions (travel application, shopping application, dialogue application, music application, etc.) using the result of the voice recognition processing.

また、上記実施形態では、近接時間が閾値未満であるか否かという閾値判断を用いて音声認識処理に係る機能を決定していたが、それ以外の判断基準が用いられてもよい。一例として、近接時間が設定範囲内であるか否かに基づいて音声認識処理に係る機能を決定してもよい。 Moreover, in the said embodiment, although the function which concerns on a speech recognition process was determined using the threshold value judgment whether proximity | contact time is less than a threshold value, other criteria may be used. As an example, a function related to speech recognition processing may be determined based on whether or not the proximity time is within a set range.

また、上記実施形態では、複数の音声認識エンジンを備えた一つの音声認識サーバ２０を端末装置１０が用い、近接時間に応じて音声認識エンジンを変えることで音声認識処理の認識手法（認識モデル）が切り替えられていたが、異なる音声認識エンジンを備えた複数の音声認識サーバを、端末装置１０が近接時間に応じて選択して用いることによって、認識モデルが切り替えられてもよい。同様に、近接時間に応じて、異なる種類のコンテンツ情報を記憶部に格納した複数のコンテンツサーバが選択的に用いられてもよい。 Moreover, in the said embodiment, the terminal device 10 uses the one speech recognition server 20 provided with the several speech recognition engine, and the recognition method (recognition model) of a speech recognition process is changed by changing a speech recognition engine according to proximity time. However, the recognition model may be switched by the terminal device 10 selecting and using a plurality of speech recognition servers having different speech recognition engines according to the proximity time. Similarly, a plurality of content servers storing different types of content information in the storage unit may be selectively used according to the proximity time.

また、上記実施形態では、音声認識処理に係る機能が利用される際、音声認識処理やコンテンツを取得するための処理が、音声認識サーバ２０およびコンテンツサーバ３０といった端末装置１０の外部に設けられたサーバによって実行されていたが、音声認識サーバ２０およびコンテンツサーバ３０によって実行される処理の少なくとも一部の処理を、端末装置１０内で実行できるようにしてもよい。その場合、端末装置１０の実行部１７が、音声認識サーバ２０およびコンテンツサーバ３０の機能の少なくとも一部を実行可能に構成されてもよい。 Moreover, in the said embodiment, when the function which concerns on a speech recognition process is utilized, the process for acquiring a speech recognition process and a content was provided outside the terminal devices 10, such as the speech recognition server 20 and the content server 30. Although executed by the server, at least a part of the processing executed by the voice recognition server 20 and the content server 30 may be executed in the terminal device 10. In that case, the execution unit 17 of the terminal device 10 may be configured to be able to execute at least some of the functions of the voice recognition server 20 and the content server 30.

また、端末装置１０においては、先に図３を参照して説明したように、近接時間に応じて音声認識処理に係る機能が決定されるまで（図３のステップＳ１０よりも前）は、入力部１６による音声の入力が開始されない。それまでの間は、先に説明した入力部１６による音声認識処理のための入力された音声からノイズを取り除くフィルタリング処理や、入力された音声情報を一時的に記憶する記憶処理等が実行されないので（入力部１６の動作がオフになるので）、その分、消費電力の低減を図ることができる。 Further, in the terminal device 10, as described above with reference to FIG. 3, the input until the function related to the speech recognition process is determined according to the proximity time (before step S <b> 10 in FIG. 3). The voice input by the unit 16 is not started. In the meantime, the filtering process for removing noise from the input voice for the voice recognition process by the input unit 16 and the storage process for temporarily storing the input voice information are not executed. (Because the operation of the input unit 16 is turned off), the power consumption can be reduced accordingly.

一部のアプリ（音楽アプリや対話アプリなど）は、ディスプレイ１４による画像の出力を必要としないこともある。そのようなアプリが利用されている間は、端末装置１０のディスプレイ１４を表示しないようにする（オフにする）制御が行われてもよい。この制御は、実行部１７により行われてもよい。このようにディスプレイ１４のオフ時（消灯時）にアプリを利用すれば、ディスプレイ１４のオン時（点灯時）にアプリを利用する場合よりも、ディスプレイ１４の消費電力の分だけ、端末装置１０の消費電力を低減することができる。 Some apps (such as a music app and a dialogue app) may not require image output by the display 14. While such an application is being used, control may be performed so that the display 14 of the terminal device 10 is not displayed (turned off). This control may be performed by the execution unit 17. If the application is used when the display 14 is turned off (turned off) in this way, the terminal device 10 has a power consumption equivalent to that of the display 14 than when the application is used when the display 14 is turned on (lit). Power consumption can be reduced.

また、端末装置１０において既に音楽や動画などのコンテンツが再生されている状態で、検出部１５によって物体の近接が検出され、近接時間が検出された場合には、そのコンテンツの再生を中断してもよい。具体的に、コンテンツが再生されている間に、図３に示されるステップＳ２において近接物体が検出され（ステップＳ２：ＹＥＳ）、その後、各ステップでの処理が進み、ステップＳ８またはステップＳ９の処理が完了した場合には、その時点でコンテンツの再生が中断された後に、ステップＳ１０の処理が実行されるようにしてもよい。これにより、ユーザは、すでに実行されている機能（この例ではコンテンツの再生）に優先して、音声認識処理に係る機能をあらたに選択して利用することができるようになる。 In addition, when content such as music or video is already being played on the terminal device 10, when the proximity of the object is detected by the detection unit 15 and the proximity time is detected, the playback of the content is interrupted. Also good. Specifically, while the content is being played back, a proximity object is detected in step S2 shown in FIG. 3 (step S2: YES), and then the process in each step proceeds, and the process in step S8 or step S9 When the process is completed, the process of step S10 may be executed after the reproduction of the content is interrupted at that time. As a result, the user can newly select and use the function related to the voice recognition process in preference to the function already executed (reproduction of content in this example).

以上説明した端末装置は、ユーザの手が濡れているために、ユーザが端末装置に触れて端末装置を操作することができないようなシーンでも利用可能である。より好ましくは、端末装置１０は、ディスプレイ１４に画面ロックが掛かっている状態（データ入力や画面出力が制限されている状態）であっても、音声処理に係る機能が選択され実行されるようにしておくとよい。これにより、ユーザは、画面ロックを解除するための操作（端末装置に触れてロック解除コードを入力するような操作）を行うことなく、音声認識処理に係る機能を選択して利用することができるようになる。 Since the user's hand is wet, the terminal device described above can be used in a scene where the user cannot touch the terminal device to operate the terminal device. More preferably, the terminal device 10 selects and executes a function related to voice processing even when the screen is locked on the display 14 (a state where data input or screen output is restricted). It is good to keep. Thus, the user can select and use the function related to the voice recognition process without performing an operation for releasing the screen lock (an operation for touching the terminal device and inputting the unlock code). It becomes like this.

１０…端末装置、１１…近接センサ、１２…マイク、１３…スピーカ、１４…ディスプレイ、１５…検出部、１６…入力部、１７…実行部、１８…決定部、２０…音声認識サーバ、３０…コンテンツサーバ。 DESCRIPTION OF SYMBOLS 10 ... Terminal device, 11 ... Proximity sensor, 12 ... Microphone, 13 ... Speaker, 14 ... Display, 15 ... Detection part, 16 ... Input part, 17 ... Execution part, 18 ... Determination part, 20 ... Voice recognition server, 30 ... Content server.

Claims

Detecting means for detecting the proximity of the object to the terminal device and measuring the proximity time of the detected object;
An input means for inputting voice;
Execution means for executing a function related to voice recognition processing for voice input by the input means;
A determination unit that determines a function related to a voice recognition process executed by the execution unit based on a proximity time measured by the detection unit;
Comprising
Terminal device.

The terminal device according to claim 1, wherein the determination unit determines a function for acquiring a result of the voice recognition process by a recognition method according to the proximity time as a function related to the voice recognition process.

The terminal device according to claim 1, wherein the determination unit determines a function using a result of the voice recognition process as a function related to the voice recognition process.