JP2012215673A

JP2012215673A - Speech processing device and speech processing method

Info

Publication number: JP2012215673A
Application number: JP2011080026A
Authority: JP
Inventors: Chikashi Sugiura; 千加志杉浦; Koji Fujimura; 浩司藤村; Akinori Kawamura; 聡典河村; Takashi Sudo; 隆須藤
Original assignee: Toshiba Corp
Current assignee: Toshiba Corp
Priority date: 2011-03-31
Filing date: 2011-03-31
Publication date: 2012-11-08

Abstract

PROBLEM TO BE SOLVED: To reduce an operation burden.SOLUTION: A speech processing device includes speech conversion means, utterance determination means, processing determination means, and execution means. The speech conversion means converts an input speech to character string information indicating contents represented by the speech. The utterance determination means analyzes the input speech to determine a mode of user's utterance at the time of utterance of the speech. The processing determination means determines processing to be executed, on the basis of the mode of the utterance. The execution means uses the character string information to execute the processing determined by the processing determination means.

Description

本発明の実施形態は、音声処理装置、及び音声処理方法に関する。 Embodiments described herein relate generally to an audio processing apparatus and an audio processing method.

従来から、ユーザが携帯するモバイル機器等のコンピュータも高性能化する傾向にある。そして、ユーザがコンピュータを快適に利用するためには、インタフェースが重要となる。そこで、近年、コンピュータに様々なセンサを内蔵し、これらセンサの検出結果をユーザの操作等として利用する技術が提案されている。 2. Description of the Related Art Conventionally, computers such as mobile devices carried by users tend to have higher performance. An interface is important for the user to use the computer comfortably. Therefore, in recent years, a technique has been proposed in which various sensors are built in a computer and the detection results of these sensors are used as user operations or the like.

例えば、コンピュータが、ユーザの音声に対して音声認識処理を施すことで、生成された音声命令に従って処理を行う技術が提案されている。 For example, a technique has been proposed in which a computer performs processing in accordance with a generated voice command by performing voice recognition processing on a user's voice.

特表平１１−５０６８４５号公報Japanese National Patent Publication No. 11-506845

しかしながら、ユーザが発話する際、発話の様式には様々な種類がある。従来技術において、発話の様式を制限することで音声認識を向上させる等の技術は提案されているが、発話の様式の違いをユーザのインタフェースとして利用することは考慮されていない。 However, when a user utters, there are various types of utterances. In the prior art, techniques such as improving speech recognition by restricting the utterance style have been proposed, but use of the difference in the utterance style as a user interface is not considered.

本発明は、上記に鑑みてなされたものであって、ユーザの発話の様式の違いに基づいて実行する処理を異ならせる音声処理装置、及び音声処理方法を提供することを目的とする。 This invention is made in view of the above, Comprising: It aims at providing the audio processing apparatus and audio | voice processing method which change the process performed based on the difference in the style of a user's utterance.

実施形態の音声処理装置は、音声変換手段と、発話判定手段と、処理判定手段と、実行手段と、を備える。音声変換手段は、入力された音声から、当該音声で発せられた内容を示した文字列情報に変換する。発話判定手段は、前記入力された音声を分析して、当該音声が発せられた際のユーザの発話の様式を判定する。処理判定手段は、前記発話の様式に基づいて、実行する処理を判定する。実行手段は、前記処理判定手段により判定された処理を、前記文字列情報を用いて実行する。
を備える音声処理装置 The speech processing apparatus according to the embodiment includes speech conversion means, speech determination means, process determination means, and execution means. The voice conversion means converts the input voice into character string information indicating the content uttered by the voice. The utterance determination means analyzes the input voice and determines the user's utterance style when the voice is uttered. The process determining means determines a process to be executed based on the utterance style. The execution means executes the process determined by the process determination means using the character string information.
Speech processing apparatus comprising

図１は、実施形態にかかる情報処理装置の外観を模式的に示す図である。FIG. 1 is a diagram schematically illustrating an appearance of an information processing apparatus according to the embodiment. 図２は、実施形態にかかる情報処理装置の構成の一例を示すブロック図である。FIG. 2 is a block diagram illustrating an example of the configuration of the information processing apparatus according to the embodiment. 図３は、実施形態にかかる情報処理装置で実行されるプログラムで実現される機能構成を示した図である。FIG. 3 is a diagram illustrating a functional configuration realized by a program executed by the information processing apparatus according to the embodiment. 図４は、実施形態にかかる情報処理装置が、認識された発話の様式が通常の抑揚の場合に行う処理の例を示した図である。FIG. 4 is a diagram illustrating an example of processing performed by the information processing apparatus according to the embodiment when the recognized speech style is normal inflection. 図５は、実施形態にかかる情報処理装置が認識された発話の様式が抑揚のない場合に行う処理の例を示した図である。FIG. 5 is a diagram illustrating an example of processing performed when the utterance style recognized by the information processing apparatus according to the embodiment has no inflection. 図６は、実施形態にかかる情報処理装置における全体的な処理の手順を示すフローチャートである。FIG. 6 is a flowchart illustrating an overall processing procedure in the information processing apparatus according to the embodiment.

実施の形態として、音声処理装置を情報処理装置に適用した例について説明する。図１は、本実施形態にかかる情報処理装置の外観を模式的に示す図である。この情報処理装置１００は、表示画面を備えた情報処理装置であり、例えばスレート端末（タブレット端末）や電子書籍リーダ、デジタルフォトフレーム等として実現されている。 As an embodiment, an example in which a voice processing device is applied to an information processing device will be described. FIG. 1 is a diagram schematically illustrating the appearance of the information processing apparatus according to the present embodiment. The information processing apparatus 100 is an information processing apparatus having a display screen, and is realized as, for example, a slate terminal (tablet terminal), an electronic book reader, a digital photo frame, or the like.

情報処理装置１００は、薄い箱状の筐体Ｂを備え、この筐体Ｂの上面に表示部１１１が配置されている。表示部１１１は、ユーザによってタッチされた表示画面上の位置を検知するためのタッチセンサ１１２を備えている。 The information processing apparatus 100 includes a thin box-shaped casing B, and a display unit 111 is disposed on the upper surface of the casing B. The display unit 111 includes a touch sensor 112 for detecting the position on the display screen touched by the user.

また、情報処理装置１００は、筐体Ｂの上面に、外部環境の音声を収集するためのマイクロフォン１１３を備えている。このマイクロフォン１１３から入力されたアナログの音声は、内部の処理により音声（オーディオ）信号に変換される。さらには、情報処理装置１００は、筐体Ｂの上面に、各種ボタンスイッチ１１４が配置されている。これらボタンスイッチ１１４を押下することで、様々な操作を行うことができる。 In addition, the information processing apparatus 100 includes a microphone 113 on the upper surface of the housing B for collecting sound of the external environment. The analog voice input from the microphone 113 is converted into a voice (audio) signal by internal processing. Further, in the information processing apparatus 100, various button switches 114 are arranged on the upper surface of the housing B. Various operations can be performed by pressing these button switches 114.

図２は、実施形態にかかる情報処理装置１００の構成の一例を示すブロック図である。図２に示すように、情報処理装置１００は、上述した表示部１１１、タッチセンサ１１２、マイクロフォン１１３、ボタンスイッチ１１４の他、ＣＰＵ２１０（Central Processing Unit）、ＲＯＭ２１１（Read Only Memory）、ＲＡＭ２１２（Random Access Memory）、記憶部２１３、計時部２１４、ジャイロセンサ２１５、通信Ｉ／Ｆ２１６を備える。 FIG. 2 is a block diagram illustrating an example of the configuration of the information processing apparatus 100 according to the embodiment. As shown in FIG. 2, the information processing apparatus 100 includes a CPU 210 (Central Processing Unit), a ROM 211 (Read Only Memory), a RAM 212 (Random Access) in addition to the display unit 111, the touch sensor 112, the microphone 113, and the button switch 114 described above. Memory), storage unit 213, timing unit 214, gyro sensor 215, and communication I / F 216.

ＣＰＵ２１０は、情報処理装置１００の動作を中央制御する。具体的には、ＣＰＵ２１０は、ＲＯＭ２１１や記憶部２１３に記憶された各種プログラムを読み出し、ＲＡＭ２１２の作業領域に展開して順次実行することで、バスラインを介して接続する情報処理装置１００の各部に制御信号を出力する。ＲＯＭ２１１は、各種プログラムや設定データを記憶する。ＲＡＭ２１２は、ＣＰＵ２１０の作業領域を提供する。記憶部２１３は、例えばＨＤＤ（Hard Disk Drive）やＳＳＤ（Solid State Drive）等の大容量記憶装置であり、ＣＰＵ２１０が実行するアプリケーションプログラム、文字、動画像、静止画像、音声などの各種データを読み書き可能に記憶する。計時部２１４は、ＲＴＣ（Real Time Clock）機能およびネットワークを介して時刻の同期を行う機能を有し、時刻の同期および計時を行う。計時部２１４が同期および計時した時刻はＣＰＵ１０に通知される。 The CPU 210 centrally controls the operation of the information processing apparatus 100. Specifically, the CPU 210 reads various programs stored in the ROM 211 and the storage unit 213, expands them in the work area of the RAM 212, and executes them sequentially, thereby allowing each unit of the information processing apparatus 100 connected via the bus line to Output a control signal. The ROM 211 stores various programs and setting data. The RAM 212 provides a work area for the CPU 210. The storage unit 213 is a large-capacity storage device such as an HDD (Hard Disk Drive) or an SSD (Solid State Drive), and reads and writes various data such as application programs executed by the CPU 210, characters, moving images, still images, and voices. Remember as much as possible. The clock unit 214 has an RTC (Real Time Clock) function and a function of synchronizing time via a network, and performs time synchronization and clocking. The time when the clock unit 214 synchronizes and clocks is notified to the CPU 10.

ジャイロセンサ２１５（ジャイロスコープ）は、ＭＥＭＳ技術を用いた振動型の角速度センサなどであり、ＸＹＺの３軸方向の情報処理装置１００の姿勢を検出し、検出結果をＣＰＵ２１０へ出力する。通信Ｉ／Ｆ２１６は、ＣＰＵ２１０の制御の下、所定の通信プロトコルに従い、有線又は無線の通信を行うインタフェースである。例えば、通信Ｉ／Ｆ２１６は、ＣＰＵ２１０の制御の下、ルータ等を介した無線ＬＡＮの通信を行う。 The gyro sensor 215 (gyroscope) is a vibration type angular velocity sensor using MEMS technology, detects the attitude of the information processing apparatus 100 in the XYZ triaxial directions, and outputs the detection result to the CPU 210. The communication I / F 216 is an interface that performs wired or wireless communication according to a predetermined communication protocol under the control of the CPU 210. For example, the communication I / F 216 performs wireless LAN communication via a router or the like under the control of the CPU 210.

ここで、ＣＰＵ２１０が音声処理プログラムを順次実行することで実現する機能構成について説明する。図３は、情報処理装置１００で実行されるプログラムで実現される機能構成を示した図である。図３に示すように、情報処理装置１００は、音声認識部３０１と、発話様式判定部３０２と、動作対象抽出部３０３と、処理判定部３０４と、処理実行部３０５と、入力部３０６と、出力部３０７と、を備える。これら機能構成は、動作対象キーワードリスト記憶部３０８と、対応リスト記憶部３０９と、を用いる。なお、動作対象キーワードリスト記憶部３０８、及び対応リスト記憶部３０９は、図２の記憶部２１３内に設けられている。 Here, a functional configuration realized by the CPU 210 executing the voice processing program sequentially will be described. FIG. 3 is a diagram illustrating a functional configuration realized by a program executed by the information processing apparatus 100. As illustrated in FIG. 3, the information processing apparatus 100 includes a voice recognition unit 301, an utterance style determination unit 302, an operation target extraction unit 303, a process determination unit 304, a process execution unit 305, an input unit 306, An output unit 307. These functional configurations use an operation target keyword list storage unit 308 and a correspondence list storage unit 309. The operation target keyword list storage unit 308 and the correspondence list storage unit 309 are provided in the storage unit 213 in FIG.

音声認識部３０１は、マイクロフォン１１３に入力された音声から、当該音声で発せられた内容を自然言語で示したテキストデータ（文字列情報）に変換する。 The voice recognition unit 301 converts the content uttered by the voice from the voice input to the microphone 113 into text data (character string information) represented in a natural language.

発話様式判定部３０２は、マイクロフォン１１３に入力された音声に対して信号処理解析（分析）して、当該音声が発せられた際のユーザの発話の様式を判定する。発話様式判定部３０２は、ユーザの発話の様式として、予め定められた音を基準として、入力された音声が、抑揚の無い声、抑揚の激しい声、高い声、低い声、かすれ声、及びささやき声のうち、いずれであるかを判定する。なお、判定される発話の様式は、１つ又は複数であっても良い。 The utterance style determination unit 302 performs signal processing analysis (analysis) on the voice input to the microphone 113 and determines the user's utterance style when the voice is uttered. The utterance style determination unit 302 uses, as a reference, a predetermined sound as a user's utterance style, and the input voice is a voice without inflection, a voice with severe inflection, a high voice, a low voice, a faint voice, and a whispering voice. Of which is determined. Note that one or more utterance modes may be determined.

動作対象キーワードリスト記憶部３０８は、処理実行部３０５で処理の対象として用いられるキーワードを記憶する。キーワードとしては、例えば、記憶部２１３に記憶されている音楽データの曲名や、演奏者の名前等でもよい。 The operation target keyword list storage unit 308 stores keywords used as processing targets in the process execution unit 305. As a keyword, for example, a song name of music data stored in the storage unit 213, a player's name, or the like may be used.

動作対象抽出部３０３は、音声認識部３０１により変換されたテキストデータから、処理実行部３０５で実行される処理の対象となるキーワード（識別情報）を抽出する。このキーワードとしては、例えば、音楽データの曲名や、演奏者の名前等の検索キーワードなどが考えられる。さらには、情報処理装置１００に記憶されているファイル名や、メタデータなどであっても良い。これら検索キーワード、ファイル名、又はメタデータなどのキーワードが、処理実行部３０５による処理の対象として判定される。なお、音声認識部３０１により変換されたテキストデータから、キーワードを抽出する際に、記憶されているファイル名等と同一である必要はなく、所定の基準以上類似しているものであればよい。 The operation target extraction unit 303 extracts keywords (identification information) to be processed by the process execution unit 305 from the text data converted by the voice recognition unit 301. As this keyword, for example, a search keyword such as a song name of music data or a player's name can be considered. Further, it may be a file name or metadata stored in the information processing apparatus 100. Keywords such as these search keywords, file names, or metadata are determined as objects of processing by the process execution unit 305. It should be noted that when extracting a keyword from text data converted by the speech recognition unit 301, it is not necessary to be the same as a stored file name or the like, as long as it is more than a predetermined standard.

また、動作対象抽出部３０３は、変換されたテキストデータから、処理対象として判定されなかった部分、又は処理対象として判定されなかった部分から先頭と末尾の助詞を取り除いたテキストを、処理対象となるキーワードとして抽出しても良い。 Further, the operation target extraction unit 303 becomes a processing target from the converted text data, a text that is not determined as a processing target, or a text obtained by removing the first and last particles from a portion that is not determined as a processing target. It may be extracted as a keyword.

対応リスト記憶部３０９は、ユーザの発話の様式と、実行する処理と、を対応付けて記憶する。なお、対応リスト記憶部３０９は、実行する処理と対応付ける対象を、ユーザの発話の様式のみに制限するものではない。例えば、対応リスト記憶部３０９において、動作対象抽出部３０３において抽出された検索キーワードと、実行する処理と、を対応付けても良い。他の例としては、検索キーワードと、ユーザの発話の様式と、実行する処理と、を対応付けても良い。なお、対応リスト記憶部３０９が記憶する対応関係は、情報処理装置１００の出荷時に予め登録したものでも良いし、ユーザが事後的に自由に登録したものでも良い。 The correspondence list storage unit 309 stores the user's utterance style and the process to be executed in association with each other. Note that the correspondence list storage unit 309 does not limit the target to be associated with the process to be executed to only the user's utterance style. For example, in the correspondence list storage unit 309, the search keyword extracted by the operation target extraction unit 303 may be associated with the process to be executed. As another example, a search keyword, a user's utterance style, and a process to be executed may be associated with each other. Note that the correspondence relationship stored in the correspondence list storage unit 309 may be registered in advance at the time of shipment of the information processing apparatus 100, or may be freely registered afterwards by the user.

処理判定部３０４は、発話様式判定部３０２により判定された発話の様式に基づいて、変換されたテキストデータから抽出された識別情報を利用する処理を判定する。本実施形態にかかる処理判定部３０４は、対応リスト記憶部３０９に記憶された対応関係に基づいて判定する。例えば、処理判定部３０４は、ユーザの発話の様式と、対応リスト記憶部３０９で対応付けられていた処理を、処理実行部３０５で実行される処理として判定する。 The process determination unit 304 determines a process that uses the identification information extracted from the converted text data, based on the utterance style determined by the utterance style determination unit 302. The process determination unit 304 according to the present embodiment determines based on the correspondence relationship stored in the correspondence list storage unit 309. For example, the process determination unit 304 determines the process associated with the user's utterance style and the correspondence list storage unit 309 as the process executed by the process execution unit 305.

また、処理判定部３０４は、発話の様式のみを判定の基準として用いるのではなく、動作対象抽出部３０３により抽出されたキーワードと、対応リスト記憶部３０９で対応付けられている処理を、実行する処理として判定しても良い。 In addition, the process determination unit 304 executes the process associated with the keyword extracted by the operation target extraction unit 303 and the correspondence list storage unit 309 instead of using only the utterance style as a determination criterion. You may determine as a process.

さらに、本実施形態にかかる処理判定部３０４は、発話様式判定部３０２により判定された発話の様式、及び動作対象抽出部３０３により抽出されたキーワードと、対応リスト記憶部３０９で対応付けられている処理を、処理実行部３０５が実行する処理として判定しても良い。 Furthermore, the processing determination unit 304 according to the present embodiment associates the utterance style determined by the utterance style determination unit 302 and the keywords extracted by the action target extraction unit 303 with the correspondence list storage unit 309. The process may be determined as a process executed by the process execution unit 305.

本実施形態にかかる処理判定部３０４は、判定された処理として、当該処理を行うアプリケーションを特定してもよい。この場合、処理実行部３０５が、特定されたアプリケーションの起動制御を行い、当該アプリケーションに対して処理の実行を要求する。 The process determination unit 304 according to the present embodiment may specify an application that performs the process as the determined process. In this case, the process execution unit 305 performs activation control of the identified application and requests the application to execute the process.

処理実行部３０５は、動作対象抽出部３０３により抽出されたキーワードを用いて、処理判定部３０４により判定された処理を実行する。なお、処理対象として用いる情報は、動作対象抽出部３０３により抽出されたキーワードに制限するものではなく、例えば音声認識部３０１により変換されたテキストデータに基づく情報であってもよい。 The process execution unit 305 executes the process determined by the process determination unit 304 using the keyword extracted by the operation target extraction unit 303. The information used as the processing target is not limited to the keyword extracted by the operation target extraction unit 303, and may be information based on text data converted by the voice recognition unit 301, for example.

本実施形態にかかる処理実行部３０５は、処理判定部３０４により特定されたアプリケーションを起動し、当該アプリケーションに対して処理の実行を要求すると共に、当該アプリケーションに対してキーワードを受け渡す。 The process execution unit 305 according to the present embodiment activates the application specified by the process determination unit 304, requests the application to execute the process, and delivers a keyword to the application.

なお、本実施形態にかかる処理実行部３０５は、処理判定部３０４により判定された処理と、動作対象抽出部３０３により抽出されたキーワードと、を表示部１１１に表示するだけでも良い。さらに、このような表示を行って、処理を実行するための許可をユーザに促しても良い。 Note that the process execution unit 305 according to the present embodiment may simply display the process determined by the process determination unit 304 and the keyword extracted by the operation target extraction unit 303 on the display unit 111. Furthermore, such a display may be performed to prompt the user for permission to execute the process.

入力部３０６は、通信Ｉ／Ｆ２１６等やジャイロセンサ２１５等から入力された入力情報を、処理実行部３０５に出力する。出力部３０７は、処理実行部３０５から出力された情報を表示部１１１に表示制御したり、外部装置に通信Ｉ／Ｆ２１６を介して送信しても良い。 The input unit 306 outputs input information input from the communication I / F 216 or the like or the gyro sensor 215 or the like to the process execution unit 305. The output unit 307 may control display of information output from the process execution unit 305 on the display unit 111 or may transmit the information to an external device via the communication I / F 216.

上述した構成で実現される情報処理装置１００で行われる処理について説明する。例えば、ユーザが「○○××（歌手）、キョク１（曲名）」と発話したものとする。この場合、情報処理装置１００が、マイクロフォン１１３により当該発話の入力処理を行い、当該発話に基づいた処理を行う。この発話に基づいた処理としては例えば、音楽再生などが考えられる。 Processing performed by the information processing apparatus 100 realized with the above-described configuration will be described. For example, it is assumed that the user speaks “XX (singer), Kyoku 1 (song name)”. In this case, the information processing apparatus 100 performs input processing of the utterance using the microphone 113 and performs processing based on the utterance. As a process based on this utterance, for example, music reproduction can be considered.

ところで、従来から、音声認識により得られたテキストを用いて、データベース検索を自動的に実行する情報処理装置が存在していた。つまり、ユーザが何語かを発話すると、これら発話の内容を認識し、テキストデータに変換し、当該テキストデータを検索キーワードとして用いて、検索を行う技術が存在する。このような従来技術では、行われる処理がＷｅｂ検索に一意に設定されていたため、処理に自由度がなかった。上述した例では、「○○××（歌手）、キョク１（曲名）」を検索キーワードとした検索結果が表示されるに留まっていた。 By the way, conventionally, there has been an information processing apparatus that automatically performs a database search using text obtained by speech recognition. That is, when the user speaks several words, there is a technique for recognizing the contents of these utterances, converting them into text data, and performing a search using the text data as a search keyword. In such a conventional technique, since the processing to be performed is uniquely set in the Web search, the processing has no degree of freedom. In the above-described example, the search results using “XXXXX (singer), Kyoku 1 (song name)” as search keywords have been displayed.

そこで、本実施形態にかかる情報処理装置１００では、音声認識からテキストを生成する際に、発話の様式を認識し、認識した発話の様式に基づいて処理を切り替えることとした。 Therefore, in the information processing apparatus 100 according to the present embodiment, when generating text from speech recognition, the utterance style is recognized, and the process is switched based on the recognized utterance style.

図４は、認識された発話の様式が通常の抑揚の場合に、情報処理装置１００が行う処理の例を示した図である。図４に示すように、情報処理装置１００のマイクロフォン１１３が入力した音声に基づいて、音声認識部３０１がテキストデータ「○○××（歌手）キョク１（曲名）」に変換すると共に、発話様式判定部３０２が音声から、発話の様式として通常の抑揚である（矢印で示したように起伏に富んでいる）と判定したものとする。この場合、処理判定部３０４が、通常の抑揚と、対応リスト記憶部３０９で対応付けられている、Ｗｅｂ検索を、実行する処理として判定する。これにより、実行処理部３０５が、Ｗｅｂブラウザを起動する。当該Ｗｅｂブラウザは、動作対象抽出部３０３により抽出された「○○××（歌手）」と「キョク１（曲名）」とが検索キーワードとして設定された検索画面４０１を表示する。その後、情報処理装置１００はタッチセンサ１１２が、開始ボタン４０２の押下を受け付けることで、Ｗｅｂブラウザによる検索が開始される。 FIG. 4 is a diagram illustrating an example of processing performed by the information processing apparatus 100 when the recognized utterance style is normal inflection. As shown in FIG. 4, based on the voice input by the microphone 113 of the information processing apparatus 100, the voice recognition unit 301 converts the text data “XXXXX (singer) Kyoku 1 (song name)” and the utterance style. Assume that the determination unit 302 determines from speech the normal inflection as the utterance style (rich in ups and downs as indicated by arrows). In this case, the process determination unit 304 determines the Web search associated with the normal intonation in the correspondence list storage unit 309 as a process to be executed. As a result, the execution processing unit 305 activates the Web browser. The Web browser displays a search screen 401 in which “XXXXX (singer)” and “Kyoku 1 (song name)” extracted by the operation target extraction unit 303 are set as search keywords. Thereafter, the information processing apparatus 100 starts the search by the Web browser when the touch sensor 112 accepts the pressing of the start button 402.

図５は、認識された発話の様式が抑揚のない場合に、情報処理装置１００が行う処理の例を示した図である。図５に示すように、情報処理装置１００のマイクロフォン１１３が入力した音声に基づいて、音声認識部３０１がテキストデータ「○○××（歌手）キョク２（曲名）」に変換すると共に、発話様式判定部３０２が音声から、発話の様式として抑揚がない（矢印で示したようにフラット）と判定したものとする。この場合、処理判定部３０４が、抑揚がない場合と、対応リスト記憶部３０９で対応付けられている、音楽アプリケーション（楽曲プレーヤー）による曲の再生を、実行する処理として判定する。これにより、実行処理部３０５が、音楽アプリケーション５０１を起動する。当該音楽アプリケーションの表示画面５０１では、動作対象抽出部３０３により抽出された「○○××（歌手）」と「キョク１（曲名）」とに基づいた音楽データが選択された状態で表示される。その後、音楽アプリケーションが、選択された音楽データを自動再生する。 FIG. 5 is a diagram illustrating an example of processing performed by the information processing apparatus 100 when the recognized utterance style is not inflected. As shown in FIG. 5, based on the voice input by the microphone 113 of the information processing apparatus 100, the voice recognition unit 301 converts the text data “XXXXX (singer) Kyoku 2 (song name)” and the utterance style. Assume that the determination unit 302 determines from speech that there is no inflection as a speech pattern (flat as indicated by an arrow). In this case, the process determination unit 304 determines that there is no inflection and that the music reproduction (music player) associated with the correspondence list storage unit 309 is performed as a process to be executed. As a result, the execution processing unit 305 activates the music application 501. On the display screen 501 of the music application, music data based on “XXXXX (singer)” and “Kyoku 1 (song name)” extracted by the operation target extraction unit 303 is displayed in a selected state. . Thereafter, the music application automatically reproduces the selected music data.

他の例としては、情報処理装置１００のマイクロフォン１１３が入力した音声に基づいて、音声認識部３０１がテキストデータ「○○××（歌手）キョク３（曲名）」に変換すると共に、発話様式判定部３０２が音声から、ささやき声と判定したものとする。このささやき声であるか否かの判定は、例えばＮＡＭ（Non-Audible Murmur）技術を用いることで実現可能とする。 As another example, based on the voice input by the microphone 113 of the information processing apparatus 100, the voice recognition unit 301 converts the text data “XXXXX (singer) Kyoku 3 (song name)” and determines the speech style. Assume that the unit 302 determines that a whispering voice is present. The determination of whether or not this whisper is made possible by using, for example, NAM (Non-Audible Murmur) technology.

そして、処理判定部３０４が、ささやき声と、対応リスト記憶部３０９で対応付けられている、動画投稿サイトの検索、再生を、実行する処理として判定する。これにより、実行処理部３０５が、Ｗｅｂブラウザを起動する。そして当該Ｗｅｂブラウザを起動する際に、接続先として動画投稿サイトのＵＲＬを設定する。その後、実行処理部３０５がＷｅｂブラウザ上に表示された動画投稿サイトに対して、動作対象抽出部３０３により抽出された「○○××（歌手）」及び「キョク３（曲名）」を検索キーワードとして受け渡す。これにより、ささやき声で入力された発話内容を検索キーワードとして、動画投稿サイトで検索、再生を行うことができる。 Then, the process determination unit 304 determines the search and playback of the video posting site associated with the whisper and the correspondence list storage unit 309 as the process to be executed. As a result, the execution processing unit 305 activates the Web browser. Then, when starting the Web browser, the URL of the video posting site is set as the connection destination. Thereafter, the execution processing unit 305 searches the moving image posting site displayed on the Web browser for “XXXXX (singer)” and “Kyoku 3 (song name)” extracted by the operation target extraction unit 303 as a search keyword. Pass as. As a result, it is possible to search and reproduce on the video posting site using the utterance content input with a whisper as a search keyword.

このように、本実施形態にかかる情報処理装置１００は、発話の様式により処理を異ならせることに制限するものではなく、動作対象抽出部３０３により抽出されたキーワードに従って、処理実行部３０５が処理を異ならせても良い。例えば、キーワードとして「うえーーー」が認識された際、「うえ」をトリガーとして処理実行部３０５が上スクロールを開始し、語尾「え」を伸ばしているのが継続している間、処理実行部３０５が上スクロールし続ける処理を行う。 As described above, the information processing apparatus 100 according to the present embodiment is not limited to changing the process according to the utterance style, and the process execution unit 305 performs the process according to the keyword extracted by the operation target extraction unit 303. It may be different. For example, when “upper” is recognized as a keyword, the processing execution unit 305 starts to scroll up using “upper” as a trigger and continues to extend the ending “e”. The process 305 continues to scroll up.

他の例として「したーーー」が認識された際、「した」をトリガーとして処理実行部３０５が下スクロールを開始し、語尾「た」を伸ばしているのが継続している間、処理実行部３０５が、下スクロールし続ける処理を行う。なお、発話延ばしていることの検出は、ＨＭＭ（隠れマルコフモデル）を用いることで可能なものとして、説明を省略する。なお、これらの処理は、上述した発話の様式と組み合わせても良い。 As another example, when “do” is recognized, the process execution unit 305 starts scrolling down with “do” as a trigger and continues to extend the ending “ta” while the process execution unit continues. 305 performs a process of continuing to scroll down. Note that the detection of the utterance extension is possible by using an HMM (Hidden Markov Model), and the description thereof is omitted. Note that these processes may be combined with the above-described utterance style.

この発話様式によるコマンドの入力はあらゆる言語に適用できる。例えば英語の場合、「upperrrrrrr」が認識された際、「up」をトリガーとして処理実行部３０５が上スクロールを開始し、語尾「r」を伸ばしているのが継続している間、処理実行部３０５が上スクロールし続ける処理を行う。また、下スクロールする場合も同様に、「lowerrrr」と語尾「r」を伸ばしている間、下スクロールが継続するものとする。 Command input in this utterance style can be applied to any language. For example, in the case of English, when “upperrrrrrr” is recognized, the process execution unit 305 starts up scrolling with “up” as a trigger and continues to extend the ending “r” while the process execution unit continues. The process 305 continues to scroll up. Similarly, when scrolling down, it is assumed that the bottom scrolling continues while “lowerrrr” and ending “r” are extended.

次に、本実施形態にかかる情報処理装置１００における、全体的なの処理について説明する。図６は、本実施形態にかかる情報処理装置１００における上述した処理の手順を示すフローチャートである。なお、図６に示す処理を行う際、情報処理装置１００による音声認識の準備が完了しているものとする。 Next, overall processing in the information processing apparatus 100 according to the present embodiment will be described. FIG. 6 is a flowchart showing the above-described processing procedure in the information processing apparatus 100 according to the present embodiment. Note that it is assumed that preparation for speech recognition by the information processing apparatus 100 is completed when performing the processing illustrated in FIG. 6.

まず、情報処理装置１００のマイクロフォン１１３が、ユーザの発話を音声信号として入力処理する（ステップＳ６０１）。次に、音声認識部３０１が、入力処理された音声信号を音声認識し、認識結果が含まれたテキストデータ（以下、認識結果テキストデータ）を生成する（ステップＳ６０２）。 First, the microphone 113 of the information processing apparatus 100 performs input processing on the user's speech as an audio signal (step S601). Next, the voice recognition unit 301 recognizes the input voice signal and generates text data including a recognition result (hereinafter, recognition result text data) (step S602).

一方、発話様式判定部３０２が、音声信号から、ユーザの発話様式を判定する（ステップＳ６０３）。 On the other hand, the speech style determination unit 302 determines the user's speech style from the voice signal (step S603).

そして、動作対象抽出部３０３が、ステップＳ６０２で生成された認識結果テキストデータから、処理の対象となるキーワード等を抽出する（ステップＳ６０４）。 Then, the operation target extraction unit 303 extracts a keyword or the like to be processed from the recognition result text data generated in step S602 (step S604).

また、処理判定部３０４が、発話様式から実行する処理を特定する（ステップＳ６０５）。その際、実行する処理を行うためのアプリケーションを特定する。また、起動するアプリケーションがＷｅｂブラウザの場合には、接続先のＵＲＬも特定しても良い。 In addition, the process determination unit 304 identifies a process to be executed from the utterance style (step S605). At that time, an application for performing the process to be executed is specified. If the application to be started is a Web browser, the connection destination URL may also be specified.

その後、処理実行部３０５が、特定された処理に対応するアプリケーションを起動する（ステップＳ６０６）。そして、処理実行部３０５が、起動したアプリケーション上で、抽出されたキーワードを用いて処理を実行する（ステップＳ６０７）。その後、処理実行部３０５が、処理結果を、表示部１１１に表示する（ステップＳ６０８）。 Thereafter, the process execution unit 305 activates an application corresponding to the identified process (step S606). Then, the process execution unit 305 executes a process using the extracted keyword on the activated application (step S607). Thereafter, the process execution unit 305 displays the process result on the display unit 111 (step S608).

上述した処理により、発話様式に従った処理がなされることになり、ユーザが処理を実行する際の操作負担を軽減することができる。なお、上述した処理手順に制限するものではなく、各ステップの順序を入れ替えても良い。例えば、ステップＳ６０２より先にステップＳ６０３を実行しても良いし、ステップＳ６０２及びステップＳ６０３を同時に実行しても良い。 With the processing described above, processing according to the utterance style is performed, and the operation burden when the user executes the processing can be reduced. In addition, it does not restrict | limit to the process sequence mentioned above, You may replace the order of each step. For example, step S603 may be executed prior to step S602, or step S602 and step S603 may be executed simultaneously.

また、本実施形態にかかる情報処理装置１００は、上述した処理に制限するものではなく様々な態様が考えられる。例えば、発話様式が所定の様式の場合には、発話した内容をテキストとしてメモする処理などを行っても良い。 Further, the information processing apparatus 100 according to the present embodiment is not limited to the above-described processing, and various modes can be considered. For example, when the utterance style is a predetermined style, processing for making a note of the uttered content as text may be performed.

また、情報処理装置１００では、アプリケーションを起動した後に、発話の態様に基づいて処理を切り替えても良い。例えば、ブログの文章を音声入力する場合、発話様式に従って入力文字の大きさや色やフォントを変更することが考えられる。 Further, in the information processing apparatus 100, after the application is started, the processing may be switched based on the utterance mode. For example, when inputting a blog sentence by voice, it is conceivable to change the size, color and font of the input character according to the utterance style.

また、ユーザの発話様式に従って、情報処理装置１００で実行する処理としては、どのような処理を行っても良いが、Web検索、楽曲再生、楽曲検索、お気に入り閲覧、ブログ執筆、メール閲覧、動画投稿サイトの閲覧、電子ブックの閲覧、アプリ検索、カメラ起動、カメラ撮影、インターネット回線を介した電話などを実行しても良い。 In addition, the processing executed by the information processing apparatus 100 according to the user's utterance style may be any processing, but Web search, music playback, music search, favorite browsing, blog writing, mail browsing, video posting Site browsing, electronic book browsing, application search, camera activation, camera shooting, telephone calls via the Internet line, etc. may be executed.

（第１の実施形態の変形例）
なお、音声認識結果テキストデータから抽出されるキーワードに音声コマンドが含まれていても良い。そこで、本変形例にかかる情報処理装置１００の動作対象抽出部３０３では、テキストデータから、音声コマンドを抽出する例とする。例えば、楽曲「楽曲Ａ」を音声コマンドで再生したい時、ユーザが「楽曲Ａを再生」と抑揚のない声で発話すればよい。この場合、対応リスト記憶部３０９で“再生”というキーワード及び抑揚のない声が、音楽アプリケーション（楽曲プレーヤー）と、対応付けられているものとする。そして、記憶部２１３に“楽曲Ａ”という楽曲が存在すれば、楽曲Ａを楽曲プレーヤーで再生できる。 (Modification of the first embodiment)
Note that a voice command may be included in a keyword extracted from the voice recognition result text data. Therefore, the operation target extraction unit 303 of the information processing apparatus 100 according to the present modification is an example of extracting a voice command from text data. For example, when the user wants to reproduce the song “Song A” with a voice command, the user may utter “Play Song A” with an unintentional voice. In this case, it is assumed that the keyword “play” and the voice without inflection are associated with the music application (music player) in the correspondence list storage unit 309. Then, if there is a song “Song A” in the storage unit 213, the song A can be reproduced by the song player.

一方、抑揚のある発話はＷｅｂ検索と対応リスト記憶部３０９に予め設定しておく。これにより、「楽曲Ａ再生」と抑揚のある発話で行った場合、音声コマンドが含まれていても、本実施形態にかかる情報処理装置１００では、Ｗｅｂ検索を優先して行うことになる。 On the other hand, utterances with intonation are preset in the Web search and correspondence list storage unit 309. As a result, when the “song A playback” is performed with an utterance with an inflection, the information processing apparatus 100 according to the present embodiment performs Web search with priority even if a voice command is included.

本実施形態にかかる発話様式判定部３０２の判定対象となる発話様式は上述した声に制限するものではなく、他の発話態様を判定基準として用いても良い。例えば、声の高さが上がり続ける発話や、声の高さが下がり続ける発話などを判定基準として用いても良い。 The speech style to be determined by the speech style determination unit 302 according to the present embodiment is not limited to the voice described above, and other speech modes may be used as the determination criterion. For example, an utterance in which the voice pitch continues to rise or an utterance in which the voice pitch continues to fall may be used as the criterion.

このように、発話内容のみならず発話の仕方からも、音声コマンド意図を判定するので、より適切にユーザが意図したコマンドを実行することができる。 Thus, since the voice command intention is determined not only from the utterance content but also from the utterance method, the command intended by the user can be executed more appropriately.

本変形例では、発話の態様と、ユーザの発話に基づく音声コマンドと、を組み合わせることとした。これにより、従来の情報処理装置では、当該処理を特定するために上位から階層を辿って実行する処理を選択していたが（例えば、[メニュー]->[プレーヤー]->[再生選択]->[再生]）、音声コマンドと発話の態様との組み合わせで、実行する処理を特定できるため、従来技術と比べて、実行する処理の特定精度を向上させることができる。 In this modification, the speech mode is combined with the voice command based on the user's speech. As a result, in the conventional information processing apparatus, in order to specify the processing, the processing to be executed by tracing the hierarchy from the top is selected (for example, [Menu]-> [Player]-> [Playback selection]- > [Playback]), the process to be executed can be specified by the combination of the voice command and the utterance mode, so that the accuracy of the process to be executed can be improved as compared with the prior art.

本実施形態及び変形例では、ユーザは発話の態様を異ならせるだけで、情報処理装置１００で実行する処理を指定できる。これにより、情報処理装置１００において、ユーザが意図しない処理を行うことを軽減できる。これにより、便利な音声コマンド処理を行うことができる。 In the present embodiment and the modification, the user can specify processing to be executed by the information processing apparatus 100 only by changing the manner of utterance. Thereby, in the information processing apparatus 100, it can reduce that the process which a user does not intend is performed. Thereby, convenient voice command processing can be performed.

さらには、ユーザが、水気のある場所（例えばキッチン）に、情報処理装置１００を配置し、当該情報処理装置１００に対して実行する処理を特定する場合に、発話の態様で実行する処理を切り替えることができるため、操作性を向上させることができる。 Furthermore, when the user places the information processing apparatus 100 in a wet place (for example, a kitchen) and specifies the process to be executed on the information processing apparatus 100, the process to be executed in the utterance mode is switched. Therefore, operability can be improved.

本実施形態及び変形例の情報処理装置で実行される音声処理プログラムは、インストール可能な形式又は実行可能な形式のファイルでＣＤ−ＲＯＭ、フレキシブルディスク（ＦＤ）、ＣＤ−Ｒ、ＤＶＤ（ＤｉｇｉｔａｌＶｅｒｓａｔｉｌｅＤｉｓｋ）等のコンピュータで読み取り可能な記録媒体に記録されて提供される。 The audio processing program executed in the information processing apparatus according to the present embodiment and the modification is a file in an installable format or an executable format, and is a CD-ROM, a flexible disk (FD), a CD-R, a DVD (Digital Versatile Disk). And the like recorded on a computer-readable recording medium.

また、本実施形態の情報処理装置で実行される音声処理プログラムを、インターネット等のネットワークに接続されたコンピュータ上に格納し、ネットワーク経由でダウンロードさせることにより提供するように構成しても良い。また、本実施形態の情報処理装置で実行される音声処理プログラムプログラムをインターネット等のネットワーク経由で提供または配布するように構成しても良い。 The voice processing program executed by the information processing apparatus of the present embodiment may be stored on a computer connected to a network such as the Internet and provided by being downloaded via the network. The voice processing program program executed by the information processing apparatus according to the present embodiment may be provided or distributed via a network such as the Internet.

なお、本発明は、上記実施形態そのままに限定されるものではなく、実施段階ではその要旨を逸脱しない範囲で構成要素を変形して具体化することができる。また、上記実施形態に開示されている複数の構成要素の適宜な組み合わせにより、種々の発明を形成することができる。例えば、実施形態に示される全構成要素からいくつかの構成要素を削除してもよい。さらに、異なる実施形態にわたる構成要素を適宜組み合わせても良い。 Note that the present invention is not limited to the above-described embodiment as it is, and can be embodied by modifying the constituent elements without departing from the scope of the invention in the implementation stage. Moreover, various inventions can be formed by appropriately combining a plurality of constituent elements disclosed in the embodiment. For example, some components may be deleted from all the components shown in the embodiment. Furthermore, the constituent elements over different embodiments may be appropriately combined.

１００…情報処理装置、１１１…表示部、１１２…タッチセンサ、１１３…ボタンスイッチ、１１３…マイクロフォン、１１４…ボタンスイッチ、２１１…ＲＯＭ、２１２…ＲＡＭ、２１３…記憶部、２１４…計時部、２１５…ジャイロセンサ、２１６…通信Ｉ／Ｆ、３０１…音声認識部、３０２…発話様式判定部、３０３…動作対象抽出部、３０４…処理判定部、３０５…実行処理部、３０５…処理実行部、３０６…入力部、３０７…出力部、３０８…動作対象キーワードリスト記憶部、３０９…対応リスト記憶部 DESCRIPTION OF SYMBOLS 100 ... Information processing apparatus, 111 ... Display part, 112 ... Touch sensor, 113 ... Button switch, 113 ... Microphone, 114 ... Button switch, 211 ... ROM, 212 ... RAM, 213 ... Memory | storage part, 214 ... Time measuring part, 215 ... Gyro sensor, 216 ... Communication I / F, 301 ... Voice recognition unit, 302 ... Utterance style determination unit, 303 ... Operation target extraction unit, 304 ... Process determination unit, 305 ... Execution processing unit, 305 ... Process execution unit, 306 ... Input unit, 307 ... output unit, 308 ... operation target keyword list storage unit, 309 ... correspondence list storage unit

実施形態の音声処理装置は、音声変換手段と、発話判定手段と、処理判定手段と、実行手段と、を備える。音声変換手段は、入力された音声から、当該音声で発せられた内容を示した文字列情報に変換する。発話判定手段は、前記入力された音声を分析して、当該音声が発せられた際のユーザの発話の様式を判定する。処理判定手段は、前記発話の様式に基づいて、実行する処理を判定する。実行手段は、前記処理判定手段により判定された処理であり、且つ前記文字列情報により認識された処理を、前記文字列情報に変換する前の音声が継続している期間に応じて実行する。
実施形態の音声処理装置は、音声変換手段と、実行手段と、を備える。音声変換手段は、入力された音声から、当該音声で発せられた内容を示した文字列情報に変換する。実行手段は、前記文字列情報により認識された処理を、前記文字列情報に変換する前の音声が継続している期間に応じて実行する。 The speech processing apparatus according to the embodiment includes speech conversion means, speech determination means, process determination means, and execution means. The voice conversion means converts the input voice into character string information indicating the content uttered by the voice. The utterance determination means analyzes the input voice and determines the user's utterance style when the voice is uttered. The process determining means determines a process to be executed based on the utterance style. Executing means, a process determined by the processing determination unit, and the processing recognized by the character string information, executed according to the period in which the voice is continuing before conversion to the character string information.
The speech processing apparatus according to the embodiment includes speech conversion means and execution means. The voice conversion means converts the input voice into character string information indicating the content uttered by the voice. The execution means executes the process recognized by the character string information according to a period during which the voice before being converted into the character string information is continued.

Claims

Voice conversion means for converting the input voice into character string information indicating the content uttered by the voice;
An utterance determination unit that analyzes the input voice and determines a user's utterance style when the voice is uttered;
Processing determination means for determining processing to be executed based on the utterance style;
Execution means for executing the processing determined by the processing determination means using the character string information;
A speech processing apparatus comprising:

The apparatus further comprises storage means for storing the user's utterance style and the process to be executed in association with each other.
The speech processing apparatus according to claim 1.

A target extraction unit that extracts identification information that is a target of the process executed by the execution unit from the character string information converted by the voice conversion unit;
The execution means uses the identification information extracted by the target extraction means for the processing.
The speech processing apparatus according to claim 1 or 2.

The process determination means determines an application that executes the process,
As the execution means, start the application determined by the processing determination means, and pass the character string information to the application,
The speech processing apparatus according to any one of claims 1 to 3.

The utterance determination means determines any one of a voice without inflection, a voice with intense inflection, a high voice, a low voice, a faint voice, and a whisper as a user's utterance style.
The speech processing apparatus according to claim 1.

A speech processing method executed by a speech processing apparatus,
A voice conversion step in which the voice conversion means converts the input voice into character string information indicating the content uttered by the voice;
An utterance determination unit analyzes the input voice and determines an utterance mode of the user when the voice is uttered;
A process determining step for determining a process to be executed based on the utterance style;
An execution step in which execution means executes the process determined in the process determination step using the character string information;
An audio processing method including: