JP2015184487A

JP2015184487A - Voice processor and voice processing method

Info

Publication number: JP2015184487A
Application number: JP2014060862A
Authority: JP
Inventors: 鳥居　健太郎; Kentaro Torii; 健太郎鳥居; 相田　聡; Satoshi Aida; 聡相田
Original assignee: Toshiba Corp
Current assignee: Toshiba Corp
Priority date: 2014-03-24
Filing date: 2014-03-24
Publication date: 2015-10-22
Anticipated expiration: 2034-03-24
Also published as: JP5802784B2

Abstract

PROBLEM TO BE SOLVED: To acquire a text of utterance content without a user uttering from the start again when voice recognition is performed during the user's utterance and voice recognition is failed on the way.SOLUTION: As an embodiment of the current invention, a voice processor includes an acquisition part, a transmission part, a storage part, a reception part, and a control unit. The acquisition part successively acquires voice data representing a content a user has uttered. The transmission part transmits to a voice recognition system a voice recognition request of the voice data acquired by the acquisition part. The storage part stores the voice data acquired by the acquisition part. The reception part receives from the voice recognition system a voice recognition response including a text converting the voice data by voice recognition or information showing failure of voice recognition of the voice data. The control unit performs control so as to identify the voice data whose voice recognition is failed on the basis of the voice recognition response and to transmit to the voice recognition system the voice recognition request of data including the voice data whose voice recognition is failed on the basis of the voice data stored to the storage part.

Description

本発明の実施形態は、音声処理装置および音声処理方法に関する。 Embodiments described herein relate generally to a voice processing apparatus and a voice processing method.

在宅医療・介護など現場においては、医療・介護職の複数の職員が、患者や被介護者のケアや日常生活の世話にかかわっている。複数の職員は、その中で、患者や被介護者の状態を観察したり診断したりする。一人の職員が連続的・継続的に患者を観察するわけではなく、多職種の複数の職員が、それぞれ異なる日時、異なる間隔で、患者を訪問し、観察する。このため、各職員が、患者についての情報を共有するため、電子カルテシステムや看護・介護記録システム、あるいはＳＮＳに、患者の観察結果を登録することが行われている。 In the field of home medical care and nursing care, several staff members of medical and nursing care staff are involved in the care of patients and care recipients and the care of daily life. A plurality of staff members observe and diagnose the condition of the patient and the cared person. A single staff member does not observe the patient continuously or continuously, but multiple staff members from various occupations visit and observe patients at different times and intervals. For this reason, in order for each staff member to share information about a patient, the observation result of the patient is registered in an electronic medical record system, a nursing / nursing care recording system, or an SNS.

患者について観察結果を共有するためのシステムとして、音声メッセージを用いた情報共有システム（以降、音声つぶやきシステムと呼ぶ）が知られている。音声つぶやきシステムでは、各職員が、スマートフォン等の携帯端末のマイクに、患者の観察結果を発話し、携帯端末に搭載された音声つぶやき登録アプリケーションでこれを記録することで、音声メッセージを生成する。各職員は、生成した音声メッセージをサーバに送信し、職員間で共有されるよう登録する。この際、音声メッセージを音声認識により変換してテキストとし、当該テキスト、発話対象となった患者のＩＤや、発話者の職員ＩＤ、発話時刻、発話場所、音声メッセージから抽出したキーワードなどのタグを、当該音声メッセージに添付する。このようにテキストやタグが添付された音声メッセージを、音声つぶやきと呼ぶ。各職員は、サーバに蓄積された音声つぶやきを、携帯端末やパソコンから、閲覧あるいは視聴することができる。 As a system for sharing observation results for patients, an information sharing system using a voice message (hereinafter referred to as a voice tweet system) is known. In the voice tweet system, each staff member utters a patient's observation result to a microphone of a portable terminal such as a smartphone, and records the result with a voice tweet registration application installed in the portable terminal, thereby generating a voice message. Each staff member sends the generated voice message to the server and registers it to be shared among the staff members. At this time, the voice message is converted into a text by voice recognition, and the tags such as the text, the patient ID to be uttered, the staff ID of the utterer, the utterance time, the utterance place, the keyword extracted from the voice message, etc. Attached to the voice message. Such a voice message to which text or a tag is attached is called a voice tweet. Each staff member can browse or view the voice tweets stored on the server from a portable terminal or a personal computer.

音声つぶやきシステムの場合に、職員は、音声認識したテキストをサーバに送信する前に、テキストの内容が発話した音声に一致しているか事前に確認したい場合がある。また、一般的に、発話した内容をテキスト化してユーザ端末に保存する場合も、発話した内容が正しく音声認識されているか、確認したい場合がある。この際、音声認識したテキストを出来るだけ速く確認できるようにしつつ、ユーザ端末の低消費電力・低コストを図ることが望まれる。 In the case of a voice tweet system, the staff member may want to confirm in advance whether the content of the text matches the spoken voice before sending the voice-recognized text to the server. In general, when the uttered content is converted into text and stored in the user terminal, it may be desired to confirm whether the uttered content is correctly recognized. At this time, it is desired to reduce the power consumption and cost of the user terminal while making it possible to confirm the speech-recognized text as quickly as possible.

ここで、音声メッセージをテキストに変換する音声認識は、端末内部で行うことや、外部の音声認識システムを利用する方法がある。また、発話を録音した音声ファイルを音声認識するバッチ音声認識や、発話中に音声認識をするリアルタイム音声認識がある。上記したような音声認識を出来るだけ速く確認しつつ、ユーザ端末の低消費電力・低コストを図る観点から、外部の音声認識システムを利用したリアルタイム音声認識を用いることが考えられる。 Here, voice recognition for converting a voice message into text can be performed inside the terminal or using an external voice recognition system. In addition, there are batch speech recognition for recognizing sound files in which utterances are recorded, and real-time speech recognition for performing speech recognition during utterances. It is conceivable to use real-time speech recognition using an external speech recognition system from the viewpoint of reducing power consumption and cost of the user terminal while confirming speech recognition as described above as quickly as possible.

しかしながら、リアルタイム音声認識の場合、発話の途中で通信が途絶えた場合や、音声認識システムが多数のユーザの音声を音声認識してリソースが逼迫している場合には、音声認識が失敗する可能性が高い。音声認識が失敗した場合、ユーザは一から発話をし直さなければならず、ユーザの負荷が大きい。 However, in the case of real-time speech recognition, speech communication may fail if communication is interrupted in the middle of an utterance or if the speech recognition system recognizes many users' voices and resources are tight. Is expensive. When voice recognition fails, the user has to start speaking again from the beginning, and the load on the user is great.

特許登録第５４１４８６５号Patent registration No. 5414865

本発明の実施形態は、ユーザの発話中に音声認識を行う場合に、途中で音声認識に失敗した場合でも、ユーザが発話をし直すことなく、当該発話した内容のテキストを取得可能にすることを目的とする。 Embodiments of the present invention make it possible to acquire the text of the uttered content without re-speaking the user even if the speech recognition fails during the speech recognition when performing the speech recognition during the user's utterance. With the goal.

本発明の実施形態として音声処理装置は、取得部、送信部、記憶部、受信部、および制御部を備える。 As an embodiment of the present invention, a speech processing apparatus includes an acquisition unit, a transmission unit, a storage unit, a reception unit, and a control unit.

前記取得部は、ユーザが発話した内容を表す音声データを順次取得する。 The acquisition unit sequentially acquires audio data representing the content uttered by the user.

前記送信部は、前記取得部により取得された音声データの音声認識依頼を、音声認識システムに送信する。 The transmission unit transmits a voice recognition request for the voice data acquired by the acquisition unit to a voice recognition system.

前記記憶部は、前記取得部により取得された音声データを記憶する。 The said memory | storage part memorize | stores the audio | voice data acquired by the said acquisition part.

前記受信部は、前記音声認識システムから、前記音声データを音声認識により変換したテキストまたは前記音声データの音声認識の失敗を示す情報、を含む音声認識応答を受信する。 The receiving unit receives, from the voice recognition system, a voice recognition response including text converted from the voice data by voice recognition or information indicating a voice recognition failure of the voice data.

前記制御部は、前記音声認識応答に基づき前記音声認識に失敗した音声データを特定し、前記記憶部に記憶された音声データに基づき、前記音声認識に失敗した音声データを含むデータの音声認識依頼を、前記音声認識システムへ送信するよう制御する。 The control unit identifies voice data that has failed in the voice recognition based on the voice recognition response, and requests voice recognition of data including the voice data that has failed in voice recognition based on the voice data stored in the storage unit Is transmitted to the voice recognition system.

本発明の実施形態に係る音声処理装置の機能ブロック図。The functional block diagram of the speech processing unit which concerns on embodiment of this invention. 図１の音声処理装置の動作を示すフローチャート。The flowchart which shows operation | movement of the audio processing apparatus of FIG. 図２に続くフローチャート。The flowchart following FIG. 認識結果テーブルの例を示す図。The figure which shows the example of a recognition result table. 再音声認識の制御のフローチャート。The flowchart of control of re-speech recognition. 第３の実施形態に係る動作のフローチャート。The flowchart of the operation | movement which concerns on 3rd Embodiment. 第５の実施形態に係るシステムの全体構成図。The whole block diagram of the system concerning a 5th embodiment. 第６の実施形態に係るシステムの全体構成図。The whole system block diagram concerning a 6th embodiment.

以下、図面を参照しながら、本発明の実施形態について説明する。 Hereinafter, embodiments of the present invention will be described with reference to the drawings.

（第１の実施形態）
図１は、本発明の実施形態に係る音声処理装置の機能ブロック図である。音声処理装置１０１が、ネットワーク３０１を介して、音声認識システム２０１に接続されている。 (First embodiment)
FIG. 1 is a functional block diagram of a speech processing apparatus according to an embodiment of the present invention. A voice processing apparatus 101 is connected to a voice recognition system 201 via a network 301.

音声処理装置１０１は、録音部（取得部）１１、ファイル記憶部１２、送信部１３、受信部１４、認識結果記憶部１５、制御部１６、表示部１７、入力部１８を備える。 The voice processing device 101 includes a recording unit (acquisition unit) 11, a file storage unit 12, a transmission unit 13, a reception unit 14, a recognition result storage unit 15, a control unit 16, a display unit 17, and an input unit 18.

音声処理装置１０１は、スマートフォン、携帯端末、タブレット、ＰＣなどのユーザ端末に実装されることができる。音声処理装置１０１が備える各処理部の機能は、ユーザ端末が一般的に備えるＣＰＵ、メモリ、補助記憶装置、通信装置、入出力インタフェースを利用して実現できる。音声処理装置１０１が搭載されるユーザ端末には、ユーザが発話した音声を収集して電気信号に変換するマイクが備え付けられているか、外部接続によりマイクを取り付け可能であるとする。マイクが、音声処理装置１０１に組み込まれることも可能である。 The voice processing apparatus 101 can be mounted on a user terminal such as a smartphone, a mobile terminal, a tablet, or a PC. The functions of the processing units included in the voice processing device 101 can be realized using a CPU, a memory, an auxiliary storage device, a communication device, and an input / output interface that are generally provided in a user terminal. It is assumed that the user terminal on which the voice processing apparatus 101 is mounted is provided with a microphone that collects voices spoken by the user and converts them into electric signals, or can be attached by external connection. A microphone can also be incorporated into the audio processing apparatus 101.

図１の各処理部の動作は、一例として、ＣＰＵ上で稼働するオペレーティングシステム（ＯＳ）と、ＯＳ上で稼働するアプリケーションにより達成される動作として実現できる。 The operation of each processing unit in FIG. 1 can be realized as an operation achieved by an operating system (OS) running on a CPU and an application running on the OS, for example.

入力部１８は、ユーザが各種指示を入力する入力インタフェースである。例えば、タッチパネル、入力ボタン、マウス、キーボードなどがある。入力部１８で入力された情報は、制御部１６へ送られる。 The input unit 18 is an input interface through which a user inputs various instructions. For example, there are a touch panel, an input button, a mouse, a keyboard, and the like. Information input through the input unit 18 is sent to the control unit 16.

表示部１７は、外部から入力される画像信号に基づき、画像を表示する出力インタフェースである。表示部１７は、例えば、液晶パネル、有機ＥＬパネル、電子インクパネルなどがある。 The display unit 17 is an output interface that displays an image based on an image signal input from the outside. Examples of the display unit 17 include a liquid crystal panel, an organic EL panel, and an electronic ink panel.

上述のように音声処理装置が搭載されるユーザ端末には、マイクが搭載されているか、マイクを外部接続可能である。このマイクでは、ユーザが発話した音声を収集し、当該音声をアナログの電気信号に変換する。電気信号はさらに所定の形式の音声デジタルデータに変換され、録音部１１に入力される。音声デジタルデータへの変換は、例えば、ユーザ端末のＣＰＵや、マイク内部、または別の処理回路により行われる。 As described above, the user terminal on which the voice processing device is mounted has a microphone or can be externally connected. This microphone collects voices spoken by the user and converts the voices into analog electrical signals. The electrical signal is further converted into audio digital data of a predetermined format and input to the recording unit 11. The conversion into audio digital data is performed, for example, by the CPU of the user terminal, the inside of the microphone, or another processing circuit.

録音部１１は、入力される音声デジタルデータの先頭から一定サイズごとに、データを音声データとして取得する取得部（図示せず）を備える。録音部１１は、取得部により取得した音声データを順次、メモリ（図示せず）に格納する。メモリへ格納する際、当該音声データを、１つ前に取得された音声データに後続するように配置する。これにより、メモリ上では、録音開始から終了までに取得された各音声データが時系列に結合される。録音部１１は、これらの結合された音声データにファイルヘッダ等を付加することで、予め定めた形式の音声ファイルを作成する。ファイル形式は任意でよい。一例として、WAVE（RIFF waveform Audio Format）形式やPCM（Pulse Code Modulation）形式がある。非可逆の圧縮フォーマットであるmp3形式などでもよい。なお、音声ファイルを所定の形式で可逆圧縮し、圧縮した音声ファイルを、代わりに格納してもよい。録音部１１は、メモリ上に作成した音声ファイルを、ファイル記憶部１２に格納する。メモリとファイル記憶部１２が同一の媒体であれば、この格納動作は省略できる。 The recording unit 11 includes an acquisition unit (not shown) that acquires data as audio data for each predetermined size from the beginning of the input audio digital data. The recording unit 11 sequentially stores the audio data acquired by the acquisition unit in a memory (not shown). When storing in the memory, the audio data is arranged to follow the audio data acquired immediately before. Thereby, on the memory, the respective audio data acquired from the start to the end of the recording are combined in time series. The recording unit 11 creates an audio file in a predetermined format by adding a file header or the like to the combined audio data. The file format may be arbitrary. As an example, there are a WAVE (RIFF waveform Audio Format) format and a PCM (Pulse Code Modulation) format. An irreversible compression format such as mp3 format may be used. Note that the audio file may be reversibly compressed in a predetermined format, and the compressed audio file may be stored instead. The recording unit 11 stores the audio file created on the memory in the file storage unit 12. If the memory and the file storage unit 12 are the same medium, this storing operation can be omitted.

また、録音部１１は、上記のように一定サイズごとに取り出した音声データに、識別子を付与する。識別子として、音声データの取得時刻を付与してもよい。時刻は、図示しないシステム時計から取得すればよい。取得時刻の代わりに、順次増加するシーケンス番号など、別の種類の識別子を付与してもよい。また、音声データには、音声認識システム２０１が提供する音声認識サービスのプロトコルに応じた情報（ヘッダ等）を付加してもよい。このヘッダ内の所定フィールドに上記識別子を含めても良い。録音部１１は、識別子を付与した音声データを、送信部１３に送る。ユーザが発話中であれば、ユーザの発話と並行して、音声データが送信部１３に送られることになる。 In addition, the recording unit 11 assigns an identifier to the audio data extracted for each fixed size as described above. You may give the acquisition time of audio | voice data as an identifier. The time may be acquired from a system clock (not shown). Instead of the acquisition time, another type of identifier such as a sequentially increasing sequence number may be given. In addition, information (header or the like) corresponding to the protocol of the voice recognition service provided by the voice recognition system 201 may be added to the voice data. The identifier may be included in a predetermined field in the header. The recording unit 11 sends the audio data to which the identifier is assigned to the transmission unit 13. If the user is speaking, voice data is sent to the transmission unit 13 in parallel with the user's speaking.

ファイル記憶部１２は、録音部１１により生成された音声ファイルを内部に記憶する。ファイル記憶部１２は、例えばＯＳに搭載されているファイルシステムによって管理されている。ファイル記憶部１２は、ハードディスク、ＳＳＤ、メモリなど、任意の記憶媒体で構成できる。 The file storage unit 12 stores therein the audio file generated by the recording unit 11. The file storage unit 12 is managed by, for example, a file system installed in the OS. The file storage unit 12 can be composed of an arbitrary storage medium such as a hard disk, SSD, or memory.

送信部１３は、所定の通信プロトコルに基づき、音声認識システム２０１と通信する。使用する通信プロトコルは任意でよいが、例えばＴＣＰ（またはＵＤＰ）／ＩＰベースのプロトコル処理を行う。ＴＣＰ／ＩＰベースのプロトコル処理として、ＴＣＰ／ＩＰより上位のｈｔｔｐを利用することも可能である。また、送信部１３は、ネットワーク３０１との通信用のプロトコルも処理する。例えば、無線ＬＡＮ規格に従ったプロトコル、３Ｇなどのセルラー方式対応のプロトコル、または、イーサーネットプロトコル等が挙げられる。 The transmission unit 13 communicates with the voice recognition system 201 based on a predetermined communication protocol. The communication protocol to be used may be arbitrary, but, for example, TCP (or UDP) / IP based protocol processing is performed. As TCP / IP-based protocol processing, it is possible to use http higher than TCP / IP. The transmission unit 13 also processes a protocol for communication with the network 301. For example, a protocol according to a wireless LAN standard, a protocol compatible with a cellular system such as 3G, or an Ethernet protocol may be used.

送信部１３は、録音部１１から入力される一定サイズごとの音声データとその識別子から、当該音声データの音声認識を要求する音声認識依頼を生成する。音声認識依頼は、音声データとその識別子を含む。送信部１３は、生成した音声認識依頼のデータをパケット化し、音声認識システム２０１に、ネットワーク３０１を介して送信する。ユーザが発話中であれば、音声認識依頼が発話中に音声認識システム２０１に送信されることになる。 The transmission unit 13 generates a speech recognition request for requesting speech recognition of the speech data from the speech data of a certain size input from the recording unit 11 and its identifier. The voice recognition request includes voice data and its identifier. The transmission unit 13 packetizes the generated voice recognition request data and transmits it to the voice recognition system 201 via the network 301. If the user is speaking, a voice recognition request is transmitted to the voice recognition system 201 while speaking.

音声認識システム２０１は、音声処理装置１０１から音声認識依頼を受信し、音声認識依頼から音声データと識別子を抽出する。音声認識システム２０１は、当該音声データを音声認識によりテキストに変換する。このテキストとは、任意の文字列のことである。音声認識システム２０１は、音声認識に成功した場合は、生成したテキストと、抽出した識別子とを含む音声認識結果を生成する。音声認識結果に、音声認識が成功したことを表す成功情報を含めても良い。 The speech recognition system 201 receives a speech recognition request from the speech processing apparatus 101, and extracts speech data and an identifier from the speech recognition request. The voice recognition system 201 converts the voice data into text by voice recognition. This text is an arbitrary character string. If the speech recognition system 201 succeeds in speech recognition, the speech recognition system 201 generates a speech recognition result including the generated text and the extracted identifier. The speech recognition result may include success information indicating that the speech recognition is successful.

一方、音声認識システム２０１は、音声認識に失敗した場合は、音声認識に失敗したことを示す失敗情報と識別子とを含む音声認識結果を生成する。音声認識に失敗する場合としては、音声認識システム２０１が多数の音声データを処理中でリソースに余裕がない場合や、システムトラブルにより音声認識そのものができない状態である場合がある。または、当該音声データ自体に異常がある場合（処理できない値が含まれる場合）などがある。 On the other hand, when the speech recognition system 201 fails, the speech recognition system 201 generates a speech recognition result including failure information indicating that the speech recognition has failed and an identifier. When speech recognition fails, there are cases where the speech recognition system 201 is processing a large amount of speech data and there are not enough resources or speech recognition itself cannot be performed due to a system trouble. Or, there may be a case where the audio data itself is abnormal (a value that cannot be processed).

音声認識システム２０１は、音声認識に成功した場合、および失敗した場合のいずれの場合も、音声認識結果を生成し、音声認識結果を含む音声認識応答を、音声処理装置１０１に送信する。 The speech recognition system 201 generates a speech recognition result and transmits a speech recognition response including the speech recognition result to the speech processing apparatus 101 in both cases where the speech recognition succeeds and fails.

受信部１４は、所定の通信プロトコルに基づき、音声認識システム２０１と通信する。また、受信部１４は、ネットワーク３０１と、無線または有線用の通信プロトコルの処理を行う。これらの通信プロトコルは、送信部１３の場合と同様である。受信部１４は、音声認識システム２０１から音声認識応答を、ネットワーク３０１を介して受信する。受信部１４は、音声認識応答から音声認識結果を取り出す。音声認識結果には、音声認識が成功した場合は、音声データを変換したテキストと、識別子が含まれる。さらに、音声認識が成功したことを示す成功情報が含まれても良い。一方、音声データの音声認識に失敗した場合には、音声認識結果には、音声認識に失敗したことを示す失敗情報と識別子が含まれる。 The receiving unit 14 communicates with the voice recognition system 201 based on a predetermined communication protocol. The receiving unit 14 performs processing of the network 301 and a wireless or wired communication protocol. These communication protocols are the same as those of the transmission unit 13. The receiving unit 14 receives a voice recognition response from the voice recognition system 201 via the network 301. The receiving unit 14 extracts a speech recognition result from the speech recognition response. If the speech recognition is successful, the speech recognition result includes text converted from speech data and an identifier. Furthermore, success information indicating that the speech recognition is successful may be included. On the other hand, when voice recognition of voice data fails, the voice recognition result includes failure information and an identifier indicating that voice recognition has failed.

制御部１６は、録音部１１の開始および終了を含む動作を制御する。例えば、入力部１８からのユーザ指示により録音部１１を起動して、音声録音を開始する。また、入力部１８からのユーザ指示により、録音部１１を停止することで、音声録音を終了する。 The control unit 16 controls operations including the start and end of the recording unit 11. For example, the recording unit 11 is activated by a user instruction from the input unit 18 to start voice recording. Further, the voice recording is ended by stopping the recording unit 11 in accordance with a user instruction from the input unit 18.

また、制御部１６は、各音声データの送信時に、各音声データに付加された識別子を認識結果記憶部１５あるいは別のメモリに、リストとして格納することで、送信済みの音声データを管理する。また、受信部１４で取得された各音声認識結果を認識結果記憶部１５あるいは別のメモリに、リストとして格納することで、受信済みの音声認識結果を管理する。制御部１６は、送信済みの音声データに付加した識別子と、受信済みの音声認識結果内の識別子を比較することで、送信した音声データのうち、どの音声データの音声認識結果が受信されているかを把握できる。 The control unit 16 manages the transmitted audio data by storing the identifier added to each audio data as a list in the recognition result storage unit 15 or another memory when transmitting each audio data. Further, the received voice recognition results are managed by storing each voice recognition result acquired by the receiving unit 14 as a list in the recognition result storage unit 15 or another memory. The control unit 16 compares the identifier added to the transmitted voice data with the identifier in the received voice recognition result, and which voice data of the transmitted voice data is received. Can be grasped.

制御部１６は、各音声認識結果に基づき、今回のユーザの発話に対する音声認識が成功したか否かを決定する。具体的に、音声認識システム２０１に送信した音声データのうち、少なくともＨ（Ｈは１以上の整数）個の音声データの音声認識に失敗した場合は、今回の発話に対する音声認識は失敗したと決定する。一方、失敗した音声データの個数がＨ個未満のときは、成功したと決定する。例えば、Ｈ＝１の場合、送信したすべての音声データに対する音声認識が成功した場合は、今回の発話の音声認識は成功したと決定し、１つの音声データでも音声認識に失敗した場合は、今回の発話の音声認識は失敗したと決定する。以下では、Ｈ＝１の場合を想定して説明を行うが、本実施形態はこれに限定されない。 Based on each voice recognition result, the control unit 16 determines whether or not the voice recognition for the current user's speech has succeeded. Specifically, if speech recognition of at least H (H is an integer of 1 or more) speech data among speech data transmitted to the speech recognition system 201 fails, it is determined that speech recognition for the current utterance has failed. To do. On the other hand, when the number of unsuccessful audio data is less than H, it is determined as successful. For example, in the case of H = 1, if the speech recognition is successful for all transmitted speech data, it is determined that the speech recognition of the current utterance is successful, and if the speech recognition fails even for one speech data, this time It is determined that the voice recognition of the utterance has failed. In the following description, the case of H = 1 is assumed, but the present embodiment is not limited to this.

制御部１６は、今回の発話に関し、成功の決定をした場合は、各音声データから変換されたテキストを時系列に結合して発話テキストを生成し、発話テキストを画面に表示する。 When the control unit 16 determines success for the current utterance, the control unit 16 generates the utterance text by combining the text converted from each voice data in time series, and displays the utterance text on the screen.

例えば、ユーザが「山田さんを訪問しました。顔色はよいです。しかし念のため解熱薬を処方しておきます。以上です。」と発話した場合を考える。この発話に基づき音声データが複数個（ここでは４つ）順次、取得され、それぞれを含む音声認識依頼が、音声認識システム２０１に送信され、４つの音声認識結果が返されたとする。１番目の音声認識結果には、「山田さんを訪問しました。」のテキストが含まれ、２番目の音声認識結果には「顔色はよいです。」のテキストが含まれ、３番目の音声認識結果には、「しかし念のため解熱薬を処方しておきます。」のテキストが含まれ、４番目の音声認識結果には、「以上です。」のテキストが含まれていたとする。この場合、すべての音声データの音声認識が成功したため、成功の決定がなされ、これらのテキストを結合した発話テキスト「山田さんを訪問しました。顔色はよいです。しかし念のため解熱薬を処方しておきます。以上です。」を画面に表示する。 For example, consider a case where a user utters “I visited Mr. Yamada. The complexion is good, but I prescribe antipyretic drugs just in case. It is assumed that a plurality (four in this case) of voice data are sequentially acquired based on this utterance, a voice recognition request including each is transmitted to the voice recognition system 201, and four voice recognition results are returned. The first speech recognition result includes the text “I visited Mr. Yamada.” The second speech recognition result includes the text “Face Color is Good.” The third speech recognition It is assumed that the result includes the text “But prescribe antipyretic drugs just in case” and the fourth speech recognition result includes the text “It is over”. In this case, since the speech recognition of all the speech data was successful, the success decision was made, and the speech text “Yamada-san that combined these texts was visited. The complexion is good. That's it! "Is displayed on the screen.

一方、失敗の決定をした場合は、音声認識失敗のメッセージを画面に表示する。その際、今回発話した内容を再度発話する必要がないことをユーザに通知するメッセージを、アプリケーション画面に表示してもよい。または、音声認識システム２０１に音声認識を（自発的に）再依頼する旨をユーザに通知するメッセージを、当該画面に表示してもよい。これは、発話開始から終了までの音声のデータを音声ファイルに保存してあるため、後にこの音声ファイルを用いて、音声認識システム２０１に音声認識を依頼できるためである。音声認識失敗の場合に画面に表示するメッセージの例として、「リアルタイムの音声認識に失敗したが、発話開始から終了までの音声を録音した音声ファイルにより、再度音声認識を試みる」旨を表示してもよい。これによりユーザは、一度発話した内容を再度発話する必要がないことを把握できる。上述した今回発話した内容を再度発話する必要がないことをユーザに通知するメッセージは、音声データの取得が完了した直後や、音声ファイルを生成した直後や、音声認識依頼の送信が完了した直後など、音声データの取得が完了した後の任意の時点で表示してもよい。 On the other hand, when the failure is determined, a voice recognition failure message is displayed on the screen. At that time, a message for notifying the user that it is not necessary to speak again the content uttered this time may be displayed on the application screen. Alternatively, a message for notifying the user that the voice recognition system 201 is to be re-requested (voluntarily) for voice recognition may be displayed on the screen. This is because the voice data from the start to the end of the utterance is stored in the voice file, and the voice recognition system 201 can be requested for voice recognition later using this voice file. As an example of a message displayed on the screen in the case of voice recognition failure, the message “Real-time voice recognition has failed, but voice recognition is attempted again using a voice file that has been recorded from the start to the end of speech” is displayed. Also good. As a result, the user can grasp that it is not necessary to utter again the contents once uttered. The message for notifying the user that it is not necessary to speak again the content uttered this time, such as immediately after the acquisition of the voice data, immediately after the generation of the voice file, or immediately after the transmission of the voice recognition request is completed It may be displayed at an arbitrary time after the acquisition of audio data is completed.

認識結果記憶部１５は、ユーザの発話と、発話に対する音声認識結果と、音声ファイルとの対応を管理するための対応情報を記憶している。対応情報は、例えばテーブル形式など任意の形式で保持することができる。ここでは、対応情報はテーブル形式を有し、このテーブルを認識結果テーブルと呼ぶ。認識結果記憶部１５は、ハードディスク、ＳＳＤ、メモリなど、任意の記憶媒体で構成できる。ファイル記憶部１２と同じ装置であっても、異なる装置であってもよい。 The recognition result storage unit 15 stores correspondence information for managing the correspondence between the user's speech, the speech recognition result for the speech, and the speech file. The correspondence information can be held in an arbitrary format such as a table format. Here, the correspondence information has a table format, and this table is called a recognition result table. The recognition result storage unit 15 can be composed of an arbitrary storage medium such as a hard disk, SSD, or memory. The same device as the file storage unit 12 or a different device may be used.

図４に、認識結果テーブルの例を示す。認識結果テーブルは、例えばテキストファイル、またはデータベースとして保持されることができる。認識結果テーブルは、「発話日時」、「認識結果」、「音声ファイル」の列を有する。ユーザの発話ごと（録音の１単位ごと）に、制御部１６により１つのエントリーが追加される。 FIG. 4 shows an example of the recognition result table. The recognition result table can be held as a text file or a database, for example. The recognition result table has columns of “utterance date / time”, “recognition result”, and “voice file”. One entry is added by the control unit 16 for each user utterance (one unit of recording).

「発話日時」列は、ユーザの発話日時を特定する情報を保持する。例えば、アプリケーションの画面上の「登録ボタン」を選択（タッチ、クリックなど）した日時、もしくは発話を開始した日時を格納する。発話を開始した日時は、最初に取得される音声データの先頭のタイミングの日時である。発話日時により、ユーザの発話が識別される。なお、発話ごとに、発話の識別情報（発話ＩＤ）を発番する場合は、その発話ＩＤを保持する列を設けても良い。 The “speech date / time” column holds information for specifying the user's utterance date / time. For example, the date and time when the “registration button” on the application screen is selected (touch, click, etc.) or the date and time when the utterance is started is stored. The date and time when the utterance is started is the date and time at the beginning of the voice data acquired first. The user's utterance is identified by the utterance date and time. When utterance identification information (utterance ID) is issued for each utterance, a column for holding the utterance ID may be provided.

「認識結果」列は、ユーザの発話に対して、各音声データから変換されたテキストを結合した発話テキスト、または、音声認識が未完了であることを示す情報を保持する。音声認識が未完了であることを示す情報の例として、音声認識に失敗したことを示す情報がある。図４では＜失敗＞という情報がこれに相当する。 The “recognition result” column holds speech text obtained by combining text converted from each speech data with respect to the user speech, or information indicating that speech recognition is not completed. As an example of information indicating that the speech recognition is incomplete, there is information indicating that the speech recognition has failed. In FIG. 4, the information <failure> corresponds to this.

「音声ファイル」列は、ファイル記憶部１２に記憶されている音声ファイルへのパスを保持する。パスとは、音声ファイルの格納場所を特定する情報である。この情報は、ファイル記憶部１２を管理するファイルシステムから取得できる。 The “voice file” column holds a path to the voice file stored in the file storage unit 12. The path is information for specifying the storage location of the audio file. This information can be acquired from a file system that manages the file storage unit 12.

制御部１６は、上述のように、今回の発話に関し、音声認識の成功または失敗の決定をしたら、認識結果テーブルに、エントリーを追加する。具体的に、成功または失敗のいずれを決定した場合も、「発話日時」列に、発話日時を特定する情報を格納し、「音声ファイル」列には、今回のユーザの発話に関する音声ファイルへのパスを格納する。「認識結果」列には、成功の場合は、発話テキストを格納し、失敗の場合は、音声認識の未完了または失敗を表す情報（ここでは、＜失敗＞）を格納する。 As described above, the control unit 16 adds an entry to the recognition result table after determining the success or failure of the speech recognition for the current utterance. Specifically, regardless of success or failure, information for specifying the utterance date and time is stored in the “utterance date and time” column, and the “voice file” column stores information for the voice file related to the user's utterance this time. Stores the path. In the “recognition result” column, the speech text is stored in the case of success, and information (in this case, <failure>) indicating incomplete or failed speech recognition is stored in the case of failure.

また、制御部１６は、一定時間間隔で、この認識結果テーブルに基づき、音声認識に失敗した発話をチェックする。例えば、「認識結果」列が「＜失敗＞」になっているエントリーを特定する。制御部１６は、特定した発話に対応する音声ファイルを、「音声ファイル」列に保持されているファイルパスに従って、ファイル記憶部１２から読み出す。音声ファイルが、音声認識システム２０１が対応しないデータ形式で圧縮して記憶されている場合は、読み出した音声ファイルを、音声認識システム２０１が対応するデータ形式に復号する。 Moreover, the control part 16 checks the speech which failed in speech recognition based on this recognition result table at fixed time intervals. For example, an entry whose “recognition result” column is “<failure>” is specified. The control unit 16 reads out an audio file corresponding to the specified utterance from the file storage unit 12 in accordance with the file path held in the “audio file” column. When the audio file is compressed and stored in a data format not supported by the speech recognition system 201, the read audio file is decoded into a data format supported by the speech recognition system 201.

制御部１６は、当該音声ファイルに含まれる音声データ全体を取り出し、当該音声データ全体の音声認識依頼を、音声認識システム２０１に送信する。音声データ全体でなく、一定サイズごとに音声データ全体を分割して、ファイルの先頭側から順番に、送信してもよい。この場合も、各音声データには識別子を付与して送信することで、受信部１４で取得される音声認識結果との対応付けが可能である。なお、音声認識システム２０１が、音声ファイルそのものに対応している場合は、音声ファイルから音声データを取り出さずに、音声ファイル自体を送信することも可能である。 The control unit 16 extracts the entire voice data included in the voice file and transmits a voice recognition request for the whole voice data to the voice recognition system 201. Instead of the entire audio data, the entire audio data may be divided for each fixed size and transmitted sequentially from the beginning of the file. Also in this case, each voice data can be associated with a voice recognition result acquired by the receiving unit 14 by giving an identifier to the voice data and transmitting it. If the voice recognition system 201 supports the voice file itself, the voice file itself can be transmitted without extracting voice data from the voice file.

受信部１４は、音声認識システム２０１から、音声認識結果を含む音声認識応答を受信する。音声データ全体を一括で送信した場合で、音声認識が成功だった場合、音声認識結果内のテキスト（発話テキスト）は、該当するエントリーの「認識結果」列に追加（＜失敗＞を上書き）される。音声データを分割して送信した場合は、すべての音声データの音声認識が成功した場合のみ、各音声データに対応するテキストを結合した発話テキストが、「認識結果」列に格納される。１つでも音声データの音声認識に失敗した場合は、失敗を表す情報（＜失敗＞等）が格納される。 The receiving unit 14 receives a voice recognition response including a voice recognition result from the voice recognition system 201. If the entire speech data is sent in a batch and the speech recognition is successful, the text (utterance text) in the speech recognition result is added to the “Recognition Result” column of the corresponding entry (overwrite <failure>). The When the voice data is divided and transmitted, the utterance text combined with the text corresponding to each voice data is stored in the “recognition result” column only when the voice recognition of all the voice data is successful. If speech recognition of speech data fails, information indicating failure (<failure> etc.) is stored.

再音声認識に成功した場合、制御部１６は、その旨のメッセージを、表示部１７に出力することで、ユーザに通知してもよい。また、音声認識された発話テキストを、当該メッセージと同じ画面または別の画面で、表示部１７に表示してもよい。発話テキストを、ユーザから指示された時点で表示部１７に表示してもよい。なお、メッセージの出力時に、スピーカから通知音を鳴らしたり、またはバイブを振動させたりしてもよい。 When the re-speech recognition is successful, the control unit 16 may notify the user by outputting a message to that effect to the display unit 17. Further, the speech text that has been voice-recognized may be displayed on the display unit 17 on the same screen as the message or on another screen. The utterance text may be displayed on the display unit 17 when instructed by the user. Note that a notification sound may be emitted from the speaker or the vibrator may be vibrated when a message is output.

一方、再度の音声認識に失敗した場合、その旨のメッセージを、表示部１７に出力してもよい。このとき、通知音を鳴らしたり、バイブを振動させたりしてもよい。成功の場合と失敗の場合とで、通知音のパターンや音量、バイブの振動パターンや強さを変えても良い。 On the other hand, when speech recognition again fails, a message to that effect may be output to the display unit 17. At this time, a notification sound may be sounded or the vibrator may be vibrated. The pattern and volume of the notification sound and the vibration pattern and strength of the vibration may be changed depending on the success or failure.

制御部１６は、上述した処理以外にも、各種の制御を行うことができる。例えば、ユーザに各種入力を促すための画面を表示部１７に表示し、入力された指示に応じた動作を行う。 The control unit 16 can perform various controls in addition to the above-described processing. For example, a screen for prompting the user to make various inputs is displayed on the display unit 17 and an operation corresponding to the input instruction is performed.

また、制御部１６は、認識結果テーブルに基づき、発話日時の一覧等を表示部１７に表示し、ユーザにより選択された発話日時に対応する発話テキストを表示してもよい。また、ユーザから、表示された発話テキストに対する編集の指示を受けて、テキストを修正してもよい。これにより、発話テキストに、ユーザの発話内容と異なる部分があった場合に、ユーザが正しい表現に修正できる。また、制御部１６は、ユーザから音声ファイルの再生指示を受け付け、ユーザから指示された音声ファイルを、スピーカを用いて再生してもよい。 Further, the control unit 16 may display a list of utterance dates and the like on the display unit 17 based on the recognition result table, and may display utterance texts corresponding to the utterance dates selected by the user. Further, the text may be corrected upon receiving an instruction to edit the displayed utterance text from the user. Thereby, when there is a part different from the user's utterance content in the utterance text, the user can correct the expression. Moreover, the control part 16 may receive the reproduction | regeneration instruction | indication of an audio | voice file from a user, and may reproduce | regenerate the audio | voice file instruct | indicated from the user using a speaker.

また制御部１６は、認識結果テーブルに格納されている各発話の発話テキスト、またはファイルパスに示される音声ファイルを、別途設けたサーバに送信してもよい。このサーバは、複数のユーザ端末（音声処理装置）から発話テキストまたは音声ファイルを収集して、ユーザ別、発話日時別など様々な属性に分類して、発話テキストまたは音声ファイルを管理する。サーバは、各ユーザ端末からの問い合わせに対して、要求された属性の発話テキストまたは音声ファイルをユーザ端末に送信してもよい。 Further, the control unit 16 may transmit the utterance text of each utterance stored in the recognition result table or the voice file indicated by the file path to a separately provided server. This server collects utterance texts or voice files from a plurality of user terminals (voice processing devices), classifies them into various attributes such as for each user and each utterance date, and manages the utterance texts or voice files. In response to an inquiry from each user terminal, the server may transmit an utterance text or an audio file having the requested attribute to the user terminal.

図２および図３に、本音声処理装置１０１の動作のフローチャートを示す。 2 and 3 show flowcharts of the operation of the speech processing apparatus 101. FIG.

（ステップＳ１０１）制御部１６が、アプリケーションの画面を表示部１７に表示する。ユーザが、入力部１８を介して、アプリケーション画面上の「登録ボタン」をタッチすると、制御部１６はこれを検知して、画面上に、発話を促すメッセージ（「お話し下さい」）を表示する。これと同時に、制御部１６は、録音部１１を起動する。ユーザは、音声を入力可能な状態となる。 (Step S <b> 101) The control unit 16 displays an application screen on the display unit 17. When the user touches the “registration button” on the application screen via the input unit 18, the control unit 16 detects this and displays a message (“Please speak”) for prompting utterance on the screen. At the same time, the control unit 16 activates the recording unit 11. The user can enter voice.

このとき、制御部１６は、アプリケーション画面上に、音声入力中であることを示す表示を行っても良い。表示内容としては、例えば、ユーザの発話音量を示すバーや、波形がある。 At this time, the control unit 16 may display on the application screen that voice input is being performed. Examples of display contents include a bar indicating a user's speech volume and a waveform.

ユーザは、ユーザ端末に搭載されているマイクに向かって、発話を開始する。録音部１１には、発話の冒頭から、音声信号を所定形式でデジタル化したデジタルデータが入力される。録音部１１は、デジタルデータの先頭から一定サイズに達するごとに、一定サイズ分のデータを、音声データとして切り出す。 The user starts speaking toward the microphone mounted on the user terminal. From the beginning of the utterance, digital data obtained by digitizing the audio signal in a predetermined format is input to the recording unit 11. Each time the recording unit 11 reaches a certain size from the beginning of the digital data, the recording unit 11 cuts out the data of a certain size as audio data.

（ステップＳ１０２）録音部１１は、一定サイズの音声データを取得するごとに、メモリ上に音声データを追加し、発話時刻順に音声データを後続するように結合する。すべての音声データを結合したら、ファイルヘッダ等の情報を設定することで、音声ファイルを生成する。 (Step S <b> 102) Each time the sound recording unit 11 acquires sound data of a certain size, the sound recording unit 11 adds the sound data to the memory and combines the sound data in order of the utterance time. When all audio data are combined, information such as a file header is set to generate an audio file.

（ステップＳ１０３）また、録音部１１は、一定サイズの音声データを取得するごとに、識別子を付与する。送信部１３は、音声データと識別子を含む音声認識依頼を、音声認識システム２０１に送信する。このとき、制御部１６は、音声データに付加した識別子をメモリに格納して、リスト管理する。 (Step S103) The recording unit 11 assigns an identifier every time audio data having a certain size is acquired. The transmission unit 13 transmits a voice recognition request including voice data and an identifier to the voice recognition system 201. At this time, the control unit 16 stores the identifier added to the audio data in the memory and manages the list.

受信部１４は、音声認識システム２０１から、音声認識結果を含む音声認識応答を、ネットワーク３０１を介して受信する。制御部１６は、音声認識結果を認識結果記憶部１５に格納して、リスト管理する。音声認識結果には、音声認識が成功の場合は、音声データを変換したテキストと識別子が含まれ、失敗の場合は、音声認識に失敗したことを示す失敗情報と識別子が含まれる。音声認識が成功の場合に、音声認識結果内に、成功を示す情報がさらに含まれてもよい。 The receiving unit 14 receives a voice recognition response including a voice recognition result from the voice recognition system 201 via the network 301. The control unit 16 stores the speech recognition result in the recognition result storage unit 15 and manages the list. If the speech recognition is successful, the speech recognition result includes text and an identifier obtained by converting the speech data. If the speech recognition result is unsuccessful, the speech recognition result includes failure information and an identifier indicating that the speech recognition has failed. When the speech recognition is successful, information indicating success may be further included in the speech recognition result.

（ステップＳ１０４）制御部１６は、認識結果記憶部１５を定期的に確認し、送信部１３から送信した音声認識依頼に対応する音声認識結果が取得されているかを調べる。制御部１６は、音声データに付加したのと同じ識別子を有する音声認識結果が、認識結果記憶部１５に存在するかで確認を行う。 (Step S <b> 104) The control unit 16 periodically checks the recognition result storage unit 15 to check whether a speech recognition result corresponding to the speech recognition request transmitted from the transmission unit 13 is acquired. The control unit 16 confirms whether a speech recognition result having the same identifier added to the speech data exists in the recognition result storage unit 15.

（ステップＳ１０５）制御部１６は、認識結果記憶部１５に音声認識結果が存在する場合は、その音声認識結果の内容を調べる。 (Step S105) When the speech recognition result exists in the recognition result storage unit 15, the control unit 16 examines the content of the speech recognition result.

（ステップＳ１０６）音声認識結果に失敗情報が含まれる場合は、今回の発話の音声認識は失敗したことを決定する。すなわち、音声認識システム２０１に送信する音声データのうち１つでも音声認識に失敗した場合は、今回のユーザの発話に対する音声認識は失敗したと決定する。このとき、制御部１６は、この失敗の決定を記憶するため、認識結果記憶部１５の所定領域、またはメモリ上の領域に予め所定のフラグを格納しておき、このフラグに、失敗を示す値を設定する（ステップＳ１０７）。フラグの初期値は、成功を示す値が設定されているとする。 (Step S106) When failure information is included in the speech recognition result, it is determined that speech recognition of the current utterance has failed. That is, when at least one of the voice data transmitted to the voice recognition system 201 fails in voice recognition, it is determined that voice recognition for the current user's speech has failed. At this time, in order to store the determination of failure, the control unit 16 stores a predetermined flag in advance in a predetermined area of the recognition result storage unit 15 or an area on the memory, and the flag indicates a value indicating the failure. Is set (step S107). Assume that a value indicating success is set as the initial value of the flag.

（ステップＳ１０８）ステップＳ１０７で失敗を記録した後、ユーザの発話が終了したかを判断する。ステップＳ１０５で音声認識結果にテキスト（または成功を示す情報）が含まれる場合、または、ステップＳ１０５で音声認識結果がないと判断された場合も、ユーザの発話が終了したかを判断する。ユーザは、発話を終了した時点で、アプリケーション画面上の「発話終了ボタン」を押し、制御部１６は、このボタンの押下を検知することで、発話終了を判定する。制御部１６は、発話終了を判定したら、録音部１１の処理状態を確認する。録音部１１の処理が完了していれば、録音部１１を停止させる。録音部１１がまだ動作中であれば、処理が完了するまで待機する。これにより、録音が終了する。 (Step S108) After recording the failure in step S107, it is determined whether or not the user's speech has ended. When the text (or information indicating success) is included in the speech recognition result in step S105, or when it is determined that there is no speech recognition result in step S105, it is determined whether the user's speech has ended. When the user ends the utterance, the user presses the “utterance end button” on the application screen, and the control unit 16 determines the end of the utterance by detecting the pressing of this button. When the control unit 16 determines the end of the utterance, the control unit 16 checks the processing state of the recording unit 11. If the processing of the recording unit 11 is completed, the recording unit 11 is stopped. If the recording unit 11 is still operating, it waits until the processing is completed. Thereby, recording ends.

なお、ユーザが録音を明示的に終了させる以外に、制御部１６が、発話の空白を検出することで、自動的に録音を終了することもできる。発話の空白は、例えば、音量が閾値以下の区間が一定時間以上、継続した場合がある。ユーザの発話がまだ終了していない場合は、ステップＳ１０１に戻り、音声データの取得処理を行う。 Note that, in addition to the user explicitly ending the recording, the control unit 16 can automatically end the recording by detecting a blank space in the utterance. For example, there may be a case where a section where the volume is equal to or lower than a threshold value continues for a certain time or longer. If the user's utterance has not been completed yet, the process returns to step S101 to perform voice data acquisition processing.

（ステップＳ１０９）制御部１６は、ユーザの発話が終了したと判定したら、未送信の音声データが残っているかを調べる。例えば送信部１３の送信バッファに、未送信の音声確認依頼が残っている場合（音声認識システム２０１から送達確認応答（ＡＣＫ）が返ってきていない場合も含む）、未送信の音声データが残っていると判断する。または、ユーザの発話が終了したが、録音部１１の動作が継続している場合も、未送信の音声データが残っていると判断する。未送信の音声データが残っていれば、ステップＳ１０３に戻り、送信処理を引き続き行う。 (Step S109) When it is determined that the user's utterance has ended, the control unit 16 checks whether untransmitted voice data remains. For example, when an unsent voice confirmation request remains in the transmission buffer of the transmission unit 13 (including a case where a delivery confirmation response (ACK) is not returned from the voice recognition system 201), unsent voice data remains. Judge that Alternatively, even when the user's utterance has ended, but the operation of the recording unit 11 continues, it is determined that untransmitted audio data remains. If untransmitted audio data remains, the process returns to step S103 to continue transmission processing.

（ステップＳ１１０）未送信の音声データが残っていなければ、すなわち、すべての音声データが送信済みであれば、すべての音声データに対応する音声認識結果が、受信済みかを判断する。まだ受信していない音声認識結果が存在する場合は、ステップＳ１０４に戻る。 (Step S110) If unsent audio data does not remain, that is, if all audio data has been transmitted, it is determined whether or not the audio recognition results corresponding to all audio data have been received. If there is a voice recognition result that has not yet been received, the process returns to step S104.

（ステップＳ１１１）すべての音声データに対応する音声認識結果が受信済みであれば、前述した所定のフラグを確認する。なお、この時点では、ユーザの発話は完了し、全ての音声データが送信済みであり、全ての音声データに対する音声認識結果が受信済みである (Step S111) If the voice recognition results corresponding to all the voice data have been received, the predetermined flag described above is checked. At this point, the user's utterance has been completed, all voice data has been transmitted, and voice recognition results for all voice data have been received.

（ステップＳ１１２）所定のフラグに失敗を示す値が設定されている場合、これは、ユーザによる発話の開始から終了までの全ての音声データのうち、少なくとも１つの音声データの音声認識に失敗したことを意味する。この場合は、音声認識失敗のメッセージを画面に表示する。その際、上述したように、今回の発話と同じ内容を再度発話する必要がないことをユーザに通知するメッセージを、アプリケーション画面上に表示してもよい。 (Step S112) When a value indicating failure is set in the predetermined flag, this means that voice recognition of at least one voice data among all voice data from the start to the end of the utterance by the user has failed. Means. In this case, a voice recognition failure message is displayed on the screen. At that time, as described above, a message notifying the user that the same content as the current utterance does not need to be uttered again may be displayed on the application screen.

（ステップＳ１１３）一方、所定のフラグに失敗を示す値が設定されていない場合、これは、ユーザによる発話の開始から終了までの全ての音声データに対して音声認識が成功したことを意味する。この場合は、音声認識結果に含まれるテキストを時系列に結合した発話テキストを生成し、生成した発話テキストを画面に表示する。 (Step S113) On the other hand, when a value indicating failure is not set in the predetermined flag, this means that the voice recognition has succeeded for all the voice data from the start to the end of the utterance by the user. In this case, an utterance text in which the text included in the speech recognition result is combined in time series is generated, and the generated utterance text is displayed on the screen.

（ステップＳ１１４）制御部１６は、認識結果記憶部１５内の認識結果テーブルに、今回の発話に関するエントリーを追加する。すなわち、制御部１６は、「発話日時」列に、今回の発話日時、「音声ファイル」列に、今回の発話に関する音声ファイルへのパスを格納する。「認識結果」列に、今回の発話の音声認識が成功の場合は、発話テキストを格納し、失敗の場合は、失敗を表す情報（ここでは、＜失敗＞）を格納する。 (Step S114) The control unit 16 adds an entry related to the current utterance to the recognition result table in the recognition result storage unit 15. That is, the control unit 16 stores the current utterance date and time in the “utterance date” column and the path to the audio file related to the current utterance in the “voice file” column. In the “recognition result” column, when the speech recognition of the current utterance is successful, the utterance text is stored, and when it is unsuccessful, information indicating failure (here, <failure>) is stored.

ここで、ネットワーク３０１の状況によっては、音声認識依頼の送信に時間がかかる場合や、音声認識システム２０１からすべての音声認識応答が届くまでに時間がかかる場合もあり得る。そこで、発話の日時から一定時間以内にすべての音声認識依頼の送信が完了しない場合、またはすべての音声認識応答が届かない場合は、今回の音声認識は失敗と決定し、図２および図３のフローにおいて、失敗と決定した場合と同様の処理を行っても良い。ここでは、発話の日時を一定時間の起点にしたが、任意の日時を起点にしてもよい。 Here, depending on the situation of the network 301, it may take time to transmit a voice recognition request, or it may take time to receive all voice recognition responses from the voice recognition system 201. Therefore, if transmission of all voice recognition requests is not completed within a certain time from the date and time of utterance, or if all voice recognition responses are not received, it is determined that the current voice recognition has failed, as shown in FIG. 2 and FIG. In the flow, the same processing as when it is determined as failure may be performed. Here, the date and time of utterance is set as the starting point for a certain time, but any date and time may be set as the starting point.

図５は、音声認識に失敗した発話に対する再音声認識（バッチ音声認識）の制御のフローチャートである。 FIG. 5 is a flowchart of control of re-speech recognition (batch speech recognition) for an utterance in which speech recognition has failed.

（Ｓ２０１）制御部１６は、一定時間間隔で、認識結果記憶部１５に保持された認識結果テーブルに基づき、音声認識に失敗した発話をチェックする。 (S201) The control unit 16 checks utterances that have failed in voice recognition based on the recognition result table held in the recognition result storage unit 15 at regular time intervals.

（Ｓ２０２）制御部１６は、特定した発話に対応する音声ファイルを、「音声ファイル」列に保持されているファイルパスに従って、ファイル記憶部１２から読み出す。 (S202) The control unit 16 reads an audio file corresponding to the specified utterance from the file storage unit 12 according to the file path held in the “audio file” column.

（Ｓ２０３）制御部１６は、読み出した音声ファイルに含まれる音声データを取り出し、当該音声データの音声認識依頼を、音声認識システム２０１に送信する。上述したように、音声データの送信方法として、音声データの全体を一括で送信してもよいし、一定サイズごとに音声データを分割して、ファイルの先頭側から順番に、送信してもよい。 (S203) The control unit 16 extracts voice data included in the read voice file and transmits a voice recognition request for the voice data to the voice recognition system 201. As described above, as a method of transmitting audio data, the entire audio data may be transmitted in a lump, or the audio data may be divided for each predetermined size and transmitted sequentially from the beginning of the file. .

（Ｓ２０４）受信部１４は、音声認識システム２０１から、音声認識結果を含む音声認識応答を受信する。音声データ全体を一括で送信した場合で、音声認識が成功だった場合は、音声認識結果内のテキストを、発話テキストとして、エントリーの「認識結果」列に書き込む。音声データを分割して送信した場合は、図２および図３に示した処理と同様に、すべての音声データの音声認識が成功した場合のみ、各音声データに対応するテキストを結合して発話テキストとし、「認識結果」列に書き込む。１つでも音声データの音声認識に失敗した場合は、失敗を表す情報（＜失敗＞等）を格納する。 (S204) The receiving unit 14 receives a voice recognition response including a voice recognition result from the voice recognition system 201. When the entire speech data is transmitted at once and the speech recognition is successful, the text in the speech recognition result is written as the utterance text in the “recognition result” column of the entry. When the voice data is divided and transmitted, the speech corresponding to each voice data is combined only when the voice recognition of all voice data is successful, as in the processes shown in FIGS. And write in the “recognition result” column. When speech recognition of even one voice data fails, information indicating failure (<failure> etc.) is stored.

（Ｓ２０５）再音声認識に成功した場合、制御部１６は、その旨のメッセージを、表示部１７に出力することで、ユーザに通知してもよい。また、音声認識された発話テキストを、当該メッセージと同じ画面、または別の画面で、表示部１７に表示してもよい。または、発話テキストを、ユーザから指示された時点で表示してもよい。再度の音声認識に失敗した場合、その旨のメッセージを、表示部１７に出力してもよい。 (S205) When the re-speech recognition is successful, the control unit 16 may notify the user by outputting a message to that effect to the display unit 17. The speech text that has been voice-recognized may be displayed on the display unit 17 on the same screen as the message or on a different screen. Alternatively, the utterance text may be displayed when instructed by the user. If the voice recognition fails again, a message to that effect may be output to the display unit 17.

なお、ステップＳ２０２で音声ファイルを取り出してから、音声認識依頼の送信を開始し、音声認識結果が返ってくるまでの間、認識結果テーブルの「認識結果」列に、音声認識中を示す情報を格納しておいてもよい。音声認識中を示す情報として、例えば、「音声認識中」の文字を格納してもよい。制御部１６は、ステップＳ２０１でのチェック時に、「認識結果」列に、音声認識中を示す情報が含まれるエントリーについては、音声ファイルの取り出しを行わないようにする。これにより、同じ音声ファイルが、重複して再音声認識依頼されることを防止できる。 Note that information indicating that speech recognition is being performed is displayed in the “recognition result” column of the recognition result table after the voice file is extracted in step S202 until transmission of the voice recognition request is started and the voice recognition result is returned. It may be stored. As information indicating that speech recognition is in progress, for example, a character “under speech recognition” may be stored. At the time of the check in step S201, the control unit 16 does not extract an audio file for an entry that includes information indicating that voice recognition is being performed in the “recognition result” column. Thereby, it is possible to prevent the same voice file from being repeatedly requested for voice recognition.

なお、本フローの処理は、図２および図３のフローの処理が行われている間、すなわち、ユーザが発話して、発話中にこれに基づく音声認識のための処理が行われている間は、行わなくてもよい。これにより、音声処理装置１０１および音声認識システム２０１の負荷を下げて、リアルタイム音声認識の成功の可能性を高めることができる。 The processing of this flow is performed while the processing of the flow of FIG. 2 and FIG. 3 is performed, that is, while the processing for speech recognition based on the speech is performed while the user is speaking. Is not necessary. Thereby, the load of the speech processing apparatus 101 and the speech recognition system 201 can be reduced, and the possibility of success of real-time speech recognition can be increased.

上述した認識結果テーブルは、種々のバリエーションが可能である。例えば、一つの端末を複数のユーザが使い、アプリケーションへのログインＩＤ、パスワードにより、ユーザを切り替える場合は、認識結果テーブルにユーザＩＤを保持する列を設けても良い。 The above-described recognition result table can be variously modified. For example, when a plurality of users use one terminal and switch users by login ID and password to the application, a column for holding the user ID may be provided in the recognition result table.

また、別の例として、認識結果テーブルから「認識結果」の列を削除して、「発話日時」と「音声ファイル」列のみを含むテーブルを作成してもよい。このテーブルには、発話に対する音声認識に失敗したときのみ、エントリーを追加する。成功したときは、発話テキストをユーザに画面で提示し、所定の手続を経た後、音声処理装置から消去する。所定の手続として、例えば「確認ボタン」などをユーザがタッチすることで、テキストを確認したことを表明した場合や、発話テキストを、他の装置（例えば前述したサーバ）に送信することなどがある。 As another example, the “recognition result” column may be deleted from the recognition result table to create a table including only the “utterance date” and “voice file” columns. An entry is added to this table only when speech recognition for an utterance fails. If successful, the utterance text is presented to the user on the screen, and after a predetermined procedure, it is erased from the speech processing apparatus. As a predetermined procedure, for example, when the user touches a “confirmation button” or the like, the user confirms that the text has been confirmed, or the utterance text is transmitted to another device (for example, the server described above). .

本実施形態では、音声処理装置１０１は、ネットワーク３０１を介して音声認識システム２０１と通信したが、音声認識システムが音声処理装置１０１内に組み込まれても良い。この場合、音声認識システムは、ＣＰＵと同じバス、またはチップセット等を介して別のバスに接続されてもよい。または、音声認識システムの機能が、ＣＰＵによるプログラム実行として実現されてもよい。 In this embodiment, the voice processing apparatus 101 communicates with the voice recognition system 201 via the network 301, but the voice recognition system may be incorporated in the voice processing apparatus 101. In this case, the speech recognition system may be connected to another bus via the same bus as the CPU or a chip set. Alternatively, the function of the voice recognition system may be realized as program execution by the CPU.

以上、本実施形態によれば、ユーザの発話中に、発話と並行して音声認識システムに音声認識を依頼するとともに、発話内容を音声ファイルとして記憶しておく。音声認識に失敗した場合は、この音声ファイルに基づき再度、音声認識を依頼する。よって、再度ユーザに発話させることなく、音声認識を依頼でき、ユーザの負荷を低減することができる。また、音声認識に失敗した場合に、再度の発話は不要である旨のメッセージをユーザに通知することにより、ユーザは再度の発話は不要であると把握できる。よって、ユーザは、その場での音声認識の成功を確認できなくとも、安心してその後の作業を行うことができる。 As described above, according to the present embodiment, during the user's speech, the speech recognition system is requested to perform speech recognition in parallel with the speech, and the speech content is stored as a speech file. If the voice recognition fails, the voice recognition is requested again based on the voice file. Therefore, voice recognition can be requested without letting the user speak again, and the load on the user can be reduced. In addition, when voice recognition fails, the user can grasp that a second utterance is unnecessary by notifying the user that a second utterance is unnecessary. Therefore, even if the user cannot confirm the success of the voice recognition on the spot, the user can perform the subsequent work with peace of mind.

（第２の実施形態）
本実施形態では、ユーザの発話状況、音声認識システム２０１に送信した音声データに対する音声認識結果の到達状況、および音声認識結果の内容に応じて、ユーザに通知するメッセージを制御する。 (Second Embodiment)
In the present embodiment, a message to be notified to the user is controlled according to the user's utterance status, the arrival status of the speech recognition result for the speech data transmitted to the speech recognition system 201, and the content of the speech recognition result.

ここでは、音声認識中であることを通知するメッセージ、音声認識が成功したことを通知するメッセージ、音声認識に失敗したことを通知するメッセージの表示を制御する場合を示す。 Here, a case is shown in which the display of a message notifying that voice recognition is in progress, a message notifying that voice recognition has been successful, and a message notifying that voice recognition has failed have been controlled.

音声認識中であることを通知するメッセージは、少なくとも１つの音声データに対する音声認識結果が受信されておらず、かつ、失敗を示す情報を含む音声認識結果が１つも受信されていないときに、表示する。具体的に、以下の２つの条件が満たされる場合に表示する。
（条件１）送信したいずれかの音声データに対する音声認識結果が返ってきていない
（条件２）送信したいずれの音声データについても失敗が返ってきていない
メッセージの具体例として、たとえば「音声認識中」の文字を表示することがある。 A message notifying that voice recognition is in progress is displayed when no voice recognition result for at least one voice data has been received and no voice recognition result including information indicating failure has been received. To do. Specifically, it is displayed when the following two conditions are satisfied.
(Condition 1) No voice recognition result is returned for any transmitted voice data (Condition 2) No failure is returned for any transmitted voice data As a specific example of a message, for example, “During voice recognition” May be displayed.

音声認識が成功したことを通知するメッセージは、音声認識システム２０１に送信したすべての音声データの音声認識が成功した場合に表示する。具体的に、以下の条件３〜５を全て満たす場合に、表示する。
（条件３）発話が終了している
（条件４）未送信の音声データはない
（条件５）送信した全ての音声データについての音声認識に成功した
メッセージの具体例として、たとえば「音声認識成功」の文字を表示することがある。 The message notifying that the voice recognition is successful is displayed when the voice recognition of all the voice data transmitted to the voice recognition system 201 is successful. Specifically, it is displayed when all the following conditions 3 to 5 are satisfied.
(Condition 3) Speech has ended (Condition 4) There is no untransmitted voice data (Condition 5) Voice recognition succeeded for all transmitted voice data As a specific example of a message, for example, “successful voice recognition” May be displayed.

音声認識に失敗したことを通知するメッセージは、音声認識システム２０１に送信した音声データのうち、少なくとも１つが音声認識に失敗した場合に表示する。具体的に、上記の条件３、４と、以下の条件６を全て満たす場合に表示する。
（条件６）送信したいずれかの音声データについて音声認識に失敗した
メッセージの具体例として、たとえば「音声認識失敗」の文字を表示することがある。 The message notifying that the voice recognition has failed is displayed when at least one of the voice data transmitted to the voice recognition system 201 has failed in the voice recognition. Specifically, it is displayed when the above conditions 3 and 4 and the following condition 6 are all satisfied.
(Condition 6) As a specific example of a message in which voice recognition has failed for any of the transmitted voice data, for example, a character “speech recognition failure” may be displayed.

上述した３つのメッセージは、いずれも２つ以上が同時に表示されることはない。つまり、条件１と条件５は同時には成立しないので、「音声認識中」と「音声認識成功」が同時に表示されることはない。条件２と条件６は同時には成立しないので、「音声認識中」と「音声認識失敗」が同時に表示されることはない。また条件５と条件６は同時には成立しないので、「音声認識成功」と「音声認識失敗」が同時に表示されることはない。 Two or more of the above three messages are not displayed simultaneously. That is, since the conditions 1 and 5 are not satisfied at the same time, “speech recognition in progress” and “successful speech recognition” are not displayed at the same time. Since condition 2 and condition 6 are not satisfied at the same time, “voice recognition in progress” and “voice recognition failure” are not displayed at the same time. Since conditions 5 and 6 are not satisfied at the same time, “speech recognition success” and “speech recognition failure” are not displayed simultaneously.

以上により「音声認識中」、「音声認識成功」、「音声認識失敗」のいずれか１個しか表示されない。これにより発話中には「音声認識失敗」を表示せず、ユーザに最後まで発話させることができる。なお、発話中に条件６が成立した場合に、「音声認識失敗」を表示する場合は、途中で音声認識に失敗したが発話を最後まで継続するよう促すメッセージを表示しても良い。 As described above, only one of “during speech recognition”, “successful speech recognition”, and “speech recognition failure” is displayed. As a result, it is possible to let the user speak to the end without displaying “speech recognition failure” while speaking. Note that when “speech recognition failure” is displayed when the condition 6 is satisfied during the utterance, a message may be displayed prompting the utterance to continue to the end although the speech recognition has failed in the middle.

（第３の実施形態）
第１の実施形態では、発話の音声認識に成功した場合は、発話の全体テキストを表示し、一部の音声データの音声認識に失敗した場合は、音声認識の失敗を示すメッセージを表示した。 (Third embodiment)
In the first embodiment, when the speech recognition of the utterance is successful, the entire text of the utterance is displayed, and when the speech recognition of a part of the speech data fails, a message indicating the failure of the speech recognition is displayed.

本実施形態では、音声認識に成功した音声データについては、そのテキストを表示し、音声認識に失敗した音声データについては、音声認識が未完了であることを示すテキストを表示する。音声認識が未完了であることを示すテキストの例として、音声認識が失敗したことを示すテキスト（例えば＜失敗＞）がある。これらのテキストを発話時刻順に並べて表示する。これにより、ユーザは、自分が発話した内容の一部のテキストだけでも迅速に確認できる。 In the present embodiment, the text is displayed for voice data that has been successfully recognized, and the text that indicates that voice recognition has not been completed is displayed for the voice data that has failed. As an example of text indicating that speech recognition has not been completed, there is text (for example, <failure>) indicating that speech recognition has failed. These texts are displayed side by side in utterance time order. As a result, the user can quickly confirm even a part of the text of the content he / she uttered.

例えば、ユーザが「山田さんを訪問しました。顔色はよいです。しかし念のため解熱薬を処方しておきます。以上です。」と発話したとする。この発話に基づき、４つの音声データが順次、取得され、それぞれを含む音声認識依頼が、音声認識システム２０１に送信され、４つの音声認識結果が返されたとする。１番目の音声認識結果には、「山田さんを訪問しました。」のテキストが含まれ、２番目の音声認識結果には、失敗を示す情報が含まれ、３番目の音声認識結果には、「しかし念のため解熱薬を処方しておきます。」のテキストが含まれ、４番目の音声認識結果には、失敗を示す情報が含まれていたとする。 For example, suppose that the user utters "I visited Mr. Yamada. The complexion is good. But just in case I prescribe antipyretic drugs. That's it." Based on this utterance, it is assumed that four voice data are sequentially acquired, a voice recognition request including each is transmitted to the voice recognition system 201, and four voice recognition results are returned. The first speech recognition result includes the text “I visited Mr. Yamada”, the second speech recognition result includes information indicating failure, and the third speech recognition result includes It is assumed that the text “But prescribe antipyretic drugs just in case” is included, and the fourth speech recognition result includes information indicating failure.

このとき、音声認識に成功した音声データについては、そのテキスト、音声認識に失敗した音声データについては、失敗を示すテキスト（例えば＜失敗＞）を、互いに結合して、表示する。この結果、表示されるテキストは、「山田さんを訪問しました。＜失敗＞しかし念のため解熱薬を処方しておきます。＜失敗＞」のようになる。認識結果テーブルの「認識結果」列には、このような結合されたテキストを格納する。 At this time, the speech data that has been successfully recognized by the speech is displayed as a text, and the speech data that has failed the speech recognition is displayed by combining the text indicating the failure (for example, <failure>). As a result, the text that appears is like “I visited Mr. Yamada. <Failure> But prescribe antipyretic drugs just in case. <Failure>”. Such combined text is stored in the “recognition result” column of the recognition result table.

ここで、各音声認識結果と、表示されるテキスト全体との対応関係を明確にするため、各音声認識結果に対応するテキストを、適当な記号（例えば＃）で結合してもよい。このとき、表示されるテキストは、「山田さんを訪問しました。＃＜失敗＞＃しかし念のため解熱薬を処方しておきます。＃＜失敗＞＃」のようになる。ユーザは、表示されたテキストを見ることで、音声データが４つ取得され、２番目と４番目の音声データの音声認識に失敗したことが分かる。 Here, in order to clarify the correspondence between each speech recognition result and the entire displayed text, the text corresponding to each speech recognition result may be combined with an appropriate symbol (for example, #). At this time, the displayed text is like "I visited Mr. Yamada. # <Failure> # but prescribe antipyretic drugs just in case. # <Failure> #". By viewing the displayed text, the user can recognize that four pieces of voice data have been acquired and voice recognition of the second and fourth voice data has failed.

本実施形態でも、第１の実施形態と同様に、一部の音声データの音声認識に失敗した場合、発話に対する音声認識は失敗と判断する。この判断を記録するため、認識結果テーブルに、「認識結果」列とは別に、列を１つ追加し、失敗を示す情報（たとえば＜失敗＞）を格納してもよい。成功の場合には、この列の値はヌルにするか、成功を示す情報（たとえば＜成功＞）を格納してもよい。この列に「＜失敗＞」があるか否かで、失敗の有無を判断できる。もちろん、このような列を追加することなく、「認識結果」列のテキスト内に＜失敗＞の文字が含まれるかを検索することで、失敗の有無を検出してもよい。 Also in the present embodiment, as in the first embodiment, when the voice recognition of a part of the voice data fails, it is determined that the voice recognition for the utterance has failed. In order to record this determination, one column may be added to the recognition result table in addition to the “recognition result” column, and information indicating failure (for example, <failure>) may be stored. In the case of success, the value of this column may be null, or information indicating success (for example, <success>) may be stored. Whether or not there is a failure can be determined by whether or not “<failure>” exists in this column. Of course, without adding such a column, the presence or absence of failure may be detected by searching for a <failure> character in the text of the “recognition result” column.

再音声認識を依頼する場合、第１の実施形態と同様に、音声認識が失敗の発話に対応する音声ファイルから、発話の音声データを取得し、音声認識システム２０１に音声認識を依頼する。取得した音声データ全体を一括して依頼してもよいし、あるいは、一定サイズごとに分割して依頼してもよい。 When requesting re-speech recognition, as in the first embodiment, speech data of an utterance is acquired from a speech file corresponding to an utterance for which speech recognition has failed, and the speech recognition system 201 is requested to perform speech recognition. The entire acquired audio data may be requested in a lump, or may be requested divided by a certain size.

または、音声認識に失敗した音声データのみについて、再音声認識を依頼してもよい。具体的に、音声ファイルから取得した音声データを一定サイズごとに分割し、このうちの何番目が音声認識に失敗したかを、上記の記号（例えば＃）に基づき特定する。 Alternatively, re-speech recognition may be requested only for speech data that has failed speech recognition. Specifically, the audio data acquired from the audio file is divided into fixed sizes, and the number of these is specified based on the above-mentioned symbol (for example, #).

上述した例「山田さんを訪問しました。＃＜失敗＞＃しかし念のため解熱薬を処方しておきます。＃＜失敗＞＃」では、２番目と４番目の音声データが音声認識に失敗したと判断できる。よって、２番目と４番目の音声データのみ、再音声認識依頼を行うことになる。 In the example above, “I visited Mr. Yamada. # <Failure> # But prescribe antipyretic drugs just in case. # <Failure> #”, the second and fourth voice data fail to recognize the voice. It can be judged that. Therefore, only the second and fourth voice data are requested to be re-spoken.

仮に２番目の音声データの再音声認識が成功し、４番目の音声データの再音声認識が失敗した場合、表示されるテキスト（「認識結果」列に格納されるテキスト）は、「山田さんを訪問しました。＃顔色はよいです。＃しかし念のため解熱薬を処方しておきます。＃＜失敗＞＃」となる。この場合も、再音声認識に失敗した音声データが存在するため、発話に対する音声認識結果は、失敗と判断される。 If the re-speech recognition of the second sound data succeeds and the re-speech recognition of the fourth sound data fails, the displayed text (the text stored in the “recognition result” column) is “Yamada-san. #Visit color is good #But prescribe antipyretic drugs just in case. # <Failure> # ". Also in this case, since there is voice data in which re-speech recognition has failed, it is determined that the voice recognition result for the utterance has failed.

この後、さらに再音声認識を行う場合は、４番目の音声データのみ、再音声認識を依頼する。この音声データの音声認識が成功した場合は、表示されるテキストは、「山田さんを訪問しました。＃顔色はよいです。＃しかし念のため解熱薬を処方しておきます。＃以上です。」のようになる。 Thereafter, when re-speech recognition is performed, re-speech recognition is requested only for the fourth sound data. If the voice recognition of this voice data is successful, the text displayed is "Visit Yamada-san. # The complexion is good. # But just in case you prescribe antipyretic drugs. "become that way.

図６に、音声認識に失敗した音声データのみ、再音声認識の依頼を行う場合の動作のフローを示す。図５のフローチャートのＳ２０２とＳ２０３の間にステップＳ２０６が追加されている。ステップＳ２０２で音声認識に失敗した発話の音声ファイルを取得した後、ステップＳ２０６では、音声認識に失敗した音声データのみを音声ファイルから切り出する。ステップＳ２０３では、切り出した音声データのみについて、音声認識を音声認識システムに依頼する。他のステップは図５と同様であるため、説明を省略する。 FIG. 6 shows an operation flow when a request for re-speech recognition is made only for speech data that has failed in speech recognition. Step S206 is added between S202 and S203 in the flowchart of FIG. After obtaining the speech file of the speech that failed in speech recognition in step S202, in step S206, only the speech data that failed in speech recognition is cut out from the speech file. In step S203, the voice recognition system is requested for voice recognition only for the cut out voice data. The other steps are the same as in FIG.

本実施形態のように、一部の音声データについてのみ、音声認識を行う際、音声認識システム２０１は、その前の成功した部分のテキストも参照したほうが、音声認識精度がよくなる可能性がある。そこで、その一部の音声データの１つもしくは複数前までの音声データを変換したテキストを、認識結果テーブルから読み出して、当該一部の音声データとともに、送信してもよい。 As in the present embodiment, when performing speech recognition only for a part of speech data, the speech recognition system 201 may improve speech recognition accuracy if it also refers to the text of the previous successful portion. Therefore, text obtained by converting one or more previous voice data of the partial voice data may be read from the recognition result table and transmitted together with the partial voice data.

または、当該一部の音声データの１つまたは複数前までの音声データを送信して、当該一部の音声データの音声認識精度を向上させてもよい。このように、本実施形態は、音声認識に失敗した音声データのみを送る場合、当該音声データとその１つまたは複数前までの音声データを送る場合、上述したような音声データすべてを送る場合のいずれも含む。つまり、音声認識に失敗した音声データを含むデータである限り、送信する音声データは任意である。当該音声認識に失敗した音声データの次の音声データを含めることも当然に可能である。 Alternatively, the voice recognition accuracy of the part of the voice data may be improved by transmitting one or more pieces of voice data before the part of the voice data. As described above, in the present embodiment, when only voice data that has failed in voice recognition is sent, when the voice data and one or more previous voice data are sent, all voice data as described above is sent. Both are included. That is, as long as the data includes voice data that has failed in voice recognition, the voice data to be transmitted is arbitrary. Of course, it is possible to include the next voice data after the voice data that failed to be recognized.

また、上述した説明では、各音声データの音声認識の成功または失敗が確定した後で、表示部１７の画面へのテキスト表示を行ったが、音声認識結果が取得されるごとに、順次、画面に表示してもよい。これにより、音声認識から結果表示までの時間をより短時間にすることができ、リアルタイム性を高めた表示が可能となる。 In the above description, after the success or failure of the voice recognition of each voice data is confirmed, the text is displayed on the screen of the display unit 17. However, each time the voice recognition result is acquired, the screen is sequentially displayed. May be displayed. As a result, the time from voice recognition to result display can be shortened, and display with improved real-time performance is possible.

（第４の実施形態）
第１の実施形態では、音声認識に失敗した場合、録音した音声ファイルに基づき、音声認識システムに再音声認識を依頼する。しかしながら、一定回数、音声認識を再依頼しても、音声認識に成功しない場合もあり得る。また、最初に音声認識を依頼して失敗した後、定期的に再音声認識を行っているにもかかわらず、長い間、音声認識に成功しない場合もあり得る。このような場合は、音声データそのものに問題がある可能性があると考えられるため、音声認識をこれ以上、行わないようにしてもよい。以下、本実施形態について詳細に説明する。 (Fourth embodiment)
In the first embodiment, when voice recognition fails, the voice recognition system is requested to perform re-voice recognition based on the recorded voice file. However, even if the voice recognition is requested again a certain number of times, the voice recognition may not succeed. In addition, there may be a case where the speech recognition is not successful for a long time even though the re-speech recognition is periodically performed after the first request for the speech recognition is unsuccessful. In such a case, it is considered that there is a possibility that the voice data itself has a problem, so that voice recognition may not be performed any more. Hereinafter, this embodiment will be described in detail.

制御部１６は、音声認識に失敗した発話について、音声認識の再依頼を行った回数が一定値に達したかを判断する。一定値に達した発話については、音声認識をこれ以上行わないように制御する。音声認識の依頼を行った回数を記憶するため、一例として、図４の認識結果テーブルに、「再音声認識回数」という列を別途設けてもよい。この列には、音声認識結果が＜失敗＞の発話について、音声ファイルによる音声認識の依頼を行った回数を格納する。 The control unit 16 determines whether or not the number of re-requests for speech recognition has reached a certain value for an utterance that has failed speech recognition. For utterances that reach a certain value, control is performed so that speech recognition is no longer performed. In order to store the number of requests for voice recognition, for example, a column of “number of times of re-recognition” may be separately provided in the recognition result table of FIG. This column stores the number of requests for speech recognition using speech files for utterances whose speech recognition result is <failure>.

制御部１６は、この回数が一定値に達した発話については、再音声認識を依頼しないように制御する。この場合、制御部１６は、この発話に関する音声認識はもはや行わないことをユーザに通知するメッセージを、表示部１７に表示してもよい。例えば、「規定の回数を超えて音声認識に失敗した。再度の音声認識は行わない。」のようなメッセージを表示してもよい。 The control unit 16 performs control so that re-speech recognition is not requested for an utterance in which the number of times reaches a certain value. In this case, the control unit 16 may display on the display unit 17 a message notifying the user that voice recognition regarding this utterance is no longer performed. For example, a message such as “speech recognition failed over the specified number of times. Voice recognition is not performed again” may be displayed.

同様に、制御部１６は、最初の発話日時から一定時間内に音声認識に成功しない場合は、音声認識の依頼を、これ以上行わないように制御する。例えば、図４の認識結果テーブルにおいて、認識結果が＜失敗＞である発話の中で、「発話日時」の値と、現在日時の間隔が、一定時間（例えば３時間）以上のものを検出する。検出した発話については、これ以上、音声認識を依頼しないように制御する。その場合、この発話に関する音声認識はもはや行わないことをユーザに通知するメッセージを、表示部１７に表示してもよい。例えば、「発話時から規定の時間を超えても音声認識に成功しない。再度の音声認識は行わない」といったメッセージを表示してもよい。ここでは、発話日時を一定時間の起点にしたが、任意の日時を起点にしてもよい。 Similarly, when the voice recognition is not successful within a certain time from the first utterance date and time, the control unit 16 controls the voice recognition request not to be performed any more. For example, in the recognition result table of FIG. 4, utterances whose recognition result is <failure> are detected when the interval between the value of “utterance date” and the current date is greater than or equal to a certain time (eg, 3 hours). . The detected utterance is controlled so as not to request further voice recognition. In that case, a message notifying the user that voice recognition regarding this utterance is no longer performed may be displayed on the display unit 17. For example, a message such as “speech recognition is not successful even if a predetermined time has elapsed since the time of utterance. Repeated speech recognition is not performed” may be displayed. Here, the utterance date / time is set as the starting point of a certain time, but any date / time may be set as the starting point.

以上、本実施形態によれば、音声認識に成功する可能性が低いと考えられる発話については、これ以上音声認識を行わないようにすることで、音声処理装置１０１の処理負荷を低減できる。また、ユーザは音声認識の成功を待つ時間を短くできるため、ユーザの負荷を低減できる。また、ユーザが発話内容を忘れない内に、再度の発話を行うことが可能となる。 As described above, according to the present embodiment, it is possible to reduce the processing load of the voice processing apparatus 101 by preventing voice recognition from being performed any more for an utterance that is considered to be unlikely to succeed in voice recognition. In addition, since the user can shorten the time for waiting for the voice recognition to be successful, the load on the user can be reduced. In addition, the user can speak again without forgetting the utterance content.

（第５の実施形態）
第１〜第４の実施形態では、音声処理装置が音声認識システムに音声認識の依頼を行ったが、本実施形態では、音声処理装置と音声認識システムとの間に配置した管理サーバが代理で行う。これにより、音声処理装置の負荷を下げるとともに、音声処理装置の記憶領域を節約する。 (Fifth embodiment)
In the first to fourth embodiments, the voice processing apparatus requests the voice recognition system to perform voice recognition. However, in this embodiment, a management server arranged between the voice processing apparatus and the voice recognition system is acting as a proxy. Do. This reduces the load on the voice processing device and saves the storage area of the voice processing device.

図７は、本実施形態に係るシステムの全体構成図である。 FIG. 7 is an overall configuration diagram of a system according to the present embodiment.

音声処理装置４０１と、音声ファイル管理サーバ５０１（以下、管理サーバ５０１と呼ぶ）と、音声認識システム２０１が示される。これらは互いにネットワークを介して接続されている。ネットワークは、無線、有線またはこれらのハイブリッドのネットワークである。音声処理装置４０１と管理サーバ５０１間のネットワークと、管理サーバ５０１と音声認識システム２０１間のネットワークは、互いに異なるネットワークであっても、同じネットワークであってもよい。 A voice processing device 401, a voice file management server 501 (hereinafter referred to as a management server 501), and a voice recognition system 201 are shown. These are connected to each other via a network. The network is a wireless network, a wired network, or a hybrid network thereof. The network between the voice processing device 401 and the management server 501 and the network between the management server 501 and the voice recognition system 201 may be different networks or the same network.

音声認識システム２０１は、第１〜第４の実施形態に係る音声認識システム２０１と同様であるため説明を省略する。 Since the voice recognition system 201 is the same as the voice recognition system 201 according to the first to fourth embodiments, the description thereof is omitted.

音声処理装置４０１の機能ブロック図は、図１と同じものを用いることができる。ただし、各ブロックの動作は一部、変更または拡張されている。以下では、第１の実施形態との差分を中心に説明を行う。 The functional block diagram of the audio processing device 401 can be the same as that shown in FIG. However, the operation of each block is partially changed or expanded. Below, it demonstrates focusing on the difference with 1st Embodiment.

録音部１１は、第１の実施形態と同様、一定サイズごとに取得した音声データを結合して音声ファイルを生成し、ファイル記憶部１２に記憶する。ただし、各音声データは送信部１３へは送らない。つまり、各音声データの音声認識依頼は、音声認識システム２０１へ送信しない。 As in the first embodiment, the recording unit 11 combines the audio data acquired for each fixed size to generate an audio file, and stores the audio file in the file storage unit 12. However, each audio data is not sent to the transmission unit 13. That is, the voice recognition request for each voice data is not transmitted to the voice recognition system 201.

制御部１６は、ファイル記憶部１２から音声ファイルを読み出し、送信部１３へ渡す。送信部１３は、音声ファイルを管理サーバ３０１に送信する。制御部１６は、音声ファイルの送信が成功した時点で、音声ファイルをファイル記憶部１２から削除してもよい。これにより、データ記憶領域を節約できる。制御部１６は、音声ファイルの送信後、または音声ファイルの作成後、ユーザに同じ内容の発話を再度行う必要はないことを通知するメッセージを、表示部１７に出力してもよい。 The control unit 16 reads the audio file from the file storage unit 12 and passes it to the transmission unit 13. The transmission unit 13 transmits the audio file to the management server 301. The control unit 16 may delete the audio file from the file storage unit 12 when the transmission of the audio file is successful. Thereby, the data storage area can be saved. The control unit 16 may output a message to the display unit 17 notifying the user that it is not necessary to utter the same content again after transmission of the audio file or creation of the audio file.

管理サーバ５０１は、音声処理装置４０１から音声ファイルを受信する。管理サーバ５０１は、音声処理装置４０１と同様に、サーバ側のファイル記憶装置および認識結果記憶装置を備える。管理サーバ５０１は、受信した音声ファイルを、サーバ側ファイル記憶装置に格納する。管理サーバ５０１は、ファイル記憶装置内の音声ファイルに基づき、音声認識システム２０１に、音声ファイルに含まれる音声データについて、音声認識依頼を送信する。管理サーバ５０１は、音声認識システム２０１から、音声認識結果を含む音声認識応答を受信する。音声認識結果は、音声認識に成功した場合は、音声データを音声認識により変換したテキストを含み、音声認識に失敗した場合は、音声認識に失敗したことを示す情報を含む。音声認識に成功した場合に、成功を示す情報が音声認識結果に追加で含まれても良い。管理サーバ５０１は、受信した音声認識応答に含まれる音声認識結果に基づき、第１の実施形態と同様、サーバ側認識結果記憶部１５に、図４に示した形式の認識結果テーブルを生成してもよい。 The management server 501 receives an audio file from the audio processing device 401. Similar to the voice processing device 401, the management server 501 includes a server-side file storage device and a recognition result storage device. The management server 501 stores the received audio file in the server-side file storage device. The management server 501 transmits a voice recognition request for the voice data included in the voice file to the voice recognition system 201 based on the voice file in the file storage device. The management server 501 receives a voice recognition response including a voice recognition result from the voice recognition system 201. The speech recognition result includes text obtained by converting speech data by speech recognition when speech recognition is successful, and includes information indicating that speech recognition has failed when speech recognition fails. When the speech recognition is successful, information indicating success may be additionally included in the speech recognition result. Based on the voice recognition result included in the received voice recognition response, the management server 501 generates a recognition result table in the format shown in FIG. 4 in the server-side recognition result storage unit 15 as in the first embodiment. Also good.

音声認識の依頼は、第１の実施形態と同様に、音声ファイル内の音声データ全体に対して一括して行っても良いし、音声データ全体を、一定サイズごとに分割して行っても良い。また、音声認識システム２０１が音声ファイルそのものに対応している場合は、音声ファイルそのものを送信してもよい。音声ファイルが、音声認識システム２０１が対応しない形式に圧縮されている場合は、音声認識システム２０１が対応可能な形式に復号するものとする。 Similar to the first embodiment, the request for speech recognition may be performed collectively for the entire audio data in the audio file, or the entire audio data may be divided into predetermined sizes. . Further, when the voice recognition system 201 supports the voice file itself, the voice file itself may be transmitted. If the audio file is compressed into a format that the voice recognition system 201 does not support, the audio file is decoded into a format that the voice recognition system 201 can handle.

管理サーバ５０１は、第１〜第４実施形態の音声処理装置１０１の動作と同様に、音声認識に失敗した発話について、再音声認識依頼の制御を行う。管理サーバ５０１は、一定の時間間隔で、認識結果記憶部１５に保持されたテーブルに基づき、音声認識に失敗した発話をチェックする。管理サーバ５０１は、音声認識に失敗した発話に対応する音声ファイルを取り出し、音声ファイル内の音声データに対する音声認識依頼を、音声認識システム２０１に再度送信する。音声認識システム２０１から、音声認識結果を含む音声認識応答を受信する。音声認識応答に含まれる音声認識結果の内容に応じて、認識結果テーブルを更新する。 Similar to the operation of the speech processing apparatus 101 of the first to fourth embodiments, the management server 501 controls a re-speech recognition request for an utterance that has failed speech recognition. The management server 501 checks utterances that have failed in voice recognition based on a table held in the recognition result storage unit 15 at regular time intervals. The management server 501 extracts a voice file corresponding to an utterance in which voice recognition has failed, and transmits a voice recognition request for voice data in the voice file to the voice recognition system 201 again. A voice recognition response including a voice recognition result is received from the voice recognition system 201. The recognition result table is updated according to the content of the speech recognition result included in the speech recognition response.

管理サーバ５０１は、音声ファイルの音声認識が成功した場合、音声処理装置４０１に音声認識に成功した旨のメッセージを通知してもよい。管理サーバ５０１は、予め音声処理装置４０１のユーザとメールアドレスの対応表を管理している。管理サーバ５０１は、この対応表に基づき、該当する音声処理装置４０１のユーザのメールアドレスを特定し、特定したアドレス宛に、成功のメッセージを送信する。成功のメッセージには、音声認識されたテキスト（発話テキスト）を追加してもよいし、発話テキストを、添付ファイルとして送信してもよい。あるいは、成功のメッセージは、メールではなく、アプリケーションの画面にプッシュ表示する形で、送信してもよい。この場合、発話テキストも同時にアプリケーションの画面に表示してもよい。発話テキストは、アプリケーション画面上の成功のメッセージを確認したユーザの端末から、送信要求を受けて送信してもよい。 When the voice recognition of the voice file is successful, the management server 501 may notify the voice processing device 401 of a message indicating that the voice recognition is successful. The management server 501 manages a correspondence table between users of the voice processing device 401 and mail addresses in advance. The management server 501 identifies the mail address of the user of the corresponding voice processing device 401 based on the correspondence table, and transmits a success message to the identified address. A speech-recognized text (utterance text) may be added to the success message, or the speech text may be transmitted as an attached file. Alternatively, the success message may be sent in the form of being pushed on the screen of the application instead of mail. In this case, the utterance text may be simultaneously displayed on the screen of the application. The utterance text may be transmitted in response to a transmission request from the terminal of the user who confirmed the success message on the application screen.

管理サーバ５０１は、初回またはそれ以降の音声認識に失敗した場合は、失敗した旨のメッセージを音声処理装置４０１に送信してもよい。この際、再度、音声認識の依頼を自発的に行う旨のメッセージを送信してもよい。第４の実施形態と同様、一定回数、音声認識に失敗した場合や、一定時間内に音声認識が成功しなかった場合は、音声認識が不可能である旨のメッセージを送信してもよい。 The management server 501 may transmit a message indicating failure to the voice processing device 401 when the first or subsequent voice recognition fails. At this time, a message indicating that a request for speech recognition is voluntarily performed may be transmitted again. As in the fourth embodiment, a message indicating that speech recognition is impossible may be transmitted when speech recognition fails a certain number of times or when speech recognition is not successful within a certain time.

音声処理装置４０１の受信部１４は、管理サーバ５０１から成功のメッセージと発話テキストを受信した場合、成功のメッセージを表示部１７に表示し、また発話テキストを、成功のメッセージと同じ、または別の画面で、表示部１７に表示する。受信した発話テキストは、認識結果記憶部に格納する。この際、図４の認識結果テーブルに準じた形式で、発話日時とともに格納してもよい。ファイル記憶部１２から音声ファイルを消去する構成の場合は、「音声ファイル」列は、削除してもよい。 When receiving the success message and the utterance text from the management server 501, the reception unit 14 of the speech processing device 401 displays the success message on the display unit 17, and the utterance text is the same as or different from the success message. The image is displayed on the display unit 17 on the screen. The received utterance text is stored in the recognition result storage unit. At this time, it may be stored together with the utterance date and time in a format according to the recognition result table of FIG. In the case where the audio file is deleted from the file storage unit 12, the “audio file” column may be deleted.

受信部１４が、管理サーバ５０１から失敗のメッセージを受信した場合は、失敗のメッセージを表示部１７に表示する。この場合、制御部１６は、失敗を示す情報を「認識結果」列に記録してもよい。また、受信部１４は、音声認識の依頼を自発的に行う旨のメッセージ、または音声認識が不可能の旨のメッセージを受信した場合は、当該メッセージを表示部１７に表示する。 When the reception unit 14 receives a failure message from the management server 501, the reception unit 14 displays the failure message on the display unit 17. In this case, the control unit 16 may record information indicating failure in the “recognition result” column. In addition, when receiving a message indicating that a request for voice recognition is voluntarily performed or a message indicating that voice recognition is not possible, the receiving unit 14 displays the message on the display unit 17.

音声処理装置４０１の制御部１６は、定期的に管理サーバ５０１に、音声ファイルの音声認識に成功したか否かを、問い合わせてもよい。問い合わせ対象となる発話は、まだ成功のメッセージを受信してない発話である。失敗のメッセージを受信した後、成功のメッセージをまだ受信していない発話も、これに相当する。この場合、管理サーバ５０１は、当該音声ファイルの音声認識に成功していれば、その旨のメッセージを音声処理装置４０１に送信してもよい。当該音声ファイルの音声認識に、まだ成功していなければ、再音声認識中の旨のメッセージを、音声処理装置４０１に送信してもよい。 The control unit 16 of the voice processing device 401 may periodically inquire the management server 501 as to whether or not voice recognition of the voice file has succeeded. The utterance to be inquired is an utterance that has not yet received a success message. An utterance that has not received a success message after receiving a failure message also corresponds to this. In this case, the management server 501 may send a message to that effect to the voice processing device 401 if the voice recognition of the voice file is successful. If voice recognition of the voice file has not been successful, a message indicating that re-voice recognition is in progress may be sent to the voice processing device 401.

以上に述べた管理サーバ５０１の動作は、一例であり、第１〜第４の実施形態における音声処理装置４０１の各種動作を組み合わせることも可能である。例えば第３の実施形態で示したように、発話の一部の音声データのみ音声認識が成功した場合は、成功した部分のテキストと、失敗を示すテキストを結合したテキストを、音声処理装置４０１に送信することも可能である。また、再音声認識依頼の制御の際も、失敗した音声データのみを切り出して、音声認識システム２０１に、再音声認識を依頼してもよい。 The operation of the management server 501 described above is an example, and various operations of the voice processing device 401 in the first to fourth embodiments can be combined. For example, as shown in the third embodiment, when the speech recognition is successful only for a part of the speech data of the utterance, the text obtained by combining the text of the successful part and the text indicating the failure is sent to the speech processing device 401. It is also possible to transmit. Also, when controlling the re-speech recognition request, only the failed speech data may be cut out and the re-speech recognition may be requested to the speech recognition system 201.

（第６の実施形態）
第５の実施形態では、管理サーバから音声認識システムへ音声認識の依頼および再依頼を行ったが、本実施形態では、音声認識の依頼は音声処理装置が行い、音声認識の再依頼は管理サーバが行う形態を示す。 (Sixth embodiment)
In the fifth embodiment, a voice recognition request and re-request are made from the management server to the voice recognition system. In this embodiment, the voice recognition request is made by the voice processing device, and the voice recognition re-request is sent to the management server. Shows a form to be performed.

図８は、本実施形態に係るシステムの全体構成図である。音声処理装置６０１と、音声ファイル管理サーバ７０１（以下、管理サーバ７０１と呼ぶ）と、音声認識システム２０１が示される。これらは互いにネットワークを介して接続されている。ネットワークは、無線、有線またはこれらのハイブリッドのネットワークである。音声処理装置６０１と音声処理システム２０１間のネットワークと、音声処理装置６０１と管理サーバ７０１間のネットワークと、管理サーバ７０１と音声認識システム２０１間のネットワークとは、互いに異なるネットワークであっても、同じネットワークであってもよい。 FIG. 8 is an overall configuration diagram of a system according to the present embodiment. A voice processing device 601, a voice file management server 701 (hereinafter referred to as a management server 701), and a voice recognition system 201 are shown. These are connected to each other via a network. The network is a wireless network, a wired network, or a hybrid network thereof. The network between the voice processing device 601 and the voice processing system 201, the network between the voice processing device 601 and the management server 701, and the network between the management server 701 and the voice recognition system 201 are the same even if they are different networks. It may be a network.

音声認識システム２０１は、第１〜第５の実施形態に係る音声認識システム２０１と同様である。また音声処理装置６０１の機能ブロック図は、図１と同じものを用いることができる。ただし、各ブロックの動作は一部、変更または拡張されている。以下では、第１および第５の実施形態との差分を中心に説明を行う。 The voice recognition system 201 is the same as the voice recognition system 201 according to the first to fifth embodiments. Further, the functional block diagram of the voice processing device 601 can be the same as that shown in FIG. However, the operation of each block is partially changed or expanded. Below, it demonstrates focusing on the difference with 1st and 5th embodiment.

音声処理装置６０１は、第１の実施形態と同様に、図２および図３のステップＳ１０１〜Ｓ１１３の処理を行う。すなわち、録音部１１で音声データが取得されるごとに、当該音声データを、送信部１３を介して音声認識システム２０１に送信し、音声認識システム２０１から音声認識結果を取得する。音声認識システム２０１に送信したすべての音声データに対する音声認識結果を取得したら、今回の発話に対する音声認識が成功または失敗したかを判断する。成功の場合は、各音声認識結果に含まれるテキストを結合した発話テキストを画面に表示し、失敗の場合は、失敗のメッセージ等を画面に表示する。第１の実施形態では、この後、認識結果テーブル（図４参照）にエントリーを追加したが、本実施形態では、この代わりに、管理サーバ７０１に、発話日時と、音声認識結果と、音声ファイルとを一組の情報として送信する。つまり、図４の１エントリーに相当する情報（ただしファイルパスではなく、音声ファイル本体）を管理サーバ７０１に送信する。音声処理装置６０１は管理サーバ７０１への上記情報の送信に成功したら、音声処理装置６０１内の音声ファイルを削除してもよい。なお、音声処理装置６０１は、発話日時と認識結果のエントリーを認識結果テーブルに追加してもよい。 The audio processing device 601 performs the processing of steps S101 to S113 in FIGS. 2 and 3 as in the first embodiment. That is, every time voice data is acquired by the recording unit 11, the voice data is transmitted to the voice recognition system 201 via the transmission unit 13, and a voice recognition result is acquired from the voice recognition system 201. When the voice recognition results for all the voice data transmitted to the voice recognition system 201 are acquired, it is determined whether the voice recognition for the current utterance has succeeded or failed. In the case of success, the utterance text obtained by combining the texts included in each speech recognition result is displayed on the screen, and in the case of failure, a failure message or the like is displayed on the screen. In the first embodiment, after that, an entry is added to the recognition result table (see FIG. 4). However, in this embodiment, instead of this, the utterance date and time, the voice recognition result, and the voice file are added to the management server 701. As a set of information. That is, information corresponding to one entry in FIG. 4 (however, not a file path but an audio file body) is transmitted to the management server 701. If the voice processing apparatus 601 successfully transmits the information to the management server 701, the voice processing apparatus 601 may delete the voice file in the voice processing apparatus 601. Note that the voice processing device 601 may add the utterance date and the entry of the recognition result to the recognition result table.

管理サーバ７０１は、上記情報（発話日時と、音声認識結果と、音声ファイル）に含まれる音声ファイルを、サーバ側のファイル記憶部に格納し、格納した位置を特定するファイルパスを取得する。また、管理サーバ７０１は、サーバ側の認識結果記憶部内に保持する認識結果テーブルに、上記情報に含まれる発話日時および音声認識結果と、取得したファイルパスとを含むエントリーを追加する。 The management server 701 stores a voice file included in the above information (speech date and time, voice recognition result, and voice file) in a file storage unit on the server side, and acquires a file path for specifying the stored position. In addition, the management server 701 adds an entry including the utterance date and time and the voice recognition result included in the information and the acquired file path to the recognition result table held in the recognition result storage unit on the server side.

管理サーバ７０１は、認識結果テーブルの「認識結果」列に基づき、音声認識に失敗した発話について、再音声認識依頼の制御を行う。すなわち、管理サーバ７０１は、一定の時間間隔で、認識結果記憶部に保持されたテーブルに基づき、音声認識に失敗した発話をチェックする。管理サーバ７０１は、音声認識に失敗した発話に対応する音声ファイルを取り出し、音声ファイル内の音声データに対する音声認識依頼を、音声認識システム２０１に再度送信する。音声認識システム２０１から、音声認識結果を含む音声認識応答を受信する。音声認識応答に含まれる音声認識結果の内容に応じて、認識結果テーブルを更新する。 The management server 701 controls a re-speech recognition request for an utterance that has failed speech recognition based on the “recognition result” column of the recognition result table. That is, the management server 701 checks utterances that have failed in voice recognition based on a table held in the recognition result storage unit at regular time intervals. The management server 701 extracts a voice file corresponding to the utterance in which voice recognition has failed, and transmits a voice recognition request for the voice data in the voice file to the voice recognition system 201 again. A voice recognition response including a voice recognition result is received from the voice recognition system 201. The recognition result table is updated according to the content of the speech recognition result included in the speech recognition response.

管理サーバ７０１は、音声ファイルの音声認識が成功した場合、音声処理装置６０１に音声認識に成功した旨のメッセージを通知してもよい。成功のメッセージには、音声認識されたテキスト（発話テキスト）を追加してもよいし、発話テキストを、添付ファイルとして送信してもよい。あるいは、成功のメッセージは、メールではなく、アプリケーションの画面にプッシュ表示する形で、送信してもよい。この場合、発話テキストも同時にアプリケーションの画面に表示してもよい。発話テキストは、アプリケーション画面上の成功のメッセージを確認したユーザの端末から、送信要求を受けて送信してもよい。 When the voice recognition of the voice file is successful, the management server 701 may notify the voice processing device 601 of a message indicating that the voice recognition is successful. A speech-recognized text (utterance text) may be added to the success message, or the speech text may be transmitted as an attached file. Alternatively, the success message may be sent in the form of being pushed on the screen of the application instead of mail. In this case, the utterance text may be simultaneously displayed on the screen of the application. The utterance text may be transmitted in response to a transmission request from the terminal of the user who confirmed the success message on the application screen.

管理サーバ７０１は、音声認識が失敗した場合は、失敗した旨のメッセージを音声処理装置６０１に送信してもよい。この際、再度、音声認識の依頼を自発的に行う旨のメッセージを送信してもよい。第４の実施形態と同様、一定回数、音声認識に失敗した場合や、一定時間内に音声認識が成功しなかった場合は、音声認識が不可能である旨のメッセージを送信してもよい。 When the voice recognition fails, the management server 701 may transmit a message indicating the failure to the voice processing device 601. At this time, a message indicating that a request for speech recognition is voluntarily performed may be transmitted again. As in the fourth embodiment, a message indicating that speech recognition is impossible may be transmitted when speech recognition fails a certain number of times or when speech recognition is not successful within a certain time.

音声処理装置６０１の受信部１４は、管理サーバ７０１から成功のメッセージと発話テキストを受信した場合、成功のメッセージを表示部１７に表示し、また発話テキストを、成功のメッセージと同じ、または別の画面で、表示部１７に表示する。受信した発話テキストは、認識結果記憶部１５に格納する。この際、図４の認識結果テーブルに準じた形式で、発話日時とともに発話テキストを格納してもよい。ファイル記憶部１２から音声ファイルを消去する構成の場合は、「音声ファイル」列は、削除してもよい。 When receiving the success message and the utterance text from the management server 701, the reception unit 14 of the voice processing device 601 displays the success message on the display unit 17, and the utterance text is the same as or different from the success message. The image is displayed on the display unit 17 on the screen. The received utterance text is stored in the recognition result storage unit 15. At this time, the utterance text may be stored together with the utterance date and time in a format according to the recognition result table of FIG. In the case where the audio file is deleted from the file storage unit 12, the “audio file” column may be deleted.

受信部１４が、管理サーバ７０１から失敗のメッセージを受信した場合は、失敗のメッセージを表示部１７に表示する。この場合、制御部１６は、失敗を示す情報を「認識結果」列に記録してもよい。また、受信部１４は、音声認識の再依頼を自発的に行う旨のメッセージ、または音声認識が不可能の旨のメッセージを受信した場合は、当該メッセージを表示部１７に表示する。 When the reception unit 14 receives a failure message from the management server 701, the reception unit 14 displays the failure message on the display unit 17. In this case, the control unit 16 may record information indicating failure in the “recognition result” column. In addition, when receiving a message indicating that a re-request for speech recognition is voluntarily performed or a message indicating that speech recognition is not possible, the receiving unit 14 displays the message on the display unit 17.

また、音声処理装置６０１の制御部１６は、定期的に管理サーバ７０１に、音声ファイルの音声認識に成功したか否かを、問い合わせてもよい。問い合わせ対象となる発話は、まだ成功のメッセージを受信してない発話である。音声認識の失敗のメッセージを受信した後、成功のメッセージをまだ受信していない発話も、これに相当する。この場合、管理サーバ７０１は、当該音声ファイルの音声認識に成功していれば、その旨のメッセージを音声処理装置６０１に送信してもよい。当該音声ファイルの音声認識に、まだ成功していなければ、音声認識中の旨のメッセージを、音声処理装置６０１に送信してもよい。 Further, the control unit 16 of the voice processing device 601 may periodically inquire the management server 701 as to whether or not voice recognition of the voice file has succeeded. The utterance to be inquired is an utterance that has not yet received a success message. An utterance that has not yet received a success message after receiving a voice recognition failure message also corresponds to this. In this case, the management server 701 may send a message to that effect to the voice processing device 601 if the voice recognition of the voice file is successful. If the voice recognition of the voice file is not yet successful, a message indicating that voice recognition is in progress may be transmitted to the voice processing device 601.

以上、本実施形態によれば、初回の音声認識依頼は音声処理装置から行うとともに、再音声認識依頼は、管理サーバから行うことにより、音声認識依頼から結果表示までの時間を短時間にできると同時に、再認識依頼に起因する処理負荷を削減または低減できる。 As described above, according to the present embodiment, the first speech recognition request is made from the speech processing apparatus, and the re-speech recognition request is made from the management server, so that the time from the speech recognition request to the result display can be shortened. At the same time, the processing load due to the re-recognition request can be reduced or reduced.

なお、第１〜第６の実施形態の音声処理装置および管理サーバは、例えば、汎用のコンピュータ装置を基本ハードウェアとして用いることでも実現することが可能である。すなわち、音声処理装置および管理サーバが備えるブロックの処理は、上記のコンピュータ装置に搭載されたプロセッサにプログラムを実行させることにより実現することができる。このとき、音声処理装置よび管理サーバは、上記のプログラムをコンピュータ装置にあらかじめインストールすることで実現してもよいし、ＣＤ−ＲＯＭなどの記憶媒体に記憶して、あるいはネットワークを介して上記のプログラムを配布して、このプログラムをコンピュータ装置に適宜インストールすることで実現してもよい。また、音声処理装置および管理サーバが備える記憶手段は、上記のコンピュータ装置に内蔵あるいは外付けされたメモリ、ハードディスク、ＳＳＤもしくはＣＤ−Ｒ、ＣＤ−ＲＷ、ＤＶＤ−ＲＡＭ、ＤＶＤ−Ｒなどの記憶媒体などを適宜利用して実現することができる。 Note that the voice processing devices and management servers of the first to sixth embodiments can also be realized by using, for example, a general-purpose computer device as basic hardware. That is, the processing of the blocks provided in the voice processing device and the management server can be realized by causing the processor mounted on the computer device to execute the program. At this time, the voice processing device and the management server may be realized by installing the above program in the computer device in advance, or may be stored in a storage medium such as a CD-ROM or via the network. And may be realized by installing this program on a computer device as appropriate. The storage means included in the voice processing device and the management server is a storage medium such as a memory, a hard disk, an SSD or a CD-R, a CD-RW, a DVD-RAM, a DVD-R, etc. incorporated in or externally attached to the computer device It can be realized by appropriately using the above.

本発明のいくつかの実施形態を説明したが、これらの実施形態は、例として提示したものであり、発明の範囲を限定することは意図していない。これら新規な実施形態は、その他の様々な形態で実施されることが可能であり、発明の要旨を逸脱しない範囲で、種々の省略、置き換え、変更を行うことができる。これら実施形態やその変形は、発明の範囲や要旨に含まれるとともに、特許請求の範囲に記載された発明とその均等の範囲に含まれる。 Although several embodiments of the present invention have been described, these embodiments are presented by way of example and are not intended to limit the scope of the invention. These novel embodiments can be implemented in various other forms, and various omissions, replacements, and changes can be made without departing from the scope of the invention. These embodiments and modifications thereof are included in the scope and gist of the invention, and are included in the invention described in the claims and the equivalents thereof.

１１：録音部
１２：ファイル記憶部
１３：送信部
１４：受信部
１５：認識結果記憶部
１６：制御部
１７：表示部
１８：入力部
１０１、４０１：音声処理装置
３０１：ネットワーク
２０１：音声認識システム 11: recording unit 12: file storage unit 13: transmission unit 14: reception unit 15: recognition result storage unit 16: control unit 17: display unit 18: input unit 101, 401: voice processing device 301: network 201: voice recognition system

Claims

An acquisition unit for sequentially acquiring audio data representing the content spoken by the user;
A transmission unit that transmits a voice recognition request of the voice data acquired by the acquisition unit to a voice recognition system;
A storage unit for storing voice data acquired by the acquisition unit;
A receiving unit that receives a voice recognition response including text converted from the voice data by voice recognition or information indicating a voice recognition failure of the voice data from the voice recognition system;
Based on the voice recognition response, the voice data that has failed in the voice recognition is identified, and on the basis of the voice data stored in the storage unit, a voice recognition request for data including the voice data that has failed in the voice recognition And a control unit that controls transmission to the system.

The control unit determines success or failure of speech recognition for the user's utterance based on the speech recognition response, and if the failure is determined, the control unit re-requests the speech recognition system to perform speech recognition again. The voice processing device according to claim 1, wherein the voice processing device is controlled to display a message to be notified.

The control unit controls to display a message notifying the user that it is not necessary to perform the same utterance again after the acquisition unit completes acquisition of all voice data for the user's utterance. Item 3. The voice processing device according to Item 1 or 2.

When the voice recognition response for at least one voice data is not received and no voice recognition response including information indicating the failure is received, the control unit indicates that voice recognition is in progress. The voice processing apparatus according to any one of claims 1 to 3, wherein a message for notifying the user of a message is controlled to be displayed.

The control unit indicates the text included in the voice recognition response for the voice data for which the voice recognition has been successful, and indicates that the voice recognition has not been completed for the voice data for which the voice recognition has not yet been successful. The speech processing apparatus according to any one of claims 1 to 4, wherein text is controlled to be arranged and displayed according to an order of acquisition of the speech data by the acquisition unit.

The speech processing apparatus according to any one of claims 1 to 5, wherein the transmission unit transmits the speech recognition request to the speech recognition system via a network.

The acquisition unit combines the audio data acquired by the acquisition unit in the acquisition order to generate an audio file,
The storage unit stores the audio file,
The control unit generates correspondence information in which the date and time when the user utters, the voice recognition result for the user's utterance, and the position where the voice file is stored in the storage unit,
The speech recognition result indicates that if all speech recognition of the speech data is successful, the speech text in which the text corresponding to each speech data is arranged in the order of acquisition of the speech data, speech recognition of at least one speech recognition data The speech processing apparatus according to any one of claims 1 to 6, which is information indicating that speech recognition has not yet been completed when is not successful.

If the voice recognition is not successful even if the voice recognition request for the data including the voice data that has failed in the voice recognition is performed a predetermined number of times, or the voice recognition has failed within a predetermined time from a predetermined time. 8. If re-speech recognition of data including speech data is not successful, control is performed so as to display a message notifying the user that speech recognition for the user's utterance is no longer performed. The speech processing apparatus according to the item.

An acquisition unit that sequentially acquires audio data representing the content spoken by the user, generates an audio file by combining the acquired audio data in order of acquisition, and
A transmitter that transmits the voice file to a management server that communicates with the voice recognition system via a network;
A receiving unit that receives text obtained by converting voice data included in the voice file by voice recognition from the management server;
After the audio file is generated or after the transmission of the audio file to the management server is completed, a message is displayed to notify the user that it is not necessary to perform the same utterance as the utterance again. A voice processing device comprising a control unit for controlling.

An acquisition step of sequentially acquiring audio data representing the content spoken by the user;
A transmission step of transmitting a voice recognition request of the voice data acquired by the acquisition step to a voice recognition system;
A storage step of storing the voice data acquired by the acquisition step in a storage device;
Receiving from the voice recognition system a voice recognition response including text converted from the voice data by voice recognition or information indicating a voice recognition failure of the voice data;
Based on the voice recognition response, the voice data that has failed in the voice recognition is identified, and based on the voice data stored in the storage device, a voice recognition request for data including the voice data that has failed in the voice recognition is received. A voice processing method comprising: a control step for controlling transmission to a system.