JP7323210B2

JP7323210B2 - VOICE RECOGNITION DISPLAY DEVICE, VOICE RECOGNITION DISPLAY METHOD AND PROGRAM

Info

Publication number: JP7323210B2
Application number: JP2021084660A
Authority: JP
Inventors: 和基小島
Original assignee: NEC Platforms Ltd
Current assignee: NEC Platforms Ltd
Priority date: 2021-05-19
Filing date: 2021-05-19
Publication date: 2023-08-08
Anticipated expiration: 2041-05-19
Also published as: JP2022178110A

Description

本発明は、音声認識表示装置、音声認識表示方法及びプログラムに関する。 The present invention relates to a speech recognition display device, a speech recognition display method, and a program.

特許文献１には、テキスト化処理の時間を短縮するために、音声ファイルを所定の時間で区切ってテキスト化処理を行う音声認識システムが開示されている。また、特許文献１では、音声ファイルを生成するための音声データにおいて、発言と発言の間の無音時間が所定の閾値以上である場合に音声データを区切り、無音時間の直前の音声データから話者を判別することが記載されている。 Japanese Patent Application Laid-Open No. 2002-200001 discloses a speech recognition system that divides a voice file into text at predetermined time intervals in order to shorten the text processing time. Further, in Patent Document 1, in audio data for generating an audio file, when the silence time between utterances is equal to or greater than a predetermined threshold, the audio data is segmented, and the speaker It is described to determine

特許文献２には、音声データを分割して音声認識させ、その結果であるテキストデータを画面上にテキストとして表示させる情報処理システムが開示されている。また、特許文献２では、テキストデータの出力タイミングよりも、当該テキストデータに対応する分割された音声データの出力タイミングを、当該音声データの音声認識処理が確定するまでの期間遅延させている。これにより、出力されたテキストデータの内容に対応する音声データを聞きながら当該テキストデータを人手により修正することができる。 Patent Literature 2 discloses an information processing system that divides voice data, performs voice recognition, and displays the resultant text data as text on a screen. Further, in Patent Document 2, the output timing of divided voice data corresponding to the text data is delayed from the output timing of the text data by a period until the voice recognition processing of the voice data is determined. As a result, the text data can be manually corrected while listening to the voice data corresponding to the contents of the output text data.

特開２０２０－６０７３５号公報JP 2020-60735 A 特開２０１９－１８５００５号公報JP 2019-185005 A

近年、消防の１１９番通報や警察の１１０番通報などの緊急通報システムにおいて、音声認識が導入され始めている。このような秒単位の判断が求められる緊急通報システムでは、通報者と受付者の通話音声を他の指令員が聴取（モニタ）し、消防車や救急車などの緊急車両の出動指示を行う。その場合、リアルタイムで通報者と受付者の通話内容をテキスト化することで、指令員が瞬時に通報内容を把握し、前述の出動指示を行うことができる。 In recent years, voice recognition has begun to be introduced in emergency call systems such as fire brigade 119 calls and police 110 calls. In an emergency call system that requires such judgments in seconds, other dispatchers listen to (monitor) the call voice of the caller and the receptionist, and instruct emergency vehicles such as fire trucks and ambulances to be dispatched. In that case, by converting the content of the call between the caller and the receptionist into text in real time, the dispatcher can instantly grasp the content of the report and issue the above-mentioned dispatch instruction.

リアルタイム音声認識システムでは、リアルタイム性が損なわれないように、テキスト化処理に係る時間を短くすることが求められている。テキスト化処理の時間を短縮するために、テキスト化する音声データを分割する方法には、所定の時間で区切る方法と、音声データにおける無音区間で区切る方法とがある。 A real-time speech recognition system is required to shorten the time required for text conversion so as not to impair real-time performance. Methods for dividing speech data to be converted into text in order to shorten the processing time for text conversion include a method of dividing at a predetermined time and a method of dividing at silent intervals in the voice data.

所定の時間で音声データを区切る方法では、あらかじめ定められた固定時間で音声データが分割される。この場合、有音部分（文章や単語）の途中で音声データが分断されてしまい、当該文章や単語が正しくテキスト化されず、分割した細切れの音声データをそれぞれ認識したテキストをつなぎ合わせても１つの文章として意味をなさない場合がある。 In the method of dividing the audio data at predetermined times, the audio data is divided at predetermined fixed times. In this case, the audio data is divided in the middle of the sound portion (sentence or word), and the sentence or word is not converted into text correctly. It may not make sense as a single sentence.

また、入力された音声データ内の無音区間を検出して、音声データを有意な単位（文章単位）に区切る場合、無音と判断する音声レベルで閾値を設け、入力された音声データの音声レベルが閾値を超えているか否かで無音／有音の判断を行い、無音区間による音声分割が実現される。 In addition, when detecting a silent interval in the input voice data and dividing the voice data into significant units (sentence units), a threshold is set at the voice level at which to judge silence, and the voice level of the input voice data is Silence/speech is determined depending on whether or not the threshold value is exceeded, and voice division is realized by silent intervals.

しかし、無音と判断する音声レベルは、音声を収集する周囲の環境音（ノイズなど）によって異なり、かつ、環境音はリアルタイムで変動するため、無音と判断する最適な音声レベルを決定するのは困難である。例えば、無音と判断する音声レベルを最適な値より低く設定した場合、音声に当該レベルを超えるノイズが混入すると、実際は無音のはずが有音と判断され、正しい位置で音声データを区切ることができなくなる。 However, it is difficult to determine the optimal sound level for judging silence because the sound level for judging silence differs depending on the surrounding environmental sounds (noise, etc.) from which the sound is collected, and environmental sounds fluctuate in real time. is. For example, if the audio level for judging silence is set lower than the optimum value, if noise exceeding this level is mixed into the audio, it will be judged as active even though it should actually be silent, and the audio data will not be separated at the correct position. Gone.

また、このような無音区間による音声分割の音声認識システムでは、音声データ中の無音区間を検出して初めて、無音区間より前の音声データのテキスト化が実行される。このため、文章が長い場合は、当該文章が終わるまでテキスト化されず、緊急通報システムで求められる秒単位のリアルタイム性を担保することができない。 Also, in such a speech recognition system that divides speech by silent intervals, the speech data before the silent interval is converted into text only after the silent interval is detected in the speech data. For this reason, if the text is long, it is not converted into text until the end of the text, and real-time performance in seconds required by the emergency call system cannot be ensured.

本開示の目的は、上述した問題を鑑み、音声認識精度を向上させるとともに、リアルタイム性を担保することが可能な音声認識表示装置、音声認識方法及びプログラムを提供することにある。 An object of the present disclosure is to provide a speech recognition display device, a speech recognition method, and a program capable of improving speech recognition accuracy and ensuring real-time performance in view of the above-described problems.

本発明の一態様に係る音声認識表示装置は、音声データを取得する音声取得部と、所定の区切時間で時間軸に沿って前記音声データを分割して、分割音声データを生成する音声分割部と、前記区切時間が経過する毎に、前記分割音声データを先頭から順に格納する音声バッファと、前記分割音声データを分割音声テキストデータに変換する音声認識部と、前記分割音声テキストデータが空か否かを判定するテキスト判定部と、前記分割音声テキストデータが空ではない場合、前記音声バッファに格納された前記分割音声データを先頭から順に結合した結合音声データを生成するバッファ制御部と、前記区切時間が経過する毎に、前記結合音声データを音声認識した結合音声テキストデータを表示するテキスト表示部とを備えるものである。 A voice recognition display device according to an aspect of the present invention includes a voice acquisition unit that acquires voice data, and a voice division unit that divides the voice data along a time axis at predetermined division times to generate divided voice data. a voice buffer for storing the divided voice data in order from the top, a voice recognition unit for converting the divided voice data into divided voice text data, and whether the divided voice text data is empty or not, each time the division time elapses. a text determination unit for determining whether or not the divided voice text data is not empty, a buffer control unit for generating combined voice data by combining the divided voice data stored in the voice buffer in order from the beginning; and a text display unit for displaying combined voice text data obtained by performing voice recognition of the combined voice data each time a delimitation time elapses.

本発明の一態様に係る音声認識方法は、音声データを取得し、所定の区切時間で時間軸に沿って前記音声データを分割して、分割音声データを生成し、前記区切時間が経過する毎に、前記分割音声データを音声バッファに先頭から順に格納し、前記分割音声データを分割音声テキストデータに変換し、前記分割音声テキストデータが空か否かを判定し、前記分割音声テキストデータが空ではない場合、前記音声バッファに格納された前記分割音声データを先頭から順に結合した結合音声データを生成し、前記区切時間が経過する毎に、前記結合音声データを音声認識した結合音声テキストデータを表示する。 A speech recognition method according to an aspect of the present invention acquires speech data, divides the speech data along the time axis at predetermined division times, generates divided speech data, and generates divided speech data each time the division time passes. and storing the divided voice data in a voice buffer in order from the beginning, converting the divided voice data into divided voice text data, determining whether the divided voice text data is empty, and determining whether the divided voice text data is empty. Otherwise, combined voice data is generated by combining the divided voice data stored in the voice buffer in order from the beginning, and combined voice text data obtained by voice recognition of the combined voice data is generated each time the division time elapses. indicate.

本発明の一態様に係るプログラムは、音声データを取得する処理と、所定の区切時間で時間軸に沿って前記音声データを分割して、分割音声データを生成する処理と、前記区切時間が経過する毎に、前記分割音声データを音声バッファに先頭から順に格納する処理と、前記分割音声データを分割音声テキストデータに変換する処理と、前記分割音声テキストデータが空か否かを判定する処理と、前記分割音声テキストデータが空ではない場合、前記音声バッファに格納された前記分割音声データを先頭から順に結合した結合音声データを生成する処理と、前記区切時間が経過する毎に、前記結合音声データを音声認識した結合音声テキストデータを表示する処理と、をコンピュータに実行させるものである。 A program according to an aspect of the present invention includes a process of acquiring audio data, a process of dividing the audio data along a time axis at a predetermined delimitation time to generate divided audio data, and a process of generating divided audio data. each time, a process of storing the divided voice data in the voice buffer in order from the beginning, a process of converting the divided voice data into divided voice text data, and a process of determining whether the divided voice text data is empty. a process of generating combined voice data by combining the divided voice data stored in the voice buffer in order from the head when the divided voice text data is not empty; and a process of displaying the combined speech text data obtained by speech recognition of the data.

本発明によれば、音声認識精度を向上させるとともに、リアルタイム性を担保することが可能な音声認識表示装置、音声認識方法及びプログラムを提供することにある。 An object of the present invention is to provide a speech recognition display device, a speech recognition method, and a program capable of improving speech recognition accuracy and ensuring real-time performance.

実施の形態に係る音声認識表示装置の概略構成を示すブロック図である。1 is a block diagram showing a schematic configuration of a speech recognition display device according to an embodiment; FIG. 実施の形態１に係る音声認識表示装置の構成を示すブロック図である。1 is a block diagram showing the configuration of a speech recognition display device according to Embodiment 1; FIG. ＩＰ電話機での通話内容及び通話音声波形の一例である。It is an example of the call content and call voice waveform in the IP phone. 実施の形態１に係る音声認識方法を説明するフロー図である。FIG. 2 is a flowchart for explaining a speech recognition method according to Embodiment 1; FIG. 時刻Ｔｄ１経過後に音声バッファに保存される分割音声データを示す図である。FIG. 10 is a diagram showing divided audio data stored in an audio buffer after time Td1 has elapsed; 時刻Ｔｄ２経過後に音声バッファに保存される分割音声データを示す図である。FIG. 10 is a diagram showing divided audio data stored in an audio buffer after time Td2 has elapsed; 時刻Ｔｄ６経過後に音声バッファに保存される分割音声データを示す図である。FIG. 10 is a diagram showing divided audio data stored in an audio buffer after time Td6 has elapsed; 時刻Ｔｄ７経過後に音声バッファに保存される分割音声データを示す図である。FIG. 10 is a diagram showing divided audio data stored in an audio buffer after time Td7 has elapsed; 時刻Ｔｄ１経過後にテキスト表示部に表示されるテキストを示す図である。FIG. 10 is a diagram showing text displayed on the text display section after time Td1 has elapsed; 時刻Ｔｄ２経過後にテキスト表示部に表示されるテキストを示す図である。FIG. 10 is a diagram showing text displayed on the text display portion after time Td2 has elapsed; 時刻Ｔｄ６経過後にテキスト表示部に表示されるテキストを示す図である。FIG. 10 is a diagram showing text displayed on the text display portion after time Td6 has elapsed; 時刻Ｔｄ７経過後にテキスト表示部に表示されるテキストを示す図である。FIG. 10 is a diagram showing text displayed on the text display section after time Td7 has elapsed; インデックス０、１の分割音声データが結合された結合音声データを示す図である。FIG. 10 is a diagram showing combined audio data in which divided audio data with indexes 0 and 1 are combined; 実施の形態２に係る音声認識表示装置の構成を示すブロック図である。2 is a block diagram showing the configuration of a speech recognition display device according to Embodiment 2; FIG. 実施の形態２に係る音声認識方法を説明するフロー図である。FIG. 10 is a flow diagram for explaining a speech recognition method according to Embodiment 2; 時刻Ｔｄ１経過後にテキストバッファに保存される結合音声テキストデータを示す図である。FIG. 10 is a diagram showing combined speech text data saved in a text buffer after time Td1 has elapsed; 時刻Ｔｄ２経過後にテキストバッファに保存される結合音声テキストデータを示す図である。FIG. 10 is a diagram showing combined speech text data saved in a text buffer after time Td2 has elapsed; 時刻Ｔｄ６経過後にテキストバッファに保存される結合音声テキストデータを示す図である。FIG. 10 is a diagram showing combined speech text data saved in a text buffer after time Td6 has elapsed; 時刻Ｔｄ７経過後にテキストバッファに保存される結合音声テキストデータを示す図である。FIG. 10 is a diagram showing combined speech text data saved in a text buffer after time Td7 has elapsed; 時刻Ｔｄ１経過後にテキスト表示部に表示されるテキストを示す図である。FIG. 10 is a diagram showing text displayed on the text display section after time Td1 has elapsed; 時刻Ｔｄ２経過後にテキスト表示部に表示されるテキストを示す図である。FIG. 10 is a diagram showing text displayed on the text display portion after time Td2 has elapsed; 時刻Ｔｄ６経過後にテキスト表示部に表示されるテキストを示す図である。FIG. 10 is a diagram showing text displayed on the text display portion after time Td6 has elapsed; 時刻Ｔｄ７経過後にテキスト表示部に表示されるテキストを示す図である。FIG. 10 is a diagram showing text displayed on the text display section after time Td7 has elapsed; 実施の形態２に係る音声認識表示装置により音声認識される通話内容及び通話音声波形の一例を示す図である。FIG. 10 is a diagram showing an example of call content and a call voice waveform that are voice-recognized by the voice recognition display device according to the second embodiment; 実施の形態３に係る音声認識表示装置の構成を示すブロック図である。FIG. 11 is a block diagram showing the configuration of a speech recognition display device according to Embodiment 3; 実施の形態４に係る音声認識表示装置の構成を示すブロック図である。FIG. 12 is a block diagram showing the configuration of a speech recognition display device according to Embodiment 4; ３回分の区切時間経過後に音声バッファＸに保存される分割音声データを示す図である。FIG. 10 is a diagram showing divided audio data stored in the audio buffer X after the lapse of three division times; ３回分の区切時間経過後に音声バッファＹに保存される分割音声データを示す図である。FIG. 10 is a diagram showing divided audio data stored in the audio buffer Y after the lapse of three division times; １回目の区切時間経過後にテキストバッファＸに保存される分割音声テキストデータを示す図である。FIG. 10 is a diagram showing divided speech text data stored in the text buffer X after the first delimitation time has elapsed; １回目の区切時間経過後にテキストバッファＹに保存される分割音声テキストデータを示す図である。FIG. 10 is a diagram showing divided speech text data stored in the text buffer Y after the lapse of the first delimitation time; ２回目の区切時間経過後にテキストバッファＸに保存される分割音声テキストデータを示す図である。FIG. 10 is a diagram showing divided speech text data stored in the text buffer X after the second delimitation time has elapsed; ２回目の区切時間経過後にテキストバッファＹに保存される分割音声テキストデータを示す図である。FIG. 10 is a diagram showing divided speech text data stored in the text buffer Y after the lapse of the second delimitation time; ３回目の区切時間経過後にテキストバッファＸに保存される分割音声テキストデータを示す図である。FIG. 10 is a diagram showing divided speech text data stored in the text buffer X after the third delimitation time has elapsed; ３回目の区切時間経過後にテキストバッファＹに保存される分割音声テキストデータを示す図である。FIG. 10 is a diagram showing divided speech text data stored in the text buffer Y after the lapse of the third delimitation time; 併合前の併合テキストバッファの保存状態を示す図である。FIG. 10 is a diagram showing a saved state of a merged text buffer before merging; 併合１回目の併合テキストバッファに保存される併合テキストデータを示す図である。FIG. 10 is a diagram showing merged text data stored in a merged text buffer for the first time of merging; 併合２回目の併合テキストバッファに保存される併合テキストデータを示す図である。FIG. 10 is a diagram showing merged text data saved in a merged text buffer for the second time of merging; 併合３回目の併合テキストバッファに保存される併合テキストデータを示す図である。FIG. 12 is a diagram showing merged text data saved in a merged text buffer after the third merging; 併合４回目の併合テキストバッファに保存される併合テキストデータを示す図である。FIG. 12 is a diagram showing merged text data saved in a merged text buffer after the fourth merging; 併合５回目の併合テキストバッファに保存される併合テキストデータを示す図である。FIG. 12 is a diagram showing merged text data saved in a merged text buffer after the fifth merge. 表示前のテキスト表示部の表示状態を示す図である。It is a figure which shows the display state of the text display part before display. 表示１回目のテキスト表示部に表示されるテキストを示す図である。FIG. 10 is a diagram showing text displayed in the text display portion for the first time of display; 表示２回目のテキスト表示部に表示されるテキストを示す図である。FIG. 10 is a diagram showing text displayed in the text display portion for the second time of display; 表示３回目のテキスト表示部に表示されるテキストを示す図である。It is a figure which shows the text displayed on the text display part of the display of the 3rd time. 表示４回目のテキスト表示部に表示されるテキストを示す図である。It is a figure which shows the text displayed on the text display part of the display of the 4th time. 表示５回目のテキスト表示部に表示されるテキストを示す図である。FIG. 13 is a diagram showing text displayed in the text display portion for the fifth time of display; 実施の形態５に係る音声認識表示装置の構成を示すブロック図である。FIG. 11 is a block diagram showing the configuration of a speech recognition display device according to Embodiment 5; 図２１のマイクで集音した発話内容及び発話音声波形の一例を示す図である。22A and 22B are diagrams showing an example of utterance content and an utterance voice waveform collected by the microphone of FIG. 21; FIG. 時刻Ｔｄ３経過後に音声バッファに保存される分割音声データを示す図である。FIG. 10 is a diagram showing divided audio data stored in an audio buffer after time Td3 has elapsed; 時刻Ｔｄ７経過後に音声バッファに保存される分割音声データを示す図である。FIG. 10 is a diagram showing divided audio data stored in an audio buffer after time Td7 has elapsed; 時刻Ｔｄ１経過後にテキストバッファに保存される結合音声テキストデータを示す図である。FIG. 10 is a diagram showing combined speech text data saved in a text buffer after time Td1 has elapsed; 時刻Ｔｄ２経過後にテキストバッファに保存される結合音声テキストデータを示す図である。FIG. 10 is a diagram showing combined speech text data saved in a text buffer after time Td2 has elapsed; 時刻Ｔｄ３経過後にテキストバッファに保存される結合音声テキストデータを示す図である。FIG. 10 is a diagram showing combined speech text data saved in a text buffer after time Td3 has elapsed; 時刻Ｔｄ４経過後にテキストバッファに保存される結合音声テキストデータを示す図である。FIG. 10 is a diagram showing combined speech text data saved in a text buffer after time Td4 has elapsed; 時刻Ｔｄ５経過後にテキストバッファに保存される結合音声テキストデータを示す図である。FIG. 10 is a diagram showing combined speech text data saved in a text buffer after time Td5 has elapsed; 時刻Ｔｄ６経過後にテキストバッファに保存される結合音声テキストデータを示す図である。FIG. 10 is a diagram showing combined speech text data saved in a text buffer after time Td6 has elapsed; 時刻Ｔｄ７経過後にテキストバッファに保存される結合音声テキストデータを示す図である。FIG. 10 is a diagram showing combined speech text data saved in a text buffer after time Td7 has elapsed; テキスト表示部の初期表示状態を示す図である。FIG. 10 is a diagram showing an initial display state of a text display section; 時刻Ｔｄ１経過後にテキスト表示部に表示されるテキストを示す図である。FIG. 10 is a diagram showing text displayed on the text display section after time Td1 has elapsed; 時刻Ｔｄ２経過後にテキスト表示部に表示されるテキストを示す図である。FIG. 10 is a diagram showing text displayed on the text display portion after time Td2 has elapsed; 時刻Ｔｄ３経過後にテキスト表示部に表示されるテキストを示す図である。FIG. 10 is a diagram showing text displayed on the text display section after time Td3 has elapsed; 時刻Ｔｄ４経過後にテキスト表示部に表示されるテキストを示す図である。FIG. 10 is a diagram showing text displayed on the text display section after time Td4 has elapsed; 時刻Ｔｄ５経過後にテキスト表示部に表示されるテキストを示す図である。FIG. 10 is a diagram showing text displayed on the text display section after time Td5 has elapsed; 時刻Ｔｄ６経過後にテキスト表示部に表示されるテキストを示す図である。FIG. 10 is a diagram showing text displayed on the text display portion after time Td6 has elapsed; 時刻Ｔｄ７経過後にテキスト表示部に表示されるテキストを示す図である。FIG. 10 is a diagram showing text displayed on the text display section after time Td7 has elapsed;

以下、本発明を適用した具体的な実施の形態について、図面を参照しながら詳細に説明する。ただし、本発明が以下の実施の形態に限定される訳ではない。また、説明を明確にするため、以下の記載及び図面は、適宜、簡略化されている。各図面において、同一の構成要素には同一の符号を付し、説明は適宜省略される。 Hereinafter, specific embodiments to which the present invention is applied will be described in detail with reference to the drawings. However, the present invention is not limited to the following embodiments. Also, for clarity of explanation, the following description and drawings are simplified as appropriate. In each drawing, the same components are denoted by the same reference numerals, and the description thereof is omitted as appropriate.

図１は、実施の形態に係る音声認識表示装置１の概略構成を示すブロック図である。音声認識表示装置１は、音声データを取得する機能、及び、該音声データから生成されたテキストデータを表示する機能を有する装置である。図１に示すように、音声認識表示装置１は、音声取得部１０、時間区切音声分割部１１、音声バッファ１２、音声認識部１３、テキスト表示部１４、テキスト判定部２１、バッファ制御部３１を備える。 FIG. 1 is a block diagram showing a schematic configuration of a speech recognition display device 1 according to an embodiment. The speech recognition display device 1 is a device having a function of acquiring speech data and a function of displaying text data generated from the speech data. As shown in FIG. 1, the speech recognition display device 1 includes a speech acquisition unit 10, a time-segmented speech division unit 11, a speech buffer 12, a speech recognition unit 13, a text display unit 14, a text determination unit 21, and a buffer control unit 31. Prepare.

音声取得部１０は、音声データを取得して、時間区切音声分割部１１に送信する。時間区切音声分割部１１は、受信した音声データを時間軸に沿って所定の区切時間で分割して、分割音声データを生成する。この分割音声データは、該区切時間が経過する毎に、音声バッファ１２に先頭から順に格納されるとともに、音声認識部１３に入力される。音声認識部１３は、分割音声データを分割音声テキストデータに変換し、テキスト判定部２１に送信する。テキスト判定部２１は、分割音声テキストデータが空か否かを判定する。 The voice acquisition unit 10 acquires voice data and transmits it to the time-segmented voice division unit 11 . The time-segmented audio dividing unit 11 divides the received audio data along the time axis at predetermined time intervals to generate divided audio data. The divided voice data are sequentially stored in the voice buffer 12 from the top and input to the voice recognition section 13 every time the division time elapses. The voice recognition unit 13 converts the divided voice data into divided voice text data and transmits the data to the text determination unit 21 . The text determination unit 21 determines whether or not the divided voice text data is empty.

テキスト判定部２１による判定結果は、バッファ制御部３１に入力される。バッファ制御部３１は、分割音声テキストデータが空ではない場合、音声バッファ１２に格納された分割音声データを先頭から順に結合した結合音声データを生成する。テキスト表示部１４は、区切時間が経過する毎に、結合音声データを音声認識した結合音声テキストデータを表示する。これにより、音声認識精度を向上させるとともに、リアルタイム性を担保することが可能となる。 A determination result by the text determination unit 21 is input to the buffer control unit 31 . If the divided voice text data is not empty, the buffer control unit 31 generates combined voice data by combining the divided voice data stored in the voice buffer 12 in order from the beginning. The text display unit 14 displays combined voice text data obtained by performing voice recognition of the combined voice data each time the delimitation time elapses. This makes it possible to improve speech recognition accuracy and ensure real-time performance.

実施の形態１．
図２は、実施の形態１に係る音声認識表示装置１Ａの構成を示すブロック図である。図２に示す例では、音声認識表示装置１Ａは、ＩＰ電話機４０での通話音声をリアルタイムで音声認識し、テキスト表示する。図２では、図１のテキスト判定部２１を含む分割音声・テキスト制御部２０と、バッファ制御部３１を含む結合音声・テキスト制御部３０とが示されている。 Embodiment 1.
FIG. 2 is a block diagram showing the configuration of the speech recognition display device 1A according to the first embodiment. In the example shown in FIG. 2, the voice recognition display device 1A recognizes the call voice of the IP telephone 40 in real time and displays it as text. FIG. 2 shows a split speech/text control section 20 including the text determination section 21 of FIG. 1 and a combined speech/text control section 30 including a buffer control section 31 .

音声認識表示装置１Ａは、時間区切音声分割部１１、音声バッファ１２、音声認識部１３、テキスト表示部１４、認識ＤＢ１５、分割音声・テキスト制御部２０、結合音声・テキスト制御部３０を備える。分割音声・テキスト制御部２０は、テキスト判定部２１、分割音声送信部２２、分割音声テキスト受信部２３を含む。結合音声・テキスト制御部３０は、バッファ制御部３１、結合音声送信部３２、結合音声テキスト受信部３３を含む。 The speech recognition display device 1A includes a time-segmented speech division unit 11, a speech buffer 12, a speech recognition unit 13, a text display unit 14, a recognition DB 15, a divided speech/text control unit 20, and a combined speech/text control unit 30. The divided speech/text control unit 20 includes a text determination unit 21 , a divided speech transmission unit 22 and a divided speech text reception unit 23 . The combined voice/text control unit 30 includes a buffer control unit 31 , a combined voice transmission unit 32 , and a combined voice/text reception unit 33 .

音声取得部１０は、ＩＰ電話機４０から出力されるデジタル音声信号を取得し、音声データとして時間区切音声分割部１１に出力する。時間区切音声分割部１１は、音声取得部１０から受け取った音声データを予め設定された区切時間で区切り、複数の分割音声データを生成する。時間区切音声分割部１１は、区切時間が経過する毎に、分割音声データを音声バッファ１２に先頭から順に格納する。また、時間区切音声分割部１１は、分割音声データを分割音声・テキスト制御部２０の分割音声送信部２２に送信する。 The voice acquisition unit 10 acquires a digital voice signal output from the IP telephone 40 and outputs it as voice data to the time-delimited voice division unit 11 . The time-segmented audio division unit 11 divides the audio data received from the audio acquisition unit 10 by preset division times to generate a plurality of divided audio data. The time-segmented audio dividing unit 11 sequentially stores the divided audio data in the audio buffer 12 from the top every time the interval time elapses. Also, the time-segmented audio dividing unit 11 transmits the divided audio data to the divided audio transmission unit 22 of the divided audio/text control unit 20 .

図３は、ＩＰ電話機４０での通話内容及び通話音声波形の一例である。ここでは、音声の開始時刻を０とし、所定の区切時間をＴｄとする。音声データは、区切時間Ｔｄ毎に複数の分割音声データに分割される。図３に示すように、最初の区切時間Ｔｄが経過した時刻をＴｄ１とし、以降、区切時間Ｔｄが経過する毎に、順にＴｄ２→Ｔｄ３→Ｔｄ４→Ｔｄ５→Ｔｄ６→Ｔｄ７とする。時間０～Ｔｄ１、Ｔｄ１～Ｔｄ２、Ｔｄ２～Ｔｄ３、Ｔｄ３～Ｔｄ４、Ｔｄ４～Ｔｄ５、Ｔｄ５～Ｔｄ６、Ｔｄ６～Ｔｄ７の分割音声データを、それぞれ分割音声データｄ１、ｄ２、ｄ３、ｄ４、ｄ５、ｄ６、ｄ７とする。 FIG. 3 shows an example of call contents and call voice waveforms on the IP telephone 40 . Here, it is assumed that the voice start time is 0 and the predetermined interval time is Td. The audio data is divided into a plurality of divided audio data every division time Td. As shown in FIG. 3, the time when the first interval time Td elapses is Td1, and thereafter, Td2→Td3→Td4→Td5→Td6→Td7 each time the interval time Td elapses. Divided audio data at times 0 to Td1, Td1 to Td2, Td2 to Td3, Td3 to Td4, Td4 to Td5, Td5 to Td6, Td6 to Td7 are divided into divided audio data d1, d2, d3, d4, d5, d6, d7.

時間区切音声分割部１１は、区切時間Ｔｄが経過する毎に、分割音声データを音声バッファ１２の待ち行列（音声キュー）の末尾に保存する。分割音声送信部２２は、時間区切音声分割部１１から受信した分割音声データを音声認識部１３に送信する。音声認識部１３は、認識ＤＢ１５を参照して、分割音声データを分割音声テキストデータに変換して、分割音声テキスト受信部２３に送信する。 The time-delimited audio dividing unit 11 saves the divided audio data at the end of the queue (audio queue) of the audio buffer 12 every time the interval time Td elapses. The divided audio transmission unit 22 transmits the divided audio data received from the time-segmented audio dividing unit 11 to the audio recognition unit 13 . The speech recognition unit 13 refers to the recognition DB 15 , converts the divided speech data into divided speech text data, and transmits the divided speech text data to the divided speech text reception unit 23 .

認識ＤＢ１５には、音声認識処理を実行する際に使用される、音響モデル、言語モデル、辞書等が格納されている。音声認識部１３は、音声データを音響分析して得られる特徴量の時系列のパターンに対して、例えば、隠れマルコフモデル（Hidden Markov Model）などの音響モデルを用いることで音素を判別する。また、音声認識部１３は、判別した音素に対して、辞書とN-gram等の言語モデルとを用いて、蓄積されている単語の中から最も妥当な単語を選択することでテキストデータを生成する。 The recognition DB 15 stores acoustic models, language models, dictionaries, etc., which are used when performing speech recognition processing. The speech recognition unit 13 discriminates phonemes by using, for example, an acoustic model such as a Hidden Markov Model for time-series patterns of feature amounts obtained by acoustically analyzing speech data. In addition, the speech recognition unit 13 generates text data by selecting the most appropriate word from the accumulated words for the determined phoneme using a dictionary and a language model such as an N-gram. do.

分割音声テキスト受信部２３は、音声認識部１３から受信した分割音声テキストデータをテキスト判定部２１に送信する。テキスト判定部２１は、分割音声テキストデータが「空」であるか否かを判定し、判定結果をバッファ制御部３１に入力する。 The divided speech text receiving unit 23 transmits the divided speech text data received from the speech recognition unit 13 to the text determination unit 21 . The text determination unit 21 determines whether the divided voice text data is “empty” and inputs the determination result to the buffer control unit 31 .

バッファ制御部３１は、判定結果が「空でない」場合、音声バッファ１２に格納されている分割音声データを先頭から順に結合した結合音声データを生成し、結合音声送信部３２へ送信する。一方、判定結果が「空である」場合、バッファ制御部３１は、音声バッファ１２に格納されている分割音声データを削除して、音声バッファ１２を空にする。 If the determination result is “not empty”, the buffer control unit 31 generates combined audio data by combining the divided audio data stored in the audio buffer 12 in order from the beginning, and transmits the combined audio data to the combined audio transmission unit 32 . On the other hand, if the determination result is "empty", the buffer control unit 31 deletes the divided audio data stored in the audio buffer 12 to make the audio buffer 12 empty.

結合音声送信部３２は、バッファ制御部３１から受信した結合音声データを音声認識部１３に送信する。音声認識部１３は、認識ＤＢ１５を参照して、結合音声データを結合音声テキストデータに変換して、結合音声テキスト受信部３３に送信する。なお、ここでは、１つの音声認識部１３が、分割音声データと結合音声データ両方の音声認識処理を行うように構成されているが、それぞれの音声データを異なる音声認識部により処理してもよい。 The combined voice transmission unit 32 transmits the combined voice data received from the buffer control unit 31 to the voice recognition unit 13 . The speech recognition unit 13 refers to the recognition DB 15 , converts the combined voice data into combined voice text data, and transmits the combined voice text data to the combined voice text reception unit 33 . Here, one speech recognition unit 13 is configured to perform speech recognition processing for both the divided speech data and the combined speech data, but each speech data may be processed by a different speech recognition unit. .

結合音声テキスト受信部３３は、受信した結合音声テキストデータをテキスト表示部１４に送信する。テキスト表示部１４は、受信した結合音声テキストデータを表示する。例えば、テキスト表示部１４は、区切時間Ｔｄ毎に更新される結合音声テキストデータを、１行ずつ順に表示することができる。 The combined voice text receiving unit 33 transmits the received combined voice text data to the text display unit 14 . The text display unit 14 displays the received combined speech text data. For example, the text display unit 14 can sequentially display the combined speech text data updated at each delimitation time Td line by line.

ここで、図４及び図５Ａ～５Ｄ、６Ａ～６Ｄ、図７を参照して、実施の形態１に係る音声認識方法について説明する。図４は、実施の形態１に係る音声認識方法を説明するフロー図である。図５Ａ～５Ｄは、各時刻経過後に音声バッファに保存される分割音声データを示す図である。図６Ａ～６Ｄは、各時刻経過後にテキスト表示部に表示されるテキストを示す図である。なお、分割音声データｄ１、ｄ２、ｄ３、ｄ４、ｄ５、ｄ６、ｄ７を音声認識して生成されるテキストデータをそれぞれ、分割音声テキストデータＴＸ１、ＴＸ２、ＴＸ３、ＴＸ４、ＴＸ５、ＴＸ６、ＴＸ７とする。 Here, the speech recognition method according to Embodiment 1 will be described with reference to FIGS. 4, 5A to 5D, 6A to 6D, and FIG. FIG. 4 is a flowchart for explaining the speech recognition method according to the first embodiment. 5A to 5D are diagrams showing divided audio data stored in the audio buffer after each time. 6A to 6D are diagrams showing the text displayed on the text display section after each time has elapsed. The text data generated by speech recognition of the divided voice data d1, d2, d3, d4, d5, d6 and d7 are assumed to be divided voice text data TX1, TX2, TX3, TX4, TX5, TX6 and TX7, respectively. .

まず、音声取得部１０が、ＩＰ電話機４０のデジタル音声信号から音声データを取得する（ステップＳ１０）。音声取得部１０は、音声データを時間０から逐次、時間区切音声分割部１１へ送信する。そして、時間区切音声分割部１１が、音声データを所定の区切時間で時間軸に沿って区切時間Ｔｄで分割して分割音声データを生成し、区切時間が経過する毎に分割音声データを音声バッファ１２の末尾に保存するとともに、音声認識部１３へ送信する（ステップＳ１１）。 First, the voice acquisition unit 10 acquires voice data from the digital voice signal of the IP telephone 40 (step S10). The audio acquisition unit 10 sequentially transmits the audio data from time 0 to the time-segmented audio division unit 11 . Then, the time-segmented audio dividing unit 11 divides the audio data along the time axis at a predetermined interval time Td to generate divided audio data, and each time the interval time elapses, the divided audio data is transferred to the audio buffer. 12 and transmitted to the speech recognition unit 13 (step S11).

具体的には、時間区切音声分割部１１は、最初の区切り時間Ｔｄの経過時（時刻Ｔｄ１）に、時間０～Ｔｄ１の分割音声データｄ１を音声バッファ１２へ格納する。このとき、音声バッファ１２は空であるため、図５Ａに示すように、分割音声データｄ１は、音声バッファ１２の音声キューのインデックス０に格納される。 Specifically, the time-separated audio dividing unit 11 stores the divided audio data d1 from time 0 to Td1 in the audio buffer 12 when the first delimiting time Td has passed (time Td1). At this time, since the audio buffer 12 is empty, the divided audio data d1 is stored in the audio queue index 0 of the audio buffer 12 as shown in FIG. 5A.

そして、音声認識部１３が、認識ＤＢ１５を参照して、分割音声データｄ１を分割音声テキストデータＴＸ１に変換する（ステップＳ１２）。分割音声テキストデータＴＸ１の内容は、「通行人が倒れ」となる。分割音声テキストデータＴＸ１は、分割音声テキスト受信部２３を介して、テキスト判定部２１に入力される。そして、テキスト判定部２１が、分割音声テキストデータＴＸ１が空であるか否かを判定する（ステップＳ１３）。 Then, the speech recognition unit 13 refers to the recognition DB 15 and converts the divided speech data d1 into divided speech text data TX1 (step S12). The content of the divided voice text data TX1 is "a passerby collapses". The divided speech text data TX1 is input to the text judgment section 21 via the division speech text receiving section 23. FIG. Then, the text determination unit 21 determines whether or not the divided voice text data TX1 is empty (step S13).

テキスト判定部２１による判定結果は、バッファ制御部３１に入力される。上述の通り、分割音声テキストデータＴＸ１は、空ではない。分割音声テキストデータが空ではない場合（ステップＳ１３ＮＯ）、バッファ制御部３１は、音声バッファ１２に格納されている分割音声データを先頭から順に結合して結合音声データを生成し、音声認識部１３へ送信する（ステップＳ１４）。このとき、音声キューに格納されている分割音声データはインデックス０のみであるため、インデックス０の分割音声データが結合音声送信部３２を介して音声認識部１３へ入力される。 A determination result by the text determination unit 21 is input to the buffer control unit 31 . As described above, the segmented speech text data TX1 is not empty. If the divided speech text data is not empty (step S13 NO), the buffer control unit 31 combines the divided speech data stored in the speech buffer 12 in order from the beginning to generate combined speech data, and sends the data to the speech recognition unit 13. Send (step S14). At this time, since only index 0 is stored in the audio queue, the divided audio data with index 0 is input to the audio recognition section 13 via the combined audio transmission section 32 .

そして、音声認識部１３は、認識ＤＢ１５を参照し、結合音声データを結合音声テキストデータへ変換する。この結合音声テキストデータは、結合音声テキスト受信部３３を介してテキスト表示部１４へ入力される。テキスト表示部１４は、受信した結合音声テキストデータを表示する（ステップＳ１５）。このときの結合音声テキストデータの内容は、「通行人が倒れ」である。テキスト表示部１４は、受信した結合音声テキストデータを１行ずつ表示する。図６Ａに示すように、テキスト表示部１４は「通行人が倒れ」とのテキストを表示する。 Then, the speech recognition unit 13 refers to the recognition DB 15 and converts the combined voice data into combined voice text data. This combined speech text data is input to the text display section 14 via the combined speech text receiving section 33 . The text display unit 14 displays the received combined speech text data (step S15). The content of the combined speech text data at this time is "a passerby collapses". The text display unit 14 displays the received combined speech text data line by line. As shown in FIG. 6A, the text display unit 14 displays the text "passerby collapses".

ステップＳ１１へ戻り、次の区切時間Ｔｄが経過するとき（時刻Ｔｄ２）、時間区切音声分割部１１は、時間Ｔｄ１～Ｔｄ２の分割音声データｄ２を音声バッファ１２の音声キューの末尾へ格納する。このとき、図５Ｂに示すように、分割音声データｄ２は、音声キューのインデックス１に格納される。同時に、時間区切音声分割部１１は、時間Ｔｄ１～Ｔｄ２の分割音声データｄ２を、分割音声送信部２２へ送信する。 Returning to step S11, when the next delimitation time Td elapses (time Td2), the time delimitation audio division unit 11 stores the division audio data d2 of the times Td1 to Td2 at the end of the audio queue of the audio buffer 12. At this time, as shown in FIG. 5B, the divided audio data d2 is stored in the index 1 of the audio queue. At the same time, the time-segmented audio dividing unit 11 transmits the divided audio data d2 of the times Td1 to Td2 to the divided audio transmitting unit 22 .

音声認識部１３は、分割音声送信部２２から分割音声データｄ２を受信する。そして、音声認識部１３は、認識ＤＢ１５を参照し、分割音声データｄ２を分割音声テキストデータＴＸ２へ変換する（ステップＳ１２）。分割音声テキストデータＴＸ２の内容は、「ていて胸が苦しい」となる。分割音声テキストデータＴＸ２は、分割音声テキスト受信部２３を介して、テキスト判定部２１に入力される。そして、テキスト判定部２１が、分割音声テキストデータＴＸ２が空であるか否かを判定する（ステップＳ１３）。 The voice recognition unit 13 receives the divided voice data d2 from the divided voice transmission unit 22 . Then, the speech recognition unit 13 refers to the recognition DB 15 and converts the divided speech data d2 into divided speech text data TX2 (step S12). The content of the divided voice text data TX2 is "It's heartbreaking". The divided speech text data TX2 is input to the text judgment section 21 via the division speech text receiving section 23. FIG. Then, the text determination unit 21 determines whether or not the divided voice text data TX2 is empty (step S13).

テキスト判定部２１による判定結果は、バッファ制御部３１に入力される。上述の通り、分割音声テキストデータＴＸ２は、空ではない。分割音声テキストデータが空ではない場合（ステップＳ１３ＮＯ）、バッファ制御部３１は、音声バッファ１２に格納されている分割音声データを先頭から順に結合して結合音声データを生成し、音声認識部１３へ送信する（ステップＳ１４）。このとき、音声キューに格納されている分割音声データはインデックス０、１であるため、インデックス０、１の分割音声データが結合され、結合音声送信部３２を介して音声認識部１３へ入力される。このときの、音声認識部１３へ入力される結合音声データが図７に示される。 A determination result by the text determination unit 21 is input to the buffer control unit 31 . As described above, the split speech text data TX2 is not empty. If the divided speech text data is not empty (step S13 NO), the buffer control unit 31 combines the divided speech data stored in the speech buffer 12 in order from the beginning to generate combined speech data, and sends the data to the speech recognition unit 13. Send (step S14). At this time, since the divided audio data stored in the audio queue have indexes 0 and 1, the divided audio data with indexes 0 and 1 are combined and input to the audio recognition section 13 via the combined audio transmission section 32. . FIG. 7 shows combined speech data input to the speech recognition unit 13 at this time.

そして、音声認識部１３は、認識ＤＢ１５を参照し、結合音声データを結合音声テキストデータへ変換する。この結合音声テキストデータは、結合音声テキスト受信部３３を介してテキスト表示部１４へ入力される。テキスト表示部１４は、受信した結合音声テキストデータを表示する（ステップＳ１５）。このときの結合音声テキストデータの内容は、「通行人が倒れていて胸が苦しい」である。図６Ｂに示すように、テキスト表示部１４は「通行人が倒れていて胸が苦しい」とのテキストを表示する。 Then, the speech recognition unit 13 refers to the recognition DB 15 and converts the combined voice data into combined voice text data. This combined speech text data is input to the text display section 14 via the combined speech text receiving section 33 . The text display unit 14 displays the received combined speech text data (step S15). The content of the combined speech text data at this time is "The passerby is lying down and it hurts my heart". As shown in FIG. 6B, the text display unit 14 displays the text "passerby is lying down and my chest hurts".

同様に、時刻Ｔｄ６経過後まで（すなわち、時間Ｔｄ２～Ｔｄ３、Ｔｄ３～Ｔｄ４、Ｔｄ４～Ｔｄ５、Ｔｄ５～Ｔｄ６の分割音声データｄ３～ｄ６に対して）、ステップＳ１１～Ｓ１５が繰り返し実行される。図５Ｃには、時刻Ｔｄ６経過後に、音声バッファ１２に格納された分割音声データが示される。図５Ｃに示すように、音声バッファ１２には、インデックス０～５にそれぞれ分割音声データｄ１～ｄ６が格納されている。また、図６Ｃには、図５Ｃの分割音声データｄ１～ｄ６が先頭から順に結合された結合音声データが音声認識され、テキスト表示部１４に表示されたテキストが示されている。 Similarly, steps S11 to S15 are repeated until time Td6 has passed (that is, for divided audio data d3 to d6 at times Td2 to Td3, Td3 to Td4, Td4 to Td5, and Td5 to Td6). FIG. 5C shows the divided audio data stored in the audio buffer 12 after time Td6 has elapsed. As shown in FIG. 5C, the audio buffer 12 stores divided audio data d1 to d6 at indexes 0 to 5, respectively. FIG. 6C shows the text displayed on the text display section 14 after voice recognition of combined voice data obtained by combining the divided voice data d1 to d6 of FIG. 5C in order from the beginning.

次に、時刻Ｔｄ７経過後の動作について説明する。時間Ｔｄ６～Ｔｄ７では、分割音声データは無音であるものとする。ステップＳ１１において、次の区切時間Ｔｄが経過するとき（時刻Ｔｄ７）、時間区切音声分割部１１は、時間Ｔｄ６～Ｔｄ７の分割音声データｄ７を音声バッファ１２の音声キューの末尾へ格納する。 Next, the operation after time Td7 has elapsed will be described. It is assumed that the divided audio data is silent between times Td6 and Td7. In step S11, when the next delimitation time Td elapses (time Td7), the time delimitation audio dividing unit 11 stores the demultiplexed audio data d7 of times Td6 to Td7 at the end of the audio queue of the audio buffer 12. FIG.

図５Ｄに示すように、分割音声データｄ７は、音声キューのインデックス６に格納される。同時に、時間区切音声分割部１１は、時間Ｔｄ６～Ｔｄ７の分割音声データｄ７を、分割音声送信部２２へ送信する。 As shown in FIG. 5D, the divided audio data d7 is stored in index 6 of the audio queue. At the same time, the time-segmented audio dividing section 11 transmits the divided audio data d7 of the times Td6 to Td7 to the divided audio transmitting section 22. FIG.

音声認識部１３は、分割音声送信部２２から分割音声データｄ７を受信する。そして、音声認識部１３は、認識ＤＢ１５を参照し、分割音声データｄ７を分割音声テキストデータＴＸ７へ変換する（ステップＳ１２）。上述の通り、分割音声データは無音であるため、分割音声テキストデータＴＸ２の内容は空となる。分割音声テキストデータＴＸ７は、分割音声テキスト受信部２３を介して、テキスト判定部２１に入力される。そして、テキスト判定部２１が、分割音声テキストデータＴＸ７が空であるか否かを判定する（ステップＳ１３）。 The voice recognition unit 13 receives the divided voice data d7 from the divided voice transmission unit 22. FIG. Then, the speech recognition unit 13 refers to the recognition DB 15 and converts the divided speech data d7 into the divided speech text data TX7 (step S12). As described above, since the divided voice data is silent, the content of the divided voice text data TX2 is empty. The divided speech text data TX7 is input to the text judgment section 21 via the division speech text receiving section 23. FIG. Then, the text determination unit 21 determines whether or not the divided voice text data TX7 is empty (step S13).

テキスト判定部２１による判定結果は、バッファ制御部３１に入力される。上述の通り、分割音声テキストデータＴＸ７は空であるため（ステップＳ１３ＹＥＳ）、バッファ制御部３１は、音声バッファ１２に格納されている分割音声データを削除して（ステップＳ１６）、音声キューを空にする。 A determination result by the text determination unit 21 is input to the buffer control unit 31 . As described above, since the divided voice text data TX7 is empty (step S13 YES), the buffer control unit 31 deletes the divided voice data stored in the voice buffer 12 (step S16) and clears the voice queue. do.

このときにテキスト表示部１４に表示されるテキストが図６Ｄに示される。時刻Ｔｄ６経過の時点ですべての音声データのテキスト化が完了しているため、図６Ｃと図６Ｄで表示されるテキストの内容の差分はない。このように、区切り時間Ｔｄが経過する毎に、テキスト表示部１４に徐々に結合されたテキストが１行ずつ表示されていく。 The text displayed on the text display section 14 at this time is shown in FIG. 6D. Since conversion of all the voice data to text has been completed at the time when time Td6 has elapsed, there is no difference between the contents of the texts displayed in FIGS. 6C and 6D. In this way, each time the delimiting time Td elapses, the gradually combined text is displayed line by line on the text display section 14 .

所定の固定時間によって音声データを強制的に分割する比較例１では、テキスト表示部に表示されるテキストは、固定時間毎の音声データをそれぞれ音声認識したテキストが単純に結合されたものとなる。固定時間毎の音声データをそれぞれ音声認識したテキストは、前後の単語との関係をもとにした音声の解析ができず、単語の途中等で区切られている可能性のある音声データのテキストであり、このようなテキストを単純に結合しただけでは、文章として理解できない内容となる可能性がある。 In Comparative Example 1, in which the audio data is forcibly divided by a predetermined fixed time, the text displayed on the text display section is simply combined text obtained by recognizing the audio data for each fixed time. The text obtained by speech recognition of speech data for each fixed time cannot be analyzed based on the relationship with the words before and after, and it is a text of speech data that may be separated in the middle of a word. There is a possibility that simply combining such texts may result in content that cannot be understood as sentences.

これに対し、実施の形態１によれば、区切時間の度に、それまでの分割音声データをすべて結合した結合音声データを再認識させることができる。このため、音声認識部１３にて、前後の単語の関係をもとにした解析が可能であり、単語の途中で区切られることなく、認識精度が向上したテキスト化が可能である。 On the other hand, according to Embodiment 1, it is possible to re-recognize the combined voice data obtained by combining all the divided voice data up to that time each time the division time is reached. Therefore, the speech recognition unit 13 can perform analysis based on the relationship between words before and after the words, so that words can be converted into text with improved recognition accuracy without being broken in the middle of the words.

また、無音区間によって音声分割を行う比較例２において、図４の音声データをリアルタイム認識する場合、時刻Ｔｄ５とＴｄ６の中間あたり（Ｔｄ５’とする）から無音区間が始まるため、時刻Ｔｄ５’までは分割音声データが生成されず、Ｔｄ５’以降に初めて分割音声データを音声認識してテキストが表示されることとなる。これに対し、実施の形態１によれば、区切時間Ｔｄの度に、それまでの音声データを結合しテキスト化することができる。これにより、Ｔｄ５’経過前に、区切時間Ｔｄ毎に徐々に文章が構築されていくようにテキストを表示することができ、比較例２よりもリアルタイム性が向上した音声認識が可能である。 Further, in Comparative Example 2 in which audio is divided by silent intervals, when the audio data in FIG. The divided voice data is not generated, and the text is displayed by recognizing the divided voice data for the first time after Td5'. On the other hand, according to Embodiment 1, the speech data up to that point can be combined and converted into text at each delimitation time Td. As a result, it is possible to display the text so that sentences are gradually constructed at each delimiting time Td before Td5' elapses, and speech recognition with improved real-time performance compared to Comparative Example 2 is possible.

実施の形態２．
図８は、実施の形態２に係る音声認識表示装置１Ｂの構成を示すブロック図である。実施の形態２では、実施の形態１と同様に、ＩＰ電話機４０での通話内容（通話音声波形）をリアルタイムで音声認識し、テキスト表示する。上述のように、実施の形態１では、区切時間Ｔｄ毎に更新される結合音声テキストデータを、１行ずつ順に表示している。実施の形態２では、テキスト表示部１４の表示をより見やすくするために、テキスト表示部１４に表示されるテキストが文章単位となるようにする。 Embodiment 2.
FIG. 8 is a block diagram showing the configuration of a speech recognition display device 1B according to the second embodiment. In the second embodiment, as in the first embodiment, voice recognition is performed in real time on the call content (call voice waveform) at the IP telephone 40, and the text is displayed. As described above, in Embodiment 1, the combined speech text data updated at each delimitation time Td are sequentially displayed line by line. In the second embodiment, in order to make the display of the text display section 14 easier to see, the text displayed on the text display section 14 is arranged in units of sentences.

実施の形態２において、実施の形態１と異なる点は、結合音声・テキスト制御部３０がテキストバッファ３４をさらに含み、バッファ制御部３１が音声バッファ１２を制御するとともに、テキストバッファ３４を制御する点である。以下、実施の形態１との差異について詳細に説明し、重複説明は適宜省略する。 Embodiment 2 differs from Embodiment 1 in that the combined voice/text control unit 30 further includes a text buffer 34, and the buffer control unit 31 controls the voice buffer 12 and the text buffer 34. is. In the following, differences from the first embodiment will be described in detail, and duplication of description will be omitted as appropriate.

結合音声テキスト受信部３３は、区切時間Ｔｄ毎に結合音声テキストデータを受信すると、音声認識部１３から受信した結合音声テキストデータをテキストバッファ３４の空きインデックスのうち最も番号の小さいインデックスに格納する。なお、「空きインデックス」とは、行末に改行コードが付与されていないインデックスである。すなわち、空きインデックスには、結合音声テキストデータが格納されていないか、又は、行末に改行コードが付与されず、１つの文章として確定していない結合音声テキストデータが格納されている。 Upon receiving the combined voice text data at each interval Td, the combined voice text reception unit 33 stores the combined voice text data received from the voice recognition unit 13 in the index with the smallest number among the empty indexes of the text buffer 34 . Note that the "empty index" is an index that does not have a linefeed code at the end of the line. In other words, the empty index stores either no combined voice text data, or stores combined voice text data that has not been determined as one sentence without a line feed code at the end of the line.

結合音声テキスト受信部３３は、当該インデックスにすでに結合音声テキストデータが存在する場合は、既存のデータを新たなデータで上書きする。すなわち、結合音声テキストデータは、区切時間が経過する毎に新たな結合音声テキストデータに更新される。 If the combined speech text data already exists in the index, the combined voice text reception unit 33 overwrites the existing data with new data. That is, the combined voice-text data is updated to new combined voice-text data each time the interval time elapses.

バッファ制御部３１は、テキスト判定部２１から受け取った判定結果が「空」である場合、テキストバッファ３４に保存されている結合音声テキストデータを１つの文章として確定する。バッファ制御部３１は、例えば、テキストバッファ３４に保存されている結合音声テキストデータの行末に改行コード［ＥＯＬ］を付与することで、結合音声テキストデータを１つの文章として確定する。 If the determination result received from the text determination unit 21 is "empty", the buffer control unit 31 determines the combined speech text data stored in the text buffer 34 as one sentence. The buffer control unit 31, for example, assigns a line feed code [EOL] to the end of the line of the combined voice text data stored in the text buffer 34, thereby determining the combined voice text data as one sentence.

この１つの文章として確定した結合音声テキストデータが格納されたインデックスが、使用インデックスとなる。この場合、次に結合音声テキスト受信部３３が結合音声テキストデータを受信すると、前回格納した使用インデックスの、１つ後ろの空きインデックスに該結合音声テキストデータが格納されることとなる。 The index in which the combined speech text data determined as one sentence is stored is the used index. In this case, next time the combined speech text data is received by the combined voice text receiving unit 33, the combined voice text data will be stored in the empty index immediately after the used index stored last time.

テキスト表示部１４は、テキストバッファ３４から結合音声テキストデータを読み出し、先頭インデックスから順にテキストを表示する。例えば、テキスト表示部１４は、区切時間Ｔｄよりも短い読み出し時間Ｔｒ毎に、テキストバッファ３４に格納されている結合音声テキストデータを読み出して、テキストを表示することができる。 The text display unit 14 reads out the combined speech text data from the text buffer 34 and displays the text in order from the top index. For example, the text display unit 14 can read the combined voice text data stored in the text buffer 34 and display the text at each read time Tr shorter than the delimiter time Td.

ここで、図９及び図１０Ａ～１０Ｄ、１１Ａ～１１Ｄを参照して、実施の形態２に係る音声認識方法について説明する。図９は、実施の形態２に係る音声認識方法を説明するフロー図である。図１０Ａ～１０Ｄは、各時刻経過後にテキストバッファに保存される結合音声テキストデータを示す図である。図１１Ａ～１１Ｄは、各時刻経過後にテキスト表示部に表示されるテキストを示す図である。 Here, a speech recognition method according to Embodiment 2 will be described with reference to FIGS. 9 and 10A to 10D and 11A to 11D. FIG. 9 is a flowchart for explaining the speech recognition method according to the second embodiment. 10A-10D are diagrams showing the combined speech text data saved in the text buffer after each time. 11A to 11D are diagrams showing text displayed on the text display portion after each time has elapsed.

なお、図９において、図４と同一のステップには、同一の符号が付されている。ＩＰ電話機４０での通話内容及び通話音声波形、音声データの区切り方、音声バッファ１２の音声キューの内容の遷移については、実施の形態１と同様であるものとする（図３、図５Ａ～５Ｄ）。 In addition, in FIG. 9, the same steps as in FIG. 4 are given the same reference numerals. It is assumed that the contents of the call, the voice waveform of the call, the method of delimiting the voice data, and the transition of the contents of the voice queue in the voice buffer 12 in the IP telephone 40 are the same as those in the first embodiment (FIGS. 3, 5A to 5D). ).

図９に示すように、最初の区切り時間Ｔｄの経過時（時刻Ｔｄ１）に、実施の形態１と同様に、ステップＳ１０～Ｓ１４の処理が実行される。そして、結合音声テキスト受信部３３は、音声認識部１３により変換された結合音声テキストデータをテキストバッファ３４に格納する（ステップＳ１７）。結合音声テキスト受信部３３は、テキストバッファ３４の、行末に改行コードが付与されていないインデックスのうち、最も番号の小さいインデックスに、「通行人が倒れ」との結合音声テキストデータを格納する。 As shown in FIG. 9, the processes of steps S10 to S14 are executed in the same manner as in the first embodiment when the first delimiting time Td has passed (time Td1). Then, the combined voice text receiving unit 33 stores the combined voice text data converted by the voice recognition unit 13 in the text buffer 34 (step S17). The combined voice text receiving unit 33 stores the combined voice text data of "Passerby falls down" in the index with the smallest number among the indexes in the text buffer 34 which do not have a linefeed code at the end of the line.

このときのテキストバッファ３４に格納されたテキストファイルの内容が、図１０Ａに示される。「通行人が倒れ」との結合音声テキストデータを書き込む際、テキストファイルは空である。このため、このテキストファイルの１行目が、結合音声テキストデータが格納されておらず、最も番号の小さい空きインデックスに相当する。図１０Ａに示すように、「通行人が倒れ」との結合音声テキストデータが、テキストファイルの１行目に書き込まれる。この時、テキストファイルの行末に改行コードが付与されていない行（１行目）が、テキストファイルの最終行となる。 The contents of the text file stored in the text buffer 34 at this time are shown in FIG. 10A. The text file is empty when writing the combined speech text data of "passerby falls". Therefore, the first line of this text file does not store the combined speech text data and corresponds to the lowest numbered free index. As shown in FIG. 10A, the combined speech-text data of "passerby collapses" is written to the first line of the text file. At this time, the line (first line) without a line feed code at the end of the text file becomes the last line of the text file.

そして、テキスト表示部１４は、テキストバッファ３４からテキストファイルを読み出して、結合音声テキストデータを表示する（ステップＳ１８）。例えば、テキスト表示部１４は、区切時間Ｔｄよりも短い読出時間Ｔｒでテキストファイルを読み出して、テキストを表示することができる。このとき、図１１Ａに示すように、テキスト表示部１４は「通行人が倒れ」とのテキストを表示する。 The text display unit 14 then reads the text file from the text buffer 34 and displays the combined speech text data (step S18). For example, the text display unit 14 can read the text file in a readout time Tr shorter than the delimiter time Td and display the text. At this time, as shown in FIG. 11A, the text display unit 14 displays the text "passerby collapses".

ステップＳ１１へ戻り、次の区切時間Ｔｄが経過すると（時刻Ｔｄ２）、ステップＳ１１～Ｓ１４が再度実行される。図１０Ａに示すように、テキストファイルの１行目には、「通行人が倒れ」との結合音声テキストデータが格納されているものの、改行コードは付与されていない。このため、テキストファイルの１行目は、行末に改行コードが付与されず、１つの文章として確定していない結合音声テキストデータが格納されている、最も番号の小さいインデックスに相当する。 Returning to step S11, when the next delimitation time Td has elapsed (time Td2), steps S11 to S14 are executed again. As shown in FIG. 10A, the first line of the text file stores the combined speech text data of "Passerby falls down", but no line feed code is added. For this reason, the first line of the text file corresponds to the lowest numbered index in which a line feed code is not added at the end of the line and combined speech text data which is not determined as one sentence is stored.

ステップＳ１７では、「通行人が倒れていて胸が苦しい」との結合音声テキストデータが、テキストファイルの１行目に上書きされる。図１０Ｂに示すように、「通行人が倒れていて胸が苦しい」との結合音声テキストデータが、テキストファイルの１行目に書き込まれる。そして、ステップＳ１８では、図１１Ｂに示すように、テキスト表示部１４は「通行人が倒れていて胸が苦しい」とのテキストを表示する。 In step S17, the first line of the text file is overwritten with the combined speech text data "The passerby is lying down and my chest hurts." As shown in FIG. 10B, the combined voice-text data "Passerby is lying down and my chest hurts" is written to the first line of the text file. Then, in step S18, as shown in FIG. 11B, the text display unit 14 displays the text "passerby is lying down and my chest hurts".

同様に、時刻Ｔｄ６経過後まで、ステップＳ１１～Ｓ１４、Ｓ１７、Ｓ１８が繰り返し実行される。図１０Ｃには、時刻Ｔｄ６経過後に、テキストバッファ３４に格納された結合音声テキストデータが示される。図１０Ｃに示すようにテキストバッファ３４には、テキストファイルの１行目に「通行人が倒れていて胸が苦しいと訴えていてえーとかかりつけの病院はないと言っています」との結合音声テキストデータが格納される。そして、図１１Ｃに示すように、テキスト表示部１４は同様のテキストを表示する。 Similarly, steps S11 to S14, S17, and S18 are repeatedly executed until time Td6 has elapsed. FIG. 10C shows the combined speech text data stored in the text buffer 34 after time Td6. As shown in FIG. 10C, in the text buffer 34, in the first line of the text file, there is combined speech text data of "A passerby is lying down and complaining of chest pain, and he says that there is no family hospital." is stored. Then, as shown in FIG. 11C, the text display section 14 displays similar text.

次に、時刻Ｔｄ７経過後の動作について説明する。時間Ｔｄ６～Ｔｄ７では、分割音声データは無音である。ステップＳ１３において、分割音声テキストデータＴＸ７は空であるため（ＹＥＳ）、ステップＳ１９へと進む。ステップＳ１９では、バッファ制御部３１は、音声バッファ１２に格納されている分割音声データを削除して、音声キューを空にするとともに、結合音声テキストデータの行末に改行コード［ＥＯＬ］を付与する。このときのテキストバッファ３４に格納されたテキストファイルを図１０Ｄに示す。 Next, the operation after time Td7 has elapsed will be described. Between times Td6 and Td7, the divided audio data is silent. In step S13, since the divided voice text data TX7 is empty (YES), the process proceeds to step S19. In step S19, the buffer control unit 31 deletes the divided audio data stored in the audio buffer 12, empties the audio queue, and adds a line feed code [EOL] to the end of the combined audio text data. The text file stored in the text buffer 34 at this time is shown in FIG. 10D.

また、このときのテキスト表示部１４が表示するテキストを、図１１Ｄに示す。時刻Ｔｄ６の経過の時点ですべての音声データのテキスト化が完了しているため、図１１Ｃと図１１Ｄで表示されるテキストの内容の差分はない。このように、実施の形態２では、区切り時間Ｔｄが経過する毎に、テキスト表示部１４に徐々に結合されたテキストが表示されていく。 Also, the text displayed by the text display unit 14 at this time is shown in FIG. 11D. Since conversion of all audio data to text has been completed by the time Td6 has passed, there is no difference between the contents of the texts displayed in FIGS. 11C and 11D. Thus, in the second embodiment, the texts that are gradually combined are displayed on the text display section 14 each time the delimiting time Td elapses.

このように、実施の形態２では、テキストファイルを使用し、文章の区切りまでは結合音声テキストデータを上書きして更新する。これにより、テキスト表示部１４に、音声認識テキストが文章単位で表示されるため、実施の形態１よりも見やすくなる。 Thus, in the second embodiment, a text file is used, and the combined speech text data is overwritten and updated up to the break of the sentence. As a result, the speech recognition text is displayed in units of sentences on the text display section 14, making it easier to see than in the first embodiment.

上述したように、無音区間によって音声分割を行う比較例２では、無音と判断する音声レベルによっては音声データが文章単位で分割されないケースが発生する。そこで、実施の形態２に記載の音声認識表示装置１Ｂを用いて、このような問題を改善する例について説明する。ここでは、図１２の発話内容及び音声波形を音声認識する。図１２では、１１９番通報をした通報者の発話内容「近くにコンビニが見えます。他に怪我人はいません。」とその音声波形が示されている。図１２において、縦軸は音声波形の音声レベル（振幅）、横軸は時間である。 As described above, in Comparative Example 2 in which voice is divided according to silent intervals, there are cases where voice data is not divided into sentences depending on the voice level determined to be silent. Therefore, an example of improving such a problem by using the speech recognition display device 1B described in the second embodiment will be described. Here, speech recognition is performed on the utterance content and speech waveform in FIG. FIG. 12 shows the speech waveform of the utterance content of the caller who called 119, "I can see a convenience store nearby. There are no other injured people." In FIG. 12, the vertical axis represents the audio level (amplitude) of the audio waveform, and the horizontal axis represents time.

上述した、無音区間によって音声分割を行う比較例２において、無音と判断する音声レベルを図１２の－Ａ１～Ａ１とする。音声の波形のすべてが－Ａ１～Ａ１内に収まれば、無音と判断され、当該時刻で音声データが分割される。通報者の発話内容「近くにコンビニが見えます。」と「他に怪我人はいません。」の間には、発話していない時間Ｔｎが存在するが、通報者の音声データに周囲のノイズ音が乗り、発話していない時間Ｔｎの音声レベルは－Ａ１～Ａ１に収まっていない。 In Comparative Example 2, in which voice is divided according to silent intervals, the voice levels determined to be silent are assumed to be -A1 to A1 in FIG. If all of the audio waveforms fall within -A1 to A1, it is determined that there is no sound, and the audio data is divided at that time. Between the utterances of the caller, ``I can see a convenience store nearby,'' and ``There are no other injured people.'' The sound level during the time Tn when sound is heard and no speech is not within -A1 to A1.

このため、比較例２では、発話していない時間Ｔｎを無音区間と判断できず、音声データを区切ることができない。したがって、比較例２では、この音声データ全体の「近くにコンビニが見えます。他に怪我人はいません。」を一度にテキスト化することとなる。 For this reason, in Comparative Example 2, it is not possible to determine the time Tn during which no speech is made as a silent section, and it is not possible to separate the voice data. Therefore, in Comparative Example 2, "I can see a convenience store nearby. There are no other injured people." in the entire voice data is converted into text at once.

そこで、音声認識表示装置１Ｂにおいて、区切時間ＴｄをＴｄ＜１／２Ｔｎと設定する。これにより、必ず音声キューに発話なし時間の音声データが格納されることとなる。音声認識部１３では、ノイズ音はテキスト化されないため、上述したバッファ制御部３１の動作により、発話なし区間の音声データが格納された時点で、結合音声テキストデータの行末に改行コード［ＥＯＬ］が付与される。すなわち、全体の音声データ「近くにコンビニが見えます。他に怪我人はいません。」は、「近くにコンビニが見えます。」と「他に怪我人はいません。」の２つの文章としてテキスト化されて、テキスト表示部１４に表示されることとなる。 Therefore, in the speech recognition display device 1B, the interval time Td is set to Td<1/2Tn. As a result, the voice data of the no-speech time is always stored in the voice queue. Since the speech recognition unit 13 does not convert the noise sound into text, when the speech data of the non-speech section is stored by the operation of the buffer control unit 31 described above, a line feed code [EOL] is added to the end of the line of the combined speech text data. Granted. In other words, the entire voice data "I can see a convenience store nearby. There are no other injured people." It will be converted into text and displayed on the text display section 14 .

このように、実施の形態２によれば、周囲の環境音に左右されずに文章単位での音声分割及びテキスト化が可能となり、比較例２のように、無音と判断する音声レベルによっては文章単位で音声データが分割されないケースが発生するという問題を解決することができる。 As described above, according to the second embodiment, it is possible to divide the speech into text units without being influenced by the surrounding environmental sounds. It is possible to solve the problem that there are cases where the audio data is not divided into units.

実施の形態３．
実施の形態２では、分割音声データを分割音声送信部２２から、結合音声データを結合音声送信部３２から別々に音声認識部１３へ入力し、分割音声テキストデータを分割音声テキスト受信部２３で、結合音声テキストデータを結合音声テキスト受信部３３で別々に受信していた。この構成を簡素化するために、実施の形態３では、フラグを設定することで、分割音声データと結合音声データを送信する機能を１つにまとめるとともに、分割音声データと結合音声データを受信する機能を１つにまとめる。 Embodiment 3.
In the second embodiment, the divided voice data is input from the divided voice transmission unit 22, and the combined voice data is input from the combined voice transmission unit 32 to the voice recognition unit 13 separately. The combined voice-text data is separately received by the combined voice-text receiving unit 33 . In order to simplify this configuration, in the third embodiment, by setting a flag, the function of transmitting the divided audio data and the combined audio data is combined into one, and the divided audio data and the combined audio data are received. Combine features into one.

図１３は、実施の形態３に係る音声認識表示装置１Ｃの構成を示すブロック図である。図１３に示す例では、アナログ電話機４１での通話音声をリアルタイムで音声認識しテキスト表示するものとする。図１３に示すように、音声認識表示装置１Ｃは、時間区切音声分割部１１、音声バッファ１２、音声認識部１３、テキスト表示部１４、認識ＤＢ１５、テキスト判定部２１、バッファ制御部３１、テキストバッファ３４、共有メモリ５０、音声送信部５１、テキスト受信部５２を含む。 FIG. 13 is a block diagram showing the configuration of a speech recognition display device 1C according to the third embodiment. In the example shown in FIG. 13, it is assumed that the speech of the analog telephone 41 is recognized in real time and displayed as text. As shown in FIG. 13, the speech recognition display device 1C includes a time segmented speech division unit 11, a speech buffer 12, a speech recognition unit 13, a text display unit 14, a recognition DB 15, a text determination unit 21, a buffer control unit 31, a text buffer 34 , a shared memory 50 , a voice transmitter 51 and a text receiver 52 .

実施の形態３において、実施の形態２と異なる点は、音声送信部５１が分割音声データと結合音声データのいずれを送信しているかを示すフラグを設定可能であり、テキスト受信部５２は該フラグを参照して、分割音声データをテキスト判定部２１へ送信するか、結合音声データをテキストバッファ３４へ書き込むかを選択的に実行する点である。以下、実施の形態１との差異について詳細に説明し、重複説明は適宜省略する。 Embodiment 3 differs from Embodiment 2 in that it is possible to set a flag indicating whether voice transmission unit 51 is transmitting divided voice data or combined voice data, and text reception unit 52 , and selectively executes whether to transmit the divided speech data to the text determination unit 21 or to write the combined speech data to the text buffer 34 . In the following, differences from the first embodiment will be described in detail, and duplication of description will be omitted as appropriate.

音声取得部１０は、アナログ電話機４１から出力されるアナログ音声信号を、該音声取得部１０が有するアナログ－デジタル変換部（Ａ－Ｄ変換部）１０Ａにてデジタル音声データへ変換し、時間区切音声分割部１１へ出力する。時間区切音声分割部１１は、音声データを予め設定された区切時間Ｔｄで区切りって分割音声データを生成し、区切時間Ｔｄが経過する毎に該分割音声データを音声バッファ１２に先頭から順に格納する。また、時間区切音声分割部１１は、分割音声データを音声送信部５１に送信する。 The voice acquisition unit 10 converts an analog voice signal output from the analog telephone 41 into digital voice data by an analog-digital conversion unit (AD conversion unit) 10A included in the voice acquisition unit 10, and converts it into time-delimited voice data. Output to the dividing unit 11 . A time-segmented audio dividing unit 11 generates divided audio data by dividing audio data by a preset division time Td, and stores the divided audio data in an audio buffer 12 in order from the beginning every time the division time Td elapses. do. Also, the time-segmented audio dividing unit 11 transmits the divided audio data to the audio transmitting unit 51 .

音声送信部５１は、分割音声データに加えて、バッファ制御部３１からの結合音声データを受信する。音声送信部５１は、時間区切音声分割部１１から受信した分割音声データ、及び、バッファ制御部３１から受信した結合音声データを音声認識部１３に送信する。このとき、音声送信部５１は、共有メモリ５０に、時間区切音声分割部１１から分割音声データを受信した場合は共有メモリ５０にＦＡＬＳＥフラグを設定し、バッファ制御部３１から結合音声データを受信した場合はＴＲＵＥフラグを設定する。 The audio transmission unit 51 receives the combined audio data from the buffer control unit 31 in addition to the divided audio data. The audio transmission unit 51 transmits the divided audio data received from the time-delimited audio dividing unit 11 and the combined audio data received from the buffer control unit 31 to the audio recognition unit 13 . At this time, the audio transmission unit 51 sets a FALSE flag in the shared memory 50 when the divided audio data is received from the time-segmented audio dividing unit 11 in the shared memory 50, and receives the combined audio data from the buffer control unit 31. If so, set the TRUE flag.

音声認識部１３は、認識ＤＢ１５を参照して、分割音声データを分割音声テキストデータに、結合音声データを結合音声テキストデータにそれぞれ変換して、テキスト受信部５２に送信する。共有メモリ５０は、音声送信部５１、テキスト受信部５２からアクセス可能である。テキスト受信部５２は、共有メモリ５０に設定されたフラグを参照し、フラグがＦＡＬＳＥの場合には、分割音声テキストデータをテキスト判定部２１に入力する。また、テキスト受信部５２は、フラグがＴＲＵＥの場合には、結合音声テキストデータをテキストバッファ３４の、行末に改行コードが付与されていない、空きインデックスのうち最も番号の小さいインデックスに書き込む。なお、この時すでに当該インデックスに結合音声テキストデータが存在する場合には、既存のデータを新たなデータで上書きする。 The speech recognition unit 13 refers to the recognition DB 15 , converts the divided speech data into divided speech text data, converts the combined speech data into combined voice text data, and transmits the data to the text reception unit 52 . The shared memory 50 can be accessed from the voice transmission section 51 and the text reception section 52 . The text receiving unit 52 refers to the flag set in the shared memory 50, and inputs the divided speech text data to the text determining unit 21 when the flag is FALSE. Also, when the flag is TRUE, the text receiving unit 52 writes the combined speech text data to the index with the smallest number among the free indexes in the text buffer 34 where no linefeed code is added at the end of the line. At this time, if the combined speech text data already exists in the index, the existing data is overwritten with the new data.

上述の通り、テキスト判定部２１は、分割音声テキストデータが「空」であるか否かを判定し、判定結果をバッファ制御部３１に入力する。バッファ制御部３１は、判定結果が「空でない」場合、音声バッファ１２に格納されている分割音声データを先頭から順に結合した結合音声データを生成し、音声送信部５１へ送信する。一方、判定結果が「空である」場合、バッファ制御部３１は、音声バッファ１２に格納されている分割音声データを削除して、音声キューを空にするとともに、結合音声テキストデータの行末に改行コード［ＥＯＬ］を付与する。テキスト表示部１４は、読出時間Ｔｒ（Ｔｒ＜Ｔｄ）でテキストバッファ３４のテキストファイルを読み出し、テキストを表示する。 As described above, the text determination unit 21 determines whether the divided voice text data is “empty” and inputs the determination result to the buffer control unit 31 . If the determination result is “not empty”, the buffer control unit 31 generates combined audio data by combining the divided audio data stored in the audio buffer 12 in order from the beginning, and transmits the combined audio data to the audio transmission unit 51 . On the other hand, if the determination result is "empty", the buffer control unit 31 deletes the divided audio data stored in the audio buffer 12, empties the audio queue, and inserts a line feed at the end of the line of the combined audio text data. Add a code [EOL]. The text display unit 14 reads the text file in the text buffer 34 at a read time Tr (Tr<Td) and displays the text.

以下、実施の形態３に係る音声認識方法について説明する。なお、ここに示す例では、音声取得部１０では、アナログ電話機４１から入力されたアナログ音声が時間０から逐次Ａ－Ｄ変換され、時間区切音声分割部１１へＡ－Ｄ変換後の音声データが逐次入力される。この音声データの波形、音声データの区切り方、音声バッファ１２の音声キューの内容の遷移、テキストバッファ３４のテキスのテキストファイルの内容の遷移、テキスト表示部１４に表示されるテキストについては、実施の形態２と同様であるものとする（図３、図５Ａ～５Ｄ、図１０Ａ～１０Ｄ、図１１Ａ～１１Ｄ）。 A speech recognition method according to Embodiment 3 will be described below. In the example shown here, in the voice acquisition unit 10, the analog voice input from the analog telephone 41 is sequentially AD-converted from time 0, and the voice data after AD conversion is sent to the time-delimited voice division unit 11. Sequentially entered. The waveform of the voice data, how to delimit the voice data, the transition of the contents of the voice queue in the voice buffer 12, the transition of the contents of the text file of the text in the text buffer 34, and the text displayed on the text display section 14 are described in the implementation. It shall be similar to form 2 (FIGS. 3, 5A-5D, 10A-10D, 11A-11D).

時間区切音声分割部１１は、時刻Ｔｄ１経過時に、時間０～Ｔｄ１の分割音声データｄ１を音声バッファ１２の音声キューの末尾へ格納する。このとき、音声キューは空のため、分割音声データｄ１は音声キューのインデックス０に格納される。このときの音声キューの内容は、図５Ａと同様である。同時に、時間区切音声分割部１１は分割音声データｄ１を音声送信部５１へ送信する。 The time-segmented audio dividing unit 11 stores the divided audio data d1 from time 0 to Td1 at the end of the audio queue of the audio buffer 12 when the time Td1 has passed. At this time, since the audio queue is empty, the divided audio data d1 is stored at index 0 of the audio queue. The content of the audio queue at this time is the same as in FIG. 5A. At the same time, the time-segmented audio dividing section 11 transmits the divided audio data d1 to the audio transmitting section 51 .

音声送信部５１は、時間区切音声分割部１１から分割音声データｄ１を受信したため、共有メモリ５０にＦＡＬＳＥフラグを設定する。また、音声送信部５１は、分割音声データｄ１を音声認識部１３へ送信する。音声認識部１３は、認識ＤＢ１５を参照して、分割音声データｄ１を分割音声テキストデータＴＸ１へ変換し、テキスト受信部５２へ送信する。このときの分割音声テキストデータＴＸ１は、「通行人が倒れ」である。 Since the audio transmission unit 51 has received the divided audio data d1 from the time-delimited audio division unit 11, it sets the FALSE flag in the shared memory 50. FIG. Also, the voice transmission unit 51 transmits the divided voice data d1 to the voice recognition unit 13 . The voice recognition unit 13 refers to the recognition DB 15 to convert the divided voice data d1 into divided voice text data TX1 and transmits the data to the text receiving unit 52 . The divided voice text data TX1 at this time is "a passerby collapses".

テキスト受信部５２は、共有メモリ５０に保存されたフラグを参照し、フラグがＦＡＬＳＥであるため、「通行人が倒れ」との分割音声テキストデータＴＸ１をテキスト判定部２１へ送信する。テキスト判定部２１は、テキスト受信部５２からの分割音声テキストデータＴＸ１が空であるか否かを判定し、判定結果をバッファ制御部３１に送信する。分割音声テキストデータＴＸ１は空でないため、バッファ制御部３１は、音声バッファ１２に格納された分割音声データを先頭から順に結合する。 The text receiving unit 52 refers to the flag stored in the shared memory 50, and since the flag is FALSE, the text receiving unit 52 transmits the divided speech text data TX1 of "passerby collapses" to the text determining unit 21. FIG. The text determination unit 21 determines whether or not the divided voice text data TX1 from the text reception unit 52 is empty, and transmits the determination result to the buffer control unit 31 . Since the divided voice text data TX1 is not empty, the buffer control unit 31 combines the divided voice data stored in the voice buffer 12 in order from the beginning.

このとき、音声バッファ１２に格納されている分割音声データはインデックス０のデータのみであるため、インデックス０のデータを結合音声データとして音声送信部５１へ送信する。音声送信部５１は、バッファ制御部３１から結合音声データを受信したため、共有メモリ５０にＴＲＵＥフラグを設定する。また、音声送信部５１は、結合音声データを音声認識部１３へ送信する。 At this time, since the divided audio data stored in the audio buffer 12 is only the data of index 0, the data of index 0 is transmitted to the audio transmission unit 51 as combined audio data. Since the audio transmission unit 51 has received the combined audio data from the buffer control unit 31 , it sets the TRUE flag in the shared memory 50 . Also, the voice transmission unit 51 transmits the combined voice data to the voice recognition unit 13 .

音声認識部１３は、認識ＤＢ１５を参照して、結合音声データを結合音声テキストデータへ変換し、テキスト受信部５２へ送信する。このときの結合音声テキストデータは、「通行人が倒れ」である。 The voice recognition unit 13 refers to the recognition DB 15 to convert the combined voice data into combined voice text data and transmits the combined voice text data to the text receiving unit 52 . The combined speech text data at this time is "a passerby collapses".

テキスト受信部５２は、共有メモリ５０に保存されたフラグを参照し、フラグがＴＲＵＥであるため、「通行人が倒れ」との結合音声テキストデータをテキストバッファ３４へ書き込む。上述したように、「通行人が倒れ」との結合音声テキストデータは、行末に改行コードが付与されていない、テキストファイルの１行目に書き込まれる（図１０Ａ）。そして、テキスト表示部１４は、読出時間Ｔｒ（Ｔｒ＜Ｔｄ）で、テキストバッファ３４に格納されたテキストファイルを読み出し、テキストを表示する（図１１Ａ）。 The text receiving unit 52 refers to the flag stored in the shared memory 50 , and writes the combined voice text data of “the passerby collapses” to the text buffer 34 because the flag is TRUE. As described above, the combined voice-text data of "passerby collapses" is written in the first line of the text file without a linefeed code at the end of the line (FIG. 10A). Then, the text display unit 14 reads the text file stored in the text buffer 34 at the read time Tr (Tr<Td) and displays the text (FIG. 11A).

次の区切時間Ｔｄが経過すると（時刻Ｔｄ２）、時間区切音声分割部１１は、時間Ｔｄ１～Ｔｄ２の分割音声データｄ２を音声バッファ１２の末尾に格納する。図５Ｂに示すように、分割音声データｄ２は音声キューのインデックス１に保存される。同時に、時間区切音声分割部１１は、分割音声データｄ２を音声送信部５１に送信する。 When the next time interval Td elapses (time Td2), the time-interval audio dividing unit 11 stores the divided audio data d2 of times Td1 to Td2 at the end of the audio buffer 12. FIG. As shown in FIG. 5B, the divided audio data d2 is stored in index 1 of the audio queue. At the same time, the time-segmented audio dividing section 11 transmits the divided audio data d2 to the audio transmitting section 51 .

音声送信部５１は、時間区切音声分割部１１から分割音声データｄ２を受信したため、共有メモリ５０にＦＡＬＳＥフラグを設定する。また、音声送信部５１は、分割音声データｄ２を音声認識部１３へ送信する。音声認識部１３は、認識ＤＢ１５を参照して、分割音声データｄ２を分割音声テキストデータＴＸ２へ変換し、テキスト受信部５２へ送信する。このときの分割音声テキストデータＴＸ２は、「ていて胸が苦しい」である。 Since the audio transmission unit 51 has received the divided audio data d2 from the time-delimited audio division unit 11, it sets the FALSE flag in the shared memory 50. FIG. Also, the voice transmission unit 51 transmits the divided voice data d2 to the voice recognition unit 13 . The voice recognition unit 13 refers to the recognition DB 15 to convert the divided voice data d2 into divided voice text data TX2 and transmits the data to the text receiving unit 52 . The divided voice text data TX2 at this time is "It's heartbreaking".

テキスト受信部５２は、共有メモリ５０に保存されたフラグを参照し、フラグがＦＡＬＳＥであるため、「ていて胸が苦しい」との分割音声テキストデータＴＸ２をテキスト判定部２１へ送信する。テキスト判定部２１は、テキスト受信部５２からの分割音声テキストデータＴＸ２が空であるか否かを判定し、判定結果をバッファ制御部３１へ送信する。分割音声テキストデータＴＸ２は空ではないため、バッファ制御部３１は、音声バッファ１２に格納されている分割音声データを先頭から順に結合する。このときの結合音声データは、図７と同様である。 The text receiving unit 52 refers to the flag stored in the shared memory 50, and since the flag is FALSE, it transmits the divided speech text data TX2 of "It hurts my chest" to the text determining unit 21. The text determination section 21 determines whether or not the divided voice text data TX2 from the text reception section 52 is empty, and transmits the determination result to the buffer control section 31 . Since the divided voice text data TX2 is not empty, the buffer control unit 31 combines the divided voice data stored in the voice buffer 12 in order from the beginning. The combined voice data at this time is the same as in FIG.

このとき、音声バッファ１２に格納されている分割音声データはインデックス０、１のデータであるため、インデックス０、１のデータを結合音声データとして音声送信部５１へ送信する。音声送信部５１は、バッファ制御部３１から結合音声データを受信したため、共有メモリ５０にＴＲＵＥフラグを設定する。また、音声送信部５１は、結合音声データを音声認識部１３へ送信する。 At this time, since the divided audio data stored in the audio buffer 12 are the data of indexes 0 and 1, the data of indexes 0 and 1 are transmitted to the audio transmission unit 51 as combined audio data. Since the audio transmission unit 51 has received the combined audio data from the buffer control unit 31 , it sets the TRUE flag in the shared memory 50 . Also, the voice transmission unit 51 transmits the combined voice data to the voice recognition unit 13 .

音声認識部１３は、認識ＤＢ１５を参照して、結合音声データを結合音声テキストデータへ変換し、テキスト受信部５２へ送信する。このときの結合音声テキストデータは、「通行人が倒れていて胸が苦しい」である。 The voice recognition unit 13 refers to the recognition DB 15 to convert the combined voice data into combined voice text data and transmits the combined voice text data to the text receiving unit 52 . The combined voice-text data at this time is "My chest hurts because the passerby is lying down".

テキスト受信部５２は、共有メモリ５０に保存されたフラグを参照し、フラグがＴＲＵＥであるため、「通行人が倒れていて胸が苦しい」との結合音声テキストデータをテキストバッファ３４に書き込む。図１０Ａに示すように、テキストファイルの１行目には、「通行人が倒れ」との結合音声データが格納されているものの、改行コードは付与されていない。このため、「通行人が倒れていて胸が苦しい」との結合音声テキストデータで、テキストファイルの１行目が上書きされる（図１０Ｂ）。そして、テキスト表示部１４は、テキストバッファ３４に格納されたテキストファイルを読み出し、テキストを表示する（図１１Ｂ）。 The text receiving unit 52 refers to the flag stored in the shared memory 50, and since the flag is TRUE, writes the combined voice text data of "passerby is lying down and my chest hurts" in the text buffer 34.例文帳に追加As shown in FIG. 10A, the first line of the text file stores the combined voice data of "passerby falls", but no line feed code is added. For this reason, the first line of the text file is overwritten with the combined speech text data "A passerby is lying down and my chest hurts" (FIG. 10B). The text display unit 14 then reads the text file stored in the text buffer 34 and displays the text (FIG. 11B).

以降、時刻Ｔｄ６経過後まで、同様の処理が繰り返し実行される。図１０Ｃに、時刻Ｔｄ６経過後に、テキストバッファ３４に格納された結合音声テキストデータが示される。そして、図１１Ｃに、このときにテキスト表示部１４に表示されるテキストが示される。 Thereafter, similar processing is repeatedly executed until time Td6 has elapsed. FIG. 10C shows the combined speech text data stored in the text buffer 34 after time Td6. FIG. 11C shows the text displayed on the text display section 14 at this time.

次に、時刻Ｔｄ７経過後の動作について説明する。時間Ｔｄ６～Ｔｄ７では、分割音声データは無音である。時刻Ｔｄ７が経過すると、時間区切音声分割部１１は、時間Ｔｄ６～Ｔｄ７の分割音声データｄ７を音声バッファ１２の末尾へ格納する。図５Ｄに示すように、分割音声データｄ７は音声キューのインデックス６に保存される。同時に、時間区切音声分割部１１は、分割音声データｄ７を音声送信部５１に送信する。 Next, the operation after time Td7 has elapsed will be described. Between times Td6 and Td7, the divided audio data is silent. After the time Td7 has passed, the time-segmented audio dividing section 11 stores the divided audio data d7 of the times Td6 to Td7 at the end of the audio buffer 12. FIG. As shown in FIG. 5D, the divided audio data d7 is stored in index 6 of the audio queue. At the same time, the time-segmented audio dividing section 11 transmits the divided audio data d7 to the audio transmitting section 51 .

音声送信部５１は、時間区切音声分割部１１から分割音声データｄ７を受信したため、共有メモリ５０にＦＡＬＳＥフラグを設定する。また、音声送信部５１は、分割音声データｄ７を音声認識部１３へ送信する。音声認識部１３は、認識ＤＢ１５を参照して、分割音声データｄ７を分割音声テキストデータへ変換し、テキスト受信部５２へ送信する。このときの分割音声テキストデータは、「－（空）」である。 Since the audio transmission unit 51 has received the divided audio data d7 from the time-segmented audio dividing unit 11, it sets the FALSE flag in the shared memory 50. FIG. Also, the voice transmission unit 51 transmits the divided voice data d7 to the voice recognition unit 13 . The voice recognition unit 13 refers to the recognition DB 15 to convert the divided voice data d7 into divided voice text data and transmits the data to the text receiving unit 52 . The divided speech text data at this time is "- (empty)".

テキスト受信部５２は、共有メモリ５０に保存されたフラグを参照し、フラグがＦＡＬＳＥであるため、「－（空）」との分割音声テキストデータをテキスト判定部２１へ送信する。テキスト判定部２１は、テキスト受信部５２からの分割音声テキストデータＴＸ７が空であるか否かを判定し、判定結果をバッファ制御部３１に送信する。この分割音声テキストデータＴＸ７は空であるため、バッファ制御部３１は、音声バッファ１２に格納されている分割音声データを削除し、音声キューを空にするとともに、結合音声テキストデータの行末に改行コード［ＥＯＬ］を付与する。 The text receiving unit 52 refers to the flag stored in the shared memory 50 , and since the flag is FALSE, transmits the divided speech text data of “− (empty)” to the text determining unit 21 . The text determination unit 21 determines whether or not the divided voice text data TX7 from the text reception unit 52 is empty, and transmits the determination result to the buffer control unit 31. FIG. Since this divided voice text data TX7 is empty, the buffer control unit 31 deletes the divided voice data stored in the voice buffer 12, empties the voice queue, and inserts a linefeed code at the end of the line of the combined voice text data. [EOL] is given.

このときのテキストバッファ３４に格納された結合音声テキストデータは図１０Ｄと同様である。そして、図１１Ｄに、このときにテキスト表示部１４に表示されるテキストが示される。このように、実施の形態３においても、実施の形態２と同様に、区切り時間Ｔｄが経過する毎に、テキスト表示部１４に徐々に結合されたテキストが表示されていく。 The combined speech text data stored in the text buffer 34 at this time is the same as in FIG. 10D. FIG. 11D shows the text displayed on the text display section 14 at this time. As described above, in the third embodiment, as in the second embodiment, the combined text is gradually displayed on the text display section 14 each time the delimiting time Td elapses.

以上説明したように、実施の形態３によれば、実施の形態２と同様に、テキスト表示部１４に、音声認識テキストを文章単位で表示させることができる。また、音声認識部１３への分割音声データ、結合音声データの送信機能、音声認識部１３からの分割音声テキストデータ、結合音声テキストデータの受信機能をそれぞれ１つの構成要素にまとめることがでるため、音声認識表示装置の構成を簡素化することが可能となる。 As described above, according to the third embodiment, as in the second embodiment, the text display unit 14 can display the speech recognition text in units of sentences. In addition, since the function of transmitting the divided speech data and the combined speech data to the speech recognition unit 13 and the function of receiving the divided speech text data and the combined speech text data from the speech recognition unit 13 can be combined into one component, It is possible to simplify the configuration of the speech recognition display device.

実施の形態４．
実施の形態４では、実施の形態２の音声認識表示装置１Ｂを２つ用い、２人の話者が発話した内容を時系列で表示する。図１４は、実施の形態４に係る音声認識表示装置１Ｄの構成を示す図である。図１４において、実施の形態２と同一の構成要素には同一の符号を付している。また、２人の話者Ｘ、Ｙがそれぞれ用いる音声認識表示装置１Ｂを区別するために、各要素にＸ又はＹの符号を付している。 Embodiment 4.
In the fourth embodiment, two speech recognition display devices 1B of the second embodiment are used to display the contents of the speeches of the two speakers in chronological order. FIG. 14 is a diagram showing the configuration of a speech recognition display device 1D according to the fourth embodiment. In FIG. 14, the same reference numerals are assigned to the same components as in the second embodiment. In addition, in order to distinguish between the voice recognition display devices 1B used by two speakers X and Y, each element is denoted by X or Y.

なお、図１４に示す例では、２人の話者（話者Ｘ、話者Ｙ）がそれぞれ使用する音声認識表示装置１Ｂで、１つの音声認識部１３が共用されているが、音声認識部１３をそれぞれ別に設けてもよい。また、２人の話者が発話した内容は１つのテキスト表示部１４にまとめて表示されるが、テキスト表示部１４を話者Ｘと話者Ｙとにそれぞれ別に設けて、同一内容を表示してもよい。 In the example shown in FIG. 14, the speech recognition display device 1B used by two speakers (speaker X, speaker Y) uses one speech recognition unit 13 in common. 13 may be provided separately. In addition, although the contents uttered by two speakers are collectively displayed on one text display section 14, the text display sections 14 are separately provided for speaker X and speaker Y to display the same content. may

実施の形態４では、２つのマイク４２（マイクＸ、マイクＹ）でそれぞれ集音される話者Ｘ、Ｙの発話内容が時系列で表示される。図１４に示すように、実施の形態４に係る音声認識表示装置１Ｄには、実施の形態２において説明した音声認識表示装置１Ｂが２つ含まれている。 In Embodiment 4, the utterance contents of speakers X and Y picked up by two microphones 42 (microphone X and microphone Y) are displayed in chronological order. As shown in FIG. 14, the speech recognition display device 1D according to the fourth embodiment includes two speech recognition display devices 1B described in the second embodiment.

音声認識表示装置１Ｄは、テキスト併合部６０と併合テキストバッファ６１をさらに含む。テキスト併合部６０は、テキストバッファＸに格納されるテキストファイルＸとテキストバッファＹに格納されるテキストファイルＹとを併合する。併合テキストバッファ６１は、テキスト併合部６０が併合した併合テキストデータを格納する。テキスト表示部１４は、併合テキストバッファ６１を読み出して、併合テキストデータを表示する。 The speech recognition display device 1D further includes a text merger 60 and a merged text buffer 61 . The text merge unit 60 merges the text file X stored in the text buffer X and the text file Y stored in the text buffer Y. FIG. The merged text buffer 61 stores merged text data merged by the text merger 60 . The text display unit 14 reads out the merged text buffer 61 and displays the merged text data.

実施の形態４では、実施の形態２と異なり、テキストバッファＸ、Ｙにそれぞれ格納されるテキストファイルＸ、Ｙは、テキストデータをいくつかのフィールド（項目）に分け、各項目の情報を区切る区切り文字（デリミタ）にカンマやタブを用いたＣＳＶ（character-separated values）形式のテキストファイルである。テキストファイルＸ、Ｙは、ＴＩＭＥ＿ＦＩＥＬＤ（１区切り目）とＴＥＸＴ＿ＦＩＥＬＤ（２区切目）を持つものとする。 In the fourth embodiment, unlike the second embodiment, the text files X and Y stored in the text buffers X and Y divide the text data into several fields (items) and delimit the information of each item. It is a text file in CSV (character-separated values) format using commas and tabs as characters (delimiters). It is assumed that text files X and Y have TIME_FIELD (first division) and TEXT_FIELD (second division).

また、テキストファイルＸ、Ｙを併合した併合テキストデータも、ＣＳＶ形式のテキストファイルである。併合テキストデータは、ＳＰＥＡＫＥＲ＿ＦＩＥＬＤ（１区切目）、ＴＩＭＥ＿ＦＩＥＬＤ（２区切り目）、ＴＥＸＴ＿ＦＩＥＬＤ（３区切り目）を持つものとする。 Merged text data obtained by merging text files X and Y is also a text file in CSV format. Assume that the merged text data has SPEAKER_FIELD (1st division), TIME_FIELD (2nd division), and TEXT_FIELD (3rd division).

図１５Ａ、図１５Ｂは、それぞれ区切時間が３回分経過した後に、音声バッファＸ、Ｙに保存される分割音声データを示す図である。図１６Ａ、図１６Ｂと、図１７Ａ、図１７Ｂと、図１８Ａ、図１８Ｂとは、それぞれ１～３回目の区切り時間経過後に、テキストバッファＸ、Ｙに保存される分割音声テキストデータを示す図である。図１９Ａ～図１９Ｆは、併合前から併合４回目までの併合テキストバッファの保存状態を示す。図２０Ａ～図２０Ｆは、表示前から表示５回目までのテキスト表示部の表示状態を示す図である。 15A and 15B are diagrams showing divided audio data saved in audio buffers X and Y after three division times have passed. FIGS. 16A and 16B, FIGS. 17A and 17B, and FIGS. 18A and 18B are diagrams showing the divided speech text data stored in the text buffers X and Y after the first to third delimiting times have passed, respectively. be. 19A to 19F show the saved state of the merged text buffer from before merging to the fourth merging. 20A to 20F are diagrams showing display states of the text display portion from before display to the fifth display.

時間区切音声分割部Ｘ、Ｙは、区切時間が経過する毎に、それぞれ分割音声データを音声バッファＸ、Ｙの待ち行列（音声キューＸ、Ｙ）の末尾に保存する。３回分の区切時間が経過すると、音声キューＸ、Ｙは、図１５Ａ、図１５Ｂに示すものとなる。実施の形態４では、時間区切音声分割部Ｘ、Ｙは、音声キューＸ、Ｙに分割音声データを格納する際に、分割音声データの格納時間を音声キューＡ、Ｂの各インデックスに紐づけて保存する。 The time-delimited audio division units X and Y store the divided audio data at the end of the queues (audio queues X and Y) of the audio buffers X and Y each time the division time elapses. After the three division times have elapsed, the audio cues X and Y become those shown in FIGS. 15A and 15B. In the fourth embodiment, when storing divided audio data in the audio cues X and Y, the time-segmented audio dividing units X and Y associate the storage time of the divided audio data with each index of the audio cues A and B. save.

バッファ制御部Ｘ、Ｙは、テキスト判定部Ｘ、Ｙから受信した判定結果が「空ではない」場合、音声キューＸ、Ｙに格納されている分割音声データを先頭から結合し、結合音声送信部Ｘ、Ｙにそれぞれ送信する。この動作は、実施の形態２と同様である。これに加えて、バッファ制御部Ｘ、Ｙは、音声キューＸ、Ｙの先頭インデックスの分割音声データに紐づけされている格納時刻を、テキストファイルＸ、Ｙの最終行（改行コード［ＥＯＬ］が付与されていない行）のＴＩＭＥ＿ＦＩＥＬＤに書き込む。 If the judgment result received from the text judging units X and Y is "not empty", the buffer control units X and Y combine the divided audio data stored in the audio queues X and Y from the beginning, and transmit the combined audio transmitting unit. Send to X and Y respectively. This operation is similar to that of the second embodiment. In addition to this, the buffer control units X and Y store the storage time linked to the divided audio data at the head index of the audio queues X and Y to the last line of the text files X and Y (where the line feed code [EOL] is Write to the TIME_FIELD of the unsigned row).

なお、テキスト判定部Ｘ、Ｙから受信した判定結果が「空」の場合、実施の形態２と同様に、バッファ制御部Ｘ、Ｙは、音声バッファＸ、Ｙに格納されている分割音声データを削除し、各音声キューＸ、Ｙを空にする。 When the determination result received from the text determination units X and Y is "empty", the buffer control units X and Y read the divided audio data stored in the audio buffers X and Y as in the second embodiment. Empty each audio queue X, Y.

以下、図１４に示す音声認識表示装置１Ｄの動作を時間軸に沿って説明する。図１５Ａ、図１５Ｂを参照すると、時間区切音声分割部Ｘ、Ｙにおいて１回目及び２回目の区切時間が経過したときは、それぞれの分割音声データは空ではない。このため、バッファ制御部Ｘ、Ｙがテキスト判定部Ｘ、Ｙから受信する判定結果は、いずれも「空ではない」となる。したがって、１回目の区切時間経過後の、テキストファイルＸは図１６Ａ、テキストファイルＹは図１６Ｂに示すものとなる。また、２回目の区切時間経過後の、テキストファイルＸは図１７Ａ、テキストファイルＹは図１７Ｂに示すものとなる。 The operation of the speech recognition display device 1D shown in FIG. 14 will be described below along the time axis. Referring to FIGS. 15A and 15B, when the first and second division times have elapsed in the time division audio division units X and Y, each divided audio data is not empty. Therefore, the determination results received by the buffer control units X and Y from the text determination units X and Y are both "not empty". Therefore, after the first delimitation time has elapsed, the text file X is as shown in FIG. 16A and the text file Y is as shown in FIG. 16B. After the second delimitation time has elapsed, the text file X is shown in FIG. 17A and the text file Y is shown in FIG. 17B.

その後、時間区切音声分割部Ｘ、Ｙにおいて３回目の区切時間が経過したときは、分割音声データは空である。このため、バッファ制御部Ｘ、Ｙがテキスト判定部Ｘ、Ｙから受信する判定結果は、いずれも「空」となる。このとき、バッファ制御部Ｘ、Ｙは、テキストファイルＸ、Ｙの最終行の末尾にそれぞれ改行コード［ＥＯＬ］を付与する。したがって、３回目の区切時間経過後の、テキストファイルＸは図１８Ａ、テキストファイルＹは図１８Ｂに示すものとなる。 After that, when the third division time has elapsed in the time division audio division units X and Y, the division audio data is empty. Therefore, the determination results received by the buffer control units X and Y from the text determination units X and Y are both "empty". At this time, the buffer controllers X and Y add a line feed code [EOL] to the end of the last line of the text files X and Y, respectively. Therefore, after the third delimitation time has passed, the text file X is as shown in FIG. 18A and the text file Y is as shown in FIG. 18B.

テキスト併合部６０は、併合時間Ｔｍ（Ｔｍ＜Ｔｄ）が経過するごとに、テキストファイルＸ及びテキストファイルＹを読み出し、併合テキストバッファ６１に格納された併合テキストファイルを更新する。テキスト併合部６０は、まず、テキストバッファＸに格納されているテキストファイルＸを読み出す。そして、テキスト併合部６０は、テキストファイルＸの最終行（改行コード［ＥＯＬ］が付与されていない行）のＴＩＭＥ＿ＦＩＥＬＤの時刻をキーにして、併合テキストバッファ６１に格納されている併合テキストファイルのＴＩＭＥ＿ＦＩＥＬＤの時刻が一致し、かつ、ＳＰＥＡＫＥＲ＿ＦＩＥＬＤが「話者Ｘ」である行を探索する。 The text merging section 60 reads out the text file X and the text file Y and updates the merged text file stored in the merged text buffer 61 every time the merged time Tm (Tm<Td) elapses. The text merging section 60 first reads the text file X stored in the text buffer X. As shown in FIG. Then, the text merging unit 60 uses the TIME_FIELD time of the last line of the text file X (the line to which the line feed code [EOL] is not added) as a key to merge the TIME_FIELD of the merged text file stored in the merged text buffer 61 . and the SPEAKER_FIELD is "speaker X".

該当する行が存在する場合、その行のＴＥＸＴ＿ＦＩＥＬＤを、テキストファイルＸのＴＥＸＴ＿ＦＩＥＬＤの内容で上書きする。該当する行が存在しない場合、テキストファイルＸの最終行のＴＩＭＥ＿ＦＩＥＬＤの時刻を参照して、併合テキストファイルのＴＩＭＥ＿ＦＩＥＬＤの時刻が昇順（すなわち、インデックス番号が増えるに従い、時刻が古いものから新しいもの）となるように、併合テキストファイルの該当行にテキストファイルＸの内容が書き込まれる。 If the corresponding line exists, the TEXT_FIELD of that line is overwritten with the contents of the TEXT_FIELD of the text file X. If the corresponding line does not exist, the TIME_FIELD time of the last line of the text file X is referenced, and the TIME_FIELD times of the merged text files are sorted in ascending order (that is, from oldest to newest as the index number increases). The contents of the text file X are written in the corresponding line of the merged text file so that

具体的には、併合テキストファイルの該当行において、ＳＰＥＡＫＥＲ＿ＦＩＥＬＤに「話者Ｘ」、ＴＩＭＥ＿ＦＩＥＬＤにテキストファイルＸの最終行のＴＩＭＥ＿ＦＩＥＬＤの時刻、ＴＥＸＴ＿ＦＩＥＬＤにテキストファイルＸの最終行のＴＥＸＴ＿ＦＩＥＬＤの内容が書き込まれる。テキストファイルＹについても、テキストファイルＸと同様の動作で、併合テキストファイルの該当行にテキストファイルＹの内容が書き込まれる。 Specifically, in the corresponding line of the merged text file, "speaker X" is written in SPEAKER_FIELD, the time in TIME_FIELD of the last line of text file X in TIME_FIELD, and the content of TEXT_FIELD in the last line of text file X in TEXT_FIELD. As for the text file Y, the contents of the text file Y are written in the corresponding line of the merged text file by the same operation as the text file X.

例として、区切時間Ｔｄを５００ｍｓｅｃ、併合時間Ｔｍを４００ｍｓｅｃとし、テキスト併合部６０が動作する１回目の併合時間を１３：００：１５．４００として、テキスト併合部６０の動作の流れを説明する。 As an example, the operation flow of the text merging section 60 will be described with the delimiting time Td of 500 msec, the merging time Tm of 400 msec, and the first merging time of the text merging section 60 operating at 13:00:15.400.

まず、テキスト併合部６０の動作開始前（１３：００：１５．０００）（併合前）では、併合テキストバッファ６１に保存される併合テキストファイルは、図１９Ａのような空の状態である。併合時間４００ｍｓｅｃが経過した併合１回目（１３：００：１５．４００）の時刻には、テキストファイルＸは図１６Ａであり、テキストファイルＹは空である。このとき、テキスト併合部６０の動作により、併合テキストファイルは図１９Ｂのようになる。 First, before the operation of the text merging section 60 starts (13:00:15.000) (before merging), the merged text file saved in the merged text buffer 61 is empty as shown in FIG. 19A. At the time of the first merging (13:00:15.400) after 400 msec of merging time has passed, the text file X is as shown in FIG. 16A and the text file Y is empty. At this time, the operation of the text merging section 60 results in a merged text file as shown in FIG. 19B.

その後、併合テキストファイルは、併合２回目（１３：００：１５．８００）の時刻に図１９Ｃ、併合３回目（１３：００：１６．２００）の時刻に図１９Ｄ、併合４回目（１３：００：１６．６００）の時刻に図１９Ｅ、併合５回目（１３：００：１７．０００）の時刻に図１９Ｆとなる。 After that, the merged text file is shown in FIG. 19C at the time of the second merge (13:00:15.800), FIG. 19E at the time of 16.600), and FIG. 19F at the time of the fifth merge (13:00:17.000).

テキスト表示部１４は、所定の読出時間Ｔｒ（Ｔｒ＜Ｔｍ）で、併合テキストファイルを読み出して表示する。図２０Ａに示すように、実施の形態４では、テキスト表示部１４は、話者Ｘ用の表示エリア（左側）と話者Ｙ用の表示エリア（右側）の２つのエリアを有する。テキスト表示部１４は、ＦＩＥＬＤ行を除いた併合テキストファイルを先頭行から１行ずつ読みだし、テキスト表示手段で先頭行から順に表示する。このとき、併合テキストファイルのＳＰＥＡＫＥＲ＿ＦＩＥＬＤが「話者Ｘ」の併合テキストデータを、話者Ｘ用の表示エリア（左側）に表示する。また、ＳＰＥＡＫＥＲ＿ＦＩＥＬＤが「話者Ｙ」の併合テキストデータを、話者Ｙ用の表示エリア（右側）に表示する。 The text display unit 14 reads and displays the merged text file at a predetermined read time Tr (Tr<Tm). As shown in FIG. 20A, in the fourth embodiment, the text display section 14 has two areas, a display area for speaker X (left side) and a display area for speaker Y (right side). The text display unit 14 reads out the merged text file, excluding the FIELD line, one line at a time from the top line, and displays them in order from the top line on the text display means. At this time, the SPEAKER_FIELD of the merged text file displays the merged text data of "speaker X" in the display area for speaker X (left side). Also, SPEAKER_FIELD displays the combined text data of "speaker Y" in the display area for speaker Y (right side).

上述の通り、併合時間Ｔｍを４００ｍｓｅｃとし、読出時間Ｔｒを例えば３００ｍｓｅｃとする。テキスト表示部１４が動作する１回目の表示時間を１３：００：１５．６００として、テキスト表示部１４の動作の流れを説明する。 As described above, the merge time Tm is assumed to be 400 msec, and the readout time Tr is assumed to be 300 msec, for example. The flow of operation of the text display section 14 will be described assuming that the first display time at which the text display section 14 operates is 13:00:15.600.

まず、テキスト表示部１４の動作開始前（１３：００：１５．０００）（表示前）では、テキスト表示部１４に表示されるテキストは、図２０Ａのような空の状態である。読出時間３００ｍｓが経過した表示１回目（１３：００：１５．６００）の時刻には、併合テキストファイルは図１９Ｂであるため、テキスト表示部１４に表示されるテキストは図２０Ｂのようになる。 First, before the operation of the text display section 14 starts (13:00:15.000) (before display), the text displayed on the text display section 14 is empty as shown in FIG. 20A. At the time of the first display (13:00:15.600) when the readout time of 300 ms has passed, the merged text file is shown in FIG. 19B, so the text displayed on the text display section 14 is as shown in FIG. 20B.

その後、表示２回目（１３：００：１５．９００）の時刻では、併合テキストファイルは図１９Ｃであるため、テキスト表示部１４に表示されるテキストは図２０Ｃのようになる。そして、表示３回目（１３：００：１６．２００）では、併合テキストファイルは図１９Ｄであるため、テキスト表示部１４に表示されるテキストは図２０Ｄのようになる。 After that, at the time of the second display (13:00:15.900), the merged text file is shown in FIG. 19C, so the text displayed on the text display section 14 is as shown in FIG. 20C. Then, at the third display (13:00:16.200), the merged text file is as shown in FIG. 19D, so the text displayed on the text display section 14 is as shown in FIG. 20D.

表示４回目（１３：００：１６．５００）の時刻では、併合テキストファイルは図１９Ｄのままであるため、テキスト表示部１４に表示されるテキストは、表示３回目（図２０Ｄ）と変わらず、図２０Ｅのようになる。表示５回目（１３：００：１６．８００）の時刻では、併合テキストファイルは図１９Ｅであるため、テキスト表示部１４に表示されるテキストは図２０Ｆのようになる。 At the time of the fourth display (13:00:16.500), the merged text file remains as shown in FIG. 19D. It looks like FIG. 20E. At the time of the fifth display (13:00:16.800), the merged text file is shown in FIG. 19E, so the text displayed on the text display section 14 is shown in FIG. 20F.

このように、実施の形態４では、実施の形態２と同様のリアルタイム性を担保しつつ、２人の話者の会話内容をチャットのように表示することができる。なお、ここでは、２つのマイクを使用した例を示したが、例えば、電話機での送話音声を一方の音声データ、受話音声を他方の音声データとして、これらをチャット形式で表示することも可能である。これにより、緊急通報システムなどで、受付者と通報者の音声をリアルタイムにチャット形式でテキスト化がすることができ、他の指令員が通報内容を瞬時に把握し、適切な業務（消防車や救急車の出動等）を迅速に行うことが可能となる。 As described above, in the fourth embodiment, it is possible to display the conversation contents of two speakers like a chat while ensuring the same real-time performance as in the second embodiment. Here, an example using two microphones is shown, but for example, it is also possible to display the transmitted voice on the telephone as one voice data and the received voice as the other voice data and display them in a chat format. is. As a result, it is possible to convert the voice of the caller and the caller into text in real time in a chat format in an emergency call system, etc. dispatch of an ambulance, etc.) can be carried out quickly.

実施の形態５．
図２１は、実施の形態５に係る音声認識表示装置１Ｅの構成を示すブロック図である。図２１に示すように、音声認識表示装置１Ｅは、実施の形態２の音声認識表示装置１Ｂの構成に加えて、全体音声バッファ１６、音声再生部１７をさらに備える。 Embodiment 5.
FIG. 21 is a block diagram showing the configuration of a speech recognition display device 1E according to the fifth embodiment. As shown in FIG. 21, the speech recognition display device 1E further includes an overall speech buffer 16 and a speech reproduction section 17 in addition to the configuration of the speech recognition display device 1B of the second embodiment.

図２１に示す例では、音声認識表示装置１Ｅは、マイク４２により集音された発話内容をリアルタイムでテキスト化するとともに、１つの文章のテキスト化が完了する度に自動で当該テキストと対応する音声を読み上げるか、又は、ユーザ操作により１文章単位で対応する音声を読み上げる。以下、実施の形態２との差異について詳細に説明し、重複説明は適宜省略する。 In the example shown in FIG. 21, the speech recognition display device 1E converts the speech content collected by the microphone 42 into text in real time, and automatically converts the text and the corresponding voice each time the text conversion of one sentence is completed. or read out the corresponding voice in units of sentences by user operation. Hereinafter, differences from the second embodiment will be described in detail, and duplication of description will be omitted as appropriate.

音声取得部１０は、取得した音声データを時間区切音声分割部１１へ送信するとともに、音声データの全てを全体音声バッファ１６に格納する。実施の形態５では、時間区切音声分割部１１は、音声バッファ１２に分割音声データを格納する際に、分割回数カウンタを１プラスして音声キューの各インデックスに紐づけて保存する。なお、分割回数カウンタは、音声データの分割回数を計測するカウンタであり、初期値は０である。 The audio acquisition unit 10 transmits the acquired audio data to the time-segmented audio dividing unit 11 and stores all of the audio data in the entire audio buffer 16 . In Embodiment 5, when storing divided audio data in the audio buffer 12, the time-segmented audio dividing unit 11 increments the division number counter by 1 and stores the divided audio data in association with each index of the audio cue. Note that the division number counter is a counter that measures the number of divisions of the audio data, and its initial value is zero.

また、実施の形態２と異なり、テキストバッファ３４に格納されるテキストファイルは、テキストデータをいくつかのフィールド（項目）に分け、各項目の情報を区切る区切り文字（デリミタ）にカンマやタブを用いたＣＳＶ形式のテキストファイルである。このテキストファイルは、ＣＯＵＮＴ＿ＦＩＥＬＤ（１区切り目）とＴＥＸＴ＿ＦＩＥＬＤ（２区切目）を持つものとする。 Also, unlike the second embodiment, the text file stored in the text buffer 34 divides the text data into several fields (items) and uses commas and tabs as delimiters (delimiters) to separate the information of each item. It is a text file in CSV format. It is assumed that this text file has COUNT_FIELD (first division) and TEXT_FIELD (second division).

実施の形態２と同様に、バッファ制御部３１は、テキスト判定部２１から入力された判定結果が「空でない」場合、音声キューに格納されている分割音声データを先頭から結合し、結合音声送信部３２に送信する。このとき、バッファ制御部３１は、音声キューの先頭インデックスの分割音声データに紐づけされた分割回数カウンタ値を、テキストバッファ３４に格納されているテキストファイルの最終行（改行コード［ＥＯＬ］が付与されていない行）のＣＯＵＮＴ＿ＦＩＥＬＤに書き込む。 As in the second embodiment, when the determination result input from the text determination unit 21 is "not empty", the buffer control unit 31 combines the divided audio data stored in the audio queue from the beginning, and transmits the combined audio. 32. At this time, the buffer control unit 31 sets the division count counter value linked to the divided audio data at the head index of the audio queue to the last line of the text file stored in the text buffer 34 (added with a line feed code [EOL]). Write to the COUNT_FIELD of the unfilled row).

なお、テキスト判定部２１から入力された判定結果が「空」の場合、実施の形態２と同様に、バッファ制御部３１は、音声バッファ１２に格納されている分割音声データを削除し、音声キューを空にする。 When the determination result input from the text determination unit 21 is "empty", the buffer control unit 31 deletes the divided audio data stored in the audio buffer 12, and empty.

テキスト表示部１４は、上述したテキストの表示を行うとともに、１つの文章のテキスト化が完了する度に、自動で又はユーザ操作により、音声再生部１７へ読み出したテキストファイルの分割回数カウンタ値を含む音声再生指示を出力する。なお、テキスト表示部１４は、表示装置と入力装置とが一体化したタッチパネルを用い、ユーザ操作を受け付け可能に構成されてもよい。音声再生部１７は、音声再生指示を受けると、全体音声バッファ１６に格納された音声データを読み出し、分割回数カウンタ値に基づく再生開始位置から音声データの再生を行う。 The text display unit 14 displays the above-described text, and includes the division count counter value of the text file read out to the audio reproduction unit 17 automatically or by user operation each time text conversion of one sentence is completed. Outputs an audio playback instruction. Note that the text display unit 14 may be configured to accept user operations using a touch panel in which a display device and an input device are integrated. Upon receiving the audio reproduction instruction, the audio reproduction unit 17 reads the audio data stored in the entire audio buffer 16 and reproduces the audio data from the reproduction start position based on the division count counter value.

以下、図２１に示す音声認識表示装置１Ｅの動作を時間軸に沿って説明する。図２２は、図２１のマイク４２で集音した発話内容及び発話音声波形の一例を示す図である。図２２において、音声の開始時刻を０とし、所定の区切時間をＴｄとする。図２２に示すように、最初の区切時間Ｔｄが経過した時刻をＴｄ１とし、以降、区切時間Ｔｄが経過する毎に、順にＴｄ２→Ｔｄ３→Ｔｄ４→Ｔｄ５→Ｔｄ６→Ｔｄ７とする。時間区切音声分割部１１は、時刻Ｔｄ１～Ｔｄ７に音声データを分割する。 The operation of the speech recognition display device 1E shown in FIG. 21 will be described below along the time axis. FIG. 22 is a diagram showing an example of the utterance content and the utterance voice waveform collected by the microphone 42 of FIG. In FIG. 22, it is assumed that the voice start time is 0 and the predetermined interval time is Td. As shown in FIG. 22, the time when the first interval time Td elapses is defined as Td1, and thereafter, Td2→Td3→Td4→Td5→Td6→Td7 in order each time the interval time Td elapses. The time-segmented audio dividing unit 11 divides the audio data into times Td1 to Td7.

図２３Ａは、１文章目の３回分の区切時間Ｔｄ経過後（時刻Ｔｄ３経過後）に音声バッファ１２に保存される分割音声データを示す図である。図２３Ｂは、２文章目の３回分の区切時間Ｔｄ経過後（時刻Ｔｄ７経過後）に音声バッファ１２に保存される分割音声データを示す図である。 FIG. 23A is a diagram showing the divided audio data stored in the audio buffer 12 after the three division times Td of the first sentence have elapsed (after the time Td3 has elapsed). FIG. 23B is a diagram showing the divided audio data stored in the audio buffer 12 after the third division time Td of the second sentence has elapsed (after time Td7 has elapsed).

図２４Ａ～図２４Ｇは、それぞれ時刻Ｔｄ１～Ｔｄ７経過後にテキストバッファに保存される結合音声テキストデータを示す図である。図２５Ａは、テキスト表示部１４の初期表示状態を示している。図２５Ｂ～図２５Ｇは、それぞれ時刻Ｔｄ１～Ｔｄ７経過後にテキスト表示部１４に表示されるテキストを示す図である。 FIGS. 24A to 24G are diagrams showing combined speech text data saved in the text buffer after times Td1 to Td7, respectively. 25A shows the initial display state of the text display section 14. FIG. 25B to 25G are diagrams showing text displayed on the text display section 14 after times Td1 to Td7 have passed, respectively.

図２３Ａを参照すると、時刻Ｔｄ１における１回目及び時刻Ｔｄ２における２回目の音声分割では、それぞれの分割音声データは空ではないため、テキスト判定部２１による判定結果は「空でない」となる。このため、時刻Ｔｄ１経過後のテキストファイルは図２４Ａとなり、時刻Ｔｄ２経過後のテキストファイルは図２４Ｂとなる。 Referring to FIG. 23A, in the first audio division at time Td1 and the second audio division at time Td2, each divided audio data is not empty, so the determination result by the text determination unit 21 is "not empty". Therefore, the text file after time Td1 has passed is shown in FIG. 24A, and the text file after time Td2 has passed is shown in FIG. 24B.

その後、時刻Ｔｄ３における３回目の音声分割では分割音声データが空であるため、テキスト判定部２１による判定結果は「空」となる。このため、バッファ制御部３１は、テキストファイルの最終行（１行目）の末尾に改行コード［ＥＯＬ］を付与する。したがって、Ｔｄ３経過後のテキストファイルは図２４Ｃとなる。 After that, in the third audio division at time Td3, since the divided audio data is empty, the judgment result by the text judging section 21 is "empty". Therefore, the buffer control unit 31 adds a line feed code [EOL] to the end of the last line (first line) of the text file. Therefore, the text file after Td3 has passed is shown in FIG. 24C.

次に、図２３Ｂを参照すると、時刻Ｔｄ４、Ｔｄ５、Ｔｄ６における４、５、６回目の音声分割では、それぞれの分割音声データは空ではないため、テキスト判定部２１による判定結果は「空でない」となる。このため、時刻Ｔｄ４経過後のテキストファイルは図２４Ｄとなり、時刻Ｔｄ５経過後のテキストファイルは図２４Ｅ、時刻Ｔｄ６経過後のテキストファイルは図２４Ｆとなる。 Next, referring to FIG. 23B, in the 4th, 5th, and 6th audio divisions at times Td4, Td5, and Td6, the respective divided audio data are not empty, so the judgment result by the text judgment unit 21 is "not empty." becomes. Therefore, the text file after time Td4 has passed is shown in FIG. 24D, the text file after time Td5 has passed is shown in FIG. 24E, and the text file after time Td6 has passed is shown in FIG. 24F.

その後、時刻Ｔｄ７における７回目の音声分割では分割音声データは空であるため、テキスト判定部２１による判定結果は「空」となる。このため、バッファ制御部３１は、テキストファイルの最終行（２行目）の末尾に改行コード［ＥＯＬ］を付与する。したがって、Ｔｄ７経過後のテキストファイルは図２４Ｇとなる。 After that, in the seventh audio division at time Td7, since the divided audio data is empty, the judgment result by the text judging section 21 is "empty". Therefore, the buffer control unit 31 adds a line feed code [EOL] to the end of the last line (second line) of the text file. Therefore, the text file after Td7 has passed is shown in FIG. 24G.

テキスト表示部１４は、所定の読出時間Ｔｒ（Ｔｒ＜Ｔｄ）でテキストファイルを読み出して表示する。また、テキスト表示部１４は、表示エリアの右側に、［再生］ボタンを表示可能である。例えば、テキスト表示部１４は、初期表示では［再生］ボタンを非表示とし、ある行に何らかのテキストが表示された場合、当該行の右側に［再生］ボタンを表示することができる。音声再生部１７は、ユーザの［再生］ボタンの押下に応じて、音声データを再生することができる。 The text display unit 14 reads and displays the text file at a predetermined read time Tr (Tr<Td). Also, the text display unit 14 can display a [playback] button on the right side of the display area. For example, the text display unit 14 can hide the [playback] button in the initial display, and display the [playback] button on the right side of the line when some text is displayed. The audio reproducing unit 17 can reproduce the audio data in response to the user's pressing of the [Playback] button.

このとき、各［再生］ボタンには、当該行に表示されているテキストに対応する分割回数カウンタ値が紐づけられている。ユーザが［再生］ボタンを押下すると、テキスト表示部１４は、音声再生部１７へ当該行のテキストに対応する分割回数カウンタ値を含む音声再生指示を送信する。また、テキスト表示部１４がテキストファイルを読み出したタイミングで、改行コード［ＥＯＬ］が付与された行を検出した場合、音声再生部１７へ当該行のＣＯＵＮＴ＿ＦＩＥＬＤの値（分割回数カウンタ値）とともに音声再生指示を出力してもよい。 At this time, each [Playback] button is associated with a division number counter value corresponding to the text displayed in the line. When the user presses the [playback] button, the text display unit 14 transmits a voice playback instruction including the division number counter value corresponding to the text of the line to the voice playback unit 17 . In addition, when the text display unit 14 detects a line with a line feed code [EOL] at the timing when the text file is read, the value of the COUNT_FIELD (divided number counter value) of the line is sent to the voice playback unit 17 along with voice playback. May output instructions.

ここで、テキスト表示部１４の動作の流れを説明する。図２５Ａにテキスト表示部１４の初期表示状態が示される。時刻Ｔｄ１が経過すると（１回目の音声分割）、テキストバッファ３４に保存されるテキストファイルは図２４Ａとなる。テキスト表示部１４では、図２５Ｂに示すように、１行目にテキストが表示される。このとき、はじめて当該行にテキストが表示されたため、当該行の右側に［再生］ボタンが表示される。 Here, the flow of operation of the text display unit 14 will be described. FIG. 25A shows an initial display state of the text display section 14. As shown in FIG. After time Td1 has passed (first audio division), the text file saved in the text buffer 34 becomes the one shown in FIG. 24A. In the text display section 14, the text is displayed on the first line as shown in FIG. 25B. At this time, since the text is displayed on the line for the first time, a [Playback] button is displayed on the right side of the line.

時刻Ｔｄ２が経過すると（２回目の音声分割）、テキストバッファ３４に保存されるテキストファイルは図２４Ｂとなる。テキスト表示部１４では、図２５Ｃに示すように、１行目のテキストが更新される。このとき、当該行はすでにテキストが表示されていたため、［再生］ボタンは表示されたままである。 After time Td2 has passed (second audio division), the text file saved in the text buffer 34 becomes the one shown in FIG. 24B. In the text display section 14, the text on the first line is updated as shown in FIG. 25C. At this time, since text has already been displayed in the line, the [Playback] button remains displayed.

時刻Ｔｄ３が経過すると（３回目の音声分割）、テキストバッファ３４に保存されるテキストファイルは図２４Ｃとなる。図２４Ｂと図２４Ｃとの違いは、１行目の行末に改行コード［ＥＯＬ］が付与されているのみであるため、図２５Ｄに示すように、テキスト表示部１４の１行目のテキストは変化しない。 After time Td3 has passed (the third audio division), the text file saved in the text buffer 34 becomes the text file shown in FIG. 24C. The difference between FIG. 24B and FIG. 24C is that the line feed code [EOL] is added to the end of the first line. Therefore, as shown in FIG. do not.

なお、テキスト表示部１４は、テキストファイルの１行目に改行コード［ＥＯＬ］を検出した場合、音声再生部１７へテキストファイルの１行目のＣＯＵＮＴ＿ＦＩＥＬＤに書き込まれた分割回数カウント値「１」とともに、音声再生指示を自動で送信してもよい。 When the text display unit 14 detects the line feed code [EOL] in the first line of the text file, the text display unit 14 displays the division number count value "1" written in the COUNT_FIELD of the first line of the text file to the voice reproduction unit 17. , the audio playback instruction may be automatically sent.

時刻Ｔｄ４が経過すると（４回目の音声分割）、テキストバッファ３４に保存されるテキストファイルは図２４Ｄとなる。テキスト表示部１４では、図２５Ｅに示すように、２行目のテキストが表示される。このとき、はじめて当該行にテキストが表示されたため、当該行の右側に［再生］ボタンが表示される。 After time Td4 has passed (fourth audio division), the text file saved in the text buffer 34 becomes the one shown in FIG. 24D. In the text display section 14, the second line of text is displayed as shown in FIG. 25E. At this time, since the text is displayed on the line for the first time, a [Playback] button is displayed on the right side of the line.

時刻Ｔｄ５、Ｔｄ６が経過すると（５回目、６回目の音声分割）、テキストファイルはそれぞれ図２４Ｅ、図２４Ｆとなる。テキスト表示部１４では、図２５Ｅ、図２５Ｆに示すように、２行目のテキストが更新される。このとき、当該行はすでにテキストが表示されていたため、［再生］ボタンは表示されたままである。 When times Td5 and Td6 have passed (fifth and sixth audio divisions), the text files are shown in FIGS. 24E and 24F, respectively. In the text display section 14, the text on the second line is updated as shown in FIGS. 25E and 25F. At this time, since text has already been displayed in the line, the [Playback] button remains displayed.

時刻Ｔｄ７が経過すると（７回目の音声分割）、テキストバッファ３４に保存されるテキストファイルは図２４Ｇとなる。図２４Ｆと図２４Ｇとの違いは、２行目の行末に改行コード［ＥＯＬ］が付与されているのみであるため、図２５Ｈに示すように、テキスト表示部１４の２行目のテキストは変化しない。 After time Td7 has passed (seventh audio division), the text file saved in the text buffer 34 becomes the one shown in FIG. 24G. The only difference between FIG. 24F and FIG. 24G is that the line feed code [EOL] is added at the end of the second line, so the text on the second line of the text display section 14 changes as shown in FIG. 25H. do not.

なお、テキスト表示部１４は、テキストファイルの２行目に改行コード［ＥＯＬ］を検出した場合、音声再生部１７へテキストファイルの２行目のＣＯＵＮＴ＿ＦＩＥＬＤに書き込まれた分割回数カウント値「４」とともに、音声再生指示を自動で送信してもよい。 When the text display unit 14 detects the line feed code [EOL] in the second line of the text file, the text display unit 14 displays the division number count value "4" written in the COUNT_FIELD of the second line of the text file to the voice reproduction unit 17. , the audio playback instruction may be automatically sent.

音声再生部１７は、テキスト表示部１４から分割回数カウンタ値を含む音声再生指示を受けると、全体音声バッファ１６から音声データを取得し、以下の式から計算される再生開始位置から音声を再生する。
再生開始位置（時間）＝（分割回数カウンタ値－１）×Ｔｄ When the audio reproduction unit 17 receives the audio reproduction instruction including the division number counter value from the text display unit 14, the audio reproduction unit 17 acquires the audio data from the entire audio buffer 16, and reproduces the audio from the reproduction start position calculated by the following formula. .
Playback start position (time) = (division number counter value - 1) x Td

これにより、テキスト表示部１４で１行ごとにテキストが表示される度に、自動的に表示されたテキストに対応する音声が読み上げられる。 As a result, each time text is displayed line by line on the text display unit 14, the voice corresponding to the displayed text is automatically read out.

なお、テキスト表示部１４に［再生］ボタンが表示された時点で、当該テキストに対応する音声の再生がいつでも可能となる。つまり、文章の終わり（自動再生）を待たずとも、［再生］ボタンを押下することで、押下時点のテキストと対応する音声の再生が可能である。さらに、発話内容が増えた場合でも、テキストと対応する［再生］ボタンを押下することで、過去に遡ってテキストと対応する音声の再生が可能となる。 It should be noted that when the [playback] button is displayed on the text display section 14, the voice corresponding to the text can be played back at any time. That is, without waiting for the end of the sentence (automatic playback), pressing the [Playback] button makes it possible to play back the text and the corresponding voice at the time of pressing. Furthermore, even if the contents of the utterance increase, by pressing the [Playback] button corresponding to the text, it is possible to go back in time and reproduce the voice corresponding to the text.

以上説明したように、実施の形態５によれば、発話内容についてリアルタイムで音声認識を行い、テキストが表示されるたびに１文章単位で、当該文章と対応する音声の自動再生が可能となる。また、１文章単位で、又は、文章の途中で、当該文章と対応する音声の手動再生が可能となる。これにより、例えば、取材等で取得した音声データを音声認識でテキストに変換した後に、作業者がテキストと音声とを比較して、人手によりテキストを修正する作業を行う場合に効果を発揮する。 As described above, according to Embodiment 5, voice recognition is performed on the contents of utterances in real time, and each time a text is displayed, it is possible to automatically reproduce the voice corresponding to the sentence in units of sentences. Also, it is possible to manually reproduce the voice corresponding to the sentence in units of sentences or in the middle of the sentence. As a result, for example, after converting voice data acquired in interviews or the like into text by voice recognition, the effect is demonstrated when an operator compares the text with the voice and manually corrects the text.

なお、上述した様々な処理を行う機能ブロックとして図面に記載される各要素は、ハードウェア的には、ＣＰＵ、メモリ、その他の回線で構成することができる。また、本発明は、任意の処理を、ＣＰＵ（Central Processing Unit）にコンピュータプログラムを実行させることにより実現することも可能である。従って、これらの機能ブロックがハードウェアのみ、ソフトウェアのみ、又はそれらの組合せによっていろいろな形で実現できることは当業者には理解されるところであり、いずれかに限定されるものではない。 It should be noted that each element described in the drawing as a functional block that performs the various processes described above can be configured by a CPU, a memory, and other lines in terms of hardware. In addition, the present invention can also realize arbitrary processing by causing a CPU (Central Processing Unit) to execute a computer program. Therefore, those skilled in the art will understand that these functional blocks can be realized in various forms by hardware only, software only, or a combination thereof, and are not limited to either one.

プログラムは、コンピュータに読み込まれた場合に、実施形態で説明された１又はそれ以上の機能をコンピュータに行わせるための命令群（又はソフトウェアコード）を含む。プログラムは、非一時的なコンピュータ可読媒体又は実体のある記憶媒体に格納されてもよい。限定ではなく例として、コンピュータ可読媒体又は実体のある記憶媒体は、random-access memory（RAM）、read-only memory（ROM）、フラッシュメモリ、solid-state drive（SSD）又はその他のメモリ技術、CD-ROM、digital versatile disc（DVD）、Blu-ray（登録商標）ディスク又はその他の光ディスクストレージ、磁気カセット、磁気テープ、磁気ディスクストレージ又はその他の磁気ストレージデバイスを含む。プログラムは、一時的なコンピュータ可読媒体又は通信媒体上で送信されてもよい。限定ではなく例として、一時的なコンピュータ可読媒体又は通信媒体は、電気的、光学的、音響的、またはその他の形式の伝搬信号を含む。 A program includes instructions (or software code) that, when read into a computer, cause the computer to perform one or more of the functions described in the embodiments. The program may be stored in a non-transitory computer-readable medium or tangible storage medium. By way of example, and not limitation, computer readable media or tangible storage media may include random-access memory (RAM), read-only memory (ROM), flash memory, solid-state drives (SSD) or other memory technology, CDs -ROM, digital versatile disc (DVD), Blu-ray disc or other optical disc storage, magnetic cassette, magnetic tape, magnetic disc storage or other magnetic storage device. The program may be transmitted on a transitory computer-readable medium or communication medium. By way of example, and not limitation, transitory computer readable media or communication media include electrical, optical, acoustic, or other forms of propagated signals.

なお、本発明は上記実施の形態に限られたものではなく、趣旨を逸脱しない範囲で適宜変更することが可能である。 It should be noted that the present invention is not limited to the above embodiments, and can be modified as appropriate without departing from the scope of the invention.

１音声認識表示装置
１０音声取得部
１１時間区切音声分割部
１２音声バッファ
１３音声認識部
１４テキスト表示部
１５認識ＤＢ
１６全体音声バッファ
１７音声再生部
２０分割音声・テキスト制御部
２１テキスト判定部
２２分割音声送信部
２３分割音声テキスト受信部
３０結合音声・テキスト制御部
３１バッファ制御部
３２結合音声送信部
３３結合音声テキスト受信部
３４テキストバッファ
４０ＩＰ電話機
４１アナログ電話機
４２マイク
５０共有メモリ
５１音声送信部
５２テキスト受信部
５３音声結合部
６０テキスト併合部
６１併合テキストバッファ REFERENCE SIGNS LIST 1 voice recognition display device 10 voice acquisition unit 11 time-delimited voice division unit 12 voice buffer 13 voice recognition unit 14 text display unit 15 recognition DB
16 Whole Speech Buffer 17 Speech Playback Unit 20 Split Speech/Text Control Unit 21 Text Judging Unit 22 Split Speech Sending Unit 23 Split Speech Text Receiving Unit 30 Combined Speech/Text Control Unit 31 Buffer Control Unit 32 Combined Speech Sending Unit 33 Combined Speech Text Receiver 34 Text Buffer 40 IP Telephone 41 Analog Telephone 42 Microphone 50 Shared Memory 51 Voice Transmitter 52 Text Receiver 53 Voice Combiner 60 Text Merger 61 Merged Text Buffer

Claims

an audio acquisition unit that acquires audio data;
an audio dividing unit that divides the audio data along the time axis at predetermined division times to generate divided audio data;
an audio buffer for storing the divided audio data in order from the beginning every time the delimitation time elapses;
a speech recognition unit that converts the divided speech data into divided speech text data;
a text determination unit that determines whether the divided voice text data is empty;
a buffer control unit for generating combined voice data by combining the divided voice data stored in the voice buffer in order from the beginning when the divided voice text data is not empty;
a text display unit for displaying combined speech text data resulting from voice recognition of the combined speech data each time the delimitation time elapses;
A speech recognition display device comprising:

The buffer control unit deletes the divided audio data stored in the audio buffer when the divided audio text data is empty.
The speech recognition display device according to claim 1.

further comprising a text buffer for storing the combined speech text data;
said combined speech text data is stored in the lowest numbered index among free indexes of said text buffer, updated each time said delimiting time elapses;
the text display unit reads the text buffer and displays the combined speech text data;
3. The speech recognition display device according to claim 1 or 2.

When the divided voice text data is empty, the buffer control unit adds a line feed code to the end of the line of the combined voice text data stored in the text buffer, and uses the index in which the combined voice text data is stored. as an index,
The speech recognition display device according to claim 3.

if there is a silent section in the audio data, the interval time is less than 1/2 of the silent section;
5. The speech recognition display device according to claim 3 or 4.

an audio transmission unit that transmits the divided audio data and the combined audio data to the audio recognition unit and sets a flag indicating which of the divided audio data and the combined audio data is being transmitted;
A text receiving unit that receives the divided speech text data and the combined speech text data, refers to the flag, and transmits the divided speech text data to the text determination unit and the combined speech text data to the text buffer. and,
further comprising
5. The speech recognition display device according to claim 3 or 4.

a first speech recognition display device comprising the speech recognition display device configuration according to claim 3 or 4, for converting first speech data into first combined speech text data;
a second speech recognition and display device having the same configuration as the first speech recognition and display device, for converting second speech data different from the first speech data into second combined speech text data;
a text merging unit that merges the first combined voice-text data and the second combined voice-text data to generate merged text data;
with
the text display unit displays the merged text data;
Voice recognition display device.

an entire audio buffer storing all of the audio data;
an audio reproduction unit that reproduces the audio data stored in the entire audio buffer from a reproduction start position based on a division number counter value linked to the divided audio data;
further comprising
5. The speech recognition display device according to claim 3 or 4.

get audio data,
dividing the audio data along the time axis at predetermined time intervals to generate divided audio data;
storing the divided audio data in an audio buffer in order from the beginning each time the delimitation time elapses;
converting the divided voice data into divided voice text data;
determining whether the divided speech text data is empty;
if the divided voice text data is not empty, generating combined voice data by combining the divided voice data stored in the voice buffer in order from the head;
displaying combined voice text data obtained by performing voice recognition of the combined voice data each time the delimitation time elapses;
Speech recognition display method.

a process of acquiring audio data;
a process of dividing the audio data along the time axis at predetermined division times to generate divided audio data;
a process of storing the divided audio data in an audio buffer in order from the top every time the delimitation time elapses;
a process of converting the divided speech data into divided speech text data;
a process of determining whether the divided voice text data is empty;
a process of generating combined voice data by combining the divided voice data stored in the voice buffer in order from the beginning when the divided voice text data is not empty;
a process of displaying combined voice text data obtained by performing voice recognition of the combined voice data each time the delimitation time elapses;
A program that makes a computer run