JP2017203953A

JP2017203953A - Data processing device, data processing system, data processing method and data processing program

Info

Publication number: JP2017203953A
Application number: JP2016097117A
Authority: JP
Inventors: 昭年泉; Akitoshi Izumi; 亮太藤井; Ryota Fujii; 久裕田中; Hisahiro Tanaka
Original assignee: Panasonic Intellectual Property Management Co Ltd
Current assignee: Panasonic Intellectual Property Management Co Ltd
Priority date: 2016-05-13
Filing date: 2016-05-13
Publication date: 2017-11-16
Anticipated expiration: 2036-05-13
Also published as: JP6731609B2

Abstract

PROBLEM TO BE SOLVED: To reduce the possibility of confidential information leaking to the outside when uttered speech data is transmitted to a speech recognition server and also recognize the speech and turn it into text with sufficient accuracy, as well as reduce the load of a speech recognition and text conversion process performed by an internal information processing terminal and secure a processing speed.SOLUTION: A data processing device 1 comprises: a speech data replacement unit 26 for replacing specific part speech data included in acquired speech data with replacement part speech data different from the specific part speech data, and outputting it as speech data for conversion; a communication unit 30 for transmitting the speech data for conversion to a speech recognition server, and receiving text data converted from the speech data for conversion from the speech recognition server; and a text data reverse replacement unit 31 for extracting post-replacement text data, among the text data inputted from the speech recognition server, that corresponds to the replacement part speech data, and replacing the post-replacement speech data with pre-replacement text data that corresponds to the specific part speech data and outputting the resulting speech data.SELECTED DRAWING: Figure 2

Description

本発明は、音声データをテキスト化するデータ処理装置、データ処理システム、データ処理方法及びデータ処理プログラムに関する。 The present invention relates to a data processing apparatus, a data processing system, a data processing method, and a data processing program for converting audio data into text.

企業等の組織において、会議での話者の発言内容が記載される議事録を作成する場合、通常は書記を１名もしくは複数名割り当て、書記が聞いた内容を手動でテキスト化する方法がとられている。しかし、この方法ではコストがかかる、正確性に欠けるという問題から、音声認識装置を用い、自動で議事録を作成するシステムが提案されている（特許文献１参照）。この技術では、音声認識処理を各会議参加者が所有する情報処理端末で実施する。 In organizations such as corporations, when creating minutes of meetings, the contents of the speaker's statements are usually assigned by assigning one or more clerks and manually transcribing the contents heard by the clerk. It has been. However, due to the problem that this method is costly and lacks accuracy, a system has been proposed in which a minutes is automatically created using a speech recognition device (see Patent Document 1). In this technology, voice recognition processing is performed at an information processing terminal owned by each conference participant.

特開平１１−２７２６６３号公報JP 11-272663 A

ところで、会議議事録等の話し言葉全体を十分な精度で音声認識・テキスト化するためには外部サーバ等の高い計算能力並びにそこに格納される十分に蓄積された学習データが必要である。したがって、会議議事録を十分な精度で音声認識・テキスト化するためには、外部サーバに会議参加者の発話音声データを送信し、音声認識・テキスト化する必要がある。しかしながら、発話内容に機密情報が含まれる場合、外部の音声認識サーバに発話音声データを送信すると機密情報が外部に漏洩する可能性がある。 By the way, in order to recognize and text the entire spoken language such as the minutes of a meeting with sufficient accuracy, high calculation capability such as an external server and sufficiently accumulated learning data stored therein are required. Therefore, in order to speech-recognize and text conference minutes with sufficient accuracy, it is necessary to transmit speech data of conference participants to an external server for speech recognition and text. However, when confidential information is included in the utterance content, there is a possibility that confidential information may be leaked to the outside if the utterance voice data is transmitted to an external voice recognition server.

本発明の目的は、音声認識サーバに発話音声データを送信する際の機密情報の外部への漏洩の可能性を低減し、かつ十分な精度で音声認識・テキスト化することが可能であり、内部の情報処理端末が行う音声認識・テキスト化の処理の負荷の軽減と、処理スピードの確保をすることが可能なデータ処理装置、データ処理システム、データ処理方法及びデータ処理プログラムを提供することである。 An object of the present invention is to reduce the possibility of leakage of confidential information to the outside when transmitting utterance voice data to a voice recognition server, and to enable voice recognition / text conversion with sufficient accuracy. It is to provide a data processing device, a data processing system, a data processing method, and a data processing program capable of reducing the processing load of voice recognition and text processing performed by the information processing terminal and ensuring the processing speed. .

本開示のデータ処理装置は、
収音音声データの認識結果を出力するデータ処理装置であって、
前記収音音声データに含まれる特定部音声データを、前記特定部音声データとは異なる置換部音声データに置換し、変換用音声データとして出力する音声データ置換部と、
前記変換用音声データを音声認識サーバへ送信し、前記変換用音声データから変換されたテキストデータを前記音声認識サーバから受信する通信部と、
前記音声認識サーバから入力された前記テキストデータのうち、前記置換部音声データに対応する置換後テキストデータを抽出し、前記特定部音声データに対応する置換前テキストデータへ前記置換後テキストデータを置き換えて、前記収音音声データの認識結果として出力するテキストデータ逆置換部と、
を備える。 The data processing apparatus of the present disclosure
A data processing device that outputs a recognition result of collected sound data,
An audio data replacement unit that replaces the specific unit audio data included in the collected audio data with replacement unit audio data different from the specific unit audio data and outputs the converted audio data;
A communication unit that transmits the voice data for conversion to a voice recognition server and receives text data converted from the voice data for conversion from the voice recognition server;
Of the text data input from the speech recognition server, extract post-replacement text data corresponding to the replacement unit speech data, and replace the post-replacement text data with pre-replacement text data corresponding to the specific unit speech data A text data reverse replacement unit that outputs as a recognition result of the collected sound data;
Is provided.

本開示のデータ処理システムは、
収音音声データの認識結果を出力するデータ処理システムであって、
前記収音音声データに含まれる特定部音声データを、前記特定部音声データとは異なる置換部音声データに置換し、変換用音声データとして出力する音声データ置換部と、
前記変換用音声データを音声認識サーバへ送信し、前記変換用音声データから変換されたテキストデータを前記音声認識サーバから受信する通信部と、
前記音声認識サーバから入力された前記テキストデータのうち、前記置換部音声データに対応する置換後テキストデータを抽出し、前記特定部音声データに対応する置換前テキストデータへ前記置換後テキストデータを置き換えて、前記収音音声データの認識結果として出力するテキストデータ逆置換部と、
を備える。 The data processing system of the present disclosure includes:
A data processing system that outputs a recognition result of collected sound data,
An audio data replacement unit that replaces the specific unit audio data included in the collected audio data with replacement unit audio data different from the specific unit audio data and outputs the converted audio data;
A communication unit that transmits the voice data for conversion to a voice recognition server and receives text data converted from the voice data for conversion from the voice recognition server;
Of the text data input from the speech recognition server, extract post-replacement text data corresponding to the replacement unit speech data, and replace the post-replacement text data with pre-replacement text data corresponding to the specific unit speech data A text data reverse replacement unit that outputs as a recognition result of the collected sound data;
Is provided.

本開示のデータ処理方法は、
収音音声データの認識結果を出力するデータ処理方法であって、
前記収音音声データに含まれる特定部音声データを、前記特定部音声データとは異なる置換部音声データに置換し、変換用音声データとして出力する音声データ置換ステップと、
前記変換用音声データを音声認識サーバへ送信し、前記変換用音声データから変換されたテキストデータを前記音声認識サーバから受信する通信ステップと、
前記音声認識サーバから入力された前記テキストデータのうち、前記置換部音声データに対応する置換後テキストデータを抽出し、前記特定部音声データに対応する置換前テキストデータへ前記置換後テキストデータを置き換えて、前記収音音声データの認識結果として出力するテキストデータ逆置換ステップと、
を備える。 The data processing method of the present disclosure includes:
A data processing method for outputting a recognition result of collected sound data,
A voice data replacement step of replacing the specific part voice data included in the collected voice data with replacement voice data different from the specific part voice data and outputting the voice data for conversion;
A communication step of transmitting the voice data for conversion to a voice recognition server and receiving text data converted from the voice data for conversion from the voice recognition server;
Of the text data input from the speech recognition server, extract post-replacement text data corresponding to the replacement unit speech data, and replace the post-replacement text data with pre-replacement text data corresponding to the specific unit speech data A text data reverse replacement step for outputting the collected sound data as a recognition result;
Is provided.

本開示のデータ処理プログラムは、
収音音声データの認識結果を出力するデータ処理装置において実行されるデータ処理プログラムであって、
前記データ処理装置のコンピュータに対して、
前記収音音声データに含まれる特定部音声データを、前記特定部音声データとは異なる置換部音声データに置換し、変換用音声データとして出力させる処理と、
前記変換用音声データを音声認識サーバへ送信させ、前記変換用音声データから変換されたテキストデータを前記音声認識サーバから受信させる処理と、
前記音声認識サーバから入力された前記テキストデータのうち、前記置換部音声データに対応する置換後テキストデータを抽出させ、前記特定部音声データに対応する置換前テキストデータへ前記置換後テキストデータを置き換えて、前記収音音声データの認識結果として出力させる処理と、
を実行させる。 The data processing program of the present disclosure
A data processing program executed in a data processing device that outputs a recognition result of collected sound data,
For the computer of the data processing device,
A process of replacing the specific unit audio data included in the collected audio data with replacement unit audio data different from the specific unit audio data and outputting the converted audio data as conversion audio data;
Processing for transmitting the voice data for conversion to a voice recognition server and receiving text data converted from the voice data for conversion from the voice recognition server;
Of the text data input from the speech recognition server, the post-replacement text data corresponding to the replacement unit speech data is extracted, and the post-replacement text data is replaced with the pre-replacement text data corresponding to the specific unit speech data. Processing to output the collected sound data as a recognition result;
Is executed.

本発明によれば、音声認識サーバに発話音声データを送信する際の機密情報の外部への漏洩の可能性を低減し、かつ十分な精度で音声認識・テキスト化することができる。なおかつ、本開示は、内部の情報処理端末が行う音声認識・テキスト化の処理の負荷を軽減することができ、処理スピードを確保することができる。 According to the present invention, it is possible to reduce the possibility of leakage of confidential information to the outside when transmitting utterance voice data to the voice recognition server, and to perform voice recognition / text conversion with sufficient accuracy. In addition, the present disclosure can reduce the load of speech recognition / text processing performed by the internal information processing terminal, and can ensure the processing speed.

本第１実施形態の音声処理システムが設置された場所のイメージの一例を示す図The figure which shows an example of the image of the place where the audio | voice processing system of this 1st Embodiment was installed 本第１実施形態の音声処理システムのシステム構成を示すブロック図The block diagram which shows the system configuration | structure of the speech processing system of this 1st Embodiment. 本第１実施形態の音声処理システムの音声データ置換に関する動作手順を説明するフローチャートThe flowchart explaining the operation | movement procedure regarding the audio | voice data replacement of the audio | voice processing system of this 1st Embodiment. 本第１実施形態の音声処理システムの音声認識サーバに関する動作手順の一例を説明するフローチャートThe flowchart explaining an example of the operation | movement procedure regarding the speech recognition server of the speech processing system of this 1st Embodiment. 本第１実施形態の音声処理システムのテキストデータ逆置換に関する動作手順を示すフローチャートThe flowchart which shows the operation | movement procedure regarding the text data reverse substitution of the speech processing system of this 1st Embodiment. 本第１実施形態において音声入力処理部が実施する音声入力処理の例を示す図The figure which shows the example of the audio | voice input process which the audio | voice input process part implements in this 1st Embodiment. 本第１実施形態において発話区間検出部が実施する発話区間検出処理の例を示す図The figure which shows the example of the speech area detection process which the speech area detection part implements in this 1st Embodiment. 本第１実施形態における組み合わせ表の例を示す図The figure which shows the example of the combination table | surface in this 1st Embodiment. 本第１実施形態における特定音声データ検出処理の処理例を示す図The figure which shows the process example of the specific audio | voice data detection process in this 1st Embodiment. 本第１実施形態における特定情報発話時刻表の例を示す図The figure which shows the example of the specific information utterance timetable in this 1st Embodiment 本第１実施形態における置換音声データ表の例と置換音声合成処理の処理例を示す図The figure which shows the example of the substituted audio | voice data table in this 1st Embodiment, and the process example of a substituted audio | voice synthesis process 本第１実施形態における音声データシフト処理の処理例を示す図The figure which shows the process example of the audio | voice data shift process in this 1st Embodiment. 本第１実施形態における音声データ置換処理の処理例を示す図The figure which shows the process example of the audio | voice data replacement process in this 1st Embodiment. 本第１実施形態における置換履歴表の例を示す図The figure which shows the example of the replacement history table | surface in this 1st Embodiment. 本第１実施形態における変換用音声データ送信処理と変換用音声データ受信処理と音声データ変換処理とテキストデータ送信処理の処理例を示す図The figure which shows the process example of the audio | voice data transmission process for a conversion in this 1st Embodiment, the audio | voice data reception process for a conversion, an audio | voice data conversion process, and a text data transmission process. 本第１実施形態におけるテキストデータ逆置換処理と認識結果出力処理の処理例を示す図The figure which shows the process example of the text data reverse substitution process in this 1st Embodiment, and a recognition result output process 本第２実施形態における置換音声データ表と置換音声合成処理の例を示す図The figure which shows the example of the substituted audio | voice data table in this 2nd Embodiment, and a substituted audio | voice synthesis process. 本第２実施形態における音声データシフト処理の処理例を示す図The figure which shows the process example of the audio | voice data shift process in this 2nd Embodiment. 本第２実施形態における音声データ置換処理の処理例を示す図The figure which shows the process example of the audio | voice data replacement process in this 2nd Embodiment. 本第３実施形態における置換音声データ表と置換音声合成処理の例を示す図The figure which shows the example of the substituted audio | voice data table and the substituted audio | voice synthetic | combination process in this 3rd Embodiment. 本第３実施形態における音声データシフト処理の処理例を示す図The figure which shows the process example of the audio | voice data shift process in this 3rd Embodiment. 本第３実施形態における音声データ置換処理の処理例を示す図The figure which shows the process example of the audio | voice data replacement process in this 3rd Embodiment.

以下、本発明に係る音声処理システムを具体的に示した実施形態（以下、「本実施形態」という）について、図面を参照して説明する。 Hereinafter, an embodiment (hereinafter referred to as “the present embodiment”) that specifically shows a speech processing system according to the present invention will be described with reference to the drawings.

（第１実施形態）
図１は、本第１実施形態の音声処理システムが設置された場所のイメージの一例を示す図である。図２は、本第１実施形態の音声処理システム１のシステム構成を示すブロック図である。図３は、本第１実施形態の音声処理システム１の音声データ置換に関する動作手順を説明するフローチャートである。図４は、本第１実施形態の音声処理システム１の音声認識サーバ１０に関する動作手順の一例を説明するフローチャートである。図５は、音声処理システム１のテキストデータ逆置換に関する動作手順を示すフローチャートである。 (First embodiment)
FIG. 1 is a diagram illustrating an example of an image of a place where the voice processing system according to the first embodiment is installed. FIG. 2 is a block diagram showing a system configuration of the voice processing system 1 according to the first embodiment. FIG. 3 is a flowchart for explaining an operation procedure related to voice data replacement in the voice processing system 1 of the first embodiment. FIG. 4 is a flowchart illustrating an example of an operation procedure related to the speech recognition server 10 of the speech processing system 1 according to the first embodiment. FIG. 5 is a flowchart showing an operation procedure related to reverse text data replacement in the speech processing system 1.

図１〜図５に示す音声処理システム１は、音声認識が行われる場所２（例えば会議室、銀行カウンター、事務所）に設置された、音声入力処理部３（無指向性マイク、指向性マイク、ヘッドセットなど）を介して発話者が発話する音声を収音し、認識結果を表示部４に出力する。２人の会議参加者の内、１人が音声認識発話者５となり、音声入力処理部３が音声認識発話者５の発話音声６を収音している。発話音声６は、データ処理装置７によって変換用音声データ８に置換され、ネットワーク９を介して音声認識サーバ１０に送信される。ネットワーク９は、有線ネットワーク（例えばイントラネット、インターネット）でも良いし、無線ネットワーク（例えば無線ＬＡＮ（Local Area Network））でも良い。音声認識サーバ１０は変換用音声データ８に対応するテキストデータ１１をデータ処理装置７に送信する。データ処理装置７は、音声認識サーバ１０より受信したテキストデータ１１を、発話音声６に対応するテキストデータに逆置換して、その逆置換したテキストデータを、発話音声６の認識結果として表示部４に出力する。更に、データ処理装置７には、操作を行う操作部１２が設置されてもよい。 The speech processing system 1 shown in FIGS. 1 to 5 is a speech input processing unit 3 (an omnidirectional microphone, a directional microphone) installed in a place 2 (for example, a conference room, a bank counter, or an office) where speech recognition is performed. , A headset, etc.) is picked up and the recognition result is output to the display unit 4. Of the two conference participants, one becomes the voice recognition speaker 5, and the voice input processing unit 3 collects the voice 6 of the voice recognition speaker 5. The utterance voice 6 is replaced by the conversion voice data 8 by the data processing device 7 and transmitted to the voice recognition server 10 via the network 9. The network 9 may be a wired network (for example, an intranet or the Internet), or may be a wireless network (for example, a wireless local area network (LAN)). The speech recognition server 10 transmits text data 11 corresponding to the conversion speech data 8 to the data processing device 7. The data processing device 7 reversely replaces the text data 11 received from the speech recognition server 10 with text data corresponding to the uttered speech 6, and displays the reversely replaced text data as a recognition result of the uttered speech 6. Output to. Further, the data processing device 7 may be provided with an operation unit 12 for performing an operation.

音声入力処理部３は、音声認識発話者５が発話した発話音声６を収音し、収音音声データ１３として発話区間検出部１４に出力する（音声入力処理Ｓ１）。 The voice input processing unit 3 picks up the uttered voice 6 uttered by the voice recognition speaker 5 and outputs it as the collected voice data 13 to the utterance section detecting unit 14 (voice input processing S1).

発話区間検出部１４は、入力された収音音声データ１３から音声認識発話者５の発話区間前後の雑音部分を取り除いた、発話区間音声データ１５を特定音声データ検出部１６に出力する（発話区間検出処理Ｓ２）。 The utterance section detection unit 14 outputs the utterance section voice data 15 obtained by removing the noise portion before and after the utterance section of the voice recognition speaker 5 from the input collected voice data 13 to the specific voice data detection unit 16 (speaking section). Detection process S2).

特定音声データ検出部１６は、入力された発話区間音声データ１５と、組み合わせ記憶部１８に記憶されている組み合わせ表１９とを基に、発話された特定部音声データ２０を生成し、発話時刻検出部１７に出力する（特定音声データ検出処理Ｓ３）。発話時刻検出部１７は、特定部音声データ２０の発話開始時刻と、発話終了時刻とが記載された、特定情報発話時刻表２１を生成し置換音声合成部２２に出力する。（発話時刻検出処理Ｓ４）。 The specific voice data detection unit 16 generates uttered specific part voice data 20 based on the input utterance section voice data 15 and the combination table 19 stored in the combination storage unit 18 to detect the utterance time. It outputs to the part 17 (specific audio | voice data detection process S3). The utterance time detection unit 17 generates a specific information utterance time table 21 in which the utterance start time and the utterance end time of the specific unit audio data 20 are described, and outputs the specific information utterance time table 21 to the replacement voice synthesis unit 22. (Speech time detection process S4).

置換音声合成部２２は、入力された特定情報発話時刻表２１を基に、置換部音声データ２３を含む置換音声データ表２４を生成し音声データシフト部２５に出力する（置換音声合成処理Ｓ５）。 The replacement speech synthesizer 22 generates a replacement speech data table 24 including the replacement portion speech data 23 based on the input specific information utterance time table 21 and outputs it to the speech data shift portion 25 (replacement speech synthesis processing S5). .

音声データシフト部２５は、入力された発話区間音声データ１５と置換音声データ表２４とを基に、シフト発話区間音声データ２７を生成し音声データ置換部２６に出力する（音声データシフト処理Ｓ６）。音声データシフト処理Ｓ６により、置換後の音声データを含む変換用音声データ８が置換前と同様に自然な形で生成されるので、音声認識・テキスト化の精度を確保することができる。 The voice data shift unit 25 generates the shift utterance section voice data 27 based on the input utterance section voice data 15 and the replacement voice data table 24, and outputs it to the voice data substitution section 26 (voice data shift processing S6). . By the voice data shift process S6, the conversion voice data 8 including the voice data after the replacement is generated in a natural manner as before the replacement, so that the accuracy of voice recognition / text conversion can be ensured.

音声データ置換部２６は、入力されたシフト発話区間音声データ２７と置換音声データ表２４とを基に生成した置換履歴表２８を置換履歴記憶部２９に記憶し、変換用音声データ８を通信部３０に出力する（音声データ置換処理Ｓ７）。置換履歴記憶部２９を備えることにより、テキストデータから置換前テキストデータを作成する際の処理が容易になる。 The voice data replacement unit 26 stores a replacement history table 28 generated based on the input shift utterance interval voice data 27 and the replacement voice data table 24 in the replacement history storage unit 29, and converts the conversion voice data 8 into the communication unit. 30 (audio data replacement processing S7). Providing the replacement history storage unit 29 facilitates processing when creating pre-replacement text data from text data.

通信部３０は、入力された変換用音声データ８を、ネットワーク９を介して、音声認識サーバ１０に送信する（変換用音声データ送信処理Ｓ８）。 The communication unit 30 transmits the input conversion voice data 8 to the voice recognition server 10 via the network 9 (conversion voice data transmission process S8).

音声認識サーバ１０は、ネットワーク９を介して変換用音声データ８を受信し（変換用音声データ受信処理Ｓ１１）、受信した変換用音声データ８をテキストデータ１１に変換し（音声データ変換処理Ｓ１２）、変換したテキストデータ１１を、ネットワーク９に送信する（テキストデータ送信処理Ｓ１３）。データ処理装置７の通信部３０は、ネットワーク９を介して受信したテキストデータ１１を、テキストデータ逆置換部３１に出力する（テキストデータ受信処理Ｓ２１）。 The voice recognition server 10 receives the conversion voice data 8 via the network 9 (conversion voice data reception process S11), and converts the received conversion voice data 8 into text data 11 (voice data conversion process S12). The converted text data 11 is transmitted to the network 9 (text data transmission process S13). The communication unit 30 of the data processing device 7 outputs the text data 11 received via the network 9 to the text data reverse replacement unit 31 (text data reception processing S21).

テキストデータ逆置換部３１は、入力されたテキストデータ１１と、置換履歴記憶部２９に記憶されている置換履歴表２８とを基に、本来得られるべき認識結果を生成し（テキストデータ逆置換処理Ｓ２２）、その認識結果を表示部４に出力する（認識結果出力処理Ｓ２３）。なお、操作部１２が備わっている場合は、ユーザは操作部１２を介して組み合わせ記憶部１８に記憶されている組み合わせ表１９を編集することで、置き換えたい特定音声をテキストにより適宜指定し、置き換え後の言葉を容易に指定することができる。操作部１２は例えば、マウスやキーボードなどである。 The text data reverse replacement unit 31 generates a recognition result to be originally obtained based on the input text data 11 and the replacement history table 28 stored in the replacement history storage unit 29 (text data reverse replacement processing). S22), the recognition result is output to the display unit 4 (recognition result output process S23). If the operation unit 12 is provided, the user edits the combination table 19 stored in the combination storage unit 18 via the operation unit 12 so that the specific voice to be replaced is appropriately designated by text and replaced. Later words can be specified easily. The operation unit 12 is, for example, a mouse or a keyboard.

図６は、本第１実施形態において音声入力処理部３が実施する音声入力処理Ｓ１の例である。音声入力処理部３は例えば、無指向性マイク、指向性マイク、ヘッドセットなどの収音可能な装置を備え、周囲の環境雑音音声や、音声認識発話者５の発話音声６を収音する。音声入力処理部３は、収音した音声をアナログ電気信号に変換し、更に前記アナログ電気信号をデジタル音声データに変換（パルス符号変調（ＰＣＭ）など）し、収音音声データ１３として出力する。 FIG. 6 is an example of the voice input process S1 performed by the voice input processing unit 3 in the first embodiment. For example, the voice input processing unit 3 includes a device capable of collecting sound such as an omnidirectional microphone, a directional microphone, and a headset, and collects ambient environmental noise sound and speech sound 6 of the speech recognition speaker 5. The voice input processing unit 3 converts the collected voice into an analog electric signal, further converts the analog electric signal into digital voice data (pulse code modulation (PCM) or the like), and outputs the collected voice data 13.

図７は、本第１実施形態において発話区間検出部１４が実施する発話区間検出処理Ｓ２の例である。音声入力処理部３が出力した収音音声データ１３は発話区間検出部１４に入力される。音声認識発話者５が発話している時間帯前後の音声データには、環境雑音音声のみが含まれる。環境雑音音声部分は音声認識を行う必要がないため、発話区間検出部１４は収音音声データ１３の特徴（波形振幅、周波数帯域など）から収音音声データ１３の内、音声認識発話者５が発話した区間の音声データを発話区間音声データ１５として切り出し、それを出力する。 FIG. 7 is an example of the speech segment detection process S2 performed by the speech segment detection unit 14 in the first embodiment. The collected voice data 13 output from the voice input processing unit 3 is input to the utterance section detection unit 14. The voice data before and after the time zone in which the voice recognition speaker 5 is speaking includes only the environmental noise voice. Since it is not necessary to perform speech recognition for the environmental noise speech part, the speech segment detection unit 14 determines that the speech recognition speaker 5 of the collected speech data 13 is based on the characteristics (waveform amplitude, frequency band, etc.) of the collected speech data 13. The voice data of the uttered section is cut out as the utterance section voice data 15 and output.

図８は、本第１実施形態における組み合わせ表１９の例を示す図である。組み合わせ表１９は、組み合わせＩＤ列、置換前テキストデータ列および置換後テキストデータ列から構成される。組み合わせＩＤ列には、組み合わせを一意に識別するための識別番号が記載される。置換前テキストデータ列には、発話区間音声データ１５内の特定部音声データ２０を検出するために、特定音声データ検出部１６が検出する短い語又はフレーズが記載される。置換後テキストデータ列には、置換音声合成部２２が置換部音声データ２３を生成するために参照する置換音声データ表２４を作成するための情報が記載される。 FIG. 8 is a diagram illustrating an example of the combination table 19 in the first embodiment. The combination table 19 includes a combination ID column, a text data string before replacement, and a text data string after replacement. In the combination ID column, an identification number for uniquely identifying the combination is described. In the pre-replacement text data string, a short word or phrase detected by the specific voice data detection unit 16 in order to detect the specific part voice data 20 in the utterance section voice data 15 is described. In the post-replacement text data string, information for creating a replacement speech data table 24 that the replacement speech synthesizer 22 refers to in order to generate the replacement portion speech data 23 is described.

組み合わせ表１９に記載される組み合わせすべてにおいて、置換前テキストデータ列と置換後テキストデータ列には、音声データ置換部２６が収音音声データ１３から変換用音声データ８を出力する場合に、変換用音声データ８内の文章の文脈が不自然にならないよう、記載する必要がある。例えば、「福岡支店」を「サウジアラビア支店」に置換した変換用音声データ８の文章は自然であるが、「福岡支店」を「１０００支店」に置換した変換用音声データ８の文章は不自然である。変換用音声データ８の文脈が不自然な場合、音声認識サーバ１０の処理にて一部の不自然な文脈が、音声認識サーバ１０が出力するテキストデータ１１全体に影響する可能性がある。 In all combinations described in the combination table 19, when the voice data replacing unit 26 outputs the converted voice data 8 from the collected voice data 13 in the pre-replacement text data string and the post-substitution text data string, for conversion It is necessary to describe so that the context of the sentence in the audio data 8 does not become unnatural. For example, the sentence of the conversion voice data 8 in which “Fukuoka branch” is replaced with “Saudi Arabia branch” is natural, but the sentence of the conversion voice data 8 in which “Fukuoka branch” is replaced with “1000 branch” is unnatural. is there. When the context of the conversion voice data 8 is unnatural, some unnatural context may affect the entire text data 11 output by the voice recognition server 10 in the processing of the voice recognition server 10.

例えば組み合わせ表１９の置換前テキストデータ列に、「［福岡］支店」と記載され、置換後テキストデータ列に、「サウジアラビア」と記載されている場合、その記載は、「福岡支店」という発話に対応する特定部音声データ２０を「サウジアラビア支店」という発話に対応する置換部音声データ２３に置換する、ということを意味する。また、例えば置換前音声テキストデータ列に「売り上げは［＊］円」と記載され、置換後テキストデータ列に、「ランダム値（１、１００００）」と記載されている場合、その記載は、「売り上げは１億円」や「売り上げは５０００万円」等の発話に対応する特定部音声データ２０を「売り上げは１０００円」等の発話に対応する置換部音声データ２３に置換される、ということを意味する。ここで、「［＊］」は任意の発話内容を意味し、「ランダム値（１、１００００）」は１から１００００までの間の整数がランダムに選択されることを意味する。 For example, if “[Fukuoka] branch” is described in the pre-replacement text data string of the combination table 19 and “Saudi Arabia” is described in the post-replacement text data string, the description is in the utterance “Fukuoka branch”. This means that the corresponding specific part voice data 20 is replaced with the replacement part voice data 23 corresponding to the utterance “Saudi Arabia branch”. For example, when “sales is [*] yen” in the speech text data string before replacement and “random value (1, 10000)” is described in the text data string after replacement, the description is “ The specific part voice data 20 corresponding to utterances such as “100 million yen in sales” or “50 million yen in sales” is replaced with replacement part voice data 23 corresponding to utterances such as “sales is 1000 yen”. Means. Here, “[*]” means arbitrary utterance content, and “random value (1, 10000)” means that an integer between 1 and 10000 is randomly selected.

図９は、本第１実施形態における特定音声データ検出処理Ｓ３の処理例を示す図である。特定音声データ検出処理Ｓ３では、特定部音声データ２０を含む発話区間音声データ１５と組み合わせ表１９とが、特定音声データ検出部１６および発話時刻検出部１７に入力される。特定音声データ検出部１６は、発話区間音声データ１５から特定部音声データ２０を検出する。発話時刻検出部１７は、算出した発話開始時刻と発話終了時刻、組み合わせ表１９の組み合わせＩＤ、置換前テキストデータおよび置換後テキストデータを含む特定情報発話時刻表２１を出力する。 FIG. 9 is a diagram illustrating a processing example of the specific voice data detection processing S3 in the first embodiment. In the specific voice data detection process S <b> 3, the utterance section voice data 15 including the specific part voice data 20 and the combination table 19 are input to the specific voice data detection unit 16 and the utterance time detection unit 17. The specific voice data detection unit 16 detects the specific part voice data 20 from the speech segment voice data 15. The utterance time detection unit 17 outputs the specific information utterance time table 21 including the calculated utterance start time and utterance end time, the combination ID of the combination table 19, the text data before replacement, and the text data after replacement.

図１０は、本第１実施形態における特定情報発話時刻表２１の例を示す図である。
特定情報発話時刻表２１は、組み合わせＩＤ列、置換前テキストデータ列、置換後テキストデータ列、発話開始時刻列および発話終了時刻列から構成される。 FIG. 10 is a diagram illustrating an example of the specific information utterance time table 21 in the first embodiment.
The specific information utterance time table 21 includes a combination ID string, a pre-replacement text data string, a post-replacement text data string, an utterance start time string, and an utterance end time string.

図１０に示す特定情報発話時刻表２１は、「２０１５年度の業績、福岡支店の売り上げは１億円という結果になった」という発話と対応する発話区間音声データ１５を基に作成されたものである。特定情報発話時刻表２１の組み合わせＩＤ「４」は、図８に示す組み合わせ表１９の組み合わせＩＤ「４」に対応する。図１０に示す特定情報発話時刻表２１のＩＤ「４」が記載されている行の「置換前テキストデータ」列には、組み合わせ表１９の置換前テキストデータ列に記載されている「［福岡］支店」の代わりに、その［］内に記載されている「福岡」のみが記載され、「置換後テキストデータ」列には「サウジアラビア」が記載される。そして、組み合わせＩＤ「４」が記載されている行の「発話開始時刻」および「発話終了時刻」の各列欄には、算出した「福岡」の発話開始時刻「２．０」と発話終了時刻「２．３」が、それぞれ記載される。このようにして、特定情報発話時刻表２１のＩＤ「４」が記載されている行は生成される。 The specific information utterance timetable 21 shown in FIG. 10 is created on the basis of the utterance section voice data 15 corresponding to the utterance “The result of 2015, the sales of the Fukuoka branch resulted in 100 million yen”. is there. The combination ID “4” in the specific information utterance time table 21 corresponds to the combination ID “4” in the combination table 19 shown in FIG. In the “pre-replacement text data” column of the line in which the ID “4” of the specific information utterance time table 21 shown in FIG. 10 is described, “[Fukuoka]” described in the pre-replacement text data column of the combination table 19 is displayed. Instead of “branch”, only “Fukuoka” described in [] is described, and “Saudi Arabia” is described in the “Substitution text data” column. In the columns of “Speech start time” and “Speech end time” in the row where the combination ID “4” is written, the calculated “Fukuoka” utterance start time “2.0” and utterance end time are displayed. “2.3” is respectively described. In this way, a row in which the ID “4” of the specific information utterance time table 21 is described is generated.

組み合わせＩＤ「４」と同様に、図１０に示す特定情報発話時刻表２１の組み合わせＩＤ「１」は、図８に示す組み合わせ表１９の組み合わせＩＤ「１」に対応する。図１０に示す特定情報発話時刻表２１のＩＤ「１」が記載されている行の「置換前テキストデータ」列には、組み合わせ表１９の置換前テキストデータ列に記載されている「売り上げは［＊］円」（ここでは＊＝１億と検知）の代わりに、その［］内で検知した「１億」のみが記載される。そして、図１０に示す特定情報発話時刻表２１の「置換後テキストデータ」列の欄には、組み合わせ表１９の置換前テキストデータ列に記載されている「ランダム値（１、１００００）」の代わりに、ランダム値として選ばれた「１０００」が記載される。組み合わせＩＤ「１」が記載されている行の「発話開始時刻」および「発話終了時刻」の各列欄には、算出した「１億」の発話開始時刻「３．５」、発話終了時刻「３．７」が、それぞれ記載される。このようにして、特定情報発話時刻表２１のＩＤ「１」が記載されている行は生成される。 Similar to the combination ID “4”, the combination ID “1” of the specific information utterance time table 21 shown in FIG. 10 corresponds to the combination ID “1” of the combination table 19 shown in FIG. In the “pre-replacement text data” column of the line in which the ID “1” of the specific information utterance time table 21 shown in FIG. 10 is described, “sales is [ Instead of “*] Yen” (* = 100 million detected here), only “100 million” detected within [] is described. Then, in the column of “post-replacement text data” column of the specific information utterance time table 21 shown in FIG. 10, instead of “random value (1, 10000)” described in the pre-replacement text data column of the combination table 19 “1000” selected as a random value is described. In the column of “speech start time” and “speech end time” in the row in which the combination ID “1” is described, the calculated “100 million” utterance start time “3.5”, utterance end time “ 3.7 "is described respectively. In this way, a row in which the ID “1” of the specific information utterance time table 21 is described is generated.

図１１は、本第１実施形態における置換音声データ表２４と置換音声合成処理Ｓ５の処理例を示す図である。置換音声合成部２２は、入力された特定情報発話時刻表２１を基に、置換音声データ表２４を生成し、出力する。置換音声データ表２４の生成については後ほど説明する。生成された置換音声データ表２４は、「置換部ＩＤ」列、「置換部音声データ内容」列、「発話開始時刻」列、「発話終了時刻」列、「置換部音声データ長」列、「シフト量」列、「置換部音声データ」列から構成される。「置換部ＩＤ」列には、置換部音声データ２３を一意に識別するための識別番号が記載される。「置換部音声データ内容」列には、置換部音声データ２３の発話内容が記載される。「発話開始時刻」列には、置換部音声データ２３をシフト発話区間音声データ２７に置換する際の置換開始位置が記載される。「発話終了時刻」列には、置換部音声データ２３をシフト発話区間音声データ２７に置換する際の置換終了位置が記載される。「置換部音声データ長」列には、置換部音声データ２３の発話長が記載される。「シフト量」列には、音声データシフト処理Ｓ６にて、音声データをどれだけシフトするかを示す数値が記載される。「置換部音声データ」列には置換部音声データ２３が格納される。 FIG. 11 is a diagram illustrating a processing example of the replacement speech data table 24 and the replacement speech synthesis processing S5 in the first embodiment. The replacement speech synthesizer 22 generates and outputs a replacement speech data table 24 based on the input specific information utterance time table 21. The generation of the replacement voice data table 24 will be described later. The generated replacement voice data table 24 includes a “replacement unit ID” column, a “replacement unit voice data content” column, an “utterance start time” column, an “utterance end time” column, a “replacement unit voice data length” column, “ It consists of a “shift amount” column and a “replacement unit audio data” column. In the “replacement unit ID” column, an identification number for uniquely identifying the replacement unit audio data 23 is described. In the “replacement unit audio data content” column, the utterance content of the replacement unit audio data 23 is described. In the “speech start time” column, a replacement start position when replacing the replacement unit voice data 23 with the shift utterance section voice data 27 is described. The “speech end time” column describes the replacement end position when replacing the replacement unit voice data 23 with the shift utterance section voice data 27. In the “replacement unit audio data length” column, the utterance length of the replacement unit audio data 23 is described. In the “shift amount” column, a numerical value indicating how much the audio data is shifted in the audio data shift processing S6 is described. The replacement unit audio data 23 is stored in the “replacement unit audio data” column.

次に、置換音声データ表２４の生成について、いくつかを例にとって説明する。
図１１において、特定情報発話時刻表２１が置換音声合成部２２に入力されると、置換音声合成部２２は、入力された特定情報発話時刻表２１の１行目、すなわち、組み合わせＩＤ「４」行にある置換前テキストデータおよび置換後テキストデータに関する項目を、置換音声データ表２４の１行目、すなわち、置換部ＩＤ「１」行にある各列に、それぞれ記載する。特定情報発話時刻表２１の組み合わせＩＤ「４」行にある「置換後テキストデータ」列の「サウジアラビア」は、置換音声データ表２４の置換部ＩＤ「１」行にある「置換部音声データ内容」列に記載される。特定情報発話時刻表２１の組み合わせＩＤ「４」行の「発話開始時刻」列にある「２．０」は、置換音声データ表２４の置換部ＩＤ「１」行にある「発話開始時刻」列に記載される。特定情報発話時刻表２１の組み合わせＩＤ「４」行の「発話終了時刻」列にある「２．３」は、置換音声データ表２４の置換部ＩＤ「１」行にある「発話終了時刻」列に記載される。置換部ＩＤ「１」行の「置換部音声データ」列には、「置換部音声データ内容」列に記載された「サウジアラビア」を、自然な話し速度で発話する音声の音データが、置換部音声データ２３として格納される。そして、置換部ＩＤ「１」行の「置換部音声データ長」列には、その「サウジアラビア」の音データの長さである「０．５」が記入される。さらに、置換部ＩＤ「１」行の「シフト量」列には、所定の式、例えば置換部音声データ長−（発話終了時刻−発話開始時刻）から算出される値「０．２」が記入される。 Next, the generation of the replacement voice data table 24 will be described with some examples.
In FIG. 11, when the specific information utterance time table 21 is input to the replacement speech synthesizer 22, the replacement speech synthesizer 22 reads the first line of the input specific information utterance time table 21, that is, the combination ID “4”. Items relating to the pre-replacement text data and the post-replacement text data in the row are described in the first row of the replacement speech data table 24, that is, in each column in the replacement portion ID “1” row. “Saudi Arabia” in the “substitution text data” column in the combination ID “4” row of the specific information utterance time table 21 is “replacement portion voice data content” in the substitution portion ID “1” row of the substitution voice data table 24. Listed in the column. “2.0” in the “utterance start time” column of the combination ID “4” row of the specific information utterance time table 21 is the “utterance start time” column in the replacement part ID “1” row of the replacement voice data table 24. It is described in. “2.3” in the “utterance end time” column of the combination ID “4” row of the specific information utterance time table 21 is the “utterance end time” column in the replacement part ID “1” row of the replacement voice data table 24. It is described in. In the “replacement unit audio data” column of the replacement unit ID “1” row, the sound data of the speech uttering “Saudi Arabia” described in the “replacement unit audio data content” column at a natural speaking speed is displayed. Stored as audio data 23. Then, “0.5”, which is the length of the sound data of “Saudi Arabia”, is entered in the “replacement unit audio data length” column of the replacement unit ID “1” row. Further, in the “shift amount” column of the replacement unit ID “1” row, a predetermined formula, for example, a value “0.2” calculated from the replacement unit voice data length− (utterance end time−utterance start time) is entered. Is done.

１行目と同様に、置換音声合成部２２は、入力された特定情報発話時刻表２１の２行目、すなわち、組み合わせＩＤ「１」行にある置換前テキストデータおよび置換後テキストデータに関する項目を、置換音声データ表２４の２行目、すなわち、置換部ＩＤ「２」行にある各列に、それぞれ記載する。特定情報発話時刻表２１の組み合わせＩＤ「１」行にある「置換部テキストデータ」列の「１０００」は、置換音声データ表２４の置換部ＩＤ「２」行にある「置換部音声データ内容」列に記載される。ただし、置換音声データ表２４の置換部ＩＤ「２」行にある「発話開始時刻」列には、特定情報発話時刻表２１の組み合わせＩＤ「１」行の「発話開始時刻」列にある「３．５」に、置換部ＩＤ「１」行の「シフト量」列に記載された値「０．２」が加算された値「３．７」が記載される。これと同様に、置換音声データ表２４の置換部ＩＤ「２」行にある「発話終了時刻」列には、特定情報発話時刻表２１の組み合わせＩＤ「１」行の「発話終了時刻」列にある「３．７」に、置換部ＩＤ「１」行の「シフト量」列に記載された値「０．２」が加算された値「３．９」が記載される。置換部ＩＤ「２」行の「置換部音声データ」列には、「置換部音声データ内容」列に記載された「１０００」を、自然な話し速度で発話する音声の音データが、置換部音声データ２３として格納される。そして、置換部ＩＤ「２」行の「置換部音声データ長」列には、その「１０００」の音データの長さである「０．１」が記入される。さらに、置換部ＩＤ「２」行の「シフト量」列には、所定の式、例えば置換部音声データ長−（発話終了時刻−発話開始時刻）から算出される値「−０．１」が記入される。 Similarly to the first line, the replacement speech synthesizer 22 selects items related to the pre-replacement text data and the post-replacement text data in the second line of the input specific information utterance time table 21, that is, the combination ID “1” line. Are described in the second row of the replacement voice data table 24, that is, in each column in the replacement portion ID “2” row. “1000” in the “replacement portion text data” column in the combination ID “1” row of the specific information utterance time table 21 is “replacement portion speech data content” in the replacement portion ID “2” row of the replacement speech data table 24. Listed in the column. However, the “utterance start time” column in the replacement part ID “2” row of the replacement voice data table 24 is “3” in the “utterance start time” column of the combination ID “1” row of the specific information utterance time table 21. .5 ”describes a value“ 3.7 ”obtained by adding the value“ 0.2 ”described in the“ shift amount ”column of the replacement unit ID“ 1 ”row. Similarly, in the “utterance end time” column in the replacement part ID “2” row of the replacement voice data table 24, the “utterance end time” column in the combination ID “1” row of the specific information utterance time table 21 is displayed. A value “3.9” obtained by adding the value “0.2” described in the “shift amount” column of the replacement unit ID “1” row is described in “3.7”. In the “replacement unit audio data” column of the replacement unit ID “2” row, the sound data of the speech that speaks “1000” described in the “replacement unit audio data content” column at a natural speaking speed is replaced with the replacement unit. Stored as audio data 23. Then, “0.1”, which is the length of the sound data of “1000”, is entered in the “replacement unit audio data length” column of the replacement unit ID “2” row. Further, in the “shift amount” column in the replacement unit ID “2” row, a value “−0.1” calculated from a predetermined formula, for example, replacement unit voice data length− (utterance end time−utterance start time) is set. Filled in.

図１２は、本第１実施形態における音声データシフト処理Ｓ６の処理例を示す図である。音声データシフト部２５は、入力された発話区間音声データ１５と置換音声データ表２４を基に、シフト発話区間音声データ２７を生成し出力する。音声データシフト部２５に入力された置換音声データ表２４の１行目の発話終了時刻には「２．３」が記載され、シフト量列には「０．２」が記載されていることから、音声データシフト部２５は、入力された発話区間音声データ１５の２．３秒以降の音声データをすべて、正の方向に（すなわち後ろに）０．２秒シフトする。更に、音声データシフト部２５に入力された置換音声データ表２４の２行目の発話終了時刻には「３．９」が記載され、シフト量列には「−０．１」が記載されていることから、音声データシフト部２５は入力された発話区間音声データ１５の３．９秒以降の音声データをすべて、負の方向に（すなわち前に）０．１秒シフトする。こうして音声データシフト部２５は、前記音声データシフト処理Ｓ６が終了した後の音声データを、シフト発話区間音声データ２７として出力する。 FIG. 12 is a diagram illustrating a processing example of the audio data shift processing S6 in the first embodiment. The voice data shift unit 25 generates and outputs shift utterance section voice data 27 based on the input utterance section voice data 15 and the replacement voice data table 24. Since “2.3” is described in the utterance end time of the first row of the replacement voice data table 24 input to the voice data shift unit 25, “0.2” is described in the shift amount column. The voice data shift unit 25 shifts all the voice data after 2.3 seconds of the input speech section voice data 15 in the positive direction (that is, backward) by 0.2 seconds. Further, “3.9” is described in the utterance end time of the second row of the replacement audio data table 24 input to the audio data shift unit 25, and “−0.1” is described in the shift amount column. Therefore, the voice data shift unit 25 shifts all the voice data after 3.9 seconds of the input speech section voice data 15 in the negative direction (that is, before) by 0.1 seconds. In this way, the voice data shift unit 25 outputs the voice data after the voice data shift process S6 is completed as the shift utterance section voice data 27.

これにより置換後の音声データを含む交換用音声データ８が、置換前と同様に自然な形で生成されるので、音声認識・テキスト化の精度を確保することができる。 As a result, the replacement voice data 8 including the voice data after the replacement is generated in a natural manner as before the replacement, so that the accuracy of voice recognition / text conversion can be ensured.

図１３は、本第１実施形態における音声データ置換処理Ｓ７の処理例を示す図である。音声データ置換部２６は、入力されたシフト発話区間音声データ２７と置換音声データ表２４とを基に、置換履歴表２８と変換用音声データ８を生成し、出力する。置換音声データ表第１行の発話開始時刻列には「２．０」が、置換部音声データ長列には「０．５」が記載されているため、シフト発話区間音声データ２７の内、２．０秒から２．５秒の音声データを、置換音声データ表２４の第１行の置換部音声データ列に格納されている置換部音声データ２３の音声データに置換する。置換音声データ表２４第２行についても同様の処理を行い、置換後の音声データを変換用音声データ８として出力する。 FIG. 13 is a diagram illustrating a processing example of the audio data replacement processing S7 in the first embodiment. The voice data replacement unit 26 generates and outputs a replacement history table 28 and conversion voice data 8 based on the input shift utterance section voice data 27 and the replacement voice data table 24. Since “2.0” is described in the utterance start time column of the first row of the replacement speech data table and “0.5” is described in the replacement portion speech data length column, the shift speech segment speech data 27 includes: The sound data of 2.0 seconds to 2.5 seconds is replaced with the sound data of the replacement unit sound data 23 stored in the replacement unit sound data string in the first row of the replacement sound data table 24. The same process is performed for the second row of the replacement voice data table 24, and the replaced voice data is output as the conversion voice data 8.

図１４は、本第１実施形態における置換履歴表２８の例を示す図である。置換履歴表２８は「置換前」列と「置換後」列から構成される。例えば、置換履歴表２８の第１行の「置換前」列に「福岡」が、「置換後」列に「サウジアラビア」が記載され、置換履歴表２８の第２行の「置換前」列に「１億」が、「置換後」列に「１０００」が記載されている場合、シフト発話区間音声データ２７の「福岡」に対応する特定部音声データ２０が「サウジアラビア」に対応する置換部音声データ２３に置換され、シフト発話区間音声データの２７「１億」に対応する特定部音声データ２０が「１０００」に対応する置換部音声データ２３に置換された変換用音声データ８が生成されたことを意味する。 FIG. 14 is a diagram showing an example of the replacement history table 28 in the first embodiment. The replacement history table 28 includes “before replacement” columns and “after replacement” columns. For example, “Fukuoka” is described in the “before replacement” column of the first row of the replacement history table 28, “Saudi Arabia” is described in the “after replacement” column, and the “before replacement” column of the second row of the replacement history table 28 is displayed. When “100 million” is described in the “after replacement” column, “1000” is described, the replacement part voice corresponding to “Saudi Arabia” is the specific part voice data 20 corresponding to “Fukuoka” in the shift utterance section voice data 27. The conversion voice data 8 is generated in which the specific part voice data 20 corresponding to 27 “100 million” of the shift utterance section voice data is replaced with the replacement part voice data 23 corresponding to “1000”. Means that.

図１５は、本第１実施形態における変換用音声データ送信処理Ｓ８と変換用音声データ受信処理Ｓ１１と音声データ変換処理Ｓ１２とテキストデータ送信処理Ｓ１３の処理例を示す図である。図１５において、通信部３０は、入力された変換用音声データ８をデータ処理装置７の外部（例えば音声認識サーバ１０）に送信する。変換用音声データ８を受信した音声認識サーバ１０は、変換用音声データ８を対応するテキストデータ１１に変換し、通信部３０に出力する。例えば、「２０１５年度の業績、サウジアラビア支店の売り上げは１０００円という結果になった」という変換用音声データ８が入力された場合、「２０１５年度の業績、サウジアラビア支店の売り上げは１０００円という結果になった」という、変換されたテキストデータ１１を通信部３０に出力する。通信部３０は、入力されたテキストデータ１１をデータ処理装置７に送信する。通信部３０と音声認識サーバ１０の間の送受信経路として、Ethernet、USB、RC232（シリアル通信）などが挙げられる。 FIG. 15 is a diagram illustrating processing examples of the conversion voice data transmission process S8, the conversion voice data reception process S11, the voice data conversion process S12, and the text data transmission process S13 in the first embodiment. In FIG. 15, the communication unit 30 transmits the input conversion voice data 8 to the outside of the data processing device 7 (for example, the voice recognition server 10). Upon receiving the conversion voice data 8, the voice recognition server 10 converts the conversion voice data 8 into corresponding text data 11 and outputs it to the communication unit 30. For example, when the conversion voice data 8 is input, such as “results in 2015, sales of Saudi Arabia branch is 1000 yen”, “results in 2015, sales of Saudi Arabia branch is 1000 yen. The converted text data 11 is output to the communication unit 30. The communication unit 30 transmits the input text data 11 to the data processing device 7. Examples of the transmission / reception path between the communication unit 30 and the voice recognition server 10 include Ethernet, USB, RC232 (serial communication), and the like.

図１６は、本第１実施形態におけるテキストデータ逆置換処理Ｓ２２と認識結果出力処理Ｓ２３の処理例を示す図である。テキストデータ逆置換部３１は、入力されたテキストデータ１１と置換履歴表２８とを基に、本来得られるべき認識結果３２を出力する。入力されたテキストデータ１１が「２０１５年度の業績、サウジアラビア支店の売り上げは１０００円という結果になった」であり、置換履歴表２８の１行目の「置換前」列に「福岡」が、「置換後」列に「サウジアラビア」が、２行目の「置換前」列に「１億」が、「置換後」列に「１０００」が記載されている場合、テキストデータ逆置換部３１は、入力されたテキストデータ１１に含まれる「サウジアラビア」を「福岡」に、「１０００」を「１億」に逆置換し、「２０１５年度の業績、福岡支店の売り上げは１億円という結果になった」という、本来得られるべき認識結果３２を生成し、表示部４に出力する。表示部４は例えば、ディスプレイ機器や、プリンタ機器などである。 FIG. 16 is a diagram illustrating a processing example of text data reverse replacement processing S22 and recognition result output processing S23 in the first embodiment. The text data reverse replacement unit 31 outputs a recognition result 32 to be originally obtained based on the input text data 11 and the replacement history table 28. The input text data 11 is “result of 2015, sales of Saudi Arabia branch was 1000 yen”, and “Fukuoka” in the “before replacement” column in the first row of the replacement history table 28 is “ When “Saudi Arabia” is described in the “After Replacement” column, “100 million” is described in the “Before Replacement” column of the second row, and “1000” is described in the “After Replacement” column, the text data reverse replacement unit 31 "Saudi Arabia" included in the text data 11 that was entered was replaced with "Fukuoka" and "1000" was reversely replaced with "100 million". "FY 2015 results, sales of the Fukuoka branch were 100 million yen. Is generated and output to the display unit 4. The display unit 4 is, for example, a display device or a printer device.

なお、ユーザは、操作部１２を介して、組み合わせ表１９の編集もしくは、新規作成を行うことで、使用状況に応じた組み合わせ表１９を作成することができる。これによりユーザは、置き換えたい特定部音声データ２０をテキストにより適宜指定し、置き換え後の言葉を容易に指定することができる。 Note that the user can create the combination table 19 according to the usage status by editing or newly creating the combination table 19 via the operation unit 12. As a result, the user can easily specify the specific part voice data 20 to be replaced by text and can easily specify the word after replacement.

以上のように、本第１実施形態によれば、音声認識サーバに発話音声データを送信する際の機密情報の外部への漏洩の可能性を低減し、かつ十分な精度で音声認識・テキスト化することができる。なおかつ、内部の情報処理端末が行う音声認識・テキスト化の処理の負荷を軽減することができ、処理スピードを確保することができる。 As described above, according to the first embodiment, the possibility of leakage of confidential information to the outside when transmitting utterance voice data to the voice recognition server is reduced, and voice recognition / text conversion is performed with sufficient accuracy. can do. In addition, it is possible to reduce the load of voice recognition / text processing performed by the internal information processing terminal, and to secure the processing speed.

（第２実施形態）
本第２実施形態については、第１実施形態と異なる部分のみを説明する。それ以外の点については、第１実施形態と同様である。 (Second Embodiment)
Regarding the second embodiment, only the parts different from the first embodiment will be described. The other points are the same as in the first embodiment.

図１７は、本第２実施形態における置換音声データ表２４と置換音声合成処理Ｓ５の例を示す図である。図１７において、本第２実施形態が第１実施形態と異なる点は、置換部音声データ長の算出方法と置換部音声データ２３の音声データの生成方法である。置換音声合成部２２は、入力された特定情報発話時刻表２１を基に置換データ表を生成し出力するが、その際、第１実施形態の図１１、図１２とは異なり、置換音声データ表２４の各行について、発話終了時刻から発話開始時刻を差し引いた値を、置換部音声データ長として記載する。具体的には、置換音声合成部２２は、置換部ＩＤ「１」の行にある「置換部音声データ内容」列「サウジアラビア」の「発話終了時刻」列の値「２．３」から「発話開始時刻」列の値「２．０」を差し引いた値「０．３」を、置換部音声データ長列に格納する。そして、置換音声合成部２２は、「０．３」秒間かけて「サウジアラビア」と発話される音声に対応する音声データを、置換部音声データ２３として生成し、「置換部音声データ」列に格納する。置換音声データ表２４の第２行についても、第１行と同様の処理を行う。 FIG. 17 is a diagram illustrating an example of the replacement speech data table 24 and the replacement speech synthesis process S5 in the second embodiment. In FIG. 17, the second embodiment is different from the first embodiment in a calculation method of the replacement unit audio data length and a generation method of the audio data of the replacement unit audio data 23. The replacement speech synthesizer 22 generates and outputs a replacement data table based on the input specific information utterance time table 21. At this time, unlike FIGS. 11 and 12 of the first embodiment, the replacement speech data table For each of the 24 lines, a value obtained by subtracting the utterance start time from the utterance end time is described as the replacement unit voice data length. Specifically, the replacement speech synthesizer 22 selects “speech” from the value “2.3” in the “speech end time” column of the “replacement unit speech data content” column “Saudi Arabia” in the row of the replacement unit ID “1”. The value “0.3” obtained by subtracting the value “2.0” in the “start time” column is stored in the replacement unit audio data length column. Then, the replacement speech synthesizer 22 generates speech data corresponding to the speech uttered “Saudi Arabia” over “0.3” seconds as replacement portion speech data 23 and stores it in the “replacement portion speech data” column. To do. For the second row of the replacement voice data table 24, the same processing as that for the first row is performed.

なお、本第２実施形態の「シフト量」列には、第１行、第２行とも、所定の式（置換部音声データ長−（発話終了時刻−発話開始時刻））から算出される値「０」が記入されるので、シフト処理が行われないこととなる。図１８は、本第２実施形態における音声データシフト処理Ｓ６の処理例を示す図であり、図１９は、本第２実施形態における音声データ置換処理Ｓ７の処理例を示す図である。図１８、図１９に示すように、「福岡」に代わる「サウジアラビア」、「１億」に代わる「１０００」の置換部音声データ２３が、それぞれ、「福岡」「１億」の各発話開始時刻、発話終了時刻を変えないよう、シフト処理を行わずに同じ長さで生成される。 In the “shift amount” column of the second embodiment, both the first row and the second row are values calculated from a predetermined formula (replacement unit voice data length− (utterance end time−utterance start time)). Since “0” is entered, the shift process is not performed. FIG. 18 is a diagram illustrating a processing example of the audio data shift processing S6 in the second embodiment, and FIG. 19 is a diagram illustrating a processing example of the audio data replacement processing S7 in the second embodiment. As shown in FIG. 18 and FIG. 19, “Saudi Arabia” replacing “Fukuoka” and “1000” replacing voice data 23 replacing “100 million” are respectively utterance start times of “Fukuoka” and “100 million”. In order not to change the utterance end time, they are generated with the same length without performing shift processing.

以上のように第２実施形態によれば、置き換えたい音声以外の部分の音声データに対する処理が不要となり、音声データの置換処理を簡略化することができる。 As described above, according to the second embodiment, it is not necessary to process a portion of audio data other than the audio to be replaced, and the audio data replacement processing can be simplified.

（第３実施形態）
本第３実施形態については、第１実施形態又は第２実施形態と異なる部分のみを説明する。それ以外の点については、第１実施形態又は第２実施形態と同様である。 (Third embodiment)
In the third embodiment, only the parts different from the first embodiment or the second embodiment will be described. About other points, it is the same as that of 1st Embodiment or 2nd Embodiment.

図２０は、本第３実施形態における置換音声データ表２４と置換音声合成処理Ｓ５の例を示す図である。図２０は、第１実施形態及び第２実施形態での「福岡」を「サウジアラビア」に置き換える場合とは異なり、「福岡」を「タイ」に置き換える例である。置換音声合成部２２は、入力された特定情報発話時刻表２１を基に置換音声データ表２４を生成し出力する点では、第１実施形態及び第２実施形態と同様である。そして、置換音声合成部２２が、特定情報発話時刻表２１の「置換前テキストデータ」列にある「福岡」の「発話開始時刻」列に記載されている値「２．０」及びその「発話終了時刻」列に記載されている値「２．３」を、それぞれ、置換音声データ表２４の「発話開始時刻」列及び「発話終了時刻」列に、それぞれ記入する点についても、第１実施形態及び第２実施形態と同様である。 FIG. 20 is a diagram illustrating an example of the replacement speech data table 24 and the replacement speech synthesis process S5 in the third embodiment. FIG. 20 is an example in which “Fukuoka” is replaced with “Thailand”, unlike the case where “Fukuoka” is replaced with “Saudi Arabia” in the first and second embodiments. The replacement speech synthesizer 22 is similar to the first embodiment and the second embodiment in that a replacement speech data table 24 is generated and output based on the input specific information utterance time table 21. The replacement speech synthesizer 22 then sets the value “2.0” described in the “utterance start time” column of “Fukuoka” in the “text data before replacement” column of the specific information utterance time table 21 and the “utterance”. The first implementation is that the value “2.3” described in the “End time” column is entered in the “Speech start time” column and the “Speech end time” column of the replacement voice data table 24, respectively. It is the same as that of form and 2nd Embodiment.

ただし、本第３実施形態の置換音声合成部２２は、第２実施の形態の図１７、図１８とは異なり、特定情報発話時刻表２１の「組み合わせＩＤ」列「１０」が記載されている１行目の「置換後テキストデータ」列にある記載項目「タイ」を、置換音声データ表２４の「置換部ＩＤ」列「１」が記載されている１行目の「置換部音声データ内容」列に記入する。そして、本第３実施形態の置換音声合成部２２は、「置換後テキストデータ」列にある記載項目「タイ」が自然な話し速度で発話される音声に対応する置換部音声データ２３を生成して「置換部音声データ」列に格納し、その発話長さである値「０．１」を「置換部音声データ長」列に記入する。 However, the replacement speech synthesizer 22 of the third embodiment describes the “combination ID” column “10” of the specific information utterance time table 21, unlike FIGS. 17 and 18 of the second embodiment. The description item “tie” in the “replaced text data” column in the first row is replaced with the “replacement unit audio data content” in the first row in which the “replacement unit ID” column “1” in the replacement speech data table 24 is described. ”Column. Then, the replacement speech synthesizer 22 of the third embodiment generates replacement portion speech data 23 corresponding to speech in which the entry item “tie” in the “substitution text data” column is spoken at a natural speaking speed. Are stored in the “replacement section voice data length” column, and the value “0.1” as the speech length is entered in the “replacement section voice data length” column.

図２１は、本第３実施形態における音声データシフト処理Ｓ６の処理例を示す図である。音声データシフト部２５は、入力された発話区間音声データ１５と置換音声データ表２４とを基に、シフト発話区間音声データ２７を生成し、出力する。置換音声データ表２４の１行目の発話終了時刻には「２．３」が記載され、シフト量列には「０．０」が記載されていることから、音声データシフト部２５は入力された発話区間音声データ１５の２．３秒以降の音声データをすべて、正の方向に０．０秒シフトする。 FIG. 21 is a diagram illustrating a processing example of the audio data shift processing S6 in the third embodiment. The voice data shift unit 25 generates and outputs shift utterance section voice data 27 based on the input utterance section voice data 15 and the replacement voice data table 24. Since “2.3” is described in the utterance end time in the first row of the replacement voice data table 24 and “0.0” is described in the shift amount column, the voice data shift unit 25 is input. All the voice data after 2.3 seconds of the utterance section voice data 15 is shifted 0.0 seconds in the positive direction.

図２２は、本第３実施形態における音声データ置換処理Ｓ７の処理例を示す図である。音声データ置換部２６は、入力されたシフト発話区間音声データ２７と置換音声データ表２４とを基に、置換履歴表２８と変換用音声データ８を生成し、出力する。置換音声データ表第１行の「発話終了時刻」列には「２．３」が、「置換部音声データ長」列には「０．１」が記載されているため、シフト発話区間音声データ２７の内、２．０秒から２．２秒の音声データを、無音と対応する音声データに置換し、更に２．２秒から２．３秒の音声データを、置換音声データ表第１行の「置換部音声データ」列に格納される置換部音声データ２３「タイ」に置換する。 FIG. 22 is a diagram illustrating a processing example of the audio data replacement processing S7 in the third embodiment. The voice data replacement unit 26 generates and outputs a replacement history table 28 and conversion voice data 8 based on the input shift utterance section voice data 27 and the replacement voice data table 24. Since “2.3” is described in the “speech end time” column and “0.1” is described in the “replacement unit speech data length” column of the first row of the replacement speech data table, the shift speech segment speech data 27, the voice data from 2.0 seconds to 2.2 seconds is replaced with the voice data corresponding to silence, and the voice data from 2.2 seconds to 2.3 seconds is further replaced with the first row of the replacement voice data table. In the “replacement unit audio data” column of “replacement unit audio data”.

以上のように本第３実施形態によれば、置き換えたい音声以外の部分の音声データに対する処理が不要となり、音声データの置換処理を簡略化することができる。 As described above, according to the third embodiment, processing for audio data other than the audio to be replaced becomes unnecessary, and the audio data replacement processing can be simplified.

以上に述べたように、本開示によれば、音声認識サーバに発話音声データを送信する際の機密情報の外部への漏洩の可能性を低減し、かつ十分な精度で音声認識・テキスト化することができる。なおかつ、本開示は、内部の情報処理端末が行う音声認識・テキスト化の処理の負荷を軽減することができ、処理スピードを確保することができる。 As described above, according to the present disclosure, the possibility of leakage of confidential information to the outside when transmitting utterance voice data to the voice recognition server is reduced, and voice recognition / text conversion is performed with sufficient accuracy. be able to. In addition, the present disclosure can reduce the load of speech recognition / text processing performed by the internal information processing terminal, and can ensure the processing speed.

本開示は、音声データをテキスト化する際に、音声認識サーバに発話音声データを送信する際の機密情報の外部への漏洩の可能性を低減し、かつ十分な精度で音声認識・テキスト化することができ、なおかつ、内部の情報処理端末が行う音声認識・テキスト化の処理の負荷を軽減することができ、処理スピードを確保することができるデータ処理装置、データ処理システム、データ処理方法及びデータ処理プログラムとして有用である。 The present disclosure reduces the possibility of leakage of confidential information to the outside when transmitting utterance voice data to the voice recognition server when voice data is converted into text, and performs voice recognition / text conversion with sufficient accuracy. Data processing apparatus, data processing system, data processing method, and data that can reduce the load of voice recognition / text processing performed by an internal information processing terminal and can ensure processing speed It is useful as a processing program.

１音声処理システム
２音声認識が行われる場所
３音声入力処理部
４表示部
５音声認識発話者
６発話音声
７データ処理装置
８変換用音声データ
９ネットワーク
１０音声認識サーバ
１１テキストデータ
１２操作部
１３収音音声データ
１４発話区間検出部
１５発話区間音声データ
１６特定音声データ検出部
１７発話時刻検出部
１８組み合わせ記憶部
１９組み合わせ表
２０特定部音声データ
２１特定情報発話時刻表
２２置換音声合成部
２３置換部音声データ
２４置換音声データ表
２５音声データシフト部
２６音声データ置換部
２７シフト発話区間音声データ
２８置換履歴表
２９置換履歴記憶部
３０通信部
３１テキストデータ逆置換部
３２認識結果 DESCRIPTION OF SYMBOLS 1 Voice processing system 2 The place where voice recognition is performed 3 Voice input processing part 4 Display part 5 Voice recognition speaker 6 Speech voice 7 Data processing device 8 Voice data for conversion 9 Network 10 Voice recognition server 11 Text data 12 Operation part 13 Collection Audio voice data 14 Utterance section detection section 15 Utterance section audio data 16 Specific voice data detection section 17 Utterance time detection section 18 Combination storage section 19 Combination table 20 Specific section voice data 21 Specific information utterance time table 22 Replacement voice synthesis section 23 Replacement section Speech data 24 Replacement speech data table 25 Speech data shift unit 26 Speech data replacement unit 27 Shift utterance interval speech data 28 Replacement history table 29 Replacement history storage unit 30 Communication unit 31 Text data reverse replacement unit 32 Recognition result

Claims

A data processing device that outputs a recognition result of collected sound data,
An audio data replacement unit that replaces the specific unit audio data included in the collected audio data with replacement unit audio data different from the specific unit audio data and outputs the converted audio data;
A communication unit that transmits the voice data for conversion to a voice recognition server and receives text data converted from the voice data for conversion from the voice recognition server;
Of the text data input from the speech recognition server, extract post-replacement text data corresponding to the replacement unit speech data, and replace the post-replacement text data with pre-replacement text data corresponding to the specific unit speech data A text data reverse replacement unit that outputs as a recognition result of the collected sound data;
A data processing apparatus comprising:

A combination storage unit for storing a combination of the pre-replacement text data corresponding to the specific unit audio data and the post-replacement text data;
A replacement speech synthesizer that synthesizes the replacement unit speech data from the replaced text data and outputs the synthesized speech data to the speech data replacement unit;
Further comprising
The data processing apparatus according to claim 1.

An utterance time detection unit for detecting an utterance start time and an utterance end time of the specific unit audio data;
The speech data shift unit that shifts the speech end time and the speech data after the speech end time according to the speech time of the replacement unit speech data;
Further comprising
The data processing apparatus according to claim 1 or 2.

The voice data shift unit sets the utterance start time of the replacement unit voice data as the utterance start time of the specific unit voice data.
The data processing apparatus according to claim 3.

The voice data shift unit calculates the utterance end time of the replacement unit voice data from the utterance start time and the utterance time of the replacement unit voice data, and determines the utterance start time of the voice data following the replacement unit voice data. , Shift after the calculated utterance end time,
The data processing apparatus according to claim 4.

The voice data shift unit is configured such that when the utterance time of the replacement unit voice data is equal to or shorter than the utterance time of the specific unit voice data, the utterance end time of the replacement unit voice data becomes the utterance end time of the specific unit voice data. , Shifting the utterance start time of the replacement unit voice data,
The data processing apparatus according to claim 3.

An operation unit for editing a combination of the pre-replacement text data corresponding to the specific unit audio data and the post-replacement text data;
The data processing apparatus according to claim 1.

The audio data shift unit generates the replacement unit audio data having the same length as the specific unit audio data.
The data processing apparatus according to claim 3.

The voice data replacement unit further includes a replacement history storage unit that stores a history of replacing the specific unit voice data with the replacement unit voice data.
The data processing apparatus according to claim 1.

A data processing system that outputs a recognition result of collected sound data,
An audio data replacement unit that replaces the specific unit audio data included in the collected audio data with replacement unit audio data different from the specific unit audio data and outputs the converted audio data;
A communication unit that transmits the voice data for conversion to a voice recognition server and receives text data converted from the voice data for conversion from the voice recognition server;
Of the text data input from the speech recognition server, extract post-replacement text data corresponding to the replacement unit speech data, and replace the post-replacement text data with pre-replacement text data corresponding to the specific unit speech data A text data reverse replacement unit that outputs as a recognition result of the collected sound data;
A data processing system comprising:

A data processing method for outputting a recognition result of collected sound data,
A voice data replacement step of replacing the specific part voice data included in the collected voice data with replacement voice data different from the specific part voice data and outputting the voice data for conversion;
A communication step of transmitting the voice data for conversion to a voice recognition server and receiving text data converted from the voice data for conversion from the voice recognition server;
Of the text data input from the speech recognition server, extract post-replacement text data corresponding to the replacement unit speech data, and replace the post-replacement text data with pre-replacement text data corresponding to the specific unit speech data A text data reverse replacement step for outputting the collected sound data as a recognition result;
A data processing method comprising:

A combination storage step of storing a combination of the pre-replacement text data corresponding to the specific speech data and the post-replacement text data;
A replacement speech synthesis step of synthesizing the replacement unit speech data from the post-substitution text data and outputting the synthesized speech data to the speech data replacement unit;
Further comprising
The data processing method according to claim 11.

An utterance time detection step of detecting an utterance start time and an utterance end time of the specific part voice data;
A speech data shift step for shifting the speech end time and the speech data after the speech end time according to the speech time of the replacement unit speech data;
Further comprising
The data processing method according to claim 11 or 12.

In the voice data shift step, the utterance start time of the replacement unit voice data is set as the utterance start time of the specific unit voice data.
The data processing method according to claim 13.

The voice data shift step calculates the utterance end time of the replacement unit voice data from the utterance start time and the utterance time of the replacement unit voice data, and sets the utterance start time of the voice data following the replacement unit voice data. , Shift after the calculated utterance end time,
The data processing method according to claim 14.

In the voice data shift step, when the utterance time of the replacement part voice data is equal to or shorter than the utterance time of the specific part voice data, the utterance end time of the replacement part voice data becomes the utterance end time of the specific part voice data. , Shifting the utterance start time of the replacement unit voice data,
The data processing method according to claim 13.

An operation step of editing a combination of the pre-replacement text data corresponding to the specific part audio data and the post-replacement text data;
The data processing method according to claim 11.

The voice data shift step generates the replacement unit voice data having the same length as the specific unit voice data.
The data processing method according to claim 13.

The voice data replacement step further includes a replacement history storage step of storing a history of replacing the specific unit voice data with the replacement unit voice data.
The data processing method according to claim 11.

A data processing program executed in a data processing device that outputs a recognition result of collected sound data,
For the computer of the data processing device,
A process of replacing the specific unit audio data included in the collected audio data with replacement unit audio data different from the specific unit audio data and outputting the converted audio data as conversion audio data;
Processing for transmitting the voice data for conversion to a voice recognition server and receiving text data converted from the voice data for conversion from the voice recognition server;
Of the text data input from the speech recognition server, the post-replacement text data corresponding to the replacement unit speech data is extracted, and the post-replacement text data is replaced with the pre-replacement text data corresponding to the specific unit speech data. Processing to output the collected sound data as a recognition result;
Data processing program that executes