JP2020013169A

JP2020013169A - Terminal device, communication method, and communication program

Info

Publication number: JP2020013169A
Application number: JP2019196136A
Authority: JP
Inventors: 古田　宏; Hiroshi Furuta; 宏古田; 英一細野; Hidekazu Hosono
Original assignee: JVCKenwood Corp
Current assignee: JVCKenwood Corp
Priority date: 2019-10-29
Filing date: 2019-10-29
Publication date: 2020-01-23
Anticipated expiration: 2035-11-20
Also published as: JP6822540B2

Abstract

To provide a technique for acquiring contents of voice actually output on a call receiving party side.SOLUTION: A voice signal is transmitted to a terminal device on a receiving side. A result of voice recognition processing executed, in the terminal device on the receiving side, for a voice signal obtained by reproducing the received voice signal is received from the terminal device on the receiving side. The received result of the voice recognition processing is displayed on a display.SELECTED DRAWING: Figure 2

Description

本発明は、通信技術に関し、特に音声信号の通信を実行する端末装置、通信方法及び通
信プログラムに関する。 The present invention relates to a communication technology, and more particularly, to a terminal device, a communication method, and a communication program that execute communication of an audio signal.

音声認識処理は、周囲環境のノイズが加わったり、音声が小さかったりする場合に、認
識不能となる。さらに、音声認識処理を繰り返し実行しても、認識不能になることがある
。特に、オペレータが認識不能の理由を分からないと、認識不能が繰り返されやすい。こ
れを防止するために、認識不能の理由がオペレータに通知される（例えば、特許文献１）
。 The voice recognition processing becomes unrecognizable when noise of the surrounding environment is added or the voice is low. Further, even if the voice recognition processing is repeatedly performed, recognition may not be performed. In particular, if the operator does not know the reason for the recognition failure, the recognition failure is likely to be repeated. To prevent this, the operator is notified of the reason for the inability to recognize (for example, Patent Document 1).
.

特開２０００−１１２４９７号公報JP 2000-112497 A

音声認識処理は、一般的に、音声のデータである音声信号に対してなされる。一方、受
話者が聞く音声には、イコライザのオン／オフ、スピーカから出力される際の音量レベル
、話速変換のオン／オフの設定がなされている。そのため、受話者が聞く音声は、音声信
号と異なる場合がある。また、同一の音声信号から再生された音声を聞いた場合であって
も、聞こえ方は受話者によって異なることがある。そのため、音声認識処理は、受話者側
の状況に応じてなされる方が望ましい。 The voice recognition process is generally performed on a voice signal that is voice data. On the other hand, the sound heard by the receiver is set to ON / OFF of an equalizer, a volume level when output from a speaker, and ON / OFF of speech speed conversion. Therefore, the voice heard by the receiver may be different from the voice signal. Further, even when the sound reproduced from the same sound signal is heard, the manner of hearing may differ depending on the listener. Therefore, it is desirable that the voice recognition process be performed according to the situation of the receiver.

本発明はこうした状況に鑑みてなされたものであり、その目的は、受話者側の状況に応
じてなされた音声認識結果を取得する技術を提供することである。 The present invention has been made in view of such a situation, and an object of the present invention is to provide a technique for acquiring a speech recognition result made according to a situation of a receiver.

上記課題を解決するために、本発明のある態様の端末装置は、端末装置であって、受信
側となる端末装置に対して音声信号を送信する送信部と、受信側となる端末装置において
、受信した音声信号を再生した音声信号に対して実行する音声認識処理の結果を、受信側
となる端末装置から受信する受信部と、受信した音声認識処理の結果を表示部に表示する
処理部と、を備える。 In order to solve the above-described problem, a terminal device according to an aspect of the present invention is a terminal device, a transmitting unit that transmits an audio signal to a receiving terminal device, and a receiving terminal device, A receiving unit that receives, from a terminal device serving as a receiving side, a result of a voice recognition process performed on a voice signal obtained by reproducing a received voice signal, and a processing unit that displays a result of the received voice recognition process on a display unit , Is provided.

本発明の別の態様の端末装置において、音声認識処理の結果は、受信側となる端末装置
において、再生した音声信号に対して、受信側となる端末装置を使用するユーザの聞こえ
方を反映する。 In the terminal device according to another aspect of the present invention, the result of the voice recognition processing reflects how the user who uses the terminal device on the receiving side hears the reproduced audio signal in the terminal device on the receiving side. .

本発明の別の態様の端末装置において、受信部は、受信側となる端末装置において、
（１）受信側となる端末装置を使用するユーザの聞こえ方を未反映のまま音声認識処理を
実行し、（２）ユーザの聞こえ方が未反映での音声認識処理の結果と、ユーザの聞こえ方
を反映した音声認識処理の結果とを比較した比較結果を受信する。 In a terminal device according to another aspect of the present invention, the receiving unit, in the terminal device on the receiving side,
(1) Speech recognition processing is performed without reflecting the way of hearing of the user using the terminal device on the receiving side, and (2) the result of the voice recognition processing without reflecting the way of hearing of the user and the user's hearing And a comparison result obtained by comparing the result of the voice recognition processing with the result of the comparison.

本発明の別の態様の端末装置は、受信部は、受信側となる端末装置の音声認識処理にお
いて、音量レベル、音声速度のうちの少なくとも１つを反映して得られた結果を、受信側
となる端末装置から受信する。 In a terminal device according to another aspect of the present invention, the receiving unit converts the result obtained by reflecting at least one of the volume level and the audio speed in the voice recognition processing of the terminal device on the receiving side to the receiving unit. Received from the terminal device.

本発明の別の態様は、通信方法である。この方法は、端末装置での通信方法であって、
受信側となる端末装置に対して音声信号を送信するステップと、受信側となる端末装置に
おいて、受信した音声信号を再生した音声信号に対して、音声認識処理を実行した結果を
、受信側となる端末装置から受信するステップと、音声認識処理の結果を取得し、表示部
に表示するステップと、を備える。 Another embodiment of the present invention relates to a communication method. This method is a communication method in a terminal device,
Transmitting an audio signal to the terminal device on the receiving side, and executing a voice recognition process on the audio signal obtained by reproducing the received audio signal in the terminal device on the receiving side, and transmitting the result to the receiving side. And a step of acquiring a result of the voice recognition process and displaying the result on a display unit.

本発明の別の態様は、コンピュータに実行させる通信プログラムである。この通信プロ
グラムは、受信側となる端末装置に対して音声信号を送信する処理と、受信側となる端末
装置において、受信した音声信号を再生した音声信号に対して、音声認識処理を実行した
結果を、受信側となる端末装置から受信する処理と、音声認識処理の結果を取得し、表示
部に表示する処理を行う。 Another embodiment of the present invention relates to a communication program to be executed by a computer. This communication program is a process of transmitting an audio signal to a terminal device on the receiving side and a result of executing a voice recognition process on an audio signal obtained by reproducing the received audio signal on the terminal device on the receiving side. Is received from the terminal device on the receiving side, and a process of acquiring the result of the voice recognition process and displaying the result on the display unit is performed.

なお、以上の構成要素の任意の組合せ、本発明の表現を方法、装置、システム、記録媒
体、コンピュータプログラムなどの間で変換したものもまた、本発明の態様として有効で
ある。 Note that any combination of the above-described components and any conversion of the expression of the present invention between a method, an apparatus, a system, a recording medium, a computer program, and the like are also effective as embodiments of the present invention.

本発明によれば、受話者側の状況に応じてなされた音声認識結果を取得することができ
る。 ADVANTAGE OF THE INVENTION According to this invention, the speech recognition result performed according to the receiver's situation can be acquired.

本発明の実施例１に係る通信システムの構成を示す図である。1 is a diagram illustrating a configuration of a communication system according to a first embodiment of the present invention. 図１の端末装置の構成を示す図である。FIG. 2 is a diagram illustrating a configuration of a terminal device of FIG. 1. 図３（ａ）−（ｂ）は、図２の表示部に表示される画面を示す図である。FIGS. 3A and 3B are diagrams showing screens displayed on the display unit in FIG. 図１の通信システムによる通信手順を示すシーケンス図である。FIG. 2 is a sequence diagram showing a communication procedure by the communication system of FIG. 1. 図５（ａ）−（ｂ）は、本発明の実施例１に係る表示部に表示される画面を示す図である。FIGS. 5A and 5B are diagrams illustrating screens displayed on the display unit according to the first embodiment of the present invention. 本発明の実施例２に係る通信システムによる通信手順を示すシーケンス図である。FIG. 8 is a sequence diagram illustrating a communication procedure by the communication system according to the second embodiment of the present invention. 本発明の実施例３に係る端末装置の構成を示す図である。FIG. 9 is a diagram illustrating a configuration of a terminal device according to a third embodiment of the present invention. 図８（ａ）−（ｂ）は、図７の表示部に表示される画面を示す図である。FIGS. 8A and 8B are diagrams showing screens displayed on the display unit in FIG. 図７の比較部による比較手順を示すフローチャートである。8 is a flowchart illustrating a comparison procedure by the comparison unit in FIG. 7. 図１０（ａ）−（ｃ）は、本発明の実施例４に係る表示部に表示される画面を示す図である。FIGS. 10A to 10C are diagrams illustrating screens displayed on the display unit according to the fourth embodiment of the present invention. 本発明の実施例４に係る通信システムによる通信手順を示すシーケンス図である。FIG. 13 is a sequence diagram illustrating a communication procedure by the communication system according to the fourth embodiment of the present invention. 本発明の実施例４に係る端末装置による特定手順を示すフローチャートである。14 is a flowchart illustrating a specifying procedure by the terminal device according to the fourth embodiment of the present invention. 本発明の実施例４に係る端末装置による別の特定手順を示すフローチャートである。13 is a flowchart illustrating another specific procedure performed by the terminal device according to the fourth embodiment of the present invention. 本発明の実施例４に係る端末装置によるさらに別の特定手順を示すフローチャートである。15 is a flowchart illustrating still another specific procedure by the terminal device according to the fourth embodiment of the present invention.

（実施例１）
本発明を具体的に説明する前に、まず概要を述べる。本発明の実施例１は、ＰＴＴ（Ｐ
ｕｓｈｔｏＴａｌｋ）による音声通信を実行する端末装置に関する。当該端末装置は
、ボタンを備えており、ユーザが当該ボタンを押し下げることによって送話が開始され、
当該ボタンを解放することによって送話が終了する。一方、当該ボタンを押していない場
合、ユーザはメッセージを聞くのみである。このようなＰＴＴにおいて、送話者は、話し
て送信するという行動だけで完結しており、それがどう伝わったのかは、受話者の反応を
頼りにするしかない。送信者は話した内容が受話者に正しく伝わっていると思い込んでい
ても、通信状況が悪く雑音が多かったり、早口すぎたりなどにより、自身の意図通りに音
声が受話者に伝わっていないおそれがある。 (Example 1)
Before describing the present invention in detail, an overview will first be given. In the first embodiment of the present invention, the PTT (P
The present invention relates to a terminal device that executes voice communication based on “to to talk”. The terminal device includes a button, and when the user presses down the button, transmission starts,
When the button is released, the transmission ends. On the other hand, if the button is not pressed, the user will only hear the message. In such a PTT, the sender simply completes the action of speaking and transmitting, and the only way to transmit it is to rely on the response of the listener. Even if the sender assumes that what he / she spoke is being transmitted correctly to the receiver, the voice may not be transmitted to the receiver as intended due to poor communication conditions, a lot of noise, or too fast. is there.

そのため、受信側の端末装置が、音声認識処理を実行することによって、受信した音声
信号をテキスト化し、テキスト化したデータ（以下、「テキストデータ」という）を送信
側の端末装置に送信する。送信側の端末装置はテキストデータを表示し、送話者はテキス
トデータを確認することによって、意図通りの音声が出力されているかを確認する。しか
しながら、受話者による聞こえ方には個人差があるので、同一の音声であっても、異なっ
たように聞こえる場合がある。そのため、テキストデータの内容と、受話者が実際に認識
した内容とが異なるおそれがある。 Therefore, the receiving-side terminal device performs a voice recognition process, converts the received voice signal into text, and transmits text-formatted data (hereinafter, referred to as “text data”) to the transmitting-side terminal device. The terminal device on the transmitting side displays the text data, and the sender confirms the text data to confirm whether the intended sound is output. However, since there is an individual difference in how the listener hears, even the same voice may sound different. Therefore, the contents of the text data may be different from the contents actually recognized by the receiver.

これに対応するために、本実施例に係る端末装置、特に受信側の端末装置は、当該端末
装置を使用するユーザ、つまり受話者の音声認識モデルを使用して、音声認識処理を実行
する。そのため、端末装置において生成されるテキストデータの内容には、受話者の聞こ
え方が反映されている。 In order to cope with this, the terminal device according to the present embodiment, particularly, the terminal device on the receiving side executes the voice recognition process using the voice recognition model of the user who uses the terminal device, that is, the listener. Therefore, the content of the text data generated in the terminal device reflects how the listener is heard.

図１は、本発明の実施例１に係る通信システム１００の構成を示す。通信システム１０
０は、端末装置１０と総称される第１端末装置１０ａ、第２端末装置１０ｂ、基地局装置
１２と総称される第１基地局装置１２ａ、第２基地局装置１２ｂ、ネットワーク１４を含
む。ここで、通信システム１００は、例えば、業務用無線に対応する。 FIG. 1 shows a configuration of a communication system 100 according to the first embodiment of the present invention. Communication system 10
0 includes a first terminal device 10a, a second terminal device 10b, which is collectively referred to as a terminal device 10, a first base station device 12a, a second base station device 12b, which is collectively referred to as a base station device 12, and a network 14. Here, the communication system 100 corresponds to, for example, business wireless.

端末装置１０は、業務用無線による通信を実行可能な装置である。業務用無線について
は公知の技術が使用されればよいので、ここでは説明を省略する。ここでは、第１端末装
置１０ａが業務用無線による音声通信の送信側に相当し、第２端末装置１０ｂが業務用無
線による音声通信の受信側に相当する。そのため、第１端末装置１０ａは送話者に使用さ
れ、第２端末装置１０ｂは受話者に使用される。なお、第１端末装置１０ａと第２端末装
置１０ｂとの関係は逆であってもよく、端末装置１０の数は「２」に限定されない。 The terminal device 10 is a device capable of executing communication by business wireless communication. Since a known technique may be used for the commercial radio, the description is omitted here. Here, the first terminal device 10a corresponds to the transmitting side of the commercial wireless communication, and the second terminal device 10b corresponds to the receiving side of the commercial wireless communication. Therefore, the first terminal device 10a is used for the transmitter, and the second terminal device 10b is used for the receiver. Note that the relationship between the first terminal device 10a and the second terminal device 10b may be reversed, and the number of terminal devices 10 is not limited to “2”.

基地局装置１２は、業務用無線のシステムに対応するとともに、一端側において、業務
用無線により端末装置１０に接続され、他端側において、基地局装置１２に接続される。
第１基地局装置１２ａと第２基地局装置１２ｂとは、異なった位置に設置される。なお、
業務用無線では、複数の端末装置１０によってグループを形成することも可能である。基
地局装置１２は、グループに対して、上りチャネルと下りチャネルを割り当ててもよい。
このような状況下において、グループ中の１つの端末装置１０が、上りチャネルにて信号
を送信し、グループ中の他の端末装置１０が、下りチャネルにて信号を受信する。 The base station device 12 corresponds to a business radio system, and is connected to the terminal device 10 at one end by business radio and connected to the base station device 12 at the other end.
The first base station device 12a and the second base station device 12b are installed at different positions. In addition,
In business wireless, a group can be formed by a plurality of terminal devices 10. The base station apparatus 12 may assign an uplink channel and a downlink channel to a group.
In such a situation, one terminal device 10 in the group transmits a signal on the uplink channel, and another terminal device 10 in the group receives the signal on the downlink channel.

ネットワーク１４は、第１基地局装置１２ａと第２基地局装置１２ｂとを接続する。こ
のような接続によって、第１端末装置１０ａと第２端末装置１０ｂは、第１基地局装置１
２ａ、ネットワーク１４、第２基地局装置１２ｂを介して、通信可能になる。ここでは、
ＰＴＴによる音声通信であるとする。 The network 14 connects the first base station device 12a and the second base station device 12b. By such a connection, the first terminal apparatus 10a and the second terminal apparatus 10b are connected to the first base station apparatus 1
2a, the network 14, and the second base station device 12b become communicable. here,
It is assumed that the voice communication is PTT.

図２は、端末装置１０の構成を示す。端末装置１０は、ボタン２０、マイク２２、操作
部２４、表示部２６、処理部２８、通信部３０、再生部３２、スピーカ３４を含む。また
、処理部２８は、送話部３６、設定部３８を含み、通信部３０は、送信部４０、受信部４
２を含む。端末装置１０は、送信側の端末装置１０と受信側の端末装置１０のいずれにも
対応する。ここでは、説明を明瞭にするために、処理の順番にしたがって、（１）送信側
、（２）受信側、（３）送信側の順に説明する。 FIG. 2 shows a configuration of the terminal device 10. The terminal device 10 includes a button 20, a microphone 22, an operation unit 24, a display unit 26, a processing unit 28, a communication unit 30, a reproduction unit 32, and a speaker 34. The processing unit 28 includes a transmitting unit 36 and a setting unit 38, and the communication unit 30 includes a transmitting unit 40, a receiving unit 4
2 inclusive. The terminal device 10 corresponds to both the terminal device 10 on the transmitting side and the terminal device 10 on the receiving side. Here, in order to clarify the explanation, (1) the transmitting side, (2) the receiving side, and (3) the transmitting side will be described in the order of processing.

（１）送信側
ここでは、送信側の端末装置１０での処理を説明する。ボタン２０は、ＰＴＴボタンに
相当し、ＰＴＴによって音声を送信する場合に、ユーザによって押し下げられる。また、
音声を送信している間にわたって、ボタン２０は押し下げ続けられる。ボタン２０が押し
下げられることは、音声信号を送信するための指示を受けつけることに相当する。ボタン
２０は、押し下げられている間にわたって、指示を送話部３６に出力し続ける。マイク２
２は、端末装置１０の周囲の音声を集音する。マイク２２は、集音した音声を電気信号（
以下、「音声信号」という）に変換し、音声信号を送話部３６に出力する。 (1) Transmission side Here, processing in the terminal device 10 on the transmission side will be described. The button 20 corresponds to a PTT button, and is pressed down by a user when transmitting voice by PTT. Also,
Button 20 is kept depressed while transmitting audio. Pressing down of the button 20 corresponds to receiving an instruction to transmit an audio signal. The button 20 continues to output an instruction to the transmitter 36 while being depressed. Microphone 2
2 collects sound around the terminal device 10. The microphone 22 converts the collected voice into an electric signal (
(Hereinafter, referred to as an “audio signal”), and outputs the audio signal to the transmitting section 36.

送話部３６は、ボタン２０からの指示を受けつけている場合に、マイク２２からの音声
信号を入力する。送話部３６は、音声信号をアナログ信号からデジタル信号に変換し、デ
ジタル信号に変換した音声信号（以下、これもまた「音声信号」という）を送信部４０に
出力する。一方、送話部３６は、ボタン２０からの指示を受けつけていない場合に、この
ような処理を実行しない。送信部４０は、送話部３６からの音声信号を入力し、音声信号
を受信側の端末装置１０に送信する。音声信号の送信のために、送信部４０は、符号化、
変調、周波数変換、増幅等を実行する。 The transmitting section 36 inputs a voice signal from the microphone 22 when receiving an instruction from the button 20. The transmitting unit 36 converts the audio signal from an analog signal to a digital signal, and outputs the audio signal converted into a digital signal (hereinafter also referred to as an “audio signal”) to the transmitting unit 40. On the other hand, when the instruction from button 20 has not been received, transmitting section 36 does not execute such processing. The transmitting unit 40 receives the audio signal from the transmitting unit 36 and transmits the audio signal to the terminal device 10 on the receiving side. For transmitting the audio signal, the transmitting unit 40 performs encoding,
Performs modulation, frequency conversion, amplification, etc.

（２）受信側
ここでは、（１）につづく、受信側の端末装置１０での処理を説明する。受信部４２は
、送信側の端末装置１０からの音声信号を受信する。受信部４２は、増幅、周波数変換、
復調、復号等を実行する。受信部４２は、処理の結果（以下、これもまた「音声信号」と
いう）を再生部３２に出力する。再生部３２は、受信部４２からの音声信号を入力し、音
声信号を再生する。音声信号の再生には公知の技術が使用されればよいので、ここでは説
明を省略する。再生部３２は、再生した音声信号をスピーカ３４と処理部２８に出力する
。スピーカ３４は、電気信号である音声信号を音声に変換し、音声を出力する。 (2) Receiving side Here, the processing in the terminal device 10 on the receiving side following (1) will be described. The receiving unit 42 receives an audio signal from the terminal device 10 on the transmitting side. The receiving unit 42 performs amplification, frequency conversion,
Perform demodulation, decoding, etc. The receiving unit 42 outputs a result of the processing (hereinafter also referred to as an “audio signal”) to the reproducing unit 32. The reproducing unit 32 receives the audio signal from the receiving unit 42 and reproduces the audio signal. A known technique may be used to reproduce the audio signal, and a description thereof will not be repeated. The reproduction unit 32 outputs the reproduced audio signal to the speaker 34 and the processing unit 28. The speaker 34 converts an audio signal, which is an electric signal, into audio and outputs the audio.

処理部２８は、再生部３２からの音声信号を入力する。一方、設定部３８には、本端末
装置１０を使用するユーザ、つまり特定された受話者の音声認識モデルが設定されている
。音声認識モデルでは、例えば、音素「あ」に対応した音声信号の波形が記憶されている
。また、このような音声信号の波形は、音素毎に記憶されている。特に、記憶された音素
と波形とは、音声を聞いた当該特定された受話者が認識するように対応づけられているの
で、これらの関係は、受話者の音声認識モデルであるといえる。 The processing unit 28 receives the audio signal from the reproduction unit 32. On the other hand, the voice recognition model of the user who uses the terminal device 10, that is, the specified listener is set in the setting unit 38. In the voice recognition model, for example, a waveform of a voice signal corresponding to the phoneme “A” is stored. The waveform of such an audio signal is stored for each phoneme. In particular, since the stored phonemes and waveforms are associated with each other so that the specified listener who has heard the voice recognizes the voice, the relationship can be said to be a voice recognition model of the listener.

処理部２８は、設定部３８に設定した受話者の音声認識モデルを使用しながら、音声信
号に対して音声認識処理を実行する。具体的に説明すると、処理部２８は、入力した音声
信号の波形に最も近い波形を音声認識モデルから選択するとともに、選択した波形に対応
した音を特定する。音声認識処理によって、音声信号がテキスト化される。このように、
処理部２８は、音声信号に対して、本端末装置１０を使用するユーザの音声認識モデルに
もとづく音声認識処理、つまりユーザの聞こえ方を反映した音声認識処理を実行する。 The processing unit 28 performs a voice recognition process on the voice signal while using the voice recognition model of the receiver set in the setting unit 38. More specifically, the processing unit 28 selects a waveform closest to the waveform of the input voice signal from the voice recognition model, and specifies a sound corresponding to the selected waveform. The voice signal is converted into text by the voice recognition processing. in this way,
The processing unit 28 performs a voice recognition process based on a voice recognition model of the user using the terminal device 10, that is, a voice recognition process that reflects how the user hears the voice signal.

なお、受話者は、操作部２４を操作しながら、受話者の音声認識モデルを設定部３８に
設定する。例えば、設定部３８は、試験用の音声信号の波形を予め記憶しており、これを
再生部３２において再生することによってスピーカ３４から出力する。受話者は、スピー
カ３４からの音声を聞きながら、操作部２４を使用することによって認識した音を入力す
る。設定部３８は、試験用の音声信号の波形と、入力した音との対応関係をもとに、受話
者の音声認識モデルを設定する。 The receiver sets the voice recognition model of the receiver in the setting unit 38 while operating the operation unit 24. For example, the setting unit 38 stores a waveform of the test audio signal in advance, and reproduces the waveform from the speaker 34 to reproduce the waveform from the speaker 34. The listener inputs the sound recognized by using the operation unit 24 while listening to the sound from the speaker 34. The setting unit 38 sets the voice recognition model of the listener based on the correspondence between the waveform of the test voice signal and the input sound.

送信部４０は、処理部２８における音声認識処理の結果として、音声認識処理によるテ
キストデータを処理部２８から入力する。送信部４０は、テキストデータを送信側の端末
装置１０に送信する。なお、送信部４０におけるテキストデータの送信は、それだけでな
されてもよく、音声信号の送信とともになされてもよい。 The transmitting unit 40 inputs text data by the voice recognition process from the processing unit 28 as a result of the voice recognition process in the processing unit 28. The transmitting unit 40 transmits the text data to the terminal device 10 on the transmitting side. The transmission of the text data in the transmission unit 40 may be performed by itself, or may be performed together with the transmission of the audio signal.

（３）送信側
ここでは、（２）につづく、送信側の端末装置１０での処理を説明する。受信部４２は
、受信側の端末装置１０からのテキストデータを受信する。受信部４２は、テキストデー
タを処理部２８に出力する。処理部２８は、受信部４２からのテキストデータを入力し、
テキストデータを表示部２６に表示する。送話者は、表示部２６に表示されたテキストデ
ータを確認することによって、受話者がどのように聞き取っているかを認識する。図３（
ａ）−（ｂ）は、表示部２６に表示される画面を示す。図３（ａ）は、送話者が「アンゼ
ン」と話し、受話者も「アンゼン」と聞き取っている場合を示す。この場合、送話者が話
した内容と、受話者が聞いた内容とが一致する。一方、図３（ｂ）は、送話者が「アンゼ
ン」と話し、受話者が「カンゼン」と聞き取っている場合を示す。この場合、送話者が話
した内容と、受話者が聞いた内容とが相違する。 (3) Transmission side Here, the processing in the terminal device 10 on the transmission side following (2) will be described. The receiving unit 42 receives text data from the terminal device 10 on the receiving side. The receiving unit 42 outputs the text data to the processing unit 28. The processing unit 28 inputs the text data from the receiving unit 42,
The text data is displayed on the display unit 26. The sender recognizes how the receiver is listening by checking the text data displayed on the display unit 26. FIG. 3 (
a)-(b) show screens displayed on the display unit 26. FIG. 3A shows a case where the sender speaks “Anzen” and the receiver hears “Anzen”. In this case, the content spoken by the sender and the content heard by the receiver match. On the other hand, FIG. 3B shows a case where the sender speaks “Anzen” and the listener hears “Kanzen”. In this case, the content spoken by the sender and the content heard by the receiver differ.

この構成は、ハードウエア的には、任意のコンピュータのＣＰＵ、メモリ、その他のＬ
ＳＩで実現でき、ソフトウエア的にはメモリにロードされたプログラムなどによって実現
されるが、ここではそれらの連携によって実現される機能ブロックを描いている。したが
って、これらの機能ブロックがハードウエアのみ、ソフトウエアのみ、またはそれらの組
合せによっていろいろな形で実現できることは、当業者には理解されるところである。 This configuration is, in terms of hardware, the CPU, memory, and other L of any computer.
It can be realized by SI, and is realized by software or the like loaded into a memory in software. Here, functional blocks realized by their cooperation are illustrated. Therefore, it will be understood by those skilled in the art that these functional blocks can be realized in various forms by hardware only, software only, or a combination thereof.

以上の構成による通信システム１００の動作を説明する。図４は、通信システム１００
による通信手順を示すシーケンス図である。第１端末装置１０ａは、音声を入力する（Ｓ
１０）と、音声信号を生成する（Ｓ１２）。第１端末装置１０ａは、音声信号を送信する
（Ｓ１４）。第２端末装置１０ｂは、音声信号を再生し（Ｓ１６）、再生した音声信号を
スピーカ３４から出力する（Ｓ１８）。第２端末装置１０ｂは、ユーザの音声認識モデル
で音声認識処理を実行し（Ｓ２０）、テキストデータを生成する（Ｓ２２）。第２端末装
置１０ｂは、テキストデータを送信する（Ｓ２４）。第１端末装置１０ａは、テキストデ
ータを表示する（Ｓ２６）。 The operation of the communication system 100 having the above configuration will be described. FIG. 4 shows a communication system 100.
FIG. 5 is a sequence diagram showing a communication procedure according to the first embodiment. The first terminal device 10a inputs a voice (S
10), an audio signal is generated (S12). The first terminal device 10a transmits an audio signal (S14). The second terminal device 10b reproduces the audio signal (S16), and outputs the reproduced audio signal from the speaker 34 (S18). The second terminal device 10b performs a voice recognition process using the voice recognition model of the user (S20), and generates text data (S22). The second terminal device 10b transmits the text data (S24). The first terminal device 10a displays the text data (S26).

本実施例によれば、音声信号に対して、端末装置を使用するユーザの音声認識モデルに
もとづく処理を実行するので、ユーザの聞こえ方を反映しながら、受信した音声信号をテ
キスト化できる。また、ユーザの聞こえ方を反映しながら、受信した音声信号がテキスト
化されるので、送話者に対して正確な情報を知らせることができる。また、ユーザの音声
認識モデルを使用しながら、音声信号に対して音声認識処理を実行するので、送話者の発
音による受話者の聞き間違いを特定できる。また、ユーザの音声認識モデルを使用しなが
ら、音声信号に対して音声認識処理を実行するので、無線区間における受話者の聞き間違
いを特定できる。 According to the present embodiment, a process based on the voice recognition model of the user using the terminal device is performed on the voice signal, so that the received voice signal can be converted to text while reflecting how the user hears. In addition, since the received voice signal is converted into text while reflecting how the user hears, accurate information can be notified to the sender. In addition, since the voice recognition process is performed on the voice signal while using the voice recognition model of the user, it is possible to identify a mistake in the listener's hearing due to the pronunciation of the sender. In addition, since the voice recognition process is performed on the voice signal while using the voice recognition model of the user, it is possible to identify a wrong listening of the listener in the wireless section.

（実施例２）
次に、実施例２を説明する。本発明の実施例２も、実施例１と同様に、ＰＴＴによる音
声通信を実行する端末装置であって、かつ受話者の音声認識モデルを使用して音声信号を
テキスト化する端末装置に関する。実施例１における音声認識モデルは、受話者によって
認識される音声信号の波形に対して構成される。一方、実施例２における音声認識モデル
は、受話者が認識可能な音声速度、受話者が認識可能な音量レベルに対して構成される。
実施例２に係る通信システム、端末装置は、図１、図２と同様のタイプである。ここでは
、これまでとの差異を中心に説明する。 (Example 2)
Next, a second embodiment will be described. As in the first embodiment, the second embodiment of the present invention also relates to a terminal device that executes voice communication using PTT and converts a voice signal into a text using a voice recognition model of a receiver. The speech recognition model in the first embodiment is configured for a waveform of a speech signal recognized by a listener. On the other hand, the voice recognition model in the second embodiment is configured for a voice speed recognizable by the listener and a volume level recognizable by the listener.
The communication system and the terminal device according to the second embodiment are of the same type as those in FIGS. Here, the description will focus on the differences from the past.

前述の（２）において、処理部２８は、再生部３２からの音声信号を入力する。また、
処理部２８は、音声信号に対して音声認識処理を実行することによって、音声信号をテキ
スト化する。一方、設定部３８には、本端末装置１０を使用するユーザ、つまり受話者の
音声認識モデルが設定されている。音声認識モデルでは、例えば、受話者が認識可能な音
声速度の値、受話者が認識可能な音量レベルの値の少なくとも一方が記憶されている。 In the above (2), the processing unit 28 inputs an audio signal from the reproduction unit 32. Also,
The processing unit 28 converts the audio signal to text by executing a voice recognition process on the audio signal. On the other hand, the voice recognition model of the user who uses the terminal device 10, that is, the listener, is set in the setting unit 38. In the voice recognition model, for example, at least one of a voice speed value recognizable by the listener and a volume level value recognizable by the listener is stored.

処理部２８は、テキスト化したデータの文字数を数えることによって、一定期間におけ
る音声信号の音声速度の値を導出する。処理部２８は、導出した音声速度の値と、設定部
３８に記憶した音声速度の値とを比較することによって、音声信号に対して、受話者が認
識可能な音声速度以下であるかの判定処理を実行する。処理部２８は、導出した音声速度
の値が、受話者が認識可能な音声速度の値よりも大きければ、テキスト化したデータのう
ち、受話者が認識可能な音声速度よりも大きい部分の文字を伏せ字に変換する。なお、処
理部２８は、導出した音声速度の値が、受話者が認識可能な音声速度の値以下であれば、
テキスト化したデータに対する変換を実行しない。 The processing unit 28 derives the value of the audio speed of the audio signal in a certain period by counting the number of characters of the text data. The processing unit 28 compares the derived voice speed value with the voice speed value stored in the setting unit 38 to determine whether the voice signal is lower than the voice speed that can be recognized by the receiver. Execute the process. If the value of the derived voice speed is higher than the value of the voice speed recognizable by the listener, the processing unit 28 extracts the characters of the portion of the text data that are higher than the voice speed recognizable by the listener. Convert to face down. Note that if the derived voice speed value is equal to or less than the voice speed value that can be recognized by the listener,
Do not convert textual data.

また、処理部２８は、一定期間における音声信号の音量レベルの値を導出してもよい。
処理部２８は、導出した音量レベルの値と、設定部３８に記憶した音量レベルの値とを比
較することによって、音声信号に対して、受話者が認識可能な音量レベル以上であるかの
判定処理を実行する。処理部２８は、導出した音量レベルの値が、受話者が認識可能な音
量レベルの値よりも小さければ、テキスト化したデータの各文字を伏せ字に変換する。な
お、処理部２８は、導出した音量レベルの値が、受話者が認識可能な音量レベルの値以上
であれば、テキスト化したデータに対する変換を実行しない。このように処理部２８にお
ける音声認識処理には、音量レベル、音声速度のうちの少なくとも１つが反映される。こ
こで、受話者の音声認識モデルは、操作部２４を操作することによって、設定部３８に設
定される。設定される内容は、受話者が認識可能な音声速度の値、受話者が認識可能な音
量レベルの値の少なくとも一方である。 Further, the processing unit 28 may derive the value of the volume level of the audio signal in a certain period.
The processing unit 28 compares the derived volume level value with the value of the volume level stored in the setting unit 38 to determine whether or not the audio signal is higher than the volume level that can be recognized by the listener. Execute the process. If the derived sound volume level value is smaller than the sound volume level value that can be recognized by the listener, the processing unit 28 converts each character of the text data into a hidden character. If the derived volume level value is equal to or greater than the volume level that can be recognized by the listener, the processing unit 28 does not execute the conversion on the text data. As described above, at least one of the volume level and the voice speed is reflected in the voice recognition processing in the processing unit 28. Here, the voice recognition model of the receiver is set in the setting unit 38 by operating the operation unit 24. The content to be set is at least one of a voice speed value recognizable by the listener and a volume level value recognizable by the listener.

送信部４０は、処理部２８からのテキストデータを入力する。送信部４０は、テキスト
データを送信側の端末装置１０に送信する。前述のごとく、受話者が認識可能な音声速度
の値よりも大きい場合、あるいは受話者が認識可能な音量レベルの値よりも小さい場合、
テキストデータの少なくとも一部の文字が伏せ字に変換されている。このことは、送信部
４０が、受話者が認識可能な音声速度以下であるかの判定結果を送信すること、あるいは
受話者が認識可能な音量レベル以上であるかの判定結果を送信することに相当する。 The transmission unit 40 inputs the text data from the processing unit 28. The transmitting unit 40 transmits the text data to the terminal device 10 on the transmitting side. As described above, if the voice speed is greater than the value that the listener can recognize, or if the volume level is smaller than the volume level that the listener can recognize,
At least some characters of the text data have been converted to hidden characters. This means that the transmitting unit 40 transmits a result of determination as to whether or not the voice speed is lower than the voice rate recognizable by the listener, or transmits a result of determination as to whether or not the volume level is higher than the volume recognizable by the listener. Equivalent to.

前述の（３）において、受信部４２は、受信側の端末装置１０からのテキストデータを
受信し、処理部２８は、テキストデータを表示部２６に表示する。図５（ａ）−（ｂ）は
、表示部２６に表示される画面を示す。図５（ａ）は、受信側の端末装置１０において再
生される音声信号の音声速度の値が、受話者が認識可能な音声速度の値よりも大きい場合
を示す。この場合、一部の文字が伏せ字によって示される。一方、図５（ｂ）は、受信側
の端末装置１０において再生される音声信号の音量レベルの値が、受話者が認識可能な音
量レベルの値よりも小さい場合を示す。この場合、すべての文字が伏せ字によって示され
る。 In (3) described above, the receiving unit 42 receives text data from the terminal device 10 on the receiving side, and the processing unit 28 displays the text data on the display unit 26. 5A and 5B show screens displayed on the display unit 26. FIG. FIG. 5A shows a case where the value of the voice speed of the voice signal reproduced in the terminal device 10 on the receiving side is higher than the value of the voice speed that can be recognized by the listener. In this case, some characters are indicated by face-down characters. On the other hand, FIG. 5B shows a case where the value of the volume level of the audio signal reproduced in the terminal device 10 on the receiving side is smaller than the value of the volume level that can be recognized by the listener. In this case, all characters are indicated by face-down characters.

以上の構成による通信システム１００の動作を説明する。図６は、本発明の実施例２に
係る通信システム１００による通信手順を示すシーケンス図である。第１端末装置１０ａ
は、音声を入力する（Ｓ５０）と、音声信号を生成する（Ｓ５２）。第１端末装置１０ａ
は、音声信号を送信する（Ｓ５４）。第２端末装置１０ｂは、音声信号を再生し（Ｓ５６
）、再生した音声信号をスピーカ３４から出力する（Ｓ５８）。第２端末装置１０ｂは、
音声認識処理を実行する（Ｓ６０）とともに、音声速度、音量レベルによる評価を実行す
る（Ｓ６２）ことによって、テキストデータを生成する（Ｓ６４）。第２端末装置１０ｂ
は、テキストデータを送信する（Ｓ６６）。第１端末装置１０ａは、テキストデータを表
示する（Ｓ６８）。 The operation of the communication system 100 having the above configuration will be described. FIG. 6 is a sequence diagram illustrating a communication procedure by the communication system 100 according to the second embodiment of the present invention. First terminal device 10a
Receives a voice (S50) and generates a voice signal (S52). First terminal device 10a
Transmits an audio signal (S54). The second terminal device 10b reproduces the audio signal (S56).
), And outputs the reproduced audio signal from the speaker 34 (S58). The second terminal device 10b
By executing the voice recognition process (S60) and performing the evaluation based on the voice speed and the volume level (S62), text data is generated (S64). Second terminal device 10b
Transmits the text data (S66). The first terminal device 10a displays the text data (S68).

本実施例によれば、音声信号に対して、ユーザが認識可能な音声速度以下であるかの判
定処理を実行するので、音声速度のために聞きづらいかを判定できる。また、音声速度の
ために聞きづらいことをテキスト化に反映できる。また、音声信号に対して、ユーザが認
識可能な音量レベル以上であるかの判定処理を実行するので、音量レベルのために聞きづ
らいかを判定できる。また、音量レベルのために聞きづらいことをテキスト化に反映でき
る。 According to the present embodiment, the process of determining whether or not the voice signal is lower than the voice speed recognizable by the user is performed. Therefore, it is possible to determine whether the voice signal is difficult to hear due to the voice speed. In addition, it is possible to reflect the fact that it is difficult to hear due to the voice speed in the text. In addition, since the audio signal is subjected to a determination process of determining whether or not the volume level is higher than a volume level recognizable by the user, it is possible to determine whether the audio signal is difficult to hear due to the volume level. In addition, the difficulty in hearing due to the volume level can be reflected in the text.

（実施例３）
次に、実施例３を説明する。本発明の実施例３も、これまでと同様に、ＰＴＴによる音
声通信を実行する端末装置であって、かつ受話者の音声認識モデルを使用して音声信号を
テキスト化する端末装置に関する。実施例３では、音声信号をテキスト化するだけではな
く、受話者が音声を聞いている状況を推測可能な情報を送信側の端末装置に通知する。実
施例３に係る通信システムは、図１と同様のタイプである。ここでは、これまでとの差異
を中心に説明する。 (Example 3)
Next, a third embodiment will be described. The third embodiment of the present invention also relates to a terminal device that executes voice communication based on PTT and converts a voice signal into a text using a voice recognition model of a receiver, as in the past. In the third embodiment, not only the audio signal is converted into text, but also information that allows the listener to estimate the situation in which the listener is listening to the audio is notified to the terminal device on the transmission side. The communication system according to the third embodiment is of the same type as that of FIG. Here, the description will focus on the differences from the past.

図７は、本発明の実施例３に係る端末装置１０の構成を示す。端末装置１０における処
理部２８は、図２と比較して、比較部４６をさらに含む。前述の（２）において、処理部
２８は、再生部３２からの音声信号を入力する。処理部２８は、実施例１と同様に、音声
信号に対して、本端末装置１０を使用するユーザの音声認識モデルにもとづく音声認識処
理、つまりユーザの聞こえ方を反映した音声認識処理を実行する。その結果、音声信号は
テキスト化（以下、テキスト化された音声信号を「第１テキスト」という）される。 FIG. 7 illustrates a configuration of the terminal device 10 according to the third embodiment of the present invention. The processing unit 28 in the terminal device 10 further includes a comparing unit 46 as compared with FIG. In the above (2), the processing unit 28 inputs an audio signal from the reproduction unit 32. The processing unit 28 performs a voice recognition process based on a voice recognition model of the user using the terminal device 10, that is, a voice recognition process that reflects how the user hears the voice signal, as in the first embodiment. . As a result, the audio signal is converted to text (hereinafter, the converted audio signal is referred to as “first text”).

その際、処理部２８は、音声認識処理において認識不可能な音素が存在するかを判定し
てもよい。例えば、入力した音声信号の１音素の波形と、当該１音素の波形に最も近い波
形との相関値が予め定められた値よりも小さい場合に、当該１音素が認識不可能な音素と
判定される。処理部２８は、第１テキストにおいて、認識不可能な音素を伏せ字に変換す
る。なお、伏せ字ではなく、別の予め定められた文字に変換されてもよく、「認識不可能
な音素あり」とのメッセージが、第１テキストに追加されてもよい。 At this time, the processing unit 28 may determine whether there is a phoneme that cannot be recognized in the voice recognition processing. For example, when the correlation value between the waveform of one phoneme of the input audio signal and the waveform closest to the waveform of the one phoneme is smaller than a predetermined value, the one phoneme is determined to be an unrecognizable phoneme. You. The processing unit 28 converts an unrecognizable phoneme into a hidden character in the first text. Note that the message may be converted to another predetermined character instead of the hidden character, and a message that “there is an unrecognizable phoneme” may be added to the first text.

一方、処理部２８は、本端末装置１０を使用するユーザに特定されない音声認識モデル
、つまり標準的な音声認識モデルも記憶する。標準的な音声認識モデルでは、例えば、音
「あ」に対応した音声信号の標準的な波形が記憶されている。処理部２８は、音声信号に
対して、標準的な音声認識モデルにもとづく音声認識処理、つまり、ユーザの聞こえ方を
未反映のままの音声認識処理も実行する。ここでも、音声信号はテキスト化（以下、テキ
スト化された音声信号を「第２テキスト」という）される。なお、処理部２８は、第１テ
キストの場合と同様に、第２テキストにおいても、認識不可能な音素を伏せ字等に変換し
てもよい。 On the other hand, the processing unit 28 also stores a speech recognition model that is not specified by the user who uses the terminal device 10, that is, a standard speech recognition model. In the standard voice recognition model, for example, a standard waveform of a voice signal corresponding to the sound “A” is stored. The processing unit 28 also performs a voice recognition process based on a standard voice recognition model, that is, a voice recognition process that does not reflect how the user hears the voice signal. Also in this case, the audio signal is converted to text (hereinafter, the converted audio signal is referred to as “second text”). The processing unit 28 may convert an unrecognizable phoneme into a hidden character or the like in the second text as in the case of the first text.

比較部４６は、第１テキストと第２テキストとを入力する。比較部４６は、第１テキス
トと第２テキストとを比較する。ここでは、比較として、第１テキストと第２テキストと
が並べられる。比較部４６は、第１テキストと第２テキストとを並べたテキストデータを
送信部４０に出力する。送信部４０は、処理部２８からのテキストデータを入力する。送
信部４０は、比較結果であるテキストデータを送信側の端末装置１０に送信する。 The comparison unit 46 inputs the first text and the second text. The comparing unit 46 compares the first text with the second text. Here, as a comparison, the first text and the second text are arranged. The comparing unit 46 outputs to the transmitting unit 40 text data in which the first text and the second text are arranged. The transmission unit 40 inputs the text data from the processing unit 28. The transmitting unit 40 transmits the text data as the comparison result to the terminal device 10 on the transmitting side.

前述の（３）において、受信部４２は、受信側の端末装置１０からのテキストデータを
受信し、処理部２８は、テキストデータを表示部２６に表示する。図８（ａ）−（ｂ）は
、表示部２６に表示される画面を示す。画面の上側には、「受話者音声認識」の場合とし
て、第１テキストが示され、画面の下側には、「標準音声認識」の場合として、第２テキ
ストが示される。図８（ａ）では、第２テキストにおいて認識不可能な音素がないにもか
かわらず、第１テキストにおいて認識不可能な音素がある場合を示す。これは、標準的な
音声認識モデルによって、発話者が発した音声に対応した音声信号を音声認識処理した場
合、すべて認識されるが、受話者の音声認識モデルによって音声認識処理した場合、認識
不可能な音素が存在することに相当する。つまり、受話者の聞こえ方によって音声が認識
されていないといえる。 In (3) described above, the receiving unit 42 receives text data from the terminal device 10 on the receiving side, and the processing unit 28 displays the text data on the display unit 26. FIGS. 8A and 8B show screens displayed on the display unit 26. At the top of the screen, a first text is shown as "receiver speech recognition", and at the bottom of the screen, a second text is shown as "standard speech recognition". FIG. 8A shows a case where there is an unrecognizable phoneme in the first text even though there is no unrecognizable phoneme in the second text. This means that if a speech signal corresponding to the voice uttered by the speaker is subjected to speech recognition processing using a standard speech recognition model, all recognition is performed, but if speech recognition processing is performed using the speech recognition model of the receiver, recognition is not possible. This corresponds to the existence of possible phonemes. That is, it can be said that the voice is not recognized depending on how the listener hears.

図８（ｂ）では、第１テキストと第２テキストとのいずれにおいても、認識不可能な音
素がある場合を示す。これは、標準的な音声認識モデルと受話者の音声認識モデルのいず
れによって、発話者が発した音声に対応した音声信号を音声認識処理した場合、認識不可
能な音素が存在することに相当する。この場合、例えば、第１端末装置１０ａと第１基地
局装置１２ａとの間の無線区間、あるいは第２端末装置１０ｂと第２基地局装置１２ｂと
の無線区間の品質が悪化していることが推定される。 FIG. 8B shows a case where there is an unrecognizable phoneme in both the first text and the second text. This corresponds to the presence of an unrecognizable phoneme when the voice signal corresponding to the voice uttered by the speaker is subjected to voice recognition processing by either the standard voice recognition model or the voice recognition model of the listener. . In this case, for example, the quality of the radio section between the first terminal apparatus 10a and the first base station apparatus 12a or the radio section between the second terminal apparatus 10b and the second base station apparatus 12b may be degraded. Presumed.

以上の構成による通信システム１００の動作を説明する。図９は、比較部４６による比
較手順を示すフローチャートである。比較部４６は、ユーザの音声認識モデルにもとづく
音声認識処理の結果を取得する（Ｓ８０）。一方、比較部４６は、標準的な音声認識モデ
ルにもとづく音声認識処理の結果を取得する（Ｓ８２）。比較部４６は、比較として両方
の音声認識処理の結果を並べる（Ｓ８４）。 The operation of the communication system 100 having the above configuration will be described. FIG. 9 is a flowchart illustrating a comparison procedure performed by the comparison unit 46. The comparing unit 46 acquires the result of the voice recognition processing based on the voice recognition model of the user (S80). On the other hand, the comparing unit 46 acquires the result of the voice recognition processing based on the standard voice recognition model (S82). The comparing unit 46 arranges the results of both voice recognition processes as a comparison (S84).

本実施例によれば、ユーザの音声認識モデルにもとづく音声認識処理の結果と、標準的
な音声認識モデルにもとづく音声認識処理の結果とを比較するので、どの段階で認識不可
能な音素が発生するかを特定できる。また、ユーザの音声認識モデルにもとづく音声認識
処理の結果に認識不可能な音素が存在し、標準的な音声認識モデルにもとづく音声認識処
理の結果に認識不可能な音素が存在しない場合、特定のユーザだけが聞き取れないことを
認識できる。また、ユーザの音声認識モデルにもとづく音声認識処理の結果と、標準的な
音声認識モデルにもとづく音声認識処理の結果とに認識不可能な音素が存在する場合、発
話あるいは通信の段階に原因があることを認識できる。 According to the present embodiment, the result of the speech recognition processing based on the user's speech recognition model is compared with the result of the speech recognition processing based on the standard speech recognition model. Can be specified. In addition, if there is an unrecognizable phoneme in the result of the speech recognition process based on the user's speech recognition model and there is no unrecognizable phoneme in the result of the speech recognition process based on the standard speech recognition model, a specific It is possible to recognize that only the user cannot hear. Also, if there is an unrecognizable phoneme in the result of the voice recognition process based on the user's voice recognition model and the result of the voice recognition process based on the standard voice recognition model, the cause is in the stage of speech or communication. I can recognize that.

（実施例４）
次に、実施例４を説明する。本発明の実施例４も、これまでと同様に、ＰＴＴによる音
声通信を実行する端末装置であって、かつ受信した音声信号をテキスト化する端末装置に
関する。受信側の端末装置において、イコライザのオン／オフ、スピーカから出力される
際の音量レベル、話速変換のオン／オフの設定がなされる場合がある。その際、そのよう
な設定に応じて処理された音声がスピーカから出力される。つまり、このような処理によ
って、実際にスピーカから出力される音声と、テキストデータの内容とが異なるおそれが
ある。 (Example 4)
Next, a fourth embodiment will be described. Embodiment 4 of the present invention also relates to a terminal device that executes voice communication using PTT and converts a received voice signal into text, as in the past. In the terminal device on the receiving side, there are cases where the setting of ON / OFF of the equalizer, the volume level when output from the speaker, and the ON / OFF of the speech speed conversion are performed. At that time, a sound processed according to such a setting is output from the speaker. That is, such processing may cause a difference between the sound actually output from the speaker and the content of the text data.

これに対応するために、本実施例に係る端末装置、特に受信側の端末装置は、当該端末
装置になされた設定に応じて、音声認識処理を実行する。そのため、端末装置において生
成されるテキストデータの内容は、設定に応じた処理を反映している。実施例４に係る通
信システム、端末装置は、図１、図２と同様のタイプである。ここでは、これまでとの差
異を中心に説明する。 In order to cope with this, the terminal device according to the present embodiment, especially the terminal device on the receiving side, executes the voice recognition processing according to the setting made for the terminal device. Therefore, the content of the text data generated in the terminal device reflects a process according to the setting. The communication system and the terminal device according to the fourth embodiment are of the same type as those in FIGS. Here, the description will focus on the differences from the past.

前述の（２）において、端末装置１０の設定部３８には、音声の出力に関して、さまざ
まな設定がなされる。この設定によって、受信部４２において受信した音声信号を再生す
る際に使用すべき設定値が登録される。設定部３８においてなされる設定のうちの１つは
、イコライザのオン／オフである。イコライザは、音声信号の周波数特性を変更する処理
である。イコライザをオンにした場合、音声信号の特定の周波数帯域（倍音成分や高調波
成分あるいはノイズ成分）を強調したり、減少したりすることが可能になる。また、設定
部３８においてなされる設定のうちの別の１つは、音声速度変換のオン／オフである。音
声速度変換は、音声の再生速度を高速にしたり、低速にしたりする処理である。さらに、
設定部３８においてなされる設定のうちのさらに別の１つは、音量レベルの調節である。
音量レベルは、スピーカ３４から音声を出力する際のボリュームである。これらの設定は
、操作部２４を操作することによってなされる。また、設定部３８には、これらの設定の
すべてがなされている必要はなく、少なくとも１つの設定がなされていればよい。 In the above (2), various settings are made in the setting unit 38 of the terminal device 10 with respect to audio output. With this setting, a setting value to be used when the receiving unit 42 reproduces the received audio signal is registered. One of the settings made in the setting unit 38 is ON / OFF of the equalizer. The equalizer is a process for changing the frequency characteristics of the audio signal. When the equalizer is turned on, it is possible to emphasize or reduce a specific frequency band (a harmonic component, a harmonic component, or a noise component) of the audio signal. Another one of the settings made in the setting unit 38 is ON / OFF of the audio speed conversion. The audio speed conversion is a process for increasing or decreasing the reproduction speed of audio. further,
Yet another one of the settings made in the setting unit 38 is adjustment of the volume level.
The volume level is a volume at which sound is output from the speaker 34. These settings are made by operating the operation unit 24. Further, the setting unit 38 does not need to make all of these settings, and it is sufficient that at least one setting is made.

再生部３２は、受信部４２からの音声信号を入力し、音声信号を再生する。その際、設
定部３８においてなされた設定値が反映される。例えば、設定部３８においてイコライザ
がオンにされている場合、再生部３２は、イコライザ処理を実行しながら音声信号を再生
する。一方、設定部３８においてイコライザがオフにされている場合、再生部３２は、イ
コライザ処理を実行せずに音声信号を再生する。 The reproducing unit 32 receives the audio signal from the receiving unit 42 and reproduces the audio signal. At that time, the set value set by the setting unit 38 is reflected. For example, when the equalizer is turned on in the setting unit 38, the reproducing unit 32 reproduces the audio signal while performing the equalizer processing. On the other hand, when the equalizer is turned off in the setting unit 38, the reproducing unit 32 reproduces the audio signal without performing the equalizer processing.

また、設定部３８において音声速度変換がオンにされている場合、再生部３２は、音声
速度を変換しながら音声信号を再生する。なお、音声速度は、２倍、３倍、１／２倍のよ
うに設定されている。一方、設定部３８において音声速度変換がオフにされている場合、
再生部３２は、音声速度を変換せずに音声信号を再生する。さらに、設定部３８において
設定されている音量レベルにおいて、再生部３２は、音声信号を再生する際の音量レベル
を調節する。なお、イコライザ処理、音声速度変換、音量レベルの調節、音声信号の再生
には公知の技術が使用されればよいので、ここでは説明を省略する。前述のごとく、設定
部３８においてこれらの設定のすべてがなされている必要はなく、少なくとも１つの設定
がなされていればよいので、再生部３２は、少なくとも１つの設定を使用すればよい。再
生部３２は、再生した音声信号をスピーカ３４と処理部２８に出力する。スピーカ３４は
、電気信号である音声信号を音声に変換し、音声を出力する。 When the audio speed conversion is turned on in the setting unit 38, the reproducing unit 32 reproduces the audio signal while converting the audio speed. Note that the audio speed is set to be twice, three times, or one half times. On the other hand, when the audio speed conversion is turned off in the setting unit 38,
The reproduction unit 32 reproduces an audio signal without converting the audio speed. Further, at the volume level set by the setting unit 38, the reproduction unit 32 adjusts the volume level when reproducing the audio signal. It should be noted that a known technique may be used for the equalizer processing, the audio speed conversion, the adjustment of the volume level, and the reproduction of the audio signal. As described above, it is not necessary for the setting unit 38 to make all of these settings, and it is sufficient that at least one setting is made. Therefore, the reproducing unit 32 may use at least one setting. The reproduction unit 32 outputs the reproduced audio signal to the speaker 34 and the processing unit 28. The speaker 34 converts an audio signal, which is an electric signal, into audio and outputs the audio.

処理部２８は、再生部３２からの音声信号を入力する。処理部２８は、前述の標準的な
音声認識モデルにもとづいて、音声信号に対して音声認識処理を実行する。音声認識処理
によって、音声信号がテキスト化される。さらに、処理部２８は、音声認識処理において
認識不可能な音素が存在する場合に、その理由（以下、「エラーの理由」という）を特定
してもよい。ここでは、エラーの理由として、（Ａ）音声認識処理において認識不可能な
音素が存在するか、（Ｂ）再生した音声信号での音声速度がしきい値より大きいか、（Ｃ
）再生した音声信号での音量レベルがしきい値より小さいかが特定される。なお、（Ｂ）
と（Ｃ）におけるしきい値は別の値でもよい。以下では、これらの処理を順に説明する。 The processing unit 28 receives the audio signal from the reproduction unit 32. The processing unit 28 performs a voice recognition process on the voice signal based on the standard voice recognition model described above. The voice signal is converted into text by the voice recognition processing. Further, when there is a phoneme that cannot be recognized in the voice recognition processing, the processing unit 28 may specify the reason (hereinafter, referred to as “error reason”). Here, the reason for the error is (A) whether there is a phoneme that cannot be recognized in the voice recognition processing, (B) whether the voice speed of the reproduced voice signal is higher than the threshold, or (C)
) It is specified whether the volume level of the reproduced audio signal is lower than the threshold value. (B)
The threshold values in (C) and (C) may be different values. Hereinafter, these processes will be described in order.

（Ａ）音声認識処理において認識不可能な音素が存在するか
処理部２８は、音声認識処理において認識不可能な音素が存在するかを判定する。例え
ば、入力した音声信号の１音素の波形と、当該１音素の波形に最も近い波形との相関値が
予め定められた値よりも小さい場合に、当該１音素が認識不可能な音素と判定される。処
理部２８は、テキスト化したデータにおいて、認識不可能な音素を伏せ字に変換する。な
お、伏せ字ではなく、別の予め定められた文字に変換されてもよく、「認識不可能な音素
あり」とのメッセージが、テキスト化したデータに追加されてもよい。つまり、処理部２
８は、本端末装置１０の設定を反映させながら、再生部３２において再生した音声信号に
対して実行された音声認識処理において認識不可能な音素が存在するかを判定することに
よって、音声信号におけるエラーの理由を特定する。 (A) Whether Unrecognizable Phoneme Exists in Speech Recognition Processing The processing unit 28 determines whether there is an unrecognizable phoneme in the speech recognition processing. For example, when the correlation value between the waveform of one phoneme of the input audio signal and the waveform closest to the waveform of the one phoneme is smaller than a predetermined value, the one phoneme is determined to be an unrecognizable phoneme. You. The processing unit 28 converts unrecognizable phonemes into hidden characters in the text data. It should be noted that the message may be converted to another predetermined character instead of the hidden character, and a message indicating “there is an unrecognizable phoneme” may be added to the text data. That is, the processing unit 2
8 determines whether there is an unrecognizable phoneme in the voice recognition process performed on the audio signal reproduced by the reproduction unit 32 while reflecting the setting of the terminal device 10, and Identify the reason for the error.

（Ｂ）再生した音声信号での音声速度がしきい値より大きいか
処理部２８は、実施例２と同様に、テキスト化したデータの文字数を数えることによっ
て、一定期間における音声信号の音声速度の値を導出する。処理部２８は、導出した音声
速度の値と、予め記憶したしきい値とを比較することによって、音声信号での音声速度が
しきい値より大きいかの判定処理を実行する。処理部２８は、音声速度の値がしきい値よ
りも大きければ、テキスト化したデータのうち、しきい値よりも大きい部分の文字を伏せ
字に変換する。さらに、処理部２８は、テキスト化したデータに、音声速度が速すぎるこ
とを示すためのメッセージを追加してもよい。なお、処理部２８は、音声速度の値がしき
い値以下であれば、テキスト化したデータに対する変換を実行しない。 (B) Whether the audio speed of the reproduced audio signal is higher than the threshold value As in the second embodiment, the processing unit 28 counts the number of characters of the text data to determine the audio speed of the audio signal in a certain period. Derive a value. The processing unit 28 performs a process of determining whether the audio speed of the audio signal is higher than the threshold value by comparing the derived audio speed value with a threshold value stored in advance. If the value of the voice speed is greater than the threshold value, the processing unit 28 converts the characters of the text data, which are greater than the threshold value, into hidden characters. Further, the processing unit 28 may add a message indicating that the audio speed is too high to the text data. If the value of the voice speed is equal to or less than the threshold value, the processing unit 28 does not execute the conversion on the text data.

（Ｃ）再生した音声信号での音量レベルがしきい値より小さいか
処理部２８は、実施例２と同様に、一定期間における音声信号の音量レベルの値を導出
する。処理部２８は、導出した音量レベルの値と、しきい値とを比較することによって、
音声信号での音量レベルがしきい値より小さいかの判定処理を実行する。処理部２８は、
音量レベルの値がしきい値よりも小さければ、テキスト化したデータの各文字を伏せ字に
変換する。さらに、処理部２８は、テキスト化したデータに、音量レベルが小さすぎるこ
とを示すためのメッセージを追加してもよい。なお、処理部２８は、音量レベルの値がし
きい値以上であれば、テキスト化したデータに対する変換を実行しない。 (C) Whether the Volume Level of the Reproduced Audio Signal is Lower than the Threshold The processing unit 28 derives the value of the volume level of the audio signal in a certain period, as in the second embodiment. The processing unit 28 compares the derived volume level value with the threshold value,
A determination process is performed to determine whether the volume level of the audio signal is smaller than the threshold. The processing unit 28
If the value of the sound volume level is smaller than the threshold value, each character of the text data is converted into a hidden character. Further, the processing unit 28 may add a message indicating that the volume level is too low to the text data. If the value of the volume level is equal to or larger than the threshold, the processing unit 28 does not execute the conversion on the text data.

送信部４０は、処理部２８から、テキストデータを入力する。このテキストデータには
、処理部２８において特定したエラーの理由が含まれてもよい。送信部４０は、テキスト
データを送信側の端末装置１０に送信する。エラーの理由が含まれる場合、伏せ字によっ
て、認識不可能な音素の存在が示される。また、音声速度がしきい値より大きいことが示
されたり、音量レベルがしきい値より小さいことが示されたりする。なお、処理部２８が
エラーの理由を特定するための処理を実行しない場合、テキストデータには、エラーの理
由が含まれない。 The transmitting unit 40 inputs text data from the processing unit 28. The text data may include the reason for the error specified by the processing unit 28. The transmitting unit 40 transmits the text data to the terminal device 10 on the transmitting side. If the reason for the error is included, the hidden character indicates the presence of an unrecognizable phoneme. Also, it is indicated that the voice speed is higher than the threshold value or that the volume level is lower than the threshold value. If the processing unit 28 does not execute the process for specifying the reason for the error, the text data does not include the reason for the error.

前述の（３）において、受信部４２は、受信側の端末装置１０からのテキストデータを
受信する。受信部４２は、テキストデータを処理部２８に出力する。処理部２８は、受信
部４２からのテキストデータを入力し、テキストデータを表示部２６に表示する。テキス
トデータにエラーの理由が含まれない場合、表示部２６は、実施例１と同様にテキストデ
ータを表示する。送話者は、表示部２６に表示されたテキストデータを確認することによ
って、受話者がどのように聞き取っているかを認識する。 In the above (3), the receiving unit 42 receives text data from the terminal device 10 on the receiving side. The receiving unit 42 outputs the text data to the processing unit 28. The processing unit 28 receives the text data from the receiving unit 42 and displays the text data on the display unit 26. When the reason for the error is not included in the text data, the display unit 26 displays the text data as in the first embodiment. The sender recognizes how the receiver is listening by checking the text data displayed on the display unit 26.

一方、以下では、テキストデータにエラーの理由が含まれている場合を説明する。図１
０（ａ）−（ｃ）は、本発明の実施例４に係る表示部２６に表示される画面を示す。図１
０（ａ）は、送話者が「アンゼン」と話しているが、「ア」が、認識不可能な音素とされ
ている場合を示す。この場合、受話者は、例えば、「カンゼン」と聞き取っている可能性
がある。図１０（ｂ）は、音声速度の値がしきい値よりも大きい場合を示す。この場合、
一部の音素が伏せ字によって示されるとともに、メッセージが表示される。一方、図１０
（ｃ）は、音量レベルの値がしきい値よりも小さい場合を示す。この場合、すべての音素
が伏せ字によって示されるとともに、メッセージが表示される。 On the other hand, a case where the text data includes an error reason will be described below. FIG.
0 (a) to (c) show screens displayed on the display unit 26 according to the fourth embodiment of the present invention. FIG.
0 (a) indicates a case where the sender speaks “Anzen” but “A” is an unrecognizable phoneme. In this case, the listener may have heard, for example, “Kanzen”. FIG. 10B shows a case where the value of the voice speed is larger than the threshold value. in this case,
Some phonemes are indicated by face-down characters and a message is displayed. On the other hand, FIG.
(C) shows a case where the value of the volume level is smaller than the threshold value. In this case, all the phonemes are indicated by the hidden characters and a message is displayed.

以上の構成による通信システム１００の動作を説明する。図１１は、本発明の実施例４
に係る通信システム１００による通信手順を示すシーケンス図である。第１端末装置１０
ａは、音声を入力する（Ｓ１１０）と、音声信号を生成する（Ｓ１１２）。第１端末装置
１０ａは、音声信号を送信する（Ｓ１１４）。第２端末装置１０ｂは、音声信号を再生し
（Ｓ１１６）、再生した音声信号をスピーカ３４から出力する（Ｓ１１８）。第２端末装
置１０ｂは、音声認識処理を実行し（Ｓ１２０）、エラーの理由を特定する（Ｓ１２２）
。また、第２端末装置１０ｂは、テキストデータ、エラーの理由を生成する（Ｓ１２４）
。第２端末装置１０ｂは、テキストデータ、エラーの理由を送信する（Ｓ１２６）。第１
端末装置１０ａは、テキストデータ、エラーの理由を表示する（Ｓ１２８）。 The operation of the communication system 100 having the above configuration will be described. FIG. 11 shows Embodiment 4 of the present invention.
Is a sequence diagram showing a communication procedure by the communication system 100 according to the first embodiment. First terminal device 10
As for a, when a voice is input (S110), a voice signal is generated (S112). The first terminal device 10a transmits an audio signal (S114). The second terminal device 10b reproduces the audio signal (S116), and outputs the reproduced audio signal from the speaker 34 (S118). The second terminal device 10b executes a voice recognition process (S120), and specifies the reason for the error (S122).
. Further, the second terminal device 10b generates the text data and the reason for the error (S124).
. The second terminal device 10b transmits the text data and the reason for the error (S126). First
The terminal device 10a displays the text data and the reason for the error (S128).

図１２は、本発明の実施例４に係る端末装置１０による特定手順を示すフローチャート
である。設定部３８にイコライザ設定がなされている場合（Ｓ１５０のＹ）、再生部３２
は、音声信号に対してイコライザ処理を実行する（Ｓ１５２）。設定部３８にイコライザ
設定がなされていない場合（Ｓ１５０のＮ）、ステップ１５２はスキップされる。再生部
３２は、音声認識処理を実行する（Ｓ１５４）。認識不可能な音素があれば（Ｓ１５６の
Ｙ）、処理部２８は、エラーの理由を特定する（Ｓ１５８）。認識不可能な音素がなけれ
ば（Ｓ１５６のＮ）、ステップ１５８はスキップされる。 FIG. 12 is a flowchart illustrating a specifying procedure by the terminal device 10 according to the fourth embodiment of the present invention. When the equalizer setting is made in the setting unit 38 (Y in S150), the reproduction unit 32
Performs an equalizer process on the audio signal (S152). If the equalizer setting has not been made in the setting unit 38 (N in S150), step 152 is skipped. The playback unit 32 performs a voice recognition process (S154). If there is an unrecognizable phoneme (Y in S156), the processing unit 28 specifies the reason for the error (S158). If there is no unrecognizable phoneme (N in S156), step 158 is skipped.

図１３は、本発明の実施例４に係る端末装置１０による別の特定手順を示すフローチャ
ートである。設定部３８に音声速度変換設定がなされている場合（Ｓ２００のＹ）、再生
部３２は、音声信号を調節する（Ｓ２０２）。設定部３８に音声速度変換設定がなされて
いない場合（Ｓ２００のＮ）、ステップ２０２はスキップされる。再生部３２は、音声認
識処理を実行する（Ｓ２０４）。音声速度がしきい値よりも大きければ（Ｓ２０６のＹ）
、処理部２８は、エラーの理由を特定する（Ｓ２０８）。音声速度がしきい値よりも大き
くなければ（Ｓ２０６のＮ）、ステップ２０８はスキップされる。 FIG. 13 is a flowchart illustrating another specific procedure by the terminal device 10 according to the fourth embodiment of the present invention. When the audio speed conversion setting is made in the setting unit 38 (Y in S200), the reproducing unit 32 adjusts the audio signal (S202). If the audio speed conversion setting has not been made in the setting unit 38 (N in S200), step 202 is skipped. The reproduction unit 32 performs a voice recognition process (S204). If the voice speed is higher than the threshold (Y in S206)
The processing unit 28 specifies the reason for the error (S208). If the voice speed is not greater than the threshold (N in S206), step 208 is skipped.

図１４は、本発明の実施例４に係る端末装置１０によるさらに別の特定手順を示すフロ
ーチャートである。再生部３２は、設定部３８における音量設定を取得する（Ｓ２５０）
。再生部３２は、音声信号を調節する（Ｓ２５２）。音量レベルがしきい値よりも小さけ
れば（Ｓ２５４のＹ）、処理部２８は、エラーの理由を特定する（Ｓ２５６）。音量レベ
ルがしきい値よりも小さくなければ（Ｓ２５４のＮ）、ステップ２５６はスキップされる
。 FIG. 14 is a flowchart illustrating still another specific procedure by the terminal device 10 according to the fourth embodiment of the present invention. The reproduction unit 32 acquires the volume setting in the setting unit 38 (S250).
. The playback unit 32 adjusts the audio signal (S252). If the volume level is lower than the threshold value (Y in S254), the processing unit 28 specifies the reason for the error (S256). If the volume level is not lower than the threshold value (N of S254), step 256 is skipped.

本実施例によれば、音声信号に対して、端末装置の設定を反映させながら音声認識処理
を実行するので、端末装置の設定を反映しながら、受信した音声信号をテキスト化できる
。端末装置の設定を反映させながら、再生した音声信号におけるエラーの理由を特定して
通知するので、端末装置において音声出力に関する設定がなされる場合であっても、音声
が聞こえにくい理由を知らせることができる。また、音声が聞こえにくい理由を知らせる
ので、当該理由を解消しながら音声信号を送信できる。また、端末装置の設定を反映させ
るので、実際の音声の聞こえ方に近くなるように音声認識処理を実行できる。 According to the present embodiment, the voice recognition processing is executed while reflecting the settings of the terminal device on the voice signal, so that the received voice signal can be converted to text while reflecting the settings of the terminal device. Since the cause of the error in the reproduced audio signal is specified and notified while reflecting the setting of the terminal device, it is possible to inform the reason why the sound is difficult to hear even when the setting regarding the audio output is made in the terminal device. it can. Further, since the user is notified of the reason why the sound is hard to hear, the sound signal can be transmitted while eliminating the reason. In addition, since the setting of the terminal device is reflected, it is possible to execute the voice recognition process so as to be closer to how the actual voice is heard.

また、音声認識処理において認識不可能な音素が存在するかを判定して通知するので、
送話者の話し方、通信環境が原因であることを知らせることができる。また、音声信号で
の音声速度がしきい値より大きいかを判定して通知するので、送話者の話し方が原因であ
ることを知らせることができる。また、音声信号での音量レベルがしきい値より小さいか
を判定して通知するので、送話者の話し方が原因であることを知らせることができる。 In addition, since it is determined whether there is an unrecognizable phoneme in the voice recognition process and notified,
It is possible to notify that the cause is due to the way of speaking of the sender and the communication environment. In addition, since the determination is made as to whether or not the voice speed of the voice signal is greater than the threshold value, it is possible to notify that the cause is the manner of speaking by the sender. Further, since the determination is made as to whether or not the volume level of the audio signal is smaller than the threshold value, it is possible to notify that the cause is the manner of speaking by the sender.

（実施例５）
次に、実施例５を説明する。実施例５は、実施例４と実施例３との組合せに相当する。
実施例５に係る通信システム、端末装置は、図１、図７と同様のタイプである。ここでは
、これまでとの差異を中心に説明する。 (Example 5)
Next, a fifth embodiment will be described. Example 5 corresponds to a combination of Example 4 and Example 3.
The communication system and the terminal device according to the fifth embodiment are of the same type as those in FIGS. Here, the description will focus on the differences from the past.

前述の（２）において、再生部３２は、受信部４２からの音声信号を入力し、音声信号
を再生する。その際、実施例４と同様に、設定部３８においてなされた設定値が反映され
る。処理部２８は、再生部３２からの音声信号を入力する。処理部２８は、音声信号に対
して、標準的な音声認識モデルにもとづく音声認識処理を実行する。その結果、音声信号
はテキスト化（以下、テキスト化された音声信号を「第１テキスト」という）される。 In the above (2), the reproducing unit 32 receives the audio signal from the receiving unit 42 and reproduces the audio signal. At that time, as in the case of the fourth embodiment, the setting value set by the setting unit 38 is reflected. The processing unit 28 receives the audio signal from the reproduction unit 32. The processing unit 28 performs a voice recognition process on the voice signal based on a standard voice recognition model. As a result, the audio signal is converted to text (hereinafter, the converted audio signal is referred to as “first text”).

一方、再生部３２は、受信部４２からの音声信号を入力し、設定部３８において設定し
た設定値を未使用のまま、音声信号を再生する。処理部２８は、再生部３２からの音声信
号を入力する。処理部２８は、音声信号に対して、標準的な音声認識モデルにもとづく音
声認識処理を実行する。その結果、音声信号はテキスト化（以下、テキスト化された音声
信号を「第２テキスト」という）される。 On the other hand, the reproducing unit 32 receives the audio signal from the receiving unit 42 and reproduces the audio signal without using the set value set in the setting unit 38. The processing unit 28 receives the audio signal from the reproduction unit 32. The processing unit 28 performs a voice recognition process on the voice signal based on a standard voice recognition model. As a result, the audio signal is converted to text (hereinafter, the converted audio signal is referred to as “second text”).

比較部４６は、第１テキストと第２テキストとを入力する。比較部４６は、第１テキス
トと第２テキストとを比較する。ここでは、比較として、第１テキストと第２テキストと
が並べられる。比較部４６は、第１テキストと第２テキストとを並べたテキストデータを
送信部４０に出力する。送信部４０は、処理部２８からのテキストデータを入力する。送
信部４０は、比較結果であるテキストデータを送信側の端末装置１０に送信する。これに
つづく処理は、これまでと同様であるので、ここでは説明を省略する。 The comparison unit 46 inputs the first text and the second text. The comparing unit 46 compares the first text with the second text. Here, as a comparison, the first text and the second text are arranged. The comparing unit 46 outputs to the transmitting unit 40 text data in which the first text and the second text are arranged. The transmission unit 40 inputs the text data from the processing unit 28. The transmitting unit 40 transmits the text data as the comparison result to the terminal device 10 on the transmitting side. Subsequent processing is the same as that described above, and a description thereof will not be repeated.

本実施例によれば、端末装置の設定値を使用している場合の音声認識処理の結果と、端
末装置の設定値を使用していない場合の音声認識処理の結果とを比較するので、どの段階
で認識不可能な音素が発生するかを特定できる。また、端末装置の設定値を使用している
場合の音声認識処理の結果に認識不可能な音素が存在し、端末装置の設定値を使用してい
ない場合の音声認識処理の結果に認識不可能な音素が存在しない場合、端末装置の設定に
よって聞き取れないことを認識できる。また、端末装置の設定値を使用している場合の音
声認識処理の結果と、端末装置の設定値を使用していない場合の音声認識処理の結果とに
認識不可能な音素が存在する場合、発話あるいは通信の段階に原因があることを認識でき
る。 According to the present embodiment, the result of the voice recognition process when the setting value of the terminal device is used is compared with the result of the voice recognition process when the setting value of the terminal device is not used. It is possible to specify whether an unrecognizable phoneme occurs at the stage. In addition, there is an unrecognizable phoneme in the result of the voice recognition processing when the setting value of the terminal device is used, and the phoneme cannot be recognized in the result of the voice recognition processing when the setting value of the terminal device is not used. If there is no such phoneme, it can be recognized that it cannot be heard due to the setting of the terminal device. Also, if there is an unrecognizable phoneme in the result of the voice recognition process when using the setting value of the terminal device and the result of the voice recognition process when not using the setting value of the terminal device, It is possible to recognize that there is a cause in the stage of speech or communication.

（実施例６）
次に、実施例６を説明する。実施例６は、実施例４において音声認識処理を実行する際
に、マイクにおいて集音した音の情報も利用することに関する。実施例６に係る通信シス
テム、端末装置は、図１、図２と同様のタイプである。ここでは、これまでとの差異を中
心に説明する。 (Example 6)
Next, a sixth embodiment will be described. The sixth embodiment relates to using information of sound collected by a microphone when performing a voice recognition process in the fourth embodiment. The communication system and the terminal device according to the sixth embodiment are of the same type as those in FIGS. Here, the description will focus on the differences from the past.

マイク２２は、本端末装置１０の周囲の音、例えば雑音を集音する。マイク２２は、集
音した雑音を電気信号（以下、「雑音信号」という）に変換し、雑音信号を処理部２８に
出力する。処理部２８は、実施例４と同様に、音声信号に対して音声認識処理を実行する
。特に、エラーの理由を特定するために前述の（Ａ）の処理を実行する場合、処理部２８
は、音声認識処理において、マイク２２において集音した音の情報を反映させる。例えば
、雑音信号の大きさに応じて、相関値と比較すべき値を調節する。具体的に説明すると、
雑音信号が大きくなるほど、相関値と比較すべき値が小さくされる。処理部２８は、これ
までと同様に、認識不可能な音素を判定する。これにつづく処理は、これまでと同様であ
るので、ここでは説明を省略する。 The microphone 22 collects sounds around the terminal device 10, for example, noise. The microphone 22 converts the collected noise into an electric signal (hereinafter, referred to as a “noise signal”) and outputs the noise signal to the processing unit 28. The processing unit 28 performs a voice recognition process on the voice signal as in the fourth embodiment. In particular, when the above-described processing (A) is performed to identify the reason for the error, the processing unit 28
Reflects the information of the sound collected by the microphone 22 in the voice recognition processing. For example, the value to be compared with the correlation value is adjusted according to the magnitude of the noise signal. Specifically,
As the noise signal increases, the value to be compared with the correlation value decreases. The processing unit 28 determines an unrecognizable phoneme as before. Subsequent processing is the same as that described above, and a description thereof will not be repeated.

本実施例によれば、マイクにおいて集音した端末装置の周囲の音の情報も反映されるの
で、実際の音声の聞こえ方に近くなるように音声認識処理を実行できる。また、実際の音
声の聞こえ方に近くなるような音声認識処理がなされるので、テキスト化の精度を向上で
きる。 According to the present embodiment, the information of the sound around the terminal device collected by the microphone is also reflected, so that the voice recognition processing can be executed so as to be closer to how the actual voice is heard. In addition, since the speech recognition processing is performed so as to be close to how the actual speech is heard, the accuracy of the text conversion can be improved.

（実施例７）
次に、実施例７を説明する。実施例７は、実施例６と実施例５との組合せに相当する。
実施例７に係る通信システム、端末装置は、図１、図７と同様のタイプである。ここでは
、これまでとの差異を中心に説明する。 (Example 7)
Next, a seventh embodiment will be described. Example 7 corresponds to a combination of Example 6 and Example 5.
The communication system and the terminal device according to the seventh embodiment are of the same type as those in FIGS. Here, the description will focus on the differences from the past.

前述の（２）において、再生部３２は、受信部４２からの音声信号を入力し、音声信号
を再生する。その際、実施例６と同様に、設定部３８においてなされた設定値が反映され
る。処理部２８は、再生部３２からの音声信号を入力する。処理部２８は、音声信号に対
して、標準的な音声認識モデルにもとづく音声認識処理を実行する。ここで、エラーの理
由を特定するために前述の（Ａ）の処理を実行する場合、処理部２８は、実施例６と同様
に、音声認識処理において、マイク２２において集音した音の情報を反映させる。その結
果、音声信号はテキスト化（以下、テキスト化された音声信号を「第１テキスト」という
）される。 In the above (2), the reproducing unit 32 receives the audio signal from the receiving unit 42 and reproduces the audio signal. At this time, as in the case of the sixth embodiment, the setting value set by the setting unit 38 is reflected. The processing unit 28 receives the audio signal from the reproduction unit 32. The processing unit 28 performs a voice recognition process on the voice signal based on a standard voice recognition model. Here, when performing the above-described processing (A) to identify the reason for the error, the processing unit 28 performs processing similar to that of the sixth embodiment to process the information of the sound collected by the microphone 22 in the voice recognition processing. To reflect. As a result, the audio signal is converted to text (hereinafter, the converted audio signal is referred to as “first text”).

一方、再生部３２は、受信部４２からの音声信号を入力し、設定部３８において設定し
た設定値を未使用のまま、音声信号を再生する。処理部２８は、再生部３２からの音声信
号を入力する。処理部２８は、音声信号に対して、標準的な音声認識モデルにもとづく音
声認識処理を実行する。しかしながら、処理部２８は、音声認識処理において、マイク２
２において集音した音の情報を反映させない。つまり、音の情報は未使用のまま音声認識
処理が実行される。その結果、音声信号はテキスト化（以下、テキスト化された音声信号
を「第２テキスト」という）される。 On the other hand, the reproducing unit 32 receives the audio signal from the receiving unit 42 and reproduces the audio signal without using the set value set in the setting unit 38. The processing unit 28 receives the audio signal from the reproduction unit 32. The processing unit 28 performs a voice recognition process on the voice signal based on a standard voice recognition model. However, in the voice recognition process, the processing unit 28
The information of the sound collected in step 2 is not reflected. In other words, the voice recognition processing is executed while the sound information is not used. As a result, the audio signal is converted to text (hereinafter, the converted audio signal is referred to as “second text”).

本実施例によれば、端末装置の設定値を使用し、かつ集音した音の情報を使用している
場合の音声認識処理の結果と、端末装置の設定値を使用せず、かつ集音した音の情報を使
用していない場合の音声認識処理の結果とを比較するので、どの段階で認識不可能な音素
が発生するかを特定できる。また、端末装置の設定値を使用し、かつ集音した音の情報を
使用している場合の音声認識処理の結果に認識不可能な音素が存在し、端末装置の設定値
を使用せず、かつ集音した音の情報を使用していない場合の音声認識処理の結果に認識不
可能な音素が存在しない場合、端末装置の設定、周囲の雑音によって聞き取れないことを
認識できる。また、端末装置の設定値を使用し、かつ集音した音の情報を使用している場
合の音声認識処理の結果と、端末装置の設定値を使用せず、かつ集音した音の情報を使用
していない場合の音声認識処理の結果とに認識不可能な音素が存在する場合、発話あるい
は通信の段階に原因があることを認識できる。 According to the present embodiment, the result of the voice recognition processing when the setting value of the terminal device is used and the information of the collected sound is used, and the setting value of the terminal device is not used and Since the result of the speech recognition processing in the case where the information of the generated sound is not used is compared, it is possible to specify at which stage an unrecognizable phoneme occurs. In addition, using the setting value of the terminal device, there is an unrecognizable phoneme in the result of the voice recognition process when using the information of the collected sound, without using the setting value of the terminal device, In addition, when there is no unrecognizable phoneme in the result of the voice recognition processing when the information of the collected sound is not used, it is possible to recognize that the terminal device cannot be heard due to the setting of the terminal device and the surrounding noise. In addition, the result of the voice recognition process when the setting value of the terminal device is used and the information of the collected sound is used, and the information of the collected sound without using the setting value of the terminal device. If an unrecognizable phoneme is present in the result of the voice recognition process when not being used, it can be recognized that there is a cause in the stage of speech or communication.

（実施例８）
次に、実施例８を説明する。実施例８は、実施例６において、音声認識処理を実行する
際に、受信側の端末装置になされた設定を反映させない場合に相当する。受信側の端末装
置になされた設定とは、イコライザのオン／オフ、スピーカから出力される際の音量レベ
ル、話速変換のオン／オフの設定である。一方、実施例８でも、実施例６と同様に、音声
認識処理を実行する際に、マイクにおいて集音した音の情報は利用される。実施例８に係
る通信システム、端末装置は、図１、図２と同様のタイプである。ここでは、これまでと
の差異を中心に説明する。 (Example 8)
Next, an eighth embodiment will be described. The eighth embodiment corresponds to a case in which the settings made in the terminal device on the receiving side are not reflected when executing the voice recognition processing in the sixth embodiment. The settings made on the terminal device on the receiving side are on / off of the equalizer, volume level when output from the speaker, and on / off of speech speed conversion. On the other hand, in the eighth embodiment, similarly to the sixth embodiment, information of the sound collected by the microphone is used when executing the voice recognition processing. The communication system and the terminal device according to the eighth embodiment are of the same type as those in FIGS. Here, the description will focus on the differences from the past.

前述の（２）において、再生部３２は、受信部４２からの音声信号を入力し、音声信号
を再生する。その際、実施例６とは異なって、設定部３８においてなされた設定値は反映
されない。マイク２２は、本端末装置１０の周囲の音、例えば雑音を集音し、集音した雑
音を電気信号（以下、「雑音信号」という）に変換し、雑音信号を処理部２８に出力する
。処理部２８は、実施例６と同様に、音声信号に対して音声認識処理を実行する。特に、
エラーの理由を特定するために前述の（Ａ）の処理を実行する場合、処理部２８は、音声
認識処理において、マイク２２において集音した音の情報を反映させる。これにつづく処
理は、これまでと同様であるので、ここでは説明を省略する。 In the above (2), the reproducing unit 32 receives the audio signal from the receiving unit 42 and reproduces the audio signal. At this time, unlike the sixth embodiment, the setting value set by the setting unit 38 is not reflected. The microphone 22 collects sound around the terminal device 10, for example, noise, converts the collected noise into an electric signal (hereinafter, referred to as a “noise signal”), and outputs the noise signal to the processing unit 28. The processing unit 28 performs a voice recognition process on the voice signal as in the sixth embodiment. In particular,
When performing the above-described process (A) to identify the reason for the error, the processing unit 28 reflects information on the sound collected by the microphone 22 in the voice recognition process. Subsequent processing is the same as that described above, and a description thereof will not be repeated.

本実施例によれば、マイクにおいて集音した端末装置の周囲の音の情報が反映されるの
で、実際の音声の聞こえ方に近くなるように音声認識処理を実行できる。また、実際の音
声の聞こえ方に近くなるような音声認識処理がなされるので、テキスト化の精度を向上で
きる。 According to the present embodiment, since the information of the sound around the terminal device collected by the microphone is reflected, it is possible to execute the voice recognition processing so as to be closer to the way of actually hearing the voice. In addition, since the speech recognition processing is performed so as to be close to how the actual speech is heard, the accuracy of the text conversion can be improved.

（実施例９）
次に、実施例９を説明する。実施例９は、実施例８と実施例７との組合せに相当する。
実施例９に係る通信システム、端末装置は、図１、図７と同様のタイプである。ここでは
、これまでとの差異を中心に説明する。 (Example 9)
Next, a ninth embodiment will be described. The ninth embodiment corresponds to a combination of the eighth embodiment and the seventh embodiment.
The communication system and the terminal device according to the ninth embodiment are of the same type as those in FIGS. Here, the description will focus on the differences from the past.

前述の（２）において、処理部２８は、再生部３２からの音声信号を入力する。処理部
２８は、音声信号に対して、標準的な音声認識モデルにもとづく音声認識処理を実行する
。ここで、エラーの理由を特定するために前述の（Ａ）の処理を実行する場合、処理部２
８は、実施例８と同様に、音声認識処理において、マイク２２において集音した音の情報
を反映させる。その結果、音声信号はテキスト化（以下、テキスト化された音声信号を「
第１テキスト」という）される。 In the above (2), the processing unit 28 inputs an audio signal from the reproduction unit 32. The processing unit 28 performs a voice recognition process on the voice signal based on a standard voice recognition model. Here, when the above-described processing (A) is performed to identify the reason for the error, the processing unit 2
8 reflects the information of the sound collected by the microphone 22 in the voice recognition process, as in the eighth embodiment. As a result, the audio signal is converted into a text (hereinafter, the text-converted audio signal is referred to as “
First text).

一方、処理部２８は、音声信号に対して、標準的な音声認識モデルにもとづく音声認識
処理を実行する。しかしながら、処理部２８は、音声認識処理において、マイク２２にお
いて集音した音の情報を反映させない。つまり、音の情報は未使用のまま音声認識処理が
実行される。その結果、音声信号はテキスト化（以下、テキスト化された音声信号を「第
２テキスト」という）される。 On the other hand, the processing unit 28 performs a voice recognition process on the voice signal based on a standard voice recognition model. However, the processing unit 28 does not reflect information on the sound collected by the microphone 22 in the voice recognition processing. In other words, the voice recognition processing is executed while the sound information is not used. As a result, the audio signal is converted to text (hereinafter, the converted audio signal is referred to as “second text”).

本実施例によれば、集音した音の情報を使用している場合の音声認識処理の結果と、集
音した音の情報を使用していない場合の音声認識処理の結果とを比較するので、どの段階
で認識不可能な音素が発生するかを特定できる。また、集音した音の情報を使用している
場合の音声認識処理の結果に認識不可能な音素が存在し、集音した音の情報を使用してい
ない場合の音声認識処理の結果に認識不可能な音素が存在しない場合、周囲の雑音によっ
て聞き取れないことを認識できる。また、集音した音の情報を使用している場合の音声認
識処理の結果と、集音した音の情報を使用していない場合の音声認識処理の結果とに認識
不可能な音素が存在する場合、発話あるいは通信の段階に原因があることを認識できる。 According to this embodiment, the result of the voice recognition process when the information of the collected sound is used is compared with the result of the voice recognition process when the information of the collected sound is not used. At which stage an unrecognizable phoneme occurs. In addition, if there is an unrecognizable phoneme in the result of the voice recognition processing when information on the collected sound is used, and the result of the voice recognition processing when the information on the collected sound is not used is recognized. If there is no impossible phoneme, it is possible to recognize that it cannot be heard due to the surrounding noise. In addition, there are unrecognizable phonemes in the result of the voice recognition process when the information of the collected sound is used and the result of the voice recognition process when the information of the collected sound is not used. In this case, it can be recognized that there is a cause in the stage of speech or communication.

以上、本発明を実施例をもとに説明した。この実施例は例示であり、それらの各構成要
素や各処理プロセスの組合せにいろいろな変形例が可能なこと、またそうした変形例も本
発明の範囲にあることは当業者に理解されるところである。 The present invention has been described based on the embodiments. This embodiment is an exemplification, and it is understood by those skilled in the art that various modifications can be made to the combination of each component and each processing process, and that such modifications are also within the scope of the present invention. .

本実施例１乃至９によれば、通信システム１００は、業務用無線を使用している。しか
しながらこれに限らず例えば、業務用無線以外の無線通信システムが使用されてもよい。
本実施例によれば、構成の自由度を向上できる。 According to the first to ninth embodiments, the communication system 100 uses business wireless. However, the present invention is not limited to this. For example, a wireless communication system other than the commercial wireless communication may be used.
According to this embodiment, the degree of freedom of the configuration can be improved.

１０端末装置、１２基地局装置、１４ネットワーク、２０ボタン、２
２マイク、２４操作部、２６表示部、２８処理部、３０通信部、３
２再生部、３４スピーカ、３６送話部、３８設定部、４０送信部、
４２受信部、４６比較部、１００通信システム。 10 terminal device, 12 base station device, 14 network, 20 button, 2
2 microphone, 24 operation unit, 26 display unit, 28 processing unit, 30 communication unit, 3
2 playback unit, 34 speaker, 36 transmission unit, 38 setting unit, 40 transmission unit,
42 receiving unit, 46 comparing unit, 100 communication system.

Claims

A transmitting unit that transmits an audio signal to a terminal device on a receiving side;
In the terminal device on the receiving side, a receiving unit that receives a result of a voice recognition process performed on an audio signal obtained by reproducing the received audio signal from the terminal device on the receiving side,
A processing unit that displays a result of the received voice recognition processing on a display unit,
A terminal device comprising:

The result of the voice recognition process is:
2. The terminal device according to claim 1, wherein an audio signal reproduced by the terminal device on the receiving side reflects how the user who uses the terminal device on the receiving side hears. 3.

The receiving unit,
In the terminal device on the receiving side,
(1) Speech recognition processing is executed without reflecting the way of hearing of the user who uses the terminal device on the receiving side. The terminal device according to claim 1, wherein the terminal device receives a comparison result obtained by comparing a result of a voice recognition process reflecting a way of hearing.

The receiving unit,
In the voice recognition processing of the terminal device on the receiving side, a result obtained by reflecting at least one of a volume level and a voice speed is received from the terminal device on the receiving side. The terminal device according to any one of claims 1 to 3.

Transmitting an audio signal to a terminal device on the receiving side;
In the terminal device on the receiving side, for the audio signal reproduced from the received audio signal,
Receiving a result of executing the voice recognition process from the terminal device on the receiving side,
Acquiring the result of the voice recognition process, and displaying the result on a display unit;
A communication method comprising:

A process of transmitting an audio signal to a terminal device on the receiving side;
In the terminal device on the receiving side, for the audio signal reproduced from the received audio signal,
A process of receiving the result of executing the voice recognition process from the terminal device on the receiving side,
A process of acquiring a result of the voice recognition process and displaying the result on a display unit;
Communication program that causes a computer to execute.