JP4127274B2

JP4127274B2 - Telephone speech recognition system

Info

Publication number: JP4127274B2
Application number: JP2005081007A
Authority: JP
Inventors: 洋平北本
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 2005-03-22
Filing date: 2005-03-22
Publication date: 2008-07-30
Anticipated expiration: 2025-03-22
Also published as: JP2006270143A

Description

本発明は電話音声認識システムに係り、特に電話音声を音声認識する電話音声認識システムに関する。 The present invention relates to a telephone voice recognition system, and more particularly to a telephone voice recognition system that recognizes telephone voice.

従来から音声認識装置を電話音声に適用した電話音声認識システムが知られている（例えば、特許文献１参照）。この従来の電話音声認識システムは、各種メモリが搭載されたＣＰＵボードと、アナログ信号をディジタル信号に変換するＡ／Ｄ変換ボードと、網終端装置やディジタルスイッチが搭載され、電話回線接続端子が設けられた電話応答ボードと、これら各ボードを電気的に接続する通電体と、前記ＣＰＵボードから出力される信号を合成音声信号に生成して前記電話応答ボードに供給する音声合成装置とから構成されている電話音声認識・応答装置を、クライアントとして複数組ネットワークに接続し、そのネットワークを介して、サーバに認識結果を伝送したり、サーバからの指令を受け取って、それに対応した処理をクライアントが自動的に行うようにしたものである。 2. Description of the Related Art Conventionally, a telephone voice recognition system in which a voice recognition device is applied to telephone voice is known (for example, see Patent Document 1). This conventional telephone speech recognition system is equipped with a CPU board on which various memories are mounted, an A / D conversion board for converting analog signals into digital signals, a network termination device and a digital switch, and a telephone line connection terminal. A telephone answering board, an electrical connection member that electrically connects these boards, and a voice synthesizer that generates a signal output from the CPU board as a synthesized voice signal and supplies the synthesized voice signal to the telephone answering board. Connected to multiple sets of networks as a client, and the client automatically transmits the recognition results to the server and receives commands from the server via the network, and the client automatically performs the corresponding processing. This is what we do.

このような構成を有する従来の電話音声認識システムは次のように動作する。電話応答ボードの電話回線接続端子に接続された電話回線から送られてくるアナログ信号を、電話応答ボードからＡ／Ｄ変換ボードへ伝送してＡ／Ｄ変換ボードでディジタル信号に変換し、そのディジタル信号をＣＰＵボードに伝送して、認識処理を実行し、その認識結果を音声合成装置で合成音声にした後に、電話応答ボードから電話回線に送信する。 The conventional telephone speech recognition system having such a configuration operates as follows. The analog signal sent from the telephone line connected to the telephone line connection terminal of the telephone response board is transmitted from the telephone response board to the A / D conversion board and converted into a digital signal by the A / D conversion board. The signal is transmitted to the CPU board, recognition processing is executed, and the recognition result is converted into synthesized speech by the speech synthesizer, and then transmitted from the telephone response board to the telephone line.

特開２０００−２２８３０号公報JP 2000-22830 A

しかし、この従来の電話音声認識システムは、通信回線の品質や帯域について何ら考慮されていないため、電話回線から送られてくるアナログ信号は歪みがないことを前提にした動作をしてしまい、歪みがあるアナログ信号に対して合成音声処理を行ってしまうことである。このため、上記の従来の電話音声認識システムでは、通信回線の品質が悪かったり、帯域が不足している場合など、音声が途切れてスムーズに会話ができない、という問題がある。 However, since this conventional telephone speech recognition system does not consider the quality and bandwidth of the communication line, the analog signal sent from the telephone line operates on the assumption that there is no distortion, and the distortion In other words, synthetic speech processing is performed on a certain analog signal. For this reason, the above conventional telephone speech recognition system has a problem that the voice is interrupted and the conversation cannot be smoothly performed when the quality of the communication line is poor or the bandwidth is insufficient.

本発明は、以上の点に鑑みなされたもので、電話音声認識と音声合成による電話音声補完を行うことにより、通信回線の品質が悪かったり帯域が不足している場合においても途切れのない会話ができる電話音声認識システムを提供することを目的とする。 The present invention has been made in view of the above points, and by performing telephone voice supplementation by telephone voice recognition and voice synthesis, a conversation without interruption even when the quality of the communication line is poor or the bandwidth is insufficient. An object of the present invention is to provide a telephone speech recognition system that can be used.

上記の目的を達成するため、本発明は、送信側電話端末から送話音声信号を音声認識して得られた音声認識信号を受信側電話端末で受信して音声合成して送話音声信号を復元する電話音声認識システムにおいて、送信側電話端末は、送話音声信号を音声認識して得られた送話音声の単音毎のディジタル信号である文字情報と送話音声信号の単音毎のアナログ音声信号である音声情報とからなる統合データを生成して送信する手段を有し、受信側電話端末は、統合データを受信して文字情報と音声情報とに分離し、分離した音声情報の歪み量が設定したしきい値以上であるときは文字情報を音声合成して得られた音声合成信号を選択し、歪み量がしきい値未満であるときには分離した音声情報を選択する分離・選択手段と、選択した音声合成信号又は音声情報を送話音声として発音する音声出力手段とを有することを特徴とする。この発明では、通信回線の品質が悪かったり、帯域が不足している場合など、音声情報が歪んでいても、音声合成された音声信号を用いて音声情報を補完することができる。 In order to achieve the above object, the present invention receives a speech recognition signal obtained by speech recognition of a transmission voice signal from a transmission side telephone terminal at a reception side telephone terminal and synthesizes the voice to generate a transmission voice signal. In the telephone speech recognition system to be restored, the transmitting telephone terminal transmits the character information, which is a digital signal for each single sound of the transmitted voice obtained by voice recognition of the transmitted voice signal, and the analog voice for each single sound of the transmitted voice signal. A means for generating and transmitting integrated data comprising voice information as a signal, and the receiving side telephone terminal receives the integrated data and separates it into character information and voice information, and the amount of distortion of the separated voice information Separating / selecting means for selecting a speech synthesis signal obtained by speech synthesis of character information when the threshold is equal to or greater than a set threshold value, and selecting separated speech information when the amount of distortion is less than the threshold value; , Selected speech synthesis signal Or characterized in that it has a Could sound output unit audio information as the transmitted voice. In the present invention, even if the voice information is distorted, such as when the quality of the communication line is poor or the band is insufficient, the voice information can be supplemented using the voice signal that has been synthesized.

また、上記の目的を達成するため、本発明は、送信側電話端末から送話音声信号を音声認識して得られた音声認識信号を受信側電話端末で受信して音声合成して送話音声信号を復元する電話音声認識システムにおいて、送信側電話端末は、送話音声を送話音声信号として取り込む音声入力手段と、送話音声信号から音声認識処理により送話音声の単音毎のディジタル信号である文字情報を得る音声認識手段と、音声入力手段から出力される送話音声信号の単音毎のアナログ音声信号を音声情報とし、音声認識手段から出力される文字情報と共に所定単位で統合した統合データを生成する統合手段と、統合データを相手電話端末へ送信する送信手段とを有し、
受信側電話端末は、統合データを受信する受信手段と、受信手段で受信された統合データを音声情報と文字情報とに分離する分離手段と、分離された音声情報の歪み量を測定する歪み解析手段と、分離された文字情報に対して音声合成処理を行って音声合成信号を出力する音声合成手段と、歪み解析手段により測定された歪み量が、設定したしきい値以上であるときは音声合成信号を選択し、歪み量がしきい値未満であるときにはひずみ量が測定された音声情報を選択する補完手段と、補完手段から出力された音声合成信号又は音声情報を音声として外部へ発音する音声出力手段とを有することを特徴とする。 In order to achieve the above-mentioned object, the present invention receives a voice recognition signal obtained by voice recognition of a transmission voice signal from a transmission side telephone terminal by the reception side telephone terminal and synthesizes the voice to transmit voice. in telephone speech recognition system to restore the signal, the transmission side telephone terminal includes a voice input means for capturing a transmitted voice as the transmission voice signal, a digital signal for each monophonic transmitted voice from the transmitted voice signal by the voice recognition processing Integrated data obtained by integrating a voice recognition means for obtaining certain character information and an analog voice signal for each single sound of the transmitted voice signal output from the voice input means into voice information, and integrating them together with the character information output from the voice recognition means in a predetermined unit. And an integration means for generating the integrated data, and a transmission means for transmitting the integrated data to the other telephone terminal
The receiving side telephone terminal includes a receiving unit that receives the integrated data, a separating unit that separates the integrated data received by the receiving unit into voice information and character information, and a distortion analysis that measures a distortion amount of the separated voice information. Means, speech synthesis means for performing speech synthesis processing on the separated character information and outputting a speech synthesis signal, and speech when the distortion amount measured by the distortion analysis means is greater than or equal to a set threshold value. When a synthesized signal is selected and the amount of distortion is less than a threshold value, a complementing unit that selects speech information whose distortion amount has been measured, and a voice synthesis signal or voice information output from the complementing unit is pronounced as speech And an audio output means.

この発明では、通信回線の品質が悪かったり、帯域が不足している場合など、音声情報が歪んでいても、音声合成された音声信号を用いて音声情報を補完することができる。 In the present invention, even if the voice information is distorted, such as when the quality of the communication line is poor or the band is insufficient, the voice information can be supplemented using the voice signal that has been synthesized.

また、本発明は上記の目的を達成するため、送信側電話端末から送話音声信号を音声認識して得られた音声認識信号を受信側電話端末で受信して音声合成して送話音声信号を復元する電話音声認識システムにおいて、送信側電話端末は、送話音声信号を音声認識して得られた送話音声の単音毎のディジタル信号である文字情報と、送話音声信号の単音毎のアナログ音声信号である音声情報と、送話者又は好みの声質の音声波形を示すディジタル信号である声質情報とからなる統合データを生成して送信する手段を有し、受信側電話端末は、統合データを受信して文字情報と音声情報と声質情報とに分離し、分離した音声情報の歪み量が設定したしきい値以上であるときは、文字情報を声質情報が示す声質に近似した音声合成音を得る音声合成処理により得られた音声合成信号を選択し、歪み量がしきい値未満であるときには、分離した音声情報を選択する分離・選択手段と、選択した音声合成信号又は音声情報を送話音声として発音する音声出力手段とを有することを特徴とする。 In order to achieve the above object, the present invention receives a voice recognition signal obtained by voice recognition of a transmission voice signal from a transmission side telephone terminal by the reception side telephone terminal and synthesizes the voice to transmit the transmission voice signal. In the telephone speech recognition system that restores the speech, the transmitting telephone terminal transmits the character information, which is a digital signal for each single sound of the transmitted voice obtained by voice recognition of the transmitted voice signal, and the single voice of the transmitted voice signal . audio information is an analog audio signal, comprising means for generating and transmitting to the integrated data consisting of a digital signal der Ru voice quality information indicating the voice quality of speech waveform talker or preferences, recipient telephone terminal When the integrated data is received and separated into character information, voice information and voice quality information, and the distortion amount of the separated voice information is equal to or greater than a set threshold, the character information is approximated to the voice quality indicated by the voice quality information. Speech synthesis to obtain speech synthesis sound When the synthesized speech signal is selected and the amount of distortion is less than the threshold, separation / selection means for selecting the separated speech information, and the selected speech synthesis signal or speech information as the transmitted speech And an audio output means.

この発明では、送話音声信号を音声認識して得られた送話音声の文字情報と送話音声信号の音声情報と受話音声の声質情報とからなる統合データを送受信するようにしたため、通信回線の品質が悪かったり、帯域が不足している場合など、音声情報が歪んでいても、音声合成された音声信号を用いて音声情報を補完することができ、更に、声質情報に基づいた声質の受話音声を得ることができる。 In the present invention, since the integrated data composed of the character information of the transmitted voice obtained by voice recognition of the transmitted voice signal, the voice information of the transmitted voice signal, and the voice quality information of the received voice is transmitted and received, the communication line Even if the voice information is distorted, such as when the quality of the voice is poor or the bandwidth is insufficient, the voice information can be supplemented using the voice signal synthesized by voice synthesis, and the voice quality based on the voice quality information is further improved. Received voice can be obtained.

また、上記の目的を達成するため、本発明は、送信側電話端末から送話音声信号を音声認識して得られた音声認識信号を受信側電話端末で受信して音声合成して送話音声信号を復元する電話音声認識システムにおいて、
送信側電話端末は、送話音声を送話音声信号として取り込む音声入力手段と、送話音声信号から音声認識処理により送話音声の単音毎のディジタル信号である文字情報を得る音声認識手段と、音声入力手段から出力される送話音声信号の単音毎のアナログ音声信号である音声情報と、音声認識手段から出力される文字情報と、送話者又は好みの声質の音声波形を示すディジタル信号である声質情報とを所定単位で統合した統合データを生成する統合手段と、統合データを相手電話端末へ送信する送信手段とを有し、
受信側電話端末は、統合データを受信する受信手段と、受信手段で受信された統合データを音声情報と文字情報と声質情報とに分離する分離手段と、分離された音声情報の歪み量を測定する歪み解析手段と、分離された文字情報と声質情報とに基づいて、文字情報を声質情報が示す声質に近似した音声合成音の音声合成信号を出力する音声合成手段と、歪み解析手段により測定された歪み量が、設定したしきい値以上であるときは音声合成信号を選択し、歪み量がしきい値未満であるときにはひずみ量が測定された音声情報を選択する補完手段と、補完手段から出力された音声合成信号又は音声情報を音声として外部へ発音する音声出力手段とを有することを特徴とする。 In order to achieve the above-mentioned object, the present invention receives a voice recognition signal obtained by voice recognition of a transmission voice signal from a transmission side telephone terminal by the reception side telephone terminal and synthesizes the voice to transmit voice. In a telephone speech recognition system that restores signals,
The transmitting-side telephone terminal includes a voice input unit that captures a transmission voice as a transmission voice signal, a voice recognition unit that obtains character information that is a digital signal for each single sound of the transmission voice from the transmission voice signal by voice recognition processing, A voice signal that is an analog voice signal for each single sound of a transmission voice signal output from the voice input means, a character information output from the voice recognition means, and a digital signal indicating a voice waveform of the sender or the desired voice quality comprising an integrated means for generating an integrated data obtained by integrating the certain voice quality information in a predetermined unit, and transmitting means for transmitting the integrated data to the destination telephone terminal,
The telephone terminal on the receiving side receives the integrated data, the separating means for separating the integrated data received by the receiving means into voice information, character information, and voice quality information, and measures the distortion amount of the separated voice information Measured by the distortion analyzing means, the speech synthesizing means for outputting the synthesized speech signal of the synthesized speech that approximates the character information to the voice quality indicated by the voice quality information, and the distortion analyzing means based on the separated character information and voice quality information A complementary means for selecting a speech synthesis signal when the amount of distortion set is equal to or greater than a set threshold value, and a complementary means for selecting speech information whose distortion amount is measured when the distortion amount is less than the threshold value; And a voice output means for generating the voice synthesis signal or voice information output from the voice as the voice.

本発明によれば、通信回線の品質が悪かったり、帯域が不足している場合など、音声情報が歪んでいても、音声合成された音声信号で補完するようにしたため、通信回線の品質が悪かったり帯域が不足している場合においても途切れのない会話ができる。 According to the present invention, even if the voice information is distorted, such as when the quality of the communication line is poor or the bandwidth is insufficient, the quality of the communication line is poor because the voice signal is supplemented with the synthesized voice signal. Even if the bandwidth is insufficient, you can have a seamless conversation.

また、本発明によれば、送話音声信号を音声認識して得られた送話音声の文字情報と送話音声信号の音声情報と受話音声の声質情報とからなる統合データを送受信することにより、音声情報が歪んでいる場合は、音声合成された音声信号を用いて音声情報を補完すると共に、声質情報に基づいた声質の受話音声を得るようにしたため、補完された音声合成音を違和感無く聞くことができる。 Further, according to the present invention, by transmitting and receiving integrated data composed of the character information of the transmitted voice obtained by voice recognition of the transmitted voice signal, the voice information of the transmitted voice signal, and the voice quality information of the received voice. When the speech information is distorted, the speech information is supplemented using the speech signal that has been synthesized, and the received speech having the voice quality based on the voice quality information is obtained. I can hear you.

次に、本発明の実施の形態の構成について図面を参照して詳細に説明する。図１は本発明になる電話音声認識システムの一実施の形態のシステム構成図を示す。同図において、本実施の形態の電話音声認識システムは、音声を送信する側の電話機１と、音声を受信する側の電話機５と、通信回線である公衆網３と、電話機１及び電話機５と公衆網３とを接続するための交換機２及び交換機４とから構成されている。 Next, the configuration of the embodiment of the present invention will be described in detail with reference to the drawings. FIG. 1 shows a system configuration diagram of an embodiment of a telephone speech recognition system according to the present invention. In the figure, the telephone speech recognition system of the present embodiment includes a telephone 1 that transmits voice, a telephone 5 that receives voice, a public network 3 that is a communication line, a telephone 1 and a telephone 5. It is composed of an exchange 2 and an exchange 4 for connecting to the public network 3.

音声送信側の電話機１は、音声入力手段１１と、音声認識手段１２と、音声・文字情報統合手段１３と、統合データ送信手段１４とを含む。一方、音声受信側の電話機５は、統合データ受信手段５１と、音声・文字情報分離手段５２と、音声合成手段５３と、音声歪み解析手段５４と、音声補完手段５５と、音声出力手段５６とを含む。 The telephone 1 on the voice transmission side includes a voice input unit 11, a voice recognition unit 12, a voice / character information integration unit 13, and an integrated data transmission unit 14. On the other hand, the telephone 5 on the voice receiving side includes an integrated data receiving unit 51, a voice / character information separating unit 52, a voice synthesizing unit 53, a voice distortion analyzing unit 54, a voice complementing unit 55, and a voice output unit 56. including.

音声・文字情報統合手段１３で統合された統合データ６は、図２（Ａ）に示すように、音声情報６１と文字情報６２とから構成されている。また、公衆網３、交換機４を介して統合データ受信手段５１で受信された統合データ６’は、図２（Ｂ）に示すように、音声情報６１’と文字情報６２とから構成されている。 The integrated data 6 integrated by the voice / character information integration means 13 is composed of voice information 61 and character information 62 as shown in FIG. The integrated data 6 ′ received by the integrated data receiving means 51 via the public network 3 and the exchange 4 is composed of voice information 61 ′ and character information 62 as shown in FIG. .

上記の音声情報６１及び６１’は、例えば単音のアナログ音声信号であり、文字情報６２は当該単音の音声認識結果により得られたディジタル音声信号である。アナログ音声信号は伝送時の通信回線の品質や帯域などにより歪む場合があるので、送信側の音声情報６１は受信側では音声情報６１’として表わされる。これに対し、文字情報６２はディジタル音声信号であり、受信側では通常はディジタル信号処理時の誤り訂正が可能であるので、受信側では送信側と同じ文字情報６２を復元できるものとして表わされている。 The voice information 61 and 61 'are, for example, single-tone analog voice signals, and the character information 62 is a digital voice signal obtained from the voice recognition result of the single sound. Since the analog voice signal may be distorted due to the quality or bandwidth of the communication line at the time of transmission, the voice information 61 on the transmission side is represented as voice information 61 'on the reception side. On the other hand, the character information 62 is a digital voice signal, and since the receiving side can normally correct errors during digital signal processing, the receiving side is represented as being able to restore the same character information 62 as the transmitting side. ing.

これらの手段はそれぞれ概略次のように動作する。音声入力手段１１は、外部からの音声を電話機１に取り込むための手段である。音声認識手段１２は、音声入力手段１１で取り込んだ外部からの音声を文字情報６２に変換するための手段である。音声・文字情報統合手段１３は、音声入力手段１１で取り込んだ、外部からの音声そのものである音声情報６１と、音声認識手段１２で変換された文字情報６２とを統合して一つの統合データ６にするための手段である。統合データ送信手段１４は、統合データ６を交換機２に送信するための手段である。 Each of these means generally operates as follows. The voice input unit 11 is a unit for taking a voice from the outside into the telephone 1. The voice recognition unit 12 is a unit for converting external voice taken in by the voice input unit 11 into character information 62. The voice / character information integration unit 13 integrates the voice information 61 which is the voice itself from the outside and the character information 62 converted by the voice recognition unit 12, which is taken in by the voice input unit 11, into one piece of integrated data 6. It is means for making. The integrated data transmission unit 14 is a unit for transmitting the integrated data 6 to the exchange 2.

一方、音声受信側の電話機５において、統合データ受信手段５１は、交換機４から統合データ６’を受信するための手段である。音声・文字情報分離手段５２は、統合データ６’を音声情報６１’と、文字情報６２に分離するための手段である。音声合成手段５３は、音声・文字情報分離手段５２で統合データ６’から分離した、文字情報６２を合成された音声に変換するための手段である。音声歪み解析手段５４は、音声・文字情報分離手段５２で統合データ６’から分離した、音声情報６１’の歪みを測定すると共に音声情報６１’を出力するための手段である。音声補完手段５５は、音声歪み解析手段５４で、歪みを検出した音声情報６１’に対して音声合成手段５３で変換した合成された音声に置換するための手段である。音声出力手段５６は、入力された音声信号を電気−音響変換して音声（音波）として出力する手段である。 On the other hand, in the telephone 5 on the voice receiving side, the integrated data receiving means 51 is means for receiving the integrated data 6 ′ from the exchange 4. The voice / character information separating unit 52 is a unit for separating the integrated data 6 ′ into the voice information 61 ′ and the character information 62. The voice synthesizing unit 53 is a unit for converting the character information 62 separated from the integrated data 6 ′ by the voice / character information separating unit 52 into synthesized speech. The voice distortion analyzing means 54 is a means for measuring the distortion of the voice information 61 ′ separated from the integrated data 6 ′ by the voice / character information separating means 52 and outputting the voice information 61 ′. The voice complementing means 55 is a means for replacing the voice information 61 ′ whose distortion has been detected by the voice distortion analyzing means 54 with the synthesized voice converted by the voice synthesizing means 53. The audio output unit 56 is a unit that performs electro-acoustic conversion on the input audio signal and outputs it as audio (sound wave).

次に、図１及び図２を参照して本発明の実施の形態の動作について詳細に説明する。送信側の音声（送話音声）は、電話機１の音声入力手段１１により音響−電気変換されて送話音声信号として取り込まれる。取り込まれた送話音声信号は、アナログ音声信号であり、音声認識手段１２により従来から公知の方法で単音毎に音声認識され、送話音声を文字列として示すディジタル信号である文字情報に変換される。 Next, the operation of the embodiment of the present invention will be described in detail with reference to FIG. 1 and FIG. The voice on the transmitting side (transmitted voice) is acoustic-electrically converted by the voice input means 11 of the telephone set 1 and is taken in as a transmitted voice signal. The captured transmission voice signal is an analog voice signal, which is recognized by the voice recognition means 12 for each single sound by a conventionally known method, and is converted into character information which is a digital signal indicating the transmission voice as a character string. The

音声・文字情報統合手段１３は、音声入力手段１１から出力された送話音声信号を示す音声情報（図２（Ａ）に６１で示す）と、音声認識手段１２から出力された文字情報（図２（Ａ）に６２で示す）とを、単音毎に統合して図２（Ａ）に示したフォーマットの統合データ６を作成する。統合データ送信手段１４は、音声・文字情報統合手段１３から出力された統合データ６を、受信側の電話機５宛に交換機２を介して公衆網３へ送信する。 The voice / character information integration unit 13 includes voice information indicating the transmitted voice signal output from the voice input unit 11 (indicated by 61 in FIG. 2A) and character information output from the voice recognition unit 12 (see FIG. 2 (A) is indicated by 62), and integrated data 6 in the format shown in FIG. The integrated data transmission unit 14 transmits the integrated data 6 output from the voice / text information integration unit 13 to the public network 3 via the exchange 2 to the telephone 5 on the receiving side.

交換機２は、受信した統合データ６を公衆網３及び交換機４を介して受信側の電話機５へ送信する。電話機５の統合データ受信手段５１は、交換機４から送信された統合データを受信して図２（Ｂ）に示したフォーマットの受信統合データ６’を得る。音声・文字情報分離手段５２は、統合データ受信手段５１により受信された受信統合データ６’を構成する図２（Ｂ）に示した音声情報６１’と文字情報６２とに分離し、文字情報６２は音声合成手段５３へ出力し、音声情報６１’は音声歪み解析手段５４へ出力する。 The exchange 2 transmits the received integrated data 6 to the telephone 5 on the receiving side via the public network 3 and the exchange 4. The integrated data receiving means 51 of the telephone 5 receives the integrated data transmitted from the exchange 4, and obtains the received integrated data 6 'having the format shown in FIG. The voice / character information separating means 52 separates the voice information 61 ′ and the character information 62 shown in FIG. 2B constituting the reception integrated data 6 ′ received by the integrated data receiving means 51 into the character information 62. Is output to the voice synthesizing means 53, and the voice information 61 ′ is output to the voice distortion analyzing means 54.

音声合成手段５３は、入力された文字情報６２に対して公知の音声合成処理を行い、合成されたアナログ音声信号を作成する。一方、音声歪み解析手段５４は、入力された音声情報６１’の歪み量を測定する。この歪み量は例えば、アナログ音声信号である音声情報６１’のレベルの大きさなどに基づいて測定される。音声補完手段５５は、音声情報６１’の測定歪み量が予め設定した閾値以上に歪んでいる場合には、歪んでいる音声情報６１’を、音声合成手段５３からの合成された音声信号と置き換えて音声出力手段５６へ出力し、測定歪み量が予め設定した閾値未満であり、歪んでいないとみなせる場合には、音声歪み解析手段５４を介して入力された受信した音声情報６１’を音声出力手段５６へ出力する。 The voice synthesizer 53 performs a known voice synthesis process on the input character information 62 to create a synthesized analog voice signal. On the other hand, the audio distortion analysis means 54 measures the distortion amount of the input audio information 61 '. This amount of distortion is measured based on, for example, the level of audio information 61 ′ that is an analog audio signal. The speech complementing means 55 replaces the distorted speech information 61 ′ with the synthesized speech signal from the speech synthesizer 53 when the measured distortion amount of the speech information 61 ′ is distorted more than a preset threshold value. If the measured distortion amount is less than a preset threshold value and can be regarded as not distorted, the received audio information 61 ′ input via the audio distortion analysis unit 54 is output as audio. It outputs to the means 56.

音声出力手段５６は、音声補完手段５５から入力されたアナログ音声信号である音声情報６１’又はアナログ音声信号である音声合成信号を、電気−音響変換して電話機５の受話音声として外部へ発音する。このように、本実施の形態によれば、受信側の電話機５で受信した電話音声信号（統合データ６’）中のアナログ音声信号である音声情報６１’が歪んでいない場合は、送信側の電話機１の個人の送話音声を発音させ、音声情報６１’が歪んでいる場合に限り、音声合成手段５３で音声合成した合成音を発音させるようにしたため、通信回線の品質が悪かったり帯域が不足している場合においても途切れのない会話ができると共に、送信側の電話機１の個人の送話音声の発音を優先するようにしているため、味気のない音声合成による合成音よりも、できるだけ送話者の声で発音させることができる。 The voice output unit 56 performs electro-acoustic conversion on the voice information 61 ′, which is an analog voice signal input from the voice complement unit 55, or the voice synthesis signal, which is an analog voice signal, and generates the voice as an incoming voice of the telephone 5. . Thus, according to the present embodiment, when the voice information 61 ′, which is an analog voice signal in the telephone voice signal (integrated data 6 ′) received by the telephone 5 on the reception side, is not distorted, the transmission side The synthesized speech synthesized by the speech synthesizing means 53 is produced only when the personal transmission voice of the telephone 1 is produced and the voice information 61 'is distorted, so that the quality of the communication line is poor or the band is low. Even when there is a shortage, it is possible to have an uninterrupted conversation and to prioritize the pronunciation of the individual transmitted speech of the telephone 1 on the sending side, so as much as possible to the synthesized speech by speech synthesis without taste. Can be pronounced with the voice of the speaker.

次に、具体例を用いて本実施の形態の動作について、図３と共に更に詳細に説明する。いま、音声入力手段１１に入力された音声が、例えば、図３に示すように、「あ」、「い」、「う」であったとする（図３のステップＳ１）。その送話音声「あ」、「い」、「う」は、電話機１の音声入力手段１１により音響−電気変換されてアナログ音声信号の音声情報６１として取り込まれる。 Next, the operation of the present embodiment will be described in more detail with reference to FIG. 3 using a specific example. Assume that the voices input to the voice input means 11 are “A”, “I”, “U” as shown in FIG. 3, for example (step S1 in FIG. 3). The transmitted voices “A”, “I”, “U” are acoustic-electrically converted by the voice input means 11 of the telephone set 1 and taken in as voice information 61 of an analog voice signal.

取り込まれた音声情報６１（「あ」、「い」、「う」）は、音声認識手段１２による公知の音声認識処理により、単音毎のディジタル信号である文字情報６２（「ア」、「イ」、「ウ」）に変換される（図３のステップＳ２）。音声・文字情報統合手段１３は、音声入力手段１１により取り込まれた音声情報６１（「あ」、「い」、「う」）と、音声認識手段１２により音声認識して得られた文字情報６２（「ア」、「イ」、「ウ」）とを統合して図２（Ａ）に示したフォーマットの統合データ６（「あ」と「ア」、「い」と「イ」、「う」と「ウ」）を作成する（図３のステップＳ３）。 The captured voice information 61 (“A”, “I”, “U”) is converted into character information 62 (“A”, “I”, which is a digital signal for each tone, by a known voice recognition process by the voice recognition unit 12. "," C ") (step S2 in FIG. 3). The voice / character information integration unit 13 includes voice information 61 (“A”, “I”, “U”) captured by the voice input unit 11 and character information 62 obtained by voice recognition by the voice recognition unit 12. ("A", "I", "U") and the integrated data 6 ("A" and "A", "I" and "I", "U" in the format shown in FIG. "And" U ") are created (step S3 in FIG. 3).

統合データ送信手段１４は上記の統合データ６（「あ」と「ア」、「い」と「イ」、「う」と「ウ」）を送信相手の電話機５宛に交換機２へ送信する。交換機２は、受信した統合データ６’（「あ」と「ア」、「？」と「イ」、「う」と「ウ」）を、公衆網３を介して交換機４に送信する。このとき、通信回線の品質が悪かったり、帯域が不足していることなどに起因して、統合データ６は交換機２への伝送時に音声情報「い」が歪み、交換機４への伝送時に音声情報「う」が歪んだものとする。これにより、電話機５の統合データ受信手段５１は、統合データ６’（「あ」と「ア」、「？」と「イ」、「？」と「ウ」）を受信する（図３のステップＳ４）。 The integrated data transmission means 14 transmits the integrated data 6 (“A” and “A”, “I” and “I”, “U” and “U”) to the exchange 2 to the telephone 5 of the transmission partner. The exchange 2 transmits the received integrated data 6 ′ (“A” and “A”, “?” And “I”, “U” and “C”) to the exchange 4 via the public network 3. At this time, due to the poor quality of the communication line or the lack of bandwidth, the integrated data 6 is distorted when the voice information “I” is transmitted to the exchange 2, and the voice information “I” is transmitted to the exchange 4. It is assumed that “U” is distorted. Thereby, the integrated data receiving means 51 of the telephone 5 receives the integrated data 6 ′ (“a” and “a”, “?” And “b”, “?” And “c”) (step of FIG. 3). S4).

この受信統合データ６’（「あ」と「ア」、「？」と「イ」、「？」と「ウ」）は、音声・文字情報分離手段５２により、音声情報６１’（「あ」、「？」、「？」）と、文字情報６２（「ア」、「イ」、「ウ」）とに分離される。分離されたディジタル信号である文字情報６２（「ア」、「イ」、「ウ」）は、音声合成手段５３に供給されて公知の音声合成処理により音声合成されたアナログ音声信号に変換される（図３のステップＳ５）。 The reception integrated data 6 ′ (“a” and “a”, “?” And “b”, “?” And “c”) is converted into voice information 61 ′ (“a”) by the voice / character information separation means 52. , “?”, “?”) And character information 62 (“A”, “I”, “U”). Character information 62 (“A”, “I”, “U”), which is a separated digital signal, is supplied to the voice synthesis means 53 and converted into an analog voice signal synthesized by a known voice synthesis process. (Step S5 in FIG. 3).

一方、分離されたアナログ音声信号である音声情報６１’（「あ」、「？」、「？」）は、音声歪み解析手段５４に供給されて歪み量が測定される（図３のステップＳ６）。ここでは、音声情報６１’のうち、１番目の「あ」の音声情報は測定歪み量が閾値未満であるが、２番目と３番目の音声情報の測定歪み量が閾値以上であり、音声が歪んでいると判断されるため、音声補完手段５５は、歪んでいない１番目の音声「あ」の音声情報は、音声歪み解析手段５４から取り出してそのまま出力し、歪んでいる２番目と３番目の音声情報は、それらに替えて、音声合成手段５３から入力される音声合成された音声「イ」、「ウ」の音声信号と置き換える（図３のステップＳ７）。音声出力手段５６は、音声補完手段５５から出力された音声信号（「あ」、「イ」、「ウ」）を電気−音響変換して音声として外部へ発音する（図３のステップＳ８）。 On the other hand, the audio information 61 ′ (“A”, “?”, “?”), Which is a separated analog audio signal, is supplied to the audio distortion analysis means 54 to measure the distortion amount (step S6 in FIG. 3). ). Here, of the audio information 61 ′, the first “a” audio information has a measured distortion amount less than the threshold, but the measured distortion amounts of the second and third audio information are greater than or equal to the threshold, and the audio is Since it is determined that the sound is distorted, the sound complementing means 55 takes out the sound information of the first sound “A” that is not distorted from the sound distortion analyzing means 54 and outputs it as it is. The voice information is replaced with voice signals “a” and “c” synthesized by voice input from the voice synthesizer 53 (step S7 in FIG. 3). The audio output means 56 performs electro-acoustic conversion on the audio signals (“A”, “I”, “U”) output from the audio complementing means 55 and generates them as audio (step S8 in FIG. 3).

従って、本実施の形態によれば、通信回線の品質が悪かったり、帯域が不足している場合など、音声が歪んでいても音声合成された音声信号を用いて音声を補完するように構成しているため、途切れのない会話ができる。 Therefore, according to the present embodiment, the voice is supplemented by using the voice synthesized voice signal even when the voice is distorted, such as when the quality of the communication line is poor or the bandwidth is insufficient. Therefore, you can have an uninterrupted conversation.

次に、本発明の他の実施の形態について図面を参照して詳細に説明する。本実施の形態のシステム構成は、図１とほぼ同様であるが、本実施の形態では、送信側の電話機１は図４（Ａ）に示すフォーマットの統合データ７を送信し、受信側の電話機５は図４（Ｂ）に示すフォーマットの統合データ７’を受信処理する点に特徴がある。 Next, another embodiment of the present invention will be described in detail with reference to the drawings. The system configuration of the present embodiment is almost the same as that of FIG. 1, but in this embodiment, the transmission side telephone 1 transmits the integrated data 7 in the format shown in FIG. 5 is characterized in that the integrated data 7 'in the format shown in FIG.

ここで、図４（Ａ）に示すように、統合データ７は、前記統合データ６の音声情報６１及び文字情報６２に加えて、受話音声の声質（例えば、送話者の声質）を示す声質情報７１が付加されている。この声質情報７１は、図１に示した音声認識手段１２において、音声入力手段１１で取り込まれた送話者のアナログ送話音声信号に基づいて、公知の方法で単音毎の音声認識結果である音声情報６１と共に生成される、声質を示す音声波形に関するディジタル信号である。 Here, as shown in FIG. 4A, the integrated data 7 includes the voice quality indicating the voice quality of the received voice (for example, the voice quality of the sender) in addition to the voice information 61 and the character information 62 of the integrated data 6. Information 71 is added. This voice quality information 71 is a voice recognition result for each single sound by a known method based on the analog voice transmission signal of the speaker captured by the voice input unit 11 in the voice recognition unit 12 shown in FIG. It is a digital signal relating to a voice waveform indicating voice quality, generated together with the voice information 61.

図１の音声・文字情報統合手段１３は、音声入力手段１１から出力された送話音声信号を示す音声情報（図４（Ａ）に６１で示す）と、音声認識手段１２から出力された文字情報（図４（Ａ）に６２で示す）及び声質情報（図４（Ａ）に７１で示す）とを、単音毎に統合して図４（Ａ）に示したフォーマットの統合データ７を作成する。統合データ送信手段１４は、音声・文字情報統合手段１３から出力された統合データ７を、受信側の電話機５宛に交換機２を介して公衆網３へ送信する。 The voice / character information integration unit 13 in FIG. 1 includes voice information (indicated by 61 in FIG. 4A) indicating the transmitted voice signal output from the voice input unit 11 and the character output from the voice recognition unit 12. Information (indicated by 62 in FIG. 4 (A)) and voice quality information (indicated by 71 in FIG. 4 (A)) are integrated for each single tone to create integrated data 7 in the format shown in FIG. 4 (A). To do. The integrated data transmitting unit 14 transmits the integrated data 7 output from the voice / text information integrating unit 13 to the public network 3 via the exchange 2 to the telephone 5 on the receiving side.

交換機２は、受信した統合データ７を公衆網３及び交換機４を介して受信側の電話機５へ送信する。電話機５の統合データ受信手段５１は、交換機４から送信された統合データを受信して図４（Ｂ）に示したフォーマットの受信統合データ７’を得る。音声・文字情報分離手段５２は、統合データ受信手段５１により受信された受信統合データ７’を構成する図４（Ｂ）に示した音声情報６１’と文字情報６２と声質情報７１とに分離し、文字情報６２と声質情報７１は音声合成手段５３へ出力し、音声情報６１’は音声歪み解析手段５４へ出力する。 The exchange 2 transmits the received integrated data 7 to the telephone 5 on the receiving side via the public network 3 and the exchange 4. The integrated data receiving means 51 of the telephone 5 receives the integrated data transmitted from the exchange 4, and obtains the received integrated data 7 'having the format shown in FIG. The voice / character information separation means 52 separates the voice information 61 ′, the character information 62, and the voice quality information 71 shown in FIG. 4B constituting the reception integrated data 7 ′ received by the integrated data receiving means 51. The character information 62 and the voice quality information 71 are output to the voice synthesis means 53, and the voice information 61 'is output to the voice distortion analysis means 54.

音声合成手段５３は、入力された文字情報６２と声質情報７１に基づいて、公知の音声合成処理により文字情報６２が示す文字の音声合成音を生成するが、このとき、その音声合成音が声質情報７１が示す声質に近似した音声合成音となるように生成する。一方、図１の音声歪み解析手段５４は、入力された音声情報６１’の歪み量を測定する。 The voice synthesizer 53 generates a voice synthesized sound of the character indicated by the character information 62 by a known voice synthesis process based on the input character information 62 and voice quality information 71. At this time, the voice synthesized sound is converted into voice quality. It is generated so as to be a voice synthesized sound that approximates the voice quality indicated by the information 71. On the other hand, the audio distortion analysis means 54 of FIG. 1 measures the distortion amount of the input audio information 61 '.

音声補完手段５５は、音声情報６１’の測定歪み量が予め設定した閾値以上に歪んでいる場合には、歪んでいる音声情報６１’を、音声合成手段５３からの合成された音声合成信号と置き換えて音声出力手段５６へ出力し、測定歪み量が予め設定した閾値未満であり、歪んでいないとみなせる場合には、受信した音声情報６１’を音声出力手段５６へ出力する。音声出力手段５６は、音声補完手段５５から入力されたアナログ音声信号である音声情報６１’又はアナログ音声信号である音声合成信号を、電気−音響変換して電話機５の受話音声として外部へ発音する。 When the measured distortion amount of the voice information 61 ′ is distorted by a predetermined threshold value or more, the voice complementing unit 55 converts the distorted voice information 61 ′ into the synthesized voice synthesis signal from the voice synthesis unit 53. When the measured distortion amount is less than a preset threshold value and can be regarded as not distorted, the received audio information 61 ′ is output to the audio output unit 56. The voice output unit 56 performs electro-acoustic conversion on the voice information 61 ′, which is an analog voice signal input from the voice complement unit 55, or the voice synthesis signal, which is an analog voice signal, and generates the voice as an incoming voice of the telephone 5. .

このように、本実施の形態によれば、受信側の電話機５で受信した電話音声信号（統合データ７’）中のアナログ音声信号である音声情報６１’が歪んでいない場合は、送信側の電話機１の個人の送話音声を発音させ、音声情報６１’が歪んでいる場合に限り、音声合成手段５３で音声合成した合成音を発音させるようにしたため、通信回線の品質が悪かったり、帯域が不足している場合においても、途切れのない会話ができると共に、音声合成音が送話者の声に近似した声質で発音されるため、音声６１’が歪んでいた場合、合成された音声音が送話者の音声に対し違和感無く置き換えられて、受話者に聞くことができるという新たな効果を有する。 As described above, according to the present embodiment, when the voice information 61 ′, which is an analog voice signal in the telephone voice signal (integrated data 7 ′) received by the receiver-side telephone 5, is not distorted, The synthesized speech synthesized by the speech synthesizing means 53 is produced only when the personal transmission voice of the telephone 1 is produced and the voice information 61 'is distorted. Even when there is a shortage of speech, the conversation can be performed without interruption, and the synthesized speech sound is generated with a voice quality similar to the voice of the sender, so that if the speech 61 'is distorted, the synthesized speech sound Has a new effect that the voice of the sender can be replaced without any sense of incongruity and can be heard by the receiver.

なお、本発明は以上の実施の形態に限定されるものではなく、例えば、声質情報７１は、受話音声の声質を定めるものであり、上記の実施の形態のような送話者の声質ではなく、送話者あるいは受話者の好みの声質情報を設定しておき、受信側の電話機５で生成する音声合成音をその好みの声質の音声合成音とすることも考えられる。また、電話回線のみでなく、携帯電話や無線環境といった用途にも本発明を適用でき、更には雑音環境下での会話といった用途にも適用可能である。すなわち、本発明は、固定電話機のみならず、携帯電話機や簡易型携帯電話機（ＰＨＳ）等の電話端末一般に適用可能である。 Note that the present invention is not limited to the above embodiment. For example, the voice quality information 71 defines the voice quality of the received voice, not the voice quality of the sender as in the above embodiment. It is also conceivable that the voice quality information of the sender or receiver is set in advance, and the voice synthesized sound generated by the telephone 5 on the receiving side is set as the voice synthesized voice of the desired voice quality. Further, the present invention can be applied not only to a telephone line but also to a use such as a mobile phone and a wireless environment, and further to a use such as a conversation under a noisy environment. That is, the present invention can be applied not only to fixed telephones but also to general telephone terminals such as mobile phones and simplified mobile phones (PHS).

本発明の一実施の形態のシステム構成図である。It is a system configuration figure of one embodiment of the invention. 本発明の一実施の形態で送信及び受信される統合データのフォーマットの一例を示す図である。It is a figure which shows an example of the format of the integrated data transmitted and received by one embodiment of this invention. 本発明の一実施の形態の具体例を説明するシステム構成図である。1 is a system configuration diagram illustrating a specific example of an embodiment of the present invention. 本発明の他の実施の形態で送信及び受信される統合データのフォーマットの一例を示す図である。It is a figure which shows an example of the format of the integrated data transmitted and received by other embodiment of this invention.

Explanation of symbols

１送信側電話機
２、４交換機
３公衆網
５受信側電話機
６、７送信された統合データ
６’、７’ 受信された統合データ
１１音声入力手段
１２音声認識手段
１３音声・文字情報統合手段
１４統合データ送信手段
５１統合データ受信手段
５２音声・文字情報分離手段
５３音声合成手段
５４音声歪み解析手段
５５音声補完手段
５６音声出力手段
６１、６１’ 音声情報
６２文字情報
７１声質情報

DESCRIPTION OF SYMBOLS 1 Transmission side telephone 2, 4 Switch 3 Public network 5 Reception side telephone 6, 7 Integrated data 6 ', 7' Integrated data received 11 Voice input means 12 Voice recognition means 13 Voice / character information integration means 14 Integration Data transmitting means 51 Integrated data receiving means 52 Voice / character information separating means 53 Voice synthesizing means 54 Voice distortion analyzing means 55 Voice complementing means 56 Voice output means 61, 61 'Voice information 62 Character information 71 Voice quality information

Claims

In a telephone voice recognition system for receiving a voice recognition signal obtained by voice recognition of a transmission voice signal from a transmission-side telephone terminal and restoring the transmission voice signal by synthesizing the voice recognition signal at the reception-side telephone terminal,
The transmitting-side telephone terminal is characterized by character information which is a digital signal for each single sound of the transmitted voice obtained by performing voice recognition on the transmitted voice signal and voice information which is an analog voice signal for each single sound of the transmitted voice signal. Means for generating and transmitting integrated data comprising:
The receiving-side telephone terminal receives the integrated data and separates the character information and the voice information. When the distortion amount of the separated voice information is equal to or greater than a set threshold, the character information is voiced. A speech synthesis signal obtained by synthesis is selected, and when the amount of distortion is less than the threshold, separation / selection means for selecting the separated speech information, and the selected speech synthesis signal or the speech information is transmitted. A telephone voice recognition system comprising voice output means for generating voice as spoken voice.

In a telephone voice recognition system for receiving a voice recognition signal obtained by voice recognition of a transmission voice signal from a transmission-side telephone terminal and restoring the transmission voice signal by synthesizing the voice recognition signal at the reception-side telephone terminal,
The sender telephone terminal is
Voice input means for capturing the transmitted voice as a transmitted voice signal;
Speech recognition means for obtaining character information which is a digital signal for each single sound of the transmitted speech by speech recognition processing from the transmitted speech signal;
Integration means for generating analog data for each single tone of the transmitted voice signal output from the voice input means as voice information, and generating integrated data integrated with the character information output from the voice recognition means in a predetermined unit; ,
Transmitting means for transmitting the integrated data to the other telephone terminal;
The receiving telephone terminal is
Receiving means for receiving the integrated data;
Separating means for separating the integrated data received by the receiving means into the voice information and the character information;
Distortion analysis means for measuring the amount of distortion of the separated voice information;
Speech synthesis means for performing speech synthesis processing on the separated character information and outputting a speech synthesis signal;
The speech synthesis signal is selected when the distortion amount measured by the distortion analysis means is greater than or equal to a set threshold value, and the distortion amount is measured when the distortion amount is less than the threshold value. Complementing means for selecting the audio information;
A telephone speech recognition system, comprising: a voice output unit that generates the voice synthesis signal or the voice information output from the complementing unit as voice.

In a telephone voice recognition system for receiving a voice recognition signal obtained by voice recognition of a transmission voice signal from a transmission-side telephone terminal and restoring the transmission voice signal by synthesizing the voice recognition signal at the reception-side telephone terminal,
The sender telephone terminal, speech the transmitter and the character information is an audio signal and a digital signal for each monophonic transmission voice obtained by speech recognition, which is an analog audio signal for each single tone of the transmitted voice signal a information and, means for generating and sends the integrated data consisting of a digital signal der Ru voice quality information indicating the voice quality of speech waveform talker or preferences,
The receiving-side telephone terminal receives the integrated data and separates the character information, the voice information, and the voice quality information, and when the amount of distortion of the separated voice information is equal to or greater than a set threshold value, A speech synthesis signal obtained by speech synthesis processing that obtains a speech synthesis sound that approximates the character information to the voice quality indicated by the voice quality information is selected, and when the distortion amount is less than the threshold value, the separated speech information A telephone speech recognition system, comprising: separation / selection means for selecting a voice, and voice output means for generating the selected voice synthesis signal or the voice information as a transmission voice.

In a telephone voice recognition system for receiving a voice recognition signal obtained by voice recognition of a transmission voice signal from a transmission-side telephone terminal and restoring the transmission voice signal by synthesizing the voice recognition signal at the reception-side telephone terminal,
The sender telephone terminal is
Voice input means for capturing the transmitted voice as a transmitted voice signal;
Speech recognition means for obtaining character information which is a digital signal for each single sound of the transmitted speech by speech recognition processing from the transmitted speech signal;
Voice information which is an analog voice signal for each single sound of the transmitted voice signal output from the voice input means, the character information output from the voice recognition means, and a voice waveform of a sender or favorite voice quality and integration means for generating integrated data and voice quality information that integrates a predetermined unit is a digital signal indicating,
Transmitting means for transmitting the integrated data to the other telephone terminal;
The receiving telephone terminal is
Receiving means for receiving the integrated data;
Separating means for separating the integrated data received by the receiving means into the voice information, the character information, and the voice quality information;
Distortion analysis means for measuring the amount of distortion of the separated voice information;
Based on the separated character information and voice quality information, voice synthesis means for outputting a voice synthesis signal of a voice synthesized sound in which the character information is approximated to the voice quality indicated by the voice quality information;
The speech synthesis signal is selected when the distortion amount measured by the distortion analysis means is greater than or equal to a set threshold value, and the distortion amount is measured when the distortion amount is less than the threshold value. Complementing means for selecting the audio information;
A telephone speech recognition system, comprising: a voice output unit that generates the voice synthesis signal or the voice information output from the complementing unit as voice.