JP2007336395A

JP2007336395A - Voice processor and voice communication system

Info

Publication number: JP2007336395A
Application number: JP2006168101A
Authority: JP
Inventors: Masato Suzuki; 真人鈴木
Original assignee: Yamaha Corp
Current assignee: Yamaha Corp
Priority date: 2006-06-16
Filing date: 2006-06-16
Publication date: 2007-12-27

Abstract

PROBLEM TO BE SOLVED: To provide a voice processor and a voice communication system which enable a listener to easily catch keywords emphasized by a speaker, without registering keywords of various languages in a keyword dictionary beforehand. SOLUTION: To each region of an input voice signal where a change in acoustic feature is relatively large, the voice processor 42 applies emphasis processing for increasing the change in acoustic feature further, and outputs it. Accordingly, since each keyword being a part which the speaker wants to tell is emphasized further, even when a conversation is made between remote places without looking at a figure of each partner, even when a language used for the conversation is unfamiliar to them, or the like, the listener can surely comprehend contents which the speaker wants to tell. Moreover, keyword registration work is not required, and the voice processor and the voice communication system can be used in each country, since the voice processor and the voice communication system are so structured that a keyword dictionary can be dispensed with, and keywords can be extracted without any problems irrespective of language. COPYRIGHT: (C)2008,JPO&INPIT

Description

本発明は、音声信号をやりとりして会話を行う際に使用する音声処理装置、及び音声通信システムに関する。 The present invention relates to a voice processing device and a voice communication system used when talking by exchanging voice signals.

従来、複数の地点間をネットワークで接続して音声会議やチャットを行う音声通信システムが各種考案されている（特許文献１参照。）。 2. Description of the Related Art Conventionally, various audio communication systems have been devised in which a plurality of points are connected via a network to perform an audio conference or chat (see Patent Document 1).

例えば、特許文献１には、各話者が音声通信装置に相当するパソコンを個別にネットワークへ接続し、仮想会議室で互いに会議をするシステムが開示されている。このシステムでは、各話者が個別に音声通信装置を操作して、受信音声信号の音質・音量・音響を調整して放音することで、各話者単位で臨場感のある会議を行うことができる。 For example, Patent Document 1 discloses a system in which each speaker individually connects a personal computer corresponding to a voice communication device to a network and performs a conference with each other in a virtual conference room. In this system, each speaker individually operates a voice communication device, adjusts the sound quality, volume, and sound of the received voice signal and emits sound, so that a conference with a sense of reality can be made for each speaker. Can do.

しかしながら、多地点間通信の音声会議においては、各話者によって声の音量、会話速度が異なるため、特定の話者の発言が聞き取り難い場合がある。例えば、特許文献１の音声通信システムでは、特定の話者の声が小さすぎる場合、聞き取り易いように受信音声信号の音質・音量・音響を調整することができる。しかし、話者毎の受信音声信号を個別に調整できないため、特定の話者の声だけでなく他の話者の声も大きくなってしまう。また、会話速度が速すぎてその話者の発言が聞き取り難い場合に、話速を調整することができなかった。 However, in a multipoint communication audio conference, the volume of the voice and the conversation speed vary depending on the speaker, and therefore it may be difficult to hear a specific speaker. For example, in the voice communication system of Patent Document 1, when the voice of a specific speaker is too low, the sound quality, volume, and sound of the received voice signal can be adjusted so as to be easily heard. However, since the received voice signal for each speaker cannot be individually adjusted, not only the voice of a specific speaker but also the voices of other speakers are increased. In addition, when the speaking speed is too fast to hear the speaker's speech, the speaking speed cannot be adjusted.

一方、このような問題を解決するための装置としては、従来、入力音声中の重要な語（キーワード）を強調することができる音声強調装置があった（特許文献２参照。）。
特開平８−１２５７６１号公報特開平５−２７７９２号公報 On the other hand, as a device for solving such a problem, there has conventionally been a speech enhancement device that can emphasize important words (keywords) in input speech (see Patent Document 2).
JP-A-8-125761 JP-A-5-27792

一般的に、人は、言語の種類にかかわらず、会話中に伝えたい部分（キーワード）をゆっくりと大きな声で話す等、話速・音量・音高（ピッチ）等の音響的な特徴量を変化させて、他の領域に比べて音響的な特徴量を大きくまたは小さくするという習性がある。例えば、話者が、「私はりんごが大好きです。」と言った場合、「私は」、「りんごが」、「大好き」のいずれのキーワードの音響的な特徴量を変化させて大きくまたは小さくして、キーワードを聴者に伝えようとする。また、話者が、どの単語について音響的な特徴量を大きくまたは小さくしたかによって伝えたい内容が変わる。そのため、聴者は、この音響的な特徴量が変化したキーワードを聞き取ることができれば、話者が伝えたい内容を大まかに把握できる。しかし、相手の顔が見えない音声会議や、聴者が不慣れな言語で会話している場合等では、話者が伝えたい部分の音響的な特徴量を変化させていても、聴者は、その部分を聞き漏らしてしまい、話者の伝えたい内容を把握できないことがあった。 In general, regardless of the type of language, a person speaks the part (keyword) that he / she wants to convey during a conversation with a loud voice, such as speaking speed, volume, and pitch (pitch). There is a habit of changing the acoustic feature amount to be larger or smaller than other regions. For example, if a speaker says, “I love apples”, the acoustic features of any of the keywords “I”, “Apples”, or “I love” are changed to be larger or smaller. And try to convey the keyword to the listener. In addition, the content that the speaker wants to convey changes depending on which word the acoustic feature amount is increased or decreased. Therefore, if the listener can hear the keyword whose acoustic feature amount has changed, the listener can roughly grasp the content that the speaker wants to convey. However, in voice conferences where the other party's face cannot be seen, or when the listener is talking in an unfamiliar language, the listener will not be able to change the acoustic feature of the part that the speaker wants to convey. I sometimes missed out and couldn't understand what the speaker wanted.

また、従来の音声強調装置では、キーワードが強調されるので、聴者はキーワードを把握することができる。しかしながら、従来の音声強調装置では、キーワードを予めキーワード辞書に登録しておく必要があり、キーワード辞書に登録されていない単語は強調されない。また、会議や音声チャット等では会話の流れによってキーワードが変化する。そのため、従来の音声通信システムに従来の音声強調装置を適用した場合、キーワード辞書に登録されていない単語がキーワードの場合には、この単語に対して強調処理が行われず、聴者はキーワードを聞き漏らしてしまい、話者の伝えたい内容を把握できない可能性がある。 Further, in the conventional speech enhancement device, the keyword is emphasized, so that the listener can grasp the keyword. However, in the conventional speech enhancement device, it is necessary to register keywords in the keyword dictionary in advance, and words that are not registered in the keyword dictionary are not emphasized. In a conference or voice chat, the keyword changes depending on the conversation flow. Therefore, when a conventional speech enhancement device is applied to a conventional speech communication system, if a word that is not registered in the keyword dictionary is a keyword, the enhancement process is not performed on the word, and the listener misses the keyword. Therefore, there is a possibility that the content that the speaker wants to convey cannot be grasped.

さらに、従来の音声強調装置を様々な言語に対応させるためには、キーワード辞書に様々な言語でキーワードを登録する必要があり、その作業が煩雑であった。 Furthermore, in order to make the conventional speech enhancement apparatus compatible with various languages, it is necessary to register keywords in various languages in the keyword dictionary, which is complicated.

そこで、本発明は、キーワード辞書に様々な言語のキーワードを予め登録することなく、話者が強調するキーワードを聴者が容易に聞き取ることができる音声処理装置及び音声通信システムを提供することを目的とする。 Accordingly, an object of the present invention is to provide a voice processing device and a voice communication system that allow a listener to easily listen to a keyword emphasized by a speaker without previously registering keywords in various languages in a keyword dictionary. To do.

この発明は、上記の課題を解決するための手段として、以下の構成を備えている。 The present invention has the following configuration as means for solving the above problems.

（１）入力された音声信号における音響的な特徴量を解析して、該音響的な特徴量が他の領域に比べて相対的に大きい領域または小さい領域を抽出する特徴量抽出手段と、
前記特徴量抽出手段が抽出した音声信号の領域に対して、その特徴量が相対的に大きい場合には更に大きくし、その特徴量が相対的に小さい場合には更に小さくする強調処理を施して出力する強調手段と、
を備えたことを特徴とする。 (1) a feature amount extraction unit that analyzes an acoustic feature amount in an input audio signal and extracts a region where the acoustic feature amount is relatively larger or smaller than other regions;
The speech signal region extracted by the feature amount extraction means is subjected to an emphasis process in which the feature amount is further increased when the feature amount is relatively large, and is further decreased when the feature amount is relatively small. Emphasis means to output,
It is provided with.

一般的に、話者は、会話中に相手に伝えたい部分（キーワード）を聴者が容易に理解できるように、言語の種類にかかわらずキーワードの部分について、話速・音量・音高（ピッチ）等の音響的な特徴量を他の領域に対して相対的に大きくまたは小さくして話す。例えば、話者は、キーワードの部分をゆっくりと大きな声で話す。しかし、相手の顔が見えない音声会議や、聴者が不慣れな言語で会話している場合等では、話者が伝えたい部分の音響的な特徴量が変化していても、聴者がその部分を聞き落としてしまうことがあった。 In general, the speaker can use the speed, volume, and pitch (pitch) of the keyword part regardless of the language so that the listener can easily understand the part (keyword) that the speaker wants to convey to the other party during the conversation. And so on, with the acoustic feature amount being relatively large or small relative to other regions. For example, the speaker speaks the keyword portion slowly and loudly. However, in voice conferences where the other party's face is not visible, or when the listener is talking in an unfamiliar language, the listener can change the part even if the acoustic feature of the part that the speaker wants to communicate changes. I missed it.

この構成においては、音声処理装置は、入力された音声信号において、音響的な特徴量が相対的に大きい領域または小さい領域に対して、その音響的な特徴量を更に大きくまたは更に小さくする強調処理を施して出力する。したがって、遠隔地間で相手の姿が見えずに会話をする場合や、会話に使用する言語に不慣れな場合等であっても、話者が伝えたい部分であるキーワードの部分の音響的な特徴量がより大きくまたはより小さく強調されるので、聴者は、話者の伝えたい内容を確実に理解することが可能となる。 In this configuration, the speech processing apparatus performs enhancement processing for further increasing or decreasing the acoustic feature amount in a region where the acoustic feature amount is relatively large or small in the input sound signal. To output. Therefore, even if you have a conversation between remote locations without seeing the other person's figure, or if you are unfamiliar with the language used for conversation, the acoustic features of the keyword part that the speaker wants to convey Because the amount is emphasized larger or smaller, the listener can reliably understand what the speaker wants to convey.

（２）前記特徴量抽出手段は、音声信号における話速を解析して、他の領域に比べて相対的に話速の低い領域を抽出する低速域抽出手段を備え、
前記強調手段は、前記低速域抽出手段が抽出した領域に対して、話速を更に低下させる強調処理を施して出力することを特徴とする。 (2) The feature amount extraction unit includes a low speed region extraction unit that analyzes a speech speed in a speech signal and extracts a region having a relatively low speech speed compared to other regions,
The enhancement means performs an enhancement process for further reducing the speech speed on the region extracted by the low speed region extraction means, and outputs the result.

この構成においては、音声処理装置は、入力された音声信号において、他の領域に比べて相対的に話速が低い（遅い）領域に対して、さらに話速を低下させる強調処理を行って出力する。したがって、話者が伝えたいキーワードの部分について、話速が更に低下するので、聴者は、キーワードを確実に聞き取ることができるので、話者の伝えたい内容を容易に把握できる。 In this configuration, the speech processing apparatus performs an enhancement process for further reducing the speech speed on an input speech signal in a region where the speech speed is relatively low (slow) compared to other regions, and outputs the result. To do. Therefore, since the speaking speed is further reduced for the keyword portion that the speaker wants to convey, the listener can surely hear the keyword, so that the content that the speaker wants to convey can be easily grasped.

（３）前記特徴量抽出手段は、音声信号における音量を解析して、他の領域に比べて相対的に音量の大きな領域を抽出する大音量域抽出手段を備え、
前記強調手段は、前記大音量域抽出手段が抽出した領域に対して、音量を更に大きくする強調処理を施して出力することを特徴とする。 (3) The feature amount extraction unit includes a large volume range extraction unit that analyzes a volume level of the audio signal and extracts a region having a relatively large volume compared to other regions,
The enhancement means performs an enhancement process for further increasing the volume of the region extracted by the high volume range extraction means, and outputs the result.

この構成においては、音声処理装置は、入力された音声信号において、他の領域に比べて相対的に音量が大きな領域に対して、さらに音量を大きくする強調処理を行って出力する。したがって、話者が伝えたいキーワードの部分について、音量が更に大きくなるので、聴者は、キーワードを確実に聞き取ることができるので、話者の伝えたい内容を容易に把握できる。 In this configuration, the audio processing device performs an enhancement process for further increasing the volume of an input audio signal with respect to an area where the volume is relatively higher than other areas, and outputs the result. Accordingly, since the volume of the keyword portion that the speaker wants to convey is further increased, the listener can surely hear the keyword, so that the content that the speaker wants to convey can be easily grasped.

（４）サーバ装置、及び複数の音声通信装置を相互に接続した音声通信システムであって、
各音声通信装置は、
マイクユニットと、
スピーカユニットと、
前記マイクユニットが収音した音声信号をサーバ装置経由で別の音声通信装置に送信するとともに、前記サーバ装置が送信した音声信号を受信する通信手段と、
前記通信手段が受信した音声信号を前記スピーカユニットに入力するスピーカ制御手段と、
を備え、
前記サーバ装置は、
請求項１乃至３のいずれかに記載の音声処理装置と、
前記音声通信装置が送信した別の音声通信装置への音声信号を受信して前記音声処理装置に入力するとともに、前記音声処理装置が出力した音声信号を前記別の音声通信装置へ送信するサーバ通信手段と、
を備えたことを特徴とする。 (4) A voice communication system in which a server device and a plurality of voice communication devices are connected to each other,
Each voice communication device
A microphone unit,
A speaker unit;
A communication means for transmitting the audio signal collected by the microphone unit to another audio communication device via the server device, and receiving the audio signal transmitted by the server device;
Speaker control means for inputting an audio signal received by the communication means to the speaker unit;
With
The server device
A voice processing device according to any one of claims 1 to 3;
Server communication that receives an audio signal transmitted from the audio communication device to another audio communication device, inputs the audio signal to the audio processing device, and transmits an audio signal output from the audio processing device to the other audio communication device Means,
It is provided with.

この構成においては、音声通信システムでは、音声通信装置が収音した音声信号をサーバ装置に送信し、サーバ装置が、この音声信号の音響的な特徴量の変化が相対的に大きな領域に対して、その特徴量の変化を更に大きくする強調処理を行い、別の音声通信装置にこの音声信号を出力する。したがって、音声通信システムでは、サーバ装置が音声信号の強調処理を行うので、各音声通信装置は、強調処理を行うことによる負荷が増加することなく、音声信号を送受信することができる。 In this configuration, in the voice communication system, the voice signal picked up by the voice communication device is transmitted to the server device, and the server device is applied to an area where the change in the acoustic feature amount of the voice signal is relatively large. Then, enhancement processing for further increasing the change in the feature amount is performed, and this audio signal is output to another audio communication device. Therefore, in the audio communication system, the server device performs the audio signal emphasis process, so that each audio communication apparatus can transmit and receive the audio signal without increasing the load due to the emphasis process.

（５）複数の音声通信装置を相互に接続した音声通信システムであって、
各音声通信装置は、
請求項１乃至３のいずれかに記載の音声処理装置と、
マイクユニットと、
スピーカユニットと、
前記マイクユニットが収音した音声信号を別の音声通信装置に送信するとともに、別の音声通信装置が送信した音声信号を受信して、前記音声処理装置に入力する通信手段と、
前記音声処理装置が出力した音声信号を前記スピーカユニットに入力するスピーカ制御手段と、
を備えたことを特徴とする。 (5) A voice communication system in which a plurality of voice communication devices are connected to each other,
Each voice communication device
A voice processing device according to any one of claims 1 to 3;
A microphone unit,
A speaker unit;
A communication means for transmitting the audio signal collected by the microphone unit to another audio communication device, receiving the audio signal transmitted by another audio communication device, and inputting the audio signal to the audio processing device;
Speaker control means for inputting an audio signal output from the audio processing device to the speaker unit;
It is provided with.

この構成においては、音声通信システムでは、別の音声通信装置から送られてきた音声信号に対して強調処理を行い、この強調処理を行った音声をスピーカユニットから放音する。したがって、この音声通信システムを使用することで、聴者は、話者の伝えたい内容を容易に理解することが可能となる。また、音声強調手段を備えていない音声通信装置から送られてきた音声信号に対しても、強調処理を施すことができるので、聴者はキーワードを確実に聞き取ることができ、話者の伝えたい内容を確実に把握することが可能となる。 In this configuration, in the voice communication system, the enhancement process is performed on the voice signal transmitted from another voice communication apparatus, and the voice subjected to the enhancement process is emitted from the speaker unit. Therefore, by using this voice communication system, the listener can easily understand what the speaker wants to convey. In addition, since it is possible to perform enhancement processing even for voice signals sent from voice communication devices that do not have voice enhancement means, the listener can surely hear the keywords and what the speaker wants to convey Can be reliably grasped.

（６）複数の音声通信装置を相互に接続した音声通信システムであって、
各音声通信装置は、
請求項１乃至３のいずれかに記載の音声処理装置と、
音声を収音して音声信号を前記音声処理装置に入力するマイクユニットと、
スピーカユニットと、
前記音声処理装置が出力した音声信号を別の音声通信装置に送信するとともに、別の音声通信装置が送信した音声信号を受信する通信手段と、
前記通信手段が受信した音声信号を前記スピーカユニットに入力するスピーカ制御手段と、
を備えたことを特徴とする。 (6) A voice communication system in which a plurality of voice communication devices are connected to each other,
Each voice communication device
A voice processing device according to any one of claims 1 to 3;
A microphone unit that picks up sound and inputs a sound signal to the sound processing device;
A speaker unit;
A communication means for transmitting the audio signal output by the audio processing device to another audio communication device and receiving the audio signal transmitted by another audio communication device;
Speaker control means for inputting an audio signal received by the communication means to the speaker unit;
It is provided with.

この構成においては、音声通信システムでは、マイクユニットで収音した音声信号に対して、音響的な特徴量の変化が相対的に大きな領域に対して、その特徴量の変化を更に大きくする強調処理を行ってから、別の音声通信装置にこの音声信号を送信する。したがって、この音声通信システムを使用することで、聴者は、話者の伝えたい内容を容易に理解することが可能となる。 In this configuration, in the audio communication system, an enhancement process that further increases the change in the feature amount relative to the region in which the change in the acoustic feature amount is relatively large with respect to the sound signal collected by the microphone unit. Then, the audio signal is transmitted to another audio communication device. Therefore, by using this voice communication system, the listener can easily understand what the speaker wants to convey.

また、音声強調手段を備えていない音声通信装置に対して強調処理を施した音声信号を送信することができるので、聴者に対して伝えたい内容を確実に伝達することが可能となる。 In addition, since the voice signal subjected to the enhancement process can be transmitted to the voice communication apparatus that does not include the voice enhancement means, it is possible to reliably transmit the contents to be transmitted to the listener.

本発明の音声処理装置は、入力された音声信号において、音響的な特徴量が他の領域に比べて相対的に大きな領域または小さな領域に対して、その音響的な特徴量を更に大きくまたは小さくする強調処理を施して出力する。したがって、遠隔地間で相手の姿が見えずに会話をする場合や、会話に使用する言語に不慣れな場合等であっても、話者が伝えたい部分であるキーワードがより強調されるので、聴者は、話者の伝えたい内容を容易に理解できる。 The speech processing apparatus according to the present invention further increases or decreases the acoustic feature amount of an input speech signal with respect to a region where the acoustic feature amount is relatively large or small compared to other regions. Output with emphasis processing. Therefore, even if you have a conversation between remote locations without seeing the other person's figure, or if you are unfamiliar with the language used for the conversation, the keywords that the speaker wants to convey are more emphasized, The listener can easily understand what the speaker wants to convey.

本発明の音声通信システムは、サーバ装置、及び複数の音声通信装置を相互に接続した構成の場合には、サーバ装置が音声処理装置を備え、音声信号において特徴量が他の領域に比べて相対的に大きい領域または小さい領域に対して、その特徴量を更に大きくするかまたは更に小さくする強調処理を施して出力する。また、本発明の音声通信システムは、複数の音声通信装置を相互に接続した構成の場合には、各音声通信装置が音声処理装置を備え、音声信号の受信側で強調処理を行うか、または音声信号の送信側で強調処理を行う。したがって、この音声通信システムを使用することで、聴者は、話者の伝えたい内容を容易に理解できる。 In the audio communication system of the present invention, when the server device and a plurality of audio communication devices are connected to each other, the server device includes an audio processing device, and the feature amount in the audio signal is relative to that of other regions. For an area that is large or small, an enhancement process that further increases or decreases the feature amount is performed and output. In the audio communication system of the present invention, in the case of a configuration in which a plurality of audio communication devices are connected to each other, each audio communication device includes an audio processing device and performs enhancement processing on the audio signal receiving side, or Emphasis processing is performed on the audio signal transmission side. Therefore, by using this voice communication system, the listener can easily understand what the speaker wants to convey.

また、本発明の音声処理装置及び音声通信システムは、キーワード辞書が不要な構成であり、また、いずれの言語でも問題なくキーワードを抽出できるので、キーワードの登録作業が不要であり、世界のあらゆる国で使用できる。 In addition, the speech processing apparatus and speech communication system of the present invention have a configuration that does not require a keyword dictionary, and can extract keywords in any language without any problem, so no keyword registration work is required, and any country in the world Can be used in

本発明では、話者の音声信号が入力されると、その音声信号における音響的な特徴量（話速、音量、音高（ピッチ）等）を解析し、音声信号の他の部分と比較してこの音響的な特徴量が相対的に大きい領域または小さい領域を抽出して、その特徴量が相対的に大きい場合には更にその特徴量を大きくし、その特徴量が相対的に小さい場合には更にその特徴量を小さくする強調処理を施して出力する。 In the present invention, when a voice signal of a speaker is input, an acoustic feature amount (speech speed, volume, pitch (pitch), etc.) in the voice signal is analyzed and compared with other parts of the voice signal. When a region with a relatively large or small acoustic feature value is extracted and the feature value is relatively large, the feature value is further increased, and when the feature value is relatively small. Further performs enhancement processing to reduce the feature amount and outputs the result.

具体的には、本発明の音声処理装置では、音声信号の音響的な特徴量として話速と音量の変化を検出し、話速が相対的に低下した領域や音量が相対的に増大した領域を抽出して、その領域の音声信号の話速を更に低下させたり、音量を更に増大させたりする。これにより、話者が伝えたい部分として特徴量を変化させたキーワードについて、その特徴量が更に強調されるので、聴者は、遠隔会議で話者の顔が見えない場合や不慣れな言語で会話している場合でも、キーワードを聞き取ることができる。したがって、聴者は、この特徴量の変化を更に強調されたキーワードで聞き取ることができるので、話者が伝えたい内容を確実に把握できる。 Specifically, in the speech processing device of the present invention, a change in speech speed and volume is detected as an acoustic feature amount of the speech signal, and a region where the speech speed is relatively decreased or a region where the volume is relatively increased Is extracted to further reduce the speech speed of the audio signal in the area or further increase the volume. As a result, the feature quantity is further emphasized for the keyword whose feature quantity has been changed as a part that the speaker wants to convey, so that the listener can speak in a remote conference when the speaker's face cannot be seen or in an unfamiliar language. You can listen to keywords even if you are. Therefore, the listener can hear the change in the feature amount with the emphasized keyword, and can surely grasp the content that the speaker wants to convey.

図１は、本発明の音声処理装置が行う処理を説明するためのイメージ図である。話者が「私はりんごが大好きです。」と言った場合、話者が抑揚を特に変化させて、ゆっくりと大きな声で発言した領域を抽出して、その領域を更に強調させる。すなわち、図１（Ａ）のように「私は」の部分の抑揚（音響的な特徴量）を変化させた場合には、この部分の話速が他の部分に比べて相対的に低下し、音量が他の部分に比べて相対的に増大している。本発明の音声処理装置では、この音響的な特徴量が他の領域に比べて相対的に変化した領域である「私は」という単語の話速を更に低下させるとともに、音量を更に増大させる。また、図１（Ｂ）のように「りんごが」の部分の抑揚を変化させた場合には、同様にこの部分の話速を更に低下させるとともに、音量を更に増大させる。さらに、図１（Ｃ）のように「大（好き）」の部分の抑揚を変化させた場合には、同様に、この部分の話速を更に低下させるとともに、音量を更に増大させる。 FIG. 1 is an image diagram for explaining processing performed by the speech processing apparatus of the present invention. If the speaker says, “I love apples,” the speaker changes the inflection in particular, extracts the region that spoke slowly and loudly, and further emphasizes that region. That is, when the inflection (acoustic feature amount) of the “I am” part is changed as shown in FIG. 1 (A), the speech speed of this part is relatively lower than the other parts. The volume is relatively increased compared to other parts. In the speech processing apparatus of the present invention, the speech speed of the word “I am”, which is a region in which the acoustic feature amount has changed relative to other regions, is further reduced, and the volume is further increased. Further, when the inflection of the “apple is” portion is changed as shown in FIG. 1B, the speech speed of this portion is similarly further lowered and the volume is further increased. Further, when the inflection of the “large (like)” portion is changed as shown in FIG. 1C, the speech speed of this portion is further reduced and the volume is further increased.

図２は、音声通信システムの概略を示す構成図である。複数の地点間で音声会議や音声チャットを行うための構成である音声通信システムに、本発明の音声処理装置を適用すると以下のような構成となる。ここで、図２には、音声信号が一方向に伝達される様子を示しているが、音声信号がこの逆方向に伝達される場合も当然あるので、以下の説明では、逆方向に音声信号が伝達する説明を括弧書きで記している。また、図２には、音声通信システムを構成する音声通信装置として、電話機を使用した例を示している。 FIG. 2 is a configuration diagram showing an outline of the voice communication system. When the voice processing device of the present invention is applied to a voice communication system that is a voice conference or voice chat configuration between a plurality of points, the following configuration is obtained. Here, FIG. 2 shows a state in which the audio signal is transmitted in one direction, but there is a case where the audio signal is transmitted in the opposite direction. Therefore, in the following description, the audio signal is transmitted in the opposite direction. The explanation conveyed by is written in parentheses. FIG. 2 shows an example in which a telephone is used as a voice communication device constituting the voice communication system.

図２（Ａ）に示す音声通信システム１のように、サーバ装置１３及び複数の電話機１１，１２を、ネットワーク１０を介して相互接続するように構成する場合には、サーバ装置１３に本発明の音声処理装置を設ける構成にすると良い。この構成では、電話機１１（１２）が受話器１１Ｊ（１２Ｊ）から入力された音声信号をサーバ装置１３にネットワーク１０を介して送信する。そして、サーバ装置１３が内蔵する音声処理装置４２で強調処理を行って、電話機１２（１１）にネットワーク１０を介してこの強調処理を行った音声信号を送信する。電話機１２（１１）は、この強調処理が施された音声信号を受信すると、受話器１２Ｊ（１１Ｊ）から放音する。 When the server device 13 and the plurality of telephones 11 and 12 are connected to each other via the network 10 as in the voice communication system 1 shown in FIG. A configuration in which a sound processing device is provided is preferable. In this configuration, the telephone 11 (12) transmits the voice signal input from the receiver 11J (12J) to the server device 13 via the network 10. Then, the enhancement processing is performed by the voice processing device 42 built in the server device 13, and the voice signal subjected to the enhancement processing is transmitted to the telephone set 12 (11) via the network 10. When the telephone 12 (11) receives the voice signal subjected to the enhancement processing, the telephone 12 (11) emits a sound from the receiver 12J (11J).

また、図２（Ｂ）に示す音声通信システム２のように、複数の電話機１４，１５を、ネットワーク１０を介して相互接続するように構成する場合には、各電話機１４，１５に音声処理装置を設けて、受信した音声信号に対して強調処理を施すように構成すると良い。この構成では、電話機１４（１５）が受話器１４Ｊ（１５Ｊ）から入力された音声信号を、ネットワーク１０を介して電話機１５（１４）に送信する。電話機１５（１４）は、音声信号を受信すると、電話機１５（１４）が内蔵する音声処理装置４２で強調処理を行って、受話器１５Ｊ（１４Ｊ）から音声を放音する。 When a plurality of telephones 14 and 15 are connected to each other via the network 10 as in the voice communication system 2 shown in FIG. To enhance the received audio signal. In this configuration, the telephone 14 (15) transmits the voice signal input from the receiver 14J (15J) to the telephone 15 (14) via the network 10. When the telephone 15 (14) receives the audio signal, the telephone 15 (14) performs enhancement processing with the voice processing device 42 built in the telephone 15 (14), and emits the voice from the receiver 15J (14J).

また、図２（Ｃ）に示す音声通信システム３のように、複数の電話機１６，１７を、ネットワーク１０を介して相互接続するように構成する場合には、各電話機１６，１７に音声処理装置を設けて、送信する音声信号について強調処理を施すように構成すると良い。この構成では、電話機１６（１７）が受話器１６Ｊ（１７Ｊ）から入力された音声信号に対して、電話機１６（１７）が内蔵する音声処理装置４２で強調処理を行って、ネットワーク１０を介して電話機１７（１６）に送信する。電話機１７（１６）は、音声信号を受信すると、受話器１７Ｊ（１６Ｊ）から音声を放音する。 When a plurality of telephones 16 and 17 are connected to each other via the network 10 as in the voice communication system 3 shown in FIG. And an enhancement process may be performed on the audio signal to be transmitted. In this configuration, the telephone 16 (17) performs an emphasis process on the voice signal input from the receiver 16 J (17 J) by the voice processing device 42 included in the telephone 16 (17), and the telephone 16 via the network 10. 17 (16). When the telephone set 17 (16) receives the voice signal, the telephone set 17J (16J) emits the voice.

次に、音声通信システム１の具体的な構成について説明する。図３は、音声通信システムの構成を示すブロック図である。音声通信システム１は、ネットワーク１０に電話機１１、電話機１２、及びサーバ装置１３を接続した構成である。なお、図２（Ａ）及び図３には、２台の電話機１１，１２をネットワーク１０に接続した構成を示しているが、これは説明を簡略化するためであり、更に複数の電話機（音声通信装置）をネットワーク１０に接続して音声信号をやりとりすることが可能である。 Next, a specific configuration of the voice communication system 1 will be described. FIG. 3 is a block diagram showing the configuration of the voice communication system. The voice communication system 1 has a configuration in which a telephone 11, a telephone 12, and a server device 13 are connected to a network 10. FIGS. 2A and 3 show a configuration in which two telephones 11 and 12 are connected to the network 10, but this is for the sake of simplification, and a plurality of telephones (voices). It is possible to connect a communication apparatus) to the network 10 and exchange audio signals.

電話機１１と電話機１２は、同様の構成であり、通信部２１、Ｄ／Ａコンバータ２２、放音用アンプ２３、スピーカユニットＳＰ１、マイクユニットＭＩＣ１、収音用アンプ２４、Ａ／Ｄコンバータ２５、操作部２６、表示部２７、及び制御部２８を備えている。 The telephone set 11 and the telephone set 12 have the same configuration. The communication unit 21, the D / A converter 22, the sound emission amplifier 23, the speaker unit SP1, the microphone unit MIC1, the sound collection amplifier 24, the A / D converter 25, the operation A unit 26, a display unit 27, and a control unit 28.

制御部２８は、電話機１１（１２）の各部を制御する。 The control unit 28 controls each unit of the telephone 11 (12).

通信部２１は、ネットワーク１０を介して入力されたサーバ装置１３からの入力音声信号を、ネットワークに対応するデータ形式（プロトコル）から変換して、Ｄ／Ａコンバータ２２に出力する。 The communication unit 21 converts an input audio signal input from the server device 13 via the network 10 from a data format (protocol) corresponding to the network and outputs the converted signal to the D / A converter 22.

Ｄ／Ａコンバータ２２は、通信部２１から送られてきた音声信号をアナログ形式に変換して放音用アンプ２３に出力する。 The D / A converter 22 converts the audio signal sent from the communication unit 21 into an analog format and outputs it to the sound emission amplifier 23.

放音用アンプ２３は、音声信号を増幅してスピーカユニットＳＰ１（ＳＰ２）に与える。 The sound emission amplifier 23 amplifies the audio signal and gives it to the speaker unit SP1 (SP2).

スピーカユニットＳＰ１（ＳＰ２）は、放音用アンプ２３から与えられた音声信号を音声に変換して放音する。 The speaker unit SP1 (SP2) converts the sound signal provided from the sound emission amplifier 23 into sound and emits the sound.

マイクユニットＭＩＣ１（ＭＩＣ２）は、話者が発した音声を収音して電気信号である音声信号に変換して、この音声信号を収音用アンプ２４に出力する。 The microphone unit MIC1 (MIC2) collects the voice uttered by the speaker and converts it into an audio signal that is an electrical signal, and outputs the audio signal to the sound collection amplifier 24.

収音用アンプ２４は、音声信号を増幅してＡ／Ｄコンバータ２５に出力する。 The sound collecting amplifier 24 amplifies the audio signal and outputs it to the A / D converter 25.

Ａ／Ｄコンバータ２５は、収音用アンプ２４から入力されたアナログの音声信号をデジタルの音声信号に変換して通信部２１に出力する。 The A / D converter 25 converts the analog audio signal input from the sound collection amplifier 24 into a digital audio signal and outputs the digital audio signal to the communication unit 21.

通信部２１は、Ａ／Ｄコンバータ２５から出力された音声信号をネットワークに対応するデータ形式（プロトコル）に変換して、ネットワーク１０を介して、サーバ装置１３に送信する。なお、音声信号には、サーバ装置１３を経由して、ネットワーク１０に接続されたどの音声通信装置に送信するかを示す送信先情報が付加されている。 The communication unit 21 converts the audio signal output from the A / D converter 25 into a data format (protocol) corresponding to the network, and transmits it to the server device 13 via the network 10. Note that destination information indicating which voice communication device is connected to the network 10 via the server device 13 is added to the voice signal.

サーバ装置１３は、通信部４１、音声処理装置４２、及び制御部４３を備えている。また、音声処理装置４２は、抑揚解析部５１、特徴抽出部５２、強調域判定部５３、及び強調処理部５４を備えている。 The server device 13 includes a communication unit 41, a voice processing device 42, and a control unit 43. In addition, the speech processing device 42 includes an intonation analysis unit 51, a feature extraction unit 52, an enhancement region determination unit 53, and an enhancement processing unit 54.

抑揚解析部５１は、話速解析部５１１及び音量解析部５１２を備えている。 The intonation analysis unit 51 includes a speech speed analysis unit 511 and a sound volume analysis unit 512.

特徴抽出部５２は、低速域抽出部５２１及び大音量域抽出部５２２を備えている。 The feature extraction unit 52 includes a low speed region extraction unit 521 and a large volume region extraction unit 522.

制御部４３は、サーバ装置１３の各部を制御する。 The control unit 43 controls each unit of the server device 13.

通信部４１は、ネットワーク１０を介して入力された電話機１１または電話機１２からの入力音声信号を、ネットワークに対応するデータ形式（プロトコル）から変換して、抑揚解析部５１の話速解析部５１１及び音量解析部５１２に出力する。 The communication unit 41 converts an input voice signal from the telephone set 11 or the telephone set 12 input via the network 10 from a data format (protocol) corresponding to the network, and converts the speech speed analysis unit 511 of the inflection analysis unit 51 and Output to volume analysis unit 512.

話速解析部５１１は、通信部４１から送られてきた音声信号の話速を解析して、話速が低下した場合には、特徴抽出部５２の低速域抽出部５２１に信号を出力する。具体的には、話速解析部５１１は、音声信号を音素レベルに切り出し、子音または母音を取り出して、各音の長さや間隔を解析する。そして、各音の長さや間隔の平均値を算出して、各音の長さや間隔が平均値から一定範囲内であり、予め設定している話速の閾値よりも話速が速ければ、話速が安定していると判定する。一方、話速解析部５１１は、各音の長さや間隔が一時的に平均値から一定範囲までの値よりも大きな値になり、話速の閾値よりも話速が低下すると、音響的な特徴量が小さくなっていると判定して、その旨を伝える信号を低速域抽出部５２１に出力する。 The speech speed analysis unit 511 analyzes the speech speed of the voice signal transmitted from the communication unit 41 and outputs a signal to the low speed region extraction unit 521 of the feature extraction unit 52 when the speech speed decreases. Specifically, the speech speed analysis unit 511 extracts a speech signal at a phoneme level, extracts a consonant or a vowel, and analyzes the length and interval of each sound. Then, the average value of the length and interval of each sound is calculated, and if the length and interval of each sound are within a certain range from the average value and the speech speed is faster than a preset speech speed threshold, Judge that the speed is stable. On the other hand, the speech speed analysis unit 511, when the length or interval of each sound temporarily becomes larger than the value from the average value to a certain range and the speech speed falls below the threshold of the speech speed, the acoustic feature It is determined that the amount is small, and a signal to that effect is output to the low speed region extraction unit 521.

音量解析部５１２は、通信部４１から送られてきた音声信号の音量を解析して、話速音量が増大した場合には、特徴抽出部５２の大音量域抽出部５２２に信号を出力する。具体的には、音量解析部５１２は、音声信号を音素レベルに切り出し、子音または母音を取り出して、各音の音量レベルを解析する。そして、各音の音量レベルの平均値を算出して、各音の音量レベルが平均値から一定範囲内であり、予め設定している音量の閾値よりも音量が小さければ、音量が安定していると判定する。一方、音量解析部５１２は、各音の音量レベルが一時的に平均値から一定範囲までの値よりも大きな値となり、音量の閾値よりも音量が大きくなると、音響的な特徴量が大きくなっていると判定して、その旨を伝える信号を大音量域抽出部５２２に出力する。 The volume analysis unit 512 analyzes the volume of the audio signal sent from the communication unit 41 and outputs a signal to the loud volume range extraction unit 522 of the feature extraction unit 52 when the speech speed volume increases. Specifically, the volume analysis unit 512 cuts out the audio signal into phoneme levels, extracts consonants or vowels, and analyzes the volume level of each sound. Then, the average value of the volume level of each sound is calculated, and if the volume level of each sound is within a certain range from the average value and the volume is lower than a preset volume threshold, the volume is stable. It is determined that On the other hand, when the volume level of each sound temporarily becomes larger than the value from the average value to a certain range, and the volume becomes larger than the volume threshold, the volume analysis unit 512 increases the acoustic feature amount. And outputs a signal to that effect to the loud volume extraction unit 522.

低速域抽出部５２１は、話速解析部５１１から信号が送られてくる間、音声信号を抽出し、その抽出した領域を音速の低速域として強調域判定部５３に出力する。 While the signal is sent from the speech speed analysis unit 511, the low speed region extraction unit 521 extracts a speech signal and outputs the extracted region to the enhancement region determination unit 53 as a low speed region of the sound speed.

大音量域抽出部５２２は、音量解析部５１２から信号が送られてくる間、音声信号を抽出し、その抽出した領域を大音量域として強調域判定部５３に出力する。 The loud sound volume extraction unit 522 extracts the audio signal while the signal is sent from the sound volume analysis unit 512, and outputs the extracted region to the enhancement region determination unit 53 as the loud sound volume range.

強調域判定部５３は、音声信号において、低速域抽出部５２１から入力された音速の低速域と、大音量域抽出部５２２から入力された大音量域と、が重複する領域を、音声信号において強調されている部分であると判定して、その領域に対して強調処理を施すように、強調処理部５４に対して制御信号を出力する。 The enhancement region determination unit 53 determines, in the audio signal, an area where the low speed region of the sound speed input from the low speed region extraction unit 521 and the high sound volume region input from the high sound volume region extraction unit 522 overlap in the audio signal. It is determined that the portion is emphasized, and a control signal is output to the enhancement processing unit 54 so that enhancement processing is performed on the region.

強調処理部５４は、強調域判定部５３から出力された制御信号に基づいて、音声信号に対して強調処理を行う。すなわち、制御信号で指定された音声信号の領域に対して、話速を一定量低下させるとともに、音量を一定量増大させる。例えば、強調処理部５４は、音声信号の話速を０．７倍にするとともに、音声信号の音量を１．５倍にする。また、強調処理部５４は、強調域判定部５３から制御信号が出力されていない領域に対しては、話速を低下させたり音量を増大させたりする処理は行わない。また、強調処理部５４では、上記のように話速を低下させる処理を行うことにより、音声通信装置間でタイムラグが発生するのを防止するために、話速の低下処理を行った時間に応じて、音声信号の無音領域をカットして音声信号を短縮する周知の処理を行う。そして、強調処理部５４は、これらの処理を行った音声信号を通信部４１に出力する。 The enhancement processing unit 54 performs enhancement processing on the audio signal based on the control signal output from the enhancement region determination unit 53. That is, the speech speed is decreased by a certain amount and the sound volume is increased by a certain amount with respect to the region of the sound signal designated by the control signal. For example, the enhancement processing unit 54 increases the speech speed of the audio signal by 0.7 times and increases the volume of the audio signal by 1.5 times. In addition, the enhancement processing unit 54 does not perform processing for decreasing the speech speed or increasing the volume for a region for which no control signal is output from the enhancement region determination unit 53. Further, in the enhancement processing unit 54, in order to prevent a time lag from occurring between the voice communication devices by performing the processing for reducing the speech speed as described above, the enhancement processing unit 54 depends on the time for performing the speech speed reduction processing. Then, a known process for shortening the audio signal by cutting the silent region of the audio signal is performed. Then, the enhancement processing unit 54 outputs the audio signal subjected to these processes to the communication unit 41.

通信部４１は、音声処理装置４２の強調処理部５４が出力した音声信号を、ネットワークに対応するデータ形式（プロトコル）に変換して、ネットワーク１０を介して、指定された音声通信装置である電話機１２または電話機１１に対して送信する。 The communication unit 41 converts the audio signal output from the enhancement processing unit 54 of the audio processing device 42 into a data format (protocol) corresponding to the network, and is a telephone that is a specified audio communication device via the network 10. 12 or the telephone 11.

次に、音声通信システム１の電話機１１、電話機１２、及びサーバ装置１３の動作について、フローチャートに基づいて説明する。図４は、サーバ装置の動作を説明するためのフローチャートである。 Next, operations of the telephone set 11, the telephone set 12, and the server apparatus 13 of the voice communication system 1 will be described based on flowcharts. FIG. 4 is a flowchart for explaining the operation of the server apparatus.

音声通信システム１では、話者Ａと話者Ｂが電話機１１及び電話機１２を使用して会話をする場合には、以下のような処理が行われる。 In the voice communication system 1, when the speaker A and the speaker B have a conversation using the telephone 11 and the telephone 12, the following processing is performed.

まず、話者Ａが、電話機１１の受話器１１Ｊを使用して話し始めて、マイクユニットＭＩＣ１から音声の入力が開始されると（ｓ１）、マイクユニットＭＩＣ１は収音した音声を音声信号への変換し、収音用アンプ２４がこの音声信号を増幅し、Ａ／Ｄコンバータ２５が音声信号をデジタル化して、通信部２１に出力する（ｓ２）。通信部２１は、デジタル化された音声信号を、サーバ装置１３を経由して電話機１２に送信するために、ネットワーク１０に送出する（ｓ３）。そして、電話機１１は、電話機１２は収音が終了して、通信部２１がサーバ装置１３からの音声信号を受信していなければ（ｓ４）、ステップｓ１以降の処理を行う。 First, when the speaker A starts speaking using the handset 11J of the telephone set 11 and voice input is started from the microphone unit MIC1 (s1), the microphone unit MIC1 converts the collected voice into a voice signal. The sound collecting amplifier 24 amplifies the sound signal, and the A / D converter 25 digitizes the sound signal and outputs it to the communication unit 21 (s2). The communication unit 21 sends the digitized audio signal to the network 10 for transmission to the telephone set 12 via the server device 13 (s3). Then, if the telephone set 12 has finished collecting sound and the communication unit 21 has not received an audio signal from the server device 13 (s4), the telephone set 11 performs the processing after step s1.

サーバ装置１３は、通信部４１が電話機１１からの音声信号を受信すると（ｓ１１）、音声処理装置４２の抑揚解析部５１で話速と音量の解析を行う（ｓ１２）。そして、サーバ装置１３は、話速が平均値から一定範囲までの値よりも大きな値で、且つ、各音の音量レベルが一時的に平均値から一定範囲までの値よりも大きな値であると（ｓ１３）、低速域抽出部５２１及び大音量域抽出部５２２はその部分の音声信号を抽出し、強調域判定部５３は、重複している領域について、音声信号が強調されている部分であると判定して、その領域に対して強調処理を施すように、強調処理部５４に対して制御信号を出力する（ｓ１４）。強調処理部５４は、強調域判定部５３から出力された制御信号に基づいて、音声信号に対して、強調処理を行う。すなわち、制御信号で指定された音声信号の領域に対して、話速を一定量低下させるとともに、音量を一定量増大させて、通信部２１に出力する（ｓ１５）。 When the communication unit 41 receives the voice signal from the telephone 11 (s11), the server device 13 analyzes the speech speed and the volume by the inflection analysis unit 51 of the voice processing device 42 (s12). The server device 13 has a speaking speed that is larger than the value from the average value to a certain range, and the volume level of each sound is temporarily larger than the value from the average value to the certain range. (S13), the low speed region extraction unit 521 and the high volume region extraction unit 522 extract the audio signal of the portion, and the enhancement region determination unit 53 is a portion where the audio signal is enhanced for the overlapping region. And a control signal is output to the enhancement processing unit 54 so that the enhancement process is performed on the region (s14). The enhancement processing unit 54 performs enhancement processing on the audio signal based on the control signal output from the enhancement region determination unit 53. That is, the speech speed is decreased by a certain amount and the sound volume is increased by a certain amount with respect to the region of the sound signal designated by the control signal, and is output to the communication unit 21 (s15).

通信部４１は、強調処理部５４が出力した音声信号を、ネットワークに対応するデータ形式（プロトコル）に変換して、ネットワーク１０を介して、指定された音声通信装置である電話機１２に対して送信する（ｓ１６）。 The communication unit 41 converts the audio signal output from the enhancement processing unit 54 into a data format (protocol) corresponding to the network, and transmits the data to the telephone set 12 that is a specified audio communication device via the network 10. (S16).

電話機１２は、音声信号の入力が無く（ｓ２１）、通信部２１がサーバ装置１３からの音声信号を受信すると（ｓ２４）、Ｄ／Ａコンバータ２２が通信部２１から送られてきた音声信号をアナログ形式に変換して放音用アンプ２３に出力し、放音用アンプ２３が音声信号を増幅してスピーカユニットＳＰ２に出力する（ｓ２５）。スピーカユニットＳＰ２は、放音用アンプ２３で増幅された音声信号を音声に変換して放音する（ｓ２６）。電話機１２は放音が終了すると、ステップｓ２１以降の処理を行う。 When the telephone 12 does not receive an audio signal (s21) and the communication unit 21 receives an audio signal from the server device 13 (s24), the D / A converter 22 analogizes the audio signal sent from the communication unit 21. The sound is converted into a format and output to the sound output amplifier 23, and the sound output amplifier 23 amplifies the sound signal and outputs it to the speaker unit SP2 (s25). The speaker unit SP2 converts the sound signal amplified by the sound emission amplifier 23 into sound and emits the sound (s26). When the sound emission is finished, the telephone 12 performs the processing after step s21.

また、話者Ｂが、話者Ａの発言を聞き終わって、電話機１２の受話器１２Ｊを使用して話し出して、マイクユニットＭＩＣ２から音声の入力が開始されると（ｓ２１）、マイクユニットＭＩＣ２は収音した音声を音声信号への変換し、収音用アンプ２４がこの音声信号を増幅し、Ａ／Ｄコンバータ２５が音声信号をデジタル化して、通信部２１に出力する（ｓ２２）。通信部２１は、デジタル化された音声信号を、サーバ装置１３経由して電話機１２に送信するために、ネットワーク１０に送出する（ｓ２３）。そして、電話機１１は、電話機１２は収音が終了して、通信部２１がサーバ装置１３からの音声信号を受信していなければ（ｓ２４）、ステップｓ２１以降の処理を行う。 Further, when the speaker B finishes listening to the speaker A, speaks using the handset 12J of the telephone 12, and starts inputting voice from the microphone unit MIC2 (s21), the microphone unit MIC2 is not stored. The sound that has been sounded is converted into a sound signal, the sound collecting amplifier 24 amplifies the sound signal, and the A / D converter 25 digitizes the sound signal and outputs it to the communication unit 21 (s22). The communication unit 21 sends the digitized audio signal to the network 10 for transmission to the telephone set 12 via the server device 13 (s23). Then, if the telephone set 12 has finished collecting sound and the communication unit 21 has not received an audio signal from the server device 13 (s24), the telephone set 11 performs the processing after step s21.

サーバ装置１３は、通信部４１が電話機１２からの音声信号を受信すると（ｓ１１）、音声処理装置４２で前記のステップｓ１２〜ステップｓ１６の処理を行う。すなわち、音声信号に対して強調処理を行い、ネットワーク１０を介して、指定された音声通信装置である電話機１１に対して送信する。 When the communication unit 41 receives the audio signal from the telephone set 12 (s11), the server device 13 performs the above-described steps s12 to s16 in the audio processing device 42. That is, the enhancement processing is performed on the voice signal, and the voice signal is transmitted to the telephone 11 that is the designated voice communication apparatus via the network 10.

電話機１１は、音声信号の入力が無く（ｓ１）、通信部２１がサーバ装置１３からの音声信号を受信すると（ｓ４）、Ｄ／Ａコンバータ２２が通信部２１から送られてきた音声信号をアナログ形式に変換して放音用アンプ２３に出力し、放音用アンプ２３が音声信号を増幅してスピーカユニットＳＰ１に出力する（ｓ５）。スピーカユニットＳＰ１は、放音用アンプ２３で増幅された音声信号を音声に変換して放音する（ｓ６）。電話機１１は放音が終了すると、ステップｓ１以降の処理を行う。 When the telephone 11 receives no audio signal (s1) and the communication unit 21 receives the audio signal from the server device 13 (s4), the D / A converter 22 analogizes the audio signal sent from the communication unit 21. The sound is converted into a format and output to the sound output amplifier 23, and the sound output amplifier 23 amplifies the sound signal and outputs it to the speaker unit SP1 (s5). The speaker unit SP1 converts the sound signal amplified by the sound emission amplifier 23 into sound and emits the sound (s6). When the telephone 11 finishes emitting sound, the telephone 11 performs processing from step s1 onward.

このように、音声通信システム１では、電話機１１や電話機１２が出力した音声信号に対して、サーバ装置１３で強調処理を施すので、各装置は効率良く処理を行うことができる。 Thus, in the voice communication system 1, since the server apparatus 13 performs the enhancement process on the audio signal output from the telephone set 11 or the telephone set 12, each apparatus can efficiently perform the process.

次に、音声通信システム２の具体的な構成について説明する。図５は、音声通信システムの図３とは異なる構成を示すブロック図である。音声通信システム２は、ネットワーク１０に電話機１４及び電話機１５を接続した構成である。なお、図２（Ｂ）及び図５には、２台の電話機１４，１５をネットワーク１０に接続した構成を示しているが、これは説明を簡略化するためであり、更に複数の電話機（音声通信装置）をネットワーク１０に接続して音声信号をやりとりすることが可能である。また、以下の説明では、音声通信システム１と同様の構成には、同じ符号を付して詳細な説明を省略する。 Next, a specific configuration of the voice communication system 2 will be described. FIG. 5 is a block diagram showing a different configuration of the voice communication system from FIG. The voice communication system 2 has a configuration in which a telephone 14 and a telephone 15 are connected to a network 10. FIGS. 2B and 5 show a configuration in which two telephones 14 and 15 are connected to the network 10, but this is for the sake of simplification of explanation and a plurality of telephones (voices). It is possible to connect a communication apparatus) to the network 10 and exchange audio signals. Moreover, in the following description, the same code | symbol is attached | subjected to the structure similar to the audio | voice communication system 1, and detailed description is abbreviate | omitted.

電話機１４と電話機１５は、電話機１１の通信部２１とＤ／Ａコンバータ２２との間に、音声処理装置４２を設けた構成である。電話機１４（１５）は、電話機１５（１４）から送信された音声信号を通信部２１で受信すると、音声処理装置４２で、前記のように音声信号の話速と音量を解析して、音声信号の話速が相対的に低下し、且つ音量が相対的に増大していると、その領域について音声信号の話速を更に低下させるとともに、音量を更に増大させる強調処理を行う。そして、音声信号をアナログ化及び増幅してスピーカユニットＳＰ１（ＳＰ２）から音声を放音する。 The telephone 14 and the telephone 15 have a configuration in which an audio processing device 42 is provided between the communication unit 21 of the telephone 11 and the D / A converter 22. When the telephone 14 (15) receives the voice signal transmitted from the telephone 15 (14) by the communication unit 21, the voice processor 42 analyzes the voice speed and volume of the voice signal as described above, and the voice signal If the speech speed is relatively decreased and the sound volume is relatively increased, an emphasis process for further decreasing the sound speed of the audio signal and further increasing the sound volume is performed for that region. Then, the audio signal is analogized and amplified to emit sound from the speaker unit SP1 (SP2).

したがって、音声通信システム２では、聴者は、話者の伝えたい内容を容易に理解することができる。また、音声処理装置４２を備えていない音声通信装置から送られてきた音声信号に対しても、強調処理を施すことができるので、聴者はキーワードを確実に聞き取ることができ、話者の伝えたい内容を確実に把握することができる。 Therefore, in the voice communication system 2, the listener can easily understand the content that the speaker wants to convey. In addition, since the emphasis process can be applied to a voice signal transmitted from a voice communication device that does not include the voice processing device 42, the listener can surely hear the keyword and the speaker wants to convey it. The contents can be grasped reliably.

次に、音声通信システム３の具体的な構成について説明する。図６は、音声通信システムの図３、５とは異なる構成を示すブロック図である。音声通信システム３は、ネットワーク１０に電話機１６及び電話機１７を接続した構成である。なお、図２（Ｃ）及び図６には、２台の電話機１６，１７をネットワーク１０に接続した構成を示しているが、これは説明を簡略化するためであり、更に複数の電話機（音声通信装置）をネットワーク１０に接続して音声信号をやりとりすることが可能である。また、以下の説明では、音声通信システム１と同様の構成には、同じ符号を付して詳細な説明を省略する。 Next, a specific configuration of the voice communication system 3 will be described. FIG. 6 is a block diagram showing a configuration different from FIGS. 3 and 5 of the voice communication system. The voice communication system 3 has a configuration in which a telephone set 16 and a telephone set 17 are connected to a network 10. FIGS. 2C and 6 show a configuration in which two telephones 16 and 17 are connected to the network 10, but this is for simplifying the description, and a plurality of telephones (voices) are also shown. It is possible to connect a communication apparatus) to the network 10 and exchange audio signals. Moreover, in the following description, the same code | symbol is attached | subjected to the structure similar to the audio | voice communication system 1, and detailed description is abbreviate | omitted.

電話機１６と電話機１７は、電話機１１のＡ／Ｄコンバータ２５と通信部２１との間に、音声処理装置４２を設けた構成である。電話機１６（１７）は、マイクユニットＭＩＣ１（ＭＩＣ２）から入力されＡ／Ｄコンバータ２５によりデジタル化された音声信号について、音声信号の話速が相対的に低下し且つ音量が相対的に増大している領域に対して、音声処理装置４２で強調処理を行う。そして、通信部２１が、ネットワーク１０を介して電話機１７（１６）に送信する。電話機１７（１６）は、音声信号を受信すると、アナログ化及び増幅を行って、スピーカユニットＳＰ２（ＳＰ１）から音声を放音する。 The telephone set 16 and the telephone set 17 are configured such that a voice processing device 42 is provided between the A / D converter 25 of the telephone set 11 and the communication unit 21. The telephone 16 (17) has a relatively low voice speed and a relatively high volume of the voice signal input from the microphone unit MIC1 (MIC2) and digitized by the A / D converter 25. The speech processing device 42 performs enhancement processing on the existing area. And the communication part 21 transmits to the telephone set 17 (16) via the network 10. FIG. Upon receiving the audio signal, the telephone set 17 (16) performs analogization and amplification, and emits audio from the speaker unit SP2 (SP1).

したがって、音声通信システム３では、聴者は、話者の伝えたい内容を容易に理解することができる。また、音声処理装置４２を備えていない音声通信装置に対して強調処理を施した音声信号を送信することができるので、話者は聴者に対して伝えたい内容を確実に伝達できる。 Therefore, in the voice communication system 3, the listener can easily understand the content that the speaker wants to convey. Further, since the emphasized voice signal can be transmitted to the voice communication apparatus that does not include the voice processing apparatus 42, the speaker can reliably transmit the contents to be transmitted to the listener.

図７は、音声会議装置の外観図である。図８は、音声会議装置の機能ブロック図である。図９は、図２（Ａ）に示した音声通信システムの音声通信装置として音声会議装置を適用した場合を示す構成図である。 FIG. 7 is an external view of the audio conference apparatus. FIG. 8 is a functional block diagram of the audio conference apparatus. FIG. 9 is a block diagram showing a case where an audio conference apparatus is applied as the audio communication apparatus of the audio communication system shown in FIG.

次に、図２に示した音声通信システム１〜３では、音声通信装置として電話機を使用した例を示したが、本発明はこれに限るものではなく、他の音声通信装置であっても良い。例えば、図２（Ａ）に示した電話機１１，１２に代えて、マイクアレイとスピーカアレイを備えた音声会議装置６１，６２を、音声通信装置として使用することが可能である。この音声会議装置６１，６２は、図７に示すように、筐体７２の前側面７２ＭにマイクユニットＭＩＣ１０１〜ＭＩＣ１１６から成るマイクアレイＭＡ１を備え、筐体７２の後側面７２ＵにマイクユニットＭＩＣ２０１〜ＭＩＣ２１６から成るマイクアレイＭＡ２を備えている。また、筐体７２の下面７２ＫにスピーカユニットＳＰ３０１〜ＳＰ３１２から成るスピーカアレイＳＡ１を備えている。 Next, in the voice communication systems 1 to 3 shown in FIG. 2, an example in which a telephone is used as the voice communication apparatus has been shown. However, the present invention is not limited to this and may be another voice communication apparatus. . For example, instead of the telephones 11 and 12 shown in FIG. 2A, voice conference apparatuses 61 and 62 including a microphone array and a speaker array can be used as voice communication apparatuses. As shown in FIG. 7, the audio conference apparatuses 61 and 62 include a microphone array MA1 including microphone units MIC101 to MIC116 on a front side 72M of a casing 72, and microphone units MIC201 to MIC216 on a rear side 72U of the casing 72. A microphone array MA2 is provided. In addition, a speaker array SA1 including speaker units SP301 to SP312 is provided on the lower surface 72K of the casing 72.

音声会議装置６１，６２は、図８に示すように、操作部７４、表示部７７、制御部１１０、入出力コネクタ８１、入出力Ｉ／Ｆ（インタフェース）１１２、放音指向性制御部１１３、Ｄ／Ａコンバータ１１４（１１４−１〜１１４−１２）、放音用アンプ１１５（１１５−１〜１１５−１２）、スピーカアレイＳＡ１を構成するスピーカユニットＳＰ３０１〜ＳＰ３１２、マイクアレイＭＡ１，ＭＡ２を構成するマイクユニットＭＩＣ１０１〜ＭＩＣ１１６及びマイクユニットＭＩＣ２０１〜ＭＩＣ２１６、収音用アンプ１１６（１１６−１〜１１６−３２）、Ａ／Ｄコンバータ１１７（１１７−１〜１１７−３２）、収音ビーム生成部１１８、収音ビーム選択部１１９、並びにエコーキャンセル部１２０を備える。 As shown in FIG. 8, the audio conference apparatuses 61 and 62 include an operation unit 74, a display unit 77, a control unit 110, an input / output connector 81, an input / output I / F (interface) 112, a sound emission directivity control unit 113, D / A converter 114 (114-1 to 114-12), sound emission amplifier 115 (115-1 to 115-12), speaker units SP301 to SP312 constituting speaker array SA1, and microphone arrays MA1 and MA2 are constituted. Microphone units MIC101 to MIC116 and microphone units MIC201 to MIC216, sound collecting amplifier 116 (116-1 to 116-32), A / D converter 117 (117-1 to 117-32), sound collecting beam generating unit 118, sound collecting A sound beam selection unit 119 and an echo cancellation unit 120 are provided.

音声会議装置６１，６２の制御部１１０は、入出力Ｉ／Ｆ１１２から入力される相手装置からの音声信号を、ネットワーク形式のデータから一般的な音声信号に変換してエコーキャンセル部１２０を介して放音指向性制御部１１３に出力するとともに、入力音声信号に添付された方位データを取得して、放音指向性制御部１１３に対して放音制御を行う。 The control unit 110 of the audio conference apparatuses 61 and 62 converts the audio signal from the partner apparatus input from the input / output I / F 112 from a network format data to a general audio signal, and passes through the echo cancellation unit 120. While outputting to the sound emission directivity control part 113, the azimuth | direction data attached to the input audio | voice signal is acquired, and sound emission control with respect to the sound emission directivity control part 113 is performed.

放音指向性制御部１１３は、放音制御内容に応じてスピーカユニットＳＰ３０１〜ＳＰ３１２に対する放音音声信号を生成する。スピーカユニットＳＰ３０１〜ＳＰ３１２に対する放音音声信号は、入力音声信号を遅延制御や振幅制御等の信号制御処理を行うことにより形成される。Ｄ／Ａコンバータ１１４（１１４−１〜１１４−１２）はデジタル形式の放音音声信号をアナログ形式に変換し、放音用アンプ１１５（１１５−１〜１１５−１２）は放音音声信号を増幅してスピーカユニットＳＰ３０１〜ＳＰ３１２に与え、スピーカユニットＳＰ３０１〜ＳＰ３１２は、放音音声信号を音声に変換して放音する。これにより、自装置の会議者に、ネットワークで接続された相手先装置の会議者の音声を放音する。 The sound emission directivity control unit 113 generates sound emission sound signals for the speaker units SP301 to SP312 according to the sound emission control contents. The sound output sound signal for the speaker units SP301 to SP312 is formed by performing signal control processing such as delay control and amplitude control on the input sound signal. The D / A converter 114 (114-1 to 114-12) converts the sound output sound signal in a digital format into an analog format, and the sound output amplifier 115 (115-1 to 115-12) amplifies the sound output sound signal. The speaker units SP301 to SP312 convert the sound emission sound signal into sound and emit the sound. Thereby, the voice of the conference person of the other party apparatus connected with the network is emitted to the conference person of the own apparatus.

マイクユニットＭＩＣ１０１〜ＭＩＣ１１６及びマイクユニットＭＩＣ２０１〜ＭＩＣ２１６は自装置の会議者の発声音を含む周囲の音を収音して電気信号に変換し、収音音声信号を生成する。 The microphone units MIC101 to MIC116 and the microphone units MIC201 to MIC216 collect ambient sounds including the utterances of the conference participants of their own devices and convert them into electrical signals, and generate collected audio signals.

収音ビーム生成部１１８は、マイクユニットＭＩＣ１０１〜ＭＩＣ１１６及びマイクユニットＭＩＣ２０１〜ＭＩＣ２１６の収音信号に対して遅延処理等を行い、所定方位に強い指向性を有する収音ビーム音声信号ＭＢ１〜ＭＢ８を生成する。収音ビーム音声信号ＭＢ１〜ＭＢ８はそれぞれ異なる方位に強い指向性を有するように設定されている。 The collected sound beam generator 118 performs delay processing on the collected signals of the microphone units MIC101 to MIC116 and the microphone units MIC201 to MIC216, and generates the collected beam sound signals MB1 to MB8 having strong directivity in a predetermined direction. To do. The collected sound beam audio signals MB1 to MB8 are set so as to have strong directivity in different directions.

図９に示すように音声会議装置６１では、ＭＢ１を方位Ｄｉｒ１１に、ＭＢ２を方位Ｄｉｒ１２に、ＭＢ３を方位Ｄｉｒ１３に、ＭＢ４を方位Ｄｉｒ１４に、ＭＢ５を方位Ｄｉｒ１５に、ＭＢ６を方位Ｄｉｒ１６に、ＭＢ７を方位Ｄｉｒ１７に、ＭＢ８を方位Ｄｉｒ１８に設定される。また、図９に示すように音声会議装置６２では、ＭＢ１を方位Ｄｉｒ２１に、ＭＢ２を方位Ｄｉｒ２２に、ＭＢ３を方位Ｄｉｒ２３に、ＭＢ４を方位Ｄｉｒ２４に、ＭＢ５を方位Ｄｉｒ２５に、ＭＢ６を方位Ｄｉｒ２６に、ＭＢ７を方位Ｄｉｒ２７に、ＭＢ８を方位Ｄｉｒ２８に設定している。 As shown in FIG. 9, in the audio conference apparatus 61, MB1 is set to the direction Dir11, MB2 is set to the direction Dir12, MB3 is set to the direction Dir13, MB4 is set to the direction Dir14, MB5 is set to the direction Dir15, MB6 is set to the direction Dir16, and MB7 is set. In the direction Dir17, MB8 is set in the direction Dir18. As shown in FIG. 9, in the audio conference apparatus 62, MB1 is set to the direction Dir21, MB2 is set to the direction Dir22, MB3 is set to the direction Dir23, MB4 is set to the direction Dir24, MB5 is set to the direction Dir25, MB6 is set to the direction Dir26, MB7 is set in the direction Dir27, and MB8 is set in the direction Dir28.

収音ビーム選択部１１９は、収音ビーム音声信号ＭＢ１〜ＭＢ８の信号強度を比較して、最も強度の高い収音ビーム音声信号を選択し、収音ビーム音声信号ＭＢとしてエコーキャンセル部１２０に出力する。収音ビーム選択部１１９は、選択した収音ビーム音声信号ＭＢに対応する方位Ｄｉｒを検出して制御部１１０に与える。入出力Ｉ／Ｆ１１２は、エコーキャンセル部１２０からの収音ビーム音声信号ＭＢをネットワーク形式で所定データ長からなる音声信号に変換し、制御部１１０から得られる方位データと収音時間データとを添付して、ネットワーク１０に出力する。 The collected sound beam selecting unit 119 compares the signal strengths of the collected sound beam sound signals MB1 to MB8, selects the collected sound beam sound signal having the highest intensity, and outputs it to the echo canceling unit 120 as the collected sound beam sound signal MB. To do. The sound collection beam selection unit 119 detects the direction Dir corresponding to the selected sound collection beam sound signal MB and supplies the detected direction to the control unit 110. The input / output I / F 112 converts the collected sound beam audio signal MB from the echo cancel unit 120 into an audio signal having a predetermined data length in a network format, and attaches the azimuth data and the collected sound time data obtained from the control unit 110. And output to the network 10.

このように、図９に示す構成では、音声会議装置６１（６２）がマイクアレイＭＡ１またはマイクアレイＭＡ２から入力された音声信号をサーバ装置１３にネットワーク１０を介して送信する。そして、サーバ装置１３が内蔵する音声処理装置４２で強調処理を行って、音声会議装置６２（６１）にネットワーク１０を介して、この強調処理を行った音声信号を送信する。音声会議装置６２（６１）は、この強調処理が施された音声信号を受信すると、スピーカアレイＳＡ１から各話者に対してビーム化した音声を放音する。 As described above, in the configuration shown in FIG. 9, the audio conference apparatus 61 (62) transmits the audio signal input from the microphone array MA 1 or the microphone array MA 2 to the server apparatus 13 via the network 10. Then, the enhancement processing is performed by the voice processing device 42 built in the server device 13, and the voice signal subjected to the enhancement processing is transmitted to the voice conference device 62 (61) via the network 10. When the voice conference device 62 (61) receives the voice signal subjected to the enhancement processing, the voice conference device 62 (61) emits the beamed voice from the speaker array SA1 to each speaker.

次に、図２（Ｂ）及び図５に示した音声通信システム２において、電話機１４，１５に代えて、音声会議装置６４，６５を音声通信装置として使用することが可能である。但し、この場合には、前記の電話機１４，１５と同様に、音声会議装置６４，６５に音声処理装置４２を設けて、受信した音声信号について強調処理を施すように構成する必要がある。そのため、図８に示すように、音声会議装置６４，６５において、エコーキャンセル部１２０と放音指向性制御部１１３との間に音声処理装置４２を設ける。 Next, in the voice communication system 2 shown in FIGS. 2B and 5, the voice conference apparatuses 64 and 65 can be used as voice communication apparatuses instead of the telephones 14 and 15. However, in this case, like the telephones 14 and 15, it is necessary to provide the audio processing device 42 in the audio conference devices 64 and 65 so that the received audio signal is enhanced. Therefore, as shown in FIG. 8, in the audio conference apparatuses 64 and 65, the audio processing apparatus 42 is provided between the echo cancellation unit 120 and the sound emission directivity control unit 113.

この構成では、音声会議装置６４（６５）がマイクアレイＭＡ１またはマイクアレイＭＡ２から入力された音声信号を音声会議装置６５（６４）にネットワーク１０を介して送信する。音声信号を受信すると、内蔵する音声処理装置４２で強調処理を行って、スピーカアレイＳＡ１から各話者に対してビーム化した音声を放音する。 In this configuration, the audio conference apparatus 64 (65) transmits the audio signal input from the microphone array MA1 or the microphone array MA2 to the audio conference apparatus 65 (64) via the network 10. When an audio signal is received, enhancement processing is performed by the built-in audio processing device 42, and beamed audio is emitted from the speaker array SA1 to each speaker.

次に、図２（Ｃ）及び図６に示した音声通信システム３において、電話機１６，１７に代えて、音声会議装置６６，６７を音声通信装置として使用することが可能である。但し、この場合には、前記の電話機１６，１７と同様に、音声会議装置６６，６７に音声処理装置４２を設けて、収音した音声信号に対して強調処理を施すように構成する必要がある。そのため、図８に示すように、音声会議装置６６（６７）において、収音ビーム選択部１１９とエコーキャンセル部１２０との間に音声処理装置４２を設ける。 Next, in the voice communication system 3 shown in FIG. 2C and FIG. 6, it is possible to use the voice conference apparatuses 66 and 67 as voice communication apparatuses instead of the telephones 16 and 17. However, in this case, similar to the telephones 16 and 17, it is necessary to provide the audio conference devices 66 and 67 with the audio processing device 42 so as to perform enhancement processing on the collected audio signals. is there. Therefore, as shown in FIG. 8, in the audio conference device 66 (67), the audio processing device 42 is provided between the collected sound beam selection unit 119 and the echo cancellation unit 120.

この構成では、音声会議装置６６（６７）がマイクアレイＭＡ１またはマイクアレイＭＡ２で収音した音声信号に対して、音声会議装置６６（６７）が内蔵する音声処理装置４２で強調処理を行って、ネットワーク１０を介して音声会議装置６７（６６）に送信する。音声会議装置６７（６６）は、音声信号を受信すると、スピーカアレイＳＡ１から各話者に対してビーム化した音声を放音する。 In this configuration, the audio signal collected by the audio conference device 66 (67) by the microphone array MA1 or the microphone array MA2 is emphasized by the audio processing device 42 incorporated in the audio conference device 66 (67). The data is transmitted to the voice conference device 67 (66) via the network 10. When the voice conference device 67 (66) receives the voice signal, the voice conference device 67 (66) emits the beamed voice from the speaker array SA1 to each speaker.

以上のように、本発明の音声通信システムでは、音声処理装置４２をサーバ装置、または音声通信装置（音声会議装置）が備えているので、話者が伝えたいキーワードの部分を抽出して強調処理を施すことができ、聴者は、話者が伝えたい内容を確実に把握できる。また、本発明の音声処理装置及び音声通信システムは、キーワード辞書が不要な構成であり、また、いずれの言語でも問題なくキーワードを抽出できるので、キーワードの登録作業が不要であり、世界のあらゆる国で使用できる。 As described above, in the voice communication system of the present invention, since the voice processing device 42 is provided in the server device or the voice communication device (voice conference device), the keyword portion that the speaker wants to convey is extracted and emphasized. The listener can surely understand the content that the speaker wants to convey. In addition, the speech processing apparatus and speech communication system of the present invention have a configuration that does not require a keyword dictionary, and can extract keywords in any language without any problem, so no keyword registration work is required, and any country in the world Can be used in

なお、以上の説明では、本発明の音声処理装置を音声通信システムに適用した例について説明したが、本発明はこれに限るものではなく、音声信号をやりとりして会話を実現するシステムや装置であれば、他の構成であっても良い。例えば、音声処理装置に対して、小型のスピーカユニットと小型のマイクユニットを接続し、マイクユニットで収音した音声信号に対して強調処理を施して、スピーカユニットからその音声を放音するように構成することで、補聴器として使用できる。 In the above description, the example in which the voice processing apparatus of the present invention is applied to a voice communication system has been described. However, the present invention is not limited to this, and a system or apparatus that exchanges voice signals to realize a conversation. Any other configuration may be used. For example, a small speaker unit and a small microphone unit are connected to the audio processing device, and an audio signal picked up by the microphone unit is emphasized, and the sound is emitted from the speaker unit. By configuring, it can be used as a hearing aid.

また、以上の説明では、音声処理装置では、音声信号における音響的な特徴量として、話速と音量について解析する構成としたが、本発明はこれに限定するものではなく、例えば、話速と音高（ピッチ）について解析する構成としたり、音量と音高（ピッチ）について解析する構成としたりして、これらの特徴量を強調する構成とすることも可能である。また、話速・音量・音高（ピッチ）の３つの特徴量について解析して、これらの特徴量を強調する構成とすることも可能である。また、話速・音量・音高（ピッチ）の３つの特徴量について解析して、これらの特徴量のいずれかを強調する構成とすることも可能である。 In the above description, the speech processing apparatus is configured to analyze the speech speed and the volume as the acoustic feature amount in the speech signal. However, the present invention is not limited to this, and for example, the speech speed A configuration for analyzing the pitch (pitch) or a configuration for analyzing the volume and pitch (pitch) may be used to emphasize these feature amounts. It is also possible to analyze the three feature quantities of speech speed, volume, and pitch (pitch) and to emphasize these feature quantities. Further, it is possible to analyze three feature amounts of speech speed, volume, and pitch (pitch) and emphasize any one of these feature amounts.

本発明の音声処理装置が行う処理を説明するためのイメージ図である。It is an image figure for demonstrating the process which the audio | voice processing apparatus of this invention performs. 音声通信システムの概略を示す構成図である。It is a block diagram which shows the outline of an audio | voice communication system. 音声通信システムの構成を示すブロック図である。It is a block diagram which shows the structure of an audio | voice communication system. サーバ装置の動作を説明するためのフローチャートである。It is a flowchart for demonstrating operation | movement of a server apparatus. 音声通信システムの図３とは異なる構成を示すブロック図である。It is a block diagram which shows the structure different from FIG. 3 of an audio | voice communication system. 音声通信システムの図３、５とは異なる構成を示すブロック図である。It is a block diagram which shows the structure different from FIG. 音声会議装置の外観図である。It is an external view of an audio conference apparatus. 音声会議装置の機能ブロック図である。It is a functional block diagram of an audio conference apparatus. 図２（Ａ）に示した音声通信システムの音声通信装置として音声会議装置を適用した場合を示す構成図である。It is a block diagram which shows the case where an audio conference apparatus is applied as an audio | voice communication apparatus of the audio | voice communication system shown to FIG. 2 (A).

Explanation of symbols

１，２，３−音声通信システム１０−ネットワーク１１Ｊ，１２Ｊ，１４Ｊ，１５Ｊ，１６Ｊ，１７Ｊ−受話器１１，１２，１４，１５，１６，１７−電話機（音声通信装置）１３−サーバ装置２１−通信部２２−Ｄ／Ａコンバータ２３−放音用アンプ２４−収音用アンプ２５−Ａ／Ｄコンバータ２６−操作部２７−表示部２８，４３，１１０−制御部４１−通信部４２−音声処理装置５１−抑揚解析部５２−特徴抽出部５３−強調域判定部５４−強調処理部６１，６２，６４，６５，６６，６７−音声会議装置７２−筐体７４−操作部７７−表示部８１−入出力コネクタ１１２−入出力Ｉ／Ｆ１１３−放音指向性制御部１１４−Ｄ／Ａコンバータ１１５−放音用アンプ１１６−収音用アンプ１１７−Ａ／Ｄコンバータ１１８−収音ビーム生成部１１９−収音ビーム選択部１２０−エコーキャンセル部５１１−話速解析部５１２−音量解析部５２１−低速域抽出部５２２−大音量域抽出部 1, 2, 3-voice communication system 10-network 11J, 12J, 14J, 15J, 16J, 17J-receiver 11, 12, 14, 15, 16, 17-telephone (voice communication device) 13-server device 21-communication Unit 22-D / A converter 23-sound emission amplifier 24-sound pickup amplifier 25-A / D converter 26-operation unit 27-display unit 28, 43, 110-control unit 41-communication unit 42-voice processing device 51-Intonation analysis unit 52-Feature extraction unit 53-Enhancement region determination unit 54-Enhancement processing unit 61, 62, 64, 65, 66, 67-Audio conference device 72-Housing 74-Operation unit 77-Display unit 81- Input / output connector 112-Input / output I / F 113-Sound emission directivity control unit 114-D / A converter 115-Sound emission Amplifier 116-Sound collecting amplifier 117-A / D converter 118-Sound collecting beam generating unit 119-Sound collecting beam selecting unit 120-Echo canceling unit 511-Speech rate analyzing unit 512-Sound volume analyzing unit 521-Low speed region extracting unit 522 -Large volume extraction unit

Claims

Analyzing the acoustic feature quantity in the input audio signal and extracting a region where the acoustic feature quantity is relatively larger or smaller than other areas;
The speech signal region extracted by the feature amount extraction means is subjected to an emphasis process in which the feature amount is further increased when the feature amount is relatively large, and is further decreased when the feature amount is relatively small. Emphasis means to output,
A voice processing apparatus.

The feature amount extraction unit includes a low speed region extraction unit that analyzes a speech speed in a speech signal and extracts a region having a relatively low speech speed compared to other regions,
The speech processing apparatus according to claim 1, wherein the enhancement unit performs an enhancement process for further reducing a speech speed on the region extracted by the low-speed region extraction unit and outputs the region.

The feature amount extraction unit includes a large volume region extraction unit that analyzes a volume of the audio signal and extracts a region having a relatively large volume compared to other regions,
The speech processing apparatus according to claim 1, wherein the enhancement unit performs an enhancement process for further increasing the volume of the region extracted by the high volume range extraction unit and outputs the region.

A voice communication system in which a server device and a plurality of voice communication devices are connected to each other,
Each voice communication device
A microphone unit,
A speaker unit;
A communication means for transmitting the audio signal collected by the microphone unit to another audio communication device via the server device, and receiving the audio signal transmitted by the server device;
Speaker control means for inputting an audio signal received by the communication means to the speaker unit;
With
The server device
A voice processing device according to any one of claims 1 to 3;
Server communication that receives an audio signal transmitted from the audio communication device to another audio communication device, inputs the audio signal to the audio processing device, and transmits an audio signal output from the audio processing device to the other audio communication device Means,
A voice communication system.

A voice communication system in which a plurality of voice communication devices are connected to each other,
Each voice communication device
A voice processing device according to any one of claims 1 to 3;
A microphone unit,
A speaker unit;
A communication means for transmitting the audio signal collected by the microphone unit to another audio communication device, receiving the audio signal transmitted by another audio communication device, and inputting the audio signal to the audio processing device;
Speaker control means for inputting an audio signal output from the audio processing device to the speaker unit;
A voice communication system.

A voice communication system in which a plurality of voice communication devices are connected to each other,
Each voice communication device
A voice processing device according to any one of claims 1 to 3;
A microphone unit that picks up sound and inputs a sound signal to the sound processing device;
A speaker unit;
A communication means for transmitting the audio signal output by the audio processing device to another audio communication device and receiving the audio signal transmitted by another audio communication device;
Speaker control means for inputting an audio signal received by the communication means to the speaker unit;
A voice communication system.