JP2008048342A

JP2008048342A - Sound acquisition apparatus

Info

Publication number: JP2008048342A
Application number: JP2006224405A
Authority: JP
Inventors: 拓弥 ▲高▼橋; Takuya Takahashi
Original assignee: Yamaha Corp
Current assignee: Yamaha Corp
Priority date: 2006-08-21
Filing date: 2006-08-21
Publication date: 2008-02-28

Abstract

PROBLEM TO BE SOLVED: To provide a sound acquisition apparatus which securely converts the speaking speed of the speech of only a speaking person existing at an optional position surrounding the apparatus from acquired speeches but never converts the speaking speed of a background sound. SOLUTION: A speech signal processing part 4 imparts prescribed delay to speech signals acquired by each microphone 2, and forms a sound acquisition beam surrounding the microphone 2. A controller 8 generates information showing the existing area of the speaking person (speaking person position information) based on an area corresponding to the sound acquisition beam of the highest level, and outputs the information to a storage part 3 to be recorded. The sound acquisition beam corresponding to the speaking person position information is output to a speaking speed conversion part 5 as a speaking sound voice signal, and the other sound acquisition beams are output to a mixer 6 as background speech signals. Thus, while the speaking speed of the speech of only the speaking person is converted but the speaking speed of the background sound is not converted, the speech can be released and recorded. COPYRIGHT: (C)2008,JPO&INPIT

Description

この発明は、会議などに用いられ、会議参加者の発話音声を収音する収音装置に関するものである。 The present invention relates to a sound collection device that is used in a conference and the like and collects speech sounds of conference participants.

従来から、入力音声信号を時間軸に伸長し、話速変換をすることで、発言内容を聴きとり易くする装置が提案されている。しかし、入力音声信号を伸長すると、話者の音声以外の音（例えばＢＧＭ）も同時に伸長されてしまう。また、話者の音声が入力されていない場合にもＢＧＭが伸長されてしまう。聴者が、話者の音声と同時に（並行して）そのＢＧＭも聞いている場合、ＢＧＭまで伸長されてしまうと、元の楽音の雰囲気を感じることができないという問題が発生する。 2. Description of the Related Art Conventionally, there has been proposed an apparatus that makes it easy to listen to the content of an utterance by expanding an input voice signal on the time axis and converting the speech speed. However, when the input voice signal is expanded, sounds other than the speaker's voice (for example, BGM) are also expanded at the same time. Further, the BGM is expanded even when the voice of the speaker is not input. When the listener is listening to the BGM at the same time (in parallel) with the speaker's voice, if the listener is extended to the BGM, there is a problem that the atmosphere of the original musical sound cannot be felt.

そこで、入力音声信号を分析し、話者音声と判断した場合にのみ話速変換処理を行う装置が提案されている（例えば特許文献１参照）。 In view of this, an apparatus has been proposed in which speech speed conversion processing is performed only when an input speech signal is analyzed and determined to be speaker speech (see, for example, Patent Document 1).

また、マイクを複数設置し、各マイクから距離が等しい地点からの（同位相の）収音音声を発話音声とし、それ以外の収音音声を背景音として分離する装置が提案されている（例えば特許文献２参照）。 In addition, a device has been proposed in which a plurality of microphones are installed, and the collected voices from the same distance from each microphone (in the same phase) are used as speech voices and other collected voices are separated as background sounds (for example, Patent Document 2).

また、音声と背景音を独立した複数のチャンネルで扱い、音声チャンネルのみ話速変換処理を行うように構成した装置も提案されている（例えば特許文献３参照）。
特開２０００−１５２３９４号公報特開２００５−２０８１７３号公報特開２００４−２４４０８１号公報 There has also been proposed an apparatus configured to handle voice and background sounds with a plurality of independent channels and perform speech speed conversion processing only on the voice channels (see, for example, Patent Document 3).
JP 2000-152394 A JP 2005-208173 A JP 2004-244081 A

しかしながら、特許文献１の装置では、発話音声と同タイミングで収音される背景音については、発話音声と同様に話速変換されてしまうという問題が有った。 However, the apparatus of Patent Document 1 has a problem that the background sound collected at the same timing as the uttered voice is converted into the speech speed in the same manner as the uttered voice.

また、特許文献２の装置では、各マイクからの距離が等しい地点からの音声しか発話音声として処理できないため、この地点以外に発話者が存在した場合に、その話者の音声について話速変換できないという問題点が有った。 Further, since the apparatus of Patent Document 2 can process only speech from a point where the distance from each microphone is equal as speech speech, if there is a speaker other than this point, speech speed cannot be converted for that speaker's speech. There was a problem.

また、特許文献３の装置では、録音するときに発話音声と背景音を別チャンネルで録音する必要があり、発話者には特定チャンネルに割り当てられたマイクに対し発声する必要があった。 Further, in the apparatus of Patent Document 3, it is necessary to record the speech sound and the background sound in separate channels when recording, and the speaker needs to speak to the microphone assigned to the specific channel.

本発明は、収音した音声から、装置周囲の任意の位置に存在する話者の音声だけを的確に話速変換し、背景音は話速変換しない収音装置を提供することを目的とする。 It is an object of the present invention to provide a sound collection device that accurately converts only the voice of a speaker existing at an arbitrary position around the device from the collected sound and does not convert the background sound. .

この発明の収音装置は、複数のマイクを配列してなるマイクアレイと、
複数のユーザ方向に対して収音ビームを形成するとともに、該収音ビーム強度を比較することで話者方位を同定する収音制御部と、前記話者方位の収音ビームを発話音声信号として選択するとともに、話者方位の収音ビーム以外の収音ビームを背景音声信号として選択する音声信号選択手段と、前記発話音声信号を話速変換する話速変換手段と、前記話速変換手段で変換された発話音声信号と、前記音声信号選択手段が選択した背景音声信号と、をミキシングするミキサと、を備えたことを特徴とする。 The sound collection device of the present invention includes a microphone array in which a plurality of microphones are arranged,
A sound collection control unit that forms a sound collection beam for a plurality of user directions and compares the sound collection beam intensities to identify a speaker direction, and uses the sound collection beam of the speaker direction as a speech signal A speech signal selection means for selecting a sound collection beam other than the sound collection beam of the speaker direction as a background voice signal, a speech speed conversion means for converting the speech speed of the speech voice signal, and the speech speed conversion means. And a mixer that mixes the converted speech audio signal and the background audio signal selected by the audio signal selection means.

この発明では、各マイクの収音音声信号にそれぞれ所定の遅延を付与し、特定の方向に強い指向性を有する収音ビームを複数形成する。これらの収音ビームのレベルを比較することで話者方位を同定する。例えば、最もレベルが高い収音ビームの方向を話者方位とする。話者方位の収音ビームを発話者音声信号として、これを話速変換してからミキサに出力し、他の方向の収音ビームは話速変換せずにそのままミキサに出力する。 In the present invention, a predetermined delay is given to the collected sound signal of each microphone, and a plurality of sound collecting beams having strong directivity in a specific direction are formed. The speaker orientation is identified by comparing the levels of these sound collecting beams. For example, the direction of the sound collecting beam having the highest level is set as the speaker orientation. The collected sound beam in the direction of the speaker is used as a speaker voice signal, which is converted into the speech speed and then output to the mixer. The collected sound beam in the other direction is output to the mixer as it is without converting the speech speed.

また、この発明の収音装置は、前記音声信号選択手段は、前記発話音声信号として選択した収音ビーム以外の方向ついて、所定レベル以上の収音ビームが存在する場合、その方向の収音ビームのみを背景音声信号として選択することを特徴とする。 In the sound collecting device of the present invention, when there is a sound collecting beam of a predetermined level or more in a direction other than the sound collecting beam selected as the speech sound signal, the sound signal selecting means has a sound collecting beam in that direction. Is selected as a background audio signal.

この発明では、発話者が存在すると判定した方向以外に高いレベルの収音ビームが存在する場合、その方向に背景音声の音源が存在するとして、その方向の収音ビームを背景音声信号としてミキサに出力する。これにより、背景音声についても的確に収音することができる。 In the present invention, when a high-level sound collecting beam exists in a direction other than the direction in which it is determined that a speaker is present, it is assumed that a sound source of background sound exists in that direction, and the sound collecting beam in that direction is input to the mixer as a background sound signal Output. Thereby, it is possible to accurately collect the background voice.

また、この発明の収音装置は、前記音声信号選択手段は、前記発話音声信号として選択された収音ビームと、前記発話音声信号として選択された収音ビームに隣接する方向の収音ビームと、の差分信号を発話音声信号として前記話速変換手段に入力することを特徴とする。 Further, in the sound collecting device of the present invention, the sound signal selecting means includes a sound collecting beam selected as the uttered sound signal, and a sound collecting beam in a direction adjacent to the sound collecting beam selected as the uttered sound signal. The difference signal is input to the speech speed conversion means as an utterance voice signal.

この発明では、発話者音声信号として選択した収音ビームから、隣接する方向の収音ビームを差分する。これにより、発話者音声信号として選択した収音ビームに含まれていた背景音声のレベルを低減し、より的確に発話者の音声のみを話速変換することができる。 In the present invention, the sound collecting beam in the adjacent direction is subtracted from the sound collecting beam selected as the speaker voice signal. Thereby, the level of the background voice included in the collected sound beam selected as the speaker voice signal can be reduced, and only the voice of the speaker can be converted more accurately.

また、この発明の収音装置は、前記収音制御部が形成した複数の収音ビームから発話音声の音声信号を抽出する発話音声信号抽出手段をさらに備え、前記収音制御部は、複数の収音ビームのうち最もレベルが高く、かつ前記発話音声信号抽出手段が発話音声の音声信号を抽出した収音ビームの方向を話者方位と判定することを特徴とする。 The sound collection device according to the present invention further includes speech sound signal extraction means for extracting speech signals of speech sound from a plurality of sound collection beams formed by the sound collection control unit, and the sound collection control unit includes a plurality of sound collection control units. It is characterized in that the direction of the sound collecting beam having the highest level among the sound collecting beams and from which the speech signal extracting means extracts the speech signal of the speech is determined as the speaker orientation.

この発明では、各収音ビームから発話音声の音声信号を抽出する。例えば収音ビームの音声特徴量を抽出し、予め記憶してある発話音声の音声特徴量と比較し、一致すれば発話音声と推定する。収音制御部は、最もレベルが高く、かつ発話音声と推定される音声信号が含まれる収音ビームについて発話者音声信号として選択するので、より的確に発話者の音声のみを話速変換することができる。 In the present invention, the speech signal of the speech speech is extracted from each sound collection beam. For example, the voice feature amount of the collected sound beam is extracted, compared with the voice feature amount of the uttered voice stored in advance, and if they match, the voice is estimated. The sound collection control unit selects the sound collection beam having the highest level and the sound signal estimated as the speech sound as the speaker sound signal, so that only the sound of the speaker can be converted more accurately. Can do.

この発明によれば、マイクアレイにより形成した収音ビームで発話者の方向を判定し、発話者の方向に対する収音ビームについてのみ話速変換し、他の方向の収音ビームについてそのまま出力することで、発話者の音声だけを的確に話速変換し、背景音は話速変換しないで音声を収音することができる。 According to the present invention, the direction of the speaker is determined by the sound collection beam formed by the microphone array, the speech speed is converted only for the sound collection beam with respect to the direction of the speaker, and the sound collection beam in the other direction is output as it is. Thus, only the voice of the speaker can be accurately converted, and the background sound can be collected without converting the voice speed.

図面を参照して、本発明の実施形態に係る放収音装置について説明する。この放収音装置は、会議において、拡声機、録音機等として用いられる。図１は、放収音装置の構成を示すブロック図である。同図に示すように、この放収音装置は、スピーカ１、複数のマイク２Ａ〜２Ｍ、記憶部３、音声信号処理部４、話速変換部５、ミキサ６、録音・再生部７、コントローラ８、および入出力Ｉ／Ｆ９を備えている。 With reference to the drawings, a sound emission and collection device according to an embodiment of the present invention will be described. This sound emission and collection device is used as a loudspeaker, a sound recorder, or the like in a conference. FIG. 1 is a block diagram showing a configuration of a sound emission and collection device. As shown in the figure, the sound emission and collection device includes a speaker 1, a plurality of microphones 2A to 2M, a storage unit 3, an audio signal processing unit 4, a speech rate conversion unit 5, a mixer 6, a recording / playback unit 7, and a controller. 8 and an input / output I / F 9.

複数のマイク２Ａ〜２Ｍは、一定の間隔で直線状（またはマトリクス状、ハニカム状）に配列され、マイクアレイを構成する。各マイク２は、一般的にはダイナミックマイクを用いるが、コンデンサマイク等、その他の形式を用いてもよい。また、マイク配列個数、配列間隔は、この放収音装置を設置する環境や、必要とする周波数帯域等により適宜設定する。 The plurality of microphones 2A to 2M are arranged in a straight line (or matrix shape or honeycomb shape) at regular intervals to constitute a microphone array. Each microphone 2 is generally a dynamic microphone, but other types such as a condenser microphone may be used. The number of microphones and the interval between the microphones are appropriately set according to the environment where the sound emitting and collecting apparatus is installed, the required frequency band, and the like.

マイク２Ａ〜２Ｍの周囲のある位置で音声が発せられると、各マイク２がこれを収音する。マイク２は、収音した音声から音声信号を音声信号処理部４に出力する。なお、図１においてはフロントエンドのアンプやアナログ音声信号をディジタル音声信号に変換するＡ／Ｄ変換器等は省略している。各マイク２から出力される音声信号は、音声信号処理部４にて合成され、話速変換部５、またはミキサ６に出力される。音声信号処理部４は、コントローラ８の指示に従って、各マイク２から出力された音声信号を選択的に出力する。各マイク２で音声を収音した際、音声は各マイク２と音源との距離に応じた伝搬時間で伝搬されるので、各マイク２では収音タイミングに時間差が生じる。 When sound is emitted at a certain position around the microphones 2A to 2M, each microphone 2 picks up the sound. The microphone 2 outputs an audio signal from the collected audio to the audio signal processing unit 4. In FIG. 1, front-end amplifiers and A / D converters for converting analog audio signals into digital audio signals are omitted. The audio signal output from each microphone 2 is synthesized by the audio signal processing unit 4 and output to the speech speed conversion unit 5 or the mixer 6. The audio signal processing unit 4 selectively outputs the audio signal output from each microphone 2 in accordance with an instruction from the controller 8. When sound is collected by each microphone 2, the sound is propagated with a propagation time corresponding to the distance between each microphone 2 and the sound source, so that each microphone 2 has a time difference in sound collection timing.

ここで、例えば全てのマイク２に前方から同タイミングで音波が到来したとすると、各マイク２から出力された音声信号は、合成によって強められる。一方で、これ以外の方向から音波が到来すると、各マイク２から出力される音声信号はそれぞれ位相が異なるために合成されることによって弱められる。したがって、アレイマイクの感度はビーム状に絞り込まれて前方にのみ主感度（収音ビーム）を形成する。 Here, for example, if sound waves arrive at all the microphones 2 from the front at the same timing, the audio signals output from the respective microphones 2 are strengthened by synthesis. On the other hand, when sound waves arrive from other directions, the audio signals output from the microphones 2 are weakened by being synthesized because they have different phases. Therefore, the sensitivity of the array microphone is narrowed down into a beam shape, and the main sensitivity (sound collecting beam) is formed only in the front.

音声信号処理部４は、各マイク２が出力した音声信号にそれぞれ所定の遅延時間を付与することで収音ビームを斜めに向けることができる。収音ビームを斜めにする場合、一方の端部マイク２から所定時間が経過する毎に順次隣のマイク２から音声信号を出力するように設定する。例えば音源がマイクアレイの一方の端部前方に存在する場合、音源に最も近い一方の端部から音波が到来し、反対の端部に最後に音波が到来するが、音声信号処理部４は、この伝搬時間差を補正するように各マイク２の音声信号に遅延時間を付与した後合成する。これにより、この方向の音声信号を合成によって強められる。したがって、一列に並んでいるマイク２から出力する音声信号を一端から他端に向けて順次遅延することにより、収音ビームは、その遅延時間に応じて傾斜する。 The audio signal processing unit 4 can direct the sound collection beam obliquely by giving a predetermined delay time to the audio signal output from each microphone 2. When the sound collecting beam is inclined, the sound signal is set to be sequentially output from the adjacent microphone 2 every time a predetermined time elapses from one end microphone 2. For example, when the sound source is present in front of one end of the microphone array, the sound wave comes from one end closest to the sound source, and the sound wave comes last to the opposite end. In order to correct this propagation time difference, the audio signal of each microphone 2 is added with a delay time and then synthesized. Thereby, the audio signal in this direction can be strengthened by synthesis. Therefore, by sequentially delaying the audio signals output from the microphones 2 arranged in a row from one end to the other end, the sound collection beam is inclined according to the delay time.

また、この収音ビームは複数を同時に形成することも可能である。図２は、音声信号処理部４のうち、マイク２に接続される主要部の構成を示すブロック図である。マイク２Ａ〜２Ｍは、それぞれ音声信号処理部４のディジタルフィルタ４１Ａ〜４１Ｍに接続される。マイク２Ａ〜２Ｍで収音した音声は、ディジタル音声信号としてディジタルフィルタ４１Ａ〜４１Ｍに入力される。なお、図２においては、ディジタルフィルタ４１Ａ〜４１Ｍの内、ディジタルフィルタ４１Ａについてのみ詳細なブロック図を図示するが、他のディジタルフィルタ４１Ｂ〜４１Ｍについても同様の構造であり、同様の動作を行うものである。 Also, a plurality of sound collecting beams can be formed simultaneously. FIG. 2 is a block diagram illustrating a configuration of a main part connected to the microphone 2 in the audio signal processing unit 4. The microphones 2A to 2M are connected to the digital filters 41A to 41M of the audio signal processing unit 4, respectively. The sound collected by the microphones 2A to 2M is input to the digital filters 41A to 41M as a digital sound signal. 2 shows a detailed block diagram of only the digital filter 41A among the digital filters 41A to 41M. However, the other digital filters 41B to 41M have the same structure and perform the same operation. It is.

ディジタルフィルタ４１Ａは、複数段の出力を有するディレイバッファ４２Ａを備えている。ディレイバッファ４２Ａの各段の遅延量は、マイクアレイのマイク２の配置、およびマイクアレイ前方の領域（発話者を検出する領域）に応じて設定される。この例においてディレイバッファ４２Ａは４段の出力を有しており、これらの出力信号がＦＩＲフィルタ４３１Ａ〜４３４Ａに入力される。 The digital filter 41A includes a delay buffer 42A having a plurality of stages of outputs. The amount of delay at each stage of the delay buffer 42A is set according to the arrangement of the microphones 2 in the microphone array and the area in front of the microphone array (area for detecting a speaker). In this example, the delay buffer 42A has four-stage outputs, and these output signals are input to the FIR filters 431A to 434A.

ディレイバッファ４２Ａは、マイク２Ａが出力した音声信号に対してそれぞれ異なる遅延時間を付与した音声信号を各段にバッファし、ＦＩＲフィルタ４３１Ａ〜４３４Ａに各遅延音声信号を出力する。ここでＦＩＲフィルタ４３１Ａ〜４３４Ａに出力する遅延音声信号は、マイクアレイ前方の各領域に対応するものである。図３は音源方向検出方法の例を示す図である。同図（Ａ）は音源とマイクとの位置関係と、音源から発生した音が各マイクで収音される際のディレイとの関係を示した図であり、同図（Ｂ）、（Ｃ）は収音された音声信号のディレイに基づくディレイ補正量の形成概念を示す図である。 The delay buffer 42A buffers audio signals obtained by adding different delay times to the audio signal output from the microphone 2A, and outputs the delayed audio signals to the FIR filters 431A to 434A. Here, the delayed audio signals output to the FIR filters 431A to 434A correspond to the respective areas in front of the microphone array. FIG. 3 is a diagram illustrating an example of a sound source direction detection method. FIG. 6A is a diagram showing the relationship between the positional relationship between the sound source and the microphone and the delay when the sound generated from the sound source is picked up by each microphone, and FIGS. FIG. 4 is a diagram showing a concept of forming a delay correction amount based on a delay of a collected audio signal.

同図に示すように、この放収音装置においてはマイクアレイ前方に４つの部分領域１０１〜１０４を設定している。部分領域１０１で発生した音は最も近いマイク２Ａで最初に収音される。そして、部分領域１０１とマイク２との距離に応じて順に、各マイクで収音され、最も遠いマイク（同図においてマイク２Ｌ）で最後に収音される。一方、部分領域１０４で発生した音は最も近いマイク２Ｌで最初に収音され、部分領域１０４とマイク２との距離に応じて順に、各マイクで収音され、最も遠いマイク２Ａで最後に収音される。このように、各領域で発生する音はマイクとの距離に応じた遅延時間（ディレイ）で収音される。 As shown in the figure, in this sound emission and collection device, four partial areas 101 to 104 are set in front of the microphone array. The sound generated in the partial area 101 is first picked up by the nearest microphone 2A. Then, sound is collected by each microphone in order according to the distance between the partial area 101 and the microphone 2, and finally collected by the farthest microphone (microphone 2L in the figure). On the other hand, the sound generated in the partial area 104 is collected first by the nearest microphone 2L, sequentially collected by each microphone according to the distance between the partial area 104 and the microphone 2, and finally collected by the farthest microphone 2A. Sounded. Thus, the sound generated in each region is collected with a delay time (delay) corresponding to the distance from the microphone.

ここで、部分領域１０１に対しては、図３（Ｂ）に示すように、各マイク２Ａ〜２Ｌで収音される音声信号を遅延処理する。すなわち、図３（Ａ）に示すディレイを補正するように対応するディレイ補正量を設定する。一方で部分領域１０４に対しては、図３（Ｃ）に示すように各マイク２Ａ〜２Ｌで収音される音声信号を遅延処理する。 Here, for the partial area 101, as shown in FIG. 3B, the audio signals collected by the microphones 2A to 2L are subjected to delay processing. That is, the corresponding delay correction amount is set so as to correct the delay shown in FIG. On the other hand, as shown in FIG. 3C, the partial area 104 is subjected to delay processing on the audio signals collected by the microphones 2A to 2L.

部分領域１０１に対応する収音ビームを構成するための遅延音声信号がディレイバッファ４２Ａにおいて生成され、ＦＩＲフィルタ４３１Ａに出力される。また、部分領域１０２に対応する収音ビームを構成するための遅延音声信号がＦＩＲフィルタ４３２Ａに出力される。同様に、部分領域１０３に対応する収音ビームを構成するための遅延音声信号がＦＩＲフィルタ４３３Ａに出力され、部分領域１０４に対応する収音ビームを構成するための遅延音声信号がＦＩＲフィルタ４３４Ａに出力される。これらの遅延音声信号の遅延量は、図３に示すようにマイク２と各領域との距離に応じて設定される。例えば部分領域１０１に対応する遅延音声信号は、マイク２Ａと部分領域１０１との距離が近いため遅延量が大きく、部分領域１０４に対応する遅延音声信号は、マイク２Ａと部分領域１０４との距離が最も遠いために遅延量が小さい。 A delayed sound signal for forming a sound collecting beam corresponding to the partial region 101 is generated in the delay buffer 42A and output to the FIR filter 431A. In addition, a delayed sound signal for forming a sound collecting beam corresponding to the partial region 102 is output to the FIR filter 432A. Similarly, a delayed sound signal for forming a sound collecting beam corresponding to the partial area 103 is output to the FIR filter 433A, and a delayed sound signal for forming a sound collecting beam corresponding to the partial area 104 is supplied to the FIR filter 434A. Is output. The delay amounts of these delayed audio signals are set according to the distance between the microphone 2 and each area as shown in FIG. For example, the delayed audio signal corresponding to the partial area 101 has a large delay amount because the distance between the microphone 2A and the partial area 101 is short, and the delayed audio signal corresponding to the partial area 104 has a distance between the microphone 2A and the partial area 104. The delay is small because it is farthest away.

図２において、ＦＩＲフィルタ４３１Ａ〜４３４Ａは全て同じ構成からなり、それぞれに入力された遅延音声信号をフィルタリングして出力する。ＦＩＲフィルタ４３１Ａ〜４３４Ａは、ディレイバッファ４２Ａでは実現できない詳細な遅延時間を設定することができる。すなわち、ＦＩＲフィルタのサンプリング周期とタップ数とを所望の値に設定することにより、例えばディレイバッファ４２Ａでのサンプリング周期を遅延時間の整数部分とする場合にこの遅延時間の小数点部分を実現することができる。 In FIG. 2, the FIR filters 431A to 434A all have the same configuration, and filter and output the delayed audio signals input thereto. The FIR filters 431A to 434A can set a detailed delay time that cannot be realized by the delay buffer 42A. That is, by setting the sampling period and the number of taps of the FIR filter to desired values, for example, when the sampling period in the delay buffer 42A is an integer part of the delay time, the decimal part of the delay time can be realized. it can.

ＦＩＲフィルタ４３１Ａ〜４３４Ａから出力された遅延音声信号は、それぞれのアンプ４４１Ａ〜４４４Ａで増幅されて、加算器４５Ａ〜４５Ｄに入力される。他のディジタルフィルタ４１Ｂ〜４１Ｍにおいてもディジタルフィルタ４１Ａと同じ構成からなり、それぞれに予め設定された遅延条件にしたがって遅延音声信号を加算器４５Ａ〜４５Ｄに出力する。 The delayed audio signals output from the FIR filters 431A to 434A are amplified by the respective amplifiers 441A to 444A and input to the adders 45A to 45D. The other digital filters 41B to 41M have the same configuration as that of the digital filter 41A, and output delayed audio signals to the adders 45A to 45D in accordance with delay conditions set in advance.

加算器４５Ａは、各ディジタルフィルタ４１Ａ〜４１Ｍから入力される遅延音声信号を合成して、図３における部分領域１０１に対応する収音ビームを生成する。同様に、加算器４５Ｂは、各ディジタルフィルタ４１Ａ〜４１Ｍから入力される遅延音声信号を合成して、図３における部分領域１０２に対応する収音ビームを生成し、加算器４５Ｃは、各ディジタルフィルタ４１Ａ〜４１Ｍから入力される遅延音声信号を合成して、図３における部分領域１０３に対応する収音ビームを生成する。また、加算器４５Ｄは、各ディジタルフィルタ４１Ａ〜４１Ｍから入力される遅延音声信号を合成して、図３における部分領域１０４に対応する収音ビームを生成する。 The adder 45A synthesizes the delayed audio signals input from the digital filters 41A to 41M, and generates a sound collection beam corresponding to the partial area 101 in FIG. Similarly, the adder 45B synthesizes the delayed speech signals input from the digital filters 41A to 41M to generate a sound collection beam corresponding to the partial region 102 in FIG. 3, and the adder 45C The delayed sound signals input from 41A to 41M are synthesized to generate a sound collection beam corresponding to the partial region 103 in FIG. Further, the adder 45D synthesizes the delayed audio signals input from the digital filters 41A to 41M, and generates a sound collection beam corresponding to the partial region 104 in FIG.

各加算器４５Ａ〜４５Ｄから出力される収音ビームは、バンドパスフィルタ（ＢＰＦ）４６に出力される。ＢＰＦ４６は、各収音ビームをフィルタリングして所定の周波数帯域の収音ビームをレベル判定部４７に出力する。ここで、ＢＰＦ４６は、マイクアレイの幅やマイク２の設置間隔に応じてビーム化される周波数帯域が異なることを利用し、各収音ビームで収音したい音声に対応する周波数帯域を通過帯域に設定する。例えば収音したい音声が話者の発話音声であれば、人の音声帯域に相当する周波数帯域を通過帯域に設定すればよい。 The collected sound beams output from the adders 45 A to 45 D are output to a band pass filter (BPF) 46. The BPF 46 filters each sound collection beam and outputs a sound collection beam in a predetermined frequency band to the level determination unit 47. Here, the BPF 46 uses the fact that the frequency band to be beamed differs depending on the width of the microphone array and the installation interval of the microphones 2, and sets the frequency band corresponding to the sound to be collected by each sound collecting beam as the pass band. Set. For example, if the voice to be collected is the voice of the speaker, a frequency band corresponding to the human voice band may be set as the pass band.

レベル判定部４７は、各収音ビームのレベルを示す情報をコントローラ８に出力する。コントローラ８は、入力された各収音ビームのレベルを比較し、最もレベルが高い収音ビームを選択する。収音ビームのレベルが高いということは、この収音ビームに対応する領域に音源（話者）が存在することとなり、図３において示した４つの領域に区分した場合における話者の存在領域を検出することができる。 The level determination unit 47 outputs information indicating the level of each sound collecting beam to the controller 8. The controller 8 compares the levels of the input sound collecting beams and selects the sound collecting beam having the highest level. A high level of the sound collecting beam means that a sound source (speaker) exists in a region corresponding to the sound collecting beam, and the region where the speaker is present when divided into the four regions shown in FIG. Can be detected.

ここで、コントローラ８は、最もレベルの高い収音ビームに対応する領域に基づいて、話者の存在領域を示す情報（以下、話者位置情報と言う。）を生成する。なお、コントローラ８は、最もレベルの高い収音ビームのレベル（絶対レベル）が所定の閾値（例えば一般的な発話音声のレベル）未満である場合は、話者が存在しないとして話者位置情報を生成しないようにしてもよい。 Here, the controller 8 generates information (hereinafter referred to as “speaker position information”) indicating the presence area of the speaker based on the area corresponding to the sound collecting beam having the highest level. When the level (absolute level) of the highest-level sound pickup beam is less than a predetermined threshold (for example, the level of general uttered speech), the controller 8 determines that there is no speaker and sets the speaker position information. It may not be generated.

コントローラ８は、生成した話者位置情報に基づいて、信号選択部４８に、話者位置情報に対応する収音ビームを選択してこれを話者音声信号として話速変換部５に出力するように設定する。また、コントローラ８は、信号選択部４８に、話者位置情報の示す領域以外の方向に対応する収音ビームのうちいずれかを選択してこれを背景音声信号としてミキサ６に出力するように設定する。なお、コントローラ８は、信号選択部４８に、話者位置情報の示す領域以外の方向に対応する収音ビームを複数選択し、これらを合成してミキサ６に出力するように設定してもよい。無論、話者位置情報の示す領域以外の方向に対応する収音ビームを全て合成してミキサ６に出力するようにしてもよい。 Based on the generated speaker position information, the controller 8 selects the sound collection beam corresponding to the speaker position information in the signal selection unit 48 and outputs it to the speech speed conversion unit 5 as a speaker voice signal. Set to. In addition, the controller 8 sets the signal selection unit 48 to select any one of the collected sound beams corresponding to directions other than the area indicated by the speaker position information and output the selected sound beam to the mixer 6 as a background audio signal. To do. The controller 8 may be set so that the signal selection unit 48 selects a plurality of sound collecting beams corresponding to directions other than the region indicated by the speaker position information, combines them, and outputs them to the mixer 6. . Of course, all the collected sound beams corresponding to directions other than the region indicated by the speaker position information may be combined and output to the mixer 6.

ここで、各収音ビームのレベルによって、出力される話者音声信号、および背景音声信号には以下の２パターンが考えられる。
（１）背景音が点音源である場合
この場合、話者位置情報の示す領域以外の方向に対応する収音ビームのうちいずれか１つについて、高いレベルを示すものが含まれる。従って、コントローラ８は、各収音ビームのレベルを比較した結果、話者位置情報の示す領域以外の方向に対応する収音ビームのうちいずれか１つに所定値以上のレベル（ただし、上記所定の閾値未満のレベル）を検出した場合、この方向の収音ビームを背景音声信号として出力するように信号選択部４８に設定する。
（２）背景音が無定位である場合
この場合、話者位置情報の示す領域以外の方向に対応する収音ビームの複数について高いレベルを示す。従って、コントローラ８は、各収音ビームのレベルを比較した結果、話者位置情報の示す領域以外の方向に対応する収音ビームで、所定数以上（例えば過半数以上）に所定値以上のレベル（ただし、上記所定の閾値未満）を検出した場合、これらの収音ビームのうち最もレベルの高いものを背景音声信号として出力するように信号選択部４８に設定する。このとき、話者位置情報に対応する収音ビームにも当該背景音の成分が含まれているため、コントローラ８は、話者位置情報に対応する収音ビームと、隣接する収音ビームとの差分について話者音声信号として出力するように信号選択部４８に設定する。 Here, depending on the level of each sound collecting beam, the following two patterns can be considered for the speaker voice signal and the background voice signal to be output.
(1) When the background sound is a point sound source In this case, one of the collected sound beams corresponding to directions other than the area indicated by the speaker position information includes a high level. Therefore, as a result of comparing the levels of the respective sound collecting beams, the controller 8 determines that one of the sound collecting beams corresponding to directions other than the area indicated by the speaker position information has a level equal to or higher than a predetermined value (however, the predetermined value). Is detected in the signal selection unit 48 so that the collected sound beam in this direction is output as a background audio signal.
(2) When the background sound is non-localized In this case, a high level is shown for a plurality of sound collecting beams corresponding to directions other than the region indicated by the speaker position information. Accordingly, as a result of comparing the levels of the respective sound collecting beams, the controller 8 is a sound collecting beam corresponding to a direction other than the region indicated by the speaker position information, and a level (for example, a majority) or a level equal to or greater than a predetermined value. However, when a signal less than the predetermined threshold is detected, the signal selection unit 48 is set so that the highest level of these collected beams is output as a background audio signal. At this time, since the background sound component is also included in the sound collection beam corresponding to the speaker position information, the controller 8 determines whether the sound collection beam corresponding to the speaker position information and the adjacent sound collection beam are The signal selection unit 48 is set to output the difference as a speaker voice signal.

以上のようにして、音声信号処理部４は、話者の音声と、それ以外の音声とを分離して後段に出力することができる。 As described above, the audio signal processing unit 4 can separate the speaker's voice and other voices and output them to the subsequent stage.

なお、図２においては、マイクアレイ前方に４つの部分領域１０１〜１０４を設定し、各領域について収音ビームを形成する例を示したが、図２に示したディレイバッファ４２の出力段数を増やし、ＦＩＲフィルタ、アンプ、加算器をディレイバッファ４２の出力段数分設定することで、さらに多数の領域について収音ビームを形成することができる。また、マイクアレイを２列背反に配列し、それぞれの列に図２に示した音声信号処理部を接続することで、各マイクアレイの正面方向に対して収音ビームを形成し、マイクアレイ両面方向（すなわち略３６０度方向）に収音ビームを形成することも可能である。 2 shows an example in which four partial areas 101 to 104 are set in front of the microphone array and a sound collecting beam is formed for each area. However, the number of output stages of the delay buffer 42 shown in FIG. 2 is increased. By setting the number of FIR filters, amplifiers, and adders as many as the number of output stages of the delay buffer 42, it is possible to form sound collecting beams for a larger number of regions. Also, the microphone arrays are arranged in two rows, and the sound signal processing unit shown in FIG. 2 is connected to each row, so that a sound collecting beam is formed in the front direction of each microphone array. It is also possible to form a sound collecting beam in a direction (that is, a direction of approximately 360 degrees).

また、コントローラ８は、各収音ビームから音声特徴量を抽出し、発話音声、楽音音声（例えば歌声等も含まれる）の区別をするようにしてもよい。音声特徴量は、典型的には話者のフォルマント、ピッチ等を表し、音声データをフーリエ変換した周波数スペクトル（パワースペクトル）、およびこのパワースペクトルを対数変換後に逆フーリエ変換したケプストラムから抽出する。発話音声の音声特徴量、楽音音声の音声特徴量を予め記憶部３に記録しておき、各収音ビームの音声特徴量が発話音声の音声特徴量に一致するようであればこれを話者音声信号として選択し、楽音音声の音声特徴量に一致するようであればこれを背景音声信号として選択すればよい。また、レベルの高い収音ビームが複数存在した場合、各収音ビームの音声特徴量を分析し、発話音声の音声特徴量と一致するものについて、発話者の収音ビームとして判定すればよい。 In addition, the controller 8 may extract a voice feature amount from each collected sound beam, and distinguish between a speech voice and a musical sound voice (including a singing voice, for example). The speech feature amount typically represents a speaker formant, pitch, and the like, and is extracted from a frequency spectrum (power spectrum) obtained by Fourier transforming speech data, and a cepstrum obtained by logarithmically transforming the power spectrum and then performing inverse Fourier transform. The voice feature quantity of the utterance voice and the voice feature quantity of the musical sound voice are recorded in the storage unit 3 in advance, and if the voice feature quantity of each sound collection beam matches the voice feature quantity of the utterance voice, this is recorded. If it is selected as an audio signal and matches the audio feature amount of the musical sound, this may be selected as the background audio signal. In addition, when there are a plurality of high-level sound collecting beams, the sound feature amount of each sound collecting beam is analyzed, and the sound feature amount of the speech sound may be determined as the sound collecting beam of the speaker.

なお、会議に先立って、議長などが放収音装置を操作し、各会議参加者に発言してもらうことで事前に話者位置情報を生成し、記憶部３に記録しておくようにしてもよい。この場合、コントローラ８は、会議中には、記憶部３に記憶されている話者位置情報に基づいて、信号選択部４８に、話者位置情報に対応する収音ビームを選択してこれを話者音声信号として話速変換部５に出力するように設定する。また、コントローラ８は、信号選択部４８に、記憶部３に記憶されている話者位置情報の示す領域以外の方向に対応する収音ビームのうちいずれかを選択してこれを背景音声信号としてミキサ６に出力するように設定する。 Prior to the conference, the chairperson or the like operates the sound emission and collection device and asks each conference participant to speak to generate speaker location information in advance and record it in the storage unit 3. Also good. In this case, during the conference, the controller 8 selects the sound collection beam corresponding to the speaker position information in the signal selection unit 48 based on the speaker position information stored in the storage unit 3 and selects this. It is set to output to the speech speed conversion unit 5 as a speaker voice signal. Further, the controller 8 selects any one of the sound collecting beams corresponding to directions other than the area indicated by the speaker position information stored in the storage unit 3 in the signal selection unit 48 and uses this as a background audio signal. Set to output to mixer 6.

次に、話速変換部５は、コントローラ８の指示に従って、入力された話者音声信号について話速変換処理を行う。話速変換処理は、単に音声を低速で出力するのではなく、以下のようにして行う。すなわち、話速変換処理は、音声信号を１周期の波形に切りわけ、各周期波形の前後１区間を合成した新たな周期波形を生成し、各周期波形の間に新たに合成した周期波形を挿入することで信号の周期波形数を増やして、音程を保ちつつ信号を時間軸伸長する処理である。 Next, the speech speed conversion unit 5 performs a speech speed conversion process on the input speaker voice signal in accordance with an instruction from the controller 8. The speech speed conversion process is not performed simply at low speed, but is performed as follows. That is, in the speech speed conversion process, the voice signal is cut into a waveform of one cycle, a new periodic waveform is generated by combining one section before and after each periodic waveform, and the newly synthesized periodic waveform is generated between each periodic waveform. Insertion is a process of increasing the number of periodic waveforms of a signal and extending the signal along the time axis while maintaining the pitch.

図４（Ａ）は伸長処理の手順を示すフローチャートである。また、同図（Ｂ）は伸長方法を説明する図である。同図（Ａ）において、まず入力音声信号の先頭部分の１周期のサンプル数（サンプリング周波数×１／信号周波数）を検出する（Ｓ９１）。この１周期分のサンプルデータである周期波形を２つ取り出して、同図（Ｂ）に示すように、１つめの周期波形Ａに対して減衰利得係数を乗算することによって減衰波を作成し、２つめの周期波形Ｂに対して増加利得係数を乗算することによって増加波を作成する（Ｓ９２）。そして、これらを加算合成することによってＡとＢの中間の形状の周期波形を合成する（Ｓ９３）。この合成波形を図５（Ａ）に示すように周期波形Ａと周期波形Ｂとの間に挿入して出力する（Ｓ９４）することによって音響的に自然な時間軸伸長を行う。 FIG. 4A is a flowchart showing the procedure of decompression processing. FIG. 2B is a diagram for explaining the expansion method. In FIG. 9A, first, the number of samples in one cycle (sampling frequency × 1 / signal frequency) of the head portion of the input audio signal is detected (S91). Two periodic waveforms, which are sample data for one period, are taken out and, as shown in FIG. 5B, an attenuation wave is created by multiplying the first periodic waveform A by an attenuation gain coefficient, An increasing wave is created by multiplying the second periodic waveform B by an increasing gain coefficient (S92). Then, by adding and synthesizing these, a periodic waveform having an intermediate shape between A and B is synthesized (S93). The synthesized waveform is inserted between the periodic waveform A and the periodic waveform B and output (S94) as shown in FIG.

なお、音声データを圧縮する場合には、図５（Ｂ）に示すように、上記Ｓ９３で合成したＡとＢの中間の形状の合成波形を周期波形Ａ，Ｂに代えて出力することにより、音声データを時間軸方向に１／２倍に圧縮することができる。 When compressing the audio data, as shown in FIG. 5B, by outputting the synthesized waveform of the intermediate shape of A and B synthesized in S93 in place of the periodic waveforms A and B, Audio data can be compressed 1/2 times in the time axis direction.

また、この話速変換処理を行う周期を規定することで、変換速度を可変とすることができる。例えば、図５（Ｃ）に示すように、周期毎に周期波形を２つ合成し、各周期波形の間に挿入することで、音声データを時間軸方向に２倍に伸長することができ、同図（Ｄ）に示すように、２周期毎に周期波形を２つ合成することで、３／２倍に伸長することができる。 Also, the conversion speed can be made variable by defining the cycle for performing the speech speed conversion processing. For example, as shown in FIG. 5C, by synthesizing two periodic waveforms for each period and inserting them between each periodic waveform, the voice data can be expanded twice in the time axis direction, As shown in FIG. 4D, by synthesizing two periodic waveforms every two periods, it can be expanded to 3/2 times.

また、話速変換は、音声区間の先頭部分（例えば７００ｍｓｅｃ）のみを伸長して、それ以後を通常速度で出力するようにし、必要以上に伸長しないようにする。なお、先頭部分を伸長し、それ以後を圧縮するようにしてもよい。音声区間、雑音区間の区別は、音声信号の周期性から判断すればよい。例えば、音声信号を所定長に分割して対応するサンプルデータを乗算または減算するなどして相関値を算出する。図６に示すように、この相関値が所定閾値よりも低い場合に雑音区間、高い場合に音声区間と判断する。音声等の周期性の多い音声信号の場合相関値は高くなり、雑音等の周期性の少ない音声信号の場合相関値は低くなる。 In the speech speed conversion, only the head portion (for example, 700 msec) of the voice section is expanded and the subsequent portion is output at the normal speed, so that it is not expanded more than necessary. The head portion may be expanded and the subsequent portion may be compressed. The distinction between the voice section and the noise section may be determined from the periodicity of the voice signal. For example, the correlation value is calculated by dividing the audio signal into a predetermined length and multiplying or subtracting corresponding sample data. As shown in FIG. 6, when this correlation value is lower than a predetermined threshold, it is determined as a noise interval, and when it is higher, it is determined as a speech interval. In the case of an audio signal having a high periodicity such as speech, the correlation value is high, and in the case of an audio signal having a low periodicity such as noise, the correlation value is low.

なお、本実施形態では、音声区間の先頭部分７００ｍｓｅｃについて話速変換する例を示したが、さらに長い区間長を話速変換するようにしてもよいし、短い区間長を話速変換するようにしてもよい。また、話速変換を行う区間中に伸長率を変更するようにしてもよい。例えば、区間長が７００ｍｓｅｃであった場合に、最初の６００ｍｓｅｃを２倍伸長、続く１００ｍｓｅｃを３／２倍伸長といった伸長率で話速変換するようにしてもよい。 In the present embodiment, the speech speed is converted for the first 700 msec of the speech section. However, the speech speed may be converted for a longer section length, or the speech speed may be converted for a shorter section length. May be. Further, the expansion rate may be changed during the section where the speech speed conversion is performed. For example, when the section length is 700 msec, the speech speed may be converted at an expansion rate such that the first 600 msec is expanded by 2 times and the subsequent 100 msec is expanded by 3/2 times.

以上のようにして話速変換部５で話速変換された話者音声信号は、ミキサ６に入力され、ミキサ６において音声信号処理部４から入力される背景音声信号とミキシングされる。このミキシングされた音声信号が録音・再生部７に入力される。録音・再生部７は、入力された音声信号をスピーカ１、および入出力Ｉ／Ｆ９に供給するとともに、音声信号を音声データ（例えばＭＰ３等の圧縮データ）に変換して記憶部３に入力する。また、録音・再生部７は、記憶部３に記録されている音声データを読み出し、この音声データに基づく音声信号をスピーカ１、および入出力Ｉ／Ｆ９に供給する。 The speaker voice signal subjected to the speech speed conversion by the speech speed conversion unit 5 as described above is input to the mixer 6 and mixed with the background speech signal input from the audio signal processing unit 4 in the mixer 6. The mixed audio signal is input to the recording / playback unit 7. The recording / playback unit 7 supplies the input audio signal to the speaker 1 and the input / output I / F 9, converts the audio signal into audio data (for example, compressed data such as MP3), and inputs the audio data to the storage unit 3. . The recording / reproducing unit 7 reads out audio data recorded in the storage unit 3 and supplies an audio signal based on the audio data to the speaker 1 and the input / output I / F 9.

スピーカ１は、録音・再生部７から入力された音声信号を放音する。スピーカ１には、一般的にはコーン型スピーカを用いるが、ホーン型スピーカ等、その他の形式を用いてもよい。なお、図１においては、ディジタル音声信号をアナログ音声信号に変換するＤ／Ａ変換器や信号を増幅するアンプ等は省略している。 The speaker 1 emits the audio signal input from the recording / playback unit 7. As the speaker 1, a cone type speaker is generally used, but other types such as a horn type speaker may be used. In FIG. 1, a D / A converter that converts a digital audio signal into an analog audio signal, an amplifier that amplifies the signal, and the like are omitted.

記憶部３は、録音・再生部７から入力された音声データを記録する。また、上述したようにコントローラ８から入力される話者位置情報も記録する。 The storage unit 3 records the audio data input from the recording / playback unit 7. In addition, as described above, the speaker position information input from the controller 8 is also recorded.

これにより、放収音装置が収音した音声のうち、発話者の音声のみが話速変換され、背景音は話速変換されずにそのまま放音、または録音される。 As a result, among the sounds collected by the sound emission and collection device, only the voice of the speaker is converted in the speech speed, and the background sound is emitted or recorded without being converted in the speech speed.

入出力Ｉ／Ｆ９は、音声信号を他の機器に供給する。入出力Ｉ／Ｆ９は、供給先の機器に応じたインタフェースを備えており、例えば音声信号をネットワーク送信に適した情報に変換し、ネットワークインタフェース、およびネットワークを介して接続される他の放収音装置に音声信号を出力する。また、入出力Ｉ／Ｆ９は、ネットワークを介して接続される他の放収音装置から音声信号を入力し、これを録音・再生部７に入力する。録音・再生部７は、自装置で収音した音声と、他装置から入力した音声と、を記憶部３に記録する。 The input / output I / F 9 supplies an audio signal to another device. The input / output I / F 9 includes an interface according to a device to which the supply is made. For example, the input / output I / F 9 converts an audio signal into information suitable for network transmission, and other sound emission and collection sounds connected via the network interface. An audio signal is output to the device. Further, the input / output I / F 9 inputs an audio signal from another sound emitting and collecting apparatus connected via the network, and inputs this to the recording / reproducing unit 7. The recording / playback unit 7 records the sound collected by the own device and the sound input from another device in the storage unit 3.

なお、上記実施形態では、放音側として単一のスピーカ１を示したが、スピーカ１を直線状に複数配列して、スピーカアレイを構成するようにしてもよい。この場合、各スピーカに供給する音声信号を順次遅延させることにより、音声ビームに焦点を持たせることができ、音声が発話者の位置から発せられたかのような音像定位をさせることができる。 In the above embodiment, the single speaker 1 is shown as the sound emitting side. However, a plurality of speakers 1 may be arranged in a straight line to constitute a speaker array. In this case, by sequentially delaying the audio signal supplied to each speaker, the audio beam can be focused, and sound image localization as if the audio was emitted from the position of the speaker can be achieved.

また、収音した音声信号を他の装置に出力し、他の装置側においてスピーカアレイを構成する場合、上述の話者位置情報も出力することで、他の装置においても音声が発話者の位置から発せられたかのような音像定位をさせることができる。 In addition, when the collected audio signal is output to another device and the speaker array is configured on the other device side, the above-described speaker position information is also output, so that the voice is also transmitted to the position of the speaker in the other device. Sound image localization as if it were emitted from

また、ネットワークを介して、上記実施形態の放収音装置を複数接続する場合、以下のような応用例が可能である。図７は、ネットワークを介して上記実施形態の放収音装置を複数接続し、音声会議システムを構成する例について示す図である。この音声会議システムは、ネットワーク１００を介して接続される放収音装置１１１Ａ〜１１１Ｃを有する。放収音装置１１１Ａ〜１１１Ｃは、上記実施形態で説明した放収音装置と同一の構成、機能を有するため、それぞれの構成、および機能の詳細な説明は省略する。 Further, when a plurality of sound emitting and collecting apparatuses of the above embodiment are connected via a network, the following application examples are possible. FIG. 7 is a diagram showing an example in which an audio conference system is configured by connecting a plurality of sound emitting and collecting apparatuses of the above-described embodiment via a network. This audio conference system includes sound emission and collection devices 111A to 111C connected via a network 100. Since the sound emission and collection devices 111A to 111C have the same configuration and function as the sound emission and collection device described in the above embodiment, detailed description of each configuration and function is omitted.

放収音装置１１１Ａ〜１１１Ｃは、それぞれ離れた地点ａ〜ｃに配置されている。地点ａには放収音装置１１１Ａが配置され、地点ｂには放収音装置１１１Ｂが配置され、地点ｃには放収音装置１１１Ｃが配置されている。 The sound emission and collection devices 111 A to 111 C are disposed at points a to c that are separated from each other. A sound emitting and collecting device 111A is disposed at the point a, a sound emitting and collecting device 111B is disposed at the point b, and a sound emitting and collecting device 111C is disposed at the point c.

地点ａでは、会議者Ａ、Ｂが、放収音装置１１１Ａに対してそれぞれ方位Ｄｉｒ１１、Ｄｉｒ１３で在席している。地点ｂでは、音源Ａが、放収音装置１１１Ｂに対して、方位Ｄｉｒ２２で存在している。地点ｃでは、会議者Ｃ、Ｄが放収音装置１１１Ｃに対して、それぞれ方位Ｄｉｒ３１、Ｄｉｒ３２で在席している。なお、方位Ｄｉｒ１１〜Ｄｉｒ１４、方位Ｄｉｒ２１〜Ｄｉｒ２４、および方位Ｄｉｒ３１〜Ｄｉｒ３４は、それぞれ、上記実施形態における４つの部分領域１０１〜１０４に対応し、放収音装置は、これらの方位の音声を収音する。 At the point a, the participants A and B are present in the directions Dir11 and Dir13 with respect to the sound emitting and collecting apparatus 111A, respectively. At the point b, the sound source A is present in the direction Dir22 with respect to the sound emission and collection device 111B. At the point c, the conference persons C and D are present in the directions Dir31 and Dir32 with respect to the sound emission and collection device 111C, respectively. The azimuth Dir11 to Dir14, the azimuth Dir21 to Dir24, and the azimuth Dir31 to Dir34 respectively correspond to the four partial areas 101 to 104 in the above-described embodiment, and the sound emitting and collecting apparatus collects sound in these directions. To do.

この音声会議システムでは、各放収音装置は、自身の装置で収音した音声を他の全ての放収音装置に送信する。また、各放収音装置は、自身の装置で収音した音声とともに、他の装置から送信された音声を記録する。 In this audio conference system, each sound emitting and collecting device transmits the sound collected by its own device to all other sound emitting and collecting devices. In addition, each sound emission and collection device records the sound transmitted from the other device together with the sound collected by its own device.

放収音装置１１１Ａは、会議者Ａ、および会議者Ｂが発話した場合、これらの音声を話速変換してから他装置に送信する。また、放収音装置１１１Ｃは、会議者Ｃ、および会議者Ｄが発話した場合、これらの音声を話速変換してから他装置に送信する。 When the conference participant A and the conference participant B speak, the sound emission and collection device 111A converts the speed of these voices and then transmits the speech to another device. In addition, when the conference person C and the conference person D speak, the sound emission and collection device 111C converts these voices and then transmits them to other devices.

ここで、放収音装置１１１Ｂは、音源Ａが発する楽音を話速変換せずに他装置に出力する。この際、放収音装置１１１Ｂは、音源Ａが発する楽音のレベルが非常に大きい場合であっても話速変換せずに送信する。例えば、上述した所定の閾値（一般的な発話音声のレベル）を超えるレベルであっても話速変換しない。すなわち、図１において、コントローラ８は、図示しない操作部等から話速変換をしない旨の指示を与えられると、音声信号処理部４に対し、収音した音声を常にミキサ６に出力するように設定する。これにより、この放収音装置においては、常に話速変換しない音声が出力されることとなる。この場合、コントローラ８は、最もレベルが高い収音ビームを出力するため、収音ビームのレベルの絶対値（一般的な発話音声のレベル以上であるか）を判定しなくともよい。 Here, the sound emission and collection device 111B outputs the musical sound generated by the sound source A to another device without converting the speech speed. At this time, the sound emission and collection device 111B transmits the sound without converting the speech speed even when the level of the musical sound generated by the sound source A is very high. For example, the speech speed is not converted even if the level exceeds a predetermined threshold value (general speech level). That is, in FIG. 1, the controller 8 always outputs the collected sound to the mixer 6 to the audio signal processing unit 4 when given an instruction not to convert the speech speed from an operation unit (not shown) or the like. Set. As a result, in this sound emission and collection device, sound that does not always convert the speech speed is output. In this case, since the controller 8 outputs the sound collecting beam having the highest level, the controller 8 does not need to determine the absolute value of the sound collecting beam level (whether it is equal to or higher than the level of a general uttered voice).

なお、コントローラ８は、音声信号処理部４に対し、収音した音声を常に話速変換部５に出力するように設定してもよい。この場合、この放収音装置においては、常に話速変換済みの音声が出力されることとなる。 Note that the controller 8 may set the audio signal processing unit 4 to always output the collected sound to the speech speed conversion unit 5. In this case, in this sound emitting and collecting apparatus, the speech speed converted voice is always output.

このように、音声会議システム内の任意の放収音装置を背景音出力専用の装置（話速変換しない放収音装置）とすることでも、各地点の会議者は、楽音等の背景音は通常の速度で聴きながら、話者の音声だけをゆっくりと聴くことができる。また、各音声会議装置では、背景音は通常速度で記録され、発話者の音声のみが話速変換されて記録される。 In this way, even if an arbitrary sound emitting and collecting device in the audio conference system is used as a device dedicated to background sound output (a sound emitting and collecting device that does not convert the speech speed), While listening at normal speed, you can listen to the speaker's voice slowly. In each audio conference device, the background sound is recorded at the normal speed, and only the voice of the speaker is converted and recorded.

本発明の実施形態の放収音装置の構成を示すブロック図The block diagram which shows the structure of the sound emission and collection apparatus of embodiment of this invention. 音声信号処理部の主要部の構成を示すブロック図Block diagram showing the configuration of the main part of the audio signal processing unit 音源検出領域を示す図Diagram showing sound source detection area 話速変換処理を示す図Diagram showing speech speed conversion processing 伸長率を変更する場合の話速変換処理を示す図The figure which shows the speech speed conversion processing when changing the expansion rate 入力音声データの相関値の計算例を示す図Diagram showing an example of calculating the correlation value of input audio data ネットワークを介して上記実施形態の放収音装置を複数接続し、音声会議システムを構成する例について示す図The figure shown about the example which connects multiple sound emitting and collecting apparatuses of the said embodiment via a network, and comprises an audio conference system.

Explanation of symbols

１−スピーカ
２−マイク
３−記憶部
４−音声信号処理部
５−話速変換部
６−ミキサ
７−録音・再生部
８−コントローラ 1-speaker 2-microphone 3-storage unit 4-audio signal processing unit 5-speech speed conversion unit 6-mixer 7-recording / reproducing unit 8-controller

Claims

A microphone array in which a plurality of microphones are arranged;
A sound collection control unit that forms a sound collection beam with respect to a plurality of user directions and identifies a speaker orientation by comparing the sound collection beam intensities;
A sound signal selection means for selecting the sound beam collected in the speaker direction as a speech sound signal, and a sound beam other than the sound beam collected in the speaker direction as a background sound signal;
Speech speed converting means for converting the speech speed of the speech signal;
A mixer that mixes the utterance voice signal converted by the speech speed conversion means and the background voice signal selected by the voice signal selection means;
A sound collecting device.

The sound signal selecting means, when there is a sound collecting beam of a predetermined level or more in a direction other than the sound collecting beam selected as the speech sound signal, selects only the sound collecting beam in that direction as a background sound signal. The sound collecting device according to 1.

The voice signal selecting means uses the difference signal between the sound collection beam selected as the speech voice signal and the sound collection beam in the direction adjacent to the sound collection beam selected as the speech voice signal as the speech signal. The sound collecting device according to claim 1, wherein the sound collecting device is input to the speech speed converting means.

Further comprising speech signal extraction means for extracting speech signals of speech from a plurality of sound collection beams formed by the sound collection control unit;
2. The sound collection control unit determines a direction of a sound collection beam having a highest level among a plurality of sound collection beams and a direction of the sound collection beam from which the speech signal extraction unit has extracted a speech signal as a speaker orientation. The sound collecting device according to claim 2 or claim 3.