JP2005055667A

JP2005055667A - Audio processing device

Info

Publication number: JP2005055667A
Application number: JP2003286255A
Authority: JP
Inventors: Hideharu Fujiyama; 英春藤山; Akira Masuda; 彰増田; Yoshitaka Abe; 義孝阿部
Original assignee: Sony Corp
Current assignee: Sony Corp
Priority date: 2003-08-04
Filing date: 2003-08-04
Publication date: 2005-03-03

Abstract

<P>PROBLEM TO BE SOLVED: To precisely conduct an audio recognition process even though a plurality of attendees simultaneously utter in a conference. <P>SOLUTION: A bidirectional talking device 2 receives speech commands through a plurality of directional microphones facing respective uttered person and specifies a main uttered person from audio data. An audio recognition processing device 3 conducts an audio recognition process for the audio data of the specified speaker to generate character string data. The audio recognition process is controlled by a microphone number indicating the main speaker supplied from the bidirectional talking device 2. <P>COPYRIGHT: (C)2005,JPO&NCIPI

Description

本発明は、たとえば複数の会議出席者が発する音声によるコマンドを認識して処理する音声処理装置に関する。 The present invention relates to a voice processing apparatus that recognizes and processes commands by voices issued by, for example, a plurality of conference attendees.

従来、人間の発話を認識して文字列データ等に変換して処理する機能（音声認識）を備えた音声処理装置は、音声の入力手段として、電話やパーソナルコンピュータ（以下、ＰＣ）等の制御機器に接続されたマイクロフォン（以下、マイク）などを前提としており、そのようなマイクを備えた音声処理装置が、例えば電話による銀行残高照会やＰＣに対するコマンド又は文書の入力を行うアプリケーション、車両におけるカーナビゲーションシステムの音声コマンドによる操作等に適用されている。 2. Description of the Related Art Conventionally, a speech processing apparatus having a function (speech recognition) for recognizing a human speech and converting it into character string data or the like (speech recognition) controls a telephone or a personal computer (hereinafter referred to as a PC) as voice input means. It is assumed that a microphone (hereinafter referred to as a microphone) connected to a device is used, and a voice processing device equipped with such a microphone, for example, an application for inquiring a bank balance by telephone or inputting a command or document to a PC, a car in a vehicle It is applied to operations using voice commands in navigation systems.

しかしながら、これらの従来の音声処理装置は、使用環境としては、１個人に対する音声認識が対象となっている。従って、これらの音声処理装置を複数人によるグループワーク、例えば複数人の会議の場面で使用した場合には、複数の音声を誤検知・誤認識してしまうため利用することができなかった。
すなわち、２人以上の参加者が同時に発話している状態では、両者の発話がミックスされた状態で、マイクを通して音声処理装置に入力されるため、複数の発話者の中から主として話をしている会議参加者を特定して、精度の良い音声認識結果を得ることは不可能であった。 However, these conventional speech processing apparatuses are intended for speech recognition for one individual as a usage environment. Therefore, when these voice processing devices are used in a group work by a plurality of people, for example, in a meeting scene of a plurality of people, a plurality of voices are erroneously detected / recognized and cannot be used.
In other words, when two or more participants are speaking at the same time, the two utterances are mixed and input to the voice processing device through the microphone. It was impossible to identify a meeting participant and obtain an accurate speech recognition result.

また、このように複数人の同時使用という環境を鑑みると、音声認識機能を備えた音声処理装置においては、各人の操作権限を考慮する必要もある。
すなわち、グループワークの場で、複数人の音声から主として話をしている会議参加者を特定できたとしても、会議における特定のコマンドに対しては、所定の権限を付与された会議参加者だけのコマンドが許可されるといった手続きをとった方が、会議の円滑な進行上望ましい場合が多い。 Further, in view of the environment of simultaneous use by a plurality of persons as described above, it is necessary to consider the operation authority of each person in a speech processing apparatus having a speech recognition function.
In other words, in a group work, even if a conference participant who is mainly speaking can be identified from the voices of multiple people, only the conference participant who has been given the predetermined authority for a specific command in the conference In many cases, it is desirable to take a procedure such as allowing the above command for smooth progress of the conference.

解決しようとする問題点は、例えば複数の会議出席者が発する音声を認識して処理する場合に、複数の発話者の中から主として話をする会議出席者を特定し精度良く音声認識処理できない点である。また、複数の発話者の中から主として話をする会議出席者を特定し音声認識をする際に、会議出席者それぞれの権限に応じた音声認識処理ができない点である。 The problem to be solved is that, for example, when recognizing and processing voices uttered by a plurality of conference attendees, it is not possible to identify a conference attendee who mainly speaks from a plurality of speaker and accurately perform speech recognition processing. It is. In addition, when a conference attendant who mainly speaks is identified from a plurality of speakers and speech recognition is performed, speech recognition processing according to the authority of each conference attendee cannot be performed.

上記課題を解決するため、本発明に係る第１の観点は、複数のマイクロフォンから音声信号を入力して処理する音声処理装置であって、前記音声信号に基づいて、一のマイクロフォンを選択するマイクロフォン選択手段と、前記マイクロフォン選択手段によって選択されたマイクロフォンから出力された音声信号を、文字列データに変換する音声変換手段とを有する。 In order to solve the above-described problems, a first aspect of the present invention is a voice processing apparatus that inputs and processes voice signals from a plurality of microphones, and selects a single microphone based on the voice signals. A selection unit; and a voice conversion unit that converts a voice signal output from the microphone selected by the microphone selection unit into character string data.

好適には、前記マイクロフォンは、単一指向特性を有する。 Preferably, the microphone has a unidirectional characteristic.

また、好適には、前記音声変換手段は、前記音声信号を保持するバッファと、前記音声信号に対応する文字列データを格納する文字列データメモリとを有し、前記マイクロフォン選択手段によって選択されたマイクロフォンから出力された音声信号を、前記バッファに取り込み、前記選択されたマイクロフォンが変化するタイミングで、前記バッファに取り込んだ音声信号を前記文字列データメモリと照合して、合致する文字列データに変換する。 Preferably, the voice conversion means includes a buffer that holds the voice signal and a character string data memory that stores character string data corresponding to the voice signal, and is selected by the microphone selection means. The voice signal output from the microphone is taken into the buffer, and at the timing when the selected microphone changes, the voice signal taken into the buffer is checked against the character string data memory and converted to matching character string data. To do.

本発明に係る第２の観点は、複数のマイクロフォンから音声信号を入力して処理する音声処理装置であって、前記音声信号に基づいて、一のマイクロフォンを選択するマイクロフォン選択手段と、前記マイクロフォン選択手段によって選択されたマイクロフォンから出力された音声信号を、命令コードに変換する音声変換手段と、前記マイクロフォン選択手段によって選択されたマイクロフォンから出力された音声信号に基づいて、声紋データを生成する声紋生成手段と、前記声紋データに基づいて、前記命令コードを処理する命令コード処理手段とを有する。 According to a second aspect of the present invention, there is provided an audio processing apparatus for inputting and processing audio signals from a plurality of microphones, a microphone selecting means for selecting one microphone based on the audio signals, and the microphone selection A voice conversion unit that converts a voice signal output from the microphone selected by the unit into a command code; and a voiceprint generation that generates voiceprint data based on the voice signal output from the microphone selected by the microphone selection unit. And an instruction code processing means for processing the instruction code based on the voiceprint data.

好適には、前記マイクロフォンは単一指向特性を有する。 Preferably, the microphone has a unidirectional characteristic.

また、好適には、前記声紋データと、前記命令コードが有効であるか否かについての命令権限データとが関連付けて格納される権限データメモリをさらに有し、前記命令コード処理手段は、前記声紋生成手段により生成された声紋データと前記権限データメモリとを参照し、前記音声変換手段により変換された命令コードが、前記声紋データに対して有効でない場合には、前記命令コードを処理しない。 Preferably, the voiceprint data further includes an authority data memory in which instruction authority data as to whether or not the instruction code is valid is associated and stored, and the instruction code processing means includes the voiceprint data With reference to the voiceprint data generated by the generation means and the authority data memory, if the instruction code converted by the voice conversion means is not valid for the voiceprint data, the instruction code is not processed.

本発明の第１の観点によれば、マイクロフォン選択手段は、複数のマイクロフォンを通して、該マイクロフォンからそれぞれ音声信号を入力し、入力した音声信号に基づいて一のマイクロフォンを選択し、そのマイクロフォンからの音声信号を出力する。音声変換手段は、マイクロフォン選択手段から入力した音声信号を文字列データに変換する。 According to the first aspect of the present invention, the microphone selection means inputs a sound signal from each of the microphones through a plurality of microphones, selects one microphone based on the input sound signal, and selects a sound from the microphone. Output a signal. The voice conversion means converts the voice signal input from the microphone selection means into character string data.

音声変換手段が、音声信号を保持するバッファと、音声信号に対応する文字列データを格納する文字列データメモリとを有している場合は、マイクロフォン選択手段により選択されたマイクロフォンから出力された音声信号を前記バッファに取り込み、マイクロフォン選択手段により選択されるマイクロフォンが変化するタイミングで、前記バッファに取り込んだ音声信号を文字列データメモリと照合し、合致する文字列データに変換する。 When the voice conversion means has a buffer that holds the voice signal and a character string data memory that stores character string data corresponding to the voice signal, the voice output from the microphone selected by the microphone selection means The signal is taken into the buffer, and at the timing when the microphone selected by the microphone selection means changes, the voice signal taken into the buffer is checked against the character string data memory and converted to matching character string data.

本発明の第２の観点によれば、マイクロフォン選択手段は、複数のマイクロフォンを通して、該マイクロフォンからそれぞれ音声信号を入力し、入力した音声信号に基づいて一のマイクロフォンを選択し、そのマイクロフォンからの音声信号を出力する。音声変換手段は、マイクロフォン選択手段から入力した音声信号を、対応する命令コードに変換する。声紋生成手段は、マイクロフォン選択手段から出力された音声信号に基づいて、声紋データを生成する。命令コード処理手段は、声紋生成手段により生成された声紋データに基づいて、音声変換手段により変換された命令コードを処理する。 According to the second aspect of the present invention, the microphone selection means inputs a voice signal from each of the microphones through a plurality of microphones, selects one microphone based on the input voice signal, and selects a voice from the microphone. Output a signal. The voice conversion unit converts the voice signal input from the microphone selection unit into a corresponding instruction code. The voiceprint generation unit generates voiceprint data based on the voice signal output from the microphone selection unit. The instruction code processing means processes the instruction code converted by the voice conversion means based on the voiceprint data generated by the voiceprint generation means.

声紋データと、命令コードが有効であるか否かについての命令権限データとが関連付けて格納される権限データメモリをさらに有している場合は、命令コード処理手段は、声紋生成手段により生成された声紋データと前記権限データメモリとを参照し、音声変換手段により変換された命令コードが、前記声紋データに対して有効である場合に前記命令コードを処理する。 If the voiceprint data further includes an authority data memory in which the instruction authority data about whether the instruction code is valid is stored in association with each other, the instruction code processing means is generated by the voiceprint generator With reference to voiceprint data and the authority data memory, the command code is processed when the command code converted by the voice conversion means is valid for the voiceprint data.

本発明は、音声によるコマンドを音声処理装置が認識して処理する場合に、複数のコマンドが重なって音声処理装置に入力されたときでも、主発話者を特定して処理するため、例えば複数人による会議において精度良く音声認識ができるという利点がある。 In the present invention, when a voice command is recognized and processed by a voice processing device, even when a plurality of commands are overlapped and input to the voice processing device, the main speaker is identified and processed. There is an advantage that voice recognition can be performed with high accuracy in the conference.

後述する第１〜４の実施の形態においては、いずれについても以下説明する双方向通話部２が使用されているため、説明の便宜のため、まず、双方向通話部２の構成、動作について図１〜４を参照しながら詳説した後、各実施の形態について説明する。 In the first to fourth embodiments to be described later, since the two-way communication unit 2 described below is used for all, the configuration and operation of the two-way communication unit 2 are first shown for convenience of description. After detailed description with reference to 1-4, each embodiment will be described.

双方向通話部２
図１は、双方向通話部２の回路ブロック図である。
図１のとおり、双方向通話部２の回路ブロック図は、Ａ／Ｄ変換器ブロック２１と、ＤＳＰ(Digtal Signal Proccessor)２２と、ＤＳＰ２３と、ＣＰＵ(Central Processing Unit) ２４と、コーデック２５と、Ｄ／Ａ変換器ブロック２６（Ｄ／Ａ変換器２６１，２６２）と、Ａ／Ｄ変換器２６３と、増幅器ブロック２７とを備えて構成されている。 Two-way communication part 2
FIG. 1 is a circuit block diagram of the two-way communication unit 2.
As shown in FIG. 1, the circuit block diagram of the bidirectional communication unit 2 includes an A / D converter block 21, a DSP (Digital Signal Processor) 22, a DSP 23, a CPU (Central Processing Unit) 24, a codec 25, A D / A converter block 26 (D / A converters 261 and 262), an A / D converter 263, and an amplifier block 27 are provided.

双方向通話部２は、複数本のマイク、図１の例では６本の単一指向特性のマイクＭＣ１〜ＭＣ６から音声を入力する。単一指向特性のマイクは、マイクが置かれた位置の正面に強い指向性を示す。 The two-way communication unit 2 inputs voice from a plurality of microphones, six unidirectional microphones MC1 to MC6 in the example of FIG. A microphone having a single directivity exhibits strong directivity in front of the position where the microphone is placed.

ＣＰＵ２４は、双方向通話部２の全体の制御処理を行う。
コーデック２５は、音声を符号化する。
ＤＳＰ２２が詳細を後述する各種の信号処理、例えば、フィルタ処理、マイク選択処理などを行う。
ＤＳＰ２３は、エコーキャンセラーとして機能する。
図１においては、Ａ／Ｄ変換器ブロック２１の１例として、Ａ／Ｄ変換器２１１〜２１３を例示し、Ａ／Ｄ変換器の１例としてＡ／Ｄ変換器２６３を例示し、Ｄ／Ａ変換器ブロック２６の１例として、Ｄ／Ａ変換器２６１〜２６２を例示し、増幅器ブロック２７の１例として、増幅器２７１〜２７２を例示している。 The CPU 24 performs overall control processing of the two-way call unit 2.
The codec 25 encodes voice.
The DSP 22 performs various signal processing, details of which will be described later, such as filter processing and microphone selection processing.
The DSP 23 functions as an echo canceller.
In FIG. 1, A / D converters 211 to 213 are illustrated as an example of the A / D converter block 21, and an A / D converter 263 is illustrated as an example of the A / D converter. D / A converters 261 to 262 are illustrated as an example of the A converter block 26, and amplifiers 271 to 272 are illustrated as an example of the amplifier block 27.

それぞれ１対のマイクＭＣ１−ＭＣ４：ＭＣ２−ＭＣ５：ＭＣ３−ＭＣ６が、それぞれ２チャンネルのアナログ信号をディジタル信号に変換するＡ／Ｄ変換器２１１〜２１３に入力されている。
Ａ／Ｄ変換器２１１〜２１３で変換したマイクＭＣ１〜ＭＣ６の集音信号はＤＳＰ２２に入力されて、後述する各種の信号処理が行われる。
ＤＳＰ２２の処理結果の１つとして、マイクＭＣ１〜ＭＣ６のうちの１つが選択される。ＤＳＰ２２では、上述したマイクの単一指向特性を利用して、マイクの選択を行う。 A pair of microphones MC1-MC4: MC2-MC5: MC3-MC6 is input to A / D converters 211-213, which convert 2-channel analog signals into digital signals, respectively.
The collected sound signals of the microphones MC1 to MC6 converted by the A / D converters 211 to 213 are input to the DSP 22, and various signal processing described later is performed.
As one of the processing results of the DSP 22, one of the microphones MC1 to MC6 is selected. The DSP 22 selects a microphone using the unidirectional characteristics of the microphone described above.

ＤＳＰ２２の処理結果が、ＤＳＰ２３に出力されてエコーキャンセル処理が行われる。ＤＳＰ２３の処理結果は、Ｄ／Ａ変換器２６１〜２６２でアナログ信号に変換される。Ｄ／Ａ変換器２６１からの出力が、必要に応じて、コーデック２５で符号化されて、増幅器２７１を介して出力される。
また、Ｄ／Ａ変換器２６２からの出力は、増幅器２７２を介してこの双方向通話部２のスピーカ２８から音として出力される。すなわち、双方向通話部２を使用する会議参加者等は、その会議室にいる発言者が発した音声を、スピーカ２８を介して聞くことができる。
双方向通話部２は、相手方の音声をＡ／Ｄ変換器２６３を介してＤＳＰ２３に入力し、エコーキャンセル処理を行う。また、相手方の音声は、図示しない径路でスピーカ２８に印加されて音として出力される。 The processing result of the DSP 22 is output to the DSP 23 and an echo cancellation process is performed. The processing result of the DSP 23 is converted into an analog signal by the D / A converters 261 to 262. The output from the D / A converter 261 is encoded by the codec 25 as necessary, and is output via the amplifier 271.
The output from the D / A converter 262 is output as a sound from the speaker 28 of the two-way communication unit 2 via the amplifier 272. That is, a conference participant or the like who uses the two-way call unit 2 can hear the voice uttered by the speaker in the conference room via the speaker 28.
The two-way communication unit 2 inputs the other party's voice to the DSP 23 via the A / D converter 263 and performs echo cancellation processing. Further, the other party's voice is applied to the speaker 28 via a path (not shown) and output as sound.

なお、指向性のないマイクを用いた場合、マイク周辺の全ての音を集音するので、発言者の音声と周辺ノイズとのＳ／Ｎ(Signal to Noise) が良くない。これを避けるため、本実施の形態では、指向性マイクで集音することによって、周辺のノイズとのＳ／Ｎを改善している。 When a microphone with no directivity is used, all sounds around the microphone are collected, so that the S / N (Signal to Noise) between the voice of the speaker and the ambient noise is not good. In order to avoid this, in the present embodiment, the S / N with surrounding noise is improved by collecting sound with a directional microphone.

次に、ＤＳＰ２２で行う処理内容について述べる。
ＤＳＰ２２で行われる主な処理は、マイクの選択・切替え処理である。すなわち、双方向通話部２を使用する複数の会議参加者が同時に話をすると、音声が入り交じり相手方にとって聞きにくくなるため、選択されたマイクからの音声信号のみが、図１における信号Ｓ２７１として出力される。
本処理を正確に行うため、下記に例示する各種の信号処理を行う。
（ａ）マイク信号の帯域分離とレベル変換処理
（ｂ）発言の開始、終了の判定処理
（ｃ）発言者方向マイクの検出処理
各マイクの集音信号を分析し、発言者に対向しているマイクを判定する。
（ｄ）発言者方向マイクの切替えタイミング判定処理、及び、
検出された発言者に対向したマイク信号の選択切替え処理 Next, processing contents performed by the DSP 22 will be described.
The main processing performed by the DSP 22 is microphone selection / switching processing. That is, when a plurality of conference participants who use the two-way call unit 2 speak at the same time, voices are mixed and difficult to hear for the other party, so only the audio signal from the selected microphone is output as the signal S271 in FIG. Is done.
In order to perform this process accurately, various signal processes exemplified below are performed.
(A) Microphone signal band separation and level conversion processing (b) Speech start / end determination processing (c) Speaker direction microphone detection processing The collected sound signal of each microphone is analyzed and facing the speaker Determine the microphone.
(D) Speaker direction microphone switching timing determination processing, and
Microphone signal selection switching process facing the detected speaker

以下、上述した各信号処理について説明する。
（ａ）マイク信号の帯域分離とレベル変換処理
マイク選択処理の開始のトリガの１つに発言の開始、終了の判定を行う。そのために、各マイク信号に対して、バンドパス・フィルタ（以下、ＢＰＦ）処理及びレベル変換処理を施す。
図２は、ＢＰＦ処理及びレベル変換処理について６つのマイクＭＣ１〜６のうち、１チャンネル（ＣＨ）についてのみ示した図である。
ＢＰＦ処理及びレベル変換処理回路は、マイクの集音信号を、ぞれぞれ１００〜６００Ｈｚ，１００〜２５０Ｈｚ，２５０〜６００Ｈｚ，６００〜１５００Ｈｚ，１５００〜４０００Ｈｚ，４０００〜７５００Ｈｚの帯域通過特性を持つＢＰＦ２２１ａ〜２２１ｆ（総称してＢＰＦブロック２２１）と、元のマイク集音信号及び上記帯域通過集音信号をレベル変換するレベル変換器２２２ａ〜２２２ｇ（総称してレベル変換ブロック２２２）とを有する。
各レベル変換器は、信号絶対値処理部２２３とピークホールド処理部２２４とを有する。従って、波形図を例示したように、信号絶対値処理部２２３は、破線で示した負の信号が入力されたとき符号を反転して正の信号に変換する。そして、ピークホールド処理部２２４は、信号絶対値処理部２２３の出力信号の絶対値を保持する。 Hereinafter, each signal processing described above will be described.
(A) Microphone signal band separation and level conversion process The start and end of speech is determined as one trigger for starting the microphone selection process. For this purpose, each microphone signal is subjected to bandpass filter (hereinafter, BPF) processing and level conversion processing.
FIG. 2 is a diagram showing only one channel (CH) among the six microphones MC1 to MC6 for the BPF process and the level conversion process.
The BPF processing and level conversion processing circuit converts the collected sound signal of the microphone into band pass characteristics of 100 to 600 Hz, 100 to 250 Hz, 250 to 600 Hz, 600 to 1500 Hz, 1500 to 4000 Hz, 4000 to 7500 Hz, respectively. ˜221f (collectively referred to as BPF block 221) and level converters 222a through 222g (collectively referred to as level conversion block 222) for converting the levels of the original microphone sound collection signal and the band-pass sound collection signal.
Each level converter includes a signal absolute value processing unit 223 and a peak hold processing unit 224. Therefore, as illustrated in the waveform diagram, the signal absolute value processing unit 223 inverts the sign and converts it into a positive signal when a negative signal indicated by a broken line is input. The peak hold processing unit 224 holds the absolute value of the output signal of the signal absolute value processing unit 223.

（ｂ）発言の開始、終了の判定処理
ＤＳＰ２２は、図２に図解したマイク信号レベル変換処理部２２２ｂで音圧レベル変換された１００〜６００ＨｚのＢＰＦを通過した音圧レベルデータが所定値以上になった場合に発言開始と判定し、一定時間（例えば、０．５秒間）所定値以下になった場合に発言終了と判定する。 (B) Processing for determining start / end of speech The DSP 22 makes the sound pressure level data passing through the BPF of 100 to 600 Hz, which has been subjected to sound pressure level conversion by the microphone signal level conversion processing unit 222b illustrated in FIG. When it becomes, it determines with the speech start, and when it becomes below predetermined value for a fixed time (for example, 0.5 second), it determines with the speech end.

（ｃ）発言者方向マイクの検出処理
発言者方向の検出には、図３に例示した単一指向性マイクの特性を利用する。
単一指向性マイクは、発言者からマイクへの音声の到達角度により図３に例示したように、周波数特性やレベル特性が変化する。図３では、双方向通話部２の１．５メートルの距離にスピーカを置いて、各マイクが集音した音声を一定時間間隔でＦＦＴした結果を示す。Ｘ軸が周波数を、Ｙ軸が時間を、Ｚ軸が信号レベルを表している。ＸＹ平面上に特定の周波数毎に引かれた線は、図２を用いて説明したＢＰＦ処理のカットオフ周波数を表し、各線に挟まれた周波数帯域のレベルが、図２におけるＢＰＦ２２１ｂ〜２２１ｆを通してデータとなる。 (C) Speaker Direction Microphone Detection Processing For detecting the speaker direction, the characteristics of the unidirectional microphone illustrated in FIG. 3 are used.
As illustrated in FIG. 3, the frequency characteristics and level characteristics of the unidirectional microphone change depending on the arrival angle of sound from the speaker to the microphone. FIG. 3 shows the result of performing FFT on the sound collected by each microphone at a fixed time interval by placing a speaker at a distance of 1.5 meters of the two-way communication unit 2. The X axis represents frequency, the Y axis represents time, and the Z axis represents signal level. A line drawn for each specific frequency on the XY plane represents the cut-off frequency of the BPF processing described with reference to FIG. 2, and the level of the frequency band sandwiched between the lines passes through the BPFs 221b to 221f in FIG. It becomes.

このような各帯域のＢＰＦの出力レベルに対し、それぞれ適切な重み付け処理（例えば、1dBFs ステップであれば、0dBFs で０，-3dBFsで３とする）を行う。この重み付けのステップで処理の分解能が決まることになる。
１サンプルクロック毎に、上記の重み付け処理を実行し、各マイクの重み付けされた得点を加算し、一定サンプル数で平均して合計点の小さい（または大きい）マイク信号を発言者に対向したマイクと判定する。この結果をイメージ化したものが表１である。
表１の例では、一番合計点が小さいのはＭＩＣ１なので、マイク１方向に音源があると判定する。その結果を音源方向マイク番号という形で保持する。 Appropriate weighting processing (for example, 0dBFs is set to 0 and -3dBFs is set to 3 for the 1dBFs step) is performed on the output level of the BPF in each band. This weighting step determines the processing resolution.
For each sample clock, the above weighting process is executed, the weighted scores of each microphone are added, and a microphone signal with a small (or large) total score is averaged over a certain number of samples, and the microphone facing the speaker judge. Table 1 shows an image of this result.
In the example of Table 1, it is determined that there is a sound source in the direction of microphone 1 because the smallest total score is MIC1. The result is held in the form of a sound source direction microphone number.

（ｄ）発言者方向マイクの切替えタイミング判定処理、及び、
検出された発言者に対向したマイク信号の選択切替え処理
ある発言者（例えば、マイクＭＣ１）からの発言が終了し、新たに別の方向から発言（例えば、マイクＭＣ２）があった場合には、（ｂ）発言の開始、終了の判定処理において説明したように、前の発言者のマイク（ＭＣ１）信号レベルが所定値以下になってから一定時間（例えば、０．５秒間）経過後に、その発言者の発言は終了したと判断する。
そして、後の発言者の発言が開始されてマイク（ＭＣ２）信号レベルが所定値以上になった時には、後の発言者に対向したマイクを集音マイクと決定し、マイク信号選択切替え処理を開始する。 (D) Speaker direction microphone switching timing determination processing, and
Processing for selecting and switching microphone signals opposite to the detected speaker When a speech from a certain speaker (for example, microphone MC1) ends and a new speech (for example, microphone MC2) is received, (B) As described in the start / end determination process, the microphone (MC1) signal level of the previous speaker falls below a predetermined value, and after a certain time (for example, 0.5 seconds) has passed, It is determined that the speaker's speech has ended.
When the microphone of the later speaker is started and the microphone (MC2) signal level exceeds a predetermined value, the microphone facing the later speaker is determined as the sound collecting microphone and the microphone signal selection switching process is started. To do.

また、前の発言者（マイクＭＣ１）が発言継続中に新たに別の方向から、より大きい声の発言（後の発言者（マイクＭＣ２））があった場合、後の発言者の発言開始（マイクＭＣ２の信号レベルが所定値以上になった時）から一定時間（例えば、０．５秒間）以上経過してから、マイクの切替え判定処理を開始する。
マイクの切替え判定処理は、以下のように行う。
すなわち、前の発言者（マイクＭＣ１）の発言終了前に、現在選択されている発言者よりも大声で発言している発言者（マイクＭＣ２）がいた場合は、マイクＭＣ２からの音圧レベルが高くなるため、（ｃ）発言者方向マイクの検出処理において、ＭＣ１とＭＣ２の上記表１における得点が逆転し、音源方向マイク番号がマイクＭＣ１→２のとおり変更になると同時に、マイク信号選択切替え処理が行われる。 In addition, when the previous speaker (Mic MC1) continues to speak and a new voice is spoken from another direction (later speaker (Mic MC2)), the subsequent speaker starts to speak ( After a certain time (for example, 0.5 seconds) has elapsed since the signal level of the microphone MC2 has reached a predetermined value or more, the microphone switching determination process is started.
The microphone switching determination process is performed as follows.
That is, if there is a speaker (microphone MC2) speaking louder than the currently selected speaker before the end of the previous speaker (microphone MC1), the sound pressure level from the microphone MC2 is (C) In the speaker direction microphone detection process, the scores in MC1 and MC2 in Table 1 are reversed, and the sound source direction microphone number is changed as microphone MC1 → 2, and at the same time, the microphone signal selection switching process Is done.

マイク信号の選択切替え処理では、図４に図解したように、６回路の乗算器と６入力の加算器で構成される。マイク信号を選択するためには、選択したいマイク信号が接続されている乗算器のチャンネルゲイン（CH Gain)を１に、その他の乗算器のチャンネルゲインを０とすることで、加算器には〔選択されたマイク信号×１〕と〔他のマイク信号×０〕の処理結果が加算されて所望のマイク選択信号が出力される。
なお、マイクが切り替わるときの前後のチャンネルゲイン（例えば、CH1 GainとCH2 Gain）の変化は、例えば１０ｍｓの間に徐々に行われる。 As illustrated in FIG. 4, the microphone signal selection switching process includes a 6-circuit multiplier and a 6-input adder. In order to select the microphone signal, the channel gain (CH Gain) of the multiplier to which the microphone signal to be selected is connected is set to 1, and the channel gains of the other multipliers are set to 0. The processing result of [selected microphone signal × 1] and [other microphone signal × 0] are added to output a desired microphone selection signal.
Note that the change in channel gain (for example, CH1 Gain and CH2 Gain) before and after the microphone switching is gradually performed, for example, in 10 ms.

以上詳述したように、双方向通話部２によれば、指向性マイクの特性を利用し、発言者からの音声をＳ／Ｎ良く集音して、複数のマイク信号の中から適切に１のマイク信号を選択することが可能であり、選択したマイク信号と選択したマイク情報（１〜６のマイク番号）を、後段の装置に供給する。 As described above in detail, according to the two-way communication unit 2, the voice from the speaker is collected with good S / N using the characteristics of the directional microphone, and one of the plurality of microphone signals is appropriately selected. Can be selected, and the selected microphone signal and the selected microphone information (the microphone numbers 1 to 6) are supplied to the subsequent apparatus.

第１の実施の形態
以下、第１の実施の形態における音声処理装置について説明する。
図５は、第１の実施の形態における音声処理装置１のブロック図である。
図５のとおり、音声処理装置１は、上述した双方向通話部２と音声認識処理装置３から構成される。
なお、本発明におけるマイクロフォン選択手段は、第１の実施の形態における双方向通話部２に対応する。
本発明における音声変換手段は、第１の実施の形態における音声認識処理装置３に対応する。
本発明における文字列データメモリは、第１の実施の形態における音声認識メモリ３２４に対応する。 First Embodiment Hereinafter, a speech processing apparatus according to a first embodiment will be described.
FIG. 5 is a block diagram of the speech processing apparatus 1 according to the first embodiment.
As shown in FIG. 5, the voice processing device 1 includes the above-described two-way communication unit 2 and the voice recognition processing device 3.
The microphone selection means in the present invention corresponds to the two-way call unit 2 in the first embodiment.
The voice conversion means in the present invention corresponds to the voice recognition processing device 3 in the first embodiment.
The character string data memory in the present invention corresponds to the voice recognition memory 324 in the first embodiment.

第１の実施の形態において、音声処理装置１は、例えば会議室の円形テーブルの真ん中にセットされて使用される。
双方向通話部２は、上述したように、会議出席者に対向する複数本、例えば６本のマイクを備え、各会議出席者の音声を入力し、１本のマイク信号を選択して音声認識処理装置３に対して出力するとともに、選択したマイク番号を通知する。ここで、各会議出席者の音声は、例えば『ＯｐｅｎＦｉｌｅ』，『Ｎｅｘｔ』等の２〜３秒程度の音声によるコマンドを想定している。
音声認識処理装置３は、双方向通話部２で選択されたマイク信号に対して、後述する所定の手順に従って音声認識処理を施し、文字列データに変換する。 In the first embodiment, the audio processing device 1 is used by being set, for example, in the middle of a circular table in a conference room.
As described above, the two-way communication unit 2 includes a plurality of microphones, for example, six microphones facing the conference attendees, inputs the voice of each conference attendee, selects one microphone signal, and performs voice recognition. While outputting to the processing apparatus 3, the selected microphone number is notified. Here, the voice of each conference attendant is assumed to be a command with a voice of about 2 to 3 seconds such as “Open File”, “Next”, and the like.
The voice recognition processing device 3 performs voice recognition processing on the microphone signal selected by the two-way call unit 2 according to a predetermined procedure described later, and converts the microphone signal into character string data.

なお、図５に示す音声処理装置１の双方向通話部２では、Ａ／Ｄ変換器ブロック２１が図１を用いて説明したような２ＣＨのＡ／Ｄ変換器ではなく、各マイク毎にそれぞれ１ＣＨのＡ／Ｄ変換器２１１〜２１６から構成されている。また、図５に示す音声処理装置１の双方向通話部２はスピーカ２８を使用しないため、その周辺部分及びエコーキャンセル処理を行うＤＳＰ２３を必要とせず、これらは図５においては記載されていない。 In the bidirectional communication unit 2 of the voice processing device 1 shown in FIG. 5, the A / D converter block 21 is not a 2CH A / D converter as described with reference to FIG. It is composed of 1-CH A / D converters 211 to 216. 5 does not use the speaker 28, the peripheral portion thereof and the DSP 23 for performing echo cancellation processing are not required, and these are not described in FIG.

また、図５に示す音声処理装置１において、双方向通話部２と音声認識処理装置３を一体とした場合などでは、双方向通話部２で選択されたマイク信号Ｓ２６１は、アナログ信号として音声認識処理装置３に供給する必要がないため、点線で示したディジタル信号Ｓ２２として音声認識処理装置３に供給されるが、以下の説明ではアナログ信号（信号Ｓ２６１）として供給されるものとする。
また、上述したように、選択されたマイク情報（１〜６のマイク番号）は、ＭＣ＿ＳＥＬとして音声認識処理装置３に供給される。 In the voice processing device 1 shown in FIG. 5, when the two-way call unit 2 and the voice recognition processing unit 3 are integrated, the microphone signal S261 selected by the two-way call unit 2 is recognized as an analog signal. Since it is not necessary to supply the signal to the processing device 3, it is supplied to the speech recognition processing device 3 as a digital signal S22 indicated by a dotted line. In the following description, it is assumed to be supplied as an analog signal (signal S261).
Further, as described above, the selected microphone information (the microphone numbers 1 to 6) is supplied to the voice recognition processing device 3 as MC_SEL.

音声認識処理装置３は、Ａ／Ｄ変換器３１と、音声認識処理部３２とから構成され、音声認識処理部３２は、ＣＰＵ３２１と、バッファ３２２と、音声認識部３２３と、音声認識メモリ３２４とを有する。
Ａ／Ｄ変換器３１は、双方向通話部２で選択されたアナログ信号であるマイク信号（Ｓ２６１）を入力し、ディジタル信号に変換する。
音声認識処理部３２は、Ａ／Ｄ変換器３１からディジタル化されたマイク信号を入力するとともに、双方向通話部２により選択されたマイク信号を入力する。従って、音声認識処理部３２には、双方向通話部２でマイクが切り替わった場合、それに応じて順次更新したマイク信号及びそのマイク情報（１〜６のマイク番号）ＭＣ＿ＳＥＬが供給される。 The speech recognition processing device 3 includes an A / D converter 31 and a speech recognition processing unit 32. The speech recognition processing unit 32 includes a CPU 321, a buffer 322, a speech recognition unit 323, and a speech recognition memory 324. Have
The A / D converter 31 receives the microphone signal (S261), which is an analog signal selected by the bidirectional communication unit 2, and converts it into a digital signal.
The voice recognition processing unit 32 inputs the digitized microphone signal from the A / D converter 31 and the microphone signal selected by the two-way call unit 2. Therefore, when the microphone is switched in the two-way communication unit 2, the microphone signal and its microphone information (microphone numbers 1 to 6) MC_SEL that are sequentially updated are supplied to the voice recognition processing unit 32.

ＣＰＵ３２１は、音声認識処理部３２の全体の制御を行うとともに、特に後述する入力信号レベルの監視やマイク情報による切替え制御等を行う。
バッファ３２２は、マイク番号毎に複数のバッファを備えており、Ａ／Ｄ変換器３１により取り込んだ双方向通話部２のマイク信号を、マイク切替え信号ＭＣ＿ＳＥＬに基づいたマイク番号の各バッファに逐次保持する。
音声認識部３２３は、バッファ３２２で保持するマイク信号をマイク切替え信号ＭＣ＿ＳＥＬが切り替わるタイミングで取り込んで、音声認識処理を行う。
音声認識メモリ３２４には、あらかじめ入力する音声コマンドに対応する文字列データが格納されている。音声認識部３２３は、入力したマイク信号を音声認識処理し、音声認識メモリ３２４に格納されている音声コマンドに対応する文字列データを照合し、合致するものを選択する。 The CPU 321 performs overall control of the voice recognition processing unit 32, and particularly performs input signal level monitoring and switching control based on microphone information, which will be described later.
The buffer 322 includes a plurality of buffers for each microphone number, and sequentially holds the microphone signal of the bidirectional communication unit 2 captured by the A / D converter 31 in each buffer of the microphone number based on the microphone switching signal MC_SEL. To do.
The voice recognition unit 323 takes in the microphone signal held in the buffer 322 at the timing when the microphone switching signal MC_SEL is switched, and performs voice recognition processing.
The voice recognition memory 324 stores character string data corresponding to a voice command input in advance. The voice recognition unit 323 performs voice recognition processing on the input microphone signal, collates character string data corresponding to the voice command stored in the voice recognition memory 324, and selects a matching one.

図６（Ａ）〜（Ｄ）は、音声認識処理部３２で行われる制御の動作について図解したタイミングチャートである。
図６（Ａ）は、双方向通話部２から供給されるマイク切替え信号ＭＣ＿ＳＥＬのタイミングチャートであり、例えば＃４と記載されている場合は、マイク番号４が現在選択されていることを示している。
図６（Ｂ）は、双方向通話部２から供給されるマイク信号Ｓ２６１のタイミングチャートである。マイク信号Ｓ２６１は、図６（Ａ）のマイク切替え信号ＭＣ＿ＳＥＬで示すマイク番号に対応した音声信号であり、Ａ／Ｄ変換器３１でディジタルに変換されて音声認識処理部３２に入力される。マイク信号Ｓ２６１は、図６（Ｂ）に記載されているとおり、”ＯｐｅｎＦｉｌｅ”，”Ｎｅｘｔ”といったコマンドの音声信号である。
図６（Ｃ）は、図６（Ａ）〜（Ｂ）で得られた情報をもとに音声認識処理部３２で行われる処理プロセスを示すタイミングチャートである。各音声データのバッファリングとバッファリング後の音声認識処理から構成される。
図６（Ｄ）は、図６（Ｃ）で示した音声認識処理の結果として順次出力される文字列データＳ３２のタイミングチャートである。 6A to 6D are timing charts illustrating the control operation performed by the voice recognition processing unit 32. FIG.
FIG. 6A is a timing chart of the microphone switching signal MC_SEL supplied from the two-way communication unit 2. For example, when “# 4” is described, it indicates that the microphone number 4 is currently selected. Yes.
FIG. 6B is a timing chart of the microphone signal S <b> 261 supplied from the two-way call unit 2. The microphone signal S261 is an audio signal corresponding to the microphone number indicated by the microphone switching signal MC_SEL in FIG. 6A, converted into digital by the A / D converter 31, and input to the speech recognition processing unit 32. The microphone signal S261 is a voice signal of a command such as “OpenFile” and “Next” as described in FIG.
FIG. 6C is a timing chart showing a processing process performed by the speech recognition processing unit 32 based on the information obtained in FIGS. It consists of buffering of each voice data and voice recognition processing after buffering.
FIG. 6D is a timing chart of the character string data S32 sequentially output as a result of the speech recognition process shown in FIG.

詳細はフローチャートを用いて後述するが、ここで音声認識処理部３２の動作の概略を図６を用いて説明する。
まず、図６（Ａ）のとおり、双方向通話部２で最初に選択されたマイク番号は＃４であり、マイク番号４の”ＯｐｅｎＦｉｌｅ”という音声データＳ２６１が音声認識処理装置３に入力されている（図６（Ｂ））。音声認識処理部３２は、Ａ／Ｄ変換器３１を介して、信号Ｓ２６１の音声ディジタル信号を入力し、図６（Ｃ）のとおり、バッファリングを開始し、その音声データはバッファ３２２のマイク番号＃４に応じたバッファで保持される。 Although details will be described later with reference to a flowchart, the outline of the operation of the speech recognition processing unit 32 will be described with reference to FIG.
First, as shown in FIG. 6A, the microphone number initially selected by the two-way communication unit 2 is # 4, and the voice data S261 “Open File” of the microphone number 4 is input to the voice recognition processing device 3. (FIG. 6B). The voice recognition processing unit 32 inputs the voice digital signal of the signal S261 via the A / D converter 31, starts buffering as shown in FIG. 6C, and the voice data is the microphone number of the buffer 322. It is held in a buffer corresponding to # 4.

その後、双方向通話部２において選択マイクが変更になり、マイク番号が＃４から＃１になると、マイク切替え信号ＭＣ＿ＳＥＬ＝１となる。
図６（Ｂ）に示すとおり、マイク番号＃１の音声データは”Ｎｅｘｔ”に相当する音声データであり、音声認識処理部３２では、マイク番号＃４のバッファリングを終了し、新たにマイク番号＃１のバッファリングを開始するとともに、バッファに保持されたマイク番号＃４の音声データに基づいて、音声認識部３２３で音声認識処理を並行して行う。
音声認識処理では、マイク番号＃４の音声データが音声認識処理され、音声認識メモリ３２４に格納されている文字列データのコマンド群と照合され、合致するものが選択され、文字列データとしての”ＯｐｅｎＦｉｌｅ”を、図６（Ｄ）のとおり信号Ｓ３２として出力される。 Thereafter, when the selected microphone is changed in the two-way communication unit 2 and the microphone number is changed from # 4 to # 1, the microphone switching signal MC_SEL = 1.
As shown in FIG. 6B, the voice data of the microphone number # 1 is voice data corresponding to “Next”, and the voice recognition processing unit 32 ends the buffering of the microphone number # 4 and starts a new microphone number. While starting the buffering of # 1, the voice recognition unit 323 performs voice recognition processing in parallel based on the voice data of the microphone number # 4 held in the buffer.
In the voice recognition processing, the voice data of the microphone number # 4 is subjected to voice recognition processing, collated with a command group of character string data stored in the voice recognition memory 324, and a matching one is selected, and “ “Open File” is output as a signal S32 as shown in FIG.

その後さらに、マイク番号が＃１から＃２へ変化しても同様である。
以上、概略説明した制御動作をフローチャートを参照してさらに説明する。 Thereafter, the same applies even if the microphone number changes from # 1 to # 2.
The control operation outlined above will be further described with reference to the flowchart.

以下図７〜１０を用いて、ＣＰＵ３２１で行われる制御のフローチャートを説明する。図７は、ＣＰＵ３２１で行われる制御のメインフローを示す図である。
図７において、まず、例えば２ｋＨｚのＴ１タイマがスタートし、５０μｓ毎に図８に示すＴ１タイマ割込みに移行する。そして、一定レベル以上の音声入力があれば（ステップＳＴ１１）、ステップＳＴ１２に移行する。この一定レベルの閾値は、アプリケーションに応じて適宜設定することができることは言うまでもない。
音声認識処理装置３はマイク切替え信号ＭＣ＿ＳＥＬが供給されているので、ステップＳＴ１１において一定レベル以上の音声入力があれば、その音声のマイク番号（１〜６）を把握している。従って、ステップＳＴ１２では、その入力音声データのサンプリングを開始し、その音声のマイク番号（１〜６）に応じたバッファに音声データを保持する。
一定レベル以上の音声入力がなければ、ステップＳＴ１２では何もしない。 A flowchart of control performed by the CPU 321 will be described below with reference to FIGS. FIG. 7 is a diagram illustrating a main flow of control performed by the CPU 321.
In FIG. 7, first, for example, a 2 kHz T1 timer is started, and a transition to a T1 timer interrupt shown in FIG. 8 is made every 50 μs. If there is a voice input of a certain level or higher (step ST11), the process proceeds to step ST12. It goes without saying that this constant level threshold value can be appropriately set according to the application.
Since the microphone switching signal MC_SEL is supplied to the voice recognition processing device 3, if there is a voice input at a certain level or higher in step ST11, the voice recognition processing apparatus 3 knows the microphone number (1 to 6) of the voice. Accordingly, in step ST12, sampling of the input voice data is started, and the voice data is held in a buffer corresponding to the microphone number (1 to 6) of the voice.
If there is no voice input above a certain level, nothing is done in step ST12.

図１０は、図７に示したメインフローを制御においてマイク選択情報が変化した場合の割込みフローを示した図である。すなわち、通常制御動作であるメインフローにおいて、双方向通話部２で選択されるマイク番号が変化して、その情報がマイク切替え信号ＭＣ＿ＳＥＬを通して通知された場合に発生する割込みフローであり、図６の例で言えば、本割込み以前にマイク番号４（マイク切替え信号ＭＣ＿ＳＥＬ＝４）の音声データをマイク番号４のバッファにサンプリングをして格納していたとき、マイク切替え信号ＭＣ＿ＳＥＬが４から１へ変化した場合である。
図１０のステップＳＴ４０において、音声サンプリングを行っていた場合は、それ以上バッファには音声データを格納しない。
この場合は、現在行っているマイク番号４からの発話入力は終了したものとみなし、サンプリングを終了する（ステップＳＴ４１）。
さらに、サンプリングが終了したマイク番号４の音声データは、音声認識部３２３に引き渡され、音声認識処理が行われる（ステップＳＴ４２）。図６の例では、音声認識部３２３において、マイク番号４の音声データは”ＯｐｅｎＦｉｌｅ”と認識され、その文字列データが信号Ｓ３２として音声処理装置１の外部に出力される。 FIG. 10 is a diagram showing an interrupt flow when the microphone selection information is changed in the control of the main flow shown in FIG. That is, in the main flow which is a normal control operation, an interrupt flow is generated when the microphone number selected by the two-way communication unit 2 changes and the information is notified through the microphone switching signal MC_SEL. For example, when audio data of microphone number 4 (microphone switching signal MC_SEL = 4) is sampled and stored in the buffer of microphone number 4 before this interruption, the microphone switching signal MC_SEL changes from 4 to 1. This is the case.
If audio sampling has been performed in step ST40 of FIG. 10, no more audio data is stored in the buffer.
In this case, it is considered that the speech input from the currently performed microphone number 4 is completed, and the sampling is terminated (step ST41).
Furthermore, the voice data of microphone number 4 for which sampling has been completed is delivered to the voice recognition unit 323, and voice recognition processing is performed (step ST42). In the example of FIG. 6, the voice recognition unit 323 recognizes the voice data of the microphone number 4 as “Open File” and outputs the character string data to the outside of the voice processing device 1 as the signal S32.

上述したように、図７におけるメインフローのステップＳＴ１０において、Ｔ１タイマが開始され、例えば５０μｓ（２０ｋＨｚ）毎に図８に示すＴ１タイマ割込みフローが開始される。Ｔ１タイマ割込みでは、５μｓ毎に音声入力があるか、および、一定レベル以上の音声入力があるか監視を行い、適切な処置を施す。まず、ステップＳＴ２０で音声サンプリングを行っていたか否かチェックされる。
音声サンプリングを行っていた場合は、さらに一定レベルの音声入力があるか否かチェックされ（ステップＳＴ２１）、一定レベルの音声入力がある場合には後述するＴ２タイマは停止する。Ｔ２タイマは発話がない状態を監視し、一定時間発話がない場合には自動的に次のフェーズである音声認識に移行するためのものである。
発話すなわち音声入力が一定レベル以上ある場合は、発話が継続していると考えられ、ステップＳＴ２２において、Ｔ２タイマはリセットされる。
また、ステップＳＴ２０で音声サンプリングを行っているが、一定レベル以上の音声入力がない場合には、現在の発話が終了した可能性があるため、発話がない状態の継続時間を監視するため、Ｔ２タイマをスタートさせる（ステップＳＴ２３）。
ステップＳＴ２１で一定レベル以上の音声入力がない場合でも、発話を再開する可能性があるため、音声サンプリングは継続する（ステップＳＴ２４）。 As described above, in step ST10 of the main flow in FIG. 7, the T1 timer is started, and for example, the T1 timer interrupt flow shown in FIG. 8 is started every 50 μs (20 kHz). In the T1 timer interruption, it is monitored whether there is an audio input every 5 μs and whether there is an audio input exceeding a certain level, and appropriate measures are taken. First, it is checked in step ST20 whether audio sampling has been performed.
If voice sampling has been performed, it is further checked whether or not there is a certain level of voice input (step ST21). If there is a certain level of voice input, the T2 timer described later stops. The T2 timer monitors a state where there is no utterance. When there is no utterance for a certain period of time, the T2 timer automatically shifts to speech recognition, which is the next phase.
If the utterance, that is, the voice input is above a certain level, it is considered that the utterance is continuing, and the T2 timer is reset in step ST22.
Further, although voice sampling is performed in step ST20, if there is no voice input of a certain level or more, the current utterance may be terminated. Therefore, in order to monitor the duration time when there is no utterance, T2 A timer is started (step ST23).
Even if there is no voice input of a certain level or higher in step ST21, speech sampling is continued because there is a possibility that speech is resumed (step ST24).

ステップＳＴ２０で音声サンプリングを行っていない場合は、ステップＳＴ２５で一定レベル以上の音声入力があるか否かがチェックされる。これにより、発話が開始された否かがチェックされ、一定レベル以上の音声入力がある場合は、発話が開始されたものとし、新しく選択されたマイクに対応したバッファに音声サンプリングが開始される（ステップＳＴ２６）。
ステップＳＴ２５で一定レベル以上の音声入力がない場合には、何もせず次の有効な発話を待つことになる。 If audio sampling is not performed in step ST20, it is checked in step ST25 whether there is an audio input of a certain level or higher. Thereby, it is checked whether or not the utterance has started, and if there is an audio input of a certain level or more, it is assumed that the utterance has started and audio sampling is started in the buffer corresponding to the newly selected microphone ( Step ST26).
If there is no voice input exceeding a certain level in step ST25, nothing is done and the next valid utterance is awaited.

図８のステップＳＴ２３で、例えば２ＨｚのＴ２タイマが開始され、一定時間経過した場合、すなわち、音声サンプリングは実施しているが（ステップＳＴ２０）、一定レベル以上の音声入力がない場合が一定時間継続した場合は、音声サンプリングを継続することは無駄であるため、図９に示すＴ２タイマ割込みフローに移行する。
すなわち、その時行っていた音声のサンプリングを終了し（ステップＳＴ３０）、音声認識処理に移行する（ステップＳＴ３１）。
音声認識処理に移行した後、ステップＳＴ３２において、次の発話の処理のため、Ｔ２タイマはリセットされる。 In step ST23 of FIG. 8, for example, a 2 Hz T2 timer is started and a certain time has elapsed, that is, audio sampling is performed (step ST20), but there is no audio input above a certain level for a certain time. In this case, since it is useless to continue the audio sampling, the process proceeds to the T2 timer interrupt flow shown in FIG.
That is, the voice sampling performed at that time is terminated (step ST30), and the process proceeds to the voice recognition process (step ST31).
After shifting to the voice recognition process, in step ST32, the T2 timer is reset for the next utterance process.

以上、本発明に係る音声処理装置の第１の実施の形態について説明した。
第１の実施の形態における音声処理装置１によれば、複数の会議出席者のそれぞれに対向したマイクを通して、複数人が重なって音声処理装置１に対して音声によりコマンドを発している場合でも、音声処理装置１が有する双方向通話部２の特徴により、各音声の帯域毎の音圧レベルを分析して、主の話者を特定して音声認識処理装置３にその音声信号を引き渡す。従って、音声認識処理装置３において、複数の音声コマンドが同時に入力された場合でも誤認識処理を起こす可能性を極力回避することができ、主に発話している音声コマンドを適切に判断・処理を行うことが可能である。
音声認識処理装置３は、引き渡された音声コマンド信号をバッファリングし、バッファリングした音声信号を音声認識処理し、音声認識メモリに格納されるコマンド文字列データと照合し、合致する文字列データを選択して処理される。
また、音声認識処理装置３は、双方向通話部２より選択されたマイク番号を逐次通知されている。従って、その選択されたマイク番号が切り替わった場合には、バッファリングを中止し、それまでバッファリングしていた音声信号を音声認識処理し、更新されたマイク番号からの音声コマンド信号のバッファリングを開始するので、音声認識の精度が向上する。 The first embodiment of the speech processing apparatus according to the present invention has been described above.
According to the voice processing device 1 in the first embodiment, even when a plurality of people overlap each other and issue commands to the voice processing device 1 by voice through microphones facing each of a plurality of conference attendees, The sound pressure level of each voice band is analyzed according to the characteristics of the two-way communication unit 2 of the voice processing apparatus 1, the main speaker is identified, and the voice signal is delivered to the voice recognition processing apparatus 3. Therefore, in the voice recognition processing device 3, even when a plurality of voice commands are input at the same time, the possibility of erroneous recognition processing can be avoided as much as possible. Is possible.
The voice recognition processing device 3 buffers the delivered voice command signal, performs voice recognition processing on the buffered voice signal, collates with the command character string data stored in the voice recognition memory, and matches the matching character string data. Select and process.
In addition, the voice recognition processing device 3 is sequentially notified of the microphone number selected by the two-way call unit 2. Therefore, when the selected microphone number is switched, the buffering is stopped, the voice signal that has been buffered up to that point is subjected to voice recognition processing, and the voice command signal from the updated microphone number is buffered. Since it starts, the accuracy of speech recognition is improved.

第２の実施の形態
次に、第２の実施の形態について図１１を参照しながら説明する。
図１１は、第２の実施の形態における音声処理装置１０のブロック図である。
図１１において、音声処理装置１０は、双方向通話部２と音声認識処理装置４から構成される。
第２の実施の形態において、音声処理装置１０は、例えば会議室の円形テーブルの真ん中にセットされて使用される。
第１の実施の形態同様、双方向通話部２が使用され、双方向通話部２は、会議出席者に対向する複数本、例えば６本のマイクを備え、各会議出席者の音声を入力し、既に説明したような手法により主発話者を特定し、１本のマイク信号を選択し、その音声信号とともに後段の音声認識処理装置４に供給する。
音声認識処理装置４では、双方向通話部２により選択されたマイク信号を音声認識及び声紋認識することにより、発言者を認識及び判断し、各会議出席者の役割設定に基づいて、発言者の指示コマンドに対して認証処理等を行う。
また、音声認識処理装置４では、各会議出席者の役割設定をインタラクティブに設定することができる。 Second Embodiment Next, a second embodiment will be described with reference to FIG.
FIG. 11 is a block diagram of the speech processing apparatus 10 according to the second embodiment.
In FIG. 11, the voice processing device 10 includes a two-way call unit 2 and a voice recognition processing device 4.
In the second embodiment, the audio processing device 10 is used by being set, for example, in the middle of a circular table in a conference room.
As in the first embodiment, a two-way call unit 2 is used, and the two-way call unit 2 includes a plurality of microphones, for example, six microphones facing the conference attendees, and inputs the voices of the conference attendees. The main speaker is identified by the method described above, one microphone signal is selected, and the microphone signal is supplied to the subsequent speech recognition processing device 4 together with the voice signal.
The voice recognition processing device 4 recognizes and judges a speaker by voice recognition and voiceprint recognition of the microphone signal selected by the two-way call unit 2, and based on the role setting of each conference attendee, Authentication processing is performed for the instruction command.
Further, the voice recognition processing device 4 can interactively set the role setting of each conference attendee.

双方向通話部２は、第１の実施の形態において詳述した双方向通話部２と同一である。第１の実施の形態における音声処理装置１では、双方向通話部２がスピーカ２８を使用しなかったため、エコーキャンセル機能を有するＤＳＰ２３を載せなかったが、本実施の形態においては、スピーカ２８を使用するため、図１を用いて説明した双方向通話部２と同一の構成を有する双方向通話部２が図１１に記載されている。 The two-way communication unit 2 is the same as the two-way communication unit 2 described in detail in the first embodiment. In the speech processing apparatus 1 according to the first embodiment, the DSP 23 having the echo cancel function is not mounted because the two-way communication unit 2 does not use the speaker 28. However, in the present embodiment, the speaker 28 is used. Therefore, the two-way communication unit 2 having the same configuration as the two-way communication unit 2 described with reference to FIG. 1 is illustrated in FIG.

従って、本実施の形態における音声処理装置１０と第１の実施の形態における音声処理装置１の相違点は音声認識処理装置４にあるため、以下の説明においては音声認識処理装置４について主として行う。
以下、音声認識処理装置４の構成について説明する。
音声認識処理装置４は、Ａ／Ｄ変換器４１と、音声・声紋認識部４２と、制御処理部４３と、音声プロンプト格納部４４と、声紋・役割データ格納部４５と、音声合成処理部４６と、Ｄ／Ａ変換器４７とを有する。
また、音声・声紋認識部４２は、音声認識部４２１と声紋認識部４２２から構成される。 Accordingly, the difference between the speech processing apparatus 10 in the present embodiment and the speech processing apparatus 1 in the first embodiment resides in the speech recognition processing apparatus 4, and therefore, the speech recognition processing apparatus 4 will be mainly described in the following description.
Hereinafter, the configuration of the speech recognition processing device 4 will be described.
The voice recognition processing device 4 includes an A / D converter 41, a voice / voiceprint recognition unit 42, a control processing unit 43, a voice prompt storage unit 44, a voiceprint / role data storage unit 45, and a voice synthesis processing unit 46. And a D / A converter 47.
The voice / voiceprint recognition unit 42 includes a voice recognition unit 421 and a voiceprint recognition unit 422.

なお、本発明におけるマイクロフォン選択手段は、本実施の形態の双方向通話部２に対応する。
本発明における音声変換手段は、本実施の形態の音声認識部４２１に対応する。
本発明における声紋生成手段は、本実施の形態の声紋認識部４２２に対応する。
本発明における命令コード処理手段は、本実施の形態の制御処理部４３に対応する。
本発明における権限データメモリは、本実施の形態の声紋・役割データ格納部４５に対応する。
本発明における音声出力手段は、本実施の形態の音声合成処理部４６に対応する。 The microphone selection means in the present invention corresponds to the two-way communication unit 2 of the present embodiment.
The voice conversion means in the present invention corresponds to the voice recognition unit 421 of the present embodiment.
The voiceprint generation means in the present invention corresponds to the voiceprint recognition unit 422 of the present embodiment.
The instruction code processing means in the present invention corresponds to the control processing unit 43 of the present embodiment.
The authority data memory in the present invention corresponds to the voiceprint / role data storage unit 45 of the present embodiment.
The voice output means in the present invention corresponds to the voice synthesis processing unit 46 of the present embodiment.

Ａ／Ｄ変換器４１は、双方向通話部２で選択されたマイク番号のマイク信号をアナログ信号として入力し、ディジタル変換を施す。
音声・声紋認識部４２は、Ａ／Ｄ変換器４１で生成された音声データを取り込み、音声認識及び声紋認識を行い、音声認識処理により文字列データを制御処理部４３に対して出力するとともに、声紋認識処理により得られた声紋データを声紋・役割データ格納部４５に格納する。音声認識及び声紋認識は、それぞれ音声認識部４２１及び声紋認識部４２２が担当する。
制御処理部４３は、ＣＰＵを内蔵し、音声認識処理装置４全体の制御を司る。また、特に後述する音声コマンドの実行権限認証処理を行う。
音声コマンドの実行権限認証処理とは、例えば各会議における役割とそれに応じた権限が予め定義されていた場合に、声紋データの割り付け処理によって各会議参加者に割り付けられた役割とそれに応じた権限に基づいて、会議参加者が発言するコマンドに対して認証処理を行い、許可された場合に限り、そのコマンドを処理する。 The A / D converter 41 inputs the microphone signal of the microphone number selected by the two-way call unit 2 as an analog signal and performs digital conversion.
The voice / voiceprint recognition unit 42 takes in the voice data generated by the A / D converter 41, performs voice recognition and voiceprint recognition, and outputs character string data to the control processing unit 43 by voice recognition processing. The voiceprint data obtained by the voiceprint recognition process is stored in the voiceprint / role data storage unit 45. Speech recognition and voiceprint recognition are handled by the speech recognition unit 421 and the voiceprint recognition unit 422, respectively.
The control processing unit 43 incorporates a CPU and controls the voice recognition processing device 4 as a whole. In addition, a voice command execution authority authentication process, which will be described later, is performed.
The voice command execution authority authentication process is, for example, the role assigned to each conference participant by the voiceprint data assignment process and the authority corresponding to the role in each meeting when the role and the authority corresponding thereto are defined in advance. Based on this, the authentication process is performed on the command spoken by the conference participant, and the command is processed only when it is permitted.

音声プロンプト格納部４４は、音声処理装置１０が会議参加者に対しスピーカ２８を通して発する音声を生成するための、音素や音節などの単位の音声プロンプトデータが格納されている。
声紋・役割データ格納部４５は、後述する役割テーブルデータと声紋テーブルデータが格納されている。これにより、制御処理部４３は、入力された音声コマンドを発した会議出席者を特定し、かつ各会議参加者の役割とそれに応じたコマンド権限を参照できる。
音声合成処理部４６では、音声プロンプト格納部４４に格納されている音素や音節などの単位の音声プロンプトデータが制御処理部４３経由で供給され、それらが合成処理された自然なアクセントやイントネーションが付された音声信号が生成される。
Ｄ／Ａ変換器４７は、音声合成処理部４６で生成されたアナログ音声信号をディジタルに変換して双方向通話部２に対して、信号Ｓ４７として出力される。
音声合成処理部４６で生成された音声信号は、双方向通話部２のＤＳＰ２３でエコーキャンセル処理が施された後、スピーカ２８を通して会議参加者に伝えられる。 The voice prompt storage unit 44 stores voice prompt data in units such as phonemes and syllables for the voice processing device 10 to generate a voice uttered to the conference participant through the speaker 28.
The voiceprint / role data storage unit 45 stores role table data and voiceprint table data, which will be described later. As a result, the control processing unit 43 can identify the conference attendee who has issued the input voice command, and can refer to the role of each conference participant and the command authority corresponding thereto.
In the voice synthesis processing unit 46, voice prompt data in units such as phonemes and syllables stored in the voice prompt storage unit 44 is supplied via the control processing unit 43, and natural accents and intonations obtained by synthesizing them are added. A voice signal is generated.
The D / A converter 47 converts the analog voice signal generated by the voice synthesis processing unit 46 into a digital signal, and outputs the digital signal to the bidirectional communication unit 2 as a signal S47.
The voice signal generated by the voice synthesis processing unit 46 is subjected to echo cancellation processing by the DSP 23 of the two-way call unit 2 and then transmitted to the conference participant through the speaker 28.

なお、双方向通話部２と音声認識処理装置４が一体構成されている場合には、音声データをアナログ信号に変換する必要がないため、双方向通話部２におけるＤ／Ａ変換器２６１及びＡ／Ｄ変換器２６３、並びに音声認識処理装置４におけるＡ／Ｄ変換器４１及びＤ／Ａ変換器４７は必要ない。 In the case where the two-way communication unit 2 and the voice recognition processing device 4 are integrally configured, it is not necessary to convert the voice data into an analog signal, so that the D / A converters 261 and A in the two-way communication unit 2 are not required. The / D converter 263 and the A / D converter 41 and the D / A converter 47 in the speech recognition processing device 4 are not necessary.

以下、音声処理装置１０の動作について、実際の会議における進め方を例示しながら説明する。
本実施の形態においては、会議の参加者が発する音声処理装置１０に対するコマンドとして、例えば下記表１の中のコマンド欄に記載されているコマンド群が、あらかじめ音声認識処理装置４の声紋・役割データ格納部４５において定義されているものとする。 Hereinafter, the operation of the voice processing device 10 will be described with reference to an example of how to proceed in an actual conference.
In the present embodiment, for example, a command group described in the command column in Table 1 below is a voice print / role data of the voice recognition processing device 4 as a command to the voice processing device 10 issued by a conference participant. It is assumed that it is defined in the storage unit 45.

表２において、役割欄に記載されている共通、議長役等は、各会議出席者に割り付けられる役割である。そして、それぞれの役割にのみ許可されるコマンド群が設定されている。すなわち、各コマンドは、それぞれの役割が割り付けられた会議出席者のみが有効となる。
従って、例えば表２において、補佐役に割り付けられた会議出席者は、”Ｌｉｇｈｔ
Ｏｎ”及び”ＬｉｇｈｔＯｆｆ”のコマンドを指示した場合に限り有効であり、議長役に割り付けられた会議出席者は、”ＮｅｘｔＰｒｅｓｅｎｔｅｒ”，”Ｓｔａｒｔ
Ｒｅｃｏｒｄ”，”ＳｔｏｐＲｅｃｏｒｄ”のコマンドを指示した場合に限り有効である。役割の欄で共通とは、会議出席者全員に対して権限が付与されていることを意味し、会議の任意の出席者がコマンド『ＳｔａｒｔＭｅｅｔｉｎｇ』を発話しても有効なものとして扱われる。
表２の内容欄は、各コマンドの意味する内容を示している。また、外部へ出力するか？の欄に対して、Ｙｅｓと記載されているコマンドは、そのコマンドに対するコマンドスクリプトを外部へ出力することを意味し、Ｎｏであるコマンドは、外部へ出力しないことを意味する。例えば、”ＬｉｇｈｔＯｎ”のコマンドが指示されると、このコマンドは外部へ出力するコマンドであるため、そのコマンドスクリプトが音声処理装置１０の外部に出力され、図示しない室内照明を制御する外部の機器に対して、室内照明を点灯させる制御コマンドとなる。 In Table 2, the common, chairperson, etc. listed in the role column are the roles assigned to each conference attendee. A command group permitted only for each role is set. That is, each command is valid only for the attendees who are assigned the respective roles.
Thus, for example, in Table 2, the attendees assigned to the assistant are “Light”
It is valid only when the commands “On” and “Light Off” are instructed, and the meeting attendees assigned to the chairperson are “Next Presenter”, “Start”.
Valid only when the command “Record” and “Stop Record” are specified. “Common” in the role column means that the authority is given to all the attendees of the conference and any attendance of the conference Even if the user utters the command “Start Meeting”, it is treated as effective.
The contents column of Table 2 shows the contents that each command means. Is it output to the outside? In the column of, a command described as Yes means that a command script for the command is output to the outside, and a command of No means that the command is not output to the outside. For example, when a “Light On” command is instructed, this command is a command to be output to the outside. Therefore, the command script is output to the outside of the sound processing apparatus 10 and an external device that controls indoor lighting (not shown). In contrast, this is a control command for turning on the room lighting.

次に、本実施の形態における表示制御部１０の動作について説明する。
本実施の形態においては、『ＬｉｇｈｔＯｎ』，『ＮｅｘｔＰｒｅｓｅｎｔｅｒ』といったコマンドが、例えば会議出席者により音声で発せられ、図１１において、双方向通話部２が有する６本のマイクＭＩＣ１〜６に入力される。
双方向通話部２では、入力された音声コマンドを処理し、Ｄ／Ａ変換器２６１を介してアナログ信号Ｓ２６１を音声認識処理装置４へ供給する。また、既に説明したように、双方向通話部２では６本のマイクＭＩＣ１〜６に複数の音声コマンドが入力された場合にも、帯域毎の音圧レベルを評価して相対的に音声レベルの高いマイクを１つ選択し、そのマイクからの入力音声コマンドを音声認識処理装置４へ供給する。 Next, the operation of the display control unit 10 in the present embodiment will be described.
In the present embodiment, commands such as “Light On” and “Next Presenter” are issued by voice, for example, by a conference attendant, and are input to the six microphones MIC 1 to 6 included in the interactive communication unit 2 in FIG. Is done.
In the two-way communication unit 2, the input voice command is processed, and an analog signal S 261 is supplied to the voice recognition processing device 4 via the D / A converter 261. In addition, as described above, even when a plurality of voice commands are input to the six microphones MIC1 to MIC6 in the two-way communication unit 2, the sound pressure level for each band is evaluated and the voice level is relatively set. One high microphone is selected, and an input voice command from the microphone is supplied to the voice recognition processing device 4.

音声認識処理装置４に入力された音声コマンドのアナログ信号Ｓ２６１は、音声認識処理装置４のＡ／Ｄ変換器４１でディジタル信号に変換されて、音声・声紋認識部４２に取り込まれる。
音声・声紋認識部４２では、ディジタル化された音声コマンド信号の音声認識を行い、コマンドデータに変換し、制御処理部４３に対して出力する。さらに、音声・声紋認識部４２では、入力した音声コマンド信号の声紋認識処理を行い、その声紋データを制御処理部４３に対して出力する。 The analog signal S261 of the voice command input to the voice recognition processing device 4 is converted into a digital signal by the A / D converter 41 of the voice recognition processing device 4, and is taken into the voice / voiceprint recognition unit.
The voice / voiceprint recognition unit 42 performs voice recognition of the digitized voice command signal, converts it into command data, and outputs it to the control processing unit 43. Further, the voice / voiceprint recognition unit 42 performs voiceprint recognition processing on the input voice command signal and outputs the voiceprint data to the control processing unit 43.

制御処理部４３では、まず音声・声紋認識部４２から入力した声紋データを声紋・役割データ格納部４５に格納されているデータと照合する。
ここで、本実施の形態においては、会議出席者の声紋データは、役割に対応した形であらかじめ声紋・役割データ格納部４５に声紋テーブルデータとして格納されている。従って、制御処理部４３は、入力した声紋データを声紋・役割データ格納部４５に格納されている声紋データと照合することで、そのコマンドを発した会議出席者を特定することができる。 The control processing unit 43 first collates the voiceprint data input from the voice / voiceprint recognition unit 42 with the data stored in the voiceprint / role data storage unit 45.
Here, in the present embodiment, voice print data of conference attendees is stored in advance as voice print table data in the voice print / role data storage unit 45 in a form corresponding to the role. Therefore, the control processing unit 43 can identify the conference attendee who issued the command by comparing the input voice print data with the voice print data stored in the voice print / role data storage unit 45.

声紋・役割データ格納部４５には、さらに、会議出席者の声紋データと各会議出席者の役割、及び各役割に対応する権限のあるコマンド群が、表２のような情報の役割テーブルデータが格納されている。
従って、制御処理部４３は、声紋・役割データ格納部４５を参照することで、コマンドを発した会議出席者だけでなく、その会議出席者に対してあらかじめ許可されたコマンド群を知ることができるので、そのコマンドが有効であるか否かを判断することができる。コマンドが特定された会議出席者に対して有効である場合には、音声・声紋認識部４２で音声認識して変換されたコマンドデータを処理し、コマンドが有効でない場合には音声・声紋認識部４２で音声認識して変換されたコマンドデータは無視される。
さらに、声紋・役割データ格納部４５に格納された役割テーブルデータには、表２の外部へ出力するか否か（”Ｙｅｓ”，”Ｎｏ”）を示すフラグが含まれ、コマンド毎に定義されている。制御処理部４３はこのフラグを参照し、コマンドが有効であって、かつ外部出力が”Ｙｅｓ”を意味するフラグである場合には、コマンドデータを外部へ出力し、コマンドは有効であるが外部出力が”Ｎｏ”である場合には外部へ出力せず、内部でコマンドデータを処理する。 The voiceprint / role data storage unit 45 further includes the voiceprint data of the conference attendees, the roles of each conference attendee, and a command group having authority corresponding to each role. Stored.
Therefore, by referring to the voiceprint / role data storage unit 45, the control processing unit 43 can know not only the conference participant who issued the command but also a command group permitted to the conference attendee in advance. Therefore, it can be determined whether or not the command is valid. When the command is valid for the identified conference attendee, the voice / voiceprint recognition unit 42 processes the command data converted by voice recognition, and when the command is not valid, the voice / voiceprint recognition unit The command data converted by voice recognition at 42 is ignored.
Furthermore, the role table data stored in the voiceprint / role data storage unit 45 includes a flag indicating whether or not to output to the outside of Table 2 (“Yes”, “No”), and is defined for each command. ing. The control processing unit 43 refers to this flag, and when the command is valid and the external output is a flag meaning “Yes”, the command data is output to the outside, and the command is valid but the external When the output is “No”, the command data is processed internally without outputting to the outside.

以上の処理を具体的な例を用いてさらに説明する。
例えば、Ａ氏〜Ｆ氏の６名で会議を行っているものとし、Ａ氏は補佐役である場合を想定する。Ａ氏の声紋データは事前に取得され、声紋・役割データ格納部４５に格納されている。
ここで、Ａ氏が『ＮｅｘｔＰｒｅｓｅｎｔｅｒ』（音声）というコマンドを発すると、このコマンドはＡ氏に対向するマイクを介して音声処理装置１０に入力され、音声・声紋認識部４２に取り込まれて、”ＮｅｘｔＰｒｅｓｅｎｔｅｒ”（コマンドデータ）に変換されたデータと声紋認識されて声紋データが、制御処理部４３に供給される。
制御処理部４３は、供給された声紋データと声紋・役割データ格納部４５に格納された声紋テーブルデータを照合した結果、入力した音声コマンドがＡ氏によるものであり、かつＡ氏は補佐役であることがわかる。そして、役割テーブルデータを参照すると、補佐役の権限に”ＮｅｘｔＰｒｅｓｅｎｔｅｒ”というコマンドは含まれていないので、権限外コマンドであると判断して、音声・声紋認識部４２で音声認識して生成した”ＮｅｘｔＰｒｅｓｅｎｔｅｒ”（コマンドデータ）を処理しない。 The above processing will be further described using a specific example.
For example, assume that Mr. A to Mr. F are holding a meeting, and Mr. A is an assistant. The voiceprint data of Mr. A is acquired in advance and stored in the voiceprint / role data storage unit 45.
Here, when Mr. A issues a command “Next Presenter” (voice), this command is input to the voice processing apparatus 10 via the microphone facing Mr. A, and is taken into the voice / voiceprint recognition unit 42. The voice print data and the voice print data recognized by the data converted to “Next Presenter” (command data) are supplied to the control processing unit 43.
The control processing unit 43 collates the supplied voiceprint data with the voiceprint table data stored in the voiceprint / role data storage unit 45. As a result, the input voice command is by Mr. A, and Mr. A is an assistant. I know that there is. Then, referring to the role table data, since the command of “Next Presenter” is not included in the authority of the assistant role, it is determined that the command is out of authority, and the voice / voiceprint recognition unit 42 recognizes the voice and generates it. “Next Presenter” (command data) is not processed.

また、Ａ氏〜Ｆ氏のうち、複数人が重なってコマンドを発している場合には、双方向通話部２の特徴により、各音声の帯域毎の音圧レベルを分析して、１の話者を特定して音声認識処理装置４に音声信号を引き渡す。従って、音声認識処理装置４において、複数の音声コマンドが同時に入力されても誤作動を起こす可能性を極力回避することができ、適切に判断・処理を行うことが可能である。 In addition, when a plurality of people among Mr. A to Mr. F are issuing commands, the sound pressure level for each voice band is analyzed according to the characteristics of the two-way communication unit 2, and one talk The person is identified and the voice signal is delivered to the voice recognition processing device 4. Therefore, in the voice recognition processing device 4, it is possible to avoid the possibility of malfunction even if a plurality of voice commands are input at the same time, and it is possible to appropriately determine and process.

以上、音声処理装置１０の構成及び動作について具体例を用いて説明した。
なお、本実施の形態は上記内容に拘泥せず、様々な変形が可能である。
例えば、役割テーブルデータ及び声紋テーブルデータが声紋・役割データ格納部４５にあらかじめ格納されているとしたが、制御処理部４３を通して適宜追加／変更／削除が可能である。また、音声プロンプト格納部４４も同様に、音声プロンプトデータの追加／削除が可能である。 The configuration and operation of the audio processing device 10 have been described above using specific examples.
The present embodiment is not limited to the above contents and can be variously modified.
For example, although role table data and voiceprint table data are stored in advance in the voiceprint / role data storage unit 45, they can be added / changed / deleted as appropriate through the control processing unit 43. Similarly, the voice prompt storage unit 44 can add / delete voice prompt data.

以上説明したように、音声処理装置１０によれば、音声コマンドを入力し、その音声コマンドに対して音声認識及び声紋認識を施し、さらにあらかじめ設定された声紋テーブルデータにより、その音声コマンドの話者とその話者の役割を特定し、かつ役割テーブルデータにより、役割とそれに対応する許可されたコマンド群がわかるので、例えば話者の発した音声コマンドの権限に応じた処理が可能となる。
さらに、音声処理装置１０は双方向通話部２を有しているため、上述のとおり、複数の音声コマンドを入力した場合においても、誤検出・誤作動を極力回避することができ、適切に判断・処理を行うことができる。 As described above, according to the voice processing apparatus 10, a voice command is input, voice recognition and voiceprint recognition are performed on the voice command, and a speaker of the voice command is set based on preset voiceprint table data. The role of the speaker and the role of the command are identified by the role table data, so that the processing according to the authority of the voice command issued by the speaker can be performed.
Furthermore, since the voice processing device 10 has the two-way communication unit 2, as described above, even when a plurality of voice commands are input, erroneous detection / malfunction can be avoided as much as possible, and an appropriate determination is made.・ Processing can be performed.

第３の実施の形態
次に、第３の実施の形態について説明する。
第３の実施の形態における音声処理装置１０は、第２の実施の形態における音声処理装置１０をそのまま適用することができる。
第３の実施の形態においては、第２の実施の形態において説明した音声認識処理装置４における声紋・役割データ格納部４５の声紋テーブルデータを、例えば会議出席者と音声処理装置１０との双方向的（インタラクティブ）なやり取りの中で生成する。
従って、第２の実施の形態においては使用されない音声プロンプト格納部４４、音声合成処理部４６及びＤ／Ａ変換器４７等が、本実施の形態においては使用される。 Third Embodiment Next, a third embodiment will be described.
The voice processing apparatus 10 according to the third embodiment can be applied as it is with the voice processing apparatus 10 according to the second embodiment.
In the third embodiment, the voiceprint table data stored in the voiceprint / role data storage unit 45 in the voice recognition processing device 4 described in the second embodiment is used as, for example, bi-directional communication between a conference attendee and the voice processing device 10. Generate in an interactive exchange.
Accordingly, the voice prompt storage unit 44, the voice synthesis processing unit 46, the D / A converter 47, and the like that are not used in the second embodiment are used in the present embodiment.

既に述べたように、音声プロンプト格納部４４は、音声処理装置１０が会議参加者に対しスピーカ２８を通して発する音声を生成するための、音素や音節などの単位の音声プロンプトデータが格納され、音声合成処理部４６には、音声プロンプト格納部４４に格納されている音素や音節などの単位の音声プロンプトデータが制御処理部４３経由で供給される。音声合成処理部４６では、音声プロンプトデータを合成処理し、自然なアクセントやイントネーションが付された音声信号を生成する。 As already described, the voice prompt storage unit 44 stores voice prompt data in units of phonemes and syllables for the voice processing device 10 to generate voices uttered to the conference participants through the speaker 28, and voice synthesis. The processing unit 46 is supplied with voice prompt data in units such as phonemes and syllables stored in the voice prompt storage unit 44 via the control processing unit 43. The voice synthesis processing unit 46 synthesizes voice prompt data to generate a voice signal with natural accents and intonations.

また、本実施の形態においては、第２の実施の形態における制御処理部４３に対し、制御処理部４３は声紋データの割り付け処理を行うことに特徴がある。
声紋データの割り付け処理とは、例えば、会議参加者の発言した音声のデータに基づく声紋データとその会議における会議参加者の役割データの割り付け処理であり、音声認識処理装置４では、信号Ｓ２６１を通して供給される双方向通話部２からの音声信号から、その音声信号元の会議参加者の役割が自動的に設定される。
以下、上記インタラクティブなやり取りの中での音声処理装置１０の動作を、実際の会議出席者と音声処理装置１０とのやり取りの例を用いて説明する。 Further, the present embodiment is characterized in that the control processing unit 43 performs voiceprint data allocation processing with respect to the control processing unit 43 in the second embodiment.
The voiceprint data allocation process is, for example, voiceprint data allocation based on voice data spoken by the conference participant and role data of the conference participant in the conference. The voice recognition processing device 4 supplies the voiceprint data through the signal S261. The role of the conference participant who is the source of the audio signal is automatically set from the audio signal from the interactive communication unit 2 to be performed.
Hereinafter, the operation of the voice processing apparatus 10 in the interactive exchange will be described using an example of an exchange between an actual conference attendee and the voice processing apparatus 10.

まず、インタラクティブなやり取りを開始するときのトリガとなるコマンドをあらかじめ設定し、声紋・役割データ格納部４５の役割テーブルデータに格納しておく。例えば、上記表２では、”ＳｔａｒｔＭｅｅｔｉｎｇ”というコマンドが相当する。このコマンドの権限を持つ役割を共通とし、声紋テーブルデータにおいて、例えば会議出席者全員に対して役割を共通として設定しておけば、任意の会議出席者が、『ＳｔａｒｔＭｅｅｔｉｎｇ』（音声）と発しても、インタラクティブな処理の開始が許可されるようにすることができる。
このようなインタラクティブなやり取りのトリガとなるコマンドが、会議出席者から発せられると、音声認識処理装置４は、双方向通話部２を通してその音声コマンドをコマンド信号として取り込む。音声・声紋認識部４２は、そのコマンド信号に対して音声認識及び声紋認識を施し、コマンドデータと声紋データを制御処理部４３に供給する。
制御処理部４３では、入力した声紋データに基づいて、声紋・役割データ格納部４５を参照してその役割と権限を確認する。”ＳｔａｒｔＭｅｅｔｉｎｇ”のようなトリガとなるコマンドは上述のとおり全員に許可されているので、”ＳｔａｒｔＭｅｅｔｉｎｇ”（コマンド）が実行される。 First, a command serving as a trigger for starting interactive exchange is set in advance and stored in the role table data of the voiceprint / role data storage unit 45. For example, in Table 2 above, the command “Start Meeting” corresponds. If the role having the authority of this command is made common and the role is set to be common to all the meeting attendees in the voiceprint table data, for example, any meeting attendee will emit “Start Meeting” (voice). However, the start of interactive processing can be permitted.
When a command serving as a trigger for such interactive exchange is issued from a conference participant, the voice recognition processing device 4 captures the voice command as a command signal through the two-way call unit 2. The voice / voiceprint recognition unit 42 performs voice recognition and voiceprint recognition on the command signal, and supplies command data and voiceprint data to the control processing unit 43.
Based on the input voiceprint data, the control processing unit 43 refers to the voiceprint / role data storage unit 45 to confirm its role and authority. Since a command as a trigger such as “Start Meeting” is permitted to all as described above, “Start Meeting” (command) is executed.

”ＳｔａｒｔＭｅｅｔｉｎｇ”コマンドが実行されると、制御処理部４３は、例えば”議長はどなたでしょうか？”という音声データを生成するため、音声プロンプト格納部４４より、この音声データに相当する音素データや音節データなどの音声プロンプトデータを取り出し、それらを合成して音声データを作成するように、音声合成処理部４６を制御する。
音声合成処理部４６では、制御処理部４３経由で供給された音声プロンプトデータに基づいて音声合成を施し、”議長はどなたでしょうか？”に相当するディジタル信号を生成する。その場合、自然なアクセントやイントネーションが付されるように処理される。
”議長はどなたでしょうか？”に相当するディジタル信号は、直接ディジタル信号として、または、Ｄ／Ａ変換器４７及びＤ／Ａ変換器２６３により一度アナログ信号に変換されて、双方向通話部２のＤＳＰ２３でエコーキャンセル処理される。そして、図示しない増幅回路により増幅されて、スピーカ２８より会議出席者に、『議長はどなたでしょうか？』という音声が伝えられる。 When the “Start Meeting” command is executed, the control processing unit 43 generates voice data such as “Who is the chairperson?”, For example, so that the phoneme data corresponding to the voice data is generated from the voice prompt storage unit 44. The voice synthesis processing unit 46 is controlled so as to extract voice prompt data such as syllable data and synthesize them to create voice data.
The voice synthesis processing unit 46 performs voice synthesis based on the voice prompt data supplied via the control processing unit 43 and generates a digital signal corresponding to “Who is the chairperson?”. In this case, processing is performed so that natural accents and intonation are added.
The digital signal corresponding to "Who is the chairperson?" Is directly converted into an analog signal as a digital signal or by the D / A converter 47 and the D / A converter 263, and the two-way communication unit 2 The DSP 23 performs echo cancellation processing. Then, it is amplified by an amplifier circuit (not shown), and the speaker 28 asks the conference attendee, “Who is the chairman? ”Is transmitted.

この音声に対し、会議出席者の一人、例えばＢ氏が『それでは今日は私が進行役をやりましょう』と応答したとする。この音声も『ＳｔａｒｔＭｅｅｔｉｎｇ』と同様に処理されるが、この場合、音声・声紋認識部４２の声紋認識処理に基づいて、『それでは今日は私が進行役をやりましょう』の発言から、対応する声紋データが得られる。
制御処理部４３は、ここで議長役の役割にこのＢ氏の声紋データを関連付ける。すなわち、声紋・役割データ格納部４５の声紋テーブルデータに書き込みを行い、後述する表３のような設定を行い、Ｂ氏の声紋データと議長役という役割を関連付ける。
従って、以降は、Ｂ氏の発言するコマンドのうち、あらかじめ議長役に権限が付与されたコマンドについては処理が有効となる。 Assume that one of the attendees, for example, Mr. B, responds to this voice as follows: “Let me do the facilitator today.” This voice is also processed in the same way as “Start Meeting”, but in this case, based on the voiceprint recognition processing of the voice / voiceprint recognition unit 42, it is supported from the remarks of “Let me do the facilitator today” Voice print data is obtained.
Here, the control processing unit 43 associates the voice print data of Mr. B with the role of the chairman. That is, the voice print table data in the voice print / role data storage unit 45 is written, and the settings shown in Table 3 to be described later are made to associate Mr. B's voice print data with the role of chairman.
Therefore, from now on, among the commands remarked by Mr. B, the processing is effective for the commands for which the authority is given to the chairperson in advance.

議長役がセットされると、議長役以外の役割を設定するため、音声認識処理装置４は、例えば『今日の会議を補佐するのはどなたですか？』といった質問をするための音声信号を、前述の『議長はどなたでしょうか？』と同様に、音声プロンプト格納部４４の音声プロンプトデータを音声合成処理部４６で合成処理して作成する。 When the chairperson is set, the voice recognition processing device 4 sets, for example, “Who will support today's meeting? "Who is the chairman?" ], The voice prompt data in the voice prompt storage unit 44 is synthesized by the voice synthesis processing unit 46 and created.

以上説明した処理を行うことにより、例えば次に示すような音声処理装置１０と会議出席者とのインラクティブなやり取りを通じて、順に声紋テーブルデータに各人の役割と声紋データとを関連付けていく。 By performing the processing described above, for example, the roles of each person and the voiceprint data are sequentially associated with the voiceprint table data through the interactive exchange between the voice processing apparatus 10 and the meeting attendee as described below, for example.

〈Ａ氏〉『ＳｔａｒｔＭｅｅｔｉｎｇ』
〈装置〉『議長はどなたでしょうか？』
〈Ｂ氏〉『それでは今日は私が進行役をやりましょう』
→Ｂ氏の声紋データに議長役が自動的に割り付けられる。
〈装置〉『今日の会議を補佐するのはどなたですか？』
〈Ｃ氏〉『私が担当します』
→Ｃ氏の声紋データに補佐役が自動的に割り付けられる。
〈Ｂ氏〉『ＮｅｘｔＰｒｅｓｅｎｔｅｒ』
〈装置〉『これから説明をされる方はどなたですか？』
〈Ｄ氏〉『私が説明します』
→Ｄ氏の声紋データに説明役が自動的に割り付けられる。 <Mr. A>"StartMeeting"
<Equipment> “Who is the chairperson? 』
<Mr. B>"Let's do the facilitator today"
→ The chairperson is automatically assigned to Mr. B's voiceprint data.
<Equipment> “Who will assist today's meeting? 』
<Mr. C> “I will be in charge”
→ An assistant role is automatically assigned to Mr. C's voiceprint data.
<Mr. B> “Next Presenter”
<Equipment> “Who is going to explain? 』
<Mr. D>"I will explain"
→ An explanatory character is automatically assigned to Mr. D's voiceprint data.

上述の音声処理装置１０と会議出席者とのインタラクティブなやり取りを通して、声紋・役割データ格納部４５には、下記表３のような声紋テーブルデータが生成される。 Through the interactive exchange between the voice processing apparatus 10 and the meeting attendee described above, the voiceprint / role data storage unit 45 generates voiceprint table data as shown in Table 3 below.

一旦声紋テーブルデータが作成されると、第２の実施の形態と同様な動作により、制御処理部４３において、下記に示す例のとおり、各役割に応じたコマンドの権限認証を行う。なお、下記の例におけるコマンドの内容は、表２に示した内容と同一である。

〈Ｄ氏〉『ＯｐｅｎＦｉｌｅＵｒｉａｇｅ』
→Ｕｒｉａｇｅファイルが説明用資料としてＰＣ上で開かれる。
〈Ｃ氏〉『ＬｉｇｈｔＯｆｆ』
→室内の照明がＯＦＦとなる。
〈Ｂ氏〉『ＳｔａｒｔＲｅｃｏｒｄ』
→議事録の記録が開始される。
〈Ｂ氏〉『ＬｉｇｈｔＯｎ』
→Ｂ氏にはこのコマンドを実行する権限はないので無視される。
Once the voiceprint table data is created, the control processing unit 43 performs command authority authentication according to each role by the same operation as in the second embodiment, as shown in the following example. Note that the contents of the commands in the following example are the same as those shown in Table 2.

<Mr. D>"Open File Urage"
→ Uriage file is opened on PC as explanatory material.
<Mr. C> “Light Off”
→ Indoor lighting is turned off.
<Mr. B>"StartRecord"
→ Recording of minutes starts.
<Mr. B> “Light On”
→ Mr. B has no authority to execute this command and is ignored.

上記の例において、音声処理装置１０は会議に設置されたＰＣに接続され、外部出力が有効な”ＯｐｅｎＦｉｌｅＵｒｉａｇｅ”コマンドにより、ＰＣが制御された結果、Ｕｒｉａｇｅファイルが開かれる。
また、例えば会議出席者のＡ氏〜Ｆ氏のうち、複数人が重なってコマンドを発している場合には、双方向通話部２の特徴により、各音声の帯域毎の音圧レベルを分析して、１の話者を特定して音声認識処理装置４に音声信号を引き渡す。従って、音声認識処理装置４において、複数の音声コマンドが同時に入力されても誤作動を起こす可能性を極力回避することができ、適切に判断・処理を行うことが可能である。 In the above example, the audio processing apparatus 10 is connected to a PC installed in the conference, and the URI file is opened as a result of the PC being controlled by the “Open File URI” command with valid external output.
Further, for example, when a plurality of people among Mr. A to F who are meeting attendees are issuing commands, the sound pressure level for each voice band is analyzed according to the characteristics of the two-way communication unit 2. Thus, one speaker is specified and the voice signal is delivered to the voice recognition processing device 4. Therefore, in the voice recognition processing device 4, it is possible to avoid the possibility of malfunction even if a plurality of voice commands are input at the same time, and it is possible to appropriately determine and process.

以上、本実施の形態における音声処理装置１０の構成及び動作について具体例を用いて説明した。
以上説明したように、音声処理装置１０によれば、話者とのインタラクティブなやり取りの中で、話者の役割を設定することができる。
また、音声処理装置１０によれば、音声コマンドを入力し、その音声コマンドに対して音声認識及び声紋認識を施し、さらにインタラクティブなやり取りの中で設定された声紋テーブルデータにより、その音声コマンドの話者とその話者の役割を特定し、かつ役割テーブルデータにより、役割とそれに対応する許可されたコマンド群がわかるので、例えば話者の発した音声コマンドの権限に応じた処理が可能となる。
さらに、音声処理装置１０は双方向通話部２を有しているため、上述のとおり、複数の音声コマンドを入力した場合においても、誤検出・誤作動を極力回避することができ、適切に判断・処理を行うことができる。 Heretofore, the configuration and operation of the speech processing apparatus 10 in the present embodiment have been described using specific examples.
As described above, according to the speech processing apparatus 10, the role of the speaker can be set in the interactive exchange with the speaker.
Further, according to the voice processing device 10, a voice command is input, voice recognition and voiceprint recognition are performed on the voice command, and the voice command table data set in the interactive exchange is used to speak the voice command. The role of the speaker and the speaker and the role table data can be identified from the role table data, so that processing according to the authority of the voice command issued by the speaker, for example, can be performed.
Furthermore, since the voice processing device 10 has the two-way communication unit 2, as described above, even when a plurality of voice commands are input, erroneous detection / malfunction can be avoided as much as possible, and an appropriate determination is made.・ Processing can be performed.

第４の実施の形態
以下、第４の実施の形態における音声処理装置について説明する。
第４の実施の形態における音声処理装置は、第２及び３の実施の形態における音声処理装置１０と同様の構成である。
第４の実施の形態では、音声処理装置１０の音声・声紋認識部４２における音声認識処理をより高度に組むことにより、話者と音声処理装置１０との間のインタラクティブな役割設定プロセスを経ることなく、自然な会話の中から役割の付与と終了を行うことが可能となる。 Fourth Embodiment Hereinafter, a speech processing apparatus according to a fourth embodiment will be described.
The speech processing apparatus in the fourth embodiment has the same configuration as the speech processing apparatus 10 in the second and third embodiments.
In the fourth embodiment, an interactive role setting process between the speaker and the speech processing device 10 is performed by further combining speech recognition processing in the speech / voiceprint recognition unit 42 of the speech processing device 10. In addition, it is possible to assign and end roles from natural conversations.

例えば、表４の例では、『それでは会議を開始したいと思います』の”会議を開始”といった内容や、『では、説明させていただきます』の”説明”といった内容のキーワードを音声・声紋認識部４２が認識することで、それらのキーワードに対応する役割である議長役と説明役をそれぞれの発言者に割り付ける。すなわち、発言者から取得した声紋データを会議上の役割と関連付けて、声紋・役割データ格納部４５の声紋テーブルデータに格納する。
また、『ではこれで会議を終了します』の”会議を終了”のキーワードを音声認識して、例えば議長役の権限の付与を終了する。
このように、本実施の形態では、第３の実施の形態における音声処理装置１０を発展させて、自然な会話の中から役割の付与と終了を行うことが可能となる。 For example, in the example of Table 4, “I want to start the meeting” “Start meeting” and “I will explain” “Explanation” keywords such as voice / voiceprint recognition By recognizing the part 42, the chairperson and the explanation role, which are roles corresponding to those keywords, are assigned to the respective speakers. That is, the voiceprint data acquired from the speaker is associated with the role in the conference and stored in the voiceprint table data of the voiceprint / role data storage unit 45.
In addition, the keyword “End Conference” in “Now the conference is ended” is recognized by voice, and for example, the grant of the authority of the chairperson is ended.
As described above, in the present embodiment, it is possible to develop the voice processing device 10 in the third embodiment and assign and end a role from a natural conversation.

双方向通話部２のブロック図である。3 is a block diagram of a two-way call unit 2. FIG. 双方向通話部２のＤＳＰ２２のブロック図である。3 is a block diagram of a DSP 22 of a two-way call unit 2. FIG. 双方向通話部２の指向性マイクのＦＦＴ結果を示す図である。It is a figure which shows the FFT result of the directional microphone of the bidirectional | two-way call part. 双方向通話部２の選択マイク信号の出力を示す図である。It is a figure which shows the output of the selection microphone signal of the two-way call part. 音声処理装置１のブロック図である。1 is a block diagram of a voice processing device 1. FIG. 音声処理装置１の動作を説明するためのタイミングチャートである。4 is a timing chart for explaining the operation of the audio processing device 1. 音声処理装置１の動作を説明するためのフローチャートである。4 is a flowchart for explaining the operation of the audio processing device 1. 音声処理装置１の動作を説明するためのフローチャートである。4 is a flowchart for explaining the operation of the audio processing device 1. 音声処理装置１の動作を説明するためのフローチャートである。4 is a flowchart for explaining the operation of the audio processing device 1. 音声処理装置１の動作を説明するためのフローチャートである。4 is a flowchart for explaining the operation of the audio processing device 1. 音声処理装置１０のブロック図である。1 is a block diagram of a voice processing device 10. FIG.

Explanation of symbols

１，１０…音声処理装置、２…双方向通話部、２１…Ａ／Ｄ変換器ブロック、２２…第１のディジタルシグナルプロセッサ（ＤＳＰ）、２３…第２のディジタルシグナルプロセッサ（ＤＳＰ）、２４…ＣＰＵ、２５…コーデック、２６…Ｄ／Ａ変換器ブロック、２７…増幅器ブロック、２８…スピーカ、３…音声認識処理装置３、３１…Ａ／Ｄ変換器、３２…音声認識処理部、３２１…ＣＰＵ、３２２…バッファ、３２３…音声認識部、３２４…音声認識メモリ、４…音声認識処理装置、４１…Ａ／Ｄ変換器、４２…音声・声紋認識部、４２１…音声認識部、４２２…声紋認識部４２２、４３…制御処理部、４４…音声プロンプト格納部、４５…声紋・役割データ格納部、４６…音声合成処理部、４７…Ｄ／Ａ変換器

DESCRIPTION OF SYMBOLS 1,10 ... Voice processing apparatus, 2 ... Two-way communication part, 21 ... A / D converter block, 22 ... 1st digital signal processor (DSP), 23 ... 2nd digital signal processor (DSP), 24 ... CPU, 25 ... codec, 26 ... D / A converter block, 27 ... amplifier block, 28 ... speaker, 3 ... voice recognition processing device 3, 31 ... A / D converter, 32 ... voice recognition processing unit, 321 ... CPU 322: Buffer, 323: Voice recognition unit, 324 ... Voice recognition memory, 4 ... Voice recognition processing device, 41 ... A / D converter, 42 ... Voice / voice print recognition unit, 421 ... Voice recognition unit, 422 ... Voice print recognition Units 422, 43 ... control processing unit, 44 ... voice prompt storage unit, 45 ... voiceprint / role data storage unit, 46 ... voice synthesis processing unit, 47 ... D / A converter

Claims

An audio processing device that inputs and processes audio signals from a plurality of microphones,
A voice processing apparatus comprising: a microphone selecting unit that selects one microphone based on the voice signal; and a voice converting unit that converts a voice signal output from the microphone selected by the microphone selecting unit into character string data. .

The sound processing apparatus according to claim 1, wherein the microphone has a unidirectional characteristic.

The voice conversion means is
A buffer that holds the audio signal; and a character string data memory that stores character string data corresponding to the audio signal;
The audio signal output from the microphone selected by the microphone selection means is taken into the buffer,
The speech processing apparatus according to claim 1, wherein at the timing when the selected microphone changes, the speech signal captured in the buffer is collated with the character string data memory and converted to matching character string data.