JP2005308950A

JP2005308950A - Speech processors and speech processing system

Info

Publication number: JP2005308950A
Application number: JP2004123981A
Authority: JP
Inventors: Akira Masuda; 彰増田
Original assignee: Sony Corp
Current assignee: Sony Corp
Priority date: 2004-04-20
Filing date: 2004-04-20
Publication date: 2005-11-04

Abstract

<P>PROBLEM TO BE SOLVED: To provide a speech processor and a speech processing system which record contents of calls among two or more calling devices for each speaker, and is enabled to mutually share the recorded contents of calls among the calling devices. <P>SOLUTION: A speech processor PC_A receives from a calling device PH_A speech data in which two or more speakers' speech performed via a private branch exchange (PBX) is mixed. A voiceprint authentication part 20 sequentially performs voiceprint authentication of the inputted speech data in each voiceprint authentication unit, and a CPU 10 arranges and stores speech data identified by the voiceprint authentication in order on memory 40 for each utterer. Further, the data of each utterer is shared on the network via a communication interface 50. <P>COPYRIGHT: (C)2006,JPO&NCIPI

Description

本発明は、たとえば複数の会議出席者が発する音声によるコマンドを認識して処理する音声処理装置および音声処理システムに関する。 The present invention relates to a voice processing apparatus and a voice processing system for recognizing and processing voice commands issued by, for example, a plurality of conference attendees.

従来より、会議などのグループワークにおいて、その議事内容（音声データ）を録音機等の記録メディアに記録し、会議後に、当該記録メディアを再生することで発話者を確認しながら、会議内容を記録するための議事録を書くということが行われている。
その際、音声データを記録メディアに記録する方法については、様々な技術が知られているところである。 Conventionally, in a group work such as a meeting, the contents of the proceedings (audio data) are recorded on a recording medium such as a recorder, and after the meeting, the recording is performed while the speaker is confirmed by playing the recording medium. It is done to write minutes to do.
At this time, various techniques are known for recording audio data on a recording medium.

ところで、従来のように会議後に発話者を確認して議事内容を記録することは、作業内容の効率の観点、および発話者特定の精度の観点から、適切な方法とは言えない。
すなわち、発言内容をリアルタイムではなく、会議後に議事内容を記録することは、本来会議後に可能であった別の作業をすることができないだけでなく、発話者別に音声データを整理することに時間がかかり煩雑となるため、効率的ではない。
また、記録メディアから再生した音声に基づいて発話者を特定することを、誤りなく行うことは困難な場合がある。 By the way, it is not an appropriate method to confirm the speaker and record the agenda contents after the conference as in the past from the viewpoint of the efficiency of the work contents and the accuracy of the speaker identification.
In other words, recording the agenda content after the meeting, not in real time, not only does it not be possible to do other work that was originally possible after the meeting, but also takes time to organize the audio data by speaker. It is not efficient because it is complicated.
In addition, it may be difficult to identify a speaker based on sound reproduced from a recording medium without error.

本発明はかかる事情に鑑みてなされたものであり、その目的は、複数の通話装置間の通話内容を発言者毎に記録し、記録した通話内容を通話装置間で相互に共有可能な音声処理装置および音声処理システムを提供することにある。 The present invention has been made in view of such circumstances, and an object of the present invention is to record the content of a call between a plurality of call devices for each speaker, and to perform voice processing capable of sharing the recorded call content between the call devices. It is to provide an apparatus and a voice processing system.

上記目的を達成するために本発明の第１の観点は、入力音声を発話者毎に分類する音声処理装置であって、１または複数の声紋データを、前記発話者に対応する発話者データと関連付けて記憶する第１の記憶手段と、第２の記憶手段と、所定時間毎に前記入力音声から声紋データを抽出し、当該声紋データを前記第１の記憶手段が記憶する声紋データと照合して、発話者データを特定する声紋認証手段と、前記所定時間毎の入力音声を、前記声紋認証手段により特定された発話者データ毎に、前記第２の記憶手段に記憶させる制御手段とを有する。 In order to achieve the above object, a first aspect of the present invention is a speech processing apparatus that classifies input speech for each speaker, and includes one or a plurality of voiceprint data and speaker data corresponding to the speaker. First and second storage means for storing them in association with each other, voice print data is extracted from the input voice every predetermined time, and the voice print data is collated with the voice print data stored in the first storage means. And voice control means for specifying speaker data, and control means for storing the input speech for each predetermined time in the second storage means for each speaker data specified by the voice print authentication means. .

上記目的を達成するために本発明の第２の観点は、回線網に接続された複数の通話装置間の音声を、発話者毎に分類する音声処理システムであって、各通話装置は、１または複数の声紋データを、前記発話者に対応する発話者データと関連付けて記憶する第１の記憶手段と、第２の記憶手段と、所定時間毎に前記音声から声紋データを抽出し、当該声紋データを前記第１の記憶手段が記憶する声紋データと照合して、発話者データを特定する声紋認証手段と、前記所定時間毎の音声を、前記声紋認証手段により特定された発話者データ毎に、前記第２の記憶手段に記憶させる制御手段と、を含む音声処理装置にそれぞれ接続され、各音声処理装置は、相互に通信によるデータ送受信が可能に構成される。 In order to achieve the above object, a second aspect of the present invention is a voice processing system that classifies voices between a plurality of communication devices connected to a circuit network for each speaker. Alternatively, a first storage unit that stores a plurality of voiceprint data in association with the speaker data corresponding to the speaker, a second storage unit, and voiceprint data is extracted from the voice every predetermined time, and the voiceprint data is extracted. For each speaker data specified by the voiceprint authentication means, voiceprint authentication means for specifying speaker data by collating data with voiceprint data stored in the first storage means, and voice for each predetermined time period. And a control means for storing in the second storage means, each of which is configured to be able to transmit and receive data by communication with each other.

好適には、各通話装置は、複数の通話装置間の音声を認識して文字列に変換する音声認識手段を含み、前記音声認識手段が前記音声を変換した文字列が所定のコマンドである場合に、前記制御手段は、前記音声に基づいて、前記声紋認証手段が特定した発話者データに対応する音声処理装置に対し、前記所定のコマンドにより予め規定された処理を実行する。 Preferably, each call device includes voice recognition means for recognizing a voice between a plurality of call devices and converting it into a character string, and the character string obtained by converting the voice by the voice recognition means is a predetermined command. In addition, the control means executes a process defined in advance by the predetermined command on the voice processing device corresponding to the speaker data specified by the voiceprint authentication means based on the voice.

好適には、前記所定のコマンドは、前記第２の記憶手段が記憶する発話者データのいずれかであり、前記予め規定された処理は、第２の記憶手段に記憶された前記発話者データに対応する音声のデータを、前記声紋認証手段が特定した前記音声の発話者に対応する音声処理装置に送信する。 Preferably, the predetermined command is any of speaker data stored in the second storage unit, and the predetermined process is performed on the speaker data stored in the second storage unit. The corresponding voice data is transmitted to the voice processing device corresponding to the voice speaker identified by the voiceprint authentication means.

好適には、前記所定のコマンドの実行を許可された許可発話者が予め限定され、音声認識手段が前記音声を変換した文字列が所定のコマンドであって、かつ、前記音声に基づいて声紋認証手段が特定した発話者データが前記許可発話者のいずれかに対応する発話者データである場合に、前記制御手段は、前記所定のコマンドにより予め規定された処理を実行する。 Preferably, permitted speakers who are permitted to execute the predetermined command are limited in advance, a character string obtained by converting the speech by the speech recognition means is a predetermined command, and voiceprint authentication is performed based on the speech. When the speaker data specified by the means is the speaker data corresponding to any of the permitted speakers, the control means executes a process defined in advance by the predetermined command.

本発明の第１の観点に係る音声処理装置によれば、声紋認証手段は、所定時間毎に入力音声から声紋データを抽出し、当該声紋データを前記第１の記憶手段が記憶する声紋データと照合して、発話者データを特定し、制御手段は、所定時間毎の入力音声を、前記声紋認証手段により特定された発話者データ毎に、前記第２の記憶手段に記憶させるので、発話者は、特別な操作をすることなく、自動的に精度良く発話者毎に音声データを整理することができる。 According to the speech processing apparatus according to the first aspect of the present invention, the voiceprint authentication unit extracts voiceprint data from the input speech every predetermined time, and the voiceprint data stored in the first storage unit is stored in the voiceprint data. The speaker data is specified by collation, and the control means stores the input speech for every predetermined time in the second storage means for each speaker data specified by the voiceprint authentication means. Can automatically and accurately organize voice data for each speaker without any special operation.

本発明の第２の観点に係る音声処理システムによれば、各通話装置に対応する音声処理装置は、声紋認証手段が、所定時間毎に入力音声から声紋データを抽出し、当該声紋データを前記第１の記憶手段が記憶する声紋データと照合して、発話者データを特定し、制御手段が、所定時間毎の入力音声を、前記声紋認証手段により特定された発話者データ毎に、前記第２の記憶手段に記憶させ、他の音声処理装置に対して、または、他の音声処理装置から入力音声を送受信できるため、発話者毎の音声データを発話者間で共有できる。 According to the speech processing system according to the second aspect of the present invention, in the speech processing device corresponding to each call device, the speech print authenticating unit extracts the speech print data from the input speech every predetermined time, and the voice print data is The voice data stored in the first storage means is collated to specify speaker data, and the control means determines the input speech for each predetermined time for each speaker data specified by the voice print authentication means. Since the input voice can be transmitted / received to / from another voice processing apparatus or from another voice processing apparatus, the voice data for each speaker can be shared among the talkers.

本発明によれば、複数の通話装置間の通話内容を発言者毎に記録し、記録した通話内容を通話装置間で相互に共有可能とするので、発話者は、特別な操作をすることなく、自動的に精度良く発話者毎に音声データを整理できるとともに、発話者毎の音声データを発話者間で共有できる。 According to the present invention, the content of a call between a plurality of call devices is recorded for each speaker, and the recorded call content can be shared between the call devices, so that the speaker does not perform any special operation. The voice data can be automatically and accurately organized for each speaker, and the voice data for each speaker can be shared among the speakers.

第１の実施形態
以下、本発明に係る音声処理装置および音声処理システムの第１の実施形態について述べる。
図１は、本実施形態に係る音声処理システムの適用例について図解する。図１に示す音声処理システムは、複数の通話装置ＰＨ＿Ａ，ＰＨ＿Ｂ，ＰＨ＿Ｃ，ＰＨ＿Ｄ，ＰＨ＿Ｅと、それぞれ各通話装置に接続された音声処理装置ＰＣ＿Ａ，ＰＣ＿Ｂ，ＰＣ＿Ｃ，ＰＣ＿Ｄ，ＰＣ＿Ｅと、を含んで構成される。 First Embodiment Hereinafter, a first embodiment of a voice processing apparatus and a voice processing system according to the present invention will be described.
FIG. 1 illustrates an application example of the speech processing system according to the present embodiment. The voice processing system shown in FIG. 1 includes a plurality of call devices PH_A, PH_B, PH_C, PH_D, and PH_E, and sound processing devices PC_A, PC_B, PC_C, PC_D, and PC_E connected to the call devices, respectively. Is done.

各通話装置間は、構内交換機ＰＢＸによるディジタル回線網が構築されている。したがって、三名以上の複数の人々の間で電話会議を行うことも可能である。
また、音声処理装置ＰＣ＿Ａ，ＰＣ＿Ｂ，ＰＣ＿Ｃ，ＰＣ＿Ｄ，ＰＣ＿Ｅは、それぞれ無線または有線による通信網に接続され、相互にデータ通信可能な構成となっている。 A digital line network using a private branch exchange PBX is constructed between the communication devices. Therefore, it is possible to hold a conference call among a plurality of people of three or more people.
Also, the voice processing devices PC_A, PC_B, PC_C, PC_D, and PC_E are connected to a wireless or wired communication network, and are configured to be capable of data communication with each other.

図２は、各通話装置および各音声処理装置の構成を、通話装置ＰＨ＿Ａおよび音声処理装置ＰＣ＿Ａを例として、示した図である。
通話装置ＰＨ＿ＡのＲＥＣ＿ＯＵＴ端子と、音声処理装置ＰＣ＿ＡのＬＩＮＥ＿ＩＮ端子とは、相互に接続され、通話装置ＰＨ＿Ａから音声のディジタルデータが音声処理装置ＰＣ＿Ａに送出される。 FIG. 2 is a diagram illustrating the configuration of each call device and each voice processing device, taking the call device PH_A and the voice processing device PC_A as examples.
The REC_OUT terminal of the call device PH_A and the LINE_IN terminal of the voice processing device PC_A are connected to each other, and voice digital data is transmitted from the call device PH_A to the voice processing device PC_A.

音声処理装置ＰＣ＿Ａは、本発明の制御手段としてのＣＰＵ１０と、本発明の声紋認証手段としての声紋認証部（ＶＰＡ）２０と、本発明の第１の記憶手段としての声紋レジスタ（ＲＥＧ）３０と、本発明の第２の記憶手段としてのメモリ（ＭＥＭ）４０と、通信インタフェース（Ｉ／Ｆ）５０と、を含んで構成される。 The voice processing device PC_A includes a CPU 10 as a control unit according to the present invention, a voiceprint authentication unit (VPA) 20 as a voiceprint authentication unit according to the present invention, and a voiceprint register (REG) 30 as a first storage unit according to the present invention. A memory (MEM) 40 as a second storage means of the present invention and a communication interface (I / F) 50 are included.

声紋認証部２０は、ＣＰＵ１０から供給される音声データを、所定時間（たとえば、３秒間）毎に声紋認証して、発話者を特定する。声紋認証技術としては、すでに公知の技術を声紋認証部２０に適用させることが可能である。
なお、声紋認証処理の単位時間（たとえば、３秒間）は、声紋認証部２０およびＣＰＵ１０の処理能力が許せば、極力短い時間が本人特定精度の観点から望ましい。 The voiceprint authentication unit 20 performs voiceprint authentication of voice data supplied from the CPU 10 every predetermined time (for example, 3 seconds) to identify a speaker. As the voiceprint authentication technique, a known technique can be applied to the voiceprint authentication unit 20.
Note that the unit time (for example, 3 seconds) of the voiceprint authentication process is preferably as short as possible from the viewpoint of the person identification accuracy if the processing capabilities of the voiceprint authentication unit 20 and the CPU 10 allow.

具体的には、先ず、声紋認証部２０は、ＣＰＵ１０から供給された音声データのパワースペクトル（声紋データ）を生成する。
声紋レジスタ３０には、発話者データ（たとえば、発話者のＩＤ）と、発話者の声紋データとが対応付けられて登録され、記録されている。
声紋認証部２０は、生成した声紋データと、声紋レジスタ３０に記録される声紋データとを順に比較し、所定の評価基準に基づいて、双方の声紋データが近似する可能性が所定のレベル以上である場合に、ＣＰＵ１０より供給された音声データを、対応する発話者の音声であると特定する。 Specifically, first, the voiceprint authentication unit 20 generates a power spectrum (voiceprint data) of voice data supplied from the CPU 10.
In the voiceprint register 30, the speaker data (for example, the ID of the speaker) and the voiceprint data of the speaker are registered and recorded in association with each other.
The voiceprint authentication unit 20 compares the generated voiceprint data with the voiceprint data recorded in the voiceprint register 30 in order, and the possibility that both voiceprint data are approximated based on a predetermined evaluation criterion is equal to or higher than a predetermined level. In some cases, the voice data supplied from the CPU 10 is identified as the voice of the corresponding speaker.

一般に、個々人の声紋の相違は、その人の顔形から生ずる口腔・鼻孔の容積・構造の相違や、および身長や性別から生ずる声帯の相違から決定されるため、本人特定の精度が高い。たとえば、通話者が風邪等によって声がかすれたり鼻声になったとしても、声紋の波形の強さや周波数は変化がなく、認証精度が影響を受けることはない。 In general, the difference in the voiceprint of each person is determined from the difference in the volume and structure of the mouth and nostrils resulting from the face shape of the person, and the difference in the vocal cords resulting from the height and sex, so that the accuracy of identification is high. For example, even if the caller has a voice or a nose due to a cold or the like, the strength and frequency of the voiceprint waveform are not changed, and the authentication accuracy is not affected.

メモリ４０は、発話者データ（たとえば、発話者のＩＤ等）をファイル名とするデータ記憶領域毎に音声データが整理されて記憶される。たとえば、Ａ氏の発話者ＩＤをＩＤ＿Ａとした場合には、ファイル名をＩＤ＿Ａとするデータ記憶領域に、Ａ氏の音声のデータ（音声データ）を記録する。
通信インタフェース５０は、他の音声処理装置（ＰＣ＿Ｂ，ＰＣ＿Ｃ等）と無線または有線によるデータ通信を行うため、所定の通信標準規格に準拠したインタフェース回路により構成される。 In the memory 40, voice data is organized and stored for each data storage area having the file name of the speaker data (for example, the speaker ID). For example, when Mr. A's speaker ID is ID_A, Mr. A's voice data (voice data) is recorded in the data storage area with the file name ID_A.
The communication interface 50 includes an interface circuit that complies with a predetermined communication standard in order to perform wireless or wired data communication with other audio processing devices (PC_B, PC_C, etc.).

ＣＰＵ１０は、音声処理装置ＰＣ＿Ａの全体の制御を行う。
たとえば、声紋レジスタ３０およびメモリ４０に対するアクセス制御、ＬＩＮＥ＿ＩＮ端子から入力した音声データに対する入力処理、声紋認証部２０の声紋処理に対するタイミング制御、通信インタフェース５０を介した他の音声処理装置との通信制御などである。
特に、ＣＰＵ１０は、声紋認証部２０が声紋処理を行う単位時間（たとえば、３秒間）毎に、ＬＩＮＥ＿ＩＮ端子から入力した音声データを、声紋処理により特定された発話者データに対応するメモリ４０のデータ記憶領域に、順に格納する。 The CPU 10 performs overall control of the voice processing device PC_A.
For example, access control to the voiceprint register 30 and the memory 40, input processing for voice data input from the LINE_IN terminal, timing control for voiceprint processing of the voiceprint authentication unit 20, communication control with other voice processing devices via the communication interface 50, etc. It is.
In particular, the CPU 10 uses the voice data input from the LINE_IN terminal for each unit time (for example, 3 seconds) in which the voiceprint authentication unit 20 performs voiceprint processing, and the data in the memory 40 corresponding to the speaker data specified by the voiceprint processing. Store sequentially in the storage area.

次いで、音声処理装置ＰＣ＿Ａで行われる処理動作について述べる。
以下の説明においては、Ａ氏が通話装置ＰＨ＿Ａを、Ｂ氏が通話装置ＰＨ＿Ｂを、Ｃ氏が通話装置ＰＨ＿Ｃを使用して、３者が電話会議を行っている場面における通話装置ＰＨ＿Ａおよび音声処理装置ＰＣ＿Ａについて、述べる。他の通話装置（ＰＨ＿Ｂ，ＰＨ＿Ｃ）および他の音声処理装置（ＰＣ＿Ｂ，ＰＨ＿Ｃ）で行われる処理も、以下に述べる処理内容と同様である。 Next, processing operations performed by the voice processing device PC_A will be described.
In the description below, Mr. A uses the call device PH_A, Mr. B uses the call device PH_B, and Mr. C uses the call device PH_C. The device PC_A will be described. The processing performed in the other communication devices (PH_B, PH_C) and the other voice processing devices (PC_B, PH_C) is the same as the processing content described below.

Ａ氏、Ｂ氏およびＣ氏の通話中の音声は、図１に示す構内交換機ＰＢＸにより混合されるので、Ａ氏の音声データだけでなく、三者の音声データがすべてミックスされた音声データが、通話装置ＰＨ＿ＡのＲＥＣ＿ＯＵＴ端子から音声処理装置ＰＣ＿ＡのＬＩＮＥ＿ＩＮ端子へ送出される。
音声処理装置ＰＣ＿Ａは、ＬＩＮＥ＿ＩＮ端子から入力した音声データを、声紋認証可能な時間間隔で処理する。 Since the voices of Mr. A, Mr. B and Mr. C during the call are mixed by the private branch exchange PBX shown in FIG. 1, not only the voice data of Mr. A but also the voice data in which all the voice data of the three parties are mixed. And sent from the REC_OUT terminal of the telephone device PH_A to the LINE_IN terminal of the voice processing device PC_A.
The voice processing device PC_A processes the voice data input from the LINE_IN terminal at a time interval that allows voiceprint authentication.

図３（Ａ）〜（Ｇ）は、時間順にＡ氏，Ｂ氏，Ｃ氏が発話したときに、声紋認証部２０がＣＰＵ１０の指示に基づいて、音声データを処理する際の処理動作を示す図である。
まず、図３（Ａ）において、上向きの矢印は声紋認証の開始を意味し、下向きの矢印は声紋認証の終了を意味している。ここでは、声紋認証の開始と終了が３秒間であり、声紋認証部２０は、３秒間単位で声紋認証処理し、声紋データの生成を行う。 3A to 3G show processing operations when the voiceprint authentication unit 20 processes voice data based on instructions from the CPU 10 when Mr. A, Mr. B, and Mr. C speak in time order. FIG.
First, in FIG. 3A, an upward arrow means the start of voiceprint authentication, and a downward arrow means the end of voiceprint authentication. Here, the start and end of voiceprint authentication is 3 seconds, and the voiceprint authentication unit 20 performs voiceprint authentication processing in units of 3 seconds and generates voiceprint data.

図３（Ｂ）は、Ａ氏，Ｂ氏，Ｃ氏の実際の発話のタイミングを示す。図に示すように、発話の開始のタイミングによって、たとえば、発話の最後の部分については、声紋認証単位時間に達しない場合がある。
かかる発話に対して、声紋認証部２０は、図３（Ｃ）に示すように、声紋認証単位である３秒間毎に、取り込んだ発話データに基づいて声紋データを生成して、図示しないバッファに格納する。図において、たとえば、Ａ氏の発話データのうち発話終了直前のデータは、声紋処理に必要な３秒間に満たないため、声紋処理ができない。 FIG. 3B shows actual utterance timings of Mr. A, Mr. B, and Mr. C. As shown in the figure, depending on the start timing of the utterance, for example, the last part of the utterance may not reach the voiceprint authentication unit time.
In response to the utterance, the voiceprint authentication unit 20 generates voiceprint data based on the acquired utterance data every three seconds as a voiceprint authentication unit as shown in FIG. Store. In the figure, for example, the data immediately before the end of the utterance of Mr. A's utterance data is less than 3 seconds required for the voiceprint process, and therefore the voiceprint process cannot be performed.

次いで、声紋認証部２０は、図３（Ｄ）に示すように、バッファに格納された声紋データに対して、声紋認証単位（３秒間の声紋データ）毎に、声紋レジスタ３０に格納される予め登録された声紋データと比較、照合を行う。すなわち、３秒間毎に生成した声紋データと、声紋レジスタ３０に格納された声紋データとを順次照合して、一致する（または、一致する可能性が所定のレベル以上である）場合に、一致した登録声紋データに対応する発話者データにより発話者を特定する。
なお、図では、Ａ氏の発話データのうち発話終了直前のデータであって、声紋処理ができない部分がＮＡと示されている。 Next, as shown in FIG. 3D, the voiceprint authentication unit 20 stores the voiceprint data stored in the buffer in the voiceprint register 30 in advance for each voiceprint authentication unit (voiceprint data for 3 seconds). Compare and collate with registered voiceprint data. That is, when the voiceprint data generated every 3 seconds and the voiceprint data stored in the voiceprint register 30 are sequentially compared and matched (or the possibility of matching is equal to or higher than a predetermined level), they match. The speaker is specified by the speaker data corresponding to the registered voiceprint data.
In the figure, the portion of the utterance data of Mr. A that is data immediately before the end of the utterance and that cannot be subjected to the voice print process is indicated by NA.

なお、図３（Ｄ）における本人特定のタイミングは、図３（Ｃ）におけるバッファに格納するタイミングと一致しているが、実際には、声紋認証のための処理時間が必要であるため、本人特定のタイミングは、バッファに格納するタイミングより遅れる。 3D corresponds to the timing stored in the buffer in FIG. 3C, but in reality, the processing time for voiceprint authentication is required. The specific timing is delayed from the timing of storing in the buffer.

図３（Ｅ）は、声紋認証が終了したＡ氏の発話データを示す図である。
ＣＰＵ１０は、声紋認証部２０が特定した発話者データに対応するメモリ４０上のファイル名のデータ記録領域に、ＬＩＮＥ＿ＩＮ端子より入力した音声データを、声紋認証単位毎に順次格納する。たとえば、Ａ氏の音声をＩＤ＿Ａというファイルに格納する。
その際、図に示すように、Ａ氏の発言の最後の声紋認証不可能な部分（ＮＡ）に相当する音声データについては、音声データの連続性をチェックする等によりＡ氏の音声データと推定して、当該部分の音声データをメモリ４０のファイル（ＩＤ＿Ａ）に格納してもよい。 FIG. 3E is a diagram showing the utterance data of Mr. A who has finished voiceprint authentication.
The CPU 10 sequentially stores the voice data input from the LINE_IN terminal for each voiceprint authentication unit in the data recording area of the file name on the memory 40 corresponding to the speaker data specified by the voiceprint authentication unit 20. For example, Mr. A's voice is stored in a file ID_A.
At that time, as shown in the figure, the voice data corresponding to the last voiceprint unauthenticated portion (NA) of Mr. A's utterance is estimated as Mr. A's voice data by checking the continuity of the voice data. Then, the audio data of the part may be stored in the file (ID_A) of the memory 40.

次いで、発話者が変更し、順にＢ氏およびＣ氏が発話した場合についても、同様である。すなわち、図３（Ｆ）および（Ｇ）に示すように、ＣＰＵ１０は、音声データを、声紋認証単位毎に順次、たとえば、Ｂ氏およびＣ氏の音声データをそれぞれメモリ４０上のＩＤ＿ＢおよびＩＤ＿Ｃというファイル名のデータ格納領域に格納する。
その際、Ａ氏の音声データと同様に、発話の前後に声紋認証不可能な部分（ＮＡ）が存在する。 Next, the same applies to the case where the speaker changes and Mr. B and Mr. C speak in order. That is, as shown in FIGS. 3 (F) and 3 (G), the CPU 10 sequentially converts the voice data for each voiceprint authentication unit, for example, the voice data of Mr. B and Mr. C are called ID_B and ID_C on the memory 40, respectively. Store in the file name data storage area.
At that time, like Mr. A's voice data, there is a portion (NA) where voiceprint authentication is impossible before and after the utterance.

以上説明したように、本実施形態に係る音声処理装置によれば、構内交換機ＰＢＸを介して行われる複数の話者の音声がミックスされた音声データを入力し、声紋認証部２０が声紋認証単位毎に順次声紋認証を行い、ＣＰＵ１０は、入力した音声データを、声紋認証により特定した発話者データ毎にメモリ４０上に整理して格納するので、以下の効果が得られる。 As described above, according to the voice processing device according to the present embodiment, voice data obtained by mixing voices of a plurality of speakers performed via the private branch exchange PBX is input, and the voiceprint authentication unit 20 performs voiceprint authentication units. Since the voice print authentication is performed sequentially for each time, and the CPU 10 organizes and stores the input voice data on the memory 40 for each speaker data specified by the voice print authentication, the following effects are obtained.

すなわち、複数人による電話会議などにおいて、発言内容が発言者毎にリアルタイムに整理されてメモリに格納されるので、音声処理装置のユーザは、議事録を作成する際に、会議内容を記録した記録メディアを再生する等して発話者を特定する必要がない。
また、発話者毎に音声データがメモリ上に整理されているので、議事録の編集を極めて容易に行うことができる。
さらに、電話会議等の内容を記録した記録メディアを再生することにより、発話者を特定することは、煩雑であると同時に、聞き取り精度の点から困難な場合があるが、本実施形態に係る音声処理装置によれば、声紋認証により本人特定を行うので、極めて高精度に発話者を特定することができる。
また、本実施形態に係る音声処理システムによれば、各音声処理装置が通信手段を有しているので、各発話者毎の音声データをネットワーク上で共有することが可能である。 That is, in a conference call with multiple persons, the content of the speech is organized in real time for each speaker and stored in the memory, so the user of the voice processing device records the content of the conference when creating the minutes. There is no need to specify the speaker by playing the media.
In addition, since the voice data is organized in the memory for each speaker, the minutes can be edited very easily.
Furthermore, specifying a speaker by playing a recording medium in which the contents of a conference call or the like are reproduced is complicated and sometimes difficult in terms of listening accuracy. According to the processing device, since the person is identified by voiceprint authentication, the speaker can be identified with extremely high accuracy.
Further, according to the voice processing system according to the present embodiment, since each voice processing apparatus has a communication means, it is possible to share voice data for each speaker on the network.

第２の実施形態
以下、本発明に係る音声処理装置および音声処理システムの第２の実施形態について述べる。
本実施形態に係る音声処理装置は、外部からの音声によるコマンドを認識して、認識したコマンドに対応して予め規定された処理を行うことを特徴とする。
本実施形態に係る音声処理システムのシステム構成は、図１で示した構成と同様である。また、図４は、図１に示した音声処理システムに構成する本実施形態に係る音声処理装置の構成を、音声処理装置ＰＣ＿Ａを例として示したものである。 Second Embodiment Hereinafter, a second embodiment of the speech processing apparatus and speech processing system according to the present invention will be described.
The speech processing apparatus according to the present embodiment is characterized by recognizing a command by a voice from the outside and performing a predetermined process corresponding to the recognized command.
The system configuration of the speech processing system according to the present embodiment is the same as the configuration shown in FIG. FIG. 4 shows an example of the configuration of the speech processing apparatus according to the present embodiment configured in the speech processing system shown in FIG. 1, using the speech processing apparatus PC_A as an example.

図４に示すように、本実施形態に係る音声処理装置ＰＣ＿Ａは、図２に示した第１の実施形態に係る音声処理装置ＰＣ＿Ａと比較して、音声認識部６０を付加した点で相違する。
音声認識部６０は、ＬＩＮＥ＿ＩＮ端子より入力した音声データを、音声認識技術により文字列データに変換して、ＣＰＵ１０に供給する。 As shown in FIG. 4, the speech processing apparatus PC_A according to the present embodiment is different from the speech processing apparatus PC_A according to the first embodiment shown in FIG. 2 in that a speech recognition unit 60 is added. .
The voice recognition unit 60 converts voice data input from the LINE_IN terminal into character string data using a voice recognition technique, and supplies it to the CPU 10.

音声認識部６０は、音響モデルと認識辞書を含んで構成される
音響モデルには、音声認識に用いる基本的な音の単位、すなわち、子音や母音などの人間の発音の小さい単位（音素）の音響特徴が記述されている。この音響特徴は、多数の話者の音声から求めた音素の統計的な音響特徴情報である。
また、認識辞書には、音声認識させる文字列データが記述されている。 The speech recognition unit 60 includes an acoustic model and a recognition dictionary. The acoustic model includes a basic sound unit used for speech recognition, that is, a unit of small human pronunciation (phoneme) such as a consonant or a vowel. Acoustic features are described. This acoustic feature is statistical acoustic feature information of phonemes obtained from the voices of many speakers.
In the recognition dictionary, character string data for voice recognition is described.

具体的には、音声認識部６０は以下の処理を行う。
（１）音響特徴の算出
入力した音声データをスペクトル分析し、音響的な特徴量（音響特徴）を抽出する。
（２）音声認識結果の出力
入力した音声データの特徴量と音響モデルに記述された特徴量を照合し、認識対象の文字列データの中で入力音声に最も近い文字列データを音声認識結果として出力する。その際、認識辞書に記述されている文字列データの中から、音響特徴が入力音声の音響特徴に最も近い文字列データを検索し、音声認識結果として出力する。
なお、音声認識部６０による音声認識率を向上させるため、認識したい言葉をあらかじめ特定の人の声で登録しておくことで登録者の音声を特に精度良く認識可能としてもよい（特定話者音声認識）。 Specifically, the voice recognition unit 60 performs the following processing.
(1) Calculation of acoustic features The input voice data is subjected to spectrum analysis, and acoustic features (acoustic features) are extracted.
(2) Output of speech recognition result The feature amount of the input speech data is collated with the feature amount described in the acoustic model, and the character string data closest to the input speech is recognized as the speech recognition result. Output. At that time, the character string data whose acoustic feature is closest to the acoustic feature of the input speech is searched from the character string data described in the recognition dictionary, and is output as a speech recognition result.
In addition, in order to improve the voice recognition rate by the voice recognition unit 60, it is possible to recognize a registrant's voice in a particularly accurate manner by registering a word to be recognized in advance with a voice of a specific person (a specific speaker voice). recognition).

本実施形態に係るＣＰＵ１０は、登録した複数のコマンド（文字列データ）に対応して実行する処理を行い、ＣＰＵ１０は、音声認識部６０の音声認識処理結果である文字列データが複数のコマンドのうちのいずれかである場合には、そのコマンドに対応する処理を実行する。
たとえば、コマンドとして「Ｂ氏の名前」がＣＰＵ１０に登録されており、そのコマンドに対応する処理内容として、「Ｂ氏の音声データの送信処理」が予め規定されている場合の動作について述べる。 The CPU 10 according to the present embodiment performs processing to be executed in response to a plurality of registered commands (character string data), and the CPU 10 has character string data that is a result of speech recognition processing by the speech recognition unit 60 as a result of a plurality of commands. If it is one of them, the process corresponding to the command is executed.
For example, the operation in the case where “name of Mr. B” is registered in the CPU 10 as a command and “transmission processing of Mr. B's voice data” is defined in advance as the processing content corresponding to the command will be described.

「Ｂ氏の名前」（たとえば、山田）に相当する音声データをＬＩＮＥ＿ＩＮ端子から入力した音声処理装置ＰＣ＿Ａでは、先ず、音声認識部６０による音声認識処理が行われる。その結果、音声認識部６０は、入力した音声データが「Ｂ氏の名前」であることを認識し、文字列データとしての「Ｂ氏の名前」をＣＰＵ１０に送出する。
ＣＰＵ１０では、複数のコマンドとして「Ｂ氏の名前」が登録されており、また、対応する処理内容として、「メモリ４０上のＩＤ＿Ｂ（発話者データ）のファイルを、入力した音声の発話者に送信する」という処理内容が定義されているとする。 In the speech processing apparatus PC_A to which speech data corresponding to “name of Mr. B” (for example, Yamada) is input from the LINE_IN terminal, speech recognition processing by the speech recognition unit 60 is first performed. As a result, the voice recognition unit 60 recognizes that the input voice data is “Mr. B's name”, and sends “Mr. B's name” as character string data to the CPU 10.
In the CPU 10, “Mr. B's name” is registered as a plurality of commands, and as a corresponding processing content, “ID_B (speaker data) file on the memory 40 is transmitted to the input voice speaker. It is assumed that the processing content “Yes” is defined.

その場合、ＣＰＵ１０は、声紋認証部２０に対し、入力した音声データに基づいて声紋認証を行うように指示する。そして、声紋認証部２０は、発話者データがＩＤ＿Ｂであることを特定し、ＣＰＵ１０は、メモリ４０上のＩＤ＿Ｂのファイルにある音声データを、通信インタフェース５０を介して音声処理装置ＰＣ＿Ｂに送信する。 In that case, the CPU 10 instructs the voiceprint authentication unit 20 to perform voiceprint authentication based on the input voice data. Then, the voiceprint authentication unit 20 specifies that the speaker data is ID_B, and the CPU 10 transmits the voice data in the file ID_B on the memory 40 to the voice processing device PC_B via the communication interface 50.

以上説明したように、本実施形態に係る音声処理システムによれば、複数人による電話会議などにおいて記録された発話者毎の音声データを、音声による簡易なコマンドによりネットワークを介して取得することが可能となる。
したがって、遠隔の音声処理装置に格納された音声データを、パーソナルコンピュータ等に対する複雑な操作を行う必要がなく取得できるので、利便性が高い。 As described above, according to the voice processing system according to the present embodiment, voice data for each speaker recorded in a conference call or the like by a plurality of people can be acquired via a network by a simple command by voice. It becomes possible.
Therefore, the voice data stored in the remote voice processing apparatus can be acquired without performing a complicated operation on a personal computer or the like, which is highly convenient.

第３の実施形態
以下、第３の実施形態に係る音声処理装置および音声処理システムについて述べる。
上述した第２の実施形態に係る音声処理システムによれば、複数の音声処理装置がネットワークに接続され、ネットワークを介して遠隔の音声処理装置に格納される音声データを取得可能とするが、ネットワークに接続された任意の人間がすべて音声データにアクセス可能とすることはセキュリティ上問題となる場合がある。
そこで、本実施形態に係る音声処理システムでは、各音声処理装置は、対応する通話装置を介した音声によるコマンドの処理を、その音声が予め登録された発話者の音声である場合に限り許可して行う。 Third Embodiment Hereinafter, a voice processing device and a voice processing system according to a third embodiment will be described.
According to the voice processing system according to the second embodiment described above, a plurality of voice processing apparatuses are connected to a network, and voice data stored in a remote voice processing apparatus can be acquired via the network. It may be a security problem to allow any person connected to the network to access the audio data.
Therefore, in the speech processing system according to the present embodiment, each speech processing device permits the processing of commands by speech via the corresponding call device only when the speech is the speech of a previously registered speaker. Do it.

本実施形態に係る音声処理装置は、図４に示した第２の実施形態に係る音声処理装置と構成は同一であるが、ＣＰＵ１０で行われる処理が異なる。
以下、第２の実施形態の説明と同一の例、すなわち、コマンドとして「Ｂ氏の名前」がＣＰＵ１０に登録されており、そのコマンドに対応する処理内容として、「Ｂ氏の音声データの送信処理」が予め規定されている場合の本実施形態に係る音声処理装置の動作について述べる。 The audio processing device according to the present embodiment has the same configuration as the audio processing device according to the second embodiment shown in FIG. 4, but the processing performed by the CPU 10 is different.
Hereinafter, the same example as that of the description of the second embodiment, that is, “Mr. B's name” is registered in the CPU 10 as a command. The operation of the speech processing apparatus according to this embodiment when “is defined in advance” will be described.

その場合、ＣＰＵ１０は、声紋認証部２０に対し、入力した音声データに基づいて声紋認証を行うように指示し、その結果、声紋認証部２０は、発話者データがＩＤ＿Ｂであることを特定する。
ＣＰＵ１０は、予めコマンド処理を許可する発話者データのリストを備えており、声紋認証部２０により特定された発話者データがそのリストに含まれているか確認する。
その結果、声紋認証部２０により特定された発話者データがそのリストに含まれている場合には、第２の実施形態に係る音声処理装置同様、ＣＰＵ１０は、メモリ４０上のＩＤ＿Ｂのファイルにある音声データを、通信インタフェース５０を介して音声処理装置ＰＣ＿Ｂに送信する。
逆に、声紋認証部２０により特定された発話者データがそのリストに含まれていない場合には、ＣＰＵ１０は、何も処理を行わない。 In that case, the CPU 10 instructs the voiceprint authentication unit 20 to perform voiceprint authentication based on the input voice data, and as a result, the voiceprint authentication unit 20 specifies that the speaker data is ID_B.
The CPU 10 has a list of speaker data that permits command processing in advance, and confirms whether the speaker data specified by the voiceprint authentication unit 20 is included in the list.
As a result, when the speaker data specified by the voiceprint authentication unit 20 is included in the list, the CPU 10 is in the file ID_B on the memory 40 as in the case of the speech processing apparatus according to the second embodiment. The audio data is transmitted to the audio processing device PC_B via the communication interface 50.
On the contrary, when the speaker data specified by the voiceprint authentication unit 20 is not included in the list, the CPU 10 does not perform any processing.

以上説明したように、本実施形態に係る音声処理システムによれば、複数人による電話会議などにおいて記録された発話者毎の音声データを、音声によるコマンドによりネットワークを介して取得する際に、声紋認証により特定された発話者が予め登録されている場合に限り、コマンドに対応する処理を行い、登録されていない場合には、コマンドに対応する処理を行わない。
したがって、ネットワーク上で音声データを共有する際にも、安全性を高めることができる。たとえば、電話会議における各発話者データに対するアクセスを、その会議の議題に関連のある人々に限定することが可能である。 As described above, according to the voice processing system according to the present embodiment, when voice data for each speaker recorded in a conference call or the like by a plurality of people is acquired via a network by a voice command, a voiceprint is obtained. Only when the speaker specified by the authentication is registered in advance, the process corresponding to the command is performed. When the speaker is not registered, the process corresponding to the command is not performed.
Therefore, safety can be improved when voice data is shared on the network. For example, access to each speaker data in a conference call can be limited to people associated with the conference agenda.

なお、実施形態の内容は上述した内容に拘泥せず、様々な改変が可能である。
たとえば、コマンドの重要度（安全レベル）に応じて、許可する発話者データのリストを可変とするように構成することを容易に行うことができることは言うまでもない。 The contents of the embodiment are not limited to the contents described above, and various modifications can be made.
For example, it goes without saying that the list of permitted speaker data can be easily changed according to the importance (safety level) of the command.

本発明に係る音声処理システムの一実施形態の構成である。1 is a configuration of an embodiment of a voice processing system according to the present invention. 本発明に係る音声処理装置の一実施形態の構成である。It is a structure of one Embodiment of the audio processing apparatus which concerns on this invention. 実施形態に係る音声処理装置の動作を説明するための図である。It is a figure for demonstrating operation | movement of the audio processing apparatus which concerns on embodiment. 本発明に係る音声処理装置の一実施形態の構成である。It is a structure of one Embodiment of the audio processing apparatus which concerns on this invention.

Explanation of symbols

１０…ＣＰＵ、２０…声紋認証部、３０…声紋レジスタ、４０…メモリ、５０…通信インタフェース、６０…音声認識部、ＰＢＸ…構内交換機、ＰＨ＿Ａ，ＰＨ＿Ｂ，ＰＨ＿Ｃ，ＰＨ＿Ｄ，ＰＨ＿Ｅ…通話装置、ＰＣ＿Ａ，ＰＣ＿Ｂ，ＰＣ＿Ｃ，ＰＣ＿Ｄ，ＰＣ＿Ｅ…音声処理装置。
DESCRIPTION OF SYMBOLS 10 ... CPU, 20 ... Voiceprint authentication part, 30 ... Voiceprint register, 40 ... Memory, 50 ... Communication interface, 60 ... Voice recognition part, PBX ... Private branch exchange, PH_A, PH_B, PH_C, PH_D, PH_E ... Communication apparatus, PC_A, PC_B, PC_C, PC_D, PC_E... Voice processing device.

Claims

A speech processing device that classifies input speech for each speaker,
First storage means for storing one or more voiceprint data in association with speaker data corresponding to the speaker;
A second storage means;
Voice print data is extracted from the input voice every predetermined time, the voice print data is compared with the voice print data stored in the first storage means, and voice print authentication means for specifying speaker data;
And a control unit that stores the input speech for each predetermined time in the second storage unit for each speaker data specified by the voiceprint authentication unit.

A voice processing system that classifies voice between a plurality of communication devices connected to a line network for each speaker,
Each communication device
First storage means for storing one or more voiceprint data in association with speaker data corresponding to the speaker;
A second storage means;
Voice print data is extracted from the voice every predetermined time, the voice print data is compared with the voice print data stored in the first storage means, and voice print authentication means for specifying speaker data;
Control means for storing the sound for each predetermined time in the second storage means for each speaker data specified by the voiceprint authentication means, respectively,
Each voice processing device is a voice processing system configured to be able to transmit and receive data by communication with each other.

Each call device includes voice recognition means for recognizing voice between a plurality of call devices and converting it into a character string,
When the character string obtained by converting the voice by the voice recognition means is a predetermined command, the control means,
The voice processing system according to claim 2, wherein a process specified in advance by the predetermined command is executed on a voice processing device corresponding to the speaker data specified by the voiceprint authentication unit based on the voice.

The predetermined command is any of speaker data stored in the second storage means,
In the predetermined processing, voice data corresponding to the speaker data stored in the second storage means is transmitted to a voice processing device corresponding to the voice speaker specified by the voiceprint authentication means. The voice processing system according to claim 3.

The permitted speakers allowed to execute the predetermined command are limited in advance,
The character string obtained by converting the voice by the voice recognition unit is a predetermined command, and the speaker data specified by the voiceprint authentication unit based on the voice is speaker data corresponding to one of the permitted speakers. If there is
The voice processing system according to claim 3, wherein the control means executes a process specified in advance by the predetermined command.