JP2000206988A

JP2000206988A - Voice processing device

Info

Publication number: JP2000206988A
Application number: JP11005291A
Authority: JP
Inventors: Hideyuki Takahashi; 秀享高橋
Original assignee: Olympus Optical Co Ltd
Current assignee: Olympus Corp
Priority date: 1999-01-12
Filing date: 1999-01-12
Publication date: 2000-07-28

Abstract

PROBLEM TO BE SOLVED: To provide a voice processing device having a speaker recognition function that surely specifies a non-registered speaker who enters into voice data. SOLUTION: Feature parameters of voice for registration are stored in a voice model recording section 13B as registered voice models for each speaker. A feature parameter extracting section 14 extracts feature parameters from the voice data recorded in a voice data recording section 13A. Then, the degreee of similarity between the feature parameters of the extracted voice data and the registered voice model recorded in the section 13B is obtained by a degree of similarity computing section 16. Based on the computational result in the section 16, a speaker specifying section 17 discriminates whether he is a registered speaker corresponding to the registered voice model or a non-registered speaker.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、話者認識機能を有
する音声処理装置に関するものである。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a speech processing apparatus having a speaker recognition function.

【０００２】[0002]

【従来の技術】従来、音声処理装置として、マイクロフ
ォンなどから入力された音声信号をデジタル信号に変換
して、半導体メモリなどの記録媒体に記録し、再生時
に、記録媒体からデジタル信号を読出してアナログ信号
に変換して出力するようにしたデジタル音声記録再生装
置が知られている。2. Description of the Related Art Conventionally, as an audio processing device, an audio signal input from a microphone or the like is converted into a digital signal, and the digital signal is recorded on a recording medium such as a semiconductor memory. 2. Description of the Related Art A digital audio recording / reproducing apparatus which converts a signal into a signal and outputs the signal is known.

【０００３】そして、このような音声処理装置では、記
録された音声データを再生するに際し、その操作性や検
索性をより向上させることが望まれ、様々な提案がなさ
れており、例えば、特開平１０−３３００号公報に開示
されるように、再生したい範囲を指定するためのインデ
ックス情報を半導体メモリに記録しておき、早送りなど
の操作があった時、記録されたインデックスに基づいて
指定範囲のみを再生するようにしたものが知られてい
る。In such an audio processing apparatus, it is desired to improve operability and searchability when reproducing recorded audio data, and various proposals have been made. As disclosed in Japanese Patent Application Laid-Open No. 10-3300, index information for designating a range to be reproduced is recorded in a semiconductor memory, and when an operation such as fast forward is performed, only the designated range is determined based on the recorded index. There is a known device for playing back.

【０００４】ところが、このように構成したものは、記
録範囲の中から再生したい部分を、指定しなければなら
ないため、再生の操作性が著しく劣るという問題があっ
た。However, in such a configuration, a portion to be reproduced must be specified from the recording range, so that there is a problem that the operability of reproduction is extremely poor.

【０００５】ところで、最近、音声処理技術の発展によ
り、音声認識技術や話者認識技術が実用化されており、
これら認識技術を応用したものとして、例えば、特開平
８−１５３１１８号公報に開示されるように話者認識機
能により特定話者の音声データを検索し、該当する話者
の音声データのみを再生できるようにすることで、再生
時の操作性および検索性を高めたものが考えられてい
る。[0005] Recently, with the development of speech processing technology, speech recognition technology and speaker recognition technology have been put into practical use.
As an application of these recognition techniques, for example, as disclosed in Japanese Patent Application Laid-Open No. 8-153118, a speaker recognition function can be used to search for voice data of a specific speaker and reproduce only voice data of the corresponding speaker. By doing so, it is considered that the operability and searchability during reproduction are enhanced.

【０００６】[0006]

【発明が解決しようとする課題】ところが、このような
話者認識機能を有するものは、予め登録された話者の中
から指定された話者を選んで再生するようになるため、
例えば会話中に未登録の話者が登場し、これら未登録話
者が登録話者と混在して記録されているような場合は、
再生すべき特定話者を指定すると、未登録話者の中から
も最も特定話者に特徴の近いものも選択され再生されて
しまうことがあり、指定した話者以外の録音内容も再生
されてしまうという問題があった。However, in a speaker having such a speaker recognition function, a designated speaker is selected from pre-registered speakers and reproduced.
For example, if unregistered speakers appear during a conversation and these unregistered speakers are recorded together with registered speakers,
If a specific speaker to be played is specified, the speaker with the characteristics closest to the specific speaker may be selected and played back from among the unregistered speakers, and recorded contents other than the specified speaker will also be played. There was a problem that it would.

【０００７】本発明は、上記事情に鑑みてなされたもの
で、音声データ中に登場する未登録話者を確実に特定す
ることができる話者認識機能を有する音声処理装置を提
供することを目的とする。SUMMARY OF THE INVENTION The present invention has been made in view of the above circumstances, and has as its object to provide a speech processing apparatus having a speaker recognition function capable of reliably specifying an unregistered speaker appearing in speech data. And

【０００８】[0008]

【課題を解決するための手段】請求項１記載の発明は、
音声を入力する音声入力手段と、各話者ごとに、登録用
音声の特徴パラメータを登録音声モデルとして記録する
登録音声モデル記録手段と、前記音声入力手段から入力
された音声信号から特徴パラメータを抽出する特徴パラ
メータ抽出手段と、この特徴パラメータ抽出手段により
抽出された前記音声入力手段から入力される音声信号の
特徴パラメータと前記登録音声モデル記録手段に記録さ
れた登録音声モデルとの類似度を求める類似度計算手段
と、この類似度計算手段での計算結果から前記登録音声
モデルに対応する登録話者または未登録話者を判定する
話者判定手段とにより構成している。According to the first aspect of the present invention,
Voice input means for inputting voice, registered voice model recording means for recording, for each speaker, feature parameters of the registered voice as a registered voice model, and extracting feature parameters from the voice signal input from the voice input means And a similarity parameter for obtaining a similarity between a feature parameter of a voice signal input from the voice input means extracted by the feature parameter extracting means and a registered voice model recorded in the registered voice model recording means. And a speaker judging unit for judging a registered speaker or an unregistered speaker corresponding to the registered voice model from the calculation result by the similarity calculating unit.

【０００９】請求項２記載の発明は、音声を入力する音
声入力手段と、各話者ごとに、登録用音声の特徴パラメ
ータを登録音声モデルとして記録する登録音声モデル記
録手段と、前記音声入力手段から入力された音声データ
を記録する音声データ記録手段と、この音声データ記録
手段に記録された音声データから特徴パラメータを抽出
する特徴パラメータ抽出手段と、この特徴パラメータ抽
出手段により抽出された前記音声データ記録手段に記録
された音声データの特徴パラメータと前記登録音声モデ
ル記録手段に記録された登録音声モデルとの類似度を求
める類似度計算手段と、この類似度計算手段での計算結
果から前記登録音声モデルに対応する登録話者または未
登録話者を判定する話者判定手段とにより構成してい
る。According to a second aspect of the present invention, there is provided voice input means for inputting voice, registered voice model recording means for recording characteristic parameters of voice for registration as a registered voice model for each speaker, and said voice input means. Voice data recording means for recording voice data input from the voice data, feature parameter extraction means for extracting feature parameters from voice data recorded in the voice data recording means, and the voice data extracted by the feature parameter extraction means. A similarity calculating means for calculating a similarity between the characteristic parameter of the voice data recorded in the recording means and the registered voice model recorded in the registered voice model recording means; and the registered voice based on the calculation result by the similarity calculating means. It comprises speaker determination means for determining a registered speaker or an unregistered speaker corresponding to the model.

【００１０】請求項３記載の発明は、請求項１または２
記載の発明において、話者判定手段は、前記類似度計算
手段による各話者の登録音声モデルとの類似度のうちの
最大値を求め、その最大値が所定の閾値以上であれば、
現入力音声は、当該登録音声モデルに対応する登録話者
と判定し、その最大値が所定の閾値未満であれば、現入
力音声の話者は、未登録話者と判定することを特徴とし
ている。[0010] The third aspect of the present invention is the first or second aspect.
In the invention described, the speaker determination means obtains the maximum value of the similarities between the registered voice models of the speakers by the similarity calculation means, and if the maximum value is equal to or greater than a predetermined threshold,
The current input voice is determined to be a registered speaker corresponding to the registered voice model, and if the maximum value is less than a predetermined threshold, the speaker of the current input voice is determined to be an unregistered speaker. I have.

【００１１】この結果、本発明によれば、音声データ中
に登録話者とともに未登録話者が混在するような場合で
も、音声データ中に登場する未登録話者を登録話者と区
別して特定することができる。As a result, according to the present invention, even when unregistered speakers are present together with registered speakers in voice data, unregistered speakers appearing in voice data are identified and distinguished from registered speakers. can do.

【００１２】[0012]

【発明の実施の形態】以下、本発明の一実施の形態を図
面に従い説明する。DESCRIPTION OF THE PREFERRED EMBODIMENTS One embodiment of the present invention will be described below with reference to the drawings.

【００１３】図１は、本発明が適用されるデジタル音声
記録再生装置の概略構成を示している。図において、１
は音声入力手段としてのマイクで、このマイク１は、プ
リアンプ２、ＣＯＤＥＣ３のローパスフィルタ３Ａ、Ａ
／Ｄ変換器３Ｂを介して符号化／復号化部７に接続され
ている。ここで、符号化／復号化部７は、例えばＣＥＬ
Ｐ（ＣｏｄｅＥｘｃｉｔｅｄＬｉｎｅａｒＰｒｅ
ｄｉｃｔｉｖｅＣｏｄｉｎｇ）方式により音声の符号
化／復号化を行いＤＳＰ（ＤｉｇｉｔａｌＳｉｇｎａ
ｌＰｒｏｃｅｓｓｏｒ）により実現される。また、符
号化／復号化部７には、ＣＯＤＥＣ３のＤ／Ａ変換器３
Ｃ、ローパスフィルタ３Ｄ、パワーアンプ６を介してス
ピーカ５が接続されている。FIG. 1 shows a schematic configuration of a digital audio recording / reproducing apparatus to which the present invention is applied. In the figure, 1
Is a microphone as an audio input means, and the microphone 1 is composed of a preamplifier 2, a low-pass filter 3A
It is connected to the encoding / decoding unit 7 via the / D converter 3B. Here, the encoding / decoding unit 7 performs, for example, CEL
P (Code Excited Linear Pre
Encoding / decoding of speech is performed by a digital coding (DSP) method, and a DSP (Digital Signa) is used.
1 Processor). The encoding / decoding unit 7 includes a D / A converter 3 of the CODEC 3.
The speaker 5 is connected via C, a low-pass filter 3D, and a power amplifier 6.

【００１４】ＣＯＤＥＣ３には、システム制御部１０が
接続されている。このシステム制御部１０は各部の動作
を制御するもので、例えば８ビットＣＰＵで実現されて
いる。A system controller 10 is connected to the CODEC 3. The system control unit 10 controls the operation of each unit, and is realized by, for example, an 8-bit CPU.

【００１５】このようなシステム制御部１０には、例え
ば１６ビットＣＰＵで実現されるメモリ制御部８を介し
て符号化／復号化部７およびフラッシュメモリ１３が接
続されるとともに、ＣＯＤＥＣ３、表示部９、操作入力
部１１および電源１２が接続されている。The encoding / decoding unit 7 and the flash memory 13 are connected to the system control unit 10 via a memory control unit 8 realized by, for example, a 16-bit CPU, and the CODEC 3 and the display unit 9 are connected to the system control unit 10. , An operation input unit 11 and a power supply 12 are connected.

【００１６】ここで、メモリ制御部８は、フラッシュメ
モリ１３と符号化／復号化部７との間の信号の入出力を
制御するものである。フラッシュメモリ１３は、音声デ
ータ記録部１３Ａ、音声モデル記録部１３Ｂ、話者コー
ド記録部１３Ｃの３つの領域を有し、また、メモリ制御
部８から出力される符号化データヘッダ情報とともに、
音声データ記録部１３Ａに記録するようにしている。こ
のフラッシュメモリ１３は、デジタル音声記録再生装置
に内蔵されていても、着脱自在になっていてもよい。操
作入力部１１は、録音、再生、話者登録などの操作ボタ
ンあるいは操作スイッチからなっている。Here, the memory controller 8 controls the input and output of signals between the flash memory 13 and the encoder / decoder 7. The flash memory 13 has three areas of a voice data recording unit 13A, a voice model recording unit 13B, and a speaker code recording unit 13C, and together with encoded data header information output from the memory control unit 8,
The audio data is recorded in the audio data recording unit 13A. The flash memory 13 may be built in the digital audio recording / reproducing device or may be detachable. The operation input unit 11 includes operation buttons or operation switches for recording, reproducing, speaker registration, and the like.

【００１７】符号化／復号化部７には、特徴パラメータ
抽出部１４が接続されている。この特徴パラメータ抽出
部１４は、音声データより話者認識に適した表現形式、
例えばピッチやケプストラムなどの特徴パラメータを抽
出するものである。The encoding / decoding section 7 is connected to a feature parameter extracting section 14. This feature parameter extraction unit 14 uses an expression format suitable for speaker recognition from voice data,
For example, it extracts feature parameters such as pitch and cepstrum.

【００１８】特徴パラメータ抽出部１４には、音声モデ
ル作成部１５が接続されている。この音声モデル作成部
１５は、特徴パラメータ抽出部１４より特徴パラメータ
の時系列データが入力され、特徴パラメータの標準パタ
ーンを登録音声モデルとして作成し、この音声モデル
を、メモリ制御部８を介してフラッシュメモリ１３の音
声モデル記録部１３Ｂに記録させるようになっている。A speech model creation unit 15 is connected to the feature parameter extraction unit 14. The speech model creation unit 15 receives the time-series data of the feature parameters from the feature parameter extraction unit 14, creates a standard pattern of the feature parameters as a registered speech model, and flashes this speech model via the memory control unit 8. The voice model is recorded in the voice model recording unit 13B of the memory 13.

【００１９】また、特徴パラメータ抽出部１４には、類
似度計算部１６が接続されている。この類似度計算部１
６には、フラッシュメモリ１３の音声モデル記録部１３
Ｂが接続されている。類似度計算部１６は、フラッシュ
メモリ１３の音声モデル記録部１３Ｂに記録された各話
者の音声モデルと特徴パラメータとの類似度を計算する
ものである。The feature parameter extracting unit 14 is connected to a similarity calculating unit 16. This similarity calculator 1
6 includes a voice model recording unit 13 of the flash memory 13.
B is connected. The similarity calculating unit 16 calculates the similarity between the voice model of each speaker recorded in the voice model recording unit 13B of the flash memory 13 and the feature parameter.

【００２０】そして、類似度計算部１６には、話者判定
手段としての話者特定部１７が接続されている。話者特
定部１７は、類似度計算部１６の計算結果から登録話者
または未登録話者を特定するもので、登録話者に対応す
る話者コードまたは未登録話者を表す未登録話者コード
をメモリ制御部８を介してフラッシュメモリ１３の話者
コード記録部１３Ｃに記録させるようになっている。The similarity calculation section 16 is connected to a speaker identification section 17 as speaker determination means. The speaker identification unit 17 identifies a registered speaker or an unregistered speaker from the calculation result of the similarity calculation unit 16, and the speaker code corresponding to the registered speaker or an unregistered speaker representing the unregistered speaker. The code is recorded in the speaker code recording unit 13C of the flash memory 13 via the memory control unit 8.

【００２１】次に、このように構成した実施の形態の動
作を説明する。Next, the operation of the embodiment configured as described above will be described.

【００２２】まず、操作者が、操作入力部１１の図示し
ない録音操作ボタンを操作した場合は、マイク１から入
力された音声アナログ信号は、プリアンプ２で、増幅さ
れ、ローパスフィルタ３Ａにより音声信号成分のうち不
要な高域成分が除去され、Ａ／Ｄ変換器３Ｂで、デジタ
ル信号に変換される。そして、符号化／復号化部７で、
Ａ／Ｄ変換器３Ｂからのデジタル信号に対する符号化が
行われ、この符号化された符号化データは、メモリ制御
部８を介してフラッシュメモリ１３の音声データ記録部
１３Ａの所定領域に記録される。この場合、音声データ
記録部１３Ａには、符号化データとともにヘッダ情報が
記録される。First, when an operator operates a recording operation button (not shown) of the operation input unit 11, an audio analog signal input from the microphone 1 is amplified by the preamplifier 2, and the audio signal component is amplified by the low-pass filter 3A. Unnecessary high frequency components are removed, and are converted to digital signals by the A / D converter 3B. Then, in the encoding / decoding unit 7,
The digital signal from the A / D converter 3B is encoded, and the encoded data is recorded in a predetermined area of the audio data recording unit 13A of the flash memory 13 via the memory control unit 8. . In this case, header information is recorded in the audio data recording unit 13A together with the encoded data.

【００２３】次に、操作者が、操作入力部１１の図示し
ない再生操作ボタンを操作した場合は、メモリ制御部８
を介してフラッシュメモリ１３の音声データ記録部１３
Ａから符号化データが読出され、符号化／復号化部７
で、復号化データが生成される。この復号化データは、
Ｄ／Ａ変換器３Ｃで、アナログ信号に変換され、ローパ
スフィルタ３Ｄによりアナログ信号に含まれる周波数成
分のうち不要な高域成分が除去され、パワーアンプ６に
より増幅され、スピーカ５より再生信号として出力され
る。Next, when the operator operates a reproduction operation button (not shown) of the operation input unit 11, the memory control unit 8
Audio data recording unit 13 of flash memory 13
A. The encoded data is read from A, and the encoding / decoding unit 7
Thus, decrypted data is generated. This decrypted data is
The D / A converter 3C converts the signal into an analog signal. The low-pass filter 3D removes unnecessary high-frequency components from the frequency components included in the analog signal. The signal is amplified by the power amplifier 6 and output as a reproduction signal from the speaker 5. Is done.

【００２４】以上は、デジタル音声記録再生装置として
の基本動作である録音、再生のそれぞれの動作である
が、次に、本発明の特徴である話者認識処理のために用
いられる話者登録の動作を説明する。なお、この話者登
録は、上述した録音操作に先立って、録音を予定する各
話者について行うものである。The above are the recording and playback operations, which are the basic operations of the digital audio recording and playback device. Next, the speaker registration process used for speaker recognition processing, which is a feature of the present invention, will be described. The operation will be described. Note that this speaker registration is performed for each speaker to be recorded prior to the above-described recording operation.

【００２５】まず、話者登録を説明する。First, speaker registration will be described.

【００２６】この場合、操作者が、操作入力部１１の図
示しない話者登録操作ボタンを操作すると、図２に示す
フローチャートが実行される。In this case, when the operator operates a speaker registration operation button (not shown) of the operation input section 11, the flowchart shown in FIG. 2 is executed.

【００２７】ステップ２０１で、話者登録用音声データ
を特徴パラメータ抽出部１４に入力する。この時の話者
登録用音声データは、マイク１から話者が入力するよう
にしてもよいし、予め話者登録用音声データを録音して
おき、その録音データを入力するようにしてもよい。In step 201, the speaker registration voice data is input to the feature parameter extraction unit 14. The speaker registration voice data at this time may be input by the speaker from the microphone 1, or the speaker registration voice data may be recorded in advance, and the recorded data may be input. .

【００２８】次に、ステップ２０２で、特徴パラメータ
抽出部１４により、入力された登録用音声データを話者
認識に適した表現形式、例えばピッチやケプストラムな
どの特徴パラメータを抽出する。Next, at step 202, the feature parameter extracting unit 14 extracts the input registration voice data into an expression format suitable for speaker recognition, for example, a feature parameter such as pitch or cepstrum.

【００２９】そして、ステップ２０３に進み、特徴パラ
メータの時系列データを音声モデル作成部１５に入力
し、特徴パラメータの標準パターンを音声モデルとして
作成し、ステップ２０４で、この音声モデルを、メモリ
制御部８を介してフラッシュメモリ１３の音声モデル記
録部１３Ｂに記録させる。Then, the process proceeds to step 203, where the time-series data of the characteristic parameters is input to the speech model creating unit 15, and a standard pattern of the feature parameters is created as a speech model. 8 through the audio model recording unit 13B of the flash memory 13.

【００３０】次に、このようにして音声モデル記録部１
３Ｂに記憶させた各話者の音声モデルを用いて、フラッ
シュメモリ１３の音声データ記録部１３Ａに記録されて
いる音声データに対する話者を認識するための処理を説
明する。Next, the voice model recording unit 1
A process for recognizing a speaker with respect to voice data recorded in the voice data recording unit 13A of the flash memory 13 using the voice model of each speaker stored in 3B will be described.

【００３１】この場合、話者認識処理が指示されると、
図３に示すフローチャートが実行される。In this case, when speaker recognition processing is instructed,
The flowchart shown in FIG. 3 is executed.

【００３２】まず、ステップ３０１で、音声データのフ
レームのカウンタ変数ｆの値を０にセットする。次に、
ステップ３０２で、ｆの値を1 加算し、ステップ３０３
で、ｆに対応するフレームの音声が有音であるか否かを
判断する。なお、ここでの判定方法は、例えば、特開平
９−２８１９８７号公報に開示される方法でもよいし、
特開平９−２１８６９４号公報に開示されるように、予
め音声データのフレームに対応するヘッダに記録された
有音、無音情報を用いるようにしてもよい。First, in step 301, the value of a counter variable f of a frame of audio data is set to zero. next,
In step 302, the value of f is incremented by one, and in step 303
Then, it is determined whether or not the sound of the frame corresponding to f is sound. The determination method here may be, for example, a method disclosed in Japanese Patent Application Laid-Open No. 9-281987,
As disclosed in JP-A-9-218694, sound / non-sound information recorded in advance in a header corresponding to a frame of audio data may be used.

【００３３】ここで、ステップ３０３で、現フレームに
音声データが記録されておらず、無音と判断すると、こ
の時のフレームについては、何も処理を行わず、ステッ
プ３０２に戻って、ｆの値を1 加算し、次のフレームに
ついてステップ３０３以降の動作を行う。If it is determined in step 303 that no audio data is recorded in the current frame and that there is no sound, no processing is performed on the current frame, and the flow returns to step 302 to return to the value of f. , And the operation after step 303 is performed for the next frame.

【００３４】一方、ステップ３０３で、現フレームに音
声データが記録されていると判断すると、ステップ３０
４で、現フレームの音声データが記録されている音声デ
ータ記録部１３Ａのアドレス情報をフラッシュメモリ１
３の話者コード記録部１３Ｃに記録する。On the other hand, if it is determined in step 303 that audio data is recorded in the current frame, step 30
4, the address information of the audio data recording unit 13A in which the audio data of the current frame is recorded is stored in the flash memory 1
3 is recorded in the speaker code recording unit 13C.

【００３５】次に、ステップ３０５で、現フレームの音
声データを特徴パラメータ抽出部１４に入力して所定の
特徴パラメータを抽出し、ステップ３０６で、類似度計
算部１６を用いて、ステップ３０５により抽出した特徴
パラメータと、フラッシュメモリ１３の音声モデル記録
部１３Ｂに記録された各話者の音声モデルとの類似度を
計算する。そして、ステップ３０７で、話者特定部１７
により計算された類似度の中で最大値のものを求め、ス
テップ３０８で、最大値の類似度が所定の閾値以上であ
るかを判断する。Next, in step 305, the speech data of the current frame is input to the feature parameter extraction unit 14 to extract predetermined feature parameters. In step 306, the similarity calculation unit 16 is used to extract the predetermined feature parameters in step 305. The similarity between the feature parameter obtained and the voice model of each speaker recorded in the voice model recording unit 13B of the flash memory 13 is calculated. Then, in step 307, the speaker specifying unit 17
In step 308, it is determined whether or not the similarity of the maximum value is equal to or greater than a predetermined threshold.

【００３６】ここで、最大値が所定の閾値以上と判断さ
れると、ステップ３０９で、現フレームの音声データの
話者は、音声モデル記録部１３Ｂに記録された音声モデ
ルのうち最大値の類似度を有する音声モデルに対応する
話者として特定する。そして、ステップ３１０で、話者
に対応する話者コードを、ステップ３０４で記録した現
フレームの音声データのアドレス情報に対応させるよう
にフラッシュメモリ１３の話者コード記録部１３Ｃに記
録する。次いで、ステップ３１１で、音声ファイルが終
わりかどうかを判断し、未だ終わっていなければ、ステ
ップ３０２に戻って、ｆの値を1 加算し、次のフレーム
についてステップ３０３以降の動作を行い、また、ファ
イルが終わっていれば、処理を終了する。If it is determined that the maximum value is equal to or larger than the predetermined threshold value, in step 309, the speaker of the audio data of the current frame determines the similarity of the maximum value among the audio models recorded in the audio model recording unit 13B. The speaker corresponding to the voice model having the degree is specified. Then, in step 310, the speaker code corresponding to the speaker is recorded in the speaker code recording unit 13C of the flash memory 13 so as to correspond to the address information of the audio data of the current frame recorded in step 304. Next, in step 311, it is determined whether or not the audio file has ended. If the audio file has not ended, the process returns to step 302, increments the value of f by 1, and performs the operations of step 303 and subsequent steps on the next frame. If the file is over, the process ends.

【００３７】一方、ステップ３０８で、最大値が所定の
閾値未満と判断されると、ステップ３１２に進み、現フ
レームの音声データの話者は、登録された話者でないと
特定する。そして、ステップ３１０で、話者に対応する
未登録話者コードを、ステップ３０４で記録した現フレ
ームの音声データのアドレス情報に対応させるようにフ
ラッシュメモリ１３の話者コード記録部１３Ｃに記録す
る。次いで、ステップ３１１で、音声ファイルが終わり
かどうかを判断し、未だ終わっていなければ、ステップ
３０２に戻って、ｆの値を1 加算し、次のフレームにつ
いてステップ３０３以降の動作を行い、また、ファイル
が終わっていれば、処理を終了する。On the other hand, if it is determined in step 308 that the maximum value is less than the predetermined threshold value, the flow advances to step 312 to specify that the speaker of the voice data of the current frame is not a registered speaker. Then, in step 310, the unregistered speaker code corresponding to the speaker is recorded in the speaker code recording unit 13C of the flash memory 13 so as to correspond to the address information of the audio data of the current frame recorded in step 304. Next, in step 311, it is determined whether or not the audio file has ended. If the audio file has not ended, the process returns to step 302 to add 1 to the value of f, and performs the operations of step 303 and subsequent steps for the next frame. If the file is over, the process ends.

【００３８】従って、このようにすれば、音声データ中
に登録話者とともに未登録話者が混在するような場合で
も、音声データ中に登場する未登録話者を登録話者と区
別して特定することができるので、従来の未登録話者が
登録話者と混在して記録されていて、再生すべき特定話
者を指定すると、未登録話者の中からも最も特定話者に
特徴の近いものも選択され、指定した話者以外の録音内
容も再生されてしまうようなものと比べ、特定話者を指
定した場合に、指定した話者のみを正確に再生すること
ができる。Accordingly, even when the registered speaker and the unregistered speaker are mixed in the voice data, the unregistered speaker appearing in the voice data is specified to be distinguished from the registered speaker. Since the unregistered speakers of the related art are recorded together with the registered speakers, and the specific speaker to be reproduced is specified, the characteristic of the unregistered speaker closest to that of the specific speaker can be obtained. When a specific speaker is specified, only the specified speaker can be accurately reproduced, compared to a case where a selected speaker is also selected and recorded contents other than the specified speaker are also reproduced.

【００３９】また、フラッシュメモリ１３の音声データ
記録部１３Ａに記録されている音声データについて話者
を認識するようにしているが、一般に話者認識処理に要
する演算量は膨大であることから、小型で安価な装置に
おいては、極めて有利にできる。Although the speaker is recognized based on the voice data recorded in the voice data recording unit 13A of the flash memory 13, the amount of calculation required for the speaker recognition process is generally enormous. Inexpensive devices can be very advantageous.

【００４０】さらに、音声データ記録部１３Ａに記録さ
れている音声データについて話者を認識するようにして
いるので、話者認識処理の実行タイミングに自由度を持
たせることができ、例えば、録音が終了した直後に自動
的に行うようにしたり、操作者による操作入力部１１の
話者認識操作により行うようにできる。Further, since the speaker is recognized with respect to the voice data recorded in the voice data recording unit 13A, it is possible to give a degree of freedom to the execution timing of the speaker recognition processing. It can be performed automatically immediately after the process is completed, or can be performed by a speaker recognition operation of the operation input unit 11 by the operator.

【００４１】さらにまた、有音区間についてのみ話者認
識処理を行うようにしているが、無音区間は話者の個人
性が存在せず、類似度の計算は不要であり、有音区間の
みを話者識別処理の対象としているので、精度のよい話
者認識を実現できる。Further, the speaker recognition processing is performed only for the sound section, but the silent section does not have the individuality of the speaker, and the calculation of the similarity is unnecessary. Since the speaker recognition processing is performed, accurate speaker recognition can be realized.

【００４２】なお、上述した実施の形態では、フラッシ
ュメモリ１３の音声データ記録部１３Ａに記録されてい
る音声データについて話者を認識するようしているが、
マイク１から直接入力された音声データについて話者を
認識するようできることは勿論である。In the above-described embodiment, the speaker is recognized with respect to the audio data recorded in the audio data recording unit 13A of the flash memory 13.
It goes without saying that the speaker can be recognized for the voice data directly input from the microphone 1.

【００４３】[0043]

【発明の効果】以上述べたように本発明によれば、音声
データ中に登録話者とともに未登録話者が混在する場合
でも、音声データ中に登場する未登録話者を登録話者と
区別して特定することができるので、特定話者を指定し
た場合に、指定した話者のみを正確に再生することがで
きる。As described above, according to the present invention, even when unregistered speakers are present together with registered speakers in voice data, unregistered speakers appearing in voice data are distinguished from registered speakers. Since the specified speaker can be specified separately, only the specified speaker can be accurately reproduced when the specified speaker is specified.

【図面の簡単な説明】[Brief description of the drawings]

【図１】本発明の一実施の形態の概略構成を示す図。FIG. 1 is a diagram showing a schematic configuration of an embodiment of the present invention.

【図２】一実施の形態の動作を説明するためのフローチ
ャート。FIG. 2 is a flowchart for explaining the operation of the embodiment;

【図３】一実施の形態の動作を説明するためのフローチ
ャート。FIG. 3 is a flowchart for explaining the operation of the embodiment;

[Explanation of symbols]

１…マイク２…プリアンプ３…ＣＯＤＥＣ３Ａ．３Ｄ…ローパスフィルタ３Ｂ…Ａ／Ｄ変換器３Ｃ…Ｄ／Ａ変換器３Ｄ…ローパスフィルタ５…スピーカ６…パワーアンプ７…復号化部８…メモリ制御部９…表示部１０…システム制御部１１…操作入力部１２…電源１３…フラッシュメモリ１３Ａ…音声データ記録部１３Ｂ…音声モデル記録部１３Ｃ…話者コード記録部１４…特徴パラメータ抽出部１５…音声モデル作成部１６…類似度計算部１７…話者特定部 1. Microphone 2. Preamplifier 3. CODEC 3A. 3D low-pass filter 3B A / D converter 3C D / A converter 3D low-pass filter 5 speaker 6 power amplifier 7 decoding unit 8 memory control unit 9 display unit 10 system control unit 11 Operation input unit 12 Power supply 13 Flash memory 13A Voice data recording unit 13B Voice model recording unit 13C Speaker code recording unit 14 Feature parameter extraction unit 15 Voice model creation unit 16 Similarity calculation unit 17 Talk Person identification department

Claims

[Claims]

1. A voice input means for inputting voice, a registered voice model recording means for recording a characteristic parameter of a voice for registration as a registered voice model for each speaker, and a voice signal input from the voice input means A feature parameter extracting means for extracting a feature parameter from the speech signal; a feature parameter of a speech signal input from the speech input means extracted by the feature parameter extracting means; and a registered speech model recorded in the registered speech model recording means. Similarity calculating means for obtaining a similarity; and speaker determining means for determining a registered speaker or an unregistered speaker corresponding to the registered voice model from a calculation result by the similarity calculating means. Audio processing device.

2. Voice input means for inputting voice, registered voice model recording means for recording a feature parameter of a voice for registration as a registered voice model for each speaker, and voice data input from the voice input means Audio data recording means for recording the audio data; characteristic parameter extraction means for extracting characteristic parameters from audio data recorded in the audio data recording means; and audio data recording means extracted by the characteristic parameter extraction means. A similarity calculating means for calculating a similarity between the feature parameter of the voice data and the registered voice model recorded in the registered voice model recording means; and a registration story corresponding to the registered voice model based on the calculation result by the similarity calculating means And a speaker determination unit for determining a speaker or an unregistered speaker.

3. The speaker determination means obtains the maximum value of the similarity between each speaker and the registered voice model by the similarity calculation means, and if the maximum value is equal to or more than a predetermined threshold, the current input value is determined. The voice is determined to be a registered speaker corresponding to the registered voice model, and if the maximum value is less than a predetermined threshold, the speaker of the current input voice is determined to be an unregistered speaker. Item 3. The audio processing device according to item 1 or 2.