JP2005300958A

JP2005300958A - Talker check system

Info

Publication number: JP2005300958A
Application number: JP2004117669A
Authority: JP
Inventors: Yuzo Maruta; 裕三丸田; Jun Ishii; 純石井
Original assignee: Mitsubishi Electric Corp
Current assignee: Mitsubishi Electric Corp
Priority date: 2004-04-13
Filing date: 2004-04-13
Publication date: 2005-10-27

Abstract

<P>PROBLEM TO BE SOLVED: To suppress noise effects when checking talkers. <P>SOLUTION: A talker check system concerning this invention uses a talker check device 1 which picks up a part of the input voice as a check section and compares it with the registered voice to recognize the identity between this registered voice and the input voice. This talker check system 1 has a voice power calculating means 107 to obtain the voice power S in the picked up check section and a noise section voice power calculating means 108 to compute a part of the voice power N of the input voice preceding the check section, and a talker discriminating means 108 which compares the check section with the registered voice to recognize the identity between those talkers and decides that the talkers of these voices are not the same talker when the value SNR obtained by dividing S by N falls the predetermined threshold TH. <P>COPYRIGHT: (C)2006,JPO&NCIPI

Description

この発明は、登録音声と照合音声とを発話した人物の同一性を判定する話者照合装置に係るものであり、特に騒音が照合処理に及ぼす影響を考慮して同一性判定処理の精度を向上する技術に関する。 The present invention relates to a speaker verification device that determines the identity of a person who utters registered speech and verification speech, and in particular, improves the accuracy of identity determination processing in consideration of the effect of noise on verification processing. Related to technology.

携帯電話などの移動体通信端末では、キー操作を行わずに手軽に施錠、解錠を行う方法として音声処理技術を用いた話者照合機能が注目を浴びている。また自動車でも、キーによる施錠、解錠に替えて、音声による施錠、解錠を行う話者照合機能により商品の付加価値を高めることが検討されている。家電機器では、子供に誤操作をさせないように手軽に操作をロック・解除できるようなユーザインターフェースが求められており、話者照合機能はそのための有力な解決策と考えられている。 In mobile communication terminals such as mobile phones, a speaker verification function using a voice processing technique is attracting attention as a method for easily locking and unlocking without performing key operations. Also in automobiles, it has been studied to increase the added value of a product by a speaker verification function that performs locking and unlocking by voice instead of locking and unlocking by keys. In home appliances, a user interface that can easily lock and release an operation so that a child does not perform an erroneous operation is required, and a speaker verification function is considered to be an effective solution for that purpose.

これらの話者照合機能は、いずれも騒音レベルの高い環境で使用されることを前提とする必要がある。例えば、携帯電話であれば駅構内や街頭のように極めて騒音レベルの高い環境で使用されるし、自動車の場合は、道路騒音を考慮しなければならない。家電機器であればテレビやエアコン、掃除機の音といった生活騒音に反応しないように話者照合機能を構成する必要がある。 These speaker verification functions must all be premised on being used in an environment with a high noise level. For example, a mobile phone is used in an environment where the noise level is extremely high, such as in a station yard or on the street, and in the case of an automobile, road noise must be taken into consideration. For home appliances, it is necessary to configure a speaker verification function so that it does not react to daily noise such as the sound of a TV, air conditioner, or vacuum cleaner.

このように、騒音の影響を考慮した音声認識技術乃至話者照合技術として従来から提案されているものとしては、騒音が重畳したテスト音声と騒音とのＳ／Ｎ比を求め、複数の周波数特性からこのＳ／Ｎに基づいて好ましい周波数特性を選択する技術が知られている（特許文献１）。 As described above, as a speech recognition technique or a speaker verification technique in consideration of the influence of noise, an S / N ratio between a test voice on which noise is superimposed and noise is obtained, and a plurality of frequency characteristics are obtained. Therefore, a technique for selecting a preferable frequency characteristic based on the S / N is known (Patent Document 1).

特開平５−１９７３８７「音声認識方法」公報Japanese Patent Application Laid-Open No. 5-197387 “Speech Recognition Method”

話者照合装置をきわめて長時間騒音下においた場合、登録者が発声しなくてもたまたま登録データが騒音データに一致してしまい、誤照合が出力される可能性があるという問題点があった。より具体的に言えば、話者照合機能では、予め準備しておいた登録音声と照合時に発声する入力音声との距離値を算出して、距離値が所定の閾値を下回るか否かを判断し、閾値を下回った場合に照合したと判断するが、静かな環境では最適である照合閾値が、騒音下ではかならずしも最適でないという問題点があったのである。 When the speaker verification device is left in a noise for a very long time, the registration data happens to match the noise data without the registrant speaking, and there is a possibility that an incorrect verification may be output. . More specifically, the speaker verification function calculates a distance value between a registered voice prepared in advance and an input voice uttered at the time of verification, and determines whether the distance value is below a predetermined threshold value. However, it is determined that the collation is performed when the threshold value is below the threshold value. However, there is a problem that the collation threshold value that is optimal in a quiet environment is not necessarily optimal under noise.

この発明に係る話者照合装置は、音声として検出・照合された区間の音響パワーＳと、照合された区間の直前および直後の区間の音響パワーＮとの比Ｓ／Ｎを求め、Ｓ／Ｎが小さい場合に入力音声を棄却することでノイズによる誤照合を防ぐものである。 The speaker verification apparatus according to the present invention obtains a ratio S / N between the acoustic power S of a section detected and verified as speech and the acoustic power N of a section immediately before and immediately after the verified section, and S / N If the input signal is small, the input speech is rejected to prevent erroneous verification due to noise.

この発明に係る話者照合装置は、入力音声の一部を照合区間として選択し、その照合区間と登録音声とを照合してこの登録音声と上記入力音声との話者同一性を判定する話者照合装置において、
選択された上記照合区間の音響パワーＳを算出する照合区間音響パワー算出手段と、
上記照合区間に先行又は後続する上記入力音声の一部の音響パワーＮを算出する騒音区間音響パワー算出手段と、
上記照合区間を上記登録音声に照合してこの登録音声と上記入力音声との話者同一性を判定するとともに、ＳをＮで割った値ＳＮＲが所定の閾値ＴＨを下回る場合に、これらの音声の話者は非同一であると判定する話者判定手段と、
を備えたものである。 The speaker verification device according to the present invention selects a part of the input speech as a verification interval, compares the verification interval with the registered speech, and determines the speaker identity between the registered speech and the input speech. In person verification device,
A matching section acoustic power calculating means for calculating the acoustic power S of the selected matching section;
A noise section sound power calculating means for calculating a sound power N of a part of the input speech preceding or following the matching section;
The collation interval is collated with the registered voice to determine the speaker identity between the registered voice and the input voice, and when the value SNR obtained by dividing S by N falls below a predetermined threshold TH, these voices Speaker determination means for determining that the speakers are non-identical,
It is equipped with.

このように、この発明に係る話者照合装置は、照合区間に先行する区間の音響パワーに基づいて騒音レベルを取得し、照合区間の音響パワーＳと騒音区間の音響パワーＮとの比Ｓ／Ｎが小さい場合には話者照合を失敗させることによって誤照合を防止することができるのである。 Thus, the speaker verification apparatus according to the present invention acquires the noise level based on the acoustic power of the section preceding the verification section, and the ratio S / of the acoustic power S of the verification section and the acoustic power N of the noise section. When N is small, erroneous verification can be prevented by failing speaker verification.

次にこの発明の実施の形態を図を用いて説明する。
実施の形態１．
図１はこの発明の実施の形態１による話者照合装置を構成を示すブロック図である。図において、話者照合装置１は音声入力部１０１、音響分析部１０２、音声区間検出部１０３、登録部１０４、登録データ記憶部１０５、連続ＤＰマッチング部１０６，照合区間内平均音響パワー計算部１０７、照合前区間平均音響パワー計算部１０８、ＳＮＲ計算部１０９、判定部１１０，報知手段１１１を備えている。 Next, embodiments of the present invention will be described with reference to the drawings.
Embodiment 1 FIG.
1 is a block diagram showing the configuration of a speaker verification apparatus according to Embodiment 1 of the present invention. In the figure, a speaker verification device 1 includes a voice input unit 101, an acoustic analysis unit 102, a voice segment detection unit 103, a registration unit 104, a registration data storage unit 105, a continuous DP matching unit 106, and an average acoustic power calculation unit 107 in the verification segment. , A pre-matching section average sound power calculation unit 108, an SNR calculation unit 109, a determination unit 110, and a notification unit 111 are provided.

話者照合装置１において、音声入力部１０１は、マイクロホンで集音した音声信号をデジタル信号に変換する装置である。音響分析部１０２は、デジタルデータに変換された音声データから，話者照合に適するような特徴（たとえば周知であるメルケプストラムや音響パワーなどの特徴）である音響特徴量を計算する部位（回路又は素子）である。音声区間検出部１０３は音響パワーなどの特徴から、音響パワーの立ち上がりと立ち下がりを検出することにより実際の発声区間を検出する部位である。登録部１０４は、検出した区間の音響特徴量を、後述する登録データ記憶部１０５に登録する部位である。登録データ記憶部１０５は、音響特徴量を記憶するための記憶素子又は回路、あるいは記憶媒体である。 In the speaker verification device 1, a voice input unit 101 is a device that converts a voice signal collected by a microphone into a digital signal. The acoustic analysis unit 102 calculates, from the speech data converted into digital data, a part (circuit or circuit) that calculates an acoustic feature amount that is a feature suitable for speaker verification (for example, a well-known feature such as a mel cepstrum or acoustic power). Element). The voice section detection unit 103 is a part that detects an actual utterance section by detecting the rise and fall of the sound power from features such as the sound power. The registration unit 104 is a part for registering the acoustic feature amount of the detected section in the registration data storage unit 105 described later. The registered data storage unit 105 is a storage element or circuit for storing acoustic feature quantities, or a storage medium.

連続ＤＰマッチング部１０６は、入力音声の音響特徴量と、登録データ記憶部１０５に登録されている登録データの音響特徴量との距離を連続ＤＰ法により計算する部位である。照合区間内平均音響パワー計算部１０７は、連続ＤＰマッチング部１０６が検出した照合区間の音響パワーから、照合区間内の平均音響パワーを計算する部位である。照合前区間音響パワー計算部１０８は、照合区間の直前の所定区間内の音響パワーから、その平均音響パワーを計算する部位である。ＳＮＲ計算部１０９は、照合区間内の平均音響パワーと照合区間の直前の所定区間内の音響パワーとの比を算出する部位である。判定部１１０は、連続ＤＰマッチング部１０６やＳＮＲ計算部１０９の算出結果と所定の条件とを照らし合わせて、照合の成否を判断する部位である。 The continuous DP matching unit 106 is a part that calculates the distance between the acoustic feature quantity of the input speech and the acoustic feature quantity of the registered data registered in the registered data storage unit 105 by the continuous DP method. The average acoustic power calculation unit 107 in the verification section is a part that calculates the average acoustic power in the verification section from the acoustic power in the verification section detected by the continuous DP matching unit 106. The pre-matching section acoustic power calculation unit 108 is a part that calculates the average sound power from the sound power in the predetermined section immediately before the matching section. The SNR calculation unit 109 is a part that calculates the ratio between the average acoustic power in the verification section and the acoustic power in the predetermined section immediately before the verification section. The determination unit 110 is a part that determines the success or failure of collation by comparing the calculation results of the continuous DP matching unit 106 and the SNR calculation unit 109 with predetermined conditions.

報知手段１１１は判定部１１０の判定結果を視覚的あるいは聴覚的、触覚的に利用者である話者に通知するための手段である。視覚的に話者に通知するためには、ＬＥＤ（発光ダイオード）やランプ、ディスプレイなどの装置を用いて報知手段１１１を構成すればよいし、聴覚的に話者に通知するためにはスピーカーを用いるようにすればよい。また触覚的に話者に判定結果を通知するには偏心モータを用いたバイブレータ機能によって報知手段１１１を構成する。 The notification unit 111 is a unit for notifying a speaker who is a user visually, auditorily, or tactilely about the determination result of the determination unit 110. In order to visually notify the speaker, the notification means 111 may be configured by using a device such as an LED (light emitting diode), a lamp, or a display, and a speaker is used to audibly notify the speaker. It may be used. In addition, in order to notify the speaker of the determination result tactilely, the notification means 111 is configured by a vibrator function using an eccentric motor.

なお、照合区間内平均音響パワー計算部１０７は、照合区間音響パワー算出手段の例をなすものである。また照合前区間音響パワー計算部１０８は、騒音区間音響パワー算出手段の例をなすものである。ＳＮＲ計算部１０９及び判定部１１０は話者判定手段の例をなすものである。また照合区間内平均音響パワー計算部１０７や照合前区間音響パワー計算部１０８はパワーを算出するものであるので、独立した部位として設けられている必要はなく、連続ＤＰマッチング部１０６の処理の過程でパワーが算出されるのであれば、そこで算出されたパワーを照合区間内平均音響パワー計算部１０７や照合前区間音響パワー計算部１０８の出力とみなしてもよい。 The average acoustic power calculation unit 107 in the matching section is an example of a matching section acoustic power calculation unit. The pre-matching section acoustic power calculation unit 108 is an example of a noise section acoustic power calculation unit. The SNR calculator 109 and the determination unit 110 constitute an example of a speaker determination unit. Further, since the average acoustic power calculation unit 107 in the verification interval and the acoustic power calculation unit 108 before the verification calculate the power, it is not necessary to be provided as an independent part, and the process of the continuous DP matching unit 106 is performed. If the power is calculated at, the power calculated there may be regarded as the output of the intra-collation average acoustic power calculation unit 107 and the pre-collation interval acoustic power calculation unit 108.

また、図１のような構成の他、汎用的な制御機能を有する中央演算装置（ＣＰＵ：ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）やコンピュータ、ＤＳＰ（ＤｉｇｉｔａｌＳｉｇｎａｌＰｒｏｃｅｓｓｏｒ）などの制御手段を設けて、図１の各部位を制御する構成を採用しても構わない。また図１の構成を図１の各部位に相当する処理を制御手段であるＣＰＵやコンピュータに実行させるコンピュータプログラムと制御手段との組み合わせに替えてもよい。しかし以降では、図１の各部位がそれぞれ該当する機能を果たすこととして説明することとする。 In addition to the configuration shown in FIG. 1, control units such as a central processing unit (CPU), a computer, and a DSP (Digital Signal Processor) having a general-purpose control function are provided. You may employ | adopt the structure which controls. Further, the configuration of FIG. 1 may be replaced with a combination of a control program and a computer program that causes a CPU or computer to execute processing corresponding to each part of FIG. However, hereinafter, each part in FIG. 1 will be described as performing the corresponding function.

図２は、携帯電話として構成した場合の話者照合装置１の正面図である。図において、符号１０１ａが示す部位は、音声入力部１０１の備えるマイクロホンであり、また報知手段１１１は液晶ディスプレイが該当する。また報知手段１１１はスピーカーやバイブレータを用いて構成してもよい。 FIG. 2 is a front view of the speaker verification device 1 when configured as a mobile phone. In the figure, the part indicated by reference numeral 101a is a microphone provided in the voice input unit 101, and the notification means 111 corresponds to a liquid crystal display. The notification unit 111 may be configured using a speaker or a vibrator.

続いて、この発明の実施の形態１による話者照合装置の動作について説明する。図２はこの話者照合装置における登録音声の登録処理の動作を示すフローチャートである。話者は登録しようとするキーワードを発声し、音声入力部１０１は入力された音声に対してサンプリングを行いデジタルデータに変換する（ステップＳ０１）。次に音響分析部１０２は、デジタルデータに変換された音声データから，話者照合に適するような特徴（たとえば周知であるメルケプストラムや音響パワーなどの特徴）である音響特徴量を計算する（ステップＳ０２）。次に音声区間検出部１０３は、音響パワーなどの特徴から音響パワーの立ち上がりと立ち下がりを検出することにより、実際の発声区間を検出する（ステップＳ０３）。次に登録部１０４は、検出した区間の音響特徴量を登録データ記憶部１０５に登録する（ステップＳ０４）。ここで、発話の安定性を確認するために、２発声以上を発声させ、相互の距離が小さい場合ときのみそれらのデータの平均値などを登録するようにしてもよい。 Next, the operation of the speaker verification device according to Embodiment 1 of the present invention will be described. FIG. 2 is a flowchart showing the operation of registration processing of registered speech in this speaker verification apparatus. The speaker utters the keyword to be registered, and the voice input unit 101 samples the input voice and converts it into digital data (step S01). Next, the acoustic analysis unit 102 calculates an acoustic feature amount which is a feature suitable for speaker verification (for example, a well-known feature such as a mel cepstrum or acoustic power) from the voice data converted into digital data (step S02). Next, the speech segment detection unit 103 detects the actual speech segment by detecting the rise and fall of the acoustic power from the characteristics such as the acoustic power (step S03). Next, the registration unit 104 registers the acoustic feature amount of the detected section in the registration data storage unit 105 (step S04). Here, in order to confirm the stability of the utterance, two or more utterances may be uttered, and the average value of the data may be registered only when the mutual distance is small.

続いて、この話者照合装置における照合時の動作について説明する。図３は、この話者照合装置における照合処理の動作を示すフローチャートである。音声入力部１０１は、逐次的に音声を取り込み、取り込んだ音声に対してサンプリングを行いデジタルデータに変換する（ステップＳ１１）。次に音響分析部１０２は、デジタルデータに変換された音声データから，話者照合に適するような特徴（たとえば周知であるメルケプストラムや音響パワーなどの特徴）である音響特徴量を計算する（ステップＳ１２）。 Next, the operation at the time of verification in this speaker verification device will be described. FIG. 3 is a flowchart showing the operation of verification processing in this speaker verification device. The audio input unit 101 sequentially acquires audio, samples the acquired audio, and converts it into digital data (step S11). Next, the acoustic analysis unit 102 calculates an acoustic feature amount that is a feature suitable for speaker verification (for example, a well-known feature such as a mel cepstrum or acoustic power) from the speech data converted into digital data (step S12).

続いて、連続ＤＰマッチング部１０６は、入力音声の音響特徴量と、登録データ記憶部１０５に登録されている登録データの音響特徴量との距離を連続ＤＰ法により計算し、この距離値に基づいて照合区間を検出する（ステップＳ１３）。 Subsequently, the continuous DP matching unit 106 calculates the distance between the acoustic feature quantity of the input speech and the acoustic feature quantity of the registered data registered in the registered data storage unit 105 by the continuous DP method, and based on this distance value. The collation interval is detected (step S13).

ここで連続ＤＰ法について説明する。連続ＤＰ法とは、ＤＰマッチングを、入力された音声データの始端から連続的に行っていく方法であり、具体的には、図４に示すような傾斜制限を用いて、入力音声中の任意の区間と登録データとのマッチングを行う。入力音声の第ｉ時刻における音響特徴量と登録データの第ｊ番目の音響特徴量の距離を局所距離としてｄ(ｉ，ｊ)で表す。一般的には、音響特徴量はベクトルとして表現されるので、ベクトルの次元をＫとすると、入力音声の第ｉ時刻における音響特徴量を（Ｖ(ｉ)₁，Ｖ(ｉ)₂，Ｖ(ｉ)₃，…，Ｖ(ｉ)_K）、登録データの第ｊ時刻における音響特徴量を（Ｒ(ｊ)₁，Ｒ(ｊ)₂，Ｒ(ｊ)₃，…，Ｒ(ｊ)_K）と表すことができる。このとき、局所距離は、ユークリッド距離を用いた場合、

となる。 Here, the continuous DP method will be described. The continuous DP method is a method in which DP matching is continuously performed from the beginning of input voice data. Specifically, an arbitrary restriction in the input voice is obtained by using a tilt restriction as shown in FIG. Is matched with the registered data. The distance between the acoustic feature quantity at the i-th time of the input speech and the j-th acoustic feature quantity of the registered data is expressed as d (i, j) as a local distance. In general, since the acoustic feature is expressed as a vector, if the dimension of the vector is K, the acoustic feature at the i-th time of the input speech is represented by (V (i) ₁ , V (i) ₂ , V ( i) ₃ ,..., V (i) _K ), and the acoustic feature quantities at the j-th time of the registered data are (R (j) ₁ , R (j) ₂ , R (j) ₃ ,..., R (j) _K )It can be expressed as. At this time, when using the Euclidean distance as the local distance,

It becomes.

次に、入力音声の第ｉ時刻と登録データの第ｊ時刻の累積距離ｇ(ｉ，ｊ)を以下で定義する。

Next, the cumulative distance g (i, j) between the i-th time of the input voice and the j-th time of the registered data is defined below.

同時にパスの長さ（重み）を以下で定義する。

At the same time, the path length (weight) is defined below.

数２と数３を各時刻ｉにおいて、ｊについてｊ＝０からｊ＝Ｊ（Ｊは登録データの長さ）まで計算する。数２の累積距離は直前の累積距離と局所距離に依存し、最小値を選択することによって累積距離を通るｉ−ｊの組が逐次的に決定されていく。ある時刻ｉ_０でのｇ(ｉ_０，Ｊ)は入力音声の第ｉ_０時刻と登録パターンの終端を対応させたときの最小累積距離を表す。同時にｉ−ｊの組がｊ＝０の開始点からｊ＝Ｊまでの経路も定まり、これを最適パスという。ｃ(ｉ_０，Ｊ)は最適パスの長さを示すことになる。そこで、長さによる変動を正規化した累積距離を

として定義し、これが第ｉ_０時刻で所定値を下回った場合に時刻ｉ_０で照合したとする（以下、累積距離値という場合には正規化した累積距離値を指すものとする）。

Formulas

2 and 3 are calculated from j = 0 to j = J (J is the length of the registered data) for j at each time i. The cumulative distance of Equation 2 depends on the previous cumulative distance and the local distance, and by selecting the minimum value, ij pairs passing through the cumulative distance are sequentially determined. G (i ₀ , J) at a certain time i ₀ represents the minimum cumulative distance when the i ₀ time of the input voice is associated with the end of the registered pattern. At the same time, a route from the starting point of j = 0 to j = J is determined for the i-j pair, and this is called an optimal path. c (i ₀ , J) indicates the length of the optimum path. Therefore, the cumulative distance normalized by the length variation is

Defined as, and this is collated at time i ₀ if below a predetermined value at the i ₀ time (hereinafter, shall refer to the cumulative distance value normalized in the case of cumulative distance values).

連続ＤＰマッチング照合手段では、以上のような手続きで、登録されたデータと現在時刻ｉ_０までに入力された入力音声データとの最適な対応関係（最適パス）と、最適パスの各点における局所距離と、最適パスを通った場合の累積距離値が同時に求められる。このとき、最適パスはたとえば図４のように決定される。なお、図中においてＭは照合された発声の長さである。 In the continuous DP matching matching means, the optimum correspondence (optimum path) between the registered data and the input speech data input up to the current time i ₀ and the local at each point of the optimum path by the procedure as described above. The distance and the cumulative distance value when the optimum path is passed are obtained simultaneously. At this time, the optimum path is determined as shown in FIG. 4, for example. In the figure, M is the length of the collated utterance.

連続ＤＰマッチング部１０６は、数４により求めた累積距離Ｇが所定値以下となる区間を照合区間とする。 The continuous DP matching unit 106 sets a section in which the accumulated distance G obtained by Equation 4 is equal to or less than a predetermined value as a matching section.

次に、照合区間内平均音響パワー計算部１０７は、連続ＤＰマッチング部１０６が検出した照合区間の音響パワーから、照合区間内の平均音響パワー（これをＳとする）を計算する（ステップＳ１４）。具体的には、時刻ｉで照合して照合区間をｉ−Ｍ＋１、ｉ−Ｍ＋２、…、ｉ−１、ｉ、その音響パワーをＰ(ｉ−Ｍ＋１)、Ｐ(ｉ−Ｍ＋２)、…、Ｐ(ｉ)としたとき、Ｓを以下のように計算する。なお、Ｍは照合された音声の長さである。

Next, the average acoustic power calculation unit 107 in the matching section calculates the average acoustic power (referred to as S) in the matching section from the acoustic power in the matching section detected by the continuous DP matching unit 106 (step S14). . Specifically, collating at time i, the collating sections are i−M + 1, i−M + 2,..., I−1, i, and their acoustic powers are P (i−M + 1), P (i−M + 2),. When P (i), S is calculated as follows. M is the length of the collated voice.

次に照合区間前平均音響パワー計算部１０８は、照合区間の直前の所定区間内の音響パワーから、その平均音響パワー（これをＮとする）を計算する（ステップＳ１５）。具体的には直前の所定区間の長さをＬとしたときに、Ｎを以下のように計算する。

Next, the average acoustic power calculation unit before collation section 108 calculates the average acoustic power (referred to as N) from the acoustic power in the predetermined section immediately before the collation section (step S15). Specifically, when the length of the immediately preceding predetermined section is L, N is calculated as follows.

これらの音響パワーは、音響分析手段が逐次的に出力する各時刻の音響パワーを照合区間内平均音響パワー計算部１０６および照合区間前平均音響パワー計算部１０７が一時的に記憶保持し、照合区間が検出した際にそれらのデータをさかのぼって各平均音響パワーを計算する。 These acoustic powers are temporarily stored in the collation interval average acoustic power calculation unit 106 and the pre-collation interval average acoustic power calculation unit 107 as the acoustic power at each time sequentially output by the acoustic analysis means, and the collation interval Each of the average sound power is calculated by going back those data.

次にＳＮＲ計算部１０９は、ステップＳ１４とＳ１５で求めたＳとＮの比（Ｓ／Ｎ）を計算し、これをＳＮＲとする（ステップＳ１６）。次に判定部１１０は、連続ＤＰマッチング部１０６が出力した累積距離値Ｇと照合区間について、所定の照合閾値ＴＨ_Ｇと比較し、Ｇ＜ＴＨ_Ｇであり、かつ、ＳＮＲと所定の閾値ＴＨ_ＳＮＲとを比較し、ＳＮＲ＜ＴＨ_ＳＮＲである場合に登録音声と入力音声の話者は同一であると判定する（ステップＳ１７：同一）。また、Ｇ＜ＴＨ_ＧとＳＮＲ＜ＴＨ_ＳＮＲとのいずれかの条件が満たされなければ、話者は非同一であると判定され（ステップＳ１７：非同一）、ステップＳ１８に進む。なお、ここではＳＮＲはＳとＮの比として説明したが、この他にＳとＮの比の対数をとったｌｏｇ（Ｓ／Ｎ）をＳＮＲとして用いるようにしてもよい。 Next, the SNR calculator 109 calculates the ratio of S and N (S / N) obtained in steps S14 and S15, and sets this as the SNR (step S16). Next, the determination unit 110 compares the accumulated distance value G output by the continuous DP matching unit 106 and the verification section with a predetermined verification threshold value TH _G , G <TH _G , and SNR and the predetermined threshold value TH _SNR. When SNR <TH _SNR, it is determined that the speakers of the registered voice and the input voice are the same (step S17: the same). If either condition of G <TH _G and SNR <TH _SNR is not satisfied, the speakers are determined to be non-identical (step S17: non-identical), and the process proceeds to step S18. Here, the SNR is described as the ratio of S and N, but log (S / N) obtained by taking the logarithm of the ratio of S and N may be used as the SNR.

ステップＳ１８では報知手段を介して話者に判定状況を知らせる。図５は報知手段１１１によって携帯電話のディスプレイに表示された判定状況の例を示す図である。図において１１１ａは現在、携帯電話が話者照合中の状態にあることを示す文字列であり、１１１ｂは話者照合の結果を示す文字列である。なお、これらは報知手段１１１の構成方法の一例に過ぎず、判定状況が利用者に理解されるような方法であれば如何なる方法で報知してもよい。このように報知手段１１１を設けたことによって、照合処理が失敗した場合には、利用者はそのことを理解して照合を再試行したり、または騒音の少ない場所に移動するなどの対処を採ることができるようになる。その後、ステップＳ１１に戻って次の時刻の音声データの取得を行う。 In step S18, the determination status is notified to the speaker via the notification means. FIG. 5 is a diagram showing an example of the determination status displayed on the display of the mobile phone by the notification means 111. In the figure, 111a is a character string indicating that the mobile phone is currently in speaker verification, and 111b is a character string indicating the result of speaker verification. These are merely examples of the configuration method of the notification unit 111, and any method may be used as long as the determination status is understood by the user. By providing the notification means 111 in this way, when the collation process fails, the user understands that fact and tries to collate again, or takes measures such as moving to a place with less noise. Will be able to. Then, it returns to step S11 and acquires the audio | voice data of the next time.

以上から明らかなように、この発明の実施の形態１による話者照合装置によれば、ＳＮＲは、検出された照合区間の平均音響パワーが、その直前の平均音響パワーに比べて大きくない場合にはより小さい値となってステップＳ１７は偽（非同一）と判定されることとなる。このことにより、騒音環境下で騒音区間の音響データと登録データが偶然一致することによる誤照合を避けることができるのである。 As is clear from the above, according to the speaker verification device according to Embodiment 1 of the present invention, the SNR is obtained when the average acoustic power of the detected verification section is not larger than the average acoustic power immediately before it. Becomes a smaller value, and step S17 is determined to be false (not identical). As a result, it is possible to avoid erroneous collation due to coincidence between the acoustic data of the noise section and the registered data in a noisy environment.

なお、上述の構成において、登録時の処理を話者照合装置１以外の機器で行い、登録音声のみをフラッシュメモリや通信データなどを介して話者照合装置１に転送するように構成してもよいことはいうまでもない。 In the configuration described above, the registration process may be performed by a device other than the speaker verification device 1, and only the registered voice may be transferred to the speaker verification device 1 via flash memory, communication data, or the like. Needless to say, it is good.

また、報知手段１１１は話者照合装置１の用途によっては必須の構成要素とはならない場合がある。例えば、自動車の施錠・解錠を行うことを目的とする場合、いたずらに照合が失敗したことを報知すると却って音声で解錠しうることを不特定の人間に知らしめてしまい安全性が脅かされることも考えられる。したがって、このように話者照合機能が稼働していることを知らしめない方が安全上望ましい場合は、話者照合が失敗しても何も表示しない方がよいので、報知手段１１１を省略して構成するとよい。 Further, the notification unit 111 may not be an essential component depending on the application of the speaker verification device 1. For example, if the purpose is to lock / unlock a car, informing the unidentified person that it can be unlocked by voice if it is inadvertently notified that the verification has failed, the safety is threatened. Is also possible. Therefore, if it is desirable for safety to not know that the speaker verification function is operating in this way, it is better to display nothing even if the speaker verification fails, so the notification means 111 is omitted. It is good to configure.

また、この発明の実施の形態１においては、騒音区間音響パワー算出手段として、照合区間に先行する区間の音響パワーを算出する照合前区間音響パワー計算部１０８を用いて構成する例をとって説明した。しかし照合区間に先行する区間のパワーを算出する処理に替えて、照合区間に後続する区間（照合区間の直後の区間）の音響パワーを算出してもよいことはいうまでもない。この場合には、数６において、直後の区間のパワーの総和を算出するようにし、さらにこの総和を長さＬで割ることで、Ｎ(ｉ)を求めればよい。 Further, in the first embodiment of the present invention, an explanation will be given by taking an example in which the noise section acoustic power calculation means is configured using the pre-matching section acoustic power calculation unit 108 that calculates the sound power of the section preceding the matching section. did. However, it goes without saying that the acoustic power of the section following the verification section (the section immediately after the verification section) may be calculated instead of the process of calculating the power of the section preceding the verification section. In this case, N (i) may be obtained by calculating the total power of the immediately following section in Equation 6 and dividing this total by the length L.

実施の形態２．
実施の形態１による話者照合装置は、算出したＳＮＲと一定の閾値とを比較して定常的に入力音声を棄却する構成としていた。しかしその他に、場面に応じて異なる閾値を設定するようにしてもよい。実施の形態２による話者照合装置は、かかる特徴を有するものである。 Embodiment 2. FIG.
The speaker verification device according to Embodiment 1 is configured to constantly reject the input speech by comparing the calculated SNR with a certain threshold value. However, other threshold values may be set depending on the scene. The speaker verification apparatus according to the second embodiment has such a feature.

図７はこの発明の実施の形態２による話者照合装置の構成を示すブロック図である。図において、照合閾値設定部２０１は、照合閾値を再計算して設定し直す部位である。その他、図１と同一の符号を付した構成要素については、実施の形態１と同様であるので、説明を省略する。 FIG. 7 is a block diagram showing a configuration of a speaker verification apparatus according to Embodiment 2 of the present invention. In the figure, a matching threshold value setting unit 201 is a part that recalculates and resets the matching threshold value. The other components having the same reference numerals as those in FIG. 1 are the same as those in the first embodiment, and thus the description thereof is omitted.

続いて、この発明の実施の形態２による話者照合装置（以下、話者照合装置１という）の動作について説明する。なお、登録時の動作は実施の形態１と同様であるので説明を省略する。図８は話者照合装置１の照合時の動作を表すフローチャートである。図においてステップはＳ１１〜Ｓ１３は実施の形態１と同様である。ステップＳ２４において、照合前区間平均音響パワー計算部１０８は、照合区間の直前の所定区間内の音響パワーから、その平均音響パワーＮを計算する。具体的には、時刻ｉで照合して照合区間をｉ−Ｍ＋１、ｉ−Ｍ＋２、…、ｉ−１、ｉとし、直前の所定区間の長さをＬ、時刻ｉでの音響パワーをＰ(ｉ)としたときに、Ｎを数６によって算出する。 Next, the operation of the speaker verification device (hereinafter referred to as speaker verification device 1) according to Embodiment 2 of the present invention will be described. Since the operation at the time of registration is the same as that of the first embodiment, description thereof is omitted. FIG. 8 is a flowchart showing the operation of the speaker verification device 1 during verification. In the figure, steps S11 to S13 are the same as those in the first embodiment. In step S24, the pre-matching section average sound power calculation unit 108 calculates the average sound power N from the sound power in the predetermined section immediately before the matching section. Specifically, collating at time i, the collating sections are i−M + 1, i−M + 2,..., I−1, i, the length of the immediately preceding predetermined section is L, and the acoustic power at time i is P ( When i), N is calculated by Equation 6.

音響分析部１０２が逐次的に出力する各時刻の音響パワーは、照合前区間平均音響パワー計算部１０８によって一時的に記憶保持される。そして照合区間が検出された際に、それらのデータをさかのぼって各平均音響パワーを計算する。 The acoustic power at each time sequentially output by the acoustic analysis unit 102 is temporarily stored and held by the pre-matching section average acoustic power calculation unit 108. Then, when the verification section is detected, the average sound power is calculated by going back to those data.

次に照合閾値設定部２０１は、照合区間前音響パワー計算部１０８が計算したＮから、照合閾値ＴＨ_Ｇを所定の対応関係にしたがって決定する。具体的には、ＴＨ_ａ、ＴＨ_N、ＴＨ_M、ｗをそれぞれ所定値として、

と設定する。すなわち、騒音が小さい場合には照合閾値を小さくするとともに、騒音が大きくなると照合閾値もそれに伴って大きくしていく。ただし、照合閾値に下限（ＴＨ_ａ）と上限（ＴＨ_ａ＋（ＴＨ_Ｍ−ＴＨ_Ｎ）×ｗ）を持たせるようにしている。その他、ＴＨをＮの任意の増加関数の関数値として算出するようにしてもよい。こうすることで、騒音レベルが大きくなった場合に照合閾値が緩くなった結果、本人棄却率が増加してしまうこと防ぐことができるのである。なお、Ｎは対数をとった対数音響パワーとしても実現可能である。 Next, the matching threshold value setting unit 201 determines the matching threshold value TH _G according to a predetermined correspondence relationship from N calculated by the acoustic power calculation unit 108 before the matching section. Specifically, TH _a , TH _N , TH _M , w are set as predetermined values, respectively.

And set. That is, when the noise is low, the verification threshold value is decreased, and when the noise increases, the verification threshold value is increased accordingly. However, the collation threshold value has a lower limit (TH _a ) and an upper limit (TH _a + (TH _M −TH _N ) × w). In addition, TH may be calculated as a function value of an arbitrary increase function of N. By doing so, it is possible to prevent the person rejection rate from increasing as a result of the collation threshold becoming loose when the noise level increases. Note that N can also be realized as a logarithmic acoustic power with a logarithm.

次に判定部１１０は、連続ＤＰマッチング部１０６が出力した距離値Ｇと照合閾値設定部２０１が設定した照合閾値ＴＨ_Ｇについて、Ｇ＜ＴＨ_Ｇである場合に話者は同一であると判定する（ステップＳ２６）。またＧ＜ＴＨ_Ｇでない場合は話者非同一であるとして次の時刻の音声データの取得を行う（ステップＳ２６）。 Next, the determination unit 110 determines that the speakers are the same when G <TH _G for the distance value G output from the continuous DP matching unit 106 and the verification threshold value TH _G set by the verification threshold value setting unit 201. (Step S26). If G <TH _G is not satisfied, the voice data at the next time is acquired assuming that the speakers are not identical (step S26).

以上の処理により、騒音レベルが変動する環境下において話者照合装置を使用する場合でも最適な閾値を設定できるため、本人棄却率が増加することを防ぐことができる。 With the above processing, an optimum threshold value can be set even when the speaker verification device is used in an environment where the noise level varies, so that it is possible to prevent the person rejection rate from increasing.

なお、上述の説明では、照合前区間平均音響パワー計算部１０８が計算した音響パワーＮの増加関数として閾値ＴＨ_Ｇを算出することとしているが、音響パワーＮに替えてＳＮＲの増加関数として閾値ＴＨ_Ｇを算出するようにしてもよい。具体的には、例えば

と設定する。これは音声と騒音の音響パワーの比が大きい場合には照合閾値を小さくして、比が小さくなると照合閾値もそれに伴って大きくしていくが、照合閾値は上下限をもっている場合である。このようにすると、音声と騒音のパワー比が小さくなった場合に照合閾値が緩くなるので本人棄却率が増加することを防ぐことができる。なお、閾値の算出方法は数８に限定されるものではなく、任意の増加関数を用いて算出してよい。また、音声と騒音のパワー比Ｓ／Ｎの対数をとったｌｏｇ（Ｓ／Ｎ）をＳＮＲの値に用いてもよい。 In the above description, the threshold value TH _G is calculated as an increase function of the acoustic power N calculated by the pre-matching section average acoustic power calculation unit 108, but the threshold value TH is used as an increase function of the SNR instead of the acoustic power N. _G may be calculated. Specifically, for example

And set. This is a case where the collation threshold value is reduced when the ratio of the sound power of the voice and the noise is large, and the collation threshold value is increased with a decrease in the ratio, but the collation threshold value has upper and lower limits. In this way, when the power ratio between voice and noise becomes small, the collation threshold value becomes loose, so that it is possible to prevent the person rejection rate from increasing. Note that the threshold calculation method is not limited to Equation 8, and may be calculated using an arbitrary increase function. Further, log (S / N) obtained by taking the logarithm of the power ratio S / N between voice and noise may be used as the value of SNR.

また、上述の処理では、照合閾値ＴＨＧを再計算するように構成例を示したが、実施の形態１と同様に構成し、照合閾値ＴＨＧに替えて、ＳＮＲと比較する閾値ＴＨ_ＳＮＲを再計算するように構成してもよい。 In the above-described processing, the configuration example is shown so that the collation threshold value THG is recalculated. However, the configuration is the same as in the first embodiment, and the threshold value TH _SNR to be compared with the _SNR is recalculated instead of the collation threshold value THG. You may comprise.

なお、この発明の実施の形態２による話者照合装置においても、実施の形態１と同様に報知手段を設けてもよいのはいうまでもないことである。 Needless to say, the speaker verification apparatus according to the second embodiment of the present invention may also be provided with notification means as in the first embodiment.

実施の形態３．
この発明の実施の形態２による話者照合装置によれば、音響パワーやＳＮＲによって異なる閾値を適用するように構成した。この他、使用局面（解錠する操作の内容）に応じて異なる閾値を適用するように構成してもよい。この発明の実施の形態３による話者照合装置はかかる特徴を有するものである。 Embodiment 3 FIG.
The speaker verification device according to Embodiment 2 of the present invention is configured to apply different thresholds depending on acoustic power and SNR. In addition, you may comprise so that a different threshold value may be applied according to a use situation (contents of operation to unlock). The speaker verification apparatus according to Embodiment 3 of the present invention has such a feature.

図９は、この発明の実施の形態３による話者照合装置の構成を示すブロック図である。この図と図１が異なる点は判定部１１０が登録データ記憶部１０５を参照するように構成されている点である。またこの発明の実施の形態３による話者照合装置（話者照合装置１）では登録データ記憶部１０５の構成が異なる。図１０は、話者照合装置１が携帯電話として構成された場合の、登録データ記憶部１０５に記憶されるデータの構成例を示す図である。図に示すように、この実施の形態３による登録データ記憶部１０５は各コマンド毎に閾値フィールドと登録音声データフィールドとを記憶している。閾値フィールドはこの携帯電話が音声によって実行可能なコマンドの内容を示すものである。これに対して登録音声データはこのコマンドを起動するために予め利用者が登録した登録音声である。なお、ここでは簡単のために登録音声データをローマ字で表記しているが、実際には音声データであるので、このような形式ではなく、メルケプストラムや音響パワーなどの特徴が記録される。また閾値はそのコマンドに対するＳＮＲと比較するＴＨ_ＳＮＲが格納される。ここでこの閾値が小さければ、多少の騒音があっても照合される局面が多くなり、この閾値が大きければ、騒音の発生に対して厳格となって照合がされにくくなることに注意すべきである。 FIG. 9 is a block diagram showing a configuration of a speaker verification apparatus according to Embodiment 3 of the present invention. 1 differs from FIG. 1 in that the determination unit 110 is configured to refer to the registered data storage unit 105. Further, in the speaker verification device (speaker verification device 1) according to the third embodiment of the present invention, the configuration of the registration data storage unit 105 is different. FIG. 10 is a diagram illustrating a configuration example of data stored in the registration data storage unit 105 when the speaker verification device 1 is configured as a mobile phone. As shown in the figure, the registered data storage unit 105 according to the third embodiment stores a threshold field and a registered voice data field for each command. The threshold field indicates the contents of a command that can be executed by voice by the mobile phone. On the other hand, the registered voice data is a registered voice registered in advance by the user to activate this command. Here, for the sake of simplicity, the registered voice data is written in Roman letters, but since it is actually voice data, features such as a mel cepstrum and acoustic power are recorded instead of such a format. The threshold stores a TH _SNR to be compared with the SNR for the command. It should be noted that if this threshold is small, there will be more phases to be collated even if there is some noise, and if this threshold is large, it will be difficult to perform collation because it is strict against noise generation. is there.

例えば図１０の例では、受話や発話に対しては閾値として２．０を設定している。一方、この携帯電話にはデジタルカメラ機能が装着されており、さらに話者が「ハイ、チーズ」と発話するとシャッターが降りるようになっているとすると、このような用途では耐騒音性（騒音により誤照合が発生する割合を低く抑える必要性）よりも、シャッターチャンスの方を重視すべきであると考えられるので、閾値１．５というように他のコマンドや操作よりも閾値を小さく設定している。このようにすることで、騒音による誤照合を防いで安全性を維持しつつ、使い勝手を向上することができるのである。 For example, in the example of FIG. 10, 2.0 is set as the threshold value for receiving and speaking. On the other hand, if this mobile phone is equipped with a digital camera function and the shutter is released when the speaker speaks “high, cheese”, in such applications, noise resistance (due to noise) Since it is thought that the photo opportunity should be more important than the need to keep the rate of occurrence of erroneous verification low, set the threshold to be smaller than other commands and operations, such as threshold 1.5. Yes. By doing so, it is possible to improve usability while maintaining safety by preventing erroneous verification due to noise.

なお、話者照合装置１の動作については実施の形態１と同様である。ただし判定部１１０は照合処理の際に、連続ＤＰマッチング部１０６が最も距離が近いと判断した登録音声のコマンドに対して記憶されているＴＨ_ＳＮＲを登録データ記憶部１０５から読み出すようになっている点が異なっている。 The operation of the speaker verification device 1 is the same as that in the first embodiment. However, the determination unit 110 reads the TH _SNR stored in response to the registered voice command that the continuous DP matching unit 106 has determined to have the shortest distance from the registered data storage unit 105 during the matching process. The point is different.

以上から明らかなように、この発明の実施の形態３の話者照合装置によれば、使用の局面に応じて異なる閾値を適用するので、騒音による誤照合を防いで安全性を維持しつつ、使い勝手を向上することができる As is clear from the above, according to the speaker verification device of the third embodiment of the present invention, different thresholds are applied depending on the situation of use, so that erroneous verification due to noise is prevented and safety is maintained. Usability can be improved

なお、登録データ記憶部１０５が各コマンドについて記憶するＴＨ_ＳＮＲを初期値として、実施の形態２で行ったような閾値の再設定を行ってもよい。また実施の形態２で示したように閾値の上限値と下限値をコマンド毎に登録データ記憶部１０５に記憶させておき、これを判定部１１０による照合処理の際に参照するようにしてもよいことはいうまでもない。 Note that the threshold value may be reset as in the second embodiment using the TH _SNR stored in the registered data storage unit 105 for each command as an initial value. Further, as shown in the second embodiment, the upper limit value and the lower limit value of the threshold value may be stored in the registered data storage unit 105 for each command, and may be referred to in the verification processing by the determination unit 110. Needless to say.

この発明は、例えば機器の使用開始可否を操作者が発話する音声に基づいて判断する話者照合装置に適用することが可能である。 The present invention can be applied to, for example, a speaker verification device that determines whether or not to start using a device based on voice uttered by an operator.

この発明の実施の形態１の構成を示すブロック図である。It is a block diagram which shows the structure of Embodiment 1 of this invention. この発明の実施の形態１の話者照合装置を携帯電話として構成した場合の正面図である。It is a front view at the time of comprising the speaker collation apparatus of Embodiment 1 of this invention as a mobile telephone. この発明の実施の形態１の話者照合装置の動作を示すフローチャートである。It is a flowchart which shows operation | movement of the speaker collation apparatus of Embodiment 1 of this invention. この発明の実施の形態１の話者照合装置の動作を示すフローチャートである。It is a flowchart which shows operation | movement of the speaker collation apparatus of Embodiment 1 of this invention. この発明の実施の形態１の話者照合装置におけるＤＰマッチングの動作原理を説明するための図である。It is a figure for demonstrating the principle of operation of DP matching in the speaker collation apparatus of Embodiment 1 of this invention. この発明の実施の形態１の報知手段の構成例を示す図である。It is a figure which shows the structural example of the alerting | reporting means of Embodiment 1 of this invention. この発明の実施の形態２の構成を示すブロック図である。It is a block diagram which shows the structure of Embodiment 2 of this invention. この発明の実施の形態２の話者照合装置の動作を示すフローチャートである。It is a flowchart which shows operation | movement of the speaker collation apparatus of Embodiment 2 of this invention. この発明の実施の形態３の構成を示すブロック図である。It is a block diagram which shows the structure of Embodiment 3 of this invention. この発明の実施の形態３の話者照合装置に記憶されるデータの構成例を示す図である。It is a figure which shows the structural example of the data memorize | stored in the speaker collation apparatus of Embodiment 3 of this invention.

Explanation of symbols

１０５登録データ記憶部、
１０６連続ＤＰマッチング部、
１０７照合区間内平均音響パワー計算部、
１０８照合前区間平均音響パワー計算部、
１０９ＳＮＲ計算部、
１１０判定部、
１１１報知手段、
２０１照合閾値設定部。 105 registered data storage unit,
106 continuous DP matching unit,
107 average acoustic power calculation section in the verification section,
108 section average sound power calculation section before verification,
109 SNR calculator,
110 determination unit,
111 notification means,
201 Collation threshold value setting unit.

Claims

In the speaker verification device that selects a part of the input speech as a verification interval, and compares the verification interval with the registered speech to determine speaker identity between the registered speech and the input speech,
A matching section acoustic power calculating means for calculating the acoustic power S of the selected matching section;
A noise section sound power calculating means for calculating a sound power N of a part of the input speech preceding or following the matching section;
The collation interval is collated with the registered voice to determine the speaker identity between the registered voice and the input voice, and when the value SNR obtained by dividing S by N falls below a predetermined threshold TH, these voices Speaker determination means for determining that the speakers are non-identical,
A speaker verification device comprising:

The speaker verification device according to claim 1, wherein the speaker determination means changes TH based on N.

3. The speaker verification device according to claim 2, wherein the speaker determination means calculates TH as a function value of an increasing function of N, and compares TH with SNR.

The speaker verification apparatus according to claim 1, wherein the speaker determination unit changes TH based on SNR.

5. The speaker verification apparatus according to claim 4, wherein the speaker determination means calculates TH as a function value of an increasing function of SNR and compares TH with SNR.

2. The speaker verification device according to claim 1, wherein the speaker determination means sets a different value to TH according to the type of registered speech to be verified, and compares TH and SNR.

The speaker verification device according to claim 1, wherein the speaker determination unit notifies the speaker of the determination status via the notification unit when the SNR is lower than TH.