JP2001109494A

JP2001109494A - Voice identification device and voice identification method

Info

Publication number: JP2001109494A
Application number: JP28265299A
Authority: JP
Inventors: Kazuyoshi Fukushi; 和義福士
Original assignee: Secom Co Ltd
Current assignee: Secom Co Ltd
Priority date: 1999-10-04
Filing date: 1999-10-04
Publication date: 2001-04-20
Anticipated expiration: 2019-10-04
Also published as: JP4328423B2

Abstract

PROBLEM TO BE SOLVED: To identify whether input voice is reproduced voice or not. SOLUTION: It is identified whether input voice is reproduced voice or not based on the difference of phase information between real voice and reproduced voice obtained by recording and reproducing real voice. Phase information is preferably a phase difference between a basic wave and the higher harmonic or the phase difference between the higher harmonics. The phase difference is compared between input voice and registered voice.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は入力音声の評価を行
う装置に関し、特に、入力音声について、今まさに発声
した生音声であるか録音後の再生音声であるかを識別す
る装置に関する。[0001] 1. Field of the Invention [0002] The present invention relates to an apparatus for evaluating an input voice, and more particularly to an apparatus for discriminating whether an input voice is a raw voice just uttered or a reproduced voice after recording.

【０００２】[0002]

【従来の技術】音声識別装置には様々な用途が考えられ
るが、ここでは１つの適用例として、音声を用いた個人
ID装置を説明する。一般に、音声個人ID装置は、利用者
の音声を予め登録し、マイクからの入力音声が登録音声
と同一であるか否かを自動的に識別する装置である。2. Description of the Related Art A variety of applications are conceivable for a voice identification device. Here, as one application example, an individual using voice is used.
The ID device will be described. Generally, a voice personal ID device is a device in which a user's voice is registered in advance, and whether or not an input voice from a microphone is the same as the registered voice is automatically identified.

【０００３】従来技術について具体的に説明すると、ま
ず利用者毎に予め決められた発声内容の音声が登録され
る。これを登録音声又は照合用音声と呼ぶ。実際には、
利用者により音声が入力されると、その音声信号から例
えばスペクトル包絡情報が抽出され、それが照合時のリ
ファレンス用データとして保存される。[0003] Explaining the prior art in detail, first, a voice having a predetermined utterance content is registered for each user. This is called a registered voice or a verification voice. actually,
When a voice is input by the user, for example, spectrum envelope information is extracted from the voice signal, and the extracted information is stored as reference data at the time of matching.

【０００４】入力音声の照合時には、利用者により入力
された音声信号から、登録時と同様の分析によりスペク
トル包絡情報が抽出され、記憶してあるリファレンス用
データとのマッチング処理が行われる。そのマッチング
処理の結果、リファレンス用データとの違いが一定の閾
値以上であれば当該入力音声を他人のものであるとして
棄却し、その違いが一定の閾値以内であれば登録話者の
音声と同一の音声であると判断し、扉に設けた電気錠の
解錠等所定の処理を実行する。At the time of collation of input speech, spectrum envelope information is extracted from a speech signal inputted by a user by the same analysis as at the time of registration, and a matching process is performed with stored reference data. As a result of the matching processing, if the difference from the reference data is equal to or greater than a certain threshold, the input voice is rejected as belonging to another person, and if the difference is within a certain threshold, the input voice is the same as the voice of the registered speaker. And performs predetermined processing such as unlocking of the electric lock provided on the door.

【０００５】ところが、登録時などにおいて、登録話者
本人の発声した音声を、背後から、あるいは装置に隠し
マイクを設置する等により録音しておき、その後、その
録音音声をスピーカ等から再生すると、スペクトル包絡
情報が酷似した音声の入力を行い得る。このようにして
入力された音声は、マイクやスピーカーの設置位置ある
いは方向などを微妙に調節する必要があるものの、登録
音声と同一視される可能性を否定できない。なお、これ
を録音画策と呼び、以下のように各種の対策が施されて
いる。However, at the time of registration or the like, the voice uttered by the registered speaker himself is recorded from behind or by installing a hidden microphone in the apparatus, and then the recorded voice is reproduced from a speaker or the like. Speech with very similar spectral envelope information can be input. Although it is necessary to finely adjust the installation position or the direction of the microphone or the speaker in the sound input in this way, the possibility of being identified with the registered sound cannot be denied. In addition, this is called a recording plan, and various measures are taken as follows.

【０００６】従来、このような画策行為を防止する方法
として、特開平５−３２３９９０号には、システムが認
証の度に異なった発声内容を入力するよう指示を与える
ものが記載されている。また、特開平９−１２７９７４
号には、システムが毎回異なった音響信号を出力し、入
力音声に重畳させるようにし、入力された音声内にシス
テムが出力した音響信号を除去した後の信号を用いるも
のが記載されている。Conventionally, as a method for preventing such an action, Japanese Patent Application Laid-Open No. 5-323990 discloses a method in which the system gives an instruction to input different utterance contents every time authentication is performed. Further, Japanese Patent Application Laid-Open No. 9-127974
Japanese Patent Application Laid-Open No. H11-157,087 describes that the system outputs a different acoustic signal each time and superimposes it on the input voice, and uses the signal after removing the audio signal output by the system from the input voice.

【０００７】[0007]

【発明が解決しようとする課題】しかし、上記の従来手
法では、入力された音声の発声内容を変えたり特定の音
響信号を重畳して用いている為、発声の度に発声内容や
重畳する音響信号を変えなければならないという問題が
ある。However, in the above-mentioned conventional method, the utterance content of the input speech is changed or a specific acoustic signal is superimposed and used. There is a problem that the signal must be changed.

【０００８】本発明は、上記従来の課題に鑑みなされた
ものであり、その目的は、信頼性の高い入力音声の識別
を実現することにある。The present invention has been made in view of the above-mentioned conventional problems, and an object of the present invention is to realize highly reliable identification of input speech.

【０００９】本発明の他の目的は、入力音声が生音声で
あるか再生音声であるか高精度に判別することにある。Another object of the present invention is to determine with high precision whether an input sound is a live sound or a reproduced sound.

【００１０】本発明の更に他の目的は、生音声とその再
生音声の性質の違いを音声の評価に利用することにあ
る。It is still another object of the present invention to utilize the difference between the properties of a live voice and its reproduced voice for voice evaluation.

【００１１】[0011]

【課題を解決するための手段】（１）手段の説明上記目的を達成するために、本発明は、音声を入力する
ための音声入力手段と、前記音声入力手段に入力された
音声から位相情報を抽出する位相情報抽出手段と、照合
用音声の位相情報と入力音声の位相情報との比較によ
り、入力音声を評価する評価手段と、を含むことを特徴
とする。Means for Solving the Problems (1) Description of the Means In order to achieve the above object, the present invention provides a sound input means for inputting a sound, and phase information from the sound input to the sound input means. , And evaluation means for evaluating the input voice by comparing the phase information of the collation voice with the phase information of the input voice.

【００１２】本発明者の各種実験によれば、生音声とそ
れを録音し再生した音声（再生音声）と間には信号波形
の相違、具体的には位相情報の相違が認められた。これ
は録音系・再生系（特にスピーカ）の位相特性の影響に
よるものと推察される。本発明は、その現象を利用し
て、入力音声の評価を行うものであり、望ましくは、入
力音声が生音声であるか再生音声であるかを識別するも
のである。本発明によれば、発声内容を変更させる必然
性ななく、また他の音の重畳も不要であり、簡便で信頼
性の高い入力音声の評価システムを実現できる。According to various experiments by the present inventor, a difference in signal waveform, specifically, a difference in phase information between a live sound and a sound obtained by recording and reproducing the reproduced sound (reproduced sound) was recognized. This is presumed to be due to the influence of the phase characteristics of the recording / reproducing system (especially the speaker). The present invention evaluates an input voice using the phenomenon, and desirably identifies whether the input voice is a live voice or a reproduced voice. ADVANTAGE OF THE INVENTION According to this invention, it is not necessary to change the utterance content, and it is not necessary to superimpose another sound, and a simple and reliable input speech evaluation system can be realized.

【００１３】上記の位相情報は、望ましくは、基本波と
高調波との間の位相差（あるいは高調波同士の位相差）
であるが、これ以外にも、位相情報としては位相比、信
号相関値、位相変化など、入力音声と再生音声の位相の
相違を指標する情報であれば各種のものを利用可能であ
る。更に、波形自体の直接比較によって、位相情報の比
較を行うようにしてもよい。Preferably, the phase information is a phase difference between a fundamental wave and a harmonic (or a phase difference between harmonics).
However, other than this, various types of information such as a phase ratio, a signal correlation value, and a phase change can be used as the phase information, as long as the information indicates the difference between the phases of the input audio and the reproduced audio. Further, the phase information may be compared by directly comparing the waveforms themselves.

【００１４】望ましくは、前記評価手段は、前記２つの
位相情報の比較に基づいて、前記入力音声が再生音声で
あるか否かを識別する手段を含む。すなわち、上記のよ
うに入力音声と再生音声の間における波形の相違を位相
情報の比較によって抽出し、これにより録音画策を判定
する。Preferably, the evaluation means includes means for identifying whether or not the input sound is a reproduced sound based on a comparison between the two pieces of phase information. That is, as described above, the difference between the waveforms of the input voice and the reproduced voice is extracted by comparing the phase information, and the recording plan is determined based on the comparison.

【００１５】望ましくは、前記位相情報は、音声の基本
波と高調波の間の位相差及び高調波間の位相差の内の少
なくとも１つに関する情報である。位相は相対的なもの
で、基本的に２つの位相の差（位相差）が物理的に意味
をもつ。よって、基本波と高調波との間における位相差
などを比較対象として利用するのが望ましい。その場合
に、パワーの大きな基本波と、高調波の中でもパワーの
大きい次数の低い高調波と、の間の位相差などを利用す
れば精度良く評価を行い得る。Preferably, the phase information is information on at least one of a phase difference between a fundamental wave and a harmonic of a voice and a phase difference between harmonics. The phases are relative, and basically the difference between the two phases (phase difference) is physically significant. Therefore, it is desirable to use the phase difference between the fundamental wave and the harmonic as a comparison object. In this case, accurate evaluation can be performed by using a phase difference between a fundamental wave having a large power and a harmonic having a large power and a low order among harmonics.

【００１６】望ましくは、前記位相情報抽出手段は、前
記照合用音声の位相情報として、前記照合用音声の基本
波とそのｍ次高調波との間の位相差Ａｍ、及び、前記照
合用音声の基本波とそのｎ次高調波との間の位相差Ａｎ
を求める手段と、前記入力音声の位相情報として、前記
入力音声の基本波とそのｍ次高調波の間の位相差Ｂｍ、
及び、前記入力音声の基本波とそのｎ次高調波の間の位
相差Ｂｎを求める手段と、を含み、前記識別手段は、前
記位相差Ａｍと前記位相差Ｂｍとの間の差分と、前記位
相差Ａｎと前記位相差Ｂｎとの間の位相差との間の差分
と、に基づいて、前記入力音声の識別を行う。Preferably, the phase information extracting means includes a phase difference Am between a fundamental wave of the verification voice and its m-th harmonic and a phase difference of the verification voice as phase information of the verification voice. The phase difference An between the fundamental and its nth harmonic
And a phase difference Bm between a fundamental wave of the input voice and its m-th harmonic, as phase information of the input voice.
And a means for determining a phase difference Bn between a fundamental wave of the input voice and its nth harmonic, wherein the identification means includes: a difference between the phase difference Am and the phase difference Bm; The input voice is identified based on a difference between a phase difference An and a phase difference between the phase difference Bn.

【００１７】上記のように、複数の位相差を求めて相互
比較すれば、より精度良く入力音声の評価を行える。な
お、上記のｍ、ｎは２以上の整数であって、ｍとｎは非
同一である。As described above, if a plurality of phase differences are obtained and compared with each other, the input voice can be evaluated more accurately. Note that the above m and n are integers of 2 or more, and m and n are not the same.

【００１８】望ましくは、前記位相情報抽出手段は、前
記入力音声の基本波を推定する予備分析手段と、前記推
定された基本波を基礎として前記入力音声の周波数解析
を行う本分析手段と、前記本分析手段の周波数解析結果
から前記位相情報を抽出する抽出手段と、を含む。Preferably, the phase information extracting means includes: a preliminary analyzing means for estimating a fundamental wave of the input voice; a main analyzing means for performing a frequency analysis of the input voice based on the estimated fundamental wave; Extracting means for extracting the phase information from the frequency analysis result of the analyzing means.

【００１９】上記構成によれば、最初に基本波を推定し
て、その基本波の周期（周波数）を基礎として次の本分
析を実行できるので、本分析の処理条件を最適化可能で
あり、結果として分析精度を高められる。According to the above configuration, the fundamental wave is first estimated, and the next main analysis can be executed based on the period (frequency) of the fundamental wave. Therefore, the processing conditions of the main analysis can be optimized. As a result, the analysis accuracy can be improved.

【００２０】望ましくは、前記予備分析手段は、固定長
の窓幅を有する固定時間窓によって前記入力音声をフレ
ーム単位で切り出して周波数解析を行い、前記本分析手
段は、前記推定された基本波に基づいて可変設定される
窓幅を有する可変時間窓を設定し、その可変時間窓によ
って前記入力音声をフレーム単位で切り出して周波数解
析を行う。Preferably, the preliminary analysis means cuts out the input voice by a fixed time window having a fixed-length window width in frame units to perform frequency analysis, and the main analysis means performs the frequency analysis on the estimated fundamental wave. A variable time window having a window width that is variably set based on the input time is set, and the input voice is cut out in frame units by the variable time window to perform frequency analysis.

【００２１】上記可変時間窓は、男性の声の平均的周波
数、女性の声の平均的周波数を考慮して、例えば３波長
程度の大きさに設定するのが望ましい。しかし、その長
さが短すぎると、十分なデータのサンプリング（切り出
し）を行えず、一方、その長さが長すぎると、窓内にお
ける周波数シフトの影響を大きく受け、分析精度が低下
するおそれがある。予備分析により、基本波が推定され
れば、それに基づいて最適な可変時間窓を設定でき、す
なわち、本分析での周波数解析を適切に行える。予備分
析及び本分析ではＦＦＴ演算などが実行されるが、それ
以外にも各種の手法を利用可能である。The variable time window is desirably set to, for example, about three wavelengths in consideration of the average frequency of a male voice and the average frequency of a female voice. However, if the length is too short, sufficient sampling (cutout) of data cannot be performed. On the other hand, if the length is too long, the frequency shift within the window is greatly affected, and the analysis accuracy may be reduced. is there. If the fundamental wave is estimated by the preliminary analysis, an optimal variable time window can be set based on the fundamental wave, that is, the frequency analysis in the present analysis can be appropriately performed. In the preliminary analysis and the main analysis, an FFT operation and the like are executed, but other various methods can be used.

【００２２】望ましくは、前記音声信号の内で所定の安
定条件を持たすフレームから前記位相情報が抽出され
る。不安定な状態で音声信号を切り出すと、位相情報を
適切に抽出できないおそれがある。そこで、安定状態を
確認の上、位相情報の抽出を行うものである。Preferably, the phase information is extracted from a frame having a predetermined stability condition in the audio signal. If audio signals are cut out in an unstable state, phase information may not be properly extracted. Therefore, the phase information is extracted after confirming the stable state.

【００２３】また、上記目的を達成するために、本発明
に係る方法は、生音声とそれを録音し再生した再生音声
との間で相互に位相情報が異なることを利用して、入力
音声が生音声であるか再生音声であるかを識別すること
を特徴とする。この構成は防犯用装置において有用であ
り、それ以外にも各種の応用が考えられる。Further, in order to achieve the above object, the method according to the present invention utilizes the fact that the phase information differs between a live voice and a reproduced voice obtained by recording and reproducing the raw voice, so that the input voice can be reproduced. It is characterized in that it is distinguished between a live sound and a reproduced sound. This configuration is useful in a security device, and various other applications can be considered.

【００２４】また、上記目的を達成するために、本発明
に係る音声識別装置は、照合用音声と入力音声の個人同
一性を判定する判定手段と、前記同一性が判定された入
力音声について、その位相情報により再生音声であるか
否かを識別する識別手段と、を含むことを特徴とする。According to another aspect of the present invention, there is provided a voice recognition apparatus comprising: a determination unit configured to determine personal identity between a verification voice and an input voice; Identification means for identifying whether or not the sound is a reproduced sound based on the phase information.

【００２５】（２）原理説明図１には、生音声と再生音声のそれぞれの信号波形が示
されている。上段の（Ａ１）〜（Ｄ１）が再生音声（す
なわち録音音声）であり、下段の（Ａ２）〜（Ｄ２）が
生音声である。ここで、再生音声は、同一話者が同一の
発声内容を発声した音声を一度録音した後に再生したも
のである。各音声信号は、説明の便宜上、低域通過フィ
ルタを通過した後のものである。また、図１の（Ａ）〜
（Ｄ）は、一定の期間にわたる音声信号を４分割して上
から順番に並べたものであり、横軸は時間軸、縦軸は振
幅を表している。(2) Explanation of Principle FIG. 1 shows respective signal waveforms of the live sound and the reproduced sound. The upper row (A1) to (D1) is the reproduced voice (that is, the recorded voice), and the lower row (A2) to (D2) is the live voice. Here, the reproduced sound is obtained by recording once a sound in which the same speaker utters the same utterance content and then reproducing the sound. Each audio signal has been passed through a low-pass filter for convenience of explanation. Also, FIG.
(D) is obtained by dividing the audio signal over a certain period into four parts and arranging them in order from the top, with the horizontal axis representing the time axis and the vertical axis representing the amplitude.

【００２６】生音声と再生音声とを比較すると、声の高
さに相当する波形の繰返し周期(基本周波数)は相互に一
致しているのに対し、ピーク位置や繰返しの基本単位で
ある波形の形状が生音声と再生音声とでは若干異なって
いることがわかる。When comparing the raw voice and the reproduced voice, the repetition period (basic frequency) of the waveform corresponding to the pitch of the voice coincides with each other, while the peak position and the waveform of the basic unit of repetition are the same. It can be seen that the shapes of the live sound and the reproduced sound are slightly different.

【００２７】図２及び図３に示すように、一般に、音声
信号の母音部は、基本周波数(声の高さ)を与える基本波
と、その２倍，３倍，．．．の周波数をもつ高調波の重
ね合わせで構成されている。その形状は、位相と呼ばれ
る個々の正弦波の相対的な位置関係で決定されている。
例えば、図２及び図３の比較から明かなように、３つの
周波数信号の合成を考えると、同じ周波数であってもい
ずれかの信号の位相が異なれば、合成波形は大きく変化
する。As shown in FIGS. 2 and 3, in general, a vowel portion of an audio signal includes a fundamental wave which provides a fundamental frequency (voice pitch) and twice, three times,. . . Is composed of superimposed harmonics having the following frequencies. The shape is determined by the relative positional relationship between the individual sine waves called the phase.
For example, as is clear from the comparison between FIG. 2 and FIG. 3, when considering the synthesis of three frequency signals, if the phase of any one of the signals is different even at the same frequency, the synthesized waveform greatly changes.

【００２８】再生音声と生音声の波形形状が異なるの
は、録音・再生の過程において上記位相の相対的な位置
関係が崩れた結果であり、主に再生系（特にスピーカ）
の影響が大きいことが経験的にわかっている。The difference between the waveform shapes of the reproduced voice and the raw voice is a result of the relative positional relationship of the phases being destroyed during the recording / reproducing process.
It has been empirically known that the influence of is large.

【００２９】そこで、本発明は、同一話者が同一の発声
内容を発声した場合であっても、再生音声と生音声とで
は、信号波形（具体的には位相）が変化することを利用
し、再生音声と生音声とを高精度に識別するものであ
る。Therefore, the present invention utilizes the fact that the signal waveform (specifically, the phase) changes between the reproduced voice and the raw voice even when the same speaker utters the same voice content. , The reproduced voice and the live voice are identified with high accuracy.

【００３０】[0030]

【発明の実施の形態】以下、本発明に係る原理を音声に
よる個人ID装置に適用した場合について説明する。もち
ろん、本発明は個人ID装置以外にも適用可能である。DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS A case where the principle according to the present invention is applied to a personal ID device using voice will be described below. Of course, the present invention can be applied to devices other than the personal ID device.

【００３１】図４は、本実施形態に係る個人ID装置の全
体構成を示すブロック図である。入力部１は音声を装置
に入力するための手段であり、マイク、 A/D変換器、増
幅器などで構成される。特徴量抽出部２は、入力部１に
より入力されデジタル化された音声信号に対して高速フ
ーリエ変換(FFT; Fast Fourier Transformation)や線形
予測(LPC; Linear Prediction Coding)分析を行い、周
波数領域の特徴量を抽出するものである。照合部３は、
現在入力されている音声と登録時に発声された音声を、
公知のＤＰマッチング等の手法を用いて比較する手段で
ある。記憶部４は、抽出された特徴量を記憶するメモリ
である。利用者が登録時に発声した音声から抽出した特
徴量はここに記憶され、照合時に入力された音声と比較
する為のリファレンス用データとして用いられる。出力
部５は、例えば、照合部３において照合された結果、同
一話者の生音声であると判断された場合に、電気錠に対
して解錠信号を出力する回路である。一方、同一話者で
ないと判断された場合又は再生音声であると判定された
場合には、必要に応じて、利用者に対して棄却された旨
の信号をブザー音やモニタ画面にて提示し、また、必要
に応じて、棄却された音声を記憶部４に記憶しておく。
なお、特徴量抽出部２及び照合部３はハードウエアで構
成することもできるが、実質的にソフトウエアで構成す
ることもできる。FIG. 4 is a block diagram showing the overall configuration of the personal ID device according to this embodiment. The input unit 1 is a means for inputting audio to the device, and includes a microphone, an A / D converter, an amplifier, and the like. The feature amount extraction unit 2 performs a fast Fourier transform (FFT) or a linear prediction (LPC; Linear Prediction Coding) analysis on the digitized audio signal input by the input unit 1 to obtain a frequency domain feature. This is to extract the amount. The matching unit 3
The currently entered voice and the voice spoken during registration are
This is a means for comparing using a known method such as DP matching. The storage unit 4 is a memory that stores the extracted feature amounts. The feature amount extracted from the voice uttered by the user at the time of registration is stored here and used as reference data for comparison with the voice input at the time of collation. The output unit 5 is, for example, a circuit that outputs an unlocking signal to the electric lock when the collation unit 3 determines that the voice is a live voice of the same speaker. On the other hand, if it is determined that the speaker is not the same speaker or that the voice is reproduced, a rejection signal is presented to the user with a buzzer sound or a monitor screen as necessary. In addition, the rejected voice is stored in the storage unit 4 as necessary.
Note that the feature amount extraction unit 2 and the collation unit 3 can be configured by hardware, but can also be substantially configured by software.

【００３２】次に、図５及び図６を用いて、音声個人ID
装置の登録時及び照合時の処理の流れについて説明す
る。Next, using FIG. 5 and FIG.
The flow of processing at the time of registration and collation of a device will be described.

【００３３】図５には登録時の処理の流れがフローチャ
ートとして示されている。音声が入力されると、まず、
入力部１にてデジタル化された信号系列から、波形の振
幅の大きさや基本周波数の有無等の情報を用いて、音声
信号が含まれている区間（発声区間）を切り出す（Ｓ
１）。次に、入力された音声をフレーム分析し、スペク
トル包絡情報を表すパラメータを抽出する（Ｓ２）。こ
こでスペクトル包絡情報とは、ある瞬間において音声信
号に含まれている各周波数成分の分布の概形のことであ
り、分析フレーム毎にFFTやLPCケプストラムを算出する
ことにより求めることができる。抽出されたパラメータ
は記憶部４に記憶され（Ｓ３）、照合時にリファレンス
用データとして用いる。以上は従来装置でも同様であ
る。FIG. 5 is a flowchart showing the flow of processing at the time of registration. When the voice is input,
From the signal sequence digitized by the input unit 1, a section (speech section) containing an audio signal is cut out using information such as the amplitude of a waveform and the presence or absence of a fundamental frequency (S).
1). Next, the input speech is subjected to frame analysis to extract a parameter representing the spectrum envelope information (S2). Here, the spectrum envelope information is an outline of the distribution of each frequency component included in the audio signal at a certain moment, and can be obtained by calculating an FFT or LPC cepstrum for each analysis frame. The extracted parameters are stored in the storage unit 4 (S3) and used as reference data at the time of collation. The same applies to the conventional device.

【００３４】次に、位相情報の抽出方法を説明する。Next, a method of extracting phase information will be described.

【００３５】位相をより精度よく求めるためには、FFT
分析の際に用いる分析窓の中に基本周期波形が幾つ含ま
れているかが重要となる。経験的には３個程度含まれて
いると精度良く分析できることがわかっている。そこ
で、登録時において、以下に説明するように、「予備分
析（Ｓ１００）」と「本分析（Ｓ１０１）」の２段階の
分析を行う。なお、識別精度があまり要求されないよう
な場合には後者の本分析のみを実行するようにしてもよ
い。In order to obtain the phase more accurately, the FFT
It is important how many fundamental period waveforms are included in the analysis window used in the analysis. Empirically, it has been found that the analysis can be performed with high accuracy when about three are included. Therefore, at the time of registration, two-stage analysis of “preliminary analysis (S100)” and “main analysis (S101)” is performed as described below. When the identification accuracy is not so required, only the latter main analysis may be executed.

【００３６】まず、予備分析においては、固定長の窓幅
(例えば40ms程度)を利用して、音声信号の基本周期を推
定する（Ｓ４）。この固定長の分析窓によると、精度は
落ちるものの、男性の低い声(70Hz程度)から女性の高い
声(500Hz程度)まで幅広く対応することができる。分析
窓は時間軸に沿って連続的にスキャンされ、各位置にお
いて基本周期が推定される。この予備分析では、望まし
くはＦＦＴ演算が実行されるが、その他に自己相関演算
を実行するようにしてもよい。First, in the preliminary analysis, a fixed-length window width is used.
(Eg, about 40 ms) to estimate the fundamental period of the audio signal (S4). According to the fixed-length analysis window, although the accuracy is lowered, it is possible to cover a wide range from a low voice of a male (about 70 Hz) to a high voice of a woman (about 500 Hz). The analysis window is continuously scanned along the time axis, and the fundamental period is estimated at each position. In this preliminary analysis, an FFT operation is desirably performed, but an autocorrelation operation may be executed in addition.

【００３７】次に、本分析において、予備分析の結果と
して得られた基本周期の３倍の大きさを持つ分析窓を用
いて再度分析を行う。すなわち、前述したスペクトル包
絡情報は予備分析すなわち基本周期を考慮していない固
定長の窓幅による分析により取得していたが、本分析で
は、予備分析により求められた基本周期が３個程度含ま
れる窓幅を新たに設定して周波数解析を実行し、それに
より位相情報を抽出する（Ｓ５）。Next, in this analysis, the analysis is performed again using an analysis window having a size three times the fundamental period obtained as a result of the preliminary analysis. That is, the above-described spectrum envelope information is obtained by preliminary analysis, that is, analysis using a fixed-length window width without considering the fundamental period. In this analysis, however, about three fundamental periods obtained by the preliminary analysis are included. A window width is newly set and a frequency analysis is performed, thereby extracting phase information (S5).

【００３８】ここで、検定フレームについて説明する。
一般に、音声のうち母音部は子音部に比べて定常的であ
る為、スペクトル情報や基本周波数等の音声パラメータ
を安定に抽出することができる。しかし、位相情報をよ
り高精度に抽出する為には、母音部の中でも更に基本周
波数が推移している部分や、高調波の振幅レベルが小さ
い部分を分析対象から除かなければなない。この為、振
幅及び位相に関する以下の２つの条件が満たされている
場合に限り、その分析フレームを位相差情報の抽出に用
いるフレーム(検定フレーム)とすることにする。Here, the test frame will be described.
In general, a vowel part of speech is more stationary than a consonant part, so that speech parameters such as spectrum information and a fundamental frequency can be stably extracted. However, in order to extract the phase information with higher accuracy, it is necessary to exclude a portion of the vowel portion in which the fundamental frequency further changes and a portion where the amplitude level of the harmonic is small from the analysis target. Therefore, only when the following two conditions relating to the amplitude and the phase are satisfied, the analysis frame is set as a frame (test frame) used for extracting the phase difference information.

【００３９】[振幅に関する条件]基本波、２倍高調波、
３倍高調波の振幅レベル（Ａ_K，Ａ_2K，Ａ_3K）の最大値
と最小値との比が所定の範囲(例えば20dB)内に入ってい
ること（あるいは最小値が所定値以上であること）。[Conditions related to amplitude] Fundamental wave, second harmonic,
The ratio between the maximum value and the minimum value of the amplitude level (A _K , A _2K , A _3K ) of the third harmonic is within a predetermined range (for example, 20 dB) (or the minimum value is higher than the predetermined value) thing).

【００４０】ここで、図７（Ａ）、図８（Ａ）には、Ｆ
ＦＴ分析後の各周波数におけるパワーが示されており、
横軸はＦＦＴポイント数（周波数）に相当し、縦軸は対
数振幅値を示している。図７（Ｂ）、図８（Ｂ）には、
ＦＦＴ分析結果に対してアンラップ処理を施して得られ
る位相分布が示されており、横軸はＦＦＴポイント数に
相当し、縦軸は位相を示している。図７の例は基本周波
数が安定している場合を示し、図８は基本周波数が遷移
している場合を示している。なお、ＦＦＴ分析結果は複
素数として得られ、複素平面上におけるベクトル角度が
位相に相当する。本来、位相は−π〜＋πの間において
不連続に存在しているが、ここでは直線位相成分を取り
去った後の位相の周期性を考慮して、位相を連続的な数
値に変換している。すなわち、図７（Ｂ）、図８（Ｂ）
は公知のアンラップ処理を施したものである。Here, FIGS. 7A and 8A show F
The power at each frequency after the FT analysis is shown,
The horizontal axis corresponds to the number of FFT points (frequency), and the vertical axis represents the logarithmic amplitude value. 7 (B) and FIG. 8 (B)
The phase distribution obtained by performing the unwrapping process on the FFT analysis result is shown. The horizontal axis corresponds to the number of FFT points, and the vertical axis represents the phase. 7 shows a case where the fundamental frequency is stable, and FIG. 8 shows a case where the fundamental frequency is transitioning. The result of the FFT analysis is obtained as a complex number, and the vector angle on the complex plane corresponds to the phase. Originally, the phase is discontinuous between -π and + π, but here, the phase is converted into a continuous numerical value in consideration of the periodicity of the phase after removing the linear phase component. . That is, FIGS. 7B and 8B
Has been subjected to a known unwrapping process.

【００４１】図７における、左のピークから順に基本波
Ａ_K(55ポイント付近[約320Hz])、２倍高調波Ａ_2K(110ポ
イント付近[約640Hz])、３倍高調波Ａ_3K(170ポイント付
近[約980Hz])である。添字Ｋは、基本周波数(図８では
約320Hz)に相当するFFTのポイント数を示している。な
お、図７（Ａ）及び図８（Ａ）には連続スペクトルが示
されているが、実際には複数の線スペクトルとして存在
しており、ＦＦＴの分析窓が有限長のため連続スペクト
ルとして観測されている。In FIG. 7, the fundamental wave A _K (around 55 points [about 320 Hz]), the second harmonic A _2K (around 110 points [about 640 Hz]), and the third harmonic A _3K (170 Near the point [about 980Hz]). The subscript K indicates the number of FFT points corresponding to the fundamental frequency (about 320 Hz in FIG. 8). Although a continuous spectrum is shown in FIGS. 7A and 8A, it actually exists as a plurality of line spectra and is observed as a continuous spectrum because the FFT analysis window has a finite length. Have been.

【００４２】上記の第１条件により、これらのうち何れ
かのレベルが低い場合には、位相差の抽出誤差が大きく
なるため、検定フレームから除外される。According to the first condition, if any one of these levels is low, the phase difference extraction error becomes large, and is excluded from the test frame.

【００４３】[位相に関する条件]次式により与えられる
ｗ_j (j=1,2,3)が、全て一定の閾値内に入っていること
（各jはそれぞれ基本波, ２倍高調波, ３倍高調波に相
当）。[Conditions Regarding Phase] All the w _j (j = 1, 2, 3) given by the following equations must be within a certain threshold value (each j is a fundamental wave, a second harmonic, 3 harmonics, respectively). Equivalent to the second harmonic).

【００４４】[0044]

【数１】ここで、ＰjはFFTの第lポイントにおける位相を、Ｋは
基本周波数に相当するFFTのポイント数を表している。
また、Ｐ_2K,Ｐ_3Kはそれぞれ２倍高調波，３倍高調波の
位相を表している。Ｍは分析フレーム長Ｎにより決まる
整数で、(Equation 1) Here, Pj represents the phase at the l-th point of the FFT, and K represents the number of FFT points corresponding to the fundamental frequency.
P _2K and P _3K represent the phases of the second harmonic and the third harmonic, respectively. M is an integer determined by the analysis frame length N,

【数２】により与えられる。ここで[x]はxを越えない最大の整数
を表す。ＬはFFTの窓幅すなわちポイント数である。(Equation 2) Given by Here, [x] represents the largest integer not exceeding x. L is the FFT window width, that is, the number of points.

【００４５】上記の図７の例は、基本波、２倍高調波、
３倍高調波が安定しており、位相差抽出に適したもので
ある。一方、図８の例は、基本波、２倍高調波、３倍高
調波ともに安定しておらず、位相差抽出には一般に適さ
ない。In the example of FIG. 7 described above, the fundamental wave, the second harmonic,
The third harmonic is stable and suitable for phase difference extraction. On the other hand, the example of FIG. 8 is not stable for the fundamental wave, the second harmonic, and the third harmonic, and is generally not suitable for phase difference extraction.

【００４６】上記の式(1)は、図７（Ｂ）及び図８
（Ｂ）において、基本波, ２倍高調波,３倍高調波の各
周波数近傍(前後Ｍポイント)における位相の安定性を示
す尺度である。この値は、理想的な信号すなわちノイズ
成分を含まない正弦波を合成した波形では、極めて小さ
くなる。従って、ｗ_jの値が大きい場合には基本周波数
が不安定、すなわち推移していると判断できる。The above equation (1) is obtained by comparing FIG. 7 (B) and FIG.
In (B), it is a measure showing the stability of the phase in the vicinity of each of the fundamental wave, the second harmonic, and the third harmonic (M points before and after). This value is extremely small for an ideal signal, that is, a waveform obtained by synthesizing a sine wave containing no noise component. Therefore, when the value of w _j is large, it can be determined that the fundamental frequency is unstable, that is, the fundamental frequency is changing.

【００４７】具体的には、図７（Ｂ）において、基本周
波数(55ポイント近傍)とその２倍高調波(110ポイント近
傍)，３倍高調波(170ポイント近傍)に相当する周波数近
傍における位相の値はかなり安定していることが分か
る。一方、図８（Ｂ）においては、位相の値は基本周波
数(50ポイント近傍)以外は安定しておらず、基本周波数
が遷移していることが分かる。Specifically, in FIG. 7B, the phase in the vicinity of the fundamental frequency (around 55 points) and the second harmonic (around 110 points) and the frequency corresponding to the third harmonic (around 170 points) are shown. It can be seen that the value of is quite stable. On the other hand, in FIG. 8B, the value of the phase is not stable except for the fundamental frequency (around 50 points), and it can be seen that the fundamental frequency has shifted.

【００４８】図５の本分析（Ｓ１０１）においては、上
記条件を満たし、安定に位相情報を抽出することができ
る定常的なフレームを検定フレームとしてラベリングし
ておき（Ｓ６）、位相差情報を算出する（Ｓ７）。算出
された位相差情報は、照合時のリファレンス用データと
して記憶される（Ｓ８）。In the main analysis (S101) of FIG. 5, a stationary frame that satisfies the above conditions and from which phase information can be stably extracted is labeled as a test frame (S6), and phase difference information is calculated. (S7). The calculated phase difference information is stored as reference data at the time of comparison (S8).

【００４９】図６には、照合時の処理の流れがフローチ
ャートとして示されている。照合工程は、大別して、音
声の周波数的特徴を用いて音声の個人性及び音韻性を識
別するスペクトルマッチングと、波形的特徴を用いて録
音・再生音声であるか否かを識別する位相マッチングの
２工程で構成される。FIG. 6 is a flow chart showing the flow of the process at the time of collation. The matching process is roughly classified into spectral matching for identifying the personality and phonologicalness of the voice using the frequency characteristics of the voice, and phase matching for determining whether the voice is a recorded / reproduced voice using the waveform characteristics. It consists of two steps.

【００５０】スペクトルマッチング時には、音声が入力
されると、分析フレーム単位で算出された上記パラメー
タを、DPマッチングを行うことにより登録音声及び入力
音声の各分析フレームとの間で対応付けを取りつつ両者
の距離を算出する（Ｓ３）。得られた距離を所定の閾値
と比較し（Ｓ４）、閾値以上であれば、両音声は、異な
る話者が発声したものであると判断し、棄却する（Ｓ
５）。これは従来同様である。At the time of spectrum matching, when a speech is input, the above parameters calculated in analysis frame units are associated with each analysis frame of the registered speech and the input speech by performing DP matching. Is calculated (S3). The obtained distance is compared with a predetermined threshold value (S4). If the distance is equal to or greater than the threshold value, both voices are determined to be uttered by different speakers and rejected (S4).
5). This is the same as before.

【００５１】距離が閾値以下である場合には、更に位相
情報に着目し、入力音声が再生音声であるか否かの判断
を行う(位相マッチング)。If the distance is equal to or less than the threshold value, it is further focused on the phase information, and it is determined whether or not the input sound is a reproduced sound (phase matching).

【００５２】ちなみに、発声が間延びした等により登録
音声の検定フレームに複数の入力音声フレームが対応し
た場合には、対応する入力音声フレームから検定フレー
ム数と同数個のフレームを選択する。逆に、早口等によ
り複数の検定フレームが同一の入力音声フレームに対応
した場合には、連続する検定フレームのうち中央近傍に
位置するフレームのみを用いる。When a plurality of input speech frames correspond to the test frames of the registered speech due to delay of the utterance or the like, the same number of frames as the number of test frames are selected from the corresponding input speech frames. Conversely, when a plurality of test frames correspond to the same input voice frame due to a rapid voice or the like, only the frame located near the center of the continuous test frames is used.

【００５３】ここで、対応する入力音声フレームが検定
フレームとしての条件を満たしているかを再度判断して
も良く、その際に条件を満たしていない場合には、前後
数フレームで条件を満たすものをが存在するか調べ、存
在する場合には対応フレームの代りに当該フレームを用
いることも可能である。Here, it may be determined again whether or not the corresponding input speech frame satisfies the condition as a test frame. It is also possible to check whether or not a frame exists, and if so, it is possible to use the frame in place of the corresponding frame.

【００５４】次に、上記において求めた検定フレームを
用いて、基本波(周期T)と２倍高調波の位相差を求める
方法について説明する（Ｓ６）。FFT等の分析結果に基
づいて、分析窓の中央における基本波の位相，２倍高調
波の位相をそれぞれ求める。分析窓内の位置は、基本波
及び２倍高調波において一致している必要があり、また
分析窓の両端では誤差が大きくなるので中央位置を用い
るのが好ましい。Next, a method for obtaining the phase difference between the fundamental wave (period T) and the second harmonic using the test frame obtained above will be described (S6). Based on the analysis result of FFT or the like, the phase of the fundamental wave and the phase of the second harmonic at the center of the analysis window are obtained. It is necessary that the position in the analysis window be the same in the fundamental wave and the second harmonic, and it is preferable to use the center position because the error increases at both ends of the analysis window.

【００５５】図９は、位相差の定義を説明するための図
である。上側の波形は基本波を、下側の波形は位相がδ
だけ遅れた２倍高調波をそれぞれ示している。FIG. 9 is a diagram for explaining the definition of the phase difference. The upper waveform is the fundamental wave, and the lower waveform is the phase δ
The second harmonic, which is delayed by only the second harmonic, is shown.

【００５６】基本波の位相(θ₁)は、The phase (θ ₁ ) of the fundamental wave is

【数３】で与えられる。ここで、ｄ₁は分析窓の中央から波形の
ピーク位置までのポイント数である。同様に、２倍高調
波の位相(θ₂)は、(Equation 3) Given by Here, d ₁ is the number of points from the center of the analysis window to the peak position of the waveform. Similarly, the phase of the second harmonic (θ ₂ ) is

【数４】で与えられる。位相差ｕは、図１０のδにあたる量を２
倍高調波の周期で正規化した値として以下のように定義
される。(Equation 4) Given by The phase difference u is calculated by calculating the amount corresponding to δ in FIG.
It is defined as follows as a value normalized by the period of the second harmonic.

【００５７】[0057]

【数５】基本波と３倍高調波の位相差(ｖ)も同様して求めること
ができる。(Equation 5) The phase difference (v) between the fundamental wave and the third harmonic can be similarly obtained.

【００５８】[0058]

【数６】登録音声の検定フレームとそれに対応する入力音声のフ
レームとの間で、（ｕ，ｖ）の各値の差分を演算し、そ
れに基づいて入力された音声が録音音声であるか生音声
であるかを判定する。検定フレームｉでの評価値とし
て、以下のＤ_iを用いる（Ｓ７）。(Equation 6) A difference between each value of (u, v) is calculated between the test frame of the registered voice and the corresponding frame of the input voice, and whether the input voice is a recorded voice or a live voice is calculated based on the difference. Is determined. The following _Di is used as the evaluation value in the test frame i (S7).

【００５９】[0059]

【数７】全ての検定フレームにおいて、それぞれ上記Ｄ_iを算出
する。再生音声であるか否かの判断は、これらの和の値
が所定の閾値以下である、あるいは個々の評価値が全て
所定の閾値以下である等の条件を用いて行う（Ｓ８）。
上記の計算式では、入力音声と登録音声との間で、基本
波と２倍高調波の位相差についての差分（第１差分）、
及び、基本波と３倍高調波の位相差についての差分（第
２差分）を加算したものを評価値としたが、第１差分又
は第２差分の一方を評価値としてもよい。但し、再生系
によっては、いずれかの差分がそれほど大きくならない
可能性があるため、上記計算式のように複数の差分を考
慮するのが望ましい。(Equation 7) The above-mentioned _Di is calculated for each of all the test frames. The determination as to whether or not the sound is a reproduced sound is made using a condition such that the value of the sum is equal to or less than a predetermined threshold value, or all the evaluation values are equal to or less than a predetermined threshold value (S8).
In the above formula, the difference (first difference) regarding the phase difference between the fundamental wave and the second harmonic between the input voice and the registered voice,
In addition, the value obtained by adding the difference (second difference) regarding the phase difference between the fundamental wave and the third harmonic is used as the evaluation value, but one of the first difference and the second difference may be used as the evaluation value. However, depending on the reproduction system, there is a possibility that one of the differences does not become so large, so it is desirable to consider a plurality of differences as in the above calculation formula.

【００６０】マッチングの結果、リファレンス用データ
との違いが閾値以上であれば棄却し（Ｓ９）、閾値以内
であれば登録話者と同一音声であるとして受理し、例え
ば扉の電気鍵の解錠信号を出力する（Ｓ１０）。このよ
うに、２段階のマッチングにおいて同一であると判断さ
れた音声のみが受理される。なお、Ｓ９の判定手法とし
ては各種のものをあげることができ、例えば、マッチン
グの結果が閾値以上となる回数が所定個以上になった場
合に棄却判定を行うようにしてもよく、あるいは、マッ
チングの結果に対して各種の統計的処理を施し、それを
評価するようにしてもよい。As a result of the matching, if the difference from the reference data is equal to or more than the threshold, the rejection is made (S9). If the difference is less than the threshold, it is accepted as the same voice as the registered speaker, and, for example, the electric key of the door is unlocked. A signal is output (S10). Thus, only the voice determined to be the same in the two-stage matching is accepted. Note that various methods can be used as the determination method in S9. For example, a rejection determination may be made when the number of times the matching result is equal to or greater than the threshold is equal to or greater than a predetermined number. May be subjected to various kinds of statistical processing and evaluated.

【００６１】図１０に男性話者音声の位相情報を分析し
て得られた評価値の例を示す。横軸は分析フレームの番
号を、縦軸は評価値をそれぞれ表している。図におい
て、生音声同士を比較した結果は全て評価値が0.20以下
の領域に集まっているのに対し、生音声と録音再生音声
とを比較した結果は全て評価値が0.60以上の領域に集ま
っており、録音再生音声と生音声が明確に分離できてい
ることが分かる。FIG. 10 shows an example of an evaluation value obtained by analyzing the phase information of a male speaker's voice. The horizontal axis represents the analysis frame number, and the vertical axis represents the evaluation value. In the figure, the results of comparing the raw voices are all gathered in the area where the evaluation value is 0.20 or less, whereas the results of comparing the raw voice and the recording / playback voice are all gathered in the area where the evaluation value is 0.60 or more. It can be seen that the recorded / reproduced sound and the live sound can be clearly separated.

【００６２】図１１に女性話者音声の位相情報を分析し
て得らえた評価値の例を示す。男性音声に比べ録音再生
音声と生音声との評価値の差が若干小さくなっているも
のの、男性音声の場合同様確実に分離できていることが
わかる。FIG. 11 shows an example of an evaluation value obtained by analyzing the phase information of a female speaker's voice. Although the difference between the evaluation values of the recorded / reproduced voice and the live voice is slightly smaller than that of the male voice, it can be seen that the voice can be separated reliably as with the male voice.

【００６３】本実施例では基本波とその２倍高調波,３
倍高調波の間の位相差を用いているが、これに限定され
るものではない。２倍高調波と３倍高調波との間の位相
差を用いてもよいし、あるいは４倍高調波以上を用いる
ことも可能である。In this embodiment, the fundamental wave and its second harmonic, 3
Although the phase difference between the harmonics is used, the present invention is not limited to this. The phase difference between the second harmonic and the third harmonic may be used, or more than the fourth harmonic may be used.

【００６４】スペクトルマッチングに用いる特徴量に関
しても、スペクトル包絡情報以外に基本周波数の変化パ
ターン等、一般に音声による個人ID装置に用いられてい
る特徴量を用いることが可能である。また、上記にて説
明したDPマッチングにより分析フレームの対応をとる手
法以外に、HMM(Hidden Markov Model)等を用いて音声の
母音部分を抽出し、この部分において検定フレームの条
件を満たすフレームのうち、より条件に適している上位
数フレームをスペクトルマッチング及び位相マッチング
に用いることも可能である。As for the characteristic amount used for spectrum matching, it is possible to use a characteristic amount generally used for a personal ID device by voice, such as a change pattern of a fundamental frequency, in addition to the spectral envelope information. In addition, in addition to the above-described method of associating analysis frames by DP matching, a vowel portion of a voice is extracted using an HMM (Hidden Markov Model) or the like. It is also possible to use higher-order frames more suitable for the conditions for spectrum matching and phase matching.

【００６５】[0065]

【発明の効果】以上詳細に説明したように、本発明によ
れば、信頼性の高い入力音声の識別を実現できる。また
本発明によれば、入力音声が生音声であるか再生音声で
あるか高精度に判別できる。更に本発明によれば、生音
声とその再生音声の性質の違いを音声の評価に利用でき
る。As described above in detail, according to the present invention, highly reliable identification of input speech can be realized. Further, according to the present invention, it is possible to determine with high accuracy whether the input sound is a live sound or a reproduced sound. Further, according to the present invention, the difference between the properties of the live voice and the reproduced voice can be used for voice evaluation.

[Brief description of the drawings]

【図１】録音後の再生音声と生音声の信号波形を示す
波形図である。FIG. 1 is a waveform diagram showing signal waveforms of a reproduced voice and a raw voice after recording.

【図２】３つの信号の合成を示す説明図である。FIG. 2 is an explanatory diagram showing the synthesis of three signals.

【図３】３つの信号の合成を示す説明図である。FIG. 3 is an explanatory diagram showing the synthesis of three signals.

【図４】音声識別装置の基本的な構成を示すブロック
図である。FIG. 4 is a block diagram illustrating a basic configuration of a voice identification device.

【図５】音声登録時の処理の流れを示すフローチャー
トである。FIG. 5 is a flowchart showing a flow of processing at the time of voice registration.

【図６】音声照合時の処理の流れを示すフローチャー
トである。FIG. 6 is a flowchart showing the flow of processing at the time of voice collation.

【図７】ＦＦＴ分析結果とアンラップ処理結果を示す
図である。FIG. 7 is a diagram showing an FFT analysis result and an unwrap processing result.

【図８】ＦＦＴ分析結果とアンラップ処理結果を示す
図である。FIG. 8 is a diagram showing an FFT analysis result and an unwrap processing result.

【図９】位相差の定義を説明するための図である。FIG. 9 is a diagram for explaining the definition of a phase difference.

【図１０】位相情報の分析結果を示す図である。FIG. 10 is a diagram showing an analysis result of phase information.

【図１１】位相情報の分析結果を示す図である。FIG. 11 is a diagram showing an analysis result of phase information.

[Explanation of symbols]

１入力部、２特徴量抽出部、３照合部、４記憶
部、５出力部。1 input unit, 2 feature amount extraction unit, 3 collation unit, 4 storage unit, 5 output unit.

Claims

[Claims]

1. A voice input unit for inputting voice, a phase information extracting unit for extracting phase information from the input voice, and an input by comparing phase information of the collation voice with phase information of the input voice. A voice identification device, comprising: evaluation means for evaluating voice.

2. The apparatus according to claim 1, wherein the phase information is information on at least one of a phase difference between a fundamental wave and a harmonic of a voice and a phase difference between the harmonics. Voice recognition device.

3. The apparatus according to claim 1, wherein said evaluation means includes identification means for identifying whether or not said input sound is reproduced sound based on a comparison between said two pieces of phase information. Voice identification device.

4. The apparatus according to claim 3, wherein the phase information extracting means includes a phase difference Am between a fundamental wave of the verification voice and an m-th harmonic thereof as phase information of the verification voice. And a phase difference A between the fundamental wave of the matching voice and its nth harmonic.
a phase difference Bm between a fundamental wave of the input voice and its m-th harmonic, and a phase difference Bm between the fundamental wave of the input voice and its n-th harmonic. Means for determining a phase difference Bn, wherein the identification means includes a step for determining a phase difference between the phase difference Am and the phase difference Bm and a phase difference between the phase difference An and the phase difference Bn. Wherein the input voice is identified based on a difference between the input voice and the input voice.

5. The apparatus according to claim 1, wherein the phase information extracting means includes: a preliminary analysis means for estimating a fundamental wave of the input voice; and a frequency analysis of the input voice based on the estimated fundamental wave. A voice identification device comprising: a main analysis unit that performs the analysis; and an extraction unit that extracts the phase information from a frequency analysis result of the main analysis unit.

6. The apparatus according to claim 5, wherein the preliminary analysis unit performs frequency analysis by cutting out the input voice by a fixed time window having a fixed length window width in units of frames. A voice identification device, comprising: setting a variable time window having a window width variably set based on the estimated fundamental wave, and performing frequency analysis by cutting out the input voice by a frame unit using the variable time window. .

7. The apparatus according to claim 6, wherein the phase information is extracted from a frame satisfying a predetermined stability condition in the audio signal.

8. Using the fact that the phase information differs between a live voice and a reproduced voice that has been recorded and played back, it is possible to identify whether the input voice is a live voice or a reproduced voice. Characteristic voice identification method.

9. A determining means for determining the personal identity of the matching voice and the input voice, and an identifying means for determining whether or not the input voice for which the identity has been determined is a reproduced voice based on phase information thereof. A voice identification device comprising: