JP6430318B2

JP6430318B2 - Unauthorized voice input determination device, method and program

Info

Publication number: JP6430318B2
Application number: JP2015077541A
Authority: JP
Inventors: 隆伸大庭; 太一浅見; 阪内　澄宇; 澄宇阪内
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2015-04-06
Filing date: 2015-04-06
Publication date: 2018-11-28
Anticipated expiration: 2035-04-06
Also published as: JP2016197200A

Description

この発明は、入力された音声が不正なものであるかを判定する技術に関する。 The present invention relates to a technique for determining whether or not an input voice is illegal.

音声による話者認識（以下、単に話者認識と記載する。）は、照合と識別に大別される。 Speaker recognition by voice (hereinafter simply referred to as speaker recognition) is broadly divided into verification and identification.

話者の照合は、本人確認に例えば利用される。話者の照合では、ユーザは、まずシステムに自分のユーザ名を申告する。次にシステムに音声を入力する。システムは、入力音声が本当に申告のあったユーザであるかを判定する。 Speaker verification is used for identity verification, for example. In speaker verification, the user first declares his username to the system. Next, input voice into the system. The system determines whether the input voice is actually a user who has reported.

一方、話者の識別は、入力音声が誰の声であるかを判定するものである。話者の識別では、事前に登録された人物の中から、最も類似した声を持つ人物が結果として返される。 On the other hand, speaker identification determines who the input voice is. In speaker identification, the person with the most similar voice is returned as a result from the pre-registered persons.

照合と話者の両方を兼ね備える場合もある。この場合、非登録話者であるかどうかが判定（照合）され、さらに登録話者であれば具体的に誰かが判定される（識別）。これらを総じて話者認識と呼ぶ。 In some cases, both verification and speaker are combined. In this case, whether or not the speaker is a non-registered speaker is determined (verified), and if it is a registered speaker, someone is specifically determined (identified). These are collectively called speaker recognition.

話者認識にはテキスト依存型とテキスト非依存型がある。テキスト依存型とは、認識を行う際に所定の文をユーザが読み上げる形式である。一方、テキスト非依存型は、ユーザは任意の言葉を発して良い形式である。 There are two types of speaker recognition: text-dependent and text-independent. The text-dependent type is a format in which a user reads a predetermined sentence when performing recognition. On the other hand, the text independent type is a format in which the user may utter any word.

話者認識では音声の事前登録が必要である。登録は１発話以上行われる。利便性のため、登録発話は短く、登録回数が少ない場合でも、適切に認識が行えることが望ましい。 In speaker recognition, voice pre-registration is required. Registration is performed for one or more utterances. For convenience, it is desirable that registration utterances are short and can be properly recognized even when the number of registrations is small.

話者認識は、個々の入力音声から特徴量を算出し、既存の外れ値検知やクラス分類アルゴリズム等の技術を用いることで実現される。話者の照合であれば、登録話者かそれ以外かの二値を判定すればよいから、外れ値検知や二値のクラス分類アルゴリズムを用いることができる。話者の識別であれば、多値のクラス分類問題に他ならない。具体的に話者認識で用いられている技術として、例えば非特許文献１，２に記載された技術が知られている。非特許文献２では、話者認識に使用される特徴量についても説明されている。 Speaker recognition is realized by calculating features from individual input speech and using techniques such as existing outlier detection and class classification algorithms. In the case of speaker verification, it is only necessary to determine the binary value of a registered speaker or the other, so outlier detection and binary classification algorithm can be used. Speaker identification is nothing but a multi-level classification problem. As technologies specifically used for speaker recognition, for example, technologies described in Non-Patent Documents 1 and 2 are known. Non-Patent Document 2 also describes feature quantities used for speaker recognition.

つぎに、母音区間検出の概要について説明する。母音区間検出は、音声を入力し、その母音の判定区間を検出するものである。基本的に音声に含まれるフォルマント成分に着目し処理が行われる（例えば、非特許文献３参照。）。フォルマントとは、音声をフーリエ変換などで周波数解析をした際に、周波数の低い成分に現れるピークのことである。ピークは、複数現れ、調波性を有する。一般に複数のピークは周波数が低い方から順に第一フォルマント、第二フォルマント・・・と呼ばれる。これら複数のフォルマントの間隔、大きさ、出現した周波数の値など総じて調波構造やフォルマント構造と呼ぶ。フォルマントは母音によく現れる反面、子音には顕著に出現しないため、母音区間検出を行う上で重要な特徴となる。 Next, an outline of vowel section detection will be described. In the vowel section detection, a voice is input and a determination section of the vowel is detected. Processing is basically performed by paying attention to the formant component included in the speech (for example, see Non-Patent Document 3). A formant is a peak that appears in a low-frequency component when a speech is subjected to frequency analysis by Fourier transform or the like. A plurality of peaks appear and have harmonic characteristics. In general, a plurality of peaks are called first formant, second formant,... In order from the lowest frequency. The interval, size, and frequency value of these formants are generally called harmonic structures and formant structures. Formants often appear in vowels, but do not appear remarkably in consonants, which is an important feature for detecting vowel intervals.

実際には、子音の中でもフォルマントが抽出されるものもある（有声子音）。そのため、構造を勘案し母音区間が推定される。つまり、母音らしい調波構造を持つかを判定する処理が行われる。 In fact, some consonants extract formants (voiced consonants). Therefore, the vowel section is estimated in consideration of the structure. That is, processing for determining whether or not a harmonic structure that is likely to be a vowel has been performed.

そのため、音声に限らず一般の信号を母音区間検出に入力した場合、母音に類似した調波構造を持つ信号区間に対しては、母音として判定される可能性が高い。例えば、楽器の音などがこれに当たる。逆に、類似の構造を持たなければ母音と判定される可能性は低い。 For this reason, when a general signal, not limited to speech, is input to vowel section detection, a signal section having a harmonic structure similar to the vowel is likely to be determined as a vowel. For example, this is the sound of a musical instrument. Conversely, if there is no similar structure, the possibility of being determined as a vowel is low.

小川哲司，松井知子，“話者認識で用いる機械学習”，日本音響学会誌69巻7号，pp.349-356，2013．Tetsuji Ogawa and Tomoko Matsui, “Machine Learning for Speaker Recognition”, Journal of the Acoustical Society of Japan, Vol.69, No.7, pp.349-356, 2013. 王龍標，西田昌史，柘植覚，網野加苗，“話者認識におけるロバストネス”，日本音響学会誌69巻7号，pp.357-364，2013．Wang Dragon, Masafumi Nishida, Akira Tsuji, Kanae Amino, “Robustness in Speaker Recognition”, Journal of the Acoustical Society of Japan, Vol.69, No.7, pp.357-364, 2013. 辻美咲，荒井隆行，程島奈緒，“音声の母音区間に対する簡易的自動検出法 ―残響環境下における音声明瞭度の改善を目的として―”，日本音響学会秋季研究発表会講演論文集，pp.329-330, 2010/09.Misaki Atsumi, Takayuki Arai, Nao Hodoshima, "Simple automatic detection method for vowel segments of speech-for the purpose of improving speech intelligibility in reverberant environment", Proc. -330, 2010/09.

ところで、話者認識は、音声の適切な入力を前提とした技術である。そのため、話者認識システムに対し、音声の適切な入力を判定する機能を具備することは有用である。特に登録時は重要である。登録音声信号が不適切であれば、正しく認識できないからである。 By the way, speaker recognition is a technique based on the premise of appropriate input of speech. Therefore, it is useful to provide the speaker recognition system with a function for determining an appropriate voice input. This is especially important during registration. This is because if the registered voice signal is inappropriate, it cannot be recognized correctly.

話者認識技術で前提としている音声は、言葉を発している音声であるが、これを厳密に定義することや、ユーザにその点を明確に示し、理解してもらうことは難しい。例えば、子音だけで構成される音声は、言葉を発していると言えるかもしないが、子音のみから話者性を適切に抽出することは困難で、現状の技術水準の話者認識システムとしては、不正な入力とみなしたい。例えば、「スススー（母音’ウ’は発音されず子音’s’だけの発声）」といったものがこれに当たる。更には、母音であっても極端に長音化したものも、技術的には幾分難しい。例えば、「あーーーーー」と数秒言い続けるようなものがこれにあたる。更には、喉を鳴らず音、舌を振動させる音や鳴らす音、息の吹きかけ・吸込み音、口笛、リップ音、咳などの音が支配的な入力も不正な入力と想定される。これらは既存の音声区間検出技術で除去することは難しい。 The speech premised on the speaker recognition technology is a speech uttering a word, but it is difficult to define it strictly or to clearly show the point to the user for understanding. For example, it may be said that speech composed only of consonants is uttering words, but it is difficult to properly extract speaker characteristics from only consonants, and as a speaker recognition system of the current technical level, I want to regard it as invalid input. For example, “Susu sou (the vowel 'U' is not pronounced and only the consonant 's' is uttered)” corresponds to this. Furthermore, even vowels that are extremely long are technically somewhat difficult. For example, this is something that keeps saying “Ahhhh” for a few seconds. Furthermore, it is also assumed that an input in which a sound such as a sound that does not ring the throat, a sound that vibrates the tongue, a sound that sounds, a breath blowing / inhalation sound, a whistle, a lip sound, or a cough is dominant is also an illegal input. These are difficult to remove with existing speech segment detection technology.

テキスト依存型の話者認識であれば、音声認識などの技術を利用し、テキストと実際の発話内容を比較するといったことで、入力音声の適切さを判断できる。しかし、テキスト非依存型では、その方法は自明ではない。 In the case of text-dependent speaker recognition, it is possible to determine the appropriateness of the input speech by using a technology such as speech recognition and comparing the text with the actual utterance content. However, for text-independent types, the method is not self-evident.

この発明の目的は、テキスト非依存型の音声信号処理においても、入力された音声が不正なものであるかを判定することができる不正音声入力判定装置、音声信号処理装置、方法及びプログラムを提供することである。 An object of the present invention is to provide an unauthorized speech input determination device, an speech signal processing device, a method, and a program capable of determining whether an input speech is unauthorized even in text-independent speech signal processing. It is to be.

この発明の一態様による不正音声入力判定装置は、入力された音声信号から有音区間の部分を抜き出すことにより音声区間信号を生成する音声区間検出部と、音声区間信号の母音区間を検出する母音区間検出部と、検出された母音区間の長さに基づいて、音声信号が不正音声であるかどうかを判断する母音区間検出結果分析部と、を備えており、母音区間検出結果分析部は、音声信号が不正音声でないと判断された場合には、音声信号又は音声区間信号を用いて話者認識の処理を行う信号処理部に、入力された音声信号又は音声区間信号を送信する。
この発明の一態様による不正音声入力判定装置は、入力された音声信号から有音区間の部分を抜き出すことにより音声区間信号を生成する音声区間検出部と、音声区間信号の母音区間を検出する母音区間検出部と、検出された母音区間の長さに基づいて、音声信号が不正音声であるかどうかを判断する母音区間検出結果分析部と、を備えており、母音区間検出結果分析部は、音声信号が不正音声でないと判断された場合には、音声信号又は音声区間信号を用いて話者登録の処理を行う信号処理部に、入力された音声信号又は音声区間信号を送信する。 Fraud audio input judging apparatus according to an aspect of the present invention, the speech section detection unit for generating a speech section signal by extracting a portion of the speech interval from the speech signal inputted, vowel detecting the vowel section of the speech section signal A vowel section detection result analysis section that determines whether or not the speech signal is fraudulent based on the length of the detected vowel section , and a vowel section detection result analysis section, If the audio signal is determined not to be bad speech is transmitted to the row cormorants signal processing unit processing the talker recognition using the voice signal or the speech section signal, the input audio signal or speech section signal.
An unauthorized speech input determination device according to an aspect of the present invention includes a speech segment detection unit that generates a speech segment signal by extracting a voiced segment from an input speech signal, and a vowel that detects a vowel segment of the speech segment signal A vowel section detection result analysis section that determines whether or not the speech signal is fraudulent based on the length of the detected vowel section, and a vowel section detection result analysis section, If it is determined that the voice signal is not an illegal voice, the input voice signal or voice section signal is transmitted to a signal processing unit that performs speaker registration processing using the voice signal or voice section signal.

この発明の一態様による音声信号処理装置は、上記の不正音声入力判定装置と、音声信号が不正音声でないと判断された場合には、音声信号を用いて話者認識の処理を行う信号処理部と、を備えている。
この発明の一態様による音声信号処理装置は、上記の不正音声入力判定装置と、音声信号が不正音声でないと判断された場合には、音声信号を用いて話者登録の処理を行う信号処理部と、を備えている。 Audio signal processing apparatus according to one aspect of the present invention, a fraud audio input determination device described above, when the audio signal is determined not to be bad speech signal processing for processing the talker recognition using the voice signal And a section.
An audio signal processing device according to an aspect of the present invention includes the above-described unauthorized speech input determination device and a signal processing unit that performs speaker registration processing using an audio signal when it is determined that the audio signal is not unauthorized speech And.

テキスト非依存型の音声信号処理においても、入力された音声が不正なものであるかを判定することができる。 Even in the text-independent voice signal processing, it is possible to determine whether the input voice is illegal.

不正音声入力判定装置、音声信号処理装置の例を説明するためのブロック図。The block diagram for demonstrating the example of an unauthorized audio | voice input determination apparatus and an audio | voice signal processing apparatus. 不正音声入力判定方法、音声信号処理方法の例を説明するための流れ図。The flowchart for demonstrating the example of an unauthorized audio | voice input determination method and an audio | voice signal processing method.

［技術的背景］
本発明は、入力音声に対して母音区間検出技術を適用し、その適用結果を不正音声入力の判定に利用することを特徴の１つとする。母音区間検出では、入力音の調波構造が母音に類似してない限り、母音と判定されにくいという性質が利用される。 [Technical background]
One feature of the present invention is that a vowel section detection technique is applied to input speech, and the application result is used to determine unauthorized speech input. In the detection of the vowel section, the property that it is difficult to determine the vowel is used unless the harmonic structure of the input sound is similar to the vowel.

話者認識における不正な音声入力のうち、子音のみ音、喉を鳴らず音、舌を振動させる音や鳴らす音、息の吹きかけ・吸込み音、リップ音、咳などは、調波構造が母音とは異なる。そのため、母音区間検出で母音と判定される可能性は小さい。 Among the illegal voice input in speaker recognition, only the consonant sound, the sound that does not ring the throat, the sound that vibrates the tongue, the sound that blows, the breath blowing / breathing sound, the lip sound, the cough, etc. Is different. Therefore, the possibility that the vowel section is detected as a vowel is small.

実際、母音の限りなく少ない音声に対する話者認識は技術的に難しいことから、母音らしい特徴を有していない信号を、話者認識における不正な入力と見なすことは、ひとつの選択肢になりえる。 In fact, speaker recognition for speech with as few vowels as possible is technically difficult, and considering a signal that does not have vowel-like features as an incorrect input in speaker recognition can be an option.

一方、話者認識における不正な音声入力のうち、母音の極端な長音化や口笛でメロディーをきざむような場合は、連続して長い区間が母音と判定される。通常の言葉は子音を挟むため、連続して長い区間が母音であることは稀であるから、検出された母音区間の長さから、不正な音声と通常の音声を区別することができる。 On the other hand, in the case of illegal voice input in speaker recognition, when a vowel is extremely long or a melody is squeezed by a whistle, a continuous long section is determined as a vowel. Since normal words sandwich consonants, it is rare for long sections to be vowels. Therefore, it is possible to distinguish illegal voices from normal voices based on the length of the detected vowel section.

［実施形態］
音声信号処理装置は、不正音声入力判定装置１及び信号処理部２を例えば備えている。不正音声入力判定装置１は、音声区間検出部１１、母音区間検出部１３及び母音区間検出結果分析部１４を例えば備えている。 [Embodiment]
The audio signal processing device includes, for example, an unauthorized audio input determination device 1 and a signal processing unit 2. The unauthorized speech input determination device 1 includes, for example, a speech segment detection unit 11, a vowel segment detection unit 13, and a vowel segment detection result analysis unit 14.

＜音声区間検出部１１＞
入力された音声信号は、不正音声入力判定装置１の音声区間検出部１１に渡される。 <Audio section detection unit 11>
The input voice signal is passed to the voice section detection unit 11 of the unauthorized voice input determination device 1.

音声区間検出部１１は、入力された音声信号から有音区間の部分を抜き出すことにより音声区間信号を生成する（ステップＳ１１）。すなわち、音声区間検出部１１は、入力された音声信号から無音部分を除去する。生成された音声区間信号は、母音区間検出部１３に渡される。 The voice section detection unit 11 generates a voice section signal by extracting a voiced section from the input voice signal (step S11). That is, the voice section detection unit 11 removes a silent part from the input voice signal. The generated speech segment signal is passed to the vowel segment detection unit 13.

有音区間の部分の抜き出しには、既存の技術を用いればよい。例えば、音声信号の大きさが所定の閾値以上の区間を有音区間の部分と判定し、音声信号の大きさが所定の閾値以上の区間を無音区間の部分と判定し、判定された有音区間の部分のみを結合することにより音声区間信号を生成することができる。 An existing technique may be used for extracting the portion of the sound section. For example, a section having a voice signal size equal to or greater than a predetermined threshold is determined to be a voiced section, a section having a voice signal magnitude equal to or greater than a predetermined threshold is determined to be a silent section, and the determined voice A voice section signal can be generated by combining only sections.

＜母音区間検出部１３＞
母音区間検出部１３は、音声区間信号の母音区間を検出する（ステップＳ１３）。検出された母音区間についての情報である母音区間情報は、母音区間検出結果分析部１４に渡される。 <Vowel section detector 13>
The vowel section detector 13 detects a vowel section of the speech section signal (step S13). Vowel section information, which is information about the detected vowel section, is passed to the vowel section detection result analysis unit 14.

母音区間情報は、例えば母音区間の開始時刻と継続長である。音声区間信号の中に複数の母音区間が検出された場合には、複数の母音区間のそれぞれの母音区間情報が生成され母音区間検出結果分析部１４に渡される。以下、Iを１以上の整数とし、検出された母音区間の個数をIとし、i=1,…,Iとして、検出された、i番目の母音区間の継続長をV(i)と表記する。 The vowel section information is, for example, the start time and duration of the vowel section. When a plurality of vowel sections are detected in the speech section signal, vowel section information of each of the plurality of vowel sections is generated and passed to the vowel section detection result analysis unit 14. In the following, I is an integer of 1 or more, the number of detected vowel segments is I, i = 1,..., I, and the detected duration of the i-th vowel segment is expressed as V (i). .

＜母音区間検出結果分析部１４＞
母音区間検出結果分析部１４は、検出された母音区間の長さに基づいて、音声信号が不正音声であるかどうかを判断する（ステップＳ１４）。 <Vowel segment detection result analysis unit 14>
Based on the length of the detected vowel section, the vowel section detection result analysis unit 14 determines whether the voice signal is an unauthorized voice (step S14).

母音区間検出結果分析部１４の具体的な構成方法の１つは、母音区間の占める割合を求めて閾値処理にて不正音声入力の判定を行うものである。今、音声区間信号の長さをLと表記すると母音区間の占める割合Rは、 One specific configuration method of the vowel section detection result analysis unit 14 is to determine the ratio of the vowel section and determine illegal voice input by threshold processing. Now, if the length of the speech interval signal is expressed as L, the ratio R occupied by the vowel interval is

にて算出できる。母音区間検出結果分析部１４は、Rが、所定の閾値TRより小さいとき、入力された音声信号は不正音声と判断する。そうでない場合には、入力された音声信号は不正音声でないと判断する。Rは、０以上１以下の値を持つ。閾値TRは事前にシステム開発者が決定する数値であり、Rの取り得る範囲で設定される。すなわち、閾値TRは、０以上１以下の所定の値に設定される。閾値TRが大きければ、多くの音声を不正と判定することになる。 Can be calculated. The vowel section detection result analysis unit 14 determines that the input voice signal is an illegal voice when R is smaller than a predetermined threshold value TR. Otherwise, it is determined that the input audio signal is not illegal. R has a value between 0 and 1. The threshold value TR is a numerical value determined in advance by the system developer, and is set in a range that R can take. That is, the threshold value TR is set to a predetermined value between 0 and 1. If the threshold value TR is large, it is determined that many voices are illegal.

この方法は、母音に類似の調波構造を持たない不正な入力の検知を目的とした方法である。 This method is intended to detect unauthorized input that does not have a harmonic structure similar to a vowel.

母音区間検出結果分析部１４の具体的な構成方法のもう１つは、正規化された母音区間の長さの最大値を閾値処理する方法である。正規化された母音区間の長さの最大値Mは、例えば、以下のように算出される。 Another specific configuration method of the vowel section detection result analysis unit 14 is a method of performing threshold processing on the maximum length of the normalized vowel section. For example, the maximum value M of the length of the normalized vowel section is calculated as follows.

と算出する。母音区間検出結果分析部１４は、Mが所定の閾値TMより大きいとき不正音声と判定する。そうでない場合には、入力された音声信号は不正音声でないと判断する。Mは0以上１以下の値を持つ。閾値TMは、事前にシステム開発者が決定する数値であり、Mの取り得る範囲で設定される。閾値TMが小さければ、多くの音声を不正と判定することになる。 And calculate. The vowel section detection result analysis unit 14 determines that the voice is illegal when M is greater than a predetermined threshold value TM. Otherwise, it is determined that the input audio signal is not illegal. M has a value between 0 and 1. The threshold value TM is a numerical value determined in advance by the system developer, and is set in a range that M can take. If the threshold value TM is small, it is determined that many voices are illegal.

この方法は、母音の極端な長音化などの不正な入力の検知を目的とした方法である。 This method is intended to detect illegal input such as extremely long vowels.

上述の方法を２段階に適用し、どちらかで不正と判断された音声信号を不正と判断してもよい。すなわち、母音区間検出結果分析部１４は、R<TR又はM>TMの場合に、入力された音声信号は不正音声と判断し、そうでない場合には、入力された音声信号は不正音声でない判断してもよい。 The above-described method may be applied to two stages, and an audio signal that is determined to be illegal in either one may be determined to be illegal. That is, the vowel section detection result analysis unit 14 determines that the input voice signal is an illegal voice if R <TR or M> TM, and otherwise determines that the input voice signal is not an illegal voice. May be.

もちろん、上述の方法以外の方法により、母音区間検出の結果を利用して不正音声入力の判定を行ってもよい。 Of course, unauthorized voice input may be determined using a result of vowel interval detection by a method other than the above-described method.

母音区間検出結果分析部１４は、入力された音声信号は不正音声であると判断された場合には、その旨をユーザ等に通知し、入力された音声信号は不正音声でないと判断された場合には、入力された音声信号又は音声区間信号を信号処理部２に渡す。 When it is determined that the input voice signal is an illegal voice, the vowel section detection result analysis unit 14 notifies the user or the like, and when the input voice signal is determined not to be an illegal voice. In this case, the input audio signal or audio interval signal is passed to the signal processing unit 2.

＜信号処理部２＞
信号処理部２は、入力された音声信号が不正音声でないと判断された場合には、その音
声信号又は音声区間信号を用いて話者認識又は話者登録の処理を行う。 <Signal processing unit 2>
When it is determined that the input voice signal is not an illegal voice, the signal processing unit 2 performs speaker recognition or speaker registration processing using the voice signal or the voice interval signal.

話者認識又は話者登録には、背景技術の欄や例えば非特許文献１，２に記載された既存の技術を用いればよい。 For the speaker recognition or the speaker registration, the existing technology described in the column of background art or non-patent documents 1 and 2 may be used.

［変形例］
不正音声入力判定装置１は、雑音抑制除去部１２を備えていてもよい。この場合、音声区間検出部１１が生成した音声区間信号は、雑音抑制除去部１２に渡される。雑音抑制除去部１２は、音声区間信号の雑音を抑圧又は除去する（ステップＳ１２）。雑音が抑圧又は除去された音声区間信号は、母音区間検出部１３に渡される。母音区間検出部１３は、上記と同様にして、雑音抑圧除去部１２により雑音が抑圧又は除去された音声区間信号の母音区間を検出する処理を行う。 [Modification]
The unauthorized speech input determination device 1 may include a noise suppression / removal unit 12. In this case, the speech segment signal generated by the speech segment detection unit 11 is passed to the noise suppression / removal unit 12. The noise suppression / removal unit 12 suppresses or removes noise in the speech section signal (step S12). The speech segment signal from which noise has been suppressed or removed is passed to the vowel segment detection unit 13. The vowel section detection unit 13 performs processing for detecting the vowel section of the speech section signal from which noise has been suppressed or removed by the noise suppression removal unit 12 in the same manner as described above.

雑音の抑圧又は除去には、既存の技術を用いればよい。なお、この雑音の抑圧又は除去の処理以降に母音区間検出が適用されるため、母音区間検出の精度低下の要因となる種の雑音（例えば母音と調波構造の類似した雑音信号）を抑圧可能な雑音抑圧手法を用いてもよい。 An existing technique may be used for noise suppression or removal. In addition, since vowel interval detection is applied after this noise suppression or removal processing, it is possible to suppress the type of noise (for example, noise signals similar in vowel and harmonic structure) that cause a reduction in accuracy of vowel interval detection. A simple noise suppression method may be used.

不正音声入力判定装置、音声信号処理装置及び方法において説明した処理は、記載の順にしたがって時系列に実行されるのみならず、処理を実行する装置の処理能力あるいは必要に応じて並列的にあるいは個別に実行されてもよい。 The processing described in the unauthorized speech input determination device, the speech signal processing device, and the method is not only executed in time series according to the order of description, but also in parallel or individually according to the processing capability of the device that executes the processing or as necessary. May be executed.

その他、この発明の趣旨を逸脱しない範囲で適宜変更が可能であることはいうまでもない。 Needless to say, other modifications are possible without departing from the spirit of the present invention.

[プログラム及び記録媒体]
不正音声入力判定装置、音声信号処理装置における各処理をコンピュータによって実現する場合、その各装置が有すべき機能の処理内容はプログラムによって記述される。そして、このプログラムをコンピュータで実行することにより、その各処理がコンピュータ上で実現される。 [Program and recording medium]
When each processing in the unauthorized speech input determination device and the speech signal processing device is realized by a computer, processing contents of functions that each device should have are described by a program. Then, by executing this program on a computer, each process is realized on the computer.

この処理内容を記述したプログラムは、コンピュータで読み取り可能な記録媒体に記録しておくことができる。コンピュータで読み取り可能な記録媒体としては、例えば、磁気記録装置、光ディスク、光磁気記録媒体、半導体メモリ等どのようなものでもよい。 The program describing the processing contents can be recorded on a computer-readable recording medium. As the computer-readable recording medium, for example, any recording medium such as a magnetic recording device, an optical disk, a magneto-optical recording medium, and a semiconductor memory may be used.

また、各処理手段は、コンピュータ上で所定のプログラムを実行させることにより構成することにしてもよいし、これらの処理内容の少なくとも一部をハードウェア的に実現することとしてもよい。 Each processing means may be configured by executing a predetermined program on a computer, or at least a part of these processing contents may be realized by hardware.

１不正音声入力判定装置
１１音声区間検出部
１２雑音抑制除去部
１２雑音抑圧除去部
１３母音区間検出部
１４母音区間検出結果分析部
２信号処理部 DESCRIPTION OF SYMBOLS 1 Fraud sound input determination apparatus 11 Voice section detection part 12 Noise suppression removal part 12 Noise suppression removal part 13 Vowel part detection part 14 Vowel part detection result analysis part 2 Signal processing part

Claims

A voice section detection unit for generating a speech section signal by extracting a portion of the speech interval from the speech signal input,
A vowel section detector for detecting a vowel section of the speech section signal;
A vowel section detection result analysis unit that determines whether or not the voice signal is fraudulent based on the length of the detected vowel section,
The vowel interval detection result analysis unit, if the audio signal is determined not to be bad speech to row cormorants signal processing unit processing the talker recognition using the speech signal or the speech section signal, the Transmitting the input voice signal or the voice interval signal ,
Illegal voice input determination device.

A voice section detection unit for generating a speech section signal by extracting a portion of the speech interval from the speech signal input,
A vowel section detector for detecting a vowel section of the speech section signal;
A vowel section detection result analysis unit that determines whether or not the voice signal is fraudulent based on the length of the detected vowel section,
The vowel interval detection result analysis unit, if the audio signal is determined not to be bad speech to row cormorants signal processing unit processing the talker registered with the speech signal or the speech section signal, said input Transmitted voice signal or the above-mentioned voice section signal ,
Illegal voice input determination device.

The unauthorized speech input determination device according to claim 1 or 2 ,
A noise suppression and removal unit that suppresses or removes noise in the speech section signal;
The vowel section detection unit detects a vowel section of the speech section signal from which noise is suppressed or removed by the noise suppression and removal unit;
Illegal voice input determination device.

Voice section detection unit, and the speech section detection step of generating a speech section signal by extracting a portion of the speech interval from the speech signal input,
A vowel section detecting unit for detecting a vowel section of the speech section signal;
A vowel section detection result analysis unit, including, based on the length of the detected vowel section, determining whether the speech signal is a fraudulent speech, and a vowel section detection result analysis step,
The vowel interval detection result analysis step, when the audio signal is determined not to be bad speech to row cormorants signal processing unit processing the talker recognition using the speech signal or the speech section signal, the Transmitting the input voice signal or the voice interval signal ,
Illegal voice input determination method.

Voice section detection unit, and the speech section detection step of generating a speech section signal by extracting a portion of the speech interval from the speech signal input,
A vowel section detecting unit for detecting a vowel section of the speech section signal;
A vowel section detection result analysis unit, including, based on the length of the detected vowel section, determining whether the speech signal is a fraudulent speech, and a vowel section detection result analysis step,
The vowel interval detection result analysis step, when the audio signal is determined not to be bad speech to row cormorants signal processing unit processing the talker registered with the speech signal or the speech section signal, said input Transmitted voice signal or the above-mentioned voice section signal ,
Illegal voice input determination method .

Program for causing a computer to function claims 1 as one of the components of the fraud audio input determination equipment 3.