JP5342629B2

JP5342629B2 - Male and female voice identification method, male and female voice identification device, and program

Info

Publication number: JP5342629B2
Application number: JP2011223680A
Authority: JP
Inventors: 光昭磯貝; 哲小橋川; 秀之水野
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2011-10-11
Filing date: 2011-10-11
Publication date: 2013-11-13
Anticipated expiration: 2031-10-11
Also published as: JP2013083796A

Abstract

<P>PROBLEM TO BE SOLVED: To accurately identify sexuality of a speaker of a speech signal even when time length of the input speech signal is extremely short. <P>SOLUTION: A method for identifying male/female voice includes: extracting a speech feature amount from an input speech signal; and identifying the sexuality of a speaker of the speech signal on the basis of likelihood obtained by collating the speech feature amount with a male speech acoustic model and a female speech acoustic model. When the time length of the speech signal is less than a predetermined time length L, the speech signal is repeatedly extended till becoming to have the time length L or more, the speech feature amount is extracted using the extended speech signal, and a collation and an identification are performed using a speech recognition grammar corresponding to the repeat. <P>COPYRIGHT: (C)2013,JPO&INPIT

Description

この発明は入力された音声信号の話者の性別を識別する男女声識別方法、男女声識別装置及びプログラムに関する。 The present invention relates to a male / female voice identification method, a male / female voice identification device, and a program for identifying the gender of a speaker of an input voice signal.

入力された音声信号から話者の性別を識別する男女声識別技術は、単に性別を識別するためだけではなく、例えば音声認識の高精度化のためにも重要な技術である。 The male / female voice identification technique for identifying the gender of a speaker from the input voice signal is an important technique not only for identifying gender but also for improving the accuracy of voice recognition, for example.

これまでは、入力された音声信号が男声・女声のいずれかを識別する場合、音声信号から音声特徴量を抽出し、その抽出した音声特徴量をＧＭＭ（Gaussian Mixture Model）等の統計的なモデル化に基づいて作成された男声用の音響モデル及び女声用の音響モデルと照合して尤度を求め、その尤度に基づいて男声・女声の識別が行われてきた。 Up to now, when the input audio signal identifies male voice or female voice, the voice feature is extracted from the voice signal, and the extracted voice feature is statistical model such as GMM (Gaussian Mixture Model). The likelihood is obtained by collating with the acoustic model for male voice and the acoustic model for female voice created based on the conversion, and the male voice and female voice are identified based on the likelihood.

特許文献１にはこのように入力された音声信号から抽出した音声特徴量を男声用の音響モデル及び女声用の音響モデルと照合し、尤度を求めることが記載されており、この尤度に基づいて男声・女声を識別することができる。 Patent Document 1 describes that the speech feature amount extracted from the speech signal input in this manner is collated with an acoustic model for male voice and an acoustic model for female voice, and the likelihood is obtained. Male voices and female voices can be identified based on this.

図１１は上記のような方法によって、入力された音声信号の話者の性別を識別する男女声識別装置の構成例を示したものである。男女声識別装置は男女声識別処理部１０と男声音響モデル２０と女声音響モデル３０とを備えて構成されている。男声音響モデル２０は音声区間モデル２１と非音声区間モデル２２とを具備し、同様に女声音響モデル３０も音声区間モデル３１と非音声区間モデル３２とを具備するものとなっている。 FIG. 11 shows an example of the configuration of a gender voice identification device for identifying the gender of a speaker of an input voice signal by the method as described above. The male / female voice identification device includes a male / female voice identification processing unit 10, a male voice acoustic model 20, and a female voice acoustic model 30. The male voice acoustic model 20 includes a voice section model 21 and a non-speech section model 22, and similarly, the female voice model 30 includes a voice section model 31 and a non-speech section model 32.

男女声識別処理部１０はこの例では音声特徴量抽出部１１と認識文法設定部１２と識別部１３とを備えている。音声特徴量抽出部１１は入力された音声信号（Ａ／Ｄ変換され、ディジタル化されたディジタル音声信号）の音声特徴量を抽出する。認識文法設定部１２は音声特徴量を男声音響モデル２０及び女声音響モデル３０と照合し、尤度を求める際に用いる認識文法の設定を行う。識別部１３は認識文法を用い、音声特徴量を男声音響モデル２０及び女声音響モデル３０と照合して尤度を求め、求めた尤度に基づいて音声信号の話者の性別を識別する。男女声識別処理部１０はこのようにして識別した結果を出力する。 In this example, the male / female voice identification processing unit 10 includes a voice feature amount extraction unit 11, a recognition grammar setting unit 12, and an identification unit 13. The voice feature quantity extraction unit 11 extracts the voice feature quantity of the input voice signal (digital voice signal that has been A / D converted and digitized). The recognition grammar setting unit 12 collates the voice feature quantity with the male voice acoustic model 20 and the female voice acoustic model 30 and sets the recognition grammar used when obtaining the likelihood. The identification unit 13 uses the recognition grammar, compares the voice feature amount with the male voice acoustic model 20 and the female voice acoustic model 30, obtains the likelihood, and identifies the gender of the speaker of the voice signal based on the obtained likelihood. The male / female voice identification processing unit 10 outputs the result of identification in this way.

男声・女声を識別する際には発話単位で識別する必要があるため、認識文法設定部１２で設定される認識文法は一般に下記に示すような認識文法（１）とされる。なお、下記認識文法（１）はＢＮＦ記法を拡張した表現で記述している。 Since it is necessary to identify male voices and female voices in units of utterances, the recognition grammar set by the recognition grammar setting unit 12 is generally a recognition grammar (1) as shown below. Note that the recognition grammar (1) below is described in an expanded form of the BNF notation.

・認識文法（１）
$[p]=pause;
$[g]=garbage;
$START=$p $g $p;
ここで、$[xxx]=はシンボルの宣言を意味し、右辺のpauseは無音等の非音声を表すシンボル、garbageは音声を表すシンボルである。$STARTは文全体を表す開始記号である。なお、記号=は定義、記号[ ]は単語表記の指定、記号;は定義の終端を表す。・ Recognition grammar (1)
$ [p] = pause;
$ [g] = garbage;
$ START = $ p $ g $ p;
Here, $ [xxx] = means a symbol declaration, pause on the right side is a symbol representing non-speech such as silence, and garbage is a symbol representing speech. $ START is a start symbol that represents the entire sentence. Note that the symbol = represents the definition, the symbol [] represents the word notation, and the symbol; represents the end of the definition.

上記認識文法（１）は、非音声→音声→非音声の順に出現することを想定した文法となっている。 The recognition grammar (1) is assumed to appear in the order of non-speech → speech → non-speech.

特開２０１１−１３５４３号公報JP 2011-13543 A

ところで、例えば１秒以下の非常に短い音声信号に対して男女声の識別を行った場合、以下の問題が発生する。 By the way, for example, when male and female voices are identified for a very short audio signal of 1 second or less, the following problems occur.

即ち、入力された音声信号から音声特徴量を抽出する際に、一般的には話者による音声特徴量の偏りを除去するため、例えばＣＭＮ（Cepstrum Mean Normalization）やＣＶＮ（Cepstrum Variance Normalization）等の音声特徴量の正規化処理を行う。しかしながら、こうした正規化処理は音声特徴量の統計的な分析に基づいた処理に基づくため、ある程度の長さの音声信号が入力されないと、統計的に正しい結果が得られず、結果的に正しい正規化処理ができない場合がある。 That is, when extracting a voice feature value from an input voice signal, generally, for example, CMN (Cepstrum Mean Normalization), CVN (Cepstrum Variance Normalization), or the like is used to remove the bias of the voice feature value by a speaker. Performs normalization processing of voice feature. However, since such normalization processing is based on processing based on statistical analysis of speech features, if a speech signal of a certain length is not input, a statistically correct result cannot be obtained, resulting in correct normalization. May not be able to be processed.

従って、例えば“はい”、“いいえ”等のごく短時間の音声が入力された場合には、その音声信号から抽出・正規化された音声特徴量に残った話者に依存した偏りが、男声／女声それぞれの音響モデルとの尤度に影響を与え、識別精度が低下するといった問題が発生する恐れがある。 Therefore, for example, when a very short speech such as “Yes” or “No” is input, the bias depending on the speaker remaining in the speech feature amount extracted and normalized from the speech signal is / There is a possibility that problems such as the influence of the female models on the acoustic model may be affected and the identification accuracy may be lowered.

この発明の目的はこのような問題に鑑み、入力された音声信号の時間長が非常に短い場合であっても、音声信号の話者の性別を正確に識別することができるようにした男女声識別方法及び男女声識別装置を提供することにある。 In view of such problems, the object of the present invention is to make it possible to accurately identify the gender of the speaker of the audio signal even when the time length of the input audio signal is very short. An object is to provide a discrimination method and a gender voice discrimination device.

請求項１の発明によれば、入力された音声信号から音声特徴量を抽出し、その音声特徴量を男声音響モデル及び女声音響モデルと照合した尤度に基づいて音声信号の話者の性別を識別する男女声識別方法において、音声信号の時間長が所定の時間長Ｌ未満の場合、音声信号を時間長Ｌ以上となるまで繰り返して伸長させ、その伸長させた音声信号を用いて音声特徴量の抽出を行い、前記繰り返しに対応した認識文法を用いて前記照合及び識別を行う。 According to the first aspect of the present invention, the voice feature is extracted from the input voice signal, and the gender of the speaker of the voice signal is determined based on the likelihood that the voice feature is collated with the male voice model and the female voice model. In the male / female voice identification method for identifying, when the time length of the audio signal is less than the predetermined time length L, the audio signal is repeatedly expanded until the time length becomes equal to or longer than the time length L, and the audio feature amount is used using the expanded audio signal. Are extracted, and the collation and identification are performed using a recognition grammar corresponding to the repetition.

請求項２の発明では請求項１の発明において、前記繰り返しを行う際、音声信号の音声区間を検出し、その音声区間のみ繰り返して音声信号を伸長させる。 According to a second aspect of the invention, in the first aspect of the invention, when the repetition is performed, a voice section of the voice signal is detected, and the voice signal is expanded by repeating only the voice section.

請求項３の発明では請求項２の発明において、検出した音声区間の長さが閾値Ｔ未満の場合、その音声区間を棄却し、前記識別を行わないこととする。 In the invention of claim 3, in the invention of claim 2, when the length of the detected speech section is less than the threshold value T, the speech section is rejected and the identification is not performed.

請求項４の発明では請求項１乃至３のいずれかの発明において、時間長Ｌは外部から設定可能とされる。 According to a fourth aspect of the present invention, in any one of the first to third aspects, the time length L can be set from the outside.

請求項５の発明では請求項１乃至３のいずれかの発明において、時間長Ｌは男女声識別を実行する計算機負荷と要求応答時間から算出される。 According to a fifth aspect of the present invention, in any one of the first to third aspects, the time length L is calculated from a computer load for executing gender discrimination and a required response time.

請求項６の発明によれば、男女声識別装置は、入力された音声信号の時間長が所定の時間長Ｌ未満か否かを判定し、時間長Ｌ未満と判定した場合、音声信号を音声伸長部に出力し、時間長Ｌ以上と判定した場合、音声信号を男女声識別処理部に出力する音声長判定部と、音声長判定部から入力された音声信号を時間長Ｌ以上となるまで繰り返して伸長させ、その伸長させた音声信号を男女声識別処理部に出力する音声伸長部と、音声長判定部から入力される音声信号及び音声伸長部から入力される音声信号の音声特徴量を抽出し、その音声特徴量を抽出した音声信号に対応する認識文法を用いて音声特徴量を男声音響モデル及び女声音響モデルと照合し、照合した尤度に基づいて音声特徴量を抽出した音声信号の話者の性別を識別して出力する男女声識別処理部とを備える。 According to the invention of claim 6, the gender voice identification device determines whether or not the time length of the input audio signal is less than the predetermined time length L. When it is output to the decompression unit and it is determined that the time length is equal to or longer than L, the voice length determination unit that outputs the voice signal to the male and female voice discrimination processing unit, and the voice signal input from the voice length determination unit until the time length L is equal to or longer A voice decompression unit that repeatedly decompresses and outputs the decompressed voice signal to the male and female voice discrimination processing unit, a voice signal input from the voice length determination unit, and a voice feature amount of the voice signal input from the voice extension unit The speech signal is extracted based on the likelihood of collating the speech feature with the male acoustic model and female acoustic model using the recognition grammar corresponding to the extracted speech signal. Identify and output the gender of the speaker And a male and female voice identification processing unit.

この発明によれば、入力された音声信号の時間長が短く、所定の時間長Ｌ未満の場合には音声信号を繰り返し、また繰り返しに対応した認識文法を用いるものとなっており、よって音声特徴量の正規化処理を安定させ、正しく行うことができ、これにより音声信号の話者の性別を正確に識別することが可能となる。 According to the present invention, when the time length of the input voice signal is short and less than the predetermined time length L, the voice signal is repeated, and the recognition grammar corresponding to the repetition is used. The amount normalization process can be stabilized and performed correctly, which makes it possible to accurately identify the gender of the speaker of the speech signal.

この発明による男女声識別方法の第１の実施例を実行する男女声識別装置の機能構成を示すブロック図。The block diagram which shows the function structure of the man and woman voice identification device which performs the 1st Example of the man and woman voice identification method by this invention. この発明による男女声識別方法の第１の実施例の処理フローを示すフローチャート。The flowchart which shows the processing flow of 1st Example of the gender voice identification method by this invention. 音声伸長例を示す図。The figure which shows the audio | voice expansion | extension example. 音声特徴量の正規化処理に対する音声伸長の効果を説明するための図。The figure for demonstrating the effect of the audio | voice expansion | extension with respect to the normalization process of an audio | voice feature-value. この発明による男女声識別方法の第２の実施例を実行する男女声識別装置の機能構成を示すブロック図。The block diagram which shows the function structure of the man and woman voice identification device which performs the 2nd Example of the man and woman voice identification method by this invention. この発明による男女声識別方法の第２の実施例の処理フローを示すフローチャート。The flowchart which shows the processing flow of 2nd Example of the gender voice identification method by this invention. 音声伸長例を示す図。The figure which shows the audio | voice expansion | extension example. この発明による男女声識別方法の第３の実施例の処理フローを示すフローチャート。The flowchart which shows the processing flow of the 3rd Example of the man and woman voice identification method by this invention. この発明による男女声識別方法の第４の実施例を実行する男女声識別装置の機能構成を示すブロック図。The block diagram which shows the function structure of the man and woman voice identification device which performs the 4th Example of the man and woman voice identification method by this invention. この発明による男女声識別方法の第５の実施例を実行する男女声識別装置の機能構成を示すブロック図。The block diagram which shows the function structure of the man and woman voice identification device which performs the 5th Example of the man and woman voice identification method by this invention. 従来の男女声識別方法を実行する男女声識別装置の機能構成を示すブロック図。The block diagram which shows the function structure of the man and woman voice identification apparatus which performs the conventional man and woman voice identification method.

以下、この発明の実施形態を図面を参照して実施例により説明する。 Hereinafter, embodiments of the present invention will be described with reference to the drawings.

実施例１の男女声識別装置の機能構成を図１に示し、その処理フローを図２に示す。 FIG. 1 shows a functional configuration of the gender voice identification apparatus according to the first embodiment, and FIG. 2 shows a processing flow thereof.

この例では男女声識別装置は図１１に示した従来の男女声識別装置に対し、音声長判定部４０と音声伸長部５０とが付加された構成となっている。 In this example, the gender voice identification device has a configuration in which a voice length determination unit 40 and a voice decompression unit 50 are added to the conventional gender voice discrimination device shown in FIG.

話者の性別を識別したい音声信号は音声長判定部４０に入力される（ステップＳ１）。音声長判定部４０は入力された音声信号の時間長が所定の時間長Ｌ未満か否かを判定し（ステップＳ２）、時間長Ｌ未満と判定した場合、音声信号を音声伸長部５０に出力し、時間長Ｌ以上と判定した場合、音声信号を男女声識別処理部１０に出力する。 A voice signal for identifying the gender of the speaker is input to the voice length determination unit 40 (step S1). The audio length determination unit 40 determines whether or not the time length of the input audio signal is less than the predetermined time length L (step S2), and if it is determined that the time length is less than the time length L, outputs the audio signal to the audio expansion unit 50. If it is determined that the time length is equal to or longer than L, the audio signal is output to the gender identification unit 10.

音声伸長部５０は音声長判定部４０から入力された音声信号を時間長Ｌ以上となるまで繰り返して伸長させ（ステップＳ３）、その伸長させた音声信号を男女声識別処理部１０に出力する。 The speech decompression unit 50 repeatedly decompresses the speech signal input from the speech length determination unit 40 until the time length becomes equal to or longer than the time length L (step S3), and outputs the decompressed speech signal to the gender voice discrimination processing unit 10.

時間長Ｌは、０より大きい任意の値にすることができる。時間長Ｌの値は、男女声識別を適用するタスクの音声セット等から、識別精度向上に有効な適切な値を実験的に求める等の方法で決定すればよい。ここでは、一例として、Ｌ＝２秒とする。 The time length L can be any value greater than zero. The value of the time length L may be determined by a method such as experimentally obtaining an appropriate value effective for improving the identification accuracy from a voice set of a task to which gender voice identification is applied. Here, as an example, L = 2 seconds.

音声伸長部５０における音声伸長は具体的には以下のように行われる。即ち、この例では音声伸長部５０はバッファ５１を備えており、このバッファ５１に、入力された音声信号の先頭フレームから順次、１フレームずつコピーが行われる。入力音声信号の最終フレームまで達したら、再び入力音声信号の先頭フレームからコピーが行われる。以上の処理をバッファ５１に含まれるフレーム長が時間長Ｌ以上となるまで繰り返す。ここでのコピーの繰り返し処理は、バッファ５１に含まれるフレーム長が時間長Ｌと等しくなった時点（もしくは越えた時点）で打ち切ってもよい（図３はこのように繰り返されて伸長された伸長音声信号の一例を入力音声信号と共に示したものである）し、あるいは、バッファ５１に含まれるフレーム長が時間長Ｌを超えた後に入力音声信号の最終フレームまで達した時点でコピーを終了してもよい。 Specifically, the voice decompression in the voice decompression unit 50 is performed as follows. That is, in this example, the audio decompression unit 50 includes a buffer 51, and the buffer 51 is copied frame by frame sequentially from the first frame of the input audio signal. When the final frame of the input audio signal is reached, copying is performed again from the first frame of the input audio signal. The above processing is repeated until the frame length included in the buffer 51 becomes equal to or longer than the time length L. The copy repetitive processing here may be terminated when the frame length included in the buffer 51 becomes equal to (or exceeds) the time length L (FIG. 3 shows the decompression that is repeated and expanded in this way. An example of the audio signal is shown together with the input audio signal), or the copy is finished when the frame length included in the buffer 51 exceeds the time length L and reaches the final frame of the input audio signal. Also good.

なお、一般的な音声認識では、このような音声伸長処理を行うと、認識結果（音声を文字化した結果）が入力音声とは異なってしまうので望ましくはないが、男女声識別においては発話内容（何が話されているか）は識別する必要がないため、このような音声伸長処理を適用することができる。 In general speech recognition, such speech decompression processing is not desirable because the recognition result (result of converting the speech into text) will be different from the input speech. Since it is not necessary to identify (what is spoken), such a voice decompression process can be applied.

男女声識別処理部１０には音声長判定部４０及び音声伸長部５０から音声信号が入力される。音声特徴量抽出部１１はこれら音声信号の音声特徴量を抽出する（ステップＳ４）。音声伸長部５０から入力される音声信号は音声伸長により、音声と非音声が交互に含まれ、また複数回繰り返して含まれうるため、この繰り返しに対応した認識文法を用いる必要がある。認識文法設定部１２は音声信号が音声長判定部４０から入力された場合及び音声伸長部５０から入力された場合のそれぞれに対応して認識文法を設定する。音声信号が音声長判定部４０から入力された場合の認識文法は前記した認識文法（１）とされ、音声伸長部５０から入力された場合の認識文法は下記に示す認識文法（２）とされる。なお、認識文法（２）は認識文法（１）と同様、ＢＮＦ記法を拡張した表現で記述している。 An audio signal is input to the gender identification processing unit 10 from the audio length determination unit 40 and the audio expansion unit 50. The voice feature quantity extraction unit 11 extracts voice feature quantities of these voice signals (step S4). The speech signal input from the speech decompression unit 50 includes speech and non-speech alternately by speech decompression, and can be repeatedly included multiple times. Therefore, it is necessary to use a recognition grammar corresponding to this repetition. The recognition grammar setting unit 12 sets the recognition grammar corresponding to each of the case where the speech signal is input from the speech length determination unit 40 and the case where the speech signal is input from the speech decompression unit 50. The recognition grammar when the speech signal is input from the speech length determination unit 40 is the recognition grammar (1) described above, and the recognition grammar when the speech signal is input from the speech decompression unit 50 is the recognition grammar (2) shown below. The Note that the recognition grammar (2) is described in an expanded form of the BNF notation, similar to the recognition grammar (1).

・認識文法（２）
$[p]=pause;
$[g]=garbage;
$START=<$p|$g>;
但し、記号< >は１回以上の繰り返し、記号|は並列接続を表す。・ Recognition grammar (2)
$ [p] = pause;
$ [g] = garbage;
$ START = <$ p | $ g>;
However, symbol <> represents one or more repetitions, and symbol | represents parallel connection.

上記認識文法（２）は、非音声と音声が交互に出現することを想定した文法となっている。 The recognition grammar (2) is a grammar assuming that non-speech and speech appear alternately.

識別部１３は認識文法設定部１２で設定された認識文法を用い、音声特徴量を男声音響モデル２０及び女声音響モデル３０と照合して尤度を求め（ステップＳ５）、求めた尤度に基づいて音声信号の話者の性別を識別する（ステップＳ６）。男女声識別処理部１０はこのようにして識別した結果を出力する。 The identification unit 13 uses the recognition grammar set by the recognition grammar setting unit 12 to collate the voice feature amount with the male acoustic model 20 and the female acoustic model 30 to obtain likelihood (step S5), and based on the obtained likelihood. Then, the gender of the speaker of the voice signal is identified (step S6). The male / female voice identification processing unit 10 outputs the result of identification in this way.

この例では、上述したように入力された音声信号の時間長が短く、所定の時間長Ｌ未満の場合には音声信号を繰り返して時間長を伸長するものとなっており、これにより音声信号の話者の性別の識別に用いる音声特徴量平均を得ることができる区間を増やすことができるものとなっている。 In this example, the time length of the input audio signal is short as described above, and when the time length is less than the predetermined time length L, the time length is extended by repeating the audio signal. It is possible to increase the section in which the voice feature amount average used for identification of the gender of the speaker can be obtained.

図４（Ｂ）はこの様子を示したものであり、比較として音声伸長を行わない従来例を図４（Ａ）に示す。 FIG. 4B shows this situation, and FIG. 4A shows a conventional example in which voice decompression is not performed as a comparison.

窓長Ｎ秒間（過去Ｎ秒間）の音声特徴量の平均を用い、逐次ＣＭＮ等の正規化処理を行う場合、図４（Ａ）に示した従来例では平均の計算に使用できるデータ量が少なく、ＣＭＮの効果が充分に得られないことになる。なお、図４（Ａ），（Ｂ）中、両矢の矢印で示した区間は窓長Ｎ秒間を示し、このうち、全て実線で示した矢印は窓長Ｎ秒間の音声特徴量の平均が使える区間を示す。 When the normalization processing such as CMN is performed sequentially using the average of the voice feature amount for the window length N seconds (the past N seconds), the data amount that can be used for the average calculation is small in the conventional example shown in FIG. Therefore, the effect of CMN cannot be obtained sufficiently. 4A and 4B, the section indicated by the double-headed arrow indicates a window length of N seconds, and among these, the arrows indicated by solid lines all represent the average of the audio feature values for the window length of N seconds. Indicates a usable section.

図４（Ａ）では冒頭のＳ_１，Ｓ_２の区間は窓長Ｎ秒間の音声特徴量の平均が使えず、窓長Ｎ秒間の音声特徴量の平均が使える区間はＳ_３〜Ｓ_５の３区間となっている。これに対し、音声伸長を行った図４（Ｂ）では窓長Ｎ秒間の音声特徴量の平均を使える区間はＳ_３〜Ｓ_９の７区間と増加し、これによりＣＭＮ等による正規化の効果を充分に得ることができ、よって男女声の識別精度の向上を図ることができる。 In FIG. 4A, the average of the speech feature quantity for the window length N seconds cannot be used in the first section S ₁ and S ₂ , and the section in which the average of the voice feature quantity for the window length N seconds can be used is S _{3 to} S ₅ . There are 3 sections. On the other hand, in FIG. 4B in which voice expansion is performed, the section in which the average of the voice feature amount for the window length of N seconds can be used is increased to seven sections of S _{3 to} S ₉ , and thereby the effect of normalization by CMN or the like. Can be sufficiently obtained, and therefore the discrimination accuracy of male and female voices can be improved.

なお、Ｎは例えば０．８秒程度とする。Ｎは長すぎると、広い区間の平均を求めることになるので、正規化の効果が低下してしまい、識別精度の低下を招く。よって、例えば単純に音声信号の全区間の音声特徴量を用いて正規化処理をするのは望ましくなく、上述したようにＮは０．８秒程度に設定する。 Note that N is, for example, about 0.8 seconds. If N is too long, the average of a wide section is obtained, so that the normalization effect is reduced and the identification accuracy is lowered. Therefore, for example, it is not desirable to simply perform normalization processing using the speech feature values of the entire section of the speech signal, and N is set to about 0.8 seconds as described above.

実施例２の男女声識別装置の機能構成を図５に示し、その処理フローを図６に示す。 FIG. 5 shows a functional configuration of the gender voice identification apparatus according to the second embodiment, and FIG. 6 shows a processing flow thereof.

この例では図１に示した実施例１の男女声識別装置に対し、音声区間検出部６０を追加した構成となっており、図２に示した実施例１の処理フローに対し、音声区間検出処理（ステップＳ１１）を音声伸長処理（ステップＳ３）の前に行うものとなっている。 In this example, the voice segment detection unit 60 is added to the gender voice identification apparatus of the first embodiment shown in FIG. 1, and the voice segment detection is performed with respect to the processing flow of the first embodiment shown in FIG. The process (step S11) is performed before the voice decompression process (step S3).

一般的な環境で入力された音声信号には、雑音や無音等の音声ではない区間（非音声区間）が含まれている。ごく短時間の音声信号であっても同じであり、音声信号の一部には非音声区間が含まれている。しかしながら、比較的時間が短い音声信号が入力された場合、こうした非音声区間の長さと音声区間の長さが同程度か音声区間の長さの方が短い場合がある。 An audio signal input in a general environment includes a section (non-speech section) that is not speech such as noise or silence. The same applies to a very short time audio signal, and a part of the audio signal includes a non-speech section. However, when a speech signal having a relatively short time is input, the length of the non-speech section and the length of the speech section may be approximately the same or the length of the speech section may be shorter.

一方、音声特徴量の正規化においては特段、音声区間、非音声区間の識別は行われない。そのため、非音声区間に含まれる雑音等が音声特徴量の統計的な分析結果に影響を与え、結果的に正しい正規化処理ができない場合がある。従って、雑音等が含まれるごく短時間の音声信号においては、雑音等による正規化処理の誤りが男女声それぞれの音響モデルとの尤度に影響を与え、識別精度が低下するといったことが起こりうる。 On the other hand, in the normalization of the voice feature amount, the voice section and the non-voice section are not particularly identified. For this reason, noise or the like included in the non-speech section affects the statistical analysis result of the speech feature value, and as a result, correct normalization processing may not be performed. Therefore, in a very short time speech signal including noise, etc., it is possible that an error in normalization processing due to noise or the like affects the likelihood of the male and female acoustic models and the identification accuracy is lowered. .

実施例２はこの問題を解決するもので、音声長判定部４０で所定の時間長Ｌ未満と判定された音声信号は音声区間検出部６０に入力され、音声区間検出部６０は入力された音声信号の音声区間を検出し（ステップＳ１１）、その音声区間のみを音声伸長部５０に出力する。音声伸長部５０は入力された音声区間のみを時間長Ｌ以上となるまで繰り返して伸長させる（ステップＳ３）。図７は入力された音声信号から音声区間が検出され、さらに音声区間が繰り返されて伸長音声信号が生成される様子を示したものである。 The second embodiment solves this problem. The audio signal determined by the audio length determination unit 40 to be less than the predetermined time length L is input to the audio interval detection unit 60, and the audio interval detection unit 60 receives the input audio. The speech section of the signal is detected (step S11), and only the speech section is output to the speech decompression unit 50. The voice decompression unit 50 repeatedly decompresses only the input voice section until the time length becomes equal to or longer than the time length L (step S3). FIG. 7 shows a state in which a speech section is detected from an input speech signal, and the speech section is repeated to generate an expanded speech signal.

音声区間検出部６０における音声区間検出には既存の音声区間検出方法を用いることができ、例えば特許第４６９１０７９号公報に記載されている音声信号区間推定方法を用いることができる。 An existing speech segment detection method can be used for speech segment detection in the speech segment detection unit 60, and for example, a speech signal segment estimation method described in Japanese Patent No. 46901079 can be used.

この例では音声区間に対してのみ正規化処理を行うことで、正規化処理を安定させることができ、より正確に男女声の識別を行うことが可能となる。 In this example, the normalization process is performed only on the speech section, so that the normalization process can be stabilized and the male and female voices can be identified more accurately.

実施例３の処理フローを図８に示す。図８では図６に示した実施例２の処理フローに対し、ステップＳ１２とＳ１３の処理が追加されている。 The processing flow of Example 3 is shown in FIG. In FIG. 8, the processes of steps S12 and S13 are added to the process flow of the second embodiment shown in FIG.

入力された音声信号に含まれる音声区間が極端に短い場合には、音声伸長処理を実施しても、十分な精度で男女声の識別を行うことができない恐れが高い。また、そのような音声は誤発声あるいは音声信号ではない入力である可能性もあり、棄却するのが望ましいこともある。 When the speech section included in the input speech signal is extremely short, there is a high possibility that the male and female voices cannot be identified with sufficient accuracy even if the speech decompression process is performed. In addition, such voice may be an erroneous voice or an input that is not a voice signal, and it may be desirable to reject it.

実施例３ではこの棄却を行うものとなっており、音声区間の検出（ステップＳ１１）を行った後、音声区間の長さが閾値Ｔ未満か否かを判定し（ステップＳ１２）、閾値Ｔ未満の場合、その音声区間を棄却し（ステップＳ１３）、男女声の識別を行わないものとする。閾値Ｔは０より大きい値であり、適切な値を実験的に求める等の方法で決定すればよい。ここでは、一例として、Ｔ＝０．２秒とする。 In the third embodiment, this rejection is performed, and after detecting the speech section (step S11), it is determined whether the length of the speech section is less than the threshold value T (step S12) and less than the threshold value T. In this case, it is assumed that the voice section is rejected (step S13), and male and female voices are not identified. The threshold value T is a value greater than 0, and may be determined by a method such as obtaining an appropriate value experimentally. Here, as an example, T = 0.2 seconds.

実施例４の男女声識別装置の機能構成を図９に示す。 FIG. 9 shows a functional configuration of the gender voice identification apparatus according to the fourth embodiment.

この例では図１に示した実施例１の男女声識別装置に対し、伸長時間長入力部７０を追加した構成となっている。 In this example, an extension time length input unit 70 is added to the gender voice identification apparatus of the first embodiment shown in FIG.

男女声識別処理の対象となる音声信号が長いと、男女声識別処理に要する時間は長くなる。処理速度に対する要求が厳しく、精度を多少犠牲にしても一定時間内で識別結果を出力する必要がある場合、音声伸長の時間長Ｌを外部から設定した値Ｌ’（Ｌ’＜Ｌ）にすることができれば、都合がよい。 If the audio signal to be subjected to the gender voice discrimination process is long, the time required for the gender voice discrimination process becomes long. When the demand for processing speed is strict and it is necessary to output the identification result within a certain time even if the accuracy is somewhat sacrificed, the speech decompression time length L is set to an externally set value L ′ (L ′ <L). If you can, it is convenient.

この例ではこのような伸長時間長の外部設定を可能とすべく、伸長時間長入力部７０を具備しており、処理速度要件に応じて識別に要する時間（識別の応答時間）を制御することができる。 In this example, an extension time length input unit 70 is provided to enable external setting of such extension time length, and the time required for identification (identification response time) is controlled according to the processing speed requirement. Can do.

実施例５の男女声識別装置の機能構成を図１０に示す。 FIG. 10 shows a functional configuration of the gender voice identification apparatus according to the fifth embodiment.

この例では図９に示した実施例４の男女声識別装置における伸長時間長入力部７０に替えて伸長時間長算出部８０を有するものとなっている。 In this example, an extension time length calculation unit 80 is provided instead of the extension time length input unit 70 in the gender voice identification apparatus of the fourth embodiment shown in FIG.

例えば、入力された音声信号を音声認識し、入力音声信号の話者と同じ性別の合成音声で返答を行うシステム等に男女声識別技術を適用する場合、音声認識とほぼ同じ時間内に入力音声信号の性別を識別することが求められる。このように外部から要求され、処理にかけることができる時間を要求応答時間Ｒとする。 For example, when gender identification technology is applied to a system that recognizes an input speech signal and responds with a synthesized speech of the same gender as the speaker of the input speech signal, the input speech within approximately the same time as speech recognition. It is required to identify the gender of the signal. A time required for processing from the outside in this way is set as a request response time R.

伸長時間長算出部８０は上記要求応答時間Ｒと、外部から入力される計算機負荷情報（例えば、ＯＳから取得することができるロードアベレージ情報）とから、音声伸長の時間長を都度、算出し、音声伸長部５０に出力するものとなっている。 The decompression time length calculator 80 calculates the speech decompression time length each time from the request response time R and computer load information (for example, load average information that can be acquired from the OS) input from the outside. This is output to the voice decompression unit 50.

ここで、計算機負荷をＷとし、算出する時間長をＬ''とすれば、時間長Ｌ''は例えば、
Ｌ''＝Ａ×（Ｒ／Ｗ）
によって計算することができる。Ａは所定のＷとＬ''のときに所望の応答時間となるよう、予め実験的に求めることができる定数である。 Here, if the computer load is W and the time length to be calculated is L ″, the time length L ″ is, for example,
L ″ = A × (R / W)
Can be calculated by: A is a constant that can be experimentally determined in advance so that a desired response time is obtained when predetermined W and L ″.

今、あるシステム構成において、計算機負荷Ｗ＝１．０の場合に、時間長Ｌ''＝２．０秒の長さであれば、応答時間０．５秒で応答を返せるという実験例がある場合に、定数Ａは、
Ａ＝Ｌ''×（Ｗ／Ｒ）
＝２．０×（１．０／０．５）
＝４．０
となる。定数Ａはこのようにして求めることができる。 Now, in a certain system configuration, when the computer load W = 1.0, there is an experimental example that a response can be returned with a response time of 0.5 seconds if the time length L ″ = 2.0 seconds. In this case, the constant A is
A = L ″ × (W / R)
= 2.0 × (1.0 / 0.5)
= 4.0
It becomes. The constant A can be obtained in this way.

定数Ａを適切な値の一例として、例えば４．０と設定した場合、計算機負荷Ｗ＝１．５，要求応答時間Ｒ＝０．５という入力に対し、時間長Ｌ''を、
Ｌ''＝４．０×（０．５／１．５）
≒１．３３（秒）
と求めることができる。 As an example of an appropriate value for the constant A, for example, 4.0, a time length L ″ is set for an input of a computer load W = 1.5 and a request response time R = 0.5.
L ″ = 4.0 × (0.5 / 1.5)
≒ 1.33 (seconds)
It can be asked.

このように求めた時間長Ｌ''を用いて、実施例１と同様に、音声伸長を行う。すなわち、音声伸長部５０はバッファ５１を備えており、このバッファ５１に、入力された音声信号の先頭フレームから順次、１フレームずつコピーが行われる。入力音声信号の最終フレームまで達したら、再び入力音声信号の先頭フレームからコピーが行われる。以上の処理をバッファ５１に含まれるフレーム長が時間長Ｌ''以上となるまで繰り返す。ここでのコピーの繰り返し処理は、バッファ５１に含まれるフレーム長が時間長Ｌ''と等しくなった時点（もしくは越えた時点）で打ち切る。 Using the time length L ″ obtained in this way, the voice is decompressed in the same manner as in the first embodiment. That is, the audio decompression unit 50 includes a buffer 51, and the buffer 51 is copied frame by frame sequentially from the first frame of the input audio signal. When the final frame of the input audio signal is reached, copying is performed again from the first frame of the input audio signal. The above processing is repeated until the frame length included in the buffer 51 becomes equal to or longer than the time length L ″. The copy repetitive processing here is terminated when the frame length included in the buffer 51 becomes equal to (or exceeds) the time length L ″.

このように、この例では計算機負荷の変動を考慮した上で要求応答時間に対応した男女声の識別処理を行えるものとなっている。 As described above, in this example, the male / female voice identification process corresponding to the required response time can be performed in consideration of the fluctuation of the computer load.

以上、各種実施例について説明したが、この発明は入力された音声信号が非常に短い場合に、その音声信号を繰り返し、伸長させることを特徴としている。これに対し、入力データの一部を入力データと組み合わせることで、データを伸長する手法は従来においても用いられている。例えば、伝送されたデータの一部に欠落がある場合、欠落していない部分のデータを用いて欠落した部分のデータを補間する技術がある。また、データとテンプレートのマッチングを行う際に、データの端部の外側にデータの端部のデータをコピーして、データの端部をマッチングの対象とする技術がある。しかしながら、いずれの技術もデータの長さを任意の長さに伸長する目的で、データ全体を繰り返しコピーして用いるものではない。 While various embodiments have been described above, the present invention is characterized in that when an input audio signal is very short, the audio signal is repeated and expanded. On the other hand, a technique for decompressing data by combining a part of the input data with the input data is also used in the past. For example, there is a technique for interpolating data in a missing part using data in a part that is not missing when part of the transmitted data is missing. In addition, there is a technique in which when data is matched with a template, the data at the end of the data is copied outside the end of the data, and the end of the data is used as a matching target. However, none of the techniques repeatedly use the entire data for the purpose of extending the data length to an arbitrary length.

音響モデルを用いて認識を行う技術も従来からある。例えば、音声認識技術等である。しかし、音声認識技術は認識される発話内容を重視する処理である。入力音声を繰り返して入力音声の時間長を伸長すると、発話自体が異なるものとなってしまう。 There is also a conventional technique for performing recognition using an acoustic model. For example, voice recognition technology. However, the speech recognition technology is processing that places importance on the content of utterances that are recognized. If the input speech is repeated to extend the time length of the input speech, the utterance itself will be different.

よって、音響モデルを用いて認識を行う技術に対し、入力音声を繰り返して入力音声の時間長を伸長する技術を組み合わせて用いるという発想は従来なかった。 Accordingly, there has been no idea in the past that a technique for recognizing using an acoustic model is combined with a technique for extending the time length of an input voice by repeating the input voice.

これに対し、この発明の男女声識別では、発話内容を認識する必要がない。音響モデルを用いて識別するのは入力された音声信号の話者の性別であり、性別の識別には入力された音声信号から抽出される音声特徴量のみを必要とする。そのため、入力された音声信号を繰り返して伸長したデータで男女声識別を行うことで、男女声の識別の精度を向上させることが可能となる。 On the other hand, in the gender identification according to the present invention, it is not necessary to recognize the utterance content. The identification using the acoustic model is the gender of the speaker of the input voice signal, and the identification of the gender requires only the voice feature amount extracted from the input voice signal. For this reason, it is possible to improve the accuracy of male / female voice identification by performing male / female voice identification using data obtained by repeatedly expanding an input audio signal.

以上説明した男女声識別装置、男女声識別方法は、コンピュータと、コンピュータにインストールされたプログラムによって実現することができる。コンピュータにインストールされたプログラムはコンピュータのＣＰＵによって解読されてコンピュータに上述した男女声識別方法を実行させる。 The male and female voice identification device and the male and female voice identification method described above can be realized by a computer and a program installed in the computer. The program installed in the computer is decrypted by the CPU of the computer to cause the computer to execute the above-described gender voice identification method.

１０男女声識別処理部１１音声特徴量抽出部
１２認識文法設定部１３識別部
２０男声音響モデル３０女声音響モデル
４０音声長判定部５０音声伸長部
６０音声区間検出部７０伸長時間長入力部
８０伸長時間長算出部 DESCRIPTION OF SYMBOLS 10 Male and female voice identification process part 11 Voice feature-value extraction part 12 Recognition grammar setting part 13 Identification part 20 Male voice acoustic model 30 Female voice acoustic model 40 Voice length determination part 50 Voice decompression part 60 Voice section detection part 70 Extension time length input part 80 Decompression Time length calculator

Claims

A male and female voice identification method for extracting a voice feature from an input voice signal and identifying the gender of the speaker of the voice signal based on the likelihood of matching the voice feature with a male acoustic model and a female acoustic model. ,
If the time length of the audio signal is less than a predetermined time length L, the audio signal is repeatedly expanded until the time length is equal to or greater than the time length L,
A method for discriminating male and female voices, wherein the voice feature value is extracted using the expanded voice signal, and the collation and identification are performed using a recognition grammar corresponding to the repetition.

The method for identifying gender voice according to claim 1,
A male / female voice identification method characterized by detecting a voice section of the voice signal and repeating the voice section to extend the voice signal when the repetition is performed.

The method for identifying gender voice according to claim 2,
When the length of the detected speech segment is less than a threshold T, the speech segment is rejected and the discrimination is not performed.

In the male and female voice identification method according to any one of claims 1 to 3,
The time length L can be set from the outside.

In the male and female voice identification method according to any one of claims 1 to 3,
The time length L is calculated from a computer load for performing gender voice discrimination and a required response time.

When it is determined whether or not the time length of the input audio signal is less than the predetermined time length L, and when it is determined that the time length is less than the time length L, the audio signal is output to the audio expansion unit and is determined to be greater than or equal to the time length L A voice length determination unit that outputs the voice signal to a gender voice identification processing unit;
The speech decompression unit that repeatedly decompresses the speech signal input from the speech length determination unit until the time length becomes equal to or greater than the time length L, and outputs the decompressed speech signal to the gender voice identification processing unit;
The speech feature is extracted from the speech signal input from the speech length determination unit and the speech signal input from the speech decompression unit, and using the recognition grammar corresponding to the speech signal from which the speech feature amount is extracted. The male and female voice models, and the male / female voice model for identifying and outputting the gender of the speaker of the voice signal from which the voice feature quantity is extracted based on the likelihood of matching. A male / female voice identification device.

A program for causing a computer to execute the gender voice identification method according to any one of claims 1 to 5.