JP6783001B2

JP6783001B2 - Speech feature extraction algorithm based on dynamic division of cepstrum coefficients of inverse discrete cosine transform

Info

Publication number: JP6783001B2
Application number: JP2019186806A
Authority: JP
Inventors: 毅左; 赫馬; 鉄山李; 培超賀; 君霞劉; 佳▲チー▼ 艾; 楊肖; 仁海于
Original assignee: 大連海事大学
Priority date: 2019-01-29
Filing date: 2019-10-10
Publication date: 2020-11-11
Anticipated expiration: 2039-10-10
Also published as: CN109767756A; CN109767756B; JP2020140193A

Description

本発明は、教師なし学習クラスター分析法を音声特徴抽出方向に適用した音声特徴抽出技術分野に関し、特に逆離散コサイン変換のケプストラム係数の動的分割に基づく音声特徴抽出アルゴリズムに関する。 The present invention relates to a voice feature extraction technology field in which an unsupervised learning cluster analysis method is applied in a voice feature extraction direction, and particularly to a voice feature extraction algorithm based on dynamic division of the cepstrum coefficient of the inverse discrete cosine transform.

話者認識技術は特徴抽出とモデリング認識の両部分を含む。特徴抽出は話者認識技術においてキーステップであり、音声認識システムの全体性能に直接的に影響を与える。普通は、音声信号に対してフレーム処理とウインドウ化の前処理をした後、高次元のデータ量が生成され、話者の特徴を抽出する場合、既存の音声における冗長情報を除去することでデータの次元を低下させる必要がある。従来の方法において、三角形フィルターを使用して、音声信号を特徴パラメーターの要求を満たす音声特徴ベクトルに転換するとともに、人間の耳の聴覚に似ている知覚特性に合うようにすることができ、またある程度で音声信号の増強と非音声信号の抑制を行うことができる。 Speaker recognition technology includes both feature extraction and modeling recognition. Feature extraction is a key step in speaker recognition technology and directly affects the overall performance of the speech recognition system. Normally, after frame processing and windowing preprocessing for the audio signal, a high-dimensional amount of data is generated, and when extracting the characteristics of the speaker, the data is removed by removing the redundant information in the existing audio. It is necessary to lower the dimension of. In conventional methods, a triangular filter can be used to convert the speech signal into a speech feature vector that meets the requirements of the feature parameters, as well as to match perceptual traits that resemble the hearing of the human ear. It is possible to enhance the audio signal and suppress the non-audio signal to some extent.

通用の特徴パラメーターは以下の通りである。線形予測分析係数は人間の発音の原理を模擬し、聲道短管のカスケードモデルを分析することで得られる特徴パラメーターであり、知覚線形予測分析係数は聴覚モデルに基づいて、計算によりスペクトル解析に応用し、入力された音声信号を人間の耳聴覚モデルにより処理して、線形予測コーディングＬＰＣに用いられる時間領域信号の代わりのＬＰＣの全極モデル予測多項式に相当する特徴パラメーターであり、Ｔａｎｄｅｍ特徴とＢｏｔｔｌｅｎｅｃｋ特徴は神経回路網により抽出した２類の特徴であり、フィルターバンクに基づくＦｂａｎｋ特徴はＭＦＣＣからラストステップを除去した離散コサイン変換に相当し、ＭＦＣＣ特徴と比べてより多くの最初音声データを保留した。線形予測ケプストラム係数は聲道モデルに基づいて、信号の生成過程中の音声励起情報を捨てる一方、数十個のケプストラム係数を用いてフォルマントの特徴を表す重要な特徴パラメーターであり、音声特徴パラメーターＭＦＣＣは最も一般的な音声特徴パラメーターであり、その抽出過程として、まず、音声に対して、プリエンファシス、フレーム処理、ウインドウ化処理、高速フーリエ変換などの前処理を行い、続いてエネルギー・スペクトルを一組のＭｅｌスケールの三角形フィルターバンクでフィルタリングし、各フィルターバンクにより出力された対数エネルギーを計算して、離散コサイン変換（ＤＣＴ）によりＭＦＣＣ係数を取得して、Ｍｅｌ−ｓｃａｌｅＣｅｐｓｔｒｕｍパラメーターを求めた後、動的ディファレンシャルパラメーター、すなわちメルケプストラム係数を抽出する。日本国特許第３６５４８３１号公報において、基底の最適化されたウエーブレットに基づいて音声特徴を抽出する方法が開示された。該方法はケプストラム係数ではなく、教師なし学習を利用して非ラベルの信号データに対して分析を行った。２０１２年にＳ．Ａｌ−Ｒａｗａｈｙａなど（ＡｌｒａｗａｈｙＳ；ＨｏｓｓｅｎＡ；ＨｅｕｔｅＵ．Ｔｅｘｔ−ｉｎｄｅｐｅｎｄｅｎｔｓｐｅａｋｅｒｉｄｅｎｔｉｆｉｃａｔｉｏｎｓｙｓｔｅｍｂａｓｅｄｏｎｔｈｅｈｉｓｔｏｇｒａｍｏｆＤＣＴ−ｃｅｐｓｔｒｕｍｃｏｅｆｆｉｃｉｅｎｔｓ．ＩｎｔｅｒｎａｔｉｏｎａｌＪｏｕｒｎａｌｏｆＫｎｏｗｌｅｄｇｅ−ｂａｓｅｄａｎｄＩｎｔｅｌｌｉｇｅｎｔＥｎｇｉｎｅｅｒｉｎｇＳｙｓｔｅｍｓ，２０１２，１６，１４１−１６１）がＭＦＣＣ特徴抽出方法を参照して、音声に対して前処理を行って得られたＤＣＴケプストラム係数に対して等周波数領域分割を行うことであるＨｉｓｔｏｇｒａｍＤＣＴケプストラム係数方法を提出した。我々はケプストラム係数を等周波数領域分割する場合、音声データの動的特性が見落とされるようになることに気づいた。したがって、本発明は、Ｓ．Ａｌ−Ｒａｗａｈｙａなどが提出した方法を元に、新しい音声特徴抽出アルゴリズム、すなわち逆離散コサイン変換のケプストラム係数の動的分割に基づく方法を提供する。本発明は、教師なし学習の階層的クラスター方法を利用して、音声データに対してその動的特徴の類似度に基づいてクラスター分析を行うことにより、音声特性をより一層説明できる動的特徴のベクトルを抽出する。 The common feature parameters are as follows. The linear predictive analysis coefficient is a characteristic parameter obtained by simulating the principle of human pronunciation and analyzing the cascade model of the short tube, and the perceptual linear predictive analysis coefficient is calculated for spectral analysis based on the auditory model. It is a feature parameter that corresponds to the omnipolar model prediction polymorphism of LPC instead of the time region signal used for linear predictive coding LPC by applying and processing the input voice signal by the human ear-hearing model. The Boatleneck feature is a type 2 feature extracted by the neural network, and the Fbank feature based on the filter bank corresponds to the discrete cosine transform with the last step removed from the MFCC, and holds more initial voice data than the MFCC feature. did. The linearly predicted cepstrum coefficient is an important feature parameter that expresses formant features using dozens of cepstrum coefficients, while discarding the voice excitation information during the signal generation process based on the voice model, and the voice feature parameter MFCC. Is the most common cepstrum parameter, and as its extraction process, the cepstrum is first preprocessed by pre-emphasis, frame processing, windowing processing, fast Fourier transform, etc., and then the energy spectrum is extracted. After filtering with a set of Mel-scale triangular filter banks, calculating the logarithmic energy output by each filter bank, obtaining the MFCC coefficient by the discrete cosine transform (DCT), and obtaining the Mel-scale Cepstrum parameter. Extract the dynamic differential parameter, the mel cepstrum coefficient. Japanese Patent No. 3654831 discloses a method for extracting speech features based on an optimized wavelet of the basis. The method used unsupervised learning rather than cepstrum coefficients to analyze unlabeled signal data. In 2012, S.M. Such as Al-Rawahya (Alrawahy S; Hossen A; Heute U.Text-independent speaker identification system based on the histogram of DCT-cepstrum coefficients.International Journal of Knowledge-based and Intelligent Engineering Systems, 2012,16,141-161) is With reference to the MFCC feature extraction method, we submitted the Histogram DCT cepstrum coefficient method, which divides the DCT cepstrum coefficient obtained by preprocessing the voice into equal frequency domains. We have noticed that the dynamic characteristics of speech data are overlooked when the cepstrum coefficient is divided into equal frequency domains. Therefore, the present invention relates to S.A. Based on the method submitted by Al-Rawaya et al., A new speech feature extraction algorithm, that is, a method based on the dynamic division of the cepstrum coefficient of the inverse discrete cosine transform is provided. The present invention uses a hierarchical cluster method of unsupervised learning to perform cluster analysis on speech data based on the similarity of the dynamic features, thereby further explaining the speech characteristics. Extract the vector.

従来の研究において、ＭＦＣＣを音声特徴ベクトルとし、それにガウス混合モデル（ＧＭＭ）、隠れマルコフモデル（ＨＭＭ）およびサポートベクターマシン（ＳＶＭ）などの機械学習方法を組合わせた話者モードのマッチングを行う音声認識技術が広く応用されている。ＭＦＣＣの抽出過程は以下の通りである。まず、音声信号に対して、プリエンファシス、フレーム処理、ウインドウ化処理及び高速フーリエ変換の前処理を行い、次にエネルギー・スペクトルを一組のＭｅｌスケールの三角形フィルターバンクでフィルタリングし、各フィルターバンクにより出力された対数エネルギーを計算して、離散コサイン変換（ＤＣＴ）によりＭＦＣＣ係数を取得し、得られた対数エネルギーを離散コサイン変換に導入して、Ｍｅｌ−ｓｃａｌｅＣｅｐｓｔｒｕｍパラメーターを求めた後、動的ディファレンシャルパラメーター、すなわちメルケプストラム係数を抽出する。 In conventional research, speech is matched in speaker mode by using MFCC as a speech feature vector and combining it with machine learning methods such as Gaussian mixed model (GMM), hidden Markov model (HMM), and support vector machine (SVM). Recognition technology is widely applied. The extraction process of MFCC is as follows. First, the audio signal is preprocessed by pre-emfasis, frame processing, windowing processing, and fast Fourier transform, and then the energy spectrum is filtered by a set of Mel-scale triangular filter banks, and each filter bank is used. The output logarithmic energy is calculated, the MFCC coefficient is obtained by the discrete cosine transform (DCT), the obtained logarithmic energy is introduced into the discrete cosine transform, the Mel-scale Cepstrum parameter is obtained, and then the dynamic differential. Extract the parameters, namely the mel cepstrum coefficient.

Ｓ．Ａｌ−Ｒａｗａｈｙａ等は２０１２年の研究（ＡｌｒａｗａｈｙＳ；ＨｏｓｓｅｎＡ；ＨｅｕｔｅＵ．Ｔｅｘｔ−ｉｎｄｅｐｅｎｄｅｎｔｓｐｅａｋｅｒｉｄｅｎｔｉｆｉｃａｔｉｏｎｓｙｓｔｅｍｂａｓｅｄｏｎｔｈｅｈｉｓｔｏｇｒａｍｏｆＤＣＴ−ｃｅｐｓｔｒｕｍｃｏｅｆｆｉｃｉｅｎｔｓ．ＩｎｔｅｒｎａｔｉｏｎａｌＪｏｕｒｎａｌｏｆＫｎｏｗｌｅｄｇｅ−ｂａｓｅｄａｎｄＩｎｔｅｌｌｉｇｅｎｔＥｎｇｉｎｅｅｒｉｎｇＳｙｓｔｅｍｓ，２０１２，１６，１４１−１６１）において、ＤＣＴＣｅｐｓｔｒｕｍという新しい特徴を発見し、等周波数領域ＤＣＴＣｅｐｓｔｒｕｍ係数に基づく音声特徴の抽出アルゴリズムを提出した。前処理後の音声信号を周波数領域に変換、すなわち前処理後の音声信号を時間領域コンボリューションから周波数領域のスペクトルの乗算の形式に変換して、その対数を取り、得られた成分を加算形式で示し、離散コサイン変換ケプストラム係数（ＤＣＴＣｅｐｓｔｒｕｍ係数）を取得した。ＤＣＴケプストラム係数は非線形増量で周波数レンジの周期性を記録する。即ち、０Ｈｚ−６００Ｈｚの周波数領域の間において、５０Ｈｚ毎に周波数領域の特徴区間を分割し、６００Ｈｚ−１０００Ｈｚの周波数領域において、１００Ｈｚ毎に周波数領域の特徴区間を分割する。該過程は、所定音声信号における周波数レンジの周期数のカウントと見なすことができる。ＭＦＣＣ特徴抽出方法より簡単で迅速である。 S. Al-Rawahya such as the 2012 study (Alrawahy S; Hossen A; Heute U.Text-independent speaker identification system based on the histogram of DCT-cepstrum coefficients.International Journal of Knowledge-based and Intelligent Engineering Systems, 2012,16, In 141-161), we discovered a new feature called DCT Cepstrum and submitted an algorithm for extracting voice features based on the constant frequency domain DCT Cepstrum coefficient. The preprocessed audio signal is converted to the frequency domain, that is, the preprocessed audio signal is converted from the time domain convolution to the form of multiplication of the spectrum of the frequency domain, the logarithm is taken, and the obtained components are added. The discrete cosine transform cepstrum coefficient (DCT Cepstrum coefficient) was obtained. The DCT cepstrum coefficient records the periodicity of the frequency range with a non-linear increment. That is, the characteristic section of the frequency domain is divided every 50 Hz between the frequency regions of 0 Hz to 600 Hz, and the characteristic section of the frequency domain is divided every 100 Hz in the frequency region of 600 Hz to 1000 Hz. The process can be regarded as a count of the number of cycles in the frequency range in a predetermined audio signal. It is simpler and faster than the MFCC feature extraction method.

中国特許出願公開第１０３３５４０９１号明細書において周波数領域変換に基づくオーディオ特徴の抽出方法および装置が開示された。この特許文献に記載の方法は、オーディオ信号に対して、切分処理を行うことにより、少なくとも二つのセグメント化の周波数領域信号を生成し、さらに各前記セグメント化の周波数領域信号のオーディオ特徴に対して、周波数領域変換を行って、各前記セグメント化の周波数領域信号の変換特徴を生成し、及び各前記セグメント化の周波数領域信号の変換特徴に基づいて、各前記セグメント化の周波数領域信号の変換特徴の高周波成分を獲得し、それによって、前記少なくとも二つのセグメント化の周波数領域信号の変換特徴の高周波成分に基づいて、前記オーディオ信号のメロディ特性を説明するための動的特徴を生成した。オーディオ特徴に対して、周波数領域変換を行うことにより、周波数領域変換後の改変特徴の高周波成分を得ることができ、これにより、前記オーディオ信号のメロディ特性を説明するための動的特徴を抽出することが実現でき、オーディオ特徴の高周波成分の可区別性を向上させることができる。しかしながら、音声信号にとって、該発明は時間を利用して、オーディオに対して切分処理を行っているため、１ｓ〜４ｓのオーディオだけに対して処理を行うことができ、より短いオーディオを分析することができなく、かつ、該発明は、特徴ベクトルを抽出する時に、次元数が低い要素を除去したので、音声特徴の完全性が破壊された。したがって、該特許文献に記載の方法は、音声特徴の抽出にとって不利である。 Chinese Patent Application Publication No. 103354091 discloses a method and apparatus for extracting audio features based on frequency domain conversion. The method described in this patent document generates at least two segmented frequency domain signals by performing fractionation processing on the audio signal, and further, for the audio characteristics of each segmented frequency domain signal. The frequency domain conversion is performed to generate the conversion characteristics of each of the segmented frequency domain signals, and the conversion of each of the segmented frequency domain signals is based on the conversion characteristics of each of the segmented frequency domain signals. The high frequency components of the feature were acquired, thereby generating dynamic features to explain the melody characteristics of the audio signal, based on the high frequency components of the conversion feature of the at least two segmented frequency domain signals. By performing frequency domain conversion on an audio feature, a high frequency component of the modified feature after frequency domain conversion can be obtained, thereby extracting a dynamic feature for explaining the melody characteristic of the audio signal. This can be achieved, and the distinctiveness of high-frequency components of audio features can be improved. However, for audio signals, the present invention uses time to perform segmentation processing on audio, so that processing can be performed only on audio for 1s to 4s, and shorter audio is analyzed. In addition to being unable to do so, the invention removed elements with a low number of dimensions when extracting the feature vector, thus destroying the completeness of the audio feature. Therefore, the method described in the patent document is disadvantageous for extracting speech features.

本発明の目的は、主に逆離散コサイン変換のケプストラム係数の等周波数領域分割に基づく音声特徴抽出アルゴリズムにおける分割周波数の不正確さに鑑みてなされたもので、逆離散コサイン変換のケプストラム係数の動的分割に基づく音声特徴の抽出アルゴリズムを提供することである。 An object of the present invention is mainly in view of the inaccuracy of the division frequency in the voice feature extraction algorithm based on the equal frequency domain division of the cepstrum coefficient of the inverse discrete cosine transform, and the dynamics of the cepstrum coefficient of the inverse discrete cosine transform. It is to provide an algorithm for extracting voice features based on target division.

本発明の技術的手段は、以下の通りである。 The technical means of the present invention are as follows.

逆離散コサイン変換のケプストラム係数の動的分割に基づく音声特徴の抽出アルゴリズムは以下のステップを含む。 An algorithm for extracting speech features based on the dynamic division of the cepstrum coefficients of the inverse discrete cosine transform involves the following steps.

ステップ１：音声信号に対して前処理を行う。
音声信号に対して、プリエンファシス、フレーム処理及びウインドウ化処理を順次に行い、
前処理により、人間の発声器官本身と音声信号を収集する設備によるエイリアシング、高次高調波ひずみ、高周波数などの要素が音声信号の品質に対する影響が除去され、後の処理で得られた信号がより均一で滑らかであることが保証され、音声特徴を抽出するために優れたパラメータを提供することができ、後の処理に係る品質を向上させる。 Step 1: Preprocess the audio signal.
Pre-emphasis, frame processing, and windowing processing are sequentially performed on the audio signal,
The pre-processing removes the effects of factors such as aliasing, higher harmonic distortion, and high frequencies from the human speech organ itself and the equipment that collects the audio signal on the quality of the audio signal, resulting in the signal obtained in the subsequent processing. It is guaranteed to be more uniform and smooth, can provide excellent parameters for extracting audio features, and improves the quality of subsequent processing.

ステップ２：前処理された後の音声信号に対して、時間領域から周波数領域への形式変換の処理を行う。
前処理された後の音声信号を周波数領域に変換、すなわち前処理された後の音声信号を時間領域コンボリューションから周波数領域スペクトル乗算の形式に変換して、その対数をとり、得られた成分を加算形式で示し、逆離散コサイン変換ケプストラム係数（ＩＤＣＴＣｅｐｓｔｒｕｍ係数）を取得し、具体的な過程は、次式（１）に従って行い、
Ｃ（ｑ）＝ＩＤＣＴｌｏｇ｜ＤＣＴ｛ｘ（ｋ）｝｜式（１）
式（１）中、ＤＣＴは離散コサイン変換であり、ＩＤＣＴは逆離散コサイン変換であり、ｘ（ｋ）は入力音声信号、すなわち前処理された後の音声信号であり、Ｃ（ｑ）は出力音声信号、すなわち逆離散コサイン変換のケプストラム係数であり、
逆離散コサイン変換のケプストラム係数は、一つのデータマトリクスであり、音声の既存周波数属性により、階層的クラスターを行う場合、すべての列属性は同様であるので、隣接する列属性の類似度を計算することで、順次にクラスターを行う。 Step 2: The preprocessed audio signal is subjected to format conversion processing from the time domain to the frequency domain.
The preprocessed audio signal is converted to the frequency domain, that is, the preprocessed audio signal is converted from the time domain convolution to the frequency domain spectral multiplication format, the logarithm is taken, and the obtained component is obtained. Shown in the addition format, the inverse discrete cosine transform cepstrum coefficient (IDCT Cepstrum coefficient) is obtained, and the specific process is performed according to the following equation (1).
C (q) = IDCT log | DCT {x (k)} | Equation (1)
In equation (1), DCT is the discrete cosine transform, IDCT is the inverse discrete cosine transform, x (k) is the input audio signal, i.e. the preprocessed audio signal, and C (q) is the output. The audio signal, the cepstrum coefficient of the inverse discrete cosine transform,
The cepstrum coefficient of the inverse discrete cosine transform is one data matrix, and when performing a hierarchical cluster with the existing frequency attributes of the voice, all the column attributes are the same, so the similarity of the adjacent column attributes is calculated. By doing so, clusters are performed in sequence.

ステップ３：クラスター分析法を利用して、ステップ２で得られた逆離散コサイン変換のケプストラム係数の間の類似度を計算して、類似度が一番大きい隣接する２類を順次に併合させて、２４類にまでクラスターされるように、上記過程を繰り返し、得られた動的分割の逆離散コサイン変換のケプストラム係数（ＤＤ−ＩＤＣＴＣｅｐｓｔｒｕｍ係数）が音声特徴となる。 Step 3: Using the cluster analysis method, calculate the similarity between the cepstrum coefficients of the inverse discrete cosine transform obtained in step 2, and sequentially merge the two adjacent classes with the highest similarity. , The cepstrum coefficient (DD-IDCT Cepstrum coefficient) of the inverse discrete cosine transform of the dynamic division obtained by repeating the above process so as to be clustered up to class 24 becomes a voice feature.

前記プリエンファシスは、デジタルフィルターにより実現され、具体的な過程は次式（２）に従って行い、
Ｙ（ｎ）＝Ｘ（ｎ）−ａＸ（ｎ−１）式（２）
式（２）中、Ｙ（ｎ）はプリエンファシスされた出力信号であり、Ｘ（ｎ）は入力された音声信号であり、ａはプリエンファシス係数であり、ｎは時刻である。 The pre-emphasis is realized by a digital filter, and the specific process is performed according to the following equation (2).
Y (n) = X (n) -aX (n-1) Equation (2)
In equation (2), Y (n) is a pre-emphasis output signal, X (n) is an input audio signal, a is a pre-emphasis coefficient, and n is a time.

音声信号の平均パワースペクトルは、声門励起と口と鼻の輻射による影響を受けており、高頻度端は約８００Ｈｚ以上において、６ｄＢ／ｏｃｔ（オクターブ）で減少され、頻度が高いほど相応の成分が小さくなるため、音声信号を分析する前にその高頻度部分を高める必要がある。 The average power spectrum of the audio signal is affected by glottic excitation and mouth and nasal radiation, and the high frequency end is reduced by 6 dB / oct (octave) above about 800 Hz, and the higher the frequency, the more the corresponding component. As it becomes smaller, it is necessary to increase its high frequency portion before analyzing the audio signal.

音声分析の全過程にわたって「短時間分析技術」が適用される。音声信号は時変特性を有するが、短時間の範囲内（普通は１０〜３０ｍｓの短時間内）で、その時変特性が基本的に変更されず、すなわち相対的に安定しており、そのため、それを準定常状態過程と見なしてもよく、すなわち、音声信号は短時間の定常性を有する。したがって、いずれの音声信号の分析と処理は必ず「短時間」を元にしなければならなく、すなわち「短時間分析」を行って、音声信号分節によりその特徴パラメーターを分析しなければならない。その中、分節毎を一つの「フレーム」と呼び、１フレームの長さは一般的に１０〜３０ｍｓを取る。このように、音声信号全体にとって、分析されたのは、各フレームの特徴パラメーターからなる特徴パラメーターの時系列である。 "Short-time analysis technology" is applied throughout the entire process of voice analysis. The audio signal has a time-varying characteristic, but within a short time range (usually within a short time of 10 to 30 ms), the time-varying characteristic is basically unchanged, that is, relatively stable, and therefore. It may be considered a quasi-steady state process, i.e., the audio signal has short-term steady state. Therefore, the analysis and processing of any audio signal must always be based on a "short time", that is, a "short time analysis" must be performed and its characteristic parameters analyzed by the audio signal segment. Among them, each segment is called one "frame", and the length of one frame generally takes 10 to 30 ms. In this way, for the entire audio signal, what is analyzed is the time series of the feature parameters consisting of the feature parameters of each frame.

前記フレーム処理において、前記プリエンファシスされた後の出力信号を１フレーム当たり２０ｍｓになるように区切る。 In the frame processing, the output signal after the pre-emphasis is divided so as to be 20 ms per frame.

フレーム処理された後、それをウインドウ化処理を行う。ウインドウ化処理の目的は、音声信号全体をより連続になるようにし、ギブス現象の発生を回避して、もともと周期性なしの音声信号に周期関数の一部分の特徴を表せることと認められる。前記ウインドウ化処理は、ハミングウインドウ化処理である。 After the frame is processed, it is windowed. It is recognized that the purpose of the windowing process is to make the entire audio signal more continuous, avoid the occurrence of the Gibbs phenomenon, and express the characteristics of a part of the periodic function in the originally non-periodic audio signal. The windowing process is a humming windowing process.

前記変換形式は、ケプストラム変換である。 The conversion format is a cepstrum conversion.

前記クラスター分析法は、階層構造分析法である。 The cluster analysis method is a hierarchical structure analysis method.

前記類似度の計算は、ユークリッド距離の計算である。 The calculation of the similarity is the calculation of the Euclidean distance.

本発明は従来技術と比べて以下の利点を有する。 The present invention has the following advantages over the prior art.

第一、本発明はＤＣＴＣｅｐｓｔｒｕｍ係数の等周波数領域分割による音声特徴抽出アルゴリズムの特性に対して深く分析することにより、従来技術において音声動的特徴を十分に利用していないまま、周波数領域の変換を行うとの欠点を改善させ、本発明がより広い適応性を有し、話者の識別において、より精度がよい識別を得ることができる。 First, the present invention deeply analyzes the characteristics of the voice feature extraction algorithm by dividing the DCT Cepstrum coefficient into equal frequency domains, thereby converting the frequency domain without fully utilizing the voice dynamic features in the prior art. This can improve the drawbacks of doing so, the present invention has broader adaptability, and more accurate identification can be obtained in the identification of speakers.

第二、本発明は、教師なし学習クラスター分析を音声特徴の抽出に応用させることで、プロセスが簡潔で、スピードが速く、計算資源を少なく占用するという利点を有する。 Second, the present invention has the advantages of a simple process, fast speed, and low computational resources by applying unsupervised learning cluster analysis to the extraction of speech features.

本発明の実施例または従来技術の技術手段をより一層明らかに説明するため、以下、実施例または従来技術に対する説明における図面について、簡単に説明する。以下の図面は本発明の実施例に関したものであり、当業者にとって、創造的な労働を行うことなく、これらの図面に基づいて、他の図面が得られることは明らかである。 In order to more clearly explain the technical means of the embodiment or the prior art of the present invention, the drawings in the description of the embodiment or the prior art will be briefly described below. The drawings below relate to embodiments of the present invention, and it will be apparent to those skilled in the art that other drawings can be obtained based on these drawings without the need for creative labor.

本発明の実施形態に係る逆離散コサイン変換のケプストラム係数の動的分割に基づく音声特徴抽出アルゴリズムのフローチャートである。It is a flowchart of the speech feature extraction algorithm based on the dynamic division of the cepstrum coefficient of the inverse discrete cosine transform which concerns on embodiment of this invention. 本発明の実施形態に係るクラスター分析のトリーグラフである。It is a tree graph of the cluster analysis which concerns on embodiment of this invention.

本発明の実施例の目的、技術手段及び利点をより明確に説明するため、本発明の実施例における図面を結合して、本発明の実施例における技術手段を明らかで完全に説明する。説明となった実施例は全部の実施例ではなく、本発明の実施例に係る一部の実施例だけであることは明らかである。本発明の実施例に基づいて、当業者が創造的な労働を行わず得られた他の実施例はいずれも本発明の範囲内に属する。 In order to more clearly explain the purpose, technical means and advantages of the examples of the present invention, the drawings in the examples of the present invention will be combined to clearly and completely explain the technical means in the examples of the present invention. It is clear that the examples described are not all examples, but only some examples according to the examples of the present invention. Based on the examples of the present invention, all other examples obtained by those skilled in the art without creative labor fall within the scope of the present invention.

図１に示すように、逆離散コサイン変換のケプストラム係数の動的分割に基づく音声特徴抽出アルゴリズムは以下のステップを含む。 As shown in FIG. 1, the speech feature extraction algorithm based on the dynamic division of the cepstrum coefficient of the inverse discrete cosine transform includes the following steps.

ステップ１：音声信号に対して前処理を行う。
音声信号に対して、プリエンファシス、フレーム処理及びウインドウ化処理を順次に行い、
前記プリエンファシスは、デジタルフィルターにより実現されて、具体的な過程は次式（２）に従って行い、
Ｙ（ｎ）＝Ｘ（ｎ）−ａＸ（ｎ−１）式（２）
式（２）中、Ｙ（ｎ）はプリエンファシスされた後の出力信号であり、Ｘ（ｎ）は入力された音声信号であり、ａはプリエンファシス係数であり、ｎは時刻であり、本文においてａの値は０．９７である。 Step 1: Preprocess the audio signal.
Pre-emphasis, frame processing, and windowing processing are sequentially performed on the audio signal,
The pre-emphasis is realized by a digital filter, and the specific process is performed according to the following equation (2).
Y (n) = X (n) -aX (n-1) Equation (2)
In equation (2), Y (n) is the output signal after pre-emphasis, X (n) is the input audio signal, a is the pre-emphasis coefficient, n is the time, and the text. The value of a is 0.97.

前記フレーム処理により前記プリエンファシスされた後の出力信号を一フレーム当たり２０ｍｓになるように区切る。 The output signal after the pre-emphasis by the frame processing is divided so as to be 20 ms per frame.

前記ウインドウ化処理はハミングウインドウ化処理である。 The windowing process is a humming windowing process.

ステップ２：前処理された後の音声信号に対して、時間領域から周波数領域への形式変換を行う。
前処理された後の音声信号を周波数領域に転換、すなわち前処理された後の音声信号を時間領域コンボリューションから周波数領域のスペクトルの乗算の形式に転換して、その対数を取り、得られた成分を加算形式で示し、逆離散コサイン変換のケプストラム係数（ＩＤＣＴＣｅｐｓｔｒｕｍ係数）を取得し、具体的な過程は次式（１）に従って行い、
Ｃ（ｑ）＝ＩＤＣＴｌｏｇ｜ＤＣＴ｛ｘ（ｋ）｝｜式（１）
式（１）中、ＤＣＴは離散コサイン変換であり、ＩＤＣＴは逆離散コサイン変換であり、ｘ（ｋ）は入力音声信号、すなわち前処理された後の音声信号であり、Ｃ（ｑ）は出力音声信号、すなわち、逆離散コサイン変換のケプストラム係数であり、前記変換形式はケプストラム変換である。 Step 2: The format of the preprocessed audio signal is converted from the time domain to the frequency domain.
The preprocessed audio signal was converted to the frequency domain, that is, the preprocessed audio signal was converted from the time domain convolution to the form of multiplication of the spectrum of the frequency domain, and the logarithm was obtained. The components are shown in an additive format, the cepstrum coefficient (IDCT Cepstrum coefficient) of the inverse discrete cosine transform is obtained, and the specific process is performed according to the following equation (1).
C (q) = IDCT log | DCT {x (k)} | Equation (1)
In equation (1), DCT is the discrete cosine transform, IDCT is the inverse discrete cosine transform, x (k) is the input audio signal, i.e. the preprocessed audio signal, and C (q) is the output. It is an audio signal, that is, the cepstrum coefficient of the inverse discrete cosine transform, and the conversion form is the cepstrum transform.

ステップ３：クラスター分析法により、ステップ２で得られた逆離散コサイン変換のケプストラム係数の間の類似度を計算し、類似度が一番大きい隣接する２類を順次に併合させて、２４類にまでクラスターされるように上記の過程を繰り返し、得られた動的分割の逆離散コサイン変換のケプストラム係数が音声特徴となり、具体的なステップは以下の通りである。 Step 3: Using the cluster analysis method, calculate the similarity between the cepstrum coefficients of the inverse discrete cosine transform obtained in step 2, and sequentially merge the two adjacent classes with the highest similarity into class 24. The above process is repeated so as to be clustered up to, and the cepstrum coefficient of the inverse discrete cosine transform of the dynamic division obtained becomes a voice feature, and the specific steps are as follows.

マトリクスＡは、ステップ２で求めたｍ人のｎ次元の逆離散コサイン変換のケプストラム係数を表し、図２に示すように、逆離散コサイン変換のケプストラム係数の各次元のベクトルＶ_１，Ｖ_２，…，Ｖ_ｎをｎ類と見なして、得られたＶ_ｉとＶ_ｊのユークリッド距離は、
となる。
以下はクラスター分析に関する具体的なステップである。 The matrix A represents the cepstrum coefficient of the n-dimensional inverse discrete cosine transform of m people obtained in step 2, and as shown in FIG. 2, the vectors V ₁ , V ₂ , of each dimension of the cepstrum coefficient of the inverse discrete cosine transform. ..., by regarding the _{V n} and n such, the Euclidean distance of the resulting _{V i} and _{V j} are
Will be.
The following are specific steps for cluster analysis.

第１回のクラスター：
ｌ_１＝Ｄｉｓ（Ｖ_１，Ｖ_２）
ｌ_２＝Ｄｉｓ（Ｖ_２，Ｖ_３）
・・・
ｌ_ｎ−１＝Ｄｉｓ（Ｖ_ｎ−１，Ｖ_ｎ）
もしｉ＝ａｒｇｍｉｎ（ｌ_１，ｌ_２，ｌ_３，…，ｌ_ｎ−１）であれば、クラスター結果は以下の通りであり、
（Ｖ_１），（Ｖ_２），…，（Ｖ_ｉ＋Ｖ_ｉ＋１），…，（Ｖ_ｎ）、すなわち、
1st cluster:
l ₁ = Dis (V ₁ , V ₂ )
l ₂ = Dis (V ₂ , V ₃ )
・・・
l _n-1 = Dis (V _n-1 , V _n )
If i = arg min (l ₁ , l ₂ , l ₃ , ..., l _n-1 ), the cluster result is as follows.
_{_{_{(V 1), (V 2}}} ), ..., (V i + V i + 1), ..., (V n), i.e.,

更新すると、以下の通りであり、
ｌ_ｉ−１＝Ｄｉｓ（Ｖ_ｉ−１,（Ｖ_ｉ＋Ｖ_ｉ＋１））
ｌ_ｉ＝Ｄｉｓ（（Ｖ_ｉ＋Ｖ_ｉ＋１），Ｖ_ｉ＋２）
ｌ_ｉ＋１＝ｌ_ｉ＋２
・・・
ｌ_ｎ−１＝ｌ_ｎ−２
Ｄｅｌｅｔｅｌ_ｎ−１ When updated, it is as follows
_{_{l i-1 = Dis (V}} i-1, (V i + V i + 1))
_{_{_{l i = Dis ((V i}}} + V i + 1), V i + 2)
l _{i + 1} = l _{i + 2}
・・・
l _n-1 = l _n-2
Delete l _n-1

第２回のクラスター：
もしｊ＝ａｒｇｍｉｎ（ｌ_１，ｌ_２，ｌ_３，…，ｌ_ｎ−２）であればクラスター結果は以下の通りであり、
（Ｖ_１），（Ｖ_２），…，（Ｖ_ｉ＋Ｖ_ｉ＋１），…，（Ｖ_ｊ＋Ｖ_ｊ＋１），…，（Ｖ_ｎ）、すなわち、
2nd cluster:
If j = arg min (l ₁ , l ₂ , l ₃ , ..., l _n-2 ), the cluster result is as follows.
_{_{_{_{(V 1), (V 2}}}} ), ..., (V i + V i + 1), ..., (V j + V j + 1), ..., (V n), i.e.,

再度更新すると、
ｌ_ｊ−１＝Ｄｉｓ（Ｖ_ｊ−１，（Ｖ_ｊ＋Ｖ_ｊ＋１））
ｌ_ｊ＝Ｄｉｓ（（Ｖ_ｊ＋Ｖ_ｊ＋１），Ｖ_ｊ＋２）
ｌ_ｊ＋１＝ｌ_ｊ＋２
・・・
ｌ_ｎ−３＝ｌ_ｎ−２
Ｄｅｌｅｔｅｌ_ｎ−２ If you update again
l _j-1 = Dis (V _j-1 , (V _j + V _{j + 1} ))
l _j = Dis ((V _j + V _{j + 1} ), V _{j + 2} )
l _{j + 1} = l _{j + 2}
・・・
l _n-3 = l _n-2
Delete l _n-2

上記のように、クラスターの結果が２４類になるまで階層的クラスターを行い、得られた動的分割の逆離散コサイン変換のケプストラム係数が音声特徴となり、該音声特徴をＧＭＭモデルに導入して識別することにより、該アルゴリズムの実行可能性を判断する。 As described above, hierarchical clustering is performed until the result of the cluster is 24th class, and the obtained cepstrum coefficient of the inverse discrete cosine transform of the dynamic division becomes the voice feature, and the voice feature is introduced into the GMM model for identification. By doing so, the feasibility of the algorithm is determined.

最後に以下の通り、説明すべきである。上記の各実施例は、本発明の技術手段に対する説明にすぎなく、その保護範囲がこの範囲内に限定されるものではない。上述した各実施例を参照して本発明について詳しく説明したが、上述した各実施例に記載の技術手段を修正し、またはその部分や全部の技術的特徴を同等切替してもよく、これら修正や切替を行っても、対応する技術手段の本質は本発明の実施例における技術手段の範囲から逸脱することはないとのことは、当業者に理解されよう。 Finally, it should be explained as follows. Each of the above examples is merely an explanation for the technical means of the present invention, and the scope of protection thereof is not limited to this scope. Although the present invention has been described in detail with reference to the above-described Examples, the technical means described in each of the above-described Examples may be modified, or the parts or all of the technical features may be switched equivalently. It will be understood by those skilled in the art that the essence of the corresponding technical means does not deviate from the scope of the technical means in the embodiment of the present invention even if the switching is performed.

（付記）
（付記１）
音声信号に対して、プリエンファシス、フレーム処理及びウインドウ化処理の前処理を順次に行うステップ１と、
前処理された後の音声信号が周波数領域に転換され、すなわち前処理された後の音声信号を時間領域コンボリューションから周波数領域のスペクトルの乗算の形式へ変換して、その対数を取り、得られた成分を加算形式で示し、逆離散コサイン変換のケプストラム係数を得るステップであって、具体的な過程は、次式（１）に従って前処理された後の音声信号に対して時間領域から周波数領域への形式変換処理を行うステップ２と、
Ｃ（ｑ）＝ＩＤＣＴｌｏｇ｜ＤＣＴ｛ｘ（ｋ）｝｜式（１）
式（１）中、ＤＣＴは離散コサイン変換であり、ＩＤＣＴは逆離散コサイン変換であり、ｘ（ｋ）は入力音声信号、すなわち前処理された後の音声信号であり、Ｃ（ｑ）は出力音声信号、すなわち逆離散コサイン変換のケプストラム係数であり、
クラスター分析法を利用して、ステップ２で得られた逆離散コサイン変換のケプストラム係数の間の類似度を計算し、類似度が一番大きい隣接する２類を順次に併合させて、２４類にまでクラスターされるように上記過程を繰り返し、得られた動的分割の逆離散コサイン変換のケプストラム係数が音声特徴となるステップ３と、
を含む、ことを特徴とする逆離散コサイン変換のケプストラム係数の動的分割に基づく音声特徴抽出アルゴリズム。 (Additional note)
(Appendix 1)
Step 1 in which pre-emphasis, frame processing, and windowing processing are sequentially performed on the audio signal, and
The preprocessed audio signal is transformed into the frequency domain, i.e. the preprocessed audio signal is transformed from the time domain convolution into the form of multiplication of the frequency domain spectrum and its logarithm is obtained. The components are shown in an additive format, and the cepstrum coefficient of the inverse discrete cosine transform is obtained. The specific process is from the time domain to the frequency domain for the audio signal after preprocessing according to the following equation (1). Step 2 to perform the format conversion process to
C (q) = IDCT log | DCT {x (k)} | Equation (1)
In equation (1), DCT is the discrete cosine transform, IDCT is the inverse discrete cosine transform, x (k) is the input audio signal, i.e. the preprocessed audio signal, and C (q) is the output. The audio signal, the cepstrum coefficient of the inverse discrete cosine transform,
Using the cluster analysis method, the similarity between the cepstrum coefficients of the inverse discrete cosine transform obtained in step 2 is calculated, and the two adjacent classes with the highest similarity are sequentially merged into class 24. Step 3 where the cepstrum coefficient of the inverse discrete cosine transform of the dynamic division obtained by repeating the above process so as to be clustered up to is a voice feature,
A speech feature extraction algorithm based on the dynamic division of the cepstrum coefficients of the inverse discrete cosine transform, including.

（付記２）
前記プリエンファシスは、デジタルフィルターにより実現され、具体的な過程は次式（２）に従って行い、
Ｙ（ｎ）＝Ｘ（ｎ）−ａＸ（ｎ−ｌ）式（２）
式（２）中、Ｙ（ｎ）はプリエンファシスされた後の出力信号であり、Ｘ（ｎ）は入力された音声信号であり、ａはプリエンファシス係数であり、ｎは時刻である、ことを特徴とする付記１に記載の抽出アルゴリズム。 (Appendix 2)
The pre-emphasis is realized by a digital filter, and the specific process is performed according to the following equation (2).
Y (n) = X (n) -aX (n-l) equation (2)
In equation (2), Y (n) is the output signal after pre-emphasis, X (n) is the input audio signal, a is the pre-emphasis coefficient, and n is the time. The extraction algorithm according to Appendix 1, which comprises the above.

（付記３）
前記フレーム処理は、前記プリエンファシスされた後の出力信号を１フレーム当たり２０ｍｓになるように分段することである、ことを特徴とする付記１に記載の抽出アルゴリズム。 (Appendix 3)
The extraction algorithm according to Appendix 1, wherein the frame processing is to divide the output signal after the pre-emphasis into 20 ms per frame.

（付記４）
前記ウインドウ化処理は、ハミングウインドウ化処理である、ことを特徴とする付記１に記載の抽出アルゴリズム。 (Appendix 4)
The extraction algorithm according to Appendix 1, wherein the windowing process is a humming windowing process.

（付記５）
前記形式変換は、ケプストラム変換である、ことを特徴とする付記１に記載の抽出アルゴリズム。 (Appendix 5)
The extraction algorithm according to Appendix 1, wherein the format conversion is a cepstrum conversion.

（付記６）
前記クラスター分析法は、階層構造分析法である、ことを特徴とする付記１に記載の抽出アルゴリズム。 (Appendix 6)
The extraction algorithm according to Appendix 1, wherein the cluster analysis method is a hierarchical structure analysis method.

（付記７）
前記類似度の計算は、ユークリッド距離の計算である、ことを特徴とする付記１に記載の抽出アルゴリズム。 (Appendix 7)
The extraction algorithm according to Appendix 1, wherein the calculation of the similarity is a calculation of the Euclidean distance.

Claims

Step 1 in which pre-emphasis, frame processing, and windowing processing are sequentially performed on the audio signals of m speakers, and
The voice signal of the m speakers after preprocessing is converted into the frequency domain, that is, the voice signal of the m speakers after preprocessing is multiplied by the spectrum of the frequency domain from the time domain convolution. The step is to convert to the form of, take the logarithm, show the obtained components in the addition form, and obtain the cepstrum coefficient of the inverse discrete cosine transform. The specific process is preprocessing according to the following equation (1). In step 2, the format conversion process from the time domain to the frequency domain is performed on the voice signals of the m speakers after the process is performed.
C (q) = IDCT log | DCT {x (k)} | Equation (1)
In equation (1), DCT is the discrete cosine transform, IDCT is the inverse discrete cosine transform, and x (k) is the input audio signal, that is , the audio signal of the m speakers after preprocessing. , C (q) is the output audio signal, that is, the cepstrum coefficient of the inverse discrete cosine transform of the m speakers .
Using the cluster analysis method, the similarity between the two columns between the cepstrum coefficients of the inverse discrete cosine transform of the m speakers obtained in step 2 is calculated, and the adjacent two columns with the highest similarity are calculated. And repeat the above process so that they are clustered up to 24 columns , and the cepstrum coefficient of the inverse discrete cosine transform of the dynamic division obtained is the voice feature of the m speakers. ,
Including
The frame processing is to divide the output signal after the pre-emphasis into 20 ms per frame.
In step 3, the calculation of the similarity degree calculation der Euclidean distance is, is calculated by the following method,
A shown in Eq. (2) represents the cepstrum coefficient matrix of the n-dimensional inverse discrete cosine transform of m speakers obtained in step 2, and the column vector V ₁ of each dimension of the cepstrum coefficient of the inverse discrete cosine transform . V _2, ..., considers _{V n} and n such, Euclidean distance _{V i} and _{V j} of the resulting two rows,
Next,
In step 3, the process of clustering up to the 24 columns is performed by the following hierarchical clustering.
1st Column Cluster:
Equation (2)
l ₁ = Dis (V ₁ , V ₂ )
l ₂ = Dis (V ₂ , V ₃ )
・・・
l _n-1 = Dis (V _n-1 , V _n )
If i = arg min (l ₁ , l ₂ , l ₃ , ..., l _n-1 ), then the column cluster results are as follows:
_{_{_{(V 1), (V 2}}} ), ..., (V i + V i + 1), ..., (V n), i.e.,
Updating the Euclidean distances of two adjacent rows is as follows:
_{_{l i-1 = Dis (V}} i-1, (V i + V i + 1))
_{_{_{l i = Dis ((V i}}} + V i + 1), V i + 2)
l _{i + 1} = l _{i + 2}
・・・
l _n-1 = l _n-2
Delete l _n-1
Part 2 Column Cluster:
If j = arg min (l ₁ , l ₂ , l ₃ , ..., l _n-2 ), the column cluster result is as follows.
_{_{_{_{(V 1), (V 2}}}} ), ..., (V i + V i + 1), ..., (V j + V j + 1), ..., (V n), i.e.,
If you update the Euclidean distance of two adjacent columns again,
l _j-1 = Dis (V _j-1 , (V _j + V _{j + 1} ))
l _j = Dis ((V _j + V _{j + 1} ), V _{j + 2} )
l _{j + 1} = l _{j + 2}
・・・
l _n-3 = l _n-2
Delete l _n-2
Perform a hierarchical cluster until the result of the cluster is 24 columns,
A speech feature extraction algorithm based on the dynamic division of the cepstrum coefficient of the inverse discrete cosine transform, characterized in that the m people are a large number of people .

The pre-emphasis is realized by a digital filter, and the specific process is performed according to the following equation ( 3 ).
Y (n) = X (n) -aX (n-l) equation ( 3 )
In equation ( 3 ), Y (n) is the output signal after pre-emphasis, X (n) is the input audio signal, a is the pre-emphasis coefficient, and n is the time. The extraction algorithm according to claim 1.

The extraction algorithm according to claim 1, wherein the windowing process is a humming windowing process.

The extraction algorithm according to claim 1, wherein the format conversion is a cepstrum conversion.