JP2008026836A

JP2008026836A - Method, device, and program for evaluating similarity of voice

Info

Publication number: JP2008026836A
Application number: JP2006202641A
Authority: JP
Inventors: Junya Ura; 純也浦
Original assignee: Yamaha Corp
Current assignee: Yamaha Corp
Priority date: 2006-07-25
Filing date: 2006-07-25
Publication date: 2008-02-07

Abstract

<P>PROBLEM TO BE SOLVED: To represent the similarity of a voice to be contrasted in the form of a numeral by objectively and qualitatively evaluating the similarity. <P>SOLUTION: Frequency analyzing units 10 and 20 analyze frequencies of an input voice sample sequence and a reference sample sequence to calculate amplitude spectrum sequences, respectively. An autocorrelation calculator 30 calculates an autocorrelation function value of the amplitude spectrum sequence obtained from the reference sample sequence. A cross-correlation calculator 40 calculates a cross correlation function value between the amplitude spectrum sequences obtained from the input voice sample sequence and reference sample sequence. A similarity calculator 50 derives the similarity between the input voice sample sequence and reference sample sequence by dividing a maximum value of the cross-correlation function value by a maximum value of the autocorrelation function value. <P>COPYRIGHT: (C)2008,JPO&INPIT

Description

この発明は、音声の類似度を評価する方法、装置およびプログラムに関する。 The present invention relates to a method, apparatus, and program for evaluating speech similarity.

音声の特徴を捉えるための手段として、周波数解析が一般的に用いられている。例えば人間の声のフォルマント解析などがその典型例として挙げられる。ある音声と別の音声とがある場合、これらの２つの音声を各々周波数解析し、２つの音声から得られた２つの振幅スペクトル分布同士を目視により比較すると、両音声が似ているか否かを直観的に判断することができる。
特開平１−２９１５１５号公報 Frequency analysis is generally used as a means for capturing the characteristics of speech. For example, a formant analysis of a human voice is a typical example. When there is a certain voice and another voice, frequency analysis of each of these two voices and comparing the two amplitude spectrum distributions obtained from the two voices visually, it is determined whether or not the two voices are similar. Intuitive judgment can be made.
JP-A-1-291515

しかしながら、対比すべき音声の振幅スペクトル分布を見比べたとしても、両音声がどの程度類似しているかを客観的かつ定量的に評価することは一般的に困難である。 However, even if the amplitude spectrum distributions of the sounds to be compared are compared, it is generally difficult to objectively and quantitatively evaluate how similar the two sounds are.

この発明は、以上説明した事情に鑑みてなされたものであり、その目的は、対比すべき音声の類似度を客観的かつ定量的に評価する技術的手段を提供することにある。 The present invention has been made in view of the circumstances described above, and an object thereof is to provide a technical means for objectively and quantitatively evaluating the similarity of speech to be compared.

上記目的を達成するため、この発明は、入力音声データの周波数解析を行い、第１の振幅スペクトル列を算出する第１の周波数解析過程と、基準音声データの周波数解析を行い、第２の振幅スペクトル列を算出する第２の周波数解析過程と、前記第２の振幅スペクトル列の自己相関関数値を算出する自己相関算出過程と、前記第１の振幅スペクトル列と前記第２の振幅スペクトル列との相互相関関数値を算出する相互相関算出過程と、前記相互相関関数値の最大値を前記自己相関関数値の最大値により除算することにより、前記入力音声データと前記基準音声データとの類似度を算出する類似度算出過程とを具備することを特徴とする音声の類似度評価方法、同方法に従って音声の類似度を算出する類似度評価装置および同方法をコンピュータに実行させるプログラムを提供する。 In order to achieve the above object, the present invention performs frequency analysis of input speech data, calculates a first amplitude spectrum sequence, performs frequency analysis of reference speech data, and performs second amplitude analysis. A second frequency analysis step of calculating a spectrum sequence; an autocorrelation calculation step of calculating an autocorrelation function value of the second amplitude spectrum sequence; the first amplitude spectrum sequence and the second amplitude spectrum sequence; A cross-correlation calculation process for calculating the cross-correlation function value, and dividing the maximum value of the cross-correlation function value by the maximum value of the auto-correlation function value to thereby determine the similarity between the input voice data and the reference voice data A similarity calculation method for speech, a similarity evaluation apparatus for calculating speech similarity according to the method, and a computer It provides a program to be executed by the.

かかる発明によれば、対比すべき音声の類似度を客観的かつ定量的に評価し、数値として表わすことができる。 According to this invention, the similarity of voices to be compared can be objectively and quantitatively evaluated and expressed as a numerical value.

相互相関と自己相関を利用して波形を比較する技術として特許文献１に開示のものがある。この特許文献１に開示の技術では、フィルタから得られるインパルス応答と理想的なインパルス応答との相互相関関数値を求め、この相互相関関数値が理想的なインパルス応答の自己相関関数値を中心とした許容範囲内に収まっているか否かにより、フィルタの特性が妥当か否かの判断を行う。しかし、本発明は、この特許文献１に開示されているように対比すべき２つの波形自体について相互相関関数値と自己相関関数値とを求めるものではなく、２つの波形の周波数解析結果についての相互相関関数値と自己相関関数値とを求め、その結果に基づいて波形間の類似度を求めるものである。この点に本発明の特徴がある。 As a technique for comparing waveforms using cross-correlation and autocorrelation, there is one disclosed in Patent Document 1. In the technique disclosed in Patent Document 1, a cross-correlation function value between an impulse response obtained from a filter and an ideal impulse response is obtained, and the cross-correlation function value is centered on an autocorrelation function value of an ideal impulse response. It is determined whether or not the characteristics of the filter are appropriate depending on whether or not they are within the allowable range. However, the present invention does not calculate the cross-correlation function value and the autocorrelation function value for the two waveforms to be compared as disclosed in Patent Document 1, but the frequency analysis result of the two waveforms. A cross-correlation function value and an autocorrelation function value are obtained, and a similarity between waveforms is obtained based on the result. This is a feature of the present invention.

以下、図面を参照し、この発明の実施の形態を説明する。
図１はこの発明の一実施形態による音声の類似度評価装置の構成を示すブロック図である。また、図２は同類似度評価装置の処理内容を示す図である。 Embodiments of the present invention will be described below with reference to the drawings.
FIG. 1 is a block diagram showing the configuration of a speech similarity evaluation apparatus according to an embodiment of the present invention. Moreover, FIG. 2 is a figure which shows the processing content of the similarity evaluation apparatus.

図１に示すように、本実施形態による類似度評価装置は、周波数解析部１０および２０と、自己相関算出部３０と、相互相関算出部４０と、類似度算出部５０とにより構成されている。 As shown in FIG. 1, the similarity evaluation apparatus according to the present embodiment includes frequency analysis units 10 and 20, an autocorrelation calculation unit 30, a cross-correlation calculation unit 40, and a similarity calculation unit 50. .

周波数解析部１０は、評価対象である入力音声サンプル列を外部から受け取り、この入力音声サンプル列から、図２に示すように時間軸上において所定長ずつオーバラップ（この例ではブロック長の２５％ずつオーバラップ）して並んだ所定個数のサンプル列からなるブロックを順次取り出し、ブロック単位で、図１に示すＦＦＴ処理１１とＡＢＳ（絶対値化）処理１２とを実行する。ここで、ＦＦＴ処理１１では、ブロック内のサンプル列に対するハニング窓の乗算処理と、この乗算処理後のサンプル列に対するＦＦＴ（高速フーリエ変換）とを実行する。そして、ＡＢＳ処理では、ブロック毎に、ＦＦＴの結果得られるＮ個（Ｎは所定の整数）の複素数形式のスペクトルの絶対値を各々算出し、振幅スペクトル列ｙ（ｎ）（ｎ＝０〜Ｎ−１）として出力する。 The frequency analysis unit 10 receives an input speech sample sequence to be evaluated from the outside, and overlaps the input speech sample sequence by a predetermined length on the time axis as shown in FIG. 2 (in this example, 25% of the block length). A block consisting of a predetermined number of sample strings arranged one after the other is sequentially extracted, and the FFT process 11 and the ABS (absolute value) process 12 shown in FIG. 1 are executed in units of blocks. Here, in the FFT process 11, a Hanning window multiplication process for the sample string in the block and an FFT (Fast Fourier Transform) for the sample string after the multiplication process are executed. In the ABS processing, the absolute values of N (N is a predetermined integer) complex number spectrum obtained as an FFT result are calculated for each block, and the amplitude spectrum sequence y (n) (n = 0 to N) is calculated. -1).

一方、周波数解析部２０は、入力音声サンプルの類似度評価の基準となる音声を示すリファレンスサンプル列を外部から受け取り、周波数解析部１０のものと同様なＦＦＴ処理２１およびＡＢＳ処理２２により、リファレンスサンプル列のブロック化、ブロック単位でのＦＦＴ、ＦＦＴ結果に基づくＮ個の振幅スペクトル列ｘ（ｎ）（ｎ＝０〜Ｎ−１）の出力を行う。 On the other hand, the frequency analysis unit 20 receives from the outside a reference sample sequence indicating a speech that is a reference for evaluating the similarity of the input speech sample, and performs the reference sample by the FFT processing 21 and the ABS processing 22 similar to those of the frequency analysis unit 10. Blocks of columns, FFT in block units, and output of N amplitude spectrum columns x (n) (n = 0 to N−1) based on the FFT result are performed.

自己相関算出部３０は、ブロック単位で、リファレンスサンプル列から得られる振幅スペクトル列ｘ（ｎ）（ｎ＝０〜Ｎ−１）の自己相関関数値Ｒｘｘ（ｍ）（ｍ＝−Ｎ＋１〜Ｎ−１）を算出する。さらに詳述すると、自己相関算出部３０は、ブロック単位で、図１に示すように、振幅スペクトル列ｘ（ｎ）（ｎ＝０〜Ｎ−１）に対するＦＦＴ処理３１と、ＦＦＴ処理３１の結果得られるスペクトル列の絶対値を算出し、その結果得られるスペクトル列を出力するＡＢＳ処理３２と、このＡＢＳ処理３２により得られたスペクトル列に対するＩＦＦＴ（逆高速フーリエ変換）処理３３と、このＩＦＦＴ処理３３により得られる複素数列の実数部を選択して出力するＲＥＡＬ処理３４とを実行する。これにより、次式に示す自己相関関数値Ｒｘｘ（ｍ）（ｍ＝−Ｎ＋１〜Ｎ−１）が得られる。

The autocorrelation calculation unit 30 is an autocorrelation function value Rxx (m) (m = −N + 1 to N−) of the amplitude spectrum sequence x (n) (n = 0 to N−1) obtained from the reference sample sequence in units of blocks. 1) is calculated. More specifically, the autocorrelation calculation unit 30 performs the FFT processing 31 on the amplitude spectrum sequence x (n) (n = 0 to N−1) and the result of the FFT processing 31 in units of blocks as shown in FIG. An ABS process 32 for calculating the absolute value of the obtained spectrum sequence and outputting the resulting spectrum sequence, an IFFT (Inverse Fast Fourier Transform) process 33 for the spectrum sequence obtained by the ABS process 32, and the IFFT process A REAL process 34 for selecting and outputting the real part of the complex number sequence obtained in step 33 is executed. As a result, autocorrelation function values Rxx (m) (m = −N + 1 to N−1) shown in the following equation are obtained.

一方、相互相関算出部４０は、ブロック単位で、リファレンスサンプル列から得られた振幅スペクトル列ｘ（ｎ）（ｎ＝０〜Ｎ−１）と入力音声サンプル列から得られた振幅スペクトル列ｙ（ｎ）（ｎ＝０〜Ｎ−１）との相互相関関数値Ｒｘｙ（ｍ）（ｍ＝−Ｎ＋１〜Ｎ−１）を算出する。さらに詳述すると、相互相関算出部４０は、ブロック単位で、図１に示すように、振幅スペクトル列ｙ（ｎ）（ｎ＝０〜Ｎ−１）に対するＦＦＴ処理４１と、ＦＦＴ処理３１の結果得られるスペクトル列とＦＦＴ４１の結果得られるスペクトル列との乗算処理４２と、この乗算処理４２により得られるスペクトル列に対するＩＦＦＴ処理４３と、このＩＦＦＴ処理４３により得られる複素数列の実数部を選択して出力するＲＥＡＬ処理４４とを実行する。これにより、次式に示す相互相関関数値Ｒｘｙ（ｍ）（ｍ＝−Ｎ＋１〜Ｎ−１）が得られる。

On the other hand, the cross-correlation calculation unit 40 is, in block units, the amplitude spectrum sequence x (n) (n = 0 to N−1) obtained from the reference sample sequence and the amplitude spectrum sequence y ( n) A cross-correlation function value Rxy (m) (m = −N + 1 to N−1) with (n = 0 to N−1) is calculated. More specifically, the cross-correlation calculation unit 40 performs the FFT processing 41 and the result of the FFT processing 31 on the amplitude spectrum sequence y (n) (n = 0 to N−1) as shown in FIG. The multiplication process 42 of the spectrum sequence obtained and the spectrum sequence obtained as a result of the FFT 41, the IFFT process 43 for the spectrum sequence obtained by the multiplication process 42, and the real part of the complex number sequence obtained by the IFFT process 43 are selected. The output REAL process 44 is executed. Thereby, the cross-correlation function value Rxy (m) (m = −N + 1 to N−1) shown in the following equation is obtained.

そして、類似度算出部５０は、ブロック毎に、図２に示すように、相互相関算出部４０により算出された相互相関関数値Ｒｘｙ（ｍ）（ｍ＝−Ｎ＋１〜Ｎ−１）の中の最大値Ｒｘｙ−ｍａｘと、自己相関算出部３０により算出された自己相関関数値Ｒｘｘ（ｍ）（ｍ＝−Ｎ＋１〜Ｎ−１）の中の最大値Ｒｘｘ−ｍａｘとから、式（３）に示す類似度Ｄを算出して出力する。この類似度Ｄは、横軸を周波数、縦軸を振幅値とする座標系における入力音声サンプル列のブロックの振幅スペクトル列ｙ（ｎ）（ｎ＝０〜Ｎ−１）の波形とリファレンスサンプル列のブロックの振幅スペクトル列ｘ（ｎ）（ｎ＝０〜Ｎ−１）の波形との類似度を示すものである。聴覚上の音声の特徴は、その振幅スペクトルの分布に現れるため、この類似度Ｄは、入力音声サンプル列とリファレンスサンプル列の音声としての特徴の類似度を適確に表すものとなる。

Then, the similarity calculation unit 50, for each block, in the cross-correlation function values Rxy (m) (m = −N + 1 to N−1) calculated by the cross-correlation calculation unit 40, as shown in FIG. From the maximum value Rxy-max and the maximum value Rxx-max among the autocorrelation function values Rxx (m) (m = −N + 1 to N−1) calculated by the autocorrelation calculation unit 30, Equation (3) is obtained. The similarity D shown is calculated and output. The degree of similarity D indicates the waveform of the amplitude spectrum sequence y (n) (n = 0 to N−1) of the block of the input audio sample sequence in the coordinate system having the horizontal axis as the frequency and the vertical axis as the amplitude value, and the reference sample sequence. The degree of similarity with the waveform of the amplitude spectrum sequence x (n) (n = 0 to N−1) of the block of FIG. Since the features of the auditory speech appear in the distribution of the amplitude spectrum, the similarity D accurately represents the similarity of the features as speech of the input speech sample sequence and the reference sample sequence.

本実施形態において、相互相関関数値の最大値Ｒｘｙ−ｍａｘを自己相関関数値の最大値Ｒｘｘ−ｍａｘによって除算したものを類似度Ｄとする理由は次の通りである。まず、相互相関関数値の最大値Ｒｘｙ−ｍａｘは、横軸を周波数、縦軸を振幅値とする座標系において、入力音声サンプル列から得られた振幅スペクトル列ｙ（ｎ）（ｎ＝０〜Ｎ−１）の波形とリファレンスサンプル列から得られた振幅スペクトル列ｘ（ｎ）（ｎ＝０〜Ｎ−１）の波形とが類似している程大きな値となる。この意味において、相互相関関数値の最大値Ｒｘｙ−ｍａｘは、振幅スペクトル列ｙ（ｎ）（ｎ＝０〜Ｎ−１）と振幅スペクトル列ｘ（ｎ）（ｎ＝０〜Ｎ−１）との類似度に依存する数値であるといえる。 In the present embodiment, the reason why the similarity D is obtained by dividing the maximum value Rxy-max of the cross-correlation function value by the maximum value Rxx-max of the autocorrelation function value is as follows. First, the maximum value Rxy-max of the cross-correlation function value is an amplitude spectrum sequence y (n) (n = 0 to 0) obtained from an input speech sample sequence in a coordinate system in which the horizontal axis is frequency and the vertical axis is amplitude value. N-1) and the waveform of the amplitude spectrum sequence x (n) (n = 0 to N-1) obtained from the reference sample sequence become larger as they become similar. In this sense, the maximum value Rxy-max of the cross-correlation function values is the amplitude spectrum sequence y (n) (n = 0 to N-1), the amplitude spectrum sequence x (n) (n = 0 to N-1), and It can be said that it is a numerical value depending on the similarity of.

しかし、リファレンスサンプル列の振幅スペクトル列ｘ（ｎ）（ｎ＝０〜Ｎ−１）と入力音声サンプル列の振幅スペクトル列ｙ（ｎ）（ｎ＝０〜Ｎ−１）とが同じ波形を維持したまま縦軸方向に同一倍率で伸縮するような場合であっても、相互相関関数値の最大値Ｒｘｙ−ｍａｘは、この伸縮に応じて増減する。例えばリファレンスサンプル列および入力音声サンプル列の振幅値をいずれも２倍にした場合、リファレンスサンプル列および入力音声サンプル列は波形自体の類似度が変わっていないにも拘わらず、両者の相互相関関数値は４倍になる。従って、あるブロックｋａにおいて得られた相互相関関数値の最大値Ｒｘｙ−ｍａｘ−ａと、別のブロックｋｂにおいて得られた相互相関関数値の最大値Ｒｘｙ−ｍａｘ−ｂとがある場合に、Ｒｘｙ−ｍａｘ−ａ＞Ｒｘｙ−ｍａｘ−ｂであったとしても、それだけでは、ブロックｋａにおける振幅スペクトル列ｙ（ｎ）（ｎ＝０〜Ｎ−１）およびｘ（ｎ）（ｎ＝０〜Ｎ−１）間の波形の類似度がブロックｋｂにおける振幅スペクトル列ｙ（ｎ）（ｎ＝０〜Ｎ−１）およびｘ（ｎ）（ｎ＝０〜Ｎ−１）間の波形の類似度よりも高い、ということはできない。ブロックｋａとブロックｋｂとでは、相互相関関数値の算出に用いるリファレンスサンプル列の振幅スペクトル列ｘ（ｎ）（ｎ＝０〜Ｎ−１）が同じではないからである。その意味において、相互相関関数値の最大値Ｒｘｙ−ｍａｘは、その基準となるリファレンスサンプル列に依存して変化し、ブロック間での比較に馴染まない相対的な類似度であるということができる。 However, the amplitude spectrum sequence x (n) (n = 0 to N−1) of the reference sample sequence and the amplitude spectrum sequence y (n) (n = 0 to N−1) of the input speech sample sequence maintain the same waveform. Even in the case where the vertical axis direction expands and contracts at the same magnification, the maximum value Rxy-max of the cross-correlation function value increases or decreases according to the expansion and contraction. For example, if the amplitude values of the reference sample sequence and the input audio sample sequence are both doubled, the cross-correlation function values of the reference sample sequence and the input audio sample sequence are the same even though the similarity of the waveform itself has not changed. Is quadrupled. Therefore, when there is a maximum value Rxy-max-a of cross-correlation function values obtained in a certain block ka and a maximum value Rxy-max-b of cross-correlation function values obtained in another block kb, Rxy Even if −max−a> Rxy−max−b, only that, the amplitude spectrum sequence y (n) (n = 0 to N−1) and x (n) (n = 0 to N−) in the block ka. 1) The waveform similarity between the amplitude spectrum sequences y (n) (n = 0 to N−1) and x (n) (n = 0 to N−1) in the block kb is greater than the waveform similarity It can't be expensive. This is because the amplitude spectrum sequence x (n) (n = 0 to N−1) of the reference sample sequence used for calculating the cross-correlation function value is not the same between the block ka and the block kb. In that sense, it can be said that the maximum value Rxy-max of the cross-correlation function value changes depending on the reference sample sequence serving as a reference, and is a relative similarity that is not familiar with comparison between blocks.

本実施形態において算出するのは、このような相対的な類似度ではなく、振幅スペクトル列ｙ（ｎ）（ｎ＝０〜Ｎ−１）およびｘ（ｎ）（ｎ＝０〜Ｎ−１）間において波形がどの程度類似しているかを客観的かつ定量的に示し、ブロック間での比較にも用いることができる絶対的な尺度としての類似度である。 What is calculated in this embodiment is not such a relative similarity, but an amplitude spectrum sequence y (n) (n = 0 to N−1) and x (n) (n = 0 to N−1). It is the degree of similarity as an absolute measure that can be used for comparison between blocks objectively and quantitatively showing how similar the waveforms are between the blocks.

さらに詳述すると、本実施形態では、前掲式（３）に示すように、相互相関関数値の最大値Ｒｘｙ−ｍａｘを自己相関関数値の最大値Ｒｘｘ−ｍａｘによって除算した結果を類似度Ｄとしている。ここで、各ブロックにおいて、入力音声サンプル列の振幅スペクトル列ｙ（ｎ）（ｎ＝０〜Ｎ−１）がリファレンスサンプル列の振幅スペクトル列ｘ（ｎ）（ｎ＝０〜Ｎ−１）と全く同じものである場合、前掲式（３）により得られる類似度Ｄは１００％となる。そして、入力音声サンプル列の振幅スペクトル列ｙ（ｎ）（ｎ＝０〜Ｎ−１）の波形がリファレンスサンプル列の振幅スペクトル列ｘ（ｎ）（ｎ＝０〜Ｎ−１）の波形に対して歪むと、それに応じて前掲式（３）の類似度Ｄは１００％から離れてゆく。 More specifically, in the present embodiment, the similarity D is obtained by dividing the maximum value Rxy-max of the cross-correlation function value by the maximum value Rxx-max of the autocorrelation function value as shown in the above equation (3). Yes. Here, in each block, the amplitude spectrum sequence y (n) (n = 0 to N−1) of the input audio sample sequence is the amplitude spectrum sequence x (n) (n = 0 to N−1) of the reference sample sequence. If they are exactly the same, the similarity D obtained by the above equation (3) is 100%. The waveform of the amplitude spectrum sequence y (n) (n = 0 to N−1) of the input audio sample sequence is compared with the waveform of the amplitude spectrum sequence x (n) (n = 0 to N−1) of the reference sample sequence. Accordingly, the degree of similarity D in the above equation (3) deviates from 100% accordingly.

このように本実施形態において得られる類似度Ｄは、各ブロックにおいて、対比すべき２つの振幅スペクトル列ｙ（ｎ）（ｎ＝０〜Ｎ−１）およびｘ（ｎ）（ｎ＝０〜Ｎ−１）がどの程度類似しているかを同じ尺度で示すものであり、ブロック間での比較にも用いることができ、その意味において絶対的なものであるということができる。 As described above, the similarity D obtained in this embodiment is obtained by comparing two amplitude spectrum sequences y (n) (n = 0 to N−1) and x (n) (n = 0 to N−1) to be compared in each block. The degree of similarity of -1) is shown on the same scale, and can be used for comparison between blocks, and can be said to be absolute in that sense.

図３および図４は本実施形態の効果を示すものである。図３は、ある楽曲のＬチャネルおよびＲチャネルの音声サンプル列をリファレンスサンプル列とし、この音声サンプル列に対し、あるアルゴリズムに従って、圧縮符号化処理を施し、この結果得られる符号化データに復号化処理を施すことにより得られるＬチャネルおよびＲチャネルの音声サンプル列を入力音声サンプル列として本実施形態による類似度評価装置に与えた場合の動作例を示している。また、図４は、全く無関係なリファレンスサンプル列と入力音声サンプル列を本実施形態による類似度評価装置に与えた場合の動作例を示している。これらの図において、相互相関関数値の最大値Ｒｘｙ−ｍａｘ、自己相関関数値の最大値Ｒｘｘ−ｍａｘの単位はｄＢ、類似度Ｄの単位は％である。 3 and 4 show the effects of this embodiment. In FIG. 3, the L channel and R channel audio sample sequences of a music piece are used as reference sample sequences, the audio sample sequences are subjected to compression encoding processing according to a certain algorithm, and the resulting encoded data is decoded. An operation example is shown when the L channel and R channel audio sample sequences obtained by performing the processing are given as input audio sample sequences to the similarity evaluation apparatus according to the present embodiment. FIG. 4 shows an operation example when a completely irrelevant reference sample sequence and input speech sample sequence are given to the similarity evaluation apparatus according to the present embodiment. In these drawings, the unit of the maximum value Rxy-max of the cross-correlation function value, the maximum value Rxx-max of the autocorrelation function value is dB, and the unit of the similarity D is%.

これらの図３および図４を見比べると、本実施形態において算出される類似度Ｄが客観的かつ定量的にリファレンスサンプル列と入力音声サンプル列との波形の類似度を示していることが分かる。まず、図３に示す例の場合、入力音声サンプル列は、リファレンスサンプル列に対して圧縮符号化および復号化を施したものであるため、リファレンスサンプル列に対して、圧縮符号化および復号化の過程において発生した雑音が重畳したものとなる。この雑音の影響により、リファレンスサンプル列と入力音声サンプル列の類似度が若干低下するブロックがランダムに現れる。図３に示す例は、まさにこの現象を示しており、類似度Ｄは、ブロック間でばらつくが、総じて１００％に近い値となっている。これに対し、入力音声サンプル列とリファレンスサンプル列とが全く無関係なものである場合、類似度Ｄは、ブロック間で大きくばらつく。図４に示す例も、この現象を示している。 Comparing these FIG. 3 and FIG. 4, it can be seen that the similarity D calculated in the present embodiment objectively and quantitatively indicates the waveform similarity between the reference sample sequence and the input speech sample sequence. First, in the case of the example shown in FIG. 3, since the input audio sample sequence is obtained by performing compression encoding and decoding on the reference sample sequence, compression encoding and decoding are performed on the reference sample sequence. Noise generated in the process is superimposed. Due to the influence of this noise, blocks in which the similarity between the reference sample sequence and the input speech sample sequence slightly decreases appear at random. The example shown in FIG. 3 shows exactly this phenomenon, and the degree of similarity D varies between blocks, but is generally a value close to 100%. On the other hand, when the input audio sample sequence and the reference sample sequence are completely irrelevant, the similarity D varies greatly between blocks. The example shown in FIG. 4 also shows this phenomenon.

以上のように、本実施形態によれば、対比すべき２つの音声がある場合に、それらの音声の類似度を客観的かつ定量的に評価することができる。 As described above, according to the present embodiment, when there are two sounds to be compared, the similarity between these sounds can be objectively and quantitatively evaluated.

以上、本発明を装置として具現する場合を例に実施形態を説明したが、本発明は、図１に示す類似度評価装置としてコンピュータを機能させるプログラムを作成し、このプログラムをユーザに配布する、という態様でも実施され得る。 As described above, the embodiment has been described by taking the case where the present invention is embodied as an apparatus. However, the present invention creates a program that causes a computer to function as the similarity evaluation apparatus illustrated in FIG. 1 and distributes the program to users. This embodiment can also be implemented.

この発明の一実施形態である類似度評価装置の構成を示すブロック図である。It is a block diagram which shows the structure of the similarity evaluation apparatus which is one Embodiment of this invention. 同類似度評価装置の処理内容を示す図である。It is a figure which shows the processing content of the similarity evaluation apparatus. 同実施形態の効果を示す図である。It is a figure which shows the effect of the same embodiment. 同実施形態の効果を示す図である。It is a figure which shows the effect of the same embodiment.

Explanation of symbols

１０，２０……周波数解析部、３０……自己相関算出部、４０……相互相関算出部、５０……類似度算出部。 10, 20 ... Frequency analysis unit, 30 ... Autocorrelation calculation unit, 40 ... Cross correlation calculation unit, 50 ... Similarity calculation unit.

Claims

A first frequency analysis process of performing frequency analysis of input speech data and calculating a first amplitude spectrum sequence;
A second frequency analysis process of performing frequency analysis of the reference audio data and calculating a second amplitude spectrum sequence;
An autocorrelation calculation step of calculating an autocorrelation function value of the second amplitude spectrum sequence;
A cross-correlation calculation step of calculating a cross-correlation function value between the first amplitude spectrum sequence and the second amplitude spectrum sequence;
A similarity calculation step of calculating a similarity between the input voice data and the reference voice data by dividing the maximum value of the cross-correlation function value by the maximum value of the autocorrelation function value. Voice similarity evaluation method.

First frequency analysis means for performing frequency analysis of input voice data and calculating a first amplitude spectrum sequence;
Second frequency analysis means for performing frequency analysis of the reference audio data and calculating a second amplitude spectrum sequence;
Autocorrelation calculating means for calculating an autocorrelation function value of the second amplitude spectrum sequence;
Cross-correlation calculating means for calculating a cross-correlation function value between the first amplitude spectrum sequence and the second amplitude spectrum sequence;
And a similarity calculation means for calculating a similarity between the input voice data and the reference voice data by dividing the maximum value of the cross-correlation function value by the maximum value of the autocorrelation function value. Voice similarity evaluation device.

The first frequency analysis means and the second frequency analysis means divide input voice data and reference voice data into blocks each having a predetermined time length, and the first amplitude spectrum string and the second amplitude spectrum in units of blocks. Output each column,
The autocorrelation calculating means calculates the autocorrelation function value in block units,
The cross-correlation calculating means calculates the cross-correlation function value in block units,
The similarity evaluation apparatus according to claim 2, wherein the similarity calculation unit calculates the similarity in units of blocks.

A first frequency analysis process of performing frequency analysis of input speech data and calculating a first amplitude spectrum sequence;
A second frequency analysis process of performing frequency analysis of the reference audio data and calculating a second amplitude spectrum sequence;
An autocorrelation calculation step of calculating an autocorrelation function value of the second amplitude spectrum sequence;
A cross-correlation calculation step of calculating a cross-correlation function value between the first amplitude spectrum sequence and the second amplitude spectrum sequence;
Causing the computer to execute a similarity calculation step of calculating a similarity between the input voice data and the reference voice data by dividing the maximum value of the cross-correlation function value by the maximum value of the autocorrelation function value. A computer program characterized by the above.