JPWO2007015489A1

JPWO2007015489A1 - Voice search apparatus and voice search method

Info

Publication number: JPWO2007015489A1
Application number: JP2007529275A
Authority: JP
Inventors: 佐藤　寧; 寧佐藤
Original assignee: Kyushu Institute of Technology NUC
Current assignee: Kyushu Institute of Technology NUC
Priority date: 2005-08-01
Filing date: 2006-08-01
Publication date: 2009-02-19
Anticipated expiration: 2026-08-01
Also published as: WO2007015489A1; JP4961565B2

Abstract

標準音声パターンを必要とせず、音声の個人差にも影響されず検索精度の高い音声検索装置を提供する。検索対象音声データの有声音のピッチ周期を等化したピッチ等化検索対象音声データの中から、音声の特徴量空間において、クエリー音声データの有声音のピッチ周期を等化したピッチ等化クエリー音声データに対する距離尺度が所定の閾値以下である部分音声データを検索する部分音声検索手段を備えた構成とする。ピッチ周期を等化することによって、音声帯域の男女差や個人差にほとんど影響されず、高い精度で音声検索を行うことが可能となる。Provided is a voice search device that does not require a standard voice pattern and has high search accuracy without being affected by individual differences in voice. Pitch equalization by equalizing pitch period of voiced sound of search target voice data Pitch equalization query voice by equalizing pitch period of voiced sound of query voice data in voice feature value space from search target voice data A partial voice search means for searching for partial voice data whose distance measure with respect to the data is equal to or less than a predetermined threshold value is provided. By equalizing the pitch period, it is possible to perform a voice search with high accuracy with little influence on gender differences and individual differences in the voice band.

Description

本発明は、蓄積された検索対象音声データの中から、所定の音声に合致する部分を検索するための音声検索装置に関する。 The present invention relates to a voice search device for searching a portion that matches a predetermined voice from stored search target voice data.

近年、多くの蓄積映像・音声データの中から、視聴者が最も知りたい情報の部分だけを取り出すマルチメディア・データベースの要請が強まりつつある。代表的な例としては、蓄積された多くのニュース番組の中から、視聴者が最も知りたいニュースのみを取り出すニュース・オンデマンド（News On Demand：ＮＯＤ）・システムなどがある。 In recent years, there has been an increasing demand for a multimedia database that extracts only the portion of information that the viewer wants to know most from a large amount of stored video / audio data. A typical example is a News On Demand (NOD) system that extracts only the news that the viewer wants to know most from among many accumulated news programs.

かかるマルチメディア・データベースを構築するためには、テレビニュースなどの蓄積された映像・音声データの中から、検索キーワードの音声（以下「クエリー音声」という。）に合致する部分を検索する音声検索技術が必要とされる。 In order to construct such a multimedia database, a voice search technique for searching a portion that matches a search keyword voice (hereinafter referred to as “query voice”) from stored video / audio data such as TV news. Is needed.

検索対象音声データの中からクエリー音声に合致する部分を検索する音声検索装置としては、特許文献１に記載のものが公知である。 As a voice search device for searching a part that matches a query voice from search target voice data, the one described in Patent Document 1 is known.

図１２は、特許文献１に記載の音声検索装置の構成を表す図である。この音声検索装置では、検索データ生成部１００の音声信号入力部１０２に音声信号が入力されると、当該音声信号は、検索対象音声データとして記録部２０１に記憶される。この際、映像検索インデックス生成部１０４が生成する映像検索インデックスが付加される。また、音声信号に同期して映像信号入力部１０１には映像信号が入力され、記録部２０１に蓄積映像データとして記憶される。一方、クエリー音声は、検索処理部２００のキーワード入力部２０３から入力され、キーワードパターン照合部２０５において検索対象音声データと照合され、もっとも一致する音声信号が音声信号出力部２０７から出力される。以下、これらの処理を概説する。 FIG. 12 is a diagram illustrating the configuration of the voice search device described in Patent Document 1. In this voice search device, when a voice signal is input to the voice signal input unit 102 of the search data generation unit 100, the voice signal is stored in the recording unit 201 as search target voice data. At this time, a video search index generated by the video search index generation unit 104 is added. In addition, a video signal is input to the video signal input unit 101 in synchronization with the audio signal, and stored as accumulated video data in the recording unit 201. On the other hand, the query voice is input from the keyword input unit 203 of the search processing unit 200, checked with the search target voice data in the keyword pattern matching unit 205, and the voice signal that most matches is output from the voice signal output unit 207. Hereinafter, these processes will be outlined.

まず、音声信号入力部１０２に音声信号が入力されると、音声特徴パターン抽出部１０３は、入力音声を１０ｍｓｅｃの分析フレームに分割する。そして、各分析フレームについて、高速フーリエ変換を行い、発生周波数帯域の音響特性データを生成する。さらに、この音響特性データを、音響特徴量から構成されるＮ次元のベクトルデータ（以下「特徴パターン」という。）に変換する。ここで、音響特徴量としては、入力音声の発生周波数帯域における短時間スペクトル又はその対数値、入力音声の一定時間内における対数エネルギー等が用いられる。 First, when an audio signal is input to the audio signal input unit 102, the audio feature pattern extraction unit 103 divides the input audio into 10 msec analysis frames. Then, fast Fourier transform is performed on each analysis frame to generate acoustic characteristic data in the generated frequency band. Further, the acoustic characteristic data is converted into N-dimensional vector data (hereinafter referred to as “feature pattern”) composed of acoustic feature amounts. Here, as the acoustic feature amount, a short-time spectrum or its logarithmic value in the generation frequency band of the input speech, logarithmic energy within a certain time of the input speech, or the like is used.

次に、映像検索インデックス生成部１０４は、音声特徴パターン収納部１０５から第１番目の標準音声パターンを取り出す。 Next, the video search index generation unit 104 extracts the first standard audio pattern from the audio feature pattern storage unit 105.

ここで、音声特徴パターン収納部１０５には、５００個の標準音声パターンが予め記憶されている。標準音声パターンとは、予め複数の話者から収集した発音を分析して、サブワード単位（#V，#CV，#CjV，CV，CjV，VC，QC，VQ，VV，V#：但し、Cは子音、Vは母音、jは拗音、Qは促音、#は無音。）で抽出した音声特徴パターンを統計処理して標準化したものである。 Here, 500 standard sound patterns are stored in the sound feature pattern storage unit 105 in advance. The standard voice pattern is an analysis of pronunciations collected from multiple speakers in advance, subword units (#V, #CV, #CjV, CV, CjV, VC, QC, VQ, VV, V #: where C Is a consonant, V is a vowel, j is a stutter, Q is a prompt, and # is silence.

映像検索インデックス生成部１０４は、処理対象となる１つの音声区間に対して、第１番目の標準音声パターンと入力音声の音声特徴パターンとの類似度を、ＤＰ照合法やＨＭＭ（Hidden Markov Model）等の音声認識処理により計算される。そして、第１番目の標準音声パターンに対して最も高い類似度を示す区間を「サブワード区間」として検出する。以下、サブワード区間の類似度を「スコア」という。映像検索インデックス生成部１０４は、サブワード区間の音素記号、発声区間（始端時刻、終端時刻）、及びスコアの組を「映像検索インデックス」として出力する。 The video search index generation unit 104 uses the DP matching method or HMM (Hidden Markov Model) to calculate the similarity between the first standard audio pattern and the audio feature pattern of the input audio for one audio section to be processed. It is calculated by voice recognition processing such as. Then, the section showing the highest similarity to the first standard speech pattern is detected as the “subword section”. Hereinafter, the similarity between subword sections is referred to as “score”. The video search index generation unit 104 outputs a set of a phoneme symbol of a subword section, a speech section (start time and end time), and a score as a “video search index”.

同様に、第２番目以降の標準音声パターンについてもサブワード区間を検出し、検出サブワード区間に関する映像検索インデックスを出力する。 Similarly, a subword section is detected for the second and subsequent standard audio patterns, and a video search index related to the detected subword section is output.

当該音声区間において、すべての標準音声パターンに関して映像検索インデックスが生成されたならば、映像検索インデックス生成部１０４は、処理対象となる音声区間を隣接する次の音声区間に移し、同様の処理を実行する。そして、入力音声の全区間に亘って映像検索インデックスを作成したところで、処理を終了する。 If the video search index is generated for all the standard audio patterns in the audio section, the video search index generation unit 104 moves the audio section to be processed to the next adjacent audio section and executes the same processing. To do. Then, when the video search index is created over the entire section of the input audio, the process is terminated.

入力音声の音声データと映像検索インデックスは、検索対象音声データとして記録部２０１に記憶される。図１３は記録部２０１に記憶された映像検索インデックスのラティス構造の一部を示す図である。図１３では、１０ｍｓｅｃ単位で分割した入力音声の各音声区間の終端を、その音声区間に対して生成した各映像検索インデックスの終端とし、同一音声区間における映像検索インデックスを生成された順番に配置している。このような映像検索インデックスのラティス構造を「音素類似度表」と呼ぶ。尚、「ラティス」とは、連続する種々の音声区間に対して、複数の音素や単語の候補とその可能性を表の形で表したものをいう（非特許文献１，ｐ．１９８参照）。 The audio data of the input audio and the video search index are stored in the recording unit 201 as search target audio data. FIG. 13 is a diagram showing a part of the lattice structure of the video search index stored in the recording unit 201. In FIG. 13, the end of each audio section of the input audio divided in units of 10 msec is the end of each video search index generated for that audio section, and the video search indexes in the same audio section are arranged in the order of generation. ing. Such a lattice structure of the video search index is called a “phoneme similarity table”. “Lattice” refers to a table in which a plurality of phoneme and word candidates and their possibilities are represented in a table form for various continuous speech segments (see Non-Patent Document 1, p. 198). .

クエリー音声を用いて映像シーンを検索する処理は次のように行われる。まず、キーワード入力部２０３に検索キーワードであるクエリー音声が入力される。キーワード変換部２０４は、クエリー音声をサブワードの時系列に変換する。次に、キーワードパターン照合部２０５は、音素類似度表の中から、クエリー音声を構成するサブワードだけをピックアップする。そして、ピックアップされた複数のラティス上のサブワードを、検索キーワードを変換したサブワードの系列順に隙間なく接続する。 The process of searching for a video scene using the query audio is performed as follows. First, a query voice that is a search keyword is input to the keyword input unit 203. The keyword conversion unit 204 converts the query speech into a time series of subwords. Next, the keyword pattern matching unit 205 picks up only the subwords constituting the query speech from the phoneme similarity table. Then, the sub-words on the plurality of lattices that have been picked up are connected without gaps in the order of the sub-words obtained by converting the search keyword.

例えば、クエリー音声としてキーワード入力部２０３に「空（そら）」が入力された場合、キーワード変換部２０４は、サブワードの系列「SO」，「OR」，「RA」を生成する。キーワードパターン照合部２０５は、音素類似度表からサブワード「SO」，「OR」，「RA」をピックアップして、これを隙間なく接続する。この場合、ある時刻のラティスからサブワード「RA」を取り出し、サブワード「RA」の始端時刻にあたるラティスからその前のサブワード「OR」を取り出し、さらにサブワード「OR」の始端時刻に当たるラティスからサブワード「SO」を取り出す。そして、最後のサブワード「RA」の終端を基準にして「SO」「OR」「RA」を連結する。 For example, when “empty” is input to the keyword input unit 203 as the query voice, the keyword conversion unit 204 generates subword sequences “SO”, “OR”, and “RA”. The keyword pattern matching unit 205 picks up subwords “SO”, “OR”, and “RA” from the phoneme similarity table and connects them without gaps. In this case, the subword “RA” is taken out from the lattice at a certain time, the subword “OR” before it is taken out from the lattice corresponding to the start time of the subword “RA”, and the subword “SO” is taken from the lattice corresponding to the start time of the subword “OR”. Take out. Then, “SO”, “OR”, and “RA” are concatenated based on the end of the last subword “RA”.

このようにサブワード（上記例では、「ＳＯ」「ＯＲ」「ＲＡ」）を連結することによって復元されたキーワードについて、その復元キーワードのスコアの総和を計算する。 Thus, for the keyword restored by concatenating the subwords (in the above example, “SO”, “OR”, “RA”), the sum of the scores of the restored keyword is calculated.

以下同様に、サブワード「ＲＡ」の終端時刻をずらした復元キーワードをすべての時刻について順次作成し、各復元キーワードについてそのスコアを計算する（図１４参照）。 In the same manner, a restoration keyword in which the end time of the subword “RA” is shifted is sequentially generated for all times, and the score is calculated for each restoration keyword (see FIG. 14).

制御部２０２は、スコアが上位となる復元キーワードの先頭サブワードの始端時刻から対応する映像信号のタイムコードを算出する。そして、記憶部２０１に蓄積された蓄積映像データ・検索対象音声データの該当部分を再生する制御を行う。
特開２０００−２３６４９４号公報（特許第３２５２２８２号公報）特開２００５−９１７０９号公報古井貞煕，「音響・音声工学」，近代科学社，ｐｐ．１９４−２１０ The control unit 202 calculates the time code of the corresponding video signal from the start end time of the first subword of the restoration keyword having the highest score. Then, control is performed to reproduce the corresponding portions of the stored video data and search target audio data stored in the storage unit 201.
JP 2000-236494 A (Patent No. 3252282) JP 2005-91709 A Furui Sadaaki, “Acoustic / Voice Engineering”, Modern Science, pp. 194-210

上記従来の音声検索装置では、音声認識を行うにあたり、音声特徴パターン収納部１０５に格納された標準音声パターンを使用し、クエリー音声と標準音声パターンとの類似度によって音声認識を行う。この場合、認識精度を上げるためには標準音声パターンを多く用意する必要がある。しかし、標準音声パターンの数が増えると、類似度演算の処理時間が増大し又は演算回路の規模が大きくなる。また、標準音声パターンとして登録されていないクエリー音声が入力された場合には、正常に認識することができないため、音声検索機能が正常に働かない場合も考えられる。 In the conventional speech search device, when performing speech recognition, the speech recognition is performed based on the similarity between the query speech and the standard speech pattern using the standard speech pattern stored in the speech feature pattern storage unit 105. In this case, in order to increase the recognition accuracy, it is necessary to prepare many standard voice patterns. However, as the number of standard voice patterns increases, the processing time for similarity calculation increases or the scale of the arithmetic circuit increases. In addition, when a query voice that is not registered as a standard voice pattern is input, it cannot be recognized normally, and the voice search function may not work normally.

また、通常、同じ音素に対する音声であっても男女間で周波数帯域が異なり、また同性でも個人間で周波数帯域が異なる。従って、標準音声パターンとクエリー音声との類似度に、これらの差異による影響が現れるため、認識精度に限界がある。 In general, even if the speech is for the same phoneme, the frequency band is different between men and women, and the frequency band is different between individuals even in the same gender. Therefore, since the influence of these differences appears on the similarity between the standard voice pattern and the query voice, the recognition accuracy is limited.

そこで、本発明の目的は、標準音声パターンを必要とせず、音声の個人差にも影響されず検索精度の高い音声検索装置を提供することにある。 SUMMARY OF THE INVENTION An object of the present invention is to provide a voice search device that does not require a standard voice pattern and has high search accuracy without being affected by individual differences in voice.

本発明に係る音声検索装置の第１の構成は、検索対象音声データ（retrieval voice-data）の中から、クエリー音声データ（query voice-data）に一致又は類似する部分音声データ（partial voice-data）を検索する音声検索装置（voice retrieval device）であって、前記検索対象音声データの有声音（voiced sound）のピッチ周期（pitch period）を等化したピッチ等化検索対象音声データ（pitch-equalized retrieval voice-data）の中から、音声の特徴量空間において、前記クエリー音声データの有声音のピッチ周期を等化したピッチ等化クエリー音声データに対する距離尺度（distance measure）（又は類似尺度（likelihood measure））が所定の閾値以下（又は所定の閾値以上）である部分音声データを検索する部分音声検索手段を備えていることを特徴とする。 The first configuration of the voice search device according to the present invention is that partial voice data (partial voice-data) that matches or is similar to query voice data (query voice-data) from search target voice data (retrieval voice-data). ) For voice equalization search target voice data (pitch-equalized) that equalizes the pitch period of voiced sound of the search target voice data. The distance measure (or similar measure) for the pitch equalized query voice data obtained by equalizing the pitch period of the voiced sound of the query voice data in the voice feature value space. )) Is provided with a partial voice search means for searching for partial voice data that is equal to or lower than a predetermined threshold (or higher than a predetermined threshold).

このように、検索対象音声データ及びクエリー音声データのピッチ周期を等化することによって、音声帯域の男女差や個人差が除去される。従って、ピッチ周期が等化された検索対象音声信号及びクエリー音声信号の特徴量空間における距離尺度や類似尺度は、音声帯域の男女差や個人差にほとんど影響されず、その音声が表す音素列に依存して定まる。故に、この距離尺度や類似尺度をマッチングの指標として用いることによって、高い精度で音声検索を行うことが可能となる。 In this way, by equalizing the pitch periods of the search target voice data and the query voice data, gender differences and individual differences in the voice band are removed. Therefore, the distance measure and similarity measure in the feature amount space of the search target speech signal and the query speech signal in which the pitch period is equalized are hardly affected by the gender difference or individual difference in the speech band, and the phoneme string represented by the speech is represented. It depends on you. Therefore, by using this distance measure or similarity measure as a matching index, it is possible to perform a voice search with high accuracy.

ここで、「特徴量」とは、音声の発生周波数帯域における短時間スペクトル又はその対数値、一定時間内での対数エネルギーなどを用いることができる。特徴量として短時間スペクトルを用いる場合は、例えば、１０〜３０チャンネル程度の帯域フィルタ群を用いて得られる各帯域の特徴データの時系列、短時間ＦＦＴを用いて直接的に計算されるスペクトル、ケプストラム変換により得られるケプストラム、相関関数により計算される相関データ列、ＬＰＣ分析を基礎として得られるＬＰＣ係数列、ＰＡＲＣＯＲ係数、ＬＳＰ周波数などが、特徴量として使用される。 Here, as the “feature amount”, a short-time spectrum in the voice generation frequency band or its logarithmic value, logarithmic energy within a certain time, or the like can be used. When using a short-time spectrum as a feature amount, for example, a time series of feature data of each band obtained using a band filter group of about 10 to 30 channels, a spectrum calculated directly using a short-time FFT, A cepstrum obtained by cepstrum conversion, a correlation data string calculated by a correlation function, an LPC coefficient string obtained based on LPC analysis, a PARCOR coefficient, an LSP frequency, and the like are used as feature quantities.

「距離尺度」とは、特徴量に応じて種々の距離尺度を用いることができる。例えば、特徴量として短時間スペクトルを使用する場合、単純なユークリッド距離、聴覚の感度を考慮した重み付けを行った距離、判別分析，主成分分析などの統計的分析を行って低次元に射影した空間におけるユークリッド距離、マハラビノス距離、板倉・齋藤距離、COSH尺度、WLR尺度(重みつき尤度比)、PWLR尺度(パワー重みつき尤度比)、LPCケプストラム間ユークリッド距離、LPC重みつきケプストラム間ユークリッド距離などを用いることができる。 As the “distance scale”, various distance scales can be used according to the feature amount. For example, when a short-time spectrum is used as a feature quantity, a simple Euclidean distance, a weighted distance that takes auditory sensitivity into account, a space that is projected to a low dimension by statistical analysis such as discriminant analysis and principal component analysis Euclidean distance, Maharabinos distance, Itakura / Saito distance, COSH scale, WLR scale (weighted likelihood ratio), PWLR scale (power weighted likelihood ratio), Euclidean distance between LPC cepstrum, Euclidean distance between LPC weighted cepstrum, etc. Can be used.

尚、特徴量（一般にベクトル量）ｘ，ｙの距離尺度ｄ（ｘ，ｙ）は、必ずしも数学的な意味での距離のように三角不等式を満たす必要はない。しかしながら、次式で定義される対称性と正値性を持つことが望ましく、また、ｄ（ｘ、ｙ）を効率よく計算するアルゴリズムが存在する必要がある。 Note that the distance measure d (x, y) of the feature quantities (generally vector quantities) x and y does not necessarily satisfy the triangular inequality like the distance in the mathematical sense. However, it is desirable to have symmetry and positive value defined by the following equation, and an algorithm for efficiently calculating d (x, y) needs to exist.

「類似尺度」とは、二つの特徴量がどれだけ類似しているのかを示す尺度をいう。例えば、次式によって定義できる類似度等を用いることができる。ここで、ｘ，ｙは特徴量を表す。 The “similarity scale” refers to a scale indicating how similar two feature quantities are. For example, a similarity that can be defined by the following equation can be used. Here, x and y represent feature amounts.

本発明に係る音声検索装置の第２の構成は、前記第１の構成において、前記クエリー音声データの有声音のピッチ周期を等化することにより前記ピッチ等化クエリー音声データを生成するピッチ周期等化手段と、前記ピッチ等化クエリー音声データを特徴量の時系列データに変換したデータ（以下「クエリー特徴データ（query feature-data）」という。）を生成する特徴データ生成手段と、を備え、前記部分音声検索手段は、前記ピッチ等化検索対象音声データに含まれる部分音声データのうち、その特徴量が、前記クエリー特徴データとの間の距離尺度（又は類似尺度）が所定の閾値以下（又は所定の閾値以上）であるものを検索することを特徴とする。 A second configuration of the voice search device according to the present invention is the pitch configuration in which the pitch equalized query voice data is generated by equalizing the pitch period of the voiced sound of the query voice data in the first configuration. And feature data generation means for generating data obtained by converting the pitch equalization query voice data into time-series data of feature quantities (hereinafter referred to as “query feature-data”), The partial speech search means has a feature measure having a distance measure (or similarity measure) between the partial feature data included in the pitch equalization search target speech data and the query feature data equal to or less than a predetermined threshold ( (Or a predetermined threshold value or more) is searched.

この構成により、クエリー音声データが入力されると、ピッチ周期等化手段が当該クエリー音声データの有声音のピッチ周期を等化する。そして、特徴データ生成手段は、ピッチ周期が等化されたクエリー音声データの特徴量を演算し、クエリー特徴データを生成する。これにより、部分音声検索手段は、ピッチ等化検索対象音声データの部分音声データとクエリー特徴データとの間の距離尺度（又は類似尺度）を閾値判定により抽出する。これにより、クエリー音声データに一致又は類似する音声データを、検索対象音声データの中から検索することが可能となる。 With this configuration, when query voice data is input, the pitch period equalizing means equalizes the pitch period of the voiced sound of the query voice data. Then, the feature data generation means calculates the feature amount of the query voice data with the equal pitch period, and generates query feature data. Thereby, the partial speech search means extracts a distance measure (or similarity measure) between the partial speech data of the pitch equalization search target speech data and the query feature data by threshold determination. Thereby, it is possible to search the search target voice data for voice data that matches or is similar to the query voice data.

本発明に係る音声検索装置の第３の構成は、前記第１又は２の構成において、前記部分音声検索手段は、前記ピッチ等化検索対象音声データを特徴量の時系列データに変換した検索対象特徴データの中から、前記クエリー音声データと同じ音素長分の部分データ（以下「選択特徴データ」という。）を、選択位置を移動させながら順次選択する部分音声選択手段と、前記各選択特徴データと前記クエリー特徴データとの間の距離尺度（又は類似尺度）を演算する特徴量尺度演算手段と、前記距離尺度（又は類似尺度）が所定の閾値以下（又は所定の閾値以上）の場合、前記選択特徴データに対応する検索対象音声データ内の位置を出力する一致位置判定手段と、を備えていることを特徴とする。 According to a third configuration of the voice search device of the present invention, in the first or second configuration, the partial voice search means is a search target obtained by converting the pitch equalization search target voice data into time-series data of feature quantities. Partial voice selection means for sequentially selecting partial data for the same phoneme length as the query voice data (hereinafter referred to as “selected feature data”) from the feature data while moving the selection position; and each selected feature data If the distance measure (or similarity measure) is less than or equal to a predetermined threshold (or greater than or equal to a predetermined threshold), And a matching position determining means for outputting a position in the search target voice data corresponding to the selected feature data.

この構成により、検索対象音声データの中から、特徴量空間におけるピッチ等化検索対象音声データとの（又は類似尺度）が所定の閾値以下（又は所定の閾値以上）の部分音声データを抽出することが可能となる。 With this configuration, partial speech data having a pitch equalization search target speech data in the feature amount space (or similarity measure) equal to or smaller than a predetermined threshold (or greater than a predetermined threshold) is extracted from the search target speech data. Is possible.

部分音声選択手段が「選択位置を移動」させる手順は、特に限定するものではない。例えば、部分音声データの開始位置を検索対象音声データの先頭から末尾に向かって逐次移動させる方法や、逆に、部分音声データの終端位置を検索対象音声データの末尾から先頭に向かって逐次移動させる方法などを採ることができる。 The procedure for “moving the selected position” by the partial voice selecting means is not particularly limited. For example, a method of sequentially moving the start position of partial audio data from the beginning to the end of the search target audio data, or conversely, the end position of partial audio data is sequentially moved from the end to the beginning of the search target audio data. You can take methods.

本発明に係る音声検索装置の第４の構成は、前記第３の構成において、前記検索対象特徴データを記憶する音声記憶手段を備えていることを特徴とする。 A fourth configuration of the voice search device according to the present invention is characterized in that, in the third configuration, voice search means for storing the search target feature data is provided.

検索対象音声データを、検索対象特徴データとして、音声記憶手段に予め記憶させておくことにより、クエリー音声データに類似する部分音声データを素早く検索することが可能となる。 By storing the search target voice data as search target feature data in the voice storage means in advance, it becomes possible to quickly search for partial voice data similar to the query voice data.

本発明に係る音声検索装置の第５の構成は、前記第３又は４の構成において、前記検索対象音声データの有声音のピッチ周期を等化することにより前記ピッチ等化検索対象音声データを生成する第２のピッチ周期等化手段と、前記ピッチ等化検索対象音声データを特徴量の時系列データに変換することにより、前記検索対象特徴データを生成する第２の特徴データ生成手段と、を備えていることを特徴とする。 According to a fifth configuration of the voice search device of the present invention, in the third or fourth configuration, the pitch equalization search target voice data is generated by equalizing the pitch period of the voiced sound of the search target voice data. Second pitch period equalizing means, and second feature data generating means for generating the search target feature data by converting the pitch equalization search target voice data into time-series data of feature quantities, It is characterized by having.

この構成により、音声データベース内の検索対象音声データが有声音のピッチ周期が等化されていない場合であっても、第２のピッチ周期等化手段によりピッチ周期を等化して第２の特徴データ生成手段により特徴量を算出することによって、ピッチ周期が等化された検索対象音声データの特徴量を得ることができる。 With this configuration, the second feature data is obtained by equalizing the pitch period by the second pitch period equalizing means even when the search target voice data in the voice database is not equalized in the pitch period of the voiced sound. By calculating the feature value by the generation unit, it is possible to obtain the feature value of the search target speech data in which the pitch period is equalized.

本発明に係る音声検索装置の第６の構成は、前記第２又は５の構成において、前記ピッチ周期等化手段（又は第２のピッチ周期等化手段）は、前記クエリー音声データ（又は前記検索対象音声データ）のピッチ周波数の検出を行うピッチ検出手段、前記ピッチ周波数と所定の基準周波数との差分を演算する残差演算手段、及び、前記差分が最小となるように、前記クエリー音声データ（又は前記検索対象音声データ）のピッチ周波数を等化する周波数シフタを具備することを特徴とする。 According to a sixth configuration of the voice search device of the present invention, in the second or fifth configuration, the pitch period equalizing means (or the second pitch period equalizing means) is configured to use the query voice data (or the search). Pitch detection means for detecting the pitch frequency of the target voice data), residual calculation means for calculating the difference between the pitch frequency and a predetermined reference frequency, and the query voice data ( Or a frequency shifter for equalizing the pitch frequency of the search target voice data).

この構成により、ピッチ周期等化手段（又は第２のピッチ周期等化手段）は、クエリー音声データ（又は前記検索対象音声データ）のピッチ周波数を等化することができる。 With this configuration, the pitch period equalizing means (or the second pitch period equalizing means) can equalize the pitch frequency of the query voice data (or the search target voice data).

本発明に係る音声検索装置の第７の構成は、前記第１乃至６の何れか一の構成において、前記検索対象特徴データ及び前記クエリー特徴データは、それぞれ、前記ピッチ等化検索対象音声データ及び前記ピッチ等化クエリー音声データを直交変換して得られるサブバンド・データの時系列であることを特徴とする。 A seventh configuration of the speech search device according to the present invention is the configuration according to any one of the first to sixth configurations, wherein the search target feature data and the query feature data are the pitch equalization search target speech data and It is a time series of subband data obtained by orthogonal transformation of the pitch equalization query voice data.

このように特徴量としてサブバンドを使用することにより、簡単なフィルタバンクやＦＦＴ，ＤＦＴ等を使用して検索対象特徴データ及び前記クエリー特徴データを高速に求めることが可能となる。 By using subbands as feature quantities in this way, it is possible to obtain the search target feature data and the query feature data at high speed using a simple filter bank, FFT, DFT, or the like.

本発明に係る音声検索装置の第８の構成は、前記第２又は５の構成において、前記クエリー特徴データを、音素区間ごとに平均化し、平均値の時系列データに変換する第１の区間分割手段と、前記検索対象特徴データを、音素区間ごとに平均化し、平均値の時系列データに変換する第２の区間分割手段と、を備え、前記特徴量尺度演算手段は、前記第１及び第２の区間分割手段が生成する平均値の時系列データの間の距離尺度（又は類似尺度）を演算することを特徴とする。 According to an eighth configuration of the speech search apparatus of the present invention, in the second or fifth configuration, the query feature data is averaged for each phoneme segment, and converted into time-series data of an average value. And a second section dividing means for averaging the search target feature data for each phoneme section and converting it into time-series data of average values, wherein the feature quantity scale calculating means includes the first and second feature scale calculation means. A distance scale (or similarity scale) between the time series data of the average values generated by the two section dividing means is calculated.

このように、音素区間で特徴量を平均化し、その平均値を用いてマッチング判定を行うことにより、ノイズや揺らぎの影響が低減され、検索精度が向上する。また、各特徴量は、音素区間ごとに時間的に離散化される。この際に、音声の伸縮の影響が除去される。従って、マッチング判定は単純な比較計算のみとなり、ＤＰマッチングのように計算量の多い方法を用いる必要がなく、装置構成の単純化、演算時間の高速化が図られる。 In this way, by averaging the feature values in the phoneme section and performing the matching determination using the average value, the influence of noise and fluctuation is reduced, and the search accuracy is improved. Each feature is discretized in time for each phoneme section. At this time, the influence of voice expansion and contraction is removed. Therefore, the matching determination is only a simple comparison calculation, and it is not necessary to use a method with a large amount of calculation like DP matching, so that the apparatus configuration can be simplified and the calculation time can be increased.

本発明に係る音声検索装置の第９の構成は、前記第１乃至８の何れか一の構成において、前記クエリー音声データ（又は前記検索対象音声データ）に対して音素ラベリングを行うことによりクエリー音素列（又は検索対象音素列）を生成する音素ラベリング処理手段と、前記前記選択特徴データに対応する前記検索対象音素列と前記クエリー音素列との距離尺度（又は類似尺度）を決定する音素列尺度演算手段と、前記特徴量尺度演算手段が出力する特徴量の距離尺度（又は類似尺度）と、前記音素列尺度演算手段が出力する音素列の距離尺度（又は類似尺度）との線形和（以下「総合距離尺度（又は総合類似尺度）」という。）を算出する総合尺度演算手段と、を備え、前記一致位置判定手段は、前記総合距離尺度（又は総合類似尺度）が所定の閾値以下（又は所定の閾値以上）の場合、前記選択特徴データに対応する検索対象音声データ内の位置を出力することを特徴とする。 According to a ninth configuration of the speech search apparatus of the present invention, a query phoneme is obtained by performing phoneme labeling on the query speech data (or the search target speech data) in any one of the first to eighth configurations. Phoneme labeling processing means for generating a sequence (or search target phoneme sequence), and a phoneme sequence scale for determining a distance measure (or similarity measure) between the search target phoneme sequence corresponding to the selected feature data and the query phoneme sequence A linear sum (hereinafter referred to as a calculation means), a distance measure (or similarity measure) of the feature amount output by the feature amount scale calculation means, and a distance measure (or similarity measure) of the phoneme string scale output by the phoneme sequence scale calculation means. Total scale calculation means for calculating “total distance scale (or total similarity scale)”, and the coincidence position determination means has the total distance scale (or total similarity scale). For the following constant threshold (or above a predetermined threshold value), and outputs the position of the searched audio data corresponding to the selected feature data.

このように、特徴量尺度に加えて音素列尺度をマッチング判定に考慮することにより、検索精度を高めることができる。 In this way, by considering the phoneme string scale in addition to the feature quantity scale in the matching determination, the search accuracy can be improved.

本発明に係る音声検索方法は、検索対象音声データの中から、クエリー音声データに一致又は類似する部分音声データを検索する音声検索方法であって、前記検索対象音声データの有声音のピッチ周期を等化したピッチ等化検索対象音声データの中から、音声の特徴量空間において、前記クエリー音声データの有声音のピッチ周期を等化したピッチ等化クエリー音声データに対する距離尺度（又は類似尺度）が所定の閾値以下（又は所定の閾値以上）である部分音声データを検索する部分音声検索ステップを有することを特徴とする。 The voice search method according to the present invention is a voice search method for searching partial voice data that matches or is similar to query voice data from search target voice data, and calculates a pitch period of voiced sound of the search target voice data. A distance scale (or similarity measure) for pitch equalized query voice data obtained by equalizing the pitch period of the voiced sound of the query voice data in the voice feature value space from the equalized pitch equalization search target voice data. It has a partial voice search step of searching for partial voice data that is below a predetermined threshold (or above a predetermined threshold).

本発明に係る音声検索方法の第２の構成は、前記第１の構成において、前記クエリー音声データの有声音のピッチ周期を等化することにより前記ピッチ等化クエリー音声データを生成するピッチ周期等化ステップと、前記ピッチ等化クエリー音声データを特徴量の時系列データに変換したデータ（以下「クエリー特徴データ」という。）を生成する特徴データ生成ステップと、を備え、前記部分音声検索ステップにおいては、前記ピッチ等化検索対象音声データに含まれる部分音声データのうち、その特徴量が、前記クエリー特徴データとの間の距離尺度（又は類似尺度）が所定の閾値以下（又は所定の閾値以上）であるものを検索することを特徴とする。 The second configuration of the speech search method according to the present invention is the pitch configuration in which the pitch equalized query speech data is generated by equalizing the pitch cycle of the voiced sound of the query speech data in the first configuration. And a feature data generation step for generating data obtained by converting the pitch equalization query voice data into time-series data of feature quantities (hereinafter referred to as “query feature data”), and in the partial voice search step, Is a feature of the partial speech data included in the pitch equalization search target speech data, and the distance measure (or similarity measure) between the feature data and the query feature data is less than or equal to a predetermined threshold (or greater than or equal to a predetermined threshold) ) Is searched for.

本発明に係る音声検索方法の第３の構成は、前記第１又は２の構成において、前記部分音声検索ステップにおいては、前記ピッチ等化検索対象音声データを特徴量の時系列データに変換した検索対象特徴データの中から、前記クエリー音声データと同じ音素長分の部分データ（以下「選択特徴データ」という。）を、選択位置を移動させながら順次選択する部分音声選択ステップと、前記各選択特徴データと前記クエリー特徴データとの間の距離尺度（又は類似尺度）を演算する特徴量尺度演算ステップと、前記距離尺度（又は類似尺度）が所定の閾値以下（又は所定の閾値以上）の場合、前記選択特徴データに対応する検索対象音声データ内の位置を出力する一致位置判定ステップと、を有することを特徴とする。 A third configuration of the speech search method according to the present invention is the search in which the pitch equalization search target speech data is converted into time-series data of feature amounts in the partial speech search step in the first or second configuration. A partial voice selection step of sequentially selecting partial data for the same phoneme length as the query voice data (hereinafter referred to as “selected feature data”) from the target feature data while moving the selection position; A feature amount scale calculating step for calculating a distance measure (or similarity measure) between data and the query feature data, and when the distance measure (or similarity measure) is a predetermined threshold value or less (or a predetermined threshold value or more), A matching position determining step of outputting a position in the search target voice data corresponding to the selected feature data.

本発明に係る音声検索方法の第４の構成は、前記第３の構成において、前記検索対象特徴データを記憶する音声記憶ステップを備えていることを特徴とする。 The fourth configuration of the speech search method according to the present invention is characterized in that, in the third configuration, a speech storage step of storing the search target feature data is provided.

本発明に係る音声検索方法の第５の構成は、前記第３又は４の構成において、前記検索対象音声データの有声音のピッチ周期を等化することにより前記ピッチ等化検索対象音声データを生成する第２のピッチ周期等化ステップと、前記ピッチ等化検索対象音声データを特徴量の時系列データに変換することにより、前記検索対象特徴データを生成する第２の特徴データ生成ステップとを有することを特徴とする。 According to a fifth configuration of the voice search method of the present invention, in the third or fourth configuration, the pitch equalization search target voice data is generated by equalizing the pitch period of the voiced sound of the search target voice data. A second pitch period equalization step, and a second feature data generation step of generating the search target feature data by converting the pitch equalization search target speech data into feature amount time-series data. It is characterized by that.

本発明に係る音声検索方法の第６の構成は、前記第２又は５の構成において、前記ピッチ周期等化ステップ（又は第２のピッチ周期等化ステップ）においては、前記クエリー音声データ（又は前記検索対象音声データ）のピッチ周波数の検出を行うピッチ検出ステップと、前記ピッチ周波数と所定の基準周波数との差分を演算する残差演算ステップと、前記差分が最小となるように、前記クエリー音声データ（又は前記検索対象音声データ）のピッチ周波数を等化する周波数シフトステップとを具備することを特徴とする。 In a sixth configuration of the speech search method according to the present invention, in the second or fifth configuration, in the pitch period equalizing step (or the second pitch period equalizing step), the query voice data (or the A pitch detection step for detecting a pitch frequency of the search target voice data), a residual calculation step for calculating a difference between the pitch frequency and a predetermined reference frequency, and the query voice data so that the difference is minimized. And a frequency shift step for equalizing the pitch frequency of the search target voice data.

本発明に係る音声検索方法の第７の構成は、前記第１乃至６の何れか一の構成において、前記検索対象特徴データ及び前記クエリー特徴データは、それぞれ、前記ピッチ等化検索対象音声データ及び前記ピッチ等化クエリー音声データを直交変換して得られるサブバンド・データの時系列であることを特徴とする。 According to a seventh configuration of the speech search method of the present invention, in any one of the first to sixth configurations, the search target feature data and the query feature data are the pitch equalization search target speech data and It is a time series of subband data obtained by orthogonal transformation of the pitch equalization query voice data.

本発明に係る音声検索方法の第８の構成は、前記第２又は５の構成において、前記クエリー特徴データを、音素区間ごとに平均化し、平均値の時系列データに変換する第１の区間分割ステップと、前記検索対象特徴データを、音素区間ごとに平均化し、平均値の時系列データに変換する第２の区間分割ステップと、を有し、前記特徴量尺度演算ステップにおいては、前記第１及び第２の区間分割ステップにおいて生成される平均値の時系列データの間の距離尺度（又は類似尺度）を演算することを特徴とする。 According to an eighth configuration of the speech search method of the present invention, in the second or fifth configuration, the query feature data is averaged for each phoneme segment and is converted into average time-series data. And a second section dividing step of averaging the search target feature data for each phoneme section and converting it into time-series data of average values. In the feature quantity scale calculation step, And calculating a distance measure (or similarity measure) between the time series data of the average values generated in the second interval dividing step.

本発明に係る音声検索方法の第９の構成は、前記第１乃至８の何れか一の構成において、前記クエリー音声データ（又は前記検索対象音声データ）に対して音素ラベリングを行うことによりクエリー音素列（又は検索対象音素列）を生成する音素ラベリングステップと、前記選択特徴データに対応する前記検索対象音素列と前記クエリー音素列との距離尺度（又は類似尺度）を決定する音素列尺度演算ステップと、前記特徴量尺度演算ステップにおいて出力される特徴量の距離尺度（又は類似尺度）と、前記音素列尺度演算ステップにおいて出力される音素列の距離尺度（又は類似尺度）との線形和（以下「総合距離尺度（又は総合類似尺度）」という。）を算出する総合尺度演算ステップと、を備え、前記一致位置判定ステップにおいては、前記総合距離尺度（又は総合類似尺度）が所定の閾値以下（又は所定の閾値以上）の場合、前記選択特徴データに対応する検索対象音声データ内の位置を出力することを特徴とする。 According to a ninth configuration of the speech search method of the present invention, a query phoneme is obtained by performing phoneme labeling on the query speech data (or the search target speech data) in any one of the first to eighth configurations. Phoneme labeling step of generating a sequence (or search target phoneme sequence), and phoneme sequence scale calculation step of determining a distance measure (or similarity measure) between the search target phoneme sequence and the query phoneme sequence corresponding to the selected feature data And a linear sum of the distance measure (or similarity measure) of the feature amount output in the feature amount scale operation step and the distance measure (or similarity measure) of the phoneme sequence output in the phoneme sequence scale operation step (hereinafter, A comprehensive scale calculation step of calculating “total distance scale (or total similarity scale)”, and in the matching position determination step, If the overall distance measure (or total similarity measure) is below a predetermined threshold value (or above a predetermined threshold value), and outputs the position of the searched audio data corresponding to the selected feature data.

本発明に係るプログラムは、コンピュータに読み込んで実行することにより、コンピュータを前記第１乃至８の何れか一の構成の音声検索装置として機能させることを特徴とする。 The program according to the present invention is read and executed by a computer, thereby causing the computer to function as the voice search device having any one of the first to eighth configurations.

以上のように、本発明によれば、検索対象音声データ及びクエリー音声データのピッチ周期を等化することにより、音声帯域の男女差や個人差が除去した音声データを用いて、特徴量のマッチングにより音声検索を行うことで、音声帯域の男女差や個人差にほとんど影響されず、音声検索の精度を向上させることができる。 As described above, according to the present invention, by matching the pitch period of the search target voice data and the query voice data, matching the feature amount using the voice data from which the gender difference or individual difference in the voice band is removed. By performing the voice search according to the above, it is possible to improve the accuracy of the voice search with almost no influence on the gender difference or individual difference in the voice band.

また、音素区間ごとにピッチ周期を等化した検索対象音声データ及びクエリー音声データの特徴量を平均化し、その特徴量の平均値の時間列のマッチング検査によって音声検索を行うことで、ノイズや揺らぎの影響が低減されるとともに、音声の伸縮による影響が除去される。その結果、音声検索の精度を向上させることができる。 In addition, by averaging the feature quantities of search target voice data and query voice data with equal pitch period for each phoneme section, and performing a voice search by time series matching inspection of the average value of the feature quantities, noise and fluctuations are obtained. Is reduced, and the influence of the expansion and contraction of the voice is removed. As a result, the accuracy of voice search can be improved.

本発明の実施例１に係る音声検索装置１の全体構成を表す図である。It is a figure showing the whole structure of the voice search device 1 which concerns on Example 1 of this invention. 図１の音声符号化器２の構成を表すブロック図である。It is a block diagram showing the structure of the audio | voice encoder 2 of FIG. 図２のピッチ周期等化手段１０の構成を表すブロック図である。It is a block diagram showing the structure of the pitch period equalization means 10 of FIG. ピッチ検出手段２１及びピッチ平均手段２２における信号処理の概略を説明する図である。It is a figure explaining the outline of the signal processing in the pitch detection means 21 and the pitch average means 22. FIG. 有声音「あ」のフォルマント特性を示す図である。It is a figure which shows the formant characteristic of voiced sound "A". 無声音「す」の自己相関及びケプストラム波形並びに周波数特性を示す図である。It is a figure which shows the autocorrelation and cepstrum waveform of unvoiced sound "su", and a frequency characteristic. 周波数シフタ２３の内部構成を表す図である。3 is a diagram illustrating an internal configuration of a frequency shifter 23. FIG. 周波数シフタ２３の内部構成の他の例を表す図である。6 is a diagram illustrating another example of the internal configuration of the frequency shifter 23. FIG. 図１の音声復号器５の構成を表すブロック図である。It is a block diagram showing the structure of the audio | voice decoder 5 of FIG. 図１の部分音声検索手段６の構成を表すブロック図である。It is a block diagram showing the structure of the partial voice search means 6 of FIG. 量子化ビット数についての説明図である。It is explanatory drawing about the number of quantization bits. 特許文献１に記載の音声検索装置の構成を表す図である。1 is a diagram illustrating a configuration of a voice search device described in Patent Document 1. FIG. 記録部２０１に記憶された映像検索インデックスのラティス構造の一部を示す図である。6 is a diagram showing a part of a lattice structure of a video search index stored in a recording unit 201. FIG. 各復元キーワードについてそのスコアを計算するために接続されたラティスの構造を表す図である。It is a figure showing the structure of the lattice connected in order to calculate the score about each restoration keyword.

Explanation of symbols

１音声検索装置
２音声符号化器
３音声記憶手段
４データ読出手段
５音声復号器
６部分音声検索手段
１０ピッチ周期等化手段
１１特徴データ生成手段
１２ａ，１２ｂ出力切替手段
１３量子化器
１４ピッチ等化波形符号化器
１５差分ビット演算器
１６ピッチ情報符号化器
１７音素ラベリング処理手段
１８リサンプラ
１９アナライザ
２０抵抗
２１入力ピッチ検出手段
２２ピッチ平均手段
２３周波数シフタ
２４出力ピッチ検出手段
２５残差演算手段
２６ＰＩＤコントローラ
２７ピッチ検出手段
２８ＢＰＦ
２９周波数カウンタ
３１ＢＰＦ
３２周波数カウンタ
３４アンプ
３６コンデンサ
４１発信器
４２変調器
４３ＢＰＦ
４４ＶＣＯ
４５復調器
５１ピッチ等化波形復号器
５２逆量子化器
５３シンセサイザ
５４ピッチ情報復号器
５５ピッチ周波数検出手段
５６差分器
５７加算器
５８周波数シフタ
５９出力切替手段
６１動作切替手段
６２部分音声選択手段
６３，６４区間分割手段
６５特徴量尺度演算手段
６６音素列尺度演算手段
６７総合尺度演算手段
６８一致位置判定手段DESCRIPTION OF SYMBOLS 1 Speech search device 2 Speech encoder 3 Speech storage means 4 Data reading means 5 Speech decoder 6 Partial speech search means 10 Pitch period equalization means 11 Feature data generation means 12a, 12b Output switching means 13 Quantizer 14 Pitch etc. Waveform encoder 15 difference bit calculator 16 pitch information encoder 17 phoneme labeling processing means 18 resampler 19 analyzer 20 resistor 21 input pitch detection means 22 pitch averaging means 23 frequency shifter 24 output pitch detection means 25 residual calculation means 26 PID controller 27 Pitch detection means 28 BPF
29 Frequency counter 31 BPF
32 Frequency counter 34 Amplifier 36 Capacitor 41 Transmitter 42 Modulator 43 BPF
44 VCO
45 Demodulator 51 Pitch equalization waveform decoder 52 Inverse quantizer 53 Synthesizer 54 Pitch information decoder 55 Pitch frequency detection means 56 Subtractor 57 Adder 58 Frequency shifter 59 Output switching means 61 Operation switching means 62 Partial voice selection means 63 64 segment division means 65 feature quantity scale calculation means 66 phoneme string scale calculation means 67 comprehensive scale calculation means 68 coincidence position determination means

以下、本発明を実施するための最良の形態について、図面を参照しながら説明する。 The best mode for carrying out the present invention will be described below with reference to the drawings.

図１は、本発明の実施例１に係る音声検索装置の全体構成を表す図である。実施例１の音声検索装置１は、音声符号化器２、音声記憶手段３、データ読出手段４、音声復号器５、及び部分音声検索手段６を備えている。 FIG. 1 is a diagram illustrating the overall configuration of the voice search device according to the first embodiment of the present invention. The speech search apparatus 1 according to the first embodiment includes a speech encoder 2, speech storage means 3, data reading means 4, speech decoder 5, and partial speech search means 6.

検索対象音声データやクエリー音声データは、入力音声データとして音声符号化器２に入力される。音声符号化器２は、入力音声データに対して有声音のピッチ周期を等化するとともに、特徴量の時系列データ（特徴データ）に変換する。この際、入力音声データのピッチ周期の情報は特徴データとは分離され、符号化されて符号化ピッチデータとして出力される。一方、特徴データは、サブバンド波形として出力される。またさらに、音声符号化器２は、特徴データを符号化し、符号化特徴データとして出力する。また、音声符号化器２は、入力音声データに対して音素ラベリング処理を行い、各音素の音素ラベル及び時間区間の情報からなる音素ラベルデータとして出力する。 Search target speech data and query speech data are input to the speech coder 2 as input speech data. The speech encoder 2 equalizes the pitch period of voiced sound with respect to the input speech data and converts it into time-series data (feature data) of feature amounts. At this time, the pitch period information of the input voice data is separated from the feature data, encoded, and output as encoded pitch data. On the other hand, the feature data is output as a subband waveform. Furthermore, the speech encoder 2 encodes the feature data and outputs it as encoded feature data. The speech encoder 2 performs a phoneme labeling process on the input speech data, and outputs the phoneme label data including the phoneme label of each phoneme and information on the time interval.

音声記憶手段３は、音声符号化器２により符号化特徴データ，符号化ピッチデータ，及び音素ラベルデータに分解され符号化された検索対象音声データを記憶する。この音声記憶手段３に記憶された符号化特徴データ及び符号化ピッチデータが、符号化された検索対象特徴データである。 The voice storage means 3 stores the search target voice data that has been decomposed and encoded by the voice encoder 2 into encoded feature data, encoded pitch data, and phoneme label data. The encoded feature data and the encoded pitch data stored in the voice storage unit 3 are encoded search target feature data.

データ読出手段４は、データ選択信号に従って、音声記憶手段３内の符号化された検索対象音声データ（符号化特徴データ，符号化ピッチデータ，及び音素ラベルデータ）の部分データを読み出す。 The data reading means 4 reads partial data of the search target speech data (encoded feature data, encoded pitch data, and phoneme label data) in the speech storage means 3 in accordance with the data selection signal.

音声復号器５は、データ読出手段４により読み出された符号化特徴データ及び符号化ピッチデータを復号し、特徴データ又は出力音声データとして出力する。 The speech decoder 5 decodes the encoded feature data and the encoded pitch data read by the data reading unit 4 and outputs them as feature data or output speech data.

部分音声検索手段６は、音声記憶手段３に蓄積されている符号化された検索対象音声データから、クエリー音声データに一致又は類似する部分データを検索する。 The partial voice search means 6 searches for partial data that matches or is similar to the query voice data from the encoded search target voice data stored in the voice storage means 3.

図２は、図１の音声符号化器２の構成を表すブロック図である。音声符号化器２は、ピッチ周期等化手段１０、特徴データ生成手段１１、出力切替手段１２ａ，１２ｂ、量子化手段１３、ピッチ等化波形符号化器１４、差分ビット演算器１５、ピッチ情報符号化器１６、及び音素ラベリング手段１７を備えている。 FIG. 2 is a block diagram showing the configuration of the speech encoder 2 of FIG. The speech encoder 2 includes a pitch period equalizing means 10, a feature data generating means 11, output switching means 12a and 12b, a quantizing means 13, a pitch equalizing waveform encoder 14, a differential bit calculator 15, a pitch information code. And a phoneme labeling means 17.

ピッチ周期等化手段１０は、入力音声データｘ_in（ｔ）の有声音のピッチ周期を等化する。ピッチ周期が等化された入力音声データ（以下「ピッチ等化音声データ」という。）ｘ_out（ｔ）は、出力端子Out_１から出力される。The pitch period equalizing means 10 equalizes the pitch period of the voiced sound of the input sound data x _in (t). Input audio data with equal pitch period (hereinafter referred to as “pitch equalized audio data”) x _out (t) is output from the output terminal Out_1.

特徴データ生成手段１１は、出力端子Out_１から出力されるピッチ等化音声データｘ_out（ｔ）を特徴量の時系列データに変換する。本実施例においては、特徴量として、短時間周波数スペクトルが用いられる。The feature data generation unit 11 converts the pitch equalized audio data x _out (t) output from the output terminal Out_1 into time-series data of feature amounts. In this embodiment, a short-time frequency spectrum is used as the feature amount.

特徴データ生成手段１１は、リサンプラ１８及びアナライザ（変形離散コサイン変換器（Modified Discrete Cosine Transformer：ＭＤＣＴ））１９から構成されている。 The feature data generation means 11 includes a resampler 18 and an analyzer (Modified Discrete Cosine Transformer (MDCT)) 19.

リサンプラ１８は、ピッチ周期等化手段１００の出力端子Out_１から出力されるピッチ等化音声データｘ_out（ｔ）の各ピッチ区間について、同一の標本化数となるように再標本化を行い、完全等化音声データｘ_ｅｑ（ｔ）として出力する。The resampler 18 resamples each pitch section of the pitch equalized audio data x _out (t) output from the output terminal Out_1 of the pitch period equalizing means 100 so as to have the same number of samples, and completes the sampling. Output as equalized audio data x _eq (t).

アナライザ１９は、完全等化音声データｘ_ｅｑ（ｔ）について、一定のピッチ区間数で変形離散コサイン変換を行い、短時間周波数スペクトル（以下「特徴データ」という。）Ｘ（ｆ）を生成する。すなわち、本実施例においては、特徴データは、短時間周波数スペクトルからなるベクトル量の時系列（式（４））として与えられる。The analyzer 19 performs a modified discrete cosine transform on the completely equalized speech data x _eq (t) with a fixed number of pitch sections, and generates a short-time frequency spectrum (hereinafter referred to as “feature data”) X (f). That is, in the present embodiment, the feature data is given as a time series of vector quantities (formula (4)) consisting of a short-time frequency spectrum.

ここで、ｔは時刻、Ｘ_ｆｉ（ｔ）（ｉ＝１，２，…，ｎ）は時刻ｔにおける周波数ｆ_ｉのサブバンドの短時間スペクトル値を表す。Here, t is the _{time, X fi (t) (i} = 1,2, ..., n) represents the short-time spectral values of the sub-band of the frequency _{f i} at time t.

出力切替手段１２ａは、部分音声検索手段６から入力される切替信号に従って、アナライザ１９が生成する特徴データＸ（ｆ）の出力先を、部分音声検索手段６又は音声記憶手段３に切り替える。具体的には、入力音声データとして、検索対象音声データが入力される場合には、特徴データＸ（ｆ）の出力先は音声記憶手段３に切り替えられる。入力音声データとして、クエリー音声データが入力される場合には、特徴データＸ（ｆ）の出力先は部分音声検索手段６に切り替えられる。 The output switching unit 12 a switches the output destination of the feature data X (f) generated by the analyzer 19 to the partial speech search unit 6 or the speech storage unit 3 according to the switching signal input from the partial speech search unit 6. Specifically, when the search target voice data is input as the input voice data, the output destination of the feature data X (f) is switched to the voice storage unit 3. When query voice data is input as input voice data, the output destination of the feature data X (f) is switched to the partial voice search means 6.

量子化器１３は、特徴データＸ（ｆ）を所定の量子化曲線に従って量子化する。ピッチ等化波形符号化器１４は、量子化器１３が出力する特徴データＸ（ｆ）を符号化し、符号化特徴データとして出力する。この符号化には、ハフマン符号化法や算術符号化法等のエントロピ符号化法が使用される。 The quantizer 13 quantizes the feature data X (f) according to a predetermined quantization curve. The pitch equalization waveform encoder 14 encodes the feature data X (f) output from the quantizer 13 and outputs it as encoded feature data. For this encoding, an entropy encoding method such as a Huffman encoding method or an arithmetic encoding method is used.

差分ビット演算器１５は、ピッチ等化波形符号化器１４が出力する符号化特徴データの符号量から目的ビット数を減算し差分（以下「差分ビット数」という。）を出力する。量子化器１３は、この差分ビット数によって量子化曲線を平行移動させ、符号化特徴データの符号量が目的ビット数の範囲内となるように調整する。 The difference bit calculator 15 subtracts the target bit number from the code amount of the encoded feature data output from the pitch equalization waveform encoder 14 and outputs a difference (hereinafter referred to as “difference bit number”). The quantizer 13 translates the quantization curve according to the difference bit number and adjusts the code amount of the encoded feature data to be within the range of the target bit number.

ピッチ情報符号化器１６は、ピッチ周期等化手段１０が出力する残差周期信号ΔＶ_pitch及び基準周期信号ＡＶ_pitchを符号化し、符号化ピッチデータとして出力する。この符号化には、ハフマン符号化法や算術符号化法等のエントロピ符号化法が使用される。The pitch information encoder 16 encodes the residual period signal ΔV _pitch and the reference period signal AV _pitch output from the pitch period equalizing means 10 and outputs the encoded period data as encoded pitch data. For this encoding, an entropy encoding method such as a Huffman encoding method or an arithmetic encoding method is used.

音素ラベリング手段１７は、入力音声データを音素区間に区分するとともに、各音素区間に対して音素ラベリングを行う。そして、音素ラベル及び時間区間の情報からなる音素ラベルデータとして出力する。 The phoneme labeling means 17 divides input speech data into phoneme sections and performs phoneme labeling on each phoneme section. And it outputs as phoneme label data which consists of information of a phoneme label and a time interval.

出力切替手段１２ｂは、音素ラベリング処理手段１７が生成する音素ラベルデータの出力先を、部分音声検索手段６又は音声記憶手段３に切り替える。具体的には、入力音声データとして、検索対象音声データが入力される場合には、音素ラベルデータの出力先は音声記憶手段３に切り替えられる。入力音声データとして、クエリー音声データが入力される場合には、音素ラベルデータの出力先は部分音声検索手段６に切り替えられる。 The output switching unit 12 b switches the output destination of the phoneme label data generated by the phoneme labeling processing unit 17 to the partial speech search unit 6 or the speech storage unit 3. Specifically, when search target speech data is input as input speech data, the output destination of phoneme label data is switched to speech storage means 3. When query voice data is input as input voice data, the output destination of phoneme label data is switched to the partial voice search means 6.

図３は、図２のピッチ周期等化手段１０の構成を表すブロック図である。ピッチ周期等化手段１０は、入力ピッチ検出手段２１、ピッチ平均手段２２、周波数シフタ２３、出力ピッチ検出手段２４、残差演算手段２５、及びＰＩＤコントローラ２６を備えている。 FIG. 3 is a block diagram showing the configuration of the pitch period equalizing means 10 of FIG. The pitch period equalizing means 10 includes an input pitch detecting means 21, a pitch averaging means 22, a frequency shifter 23, an output pitch detecting means 24, a residual calculating means 25, and a PID controller 26.

入力ピッチ検出手段２１は、入力音声データｘ_in（ｔ）から、当該音声信号に含まれるピッチの基本周波数を検出する。ピッチの基本周波数を検出する方法は、現在までに種々の方法が考案されているが、本実施例ではその代表的なものを示す。この入力ピッチ検出手段２１は、ピッチ検出手段２７、バンドパスフィルタ（Band Pass Filter：以下「ＢＰＦ」という。）２８、及び周波数カウンタ２９を備えている。The input pitch detection means 21 detects the fundamental frequency of the pitch included in the audio signal from the input audio data x _in (t). Various methods for detecting the fundamental frequency of the pitch have been devised up to now, but representative examples are shown in this embodiment. The input pitch detection means 21 includes a pitch detection means 27, a band pass filter (hereinafter referred to as “BPF”) 28, and a frequency counter 29.

ピッチ検出手段２７は、入力音声データｘ_in（ｔ）から、ピッチの基本周期Ｔ_０＝１／ｆ_０を検出する。例えば、入力音声データｘ_in（ｔ）が図４（ａ）のような波形であったとする。ピッチ検出手段２７は、まずこの波形に対して短時間フーリエ変換を行い、図４（ｂ）のようなスペクトル波形Ｘ（ｆ）を導出する。The pitch detection means 27 detects the basic pitch pitch T ₀ = 1 / f ₀ from the input voice data x _in (t). For example, assume that the input voice data x _in (t) has a waveform as shown in FIG. The pitch detection means 27 first performs a short-time Fourier transform on this waveform to derive a spectrum waveform X (f) as shown in FIG.

通常、音声波形は、ピッチ以外にも多くの周波数成分を含み、ここで得られるスペクトル波形は、ピッチの基本周波数及びピッチの高調波成分以外にも、付加的に多くの周波数成分を有する。したがって、このスペクトル波形Ｘ（ｆ）からピッチの基本周波数ｆ_０を抽出するのは一般に困難である。そこで、ピッチ検出手段２７は、このスペクトル波形Ｘ（ｆ）し対し再度フーリエ変換を行う。これにより、スペクトル波形Ｘ（ｆ）に含まれるピッチの高調波の間隔Δｆ_０の逆数Ｆ_０＝１／Δｆ_０の点に鋭いピークを持つスペクトル波形が得られる（図４（ｃ）参照）。ピッチ検出手段２７は、このピークの位置Ｆ_０を検出することによって、ピッチの基本周波数ｆ_０＝Δｆ_０／２＝Ｆ_０／２を検出する。Usually, a speech waveform includes many frequency components in addition to the pitch, and the spectrum waveform obtained here additionally has many frequency components in addition to the fundamental frequency of the pitch and the harmonic component of the pitch. Accordingly, it is generally difficult to extract the fundamental frequency f _{0 of the} pitch from the spectrum waveform X (f). Therefore, the pitch detection means 27 performs Fourier transform again on the spectrum waveform X (f). Thereby, a spectrum waveform having a sharp peak at a point of the reciprocal number F ₀ = 1 / Δf ₀ of the harmonic interval Δf ₀ of the pitch included in the spectrum waveform X (f) is obtained (see FIG. 4C). Pitch detecting means 27, by detecting the position _{F 0} of the peak, to detect the fundamental frequency _{_{_{f 0 = Δf 0/2 =}}} F 0/2 pitches.

また、ピッチ検出手段２７は、スペクトル波形Ｘ（ｆ）から、入力音声データｘ_in（ｔ）が有声音か無声音かを判別する。有声音の場合には、ノイズフラグ信号Ｖ_noiseとして０を出力する。無声音の場合にはノイズフラグ信号Ｖ_noiseとして１を出力する。なお、有声音と無声音の判別は、スペクトル波形Ｘ（ｆ）の傾き検出によって行われる。図５は有声音「あ」のフォルマント特性を示す図であり、図６は無声音「す」の自己相関及びケプストラム波形並びに周波数特性を示す図である。有声音は、図５のように、スペクトル波形Ｘ（ｆ）は、全体的に低周波側が大きく高周波側に向かって小さくなるようなフォルマント特性を示す。それに対して、無声音は、図６のように、全体的に高周波側に向かって大きくなるような周波数特性を示す。したがって、スペクトル波形Ｘ（ｆ）の全体的な傾きを検出することによって、入力音声データｘ_in（ｔ）が有声音か無声音かを判別することができる。The pitch detection means 27 determines whether the input voice data x _in (t) is voiced sound or unvoiced sound from the spectrum waveform X (f). In the case of voiced _sound , 0 is output as the noise flag signal V _noise . In the case of an unvoiced sound, 1 is output as the noise flag signal V _noise . The distinction between voiced and unvoiced sounds is made by detecting the slope of the spectrum waveform X (f). FIG. 5 is a diagram showing formant characteristics of voiced sound “A”, and FIG. 6 is a diagram showing autocorrelation, cepstrum waveform, and frequency characteristics of unvoiced sound “su”. As shown in FIG. 5, the voiced sound has a formant characteristic in which the spectrum waveform X (f) is large on the low frequency side and smaller on the high frequency side as a whole. On the other hand, the unvoiced sound has a frequency characteristic that becomes larger toward the high frequency side as shown in FIG. Therefore, it is possible to determine whether the input voice data x _in (t) is voiced sound or unvoiced sound by detecting the overall inclination of the spectrum waveform X (f).

尚、入力音声データｘ_in（ｔ）が無声音の場合、ピッチが存在しないので、ピッチ検出手段２７が出力するピッチの基本周波数ｆ_０は無意味な値となる。When the input voice data x _in (t) is an unvoiced sound, there is no pitch, so the pitch fundamental frequency f ₀ output by the pitch detection means 27 is a meaningless value.

ＢＰＦ２８は、通過帯域を外部から設定可能な狭帯域のバンドパスフィルタが使用される。ＢＰＦ２８は、ピッチ検出手段２７により検出されるピッチの基本周波数ｆ_０を通過帯域の中心周波数として設定する（図４（ｄ）参照）。そして、ＢＰＦ２８は、入力音声データｘ_in（ｔ）をフィルタリングし、ピッチの基本周波数ｆ_０のほぼ正弦波状の波形を出力する（図４（ｅ）参照）。The BPF 28 uses a narrow bandpass filter whose passband can be set from the outside. The BPF 28 sets the fundamental frequency f _{0 of the} pitch detected by the pitch detection means 27 as the center frequency of the pass band (see FIG. 4D). Then, the BPF 28 filters the input voice data x _in (t) and outputs a substantially sinusoidal waveform having the fundamental frequency f ₀ of the pitch (see FIG. 4E).

周波数カウンタ２９は、ＢＰＦ２８が出力するほぼ正弦波状の波形のゼロクロス点の時間間隔をカウントすることにより、ピッチの基本周期Ｔ_０＝１／ｆ_０を出力する。この検出されたピッチの基本周期Ｔ_０が入力ピッチ検出手段２１の出力信号（以下「基本周波数信号」）として出力される（図４（ｆ）参照）。The frequency counter 29 outputs the basic pitch pitch T ₀ = 1 / f ₀ by counting the time interval of the zero cross point of the substantially sinusoidal waveform output from the BPF 28. The detected basic period T ₀ of the pitch is output as an output signal (hereinafter referred to as “basic frequency signal”) of the input pitch detecting means 21 (see FIG. 4F).

ピッチ平均手段２２は、ピッチ検出手段２７が出力するピッチの基本周期信号Ｔ_０を平均化するものであり、通常のローパスフィルタ（Low Pass Filter：以下「ＬＰＦ」という。）が使用される。ピッチ平均手段２２により、基本周期信号Ｖ_pitchが平滑化され、音素内では時間的にほぼ一定の信号となる。この平滑化された基本周期が基準周期Ｔ_ｓ（基準周波数ｆ_ｓ＝１／Ｔ_ｓ）として使用される（図４（ｇ）参照）。The pitch averaging means 22 averages the basic period signal T ₀ of the pitch output from the pitch detection means 27, and a normal low pass filter (hereinafter referred to as “LPF”) is used. The basic periodic signal V _pitch is smoothed by the pitch averaging means 22 and becomes a substantially constant signal in time within the phoneme. This smoothed fundamental period is used as the reference period T _s (reference frequency f _s = 1 / T _s ) (see FIG. 4G).

周波数シフタ２３は、入力音声データｘ_in（ｔ）のピッチ周波数を基準周波数ｆ_０に近づける方向にシフトさせることにより、音声信号のピッチ周期を等化する。The frequency shifter 23 equalizes the pitch period of the audio signal by shifting the pitch frequency of the input audio data x _in (t) in a direction approaching the reference frequency f ₀ .

出力ピッチ検出手段２４は、周波数シフタ２３より出力される出力音声データ（以下「ピッチ等化音声データ」という。）ｘ_out（ｔ）から、当該ピッチ等化音声データｘ_out（ｔ）に含まれるピッチの基本周期Ｔ_０’を検出する。この出力ピッチ検出手段２４も、基本的に入力ピッチ検出手段２１と同様の構成とすることができる。本実施例の場合、出力ピッチ検出手段２４は、ＢＰＦ３１及び周波数カウンタ３２を備えている。Output pitch detecting means 24 includes output audio data which is output from the frequency shifter 23 (hereinafter referred to as "pitch equalizing audio data".) From the x _out (t), to the pitch equalization audio data x _out (t) The basic period T ₀ ′ of the pitch is detected. The output pitch detection means 24 can also basically have the same configuration as the input pitch detection means 21. In the case of the present embodiment, the output pitch detection means 24 includes a BPF 31 and a frequency counter 32.

ＢＰＦ３１は、通過帯域を外部から設定可能な狭帯域のＢＰＦが使用される。ＢＰＦ３１は、ピッチ検出手段２７により検出されるピッチの基本周波数ｆ_０を通過帯域の中心周波数として設定する。そして、ＢＰＦ３１は、ピッチ等化音声データｘ_out（ｔ）をフィルタリングし、ピッチの基本周波数ｆ_０’のほぼ正弦波状の波形を出力する。周波数カウンタ３２は、ＢＰＦ３１が出力するほぼ正弦波状の波形のゼロクロス点の時間間隔をカウントすることにより、ピッチの基本周期Ｔ_０’＝１／ｆ_０’を出力する。この検出されたピッチの基本周期Ｔ_０’が出力ピッチ検出手段２４の出力信号として出力される。As the BPF 31, a narrow band BPF whose pass band can be set from the outside is used. The BPF 31 sets the basic frequency f _{0 of the} pitch detected by the pitch detection means 27 as the center frequency of the pass band. Then, the BPF 31 filters the pitch-equalized audio data x _out (t) and outputs a substantially sinusoidal waveform having the pitch fundamental frequency f ₀ ′. The frequency counter 32 outputs a basic pitch period T ₀ ′ = 1 / f ₀ ′ by counting the time interval between zero cross points of the substantially sinusoidal waveform output from the BPF 31. The detected basic pitch T ₀ ′ of the pitch is output as an output signal of the output pitch detecting means 24.

残差演算手段２５は、出力ピッチ検出手段２４が出力する基本周期Ｔ_０’からピッチ平均手段２２が出力する基準周期Ｔ_ｓを引いた残差周期ΔＴ_pitchを出力する。この残差周期ΔＴ_pitchは、ＰＩＤコントローラ２６を介して周波数シフタ２３に入力される。周波数シフタ２３は、残差周波数１／ΔＴ_pitchに比例して、入力音声データのピッチ周波数を基準周波数ｆ_０に近づける方向にシフトさせる。The residual calculation means 25 outputs a residual period ΔT _pitch obtained by subtracting the reference period T _s output from the pitch averaging means 22 from the basic period T ₀ ′ output from the output pitch detection means 24. This residual period ΔT _pitch is input to the frequency shifter 23 via the PID controller 26. The frequency shifter 23 shifts the pitch frequency of the input audio data in a direction approaching the reference frequency f ₀ in proportion to the residual frequency 1 / ΔT _pitch .

尚、ＰＩＤコントローラ２６は、直列接続されたアンプ３４及び抵抗２０、並びに、アンプ３４に対して並列接続されたコンデンサ３６から構成されている。このＰＩＤコントローラ２６は、周波数シフタ２３、出力ピッチ検出手段２４、及び残差演算手段２５からなるフィードバックループの発振を防止するためのものである。 The PID controller 26 includes an amplifier 34 and a resistor 20 connected in series, and a capacitor 36 connected in parallel to the amplifier 34. The PID controller 26 is for preventing oscillation of a feedback loop composed of the frequency shifter 23, the output pitch detection means 24, and the residual calculation means 25.

尚、図３では、ＰＩＤコントローラ２６は、アナログ回路表示しているが、デジタル回路で構成してもよい。 In FIG. 3, the PID controller 26 displays an analog circuit, but may be configured by a digital circuit.

図７は周波数シフタ２３の内部構成を表す図である。周波数シフタ２３は、発信器４１、変調器４２、ＢＰＦ４３、電圧制御発信器（Voltage Controlled Oscillator：以下「ＶＣＯ」という。）４４、及び復調器４５を備えている。 FIG. 7 is a diagram illustrating the internal configuration of the frequency shifter 23. The frequency shifter 23 includes a transmitter 41, a modulator 42, a BPF 43, a voltage controlled oscillator (hereinafter referred to as “VCO”) 44, and a demodulator 45.

発信器４１は、入力音声データｘ_in（ｔ）の周波数変調を行うための一定周波数の変調キャリア信号Ｃ_１を出力する。通常、音声信号の帯域は８ｋＨｚ程度である（図７（ｉ）参照）。したがって、発信器４１が発生する変調キャリア信号Ｃ_１の周波数（以下「変調キャリア周波数」という。）としては、通常は２０ｋＨｚ程度のものが使用される。The transmitter 41 outputs a modulated carrier signal C ₁ having a constant frequency for performing frequency modulation of the input voice data x _in (t). Usually, the band of the audio signal is about 8 kHz (see FIG. 7 (i)). Therefore, the frequency of the modulated carrier signal C ₁ generated by the transmitter 41 (hereinafter referred to as “modulated carrier frequency”) is normally about 20 kHz.

変調器４２は、発信器４１が出力する変調キャリア信号Ｃ_１を入力音声データｘ_in（ｔ）で周波数変調し、被変調信号を生成する。この被変調信号は、変調キャリア周波数を中心として、その両側に音声信号の帯域と同じバンド幅の側波帯（上側波帯及び下側波帯）を有する信号である（図７（ｉｉ）参照）。The modulator 42 frequency-modulates the modulated carrier signal C ₁ output from the transmitter 41 with the input audio data x _in (t) to generate a modulated signal. This modulated signal is a signal having sidebands (upper sideband and lower sideband) having the same bandwidth as the audio signal band on both sides of the modulated carrier frequency (see FIG. 7 (ii)). ).

ＢＰＦ４３は、変調キャリア周波数を下限遮断周波数とし、入力音声データの帯域幅よりも大きいバンド幅の通過域を有するＢＰＦである。これにより、ＢＰＦ４３から出力される被変調信号は、上側波帯のみが切り出された信号となる（図７（ｉｉｉ）参照）。 The BPF 43 is a BPF having a modulation carrier frequency as a lower limit cutoff frequency and having a passband having a bandwidth larger than the bandwidth of the input voice data. As a result, the modulated signal output from the BPF 43 is a signal obtained by cutting out only the upper sideband (see FIG. 7 (iii)).

ＶＣＯ４４は、発信器４１が出力する変調キャリア信号Ｃ_１と同じ周波数の信号を、ＰＩＤコントローラ２６を介して残差演算手段２５から入力される残差周期ΔＴ_pitchの信号（以下「残差周期信号」という。）ΔＶ_pitchにより周波数を変調して得られる信号（以下「復調キャリア信号」という。）を出力する。The VCO 44 outputs a signal having the same frequency as the modulation carrier signal C ₁ output from the transmitter 41 to a signal of the residual period ΔT _pitch (hereinafter referred to as “residual period signal”) input from the residual calculation means 25 via the PID controller 26. A signal obtained by modulating the frequency by ΔV _pitch (hereinafter referred to as “demodulated carrier signal”) is output.

復調器４５は、ＢＰＦ４３が出力する上側波帯のみの被変調信号を、ＶＣＯ４４が出力する復調キャリア信号により復調し、音声信号を復元する（図７（ｉｖ）参照）。このとき、復調キャリア信号は、残差周期信号で変調されている。そのため、被変調信号を復調する際に、入力音声データｘ_in（ｔ）のピッチ周波数の基準周波数ｆ_ｓからのずれが消去される。すなわち、入力音声データｘ_in（ｔ）のピッチ周期は、基準周期Ｔ_ｓに等化される。The demodulator 45 demodulates the modulated signal of only the upper side band output from the BPF 43 with the demodulated carrier signal output from the VCO 44 to restore the audio signal (see FIG. 7 (iv)). At this time, the demodulated carrier signal is modulated with the residual periodic signal. Therefore, when demodulating the modulated signal, the deviation of the pitch frequency of the input audio data x _in (t) from the reference frequency f _s is eliminated. That is, the pitch period of the input voice data x _in (t) is equalized to the reference period T _s .

図８は、周波数シフタ２３の内部構成の他の例を表す図である。図８においては、図７の発信器４１とＶＣＯ４４とを入れ替えた構成とされている。この構成によっても、図７の場合と同様に、入力音声データｘ_in（ｔ）のピッチ周期は、基準周期Ｔ_ｓに等化することができる。FIG. 8 is a diagram illustrating another example of the internal configuration of the frequency shifter 23. In FIG. 8, the transmitter 41 and the VCO 44 in FIG. 7 are replaced. Also with this configuration, the pitch period of the input audio data x _in (t) can be equalized to the reference period T _{s as} in the case of FIG.

図９は、図１の音声復号器５の構成を表すブロック図である。音声復号器５は、音声符号化器２により符号化された音声信号を復号する装置である。音声復号器５は、ピッチ等化波形復号器５１、逆量子化器５２、シンセサイザ５３、ピッチ情報復号器５４、ピッチ周波数検出手段５５、差分器５６、加算器５７、周波数シフタ５８、及び出力切替手段５９を備えている。 FIG. 9 is a block diagram showing the configuration of the speech decoder 5 of FIG. The audio decoder 5 is a device that decodes the audio signal encoded by the audio encoder 2. The speech decoder 5 includes a pitch equalization waveform decoder 51, an inverse quantizer 52, a synthesizer 53, a pitch information decoder 54, a pitch frequency detection means 55, a difference unit 56, an adder 57, a frequency shifter 58, and an output switch. Means 59 are provided.

音声復号器５には、符号化特徴データ及び符号化ピッチデータが入力される。符号化特徴データは、図２のピッチ等化波形符号化器１４から出力される符号化特徴データである。符号化ピッチデータは、図２のピッチ情報符号化器１６から出力される符号化ピッチデータである。 Encoded feature data and encoded pitch data are input to the audio decoder 5. The encoded feature data is encoded feature data output from the pitch equalization waveform encoder 14 of FIG. The encoded pitch data is encoded pitch data output from the pitch information encoder 16 of FIG.

ピッチ等化波形復号器５１は、符号化特徴データを復号し、量子化後の各サブバンドの特徴データ（以下「量子化特徴データ」という。）を復元する。逆量子化器５２は、この量子化特徴データを逆量子化し、ｎ個のサブバンドの特徴データＸ（ｆ）＝｛Ｘ（ｆ_１），Ｘ（ｆ_２），…，Ｘ（ｆ_ｎ）｝を復元する。The pitch equalization waveform decoder 51 decodes the encoded feature data and restores the quantized feature data of each subband (hereinafter referred to as “quantized feature data”). The inverse quantizer 52 inversely quantizes the quantized feature data, and the n subband feature data X (f) = {X (f ₁ ), X (f ₂ ),..., X (f _n ). } Is restored.

シンセサイザ５３は、特徴データＸ（ｆ）を逆変形離散コサイン変換（Inverse Modified Discrete Cosine Transform：以下「ＩＭＤＣＴ」という。）し、１ピッチ区間の時系列データ（以下「等化音声信号」という。）ｘ_eq（ｔ）を生成する。ピッチ周波数検出手段５５は、この等化音声信号ｘ_eq（ｔ）のピッチ周波数を検出し等化ピッチ周波数信号Ｖ_eqとして出力する。The synthesizer 53 performs inverse modified discrete cosine transform (hereinafter referred to as “IMDCT”) on the feature data X (f), and time-series data in one pitch section (hereinafter referred to as “equalized audio signal”). Generate x _eq (t). The pitch frequency detecting means 55 detects the pitch frequency of the equalized audio signal x _eq (t) and outputs it as the equalized pitch frequency signal V _eq .

一方、ピッチ情報復号器５４は、符号化ピッチデータを復号することにより、基準周波数信号ＡＶ_pitch及び残差周波数信号ΔＶ_pitchを復元する。差分器５６は、基準周波数信号ＡＶ_pitchから等化ピッチ周波数信号Ｖ_eqを差し引いた差分を基準周波数変化信号ΔＡＶ_pitchとして出力する。加算器５７は、残差周波数信号ΔＶ_pitchと基準周波数変化信号ΔＡＶ_pitchとを加算してこれを修正残差周波数信号ΔＶ_pitch”として出力する。On the other hand, the pitch information decoder 54 restores the reference frequency signal AV _pitch and the residual frequency signal ΔV _pitch by decoding the encoded pitch data. The difference unit 56 outputs a difference obtained by subtracting the equalized pitch frequency signal V _eq from the reference frequency signal AV _pitch as a reference frequency change signal ΔAV _pitch . The adder 57 adds the residual frequency signal ΔV _pitch and the reference frequency change signal ΔAV _pitch and outputs this as a modified residual frequency signal ΔV _pitch ″.

周波数シフタ５８は、図７又は図８に示した周波数シフタ２３と同様の構成を有する。この場合、入力端子Inには等化音声信号ｘ_eq（ｔ）が入力され、ＶＣＯ４４には修正残差周波数信号ΔＶ_pitch”が入力される。ＶＣＯ４４は発信器４１が出力する変調キャリア信号Ｃ_１と同じキャリア周波数の信号を、加算器５７から入力される修正残差周波数信号ΔＶ_pitch”により周波数変調して得られる信号（以下「復調キャリア信号」という。）を出力するが、この場合、復調キャリア信号の周波数は、キャリア周波数に残差周波数を加えた周波数となる。The frequency shifter 58 has the same configuration as the frequency shifter 23 shown in FIG. 7 or FIG. In this case, the equalized audio signal x _eq (t) is input to the input terminal In, and the modified residual frequency signal ΔV _pitch ″ is input to the VCO 44. The VCO 44 outputs the modulated carrier signal C ₁ output from the transmitter 41. A signal (hereinafter referred to as a “demodulated carrier signal”) obtained by frequency-modulating a signal having the same carrier frequency with a modified residual frequency signal ΔV _pitch ”input from the adder 57 is output. The frequency of the carrier signal is a frequency obtained by adding a residual frequency to the carrier frequency.

これにより、周波数シフタ５８において等化音声信号ｘ_eq（ｔ）の各ピッチ区間のピッチ周期に揺らぎ成分が加えられ、音声信号ｘ_res（ｔ）が復元される。Thus, the fluctuation component is added to the pitch period of each pitch section of the equalized audio signal x _eq (t) in the frequency shifter 58, and the audio signal x _res (t) is restored.

出力切替手段５９は、部分音声検索手段６から入力される切替信号に従って、逆量子化器５２が生成する特徴データＸ（ｆ）の出力先を、シンセサイザ５３又は部分音声検索手段６に切り替える。具体的には、部分音声検索動作を行う場合には、特徴データＸ（ｆ）の出力先は部分音声検索手段６に切り替えられる。一方、検索対象音声データを外部に出力する場合には、特徴データＸ（ｆ）の出力先はシンセサイザ５３に切り替えられる。 The output switching unit 59 switches the output destination of the feature data X (f) generated by the inverse quantizer 52 to the synthesizer 53 or the partial speech search unit 6 according to the switching signal input from the partial speech search unit 6. Specifically, when performing a partial speech search operation, the output destination of the feature data X (f) is switched to the partial speech search means 6. On the other hand, when the search target audio data is output to the outside, the output destination of the feature data X (f) is switched to the synthesizer 53.

図１０は、図１の部分音声検索手段６の構成を表すブロック図である。部分音声検索手段６は、動作切替手段６１、部分音声選択手段６２、区間分割手段６３，６４、特徴量尺度演算手段６５、音素列尺度演算手段６６、総合尺度演算手段６７、及び一致位置判定手段６８を備えている。 FIG. 10 is a block diagram showing the configuration of the partial voice search means 6 of FIG. The partial voice search means 6 includes an operation switching means 61, a partial voice selection means 62, section division means 63 and 64, a feature amount scale calculation means 65, a phoneme string scale calculation means 66, a total scale calculation means 67, and a matching position determination means. 68.

動作切替手段６１は、音声検索装置１の動作を、音声記憶手段３に対する検索対象音声データの入出力動作、又は部分音声検索手段６による部分音声検索動作に切り替える切替信号を出力する。 The operation switching means 61 outputs a switching signal for switching the operation of the voice search device 1 to the search target voice data input / output operation for the voice storage means 3 or the partial voice search operation by the partial voice search means 6.

部分音声選択手段６２は、音声記憶手段３に記憶されている検索対象特徴データ（正確には、符号化された検索対象特徴データ）の中から、部分音声データを選択するためのデータ選択信号を出力する。このデータ選択信号は、データ読出手段４に入力される。データ読出手段４は、データ選択信号に従って、音声記憶手段３に記憶されている検索対象特徴データを選択し読み出す。 The partial voice selecting unit 62 receives a data selection signal for selecting partial voice data from the search target feature data (more precisely, encoded search target feature data) stored in the voice storage unit 3. Output. This data selection signal is input to the data reading means 4. The data reading unit 4 selects and reads the search target feature data stored in the voice storage unit 3 in accordance with the data selection signal.

区間分割手段６３は、音声符号化器２のアナライザ１９から入力されるクエリー音声の特徴データ（サブバンド波形）を、音素ラベリング処理手段１７から入力されるクエリー音声の音素ラベルデータの時間区間の情報に従って、音素区間ごとに分割する。そして、それぞれの音素区間ごとに、特徴データを平均化し、平均値の時系列データとして特徴量尺度演算手段６５に出力する。 The section division unit 63 uses the feature data (subband waveform) of the query speech input from the analyzer 19 of the speech coder 2 and information on the time interval of the phoneme label data of the query speech input from the phoneme labeling processing unit 17. To divide each phoneme segment. Then, the feature data is averaged for each phoneme section, and is output to the feature quantity scale calculation means 65 as time-series data of average values.

区間分割手段６４は、音声復号器５の逆量子化器５２から入力される検索対象音声の特徴データ（サブバンド波形）を、データ読出手段４から入力される検索対象音声の音素ラベルデータの時間区間の情報に従って、音素区間ごとに分割する。そして、それぞれの音素区間ごとに、特徴データを平均化し、平均値の時系列データとして特徴量尺度演算手段６５に出力する。 The section dividing unit 64 uses the characteristic data (subband waveform) of the search target speech input from the inverse quantizer 52 of the speech decoder 5 as the time of the phoneme label data of the search target speech input from the data reading unit 4. Divide into phoneme sections according to the section information. Then, the feature data is averaged for each phoneme section, and is output to the feature quantity scale calculation means 65 as time-series data of average values.

特徴量尺度演算手段６５は、区間分割手段６３，６４から入力される特徴データの間の距離尺度Ｄ_１（Ｘ_ｑ，Ｘ_ｏ）を演算する。ここで、距離尺度は、特徴データを構成する各サブバンド波形の相関係数の線形和として表される。
すなわち、クエリー音声の特徴データをＸ_ｑ（ｆ）、検索対象音声の特徴データをＸ_ｏ（ｆ）とし、それぞれ式（５）（６）で表す。The feature quantity scale calculating unit 65 calculates a distance scale D ₁ (X _q , X _o ) between the feature data input from the section dividing units 63 and 64. Here, the distance measure is expressed as a linear sum of correlation coefficients of the subband waveforms constituting the feature data.
That is, the feature data of the query speech is X _q (f), and the feature data of the search target speech is X _o (f), which are expressed by equations (5) and (6), respectively.

特徴データＸ_ｑ（ｆ），Ｘ_ｏ（ｆ）の各サブバンド要素の相関係数は式（７）により表される。ここで、ｔ_ｊはｊ番目の音素区間を表す。また、Ｘ_ｑ，ｆｉ（ｔ_ｊ）は、ｊ番目の音素区間における特徴データＸ_ｑ，ｆｉ（ｔ）の時間平均値、Ｘ_ｏ，ｆｉ（ｔ_ｊ）は、ｊ番目の音素区間における特徴データＸ_ｏ，ｆｉ（ｔ）を時間平均値である。The correlation coefficient of each subband element of the feature data X _q (f), X _o (f) is expressed by equation (7). Here, t _j represents the j-th phoneme section. X _{q, fi} (t _j ) is the time average value of feature data X _{q, fi} (t) in the j-th phoneme section, and X _{o, fi} (t _j ) is feature data in the j-th phoneme section. X _{o, fi} (t) is a time average value.

本実施例１においては、特徴データの間の距離尺度Ｄ_１（Ｘ_ｑ，Ｘ_ｏ）を式（１０）により定義する。In the first embodiment, a distance measure D ₁ (X _q , X _o ) between feature data is defined by Expression (10).

ここで、ｗ_ｉは重み係数である。重み係数ｗ_ｉは、適宜設定される。

Here, w _i is a weighting factor. The weighting factor w _i is set as appropriate.

音素列尺度演算手段６６は、音声符号化器２の音素ラベリング処理手段１７からクエリー音声の音素ラベルデータが入力されるとともに、データ読出手段４から検索対象音声の音素ラベルデータが入力される。音素列尺度演算手段６６は、これらの音素ラベルデータの距離尺度Ｄ_２を所定の音素間距離尺度表を用いて演算する。ここで、音素間距離尺度表とは、すべての２つの音素の組み合わせに対して２つの音素間の距離尺度をテーブルとして表したものである。The phoneme string scale calculation means 66 receives the phoneme label data of the query speech from the phoneme labeling processing means 17 of the speech encoder 2 and the phoneme label data of the search target speech from the data reading means 4. Phoneme sequence measure calculating unit 66 calculates using these phonemes labels predetermined distance between phonemes measure table the distance measure D ₂ data. Here, the interphoneme distance scale table represents a distance scale between two phonemes as a table for all combinations of two phonemes.

総合尺度演算手段６７は、特徴量尺度演算手段６５が算出する特徴データの間の距離尺度Ｄ_１（Ｘ_ｑ，Ｘ_ｏ）と音素列尺度演算手段６６が算出する音素ラベルデータの距離尺度Ｄ_２の線形和をとることによって、総合距離尺度Ｄを演算する。すなわち、総合距離尺度Ｄは、式（１１）により表される。The total scale calculation means 67 is a distance scale D ₁ (X _q , X _o ) between feature data calculated by the feature quantity scale calculation means 65 and a distance scale D _{2 of} phoneme label data calculated by the phoneme string scale calculation means 66. The total distance measure D is calculated by taking the linear sum of That is, the total distance measure D is expressed by the equation (11).

ここで、Ｗ_１，Ｗ_２は重み係数であり、適宜決められる。

Here, W ₁ and W ₂ are weighting factors, which are determined as appropriate.

一致位置判定手段６８は、距離尺度Ｄが所定の閾値Ｄ_ｔｈ以下であるか否かを判定し、Ｄ≦Ｄ_ｔｈの場合には、当該部分データを選択するデータ選択信号を出力する。The coincidence position determination means 68 determines whether or not the distance measure D is equal to or smaller than a predetermined threshold value D _th , and outputs a data selection signal for selecting the partial data when D ≦ D _th .

以上のように構成された本実施例の音声検索装置１について、以下その動作を説明する。 The operation of the voice search device 1 of the present embodiment configured as described above will be described below.

〔１〕検索対象音声データの蓄積動作
まず、検索対象音声データを音声記憶手段３に蓄積する際の動作について説明する。この場合、部分音声検索手段６の動作切替手段６１は、切替信号として検索対象音声データの入出力動作を表すレベル（例えばＨレベル）を出力する。これにより、音声符号化器２の出力切替手段１２ａは、アナライザ１９が生成する特徴データＸ（ｆ）を量子化器１３に出力する。音声符号化器２の出力切替手段１２ｂは、音素ラベリング処理手段１７が生成する音素ラベルデータを音声記憶手段３に出力する。また、音声復号器５の出力切替手段５９は、逆量子化器５２が生成する特徴データＸ（ｆ）をシンセサイザ５３に出力する。[1] Storage Operation of Search Target Voice Data First, the operation when storing the search target voice data in the voice storage means 3 will be described. In this case, the operation switching unit 61 of the partial voice search unit 6 outputs a level (for example, H level) representing the input / output operation of the search target voice data as a switching signal. Thereby, the output switching unit 12 a of the speech encoder 2 outputs the feature data X (f) generated by the analyzer 19 to the quantizer 13. The output switching unit 12 b of the speech coder 2 outputs the phoneme label data generated by the phoneme labeling processing unit 17 to the speech storage unit 3. Further, the output switching means 59 of the speech decoder 5 outputs the feature data X (f) generated by the inverse quantizer 52 to the synthesizer 53.

まず、検索対象音声データとして入力音声データｘ_in（ｔ）が音声符号化器２へ入力されると、ピッチ周期等化手段１０の入力ピッチ検出手段２１は、入力音声データｘ_in（ｔ）が有声音か無声音かを判別してノイズフラグ信号Ｖ_noiseを出力端子OUT_４へ出力するとともに、入力音声データｘ_in（ｔ）からピッチ周波数を検出し、基本周波数信号Ｖ_pitchをピッチ平均手段２２に出力する。ピッチ平均手段２２は、基本周波数信号Ｖ_pitchを平均化し（この場合、ＬＰＦを使用するので加重平均となる。）、これを基準周波数信号ＡＶ_pitchとして出力する。この基準周波数信号ＡＶ_pitchは、出力端子OUT_３から出力されるとともに、残差演算手段２５に入力される。First, when input speech data x _in (t) is input to the speech coder 2 as search target speech data, the input pitch detection means 21 of the pitch period equalizing means 10 receives the input speech data x _in (t). It is discriminated whether it is voiced sound or unvoiced sound, and the noise flag signal V _noise is output to the output terminal OUT_4, the pitch frequency is detected from the input sound data x _in (t), and the basic frequency signal V _pitch is output to the pitch averaging means 22. To do. The pitch averaging means 22 averages the basic frequency signal V _pitch (in this case, the LPF is used so that it becomes a weighted average), and this is output as the reference frequency signal AV _pitch . The reference frequency signal AV _pitch is output from the output terminal OUT_3 and also input to the residual calculation means 25.

一方、周波数シフタ２３は、入力音声データｘ_in（ｔ）の周波数をシフトさせ、ピッチ等化音声データｘ_out（ｔ）として出力端子Out_１へ出力する。初期状態においては、残差周波数信号ΔＶ_pitchは０（リセット状態）であり、周波数シフタ２３は、入力音声データｘ_in（ｔ）がそのままピッチ等化音声データｘ_out（ｔ）として出力端子Out_１へ出力される。On the other hand, the frequency shifter 23 shifts the frequency of the input audio data x _in (t), and outputs it to the output terminal Out_1 as pitch equalized audio data x _out (t). In the initial state, the residual frequency signal ΔV _pitch is 0 (reset state), and the frequency shifter 23 uses the input audio data x _in (t) as it is as pitch equalized audio data x _out (t) to the output terminal Out_1. Is output.

次に、出力ピッチ検出手段２４は、周波数シフタ２３が出力する出力音声データのピッチ周波数ｆ_０’を検出する。検出されたピッチ周波数ｆ_０’は、ピッチ周波数信号Ｖ_pitch’として残差演算手段２５に入力される。Next, the output pitch detection means 24 detects the pitch frequency f ₀ ′ of the output audio data output from the frequency shifter 23. The detected pitch frequency f ₀ ′ is input to the residual calculation means 25 as the pitch frequency signal V _pitch ′.

残差演算手段２５は、ピッチ周波数信号Ｖ_pitch’から基準周波数信号ＡＶ_pitchを差し引くことにより、残差周波数信号ΔＶ_pitchを生成する。この残差周波数信号ΔＶ_pitchは、出力端子Out_２へ出力されるとともに、ＰＩＤコントローラ２６を介して周波数シフタ２３へ入力される。The residual calculation means 25 generates a residual frequency signal ΔV _pitch by subtracting the reference frequency signal AV _pitch from the pitch frequency signal V _pitch ′. The residual frequency signal ΔV _pitch is output to the output terminal Out_2 and also input to the frequency shifter 23 via the PID controller 26.

周波数シフタ２３は、ＰＩＤコントローラ２６を介して入力される残差周波数信号ΔＶ_pitchに比例して、周波数のシフト量を設定する。この場合、残差周波数信号ΔＶ_pitchが正値であれば、残差周波数信号ΔＶ_pitchに比例した量だけ周波数を下げるようにシフト量が設定される。残差周波数信号ΔＶ_pitchが負値であれば、残差周波数信号ΔＶ_pitchに比例した量だけ周波数を上げるようにシフト量が設定される。The frequency shifter 23 sets a frequency shift amount in proportion to the residual frequency signal ΔV _pitch input via the PID controller 26. In this case, if the residual frequency signal ΔV _pitch is a positive value, the shift amount is set so as to decrease the frequency by an amount proportional to the residual frequency signal ΔV _pitch . If the residual frequency signal ΔV _pitch is a negative value, the shift amount is set so as to increase the frequency by an amount proportional to the residual frequency signal ΔV _pitch .

このようなフィードバック制御により、入力音声データｘ_in（ｔ）のピッチ周期は、常に基準周期１／ｆ_ｓに維持され、ピッチ等化音声データｘ_out（ｔ）のピッチ周期は等化される。Such feedback control, the pitch period of the input speech data x _in (t) is always maintained at the reference period 1 / f _s, the pitch period of the pitch equalization speech data x _out (t) is equalized.

このように、ピッチ周期等化手段１０において、入力音声データｘ_in（ｔ）に含まれる情報は、
（ａ）有声音か無声音かを示す情報；
（ｂ）１ピッチ区間の音声波形を表す情報；
（ｃ）基準ピッチ周波数の情報；
（ｄ）各ピッチ区間のピッチ周波数の基準ピッチ周波数からの偏倚量を表す残差周波数情報；
に分離される。（ａ）〜（ｄ）の情報は、それぞれ、ノイズフラグ信号Ｖ_noise、ピッチ周期が基準周期１／ｆ_ｓ（入力音声データの過去のピッチ周波数の加重平均の逆数）に等化されたピッチ等化音声データｘ_out（ｔ）、基準周波数信号ＡＶ_pitch、及び残差周波数信号ΔＶ_pitchとして出力される。ノイズフラグ信号Ｖ_noiseは出力端子Out_４から出力され、ピッチ等化音声データｘ_out（ｔ）は出力端子Out_１から出力され、基準周波数信号ＡＶ_pitchは出力端子Out_３から出力され、残差周波数信号ΔＶ_pitchは出力端子Out_２から出力される。Thus, in the pitch period equalizing means 10, the information included _in the input voice data x _in (t) is
(A) Information indicating voiced or unvoiced sound;
(B) Information representing a speech waveform in one pitch section;
(C) Reference pitch frequency information;
(D) residual frequency information indicating the amount of deviation of the pitch frequency of each pitch section from the reference pitch frequency;
Separated. The information of (a) to (d) includes a noise flag signal V _noise , a pitch whose pitch period is equalized to a reference period 1 / f _s (a reciprocal of a weighted average of past pitch frequencies of input voice data), etc. Audio data x _out (t), reference frequency signal AV _pitch , and residual frequency signal ΔV _pitch are output. The noise flag signal V _noise is output from the output terminal Out_4, the pitch equalized audio data x _out (t) is output from the output terminal Out_1, the reference frequency signal AV _pitch is output from the output terminal Out_3, and the residual frequency signal ΔV _pitch. Is output from the output terminal Out_2.

ピッチ等化音声データｘ_out（ｔ）は、男女差、個人差、音素、感情及び会話内容によって変化するピッチ周波数のジッタ成分や変化成分が除去された音声信号であり、抑揚のない平坦的・機械的な音声信号である。したがって、同じ有声音のピッチ等化音声データｘ_out（ｔ）は、男女差、個人差、音素、感情又は会話内容に無関係にほぼ同じ波形が得られるため、ピッチ等化音声データｘ_out（ｔ）を比較することによって有声音についてのマッチングを精度よく行うことが可能となる。The pitch equalized speech data x _out (t) is a speech signal from which jitter components and change components of the pitch frequency that change depending on gender differences, individual differences, phonemes, emotions, and conversation contents are removed, It is a mechanical audio signal. Therefore, since the pitch equalized voice data x _out (t) of the same voiced sound can obtain almost the same waveform irrespective of gender differences, individual differences, phonemes, emotions or conversation contents, the pitch equalized voice data x _out (t) ) Can be accurately matched for voiced sounds.

また、有声音のピッチ等化音声データｘ_out（ｔ）はピッチ周期が基準周期１／ｆ_ｓに等化されているので、一定数のピッチ区間でサブバンド符号化を行うことにより、ピッチ等化音声データｘ_out（ｔ）の周波数スペクトルＸ_out（ｆ）は、基準周波数の高調波成分のサブバンド成分に集約される。音声はピッチ間の波形相関が大きいので、各サブバンド成分のスペクトル強度の時間変化は緩やかである。したがって、各サブバンド成分を符号化し、その他の雑音成分を省略することにより、高効率の符号化が可能となる。また、基準周波数信号ＡＶ_pitch、及び残差周波数信号ΔＶ_pitchは、音声の性質上、同一音素内で狭レンジでしか変動しないため、高効率の符号化が可能である。したがって、全体として入力音声データｘ_in（ｔ）の有声音成分を高効率で符号化することが可能となる。Further, since the pitch period of voiced sound equalized voice data x _out (t) is equalized to the reference period 1 / f _s , by performing subband coding in a fixed number of pitch sections, the pitch etc. Frequency spectrum X _out (f) of the digitized audio data x _out (t) is aggregated into subband components of harmonic components of the reference frequency. Since speech has a large waveform correlation between pitches, the temporal change in the spectral intensity of each subband component is gradual. Therefore, by encoding each subband component and omitting other noise components, highly efficient encoding is possible. Further, since the reference frequency signal AV _pitch and the residual frequency signal ΔV _pitch change only in a narrow range within the same phoneme due to the nature of speech, highly efficient encoding is possible. Therefore, the voiced sound component of the input voice data x _in (t) can be encoded with high efficiency as a whole.

次に、リサンプラ１８は、各ピッチ区間において、基準周波数信号ＡＶ_pitchを一定のリサンプリング数ｎで除算することによりリサンプリング周期を計算する。そして、ピッチ等化音声データｘ_out（ｔ）をそのリサンプリング周期によりリサンプリングし、等標本数音声データｘ_ｅｑ（ｔ）として出力する。これにより、ピッチ等化音声データｘ_out（ｔ）の１ピッチ区間の標本化数が一定の値とされる。Next, the resampler 18 calculates a resampling period by dividing the reference frequency signal AV _pitch by a constant resampling number n in each pitch section. Then, the pitch equalized voice data x _out (t) is resampled at the resampling period, and is output as equal sample number voice data x _eq (t). As a result, the number of samples in one pitch section of the pitch equalized audio data x _out (t) is set to a constant value.

次に、アナライザ１９は、等標本数音声データｘ_ｅｑ（ｔ）を、一定のピッチ区間数のサブフレームに区分する。そして、サブフレーム毎に変形離散コサイン変換を行うことによって周波数スペクトル信号Ｘ（ｆ）を生成する。Next, the analyzer 19 divides the equal sample number voice data x _eq (t) into subframes having a fixed number of pitch sections. Then, the frequency spectrum signal X (f) is generated by performing the modified discrete cosine transform for each subframe.

ここで、１つのサブフレームの長さは、１ピッチ周期の整数倍とされる。本実施例では、サブフレームの長さは１ピッチ周期（標本化数ｎ）とする。従って、ｎ個の周波数スペクトル信号｛Ｘ（ｆ_１），Ｘ（ｆ_２），…，Ｘ（ｆ_ｎ）｝が出力される。周波数ｆ_１は基準周波数の第１高調波、周波数ｆ_２は基準周波数の第２高調波、周波数ｆ_ｎは基準周波数の第ｎ高調波である。Here, the length of one subframe is an integral multiple of one pitch period. In this embodiment, the length of the subframe is 1 pitch period (sampling number n). Therefore, n frequency spectrum signals {X (f ₁ ), X (f ₂ ),..., X (f _n )} are output. The frequency f ₁ is the first harmonic of the reference frequency, the frequency f ₂ is the second harmonic of the reference frequency, and the frequency f _n is the nth harmonic of the reference frequency.

このように、１ピッチ周期の整数倍のサブフレームに分割して各サブフレームを直交変換することによりサブバンド符号化を行うことで、音声波形データの周波数スペクトル信号は基準周波数の高調波のスペクトルに集約される。そして、音声の性質上、同一の音素内における連続するピッチ区間の波形は類似する、従って、隣接するサブフレーム間で基準周波数の高調波成分のスペクトルは類似する、従って、符号化効率は高められる。 Thus, by performing subband coding by dividing each subframe into subframes that are integral multiples of one pitch period and orthogonally transforming each subframe, the frequency spectrum signal of the speech waveform data is a harmonic spectrum of the reference frequency. To be aggregated. And, due to the nature of speech, the waveforms of successive pitch sections within the same phoneme are similar, so the spectrum of the harmonic component of the reference frequency is similar between adjacent subframes, and therefore the coding efficiency is increased. .

次に、量子化器１３は、周波数スペクトル信号Ｘ（ｆ）を量子化する。ここで、量子化器１３はノイズフラグ信号Ｖ_noiseを参照し、ノイズフラグ信号Ｖ_noiseが０（有声音）の場合と１（無声音）の場合とで量子化曲線を切り換える。Next, the quantizer 13 quantizes the frequency spectrum signal X (f). Here, the quantizer 13 refers to the noise flag signal V _noise, switching the quantization curve in the case the noise flag signal V _noise is 0 when the 1 (unvoiced) of (voiced).

ノイズフラグ信号Ｖ_noiseが０（有声音）の場合、量子化曲線は、図２（ａ）に示したように、周波数が高くなるに従って量子化ビット数が減少するような量子化曲線とされる。これは、有声音の周波数特性は、図５に示したように低周波数域で大きく高周波域になるに従って減少する特性を有することに対応させたものである。When the noise flag signal V _noise is 0 (voiced sound), the quantization curve is a quantization curve in which the number of quantization bits decreases as the frequency increases, as shown in FIG. . This corresponds to the fact that the frequency characteristic of the voiced sound has a characteristic that decreases in the low frequency region and increases in the high frequency region as shown in FIG.

一方、ノイズフラグ信号Ｖ_noiseが１（無声音）の場合、量子化曲線は、図２（ｂ）に示したように、周波数が高くなるに従って量子化ビット数が増加するような量子化曲線とされる。これは、無声音の周波数特性は、図６に示したように高周波域になるに従って増加する特性を有することに対応させたものである。On the other hand, when the noise flag signal V _noise is 1 (unvoiced sound), the quantization curve is a quantization curve in which the number of quantization bits increases as the frequency increases, as shown in FIG. The This corresponds to the fact that the frequency characteristic of the unvoiced sound has a characteristic that increases as the frequency becomes higher as shown in FIG.

この量子化曲線の切り換えにより、有声音か無声音かに対応して最適な量子化曲線が選択される。 By switching the quantization curve, an optimal quantization curve is selected corresponding to voiced sound or unvoiced sound.

尚、補足として、量子化ビット数について説明する。量子化器１３による量子化のデータフォーマットは図１１（ａ）（ｂ）に示したように、小数点以下の実数部（ＦＬ）及び２の冪乗を表す指数部（ＥＸＰ）によって表現される。但し、０以外の数を表す場合において、実数部（ＦＬ）の先頭の１ビットは必ず１であるように指数部（ＥＸＰ）が調整されるものとする。 As a supplement, the number of quantization bits will be described. The data format of quantization by the quantizer 13 is expressed by a real part (FL) below the decimal point and an exponent part (EXP) representing the power of 2 as shown in FIGS. However, when representing a number other than 0, the exponent (EXP) is adjusted so that the first bit of the real part (FL) is always 1.

例えば、実数部（ＦＬ）が４ビット、指数部（ＥＸＰ）が２ビットの場合において、４ビットで量子化する場合、及び２ビットで量子化する場合は、次のようになる（図１１（ｃ），（ｄ）参照）。 For example, when the real part (FL) is 4 bits and the exponent part (EXP) is 2 bits, the quantization is performed with 4 bits and the quantization is performed with 2 bits (FIG. 11 ( c) and (d)).

（１）４ビットで量子化する場合
（例１）Ｘ（ｆ）＝８＝［１０００］_２（但し、［］_２は２進数表記を表す。）は、
ＦＬ＝［１０００］_２，ＥＸＰ＝［１００］_２
（例２）Ｘ（ｆ）＝７＝［０１００］_２は、
ＦＬ＝［１１１０］_２，ＥＸＰ＝［０１１］_２
（例３）Ｘ（ｆ）＝３＝［１０００］_２は、
ＦＬ＝［１１００］_２，ＥＸＰ＝［０１０］_２ (1) When quantizing with 4 bits (Example 1) X (f) = 8 = [1000] ₂ (where [] ₂ represents a binary number notation)
FL = [1000] ₂ , EXP = [100] ₂
(Example 2) X (f) = 7 = [0100] ₂ is
FL = [1110] ₂ , EXP = [011] ₂
(Example 3) X (f) = 3 = [1000] ₂ is
FL = [1100] ₂ , EXP = [010] ₂

（２）２ビットで量子化する場合
（例１）Ｘ（ｆ）＝８＝［１０００］_２は、
ＦＬ＝［１０００］_２，ＥＸＰ＝［１００］_２
（例２）Ｘ（ｆ）＝７＝［０１００］_２は、
ＦＬ＝［１１００］_２，ＥＸＰ＝［０１１］_２
（例３）Ｘ（ｆ）＝３＝［１０００］_２は、
ＦＬ＝［１１００］_２，ＥＸＰ＝［０１０］_２ (2) When quantizing with 2 bits (Example 1) X (f) = 8 = [1000] ₂
FL = [1000] ₂ , EXP = [100] ₂
(Example 2) X (f) = 7 = [0100] ₂ is
FL = [1100] ₂ , EXP = [011] ₂
(Example 3) X (f) = 3 = [1000] ₂ is
FL = [1100] ₂ , EXP = [010] ₂

すなわち、ｎビットで量子化する場合は、実数部（ＦＬ）の先頭からｎビットを残し、残りのビットは０とするものとする（図１１（ｄ）参照）。 That is, when quantizing with n bits, n bits are left from the beginning of the real part (FL), and the remaining bits are set to 0 (see FIG. 11D).

次に、ピッチ等化波形符号化器１４は、量子化器１３が出力する量子化された周波数スペクトル信号Ｘ（ｆ）をエントロピ符号化法により符号化し、符号化特徴データを出力する。また、ピッチ等化波形符号化器１４は、符号化特徴データの符号量（ビット数）を差分ビット演算器１５に出力する。差分ビット演算器１５は、符号化特徴データの符号量から所定の目的ビット数を減算し、差分ビット数を出力する。量子化器１３は、差分ビット数に応じて、有声音に対する量子化曲線を平行移動的に上下させる。 Next, the pitch equalization waveform encoder 14 encodes the quantized frequency spectrum signal X (f) output from the quantizer 13 by an entropy encoding method, and outputs encoded feature data. Further, the pitch equalization waveform encoder 14 outputs the code amount (number of bits) of the encoded feature data to the differential bit calculator 15. The difference bit calculator 15 subtracts a predetermined number of target bits from the code amount of the encoded feature data and outputs the number of difference bits. The quantizer 13 moves the quantization curve for voiced sound up and down in parallel translation according to the number of difference bits.

例えば、｛ｆ_１，ｆ_２，ｆ_３，ｆ_４，ｆ_５，ｆ_６｝に対する量子化曲線が｛６，５，４，３，２，１｝であったとし、差分ビット数として２が入力されたとすると、量子化器１３は、量子化曲線を下方に２だけ平行移動する。その結果、量子化曲線は｛４，３，２，１，０，０｝となる。また、差分ビット数として−２が入力されたとすると、量子化器１３は、量子化曲線を上方に２だけ平行移動する。その結果、量子化曲線は｛８，７，６，５，４，３｝となる。For example, if the quantization curve for {f ₁ , f ₂ , f ₃ , f ₄ , f ₅ , f ₆ } is { ₆ , ₅ , ₄ , ₃ , ₂ , ₁ }, 2 is the difference bit number. If input, the quantizer 13 translates the quantization curve downward by two. As a result, the quantization curve is {4, 3, 2, 1, 0, 0}. If −2 is input as the number of difference bits, the quantizer 13 translates the quantization curve upward by two. As a result, the quantization curve becomes {8, 7, 6, 5, 4, 3}.

このように有声音の量子化曲線を上下に変化させることによって、各サブフレームの符号化特徴データの符号量が目的ビット数程度に調整される。 In this way, by changing the quantization curve of the voiced sound up and down, the code amount of the encoded feature data of each subframe is adjusted to about the target number of bits.

一方、これに並行して、ピッチ情報符号化器１６は、基準周波数信号ＡＶ_pitch及び残差周波数信号ΔＶ_pitchを符号化する。On the other hand, in parallel with this, the pitch information encoder 16 encodes the reference frequency signal AV _pitch and the residual frequency signal ΔV _pitch .

一方、音素ラベリング処理手段１７は、入力音声データｘ_ｉｎ（ｔ）を音素区間に区分し、各音素区間に対して音素ラベリングを行う。音素区間の分割方法や音素ラベリングの方法に関しては、音声認識の分野において多くの技術が公知であり、ここではそれら公知の方法を用いることができる。音素ラベリング処理手段１７は、音素ラベリングにより得られた音素ラベルと各音素ラベルに対する時間区間を表す音素区間の情報を、音素ラベルデータとして出力する。On the other hand, the phoneme labeling processing means 17 divides the input speech data x _in (t) into phoneme sections and performs phoneme labeling on each phoneme section. Many techniques are known in the field of speech recognition regarding the method of dividing a phoneme section and the method of labeling a phoneme, and these known methods can be used here. The phoneme labeling processing means 17 outputs, as phoneme label data, phoneme labels obtained by phoneme labeling and information on phoneme intervals representing time intervals for each phoneme label.

以上のようにして生成された、符号化特徴データ，符号化ピッチデータ，及び音素ラベルデータは、音声記憶手段３に出力され、保存される。 The encoded feature data, the encoded pitch data, and the phoneme label data generated as described above are output to the voice storage unit 3 and stored.

〔２〕音声復号器の動作
データ読出手段４が、音声記憶手段３から符号化特徴データ及び符号化ピッチデータを読み出すと、これらのデータは音声復号器５に入力される。[2] Operation of Speech Decoder When the data reading means 4 reads the encoded feature data and the encoded pitch data from the speech storage means 3, these data are input to the speech decoder 5.

音声復号器５のピッチ等化波形復号器５１は、符号化特徴データを復号し、量子化後の各サブバンドの周波数スペクトル信号（以下「量子化周波数スペクトル信号」という。）を復元する。逆量子化器５２は、この量子化周波数スペクトル信号を逆量子化し、ｎ個のサブバンドの周波数スペクトル信号Ｘ（ｆ）＝｛Ｘ（ｆ_１），Ｘ（ｆ_２），…，Ｘ（ｆ_ｎ）｝を復元する。The pitch equalization waveform decoder 51 of the speech decoder 5 decodes the encoded feature data, and restores the frequency spectrum signal of each subband after quantization (hereinafter referred to as “quantized frequency spectrum signal”). The inverse quantizer 52 inversely quantizes the quantized frequency spectrum signal, and the frequency spectrum signals X (f) = {X (f ₁ ), X (f ₂ ),..., X (f _n )} is restored.

シンセサイザ５３は、周波数スペクトル信号Ｘ（ｆ）を逆変形離散コサイン変換（Inverse Modified Discrete Cosine Transform：以下「ＩＭＤＣＴ」という。）し、１ピッチ区間の時系列データ（以下「等化音声信号」という。）ｘ_eq（ｔ）を生成する。ピッチ周波数検出手段５５は、この等化音声信号ｘ_eq（ｔ）のピッチ周波数を検出し等化ピッチ周波数信号Ｖ_eqとして出力する。The synthesizer 53 performs inverse modified discrete cosine transform (hereinafter referred to as “IMDCT”) on the frequency spectrum signal X (f), and is referred to as time-series data (hereinafter referred to as “equalized audio signal”) in one pitch interval. ) X _eq (t) is generated. The pitch frequency detecting means 55 detects the pitch frequency of the equalized audio signal x _eq (t) and outputs it as the equalized pitch frequency signal V _eq .

〔３〕クエリー音声データによる部分音声データの検索動作
次に、クエリー音声データによる部分音声データの検索動作について説明する。この場合、部分音声検索手段６の動作切替手段６１は、切替信号として部分音声検索動作を表すレベル（例えばＬレベル）を出力する。これにより、音声符号化器２の出力切替手段１２ａは、アナライザ１９が生成する特徴データＸ（ｆ）を部分音声検索手段６に出力する。音声符号化器２の出力切替手段１２ｂは、音素ラベリング処理手段１７が生成する音素ラベルデータを部分音声検索手段６に出力する。また、音声復号器５の出力切替手段５９は、逆量子化器５２が生成する特徴データＸ（ｆ）を部分音声検索手段６に出力する。[3] Partial Voice Data Search Operation Using Query Voice Data Next, partial voice data search operation using query voice data will be described. In this case, the operation switching unit 61 of the partial voice search unit 6 outputs a level (for example, L level) representing the partial voice search operation as a switching signal. As a result, the output switching unit 12 a of the speech encoder 2 outputs the feature data X (f) generated by the analyzer 19 to the partial speech search unit 6. The output switching unit 12 b of the speech encoder 2 outputs the phoneme label data generated by the phoneme labeling processing unit 17 to the partial speech search unit 6. The output switching means 59 of the speech decoder 5 outputs the feature data X (f) generated by the inverse quantizer 52 to the partial speech search means 6.

まず、クエリー音声データは、入力音声データｘ_ｉｎ（ｔ）として音声符号化器２に入力される。First, the query speech data is input to the speech coder 2 as input speech data x _in (t).

ピッチ周期等化手段１では、上述のように、入力音声データｘ_ｉｎ（ｔ）の有声音のピッチ周期を等化し、ピッチ等化音声データｘ_ｏｕｔ（ｔ）として出力端子Out_１から出力する。また、特徴データ生成手段１９は、上述のように、ピッチ等化音声データｘ_ｏｕｔ（ｔ）を短時間スペクトルの時系列からなる特徴データＸ（ｆ）に変換する。特徴データＸ（ｆ）は、出力切替手段１２ａを介して部分音声検索手段６へ出力される。As described above, the pitch period equalizing means 1 equalizes the pitch period of the voiced sound of the input sound data x _in (t) and outputs the equalized sound data x _out (t) from the output terminal Out_1. In addition, as described above, the feature data generation unit 19 converts the pitch equalized speech data x _out (t) into feature data X (f) that includes a time series of a short-time spectrum. The feature data X (f) is output to the partial speech search means 6 via the output switching means 12a.

一方、音素ラベリング処理手段１７では、上述のように、入力音声データｘ_ｉｎ（ｔ）を音素区間に区分し、各音素区間に対して音素ラベリングを行う。そして、音素ラベルと音素区間の情報を、音素ラベルデータとして出力する。On the other hand, as described above, the phoneme labeling processing means 17 divides the input speech data x _in (t) into phoneme sections and performs phoneme labeling on each phoneme section. And the information of a phoneme label and a phoneme area is output as phoneme label data.

次に、部分音声検索手段６の部分音声選択手段６２は、音声記憶手段３に記憶された符号化特徴データ，符号化ピッチデータ，及び音素ラベルデータを、データの先頭から順に順次読み出すためのデータ選択信号を出力する。このとき、読み出す部分データの長さは、クエリー音声データと同じ音素長の長さとされる。データ読出手段４は、データ選択信号に従って、音声記憶手段３から部分データを読み出す。 Next, the partial voice selection means 62 of the partial voice search means 6 is data for sequentially reading the encoded feature data, the encoded pitch data, and the phoneme label data stored in the voice storage means 3 from the top of the data. Outputs a selection signal. At this time, the length of the partial data to be read out is the same phoneme length as that of the query speech data. The data reading unit 4 reads partial data from the voice storage unit 3 in accordance with the data selection signal.

データ読出手段４により読み出された音素ラベルデータは、部分音声検索手段６に入力される。 The phoneme label data read by the data reading means 4 is input to the partial voice search means 6.

一方、データ読出手段４により読み出された符号化特徴データ及び符号化ピッチデータの部分データは、音声復号器５に入力される。音声復号器５では、上述のように、ピッチ等化波形復号器５１で符号化特徴データを復号し、逆量子化器５２で逆量子化を行うことにより、特徴データを生成し、部分音声検索手段６に出力する。 On the other hand, the encoded feature data and the partial data of the encoded pitch data read by the data reading unit 4 are input to the speech decoder 5. In the speech decoder 5, as described above, the encoded feature data is decoded by the pitch equalization waveform decoder 51, and the inverse quantizer 52 performs inverse quantization, thereby generating feature data and partial speech search Output to means 6.

以下、音声復号器５から部分音声検索手段６に入力される検索対象特徴データの部分データを「選択特徴データ」と呼ぶ。 Hereinafter, the partial data of the search target feature data input from the speech decoder 5 to the partial speech search means 6 is referred to as “selected feature data”.

部分音声検索手段６においては、音声符号化器２からクエリー音声の特徴データ（以下「クエリー特徴データ」という。）及び音素ラベルデータが入力されると、区間分割手段６３は、クエリー特徴データを音素区間ごとに平均化し、平均値の時系列データに変換する。この場合、音素ラベルデータに含まれる音素区間の情報に基づき、クエリー特徴データを時間区間に区分し、各時間区間で平均値をとればよい。この平均値の時系列データは、特徴量尺度演算手段６５に入力される。 In the partial speech search means 6, when query speech feature data (hereinafter referred to as “query feature data”) and phoneme label data are input from the speech encoder 2, the segment dividing means 63 converts the query feature data into phoneme. Averaging is performed for each section and converted to time-series data of average values. In this case, the query feature data may be divided into time intervals based on the information on the phoneme intervals included in the phoneme label data, and an average value may be taken for each time interval. The time-series data of the average value is input to the feature amount scale calculation means 65.

また、音声復号器５及びデータ読出手段４から選択特徴データ及び音素ラベルデータが入力されると、区間分割手段６４は、選択特徴データを音素区間ごとに平均化し、平均値の時系列データに変換する。この平均値の時系列データは、特徴量尺度演算手段６５に入力される。 When the selected feature data and the phoneme label data are input from the speech decoder 5 and the data reading unit 4, the section dividing unit 64 averages the selected feature data for each phoneme section and converts it into time-series data of an average value. To do. The time-series data of the average value is input to the feature amount scale calculation means 65.

特徴量尺度演算手段６５は、区間分割手段６３及び区間分割手段６４から入力される平均値の時系列データの間の距離尺度Ｄ_１（Ｘ_ｑ，Ｘ_ｏ）を式（１０）に従って算出する。The feature quantity scale calculation means 65 calculates a distance scale D ₁ (X _q , X _o ) between the time series data of the average values input from the section dividing means 63 and the section dividing means 64 according to the equation (10).

一方、音素列尺度演算手段６６は、音声符号化器２から入力されるクエリー音声の音素ラベルデータとデータ読出手段から入力される検索対象音声の音素ラベルデータとの間の距離尺度Ｄ_２を音素間距離尺度表を用いて演算する。On the other hand, the phoneme string measure calculating means 66, phoneme distance measure D ₂ between the search target speech phoneme label data inputted from the phoneme label data and the data reading means of the query speech input from the speech coder 2 Calculate using the distance scale table.

総合尺度演算手段６７は、特徴量尺度演算手段６５が算出する特徴データの間の距離尺度Ｄ_１（Ｘ_ｑ，Ｘ_ｏ）と音素列尺度演算手段６６が算出する音素ラベルデータの距離尺度Ｄ_２の線形和をとることによって、総合距離尺度Ｄを式（１１）により演算する。The total scale calculation means 67 is a distance scale D ₁ (X _q , X _o ) between feature data calculated by the feature quantity scale calculation means 65 and a distance scale D _{2 of} phoneme label data calculated by the phoneme string scale calculation means 66. The total distance measure D is calculated by the equation (11).

一致位置判定手段６８は、距離尺度Ｄが所定の閾値Ｄ_ｔｈ以下であるか否かを判定し、Ｄ≦Ｄ_ｔｈの場合には、当該部分データを選択するデータ選択信号を出力する。そして、動作切替手段６１は、切替信号として部分音声検索動作を表すレベル（例えばＬレベル）を出力する。The coincidence position determination means 68 determines whether or not the distance measure D is equal to or smaller than a predetermined threshold value D _th , and outputs a data selection signal for selecting the partial data when D ≦ D _th . Then, the operation switching means 61 outputs a level (for example, L level) representing the partial voice search operation as a switching signal.

これにより、検索された検索対象データの部分データが、出力音声データとして出力される。 Thereby, the partial data of the searched search target data is output as output audio data.

尚、本実施例は、音声情報と映像とが一体として記録されたマルチメディア・データベースにおける情報の検索においても適用することができる。 This embodiment can also be applied to search for information in a multimedia database in which audio information and video are recorded together.

本発明は、音声データベースや音声情報を含むマルチメディア・データベース等において利用可能である。

The present invention can be used in a voice database, a multimedia database including voice information, and the like.

Claims

A voice search device that searches partial voice data that matches or is similar to query voice data from search target voice data,
Pitch equalization by equalizing the pitch period of voiced sound of the search target voice data Pitch equalization by equalizing the pitch period of voiced sound of the query voice data in the voice feature amount space from the search target voice data A speech search apparatus comprising: partial speech search means for searching partial speech data whose distance measure (or similarity measure) with respect to query speech data is less than or equal to a predetermined threshold (or greater than or equal to a predetermined threshold).

Pitch period equalizing means for generating the pitch equalized query voice data by equalizing the pitch period of the voiced sound of the query voice data;
Feature data generating means for generating data obtained by converting the pitch equalized query voice data into time-series data of feature quantities (hereinafter referred to as “query feature data”);
With
The partial speech search means has a feature measure having a distance measure (or similarity measure) between the partial feature data included in the pitch equalization search target speech data and the query feature data equal to or less than a predetermined threshold ( 2. The speech search apparatus according to claim 1, wherein a search is made for a search that is equal to or greater than a predetermined threshold.

The partial voice search means
Of the search target feature data obtained by converting the pitch equalization search target speech data into time-series data of feature quantities, partial data for the same phoneme length as the query speech data (hereinafter referred to as “selected feature data”) is used. Partial voice selection means for sequentially selecting while moving the selection position;
Feature quantity scale calculation means for calculating a distance scale (or similarity scale) between each of the selected feature data and the query feature data;
When the distance measure (or similarity measure) is equal to or smaller than a predetermined threshold (or equal to or greater than a predetermined threshold), a matching position determination unit that outputs a position in search target audio data corresponding to the selected feature data;
The voice search device according to claim 1, further comprising:

4. A voice search apparatus according to claim 3, further comprising voice storage means for storing the search target characteristic data.

Second pitch period equalizing means for generating the pitch equalization search target voice data by equalizing the pitch period of the voiced sound of the search target voice data;
Second feature data generating means for generating the search target feature data by converting the pitch equalization search target voice data into time-series data of feature quantities;
The voice search device according to claim 3 or 4, further comprising:

The pitch period equalizing means (or the second pitch period equalizing means)
Pitch detection means for detecting a pitch frequency of the query voice data (or the search target voice data);
Residual calculating means for calculating a difference between the pitch frequency and a predetermined reference frequency;
6. A voice search apparatus according to claim 2, further comprising a frequency shifter for equalizing a pitch frequency of the query voice data (or the search target voice data) so that the difference is minimized. .

The search target feature data and the query feature data are respectively time series of subband data obtained by orthogonal transformation of the pitch equalization search target speech data and the pitch equalization query speech data. The voice search device according to any one of claims 1 to 6.

A first section dividing means for averaging the query feature data for each phoneme section and converting it into time-series data of an average value;
A second section dividing means for averaging the search target feature data for each phoneme section and converting it into time-series data of an average value;
With
6. The feature quantity scale calculating means calculates a distance scale (or similarity scale) between time series data of average values generated by the first and second section dividing means. The voice search device described.

Phoneme labeling processing means for generating a query phoneme sequence (or search target phoneme sequence) by performing phoneme labeling on the query speech data (or the search target speech data);
Phoneme string scale calculation means for determining a distance scale (or similarity scale) between the search target phoneme string corresponding to the selected feature data and the query phoneme string;
A linear sum (hereinafter referred to as “total distance measure”) of the distance measure (or similarity measure) of the feature value output by the feature value measure calculating means and the distance measure (or similarity measure) of the phoneme string output means by the phoneme string measure calculating means. (Or an overall similarity scale) ”)
With
The coincidence position determination means outputs a position in search target audio data corresponding to the selected feature data when the total distance scale (or total similarity scale) is equal to or smaller than a predetermined threshold (or higher than a predetermined threshold). The voice search device according to any one of claims 1 to 8.

A voice search method for searching partial voice data that matches or is similar to query voice data from search target voice data,
Pitch equalization by equalizing the pitch period of voiced sound of the search target voice data Pitch equalization by equalizing the pitch period of voiced sound of the query voice data in the voice feature amount space from the search target voice data A speech search method comprising: a partial speech search step of searching for partial speech data whose distance measure (or similarity measure) for query speech data is equal to or less than a predetermined threshold (or greater than a predetermined threshold).

A pitch period equalizing step for generating the pitch equalized query voice data by equalizing the pitch period of the voiced sound of the query voice data;
A feature data generation step of generating data (hereinafter referred to as “query feature data”) obtained by converting the pitch equalization query voice data into time-series data of feature amounts;
With
In the partial speech search step, the feature amount of the partial speech data included in the pitch equalization search target speech data has a distance measure (or similarity measure) between the query feature data and a predetermined threshold value or less. The speech search method according to claim 9, wherein a search is made for a search that is (or more than a predetermined threshold).

In the partial voice search step,
Of the search target feature data obtained by converting the pitch equalization search target speech data into time-series data of feature quantities, partial data for the same phoneme length as the query speech data (hereinafter referred to as “selected feature data”) is used. A partial voice selection step for sequentially selecting while moving the selection position;
A feature amount scale calculating step for calculating a distance measure (or similarity measure) between each of the selected feature data and the query feature data;
When the distance measure (or similarity measure) is equal to or less than a predetermined threshold (or greater than or equal to a predetermined threshold), a matching position determination step of outputting a position in search target audio data corresponding to the selected feature data;
The voice search method according to claim 9 or 10, characterized by comprising:

The voice search method according to claim 11, further comprising a voice storage step of storing the search target characteristic data.

A second pitch period equalization step of generating the pitch equalization search target voice data by equalizing the pitch period of the voiced sound of the search target voice data;
A second feature data generation step of generating the search target feature data by converting the pitch equalization search target voice data into time-series data of feature amounts;
The voice search method according to claim 11 or 12, characterized by comprising:

In the pitch period equalization step (or second pitch period equalization step),
A pitch detection step of detecting a pitch frequency of the query voice data (or the search target voice data);
A residual calculation step of calculating a difference between the pitch frequency and a predetermined reference frequency;
The voice search method according to claim 11, further comprising a frequency shift step of equalizing a pitch frequency of the query voice data (or the search target voice data) so that the difference is minimized. .

The search target feature data and the query feature data are respectively time series of subband data obtained by orthogonal transformation of the pitch equalization search target speech data and the pitch equalization query speech data. The voice search method according to any one of claims 10 to 15.

A first section division step of averaging the query feature data for each phoneme section and converting the average to time-series data of an average value;
A second section dividing step of averaging the search target feature data for each phoneme section and converting the averaged time-series data into average values;
Have
The distance scale (or similarity scale) between the time series data of the average values generated in the first and second section division steps is calculated in the feature quantity scale calculation step. Or the voice search method of 14.

A phoneme labeling step of generating a query phoneme sequence (or search target phoneme sequence) by performing phoneme labeling on the query speech data (or the search target speech data);
A phoneme sequence scale calculation step for determining a distance measure (or similarity measure) between the search target phoneme sequence corresponding to the selected feature data and the query phoneme sequence;
A linear sum (hereinafter, “total”) of the distance measure (or similarity measure) of the feature amount output in the feature amount scale operation step and the distance measure (or similarity measure) of the phoneme sequence output in the phoneme sequence scale operation step. A total scale calculation step for calculating a distance scale (or a total similarity scale).
With
In the matching position determination step, when the total distance measure (or total similarity measure) is equal to or smaller than a predetermined threshold (or equal to or larger than a predetermined threshold), a position in the search target audio data corresponding to the selected feature data is output. The voice search method according to claim 10, wherein the voice search method is a voice search method.

A program that causes a computer to function as the voice search device according to any one of claims 1 to 8 by being read and executed by the computer.