JP2011128903A

JP2011128903A - Sequence signal retrieval device and sequence signal retrieval method

Info

Publication number: JP2011128903A
Application number: JP2009286883A
Authority: JP
Inventors: Tomoyoshi Akiba; 友良秋葉; Taisuke Kaneko; 泰輔金子
Original assignee: Toyohashi University of Technology NUC
Current assignee: Toyohashi University of Technology NUC
Priority date: 2009-12-17
Filing date: 2009-12-17
Publication date: 2011-06-30

Abstract

【課題】誤りを含む系列信号から複数候補を効率よく処理できる系列信号検索装置および系列信号検索方法を提供する。
【解決手段】音声データの音声認識結果の音節列と検索語の音節列から２次元配列を表し、前記２次元配列の要素として音節間の距離、すなわち類似度を用いることにより平面を構成し、前記平面上で直線を検出することにより、検索語による音声データの検索処理を実現する。距離を考慮した索引付けを用いることで高速な検出が可能となるとともに、音声認識の複数候補を考慮することで高精度な検出も可能になる。そして、距離を考慮した索引付けは、距離の近い、あるいは近似的に近い候補から探索を進めることができるため探索を打ち切る必要は無く、適切なしきい値が設定できる。さらに、しきい値で探索を打ち切らないので、ノイズが大きい場合でも、距離の近い、あるいは近似的に近い候補から順番に解を見つけることができるようになる。
【選択図】図１
A sequence signal search apparatus and sequence signal search method capable of efficiently processing a plurality of candidates from a sequence signal including an error are provided.
A two-dimensional array is represented from a syllable string of a speech recognition result of speech data and a syllable string of a search word, and a plane is constructed by using a distance between syllables, that is, a similarity as an element of the two-dimensional array, By detecting a straight line on the plane, a speech data search process using a search word is realized. High-speed detection is possible by using indexing considering distance, and high-precision detection is possible by considering a plurality of candidates for speech recognition. In the indexing in consideration of the distance, the search can be advanced from candidates that are close or approximately close to each other, so that it is not necessary to stop the search, and an appropriate threshold value can be set. Furthermore, since the search is not terminated at the threshold value, even when the noise is large, it becomes possible to find solutions in order from candidates that are close or approximately close.
[Selection] Figure 1

Description

本発明は、検索装置および検索方法に関し、特に、音声データ中に使用された言葉の検索、またはＯＣＲ後の誤りを含むテキストデータからの検索などのように系列信号について検索するための検索装置および検索方法に関するものである。
The present invention relates to a search device and a search method, and in particular, a search device for searching for a sequence signal such as a search for words used in speech data or a search from text data including errors after OCR. It relates to the search method.

音声・画像・ビデオの記録・編集機器の拡大、およびインターネットをはじめとする情報通信網の発展により、誰でも気軽にコンテンツを作成・公開することが可能となり、マルチメディアコンテンツの情報爆発が進行しつつある。これらのコンテンツには、ファイル名やタイトル以外にはメタデータが付与されていないことが多く、従来のテキストベースの検索技術だけでは、目的のコンテンツにたどり着くことは困難である。一方、話し言葉を含むコンテンツの場合には、大語彙連続音声認識技術を利用することで、言語情報を利用した検索が可能である。このような音声言語情報を対象とした検索技術は「音声ドキュメント検索（Ｓｐｏｋｅｎ
ＤｏｃｕｍｅｎｔＲｅｔｒｉｅｖａｌ）」または、単に「音声検索」と呼ばれ、マルチメディアコンテンツの情報爆発時代に必要不可欠な技術である。 With the expansion of audio / image / video recording / editing equipment and the development of the information communication network including the Internet, anyone can easily create and publish content, and the information explosion of multimedia content has progressed. It's getting on. In many cases, metadata other than the file name and title is not given to these contents, and it is difficult to reach the target contents only with the conventional text-based search technology. On the other hand, in the case of content including spoken language, a search using language information can be performed by using a large vocabulary continuous speech recognition technology. Search technology for such spoken language information is “spoken document search (Spoken).
Document Retrieval) or simply “Voice Search”, it is an indispensable technology in the information explosion era of multimedia contents.

音声ドキュメント検索のうち、入力した検索語（クエリ、パターンなどと呼ぶ）が音声データ中で現れる位置を特定する問題は、音声中の検索語検出（Ｓｐｏｋｅｎ
ＴｅｒｍＤｅｔｅｃｔｉｏｎ；ＳＴＤ）、音声中の既知語検索（ＫｎｏｗｎＩｔｅｍＲｅｔｒｉｅｖａｌ）、音声キーワード検索、あるいは単に音声検索、などと呼ばれ、音声情報処理の分野では活発に研究が行われている問題である。１９９７年には、米国ＮＩＳＴ主催の評価型ワークショップＴＲＥＣの音声ドキュメント検索トラック（ＳＤＲ
Ｔｒａｃｋ）において、ＫｎｏｗｎＩｔｅｍＲｅｔｒｉｅｖａｌの評価が行われた［非特許文献１］。また、２００６年にＮＩＳＴは再びＳｐｏｋｅｎ
ＴｅｒｍＤｅｔｅｃｔｉｏｎを研究課題として設定し、それ以降未知語の検出を重視したＳＴＤの研究が盛んに行われるようになった［非特許文献２］。音声情報処理の代表的な国際会議であるＩＣＡＳＳＰ（ＩｎｔｅｒｎａｔｉｏｎａｌＣｏｎｆｅｒｅｎｃｅｏｎＡｃｏｕｓｔｉｃ，ＳｐｅｅｃｈａｎｄＳｉｇｎａｌＰｒｏｃｅｓｓｉｎｇ）でも２００９年に音声ドキュメント検索のスペシャルセッションが組まれている。また、日本においても情報処理学会音声言語情報処理研究会のワーキンググループにおいて、ＳＴＤの評価用テストコレクションの整備が進行中である［非特許文献３］。 The problem of specifying the position where an input search word (referred to as a query or a pattern) appears in the voice data in the voice document search is the search word detection (Spoken).
It is called “Term Detection (STD)”, “Known Item Retrieval”, “Speech Keyword Search”, or “Speech Search”, and is a problem that has been actively researched in the field of speech information processing. In 1997, a speech document search track (SDR) of evaluation type workshop TREC organized by NIST, USA
Track) evaluated Known Item Retrieval [Non-Patent Document 1]. In 2006, NIST again spoke
Since Term Detection is set as a research topic, STD research that focuses on the detection of unknown words has been actively conducted since then [Non-Patent Document 2]. A special session for speech document search was also organized in 2009 at ICASSP (International Conference on Acoustic, Speech and Signal Processing), which is a representative international conference on speech information processing. In Japan, the working group of the IPSJ Spoken Language Information Processing Study Group is developing a test collection for STD evaluation [Non-patent Document 3].

ＳＴＤに対する従来手法は、音声認識の認識誤りを、比較的研究が進んでいるテキストを対象とした検索手法の枠組みの中で扱う方法がほとんどである。前述ＴＲＥＣのＳＤＲ
Ｔｒａｃｋでは、音声認識結果の一位の候補に加えて二位以下の複数候補を利用することで性能改善できることが示されている。その後の手法は、認識結果の複数候補を効率よく表現する方法、たとえば代表的な方法として、Ｃｏｎｆｕｓｉｏｎ
Ｎｅｔｗｏｒｋ［非特許文献４］やＴＡＬＥ（Ｔｉｍｅ−ＡｎｃｈｏｒｅｄＬａｔｔｉｃｅ
Ｅｘｐａｎｓｉｏｎ）［非特許文献５］、に焦点が当てられ、検索については誤りの無いテキスト検索方法と同様の手法で索引付けする方法を用いるものがほとんどであった。これらの手法は、完全一致の索引を用いるため索引付けに漏れがあると全く検出できなくなるという問題点がある。特に、音声認識の認識語彙外語（未知語ともいう）の扱いが問題である。これに対し、音素や音節認識の結果を併用する手法も提案されているが、認識率の低下が問題となっている。また、誤認識の対処のため利用する認識候補数を大きく取ると、索引の数が大きくなり検索効率が悪化するという問題点もある。 Most conventional methods for STD deal with recognition errors in speech recognition within the framework of a search method for text that is relatively researched. TDR SDR
In Track, it is shown that performance can be improved by using a plurality of candidates lower than the second in addition to the first candidate of the speech recognition result. Subsequent methods include a method for efficiently expressing a plurality of recognition result candidates, for example, as a typical method,
Network [Non-Patent Document 4] and TALE (Time-Anchored Lattice)
(Expansion) [Non-Patent Document 5], and most of the search uses an indexing method similar to a text search method without error. Since these methods use an exact index, there is a problem that it cannot be detected at all if there is a leak in indexing. In particular, the handling of recognition vocabulary words (also called unknown words) in speech recognition is a problem. On the other hand, a method using both the phoneme and syllable recognition results has been proposed, but the recognition rate is a problem. In addition, if the number of recognition candidates used for dealing with misrecognition is increased, the number of indexes increases and search efficiency deteriorates.

一方、手島らは、サフィックスアレイを用いた索引付け手法をＳＴＤに適用した検索法を提案している〔非特許文献６］。ＳＴＤに近似文字列照合を適用した方法と位置づけられるが、サフィックスアレイ（あるいは、サフィックスツリー）で認識の複数候補を扱うことは難しく、検索精度に問題がある。また、従来のいずれの方法もテキストベースの索引付け法をそのまま適用しているため、索引自体は一致／不一致の２値情報しか含んでいない。そのため、検索時や検索後にＤＰ（ＤｙｎａｍｉｃＰｒｏｇｒａｍｍｉｎｇ）マッチングなどを用いて距離の計算を要するという問題点がある。 On the other hand, Teshima et al. Have proposed a search method in which an indexing method using a suffix array is applied to STD [Non-Patent Document 6]. Although it is positioned as a method in which approximate character string matching is applied to STD, it is difficult to handle a plurality of recognition candidates with a suffix array (or suffix tree), and there is a problem in search accuracy. In addition, since any of the conventional methods applies the text-based indexing method as it is, the index itself includes only binary information corresponding to match / mismatch. Therefore, there is a problem that it is necessary to calculate a distance using DP (Dynamic Programming) matching or the like during or after the search.

一方、テキストデータを対象としてパターンと類似した部分を見つける問題は、「近似文字列照合（Ａｐｐｒｏｘｉｍａｔｅ
ＳｔｒｉｎｇＭａｔｃｈｉｎｇ）」と呼ばれる。近似文字列照合は、テキストがその場で与えられることを仮定してテキストを前処理することなしに照合を行うオンライン手法と、あらかじめテキストが与えられていることを仮定してテキストを前処理して索引付け（Ｉｎｄｅｘｉｎｇ）を行うオフライン手法の２つに分類される。 On the other hand, the problem of finding a portion similar to a pattern in text data is “approximate character string matching (Approximate
String Matching) ”. Approximate string matching is an online method that performs matching without preprocessing the text assuming that the text is given on the fly, and preprocessing the text assuming that the text is given in advance. Thus, there are two types of off-line methods for performing indexing.

また、オフライン近似文字列照合の索引付け手法としては、以下の３種類に分類される［非特許文献７］。 Further, the indexing methods for offline approximate character string matching are classified into the following three types [Non-patent Document 7].

（１）サフィックスツリー（ＳｕｆｆｉｘＴｒｅｅ）またはサフィックスアレイ（ＳｕｆｆｉｘＡｒｒａｙ）に基づく手法
（２）ｎ−ｇｒａｍ索引を用いる手法
（３）距離空間上の索引を用いる手法［非特許文献８］ (1) Method based on suffix tree (Suffix Tree) or suffix array (Suffix Array) (2) Method using n-gram index (3) Method using index on metric space [Non-patent Document 8]

これらの手法をＳＴＤに用いる場合、上記（１）は音声認識の複数候補を扱えない、上記（２）は検出位置には検索語と一定割合以上の完全一致箇所が含まれている必要がある、といった問題点がある。また、上記従来法（１）（２）のいずれの手法も「ある一定のエラーの範囲内の全候補を探す」という問題設定に対する手法となっており、「最も距離の近い上位Ｎ個の候補を求める」といった問題に適応するには、しきい値を決め直して再検索するなど、追加コストがかかることも問題である。一方、上記（３）の距離空間上の索引を用いる手法は、これまでＳＴＤに適用されていない。 When these methods are used for STD, the above (1) cannot handle a plurality of candidates for speech recognition, and the above (2) requires that the detected position includes a search word and a perfect match portion of a certain ratio or more. There is a problem such as. In addition, any of the above conventional methods (1) and (2) is a method for the problem setting “search for all candidates within a certain error range”, and “the top N candidates with the shortest distance” In order to adapt to the problem of “determining“ On the other hand, the method using the index on the metric space of (3) has not been applied to the STD so far.

John S.Garofolo and Cedric G. P. Auzanne and Ellen M. Voorhees, “The TREC SpokenDocument Retrieval Track: A Success Story”, In Proceedings of TREC-9, page107-129, 1999.John S. Garofolo and Cedric G. P. Auzanne and Ellen M. Voorhees, “The TREC SpokenDocument Retrieval Track: A Success Story”, In Proceedings of TREC-9, page107-129, 1999. NIST, “TheSpoken Term Detection Evaluation”:http://www.itl.nist.gov/iad/mig/tests/std/2006/index.htmlNIST, “TheSpoken Term Detection Evaluation”: http://www.itl.nist.gov/iad/mig/tests/std/2006/index.html 伊藤慶明, 西崎博光, 胡新輝, 南条浩輝, 秋葉友良, 相川清明, 河原達也, 中川聖一, 松井知子, 山下洋一,“音声中の検索語検出のためのテストコレクション構築−中間報告−”, 情報処理学会研究報告,Vol.2009-SLP-78 No.4, 2009.Yoshiaki Ito, Hiromitsu Nishizaki, Hiroki Hujo, Hiroki Nanjo, Tomoaki Akiba, Kiyoaki Aikawa, Tatsuya Kawahara, Seiichi Nakagawa, Tomoko Matsui, Yoichi Yamashita, “Development of Test Collection for Searching Search Terms in Speech -Interim Report-”, Information Proceedings of IPSJ, Vol.2009-SLP-78 No.4, 2009. L.Mangu, E.Brill andA.Stolcke: “Finding Consensus in Speech Recognition: Word Error Minimizationand Other Applications of Confusion Network” Computer Speech andLanguage,Vol.14, No.4, pp.373-400, 2000.L. Mangu, E. Brill and A. Stolcke: “Finding Consensus in Speech Recognition: Word Error Minimization and Other Applications of Confusion Network” Computer Speech and Language, Vol. 14, No. 4, pp. 373-400, 2000. P. Yu, Y. Shi and F. Seide,“Approximate Word-Lattice Indexing with Text Indexers: Time-Anchored LatticeExpansion”, In Proceedings of International Conference on Acoustic, Speech andSignal Processing (ICASSP), pp.5248-5251, 2008.P. Yu, Y. Shi and F. Seide, “Approximate Word-Lattice Indexing with Text Indexers: Time-Anchored LatticeExpansion”, In Proceedings of International Conference on Acoustic, Speech and Signal Processing (ICASSP), pp.5248-5251, 2008 . 手島茂樹，桂田浩一，新田恒雄： “Suffix Array を用いた高速なキーワード音声検索システム”，日本音響学会2009年秋季研究発表会講演論文集，1-R-26 (2009-9)．Shigeki Teshima, Koichi Katsuta, Tsuneo Nitta: “High-speed keyword speech search system using Suffix Array”, Proceedings of the Acoustical Society of Japan 2009 Fall Meeting, 1-R-26 (2009-9). G. Navarro, et al., “Indexingmethods for approximate string matching”, IEEE Data Eng. Bull., 24(4):12-27,2001.G. Navarro, et al., “Indexingmethods for approximate string matching”, IEEE Data Eng. Bull., 24 (4): 12-27, 2001. G. Navarro, et al., “A Metric Indexfor Approximate String Matching”, Theoretical Computer Science (TCS) 352(1-3):266-279, 2006.G. Navarro, et al., “A Metric Index for Approximate String Matching”, Theoretical Computer Science (TCS) 352 (1-3): 266-279, 2006.

従来法は、曖昧性の無いテキストを対象とした索引付け手法をベースにしているため、音声認識結果のような、（１）複数の候補を扱う方法、（２）連続的な距離を扱う方法、に問題があり、そのため精度あるいは効率のいずれかを犠牲にすることに問題があった。前記（１）については、一致／不一致の二値の索引付けを適用するため、複数候補を全く扱えないか、複数候補を扱うと候補数が多くなり効率が悪化してしまう。また前記（２）について、音声などの曖昧な情報を扱う場合は、連続的な距離（尤度、信頼度、確率、など）の扱いが必要である。 Since the conventional method is based on an indexing method for unambiguous text, (1) a method of handling a plurality of candidates, such as a speech recognition result, and (2) a method of handling a continuous distance , So there was a problem in sacrificing either accuracy or efficiency. Regarding (1), since binary indexing of match / mismatch is applied, a plurality of candidates cannot be handled at all, or if a plurality of candidates are handled, the number of candidates increases and efficiency deteriorates. Regarding (2), when handling ambiguous information such as speech, it is necessary to handle continuous distances (likelihood, reliability, probability, etc.).

従来法は離散的な距離（ｎ文字異なる、など）を扱うことを目指しており、索引自体には距離情報を利用していなかった。そのため、検索時や検索後にＤＰマッチングなどを用いて距離の計算を要するという問題点がある。また、従来法は探索にしきい値を必要とし、適切なしきい値の設定が難しいという問題がある。 The conventional method aims to handle discrete distances (n characters are different, etc.), and the index itself does not use distance information. Therefore, there is a problem that it is necessary to calculate the distance using DP matching or the like at the time of search or after the search. Further, the conventional method has a problem that a threshold value is required for searching, and it is difficult to set an appropriate threshold value.

そこで、本発明は、複数候補を効率よく行うとともに、検索処理を簡素化することのできる検索装置および検索方法を提供することを目的とする。
Therefore, an object of the present invention is to provide a search device and a search method capable of efficiently performing a plurality of candidates and simplifying search processing.

上記目的を達成するために、本発明者らは、鋭意検討の結果本発明に至った。 In order to achieve the above object, the present inventors have made the present invention as a result of intensive studies.

すなわち、系列信号検索装置にかかる本発明は、検索対象の系列信号情報を所定単位ごとに分け、単位ごとの信号特徴を抽出する信号特徴抽出手段と、前記信号特徴抽出手段により抽出された信号特徴と参照信号特徴との特徴量の類似度を示す距離を計算する類似距離算出手段と、前記類似距離算出手段により算出された類似距離の最小値から信号特徴を順次配列させた特徴ベクトルを前記参照信号特徴ごとに生成する特徴ベクトル生成手段と、前記系列信号情報、前記参照信号特徴、前記信号特徴の特徴量の類似度を示す距離、および、前記特徴ベクトルを記憶する記憶装置と、検索信号情報を所定単位ごとに分け、単位ごとの検索信号特徴を抽出する検索信号特徴抽出手段と、前記検索信号特徴に一致する参照信号特徴ごとに前記特徴ベクトルを整列させ、各特徴ベクトルの最小値から順次選択して所定の信号列を生成する信号特徴列生成手段と、前記信号特徴列生成手段により生成された信号特徴列が検索信号特徴の一部または全部を特定したとき、検索結果として判定する判定手段と、前記検索結果を出力する出力手段とを備えたことを特徴とする系列信号検索装置を要旨とすることができる。 That is, the present invention according to the sequence signal search apparatus is configured to divide the sequence signal information to be searched for each predetermined unit and extract the signal feature for each unit, and the signal feature extracted by the signal feature extraction unit Similar distance calculation means for calculating the distance indicating the similarity between the feature quantity and the reference signal feature, and the feature vector in which the signal features are sequentially arranged from the minimum value of the similarity distance calculated by the similarity distance calculation means Feature vector generation means for generating each signal feature, the sequence signal information, the reference signal feature, a distance indicating the similarity of the feature amount of the signal feature, a storage device for storing the feature vector, and search signal information Search signal feature extracting means for extracting a search signal feature for each unit, and for each reference signal feature that matches the search signal feature, And a signal feature sequence generating means for generating a predetermined signal sequence by sequentially selecting from the minimum value of each feature vector, and the signal feature sequence generated by the signal feature sequence generating means is a part of the search signal feature Alternatively, it is possible to make a gist of a series signal search device including a determination unit that determines a search result when all are specified, and an output unit that outputs the search result.

上記発明の判定手段については、前記検索信号情報を先頭から配列した列と、前記系列信号情報を先頭から配列した列とを記憶し、前記二つの列で構成されるマトリクス上において前記信号特徴列生成手段により生成された信号特徴列が直線状に整列するとき、検索結果として判定する判定手段とすることができる。 The determination means of the invention stores a column in which the search signal information is arranged from the top and a column in which the sequence signal information is arranged from the top, and the signal feature sequence on a matrix composed of the two columns When the signal feature sequences generated by the generation unit are aligned in a straight line, the determination unit can determine as a search result.

また、上記発明において、前記系列信号情報を音声データとし、前記信号特徴および前記参考信号特徴が、音素、音節または音素もしくは音節のｎ−ｇｒａｍによって特徴付けられる信号特徴とすることができる。 Further, in the above invention, the sequence signal information may be speech data, and the signal feature and the reference signal feature may be signal features characterized by phonemes, syllables, or phonemes or syllable n-grams.

他方、系列信号検索方法にかかる本発明は、記憶装置に蓄積された情報に接続される計算機を介して検索する方法において、前処理過程および実行時処理過程とで構成された系列信号検索方法であって、前処理過程は、検索対象の系列信号情報を所定単位ごとに分け、単位ごとの信号特徴を抽出する信号特徴抽出過程と、前記信号特徴抽出過程により抽出された信号特徴と参照信号特徴との特徴量の類似度を示す距離を計算する類似距離算出過程と、前記類似距離算出過程により算出された類似距離の最小値から信号特徴を順次配列させた特徴ベクトルを前記参照信号特徴ごとに生成する特徴ベクトル生成過程とで構成され、実行時処理過程は、検索信号情報を所定単位ごとに分け、単位ごとの検索信号特徴を抽出する検索信号特徴抽出過程と、前記検索信号特徴に一致する参照信号特徴ごとに前記特徴ベクトルを整列させ、各特徴ベクトルの最小値から順次選択して所定の信号列を生成する信号特徴列生成過程と、前記信号特徴列生成過程により生成された信号特徴列が検索信号特徴の一部または全部を特定したとき、検索結果として判定する判定過程と、前記検索結果を出力する出力過程とで構成されたことを特徴とする系列信号検索方法を要旨としている。 On the other hand, the present invention relating to the sequence signal search method is a sequence signal search method comprising a preprocessing step and a runtime processing step in a search method via a computer connected to information stored in a storage device. The pre-processing step divides the sequence signal information to be searched into predetermined units, extracts a signal feature for each unit, and extracts the signal features and reference signal features extracted by the signal feature extraction step. For each reference signal feature, a similarity vector calculation process for calculating a distance indicating the similarity between the feature quantities and a feature vector in which signal features are sequentially arranged from the minimum value of the similarity distance calculated by the similarity distance calculation process. The search signal feature extraction process that divides the search signal information into predetermined units and extracts the search signal features for each unit. A signal feature sequence generating step of generating a predetermined signal sequence by aligning the feature vectors for each reference signal feature that matches the search signal features and sequentially selecting from the minimum value of each feature vector; and the signal feature sequence generation A sequence comprising: a determination process for determining a search result when a signal feature sequence generated by the process specifies a part or all of a search signal feature; and an output process for outputting the search result The gist of the signal search method.

上記発明の判定過程としては、前記検索信号情報を先頭から配列した列と、前記系列信号情報を先頭から配列した列とを記憶し、前記二つの列で構成されるマトリクス上において前記信号特徴列生成過程により生成された信号特徴列が直線状に整列するとき、検索結果として判定する判定過程とすることができる。 As a determination process of the above invention, a column in which the search signal information is arranged from the top and a column in which the sequence signal information is arranged from the top are stored, and the signal feature sequence is stored on a matrix composed of the two columns. When the signal feature sequences generated by the generation process are arranged in a straight line, it can be a determination process for determining as a search result.

また、上記発明において、前記信号特徴抽出過程は、音声データを音素、音節または音素もしくは音節のｎ−ｇｒａｍを単位として分割し、該単位ごとの信号特徴を抽出する信号特徴抽出過程であり、前記検索信号特徴抽出過程は、文字データを音素、音節または音素もしくは音節のｎ−ｇｒａｍを単位として分割し、該単位ごとの信号特徴を抽出する検索信号特徴抽出過程である構成とすることができる。 In the above invention, the signal feature extraction step is a signal feature extraction step of dividing speech data into phonemes, syllables or phoneme or syllable n-grams and extracting signal features for each unit, The search signal feature extraction process may be a search signal feature extraction process in which character data is divided in units of phonemes, syllables or phonemes or syllable n-grams, and signal features are extracted for each unit.

なお、上記構成の検索方法は従来技術のうち、距離空間上の索引を用いる手法に分類されると考えられるが、従来技術における手法が文字列全体の距離空間上で直接索引付けするのに対し、本発明は距離空間を線形に再構成可能な部分距離空間に分割し、各部分空間で距離空間索引付けを行うものである。 The search method with the above configuration is considered to be classified as a technique using an index in the metric space among the conventional techniques, whereas the technique in the conventional technique directly indexes in the metric space of the entire character string. The present invention divides a metric space into linearly reconfigurable partial metric spaces, and performs metric space indexing in each partial space.

また、本発明の検索方法は、大規模な音声データ中から、入力した検索語と類似した箇所を、類似度の高い順に高速に検索する、ＳＴＤ手法の一つである。音声データからのキーワード検索のほかに、検索語を分割した単位、たとえば音声検索の場合は音素、音節、音節や音素のｎ−ｇｒａｍなど、と検索対象データの各位置の間に距離、特に連続値による距離または類似度が定義されている系列からの類似箇所検索一般に適用可能である。たとえば、近似文字列照合、ＯＣＲ後の誤りを含むテキストからの検索などに適用可能である。特に、検索対象データについて、各位置に複数の候補がある場合にも適用可能であることから、誤りを含む系列からの類似列検索に適している。
The search method according to the present invention is one of STD methods for searching a portion similar to an input search word from large-scale voice data at high speed in descending order of similarity. In addition to keyword search from speech data, the distance between each position of the search target data, in particular continuous, is a unit in which the search term is divided, such as phonemes, syllables, n-grams of syllables and phonemes in the case of speech search The present invention is generally applicable to a similar part search from a series in which a distance or similarity by value is defined. For example, the present invention can be applied to approximate character string collation and search from text including errors after OCR. In particular, since the search target data can be applied even when there are a plurality of candidates at each position, the search target data is suitable for a similar column search from a sequence including an error.

本発明によれば、距離を考慮した索引付けを用いることで高速な検出が可能となるとともに、音声認識の複数候補を考慮することで高精度な検出も可能になる。 According to the present invention, high-speed detection is possible by using indexing in consideration of distance, and high-precision detection is also possible by considering a plurality of speech recognition candidates.

そして、距離を考慮した索引付けは、距離の近い、あるいは近似的に近い候補から探索を進めるという良い性質を持っている。また、従来技術は、索引は距離を考慮していないので可能性のある候補を等価に扱う必要があり、そのため探索を打ち切るしきい値を必要とし、適切なしきい値の設定が難しいという問題があったが、本発明の検索方法によれば、しきい値を設定する必要は無く、この問題を回避できる。 Indexing in consideration of distance has a good property that the search is advanced from candidates that are close or approximately close. In addition, since the index does not take distance into consideration, the prior art needs to treat possible candidates equivalently, and therefore requires a threshold value to stop the search, and it is difficult to set an appropriate threshold value. However, according to the search method of the present invention, there is no need to set a threshold value, and this problem can be avoided.

このように、本発明の検索装置は、その性質により、新たなＳＴＤシステムの運用方法が可能になる。また、本発明の検索方法によれば、しきい値で探索を打ち切ることを行わないので、どんなにノイズが大きい場合でも、距離の近い、あるいは近似的に近い候補から順番に解を見つけることができる、という良い性質を持つ。この性質により、これまでにないシステムの運用が可能である。例えば、最初のＮ個の解が見つかるまで検出を行うといった運用や、検出に時間がかかる場合は、距離の類似した対立候補が多くそもそも検索語が存在しないなど良い結果が得られないことが示唆されるので、一定時間だけで検索を打ち切るなど、システム構築の際にこの性質を活かした運用を行うことが可能である。
Thus, the search device of the present invention enables a new STD system operation method due to its nature. Also, according to the search method of the present invention, the search is not terminated at the threshold value, so that solutions can be found in order from candidates that are close to each other or approximately close to each other no matter how loud the noise is. It has a good nature. This property allows for unprecedented system operation. For example, if operation is performed until the first N solutions are found, or if the detection takes time, it is suggested that there are many conflicting candidates with similar distances and there are no search terms in the first place, so that good results cannot be obtained. Therefore, it is possible to perform operations that take advantage of this property when constructing a system, such as stopping the search within a certain period of time.

ｘｙ平面上の直線検出による検索語の検出の模式図である。It is a schematic diagram of the search term detection by the straight line detection on xy plane. 音節距離配列の模式図である。It is a schematic diagram of a syllable distance arrangement. 距離順にソートされた位置ｉのベクトル（スタック）を示す模式図である。It is a schematic diagram which shows the vector (stack) of the position i sorted in order of distance. 前処理（索引付け）アルゴリズムのフローチャートである。It is a flowchart of a pre-processing (indexing) algorithm. 検索アルゴリズムのフローチャートである。It is a flowchart of a search algorithm. 検索アルゴリズムのフローチャートである。It is a flowchart of a search algorithm. 近傍生成の例を示す模式図である。It is a schematic diagram which shows the example of neighborhood production | generation. ＣｏｎｆｕｓｉｏｎＮｅｔｗｏｒｋを考慮した音節間距離の模式図である。It is a schematic diagram of the distance between syllables in consideration of Confusion Network. 挿入、削除、誤りを考慮した音節間距離を表す模式図である。It is a schematic diagram showing the distance between syllables which considered insertion, deletion, and an error. 本発明の実施形態における前処理のための索引データ作成装置の模式図である。It is a schematic diagram of the index data creation apparatus for the pre-processing in the embodiment of the present invention. 本発明の実施形態における検索処理装置の模式図である。It is a schematic diagram of the search processing apparatus in the embodiment of the present invention. 本発明の実施形態における連続ＤＰのＳＴＤ実験結果を示すグラフである。It is a graph which shows the STD experimental result of continuous DP in embodiment of this invention. 本発明の実施形態におけるＳＴＤ実験結果を示すグラフである。It is a graph which shows the STD experiment result in embodiment of this invention.

以下、本発明の実施の形態を説明する。本発明の基本的な原理は、ＳＴＤ問題の性質を利用して効率よい索引付け（前処理）手段と直線検出手段を利用した検索方法（実行時処理）に特長がある。 Embodiments of the present invention will be described below. The basic principle of the present invention is characterized by an efficient indexing (preprocessing) means using the nature of the STD problem and a search method (runtime processing) using a straight line detection means.

系列信号情報としては、音声データ、テキストデータ、音楽データおよび画像データなどが挙げられるが、本実施形態では音声データを例示して説明する。また、音声データについて分割する単位には、音素、音節または音素もしくは音節のｎ−ｇｒａｍなどを挙げることができるが、説明の容易さから音節を単位として音声データを分割し、それぞれの音節について、個々の音節を信号特徴として認識する場合について説明する。なお、各音節における信号特徴およびその類似度は、発音された音声波形と参照音節波形との間の物理的な差異を尺度とするほか、発声方法の相違点に着目した解析方法によって特徴付けられる基準により尺度を決定する場合もあり得る。より具体的には、音節をＨｉｄｄｅｎＭａｒｋｏｖＭｏｄｅｌ（ＨＭＭ）でモデル化して、そのモデル間の距離を尺度とすることができる。そのため、以下の説明中に示した「音節」には、上述のように特徴付けられた「音節ごとの信号特徴」を含む場合があるものとし、また、参照信号特徴とは、正確な発音をした場合の音節ごとの信号特徴を意味する。 Examples of the series signal information include audio data, text data, music data, and image data. In the present embodiment, the audio data will be described as an example. Moreover, examples of the unit for dividing the speech data include phonemes, syllables, or phonemes or n-grams of syllables. For ease of explanation, the speech data is divided in units of syllables, and for each syllable, A case where individual syllables are recognized as signal features will be described. Note that the signal features and their similarities in each syllable are characterized by an analysis method that focuses on the difference in the utterance method as well as the physical difference between the pronounced speech waveform and the reference syllable waveform. The scale may be determined by criteria. More specifically, syllables can be modeled by a Hidden Markov Model (HMM), and the distance between the models can be used as a scale. Therefore, the “syllable” shown in the following description may include the “signal feature for each syllable” characterized as described above, and the reference signal feature is an accurate pronunciation. This means the signal characteristics for each syllable.

音声データの音声認識結果の音節列をｘ軸、検索語の音節列をｙ軸にとった平面を考える（図１）。ＳＴＤは、この平面から直線を検出する問題ととらえることができる。ｘｙ平面は、要素として音節間の距離（類似度）が代入された音声データ長ｎ×検索語長ｍの２次元配列Ｄ［ｉ］［ｊ］
（０≦ｉ＜ｎ，０≦ｊ＜ｍ）で表現することができる。 Consider a plane in which the syllable string of the speech recognition result of the voice data is taken on the x axis and the syllable string of the search word is taken on the y axis (FIG. 1). STD can be regarded as a problem of detecting a straight line from this plane. The xy plane is a two-dimensional array D [i] [j] of speech data length n × search word length m in which a distance (similarity) between syllables is substituted as an element.
(0 ≦ i <n, 0 ≦ j <m).

このとき、例えば（最も一般的なマッチングのケースである）傾き１の直線ｙ＝ｘ＋ｃの検出には、数１を各ｉ（０≦ｉ＜ｎ）で計算し、Ｔ［ｉ］の小さい箇所を類似箇所として検出する、と言った問題に定式化できる。 At this time, for example, in order to detect a straight line y = x + c with a slope of 1 (which is the most common matching case), Equation 1 is calculated for each i (0 ≦ i <n), and a place where T [i] is small Can be formulated into the problem of detecting as a similar part.

これ以降、ｊ（０≦ｊ＜ｍ−１）に関する＜式＞の総和を Σ＿ｊ＜式＞と表す。 Hereinafter, the sum of <Expression> related to j (0 ≦ j <m−1) is expressed as Σ_j <Expression>.

画像の直線検出の場合は、２次元配列Ｄ［ｉ］［ｊ］がその場で得られるので、すべてオンラインで計算する必要がある。 In the case of image straight line detection, since a two-dimensional array D [i] [j] is obtained on the spot, all calculations must be performed online.

一方、ＳＴＤの場合は、音声データは検索前に既知であると仮定できるので、検索語のある分割単位（音素、音節、音節や音素のｎ−ｇｒａｍ、など。以下、音節と呼ぶ。）ａごとに音声データの各位置での距離の配列Ｄ（ａ）［ｉ］（０≦ｉ＜ｎ）をオフラインで計算しておくことができる（図２）。検索時には検索語の音節系列ａ［０］，ａ［１］，・・，ａ［ｍ−１］によって距離の配列Ｄ（ａ［０］）［ｉ］，Ｄ（ａ［１］）［ｉ］，・・，Ｄ（ａ［ｍ−１］）［ｉ］をｙ軸方向に順番に並べるだけで２次元配列Ｄ［ｉ］［ｊ］を構成できる（すなわち、Ｄ［ｉ］［ｊ］＝Ｄ（ａ［ｊ］）［ｉ］となる）。 On the other hand, in the case of STD, since it can be assumed that the speech data is known before the search, a certain division unit of the search word (phonemes, syllables, syllables and n-grams of phonemes, etc., hereinafter referred to as syllables) a It is possible to calculate the distance array D (a) [i] (0 ≦ i <n) at each position of the audio data offline (FIG. 2). At the time of retrieval, the array of distances D (a [0]) [i], D (a [1]) [i is determined by the syllable series a [0], a [1],. ],..., D (a [m−1]) [i] can be arranged in order in the y-axis direction to form a two-dimensional array D [i] [j] (ie, D [i] [j] = D (a [j]) [i]).

さらに、距離の配列Ｄ（ａ）［ｉ］は、オフラインで距離の昇順にソートしておくことができる。つまり、類似距離の最小値から信号特徴を順次配列させるのである。音節ａについて音声データ中の各位置ｉをＤ（ａ）［ｉ］に従って距離の昇順にソートしたスタックをＳ（ａ）、そのスタックトップをＳ（ａ）．ｔｏｐとする（図３）。 Furthermore, the distance array D (a) [i] can be sorted offline in ascending order of distance. That is, the signal features are sequentially arranged from the minimum value of the similar distance. S (a) is a stack in which each position i in the audio data for syllable a is sorted in ascending order of distance according to D (a) [i], and the top of the stack is S (a). Top (FIG. 3).

以上で述べた、Ｄ（ａ）［ｉ］およびＳ（ａ）を計算するアルゴリズムのフローチャートを（図４）に示す。以下、（図４）に従って手順を説明する。 A flowchart of the algorithm for calculating D (a) [i] and S (a) described above is shown in FIG. Hereinafter, the procedure will be described with reference to FIG.

まず、音声データを音声認識する等の手段で、音節ＣｏｎｆｕｓｉｏｎＮｅｔｗｏｒｋＣＮ［ｉ］を用意する（ステップＡ１）。音節集合Ａを初期化し、全音節を代入する（ステップＡ２）。Ａから音節ａを一つ取り出し、位置ｉを０に初期化する（ステップＡ３）。ａとＣＮ［ｉ］の距離を計算し、それを距離の配列Ｄ（ａ）［ｉ］に代入し、ｉをインクリメントする（ステップＡ４）。位置ｉが文書長ｎに達したら次のステップＡ６に、そうでなければステップＡ４に戻る（ステップＡ５）。求めたＤ（ａ）［ｉ］に従って、文書位置を昇順にソートし、ソートされた位置ベクトルＳ（ａ）を得る（ステップＡ６）。音節集合Ａにまだ音節が残っていたらステップＡ３に戻る（ステップＡ７）。求めた位置ベクトルＳ（ａ）と距離ベクトルＤ（ａ）［ｉ］（両ベクトルを合わせて特徴ベクトルという）を出力する（ステップＡ８）。 First, syllable Confusion Network CN [i] is prepared by means such as voice recognition of voice data (step A1). The syllable set A is initialized and all syllables are substituted (step A2). One syllable a is extracted from A, and position i is initialized to 0 (step A3). The distance between a and CN [i] is calculated, assigned to the distance array D (a) [i], and i is incremented (step A4). If the position i reaches the document length n, the process returns to the next step A6, and if not, the process returns to step A4 (step A5). According to the obtained D (a) [i], the document positions are sorted in ascending order to obtain the sorted position vector S (a) (step A6). If syllables still remain in the syllable set A, the process returns to step A3 (step A7). The obtained position vector S (a) and distance vector D (a) [i] (both vectors are referred to as a feature vector) are output (step A8).

前処理で計算しておいたＳ（ａ）を検索の索引として用いると、見込みのある位置から順に検出をする効率的な検出方法が構成可能である。より具体的には、以下の手順に従い、高速な検索語検出が可能である。 When S (a) calculated in the preprocessing is used as a search index, an efficient detection method for detecting in order from a probable position can be configured. More specifically, high-speed search word detection is possible according to the following procedure.

検索処理のフローチャートを（図５）に示す。以下、（図５）に従って手順を説明する。 A flowchart of the search process is shown in FIG. Hereinafter, the procedure will be described with reference to FIG.

まず、検索語を入力し、その音節列ａ［０］，ａ［１］，・・，ａ［ｍ−１］に従って、予め前処理で計算されているスタックの列Ｓ（ａ［０］），Ｓ（ａ［１］），・・，Ｓ（ａ［ｊ］），・・，Ｓ（ａ［ｍ−１］）を用意する。また、Ｒ＝｛｝，ｃｏｕｎｔ［ｉ］＝０
（０≦ｉ＜ｎ）に初期化する（ステップＢ１）。各スタック列のスタックトップの要素からなる集合Ｕ＝｛Ｓ（ａ［０］）．ｔｏｐ，Ｓ（ａ［１］）．ｔｏｐ，・・，Ｓ（ａ［ｊ］）．ｔｏｐ，Ｓ（ａ［ｍ−１］）．ｔｏｐ｝から、ある基準でｊを選び、Ｓ（ａ［ｊ］）．ｔｏｐをスタックＳ（ａ［ｊ］）から取り出し（ポップし）、位置ｉ＝Ｓ（ａ［ｊ］）．ｔｏｐ−ｊについて、投票した回数を数えるカウンタｃｏｕｎｔ［ｉ］に１を加える（投票する）（ステップＢ２）。ある値ｋについて、ｃｏｕｎｔ［ｉ］≧ｋ
がまだ成り立っていないなら、ステップＢ２に戻る（ステップＢ３）。位置ｉを検索結果候補集合Ｒに加える（ステップＢ４）。
Ｒ中の各候補ｉについて、ある基準を満たしたものを、検索結果として出力する。出力した候補はＲから取り除く（ステップＢ５）。終了条件が満たされていない場合はステップＢ２に戻る（ステップＢ６）。 First, a search word is input, and a stack sequence S (a [0]) calculated in advance by preprocessing according to the syllable sequence a [0], a [1],. , S (a [1]),..., S (a [j]),. R = {}, count [i] = 0
(0 ≦ i <n) is initialized (step B1). A set U = {S (a [0]). top, S (a [1]). top,..., S (a [j]). top, S (a [m-1]). top}, j is selected according to a certain criterion, and S (a [j]). remove top from stack S (a [j]), position i = S (a [j]). For top-j, 1 is added (voted) to a counter count [i] that counts the number of times of voting (step B2). For a certain value k, count [i] ≧ k
If is not yet established, the process returns to step B2 (step B3). The position i is added to the search result candidate set R (step B4).
For each candidate i in R, the one that satisfies a certain criterion is output as a search result. The output candidate is removed from R (step B5). If the end condition is not satisfied, the process returns to step B2 (step B6).

以上の手順において、基本的には各候補について距離の計算は全く行う必要はないことに注意されたい。また、以降で述べるように、ステップＢ５で用いる基準によっては各候補の距離 In the above procedure, it should be noted that basically no distance calculation is required for each candidate. Also, as will be described later, depending on the criteria used in step B5, the distance between each candidate

を計算する必要があるが、あらかじめ求められている各音節の距離Ｄ（ａ［ｊ］）［ｉ＋ｊ］を足し込んでいくだけで、高価な計算なしに求めることができる。 However, it is possible to calculate without adding an expensive calculation only by adding the distance D (a [j]) [i + j] of each syllable obtained in advance.

ステップＢ２のｊを選ぶ基準には、Ｕ中で最小の距離を持つ要素（ａｒｇｍｉｎ＿ｊＤ（ａ［ｊ］）［Ｓ（ａ［ｊ］）．ｔｏｐ］）などが考えられる。ステップＢ６の終了条件には、「最初の検索結果が得られるまで」「最初のＮ個の検索結果が得られるまで」「検索結果の距離があるしきい値を越えるまで」「ある一定時間が経過するまで」などが考えられる。また、ステップＢ５は各繰り返しで必ず実行する必要は無く、例えば、ステップＢ４を一定回数実行したら１回実行する、といった実装も考えられる。 As a criterion for selecting j in step B2, an element having the smallest distance in U (argmin_j D (a [j]) [S (a [j]). Top]) may be considered. The end condition of step B6 includes “until the first search result is obtained”, “until the first N search results are obtained”, “until the distance of the search results exceeds a certain threshold value”, and “a certain period of time” It can be considered “until it elapses”. Further, step B5 does not necessarily need to be executed in each iteration. For example, an implementation may be considered in which step B4 is executed once after a predetermined number of times.

ステップＢ３のｋとステップＢ５の基準の選択は、精度と効率に関係する。ステップＢ３のｋを小さくすると効率は良くなるが、精度は落ちる。そのためステップＢ５である基準により結果の評価を行い、安全なものを出力する。 The selection of k in step B3 and the criterion in step B5 is related to accuracy and efficiency. If k in step B3 is reduced, the efficiency is improved, but the accuracy is lowered. Therefore, the result is evaluated according to the criterion of step B5, and a safe one is output.

最も単純な実装方法は、ｋ＝ｍ（検索語長）として、ステップＢ５には何も基準を課さず、すぐに検索結果を出力することが考えられる。また、効率を改善するには、ステップＢ５の基準を課さないまま、ｋを小さく取れば良い。例えば、ある実数α（０＜α＜１）についてｋ＝αｍとすると、検索語長に対して割合αだけ投票されたときに出力することとなる。 The simplest implementation method is that k = m (search word length), and no criteria is imposed on step B5, and the search result is output immediately. Further, in order to improve the efficiency, k may be set small without imposing the criterion of step B5. For example, if k = αm for a certain real number α (0 <α <1), it is output when a ratio α is voted for the search word length.

これらの、ステップＢ５で基準を課さない単純な実装のフローチャートを（図６）に示す。 A flow chart of these simple implementations that do not impose criteria in step B5 is shown in FIG.

図６において、ステップＢ１、ステップＢ２、ステップＢ３、ステップＢ６は、図５の対応するステップと同じである。図６のステップＢ４５では、位置ｉを検索結果としてそのまま出力する。 In FIG. 6, Step B1, Step B2, Step B3, and Step B6 are the same as the corresponding steps in FIG. In step B45 in FIG. 6, the position i is output as it is as a search result.

図５において、特にｋ＝１とした場合、ステップＢ５に次の基準１を用いると、以降の繰り返しで出力する候補よりも悪い（距離の大きい）候補は出力しないことを保証する最適解アルゴリズムが得られる。 In FIG. 5, when k = 1 in particular, when the next criterion 1 is used in step B5, an optimal solution algorithm that guarantees that a candidate that is worse (larger in distance) than a candidate that is output in subsequent iterations is not output is can get.

（基準１） Σ＿ｊＤ（ａ［ｊ］）［ｉ＋ｊ］≦Σ＿ｊＳ（ａ［ｊ］）．ｔｏｐが成り立つ。 (Criteria 1) Σ_j D (a [j]) [i + j] ≦ Σ_j S (a [j]). top holds.

さらに、ステップＢ５で距離の小さい候補から順番に出力すれば、距離の昇順に解を出力するアルゴリズムとなる。また、集合Ｒは２分探索木やＢ木など順序を保持するデータ構造を用いれば、効率よく実装できる。 Furthermore, if the candidates are output in order from the candidate with the smallest distance in step B5, the algorithm outputs the solutions in ascending order of the distance. Further, the set R can be efficiently implemented by using a data structure that retains the order such as a binary search tree or a B-tree.

基準１の妥当性は、以下の補題から明らかである。 The validity of Criterion 1 is clear from the following lemma.

（補題１）まだ一度も投票が行われていない位置ｉの、数２で定義される最終的な距離Ｔ［ｉ］は、スタックトップの距離の総和Σ＿ｊＳ（ａ［ｊ］）．ｔｏｐ (Lemma 1) The final distance T [i] defined by Equation 2 at the position i that has not yet been voted is the sum of the stack top distances Σ_j S (a [j]). top

以上である。

That's it.

（証明）スタックは昇順にソートされているので、まだ投票の無い位置ｉについては任意の音節位置ｊについて、 (Proof) Since the stack is sorted in ascending order, for any syllable position j for a position i that has not yet been voted,

が成り立つ。よって、

Holds. Therefore,

が成り立つ。（証明終）

Holds. (End of proof)

（補題２）Σ＿ｊＳ（ａ［ｊ］）．ｔｏｐより距離の小さい候補位置ｉは、少なくとも１回ある音節位置ｊで投票が行われている。 (Lemma 2) Σ_j S (a [j]). The candidate position i having a distance smaller than the top is voted at a syllable position j at least once.

（証明）補題１の対偶により明らか。 (Proof) Clarified by the even number of Lemma 1.

以上により、候補集合Ｒ以外にΣ＿ｊ
Ｓ（ａ［ｊ］）．ｔｏｐより距離の小さい候補は存在しないことが分かり、最適解が得られることが保証される。 As described above, in addition to the candidate set R, Σ_j
S (a [j]). It can be seen that there is no candidate whose distance is smaller than top, and it is guaranteed that an optimum solution is obtained.

以上が基本的なアルゴリズムであるが、種々のバリエーションが考えられる。ここまでの説明では、最も単純な傾き１の直線ｙ＝ｘ＋ｃの場合を想定したが、これは距離尺度としては、文字列間のハミング距離に相当する。近似文字列照合の分野で使われる手法である近傍生成を適用することにより、直線の近傍となる直線や折れ線についても投票先Ｔ［ｉ］を用意し同様な計算が可能であり、より一般的な文字列間の距離尺度である編集距離やその他の距離に適用できる（図７）。 The above is the basic algorithm, but various variations are possible. In the description so far, the case of the simplest straight line y = x + c having a slope of 1 is assumed. This corresponds to a Hamming distance between character strings as a distance scale. By applying neighborhood generation, which is a technique used in the field of approximate character string matching, voting destinations T [i] can be prepared for straight lines and broken lines that are close to straight lines, and the same calculation is possible. It can be applied to edit distances and other distances which are distance scales between character strings (FIG. 7).

また、距離配列Ｄ（ａ）［ｉ］には、音節ａと音声データ中の位置ｉから求まる任意の距離を用いることが可能である。複数音声認識結果のコンパクトな表現方法であるＣｏｎｆｕｓｉｏｎ
ＮｅｔｗｏｒｋやＴＡＬＥ（Ｔｉｍｅ−ＡｎｃｈｏｒｅｄＬａｔｔｉｃｅＥｘｐａｎｓｉｏｎ）などを用いた複数候補を考慮した距離（図８）、認識のもっともらしさを表現した距離、挿入誤りや削除誤りを考慮して直線からの逸脱を許容するために隣接する音節との距離も考慮した距離（図９）、などより複雑な距離を使用することができる。
In addition, an arbitrary distance obtained from the syllable a and the position i in the audio data can be used for the distance array D (a) [i]. Confusion, a compact method for expressing multiple speech recognition results
Allowing deviation from a straight line in consideration of distances considering multiple candidates using Network, TALE (Time-Anchored Lattice Expansion), etc., distances expressing the plausibility of recognition, insertion errors and deletion errors Therefore, it is possible to use a more complicated distance such as a distance considering the distance between adjacent syllables (FIG. 9).

〔前処理〕次の手順で索引データ（前記実施の形態に示したスタックＳ（ａ））を作成する。（図１０） [Preprocessing] Index data (stack S (a) shown in the above embodiment) is created in the following procedure. (Fig. 10)

１．ハードディスク、ＣＤ−ＲＯＭなどの記憶装置１０２に記録された音声データ２０１を用意する。音声データ１０２を、計算機１０１上にインストールした音声認識デコーダを用いて認識し、その認識結果である複数認識候補を表現した音節Ｃｏｎｆｕｓｉｏｎ
Ｎｅｔｗｏｒｋ２０２を記憶装置１０２上に作成する。音節ＣｏｎｆｕｓｉｏｎＮｅｔｗｏｒｋ２０２の代わりに、よりシンプルな認識結果の音節列をそのまま用いても良い。また、音節の代わりに、音素や、音節・音素のｎ−ｇｒａｍなど、検索語を分割して得られる任意の単位を用いても良い（以下では、これらをまとめて音節と呼ぶ。） 1. Audio data 201 recorded in a storage device 102 such as a hard disk or CD-ROM is prepared. The speech data 102 is recognized using a speech recognition decoder installed on the computer 101, and a syllable Confusion that represents a plurality of recognition candidates as a recognition result.
A network 202 is created on the storage device 102. Instead of the syllable confusion network 202, a simpler syllable string as a recognition result may be used as it is. Instead of syllables, arbitrary units obtained by dividing a search word such as phonemes, syllable / phoneme n-grams, and the like may be used (hereinafter collectively referred to as syllables).

２．音節ａについて、音節ＣｏｎｆｕｓｉｏｎＮｅｔｗｏｒｋ２０２の各位値ｉの複数音節候補それぞれとの距離を計算機１０１で計算し、音節間距離の最小値を求める。すべての位置について距離計算した結果を、音節距離配列２０３として記憶装置１０２上に作成する。以上の操作をすべての音節で繰り返し、音節毎の音節距離配列２０３を作成する。 2. For the syllable a, the computer 101 calculates the distance from each of the multiple syllable candidates of each value i of the syllable Confusion Network 202, and obtains the minimum value of the distance between syllables. The distance calculation results for all positions are created on the storage device 102 as the syllable distance array 203. The above operation is repeated for all syllables to create a syllable distance array 203 for each syllable.

３．各音節ａについて、音節距離配列２０３に従って位置ｉを計算機を用いて昇順にソートし、その出力である音節位置のベクトル２０４を記憶装置上に作成する。 3. For each syllable a, the positions i are sorted in ascending order using a computer in accordance with the syllable distance array 203, and a vector 204 of syllable positions, which is the output, is created on the storage device.

〔検索処理〕前処理で作成した音節位置ベクトル（スタック）を用いて検索処理を行う。（図１１） [Search Processing] Search processing is performed using the syllable position vector (stack) created in the preprocessing. (Fig. 11)

１．音声データ４０１と〔前処理〕で作成した音節毎の音節位置ベクトル４０３を、ハードディスクなどの記憶装置３０２に用意する。検索処理中に音節と位置の間の距離を要する方法を用いる場合は、音節毎の音節距離配列４０２も記憶装置３０２に用意する。音声認識結果のテキストを検索結果とする場合には、音声データ４０１の代わりに音声認識結果のテキストを記憶装置上に用意しても良い。 1. The voice data 401 and the syllable position vector 403 for each syllable created by [preprocessing] are prepared in a storage device 302 such as a hard disk. When using a method that requires a distance between a syllable and a position during search processing, a syllable distance array 402 for each syllable is also prepared in the storage device 302. When the speech recognition result text is used as the search result, the speech recognition result text may be prepared on the storage device instead of the speech data 401.

２．システムユーザの与える検索語（音節列）をキーボードなどの入力装置３０３を使って計算機３０１へ入力する。入力された音節列に含まれる各音節すべてについて、対応する音節位置ベクトル４０３を記憶装置３０２から計算機３０１に読み込んで、計算機メモリ上にスタックを構成する。スタックは、検索語を構成する各音節ごとに１つ、検索語の音節数だけ用意する。検索処理中に音節と位置の間の距離を要する方法を用いる場合は、対応する音節距離配列４０２も計算機３０１のメモリ上に読み込む。 2. A search word (syllable string) given by the system user is input to the computer 301 using an input device 303 such as a keyboard. For all the syllables included in the input syllable string, the corresponding syllable position vector 403 is read from the storage device 302 to the computer 301, and a stack is formed on the computer memory. One stack is prepared for each syllable constituting the search word, as many as the number of syllables of the search word. When using a method that requires a distance between a syllable and a position during the search process, the corresponding syllable distance array 402 is also read into the memory of the computer 301.

このとき、入力装置３０３には、キーボードの他、音声認識や手書き文字認識など、音節列を入力可能な任意の入力装置が利用できる。また、音節距離配列４０２はメモリに読み込む代わりに、記憶装置３０２にそのまま配置し、必要なときにランダムアクセスで参照することも可能である。音節位置ベクトル４０３も、記憶装置から一度にメモリ上に読み込む必要はなく、ベクトルを先頭からある長さまでのブロックに分割し、必要に応じて１ブロックずつ読み込むことも可能である。また音節ベクトルのうち距離の大きなものが記録されている接尾部分は、使用される可能性が低いので、記憶装置３０２の記憶容量の節約のため、削除してしまうことも可能である。 At this time, as the input device 303, any input device capable of inputting a syllable string such as voice recognition and handwritten character recognition can be used in addition to a keyboard. The syllable distance array 402 can be arranged in the storage device 302 as it is instead of being read into the memory, and can be referred to by random access when necessary. The syllable position vector 403 does not need to be read from the storage device into the memory at once, but the vector can be divided into blocks from the head to a certain length, and can be read one block at a time as necessary. In addition, a suffix portion in which a large distance among syllable vectors is recorded is unlikely to be used, and can be deleted to save the storage capacity of the storage device 302.

３．計算機３０１のメモリ上の音節毎に用意されたスタックを参照・操作しながら、前記実施の形態で示した方法により検索語出現位置を求める。検索処理中に音節と位置の間の距離を要する方法を用いる場合は、計算機メモリ上（または、記憶装置上）の音節距離配列４０２も参照して処理を行う。 3. While referring to and manipulating the stack prepared for each syllable on the memory of the computer 301, the search word appearance position is obtained by the method described in the above embodiment. When using a method that requires a distance between a syllable and a position during the search process, the process is also performed with reference to the syllable distance array 402 on the computer memory (or on the storage device).

４．検出された検索語出現位置にしたがって、記憶装置３０２上の音声データ４０１から検索語出現位置付近の音声を取り出し、検索結果とする。音声認識結果のテキストを用いる場合は、検索語出現位置のテキストを検索結果とする。検索結果として出力する範囲は、検索語そのものや、検索語を含むより広い範囲の文や文書など、用途に応じて任意に選択することができる。検索結果は、音声再生装置やディスプレイ装置などの出力装置３０４を用いてユーザにそのまま提示してもよいし、機械翻訳、音声合成器、Ｗｅｂサーバ、などの各種サービスを提供する任意の装置への入力として利用することもできる。 4). In accordance with the detected search word appearance position, the voice near the search word appearance position is extracted from the audio data 401 on the storage device 302 and used as a search result. When the text of the speech recognition result is used, the text at the search word appearance position is used as the search result. The range to be output as a search result can be arbitrarily selected according to the use, such as the search term itself, a wider range of sentences and documents including the search term. The search result may be presented to the user as it is using the output device 304 such as a voice reproduction device or a display device, or to any device that provides various services such as machine translation, a voice synthesizer, and a Web server. It can also be used as input.

以上の〔前処理〕〔検索処理〕で説明した装置を構成する要素である、記憶装置３０２、入力装置３０３、出力装置３０４、および計算機３０１は、直接接続することもできるし、ネットワーク上に分散して配置し、通信により相互接続して装置を構成することもできる。 The storage device 302, the input device 303, the output device 304, and the computer 301, which are the elements constituting the device described in the above [Preprocessing] and [Search processing], can be directly connected or distributed on the network. It is also possible to configure the apparatus by arranging and interconnecting by communication.

また、記憶装置３０２上のデータである、音声データ４０１、音節距離配列４０２、音節位置ベクトル４０３は、それぞれ別の記憶装置に配置することも出来る。例えば、記憶装置３０２上のデータのうち音声データ４０１をネットワーク越しのサーバ上の記憶装置に配置し、音節位置ベクトルだけ計算機３０１に直結した記憶装置に配置することもできる。 Further, the audio data 401, the syllable distance array 402, and the syllable position vector 403, which are data on the storage device 302, can be arranged in different storage devices. For example, the audio data 401 among the data on the storage device 302 can be arranged in a storage device on a server over a network, and only the syllable position vector can be arranged in a storage device directly connected to the computer 301.

また、入出力装置３０３をＷｅｂ上のクライアントで、その他をＷｅｂサーバ上に構築することもできる。 It is also possible to construct the input / output device 303 with a client on the Web and the other on the Web server.

実施例に示した装置の実装と評価実験を、以下の手順で行った。 Implementation of the apparatus shown in the examples and an evaluation experiment were performed in the following procedure.

日本語の学会講演と模擬講演を記録したコーパスである「日本語話し言葉コーパス（ＣＳＪ）」中の１７７講演（約４４時間）の音声データを対象としたＳＴＤシステムを構築した。前記ＣＳＪを検索対象とした音声ドキュメント検索用テストコレクションに含まれる前記ＣＳＪの１ｂｅｓｔの音声認識結果を音素列に展開し、音素を分割単位として索引付けを行った。 An STD system was constructed for speech data of 177 lectures (about 44 hours) in the “Japanese Spoken Language Corpus (CSJ)”, a corpus that recorded lectures and mock lectures in Japanese. The CSJ 1-best speech recognition results included in the test collection for speech document search using the CSJ as a search target are expanded into phoneme strings, and indexed using phonemes as division units.

音素ａと検索対象位置ｉの間の距離尺度としては、音素弁別特徴間のハミング距離［非特許文献６］を用いた。 As the distance measure between the phoneme a and the search target position i, the Hamming distance between phoneme discrimination features [Non-Patent Document 6] was used.

検索語として、前記ＣＳＪを対象とした検索語検出のためのテストコレクション［非特許文献３］の検索語を用いた。 As a search term, the search term of the test collection [Non-Patent Document 3] for search term detection targeting the CSJ was used.

上記実施例と同条件の距離しきい値による連続ＤＰマッチングと比較を行った。実験結果を（図１２）と（図１３）に示す。この結果から、検索性能を落とすこと無く、検索効率が大幅に改善されていることがわかる。
Comparison was made with continuous DP matching using a distance threshold under the same conditions as in the above example. The experimental results are shown in (FIG. 12) and (FIG. 13). From this result, it can be seen that the search efficiency is greatly improved without degrading the search performance.

１０１…計算機
１０２…記憶装置
２０１…音声データ
２０２…ＣｏｎｆｕｓｉｏｎＮｅｔｗｏｒｋ
２０３…音節距離配列
２０４…音節位置ベクトル
３０１…計算機
３０２…記憶装置
３０３…入力装置
３０４…出力装置
４０１…音声データ
４０２…音節距離配列
４０３…音節位置ベクトル

101 ... Computer 102 ... Storage device 201 ... Audio data 202 ... Confusion Network
203 ... syllable distance array 204 ... syllable position vector 301 ... calculator 302 ... storage device 303 ... input device 304 ... output device 401 ... audio data 402 ... syllable distance array 403 ... syllable position vector

Claims

Signal feature extraction means for dividing the sequence signal information to be searched into predetermined units and extracting signal features for each unit;
Similarity distance calculation means for calculating a distance indicating the similarity of the feature quantity between the signal feature extracted by the signal feature extraction means and the reference signal feature;
Feature vector generation means for generating for each reference signal feature a feature vector in which signal features are sequentially arranged from the minimum value of the similarity distance calculated by the similarity distance calculation means;
A storage device that stores the sequence signal information, the reference signal feature, a distance indicating the similarity of the feature quantity of the signal feature, and the feature vector;
Search signal feature extracting means for dividing search signal information into predetermined units and extracting search signal features for each unit;
Signal feature sequence generating means for aligning the feature vectors for each reference signal feature that matches the search signal features and sequentially selecting from the minimum value of each feature vector to generate a predetermined signal sequence;
A determination unit that determines a search result when the signal feature sequence generated by the signal feature sequence generation unit specifies a part or all of a search signal feature;
A sequence signal search apparatus comprising: output means for outputting the search result.

The determination unit stores a column in which the search signal information is arranged from the top and a column in which the sequence signal information is arranged from the top, and the signal feature sequence generation unit on the matrix composed of the two columns. 2. The series signal search device according to claim 1, wherein the sequence signal search device is a determination unit that determines a search result when the generated signal feature sequence is aligned in a straight line.

The sequence signal information is speech data, and the signal feature and the reference signal feature are phonemes, syllables, or signal features characterized by n-grams of phonemes or syllables. The series signal search device described.

In a method for searching through a computer connected to information stored in a storage device, a sequence signal search method comprising a pre-processing step and a runtime processing step,
The pre-processing process divides the sequence signal information to be searched into predetermined units, and extracts a signal feature for each unit;
A similarity distance calculation step of calculating a distance indicating the similarity of the feature quantity between the signal feature extracted by the signal feature extraction step and the reference signal feature;
A feature vector generating step of generating, for each reference signal feature, a feature vector in which signal features are sequentially arranged from the minimum value of the similar distance calculated by the similar distance calculating step;
The runtime processing process divides the search signal information into predetermined units and extracts a search signal feature for each unit,
Aligning the feature vectors for each reference signal feature that matches the search signal features and sequentially selecting from the minimum value of each feature vector to generate a predetermined signal sequence; and
When the signal feature sequence generated by the signal feature sequence generation process specifies a part or all of the search signal features, a determination process for determining as a search result;
A sequence signal search method comprising: an output process for outputting the search result.

The determination process stores a column in which the search signal information is arranged from the beginning and a column in which the sequence signal information is arranged from the beginning, and is performed by the signal feature sequence generation process on a matrix composed of the two columns. 5. The sequence signal search method according to claim 4, wherein the sequence signal search method is a determination process of determining the search result when the generated signal feature sequence is aligned in a straight line.

The signal feature extraction step is a signal feature extraction step of dividing speech data into phonemes, syllables or phoneme or syllable n-grams as units, and extracting signal features for each unit, and the search signal feature extraction step is 6. The search signal feature extracting process of dividing character data into phonemes, syllables or n-grams of phonemes or syllables, and extracting signal features for each unit. Sequence signal search method.