JP2011175046A

JP2011175046A - Voice search device and voice search method

Info

Publication number: JP2011175046A
Application number: JP2010038011A
Authority: JP
Inventors: Seiichi Nakagawa; 聖一中川
Original assignee: Toyohashi University of Technology NUC
Current assignee: Toyohashi University of Technology NUC
Priority date: 2010-02-23
Filing date: 2010-02-23
Publication date: 2011-09-08
Anticipated expiration: 2030-02-23
Also published as: JP5590549B2

Abstract

<P>PROBLEM TO BE SOLVED: To provide a voice search device and a voice search method, for detecting desired voice from a large amount of voice data including unknown words or incorrect recognition which does not exist in a dictionary, by voice and text input. <P>SOLUTION: An index is attached to the unknown word, and a plurality of detection candidates are generated for the index of the unknown word by dividing the unknown word or search word for the voice data of the unknown word into phonemes or syllables, in large vocabulary continuous voice recognition for recognizing the voice data from search input, by the search input of voice or texts. Thereby, a search result is presented from the large amount of voice data including unknown words or incorrect recognition which does not exist in the dictionary. <P>COPYRIGHT: (C)2011,JPO&INPIT

Description

本発明は、音声およびテキスト入力により、音声データから所望する音声を検出する音声検索装置および音声検索方法に関するものである。 The present invention relates to a voice search apparatus and a voice search method for detecting a desired voice from voice data by voice and text input.

インターネット上には、ニュース音声、動画投稿、ポッドキャスト（これらを音声ドキュメントと称する）など、音声情報が多量に存在し、その量は年々増加している。また、会議音声や講義音声、コールセンター音声など、個別組織で保有している音声データも増加している。このような膨大な音声データから、ユーザの欲する情報を高速に、正しく検出することが要求されている。また、検索要求語は、固有名詞や新しい造語（例：豚インフルエンザ）などが多く、既存の大規模辞書に存在しない単語が多い。従来のテキスト検索法では、このような単語でも文字列として正しく表現されているので、検索語を文字列で入力すれば、正しく検索されることが多く問題は少ない。 On the Internet, there is a large amount of audio information such as news audio, video postings, podcasts (these are referred to as audio documents), and the amount is increasing year by year. In addition, voice data held by individual organizations such as conference voices, lecture voices, and call center voices is increasing. It is required to detect information desired by a user at high speed from such enormous audio data. In addition, many search request words include proper nouns and new coined words (eg, swine flu), and many words do not exist in existing large-scale dictionaries. In the conventional text search method, such a word is correctly expressed as a character string. Therefore, if a search word is input as a character string, it is often searched correctly and there are few problems.

一方、音声ドキュメントに対しては、通常は大語彙連続音声認識器で音声を単語列に変換後、テキスト検索技術を利用するのが最も簡単な方法であるが、もともと認識辞書（通常は2万語から十万語）に存在しない単語（未知語）は検索できない。また、音声認識誤りも多く、通常のテキスト検索では、検索できない。 On the other hand, for speech documents, it is usually easiest to use a text search technique after converting speech into a word string with a large vocabulary continuous speech recognizer. Words (unknown words) that do not exist in words to 100,000 words cannot be searched. In addition, there are many voice recognition errors, and a normal text search cannot be performed.

音声ドキュメント中の未知語の検索語に対しては、認識辞書を大きくし、未知語を減らす方法がある（例えば、非特許文献１）。しかし、固有名詞などすべてを辞書として登録することは不可能であり、また、辞書を大きくしても認識誤りは避けられず、特に出現頻度の少ない単語は認識誤りが生じやすく、認識誤りの問題は解決されない。 There is a method for enlarging a recognition dictionary and reducing unknown words for an unknown word search word in an audio document (for example, Non-Patent Document 1). However, it is impossible to register all proper nouns as a dictionary, and recognition errors cannot be avoided even if the dictionary is enlarged. Is not resolved.

そこで、未知語に対しては、音声認識技術によって、音声ドキュメントを音声言語の基本単位である音素・音節（他の基本単位もありうる）などの記号列に変換後、これらの記号列で表現された記号列とのマッチングを行うのが基本である（例えば、非特許文献２）。この音素・音節列への認識誤りに対処するために複数候補の認識結果を効率よくグラフ構造で表し、これに対して検索するのが普通である（例えば、特許文献１および非特許文献３）。これと検索語の音素・音節列のマッチングの高速化方法には、様々な工夫が行われているが（非特許文献３および４）、検索対象の音声ドキュメント量に比例して時間がかかるという問題がある。
Therefore, for unknown words, speech documents are converted into symbolic strings such as phonemes and syllables (which may also be other basic units), which are the basic units of spoken language, and expressed with these symbolic strings. Basically, matching is performed with the generated symbol string (for example, Non-Patent Document 2). In order to deal with recognition errors in this phoneme / syllable string, it is common to efficiently represent the recognition results of a plurality of candidates in a graph structure and search for this (for example, Patent Document 1 and Non-Patent Document 3). . Various approaches have been made to speed up the phoneme / syllable string matching of this and the search word (Non-Patent Documents 3 and 4), but it takes time in proportion to the amount of speech documents to be searched. There's a problem.

特開２００９−２７１１１７公報JP 2009-271117 A

栗城吾央、伊藤慶明、小嶋和徳、石亀昌明、田中和世：Web上の語彙を利用したクエリ格調による検索語検出、電子情報通信学会、音声技報、SP2009-84 (2009.12)Kurijo Rio, Ito Yoshiaki, Kojima Kazunori, Ishigame Masaaki, Tanakayo: Search term detection by query tone using vocabulary on the Web, IEICE, Spoken Technical Report, SP2009-84 (2009.12) 堀貴明、他：コンヒュージョンネットワークを用いたオープン語彙発話検索、日本音響学会講演論文集、1-3-10 (2007.9)Takaaki Hori, et al .: Open vocabulary utterance search using confusion network, Proc. Of Acoustical Society of Japan, 1-3-10 (2007.9) 伊藤慶明、他：語彙のない音声文書検索における複数サブワードの統合、情報処理が会論文誌、 Vol.50, No.2, pp.524-533 (2009.2)Yoshiaki Ito, et al .: Integration of multiple subwords in vocabulary-free spoken document retrieval, information processing is the society journal, Vol.50, No.2, pp.524-533 (2009.2) 神田直之、住吉貴志、戸上真人、大淵康成：任意語彙音声発話検索のための多段階リスコアリング手法の性能評価、第２回音声ドキュメント処理ワークショップ論文集、pp.73-78 (2008.2)Naoyuki Kanda, Takashi Sumiyoshi, Masato Togami, Yasunari Otsuki: Performance Evaluation of Multi-stage Rescoring Method for Arbitrary Vocabulary Speech Search, Proceedings of the Second Speech Document Processing Workshop, pp.73-78 (2008.2)

大量の音声ドキュメントに対して、未知語に頑健な高速な検索が望まれているが、上記に示したように、従来法は、音声認識技術によって、音声ドキュメントを音声言語の基本単位である音素・音節などの記号列に変換後、これらの記号列で表現された記号列とのマッチングを行うのが基本である。この認識誤りに対処するために複数候補の認識結果を効率よくグラフ構造で表し、これに対して検索するのが従来法である。 Although a fast search robust to unknown words is desired for a large amount of speech documents, as described above, the conventional method uses speech recognition technology to convert speech documents to phonemes, which are the basic units of speech languages. Basically, after conversion to a symbol string such as a syllable, matching with a symbol string represented by these symbol strings is performed. In order to deal with this recognition error, it is a conventional method to efficiently represent the recognition results of a plurality of candidates in a graph structure and perform a search for this.

しかし、検索対象の音声ドキュメント量に比例して時間がかかるのと認識結果の複数候補に存在しない認識誤りは対処できないという問題があった。 However, there is a problem that it takes time in proportion to the amount of the audio document to be searched and a recognition error that does not exist in a plurality of recognition result candidates cannot be dealt with.

この問題に対処する方法として、二つの方法が提案されている。一つは、音素の記号列を転置インデックス法でインデックス化しておき、検索時に、置換誤りなどの認識誤りを考慮しながら、インデックステーブルを探索する方法である。この方法で高速化が可能となっているが、インデックスの探索に時間が要する問題があった。 Two methods have been proposed to deal with this problem. One is a method in which a phoneme symbol string is indexed by a transposed index method, and an index table is searched in consideration of recognition errors such as substitution errors at the time of search. Although this method can increase the speed, there is a problem that it takes time to search for an index.

もう一つの方法は、音素・音節列の認識誤りを考慮して、インデックス化しておく方法である。これによって、高速に検索できるが、検索要求語が過剰に検出されることが多いため、検出された箇所に対して、従来法で、詳細に記号列同士の照合を行い、候補を絞っている。この手法は、詳細な照合部分が、音声ドキュメントの時間長に比例して、時間がかかるという問題があった。 The other method is an indexing method in consideration of recognition errors of phonemes / syllable strings. As a result, search can be performed at high speed, but search request words are often detected excessively. Therefore, in the conventional method, symbol strings are collated in detail with the conventional method to narrow down candidates. . This method has a problem that the detailed collation portion takes time in proportion to the time length of the voice document.

従来技術における上記課題を解決するために、発明者は以下の特徴を有する音声検索装置および音声検索方法を発明した。
In order to solve the above-described problems in the prior art, the inventor has invented a voice search device and a voice search method having the following characteristics.

請求項１に記載の音声検索装置は、
音声あるいはテキストによる検索語入力部と、検索対象の連続音声データベース格納部と、
前記入力部と前記データベース格納部からの音声データを認識する大語彙連続音声認識部と、
前記大語彙連続音声認識部の認識結果を格納する連続音声データ認識結果格納部と、
前記連続音声データベースにおいて未知語に索引を付与する未知語インデックス作成部と、
未知語の音声データに対して音声の基本単位である音素あるいは音節に分割し認識する音素・音節認識部と、
前記音素・音節認識部の認識結果を格納する音素・音節データ格納部と、
少なくとも一つ以上の検索候補を提示する音素・音節列検索部と、を備えた音声検索装置であって、前記音素・音節認識部は、前記索引が付与された未知語に対して複数の検出候補を生成する機能を具備することを特徴とする。 The voice search device according to claim 1 is:
Search term input part by voice or text, continuous voice database storage part to be searched,
A large vocabulary continuous speech recognition unit for recognizing speech data from the input unit and the database storage unit;
A continuous speech data recognition result storage unit for storing a recognition result of the large vocabulary continuous speech recognition unit;
An unknown word index creating unit for indexing unknown words in the continuous speech database;
A phoneme / syllable recognition unit that recognizes speech data of unknown words by dividing them into phonemes or syllables, which are basic units of speech,
A phoneme / syllable data storage unit for storing a recognition result of the phoneme / syllable recognition unit;
A phoneme / syllable string search unit that presents at least one search candidate, wherein the phoneme / syllable recognition unit detects a plurality of unknown words assigned the index It has a function of generating candidates.

請求項２に記載の音声検索装置は、
請求項１に記載の音声検索装置であって、
前記音素・音節認識部は、少なくとも一つ以上の認識結果候補から置換誤りと挿入誤りを想定し、前記認識結果を索引として付与し、検出候補を提示する機能を具備することを特徴とする。 The voice search device according to claim 2,
The voice search device according to claim 1,
The phoneme / syllable recognition unit has a function of assuming a substitution error and an insertion error from at least one recognition result candidate, assigning the recognition result as an index, and presenting a detection candidate.

請求項３に記載の音声検索装置は、
請求項１または請求項２に記載の音声検索装置であって、
前記音素・音節認識部は、少なくとも一つ以上の認識結果候補から脱落誤りを想定し、前記脱落誤りを検索語で想定し、索引として付与し、検出候補を提示する機能を具備することを特徴とする。 The voice search device according to claim 3 is:
The voice search device according to claim 1 or 2,
The phoneme / syllable recognition unit has a function of assuming a drop error from at least one recognition result candidate, assuming the drop error as a search word, adding it as an index, and presenting a detection candidate. And

請求項４に記載の音声検索装置は、
請求項３に記載の音声検索装置であって、
前記音素・音節認識部は、少なくとも一つ以上の認識結果候補から置換誤りと挿入誤りおよび／または脱落誤りを想定し、前記誤りの認識を索引として付与し、分割された検索語の情報を用いて得られた検出候補から、事前に設定された閾値を基準として検出候補を選別する機能を具備することを特徴とする。 The voice search device according to claim 4 is:
The voice search device according to claim 3,
The phoneme / syllable recognition unit assumes a substitution error, an insertion error, and / or a drop error from at least one recognition result candidate, assigns the recognition of the error as an index, and uses information of divided search terms And a function of selecting detection candidates from the detection candidates obtained based on a preset threshold value.

請求項５に記載の音声検索装置は、
請求項１乃至請求項４に記載の音声検索装置であって、
前記音素・音節認識部は、少なくとも一つ以上の認識結果候補に対して、音素間および音節間のバタチャリヤ距離を用いて索引を付与し、
第１の認識結果候補との音響的類似度に基づいて、第２の認識結果候補あるいは第３の認識結果候補との距離により検出候補を提示する機能を具備すること音素・音節認識部を特徴とする。なお、音素間および音節間の距離の定義は、種々考えられ、通常は、認識システムで使用する尺度と対応するものを使用する。 The voice search device according to claim 5 is:
The voice search device according to claim 1, wherein:
The phoneme / syllable recognition unit assigns an index using at least one recognition result candidate using a virtual distance between phonemes and between syllables,
A phoneme / syllable recognition unit having a function of presenting detection candidates based on a distance from the second recognition result candidate or the third recognition result candidate based on the acoustic similarity with the first recognition result candidate And Various definitions of the distance between phonemes and syllables are conceivable, and usually the one corresponding to the scale used in the recognition system is used.

請求項６に記載の音声検索装置は、
請求項１乃至請求項５に記載の音声検索装置であって、
前記音素・音節認識部は、少なくとも一つ以上の認識結果候補に対して、数１で定義する対数尤度を用いて索引を付与し、 The voice search device according to claim 6 is:
The voice search device according to claim 1, wherein:
The phoneme / syllable recognition unit assigns an index to the at least one recognition result candidate using the log likelihood defined by Equation 1,

認識結果の対数尤度に基づいて、検出候補を提示する機能を具備することを特徴とする。

It has a function of presenting detection candidates based on the log likelihood of the recognition result.

請求項７に記載の音声検索方法は、
音声あるいはテキストによる検索語入力ステップと、検索対象の連続音声データベース格納ステップと、前記入力部と前記データベース格納部からの音声データを認識する大語彙連続音声認識ステップと、
前記大語彙連続音声認識部の認識結果を格納する連続音声データ認識結果格納のステップと、
前記連続音声データベースにおいて未知語に索引を付与する未知語インデックスのステップと、
未知語の音声データに対して音声の基本単位である音素あるいは音節に分割し認識する音素・音節認識ステップと、
前記音素・音節認識部の認識結果を格納する音素・音節認識結果格納のステップと、
少なくとも一つ以上の検索候補を提示する音素・音節検索ステップと、を備えた音声検索方法であって、
前記音素・音節認識ステップは、前記索引が付与された未知語に対して複数の検出候補を生成する機能を有することを特徴とする。 The voice search method according to claim 7 comprises:
A search word input step by speech or text, a continuous speech database storage step to be searched, a large vocabulary continuous speech recognition step for recognizing speech data from the input unit and the database storage unit,
A step of storing a continuous speech data recognition result for storing a recognition result of the large vocabulary continuous speech recognition unit;
An unknown word index for indexing unknown words in the continuous speech database;
A phoneme / syllable recognition step for recognizing the speech data of unknown words by dividing them into phonemes or syllables which are the basic units of speech;
Storing a phoneme / syllable recognition result storing step for storing a recognition result of the phoneme / syllable recognition unit;
A phoneme / syllable search step for presenting at least one search candidate,
The phoneme / syllable recognition step has a function of generating a plurality of detection candidates for an unknown word to which the index is assigned.

請求項８に記載の音声検索方法は、
請求項７に記載の音声検索方法であって、
前記音素・音節認識ステップは、少なくとも一つ以上の認識結果候補から置換誤りと挿入誤りを想定し、前記認識を索引として付与し、検出候補を提示する機能を有することを特徴とする。 The voice search method according to claim 8 comprises:
The voice search method according to claim 7,
The phoneme / syllable recognition step has a function of assuming a substitution error and an insertion error from at least one recognition result candidate, assigning the recognition as an index, and presenting a detection candidate.

請求項９に記載の音声検索方法は、
請求項７または請求項８に記載の音声検索方法であって、
前記音素・音節認識ステップは、少なくとも一つ以上の認識結果候補から脱落誤りを想定し、前記脱落誤りの認識を検索語で想定し、索引として付与し、検出候補を提示する機能を有することを特徴とする。 The voice search method according to claim 9 comprises:
The voice search method according to claim 7 or 8,
The phoneme / syllable recognition step has a function of assuming a drop error from at least one recognition result candidate, assuming recognition of the drop error as a search word, giving it as an index, and presenting a detection candidate. Features.

請求項１０に記載の音声検索方法は、
請求項９に記載の音声検索方法であって、
前記音素・音節認識ステップは、少なくとも一つ以上の認識結果候補から置換誤りと挿入誤りおよび／または脱落誤りを想定し、前記誤りの認識を索引として付与し、分割された検索語の情報を用いて得られた検出候補から、事前に設定された閾値を基準として検出候補を選別する機能を有することを特徴とする。 The voice search method according to claim 10 comprises:
The voice search method according to claim 9,
The phoneme / syllable recognition step assumes a substitution error, an insertion error, and / or a drop error from at least one recognition result candidate, assigns the recognition of the error as an index, and uses information of divided search terms It has the function to select a detection candidate from the detection candidate obtained by using a preset threshold as a reference.

請求項１１に記載の音声検索方法は、
請求項７乃至請求項１０に記載の音声検索方法であって、
前記音素・音節認識ステップは、少なくとも一つ以上の検出候補に対して、音素間および音節間のバタチャリヤ距離を用いて索引を付与し、
第１の検出候補との音響的類似度に基づいて、第２の検出候補あるいは第３の検出候補との距離により検出候補を提示する機能を有することを特徴とする。なお、音素間および音節間の距離の定義は、種々考えられ、通常は、認識システムで使用する尺度と対応するものを使用する。 The voice search method according to claim 11 comprises:
The voice search method according to claim 7, wherein:
In the phoneme / syllable recognition step, at least one or more detection candidates are indexed using a virtual distance between phonemes and between syllables,
It has a function of presenting a detection candidate by a distance from the second detection candidate or the third detection candidate based on the acoustic similarity with the first detection candidate. Various definitions of the distance between phonemes and syllables are conceivable, and usually the one corresponding to the scale used in the recognition system is used.

請求項１２に記載の音声検索方法は、
請求項７乃至請求項１１に記載の音声検索方法であって、
前記音素・音節認識ステップは、少なくとも一つ以上の検出候補に対して、数１で定義する対数尤度を用いて索引を付与し、 The voice search method according to claim 12 includes:
12. The voice search method according to claim 7, wherein:
In the phoneme / syllable recognition step, at least one or more detection candidates are indexed using a log likelihood defined by Equation 1,

認識結果の対数尤度に基づいて、検出候補を提示する機能を有することを特徴とする。

以上から、既知語に対しては、大語彙連続音声認識によって単語列に変換され、未知語や認識誤り単語に対しては、音素・音節認識によって単語よりも基本単位の音素列や音節列を認識することから、辞書に存在しない未知語あるいは認識誤りを含む大量の音声データから、音声およびテキスト入力による音声検索装置および音声検索方法を提供できるようになる。
また、分割された検索情報を用いて検索候補を提示する場合は、事前に設定する閾値を基準に検出候補を選別することにより、検索効率を向上させることができる。
From the above, for known words, it is converted into a word string by large vocabulary continuous speech recognition, and for unknown words or recognition error words, a phoneme string or syllable string that is a basic unit rather than a word by phoneme / syllable recognition From the recognition, it is possible to provide a voice search device and a voice search method based on voice and text input from a large amount of voice data containing unknown words or recognition errors that do not exist in the dictionary.
In addition, when a search candidate is presented using the divided search information, the search efficiency can be improved by selecting the detection candidate based on a threshold set in advance.

音声検索装置のブロック図（ａ）および音声検索方法のアルゴリズムを説明するブロック図（ｂ）である。It is the block diagram (a) of a speech search device, and the block diagram (b) explaining the algorithm of a speech search method. トライグラムアレイの作成手順を示す説明図である。It is explanatory drawing which shows the preparation procedure of a trigram array. 置換誤りを含む場合のトライグラムアレイの作成手順を示す説明図である。It is explanatory drawing which shows the preparation procedure of a trigram array in case a substitution error is included. 挿入誤りを含む場合のトライグラムアレイの作成手順を示す説明図である。It is explanatory drawing which shows the preparation procedure of a trigram array in case an insertion error is included. 検索語のトライグラムへの分割を示す説明図である。It is explanatory drawing which shows the division | segmentation to the trigram of a search word. 脱落誤りを含む場合のトライグラムアレイの作成手順を示す説明図である。It is explanatory drawing which shows the preparation procedure of a trigram array in case a drop error is included. 挿入誤り対策と脱落誤り対策の併用による置換誤り対策を示す説明図である。It is explanatory drawing which shows the replacement error countermeasure by combined use of an insertion error countermeasure and a drop error countermeasure. トライグラムアレイの内部表現を示す説明図である。It is explanatory drawing which shows the internal representation of a trigram array.

本発明は、認識誤りや未知語を含む大量の音声データベースに対し、高速に音声を検索できる手段を提供する。 The present invention provides a means capable of searching for speech at a high speed with respect to a large amount of speech database including recognition errors and unknown words.

具体的な構成は、既知語に対しては、従来法の大語彙認識装置で単語列に変換してから、通常のテキスト検索法で検索する。一方、未知語や認識誤り単語に対しては、単語よりも基本単位である音節列とか音素列の認識を行い、この結果に対して検索を行う。未知語は音素列や音節列に正しく認識できないので、認識誤りがあると想定する。認識誤りには、置換誤り・挿入誤り・脱落誤りがある。置換誤りに対しては、複数候補の認識結果を用いる。これだけでは、対処が不十分な場合は、次の挿入誤り対策と脱落誤り対策の併用で、対処可能である。挿入誤りに対しては、認識結果の挿入を考慮して検索する。脱落誤りに対しては、検索語の音素・音節列の方を脱落させて対処する。 Specifically, a known word is converted into a word string by a conventional large vocabulary recognition apparatus and then searched by a normal text search method. On the other hand, for unknown words and recognition error words, a syllable string or phoneme string, which is a basic unit rather than a word, is recognized, and a search is performed on the result. Since unknown words cannot be recognized correctly in phoneme strings or syllable strings, it is assumed that there is a recognition error. Recognition errors include substitution errors, insertion errors, and omission errors. For replacement errors, recognition results of a plurality of candidates are used. If this is not enough, it can be dealt with by using the following countermeasure against insertion error and dropping error. For insertion errors, search is performed in consideration of insertion of recognition results. To cope with omission errors, the phoneme / syllable string of the search word is dropped.

検索対象の音声データベースに対しては、オフラインで予め認識を行い、その結果に対して、誤りを考慮したｎ音素・音節、例えば３音節の場合、３つ組アレイ（以下、トライグラムアレイという）を構成し、これに、この認識結果の位置と認識誤りの程度を示す距離もしくは尤度をつけてインデックスとして辞書順に記憶しておく。検索語に対して、単語を３音節単位に分割して、トライグラムアレイのインデックスを２分探索法などで検索し、分割して検索した結果を統合して、最終検索結果を出力する。この方法によって、従来法と比べて検索精度を落とすことなく、高速に検索できる。１万時間の音声データに対して、検索時間は１秒以内である。トライグラムアレイのインデックスの記憶量も、もとの音声ファイルデータ量よりも少なく、実用的である。 The speech database to be searched is recognized in advance offline, and the result is n-phonemes / syllables taking into account errors, for example, in the case of 3 syllables, a triple array (hereinafter referred to as trigram array) And a distance or likelihood indicating the position of the recognition result and the degree of recognition error is added to this and stored as an index in dictionary order. For the search word, the word is divided into three syllable units, the trigram array index is searched by a binary search method or the like, the divided search results are integrated, and the final search result is output. By this method, it is possible to search at high speed without lowering the search accuracy compared to the conventional method. The search time is within one second for 10,000 hours of audio data. The storage amount of the trigram array index is also smaller than the original audio file data amount, which is practical.

本装置への検索語の入力は、テキスト入力と音声入力の両方が可能である。音声入力の場合は、大語彙連続音声認識装置または音素・音節連続音声認識装置によりテキストに変換する。後者の認識誤りに対しては、検索対象音声データの音声認識誤り対策と同じ手法で対処する。 Both the text input and the voice input are possible for the input of the search word to the apparatus. In the case of speech input, it is converted into text by a large vocabulary continuous speech recognition device or a phoneme / syllable continuous speech recognition device. The latter recognition error is dealt with by the same method as the speech recognition error countermeasure for the retrieval target speech data.

本発明の中核は、認識誤りに対応する距離もしくは尤度を、インデックスに付随させておき、検索文字列との詳細な照合を実行せずに、同等の性能で超高速に検索を可能とすることである。 The core of the present invention is that the distance or likelihood corresponding to the recognition error is attached to the index, and the search can be performed at high speed with the same performance without performing detailed matching with the search character string. That is.

本発明に係る実施形態について、図を用いて説明を行う。以下は、単なる説明例であって、実施の詳細について、前記説明例および図に限定されるものではない。 Embodiments according to the present invention will be described with reference to the drawings. The following are merely illustrative examples, and details of implementation are not limited to the illustrative examples and figures.

本発明による音声ドキュメントの検索装置のブロック図および全体の処理の流れを図１に示す。まず、オフラインで行う音声ドキュメントの認識およびインデックス化について説明する。 FIG. 1 shows a block diagram of an audio document retrieval apparatus according to the present invention and an overall processing flow. First, the recognition and indexing of voice documents performed offline will be described.

検索対象の音声ドキュメントデータの格納部（ア）から検索対象音声を取り出し、既知語検索のための大語彙連続音声認識装置（イ）により大語彙連続音声認識を行い、単語列に変換し、大語彙連続音声認識結果の格納部（ウ）に格納する（Ｓ３）。既知語の単語列の認識結果に対しては、既知語検索のためのインデックス作成部（エ）において、転置インデックスのデータ構造で、単語とその出現位置をテーブル化しておく。 The search target speech is taken out from the storage unit (a) of the speech document data to be searched, and the large vocabulary continuous speech recognition device (a) for known word search is performed and converted into a word string. The vocabulary continuous speech recognition result is stored in the storage unit (c) (S3). For the recognition result of the word string of known words, the index creation unit (d) for searching for known words tabulates the words and their appearance positions with the data structure of the transposed index.

また、これと並行して、認識誤り単語・未知語検索のための音素・音節認識部（オ）において、音声の基本認識単位である音素とか音節の列にも認識しておく。特に、音素・音節の認識は困難なので、複数の認識候補を出力し、ラティスの形式で出力し（この出力形式には任意性がある）、音素・音節認識結果の格納部（カ）に格納しておく（以上、Ｓ４）。認識誤り単語・未知語検索のための音素・音節列の認識結果に対しても、未知語・認識誤り単語の検索のための音素・音節のインデックス作成部（キ）において既知語と同様にインデックス化し、テーブル化しておくが、次の方法でインデックス化する。 In parallel with this, the phoneme / syllable recognition unit (e) for recognizing a recognition error word / unknown word also recognizes a phoneme or syllable string which is a basic recognition unit of speech. In particular, since recognition of phonemes / syllables is difficult, multiple recognition candidates are output and output in a lattice format (this output format is optional) and stored in the storage unit (f) for phoneme / syllable recognition results. (S4). The phoneme / syllable string recognition result for the recognition error word / unknown word search is also indexed in the same way as the known word in the phoneme / syllable index creation unit (ki) for the unknown word / recognition word search. It is converted into a table and indexed by the following method.

音素・音節列を連続するｎ個を単位（以下、nグラムと呼ぶ）で、そのnグラムと音素・音節列の認識結果中にそれが存在する先頭位置および距離をインデックスとし、表にまとめる。音声ドキュメントの最初から最後まで、１音素・音節ずつずらしながらnグラムを作成していく。文探索で高速に検索できるように、辞書順に並べておく（以上、Ｓ５）。 A unit of n phonemes / syllable strings (hereinafter referred to as n-gram) is used as a unit, and the head position and the distance in the recognition result of the n-gram and the phoneme / syllable string are used as an index and are summarized in a table. Create n-grams by shifting one phoneme and syllable from beginning to end of a voice document. Arranged in the order of the dictionary so that the sentence search can be performed at high speed (S5).

上記Ｓ５の手順を図２の例を用いて説明する。この例は、認識候補が一つだけの場合である。第１候補の音素・音節認識結果だけを用いて作成したｎグラムは、距離を０とする。 The procedure of S5 will be described using the example of FIG. This example is a case where there is only one recognition candidate. The n-gram created using only the first candidate phoneme / syllable recognition result has a distance of zero.

説明の簡単化のため、n＝３とする。音声ドキュメントの最初の音節単位の３グラム（トライグラム）は「ｆｕｕｒｉ」で先頭位置は０であるから、インデックスは０、距離は０、挿入誤りは０とする。次の３グラム（トライグラム）は「ｕｒｉｅ」で、インデックスは１、距離は０、挿入誤りは０となる。以下同様な操作を実行する。これを音声ドキュメントの最後まで実行した後、トライグラムを辞書順に並べる。同じトライグラムが複数箇所に存在するときは、辞書順に並べたときは、同じトライグラムが複数個並ぶ。このような同じトライグラムが続く場合の記憶方法の変形は種々存在する。たとえば、同じトライグラムが並ぶ場合は、一つだけで代表させ、あとは、別の表に保存する方法が考えられる。 For simplicity of explanation, n = 3. Since the first 3 gram (trigram) of the syllable unit of the voice document is “fu uri” and the head position is 0, the index is 0, the distance is 0, and the insertion error is 0. The next 3 grams (trigram) is “u ri e”, the index is 1, the distance is 0, and the insertion error is 0. Thereafter, the same operation is performed. After this is done to the end of the audio document, the trigrams are arranged in dictionary order. When the same trigram exists in a plurality of places, a plurality of the same trigrams are arranged when arranged in dictionary order. There are various variations of the storage method when the same trigram continues. For example, when the same trigrams are arranged, only one can be represented and the other can be stored in a separate table.

図３は置換誤り対策のために、複数候補の認識結果を用いる場合を示している。ここでは、音節の認識候補数は３個とする。図２と同じようにトライグラムのインデックスを作成していくが、図３の例で示すようにトライグラムの先頭が第１候補、真ん中を第２候補、最後尾を第３候補として選んでトライグラムを作成する場合を示している。この場合の距離は、第１候補からの距離の和として求める。すなわち、ｄ(ｅ，ｕ)+ｄ(ｋｉ，ｒｉ)である。ここで、ｄ（音節ｉ，音節ｊ）は、音節ｉと音節ｊの距離を示している。この音節間同士の距離は、あらかじめ定義しておき、表に格納しておく。前記音素・音節間の距離はバタチャリヤ距離で定義される。音節aと音節bの距離は、数２のように示される。 FIG. 3 shows a case where recognition results of a plurality of candidates are used for countermeasure against substitution error. Here, the number of syllable recognition candidates is three. The trigram index is created in the same way as in FIG. 2, but as shown in the example of FIG. 3, the trigram head is selected as the first candidate, the middle as the second candidate, and the tail as the third candidate. This shows the case of creating a gram. The distance in this case is obtained as the sum of the distances from the first candidate. That is, d (e, u) + d (ki, ri). Here, d (syllable i, syllable j) indicates the distance between syllable i and syllable j. The distance between syllables is defined in advance and stored in a table. The distance between the phonemes and syllables is defined by the batcha rear distance. The distance between the syllable a and the syllable b is expressed as in Equation 2.

（バタチャリヤ距離）

バタチャリヤ距離は、多次元正規分布間の距離を表わすもので、音節のモデルは複数個の正規分布の和からなるＭ個の状態で表わされる。 (Batachariya distance)

The batcha rear distance represents a distance between multi-dimensional normal distributions, and a syllable model is represented by M states composed of a sum of a plurality of normal distributions.

またこの例では挿入誤りはないので０である。このように、複数の認識結果の候補を考慮して、すべての組み合わせでトライグラムのインデックスを作成していく。 In this example, it is 0 because there is no insertion error. In this way, trigram indexes are created for all combinations in consideration of a plurality of recognition result candidates.

図４は、挿入誤り対策の例を示している。簡単のために第１候補だけの認識結果を示している。この系列に対して、挿入誤りを仮定してトライグラムのインデックスを作成していく。図４の例は、位置１の認識結果「ｋｕ」が挿入誤りと仮定し、この音節を飛ばしてトライグラムを作成した「ｆｕｕｒｉ」の例を示している。この場合は、挿入誤りを仮定して作成したので、挿入の欄は、１となる。挿入誤りは無制限に仮定するのではなく、実際の音声認識装置の挿入誤り傾向に合致させる。たとえば、３音節のうち１音節が挿入されうるとする。 FIG. 4 shows an example of countermeasure against insertion error. For simplicity, only the recognition result of the first candidate is shown. A trigram index is created for this sequence assuming an insertion error. The example of FIG. 4 shows an example of “fu uri” in which a trigram is created by skipping this syllable, assuming that the recognition result “ku” at position 1 is an insertion error. In this case, since it was created assuming an insertion error, the insertion column is 1. The insertion error is not assumed to be unlimited, but is made to match the insertion error tendency of an actual speech recognition apparatus. For example, assume that one syllable of three syllables can be inserted.

実際は、置換誤りも挿入誤りにも同時に対処するので、複数候補の認識結果に対して図３と図４の操作をすべて行う。 Actually, since both replacement errors and insertion errors are dealt with simultaneously, the operations shown in FIGS. 3 and 4 are all performed on the recognition results of a plurality of candidates.

以上の方法で、検索対処の音声ドキュメントをオフラインで、インデックス化しておく。これに対して、検索語をオンラインでの検索について説明する。 Using the above method, search-response voice documents are indexed offline. In contrast, an online search for a search term will be described.

タイピング入力または音声入力される検索単語の入力部（ク）（Ｓ１）からの検索語が、既知語の場合は（Ｓ２）、既知語のための検索部（ケ）により、通常のテキスト検索技術を用いて検索し（Ｓ６）、検索結果を得る（Ｓ７）。既知語の検索結果は、既知語の検索結果表示部（サ）によりユーザに対して表示される。 When the search word from the input part (g) (S1) of the search word input by typing or voice input is a known word (S2), a normal text search technique is performed by the search part (g) for the known word. (S6), and a search result is obtained (S7). The search result of the known word is displayed to the user by the search result display unit (sa) of the known word.

一方、検索語が未知語の検索の場合（音声認識用の辞書に入ってない場合）には（Ｓ２）、未知語・認識誤り単語のための検索部（コ）により音素・音節列に変換し（Ｓ８）、ｎ連続単位（すなわち、ｎグラム）ごとに分割し（ここでは、ｍ分割されたとする）、それぞれのｎグラム単位で独立に、音声ドキュメントが上述の方法によりnグラム単位でインデックス化されているインデックステーブルを２分探索法で高速に検索し（Ｓ９）、検索件を得る（Ｓ１０）。未知語・認識誤り単語の検索結果は、未知語・認識誤り単語の検索結果の表示部（シ）によりユーザに対して表示される。 On the other hand, when the search word is an unknown word search (when it is not in the dictionary for speech recognition) (S2), it is converted into a phoneme / syllable string by the search unit (co) for unknown words / recognized words. (S8), divided into n consecutive units (that is, n-grams) (here, m-divided), and independently for each n-gram unit, the audio document is indexed in n-gram units by the above-described method. The index table that has been converted is searched at a high speed by the binary search method (S9), and a search result is obtained (S10). The search result of the unknown word / recognition error word is displayed to the user by the display unit (b) of the search result of the unknown word / recognition error word.

上記の未知語・認識誤り単語の高速検索には、種々の変形が考えられる。たとえば、検索単位がトライグラムという固定長に限定しているので、与えられたトライグラムが表のどこに存在するか、一対一に対応させる計算法や表を用いることもできる。検索された結果は、一般に、ｍ個のｎグラムごとに、複数個所の音声の出現位置とスコア（距離とか尤度が付随している）からなる。独立に検索したｍ個のｎグラムの検索結果候補が、互いにオーバーラップなく出現位置が連続するものを正しい検索位置候補とする。このうち、連続したｍ個のスコアを加算していき、あらかじめ設定されている閾値の条件を満たすものを検索結果とする。この時、スコアには、挿入誤りを仮定したnグラムによる検索結果であったかどうか、後述する検索語に脱落誤りを仮定したnグラムによる検索結果であったかどうかを、反映させる。反映のさせ方は種々の方法がありうる。検索語がｎグラム単位に分割できない場合は、オーバーラップを許しながら、分割する。たとえば、７音節からなる単語を３グラムずつに分割する場合は、１〜３、３〜５、５〜７の位置で３分割する）。図５は、その他の音節長の分割方法を示している。 Various modifications are conceivable for the above-described high-speed search for unknown / recognized words. For example, since the search unit is limited to a fixed length of trigram, it is possible to use a calculation method or a table in which the given trigram is in one-to-one correspondence with the table. The retrieved result is generally composed of the appearance positions and scores (with distances and likelihoods) of a plurality of places for every m n-grams. Of the n n-gram search results candidates that have been independently searched, those in which the appearance positions continue without overlapping each other are regarded as correct search position candidates. Among them, m consecutive scores are added, and a search result satisfying a preset threshold value is set. At this time, the score reflects whether the search result is based on an n-gram assuming an insertion error, and whether the search result is based on an n-gram assuming a drop-off error in a search word to be described later. There are various ways to reflect the information. If the search term cannot be divided into n-gram units, it is divided while allowing overlap. For example, when a word consisting of 7 syllables is divided into 3 grams, it is divided into 3 at positions 1 to 3, 3 to 5 and 5 to 7). FIG. 5 shows another method of dividing the syllable length.

認識結果の脱落誤りに対しては、検索語の音素・音節列に脱落を許して、新しい検索語とみなして、同様に検索する。ただし、脱落誤りは、無制限に仮定しているのではなく、音声認識装置の脱落誤り傾向と合致させる。通常は、連続する３音節に１個の割合で脱落を仮定する。図６は、検索語の脱落誤りを対処した３グラムの作成方法を示している。このように、未知語の検索語が与えられた場合、脱落誤りを仮定して、検索語を複数個のｎグラム単位に分割し、これらを独立に、インデックステーブルを検索する。検索結果をもとに、統合して、検索語の結果を求める。脱落を考慮した時のトライグラムは、そのことを図６に示すように記憶しておく。 For missing errors in the recognition result, the phoneme / syllable string of the search word is allowed to be dropped, and the search is performed in the same manner as a new search word. However, the dropout error is not assumed to be unlimited, but is matched with the dropout error tendency of the speech recognition apparatus. Usually, dropout is assumed at a rate of one in three consecutive syllables. FIG. 6 shows a 3 gram creation method that copes with a search term omission error. As described above, when an unknown word search word is given, assuming a drop error, the search word is divided into a plurality of n-gram units, and these are independently searched in the index table. Based on the search results, integration is performed to obtain a search term result. In the trigram when dropping is considered, this is stored as shown in FIG.

図７は、認識結果の挿入誤り対策と検索語による脱落誤り対策の併用による置換誤り対策の例を示している。この両者を併用することにより、置換誤りにも対処できる。 FIG. 7 shows an example of replacement error countermeasures by using a combination of countermeasures against insertion errors of recognition results and dropout error countermeasures using search words. By using both of these, substitution errors can be dealt with.

ここでは、認識誤りに対して第１候補からの距離という尺度を用いて説明したが、認識装置の出力には、認識結果の確からしさを表す尤度（対数事後確率）が付随しているので、この値を用いることもできる。前記対数事後確率は、上記数１で定義される。数1は、音節列ＳのＨＭＭによる音声入力パターンの第ｉ時間区分から第ｊ時間区分までの入力特徴パラメータ系列ａ_ｉａ_ｉ＋１・・・ａ_ｊの対数生起確率を表わす。
Here, the recognition error has been described using the scale of the distance from the first candidate, but the likelihood (logarithmic posterior probability) representing the probability of the recognition result is attached to the output of the recognition device. This value can also be used. The log posterior probability is defined by Equation 1 above. Number 1 represents the logarithmic probability of the input feature parameter sequence _{_{a i a i + 1 ··· a}} j from the i time segment of speech input pattern to the j-th time interval by HMM syllable string S.

日本語話し言葉コーパスの音声発声時間長４４時間分の学会講演音声データベースを対象として、本発明方法を実施した。図１の音声検索装置を、ＣＰＵ、メモリ、外部記憶装置などが有意に電気的に接続されたパーソナルコンピュータ（Ｉｎｔｅｌ（登録商標）Ｘｅｏｎ（登録商標）Ｘ５３６５、３ＧＨｚ、メモリ３３ＧＢ）上でＣ言語を用いて構築した。特に、未知語の検索を評価するために、まず、連続音節認識を行い、第３候補まで認識結果を出力する（音節ラティスと呼ぶ）。この認識結果に対して、トライグラムアレイをインデックス化した。図８に、トライグラムアレイの記憶装置内の内部表現を示す（ＳＩＬは文頭記号を示す）。日本語の音節の場合だと、音節の種類は外来語表現を含めて１１６種類なので、インデックスと３音節の組み合わせは、４バイト＝１長バイト整数型で記憶できる。４４時間の音声データのインデックステーブルの記憶容量は、１．５Ｇバイトであった。これは基の音声波形の記憶量（３６００×４４時間×１６ｋＨｚ×２バイト）＝５Ｇバイトよりも少ない。 The method of the present invention was carried out on an academic speech database for 44 hours of speech utterance length of a Japanese spoken corpus. The voice search device of FIG. 1 is stored in C language on a personal computer (Intel (registered trademark) Xeon (registered trademark) X5365, 3 GHz, memory 33 GB) to which a CPU, a memory, an external storage device and the like are significantly electrically connected. Constructed using. In particular, in order to evaluate retrieval of unknown words, first, continuous syllable recognition is performed, and a recognition result is output up to the third candidate (referred to as a syllable lattice). The trigram array was indexed for this recognition result. FIG. 8 shows an internal representation in the storage device of the trigram array (SIL indicates a beginning symbol). In the case of Japanese syllables, there are 116 types of syllables including foreign language expressions, so the combination of index and 3 syllables can be stored as 4 bytes = 1 long byte integer type. The storage capacity of the index table of the audio data for 44 hours was 1.5 Gbytes. This is less than the storage amount of the original speech waveform (3600 × 44 hours × 16 kHz × 2 bytes) = 5 Gbytes.

音節間の距離は、音節単位のＨＭＭにおける各状態の音声特徴ベクトルの正規分布間のバタチャリヤ距離で定義した。検索スコアは、挿入誤りを考慮した場合は、挿入数のα倍、脱落誤りを仮定した場合は、その脱落数のβ倍をスコアに加算する。 The distance between syllables is defined as the distance between the normal distributions of the speech feature vectors in each state in the HMM in syllable units. When an insertion error is considered, the search score is added to the score by α times the number of insertions, and when a drop error is assumed, β times the number of deletions.

検索語は、４４時間の音声データ中（約２２万単語を発声）に４回以下（１０時間に１回）しか発声されていなく、２００００単語の大語彙連続音声認識装置の辞書に存在しない、４３単語を未知語として用いた。延べ出現回数は１４２箇所で、１単語あたり、平均３回の出現回数（１５時間に１回、言い換えれば、７万単語の発声中に１回だけ発声）である。この検索は、非常に困難な問題であることが容易に理解できるところである。 The search term is uttered only 4 times or less (once every 10 hours) in the 44 hours of speech data (about 220,000 words uttered), and does not exist in the dictionary of the large vocabulary continuous speech recognition apparatus of 20000 words. 43 words were used as unknown words. The total number of appearances is 142, and the average number of appearances is 3 times per word (once every 15 hours, in other words, only once during 70,000 words). It can be easily understood that this search is a very difficult problem.

まず、音節認識率の性能を表１に示す。 First, Table 1 shows the performance of the syllable recognition rate.

ここで、
正解率＝１．０−置換率―脱落率、
認識精度＝１．０−置換率―脱落率―挿入率、
である。この性能は、音声認識装置の性能そのもので、その性能は年々向上しているが、本発明とは直接関係はない。しかし、音節認識性能が良いほど検索性能も良くなることは、本発明の評価結果の解釈に注意を要する。表１より第３候補までに発声した音節が正しく認識できた割合は８７％である。挿入誤り率は３%、脱落誤り率は６％である。 here,
Accuracy rate = 1.0-replacement rate-dropout rate,
Recognition accuracy = 1.0-replacement rate-dropout rate-insertion rate,
It is. This performance is the performance of the speech recognition apparatus itself, and the performance is improving year by year, but is not directly related to the present invention. However, the higher the syllable recognition performance, the better the search performance, so care must be taken in interpreting the evaluation results of the present invention. From Table 1, the percentage of correctly recognized syllables uttered by the third candidate is 87%. The insertion error rate is 3% and the drop error rate is 6%.

次に、未知語の検索結果を表２に示す。 Next, Table 2 shows the search results for unknown words.

比較のために、本発明の基本である距離つきトライグラムを用いないで、単なるトライグラムで検索し、詳細なＤＰマッチングによって過剰な検索候補を削除する方法による結果を表３に示す。 For comparison, Table 3 shows the results obtained by a method of searching with a simple trigram without using a trigram with distance, which is the basis of the present invention, and deleting excessive search candidates by detailed DP matching.

ここで、
再現率＝正しく検出された数／全検索語数、
適合率＝正しく検出された数／検出された数
である。表中、「絞り込みなし」は、距離付きでない従来のトライグラムアレイで検索した場合の結果を示す。表３は、この結果に対して、ＤＰマッチングで、詳細に音節同士の照合で、候補区間を絞った場合である。表２と表３を比較すると、性能はほとんど同じであることがわかる。大まかに言えば、４４時間に４回（１０時間に１回）現れる未知語を検出すると２０箇所（２時間に１回）候補箇所が検出され、そのうち、２箇所が正しい検索結果である、という性能である。 here,
Reproducibility = number correctly detected / total number of search terms,
Match rate = number correctly detected / number detected. In the table, “without narrowing down” indicates a result when searching with a conventional trigram array without a distance. Table 3 shows a case where candidate sections are narrowed down by DP matching and in detail by collating syllables. Comparing Table 2 and Table 3, it can be seen that the performance is almost the same. Roughly speaking, when an unknown word that appears 4 times in 44 hours (once every 10 hours) is detected, 20 candidate locations (once every 2 hours) are detected, and 2 of them are correct search results. Is performance.

一方、検索時間は、本発明による方法は、１検索語当り２．５ｍｓ、ＤＰマッチングを併用する我々の従来法では、１５ｍｓである。本発明方法は、検索対象の音声時間長の対数に比例し、一方、我々の従来方法では、線形に比例する。たとえば、１万時間の音声データを検索する場合、本発明方法だと、１検索語当り約数十ｍｓ程度で検索でき、従来の我々の方法だと５秒程度時間がかかる。
On the other hand, the search time is 2.5 ms per search word in the method according to the present invention, and 15 ms in our conventional method using DP matching together. The method of the present invention is proportional to the logarithm of the speech time length to be searched, while in our conventional method, it is linearly proportional. For example, when searching for 10,000 hours of speech data, the method of the present invention can be searched in about several tens of ms per search word, and it takes about 5 seconds with our conventional method.

ア：検索対象の音声ドキュメントデータの格納部
イ：既知語検索のための大語彙連続音声認識部
ウ：大語彙連続音声認識結果の格納部
エ：既知語検索のためのインデックス作成部
オ：認識誤り単語・未知語検索のための音素・音節認識部
カ：音素・音節認識結果の格納部
キ：未知語・認識誤り単語の検索のための音素・音節のインデックス作成部
ク：検索単語の入力部（タイピング入力または音声入力）
ケ：既知語のための検索部
コ：未知語・認識誤り単語のための検索部
サ：既知語の検索結果表示部
シ：未知語・認識誤り単語の検索結果の表示部
A: Storage unit for speech document data to be searched b: Large vocabulary continuous speech recognition unit for known word search C: Storage unit for large vocabulary continuous speech recognition results D: Index creation unit for known word search E: Recognition Phoneme / syllable recognition unit for error word / unknown word search K: Phoneme / syllable recognition result storage unit K: Phoneme / syllable index creation unit for unknown word / recognition error word search K: Search word input Part (typing input or voice input)
K: Search section for known words E: Search section for unknown words / recognized error words S: Search result display section for known words S: Display section for search results of unknown words / recognized error words

Claims

Search term input part by voice or text, continuous voice database storage part to be searched,
A large vocabulary continuous speech recognition unit for recognizing speech data from the input unit and the database storage unit;
A continuous speech data recognition result storage unit for storing a recognition result of the large vocabulary continuous speech recognition unit;
An unknown word index creating unit for indexing unknown words in the continuous speech database;
A phoneme / syllable recognition unit that recognizes speech data of unknown words by dividing them into phonemes or syllables, which are basic units of speech,
A phoneme / syllable recognition result storage unit for storing a recognition result of the phoneme / syllable recognition unit;
A phoneme / syllable string search unit that presents at least one search candidate, wherein the phoneme / syllable recognition unit detects a plurality of unknown words assigned the index A voice search device comprising a function for generating candidates.

The voice search device according to claim 1,
The phoneme / syllable recognition unit has a function of assuming a replacement error and an insertion error from at least one recognition result candidate, assigning the recognition as an index, and presenting a detection candidate. apparatus.

The voice search device according to claim 1 or 2,
The phoneme / syllable recognition unit has a function of assuming a drop error from at least one recognition result candidate, assuming that the drop error is recognized as a search word, giving it as an index, and presenting a detection candidate. Voice search device characterized by the above.

The voice search device according to claim 3,
The phoneme / syllable recognition unit assumes a replacement error, an insertion error, and / or a drop error from at least one recognition result candidate, assigns the error recognition as an index, and uses the divided search information. A speech search apparatus comprising a function of selecting detection candidates from a set of detection candidates based on a preset threshold value.

The voice search device according to claim 1, wherein:
The phoneme / syllable recognition unit assigns an index using at least one recognition result candidate using a virtual distance between phonemes and between syllables,
A speech search apparatus comprising a function of presenting detection candidates based on a distance from the second recognition result candidate or the third recognition result candidate based on the acoustic similarity with the first recognition result candidate .

The voice search device according to claim 1, wherein:
The phoneme / syllable recognition unit assigns an index to at least one or more detection candidates using a log likelihood defined by Equation 1,

A voice search device comprising a function of presenting detection candidates based on log likelihood with recognition result candidates.

A search word input step by voice or text, a continuous voice database storage step to be searched,
A large vocabulary continuous speech recognition step for recognizing speech data from the input unit and the database storage unit;
A step of storing a continuous speech data recognition result for storing a recognition result of the large vocabulary continuous speech recognition unit;
An unknown word index for indexing unknown words in the continuous speech database;
A phoneme / syllable recognition step for recognizing the speech data of unknown words by dividing them into phonemes or syllables which are the basic units of speech;
Storing a phoneme / syllable recognition result storing step for storing a recognition result of the phoneme / syllable recognition unit;
A phoneme / syllable search step for presenting at least one search candidate,
The phoneme / syllable recognition step has a function of generating a plurality of detection candidates for an unknown word to which the index is assigned.

The voice search method according to claim 7,
The phoneme / syllable recognition step has a function of assuming a substitution error and an insertion error from at least one recognition result candidate, assigning the recognition as an index, and presenting a detection candidate. .

The voice search method according to claim 7 or 8,
The phoneme / syllable recognition step has a function of assuming a drop error from at least one recognition result candidate, assuming that the drop error is recognized as a search word, giving it as an index, and presenting a detection candidate. A featured voice search method.

The voice search method according to claim 9,
The phoneme / syllable recognition step recognizes a substitution error, an insertion error, and / or a drop error from at least one recognition result candidate, assigns the error recognition as an index, and uses information of divided search words A speech search method characterized by having a function of selecting detection candidates from detection candidates obtained by using a preset threshold as a reference.

The voice search method according to claim 7, wherein:
In the phoneme / syllable recognition step, at least one recognition result candidate is indexed using a virtual distance between phonemes and between syllables,
A speech search method characterized by having a function of presenting a detection candidate based on a distance from the second recognition result candidate or the third recognition result candidate based on the acoustic similarity with the first recognition result candidate.

12. The voice search method according to claim 7, wherein:
In the phoneme / syllable recognition step, at least one recognition result candidate is indexed using a log likelihood defined by Equation 1,

A speech search method characterized by having a function of presenting detection candidates based on a log likelihood of a recognition result.