JP2010123005A

JP2010123005A - Document data retrieval device

Info

Publication number: JP2010123005A
Application number: JP2008297387A
Authority: JP
Inventors: Shin Jo; ▲シン▼ 徐; Tsuneo Kato; 恒夫加藤; Hisashi Kawai; 恒河井; Masaki Naito; 正樹内藤
Original assignee: KDDI Corp
Current assignee: KDDI Corp
Priority date: 2008-11-20
Filing date: 2008-11-20
Publication date: 2010-06-03
Anticipated expiration: 2028-11-20
Also published as: JP5308786B2

Abstract

<P>PROBLEM TO BE SOLVED: To obtain a document (lyrics) data retrieval device for retrieving a target content and document by discriminating similarities at a high speed even when a retrieval character string that does not coincide with a retrieval original is input in a text document retrieval system for retrieving a large amount of text documents. <P>SOLUTION: Words (retrieval words) having the possibility to be retrieved are predicted and registered, and sound similar distance values to retrieval target words in a lyrics file are calculated in advance. By creating an index file 22 in which significance information is calculated by using a keyword (a retrieval word having the possibility to be included in a retrieval word string) which is acoustically similar to a word having the possibility to be retrieved in the lyrics file, a sound similar distance value between two words, and the positional information of the retrieval target word in the lyrics file as parameters, the acceleration of similarity calculation in a similar text retrieval is achieved. <P>COPYRIGHT: (C)2010,JPO&INPIT

Description

本発明は、コンテンツ関連用語によるコンテンツの検索システムや、大量の文書データ（テキスト文書）を検索するテキストの検索システムにおいて使用される検索装置に関し、文書データ内に存在する原文と一致しない検索単語を入力した場合にも、類似度を高速に判別して目的のコンテンツや文書データの検索ができるようにした文書データ検索装置に関する。 The present invention relates to a search device used in a content search system based on content-related terms and a text search system for searching a large amount of document data (text document), and a search word that does not match an original text existing in the document data. The present invention also relates to a document data search apparatus that enables a target content and document data to be searched by determining similarity at high speed even when input is made.

テキストに対する検索装置においては、キーワードの完全一致検索が行われるが、使い勝手の向上を図るため、例えば「インタフェース」と「インターフェス」のように完全に一致していないものでも一致するというように判断する「あいまい検索」を行うことが要求されている。 In a search device for text, a complete match search of keywords is performed. However, in order to improve usability, for example, “interface” and “interface” are judged to match even if they do not match completely. It is required to perform “fuzzy search”.

例えば特許文献１に記載された類似テキスト検索装置では、テキストに対するＮグラム要素（Ｎ文字の並び）を作成し、複数のテキストに関するＮグラム要素の一致度を演算する。その演算結果において、一致度の高い順で検索対象候補を出力することによって、高速な類似テキスト検索を実現している。 For example, in the similar text search device described in Patent Document 1, N-gram elements (N character sequences) for text are created, and the degree of coincidence of N-gram elements for a plurality of texts is calculated. In the calculation results, high-speed similar text search is realized by outputting search target candidates in descending order of coincidence.

また、特許文献２に記載された音声検索システムでは、音声認識された音声データ中の単語や語句を検索する場合において、検索したい文字列が誤認識されていたり未知語であった場合にも対処するため、入力された検索文字列を音素列に変換し、連続単語音声認識で用いられているサーチアルゴリズムを利用し、検索対象となる音声データの音声認識結果に出現し得る類似単語または類似単語列に展開してから検索することが行われる。
検索単語（文字列）として「ハリーポッター」を入力した場合を例に説明すると、音声データの認識結果には「ハリー」、「ポスター」、「は」、「リポーター」、などの単語候補がピックアップされる。このとき、単語候補を並べることで、検索文字列「ハリーポッター」と音響的に近い「ハリー」＋「ポスター」、「は」＋「リポーター」、などの単語列に展開する。この展開は、検索文字列と音響的な距離が近くなるように行われるため、検索文字列の認識結果となる可能性が高い単語列を誤認識も含めて求めることになる。
特開２００３−２８８３６６号公報特開２００６−３１２７８号公報 Further, in the speech search system described in Patent Document 2, when searching for a word or phrase in speech data that has been speech-recognized, it is possible to cope with a case where a character string to be searched is erroneously recognized or an unknown word. Therefore, the input search character string is converted into a phoneme string, and a similar word or similar word that can appear in the speech recognition result of the speech data to be searched using a search algorithm used in continuous word speech recognition Searching is performed after expanding the column.
For example, if you input “Harry Potter” as a search word (character string), the speech data recognition results will pick up word candidates such as “Harry”, “Poster”, “Ha”, “Reporter”, etc. Is done. At this time, by arranging the word candidates, it expands into a word string such as “Harry” + “poster”, “ha” + “reporter”, which is acoustically close to the search character string “Harry Potter”. Since this expansion is performed so that the acoustic distance is close to the search character string, a word string that is highly likely to be a recognition result of the search character string is obtained including erroneous recognition.
JP 2003-288366 A JP 200631278 A

歌詞の一部分を検索キーワードとして楽曲を検索する文書データ検索装置において、ユーザが聞き間違えて覚えてしまった歌詞で検索する場合（例えば、歌詞の原文に存在する文字列が「無数の光」であるのに対して、検索文字列を「まっすぐの光」と誤って入力して検索した場合）を想定する。
この場合、特許文献１の類似テキスト装置のように、文字列の編集距離の計算やＮグラムのインデクシングを行っても、「無数」と「まっすぐ」とでは編集距離等が乖離しているため、目的とする楽曲の情報（歌詞文）の検索は不可能である。 In a document data search device that searches for music using a part of lyrics as a search keyword, when searching with lyrics that the user has misunderstood and remembered (for example, the character string existing in the original text of the lyrics is “innumerable light”) In contrast, assume that the search character string is erroneously entered as “straight light” and is searched).
In this case, as in the similar text device of Patent Document 1, even if calculation of the edit distance of the character string and indexing of the N-gram are performed, the edit distance is different between “numberless” and “straight”. It is impossible to search for information (lyric text) of the target music.

また、特許文献２の装置については、検索するたびに、検索文字列を単語列に展開するために、検索文字列と展開する単語候補との音響的な距離の計算を行う必要がある。特に検索クエリの文字列数が多く（２つ以上の非連続歌詞キーワード）、かつ検索対象となる歌詞ファイルが多い（商用検索システムでは数万以上）場合に、オンライン上での計算時間を要して検索が非常に遅くなるという問題点があった。 In addition, with respect to the device of Patent Document 2, it is necessary to calculate an acoustic distance between a search character string and a word candidate to be expanded in order to expand the search character string into a word string each time a search is performed. Especially when the number of search query strings is large (two or more non-consecutive lyrics keywords) and there are many lyrics files to be searched (more than tens of thousands for commercial search systems), online calculation time is required. The search was very slow.

本発明は上記実情に鑑みて提案されたもので、大量のテキスト文書を検索するテキスト文書検索システムにおいて、検索原文と一致しない検索文字列を入力した場合にも、類似度を高速に判別して目的のコンテンツや文書の検索を可能とする文書データ検索装置を提供することを目的とする。 The present invention has been proposed in view of the above circumstances, and in a text document search system for searching a large number of text documents, even when a search character string that does not match the original search text is input, the similarity is determined at high speed. It is an object of the present invention to provide a document data retrieval apparatus that enables retrieval of target contents and documents.

上記目的を達成するため請求項１の発明は、検索文字列が入力される入力インタフェースを備え、前記検索文字列により複数の検索対象文章データから文書データの検索を行う検索装置であって、次の構成を含むことを特徴としている。
単語抽出手段。この単語抽出手段は、予め複数の文章データを入力し文書データを構成する単語を抽出するものである。
キーワード登録手段。このキーワード登録手段は、前記検索文字列に含まれる可能性のある検索単語をキーワードとしてリストに登録するものである。
INDEXファイル作成手段。このINDEXファイル作成手段は、前記単語抽出手段により抽出された抽出単語とキーワードに対して分解された読み仮名又は音素同士の比較から算出される音響類似距離値を用いて検索用INDEXファイルを作成するものである。
類似度演算手段。この類似度演算手段は、入力された検索文字列から得られる検索単語に基づいて、前記INDEXファイルを参照して入力された検索単語による検索対象文書データの類似度を演算するものである。
出力インタフェース。この出力インタフェースは、前記類似度に基づいた文書データの検索結果を出力するものである。 In order to achieve the above object, the invention of claim 1 is a search device comprising an input interface through which a search character string is input, and searching for document data from a plurality of text data to be searched using the search character string. It is characterized by including the structure of.
Word extraction means. This word extracting means extracts a word constituting the document data by inputting a plurality of sentence data in advance.
Keyword registration means. The keyword registration means registers search words that may be included in the search character string as keywords in the list.
INDEX file creation means. The INDEX file creation means creates a search INDEX file by using the extracted words extracted by the word extraction means and the acoustic similarity distance value calculated from the comparison between the reading kana or phonemes decomposed for the keywords. Is.
Similarity calculation means. The similarity calculation means calculates the similarity of the search target document data based on the search word input with reference to the INDEX file, based on the search word obtained from the input search character string.
Output interface. This output interface outputs a search result of document data based on the similarity.

請求項２の発明は、請求項１の文書データ検索装置において、
前記INDEXファイル作成手段の音響類似距離値は、
前記キーワードと抽出単語同士の編集距離を計算する手段によりそれぞれ読み仮名列に変換し、二つ読み仮名列において、音節を単位としたＤＰマッチングを行うことで、音節間の距離（違う音節間同士のペナルティ値），音節の挿入時のペナルティ値，音節の脱落時のペナルティ値それぞれに所望の数値を与える計算手段により計算することを特徴としている。 The invention of claim 2 is the document data retrieval apparatus of claim 1,
The acoustic similarity distance value of the INDEX file creation means is
A distance between syllables (between different syllables) can be obtained by performing DP matching in units of syllables in the two reading kana strings by converting them into reading kana strings by means of calculating the editing distance between the keyword and the extracted words. ), A penalty value at the time of syllable insertion, and a penalty value at the time of syllable drop are calculated by calculation means for giving desired numerical values.

請求項３の発明は、請求項１の文書データ検索装置において、
前記INDEXファイル作成手段の音響類似距離値は、
前記キーワードと抽出単語同士の二つの音素列の編集距離を計算する手段によりそれぞれの単語の読みを音素列に変換し、二つ音素列において、音素を単位としたＤＰマッチングを行うことで、音素間の距離（違う音素間同士のペナルティ値），音素の挿入時のペナルティ値，音素の脱落時のペナルティ値それぞれに所望の数値を与える計算手段により計算することを特徴としている。 The invention of claim 3 is the document data search device of claim 1,
The acoustic similarity distance value of the INDEX file creation means is
By converting the reading of each word into a phoneme string by means of calculating the edit distance of the two phoneme strings between the keyword and the extracted word, and performing DP matching in units of phonemes in the two phoneme strings, It is characterized in that it is calculated by a calculation means for giving desired numerical values to the distance (penalty value between different phonemes), the penalty value when a phoneme is inserted, and the penalty value when a phoneme is dropped.

請求項４の発明は、請求項１乃至請求項３のいずれかに記載の文書データ検索装置において、
前記INDEXファイル作成手段は、前記検索対象文章データから抽出された前記抽出単語の文書データ中における位置情報及び登場頻度をパラメータとした重要度情報を登録したファイルを作成することを特徴としている。 According to a fourth aspect of the present invention, in the document data retrieval apparatus according to any one of the first to third aspects,
The INDEX file creation means creates a file in which importance information using the position information and appearance frequency of the extracted word extracted from the search target sentence data as parameters in the document data is registered.

請求項５の発明は、請求項１の文書データ検索装置において、
前記出力インタフェースは、音響類似距離値に基づいて適合度を計算し、INDEXファイルによる検索文字列から得られる一つ又は複数の検索単語との適合度の高い順で検索結果を出力する手段を備えることにより、
INDEXファイルによる検索文字列との適合度の高い順で上位Ｎ個の検索対象文書データ候補を絞り、前記Ｎ個候補における検索文字列の音素列と前記検索対象文書データ候補の文字列の音素列とのＤＰマッチングを計算し、その計算結果を類似度として出力順位を決めることを特徴としている。
音響類似距離値に基づいて計算される適合度は、例えば、音響類似距離値に任意の数を加算した数値の逆数を含んだ値とする。 The invention of claim 5 is the document data retrieval apparatus of claim 1,
The output interface includes means for calculating a fitness based on an acoustic similarity distance value and outputting a search result in descending order of the fitness with one or a plurality of search words obtained from a search character string by an INDEX file. By
The top N search target document data candidates are narrowed down in descending order of suitability with the search character string by the INDEX file, and the phoneme string of the search character string and the phoneme string of the search target document data candidate character string in the N candidates DP matching is calculated, and the output order is determined based on the calculation result as a similarity.
The goodness of fit calculated based on the acoustic similarity distance value is, for example, a value including the reciprocal of a numerical value obtained by adding an arbitrary number to the acoustic similarity distance value.

すなわち本発明は、文書データ検索装置において、入力される可能性がある単語を予測し、この単語と文書データ中の検索対象単語との音響類似距離を事前に計算し、文書データ中で検索される可能性がある単語と最も音響的に類似する単語候補に対して、二つの単語間の音響類似距離値、その単語候補の文書データ中の位置、重要度の情報に基づいて検索INDEXファイルを作成する手段を設けることを特徴的な構成としている。 That is, the present invention predicts a word that may be input in a document data search apparatus, calculates in advance an acoustic similarity distance between this word and a search target word in the document data, and is searched in the document data. Search index file based on the acoustic similarity distance value between two words, the position of the word candidate in the document data, and the importance information. Providing means for creating is a characteristic configuration.

本発明によれば、検索対象単語とは完全一致しない、聞き間違えて覚えた検索文字列で検索する場合でも、検索文字列と音響的に類似する候補を挙げることが出来るため、探したい検索対象テキスト文書（コンテンツ）を検索することが可能となる。
また、本発明のINDEXファイルの作成段階において、事前に検索キーワードと検索対象となる単語の音響類似距離を計算しておくため、検索時に音響類似距離の計算が不要となり、検索の時間を削減することできるので、類似テキスト検索における高速化が可能となる。 According to the present invention, even when searching with a search character string that does not exactly match the search target word or is mistakenly remembered, a candidate that is acoustically similar to the search character string can be cited, so A text document (content) can be searched.
Moreover, since the acoustic similarity distance between the search keyword and the word to be searched is calculated in advance at the stage of creating the INDEX file of the present invention, it is not necessary to calculate the acoustic similarity distance at the time of searching, and the search time is reduced. Therefore, it is possible to speed up the similar text search.

以下、本発明の文書データ検索装置の一実施形態について、図１のブロック図を参照しながら説明する。
文書データ検索装置１は、楽曲の歌詞を検索する装置に適用した例であり、歌詞の一部のフレーズ等を検索文字列として入力する入力インタフェース２と、歌詞データ（文書データ）の検索結果を出力する出力インタフェース３と、検索対象となる複数の歌詞データ（歌詞ファイル）が格納された歌詞データベース４と、入力インタフェース２に入力された検索文字列から歌詞データベース４に格納されている歌詞データの検索処理を行う制御部１０を有して構成されている。 Hereinafter, an embodiment of a document data retrieval apparatus of the present invention will be described with reference to the block diagram of FIG.
The document data search device 1 is an example applied to a device for searching the lyrics of a song. An input interface 2 for inputting a partial phrase or the like of a lyrics as a search character string and a search result of lyrics data (document data). An output interface 3 for outputting, a lyrics database 4 storing a plurality of lyrics data (lyric files) to be searched, and lyrics data stored in the lyrics database 4 from a search character string input to the input interface 2 It has the control part 10 which performs a search process, and is comprised.

制御部１０は、予め複数の歌詞データ（文章データ）を入力し歌詞（文書データ）を構成する単語を抽出する単語抽出手段１１と、検索される可能性のある検索単語（検索文字列に含まれる可能性のある検索単語）をキーワードとしてリストに登録するキーワード登録手段１２と、抽出された前記単語（検索対象単語）と登録されたキーワードに対して音響類似距離値を用いて検索用INDEXファイルを作成するINDEXファイル作成手段１３と、前記INDEXファイルを参照して入力された検索単語による検索対象文書データの類似度を演算する類似度演算手段１４を備えている。 The control unit 10 inputs a plurality of lyrics data (sentence data) in advance and extracts a word constituting the lyrics (document data), and a search word that may be searched (included in the search character string). Keyword registration means 12 for registering a search word as a keyword in a list, and an index file for search using an acoustic similarity distance value for the extracted word (search target word) and the registered keyword INDEX file creation means 13 for creating a search result, and similarity calculation means 14 for calculating the similarity of search target document data based on a search word input with reference to the INDEX file.

文書データ検索装置１の制御部１０においては、検索時に使用するINDEXファイルが単語抽出手段１１、キーワード登録手段１２、INDEXファイル作成手段１３に対して予め入力される情報により作成される。
以下、制御部１０におけるINDEXファイルの作成手順について、図２を参照しながら説明する。 In the control unit 10 of the document data search apparatus 1, an INDEX file used at the time of search is created based on information input in advance to the word extraction unit 11, the keyword registration unit 12, and the INDEX file creation unit 13.
Hereinafter, a procedure for creating an INDEX file in the control unit 10 will be described with reference to FIG.

先ず、単語抽出手段１１により歌詞データベース４に格納されている複数の歌詞ファイルＳ1〜Ｓiの文書データ（歌詞データ）に対して複数の単語を抽出する形態素解析を行い、歌詞ファイル毎に抽出単語（読み付き）リストＬi２０を作成する。ここで用いる形態素解析ツールは、日本語および英語の品詞分類に対応でき、また、外来語や仮名英語における表記の揺れについても対応できるようになっている。 First, a morphological analysis for extracting a plurality of words is performed on the document data (lyric data) of the plurality of lyrics files S1 to Si stored in the lyrics database 4 by the word extraction means 11, and an extracted word ( A list Li20 is created. The morphological analysis tool used here can cope with Japanese and English part-of-speech classifications, and can deal with fluctuations in notation in foreign words and kana English.

次に、キーワード登録手段１２により、検索するために入力される単語を予想し、キーワードとして登録したキーワードリスト（Ａ1,…,Ａn）２１を作成する。キーワード（検索文字列に含まれる可能性のある検索単語）としては、キーワードリスト２１の単語集合の柔軟性を備えるため（聞き間違えて覚えた単語を対応するため）、歌詞データベースに格納されている歌詞に含まれるすべての単語以外にも、歌詞以外のドメインの言語コーパスから抽出した単語なども登録しておく。 Next, a keyword list (A1,..., An) 21 registered as keywords is created by the keyword registration means 12 in anticipation of words to be input for searching. The keywords (search words that may be included in the search character string) are stored in the lyrics database in order to provide flexibility in the word set of the keyword list 21 (to deal with words that are mistakenly remembered). In addition to all the words included in the lyrics, words extracted from the language corpus of the domain other than the lyrics are also registered.

歌詞ファイル毎に作成された抽出単語リストＬi２０中の各単語において、キーワードＡnとの音響類似距離を後述する音響類似距離計算方法で計算し、音響類似距離が一番近い単語Ｗni（歌詞ファイルＳiにおいてＡnと音響的に最も類似する単語）をそれぞれ抽出し、距離値dni（歌詞ファイルＳiにおいて、Ａnと類似する単語ＷniとＡnの音響類似距離値）を登録することでINDEXファイル２２を作成する。 For each word in the extracted word list Li20 created for each lyric file, the acoustic similarity distance to the keyword An is calculated by the acoustic similarity distance calculation method described later, and the word Wni having the closest acoustic similarity distance (in the lyrics file Si) An INDEX file 22 is created by extracting the distance values dni (acoustic similarity distance values of words Wni and An similar to An in the lyrics file Si) respectively.

上記INDEXファイル２２における音響類似距離値は、以下の手法で計算することが可能である。
INDEXファイル作成手段の音響類似距離値は、キーワード（検索単語）と抽出単語（検索対象単語）同士の読み仮名の編集距離を計算する手段、又は、単語同士の二つ音素列の編集距離を計算する手段により単語の読みを読み仮名又は音素列に変換し、ＤＰマッチングを行うことで類似距離値を求めるものである。 The acoustic similarity distance value in the INDEX file 22 can be calculated by the following method.
The acoustic similarity distance value of the INDEX file creation means is a means of calculating the edit distance of the reading kana between the keyword (search word) and the extracted word (search target word), or the edit distance of two phoneme strings between words Thus, the reading of the word is converted into a kana or phoneme string by means of, and DP matching is performed to obtain the similarity distance value.

読み仮名列の比較で編集距離を計算してINDEXファイル作成手段の音響類似距離値を求める場合は、先ず、単語同士の編集距離を計算する手段により単語の読みを読み仮名列に変換する。そして、二つ読み仮名列において、音節を単位としたＤＰマッチングを行うことで、仮名間の距離を各音節を表す音響モデルの分布距離とし、音節の挿入時や脱落時に距離を１とする計算手段により計算する。 When the edit distance is calculated by comparing the reading kana strings and the acoustic similarity distance value of the INDEX file creation means is obtained, first, the word reading is converted into the reading kana string by the means for calculating the editing distance between the words. Then, DP matching is performed in syllable units in the two-reading kana string, so that the distance between kana is the distribution distance of the acoustic model representing each syllable, and the distance is 1 when syllable is inserted or dropped Calculate by means.

また、音素列の比較で編集距離を計算してINDEXファイル作成手段の音響類似距離値を求める場合は、先ず単語同士の二つの音素列の編集距離を計算する手段により単語の読みを音素列に変換する。そして、二つ音素列において、音素を単位としたＤＰマッチングを行うことで、音素間の距離を各音素を表す音響モデルの確率分布間の距離とし、音素の挿入時や脱落時の距離を１とする計算手段により計算する。 Also, when calculating the edit distance by phoneme sequence comparison and obtaining the acoustic similarity distance value of the INDEX file creation means, first the word reading is converted into a phoneme sequence by means of calculating the edit distance of two phoneme sequences between words. Convert. Then, DP matching is performed in units of phonemes in the two phoneme strings, so that the distance between phonemes is the distance between the probability distributions of the acoustic model representing each phoneme, and the distance when the phoneme is inserted or dropped is 1 It is calculated by the calculation means.

すなわち、先ず検索キーワードＡnと検索対象となる単語を両方とも読み仮名又は音素列に変換し、二つの読み仮名又は音素列において、文字又は音素を単位としたＤＰマッチングを行い、ＤＰマッチング計算により算出された距離値が音響類似距離値となる。
ＤＰマッチングは、系列になっているデータ同士の類似度を計算する方法であり、以下、具体的なＤＰマッチング計算例に基づいて説明する。
ＤＰマッチングにおいては、「字が合わなければ1点」、「字が一つずれること（挿入や脱落）で１点」というペナルティを決めておく。 That is, first, the search keyword An and the word to be searched are both converted into a reading kana or phoneme string, and DP matching is performed on the two reading kana or phoneme strings in units of characters or phonemes, and calculated by DP matching calculation. The obtained distance value becomes the acoustic similarity distance value.
DP matching is a method for calculating the similarity between series of data, and will be described below based on a specific DP matching calculation example.
In DP matching, a penalty of “one point if the character does not match” and “one point if the character is shifted by one (insertion or dropping)” is determined.

例えば、読み仮名同士で比較する場合、４音の文字から構成される「おおさか」と「おおつか」と比較すると、１箇所で字が相違し３箇所で字が一致するので、相違部分が「１×１＝１」、一致部分が「０」で総ペナルティは１＋０＝１点となる。距離値を０〜１．０の値に正規化するため、正規化後のペナルティは「総ペナルティ／文字列長」の計算とし、１／４＝０．２５となる。正規化を行う場合の分母となる文字列長は、文字列同士の長い方の文字数とする。
また、６文字の「しんおおさか」と、４文字の「おおさか」との比較については、「字が２個ずれた・停滞した」と考えて、「しんおおさか」と「おおおおさか」との比較と考え、字ずれ２個の部分が「１×２＝２」で、不一致２個の部分が「１×２＝２」で、正規化後のペナルティは（２＋２）／６＝０．６６７点になる。ＤＰマッチングでは、ペナルティの値が小さいほど類似性が高いと設定するため、検索単語が「おおさか」である場合、「しんおおさか」よりも「おおつか」の方を似ていると判断する。 For example, when comparing between reading kana characters, comparing “Osaka” and “Otsuka” composed of four-tone characters, the characters are different at one place and the characters are the same at three places. × 1 = 1 ”, the matching part is“ 0 ”, and the total penalty is 1 + 0 = 1 point. Since the distance value is normalized to a value of 0 to 1.0, the penalty after normalization is calculated as “total penalty / character string length”, which is ¼ = 0.25. The character string length as a denominator in normalization is the longer number of characters between character strings.
In addition, regarding the comparison between the 6-character “Shin-Osaka” and the 4-character “Osaka”, the “Shin-Osaka” and “Osaka” are compared to each other, assuming that “the two characters are shifted or stagnated.” 2 parts of misalignment are “1 × 2 = 2”, 2 parts of mismatch are “1 × 2 = 2”, and the penalty after normalization is (2 + 2) /6=0.667 points become. In DP matching, the smaller the penalty value, the higher the similarity is set. Therefore, when the search word is “Osaka”, it is determined that “Otsuka” is more similar to “Osaka”.

また、同じ例で「おおさか」「おおつか」について音素列同士で比較する場合は、「Ｏ」「Ｏ」「Ｓ」「Ａ」「Ｋ」「Ａ」と、「Ｏ」「Ｏ」「ＴＳ」「Ｕ」「Ｋ」「Ａ」との比較となる。［Ｏ」「Ｏ」「Ｓ」「Ａ」「Ｋ」「Ａ」と「Ｏ」「Ｏ」「ＴＳ」「Ｕ」「Ｋ」「Ａ」を考えて、ずれが０個：2×1＝２、相違は２個：２×１＝２、正規化後のペナルティは（０＋２）／８＝０．２５点になる。正規化を行う場合の分母は、長い方の音素列の音素数である。
また、「おおさか」「しんおおさか」について音素列同士で比較する場合は、「Ｏ」「Ｏ」「Ｓ」「Ａ」「Ｋ」「Ａ」と、「Ｓ」「Ｉ」「Ｎ」「Ｏ」「Ｏ」「Ｓ」「Ａ」「Ｋ」「Ａ」との比較となる。「Ｏ」「Ｏ」「Ｏ」「Ｏ」「Ｏ」「Ｓ」「Ａ」「Ｋ」「Ａ」と「Ｓ」「Ｉ」「Ｎ」「Ｏ」「Ｏ」「Ｓ」「Ａ」「Ｋ」「Ａ」と考えて、ずれが３個で３×1＝３、相違は３個で３×１＝３、正規化後のペナルティは（３＋３）／９＝０．６６７点になる。音素レベルでは、検索単語が「おおさか」である場合、「しんおおさか」よりも「おおつか」の方を似ていると判断する。 In the same example, when comparing “Osaka” and “Otsuka” with phoneme sequences, “O”, “O”, “S”, “A”, “K”, “A”, and “O”, “O”, “TS”. This is a comparison with “U”, “K”, and “A”. Considering “O”, “O”, “S”, “A”, “K”, “A” and “O”, “O”, “TS”, “U”, “K”, and “A”, there are 0 deviations: 2 × 1 = 2. Two differences: 2 × 1 = 2, and the penalty after normalization is (0 + 2) /8=0.25 points. The denominator for normalization is the phoneme number of the longer phoneme string.
Further, when comparing phonemes between “Osaka” and “Shin Osaka”, “O” “O” “S” “A” “K” “A” and “S” “I” “N” “O” “O” “S” “A” “K” “A”. “O” “O” “O” “O” “O” “S” “A” “K” “A” and “S” “I” “N” “O” “O” “S” “A” “ Considering “K” and “A”, the deviation is 3 × 3 = 3, the difference is 3 × 3 = 3, and the penalty after normalization is (3 + 3) /9=0.667 points. At the phoneme level, when the search word is “Osaka”, it is determined that “Otsuka” is more similar than “Shin Osaka”.

上述した音響類似距離計算方法では、読み仮名の場合の文字、又は、音素の挿入時や脱落時のペナルティを「１」とし、音節、又は、音素間の距離値（違う音素間のペナルティ値）を「１」としたが、他のペナルティ値を使用して計算しても良い。
例えば、音素間の距離値について、認識時に使用した音素音響モデルのモデル間距離値を使用しても良い。この場合、各音素を表す音響モデルの確率分布間のマハラノビス距離によって音素間の距離値を定義することができる。各音素の音響モデルが１状態かつ単一ガウス分布でモデル化されているとき、２つのモデル間のマハラノビス距離ＡＤij（音素i とj）は、下記の数１で表される。 In the acoustic similarity distance calculation method described above, the penalty when inserting or dropping a character or phoneme in the case of a reading character is “1”, and the distance value between syllables or phonemes (penalty value between different phonemes). Is set to “1”, but may be calculated using other penalty values.
For example, for the distance value between phonemes, the inter-model distance value of the phoneme acoustic model used at the time of recognition may be used. In this case, the distance value between phonemes can be defined by the Mahalanobis distance between the probability distributions of the acoustic model representing each phoneme. When the acoustic model of each phoneme is modeled with one state and a single Gaussian distribution, the Mahalanobis distance ADij (phonemes i and j) between the two models is expressed by the following equation (1).

上記の数１において、
ＫはMFCC (Mel-Frequency Cepstrum Coefficient) ベクトルの次元数（Ｋ＝１２)、
μikおよびσikは、それぞれ音素iの平均および分散MFCCベクトルのk次元目の要素である。 In the above equation 1,
K is the number of dimensions of the MFCC (Mel-Frequency Cepstrum Coefficient) vector (K = 12),
μik and σik are the elements of the kth dimension of the mean and variance MFCC vectors of phoneme i, respectively.

また、音素間混同行列に基づいて音素間距離を計算しても良い。音素間混同行列とは認識実験などにより求め、行列の要素を確率で表したものである。この音素間混同行列を使用した音素間距離の算出例について、表１を参照して説明する。
表１は、入力音素a、i、u、e、o、・・がそれぞれa、i、u、e、o、・・と聞こえる確率を行列で表したものである。例えば、音素aがaと聞こえる確率は０．９、iと聞こえる確率は０．２、uと聞こえる確率は０．３、・・・などのことが示される。音素混同行列の確率の逆数を音素間距離として定義することができる。 Further, the distance between phonemes may be calculated based on the interphoneme confusion matrix. The inter-phoneme confusion matrix is obtained by a recognition experiment or the like, and represents the elements of the matrix as probabilities. A calculation example of the interphoneme distance using this interphoneme confusion matrix will be described with reference to Table 1.
Table 1 shows the probability that the input phonemes a, i, u, e, o,... Can be heard as a, i, u, e, o,. For example, the probability that the phoneme a can be heard as a is 0.9, the probability that it can be heard as i is 0.2, the probability that it can be heard as u is 0.3, and so on. The reciprocal of the probability of the phoneme confusion matrix can be defined as the distance between phonemes.

INDEXファイル作成手段１３で作成するINDEXファイルには、上記した音響類似距離値の他に、検索単語に対して歌詞データを選択する場合に判断基準となる情報（ランキング要素値）が登録されているものであってもよい。これらの情報としては、例えば、検索単語に類似する単語（歌詞中に存在する検索対象単語）の歌詞データ（文書データ）中における位置情報や、登場頻度をパラメータとして算出する重要度情報が考えられる。 In the INDEX file created by the INDEX file creation means 13, information (ranking element value) that is used as a criterion for selecting lyrics data for a search word is registered in addition to the acoustic similarity distance value described above. It may be a thing. As such information, for example, position information in the lyric data (document data) of a word similar to the search word (search target word existing in the lyrics) and importance information for calculating the appearance frequency as a parameter can be considered. .

位置情報は、類似単語候補の出現位置pni（ pniは類似単語候補Ｗniが歌詞ファイルＳi中での出現位置：何番目に出現する単語であるかの数値）としてINDEXファイルに保存される。出現位置pniが複数ある場合には、その全てを登録しておく。
重要度情報は、検索対象単語が歌詞データ中でどのくらい重要な指標を持つかを評価する特徴量tf・idfと定義し、tfをある歌詞データの中の検索対象単語の出現頻度、idfをlog (検索対象全文書数／検索対象単語を含む文書の数) とした場合に、特徴量tf・idf はこれらの積であるtf・id fで算出される。
また、特徴量tf・idfの値を０〜１．０の値に正規化するため、以下の計算を行う。歌詞ファイルＳiの単語Ｗniのtf・idf値はtf・ idfni,Σtf・idfiはＳiの全単語のtf・idf値の総和とする。正規化後のＷniのtf・idf’ni=tf・idfni／Σtf・idfi。 The position information is stored in the INDEX file as an appearance position pni of similar word candidates (where pni is an appearance position of the similar word candidate Wni in the lyric file Si: a numerical value indicating what number of words appears). If there are a plurality of appearance positions pni, all of them are registered.
The importance level information is defined as a feature quantity tf ・ idf that evaluates how important the search target word has in the lyric data, and tf is the frequency of occurrence of the search target word in the lyric data and idf is log The feature quantity tf · idf is calculated by tf · id f which is the product of these, where (number of all search target documents / number of documents including search target words).
Further, in order to normalize the value of the feature value tf · idf to a value of 0 to 1.0, the following calculation is performed. The tf · idf value of the word Wni in the lyric file Si is tf · idfni, and Σtf · idfi is the sum of the tf · idf values of all the words of Si. Normalized Wni tf · idf'ni = tf · idfni / Σtf · idfi.

そして、キーワードリスト２１に登録されたキーワードＡ1〜Ａnの全ての単語について、音響類似距離が一番近い単語Ｗni、音響類似距離計算による距離値dni、出現位置pni、正規化特徴量tｆ・id f’値を登録したテーブルを作成しておく。INDEXファイル２２のテーブルは、例えば、歌詞ファイルＳ10において、表２のように作成される。 For all the words of the keywords A1 to An registered in the keyword list 21, the word Wni having the closest acoustic similarity distance, the distance value dni by the acoustic similarity distance calculation, the appearance position pni, the normalized feature amount tf · id f 'Create a table with registered values. The table of the INDEX file 22 is created as shown in Table 2 in the lyrics file S10, for example.

制御部１０の類似度演算手段１４は、前記テーブルに登録された音響類似距離が一番近い単語Ｗni、音響類似距離計算による距離値dni、出現位置pni、特徴量tf・idf’値から類似度を演算する。 The similarity calculation means 14 of the control unit 10 calculates the similarity from the word Wni having the closest acoustic similarity distance registered in the table, the distance value dni by the acoustic similarity distance calculation, the appearance position pni, and the feature value tf · idf ′ value. Is calculated.

出力インタフェース３は、制御部１０の類似度演算手段１４での演算結果に基づいて文書データの検索結果を出力するものであり、INDEXファイルによる検索文字列との類似度（または適合度）の高い順で検索結果を出力するように構成されている。 The output interface 3 outputs the search result of the document data based on the calculation result in the similarity calculation means 14 of the control unit 10, and has a high similarity (or goodness of fit) with the search character string by the INDEX file. The search results are output in order.

次に、INDEXファイルが登録された文書データ検索装置を使用して、検索文字列によって歌詞ファイルを検索する場合について、図３のフローチャート及び図２を参照しながら説明する。
先ず、ユーザが聞き覚えのある単語の集合である文字列（歌詞文やフレーズ）を検索文字列として入力インタフェース２に入力する（ステップ１０１）。
入力された検索文字列は、制御部１０において、INDEXファイル作成の際に使用したのと同じ形態素解析が行われ（ステップ１０２）、解析結果となった検索クエリ単語リストＱ1，…，Ｑmの単語列を抽出する（ステップ１０３）。
次に、検索クエリ単語リストをINDEXファイル２２のキーワードリストに照合し（ステップ１０４）、一致とされたキーワードＡ1〜Ａnを抽出する。例えば、検索クエリ単語リストにＱ1、Ｑ2、Ｑ3があり、Ｑ1はＡ5、Ｑ2はＡ7、Ｑ3はＡ9とそれぞれ一致している場合に、Ａ5、Ａ7、Ａ9を抽出する。 Next, a case where a lyric file is searched by a search character string using a document data search apparatus in which an INDEX file is registered will be described with reference to the flowchart of FIG. 3 and FIG.
First, a character string (lyric sentence or phrase) that is a set of words familiar to the user is input to the input interface 2 as a search character string (step 101).
The input search character string is subjected to the same morphological analysis as that used in the creation of the INDEX file in the control unit 10 (step 102), and the words in the search query word list Q1,. A column is extracted (step 103).
Next, the search query word list is collated with the keyword list of the INDEX file 22 (step 104), and the matched keywords A1 to An are extracted. For example, if there are Q1, Q2, and Q3 in the search query word list, Q1 matches A5, Q2 matches A7, and Q3 matches A9, A5, A7, and A9 are extracted.

抽出されたキーワード単語（または単語列）に対して、INDEXファイル２２を参照し、各歌詞ファイルの中で、キーワード単語（または単語列）と音響的に最も類似する単語候補と、その音響類似距離値と、類似単語候補の位置情報を求めて類似度の演算を行い（ステップ１０５）、歌詞ファイルの選択を行う（ステップ１０６）。
Ｑ1はＡ5、Ｑ2はＡ7、Ｑ3はＡ9とそれぞれ一致しているため、歌詞ファイルＳiに対する検索クエリＱ1、Ｑ2、Ｑ3の音響類似距離Ｄiはd5i＋d7i＋d9iとなり、歌詞ファイルＳiに対する類似単語候補列ＷiはＷ5i、Ｗ7i、Ｗ9iとなる。また、類似単語候補列Ｗ5i、Ｗ7i、Ｗ9iの歌詞ファイルＳi中での出現位置情報Piはp5i、p7i、p9iとなる。 With respect to the extracted keyword word (or word string), the index file 22 is referred to, and in each lyrics file, a word candidate that is acoustically most similar to the keyword word (or word string), and its acoustic similarity distance The value and position information of similar word candidates are obtained and the similarity is calculated (step 105), and the lyrics file is selected (step 106).
Since Q1 matches A5, Q2 matches A7, and Q3 matches A9, the acoustic similarity distance Di of the search queries Q1, Q2, Q3 for the lyrics file Si is d5i + d7i + d9i, and the similar word candidate string Wi for the lyrics file Si is W5i. , W7i, W9i. Also, the appearance position information Pi in the lyrics file Si of the similar word candidate strings W5i, W7i, W9i is p5i, p7i, p9i.

次に、検索文字列と検索にヒットした歌詞ファイルとの適合度を計算し、それに応じてランキングを行う（ステップ１０７）。
適合度の計算については、歌詞ファイルＳiの適合度を１／（Ｄi＋０．１）とする単純な手法で行うことができる（Ｄiは歌詞ファイルＳiの音響類似距離値）。実際には歌詞ファイルの適合度を１／（Ｄi＋β）と定義し、βを０．１に設定することで計算する。
また、キーワード単語列に対する類似単語候補列Ｗiの出現位置情報Piを読み出した上で、類似単語候補列の検索単語列間隣接関係をチェックする。そして、音響類似距離Ｄiの値に単語列間の隣接重みαi（隣接関係強い方は隣接重みαi値が高いと設定する）をつける。計算例を挙げると、Ｑ1はＡ5、Ｑ2はＡ7、Ｑ3はＡ9とそれぞれ一致しており、歌詞ファイルＳiに対する隣接重みαiが、下記の数２により計算できる。また、単語が一つしかない場合には、数２において、αi＝１となる。 Next, the degree of matching between the search character string and the lyrics file hit in the search is calculated, and ranking is performed accordingly (step 107).
The calculation of the fitness can be performed by a simple method in which the fitness of the lyrics file Si is 1 / (Di + 0.1) (Di is the acoustic similarity distance value of the lyrics file Si). Actually, the degree of conformity of the lyrics file is defined as 1 / (Di + β), and β is set to 0.1.
Further, after the appearance position information Pi of the similar word candidate string Wi with respect to the keyword word string is read, the adjacent relationship between the search word strings in the similar word candidate string is checked. Then, the adjacent weight αi between the word strings is set to the value of the acoustic similarity distance Di (the higher the adjacent weight αi value is set for the stronger adjacent relationship). As a calculation example, Q1 coincides with A5, Q2 coincides with A7, Q3 coincides with A9, and the adjacent weight αi for the lyric file Si can be calculated by the following equation (2). Further, when there is only one word, αi = 1 in Equation 2.

（数２）
αi=1／［（p7i-p5i）^2+（p9i-p7i）^2］ (Equation 2)
αi = 1 / [(p7i-p5i) ^ 2 + (p9i-p7i) ^ 2]

歌詞ファイルのランキングを行うランキング用適合度は、音響類似距離値Ｄiに隣接重みαiや特徴量tf・idfを関連づけて算出される。計算例として、ランキング用適合度は、αi／（Ｄi＋０．１）＋tf・idf’によって算出された値となる。また、Ｄiの値は０の場合のみ、適合度にtf・idf’の値を加算する手段としてもよい。
そして、歌詞ファイルのランキングを行うに際に、歌詞データベースに格納された全ての歌詞ファイルのデータとの間で適合度を判断する。
ランキング用適合度値によって歌詞ファイルのランキングを行った後において、適合度の高い順で上位Ｎ位となる歌詞ファイル及びそれらの楽曲情報が検索結果として出力インタフェース３から出力する（ステップ１０８）。 The ranking suitability for ranking the lyrics file is calculated by associating the acoustic similarity distance value Di with the adjacent weight αi and the feature quantity tf · idf. As a calculation example, the ranking suitability is a value calculated by αi / (Di + 0.1) + tf · idf ′. In addition, only when the value of Di is 0, the value of tf · idf ′ may be added to the fitness.
Then, when ranking the lyric files, the degree of fitness is determined among all the lyric file data stored in the lyric database.
After ranking the lyric files according to the ranking suitability values, the lyric files that are ranked in the top N in order of the suitability and their music information are output from the output interface 3 as search results (step 108).

上述した例では、INDEXファイル２２を作成するに際して、歌詞ファイル毎に作成された単語リストＬi２０中の各単語において、キーワードＡnとの音響類似距離を計算し、音響類似距離が一番近い単語Ｗni（歌詞ファイルＳiにおいてＡnと音響的に最も類似する単語）をそれぞれ抽出するようにしたが、音響類似距離が近い上位Ｍ個の単語候補（歌詞ファイルＳiにおいてＡnと音響的に最も類似する上位Ｍ個の単語）も抽出し、それらの音響類似距離値Ｄiと隣接重みαiを保存しておくようにしてもよい。
この場合、歌詞ファイルＳiにおいて、検索単語列を構成する検索単語となるキーワードに対して複数の類似単語が候補となるが、ランキング用適合度を算出する場合に、その候補のαi／（Ｄi＋０．１）＋tf・idf’の値の中で一番高い値を歌詞ファイルＳiに対するランキング用適合度とすればよい。 In the above-described example, when the INDEX file 22 is created, the acoustic similarity distance with the keyword An is calculated for each word in the word list Li20 created for each lyrics file, and the word Wni () having the closest acoustic similarity distance is calculated. The words in the lyric file Si that are acoustically most similar to An are extracted, but the top M word candidates that are closest in acoustic similarity (the top M words that are acoustically most similar to An in the lyric file Si). May be extracted, and the acoustic similarity distance value Di and the adjacent weight αi may be stored.
In this case, in the lyric file Si, a plurality of similar words are candidates for the keyword that is the search word constituting the search word string. When calculating the ranking suitability, the candidate αi / (Di + 0. 1) The highest value among the values of + tf · idf ′ may be used as the ranking suitability for the lyrics file Si.

また、類似度の計算精度を高めるため、INDEXファイルによるランキング用適合度値の高い順で上位Ｎ個の検索対象歌詞ファイル候補を絞りこみ、そのＮ個候補の各歌詞ファイルの文字列と入力された検索文字列（上述した形態素解析前の入力されたフレーズ）を音素列に変換し、ＤＰマッチングを計算する処理を行ってもよい。その計算結果となる距離値に基づいて適合度を計算し高い順にランキングすることで、より類似精度の高いランキング表示を行うことができる。 In addition, in order to increase the calculation accuracy of similarity, the top N search target lyric file candidates are narrowed down in descending order of the fitness value for ranking by the INDEX file, and the character string of each of the N candidate lyric files is input. The search character string (the inputted phrase before the morpheme analysis described above) may be converted into a phoneme string and a process of calculating DP matching may be performed. By calculating the fitness based on the distance value that is the calculation result and ranking in descending order, ranking display with higher similarity accuracy can be performed.

上記構成によれば、入力した検索文字列を構成する検索単語に類似する若しくは同じ単語がINDEXファイル２２における単語リストＡn（検索対象単語）に存在すれば、類似距離等の数値が既に演算済のデータとして登録されているので、この数値に基づいて類似度の演算の行うことで、類似テキスト検索の高速化が可能となる。
上記例では歌詞データを検索する文書データ検索装置について説明したが、歌詞データに限らず文書データ等のテキストに対しての類似テキスト検索に適用することができる。 According to the above configuration, if a word similar to or the same as the search word constituting the input search character string exists in the word list An (search target word) in the INDEX file 22, the numerical values such as the similarity distance have already been calculated. Since it is registered as data, the similarity text search can be speeded up by calculating the similarity based on this numerical value.
In the above example, the document data search apparatus for searching for lyric data has been described. However, the present invention is not limited to lyric data, and can be applied to similar text search for text such as document data.

なお、入力した単語がキーワードリストにない場合は、リアルタイムでその単語（または単語列）を音素列に変換し、各検索対象歌詞ファイルの音素列とのＤＰマッチングを計算し、その結果となる距離値を類似度として検索する。 If the input word is not in the keyword list, the word (or word string) is converted to a phoneme string in real time, DP matching with the phoneme string of each search target lyrics file is calculated, and the resulting distance Search for values as similarities.

本発明の文書データ検索装置の実施形態の一例を示す機能ブロック図である。It is a functional block diagram which shows an example of embodiment of the document data search apparatus of this invention. 本発明の文書データ検索装置におけるINDEXファイル作成処理を説明するための説明図である。It is explanatory drawing for demonstrating the INDEX file creation process in the document data search apparatus of this invention. 本発明の文書データ検索装置における検索処理の一例を示すフローチャートである。It is a flowchart which shows an example of the search process in the document data search apparatus of this invention.

Explanation of symbols

１…文書データ検索装置、２…入力インタフェース、３…出力インタフェース、４…歌詞データベース、１０…制御部、１１…単語抽出手段、１２…キーワード登録手段、１３…INDEXファイル作成手段、１４…類似度演算手段、２０…抽出単語リスト、２１…キーワードリスト、２２…INDEXファイル。 DESCRIPTION OF SYMBOLS 1 ... Document data search device, 2 ... Input interface, 3 ... Output interface, 4 ... Lyrics database, 10 ... Control part, 11 ... Word extraction means, 12 ... Keyword registration means, 13 ... INDEX file creation means, 14 ... Similarity degree Arithmetic means, 20 ... extracted word list, 21 ... keyword list, 22 ... INDEX file.

Claims

In a search device comprising an input interface for inputting a search character string, and searching for document data from a plurality of search target text data by the search character string,
Word extraction means for inputting a plurality of sentence data in advance and extracting words constituting the document data;
Keyword registration means for registering a search word that may be included in the search character string as a keyword in a list;
INDEX file creation means for creating an index file for search using an acoustic similarity distance value calculated from a comparison between reading kana or phonemes decomposed for extracted words and keywords extracted by the word extraction means;
Based on a search word obtained from the input search character string, similarity calculation means for calculating the similarity of search target document data with reference to the INDEX file,
A document data search apparatus comprising: an output interface that outputs a search result of document data based on the similarity.

The acoustic similarity distance value of the INDEX file creation means is
By converting each of the keyword and the extracted word into a reading kana string by means of calculating the edit distance, and performing DP matching in syllable units in the two reading kana strings, the distance between syllables (between different syllables) 2. The document data retrieval apparatus according to claim 1, wherein the calculation means calculates a desired value for each of the penalty value at the time of syllable insertion, the penalty value at the time of syllable dropping, and the penalty value at the time of syllable dropping.

The acoustic similarity distance value of the INDEX file creation means is
By converting each reading into a phoneme string by means of calculating the edit distance of the two phoneme strings between the keyword and the extracted word, and performing DP matching in units of phonemes in the two phoneme strings, 2. The document data retrieval apparatus according to claim 1, wherein calculation is performed by calculation means for giving desired numerical values to distances (penalty values between different phonemes), penalty values at the time of phoneme insertion, and penalty values at the time of phoneme dropout.

The INDEX file creation means creates a search INDEX file in which importance information using the position information and appearance frequency of the extracted word extracted from the search target sentence data as parameters in the document data is registered. The document data search device according to claim 3.

The output interface includes means for calculating a fitness based on an acoustic similarity distance value and outputting a search result in descending order of the fitness with one or a plurality of search words obtained from a search character string by an INDEX file. By
The top N search target document data candidates are narrowed down in descending order of suitability with the search character string by the INDEX file, and the phoneme string of the search character string and the phoneme string of the search target document data candidate character string in the N candidates The document data search apparatus according to claim 1, wherein DP matching is calculated and an output order is determined based on a similarity as a result of the calculation.