JP5866084B2

JP5866084B2 - Search device

Info

Publication number: JP5866084B2
Application number: JP2015551295A
Authority: JP
Inventors: 相川　勇之; 勇之相川; 伍井　啓恭; 啓恭伍井
Original assignee: Mitsubishi Electric Corp
Current assignee: Mitsubishi Electric Corp
Priority date: 2014-01-16
Filing date: 2014-01-16
Publication date: 2016-02-17
Anticipated expiration: 2034-01-16
Also published as: JPWO2015107659A1; WO2015107659A1

Description

この発明は、正式名称だけでなく略称やうろ覚えの名称などを検索キーとして、大量に登録された住所データや施設名データ中をあいまい検索することが可能な検索装置に関するものである。 The present invention relates to a search apparatus capable of performing an ambiguous search in a large number of registered address data and facility name data using not only an official name but also an abbreviated name or a name of a familiar memory as a search key.

住所や施設名を検索する場合、利用者は必ずしも正確な名称を記憶しておらず、通称や略称、うろ覚えの誤った名称などを検索キーとして検索する場合がある。
また、カーナビやスマートフォンのように、入力デバイスとしてキーボードを持たない端末や機器においては、マイクからの音声信号を音声認識した結果や、タッチパネルからの入力を文字認識した結果などから検索する場合もある。これらの入力デバイスを用いる入力の際には、認識誤り、打鍵誤りなどユーザの操作ミスによる入力誤りが存在する。
上記いずれの場合も、正式名称だけではなく、文字列あるいは発音が類似する名称をあいまい検索する技術が必要となる。When searching for an address or a facility name, the user does not necessarily store an accurate name, and may search for a common name, abbreviation, a misrecognized name, or the like as a search key.
In addition, in a terminal or device that does not have a keyboard as an input device, such as a car navigation system or a smartphone, a search may be performed based on a result of voice recognition of a voice signal from a microphone or a result of character recognition of input from a touch panel. . In input using these input devices, there are input errors due to user operation errors such as recognition errors and keystroke errors.
In any of the above cases, not only the official name but also a technique for fuzzy searching for a character string or a name with similar pronunciation is required.

上記のようなあいまい検索を可能とする技術の一例として例えば特許文献１に示す検索装置があった。特許文献１に示された検索装置は、入力されたキーワードから部分文字列の一致度を用いて類似語候補を検索し、さらにこれらの類似語候補から入力キーワードと編集距離の近い類似語を抽出して検索キーワードに追加することで曖昧な全文検索を行うものである。例えば、「アセトアルデヒド」が検索キーワードとして入力されると、その部分文字列である「アセト」「アルデ」「ヒド」などを含む類似語候補を検索する。この検索結果として「アセトアルデイド」や「アセトアルドール」などの類似語候補が得られると、今度は入力キーワード「アセトアルデヒド」と各類似語候補との編集距離を計算し、このうち編集距離が小さい類似語「アセトアルデイド」も用いて全文検索することで、検索もれを少なくしていた。 As an example of a technique that enables the fuzzy search as described above, for example, there is a search device disclosed in Patent Document 1. The search device disclosed in Patent Literature 1 searches for similar word candidates from the input keywords using the matching degree of the partial character strings, and further extracts similar words whose edit distance is close to the input keyword from these similar word candidates Then, an ambiguous full-text search is performed by adding it to the search keyword. For example, when “acetaldehyde” is input as a search keyword, similar word candidates including “aceto”, “alde”, “hide”, and the like that are partial character strings are searched. When similar word candidates such as “acetoaldade” and “acetoaldol” are obtained as a result of this search, the edit distance between the input keyword “acetaldehyde” and each similar word candidate is calculated. Of these, the edit distance is short. A full-text search using the similar term “acetoaldade” also reduced search leaks.

特開２００５−１１０７８号公報JP 2005-11078 A

しかしながら、上記特許文献１に記載された技術では、事前の絞り込みに類似語候補を用いているため、ユーザからの入力が複数の単語を連ねたべた書きの場合に対応しにくいという問題があった。また、検索対象を事前に形態素解析して単語分割しておく必要もあった。 However, since the technique described in Patent Document 1 uses similar word candidates for prior narrowing down, there is a problem that it is difficult to cope with a case where the input from the user is a continuous writing including a plurality of words. . In addition, it is necessary to divide words by subjecting the search target to morphological analysis in advance.

また、類似語候補の検索にあたり、事前絞り込みの精度を上げるためにはトライグラム以上の索引利用が必要であり、辞書データサイズが巨大になるため、カーナビや携帯電話などの組み込み機器への用途には適さないという課題もあった。一方、メモリ容量を考慮し、事前絞り込みで使用する索引をバイグラム以下に限定すると、効果的な事前絞り込みが困難となり編集距離計算に時間を要するという課題があった。 In addition, when searching for similar word candidates, it is necessary to use an index larger than trigrams in order to improve the precision of prior narrowing down, and the dictionary data size becomes enormous, so it is suitable for use in embedded devices such as car navigation systems and mobile phones. There was also a problem that was not suitable. On the other hand, considering the memory capacity, if the index used for pre-restriction is limited to the bigram or less, effective pre-restriction becomes difficult and it takes time to calculate the edit distance.

さらに、事前絞り込みの段階では誤りの偏りを考慮していないため、文字列の前方に差異が集中したり、逆に後方に集中したりと、入力誤りの補正という観点ではやや不適切な絞り込み結果となり、編集距離計算前の絞り込みが十分にできない場合があった。例えば、ユーザが「ＴＳＵＮＡＳＨＩＭＡ」と入力しようとして誤って「ＴＳＵＮＡＨＩＭＡ」と入力したとする。このとき、従来技術では事前絞り込みの結果として、「ＫＵＭＡＳＨＩＭＡ」（ＴＳ→Ｋ）や「ＴＳＵＮＡＨＡＲＡ」（ＩＭ→ＡＲ）なども含まれることになるが、これらは入力誤りという観点では、元データとの差異をもつ文字列が一か所に固まっており不自然である。 In addition, since error bias is not considered at the pre-restriction stage, differences are concentrated in front of the character string or conversely in the backward direction. As a result, there were cases in which it was not possible to sufficiently narrow down before calculating the edit distance. For example, it is assumed that the user erroneously inputs “TSUNAHIMA” when attempting to input “TSUNAHIMA”. At this time, in the prior art, “KUMASHIMA” (TS → K), “TSUNAHARA” (IM → AR), and the like are included as a result of prior narrowing down. Character strings with differences are hardened in one place and unnatural.

この発明は、かかる問題を解決するためになされたもので、省メモリかつ高速に曖昧検索結果が得られる検索装置を得ることを目的とする。 The present invention has been made to solve such a problem, and an object of the present invention is to obtain a search device that can obtain an ambiguous search result at a high speed with a reduced memory.

この発明に係る検索装置は、検索対象データの読みを２音節単位に分割して格納した音節バイグラム辞書と、正式名称を格納した名称検索用辞書と、与えられた読み文字列の一部の文字列と類似する読み文字列を音節バイグラム辞書から検索し、検索した文字列を組み合わせて類似読み候補を作成する類似読み候補作成部と、類似読み候補作成部で作成された類似読み候補を元に名称検索用辞書から正式名称を検索する名称検索部とを備えたものである。 The search device according to the present invention includes a syllable bigram dictionary in which a reading of data to be searched is divided and stored in units of two syllables, a name search dictionary in which official names are stored, and a part of characters in a given reading character string Based on the similar reading candidates created by the similar reading candidate creation unit that searches for similar reading strings from the syllable bigram dictionary and creates similar reading candidates by combining the searched character strings. A name search unit for searching for an official name from the name search dictionary.

この発明に係る検索装置は、読み文字列の一部の文字列と類似する読み文字列を音節バイグラム辞書から検索し、検索した文字列を組み合わせて類似読み候補を作成して正式名称を検索するようにしたものである。これにより、省メモリかつ高速に曖昧検索結果を得ることができる。 The search device according to the present invention searches for a reading character string similar to a part of the character string of the reading character string from the syllable bigram dictionary, creates a similar reading candidate by combining the searched character strings, and searches for a formal name. It is what I did. As a result, the ambiguous search result can be obtained at high speed with less memory.

この発明の実施の形態１の検索装置を示す構成図である。It is a block diagram which shows the search device of Embodiment 1 of this invention. この発明の実施の形態１の検索装置における検索処理を示すフローチャートである。It is a flowchart which shows the search process in the search device of Embodiment 1 of this invention. この発明の実施の形態１の検索装置における類似読み候補作成部の詳細を示す構成図である。It is a block diagram which shows the detail of the similar reading candidate preparation part in the search device of Embodiment 1 of this invention. この発明の実施の形態１の検索装置における音節バイグラム辞書の説明図である。It is explanatory drawing of the syllable bigram dictionary in the search device of Embodiment 1 of this invention. この発明の実施の形態１の検索装置における類似読み候補木構造データの概念を示す説明図である。It is explanatory drawing which shows the concept of the similar reading candidate tree structure data in the search device of Embodiment 1 of this invention. この発明の実施の形態１の検索装置における類似読み候補木構造データの説明図である。It is explanatory drawing of the similar reading candidate tree structure data in the search device of Embodiment 1 of this invention. この発明の実施の形態１の検索装置における類似読み候補作成処理を示すフローチャートである。It is a flowchart which shows the similar reading candidate creation process in the search device of Embodiment 1 of this invention. この発明の実施の形態１の検索装置における音節バイグラム読み文字列検索処理を示すフローチャートである。It is a flowchart which shows the syllable bigram reading character string search process in the search device of Embodiment 1 of this invention. この発明の実施の形態１の検索装置における名称検索用辞書名称検索部との関係を示す説明図である。It is explanatory drawing which shows the relationship with the dictionary name search part for name searches in the search device of Embodiment 1 of this invention. この発明の実施の形態１の検索装置における名称リストの説明図である。It is explanatory drawing of the name list in the search device of Embodiment 1 of this invention. この発明の実施の形態２の検索装置を示す構成図である。It is a block diagram which shows the search device of Embodiment 2 of this invention. この発明の実施の形態２の検索装置における名称先読み類似読み候補作成部の詳細を示す構成図である。It is a block diagram which shows the detail of the name prefetching similar reading candidate preparation part in the search device of Embodiment 2 of this invention. この発明の実施の形態２の検索装置における名称先読み類似読み候補作成処理を示すフローチャートである。It is a flowchart which shows the name prefetching similar reading candidate creation process in the search device of Embodiment 2 of this invention. この発明の実施の形態３の検索装置を示す構成図である。It is a block diagram which shows the search device of Embodiment 3 of this invention. この発明の実施の形態３の検索装置における区切り限定類似読み候補作成部の詳細を示す構成図である。It is a block diagram which shows the detail of the division | segmentation limited similar reading candidate preparation part in the search device of Embodiment 3 of this invention. この発明の実施の形態３の検索装置における区切り限定類似読み候補作成処理を示すフローチャートである。It is a flowchart which shows the division | segmentation limited similar reading candidate creation process in the search device of Embodiment 3 of this invention. この発明の実施の形態３の検索装置における入力読み区切り判定処理による候補棄却の例を示す説明図である。It is explanatory drawing which shows the example of candidate rejection by the input reading division | segmentation determination process in the search device of Embodiment 3 of this invention.

以下、この発明をより詳細に説明するために、この発明を実施するための形態について、添付の図面に従って説明する。
実施の形態１．
図１は、この発明の実施の形態１による検索装置を示す構成図である。なお、以下ではカーナビにおける施設名検索を例として説明するが、本発明はカーナビの施設名検索に限定されるものではなく、住所の検索や電子マニュアルの検索など、組み込み機器上でなされる検索処理全般に適用しうるものである。Hereinafter, in order to explain the present invention in more detail, modes for carrying out the present invention will be described with reference to the accompanying drawings.
Embodiment 1 FIG.
FIG. 1 is a block diagram showing a search device according to Embodiment 1 of the present invention. The facility name search in the car navigation system will be described below as an example. However, the present invention is not limited to the facility name search in the car navigation system, and search processing performed on an embedded device such as an address search or an electronic manual search. It can be applied in general.

図１に示す検索装置は、入力部１０１、類似読み候補作成部１０３、音節バイグラム辞書１０４、名称検索部１０６、名称検索用辞書１０７を備える。入力部１０１は、ソフトウェアキーボードや音声認識などの利用者による入力操作を受け付け、読み文字列データ１０２に変換する。類似読み候補作成部１０３は、検索対象データの読みを２音節単位に分割して格納した音節バイグラム辞書１０４を参照して類似読み候補リスト１０５を作成する。名称検索部１０６は名称検索用辞書１０７を参照して、類似読み候補リスト１０５の各読み文字列を含む名称データを検索結果データ１０８として出力する。 The search apparatus shown in FIG. 1 includes an input unit 101, a similar reading candidate creation unit 103, a syllable bigram dictionary 104, a name search unit 106, and a name search dictionary 107. The input unit 101 accepts an input operation by a user such as a software keyboard or voice recognition, and converts it into read character string data 102. The similar reading candidate creation unit 103 creates a similar reading candidate list 105 with reference to the syllable bigram dictionary 104 stored by dividing the reading of the search target data into units of two syllables. The name search unit 106 refers to the name search dictionary 107 and outputs name data including each reading character string in the similar reading candidate list 105 as search result data 108.

なお、本発明の検索装置はコンピュータを用いて構成され、入力部１０１、類似読み候補作成部１０３、名称検索部１０６は、それぞれの機能に対応したソフトウェアをＣＰＵが実行することで構成されている。または、いずれかの処理部を専用のハードウェアで構成してもよい。また、音節バイグラム辞書１０４および名称検索用辞書１０７は、ハードディスク装置といった記憶装置上に格納されている。 The search device of the present invention is configured using a computer, and the input unit 101, the similar reading candidate creation unit 103, and the name search unit 106 are configured by the CPU executing software corresponding to each function. . Alternatively, any of the processing units may be configured with dedicated hardware. The syllable bigram dictionary 104 and the name search dictionary 107 are stored on a storage device such as a hard disk device.

以下、図１から図１０までを適宜参照しつつ、実施の形態１の処理内容について説明する。図２に検索処理の基本処理フローを示す。
ステップＳＴ２０１は類似読み候補作成処理であり、類似読み候補作成部１０３が、音節バイグラム辞書１０４を参照し、この検索結果の音節バイグラムを組み合わせて類似読み候補を作成する。類似読み候補作成処理の詳細については後述する。
ステップＳＴ２０２は名称検索処理である。名称検索部１０６が、名称検索用辞書１０７を参照して、類似読み候補リスト１０５を含む読みの一部に含むデータを検索結果データ１０８として出力する。名称検索処理の詳細については後述する。Hereinafter, the processing contents of the first embodiment will be described with reference to FIGS. 1 to 10 as appropriate. FIG. 2 shows a basic processing flow of search processing.
Step ST201 is a similar reading candidate creation process. The similar reading candidate creation unit 103 refers to the syllable bigram dictionary 104 and creates a similar reading candidate by combining the syllable bigrams of the search results. Details of the similar reading candidate creation process will be described later.
Step ST202 is a name search process. The name search unit 106 refers to the name search dictionary 107 and outputs data included in a part of the reading including the similar reading candidate list 105 as the search result data 108. Details of the name search process will be described later.

上記のように、類似読み候補を作成するステップＳＴ２０１と、名称を検索するステップＳＴ２０２とを分割することには、以下の利点がある。すなわち、索引データ容量や計算量が大きくなる曖昧検索処理については、ステップＳＴ２０１の音節バイグラムを元にした処理とすることで、対象データ件数を少なくして容量や計算量の増大を抑制できる。また、検索対象件数が非常に多くなるステップＳＴ２０２の名称検索処理については、曖昧検索をせず単純な前方一致検索処理とすることにより、速度性能およびメモリ性能を重視した処理が可能となる。 As described above, dividing the step ST201 for generating similar reading candidates and the step ST202 for searching for names has the following advantages. In other words, the ambiguous search process that increases the index data capacity and the calculation amount is a process based on the syllable bigram in step ST201, thereby reducing the number of target data and suppressing an increase in the capacity and calculation amount. In addition, the name search process in step ST202 in which the number of search objects is extremely large can be processed with emphasis on speed performance and memory performance by performing a simple forward match search process without performing an ambiguous search.

次に、類似読み候補作成処理（ステップＳＴ２０１）の詳細について説明する。
図３に類似読み候補作成部１０３の詳細構成図を示す。類似読み候補木構造データ生成部３０１は、音節バイグラム辞書検索部３０２により入力読み文字列の各部分文字列と類似する音節バイグラムの情報を取得し、この情報を組み合わせて類似読み候補木構造データ３０３を生成する。このデータの詳細については後述する。類似読み候補選択部３０４は、音節別読み編集距離判定部３０５、読み編集距離累積値判定部３０６、読み編集距離累積値計算部３０７、新規候補数判定部３０８、類似読み候補スコア順選択部３０９、読み出現確率累積値計算部３１０の各手段を呼び出し、類似読み候補木構造データ生成部３０１で生成された類似読み候補木構造データ３０３をそれぞれ所定の条件に従って選択する。各処理部の動作および選択条件については後述する。類似読み候補抽出部３１１は、入力読み文字列のすべての文字位置に対して類似読み候補木構造データ３０３の生成および選択がなされた後、この類似読み候補木構造データ３０３から類似読み候補文字列のリスト（類似読み候補リスト１０５）を抽出する。Next, details of the similar reading candidate creation process (step ST201) will be described.
FIG. 3 shows a detailed configuration diagram of the similar reading candidate creation unit 103. The similar reading candidate tree structure data generation unit 301 obtains information on syllable bigrams similar to each partial character string of the input reading character string by the syllable bigram dictionary search unit 302, and combines these information with similar reading candidate tree structure data 303. Is generated. Details of this data will be described later. The similar reading candidate selection unit 304 includes a syllable reading editing distance determination unit 305, a reading editing distance accumulated value determination unit 306, a reading editing distance accumulated value calculation unit 307, a new candidate number determination unit 308, and a similar reading candidate score order selection unit 309. Then, each means of the reading appearance probability cumulative value calculation unit 310 is called, and the similar reading candidate tree structure data 303 generated by the similar reading candidate tree structure data generation unit 301 is selected according to each predetermined condition. The operation and selection conditions of each processing unit will be described later. The similar reading candidate extraction unit 311 generates and selects the similar reading candidate tree structure data 303 for all character positions of the input reading character string, and then selects the similar reading candidate character string from the similar reading candidate tree structure data 303. (Similar reading candidate list 105) is extracted.

音節バイグラム辞書１０４は、検索対象データに付与された読み文字列を事前に音節単位に分割し、この音節単位の読み文字列中に出現するすべての音節２つ分の読み文字列をもとに作成する。音節とは、連続する言語音を区切る分節単位の一種である。ある母音を中心に、その母音単独で、またはその母音の前後に１個または複数個の子音を伴って構成する単位である。日本語でいえば平仮名１文字に相当する単位である。２つの音節の連接からなる音節バイグラムは、特許文献１に記載された装置で使用する文字トライグラムと比較すると非常に数が少なく、辞書データサイズを小さくできるという利点がある。 The syllable bigram dictionary 104 divides the reading character string assigned to the search target data into syllable units in advance, and based on the reading character strings of all two syllables appearing in the syllable-based reading character string. create. A syllable is a type of segmental unit that separates consecutive speech sounds. It is a unit composed of a vowel as a center, with the vowel alone or with one or more consonants before and after the vowel. In Japanese, it is a unit corresponding to one hiragana character. The syllable bigram composed of the concatenation of two syllables has the advantage that the number of the syllable bigrams is very small compared to the character trigram used in the apparatus described in Patent Document 1, and the dictionary data size can be reduced.

図４に音節バイグラム辞書の例を示す。同辞書データはデータを格納する部分（データ格納部４０１）と同辞書を高速に検索するための文字索引４０２からなる。データ格納部４０１では、各レコードに音節読みバイグラムＩＤ（１０４ａ）、第一音節の読み（１０４ｂ）、第二音節の読み（１０４ｃ）、出現確率（１０４ｄ）を格納する。これらは、検索対象データに付与された音節区切り入りの読み文字列、例えば検索対象データの読み文字列“ＭＩＴＳＵＢＩＳＨＩＤＥＮＫＩ”からは、“ＭＩＴＳＵ”、“ＴＳＵＢＩ”、“ＢＩＳＨＩ”、“ＳＨＩＤＥ”、“ＤＥＮ”、“ＮＫＩ”という音節バイグラムを抽出し、データ格納部４０１の各レコードに格納する。その際、別途事前に各音節バイグラムの元データ中での出現頻度を計算しておき、これをもとに出現確率（１０４ｄ）を計算して格納する。 FIG. 4 shows an example of a syllable bigram dictionary. The dictionary data includes a data storage portion (data storage unit 401) and a character index 402 for searching the dictionary at high speed. The data storage unit 401 stores the syllable reading bigram ID (104a), the first syllable reading (104b), the second syllable reading (104c), and the appearance probability (104d) in each record. These are read character strings with syllable breaks added to the search target data, for example, from the read character string “MI TSU BI SHI DE N KI” of the search target data, “MI TSU”, “TSU BI”, “BI”. The syllable bigrams “SHI”, “SHI DE”, “DEN”, and “NKI” are extracted and stored in each record of the data storage unit 401. At that time, the appearance frequency in the original data of each syllable bigram is calculated in advance separately, and the appearance probability (104d) is calculated and stored based on this.

文字索引４０２には、上記のデータ格納部４０１に高速にアクセスするための索引情報を格納する。具体的には、入力された読み文字列中の各文字を含む音節バイグラムを高速に取得するための転置索引情報（１０４ｆ）をもつ。図４の例では、文字バイグラム（１０４ｅ）“ＡＫ”が、音節読みバイグラムＩＤが２であるレコードとＩＤが１００であるレコードに出現していることを示している。 The character index 402 stores index information for accessing the data storage unit 401 at a high speed. Specifically, it has transposed index information (104f) for acquiring a syllable bigram including each character in the input reading character string at high speed. In the example of FIG. 4, the character bigram (104e) “AK” is shown to appear in the record with the syllable reading bigram ID of 2 and the record with the ID of 100.

次に、類似読み候補木構造データ３０３の概要を図５により説明する。類似読み候補木構造データ３０３は、多様な可能性を列挙するために生成する中間データ構造である。図中に示す音節バイグラム候補［ｎ］は、音節バイグラム辞書１０４を検索して得られる情報を示している。各候補は第一音節の読み、第二音節の読み、入力読み文字列中での文字位置などを保持する（図５中では音節バイグラム候補［１］のみに記載し、他候補では記載を省略している）。図中の音節バイグラム候補［１］〜［３］は、入力読み文字列の文字位置０で始まる部分文字列に類似する候補である。音節バイグラム［５］と［６］は、文字位置１で始まる部分文字列に類似する候補である。音節バイグラム［７］〜［１０］は文字位置３で始まる部分文字列に類似する候補である。ここで、音節バイグラム候補［５］と［６］は、音節バイグラム候補［３］に連なる候補であり、それぞれ“ＴＯＳＵＮＡ”、“ＴＯＳＵＭＡ”という類似読み候補を表現している。また、音節バイグラム候補［７］と［８］は、音節バイグラム候補［１］に連なる候補であり、それぞれ“ＴＳＵＮＡＳＨＩ”、“ＴＳＵＮＡＫＩ”という類似読み候補を表現している。このように類似読み候補木構造データ３０３では、音節バイグラムの組み合わせにより入力読み文字列と類似する読み文字列の候補を表現する。 Next, the outline of the similar reading candidate tree structure data 303 will be described with reference to FIG. The similar reading candidate tree structure data 303 is an intermediate data structure generated to enumerate various possibilities. The syllable bigram candidate [n] shown in the figure indicates information obtained by searching the syllable bigram dictionary 104. Each candidate holds the reading of the first syllable, the reading of the second syllable, the character position in the input reading character string, etc. doing). The syllable bigram candidates [1] to [3] in the figure are candidates similar to the partial character string starting at the character position 0 of the input reading character string. Syllable bigrams [5] and [6] are candidates similar to a partial character string starting at character position 1. Syllable bigrams [7] to [10] are candidates similar to a partial character string starting at character position 3. Here, the syllable bigram candidates [5] and [6] are candidates connected to the syllable bigram candidate [3], and represent similar reading candidates “TO SU NA” and “TO SU MA”, respectively. The syllable bigram candidates [7] and [8] are candidates that are connected to the syllable bigram candidate [1], and represent similar reading candidates “TSU NA SHI” and “TSU NA KI”, respectively. As described above, the similar reading candidate tree structure data 303 represents reading character string candidates similar to the input reading character string by a combination of syllable bigrams.

図６に類似読み候補木構造データ３０３の具体例を示す。これは図５で概念的に示した木構造を具体化したものである。図６の表の１行が、図５で示した音節バイグラム候補の１つに対応する。各候補のデータは、候補ＩＤ（３０３ａ）、第一音節読み（３０３ｂ）、第二音節読み（３０３ｃ）、第一音節開始位置（３０３ｄ）、第二音節開始位置（３０３ｅ）、第二音節終了位置（３０３ｆ）、第一音節編集距離（３０３ｇ）、第二音節編集距離（３０３ｈ）、出現確率（３０３ｉ）、親ノードＩＤ（３０３ｊ）からなる。各情報がどのように設定されるかについては後述する。なお、木構造のリンク情報は３０３ｊの親ノードＩＤとして表現する。 FIG. 6 shows a specific example of the similar reading candidate tree structure data 303. This is an embodiment of the tree structure conceptually shown in FIG. One row in the table of FIG. 6 corresponds to one of the syllable bigram candidates shown in FIG. The data of each candidate includes candidate ID (303a), first syllable reading (303b), second syllable reading (303c), first syllable start position (303d), second syllable start position (303e), and second syllable end. It consists of position (303f), first syllable editing distance (303g), second syllable editing distance (303h), appearance probability (303i), and parent node ID (303j). How each information is set will be described later. The tree-structured link information is expressed as a parent node ID of 303j.

図７に類似読み候補作成処理（ステップＳＴ２０１）の詳細フローを示す。以下では、図５に示した“ＴＳＵＮＡＨＩＭＡ”という読み文字列が入力された場合を例として説明する。ステップＳＴ７０１からステップＳＴ７１３は、入力読み文字数分だけ行うループ処理で、変数ｃｕｒを１ずつ増やしながらステップＳＴ７０２からステップＳＴ７１２の処理を繰り返す。ここで変数ｃｕｒは入力読み文字列中の文字位置を表しており、図５に示した読み文字列が入力された場合には、０から８まで順次更新される。 FIG. 7 shows a detailed flow of the similar reading candidate creation process (step ST201). Hereinafter, a case where a reading character string “TSUNAHIMA” illustrated in FIG. 5 is input will be described as an example. Steps ST701 to ST713 are loop processes performed for the number of input reading characters, and the processes from step ST702 to step ST712 are repeated while increasing the variable cur by one. Here, the variable “cur” represents the character position in the input reading character string, and when the reading character string shown in FIG.

ステップＳＴ７０２は音節バイグラム読み文字列検索処理である。ここでは、音節バイグラム辞書検索部３０２が、入力読み文字列の文字位置ｃｕｒで始まる部分文字列と類似する読みをもつ音節バイグラムを音節バイグラム辞書１０４から検索する。 Step ST702 is a syllable bigram reading character string search process. Here, the syllable bigram dictionary search unit 302 searches the syllable bigram dictionary 104 for a syllable bigram having a reading similar to the partial character string starting at the character position cur of the input reading character string.

図８に音節バイグラム読み文字列検索処理（ステップＳＴ７０２）の詳細フローを示す。ステップＳＴ８０１は文字単位検索処理である。ここでは、文字位置ｃｕｒで始まる部分文字列の各文字を含む音節バイグラムを図４に示した文字索引４０２を用いて検索する。ｃｕｒ＝０の場合には、音節読みバイグラムＩＤが２００，２０１，２０２，２０３のデータが検索される。ｃｕｒ＝１の場合には、音節読みバイグラムＩＤが２０４，２０５のデータが検索される。ｃｕｒ＝３の場合には、音節読みバイグラムＩＤが１００，１０１，１０２，１０３のデータが検索される。この検索結果は、類似読み候補木構造データ３０３に仮登録される。ｃｕｒ＝０の場合には候補ＩＤが１，２，３，４の候補が、ｃｕｒ＝１の場合には候補ＩＤが５，６の候補が、ｃｕｒ＝３の場合には候補ＩＤが７，８，９，１０の候補がそれぞれ仮登録される。まず、音節バイグラム辞書１０４から得られる情報である第一音節読み（３０３ｂ）、第二音節読み（３０３ｃ）、出現確率（３０３ｉ）を各候補に格納する。続けて、文字位置ｃｕｒと各音節の読み長さを元に、第一音節開始位置（３０３ｄ）、第二音節開始位置（３０３ｅ）、第二音節終了位置（３０３ｆ）を計算して格納する。 FIG. 8 shows a detailed flow of the syllable bigram reading character string search process (step ST702). Step ST801 is a character unit search process. Here, the syllable bigram including each character of the partial character string starting at the character position cur is searched using the character index 402 shown in FIG. When cur = 0, data with syllable reading bigram IDs 200, 201, 202, and 203 is retrieved. When cur = 1, data with syllable reading bigram IDs 204 and 205 is retrieved. When cur = 3, data with syllable reading bigram IDs of 100, 101, 102, and 103 is retrieved. This search result is provisionally registered in the similar reading candidate tree structure data 303. When cur = 0, candidate IDs 1, 2, 3, and 4 are candidates. When cur = 1, candidate IDs 5 and 6 are candidates. When cur = 3, candidate IDs 7 and 7. 8, 9, and 10 candidates are provisionally registered. First, the first syllable reading (303b), the second syllable reading (303c), and the appearance probability (303i), which are information obtained from the syllable bigram dictionary 104, are stored in each candidate. Subsequently, based on the character position cur and the reading length of each syllable, the first syllable start position (303d), the second syllable start position (303e), and the second syllable end position (303f) are calculated and stored.

ステップＳＴ８０２は編集距離計算処理である。ここでは、ステップＳＴ８０１で検索された音節バイグラムデータと、入力文字列の文字位置ｃｕｒで始まる部分文字列との編集距離を計算する。編集距離計算は非常にコストの大きい計算処理のため、ステップＳＴ８０１の文字単位の検索時には、ヒットした文字数、文字の出現確率等を考慮してランキングする。このランキング結果上位の所定件数について編集距離を計算し、編集距離計算結果に基づいてランキングをし直す。ここで編集距離計算は、第一音節の読みと入力読みの部分文字列、第二音節の読みと入力読みの部分文字列でそれぞれ計算する。計算結果を類似読み候補木構造データ３０３に仮登録したデータの第一音節編集距離（３０３ｇ）、第二音節編集距離（３０３ｈ）に格納する。 Step ST802 is edit distance calculation processing. Here, the edit distance between the syllable bigram data searched in step ST801 and the partial character string starting at the character position cur of the input character string is calculated. Since the edit distance calculation is a very expensive calculation process, ranking is performed in consideration of the number of hit characters, the appearance probability of characters, and the like at the time of retrieval in units of characters in step ST801. The edit distance is calculated for a predetermined number of higher ranking results, and the ranking is re-performed based on the edit distance calculation result. Here, the edit distance calculation is performed for the partial character string of the first syllable reading and the input reading, and the partial character string of the second syllable reading and the input reading, respectively. The calculation result is stored in the first syllable editing distance (303g) and the second syllable editing distance (303h) of the data temporarily registered in the similar reading candidate tree structure data 303.

ステップＳＴ７０３は音節別読み編集距離判定処理である。ここでは、音節別読み編集距離判定部３０５が、ステップＳＴ８０２で計算された音節ごとの編集距離が所定値以下の候補を残し、それ以外の候補を棄却する。ここでは音節ごとの編集距離が１以下の候補を残すこととし、図５の音節バイグラム候補［４］を棄却する。棄却された候補であることを示すため、親ノードＩＤ（３０３ｈ）を−１に変更する。 Step ST703 is a syllable-specific reading edit distance determination process. Here, the syllable-by-syllable reading edit distance determination unit 305 leaves candidates whose edit distance for each syllable calculated in step ST802 is a predetermined value or less, and rejects other candidates. Here, a candidate whose edit distance for each syllable is 1 or less is left, and the syllable bigram candidate [4] in FIG. 5 is rejected. In order to indicate that the candidate is rejected, the parent node ID (303h) is changed to -1.

ステップＳＴ７０４からステップＳＴ７１０までの二重ループ処理では、ステップＳＴ７０３の結果得られた音節バイグラム候補のすべてと、既存の音節バイグラム候補のすべての接続可否を判定すると共に、読み編集距離累積値に応じた候補選択のうえで新規の音節バイグラム候補を生成する。ｃｕｒ＝０の場合には直前の候補がない状態であるため、ステップＳＴ７０６の接続判定は常に真となりステップＳＴ７０７に進む。ｃｕｒ＞０の場合、ステップＳＴ７０３までで新たに得られた各音節バイグラム候補と既存の音節バイグラム候補の接続判定を行う。接続判定は、新規音節バイグラムの第一音節読みと既存音節バイグラムの第二音節読みが一致するかどうかにより行う。例えば、ｃｕｒ＝１の場合、音節バイグラム候補［５］［６］の第一音節読みと、音節バイグラム候補［３］の第二音節読みが一致するので、音節バイグラム候補［５］［６］の親ノードＩＤに３を格納する。同じくｃｕｒ＝３の場合、音節バイグラム候補［７］［８］の親ノードＩＤに１を、音節バイグラム候補［９］［１０］の親ノードＩＤに２をそれぞれ格納する。 In the double loop processing from step ST704 to step ST710, it is determined whether all syllable bigram candidates obtained as a result of step ST703 and all existing syllable bigram candidates can be connected, and according to the reading edit distance accumulated value. After selecting candidates, a new syllable bigram candidate is generated. When cur = 0, there is no previous candidate, so the connection determination in step ST706 is always true and the process proceeds to step ST707. When cur> 0, connection determination between each syllable bigram candidate newly obtained up to step ST703 and an existing syllable bigram candidate is performed. Connection determination is performed based on whether the first syllable reading of the new syllable bigram and the second syllable reading of the existing syllable bigram match. For example, when cur = 1, the first syllable reading of the syllable bigram candidate [5] [6] and the second syllable reading of the syllable bigram candidate [3] match, so that the syllable bigram candidate [5] [6] 3 is stored in the parent node ID. Similarly, when cur = 3, 1 is stored in the parent node ID of the syllable bigram candidate [7] [8], and 2 is stored in the parent node ID of the syllable bigram candidate [9] [10].

ステップＳＴ７０７は読み編集距離累積値に応じた候補選択処理である。ここでは読み編集距離累積値計算部３０７により計算された読み編集距離累積値に応じて読み編集距離累積値判定部３０６が音節バイグラム候補を選択する。 Step ST707 is candidate selection processing corresponding to the reading editing distance accumulated value. Here, the reading / editing distance cumulative value determination unit 306 selects a syllable bigram candidate according to the reading / editing distance cumulative value calculated by the reading / editing distance cumulative value calculation unit 307.

読み編集距離累積値は、類似読み候補木構造データ３０３の第一音節編集距離（３０３ｇ）および第二音節編集距離（３０３ｈ）を元に計算する。ある音節バイグラム候補の編集距離累積値は、親ノードＩＤをたどりながら文字位置０となる音節バイグラム候補まで第二音節編集距離（３０３ｈ）を加算し、最後に第一音節編集距離（３０３ｇ）を加算することで得られる。例えば、図５及び図６に示した音節バイグラム候補［８］の場合、読み編集距離累積値は１＋０＋０＝１となる。また、音節バイグラム候補［１０］の場合、読み編集距離累積値は１＋１＋０＝２となる。 The reading editing distance accumulated value is calculated based on the first syllable editing distance (303g) and the second syllable editing distance (303h) of the similar reading candidate tree structure data 303. The cumulative edit distance of a syllable bigram candidate adds the second syllable edit distance (303h) to the syllable bigram candidate at character position 0 while following the parent node ID, and finally adds the first syllable edit distance (303g). It is obtained by doing. For example, in the case of the syllable bigram candidate [8] shown in FIGS. 5 and 6, the cumulative reading edit distance value is 1 + 0 + 0 = 1. Further, in the case of the syllable bigram candidate [10], the cumulative reading edit distance value is 1 + 1 + 0 = 2.

候補の採否を決める閾値Ｔｈ１は、入力読み文字列の長さＬによって決める。入力読み文字列が長い場合には、誤りの含まれる可能性が高いので大きな値を設定する。例えば、Ｌ／４を閾値とし、読み編集距離累積値がこれより大きくなった類似読み候補を棄却する。ｃｕｒ＝０の場合は、音節バイグラム候補［１］［２］［３］のいずれの候補も読み編集距離の累積値が１以下となるので、全候補が棄却されずに採用される。ｃｕｒ＝１の場合は、音節バイグラム候補［６］では編集距離の累積値が２となるので棄却される。ｃｕｒ＝３の場合、同様にして音節バイグラム［１０］が棄却される。 The threshold value Th1 for determining whether or not a candidate is accepted is determined by the length L of the input reading character string. If the input character string is long, it is highly possible that an error is included, so a large value is set. For example, L / 4 is set as a threshold value, and similar reading candidates whose reading editing distance accumulated value is larger than this are rejected. When cur = 0, all candidates of syllable bigram candidates [1] [2] [3] have a reading edit cumulative value of 1 or less, and all candidates are adopted without being rejected. In the case of cur = 1, the cumulative value of the edit distance is 2 in the syllable bigram candidate [6], and is rejected. In the case of cur = 3, the syllable bigram [10] is similarly rejected.

ステップＳＴ７０８は類似読み候補木構造更新処理である。上記のステップＳＴ８０１で仮登録された候補を正式に登録する。例えばｃｕｒ＝０の場合には図６における候補ＩＤが１，２，３の行が正式に登録される。候補ＩＤが４の行についてはステップＳＴ７０３において棄却済みであるため正式登録はされない。図５との対応をとりやすくするため図６中にも記載しているが、メモリ使用量を抑制するためｃｕｒ＝１以降の処理で上書きしてもよい。同様にｃｕｒ＝１の場合には、候補ＩＤが５の候補が正式登録され、候補ＩＤが６の候補はステップＳＴ７０７で棄却されるので正式登録されない。同じくｃｕｒ＝３の場合には、候補ＩＤが７，８，９の候補が正式登録される。 Step ST708 is a similar reading candidate tree structure update process. The candidates provisionally registered in step ST801 are formally registered. For example, when cur = 0, the rows with candidate IDs 1, 2, and 3 in FIG. 6 are formally registered. Since the candidate ID of 4 is already rejected in step ST703, it is not officially registered. Although it is also described in FIG. 6 to facilitate the correspondence with FIG. 5, it may be overwritten by processing after cur = 1 in order to suppress the memory usage. Similarly, when cur = 1, a candidate with a candidate ID of 5 is officially registered, and a candidate with a candidate ID of 6 is rejected in step ST707 and is not officially registered. Similarly, when cur = 3, candidates with candidate IDs 7, 8, and 9 are formally registered.

ステップＳＴ７０４からステップＳＴ７１０までの二重ループ処理が終了すると、ステップＳＴ７１１に進む。ステップＳＴ７１１では、新規候補数判定部３０８が、現在の文字位置ｃｕｒにおいて追加された新規音節バイグラム候補数による選択処理判定を行う。 When the double loop process from step ST704 to step ST710 ends, the process proceeds to step ST711. In step ST711, the new candidate number determination unit 308 performs selection processing determination based on the number of new syllable bigram candidates added at the current character position cur.

新規候補数が所定の閾値Ｔｈ２よりも多い場合に、ステップＳＴ７１２に進み類似読み候補スコア順選択部３０９が音節バイグラム候補をスコア順に整列し、スコア上位の候補を残し、スコア下位の候補を棄却する。ここで閾値Ｔｈ２は、メモリ使用量や速度性能など該当機能が満たすべき外部の条件に従い、適宜設定するものとする。 When the number of new candidates is greater than the predetermined threshold value Th2, the process proceeds to step ST712, and the similar reading candidate score order selection unit 309 arranges the syllable bigram candidates in order of scores, leaves the higher score candidates, and rejects the lower score candidates. . Here, the threshold Th2 is appropriately set according to external conditions that the corresponding function should satisfy, such as memory usage and speed performance.

スコア計算は、読み出現確率累積値計算部３１０により行う。具体的には、ある音節バイグラム候補のスコアは、親ノードＩＤをたどりながら文字位置０となる音節バイグラム候補まで出現確率（３０３ｉ）を順次加算することにより得る。また、読み編集距離累積値計算部３０７により計算された編集距離累積値をスコアとして用いても良い。編集距離累積値の計算方法は、ステップＳＴ７０７と同様の手順とする。 The score calculation is performed by the reading appearance probability cumulative value calculation unit 310. Specifically, the score of a syllable bigram candidate is obtained by sequentially adding the appearance probability (303i) up to the syllable bigram candidate at character position 0 while following the parent node ID. Also, the edit distance accumulated value calculated by the reading edit distance accumulated value calculation unit 307 may be used as a score. The method for calculating the edit distance accumulated value is the same as that in step ST707.

ステップＳＴ７０１からステップＳＴ７１３までのループ処理が終了すると、ステップＳＴ７１４に進む。なお、図７中には示していないが、ステップＳＴ７０１の直後に、ステップＳＴ７０６の接続判定を行うようにしても良い。すなわち、既存の音節バイグラム候補のいずれも第二音節終了位置（３０３ｆ）が現在の文字位置ｃｕｒでなければ、ステップＳＴ７０２、ＳＴ７０３の処理をしたとしてもステップＳＴ７０６ですべて棄却されてしまう。ループ処理の先頭でこのような事前の接続検定を行い、接続可能な候補がなければステップＳＴ７０２以下の処理をスキップすることで無駄な処理を防ぐことができる。 When the loop processing from step ST701 to step ST713 ends, the process proceeds to step ST714. Although not shown in FIG. 7, the connection determination in step ST706 may be performed immediately after step ST701. That is, if none of the existing syllable bigram candidates is the second character syllable end position (303f) at the current character position cur, even if the processes of steps ST702 and ST703 are performed, all of them are rejected at step ST706. Such a prior connection test is performed at the beginning of the loop processing, and if there is no connectable candidate, the processing after step ST702 can be skipped to prevent useless processing.

ステップＳＴ７１４では類似読み候補抽出部３１１が、類似読み候補木構造データ３０３から実際の類似読み候補文字列を抽出する。類似読み候補文字列は、木構造データの末端にあり、かつ、第二音節終了位置（３０３ｆ）が入力読み文字列の末端に達している音節バイグラム候補に対して、親ノードＩＤをたどりながら文字位置０となる音節バイグラム候補まで第二音節読み（３０３ｃ）を順次連結し、最後に第一音節読み（３０３ｂ）を連結して得られた文字列を逆順にすることで得られる。 In step ST714, the similar reading candidate extraction unit 311 extracts an actual similar reading candidate character string from the similar reading candidate tree structure data 303. The similar reading candidate character string is a character that follows the parent node ID for the syllable bigram candidate whose second syllable end position (303f) reaches the end of the input reading character string at the end of the tree structure data. This is obtained by sequentially connecting the second syllable readings (303c) up to the syllable bigram candidate at position 0, and finally reversing the character string obtained by connecting the first syllable readings (303b).

以上が図２のステップ２０１の類似読み候補作成処理の詳細である。次にステップＳＴ２０２の名称検索処理について説明する。 The above is the details of the similar reading candidate creation process in step 201 of FIG. Next, the name search process in step ST202 will be described.

名称検索処理では、名称検索部１０６が、名称検索用辞書１０７を参照して、類似読み候補リスト１０５を含む名称データを検索して出力する。ここでは、例えば、特開２０１０−２０５１１９号公報に示される検索手法を用いることとする。検索手法の詳細については文献中に記載されているため、以下では検索処理の概略を説明する。 In the name search process, the name search unit 106 searches the name data including the similar reading candidate list 105 with reference to the name search dictionary 107 and outputs it. Here, for example, a search method disclosed in JP 2010-205119 A is used. Since the details of the search method are described in the literature, the outline of the search process will be described below.

図９に名称検索用辞書１０７の構成例を示す。名称検索部１０６は、ダブル配列索引データ９０１に基づいて、与えられた類似読み候補文字列に該当するノードを探索する。続けて、最小・最大子ノード索引９０２に基づいて、上記で探索されたノードの子ノードのうち、辞書順で最小の文字列となるノードと最大の文字列となるノードとを探索する。さらに、名称リスト９０３より、上記で探索された最小ノードに対応する名称から最大ノードに対応する名称までのすべての名称を抽出して検索結果データ１０８とする。 FIG. 9 shows a configuration example of the name search dictionary 107. The name search unit 106 searches for a node corresponding to the given similar reading candidate character string based on the double sequence index data 901. Subsequently, based on the minimum / maximum child node index 902, the node that becomes the minimum character string and the node that becomes the maximum character string in the dictionary order among the child nodes of the node searched above are searched. Further, all the names from the name corresponding to the minimum node searched above to the name corresponding to the maximum node are extracted from the name list 903 as search result data 108.

図１０に名称リスト９０３の例を示す。名称リスト９０３は少なくとも、各名称を一意に特定する名称ＩＤ（９０３ａ）と、その名称の読み文字列（９０３ｂ）と、種別情報（９０３ｃ）からなるものとする。なお、図１０では、同一の名称ＩＤである３をもつ行を２箇所に示しているが、これは、読み文字列（９０３ｂ）を、途中の文字からでも検索可能とするため事前に展開して索引化しているためである。 FIG. 10 shows an example of the name list 903. The name list 903 includes at least a name ID (903a) that uniquely identifies each name, a reading character string (903b) of the name, and type information (903c). In FIG. 10, two lines having the same name ID of 3 are shown in two places, but this is expanded in advance so that the reading character string (903b) can be searched even from characters in the middle. This is because it is indexed.

上記の説明では、一例として特開２０１０−２０５１１９号公報に記載のダブル配列索引を用いた検索手法を示したが、ステップＳＴ２０２の名称検索処理は、類似読み候補リスト１０５に含まれる各類似読み候補文字列から、その読み文字列を含む名称データを高速に検索できる手法であればどのようなものであっても適用可能である。例えば、組み込み機器向けのデータベースを用いても良いし、図９に示した名称リスト９０３のもつ情報を高速検索するための木構造索引データの中に埋め込む構造としても良い。 In the above description, the search method using the double array index described in Japanese Patent Application Laid-Open No. 2010-205119 is shown as an example. However, the name search process in step ST202 is performed by each similar reading candidate included in the similar reading candidate list 105. Any method can be applied as long as the name data including the reading character string can be searched from the character string at high speed. For example, a database for an embedded device may be used, or a structure embedded in the tree structure index data for high-speed retrieval of information included in the name list 903 shown in FIG.

このように、実施の形態１では、検索対象データの読みを２音節単位に分割して格納した音節バイグラム辞書１０４と、ユーザが入力した読み文字列の一部の文字列と類似する読み文字列を音節バイグラム辞書１０４から検索し、これらを組み合わせて類似読み候補を作成する手段を設けたことにより、事前の単語区切りが不要で、かつ、日本語や韓国語や中国語などのユーザが複数の単語を連続して入力しうる言語において柔軟かつ高速な曖昧検索が可能となる。
また、所定の選択基準により適宜候補を選択しつつ類似読み候補を作成するための類似読み候補選択部３０４を設けたことにより、入力文字数が大きくなっても組み合わせが急激に増加することなく、省メモリかつ高速な類似読み文字列候補の作成が可能となる。As described above, in the first embodiment, the syllable bigram dictionary 104 in which the reading of the search target data is divided and stored in units of two syllables, and the reading character string similar to a part of the character string input by the user. Is prepared from the syllable bigram dictionary 104 and combined to create similar reading candidates, so that there is no need for a prior word break and multiple users such as Japanese, Korean, Chinese, etc. Flexible and high-speed fuzzy search is possible in a language in which words can be continuously input.
Further, by providing a similar reading candidate selection unit 304 for creating a similar reading candidate while appropriately selecting candidates according to a predetermined selection criterion, the number of input characters increases, so that the combination does not increase rapidly. It is possible to create a similar reading character string candidate at high speed in memory.

以上説明したように、実施の形態１の検索装置によれば、検索対象データの読みを２音節単位に分割して格納した音節バイグラム辞書と、正式名称を格納した名称検索用辞書と、与えられた読み文字列の一部の文字列と類似する読み文字列を音節バイグラム辞書から検索し、検索した文字列を組み合わせて類似読み候補を作成する類似読み候補作成部と、類似読み候補作成部で作成された類似読み候補を元に名称検索用辞書から正式名称を検索する名称検索部とを備えたので、通称や略称、うろ覚えの誤った名称などを検索キーとして検索した場合であっても元の正式名称を曖昧検索して出力できる検索装置として、省メモリでかつ高速な処理を実現することができる。 As described above, according to the search device of the first embodiment, the syllable bigram dictionary storing the reading of the search target data divided into units of two syllables, the name search dictionary storing the official names, are given. A similar reading candidate creation unit that searches a similar reading candidate character string from a syllable bigram dictionary and creates a similar reading candidate by combining the searched character strings. Since it has a name search unit that searches the official name from the name search dictionary based on the similar reading candidates that have been created, even if the search key is a common name, abbreviation, or misrecognized name As a search device capable of fuzzy searching and outputting the official name, it is possible to realize memory-saving and high-speed processing.

また、実施の形態１の検索装置によれば、類似読み候補作成部は、検索した文字列を組み合わせて作成した類似読み候補を取捨選択する類似読み候補選択部を備えたので、省メモリかつ高速に曖昧検索結果を得ることができる。 In addition, according to the search device of the first embodiment, the similar reading candidate creation unit includes the similar reading candidate selection unit that selects similar reading candidates created by combining the searched character strings. A vague search result can be obtained.

また、実施の形態１の検索装置によれば、類似読み候補選択部は、音節バイグラム辞書から検索された各音節の読みと、これと対応する与えられた読み文字列の部分文字列との編集距離が所定値以下の類似読み候補を選択し、これ以外の類似読み候補を棄却する音節別読み編集距離判定部を備えたので、省メモリかつ高速に曖昧検索結果を得ることができる。 Further, according to the search device of the first embodiment, the similar reading candidate selection unit edits the reading of each syllable searched from the syllable bigram dictionary and the corresponding partial character string of the given reading character string. A syllable-by-syllable reading edit distance determination unit that selects similar reading candidates whose distance is equal to or smaller than a predetermined value and rejects other similar reading candidates is provided, so that an ambiguous search result can be obtained at high speed with less memory.

また、実施の形態１の検索装置によれば、類似読み候補選択部は、音節バイグラム辞書から検索された各音節の読みと、これと対応する与えられた読み文字列の部分文字列との編集距離の累積値を計算する読み編集距離累積値計算部と、読み編集距離累積値計算部により計算された読み編集距離累積値が所定値以下の類似読み候補を選択し、これ以外の類似読み候補を棄却する読み編集距離累積値判定部とを備えたので、省メモリかつ高速に曖昧検索結果を得ることができる。 Further, according to the search device of the first embodiment, the similar reading candidate selection unit edits the reading of each syllable searched from the syllable bigram dictionary and the corresponding partial character string of the given reading character string. A reading editing distance cumulative value calculation unit that calculates the cumulative value of distances, and a similar reading candidate whose reading editing distance cumulative value calculated by the reading editing distance cumulative value calculation unit is equal to or less than a predetermined value, and other similar reading candidates Is provided with a reading edit distance cumulative value determination unit that rejects the ambiguity, so that an ambiguous search result can be obtained at high speed with a reduced memory.

また、実施の形態１の検索装置によれば、類似読み候補選択部は、類似読み候補数が所定値を超えたかどうかを判定する新規候補数判定部と、新規候補数判定部で所定値を超えたと判定した場合、類似読み候補のスコアを計算し、スコア上位の所定値の類似読み候補のみを残す類似読み候補スコア順選択部を備えたので、省メモリかつ高速に曖昧検索結果を得ることができる。 Further, according to the search device of the first embodiment, the similar reading candidate selection unit determines the predetermined value by the new candidate number determination unit that determines whether or not the number of similar reading candidates exceeds a predetermined value, and the new candidate number determination unit. If it is determined that the number of similar reading candidates has been exceeded, the score of the similar reading candidate is calculated, and the similar reading candidate score order selection unit that leaves only the similar reading candidate having a predetermined value higher in the score is provided. Can do.

また、実施の形態１の検索装置によれば、類似読み候補スコア順選択部は、読み編集距離累積値計算部により計算された読み編集距離の累積値を元に類似読み候補のスコアを計算するようにしたので、省メモリかつ高速に曖昧検索結果を得ることができる。 Further, according to the search device of the first embodiment, the similar reading candidate score order selection unit calculates the score of the similar reading candidate based on the cumulative value of the reading editing distance calculated by the reading editing distance cumulative value calculation unit. As a result, ambiguous search results can be obtained at high speed with less memory.

また、実施の形態１の検索装置によれば、類似読み候補選択部は、音節バイグラム辞書から検索された各音節の出現確率の累積値を計算する読み出現確率累積値計算部を有し、類似読み候補スコア順選択部は、読み出現確率累積値計算部の計算結果を元に類似読み候補のスコアを計算するようにしたので、省メモリかつ高速に曖昧検索結果を得ることができる。 Further, according to the search device of the first embodiment, the similar reading candidate selection unit has the reading appearance cumulative value calculation unit that calculates the cumulative value of the appearance probability of each syllable searched from the syllable bigram dictionary, Since the reading candidate score order selection unit calculates the score of the similar reading candidate based on the calculation result of the reading appearance probability cumulative value calculation unit, the ambiguous search result can be obtained at high speed with less memory.

実施の形態２．
図１１は、実施の形態２の検索装置を示す構成図である。実施の形態２は、実施の形態１における類似読み候補作成部１０３に代えて名称先読み類似読み候補作成部１１０１を設けたものである。名称先読み類似読み候補作成部１１０１は、検索対象データの読みを２音節単位に分割して格納した音節バイグラム辞書１０４を参照して類似読み候補リスト１０５ａを作成する。その際、候補選択処理の一部で名称検索用辞書１０７も参照する。その他の構成については図１に示した実施の形態１の構成と同様であるため、対応する部分に同一符号を付してその説明を省略する。Embodiment 2. FIG.
FIG. 11 is a configuration diagram illustrating the search device according to the second embodiment. In the second embodiment, a name pre-read similar reading candidate creation unit 1101 is provided instead of the similar reading candidate creation unit 103 in the first embodiment. The name pre-read similar reading candidate creation unit 1101 creates the similar reading candidate list 105a with reference to the syllable bigram dictionary 104 that stores the search target data divided into two syllable units. At that time, the name search dictionary 107 is also referred to as part of the candidate selection process. Since other configurations are the same as those of the first embodiment shown in FIG. 1, the same reference numerals are given to corresponding portions, and descriptions thereof are omitted.

図１２に、名称先読み類似読み候補作成部１１０１の詳細構成図を示す。名称有無判定部１２０１は名称検索用辞書１０７を参照し、類似読み候補が名称辞書中に存在するかどうかを判定する。これ以外の構成については図３に示した実施の形態１の構成と同様であるので、対応する部分に同一符号を付してその説明を省略する。 FIG. 12 shows a detailed configuration diagram of the name prefetching similar reading candidate creation unit 1101. The name presence / absence determination unit 1201 refers to the name search dictionary 107 to determine whether a similar reading candidate exists in the name dictionary. Since the configuration other than this is the same as the configuration of the first embodiment shown in FIG. 3, the same reference numerals are given to corresponding portions, and the description thereof is omitted.

図１３は、実施の形態２における名称先読み類似読み候補作成処理を示すフローチャートである。ステップＳＴ１３０１の名称検索有無判定処理以外については、実施の形態１の図７に示した処理と同様であるため、ステップＳＴ１３０１以外の処理についてはその説明を省略する。ステップＳＴ１３０１では、まず、ステップＳＴ７０７の処理結果の各音節バイグラム候補に対し、類似読み候補抽出部３１１により類似読み候補文字列を抽出する。次に、名称有無判定部１２０１が名称検索用辞書１０７を参照し、この類似読み候補文字列が名称検索用辞書１０７中に存在するかどうかを判定する。もし存在しないのであれば、後段の名称検索処理で検索に失敗するので、この時点で候補を棄却する。
なお、ステップＳＴ１３０１の名称有無判定処理では、最終的な名称データ自体は必要ないため、ダブル配列索引データ９０１（図９参照）のみを用いることで、より高速な検索処理を行う。FIG. 13 is a flowchart illustrating a name prefetching similar reading candidate creation process according to the second embodiment. Since the processing other than the name search presence / absence determination processing in step ST1301 is the same as the processing shown in FIG. 7 of the first embodiment, description of processing other than step ST1301 is omitted. In step ST1301, first, a similar reading candidate character string is extracted by the similar reading candidate extraction unit 311 for each syllable bigram candidate of the processing result of step ST707. Next, the name presence / absence determination unit 1201 refers to the name search dictionary 107 to determine whether or not this similar reading candidate character string exists in the name search dictionary 107. If it does not exist, the search fails in the subsequent name search process, so the candidate is rejected at this point.
Note that the final name data itself is not necessary in the name presence / absence determination process in step ST1301, and therefore a faster search process is performed by using only the double array index data 901 (see FIG. 9).

以上説明したように、実施の形態２の検索装置によれば、類似読み候補選択部は、類似読み候補に対し、名称検索用辞書を検索することで名称の一部となり得る類似読み候補を選択し、これ以外の類似読み候補を棄却する名称有無判定部を備えたので、類似読み候補のうち、実際に名称検索して結果が得られるもののみを選択でき、さらに省メモリかつ高速な曖昧検索処理が可能となる。 As described above, according to the search device of the second embodiment, the similar reading candidate selection unit selects a similar reading candidate that can be part of the name by searching the name search dictionary for the similar reading candidate. In addition, a name presence / absence determination unit that rejects other similar reading candidates is provided, so that only those that can be obtained as a result of an actual name search among similar reading candidates can be selected. Processing is possible.

実施の形態３．
図１４は実施の形態３の検索装置を示す構成図である。実施の形態３は、実施の形態１における入力部１０１と類似読み候補作成部１０３に代えて読み変換入力部１４０１と区切り限定類似読み候補作成部１４０３とを設けたものである。読み変換入力部１４０１では、日本語の仮名漢字変換、中国語のピンイン漢字変換、韓国語の字母ハングル変換など、ユーザが読み文字列を入力して表記文字列に変換する。その結果を表記の１文字ずつに対応するよう読み文字列に区切りが付与された区切り付き読み文字列データ１４０２として取得する。区切り限定類似読み候補作成部１４０３では、区切り付き読み文字列データ１４０２を用いて、実施の形態１と同様に音節バイグラム辞書１０４を参照して、類似読み候補リスト１０５ｂを作成する。その他の構成については図１に示した実施の形態１と同様であるため、対応する部分に同一符号を付してその説明を省略する。Embodiment 3 FIG.
FIG. 14 is a block diagram showing the search device of the third embodiment. In the third embodiment, instead of the input unit 101 and the similar reading candidate creation unit 103 in the first embodiment, a reading conversion input unit 1401 and a delimited limited similar reading candidate creation unit 1403 are provided. In the reading conversion input unit 1401, the user inputs a reading character string and converts it into a notation character string, such as Japanese Kana-Kanji conversion, Chinese Pinyin Kanji conversion, and Korean character Hangul conversion. The result is acquired as delimited read character string data 1402 in which a delimiter is given to the read character string so as to correspond to each character of the notation. A delimited limited similar reading candidate creation unit 1403 creates a similar reading candidate list 105b by using the delimited reading character string data 1402 and referring to the syllable bigram dictionary 104 as in the first embodiment. Since other configurations are the same as those of the first embodiment shown in FIG. 1, the corresponding parts are denoted by the same reference numerals and description thereof is omitted.

図１５は、区切り限定類似読み候補作成部１４０３の詳細を示す構成図である。区切り限定類似読み候補木構造データ生成部１５０１は、音節バイグラム辞書検索部３０２により区切り付き読み文字列データ１４０２の各部分文字列と類似する音節バイグラムの情報を取得し、この情報を組み合わせて類似読み候補木構造データ３０３を生成する。これ以外の構成については図３に示した実施の形態１の構成と同様であるので、対応する部分に同一符号を付してその説明を省略する。 FIG. 15 is a configuration diagram illustrating details of the delimiter limited similar reading candidate creation unit 1403. The delimitation limited similar reading candidate tree structure data generation unit 1501 acquires information on syllable bigrams similar to each partial character string in the delimited reading character string data 1402 by the syllable bigram dictionary search unit 302, and combines these information to obtain similar readings. Candidate tree structure data 303 is generated. Since the configuration other than this is the same as the configuration of the first embodiment shown in FIG. 3, the same reference numerals are given to corresponding portions, and the description thereof is omitted.

図１６は、実施の形態３における区切り限定類似読み候補作成処理を示すフローチャートである。ステップＳＴ１６０１の入力読み区切り判定処理以外については、実施の形態１の図７に示した処理と同様であるため、ステップＳＴ１６０１以外の処理についてはその説明を省略する。ステップＳＴ１６０１では、入力された読み区切りの情報を元に、音節バイグラム検索結果を類似読み候補木構造データ３０３に登録するかどうかを判定する。具体的には、現在の文字位置ｃｕｒと検索結果の第一音節の読み長さが、入力された読み文字列区切りに一致するか否かを判定し、一致しない場合にはその候補を棄却する。 FIG. 16 is a flowchart showing a segmentation limited similar reading candidate creation process according to the third embodiment. Since the processing other than the input reading delimitation determination processing in step ST1601 is the same as the processing shown in FIG. 7 of the first embodiment, the description of processing other than step ST1601 is omitted. In step ST1601, it is determined whether or not to register the syllable bigram search result in the similar reading candidate tree structure data 303 based on the input reading break information. Specifically, it is determined whether or not the current character position cur and the reading length of the first syllable of the search result match the input reading character string delimiter. If they do not match, the candidate is rejected. .

ステップＳＴ１６０１で候補が棄却される例を図１７に示す。この例では、区切り付き入力読み文字列が「ＴＳＵ／ＮＡ／ＨＩ／ＭＡ」であるのに対し、音節バイグラム候補［３］では、２音節分で区切り付き読み入力の「ＴＳＵ」に対応しており音節区切りが一致しないので棄却される。 An example in which candidates are rejected in step ST1601 is shown in FIG. In this example, the delimited input reading character string is “TSU / NA / HI / MA”, whereas the syllable bigram candidate [3] corresponds to the delimited input input “TSU” for two syllables. It is rejected because the syllable break does not match.

以上説明したように、実施の形態３の検索装置によれば、検索対象データの読みを２音節単位に分割して格納した音節バイグラム辞書と、正式名称を格納した名称検索用辞書と、音節単位で区切られた区切り付き読み文字列の一部の文字列と類似する読み文字列を音節バイグラム辞書から検索し、検索した文字列を組み合わせて類似読み候補を作成すると共に、区切り付き読み文字列に対し、区切り読み文字列中の音節区切りと音節バイグラム辞書の検索結果との一致判定を行い、区切りが一致しない音節バイグラム検索結果を棄却して類似読み候補を作成する区切り限定類似読み候補作成部と、区切り限定類似読み候補作成部で作成された類似読み候補を元に名称検索用辞書から正式名称を検索する名称検索部とを備えたので、入力と一致しない不要な音節バイグラム候補を棄却でき、さらに省メモリかつ高速な曖昧検索が可能となる。 As described above, according to the search device of the third embodiment, the syllable bigram dictionary in which the reading of the search target data is divided and stored in units of two syllables, the name search dictionary that stores official names, and the syllable units The syllable bigram dictionary searches for a reading string that is similar to a part of the separated reading string separated by, and creates a similar reading candidate by combining the searched strings. On the other hand, a delimiter limited similar reading candidate creation unit that performs a match determination between a syllable delimiter in a delimiter reading character string and a search result of the syllable bigram dictionary, rejects a syllable bigram search result that does not match the delimiter, and creates a similar reading candidate Since it has a name search unit that searches for a formal name from the dictionary for name search based on the similar reading candidate created by the delimiter limited similar reading candidate creation unit, it matches the input Not to reject the unwanted syllables bigram candidates, it is possible to further saving memory and fast fuzzy search.

なお、本願発明はその発明の範囲内において、各実施の形態の自由な組み合わせ、あるいは各実施の形態の任意の構成要素の変形、もしくは各実施の形態において任意の構成要素の省略が可能である。 In the present invention, within the scope of the invention, any combination of the embodiments, or any modification of any component in each embodiment, or omission of any component in each embodiment is possible. .

以上のように、この発明に係る検索装置は、読み文字列の一部の文字列と類似する読み文字列を音節バイグラム辞書から検索し、検索した文字列を組み合わせて類似読み候補を作成して正式名称を検索するようにしたものであり、カーナビの施設名検索や、住所の検索や電子マニュアルの検索など組み込み機器上でなされる検索処理全般に適用しうる。 As described above, the search device according to the present invention searches the syllable bigram dictionary for a reading character string similar to a part of the character string of the reading character string, and creates a similar reading candidate by combining the searched character strings. The official name is searched, and can be applied to search processing performed on an embedded device such as a car navigation facility name search, an address search, and an electronic manual search.

１０１入力部、１０２読み文字列データ、１０３類似読み候補作成部、１０４音節バイグラム辞書、１０５，１０５ａ，１０５ｂ類似読み候補リスト、１０６名称検索部、１０７名称検索用辞書、１０８検索結果データ、３０１類似読み候補木構造データ生成部、３０２音節バイグラム辞書検索部、３０３類似読み候補木構造データ、３０４類似読み候補選択部、３０５音節別読み編集距離判定部、３０６読み編集距離累積値判定部、３０７読み編集距離累積値計算部、３０８新規候補数判定部、３０９類似読み候補スコア順選択部、３１０読み出現確率累積値計算部、３１１類似読み候補抽出部、１１０１名称先読み類似読み候補作成部、１２０１名称有無判定部、１４０１読み変換入力部、１４０２区切り付き読み文字列データ、１４０３区切り限定類似読み候補作成部、１５０１区切り限定類似読み候補木構造データ生成部。 101 Input unit, 102 Reading character string data, 103 Similar reading candidate creation unit, 104 Syllable bigram dictionary, 105, 105a, 105b Similar reading candidate list, 106 Name search unit, 107 Name search dictionary, 108 Search result data, 301 Similar Reading candidate tree structure data generation unit, 302 syllable bigram dictionary search unit, 303 similar reading candidate tree structure data, 304 similar reading candidate selection unit, 305 syllable-specific reading editing distance determination unit, 306 reading editing distance accumulated value determination unit, 307 reading Edit distance cumulative value calculation unit, 308 New candidate number determination unit, 309 Similar reading candidate score order selection unit, 310 Reading appearance probability cumulative value calculation unit, 311 Similar reading candidate extraction unit, 1101 Name prefetched similar reading candidate creation unit, 1201 Name Presence / absence determination unit, 1401 Reading conversion input unit, 1402 Sticky reading character string data, 1403 delimiter limited similar reading candidate creation unit, 1501 delimiter limited similar reading candidate tree structure data generation unit.

Claims

A syllable bigram dictionary that stores the search target data divided into two syllable units;
A dictionary for name search that stores the official name,
A similar reading candidate creation unit that searches the syllable bigram dictionary for a reading character string similar to a part of a given reading character string, and creates a similar reading candidate by combining the searched character strings;
A search device comprising: a name search unit that searches for a formal name from the name search dictionary based on the similar reading candidate created by the similar reading candidate creation unit.

The search device according to claim 1, wherein the similar reading candidate creation unit includes a similar reading candidate selection unit that selects similar reading candidates created by combining the searched character strings.

The similar reading candidate selection unit selects similar reading candidates whose edit distance between a reading of each syllable searched from the syllable bigram dictionary and a partial character string of the given reading character string corresponding thereto is a predetermined value or less. The search device according to claim 2, further comprising a syllable-by-syllable reading edit distance determination unit that selects and rejects other similar reading candidates.

The similar reading candidate selection unit calculates a cumulative value of an editing distance between a reading of each syllable searched from the syllable bigram dictionary and a corresponding partial character string of the given reading character string. Cumulative value calculation unit and reading editing distance cumulative value determination that selects similar reading candidates whose reading editing distance cumulative value calculated by the reading editing distance cumulative value calculation unit is a predetermined value or less and rejects other similar reading candidates The search device according to claim 2, further comprising: a section.

When the similar reading candidate selection unit determines that the number of similar reading candidates exceeds a predetermined value, and the new candidate number determination unit determines that the number of similar reading candidates exceeds a predetermined value, the similar reading candidate selection unit The search device according to claim 2, further comprising a similar reading candidate score order selection unit that calculates a score and leaves only the similar reading candidates of the predetermined value higher in the score.

The said similar reading candidate score order selection part calculates the score of the said similar reading candidate based on the accumulated value of the reading editing distance calculated by the said reading editing distance cumulative value calculation part. Search device.

The similar reading candidate selection unit includes a reading appearance cumulative value calculation unit that calculates a cumulative value of appearance probabilities of each syllable searched from the syllable bigram dictionary, and the similar reading candidate score order selection unit includes the reading The search device according to claim 5, wherein the score of the similar reading candidate is calculated based on the calculation result of the appearance probability cumulative value calculation unit.

The similar reading candidate selection unit selects a similar reading candidate that can be a part of a name by searching the name search dictionary with respect to the similar reading candidate, and determines whether there is a name that rejects other similar reading candidates The search device according to claim 2, further comprising a section.

A syllable bigram dictionary that stores the search target data divided into two syllable units;
A dictionary for name search that stores the official name,
A reading character string similar to a partial character string of a delimited reading character string divided in syllable units is searched from the syllable bigram dictionary, a similar reading candidate is created by combining the searched character strings, and the separation A delimiter that creates a similar reading candidate by rejecting a syllable bigram search result that does not match the syllable bigram dictionary search result for the diacritic character string A limited similar reading candidate creation section;
A search device comprising: a name search unit that searches a formal name from the name search dictionary based on the similar reading candidates created by the delimitation limited similar reading candidate creation unit.