JP5132430B2

JP5132430B2 - Information processing apparatus, information processing method, and program for generating first and last name candidates

Info

Publication number: JP5132430B2
Application number: JP2008141209A
Authority: JP
Inventors: 剛志福田; 弘一高橋
Original assignee: International Business Machines Corp
Current assignee: International Business Machines Corp
Priority date: 2008-05-29
Filing date: 2008-05-29
Publication date: 2013-01-30
Anticipated expiration: 2028-05-29
Also published as: JP2009289052A

Description

本発明は、姓名検索に関し、より詳細には、漢字で記述された姓名候補を生成するための情報処理装置、情報処理方法、およびプログラムに関する。 The present invention relates to a first name surname search, and more particularly to an information processing apparatus, an information processing method, and a program for generating first name surname candidates described in Chinese characters.

近年、交通機関や、ネットワーク技術の進歩などにより、国際間での人的活動が活発化している。複数の国にわたって個人や企業が各種の活動を行う場合、当該個人の姓名や、企業の名称を記載して、各種の手続を行う。国際間での活動が増加するにつれ、個人などが手続を行った国以外の国で、その情報を使用して各種の検索を行うことが必要な場合も発生してくるものと考えられる。例示的な場合としては、国際間でのマネーロンダリングなどの場合に、銀行口座などを作成、送金、引出しなどを行った個人の同一性を、複数の国で情報を共有し、なおかつ異なる文化圏に対応する姓名表記に対応して姓名の同一性を損なうことなく認識することが必要な場合を挙げることができる。 In recent years, international human activities have become active due to advances in transportation and network technology. When an individual or a company conducts various activities across multiple countries, the person's first and last name and the name of the company are described and various procedures are performed. As international activities increase, it may be necessary to perform various searches using the information in countries other than the countries where individuals have processed. As an example, in the case of international money laundering, etc., the identity of the person who created, remittance, withdrawal, etc. in the bank account etc. is shared in multiple countries, and different cultures The case where it is necessary to recognize the first and last name corresponding to the category without losing the identity of the first and last name can be given.

また、姓名検索は、必ずしも国際的な行動だけに必要とされるものではない。例えば日本で言えば、銀行口座、健康保険などが同一人物を示しているか否かを判断することが必要な、いわゆる名寄せなどの処理には、取得した姓名からどのような類似姓名が生じるかを、高精度かつ効率的に生成させることが必要となる。 Last name search is not always necessary for international action. For example, in Japan, it is necessary to determine whether a bank account, health insurance, etc. indicate the same person. It is necessary to generate with high accuracy and efficiency.

英語、キリル語、ギリシャ文字などを含むインド・ヨーロッパ語で記述された姓名は、アルファベットなどの表音文字で記述されるため、文字列を使用して検索を行うには好適で、検索は、文字列を使用する類似検索を含めて比較的容易である。なお、アルファベットで記述された人名を比較して、人名の類似性を、類似性スコアを計算することにより比較する方法は、人名検索のために利用されている。アルファベットで記述された人名を検索するためのシステム／方法としては、例えば、米国特許第６、９６３、８７１Ｂ１明細書（特許文献１）に記載される、アルファベットで記述された人名を検索する自動化人名検索システムを挙げることができる。 First and last names written in Indo-European languages including English, Cyrillic, Greek letters, etc. are written in phonetic characters such as alphabets, so it is suitable for searching using character strings, It is relatively easy to include similar search using character strings. Note that a method of comparing names by comparing the names of the names described in alphabets and calculating the similarity score of the names is used for name search. As a system / method for searching for a person name written in alphabet, for example, an automated person name for searching for a person name written in alphabet is described in US Pat. No. 6,963,871B1 (Patent Document 1). List search systems.

また、http://publibfp.boulder.ibm.com/epubs/pdf/c1912860.pdf（非特許文献１）で指定されるＵＲＩには、アルファベット表記された人名についての類似性を使用して人名検索する、Global Name Analytics（ＧＮＡ）システムも開示されている。 In addition, the URI specified in http://publibfp.boulder.ibm.com/epubs/pdf/c1912860.pdf (Non-Patent Document 1) uses the similarity of person names in alphabets to search for names. A Global Name Analytics (GNA) system is also disclosed.

ところで、ＣＪＫ(Chinese, Japanese, Korean)として例示することができる、いわゆる表意文字を使用する文化圏では、異字体の存在、表音特性の相違などのために、インド・ヨーロッパ語文化圏での姓名検索とは異なる問題も発生する。ＣＪＫとして参照される文化圏では、表意文字が使用され、当該表示文字に対して複数の発音特性が存在する。さらに、入力された漢字表記の姓名が異字体または略字体で情報処理装置に入力される場合、漢字コード動詞の比較ではミスヒットを生じさせてしまい、検索精度を低下させてしまうことになる。また、表意文字を使用する姓名は、表記文字字体を直接使用して検索する場合、発音上での姓名の相違の他、転記ミス、誤記、ＯＣＲによる誤認識も含まれ、異字体、略字体、および類似形状などを考慮して、類似検索を行う必要もあり、インド・ヨーロッパ語圏の姓名類似検索と比較して、留意するべき問題があり、精度的な面で問題がある。すなわち、表意文字を含む文化圏では、同字異音の類似についても判断することが必要とされる。 By the way, in the cultural sphere using so-called ideographs, which can be exemplified as CJK (Chinese, Japanese, Korean), due to the existence of typographical characters and the difference in phonetic characteristics, There are also problems that are different from first name search. In the cultural sphere referred to as CJK, ideographic characters are used, and there are a plurality of pronunciation characteristics for the displayed characters. In addition, when the entered first and last names in kanji notation are input to the information processing apparatus in a different or abbreviated form, the comparison of the kanji code verbs causes a miss hit and decreases the search accuracy. In addition, when searching for first and last names using ideographic characters, not only the difference in phonetic name, but also transcription mistakes, misprints, and misrecognitions due to OCR are included. It is also necessary to perform a similar search in consideration of the similar shape and the like, and there are problems to be noted as compared with the surname / first name similar search in the Indo-European language area, and there is a problem in terms of accuracy. In other words, in a cultural area that includes ideographs, it is necessary to determine the similarity of the same-character and different sounds.

さらに、表意文字であっても、発音を伴うので、例えば聞き書きなどで記述された姓名が、聞き間違いなどで記述される場合もある。この様な場合、同字異音の姓名のみを検索しただけでは、姓名の類似候補を高精度に生成させることはできない。 Furthermore, even ideographic characters are accompanied by pronunciation, so the first and last names described in, for example, written interviews may be described as incorrect listening. In such a case, it is not possible to generate a similar candidate for a surname with high accuracy only by searching for only surnames with the same character and different sound.

これまで、漢字列、例えば日本語について姓名を検索する技術は知られている。日本語など、漢字で記述された文書を、設定された単位の語（トークン）に分割して文書検索を行う情報検索装置は、例えば特開２００４−２０６４７３号公報（特許文献２）に記載されている。さらに漢字で記述された人名のうち、姓、名を使用して異なる重み付けを付与して検索する情報検索装置技術も、例えば、特開２００４−２９５７９７号公報（特許文献３）に記載されている。また、特開平７−２３０４７２号公報（特許文献４）では、人名誤読補正方法が記載されている。特許文献４では、人名よみがな・漢字対応テーブルと人名漢字・よみがな対応テーブルの両方を連続して参照し、テーブルサーチで得られた読みまたは漢字を全て検索プログラムに渡す処理が記載されている。 So far, techniques for searching for first and last names for kanji strings, for example, Japanese are known. An information search apparatus that performs document search by dividing a document written in Chinese characters, such as Japanese, into words (tokens) of a set unit is described in, for example, Japanese Patent Application Laid-Open No. 2004-206473 (Patent Document 2). ing. Furthermore, an information search device technique for searching by assigning different weights using surnames and first names among the names described in kanji is also described in, for example, Japanese Patent Application Laid-Open No. 2004-295797 (Patent Document 3). . Japanese Laid-Open Patent Publication No. 7-230472 (Patent Document 4) describes a personal name misreading correction method. Patent Document 4 describes a process of continuously referring to both a personal name-to-name / kanji correspondence table and a name-to-name / kanji correspondence table, and passing all the readings or kanji obtained by the table search to the search program.

特許文献１〜３、非特許文献１に開示される情報検索は、アルファベットや、その他のローマ字コードで記述される人名を検索するには、充分な精度および検索性を提供することができる。また、特許文献２、３では、１文字が漢字コードで定義される漢字で記述された文書を検索し、また文書中から人名を抽出して重み付けに反映させることも可能である。 The information retrieval disclosed in Patent Documents 1 to 3 and Non-Patent Document 1 can provide sufficient accuracy and searchability to retrieve personal names described in alphabets or other Roman codes. Further, in Patent Documents 2 and 3, it is possible to search for a document in which one character is described by a Kanji character defined by a Kanji code, extract a person name from the document, and reflect it in weighting.

しかしながら、漢字やカタカナなどの漢字で記載された姓名は、その発音特性が種々多様であり、異字体なども存在し、ローマ字文字などのアルファベット文字列で記述された姓名から、漢字で記述された姓名を精度良く生成することは、漢字の表音特性の多様性のため困難である。 However, surnames written in kanji such as kanji and katakana have various pronunciation characteristics, there are also different fonts, etc., and they were written in kanji from first and last names written in alphabetic strings such as roman letters It is difficult to generate first and last names accurately because of the diversity of phonetic characteristics of kanji.

また、漢字やカタカナなど漢字で記述された姓名については、情報処理装置に入力する場合、外観上類似する姓名で誤入力される場合がある。また、入力処理としてＯＣＲなどを使用する場合、記述された漢字がＯＣＲにより類似する漢字に誤変換され、この結果、情報処理装置に誤入力される場合もある。上述した特許文献および非特許文献の技術は、転記ミス、誤記、誤認識などの可能性を含め、漢字で記述された姓名から派生する姓名を効率的に生成するものではない。 In addition, when the first and last names described in kanji such as kanji and katakana are input to the information processing apparatus, the first and last names that are similar in appearance may be erroneously input. In addition, when using OCR or the like as input processing, the described kanji is erroneously converted into a similar kanji by OCR, and as a result, it may be erroneously input to the information processing apparatus. The techniques of the above-mentioned patent documents and non-patent documents do not efficiently generate first and last names derived from first and last names described in Kanji, including the possibility of transcription mistakes, misprints, and misrecognitions.

この点で、特許文献４では、姓名の読みおよび漢字を正しいものに補正することは可能であるものの、特許文献４では、姓名自体の検索を行うことを目的とするものではない。また、特許文献４では、アルファベット文字から推定される最尤の漢字を生成させる処理を実行させるものではなく、また同字異音および発音特性の類似する姓名候補を生成することを課題とするものではない。 In this regard, although it is possible in Patent Document 4 to correct the reading of first and last names and correct kanji, Patent Document 4 is not intended to search for first and last names themselves. Further, Patent Document 4 does not execute the process of generating the maximum likelihood kanji estimated from the alphabet characters, and it is an object to generate surname candidates with similar homonyms and similar pronunciation characteristics. is not.

また、特に日本人の姓名については、姓が約１５万種類知られており、また名は３０万〜４０万種類以上存在することが知られていて、その表音特性も多様であることから、姓名を一括して検索することは処理効率上問題もあった。さらに、姓名を同時に検索させる場合には、入力された漢字列の転記ミスや誤記、またはＯＣＲの誤認識などを含む可能性のある姓名について、可能性のある姓名組合わせの生成が制限されてしまうという問題点もあった。 Also, especially for Japanese surnames, there are about 150,000 known surnames, and it is known that there are more than 300,000 to 400,000 surnames, and their phonetic characteristics are diverse. , Searching first and last names in a batch also has a problem in processing efficiency. In addition, when searching for first and last names at the same time, the generation of possible first and last name combinations is limited for first and last names that may include transcription errors or misprints of the input kanji strings, or misrecognition of OCR. There was also a problem of end.

米国特許第６、９６３、８７１Ｂ１明細書US Pat. No. 6,963,871B1 特開２００４−２０６４７３号公報JP 2004-206473 A 特開２００４−２９５７９７号公報JP 2004-295797 A 特開平７−２３０４７２号公報JP-A-7-230472 http://publibfp.boulder.ibm.com/epubs/pdf/c1912860.pdfhttp://publibfp.boulder.ibm.com/epubs/pdf/c1912860.pdf

すなわち、これまで漢字で記述された姓名の入力を受取り、高精度に姓名検索を実行し、検索結果から入力された姓名に対応する可能性のある漢字の姓名候補を生成する情報処理装置、情報処理方法、およびプログラムが必要とされていた。さらに、また、マネーロンダリングなどの国際協力などの観点から、アルファベットで表記される姓名を、漢字と統合して姓名検索する場合に、インド・ヨーロッパ語文化圏の姓名表記の姓名候補の検索において、検索の統合性を高め、かつ高精度の漢字姓名候補を生成することが必要とされていた。 That is, an information processing device that receives input of first and last names written in kanji so far, performs a first and last name search with high accuracy, and generates a first and last name candidate of kanji that may correspond to the first and last name entered from the search result, information A processing method and program were needed. In addition, from the viewpoint of international cooperation such as money laundering, when searching for first and last names in alphabetical letters in the Indo-European culture area, when searching for first and last names expressed in alphabetical characters with Chinese characters, There was a need to improve search integration and generate high-precision kanji surname candidates.

本発明は、上述した従来技術の問題点に鑑みてなされたものであり、日本語、韓国語、中国語で記述された姓名を、同字異音および発音特性の類似性を含めて効率的に類似検索し、入力された姓名から生じる可能性のある姓名の候補を、高精度かつ高効率で生成する。本発明では、入力された漢字列で入力された姓名から、漢字姓名−ローマ字姓名変換辞書を参照して、アルファベットなどインド・ヨーロッパ語で記述されたアルファベット文字列で表現された姓名候補を自動生成する。自動生成の際には、生成したアルファベット文字列のうち、姓および名を識別させるための識別子が生成され、ＣＪＫ文化圏の姓名シーケンスと、インド・ヨーロッパ語圏の姓名シーケンスを識別した姓検索および名検索を可能とする。 The present invention has been made in view of the above-mentioned problems of the prior art, and it is efficient for surnames written in Japanese, Korean, and Chinese, including homonyms and similar pronunciation characteristics. A similar name search is performed, and a first name surname that may arise from the first name surname is generated with high accuracy and high efficiency. In the present invention, from the first name and last name entered in the inputted kanji string, the first and last name candidates expressed automatically in alphabetic strings described in Indo-European languages such as alphabets are automatically generated by referring to the kanji first name-roman name surname conversion dictionary To do. In the automatic generation, an identifier for identifying the first name and the last name in the generated alphabet string is generated, and the last name search and the first name search sequence identifying the CJK cultural area and the first and last name sequence of the Indo-European language area and Name search is possible.

生成されたアルファベット文字列は、姓および名を検索するための検索語に分割され、２種類の類似検索のために利用される。第１の類似検索は、漢字特有の同字異音検索であり、姓検索語および名検索語を直接使用して、同字異音の姓候補および名候補を取得するために漢字姓名−アルファベット姓名変換辞書を検索する。第２の類似検索は、姓検索語および名検索語について、アルファベット姓名辞書を検索する。アルファベット姓名辞書の検索は、発音特性の相違、例えば母音、子音の種類の相違、単語長などの相違などに基づくいわゆる発音類似性の観点で類似検索である。 The generated alphabetic character string is divided into search words for searching for first and last names and used for two types of similar searches. The first similar search is a Kanji-specific homonym search, and uses a surname search word and a first name search word directly to obtain a Kanji surname candidate and first name candidate to obtain a Kanji surname-alphabet. Search the first name surname conversion dictionary. The second similarity search searches the alphabet surname dictionary for surname search terms and surname search terms. The search of the alphabet first name surname dictionary is a similarity search in terms of so-called pronunciation similarity based on differences in pronunciation characteristics, such as differences in vowel and consonant types, word lengths, and the like.

その後、情報処理装置は、検索結果を受領し、検索結果を漢字姓名−ローマ字姓名変換辞書を逆検索するための検索語として利用して、発音特性の類似する姓候補および名候補を生成する。なお、本発明において、用語「類似逆検索」とは、アルファベット文字列から、アルファベット文字列に対応する漢字を逆引き検索する検索処理を意味する。また、用語「逆検索」とは、アルファベット文字列から漢字を逆引きする処理を意味する。 After that, the information processing apparatus receives the search result, and uses the search result as a search word for reverse-searching the kanji surname-roman name surname conversion dictionary to generate surname candidates and name candidates with similar pronunciation characteristics. In the present invention, the term “similar reverse search” means a search process in which a Chinese character corresponding to an alphabetic character string is reversely searched from an alphabetic character string. Further, the term “reverse search” means a process of reversely retrieving a Chinese character from an alphabetic character string.

本発明では、姓および名を分離して検索を実行するので、文化圏の相違を吸収して検索を行うことができ、インド・ヨーロッパ語圏における姓名検索と統合した姓名検索が可能となる。また、検索結果を使用して入力された姓名から生成されることが予測される可能な姓名組合わせを有する姓名候補を生成することができる。さらに本発明では、特に発音特性の多様な日本人名に対する広範な類似検索を可能する。 In the present invention, since the search is performed by separating the first name and the last name, it is possible to perform the search by absorbing the difference in the cultural sphere, and it is possible to perform the first name search integrated with the first name search in the Indo-European language area. In addition, first and last name candidates having possible first name and last name combinations that are expected to be generated from the first name and last name entered using the search results may be generated. Furthermore, in the present invention, it is possible to perform a wide range of similar searches especially for Japanese names with various pronunciation characteristics.

本発明の情報処理装置は、ネットワークに接続されたサーバとして構成することができ、また、本発明では、上述の処理を実行するためのコンピュータ実行可能な情報処理方法およびプログラムが提供される。 The information processing apparatus of the present invention can be configured as a server connected to a network, and the present invention provides a computer-executable information processing method and program for executing the above-described processing.

以上のように、本発明によれば、漢字で記述された姓名の入力を受取り、高精度に姓名検索を実行し、検索結果から入力された姓名に対応する同字異音および発音類似の観点から可能性のある漢字の姓名候補を生成することを可能とし、インド・ヨーロッパ語圏の姓名検索と統合可能な情報処理装置、情報処理方法、およびプログラムを提供することができる。 As described above, according to the present invention, the first and last name input described in kanji is received, the first and last name search is performed with high accuracy, and the same character allophone and pronunciation similar viewpoint corresponding to the first and last name input from the search result Information processing apparatus, information processing method, and program can be provided that can generate possible first name and surname candidates of kanji characters and that can be integrated with first name surname search in the Indo-European area.

以下、本発明を、図面に示した実施形態をもって説明するが、本発明は後述する実施形態に限定されるものではない。 Hereinafter, the present invention will be described with reference to embodiments shown in the drawings, but the present invention is not limited to the embodiments described below.

図１は、本実施形態の情報処理装置１３０を含む情報処理システム１００の実施形態を示す。情報処理システム１００は、ネットワーク１２０と、ネットワーク１２０に接続され、ユーザにより操作されて、ネットワーク１２０を介して情報処理装置１３０にアクセスする復数のクライアント１１２を含んで構成されている。 FIG. 1 shows an embodiment of an information processing system 100 including an information processing apparatus 130 of the present embodiment. The information processing system 100 includes a network 120 and a number of clients 112 connected to the network 120 and operated by a user to access the information processing apparatus 130 via the network 120.

情報処理装置１３０は、サーバ・マシンから構成することができ、ウェブ・サーバ、ＳＮＳサーバ、または姓名検索用の専用サーバなどとして構成することができる。本実施形態では、ネットワーク１２０は、インターネットなどのネットワークを含むことが好ましいが、インターネット以外にもＷＡＮ(Wide Area Network)、ＬＡＮ(Local Area network)などを含んで構成されていてもよい。 The information processing apparatus 130 can be configured from a server machine, and can be configured as a web server, an SNS server, or a dedicated server for searching for first and last names. In the present embodiment, the network 120 preferably includes a network such as the Internet, but may include a WAN (Wide Area Network), a LAN (Local Area Network), and the like in addition to the Internet.

また、情報処理装置１３０は、漢字で記述された姓名を生成させるため、複数のデータベースを管理する。第１データベース１４０は、漢字姓名−ローマ字姓名データベースである。第１データベースは、情報処理装置１３０が漢字で記述された姓名を受領して、姓名を構成する漢字列をアルファベット文字列に変換する処理を実行する。また、第２データベース１５０は、アルファベット姓名データベースである。第２データベース１５０は、アルファベット文字に変換された漢字姓名の発音類似性を使用して類似検索を実行し、入力されたアルファベット文字列に対する一致度に応じて、姓および名のアルファベット文字列を検索結果として情報処理装置１３０に返す。 Further, the information processing apparatus 130 manages a plurality of databases in order to generate first and last names written in kanji. The first database 140 is a Kanji surname / Romaji surname database. In the first database, the information processing apparatus 130 receives a first and last name written in kanji and executes a process of converting a kanji string constituting the first and last name into an alphabetic character string. The second database 150 is an alphabet surname database. The second database 150 performs a similarity search using pronunciation similarities of kanji surnames converted to alphabetic characters, and searches for alphabetic strings of surnames and first names according to the degree of matching with the input alphabetic character strings. As a result, the information is returned to the information processing apparatus 130.

なお、本実施形態では、漢字とは、日本語、中国語、韓国語など、いわゆるＣＪＫとして参照される代表的な漢字を使用して記述される言語で使用される文字を意味する。 In the present embodiment, the kanji means a character used in a language described using a typical kanji referred to as a so-called CJK, such as Japanese, Chinese, or Korean.

また、アルファベット文字列とは、日本語では、ヘボン式、訓令式などで記述され、漢字の発音を再現するアルファベット文字を使用した文字列であり、以後ローマ字として参照される。なお、本実施形態では、中国語、韓国語についても、漢字などの表意文字の発音をアルファベットで再現したアルファベット文字列についても適用可能である。本実施形態は、ＣＪＫで指定される漢字文化圏の姓名に適用可能であるが、本実施形態では説明の便宜上、漢字姓名をローマ字文字に変換し、さらに漢字姓名に変換するものとして説明を行う。 In addition, the alphabet character string is a character string that is described in Japanese using Hebon's formula, ceremonial formula, etc., and that uses alphabetic characters that reproduce the pronunciation of kanji, and is referred to as Roman characters hereinafter. In the present embodiment, the present invention can also be applied to Chinese and Korean, as well as to an alphabet character string that reproduces the pronunciation of ideographic characters such as Chinese characters in alphabet. This embodiment can be applied to first and last names in the Kanji culture area designated by CJK. However, in this embodiment, for convenience of explanation, the first and last names will be converted into Roman characters and further converted into first and last names. .

上述した情報処理装置１３０は、ＰＥＮＴＩＵＭ（登録商標）、ＰＥＮＴＩＵＭ（登録商標）互換チップ、などのＣＩＳＣアーキテクチャのマイクロプロセッサ（ＭＰＵ）、または、ＰＯＷＥＲＰＣ（登録商標）などのＲＩＳＣアーキテクチャのマイクロプロセッサを実装することができる。また、情報処理装置１３０は、ＷＩＮＤＯＷＳ（登録商標）２００Ｘ、ＵＮＩＸ（登録商標）、ＬＩＮＵＸ（登録商標）などのオペレーティング・システムにより制御されている。また、情報処理装置１３０は、Ｃ＋＋、ＪＡＶＡ（登録商標）、ＪＡＶＡ（登録商標）ＢＥＡＮＳ、ＰＥＲＬ、ＲＵＢＹなどのプログラミング言語を使用して実装される、ＣＧＩ、サーブレット、ＡＰＡＣＨＥ、ＩＩＳなどのサーバ・プログラムを実行し、クライアント１１２からの要求を処理する。 The above-described information processing apparatus 130 is implemented with a CISC architecture microprocessor (MPU) such as PENTIUM (registered trademark), a PENTIUM (registered trademark) compatible chip, or a RISC architecture microprocessor such as POWER PC (registered trademark). can do. The information processing apparatus 130 is controlled by an operating system such as WINDOWS (registered trademark) 200X, UNIX (registered trademark), or LINUX (registered trademark). The information processing apparatus 130 is a server program such as CGI, servlet, APACHE, or IIS, which is implemented using a programming language such as C ++, JAVA (registered trademark), JAVA (registered trademark) BEANS, PERL, or RUBY. To process the request from the client 112.

クライアント１１２は、パーソナル・コンピュータまたはワークステーションなどを使用して実装でき、また、そのマイクロプロセッサは、これまで知られたいかなるシングルコア・プロセッサまたはマルチコア・プロセッサを含んでいてもよい。また、クライアント１１２は、ＷＩＮＤＯＷＳ（登録商標）、ＵＮＩＸ（登録商標）、ＬＩＮＵＸ（登録商標）、ＭＡＣＯＳなどのオペレーティング・システムにより制御されている。また、クライアント１１２は、Internet Explorer（商標）、Mozilla、Opera、Netscape
Navigator（商標）などのブラウザ・ソフトウェアを使用するＨＴＴＰトランザクション基盤、または専用クライアント・プログラムを使用してＣＯＲＢＡ(Common
Object Resource Broker Architecture)などをトランザクション基盤により、情報処理装置１３０にアクセスし、要求発行、情報取得などの処理を行う。 Client 112 can be implemented using a personal computer or workstation, etc., and its microprocessor may include any single-core or multi-core processor known so far. The client 112 is controlled by an operating system such as WINDOWS (registered trademark), UNIX (registered trademark), LINUX (registered trademark), or MAC OS. The client 112 includes Internet Explorer (trademark), Mozilla, Opera, Netscape.
Use an HTTP transaction infrastructure using browser software such as Navigator (trademark), or a CORBA (Common
(Object Resource Broker Architecture) or the like is accessed on the information processing device 130 using a transaction basis, and processing such as request issuance and information acquisition is performed.

図２は、本実施形態の情報処理装置１３０のソフトウェア・モジュール２００を示した機能ブロック図である。情報処理装置１３０のソフトウェア・モジュール２００は、ハードディスク上に格納された実行形式のオブジェクトを、情報処理装置１３０が実装するＲＡＭといった実行空間に読込んで、マイクロプロセッサがオブジェクトを実行することにより機能手段として情報処理装置上に実現される。 FIG. 2 is a functional block diagram showing the software module 200 of the information processing apparatus 130 of this embodiment. The software module 200 of the information processing apparatus 130 reads an execution format object stored on the hard disk into an execution space such as a RAM mounted on the information processing apparatus 130, and the microprocessor executes the object as a function unit. Realized on an information processing apparatus.

情報処理装置１３０は、図２に示すように、ネットワーク１２０から送付される漢字で記述された姓名を受領して、ＮＩＣ(ネットワーク・インタフェース・カード）などを含んで構成され、ＯＳＩ基本参照モデルにおける、物理層およびデータリンク層機能を提供するネットワーク・アダプタ２１０に渡す。ネットワーク・アダプタ２１０が受領した姓名は、ＣＧＩ、サーブレットなどで構成されるサーバ・プログラムとして構成される、各ソフトウェア・モジュール群に送付される。 As shown in FIG. 2, the information processing apparatus 130 is configured to receive first and last names written in kanji sent from the network 120 and include a NIC (network interface card) or the like in the OSI basic reference model. To the network adapter 210 that provides physical layer and data link layer functions. The first and last name received by the network adapter 210 is sent to each software module group configured as a server program including CGI, servlet, and the like.

姓名は、まず、アルファベット文字列生成部２２０に送られる。アルファベット文字列生成部２２０については、特願２００８−１１７５３８号明細書（アトーニー・ドケット番号：ＪＰ９０８００５４）に、その詳細な構成および処理を記載されているが、以下、その処理の概略について説明する。 The full name is first sent to the alphabet character string generation unit 220. The detailed configuration and processing of the alphabet character string generation unit 220 is described in Japanese Patent Application No. 2008-117538 (Athony Docket number: JP9080054). The outline of the processing will be described below.

アルファベット文字列生成部２２０は、文字正規化部を含んで構成される。文字正規化部は、姓名を構成する漢字列を正規化する。本実施形態で用語「正規化」とは、異字体を、情報処理装置１３０が以後の処理に統一して使用することが可能な字体に統一する処理を実行する。なお、異字体とは、例えで日本語名「齋藤」の場合、漢字「齋藤」の「齋」の文字や、異字体である「斎」、「齊」、を、「斉」に統一する処理を意味する。また、「藤」の場合には、同様に「籐」の異字体を、「藤」に統一する処理を意味する。 The alphabet character string generation unit 220 includes a character normalization unit. The character normalization unit normalizes the kanji strings that make up the first and last names. In the present embodiment, the term “normalization” executes a process of unifying the different fonts into a font that can be used by the information processing apparatus 130 for subsequent processing. For example, in the case of the Japanese name “Saito”, the allo-style is unified with the characters “Sai” and “Sai” in the Kanji “Saito” and “Sai”. Means processing. Further, in the case of “wisteria”, it means a process of unifying the “rattan” variant into “wisteria”.

アルファベット文字列生成部２２０は、姓名の正規化後、形態素解析部を呼出して形態素辞書２８０を検索し、正規化された漢字列に対応するアルファベット文字列を生成する。形態素辞書２８０は、第１データベース１４０内に構成されたテーブル・セットとして実装することができる。形態素辞書２８０は、上述した処理を実行するために、漢字（列）と、漢字（列）に対応するアルファベット文字列を登録するテーブルを含んで構成されている。また、形態素辞書２８０は、各形態素について、姓、名、「さん」、「殿」ななどの付属語を対応付けて登録する。 After normalizing the first and last names, the alphabet character string generation unit 220 calls the morpheme analysis unit to search the morpheme dictionary 280, and generates an alphabet character string corresponding to the normalized kanji character string. The morpheme dictionary 280 can be implemented as a table set configured in the first database 140. The morpheme dictionary 280 includes a table for registering kanji (columns) and alphabetic character strings corresponding to kanji (columns) in order to execute the above-described processing. In addition, the morpheme dictionary 280 registers, for each morpheme, associated words such as first name, last name, “san”, and “dono”.

組合わせ補作成部は、形態素解析部が生成した組合わせ情報を受領して、姓、名に付された属性識別子について接続を解析する。属性識別子の解析では、連続する姓および名に対応する形態素に付された属性識別子を抽出し、その連続から接続識別子として決定する。接続識別子は、例えば、各形態素に付された属性識別子を参照し、形態素のシーケンスが、姓−名（ＳＧ）、姓−姓（ＳＳ）、名−姓（ＧＳ）、名−名（ＧＧ）などのシーケンスとなっていることに対応して、異なるスコア値が割当てられ、組合わせ候補の妥当性の指標を与えるためにも使用される。 The combination supplement creation unit receives the combination information generated by the morpheme analysis unit, and analyzes the connection for the attribute identifiers attached to the surname and first name. In the analysis of attribute identifiers, attribute identifiers attached to morphemes corresponding to consecutive last names and first names are extracted and determined as connection identifiers from the sequence. The connection identifier refers to, for example, an attribute identifier attached to each morpheme, and the sequence of morphemes is a surname-surname (SG), surname-surname (SS), surname-surname (GS), surname-surname (GG). Are assigned different score values and are also used to give an indication of the validity of the combination candidates.

組合わせ候補作成部は、形態素の接続を特徴付ける接続識別子を、形態素間に挿入する型式で、（形態素−接続識別子−形態素−接続識別子、・・・）のフィールドから構成されるレコードを含む組合わせ候補リストを、生成された全形態素列について作成する。上述した処理の結果は、組合わせ候補リストとして構成される。組合わせ候補リストは、特定の姓名候補についての形態素および接続識別子が１レコードを構成するようにして生成される。組合わせ候補作成部は、生成された姓および名についての姓名候補から、姓名の帰属される文化圏を識別して、姓名の組み合わせが、例えばＣＪＫのうち、日本国であるか、韓国であるか、中国であるかを識別する文化圏重み付けを生成する。 The combination candidate creation unit is a type in which a connection identifier characterizing the connection of morphemes is inserted between morphemes, and includes a record including a field of (morpheme-connection identifier-morpheme-connection identifier,...). A candidate list is created for all generated morpheme sequences. The result of the processing described above is configured as a combination candidate list. The combination candidate list is generated such that a morpheme and a connection identifier for a specific first and last name candidate constitute one record. The combination candidate creation unit identifies the cultural area to which the first and last names belong from the generated first and last name candidates, and the combination of the first and last names is, for example, CJK in Japan or Korea Or a cultural weight that identifies China.

情報処理装置１３０は、以後、上述して生成された接続識別子を、姓と名とを識別させるための識別子として使用し、上述したように、アルファベット文字列の適切な箇所に追加する。この識別子は、アルファベット文字列シーケンスのうち、何処までが姓で、何処までが名であるかを識別するために使用され、以後、姓および名の分離検索のためおよびアルファベット姓名辞書２９０−２を検索するために姓検索語および名検索語を特定するために使用される。なお、接続識別子は、本実施形態では、姓名のシーケンスを示すＳＧなどの値を使用するものとして説明するが、空白文字など、姓シーケンスと、名シーケンスとの間を識別させることができる限り、いかなる識別子でも使用することができる。 Thereafter, the information processing apparatus 130 uses the connection identifier generated as described above as an identifier for identifying the last name and the first name, and adds the connection identifier to an appropriate portion of the alphabet character string as described above. This identifier is used to identify where in the alphabet string sequence is a surname and where is a first name, and for subsequent search of surname and first name and for the alphabet surname / name dictionary 290-2. Used to specify last name search terms and first name search terms for searching. In the present embodiment, the connection identifier is described as using a value such as SG indicating the sequence of first and last names. However, as long as it is possible to distinguish between the last name sequence and the first name sequence such as a blank character, Any identifier can be used.

また、アルファベット文字列生成部２２０は、スコア計算部を含んでおり、スコア計算部は、姓名候補リストの特定のレコードについて、当該レコードの値がＮＵＬＬではないフィールドの値を取得し、当該フィールド値について割当てられたスコア値を取得し、処理レコードの非ＮＵＬＬのフィールド値について取得されたスコア値を合計する。 Moreover, the alphabet character string generation unit 220 includes a score calculation unit, and the score calculation unit acquires, for a specific record in the surname candidate list, the value of the field whose record value is not NULL, and the field value The score values assigned for are obtained and the score values obtained for the non-NULL field values of the processing record are summed.

さらに、アルファベット文字列生成部２２０は、合計スコア値の最小な姓名候補を姓名候補リストから最尤の組合わせ候補として抽出し、最尤の組合わせ候補について、漢字について、その発音特性の最も近いアルファベット表記を登録する表記変換辞書を参照して、最尤の姓名候補について登録されたアルファベット表記を、姓および名について割当て、姓名候補についての最尤のアルファベット表記として出力する。 Further, the alphabet string generation unit 220 extracts the first and last name candidate having the smallest total score value from the first and last name candidate list as the most likely combination candidate, and the kanji character has the closest pronunciation characteristic for the most likely combination candidate. With reference to the notation conversion dictionary for registering alphabetical notation, the alphabetical notation registered for the most likely first name surname is assigned to the surname and first name, and is output as the most likely alphabetical notation for the surname surname candidate.

さらに図２の各モジュールについて説明する。アルファベット文字列生成部２２０は、姓および名を類似逆検索部２４０に送付する。類似逆検索部２４０は、アルファベット文字列生成部２２０の出力した姓名を直接受領し、姓名変換辞書２９０−１を類似逆検索する。この類似逆検索は、生成された組合わせ候補のアルファベット文字列のうち、同字異音の姓名の検索を可能とするために実行される。なお、同字異音の類似逆検索は、発音特性の特に多様な名についていわゆる同字異音の名を検索するために好ましく利用される。検索された同字異音の姓名は、アルファベット文字列検索部２３０に送られ、同字異音の名候補として追加され、アルファベット文字列検索部２３０の検索語とされる。なお、他の実施形態では、同字異音の類似逆検索は、名のみについて実行することもでき、姓については、直接アルファベット文字列検索部２３０に送付して、発音類似の検索のみを実行させてもよい。 Further, each module in FIG. 2 will be described. The alphabet character string generation unit 220 sends the surname and first name to the similar reverse search unit 240. The similar reverse search unit 240 directly receives the first and last names output from the alphabetic character string generation unit 220 and performs a similar reverse search on the first name conversion dictionary 290-1. This similar reverse search is executed in order to make it possible to search for first and last names of the same character in the generated combination candidate alphabetic character string. Note that the similar reverse search for homophones is preferably used to search for so-called homonymous names for names with particularly diverse pronunciation characteristics. The searched surname of the same character and different sound is sent to the alphabet character string search unit 230, added as a candidate name of the same character and different sound, and used as a search word of the alphabet character string search unit 230. In another embodiment, the similar reverse search for the same character and different sound can be performed only for the first name, and the surname is directly sent to the alphabet string search unit 230 to perform only the phonetic similar search. You may let them.

類似逆検索部２４０は、検索された同字異音の姓名を、アルファベット文字列生成部２２０から受領した名とともにアルファベット文字列検索部２３０に送る。その後、アルファベット文字列検索部２３０は、受領した姓を姓検索語として設定し、同字異音の名を含む名を名検索語として設定する。その後、アルファベット文字列検索部２３０は、アルファベット姓名辞書２９０−２の検索を実行し、検索対象のアルファベット文字列に一致または類似する姓に関連したアルファベット文字列を検索する。 The similar reverse search unit 240 sends the searched first and last names of the same character and different sounds to the alphabet character string search unit 230 together with the name received from the alphabet character string generation unit 220. After that, the alphabet character string search unit 230 sets the received surname as a surname search word, and sets a name including a first name with the same character as a first name search word. Thereafter, the alphabet character string search unit 230 performs a search of the alphabet first name surname dictionary 290-2 to search for an alphabet character string related to a surname that matches or is similar to the alphabet character string to be searched.

類似するアルファベット文字列とは、各検索語を構成し、検索語の所定の特性値、例えば類似する発音の子音または母音の一致割合、姓または名の文字列の一致割合などを使用して判断することができる。これらの特性値を総じて発音特性として参照し、本実施形態では、発音特性の類似を発音類似として参照する。なお、発音特性の類似判断処理は、本発明が直接要旨とするものではないので、より詳細な説明は省略する。 Similar alphabetic character strings are determined by configuring each search word and using the specified characteristic values of the search word, for example, the concordance rate of consonants or vowels of similar pronunciations, the concordance rate of surname or first name character strings, etc. can do. These characteristic values are collectively referred to as the pronunciation characteristics, and in this embodiment, the similarity of the pronunciation characteristics is referred to as the pronunciation similarity. Note that the pronunciation characteristic similarity determination process is not directly taken up by the present invention, so a more detailed description is omitted.

アルファベット文字列検索部２３０による検索は、主として発音類似検索を実行するために行われる。本実施形態のアルファベット文字列検索部２３０は、本実施形態では、インターナショナル・ビジネス・マシーンズ・コーポレーション社製の、Global Name Analyticsを実装するGlobal Name Recognitionシステムを使用することができる。 The search by the alphabet character string search unit 230 is mainly performed to execute a pronunciation similarity search. In this embodiment, the alphabet character string search unit 230 of this embodiment can use a Global Name Recognition system manufactured by International Business Machines Corporation, which implements Global Name Analytics.

また、アルファベット文字列検索部２３０は、図２に示した実施形態では情報処理装置１３０のモジュールとして構成されるものとして示されている。しかしながら、他の実施形態では、アルファベット文字列検索部２３０は、別体とされたサーバ装置などとして機能分離させることができる。 In addition, the alphabet character string search unit 230 is shown as being configured as a module of the information processing apparatus 130 in the embodiment shown in FIG. However, in other embodiments, the alphabet character string search unit 230 can be functionally separated as a separate server device or the like.

以下、発音類似および同字異音の類似検索について詳細に説明する。アルファベット文字列検索部２３０の検索は、例えば、発音特性の近い、アルファベット文字列「takahashi」および「takanashi」を返し、それぞれ「高橋」、「高梨」など、発音特性の類似した姓名の抽出を可能とする。また名についても、アルファベット文字列検索部２３０の検索結果は、「Takeshi」、「Takashi」といった検索結果を返し、類似逆検索では、検索結果と同音で、文字の異なる「健」、「剛志」、「武」などが抽出される。なお、類似逆検索部２４０は、上述したように、アルファベット文字列検索部２３０が検索語として使用した以外、発音類似として登録された姓候補および名候補を含めた類似姓名を抽出することを可能とし、姓名候補抽出の範囲を拡大する。 In the following, the similarities search for pronunciation similarities and similar sounds of different characters will be described in detail. The search by the alphabet string search unit 230 returns, for example, the alphabet strings “takahashi” and “takanashi” that have similar pronunciation characteristics, and can extract names with similar pronunciation characteristics such as “Takahashi” and “Takanashi”, respectively. And As for the name, the search result of the alphabet string search unit 230 returns search results such as “Takeshi” and “Takashi”. In the similar reverse search, “Ken” and “Takeshi” with the same sound and different characters are used. , “Take” etc. are extracted. As described above, the similar reverse search unit 240 can extract similar surnames including surname candidates and surname candidates registered as pronunciation similar to those other than the alphabetic character string search unit 230 used as search terms. And expand the range of first and last name candidate extraction.

一方、同字異音検索は、類似逆検索部２４０がアルファベット文字列生成部２２０から名に対応するアルファベット文字列を受領して実行され、特に発音特性の多様な名候補の生成に対して適用される。例えば、「恭子」が入力された名とすると、類似逆検索部２４０は、まず「kyouko」を取得し、姓名変換辞書２９０−１を逆参照して同音を有する他の漢字の存在を検索する。例示する「恭子」＝「kyouko」については、「yasuko」も登録されており、「yasuko」を検索結果として返す。類似逆検索部２４０は、「kyouko」および「yasuko」の値をアルファベット文字列検索部２３０に送り、同字異音の名についても発音類似検索を実行させることで、名候補を拡大させることが可能となる。 On the other hand, the homonym / sound search is executed when the similar reverse search unit 240 receives an alphabetic character string corresponding to a name from the alphabetic character string generation unit 220 and is applied particularly to generation of name candidates having various pronunciation characteristics. Is done. For example, if the name is input with “恭子”, the similar reverse search unit 240 first obtains “kyouko” and reversely references the surname conversion dictionary 290-1 to search for the presence of other Chinese characters having the same sound. . For “Isuko” = “kyouko” illustrated, “yasuko” is also registered, and “yasuko” is returned as a search result. The similar reverse search unit 240 can expand the name candidates by sending the values of “kyouko” and “yasuko” to the alphabetic character string search unit 230, and performing a pronunciation similar search for the names of the same character and different sounds. It becomes possible.

類似逆検索部２４０は、アルファベット文字列検索部２３０が生成した検索結果を受領し、再度姓名変換辞書２９０−１を名について逆検索し、対応する漢字列での姓候補および名候補を生成させる。 The similar reverse search unit 240 receives the search result generated by the alphabetic character string search unit 230, and performs a reverse search for the first name and last name conversion dictionary 290-1 again for a first name, and generates a surname candidate and a first name candidate in the corresponding kanji string. .

類似逆検索部２４０は、その検索結果を漢字列出力部２５０へと送る。漢字列出力部２５０は、類似逆検索部２４０の出力を、図示する実施形態では、漢字（列）、アルファベット表記、および対応する読みなどを出力結果２６０として生成する。図２に示す出力結果２６０は、名候補に対応する漢字列、アルファベット表記、および対応する読みなどをレコードとするテーブル２７０として作成し、出力している。 The similar reverse search unit 240 sends the search result to the Chinese character string output unit 250. The kanji string output unit 250 generates the output of the similar reverse search unit 240 as an output result 260 in the illustrated embodiment, such as kanji (string), alphabetical notation, and corresponding reading. The output result 260 shown in FIG. 2 is created and output as a table 270 in which kanji strings corresponding to name candidates, alphabetical notation, and corresponding readings are recorded.

また、漢字列出力部２５０は、アルファベット文字列生成部２２０が生成した姓候補と、名候補とを受領して、和集合を作成し、姓候補および名候補ごとに対応付けて姓名候補を作成する。作成された姓名候補は、姓および名の汎用割合などを使用してランク付けされ、ランクに関連して降順にリストさせることができる。 In addition, the kanji string output unit 250 receives the surname candidate and the first name candidate generated by the alphabet character string generation unit 220, creates a union, and creates a first name surname candidate in association with each surname candidate and first name candidate. To do. The created first and last name candidates are ranked using, for example, a generic ratio of first name and last name, and can be listed in descending order relative to the rank.

図３は、本実施形態の姓名変換辞書２９０−１が含む変換テーブル３００、３１０の実施形態を示す。変換テーブル３００は、名変換のために使用する変換テーブルである。また変換テーブル３１０は、姓変換するために使用される変換テーブル３１０である。変換テーブル３００、３１０は、漢字表記についてそれぞれ対応するアルファベット表記が対応付けられて登録されている。漢字をアルファベット表記に変換する場合には、アルファベット文字列変生成部２２０は、変換テーブル３００、３１０を使用してアルファベット文字列を取得し、姓名候補を生成する。 FIG. 3 shows an embodiment of conversion tables 300 and 310 included in the first and last name conversion dictionary 290-1 of the present embodiment. The conversion table 300 is a conversion table used for name conversion. The conversion table 310 is a conversion table 310 used for last name conversion. In the conversion tables 300 and 310, alphabetical expressions corresponding to Chinese character expressions are registered in association with each other. When converting kanji into alphabetic notation, the alphabet character string changing / generating unit 220 acquires an alphabet character string using the conversion tables 300 and 310, and generates first and last name candidates.

図３に示した実施形態を使用して説明すると、例えば、類似逆検索部２４０は、名を逆検索する場合、「Takeshi」を取得したとすると、名変換テーブル３００を逆参照して、「Takeshi」が登録されている漢字表記を逆検索し、「Ken」、「Tsuyoshi」を取得し、取得した「Ken」、「Tsuyoshi」を、アルファベット文字列検索部２３０に送付する。なお、図３で示す変換テーブル３１０は、姓を逆変換する場合に使用され、アルファベット文字列検索部２３０の発音類似検索結果を使用して、対応する姓候補を生成するために利用される。 To explain using the embodiment shown in FIG. 3, for example, if the similar reverse search unit 240 acquires “Takeshi” when performing a reverse search for a name, the name conversion table 300 is dereferenced and “ The Chinese character notation registered with “Takeshi” is reversely searched, “Ken” and “Tsuyoshi” are acquired, and the acquired “Ken” and “Tsuyoshi” are sent to the alphabet string search unit 230. Note that the conversion table 310 shown in FIG. 3 is used to reversely convert surnames, and is used to generate corresponding surname candidates using the pronunciation similarity search result of the alphabet string search unit 230.

また、姓名変換辞書２９０−１は、図３に示すように、注意文字リスト３２０を含んで構成される。注意文字リスト３２０は、同字異音を取得するための類似逆検索においてのみ参照される。類似逆検索部２４０は、検索された姓候補および名候補が、注意文字リスト３２０に含まれている場合、注意文字に対応する文字を注意文字で置換した追加の姓候補および名候補を作成し、漢字列出力部２５０に検索結果として追加して渡す。注意文字で置換された姓候補および名候補は、出力結果２６０の適切なレコードに追加され、テーブル２７０などとして出力される。 Moreover, the first name surname conversion dictionary 290-1 includes an attention character list 320 as shown in FIG. 3. The attention character list 320 is referred to only in the similar reverse search for acquiring the same character abnormal sound. When the searched surname candidate and first name candidate are included in the attention character list 320, the similar reverse search unit 240 creates an additional last name candidate and first name candidate by replacing the character corresponding to the attention character with the attention character. , The search result is added and passed to the kanji string output unit 250. The surname candidate and the first name candidate replaced with the attention character are added to an appropriate record of the output result 260 and output as a table 270 or the like.

図４は、本実施形態の姓名検索方法のフローチャートを示す。図４の処理は、ステップＳ４００から開始し、ステップＳ４０１で、姓名に対応する漢字を取得する。ステップＳ４０２では、類似検索処理を実行する。ステップＳ４０２の類似検索処理は、本実施形態における、同字異音の姓名候補および発音類似姓名候補の両方の類似検索を実行する。 FIG. 4 shows a flowchart of the first and last name search method of the present embodiment. The process of FIG. 4 starts from step S400, and in step S401, a kanji character corresponding to the first and last name is acquired. In step S402, a similarity search process is executed. The similarity search process in step S402 executes a similarity search for both first-name and second-name candidates of the same character and different sound in the present embodiment.

ステップＳ４０３では、処理対象の姓名が残っているか否かを判断し、姓名が、例えばリクエスト・キューに残っている場合（ｙｅｓ）、処理をステップＳ４０１に分岐させ、未処理の姓名がなくなるまで処理を繰り返す。一方、ステップＳ４０３で処理対象の姓名が残っていないと判断された場合（ｎｏ）、処理をステップＳ４０４に分岐させ、結果出力を生成し、処理を終了する。 In step S403, it is determined whether or not the first and last names to be processed remain. If the first and last names remain in the request queue, for example (yes), the process branches to step S401 and is processed until there are no unprocessed first and last names. repeat. On the other hand, if it is determined in step S403 that there are no surnames to be processed (no), the process branches to step S404, a result output is generated, and the process ends.

なお、結果出力を生成する場合、アルファベット文字列生成部２２０が正規化した元の漢字などの漢字を記憶させておき、結果出力について、正規化する前の漢字で、姓または名の対応する漢字を置換して表示させるか、または正規化前後の漢字を含む姓名を並列的に表示させ、検索結果の認識性を高めてもよい。 When generating the result output, the kanji such as the original kanji normalized by the alphabet character string generation unit 220 is stored, and the kanji corresponding to the surname or first name in the kanji before normalization for the result output is stored. May be displayed, or the first and last names including Chinese characters before and after normalization may be displayed in parallel to improve the recognition of the search result.

図５は、本実施形態の類似逆検索処理を含む姓名検索処理のフローチャートを示す。図５の処理は、ステップＳ５００から開始し、ステップＳ５０１で漢字からアルファベット文字列を生成し、姓名シーケンス識別子を参照して姓および名に対応する検索語を設定する。 FIG. 5 shows a flowchart of first and last name search processing including similar reverse search processing of the present embodiment. The process of FIG. 5 starts from step S500, and in step S501, an alphabet character string is generated from kanji, and a search term corresponding to the first name and last name is set with reference to the first name and last name sequence identifier.

ステップＳ５０２では、姓検索語および名検索語を直接使用して、類似逆検索部２４０が姓名類似逆検索（同字異音）を実行し、ステップＳ５０３で取得した名に対応する同字異音候補を生成し、名検索語とともにアルファベット文字列検索部２３０に送付する。その後、ステップＳ５０４では、アルファベット姓名辞書２９０−２を、発音類似を含めて姓候補および名候補について検索し、ステップＳ５０５で検索終了を待機する。ステップＳ５０５で検索終了と判断した場合（ｙｅｓ）、ステップＳ５０６で、姓名候補を出力する。一方、検索が終了しない場合（ｎｏ）、検索終了まで処理をステップＳ５０４に分岐させ、姓候補および名候補が無くなるまで検索処理を続行させる。 In step S502, the similar reverse search unit 240 performs a first name surname similar reverse search (same character abnormal sound) using the first name search word and the first name search word directly, and the same character abnormal sound corresponding to the first name acquired in step S503. Candidates are generated and sent to the alphabet string search unit 230 along with name search terms. After that, in step S504, the alphabet surname / name dictionary 290-2 is searched for surname candidates and surname candidates including pronunciation similarities, and the end of the search is awaited in step S505. If it is determined in step S505 that the search is complete (yes), first and last name candidates are output in step S506. On the other hand, if the search does not end (no), the process branches to step S504 until the search ends, and the search process is continued until there are no surname candidates and first name candidates.

一方、ステップＳ５０６では漢字列出力部２５０は、各検索の結果を受領して、検索結果を生成する。ステップＳ５０７で、検索結果は、同字異音検索および発音類似検索で生成された結果を受領して出力するためのテーブル２７０を作成し、ステップＳ５０８で処理を終了させる。テーブル２７０の作成は、各類似検索の結果で重複する姓候補および名候補を重複を排除するように、和集合を作成する。その後、姓候補に対応付けて名候補を登録することにより、テーブル型式またはリスト型式での検索結果を生成させることができる。この際、文字正規化前の文字を含む姓または名を上位に配置するように置換またはソートすることで、結果の一致性を認識しやすくさせることができる。 On the other hand, in step S506, the Chinese character string output unit 250 receives the results of each search and generates a search result. In step S507, as a search result, a table 270 for receiving and outputting the results generated by the same character abnormal sound search and pronunciation similarity search is created, and the process ends in step S508. In the creation of the table 270, a union is created so as to eliminate duplication of surname candidates and first name candidates that are duplicated in the result of each similar search. Thereafter, by registering the first name candidate in association with the last name candidate, it is possible to generate a search result in a table type or a list type. At this time, it is possible to make it easier to recognize the matching of the results by replacing or sorting the surnames or surnames including the characters before character normalization so that they are arranged at the top.

なお、図５に示した処理で、姓および名を識別させるための姓名シーケンス識別子を使用して検索する処理を採用する。この理由は、日本語のように、姓数および名数の数が著しく相違する場合に検索処理を分離して効率化を行うとともに、アルファベット文字列検索部２３０が文化圏の相違を相殺して、漢字からの自動姓名組合わせ処理の結果を直接的に利用して、インド・ヨーロッパ語圏の姓名とともに、グローバルな姓名検索を自動実行させるためである。 In the process shown in FIG. 5, a search process using a surname / name sequence identifier for identifying a surname and a first name is adopted. The reason for this is that, as in Japanese, when the number of surnames and the number of first names are significantly different, the search process is separated and streamlined, and the alphabet string search unit 230 offsets the differences in cultural spheres. This is to directly execute the global first name search together with the first and last name in the Indo-European area by directly using the result of the automatic first name combination processing from the kanji.

図６は、本実施形態の姓名検索による、名文字生成で生成される名候補６００の実施形態を示す。図６に示す実施形態では、名候補６１０として「恭子」が入力される。アルファベット文字列生成部２２０は、入力された名６１０を取得し、その文化圏を判断して「kyoko」６２０のアルファベット文字列を生成する。一方、「kyoko」６２０を受領したアルファベット文字列検索部２３０は、類似逆検索部２４０を逆検索し「kyoko」６２０と同字異音となる「yasuko」６３０のアルファベット文字列を取得する。アルファベット文字列検索部２３０は、「kyouko」６２０、「yasuko」６３０についての発音類似検索も実行し、「kiyoko」６４０、「yatsuko」６５０などのアルファベット文字列も検索結果として返す。 FIG. 6 shows an embodiment of a name candidate 600 generated by last name generation by first name surname search of this embodiment. In the embodiment illustrated in FIG. 6, “lion” is input as the name candidate 610. The alphabet character string generation unit 220 acquires the input name 610, determines the cultural sphere, and generates an alphabet character string of “kyoko” 620. On the other hand, the alphabet character string search unit 230 that has received “kyoko” 620 performs reverse search of the similar reverse search unit 240 and acquires an alphabet character string of “yasuko” 630 that has the same character and sound as “kyoko” 620. The alphabet string search unit 230 also performs a pronunciation similarity search for “kyouko” 620 and “yasuko” 630, and returns alphabet string strings such as “kiyoko” 640 and “yatsuko” 650 as search results.

類似逆検索部２４０は、「kyoko」６２０、「yasuko」６３０、「kiyoko」６４０、「yatsuko」６５０それぞれについて類似逆変換を実行し、「今日子」、「杏子」などの名候補セット６７０、「康子」、「八洲子」などの名候補セット６６０、「喜世子」、「喜洋子」などの名候補セット６８０、「弥津子」、「八津子」などの名候補セット６９０を生成する。これらの名候補セットは、テーブル２７０の名フィールドに登録され、姓候補セットそれぞれとの組合わせで可能性のある姓名候補として登録される。 The similar reverse search unit 240 performs similar reverse conversion for each of “kyoko” 620, “yasuko” 630, “kiyoko” 640, and “yatsuko” 650, and sets name candidates 670 such as “Kyoko” and “Kyoko”. Name candidate sets 660 such as “Yasuko” and “Yasuko”, name candidate sets 680 such as “Kiseko” and “Keiko”, and name candidate sets 690 such as “Yatsuko” and “Yatsuko” are generated. These first name candidate sets are registered in the first name field of the table 270, and are registered as possible first and last name candidates in combination with the last name candidate sets.

図７は、本実施形態の姓名検索による姓検索で生成される姓候補７００の実施形態を示す。姓候補７１０として、アルファベット文字列生成部２２０は、「高橋」を受領し、アルファベット文字列検索部２３０に、アルファベット文字列「takahashi」７２０を送付する。アルファベット文字列検索部２３０は、「takahashi」７２０の検索を実行するとともに発音類似検索で、「takanashi」７３０も取得し、検索結果として返す。 FIG. 7 shows an embodiment of a surname candidate 700 generated by a surname search by a surname search according to this embodiment. The alphabet character string generation unit 220 receives “Takahashi” as the surname candidate 710 and sends the alphabet character string “takahashi” 720 to the alphabet character string search unit 230. The alphabet string search unit 230 performs a search for “takahashi” 720 and also acquires “takanashi” 730 by a pronunciation similar search, and returns it as a search result.

類似逆検索部２４０は、それぞれ「takahashi」７２０および「takanashi」７３０の検索を実行し、それぞれ姓候補セット７４０、７５０を生成する。これらの姓候補セット７４０、７５０は、テーブル２７０の姓フィールドに記述され、対応するアルファベット表記とともにテーブル２７０として出力が行なうことができる。この結果、漢字などの漢字列から派生する可能性のある姓名を出力させる。なお、姓名変換辞書２９０−１には、注意漢字リストを追加することもでき、例えば、「明」と、「朋」、「助」と、「肋」、など見かけが近い漢字を登録し、いずれかが検索された場合、参考姓名として出力データに追加する処理を行うこともできる。注意漢字リストは、入力がＯＣＲなどで行われる場合に、読取不良の可能性を喚起するために好適な実施形態となる。 The similar reverse search unit 240 executes searches for “takahashi” 720 and “takanashi” 730, respectively, and generates surname candidate sets 740 and 750, respectively. These surname candidate sets 740 and 750 are described in the surname field of the table 270, and can be output as the table 270 together with the corresponding alphabetic notation. As a result, first and last names that may be derived from a kanji string such as kanji are output. The kanji list can be added to the surname conversion dictionary 290-1, for example, “Ming”, “朋”, “Help”, “肋”, and similar Kanji characters are registered. When either one is searched, it is possible to perform processing for adding to the output data as a reference first and last name. The attention kanji list is a preferred embodiment for raising the possibility of reading failure when the input is performed by OCR or the like.

図８は、本実施形態で、アルファベット文字列検索部２３０の出力結果を示すグラフィカル・ユーザ・インタフェース（ＧＵＩ）の実施形態を示す。図８に示す実施形態のＧＵＩ８００は、表示ウィンドウ８１０内に姓（ＳＮ）、明（ＧＮ）の検索結果が表示されている。図８の実施形態では、表示ウィンドウ８１０には、上段に姓（SurName:ＳＮ）の検索結果がリストされ、下段に名（GivenName:ＧＮ）の検索結果がリストされている。図８に示した実施形態では、ＧＮ＝ｈｉｄｅｋｉ、ＳＮ＝ｍａｔｓｕｉとして入力され、入力された姓名に類似する検索結果が、ＳＮおよびＧＮとして表示されている。 FIG. 8 shows an embodiment of a graphical user interface (GUI) showing the output result of the alphabet string search unit 230 in this embodiment. In the GUI 800 according to the embodiment shown in FIG. 8, the search result of last name (SN) and Ming (GN) is displayed in the display window 810. In the embodiment of FIG. 8, in the display window 810, search results for surnames (SurName: SN) are listed in the upper row, and search results for first names (GivenName: GN) are listed in the lower row. In the embodiment shown in FIG. 8, GN = hideki and SN = matsu are input, and search results similar to the input first and last names are displayed as SN and GN.

図８に示された検索結果が、類似逆検索部２４０に送られ、漢字で表現された姓名が逆検索される。なお、図８に示した実施形態では、姓名の発音上の類似度に対応して、姓、名ともに１５の姓および名が表示されているが、類似逆検索部２４０には、類似度に対応して、トップ３までの範囲の姓および名を渡すなど、生成される類似姓名の種類を調整することができる。 The search result shown in FIG. 8 is sent to the similar reverse search unit 240, and the first and last names expressed in kanji are reverse searched. In the embodiment shown in FIG. 8, 15 surnames and surnames are displayed for both surnames and surnames corresponding to the pronunciation similarity of surnames. Correspondingly, the type of generated first and last names can be adjusted, such as passing first and last range of first and last names.

以上のように、本実施形態によれば、漢字で取得した姓名を使用して、アルファベット文字列での姓と名とを分離して、多様な読みの可能性のある名を含めて姓名検索を実行することが可能となるので、漢字の字体の相違にかかわらず、高精度の姓名検索を実行することができる。さらに本実施形態では、文化圏の相違などに対応して、文化圏に対応した姓名シーケンスの識別子を含ませて、アルファベット文字列検索部２３０に姓名検索を実行させることができるので、姓候補で名に対応するデータを検索してしまうなどの問題を生じさせることなく、グローバルな姓名検索システムに統合でき、さらに高精度・高効率の検索を実行することができる。 As described above, according to the present embodiment, the first and last name search including the names having various reading possibilities is performed by separating the first name and the last name in the alphabet string using the first and last names obtained in kanji. Therefore, it is possible to execute a high-accuracy first-name search regardless of the difference in the kanji font style. Furthermore, in the present embodiment, it is possible to include the identifier of the first and last name sequence corresponding to the cultural sphere corresponding to the difference in the cultural sphere, etc., and to allow the alphabetic character string search unit 230 to execute the first and last name search. Without causing problems such as searching for data corresponding to first names, it can be integrated into a global first name search system, and more accurate and efficient search can be performed.

さらに、姓および名で分離した姓名検索を実行させることにより、姓の種類数および名の種類数に対応した並列検索を実行させることができ、さらに入力された漢字列の姓名に類似する姓名を姓名検索を実行させるよりも多数生成させることができるので、転記ミス、誤記、誤認識の可能性を含めた漢字列での姓名を生成することが可能となる。 Furthermore, by performing a first and last name search separated by first name and last name, it is possible to perform a parallel search corresponding to the number of last name types and the number of first name types. Since many names can be generated rather than performing a name search, it is possible to generate first and last names in a kanji string including the possibility of transcription mistakes, misprints, and misrecognitions.

本実施形態の上記機能は、Ｃ＋＋、Ｊａｖａ（登録商標）、Ｊａｖａ（登録商標）Ｂｅａｎｓ、Ｊａｖａ（登録商標）Ａｐｐｌｅｔ、Ｊａｖａ（登録商標）Ｓｃｒｉｐｔ、Ｐｅｒｌ、Ｒｕｂｙなどのオブジェクト指向プログラミング言語などで記述された装置実行可能なプログラムにより実現でき、当該プログラムは、ハードディスク装置、ＣＤ−ＲＯＭ、ＭＯ、フレキシブルディスク、ＥＥＰＲＯＭ、ＥＰＲＯＭなどの装置可読な記録媒体に格納して頒布することができ、また他装置が可能な形式でネットワークを介して伝送することができる。 The functions of this embodiment are described in an object-oriented programming language such as C ++, Java (registered trademark), Java (registered trademark) Beans, Java (registered trademark) Applet, Java (registered trademark) Script, Perl, and Ruby. The program can be realized by a program executable by the apparatus, and the program can be stored in a device-readable recording medium such as a hard disk device, CD-ROM, MO, flexible disk, EEPROM, EPROM, and distributed. It can be transmitted over the network in a possible format.

これまで本実施形態につき説明してきたが、本発明は、上述した実施形態に限定されるものではなく、他の実施形態、追加、変更、削除など、当業者が想到することができる範囲内で変更することができ、いずれの態様においても本発明の作用・効果を奏する限り、本発明の範囲に含まれるものである。 Although the present embodiment has been described so far, the present invention is not limited to the above-described embodiment, and other embodiments, additions, changes, deletions, and the like can be conceived by those skilled in the art. It can be changed, and any aspect is within the scope of the present invention as long as the effects and effects of the present invention are exhibited.

本実施形態の情報処理システムの機能ブロック図。The functional block diagram of the information processing system of this embodiment. 本実施形態の情報処理装置のソフトウェア・モジュールを示した機能ブロック図。The functional block diagram which showed the software module of the information processing apparatus of this embodiment. 本実施形態の姓名変換辞書の登録テーブル内容の実施形態を示した図。The figure which showed embodiment of the registration table content of the full name conversion dictionary of this embodiment. 本実施形態の姓名検索を行うための情報処理方法のフローチャート。The flowchart of the information processing method for performing the full name search of this embodiment. 本実施形態の類似逆検索を含む類似検索処理の実施形態のフローチャート。The flowchart of embodiment of the similarity search process including the similarity reverse search of this embodiment. 本実施形態の姓名検索による姓検索で生成される名候補の実施形態を示した図。The figure which showed embodiment of the name candidate produced | generated by the surname search by the surname search of this embodiment. 本実施形態の姓名検索による姓検索で生成される姓候補の実施形態を示した図。The figure which showed embodiment of the surname candidate produced | generated by the surname search by the surname search of this embodiment. 本実施形態で、アルファベット文字列検索部の出力結果を示すグラフィカル・ユーザ・インタフェース（ＧＵＩ）の実施形態を示した図。The figure which showed embodiment of the graphical user interface (GUI) which shows the output result of an alphabet character string search part in this embodiment.

Explanation of symbols

１００…情報処理システム、１１２…クライアント、１２０…ネットワーク、１３０…情報処理装置、１４０…漢字姓名−ローマ字姓名データベース、１５０…アルファベット姓名データベース、２００…ソフウェア・モジュール、２１０…ネットワーク・アダプタ、２２０…アルファベット文字列生成部、２３０…アルファベット文字列検索部、２４０…類似逆検索部、２５０…漢字列出力部、２６０…出力結果、２７０…テーブル、２８０…形態素辞書、２９０−１…姓名変換辞書、２９０−２…アルファベット姓名辞書、３００…変換テーブル、３１０…変換テーブル、３２０…注意文字リスト DESCRIPTION OF SYMBOLS 100 ... Information processing system, 112 ... Client, 120 ... Network, 130 ... Information processing apparatus, 140 ... Kanji surname-Romaji surname database, 150 ... Alphabet surname database, 200 ... Software module, 210 ... Network adapter, 220 ... Alphabet Character string generation unit, 230 ... Alphabetic character string search unit, 240 ... Similar reverse search unit, 250 ... Kanji string output unit, 260 ... Output result, 270 ... Table, 280 ... Morphological dictionary, 290-1 ... Last name conversion dictionary, 290 -2 ... Alphabetic first name surname dictionary, 300 ... Conversion table, 310 ... Conversion table, 320 ... Caution character list

Claims

An information processing apparatus for generating a first name surname similar to a first name surname expressed in kanji, wherein the information processing apparatus includes:
Receiving the first and last name, searching the first database and generating an alphabetic character string corresponding to the first and last name together with an identifier for identifying the first name and last name;
Receiving the alphabet string to identify the first name and last name, separating the last name search word and the first name search word and searching the second database, the last name search word and the first name including the homonymous character given by the alphabet string An alphabet string search unit that obtains pronunciation-like search results for the search terms;
Returning the first character search word of the same character and different sounds written in a Chinese character string using the first name search word and the first name search word of the alphabet character string to the alphabet character string search unit, obtaining the search result and A similar reverse search unit that generates a surname candidate and a first name candidate by performing a similar reverse search of the first database;
An information processing apparatus, comprising: a kanji string output unit that outputs a first name surname candidate similar to the first surname from the generated surname candidate and the first name candidate.

The first database includes a kanji surname / alphabet surname / name conversion dictionary, and the similar reverse search unit performs a reverse search of the first database by directly using the surname search word and the first name search word, and the same character abnormal sound. The information processing apparatus according to claim 1, wherein each of the search terms is acquired.

The second database includes an alphabet first name surname dictionary, and the alphabet string search unit performs a phonetic similarity search on the alphabet first name surname dictionary using the first name surname search word and the first name search term of the same character and different sounds. The information processing apparatus according to claim 2.

The kanji string output unit generates a union from the last name candidate and the first name candidate generated by the similar search for the same-character different sound and the similar search for the phonetic similarity, and outputs the union as the last name candidate. The information processing apparatus described in 1.

An information processing method for generating a first name surname similar to a first name surname expressed in Chinese characters, the information processing method comprising:
Receiving the first and last name and searching a first database to generate an alphabetic string corresponding to the first and last name together with an identifier identifying the first and last name;
Receiving the alphabet string to identify the first name and last name, separating the last name search word and the first name search word and searching the second database, the last name search word and the first name including the homonymous character given by the alphabet string Obtaining pronunciation-like search results for the search terms;
Each of the search terms including the same name / same name search term and the first name search word written in a kanji string using the first name search term and the first name search term of the alphabet character string, the alphabet character string search unit And generating a surname candidate and a first name candidate by obtaining the search result and performing a similar reverse search on the first database;
A step of outputting a first and last name candidate similar to the first and last name from the generated last name candidate and the first name candidate.

The first database includes a kanji surname / alphabet surname / name conversion dictionary, and the similar reverse search unit performs a reverse search of the first database by directly using the surname search word and the first name search word, and the same character abnormal sound. The information processing method according to claim 5, further comprising a step of searching for each of the search terms.

The second database includes an alphabet first name surname dictionary, and the alphabet character string search unit performs a phonetic similarity search on the alphabet first name surname dictionary using the first name surname search word and the first name search term of the same character and different sound. The information processing method according to claim 6, comprising:

The step of outputting a first and last name candidate similar to the first and last name generates a union from the first name surname candidate and the first name candidate generated by the similar search for the same character and different sounds and the similar search for phonetic similarity, The information processing method according to claim 7, comprising an output step.

An apparatus-executable program for generating first and last name candidates similar to first and last names expressed in Chinese characters, the program comprising:
Means for receiving the first and last name and searching a first database to generate an alphabetic string corresponding to the first and last name together with an identifier identifying the first and last name;
Receiving the alphabet string to identify the first name and last name, separating the last name search word and the first name search word and searching the second database, the last name search word and the first name including the homonymous character given by the alphabet string Means for obtaining pronunciation-like search results for the search terms;
Each of the search terms including the same name / same name search term and the first name search word written in a kanji string using the first name search term and the first name search term of the alphabet character string, the alphabet character string search unit And generating a surname candidate and a first name candidate by obtaining the search result and performing a similar reverse search of the first database;
A program which functions as means for outputting a first name surname candidate similar to the first name surname from the generated surname candidate and the first name candidate.