JP2010044597A

JP2010044597A - Information processor, information processing method, information processing system, and program

Info

Publication number: JP2010044597A
Application number: JP2008208297A
Authority: JP
Inventors: 珠代 ▲高▼木; Tamayo Takagi; Tsuyoshi Fukuda; 剛志福田; Kiyoshi Kumada; 清志熊田; Koji Fukuda; 厚司福田
Original assignee: International Business Machines Corp
Current assignee: International Business Machines Corp
Priority date: 2008-08-13
Filing date: 2008-08-13
Publication date: 2010-02-25
Anticipated expiration: 2028-08-13
Also published as: JP5348964B2

Abstract

<P>PROBLEM TO BE SOLVED: To estimate similarity of phrases expressed by a plurality of types of characters. <P>SOLUTION: An information processor 110 includes: a character decision part 118 which receives a character sequence showing a full name, and decides the type of characters configuring the received character sequence; a different notation acquisition part 120 which generates different notations in which the character sequence is described in the different type of characters including ideogram or phonogram from the character sequence, and which generates notation vectors including at least two different types of full name notations including the phonogram; a similarity calculation part 122 which executes different similarity determination in response to the type of characters, and which calculates scores giving the scale of similarity for the elements of the notation vectors; and a similarity score calculation part 124 which calculates similar scores for full name candidates by using the scores calculated by the similarity calculation part 122. <P>COPYRIGHT: (C)2010,JPO&INPIT

Description

本発明は、姓名の類似判断技術に関し、より詳細には、インド・ヨーロッパ語、アフロ・アフリカ語、日本語、中国語、韓国語などで記述された姓名から、言語種類にかかわらず姓名の類似性を判断する、情報処理装置、情報処理方法、情報処理システム、およびプログラムに関する。 The present invention relates to a technique for determining similarity between first and last names, and more specifically, from first and last names written in Indo-European, Afro-African, Japanese, Chinese, Korean, etc. The present invention relates to an information processing apparatus, an information processing method, an information processing system, and a program for determining sex.

近年、交通機関や、ネットワーク技術の進歩などにより、国際間での人的活動が活発化している。複数の国にわたって個人や企業が各種の活動を行う場合、当該個人の姓名や、企業の名称を記載して、各種の手続を行う。国際間での活動が増加するにつれ、個人などが手続を行った国以外の国で、その情報を使用して各種の検索を行うことが必要な場合も発生してくるものと考えられる。例示的な場合としては、国際間でのマネーロンダリングなどの場合に、銀行口座などを作成、送金、引出しなどを行った個人の同一性を、複数の国で情報を共有し、なおかつ異なる文化圏に対応する姓名表記に対応して姓名の同一性を損なうことなく認識することが必要な場合を挙げることができる。 In recent years, international human activities have become active due to advances in transportation and network technology. When an individual or a company conducts various activities across multiple countries, the person's first and last name and the name of the company are described and various procedures are performed. As international activities increase, it may be necessary to perform various searches using the information in countries other than the countries where individuals have processed. As an example, in the case of international money laundering, etc., the identity of the person who created, remittance, withdrawal, etc. in the bank account etc. is shared in multiple countries, and different cultures The case where it is necessary to recognize the first and last name corresponding to the category without losing the identity of the first and last name can be given.

また、姓名検索は、必ずしも国際的な行動だけに必要とされるものではない。例えば日本で言えば、銀行口座、健康保険などが同一人物を示しているか否かを判断することが必要な、いわゆる名寄せなどの処理には、取得した姓名から、どのような類似姓名が生じる可能性があるか、またその尤度がどの程度なのかを、高精度かつ効率的に生成させることが必要となる。 Last name search is not always necessary for international action. For example, in Japan, it is necessary to determine whether bank accounts, health insurance, etc. indicate the same person, so what kind of similar surname can be generated from the acquired surname, etc. Therefore, it is necessary to generate the accuracy and the likelihood of the possibility with high accuracy and efficiency.

英語、キリル語、ギリシャ文字などを含むインド・ヨーロッパ語、アフロ・アジア語で記述された姓名は、アルファベットなど表音文字のみで記述されるため、文字列を使用して検索を行うことができ、また検索は、文字列を使用する類似検索を含めて比較的容易である。なお、アルファベットで記述された人名を比較して、人名の類似性を、類似性スコアを計算することにより比較する方法は、人名検索のために利用されている。アルファベットで記述された人名を検索するためのシステム／方法としては、例えば、米国特許第６、９６３、８７１Ｂ１明細書（特許文献１）に記載される、アルファベットで記述された人名を検索する自動化人名検索システムを挙げることができる。 Surnames written in Indo-European and Afro-Asian languages, including English, Cyrillic, Greek, etc., are written only in phonograms such as alphabets, so you can search using strings. The search is relatively easy including a similar search using a character string. Note that a method of comparing names by comparing the names of the names described in alphabets and calculating the similarity score of the names is used for name search. As a system / method for searching for a person name written in alphabet, for example, an automated person name for searching for a person name written in alphabet is described in US Pat. No. 6,963,871B1 (Patent Document 1). List search systems.

また、http://publibfp.boulder.ibm.com/epubs/pdf/c1912860.pdf（非特許文献１）で指定されるＵＲＩには、インド・ヨーロッパ語で表記された人名についての類似性を使用して人名検索する、Global Name Analytics（ＧＮＡ）システムも開示されている。 In addition, the URI specified in http://publibfp.boulder.ibm.com/epubs/pdf/c1912860.pdf (Non-patent Document 1) uses the similarity of person names written in Indo-European languages. A Global Name Analytics (GNA) system that searches for a person's name is also disclosed.

ところで、ＣＪＫＶ(Chinese, Japanese, Korean, Vietnam)として例示することができる、いわゆる表意文字を使用する文化圏では、異字体が存在し、また同一の文字であっても文字列シーケンスの態様に応じて表音特性が相違する場合がある。このため、表意文字を含む姓名を類似判断を充分な精度で自動実行する際には、インド・ヨーロッパ語文化圏での姓名検索とは異なる問題がある。この様な問題として、例えば下記の掲げる問題点を例示的に挙げることができる。 By the way, in the cultural sphere using so-called ideographs, which can be exemplified as CJKV (Chinese, Japanese, Korean, Vietnam), there are allo-characters, and even the same character depends on the character string sequence. The phonetic characteristics may be different. For this reason, there is a problem different from the first name search in the Indo-European cultural sphere when the similar determination of the first name and the second name including the ideogram is automatically executed with sufficient accuracy. As such problems, for example, the following problems can be exemplified.

（イ）さらに、ＣＪＫＶとして参照される文化圏では、本来同一の表意文字を表し、またその表音特性も同一となる旧字体、新字体、異字体が存在し、例えば、「斉」、「齊」、「斎」、「齋」など、同一の表音文字で記述された姓名に対して複数の表意文字による表記が存在しうる。したがって、表音文字の同一性や表意文字の同一性、類似性を自動判断することができない。 (B) Furthermore, in the cultural sphere referred to as CJKV, there are old, new, and variant fonts that originally represent the same ideogram and have the same phonetic characteristics. For example, “Sai”, “ There may be a plurality of ideographic characters for the first and last names written in the same phonetic character, such as “齊”, “sai”, “齋”. Therefore, it is not possible to automatically determine the identity of phonetic characters and the identity and similarity of ideographic characters.

（ロ）上述した問題に加え、例えば日本語を、ローマ字表記する場合、「タロウ」の表音文字列に対応し、「Taro」、「Taroh」、「Tarou」など、または「トモロヲ」と「トモロオ」、「トモエ」と「トモゑ」、「トモヱ」として例示される表音表記上の揺らぎが存在する。 (B) In addition to the above-mentioned problems, for example, when Japanese is written in Roman letters, it corresponds to the phonetic character string “Taro”, “Taro”, “Taroh”, “Tarou”, etc., or “Tomoro” and “ There are fluctuations in the phonetic notation exemplified as “Tomoroo”, “Tomoe”, “Tomomo”, and “Tomomo”.

（ハ）さらに、現在、大量の文字情報を処理する場合、ＯＣＲなどの光学読み取りシステムが導入され、イメージリーダなどから入力された画像を文字テンプレートなどとビット比較することにより、自動文字認識が普及している。この場合、ＯＣＲによる誤認識を考慮して、外見上類似する文字、例えば、「カ」と、「ケ」「萩」と「荻」、「富」と「冨」などの可能性を考慮する必要がある。また、これらの外観上類似した文字は、業務担当者などによる転記ミス、誤記も考慮する必要があり、表意文字および表音文字の両方を有する言語の姓名を、コンピュータを使用して自動判断することは、困難であった。 (C) Furthermore, when processing a large amount of character information at present, an optical reading system such as OCR has been introduced, and automatic character recognition has become widespread by comparing images input from image readers with character templates. is doing. In this case, in consideration of misrecognition by OCR, the appearance of similar characters such as “K”, “K”, “」 ”and“ 萩 ”,“ wealth ”and“ 冨 ”, etc. are considered. There is a need. In addition, these characters that are similar in appearance need to take account of transcription mistakes and misprints by business staff, etc., and automatically determine the first and last names of languages that have both ideograms and phonograms using a computer That was difficult.

これまで、漢字列、例えば日本語について姓名を検索する技術が種々知られている。例えば、特開平１１−３３８８５９号公報（特許文献２）では、フリガナで入力される氏名をローマ字に変換して入力する氏名入力装置であって、漢字綴りの氏名とフリガナとの対応関係を管理するとともに、該フリガナの各文字毎に、規定のローマ字変換を施す必要があるのか否かを示すフリガナ属性情報を管理する氏名辞書と、上記氏名辞書を検索することで、入力されるフリガナの指す氏名を検索するとともに、該氏名の持つフリガナ属性情報を取得する検索手段と、上記検索手段の検索する氏名の中から氏名を１つ選択する選択手段と、上記選択手段の選択する氏名の持つフリガナ属性情報の指示に従って上記規定のローマ字変換を施しつつ、入力されるフリガナをローマ字に変換する変換手段とを備えることを、特徴とする氏名入力装置を開示する。 Up to now, various techniques for searching for first and last names for kanji strings, such as Japanese, are known. For example, Japanese Patent Application Laid-Open No. 11-338859 (Patent Document 2) is a name input device that converts a name input in kana to romaji and inputs the name, and manages the correspondence between the name spelled in kanji and the kana. In addition, a name dictionary for managing the reading attribute information indicating whether or not it is necessary to perform prescribed romaji conversion for each character of the reading, and the name indicated by the input reading by searching the name dictionary Search means for acquiring the reading attribute information of the name, selection means for selecting one name from the names searched by the search means, and the reading attribute of the name selected by the selection means Name conversion, characterized by comprising conversion means for converting the input furigana to romaji while performing the prescribed romaji conversion according to the instructions of the information It discloses an apparatus.

さらに、特開２０００−２５１０１７号公報（特許文献３）では、団体または個人の名称を表わす「名称」文字列を記憶した第１記憶手段と、個人の姓名の姓を表わす「姓」文字列および名を表わす「名」文字列を、その種別を表わす種別データと共に記憶した第２記憶手段と、一般名称を表わす「一般名称」文字列をその種別を表わす種別データと共に記憶した第３記憶手段と、単語検索の照合に用いる照合用単語辞書を記憶するための第４記憶手段と、前記第１記憶手段内の「名称」文字列の構成要素が前記第２記憶手段内の文字列または前記第３記憶手段内の文字列と一致するかどうか判定し、一致条件の成立した「名称」文字列を一致先の文字列に対応する種別データと共に照合用単語辞書として前記第４記憶手段に記憶せしめる辞書生成手段と、を具備したことを特徴とする単語辞書作成装置を記載する。 Further, in Japanese Patent Laid-Open No. 2000-251017 (Patent Document 3), a first storage means storing a “name” character string representing the name of an organization or an individual, a “last name” character string representing a surname of the individual's first name, and Second storage means for storing a "name" character string representing a name together with type data representing its type; and a third storage means for storing a "general name" character string representing a general name together with type data representing the type; A fourth storage means for storing a word dictionary for collation used for collation of word search, and a constituent element of a “name” character string in the first storage means is a character string in the second storage means or the second 3. It is determined whether or not it matches the character string in the storage means, and the “name” character string for which the matching condition is satisfied is stored in the fourth storage means as a matching word dictionary together with the type data corresponding to the matching destination character string. Remarks A generation unit describes the word dictionary creation device being characterized in that comprises a.

さらに、特開２００７−３０５０４６号公報（特許文献４）は、漢字と前記漢字の読みを表わす表音文字とを対応させた第１
の情報を予め記憶するため
の手段と、個人の姓名を表わす漢字と前記個人のメールアドレスに含まれる第２
の情報とを取得するための取得手段と、前記第１の情報に基づいて、前記姓名を表わす漢字の読みの候補を生成するための生成手段と、前記第２の情報と前記候補とを照合した結果に基づいて、前記姓名を表わす漢字の読みを決定するための決定手段とを含む、情報処理装置を開示する。 Further, Japanese Patent Laid-Open No. 2007-305046 (Patent Document 4) discloses a first example in which a kanji is associated with a phonetic character representing a reading of the kanji.
Means for storing the information in advance, a kanji representing a person's first and last name, and a second included in the person's e-mail address
The acquisition means for acquiring the information, the generation means for generating a kanji reading candidate representing the first and last names based on the first information, and the second information and the candidate are collated An information processing apparatus is disclosed that includes a determination unit for determining the reading of the kanji representing the first and last names based on the result.

また、特開２００４−３１８６９９号公報（特許文献５）は、入力された名義の先頭から姓辞書中の単語と前方一致するものを検索する。検索した結果、種別が法人種別や職種の場合はその時点で個人でないとして処理を終了する。種別が姓の場合、入力された名義と後方一致する単語を名辞書から検索する。次に、姓と名の区切り方の候補を１つに絞り込む。最後に、絞り込まれた解が個人かどうか判定する名義解析方法を記載している。 Japanese Patent Laid-Open No. 2004-318699 (Patent Document 5) searches for a word that matches the word in the surname dictionary from the beginning of the input name. As a result of the search, if the type is a corporate type or a job type, the process is terminated because it is not an individual at that time. When the type is a surname, the name dictionary is searched for a word that matches the input name. Next, narrow down the candidates for surnames and surnames to one. Finally, a nominal analysis method for determining whether the narrowed solution is an individual is described.

さらに、特開平１０−１７１７９９号公報（特許文献６）は、姓名の区切りなしに入力された個人名とフリガナに対して姓と名の区切り及び漢字１文字毎の文字区切りを付与する姓名解析方法において、姓を登録した姓辞書と、名を登録した名辞書を用いて、入力された前記姓名を姓と名に分割し、文字に対する読みを登録した文字辞書を用いて、姓または名の片方しか前記姓辞書、前記名辞書にない場合、あるいは、両方とも該姓・名辞書にない場合には、漢字１文字毎にフリガナの対応を取ることにより漢字１文字毎にフリガナを付与し、読みの多義を解消するための文字の区切り情報に基づいて解が正しいかを判定することを特徴とする姓名解析方法を記載している。 Furthermore, Japanese Patent Laid-Open No. 10-171799 (Patent Document 6) discloses a surname / name analysis method for assigning a surname / name delimiter and a character delimiter for each kanji to an individual name and a reading that are input without a delimiter , Using the surname dictionary where surnames are registered and the surname dictionary where surnames are registered, the entered surname is divided into surnames and surnames, and the surname or one of the surnames is registered using a character dictionary in which the readings for the characters are registered However, if it is not in the surname dictionary, the surname dictionary, or both are not in the surname / name dictionary, the kana is assigned to each kanji by reading the kanji for each kanji. Describes a surname analysis method that determines whether a solution is correct based on character delimiter information for resolving ambiguity.

米国特許第６、９６３、８７１Ｂ１明細書US Pat. No. 6,963,871B1 特開平１１−３３８８５９号公報JP 11-338859 A 特開２０００−２５１０１７号公報Japanese Unexamined Patent Publication No. 2000-251017 特開２００７−３０５０４６号公報JP 2007-305046 A 特開２００４−３１８６９９号公報JP 2004-318699 A 特開平１０−１７１７９９号公報JP-A-10-171799 http：／／publibfp.boulder.ibm.com／epubs／pdf ／c1912860.pdfhttp://publibfp.boulder.ibm.com/epubs/pdf/c1912860.pdf

特許文献２〜特許文献６は、それぞれ姓名をコンピュータにより自動的に解析する技術を開示する。しかしながら、特許文献２に記載された技術は、表音文字と表意文字との間の対応付けを操作者が実行するものであり、コンピュータが表意文字シーケンスと表音文字シーケンスとを自動的に対応付ける技術を開示するものではない。また、特許文献３に記載された技術は、類似可能性を考慮した辞書を作製する点および当該辞書を使用して顧客名を選択することを可能とするものの、類似度をどのように適用して姓名の自動選択をするかについては何ら記載するものではなく、また、表意文字と表音文字との間の対応付けを行うことを目的とするものではない。 Patent Documents 2 to 6 each disclose a technique for automatically analyzing first and last names by a computer. However, in the technique described in Patent Document 2, the operator executes association between phonograms and ideograms, and the computer automatically associates the ideogram sequences with the phonogram sequences. It does not disclose technology. Further, although the technique described in Patent Document 3 makes it possible to select a customer name by using a dictionary in consideration of the possibility of similarity and how to apply the similarity, It does not describe at all whether the first name and the last name are automatically selected, and is not intended to associate ideograms and phonograms.

さらに、特許文献４は、姓名に対して使用される漢字に適用される表音文字シーケンスを、漢字に適用される頻度の最も高いものとして選択して、表意文字と表音文字との関連づけを行うものである。しかしながら、その時点で処理するべき表意文字シーケンスが、最も頻度の高い表意文字シーケンスで表されるべきか否かについては、何らの判断を行うものではない。また、特許文献４は、漢字表記の姓名とともに、メールアドレスなど姓名を示唆する情報の入力を必要とするため、消音文字と表意文字との間の対応関係について、表音文字または表意文字での姓、名、または姓名が単独で入力された場合には、表音文字と表意文字との対応関係を一義的に定めることはできない。 Further, Patent Document 4 selects a phonetic character sequence that is applied to kanji used for first and last names as the one that is most frequently applied to kanji, and associates the ideogram with the phonetic character. Is what you do. However, no determination is made as to whether or not the ideographic character sequence to be processed at that time should be represented by the most frequent ideographic character sequence. Moreover, since patent document 4 requires the input of the information which suggests a full name, such as e-mail address, with the surname in kanji notation, about the correspondence between a mute character and an ideogram, a phonetic character or an ideogram When a first name, last name, or last name is entered alone, the correspondence between phonograms and ideograms cannot be uniquely determined.

さらに、姓名の誤記や転記ミス、または字体の相違などがある場合について、どの字体を使用するべきなのかについて、何ら対応するものではなく、また例えば、「高田俊：タカダタカシ」と、「高田健：タカダタケシ」との間の類似性について何ら基準を与えるものではなく、ＯＣＲにおける誤変換の可能性も含めた表意文字−表音文字変換を目的とするものではないし、表音文字シーケンスも表意文字シーケンスも異なる姓名を可能性のある候補として可能性に対応付けるものではない。 Furthermore, there is no correspondence as to which type of character should be used when there is a mistake in the first name, transcription mistake, or difference in character type. For example, "Shun Takada: Takashi Takada" and "Takada Ken : No reference is given to the similarity between ": Takada Takeshi" and it is not intended for ideographic character-phonetic character conversion including the possibility of erroneous conversion in OCR. The sequence also does not associate different first and last names as possible candidates.

さらに、特許文献５〜６についても、漢字と読みとを頻度に対応付けて登録し、それらの間の変換を行うものであって、出現頻度のみを使用して表意文字および表音文字の対応付けを行うものであり、入力された表意文字シーケンスまたは表音文字シーケンスに出現頻度では対応付けられない可能性のある姓名候補を生成させるものではない。 Furthermore, also in Patent Documents 5 to 6, kanji and reading are registered in association with frequencies, and conversion between them is performed, and correspondence between ideograms and phonograms using only the appearance frequency. It does not generate surname candidates that may not be associated with the input ideographic character sequence or phonetic character sequence at the appearance frequency.

また、特許文献１〜特許文献６に記載された技術は、姓名がインド・ヨーロッパ語、アフロ・アジア語、平仮名、カタカナ、ハングルなどを含む、いかなる表音文字で入力された場合であっても、表意文字で入力された場合であっても、最尤な表意文字シーケンスまたは表音文字シーケンスを含み、当該入力姓名の可能性がある姓名を判断することを課題とするものではない。 In addition, the techniques described in Patent Literature 1 to Patent Literature 6 can be used even when the first and last names are input in any phonetic character including Indo-European, Afro-Asia, Hiragana, Katakana, Hangul, etc. Even when input is performed using ideograms, it is not an issue to determine a surname that includes the most likely ideogram sequence or phonogram sequence and that may be the input surname.

すなわち、これまで、表意文字シーケンスまたは表音文字シーケンスで入力された姓名の入力に対して、最尤の姓名候補を生成するとともに、当該入力された姓名に対して他言語の表記を考慮して同一または類似する姓名候補の他、表音特性に関連して非類似の姓名を判定することが可能な技術が必要とされていた。 In other words, up to now, the most likely first name surname candidate is generated for the first name surname input in the ideogram sequence or phonetic character sequence, and the notation of other languages is considered for the last name input. In addition to the same or similar candidate names, there is a need for a technique that can determine dissimilar names in relation to phonetic characteristics.

本発明は、上記従来技術の課題に鑑みてなされたものであり、本発明では、情報処理装置は、人名を記述した文字シーケンスを受領して、少なくとも異なる２種の表音文字からなる姓名表記、または少なくとも２種の表音文字および表意文字を含む表記ベクトルを生成する。 The present invention has been made in view of the above-mentioned problems of the prior art. In the present invention, the information processing apparatus receives a character sequence describing a person's name, and displays a first and last name composed of at least two different phonetic characters. Or a notation vector including at least two types of phonetic characters and ideograms.

表記ベクトルの要素は、情報処理装置が管理する漢字シーケンスまたはカナ・シーケンスとそれぞれ類似度が計算され、表記ベクトルが含む複数の姓名表記について計算したスコアを含むスコア・サブマトリックスを生成する。本発明では、さらにさらにスコア・サブマトリックスについて、少なくとも複数の姓名表記間の類似度を考慮した類似スコアが計算され、当該類似スコアを、姓名を構成するフレーズについて、類似度を言語を横断的に判断した類似尺度と使用する。 For the elements of the notation vector, the degree of similarity is calculated with the kanji sequence or kana sequence managed by the information processing apparatus, and a score sub-matrix including the scores calculated for a plurality of first and last names included in the notation vector is generated. In the present invention, for the score sub-matrix, a similarity score is calculated in consideration of the similarity between at least a plurality of first and last names, and the similarity is crossed across languages for the phrases constituting the first and last names. Use with the judged similarity measure.

本発明では、文字シーケンスから少なくとも２種の表音文字による姓名表記が生成され、特定の実施形態では、カナ・シーケンスと、アルファベット・シーケンスとされる。文字シーケンスがカナ・シーケンスである場合には、さらに、漢字といった表意文字による姓表記または名表記が与えられる。 In the present invention, first and last name representations of at least two phonetic characters are generated from a character sequence, and in a specific embodiment, a kana sequence and an alphabet sequence. When the character sequence is a kana sequence, a surname or first name notation by ideographic characters such as kanji is further given.

スコア・サブマトリックスは、姓または名のフレーズに関して統合され、スコア付けマトリックスを構成する。本発明では、スコア付けマトリックスは、姓および名についての翻字についての尤度を判断するために使用され、スコア付けマトリックスに登録された姓および名を統合して姓名候補および姓名候補に対して同音でもなく同字でもない類似姓名候補を生成するために利用される。 The score sub-matrix is integrated with respect to last name or first name phrases to form a scoring matrix. In the present invention, the scoring matrix is used to determine the likelihood of transliteration for first name and last name, and the first name and last name candidates are integrated by integrating the last name and first name registered in the scoring matrix. Used to generate similar first and last name candidates that are neither the same sound nor the same character.

本発明によれば、入力された文字シーケンスを使用して複数の異なる表記についての類似性を判断して姓名候補を生成することができ、さらに、生成された姓名候補について、同音でもなく同字でもない誤記、転記ミス、略記、変換エラーなどの可能性を含めた類似姓名候補を生成することができ、さらには、グローバルな姓名表記に対応可能な、情報処理装置、情報処理方法、情報処理システムおよびプログラムを提供することができる。 According to the present invention, it is possible to determine similarities for a plurality of different notations using an input character sequence to generate first and last name candidates. Information processing device, information processing method, and information processing capable of generating similar first and last name candidates including the possibility of mistyping, transcription mistakes, abbreviations, conversion errors, etc. Systems and programs can be provided.

また、表音文字シーケンスまたは表意文字シーケンスで入力された姓名に対して、多言語での尤度を判断して、目的とする表音文字シーケンスまたは表意文字シーケンスでの姓名候補を重み付けして判断することが可能な情報処理装置、情報処理方法およびプログラムが提供できる。 Also, the likelihood in multiple languages is determined for surnames entered in phonetic character sequences or ideogram sequences, and weighted candidates for surnames in the target phonogram sequence or ideogram sequence are determined. An information processing apparatus, an information processing method, and a program that can be provided can be provided.

加えて、インド・ヨーロッパ語、アフロ・アジア語、日本語、中国語、韓国語、ベトナム語など多言語にわたる姓名の同一・類似判断を行い、さらに表音特性の非類似の姓名を判断することが可能な情報処理装置、情報処理方法およびプログラムが提供できる。 In addition, perform the same / similarity of surnames across multiple languages such as Indo-European, Afro-Asian, Japanese, Chinese, Korean, Vietnamese, etc., and also determine dissimilar surnames with phonetic characteristics An information processing apparatus, an information processing method, and a program capable of performing the above can be provided.

以下、本発明を実施形態を使用して説明するが、本発明は後述する実施形態に限定されるものではない。図１は、本実施形態の情報処理システム１００の機能ブロック図である。図１に示した情報処理システム１００は、サーバ１１０と、クライアント・コンピュータ（以下、単にクライアントとして参照する。）１５０と、サーバ・コンピュータ（以下、単にサーバとして参照する。）１１０とクライアント１５０とを接続するネットワーク１１２とを含んで構成されている、クライアント１５０は、カナ、漢字、またはインド・ヨーロッパ語、アフロ・アジア語などで記述された姓、名、または姓名と考えられる文字シーケンスの入力を受付け、例えば、ＨＴＴＰプロトコルなどを使用してサーバ１１０に検索要求を送付する。 Hereinafter, although this invention is demonstrated using embodiment, this invention is not limited to embodiment mentioned later. FIG. 1 is a functional block diagram of an information processing system 100 according to this embodiment. The information processing system 100 shown in FIG. 1 includes a server 110, a client computer (hereinafter simply referred to as a client) 150, a server computer (hereinafter simply referred to as a server) 110, and a client 150. The client 150, which is configured to include the network 112 to be connected, receives input of a kana, kanji, or a surname, first name, or surname surname written in Indo-European, Afro Asian, etc. Accept, for example, send a search request to server 110 using HTTP protocol or the like.

なお、用語「カナ」は、本実施形態では、カタカナ、平仮名を含む物として定義する。サーバ１１０は、ＬＡＮ、インターネットなどを含んで構成されるネットワーク１１２を介して姓名照会要求を受領し、姓名生成処理を実行する。サーバ１１０は、受領した姓名照会要求に含まれる文字シーケンスを使用して、受領した文字シーケンスに対応する他の文字表記、例えば、カナ・シーケンスを受領した場合、インド・ヨーロッパ語の例示としてアルファベット・シーケンスおよび表意文字の例である漢字または漢字シーケンスを生成し、姓名検索のために利用する。 In this embodiment, the term “kana” is defined as a thing including katakana and hiragana. The server 110 receives a full name query request via a network 112 including a LAN, the Internet, etc., and executes a full name generation process. When the server 110 receives other character notation corresponding to the received character sequence using the character sequence included in the received first and last name query request, for example, the kana sequence, A kanji or kanji sequence, which is an example of a sequence and ideogram, is generated and used for first name surname search.

サーバ１１０は、ＰＥＮＴＩＵＭ（登録商標）、ＰＥＮＴＩＵＭ（登録商標）互換チップなどのＣＩＳＣアーキテクチャのマイクロプロセッサ、または、ＰＯＷＥＲＰＣ（登録商標）などのＲＩＳＣアーキテクチャのマイクロプロセッサを使用することができ、マイクロプロセッサは、シングルコアでもマルチコアの構成のものを採用することができる。また、サーバ１１０が実装するオペレーティングシステム（ＯＳ）としては、ＷＩＮＤＯＷＳ（登録商標）２００Ｘ、ＵＮＩＸ（登録商標）、ＬＩＮＵＸ（登録商標）などを挙げることができる。 The server 110 may use a CISC architecture microprocessor such as PENTIUM (registered trademark), a PENTIUM (registered trademark) compatible chip, or a RISC architecture microprocessor such as POWER PC (registered trademark). A single-core or multi-core configuration can be adopted. Examples of the operating system (OS) implemented by the server 110 include WINDOWS (registered trademark) 200X, UNIX (registered trademark), and LINUX (registered trademark).

また、サーバ１１０は、Ｃ＋＋、ＪＡＶＡ（登録商標）、ＪＡＶＡ（登録商標）ＢＥＡＮＳ、ＰＥＲＬ、ＲＵＢＹなどのプログラミング言語を使用して実装される、ＣＧＩ、サーブレット、ＡＰＡＣＨＥ、ＩＩＳなどのサーバ・プログラムを実行し、ネットワーク（図示せず）を介して各種要求を処理する。なお、サーバ１１０は、ウェブ・サーバとして実装することもできるし、ＣＯＲＢＡ(Common Object Resource Broker Architecture)を使用した分散コンピューティングを可能とする専用サーバとすることができる。 In addition, the server 110 executes server programs such as CGI, servlet, APACHE, and IIS, which are implemented using programming languages such as C ++, JAVA (registered trademark), JAVA (registered trademark) BEANS, PERL, and RUBY. Then, various requests are processed through a network (not shown). The server 110 can be implemented as a web server, or can be a dedicated server that enables distributed computing using CORBA (Common Object Resource Broker Architecture).

クライアント１５０は、パーソナル・コンピュータまたはワークステーションなどを使用して実装され、クライアント１５０が実装するマイクロプロセッサ（ＭＰＵ）としては、これまで知られたいかなるシングルコア・プロセッサまたはデュアルコア・プロセッサを挙げることができる。クライアント１５０は、ＷＩＮＤＯＷＳ（登録商標）、ＵＮＩＸ（登録商標）、ＬＩＮＵＸ（登録商標）、ＭＡＣＯＳ(登録商標)などのＯＳを搭載することができる。 The client 150 is implemented using a personal computer or a workstation. The microprocessor (MPU) implemented by the client 150 may include any single-core processor or dual-core processor that has been known so far. it can. The client 150 can be installed with an OS such as WINDOWS (registered trademark), UNIX (registered trademark), LINUX (registered trademark), or MAC OS (registered trademark).

また、クライアントがウェブ・クライアントとして機能する場合、Internet Explorer（登録商標）、Mozilla、Opera、Netscape(登録商標) Navigatorなどのブラウザ・ソフトウェアを使用して、ＨＴＴＰプロトコルを使用してウェブ・サーバにアクセスする。その他、ＣＯＲＢＡなどの分散コンピューティング環境を利用する場合、クライアント１５０は、サーバ１１０と相互通信するための専用クライアント・プログラムを実装することもできる。 If the client functions as a web client, it uses browser software such as Internet Explorer (registered trademark), Mozilla, Opera, and Netscape (registered trademark) Navigator to access the web server using the HTTP protocol. To do. In addition, when using a distributed computing environment such as CORBA, the client 150 can also implement a dedicated client program for mutual communication with the server 110.

以下、サーバ１１０の機能ブロック構成について詳細に説明する。サーバ１１０の機能ブロックは、サーバ１１０が実装するＲＡＭ内にプログラムを展開し、ＣＰＵまたはＭＰＵがプログラムを実行してサーバ１１０を各機能ブロックとして機能させることにより、サーバ１１０上に実現される機能手段である。 Hereinafter, the functional block configuration of the server 110 will be described in detail. The functional block of the server 110 is a functional unit realized on the server 110 by expanding the program in the RAM mounted on the server 110 and causing the CPU or MPU to execute the program and cause the server 110 to function as each functional block. It is.

サーバ１１０は、ネットワーク・アダプタ１１４と、文字シーケンス取得部１１６と、文字判定部１１８とを含んでいる。ネットワーク・アダプタ１１４は、姓名照会要求を受領して、ＴＣＰ／ＩＰレベルでトランザクションを処理し、本実施形態の姓名候補生成処理を実行する処理部へと姓名照会要求を渡す。文字シーケンス取得部１１６は、姓名照会要求に含まれる、姓表記、名表記または姓表記および名表記を含む文字シーケンスを、受信パケットのペイロードから取得し、処理対象の文字シーケンスとして適切なメモリに登録する。なお、本実施形態では、姓表記とは、名前の姓(sir Name)に対応する文字列を意味し、名表記とは、名前の名(Second Name)に対応する文字列を意味し、いずれか一方または両方を含む可能性のある文字シーケンスについては、姓名表記として参照する。 The server 110 includes a network adapter 114, a character sequence acquisition unit 116, and a character determination unit 118. The network adapter 114 receives the full name query request, processes the transaction at the TCP / IP level, and passes the full name query request to the processing unit that executes the full name candidate generation process of the present embodiment. The character sequence acquisition unit 116 acquires from the payload of the received packet a character sequence including a surname notation, a surname notation or a surname notation and a name notation included in the surname surname query request, and registers it in an appropriate memory as a character sequence to be processed. To do. In this embodiment, the surname notation means a character string corresponding to the surname (sir Name) of the name, and the name notation means a character string corresponding to the name of the name (Second Name). Character sequences that may contain either or both are referred to as full name notation.

文字判定部１１８は、登録した文字シーケンスを構成する文字を解析し、文字シーケンスが、例えば、カタカナ、平仮名、漢字、アルファベットであるか否かを判断する。なお、文字の解析は、文字コードを判定することにより実行することができ、ＳＪＩＳ、ＧＢ、ＵＮＩＣＯＤＥなどいずれのコード体系でも使用することができる。しかしながら、グローバルな姓名検索を実行する上で、ＵＮＩＣＯＤＥを使用することが、判断するべき文字種に適切に対応でき、以下、文字コードの比較を実行する場合、ＵＮＩＣＯＤＥ（ＵＮＩＣＯＤＥ２．０など）を使用して比較を実行するものとする。文字判定部１１８は、受領した文字シーケンスの文字判定を実行した後、受領した文字シーケンスの言語圏を特定し、以後の処理に使用する。 The character determination unit 118 analyzes the characters that constitute the registered character sequence, and determines whether the character sequence is, for example, katakana, hiragana, kanji, or alphabet. Character analysis can be executed by determining a character code, and any code system such as SJIS, GB, or UNICODE can be used. However, the use of UNICODE in performing a global first and last name search can appropriately correspond to the character type to be determined, and hereinafter, when performing character code comparison, UNICODE (such as UNICODE 2.0) is used. The comparison shall be performed. The character determination unit 118 performs character determination of the received character sequence, identifies the language area of the received character sequence, and uses it for subsequent processing.

さらに、サーバ１１０は、他表記取得部１２０と、類似度計算部１２２と、類似スコア計算部１２４とを含んで実装されている。他表記取得部１２０は、受領した文字シーケンスを判断し、受領した文字シーケンスがカナであると判断した場合、本実施形態では、翻字処理を実行し、アルファベットおよび漢字表記を取得する。アルファベット表記には、ヘボン式、訓令式、ＩＳＯ３６０２などのローマ字およびインド・ヨーロッパ語圏の姓または名を含む。なお、本実施形態では、情報処理システム１００の特定の目的に応じて、アルファベット以外でもギリシャ文字、キリル文字、アラビア文字などの他表記を取得することができる。 Further, the server 110 is implemented including an other notation acquisition unit 120, a similarity calculation unit 122, and a similarity score calculation unit 124. The other notation acquisition unit 120 determines the received character sequence, and when it is determined that the received character sequence is kana, in the present embodiment, transliteration processing is executed to acquire alphabet and kanji notation. Alphabetic notation includes Hebon, ceremonial, Roman characters such as ISO 3602, and surnames or first names in Indo-European speaking. In the present embodiment, other notations such as Greek letters, Cyrillic letters, and Arabic letters can be acquired in addition to the alphabet depending on the specific purpose of the information processing system 100.

例えば、他表記取得部１２０は、文字シーケンスがカナ・シーケンスであると判断した場合、表音文字変換テーブル１３０を参照し、文字シーケンスを、受領した文字シーケンスに対応する表意文字、より具体的な実施形態では、漢字に変換する本意処理を実行する。表音文字変換テーブル１３０は、姓および名を端意図し、可能性のあるカナ表記に対して漢字表記を対応付けて登録する姓名辞書として構成することができる。 For example, when the other notation acquisition unit 120 determines that the character sequence is a kana sequence, the other notation acquisition unit 120 refers to the phonogram conversion table 130 and converts the character sequence to an ideogram corresponding to the received character sequence, more specifically. In the embodiment, the intention processing for converting to kanji is executed. The phonetic character conversion table 130 can be configured as a first name surname dictionary that intends first name and last name and registers kana notation in association with possible kana notation.

さらに本実施形態の表音文字変換テーブル１３０は、人名に使用される可能性のある漢字を、文化圏によらずに登録し、表音文字−表意文字対応テーブルの他、対応する表意文字が帰属される可能性のある文化圏判断データを含んで構成されている。例えば、入力された文字シーケンスが、「ハリ」である場合、他表記取得部１２０は、表音文字変換テーブル１３０を参照して、人名漢字として、「張」を取得し、文化圏判断データを参照し、当該「張」が、日本語圏、中国語圏、朝鮮語圏に帰属される可能性があることを示すコードを生成し、漢字に割当てることもできる。 Furthermore, the phonetic character conversion table 130 according to the present embodiment registers kanji characters that may be used for personal names regardless of cultural spheres, and in addition to the phonetic character-ideographic character correspondence table, the corresponding ideographic characters are displayed. It includes cultural area judgment data that may be attributed. For example, when the input character sequence is “Hari”, the other notation acquisition unit 120 refers to the phonetic character conversion table 130 to acquire “Zhang” as the personal name kanji, and uses the cultural zone determination data as It is also possible to generate a code indicating that the “Zhang” may belong to a Japanese-speaking area, a Chinese-speaking area, or a Korean-speaking area, and assign it to a kanji.

さらに、他表記取得部１２０は、受領した文字シーケンスがカナである場合、漢字を取得する他、アルファベット・シーケンスを生成する。この処理は、文字シーケンスがカナである場合、隠れマルコフモデルを使用したカナ−アルファベット変換により実行され、また、文字シーケンスが、漢字コードであると判断された場合には、表音文字変換テーブル１３０の逆検索を実行して、カナ表記を取得し、カナ表記をアルファベット変換することにより、｛カナ、アルファベット、漢字｝の表記ベクトルとして文字シーケンスを登録する。 Further, when the received character sequence is kana, the other notation acquisition unit 120 generates an alphabetic sequence in addition to acquiring kanji. This process is executed by kana-alphabet conversion using a hidden Markov model when the character sequence is kana, and when it is determined that the character sequence is kanji code, the phonetic character conversion table 130 is used. The kana notation is acquired, and the kana notation is converted into an alphabet, thereby registering a character sequence as a notation vector of {kana, alphabet, kanji}.

さらに、文字シーケンスがアルファベットである場合、サーバ１１０は、漢字への変換処理を実行せず、表記ベクトルの｛漢字｝フィールドをブランクのままとして表記ベクトルを生成する。文字シーケンスがアルファベットである場合のカナ表記の生成は、後述する遷移確率テーブルを生成する際に使用するアルファベット−カナ対応テーブルを参照し、アルファベット・シーケンスに対応するカナが検索できる範囲で変換を行うことができる。また、後述する隠れマルコフモデルを使用してカナを目的系列として処理することで、生成することができる。 Further, when the character sequence is alphabet, the server 110 does not execute the conversion process to kanji, and generates a notation vector while leaving the {kanji} field of the notation vector blank. When the character sequence is alphabetic, the kana notation is generated by referring to an alphabet-kana correspondence table used when generating a transition probability table, which will be described later, and converting within a range where kana corresponding to the alphabet sequence can be searched. be able to. Moreover, it can produce | generate by processing kana as a target series using the hidden Markov model mentioned later.

なお、文字シーケンスとしてアルファベット・シーケンスが入力された場合には、類似度は、ＧＮＲ処理によって計算するので、カナ、漢字が全く生成されなくとも類似度を使用する処理は、不都合なく実行できる。 When an alphabetic sequence is input as a character sequence, the similarity is calculated by the GNR process. Therefore, even if no kana or kanji is generated, the process using the similarity can be executed without any inconvenience.

他表記取得部１２０において、文字シーケンスについて対応する可能性のある表記ベクトルのセットが取得された後、表記ベクトルは、類似度計算部１２２に渡される。類似度計算部１２２は、複数の方法を使用して、表記ベクトルに含まれるアルファベット、カナ、漢字についてそれぞれ類似度のスコアを計算する。 After the other notation acquisition unit 120 acquires a set of notation vectors that may correspond to the character sequence, the notation vector is passed to the similarity calculation unit 122. The similarity calculation unit 122 calculates a similarity score for each of alphabets, kana, and kanji included in the notation vector using a plurality of methods.

表記ベクトルが含む文字種類のうち、漢字の類似度は、表音文字変換テーブル１３０などを参照して、表記ベクトルが含む漢字と、表音文字変換テーブルに登録された文字コードとが一致しているか否かを使用して行うことができ、例えば、文字コードが完全に一致している場合、１．０を与え、完全に一致していない場合、０を与えるバイナリ基準値を付して行うことができる。 Among the character types included in the notation vector, the similarity of kanji is determined by referring to the phonetic character conversion table 130 and the like so that the kanji included in the notation vector matches the character code registered in the phonetic character conversion table. For example, when the character codes are completely matched, 1.0 is given, and when they are not completely matched, a binary reference value giving 0 is given. be able to.

さらに、サーバ１１０は、同音異字、同字異音、または同音でもなく同字でもないが、転記ミス、ＯＣＲによる変換ミスなどを含む類似姓名候補を生成するために、類似モデル・テーブル１３２と、異字体テーブル１３４と、同一・非類似漢字テーブル１３６とを管理する。上述した類似モデル・テーブル１３２、異字体テーブル１３４は、漢字の同一・非類似について、２値基準値以外の、異字体やエラー誘因性を考慮して０〜１までの多値基準値を提供するためにも利用することができる。 Furthermore, the server 110 generates a similar first name surname, a homonym character, a homonym character, or a homonym character that is neither a homonym nor a homonym, but includes a transcription mistake, a conversion error due to OCR, and the like, The different character table 134 and the same / dissimilar Kanji table 136 are managed. The above-mentioned similar model table 132 and variant script table 134 provide multi-value reference values from 0 to 1 in consideration of variant characters and error incentives other than binary reference values for the same / dissimilarity of kanji. Can also be used to

類似度計算部１２２は、アルファベット・シーケンスについての類似度についても取得する。類似度計算部１２２は、アルファベット・シーケンスの類似度を計算する場合、アルファベット・シーケンスを抽出し、特定の実施形態では、サーバ１１０の処理部として構成されたＧＮＲ処理部１６０に渡し、ＧＮＲ処理部１６０が登録するアルファベット名との類似度を取得する。なお、ＧＮＲ処理部１６０が実行する類似度判定処理については、例えば特許文献１および非特許文献１にその詳細が開示された方法を使用することができる。 The similarity calculation unit 122 also acquires the similarity for the alphabet sequence. When calculating the similarity of the alphabet sequence, the similarity calculation unit 122 extracts the alphabet sequence and passes it to the GNR processing unit 160 configured as the processing unit of the server 110 in the specific embodiment. The similarity with the alphabet name registered in 160 is acquired. For the similarity determination process executed by the GNR processing unit 160, for example, methods disclosed in detail in Patent Document 1 and Non-Patent Document 1 can be used.

なお、他の実施形態では、ＧＮＲ処理部１６０ではなく、ＧＮＲ処理部１６０を、ＧＮＲ処理を専ら実行するＧＮＲサーバ１４０として構成することができる。この場合、類似度計算部１２２は、類似度を判定するべきアルファベット・シーケンスをネットワーク１１２を介してＧＮＲサーバ１４０に送付し、ＧＮＲサーバ１４０からの応答として類似度を取得する。 In other embodiments, the GNR processing unit 160 instead of the GNR processing unit 160 can be configured as a GNR server 140 that exclusively executes GNR processing. In this case, the similarity calculation unit 122 sends an alphabet sequence for determining the similarity to the GNR server 140 via the network 112 and acquires the similarity as a response from the GNR server 140.

さらに類似度計算部１２２は、カナ・シーケンスとカナ・シーケンスとの間の類似度を計算する。この目的のため、類似度計算部１２２は、表記ベクトルからカナ・シーケンスのコードを抽出し、表音文字変換テーブル１３０に登録されたカナ・シーケンスとの比較を実行する。さらに他の実施形態では、カナ・シーケンスから生成されたアルファベット・シーケンスを、ＧＮＲ処理部１６０やＧＮＲサーバ１４０に送付し、ＧＮＲ処理により提供される類似度の値を取得してカナ・シーケンスについての類似度を取得することができる。 Furthermore, the similarity calculation unit 122 calculates the similarity between the kana sequence and the kana sequence. For this purpose, the similarity calculation unit 122 extracts a kana sequence code from the notation vector, and performs comparison with the kana sequence registered in the phonetic character conversion table 130. In still another embodiment, the alphabet sequence generated from the kana sequence is sent to the GNR processing unit 160 or the GNR server 140, and the similarity value provided by the GNR processing is acquired to obtain the kana sequence. Similarity can be acquired.

カナ・シーケンスの比較を実行する場合、類似度判断部１２２は、文字シーケンスが含む文字列を、表音文字変換テーブル１３０を参照して可能性のある参照カナ・シーケンスを生成する。例えば、文字シーケンスとして、「タケシ」が入力された場合、対応付けられる漢字「健」を抽出し、各漢字コードに対応付けられたカナ・シーケンスである「ケン」を抽出し、参照カナ・シーケンスとして生成する。そして、類似度判断部１２２は、生成された「ケン」を他表記取得部１２０に渡し、対応するアルファベット・シーケンスである「Ｋｅｎ」を生成し、対象シーケンスと参照シーケンスとのスコア計算を可能とする。 When performing kana sequence comparison, the similarity determination unit 122 refers to the phonogram conversion table 130 for a character string included in the character sequence, and generates a possible reference kana sequence. For example, when “Takeshi” is input as the character sequence, the associated kanji character “Ken” is extracted, the “Ken” that is the kana sequence associated with each kanji code is extracted, and the reference kana sequence is extracted. Generate as Then, the similarity determination unit 122 passes the generated “ken” to the other notation acquisition unit 120, generates “Ken” that is a corresponding alphabet sequence, and can calculate the score of the target sequence and the reference sequence. To do.

その後、対象カナ・シーケンスのコードを参照カナ・シーケンスのコードに対し、逐次フォーワード比較などを使用して実行し、文字コードの一致／不一致に対応して例えば（一致した文字コード総数）／（対象カナ・シーケンスおよび参照カナ・シーケンスの文字コード総数）などとしてスコアを計算する。計算された各スコアのセットは、スコア・サブマトリックスとして登録される。 Thereafter, the code of the target kana sequence is executed with respect to the code of the reference kana sequence using sequential forward comparison or the like, and for example, (total number of matched character codes) / ( The score is calculated as the total number of character codes of the target kana sequence and the reference kana sequence. Each calculated set of scores is registered as a score sub-matrix.

なお、本実施形態で、漢字、カナなど対応する文字コードが発見できなかった場合については、例えばＯＣＲなどで使用されるように、イメージデータなどとして文字テンプレートを用意しておき、カナまたは漢字に対して文字テンプレートなどとのビットマップ比較を実行して、その類似度を判定することができる。類似度計算部１２２は、他表記取得部１２０が生成した参照表記ベクトルの全セットについてスコア・ベクトルを生成し、ＲＡＭ、Ｌ２キャッシュ、Ｌ３キャッシュなどの適切なメモリに格納する。 In this embodiment, when a corresponding character code such as kanji or kana cannot be found, a character template is prepared as image data or the like so as to be used in, for example, OCR. On the other hand, a bitmap comparison with a character template or the like can be performed to determine the degree of similarity. The similarity calculation unit 122 generates score vectors for all sets of reference notation vectors generated by the other notation acquisition unit 120 and stores them in an appropriate memory such as a RAM, an L2 cache, or an L3 cache.

サーバ１１０は、さらに類似スコア計算部１２４と、姓名候補生成部１２６とを含んでいる。類似度計算部１２２が生成したスコア・サブマトリックスは、表記ベクトルと対応付けられて、類似スコア計算部１２４に送付される。類似スコア計算部１２４は、受領したサブスコア・マトリックスを、姓または名を構成するフレーズ単位で登録し、スコア付けマトリックスを生成する。説明している実施形態では、スコア付けマトリックスは、３×３のスコア・サブマトリックスが１セットとして登録され、姓について抽出したスコア・サブマトリックスと名について抽出したスコア・サブマトリックスとが姓−名など。設定された順で配置された正方行列として定義される。 Server 110 further includes a similarity score calculation unit 124 and a first and last name candidate generation unit 126. The score sub-matrix generated by the similarity calculation unit 122 is associated with the notation vector and sent to the similarity score calculation unit 124. The similarity score calculation unit 124 registers the received sub-score matrix in units of phrases constituting the first name or last name, and generates a scoring matrix. In the described embodiment, the scoring matrix is registered as a set of 3 × 3 score sub-matrices, and the score sub-matrix extracted for the surname and the score sub-matrix extracted for the first name are the surname-first name Such. It is defined as a square matrix arranged in the set order.

その後、類似スコア計算部１２４は、生成したスコア・サブマトリックスの各要素値を使用して類似スコアを計算する。類似スコアの計算は、スコア・ベクトルの要素と、適切な重みベクトルとを使用する内積計算または、設定されたルールに従って要素を抽出して計算することで取得することができ、その処理については、より詳細に後述する。 Thereafter, the similarity score calculation unit 124 calculates a similarity score using each element value of the generated score sub-matrix. The calculation of the similarity score can be obtained by calculating the inner product using the elements of the score vector and an appropriate weight vector, or by extracting and calculating the elements according to a set rule. More details will be described later.

また、サーバ１１０は、サーバ１１０の独立した処理部または後述する姓名候補生成部１２６の機能モジュールとして実装された姓名類似度計算部１２８を含んでいる。姓名類似度計算部１２８は、類似スコア計算部１２４が生成した姓名候補のうち、合計類似スコアの最大の姓名候補、または他の実施形態ではトップ１０の姓名候補を抽出し、抽出された姓名候補について異体字または外観が類似するだけで、表音特性が全く異なる姓名候補を、類似姓名候補として生成する。 The server 110 also includes a first and last name similarity calculation unit 128 implemented as an independent processing unit of the server 110 or a functional module of the first and last name candidate generation unit 126 described later. The first and last name similarity calculation unit 128 extracts the first and last name candidate having the maximum total similarity score from the first and last name candidates generated by the similarity score calculation unit 124, or the top ten first and last name candidates in other embodiments. Are generated as similar first and last name candidates that have different phonetic characteristics but are similar in appearance or appearance.

姓名候補生成部１２６は、姓名類似度計算部１２８の出力を受領して、受領した文字シーケンスに対応する姓名候補または類似姓名候補を含む姓名候補リストを生成し、テーブル形式として、クライアント１５０に返す処理を実行する。 The first and last name candidate generation unit 126 receives the output of the first and last name similarity calculation unit 128, generates a first and last name candidate list including the first and last name candidates or similar first and last name candidates corresponding to the received character sequence, and returns it to the client 150 as a table format. Execute the process.

この段階での姓名候補リストは、類似判断基準に対応して同音異字・同字異音の姓名まで考慮することができ、特定の用途に対しては充分な結果を提供することができる。さらに、姓名候補生成部１２６が、そのモジュールとして姓名類似度計算部１２８を含んで構成される場合、姓名類似度計算部１２８は、姓名候補リストの中から合計類似スコアの高い姓名候補について、異字体および外観的に類似し、転記ミス、認識ミス、誤記などの可能性のある姓名候補を生成し、姓名候補リストに追加することができる。 The full name candidate list at this stage can take into account the first name and the second name of the same sound and different letters corresponding to the similarity determination criteria, and can provide sufficient results for a specific application. Further, when the first and last name candidate generation unit 126 is configured to include the first and last name similarity calculation unit 128 as its module, the first and last name similarity calculation unit 128 determines the first and last name candidate having a high total similarity score from the first and last name candidate list. First and last name candidates that are similar in character and appearance and may have transcription mistakes, recognition errors, typographical errors, and the like can be generated and added to the first and last name candidate list.

さらに他の実施形態では、姓名類似度計算部１２８を、サーバ１１０の独立した処理部として実装することができる。説明する当該他の実施形態では、姓名類似度計算部１２８は、類似スコア計算部１２４からの漢字での姓名またはクライアント１５０から送付された漢字を受領し、異体字または外観上類似する可能性のある姓名候補を生成し、クライアント１５０に類似指標値とともに返す処理を実行する。なお、本実施形態の処理については、当業者により実施できるように、より詳細に後述するが、特願２００８−１１７５３８号明細書（ＩＢＭアトーニードケット番号：ＪＰ９０８００５４）、特願２００８−１４１２０９号明細書（ＩＢＭアトーニードケット番号：ＪＰ９０８００８１）、および特願２００８−１６８０８７号明細書（ＩＢＭアトーニードケット番号：ＪＰ９０８０１１５）の記載は、本明細書の一部を構成するものとして全部を含めることができる。 In still another embodiment, the first and last name similarity calculation unit 128 can be implemented as an independent processing unit of the server 110. In the other embodiment to be described, the first and last name similarity calculation unit 128 receives the first and last name in kanji from the similarity score calculation unit 124 or the kanji sent from the client 150, and may have a variant or appearance similar. A process of generating a given first and last name candidate and returning it to the client 150 together with the similarity index value is executed. The processing of the present embodiment will be described in detail later so that it can be performed by those skilled in the art. However, Japanese Patent Application No. 2008-117538 (IBM Atonie Docket No. JP9080054), Japanese Patent Application No. 2008-141209 (IBM Atonie Docket Number: JP9080081) and Japanese Patent Application No. 2008-168087 (IBM Atonie Docket Number: JP9080115) may be incorporated in their entirety as part of this specification. .

図２は、本実施形態のサーバ１１０が実行する類似度生成のための処理シーケンス２００を説明する図である。クライアント１５０から入力される可能性のある文字シーケンスは、特定の用途に対応しアルファベット、カナ、漢字のいずれかの可能性がある。また、さらに他の実施形態ではアラビア語などアフロ・アジア語圏の文字列の可能性もある。サーバ１１０は、クライアントから受領した文字シーケンスについて文字コードを判定し、自己以外の文字コードで表現される他表記、例えば、文字シーケンスがカナ２１０である場合、アルファベット・シーケンス２３０および漢字２２０を生成し、類似度計算部１２２に送付する。 FIG. 2 is a diagram illustrating a processing sequence 200 for similarity generation executed by the server 110 according to the present embodiment. A character sequence that may be input from the client 150 corresponds to a specific application and may be any one of alphabet, kana, and kanji. In yet another embodiment, there is a possibility of an Afro-Asian-speaking character string such as Arabic. The server 110 determines the character code of the character sequence received from the client, and generates another notation expressed by a character code other than itself, for example, the alphabet sequence 230 and the kanji 220 when the character sequence is Kana 210. And sent to the similarity calculation unit 122.

類似度計算部１２２は、アルファベット・シーケンス２３０については、ボックス２４０で示すように、ＧＮＲ類似検索２４０を呼び出して処理を依頼する。また、カナ・シーケンス２１０については、アルファベット・シーケンスへの変換後、ボックス２５０で示すように、ＧＮＲ類似検索を実行することもできるし、カナ−カナの文字コード比較を行うこともできるし、さらに他の実施形態では、文字テンプレートとのビットマップ比較を適用することができる。 The similarity calculation unit 122 calls the GNR similarity search 240 and requests processing of the alphabet sequence 230 as indicated by a box 240. As for the kana sequence 210, after conversion to an alphabetic sequence, as shown in box 250, a GNR similarity search can be executed, a kana-kana character code comparison can be performed, and In other embodiments, a bitmap comparison with a character template can be applied.

漢字２２０の場合、ボックス２６０で示すように、文字コード同士の比較による２値基準値を与えることもできるし、図１に示した各テーブル１３２、１３４、１３６を使用し、コスト値から類似度を示すスコアを計算することもできる。さらにボックス２６０で示すように、文字コードの比較ができない場合などについては、文字テンプレートなどを使用するビットマップ比較を行うことにより類似度判断を行うことができる。図２に示した処理シーケンス２００を使用することにより、クライアント１５０から受領する文字シーケンスによらず、適切な類似計算処理を使用して効率的で精度の高い姓名検索を実行することができる。 In the case of the Chinese character 220, as shown by a box 260, a binary reference value can be given by comparing character codes, or the respective values 132, 134, and 136 shown in FIG. A score indicating can also be calculated. Further, as indicated by a box 260, when character codes cannot be compared, similarity can be determined by performing bitmap comparison using a character template or the like. By using the processing sequence 200 shown in FIG. 2, it is possible to execute an efficient and accurate name search using an appropriate similarity calculation process regardless of the character sequence received from the client 150.

図３は、本実施形態の姓名候補を生成する情報処理方法のフローチャートを示す。図３に示した処理は、ステップＳ３００から開始し、ステップＳ３０１で、クライアント１５０から文字シーケンスを受領する。ステップＳ３０２で、受領した文字コードの文字コード判定を実行する。説明する実施形態では、漢字、カナ、アルファベットのいずれかを判断するものとしているが、その他ハングル、アラビア語、ギリシャ語、キリル語など受領するべき文字シーケンスは、姓名候補生成の目的および用途に対応して特に制限されるものではない。 FIG. 3 shows a flowchart of an information processing method for generating first and last name candidates according to this embodiment. The process shown in FIG. 3 starts from step S300, and a character sequence is received from the client 150 in step S301. In step S302, the character code determination of the received character code is executed. In the embodiment to be described, it is assumed that one of kanji, kana, and alphabet is judged, but other character sequences to be received such as Korean, Arabic, Greek, Cyrillic, etc. correspond to the purpose and application of generating first and last name candidates. There is no particular limitation.

ステップＳ３０３では、他表記を取得して、漢字、カタカナ、アルファベットの表記ベクトルとして登録し、ない表記、例えば、文字シーケンスがアルファベット表記の場合、アルファベット・シーケンスから漢字を生成することなく、漢字に対応する要素は、ブランクのままとする。ステップＳ３０４では、文字シーケンスの文字種類を判断し、漢字、アルファベット（カナ）、またはカナを判断する。文字シーケンスが漢字である場合、ステップＳ３１０に処理を渡し、カナである場合には、採用する処理シーケンスに対応して、ステップＳ３０５、またはステップＳ３０９に処理を渡す。 In step S303, other notations are acquired and registered as kanji, katakana, and alphabet notation vectors. If there is no notation, for example, if the character sequence is an alphabet notation, kanji is supported without generating kanji from the alphabet sequence. Leave the element to be blank. In step S304, the character type of the character sequence is determined, and kanji, alphabet (kana), or kana is determined. If the character sequence is kanji, the process is passed to step S310, and if it is kana, the process is passed to step S305 or step S309 corresponding to the process sequence to be adopted.

また、文字シーケンスがアルファベットの場合には、処理をステップＳ３０６に渡す。ステップＳ３０５では、カナ・シーケンスをアルファベット変換し、アルファベット・シーケンスをステップＳ３０６に渡し、ＧＮＲ処理部１４０または１６０に送付する。ステップＳ３０７では、ＧＮＲ処理部１４０または１６０から類似度を受領して処理をステップＳ３０８に渡す処理を実行する。 If the character sequence is alphabetic, the process is passed to step S306. In step S305, the kana sequence is converted to alphabet, and the alphabet sequence is transferred to step S306 and sent to the GNR processing unit 140 or 160. In step S307, a process of receiving similarity from the GNR processing unit 140 or 160 and passing the process to step S308 is executed.

一方、ステップＳ３０９では、漢字同士の文字コード比較を行うか、または文字テンプレートとのビットマップ比較を実行し、その類似判断による類似度をステップＳ３０８に渡す。ステップＳ３０８では、文字シーケンスの文化圏に対応し、少なくとも２表記についての類似度を使用して、スコアを計算する。この理由は、例えばクライアント１５０から受領した文字シーケンスがアルファベットである場合、強制的に漢字に変換せず、表記ベクトルの漢字のフィールドがブランクのままとされるためである。さらにステップＳ３０８では、計算したスコアから、スコア・サブマトリックスについての類似性の指標を与える類似スコアを計算して姓名候補の抽出を可能とさせ、姓名候補リストを生成する。姓名候補リストの生成が終了した段階で、処理をステップＳ３１０に渡し、処理を終了させる。 On the other hand, in step S309, character codes comparison between Chinese characters is performed, or bitmap comparison with character templates is executed, and the similarity based on the similarity determination is passed to step S308. In step S308, a score is calculated using the similarity for at least two notations corresponding to the cultural sphere of the character sequence. This is because, for example, when the character sequence received from the client 150 is an alphabet, it is not forcibly converted to Kanji, and the Kanji field of the notation vector is left blank. Further, in step S308, a similarity score that gives an index of similarity with respect to the score sub-matrix is calculated from the calculated score to enable extraction of first and last name candidates, and a first and last name candidate list is generated. When the generation of the first and last name candidate list is completed, the process is passed to step S310 to end the process.

図３の処理によって、クライアント１５０のユーザは、入力した文字シーケンスに対応する姓名候補を、その確からしさとともに取得することが可能となり、文字シーケンスから、どのような姓名が候補とされるのかについて判断することが可能となる。 The process of FIG. 3 allows the user of the client 150 to acquire the first and last name candidates corresponding to the inputted character sequence together with the certainty, and determines what kind of first and last names are candidates from the character sequence. It becomes possible to do.

図４は、本実施形態の他表記取得部１２０が実行するカナ−アルファベット変換処理のフローチャートを示す。図４の処理は、ステップＳ４００から開始し、ステップＳ４０１でカナ・シーケンスを受領する。ステップＳ４０２でカナ・シーケンスを音素分解する。音素分解は、表音文字の音に対応して行われ、長音、促音については、直前の音素に含ませたものとする。例えばカナ・シーケンス＝「インフォマーション」の場合、音素分解は、「イ」＋「ン」＋「フォ」＋「メー」＋「ショ」＋「ン」とされる。 FIG. 4 shows a flowchart of the kana-alphabet conversion process executed by the other notation acquisition unit 120 of this embodiment. The process in FIG. 4 starts from step S400, and a kana sequence is received in step S401. In step S402, the Kana sequence is phoneme decomposed. The phoneme decomposition is performed corresponding to the phonetic character sound, and the long sound and the prompt sound are assumed to be included in the immediately preceding phoneme. For example, when the kana sequence = “infomersion”, the phoneme decomposition is “i” + “n” + “fo” + “mae” + “sho” + “n”.

ステップＳ４０３では、音素分解の結果に遷移確率モデルを使用して変換尤度χを計算し、ステップＳ４０４では、変換尤度χの最も高いアルファベット・シーケンスを取得する。なお、遷移確率モデルは、音素および音素シーケンスを解析し、カナを観測系列とし、アルファベットを目的系列とする隠れマルコフモデルを適用するため、言語解析によって与えることができる。ステップＳ４０５では、アルファベット変換結果を類似度計算部１２２に渡して類似度計算を依頼し、ステップＳ４０６で処理を終了させる。 In step S403, the conversion likelihood χ is calculated using the transition probability model for the phoneme decomposition result, and in step S404, the alphabet sequence having the highest conversion likelihood χ is acquired. The transition probability model can be given by language analysis because it analyzes phonemes and phoneme sequences, applies a hidden Markov model with kana as an observation series and alphabet as a target series. In step S405, the alphabet conversion result is passed to the similarity calculator 122 to request similarity calculation, and the process ends in step S406.

図５は、図４で説明した隠れマルコフモデルを使用するアルファベット変換の状態遷移図を示す。変換対象の文字シーケンスは、「インフォメーション」であり、他表記取得部１２０は、音素ごとに音素−音素の遷移確率πｉと、音素間の遷移確率ｐ_ｉｊとを遷移確率テーブルとして登録し、カナ−アルファベット変換を適用する。変換尤度は、図５に示されるように、対数尤度で与えられ、各対数尤度を、音素にわたり積算し、下記式（１）で与えられる変換尤度χを使用して計算し、変換尤度の最も高いアルファベットが類似度計算部１２２へと出力される。 FIG. 5 shows a state transition diagram of alphabet conversion using the hidden Markov model described in FIG. The character sequence to be converted is “information”, and the other notation acquisition unit 120 registers the phoneme-phoneme transition probability πi and the transition probability p _ij between phonemes for each phoneme as a transition probability table. Apply alphabet conversion. As shown in FIG. 5, the conversion likelihood is given by log likelihood, and each log likelihood is accumulated over phonemes and calculated using the conversion likelihood χ given by the following equation (1): The alphabet with the highest conversion likelihood is output to the similarity calculation unit 122.

図６は、図１で説明した、類似モデル・テーブル１３２、異字体テーブル１３４、同一・非類似漢字テーブル１３６の実施形態を示す。図６に示した値は、対応付けられる漢字が類似するものとして判定することのリスクに対応したコストである。図６に示した各テーブルは、姓名類似度計算部１２８が類似姓名候補を抽出する差異の経路コストを計算するために使用される。 FIG. 6 shows an embodiment of the similar model table 132, the variant character table 134, and the same / dissimilar kanji table 136 described in FIG. The value shown in FIG. 6 is the cost corresponding to the risk of determining that the associated kanji is similar. Each table shown in FIG. 6 is used by the first and last name similarity calculation unit 128 to calculate a path cost of a difference for extracting similar first and last name candidates.

また、他の実施形態では、類似モデル・テーブル１３２および異字体テーブル１３４は、類似度計算部１２２が漢字相互の類似度に関する多値基準値を計算するためにも適用することができる。この場合、例えば、図６に示したコストの値を使用して、漢字の類似度についてのスコアを、スコア＝１／（コスト＋α）（αは、コストの最小値よりも小さくアンダーフロー・オーバーフローを回避することができる適切な定数である。）で与えることもできる。なお、多値基準値を与えるため、さらに適切ないかなる別の定式化を使用してもよい。さらに、図６に示した実施形態では、各テーブル１３２、１３４、１３６は、一体化された単一のテーブルとして構成されているが、それぞれのテーブルを個別に作製し、登録しておくことができる。 In another embodiment, the similarity model table 132 and the variant font table 134 can also be applied to allow the similarity calculation unit 122 to calculate a multi-value reference value related to the similarity between Chinese characters. In this case, for example, using the cost value shown in FIG. 6, the score for the similarity of the kanji is set to score = 1 / (cost + α) (α is smaller than the minimum value of the cost and underflow overflow. Is an appropriate constant that can be avoided.). It should be noted that any other more appropriate formulation may be used to provide a multi-level reference value. Furthermore, in the embodiment shown in FIG. 6, each table 132, 134, 136 is configured as a single integrated table, but each table can be individually created and registered. it can.

図６のテーブルのうち、テーブル６１０の非対角要素が同一・非類似漢字テーブル１３６の非類似コストを登録する領域であり、テーブル６２０の非対角要素が異字体テーブル１３４に対応する。また、テーブル６３０の非対角要素は、外観上類似する漢字を登録する類似モデル・テーブル１３２に対応する領域である。図６に示した実施形態は、データ構造を明示的に説明するため、同一・非類似漢字についてのコストまで登録するものとして示す。そして、図６のテーブルの対角要素が、同一・非類似漢字テーブル１３６の同一テーブルに対応する。 In the table of FIG. 6, the non-diagonal element of the table 610 is an area for registering the dissimilar cost of the same / dissimilar Kanji table 136, and the non-diagonal element of the table 620 corresponds to the variant character table 134. Further, the non-diagonal elements of the table 630 are areas corresponding to the similar model table 132 in which Kanji characters that are similar in appearance are registered. In the embodiment shown in FIG. 6, in order to explicitly describe the data structure, the cost for the same / dissimilar kanji is shown as being registered. The diagonal elements in the table of FIG. 6 correspond to the same table of the identical / dissimilar Kanji table 136.

図６は説明の目的で全エントリについてコスト値を登録するものとして説明する。本実施形態の好ましい実施形態では、類似モデル・テーブル１３２および異字体テーブル１３４のみを陽に実装させ、同一・非類似漢字テーブル１３６に相当するコストは、テーブルのデータ量を削減するため、プログラムにおける固定値などとして設定し、テーブルのために消費するメモリを削減することもできる。当該実施形態の場合、サーバ１１０は、同一・非類似の漢字について、類似モデル・テーブル１３２および異字体テーブル１３４に登録されていないことに基づいて同一・非類似と判断し、処理を実行することができる。 For the purpose of explanation, FIG. 6 will be described assuming that cost values are registered for all entries. In the preferred embodiment of the present embodiment, only the similar model table 132 and the variant character table 134 are explicitly mounted, and the cost corresponding to the same / dissimilar kanji table 136 is reduced in the program in order to reduce the data amount of the table. It can be set as a fixed value or the like to reduce the memory consumed for the table. In the case of this embodiment, the server 110 determines that the same / dissimilar kanji are the same / dissimilar based on the fact that they are not registered in the similar model table 132 and the variant script table 134, and executes the processing. Can do.

また、漢字同士が非類似である場合、コストとしては、２が割当てられ、漢字同士が同一の場合コストは０とされている。また、外観的に類似する漢字は、転記ミス、誤記、ＯＣＲ変換ミスなどを発生させる可能性が高いことから、当該変換に幅がある。このため、「萩」と「荻」については、完全に異なるわけではなく誤記、ＯＣＲによる変換ミスの可能性があり、また表音特性も類似しているので、コスト＝０．３が与えられている。一方、漢字「朋」と、「明」については、外観上類似するものの、表音特性が全く異なるため、外観類似であってもより高いコスト＝０．４が与えられている。 Further, when the Chinese characters are dissimilar, 2 is assigned as the cost, and when the Chinese characters are the same, the cost is 0. In addition, Kanji characters that are similar in appearance are highly likely to cause transcription mistakes, misprints, OCR conversion mistakes, and the like, and thus there is a range in the conversion. For this reason, “萩” and “荻” are not completely different, but may be erroneously written, there is a possibility of conversion error due to OCR, and the phonetic characteristics are similar, so that cost = 0.3 is given. ing. On the other hand, although the Chinese characters “朋” and “Ming” are similar in appearance, the phonetic characteristics are completely different, so that even if they are similar in appearance, a higher cost = 0.4 is given.

また、「富」と「冨」、「高」と、「▲高▼」（鍋ぶたの下は、縦線２本の間に横線２本）、「斉」、「齊」などについては、略記されがちな異字体として、コスト＝０．１が割当てられている。 Also, for “wealth”, “冨”, “high”, “▲ high ▼” (under the hot pot, two horizontal lines between two vertical lines), “Sai”, “齊”, etc. Cost = 0.1 is assigned as a variant that tends to be abbreviated.

図７は、本実施形態で、サーバ１１０が入力される可能性のある文字シーケンスに対応するスコア・サブマトリックス７００の実施形態を示す。行が他表記取得部１２０が生成した表記ベクトルに対応するスコア・ベクトルであり、図７に示した実施形態では、要素値｛Takeshi、タケシ、健｝とされる。列は、類似度計算部１２２がスコア計算を実行する際に表音文字変換テーブル１３０から抽出したカナ・シーケンスおよび漢字に対応して計算されたスコアである。 FIG. 7 illustrates an embodiment of a score sub-matrix 700 that corresponds to a character sequence that the server 110 may enter in this embodiment. The row is a score vector corresponding to the notation vector generated by the other notation acquisition unit 120, and in the embodiment shown in FIG. 7, the element value is {Takeshi, Takeshi, Ken}. The columns are scores calculated corresponding to kana sequences and kanji extracted from the phonetic character conversion table 130 when the similarity calculation unit 122 executes score calculation.

例えば、漢字「健」の場合には、カナ「ケン」、「タケシ」が割当てられ、この結果、スコア・ベクトルは、｛Takeshi、タケシ、健｝と｛Ken、ケン、健｝の参照表記ベクトルが生成され、スコア・サブマトリックス７１０が生成されている。スコア・サブマトリックス７１０は、アルファベット・シーケンスおよびカナ・シーケンスが不一致なので、一致度＝０が与えられている。また、漢字については、漢字「健」の文字コードに対して登録されている読みの範囲で一致し、アルファベット変換でも生成される範囲なので、一致度＝１．０が与えられている。 For example, in the case of the Chinese character “Ken”, kana “Ken” and “Takeshi” are assigned. As a result, the score vectors are reference notation vectors of {Takeshi, Takeshi, Ken} and {Ken, Ken, Ken}. Are generated, and a score sub-matrix 710 is generated. In the score sub-matrix 710, since the alphabet sequence and the kana sequence do not match, the degree of match = 0 is given. In addition, the kanji is matched in the registered reading range with respect to the character code of the kanji “Ken” and is also generated by alphabet conversion, so the degree of coincidence = 1.0 is given.

一方、スコア・サブマトリックス７２０は、サーバ１１０が文字シーケンスとしてアルファベットを受領した場合のスコア・サブマトリックス７２０の実施形態を示す。スコア・サブマトリックス７２０は、サーバ１１０が文字シーケンスとして「James」を受領した場合である。スコア・サブマトリックス７２０では、アルファベット・シーケンスに対応するカナ・シーケンスである「ジェームズ」が見出されたことにより、表記ベクトルは、｛James、ジェームズ、null｝として生成されている。 On the other hand, the score sub-matrix 720 shows an embodiment of the score sub-matrix 720 when the server 110 receives the alphabet as a character sequence. The score sub-matrix 720 is when the server 110 receives “James” as a character sequence. In the score sub-matrix 720, “James” which is a kana sequence corresponding to the alphabet sequence is found, so that a notation vector is generated as {James, James, null}.

この場合、参照表記ベクトルについては、説明している実施形態では、アルファベット・シーケンス「James」に対してカナ・シーケンス「ジェームズ」しか登録されていないので、アルファベット、カナの両方の一致度について一致度＝１．０が割当てられ、漢字との類似度については、漢字がブランク（null）のままとされているので、一致度もブランク（null）のままとされている。 In this case, for the reference notation vector, only the kana sequence “James” is registered for the alphabet sequence “James” in the described embodiment. = 1.0 is assigned, and the degree of similarity with the Chinese character is also left blank because the Chinese character is left blank.

そして、スコア・サブマトリックス７３０は、サーバ１１０が、文字シーケンスとして漢字の「毛」を受領した場合の実施形態を示す。漢字「毛」に対しては、「モウ」、「ケ」、「ゲ」、「マオ」などがカナとして登録されている。一方、アルファベット変換処理では、Mou、Mao、Ge、Keなどが抽出される。この場合、例示したスコア・サブマトリックス７３０では、漢字については、常に一致度＝１．０が割当てられ、その他については、対応するアルファベット、カナを含む表記ベクトルと参照表記ベクトルとから生成したスコア・サブマトリックスでない限り、一致度＝０が与えられる。 The score sub-matrix 730 shows an embodiment in the case where the server 110 receives “hair” of a Chinese character as a character sequence. For the Chinese character “hair”, “mo”, “ke”, “ge”, “mao”, etc. are registered as kana. On the other hand, in the alphabet conversion process, Mou, Mao, Ge, Ke and the like are extracted. In this case, in the illustrated score sub-matrix 730, the matching score = 1.0 is always assigned for the kanji characters, and for the others, the score / number generated from the notation vector including the corresponding alphabet and kana and the reference notation vector Unless it is a sub-matrix, the degree of coincidence = 0 is given.

さらに、本実施形態では、スコア・サブマトリックス７１０、７２０、７３０には、それぞれ類似スコアを計算し、スコア・サブマトリックス全体の類似度を生成する。図８は、類似スコアを計算する場合の各表記に対応するスコアの割当てを説明するための、割当てマップ８００を示す。類似度は、本実施形態では、単純に各スコアの平均値として算出することができる。しかしながら、各表記に対応して、姓名について特有の判断を実行することができることに着目し、本実施形態では、要素ごとに異なる処理を使用して類似スコアを生成することができる。第１の実施形態では、スコア・サブマトリックスから類似度ベクトルを生成する方法である。 Further, in the present embodiment, a similarity score is calculated for each of the score sub-matrices 710, 720, and 730, and the similarity of the entire score sub-matrix is generated. FIG. 8 shows an assignment map 800 for explaining the assignment of scores corresponding to each notation when calculating a similarity score. In this embodiment, the similarity can be simply calculated as an average value of each score. However, paying attention to the fact that a unique judgment can be performed on the first and last names corresponding to each notation, in the present embodiment, a similarity score can be generated using a different process for each element. In the first embodiment, a similarity vector is generated from a score sub-matrix.

類似度ベクトルは、スコア・ベクトルの要素を、設定された順に抽出し、例えば説明する実施形態では、下記式（２）で与えられる。 The similarity vector is obtained by extracting the elements of the score vector in the set order. For example, in the embodiment described below, the similarity vector is given by the following expression (2).

また、本実施形態では、各要素値の割当てマップ８００の位置に対応して重み付けが与えられ、Ｓ１，．．．，Ｓ９にそれぞれ対応して、Ｗ１，．．．，Ｗ９が与えられ、重みベクトルについては、下記式（３）で与えられる。

In the present embodiment, weighting is given in correspondence with the position of each element value assignment map 800, and S1,. . . , S9, W1,. . . , W9, and the weight vector is given by the following equation (3).

スコア・サブマトリックス全体としての類似スコアは、各ベクトルの内積を使用して、下記式（４）で与えることができる

The similarity score for the entire score sub-matrix can be given by the following equation (4) using the inner product of each vector.

重み付けの値は、例えばカナ−カナ一致、カナ−漢字一致、アルファベット−
カナ一致を重要視する場合など、特定の用途に応じて適宜設定することができ、例えば、日本人について、カナ表記での姓名一致候補を生成する場合には、Ｓ５の重みを高くし、カナ−カナ一致およびカナ−漢字一致の姓名候補を生成する場合には、Ｓ５、Ｓ６、Ｓ８、およびＳ９の重みを高くする。一方、アルファベット−アルファベット一致およびアルファベット−カナ一致を重要視するべき外国人の場合については、Ｓ１、Ｓ２、Ｓ４の重み付けを高く設定することができ、適宜目的とする姓名候補の生成に対応して、システム設定またはユーザ選択により設定することができる。 The weighting values are, for example, kana-kana match, kana-kanji match, alphabet-
It can be set as appropriate according to the specific application, such as when emphasizing kana matching. For example, when generating surname matching candidates in kana notation for Japanese, the weight of S5 is increased, -When generating surname candidates for kana match and kana-kanji match, the weights of S5, S6, S8, and S9 are increased. On the other hand, for foreigners who should place importance on alphabet-alphabet match and alphabet-kana match, the weighting of S1, S2, and S4 can be set high, corresponding to the generation of first and last name candidates as appropriate. It can be set by system setting or user selection.

以下、スコア・サブマトリックス７１０、７２０、７３０の類似スコアの計算は、他の方法を使用しても計算することができる。以下に他の実施形態の説明を行うが、要素割当てのマッピングは、図８に示すとおりである。特に日本人と、中華民族、朝鮮民族を含む外国人の相違は、カナ表記と漢字表記の一致・不一致に依存する。 Hereinafter, the calculation of the similarity score of the score sub-matrices 710, 720, and 730 may be calculated using other methods. Although other embodiments will be described below, the mapping of element allocation is as shown in FIG. In particular, the difference between Japanese and foreigners including Chinese and Koreans depends on the match / mismatch between kana and kanji.

このため、スコア・サブマトリックスの全体の類似スコアを計算せずに、各要素値の特徴を反映する要素を判断し、文化圏判定を可能とする類似スコアを生成することができる。例えば、上述したように、漢字表記との対応関係を含むＳ５、Ｓ６、Ｓ８、Ｓ９の各要素値が特徴値となる。このため、各要素値に対してしきい値ＴＨ_ｊを設定しておき、Ｓ８、Ｓ９の類似度がいずれもしきい値ＴＨ_ｊを超えた場合、日本人と判定し、類似スコアを計算する。 Therefore, it is possible to determine an element that reflects the feature of each element value without calculating the overall similarity score of the score sub-matrix, and generate a similarity score that enables cultural zone determination. For example, as described above, each element value of S5, S6, S8, and S9 including the correspondence relationship with the kanji notation is the feature value. For this reason, a threshold value TH _j is set for each element value, and if the degree of similarity in S8 and S9 exceeds the threshold value TH _j , it is determined that the person is Japanese and a similarity score is calculated.

さらに、類似スコアを計算する第２の実施形態では、外国人、特にインド・ヨーロッパ語圏の民族やアフロ・アフリカ語族に関連する文字シーケンスについては漢字表記を生成しないことに着目し、文化圏の判断を可能とするものである。例えば、図８中、Ｓ３、Ｓ７がnullの場合には、日本、中国、台湾、韓国、ベトナム、マレーシアなど日本民族、中国民族や朝鮮民族を除く外国人用の類似スコアを計算する。そして、漢字−漢字の類似性が設定したしきい値ＴＨ_ｋより高く、カナ−カナの類似度が、設定したしきい値ＴＨ_ｃよりも低い場合には、中国・朝鮮民族の類似スコアを計算させることができる。下記式（５）に、上記判断を使用する場合の処理のための疑似コードを記載する。 Furthermore, in the second embodiment for calculating the similarity score, attention is paid to not generating kanji notation for character sequences related to foreigners, especially ethnic groups of Indo-European languages and Afro-African languages. Judgment is possible. For example, in FIG. 8, when S3 and S7 are null, similar scores for foreigners excluding Japanese, Chinese, and Korean nationals such as Japan, China, Taiwan, South Korea, Vietnam, and Malaysia are calculated. If the similarity between Kanji and Kanji is higher than the set threshold TH _k and the Kana-Kana similarity is lower than the set threshold TH _c , the similarity score of the Chinese / Korean people is calculated. Can be made. In the following formula (5), pseudo code for processing when the above judgment is used is described.

なお、上記式中、cult_categoryは、文化圏識別値を設定する文化圏識別変数である。サーバ１１０は、処理の実装形式に対応して、cult_categoryの値を他の処理部に渡し、文化圏識別処理を実行させることができる。さらに他の実施形態では、名前尤度・国字、アクサンデキュ、ウムラウトなど文化圏に特徴的な文字を検出して実行される文化圏判断の結果を補助情報として使用し、類似スコアの計算処理を変更することもできる。 In the above formula, cult_category is a cultural sphere identification variable for setting a cultural sphere identification value. The server 110 can pass the value of cult_category to another processing unit in accordance with the process implementation format, and execute the culture area identification process. In yet another embodiment, the result of the cultural sphere judgment that is executed by detecting characters that are characteristic of the cultural sphere such as name likelihood / national character, akusan decu, umlaut is used as auxiliary information, and the similarity score calculation process is changed. You can also

図９は、スコア・サブマトリックスを部分マトリックスとして構成される、スコア付けマトリックス９００の実施形態を示す。スコア・サブマトリックスは、サーバ１１０が適切な辞書などを参照して生成する姓または名を構成し、ＧＮＲサーバ１４０またはＧＮＲ処理部１６０が処理可能なフレーズを単位として構成されている。スコア付けマトリックス９００は、評価対象の姓・名などのフレーズとして可能性のあるスコア・サブマトリックスを、文字シーケンスの認識順に配列させて生成される。 FIG. 9 illustrates an embodiment of a scoring matrix 900 configured with the score sub-matrix as a partial matrix. The score sub-matrix constitutes a surname or first name generated by the server 110 by referring to an appropriate dictionary or the like, and is configured with a phrase that can be processed by the GNR server 140 or the GNR processing unit 160 as a unit. The scoring matrix 900 is generated by arranging score sub-matrices that may be used as phrases such as the first name and last name of the evaluation target in the order of recognition of the character sequence.

特定の文字またはフレーズに対する類似スコアは、スコア付けマトリックス９００の例えば、スコア・サブマトリックス９１０の３×３要素を使用し、上記式（４）または（５）を使用して計算することができる。なお、３×３のスコア・サブマトリックスの位置を示すため、スコア・サブマトリックスを、(Name_ｉ,Name_j)として参照し、フレーズの単位を、図９では、Name1、Name2、Name3、・・・として示す。上記表記中、Name1は、姓「齊籐」に対応し、Name2は、姓「斎藤」に対応し、Name3は、例えば「西藤」などに対応する。さらに、名についても同様なシーケンスでスコア付けマトリックス９００に登録することもできるし、他の実施形態では、名用のスコア付けマトリックスを別に生成し、同洋書方法で類似判断を実行することができる。なお、受領する文字シーケンスがアルファベットの場合、Name1の単位として「James」を使用することができる。 A similarity score for a particular character or phrase can be calculated using the above equation (4) or (5) using, for example, the 3 × 3 element of the score sub-matrix 910 of the scoring matrix 900. In order to indicate the position of the 3 × 3 score sub-matrix, the score sub-matrix is referred to as (Name _i , Name _j ), and the phrase units are shown in FIG. 9 as Name1, Name2, Name3,.・ As shown. In the above notation, Name1 corresponds to the surname “齊 Rattan”, Name2 corresponds to the surname “Saito”, and Name3 corresponds to “Nishito”, for example. Furthermore, names can also be registered in the scoring matrix 900 in a similar sequence, and in other embodiments, a scoring matrix for names can be generated separately, and similarity determination can be performed using the Western book method. . If the character sequence to be received is an alphabet, “James” can be used as the unit of Name1.

本実施形態の図９に示した実施形態では、スコア・サブマトリックス９１０に隣接する (Name_ｉ,Name_j±１)で示されるスコア・サブマトリックスおよびスコア・サブマトリックスの部分要素９２０および９３０を使用して、姓名シーケンスの文化圏判断などの一貫性を判断することができる。すなわち、部分要素９２０、９３０は、図８のＳ２、Ｓ３、Ｓ６に対応する要素値に対応する、上記式（４）および（５）に示すように、Ｓ２のカナ−アルファベット類似度および、Ｓ３、Ｓ６の漢字−カナ対応付けの類似度は、例えば姓目シーケンスを生成する場合の類似度のみではない結合処理を可能とし、より精度の高い姓名候補を生成させることができる。 In the embodiment shown in FIG. 9 of the present embodiment, the score submatrix indicated by (Name _i , Name _{j ± 1} ) adjacent to the score submatrix 910 and the subelements 920 and 930 of the score submatrix are used. Thus, it is possible to determine the consistency of the culture name judgment of the first and last name sequence. That is, the sub-elements 920 and 930 correspond to the Kana-alphabet similarity of S2 and S3 as shown in the above formulas (4) and (5) corresponding to the element values corresponding to S2, S3, and S6 in FIG. , S6 Kanji-Kana matching similarity, for example, enables not only the similarity when generating a surname sequence, but also a join process, and a more accurate surname candidate can be generated.

図９に示したスコア付けマトリックス９００に関して、上記式（４）または（５）を使用し、スコア・サブマトリックス(Name_ｉ,Name_j)についての類似スコアの高い順に例えばトップ３などの姓および名に対応するフレーズを抽出し、姓フレーズと名フレーズとを連結して、合計類似スコアを生成し、合計類似スコアが最高姓名候補、または合計類似スコアのトップＫ（Ｋは、正の整数である。）を抽出し、結果リストに登録する。サーバ１１０は、この段階で、姓名候補リストとして、クライアント１５０に渡すこともできる。 With respect to the scoring matrix 900 shown in FIG. 9, the above formula (4) or (5) is used, and the surname and first name such as top 3 in descending order of similarity score for the score sub-matrix (Name _i , Name _j ) And the first and last phrases are concatenated to generate a total similarity score, and the total similarity score is the highest first-name candidate or the top K of the total similarity score (K is a positive integer) .) Is extracted and registered in the result list. At this stage, the server 110 can also pass it to the client 150 as a first and last name candidate list.

さらに、図９に示したスコア付けマトリックス９００を使用することにより、姓名が同一人物を示すか否かについての判断についても可能とされる。例えばカナ・シーケンスが相違する場合であっても、同一漢字シーケンスに類似する場合には、本来的に別人物である確率Ｐｒ（カナ・シーケンス１｜カナ・シーケンス２）を統計的に決定しておき、同一人物である確率（１−Ｐｒ）を付して姓名候補を表示することもできる。 Further, by using the scoring matrix 900 shown in FIG. 9, it is possible to determine whether or not the first and last names indicate the same person. For example, even if the kana sequences are different, if they are similar to the same kanji sequence, the probability Pr (kana sequence 1 | kana sequence 2) that is inherently another person is statistically determined. In addition, first and last name candidates can be displayed with a probability (1-Pr) of being the same person.

本実施形態では、さらに、略記、転記ミス、ＯＣＲによる誤認識の可能性を含む姓名候補を生成し、姓名候補リストに追加する構成を採用することができる。この姓名候補は、同音でもなく同字でもない姓名候補を生成させることができる実施形態である。本実施形態では、記載時に略記される可能性が高かったり、転記ミスを生じさせやすかったり、ＯＣＲの誤認識の可能性のある姓名候補となる姓名シーケンスを生成させる。 In the present embodiment, it is possible to adopt a configuration in which a first name surname candidate including abbreviations, transcription mistakes, and possibility of misrecognition by OCR is generated and added to the first name surname candidate list. This full name candidate is an embodiment that can generate full name candidates that are neither the same sound nor the same character. In the present embodiment, a first and last name sequence is generated that is a candidate for a first name and last name that is highly likely to be abbreviated at the time of writing, is likely to cause a transcription mistake, or has a possibility of erroneous recognition of OCR.

図１０は、本実施形態で、同音でもなく同字でもない類似姓名候補を抽出する処理のフローチャートである。図１０の処理は、ステップＳ１０００から開始する。なお、処理は姓名シーケンスを構成するフレーズではなく、漢字ごとについて文字シーケンスの先頭から末尾まで順に判定してゆくものとするが、当該判定は、逆順でもかまわない。ステップＳ１００１では、先頭から末尾まで姓名シーケンスを構成する漢字をチェックし、図６に示したテーブルを参照し、姓名シーケンスが含む漢字について、異字体テーブル１３４および類似モデル・テーブル１３２を参照して、参照姓名候補を生成し経路マップを生成する。 FIG. 10 is a flowchart of a process for extracting similar first and last name candidates that are neither the same sound nor the same character in the present embodiment. The process of FIG. 10 starts from step S1000. Note that the processing is not performed for the phrases constituting the first and last name sequence, but for each Chinese character in order from the beginning to the end of the character sequence. However, the determination may be performed in reverse order. In step S1001, the Chinese characters constituting the surname sequence are checked from the beginning to the end, the table shown in FIG. 6 is referred to, the kanji included in the surname sequence is referred to the variant table 134 and the similar model table 132, A reference first name surname candidate is generated and a route map is generated.

この際、参照姓名候補は、姓名候補として生成された姓名が含む漢字を含む可能性のある姓および名を網羅的に抽出し、抽出された姓名に関して図６のテーブルを参照して異体字や外観において類似する可能性のある漢字で、類似する漢字を置換した姓名を類似姓名候補として設定することができる。具体的には、姓＝「萩原」には、「萩原」の他、「萩野」、「萩」などの姓が想定でき、外観で類似し、誤りを考慮すれば、「荻原」、「荻野」、「荻」などが可能性のある候補とされる。なお、姓名候補と、参照姓名候補との間で対応付けできない漢字がある場合、例えば、「萩原」に対して「伯耆原」など、適切な大数を設定し、コストを高めることで、類似姓名候補から排除することができる。 At this time, the first and last name candidates are comprehensively extracted with last names and first names that may include kanji included in the first name and last name generated as the first and last name candidates, and with reference to the table of FIG. A surname that can be similar in appearance and replaced the similar kanji can be set as a similar surname candidate. Specifically, the surname = “Hagiwara” can be assumed to be “Hagiwara”, as well as “Sagano”, “Hagi” and other surnames. ”,“ 荻 ”, etc. are possible candidates. If there are kanji characters that cannot be matched between the first and last name candidates and the reference first and last name candidates, for example, by setting an appropriate large number such as “Hakubara” for “Hagiwara”, it is similar by increasing the cost. It can be excluded from first and last name candidates.

図１０の処理について、具体的な姓名シーケンスを使用して説明すると、図９で説明した類似スコアの最も高い姓名シーケンスが、「荻野明▲高▼」であり、ステップＳ１００１で「萩原朋高」が生成された場合について具体的に説明する。なお、姓名シーケンス「荻原朋高」については、「荻」、「明」と「萩」、「朋」が類似モデル・テーブル１３２で設定したしきい値以下のコスト（それぞれ０．３、０．４）で類似し、「高」と「▲高▼」とは、異字体として、コスト＝０．１が割当てられ、設定したしきい値以下の値を有しているので、姓名候補として生成される。 The processing of FIG. 10 will be described using a specific surname / name sequence. The surname / name sequence with the highest similarity score explained in FIG. 9 is “Akira Sugano ▲ High ▼”. The case where is generated will be specifically described. For the surname sequence “Hagiwara Takataka”, “荻”, “Ming”, “萩”, and “朋” are costs less than or equal to the threshold values set in the similar model table 132 (0.3, 0.00, respectively). 4) Similar, “High” and “▲ High ▼” are assigned as cost variants of 0.1 as a variant and have a value that is less than or equal to the set threshold value. Is done.

また、姓＝「萩野」に対しては、姓＝「荻野」、「萩原」が抽出され、漢字「野」と「原」とは類似でもなく、異字体でもないので、コスト＝２．０が割当てられている。なお、上述したコストについてのしきい値は、特定の用途、精度、および処理効率を考慮して適宜設定でき、概して低くしすぎると類似姓名候補が制限されることから、姓名候補の列挙精度が低下し、高くしすぎると処理効率が低下する。 Also, for surname = “Sagano”, surname = “Sagano”, “Kashihara” is extracted, and the kanji characters “Sano” and “Hara” are neither similar nor variant, so the cost = 2.0 Is assigned. Note that the above-mentioned threshold value for costs can be set as appropriate in consideration of a specific application, accuracy, and processing efficiency. If the threshold is generally too low, similar name surname candidates are limited. If it is lowered and too high, the processing efficiency is lowered.

姓名候補の「荻野明▲高▼」と、生成された「萩原朋高」について、配列順位ステップＳ１００１で、同一漢字か否かを文字コードを比較することにより、判断し、同一漢字でない場合（ｎｏ）処理をステップＳ１００２に渡す。また、同一漢字であれば、コスト＝０に設定し、処理をステップＳ１００２に渡す。ステップＳ１００２では、処理中の漢字が異体字であるか否かを判断する。異体字でもない場合（ｎｏ）、処理をステップＳ１００３に渡し、異体字である場合（ｙｅｓ）、ステップＳ１００８で、異体字テーブル１３４をルックアップして、コストを取得し、処理をステップＳ１００３に渡す。ステップＳ１００３では、比較対象の文字が外観的に類似し、転記ミスや、誤記、または誤認識を生成させるエントリとして登録されているか否かを判断する。 When the first and last name candidates “Akira Sugano ▲ High ▼” and the generated “Hagiwara 朋高” are determined by comparing the character codes in the sequence ranking step S1001, whether or not they are the same Kanji characters ( no) The process is passed to step S1002. If the same kanji is used, the cost = 0 is set, and the process is passed to step S1002. In step S1002, it is determined whether the kanji being processed is a variant. If it is not a variant character (no), the process is passed to step S1003. If it is a variant character (yes), the variant character table 134 is looked up in step S1008, the cost is acquired, and the process is passed to step S1003. . In step S1003, it is determined whether or not the character to be compared is similar in appearance and is registered as an entry that generates a transcription error, a misprint, or a misrecognition.

処理中の漢字が、当該エントリに登録されていなければ、処理をステップＳ１００５に渡し、登録されている場合（ｙｅｓ）には、ステップＳ１００９で、類似モデル・テーブル１３２をルックアップして登録されたコストを取得し処理をステップＳ１００５に渡す。ステップＳ１００５では、登録されたコストを経路マップに沿って総和して、経路コストを計算し、コストの最も低い姓名シーケンスを選択して姓名候補に対して類似する類似姓名候補として登録し、ステップＳ１００６で処理を終了させる。なお、他の実施形態では、登録した類似姓名候補は、姓名候補生成部１２６へと送付され、図３で説明した処理で生成された姓名候補に対する追加情報として、姓名候補リストに追加する。 If the kanji being processed is not registered in the entry, the process passes to step S1005, and if it is registered (yes), the similar model table 132 is looked up and registered in step S1009. The cost is acquired and the process is passed to step S1005. In step S1005, the registered costs are summed along the route map to calculate the route cost, the lowest name sequence with the lowest cost is selected and registered as a similar first name surname candidate for the first name surname candidate, and step S1006. To end the process. In another embodiment, the registered similar first name surname candidates are sent to the first name surname candidate generation unit 126 and added to the first name surname candidate list as additional information for the first name surname candidates generated by the processing described in FIG.

図１１は、図１０の処理で使用する経路コスト計算処理１１００の実施形態を示す。図１１に示した経路マップ１１１０は、列方向が、図９で説明した処理によって抽出された姓名候補であり、行方向が、図１０に説明した処理で生成された類似姓名候補の姓名シーケンスである。漢字「萩」、「荻」の間のコストは、図６に示すように、０．３であり、音は似ているが、「原」と、「野」は、説明する実施形態では、非類似であるため、経路マップ１１１０の矩形の辺を通過する。なお。矩形の片については、経路コスト＝１．０を与え、図６のテーブルとの対応付けを可能とする。また、漢字「明」と「朋」については、０．４として設定され、漢字「高」と「▲高▼」については、異字体であることから、コスト＝０．１が割当てられている。 FIG. 11 shows an embodiment of a route cost calculation process 1100 used in the process of FIG. In the route map 1110 shown in FIG. 11, the column direction is a first and last name candidate extracted by the process described with reference to FIG. 9, and the row direction is a first and last name sequence of similar first and last names generated by the process described with reference to FIG. 10. is there. As shown in FIG. 6, the cost between the kanji characters “０．３” and “荻” is 0.3, and the sounds are similar, but “hara” and “field” are Since it is dissimilar, it passes through the rectangular side of the route map 1110. Note that. For the rectangular piece, a path cost = 1.0 is given, and it can be associated with the table of FIG. In addition, the Chinese characters “Ming” and “設定” are set as 0.4, and the Chinese characters “High” and “▲ High ▼” are different character forms, so the cost = 0.1 is assigned. .

図１１に示した経路マップでは、設定された閾値以下のコストが登録されている場合には、対角線上の経路を進行して最小コスト経路１１２０に沿った経路コストが計算される。一方、設定されたしきい値以下のコストが登録されていない場合には、対角線の端点を連結する矩形の辺を進行し、漢字列が全く一致していない場合には、最大コスト経路１１３０で示されるように、最外辺のみを通過する。このため、先頭文字から最終文字まで漢字の類似性に関連して経路を選択してコストの累積計算を実行させることで、経路コスト＝２．８が与えられる。なお、当該経路コストは、姓名の類似性を示す類似指標値として使用することができる。 In the route map shown in FIG. 11, when a cost equal to or less than the set threshold is registered, a route cost along the minimum cost route 1120 is calculated by traveling along a diagonal route. On the other hand, when the cost equal to or less than the set threshold is not registered, the side of the rectangle connecting the endpoints of the diagonal line is advanced. When the kanji strings do not match at all, the maximum cost route 1130 is used. As shown, it passes only the outermost side. For this reason, route cost = 2.8 is given by selecting a route in relation to the similarity of kanji from the first character to the last character and executing the cumulative calculation of the cost. The route cost can be used as a similarity index value indicating the similarity between first and last names.

なお、姓名候補と参照姓名候補とが文字数が異なる場合、対応付け不能な文字が検出された段階で大数で与えられるコストを割当て、類似姓名候補から排除されるようにすることで、「伯耆原」などの候補を除外することができる。なお、経路コストは、説明している実施形態では、低コストの方がより尤度の高い姓名候補となるので、生成された姓名候補のうち、例えば設定したしきい値以下のコストを有する類似姓名候補を、姓名候補リストに追加し、最終的な結果リストを構成する。 In addition, when the first and last name candidates are different from the reference first and last name candidates, a cost given by a large number is assigned at the stage where unmatchable characters are detected, so that they are excluded from similar first and last name candidates. Candidates such as “Hara” can be excluded. In addition, in the embodiment described, since the lower cost is a more likely surname candidate in the described embodiment, among the generated surname candidates, for example, a similar cost having a cost equal to or less than a set threshold value. Add the first and last name candidates to the first name and last name list to construct the final result list.

なお、図１１に示した実施形態では、図６のテーブルを使用して対応付けする類似姓名候補および経路マップ１１００を予め生成し、経路コストを計算するものとして説明した。さらに他の実施形態では、漢字−漢字間に誤記や転記ミスを生じさせる遷移確率を割当て、Vitabiアルゴリズムを使用して列方向に登録された姓名候補と、行方向に登録された姓名候補との間の経路探索を実行し、対数尤度の最小となる経路を与える姓名候補を、類似姓名候補として生成し、結果リストに追加することができる。 In the embodiment shown in FIG. 11, the description has been given assuming that the similar first and last name candidates and the route map 1100 to be associated are generated in advance using the table of FIG. 6 and the route cost is calculated. In yet another embodiment, a transition probability that causes an error or transcription error between kanji and kanji is assigned, and a first name surname candidate registered in the column direction using the Vitabi algorithm and a first name surname candidate registered in the row direction. The first and last name candidates that give the route having the smallest log likelihood are generated as similar first and last name candidates and added to the result list.

また、図１０および図１１に示した処理は、サーバ１１０の姓名候補生成部１２６のモジュールとして構成する他、姓名類似度を判断する姓名類似度計算部１２８とそして独立した処理部として構成することができる。姓名類似度判断部１２８は、漢字シーケンスを受領して、各処理を実行した後、受領した漢字文字シーケンスを図１１の列に設定し、類似する姓名候補を生成して、行に設定して経路コストを計算し、その結果をクアライアント１５０に返すことができる。 The processing shown in FIGS. 10 and 11 is configured as a module of the first and last name candidate generation unit 126 of the server 110, as well as a first and last name similarity calculation unit 128 that determines the first and last name similarity, and an independent processing unit. Can do. The first name surname similarity determination unit 128 receives the kanji sequence, executes each process, sets the received kanji character sequence in the column of FIG. 11, generates similar first name surname candidates, and sets them in the row. The route cost can be calculated and the result returned to the client 150.

図１２は、漢字シーケンスの入力を行う場合のＧＵＩ１２００を示す。例えば、クライアント１５０は、サーバ１１０から姓名候補のリストを受領して、可能性のある姓名候補が他にないかどうかを知ることを希望する場合もある。このような場合、クライアント１５０は、図１２に示したＧＵＩ１２００を表示させ、対象とするべき漢字シーケンスを入力し、可能性のある姓名候補を取得することができる。 FIG. 12 shows a GUI 1200 when inputting a Chinese character sequence. For example, the client 150 may wish to receive a list of first and last name candidates from the server 110 and know if there are other potential first and last name candidates. In such a case, the client 150 can display the GUI 1200 shown in FIG. 12, input a kanji sequence to be targeted, and obtain possible first and last name candidates.

図１２に示したＧＵＩ１２００には、表示ウィンドウ１２１０内に、漢字文字列類似度判定を行うためのＧＵＩであることの表示１２２０が表示されている。クライアントのユーザは、入力フィールド１２３０に対象とする漢字シーケンスを入力し、「ＯＫ」ボタンをクリックしてサーバ１１０に漢字シーケンスを送付する。サーバは、上述した処理を適用してクライアント１５０から送付された姓名＝「萩野明▲高▼」に類似する類似姓名候補を生成し、そのリストを、類似候補リスト１２５０としてクライアントに返す構成とすることができる。 In the GUI 1200 shown in FIG. 12, a display 1220 indicating that the GUI is used for determining the Kanji character string similarity is displayed in the display window 1210. The user of the client inputs the target kanji sequence in the input field 1230 and clicks the “OK” button to send the kanji sequence to the server 110. The server is configured to generate similar first and last name candidates similar to the first and last name = “Akira Sugano” sent from the client 150 by applying the above-described processing, and return the list to the client as a similar candidate list 1250. be able to.

なお、姓名類似度計算部１２８を、姓名候補生成部１２６のモジュールとして構成するか、または独立した処理部として構成するかについては、特定の用途における目的に応じて適宜選択してサーバ１１０に実装することができる。 Whether the first and last name similarity calculation unit 128 is configured as a module of the first and last name candidate generation unit 126 or as an independent processing unit is appropriately selected according to the purpose in a specific application and mounted on the server 110. can do.

さらに、表意文字シーケンスまたは表音文字シーケンスの姓、名、または姓名の入力のみで、表音特性に関連して同一または類似する姓名候補の他、表音特性に関して非類似の姓名候補を、ＣＪＫＶおよびインド・ヨーロッパ語など多言語を考慮して類似性を判断することが可能な情報処理装置、情報処理方法およびプログラムが提供できる。 Further, by inputting only the first name, last name, or last name of the ideogram sequence or phonetic character sequence, in addition to the same or similar first name surname candidates related to the phonetic characteristics, the dissimilar first name surname candidates regarding the phonetic characteristics can be changed to CJKV. In addition, an information processing apparatus, an information processing method, and a program capable of determining similarity in consideration of multiple languages such as Indian and European languages can be provided.

以上、本発明によれば、入力された文字シーケンスを使用して複数の異なる表記についての類似性を判断して姓名候補を生成することができ、さらに、生成された姓名候補について、同音でもなく同字でもない誤記、転記ミス、略記、変換エラーなどの可能性を含めた類似姓名候補を生成することができ、さらには、グローバルな姓名表記に対応可能な、情報処理装置、情報処理方法、情報処理システムおよびプログラムを提供することができる。 As described above, according to the present invention, it is possible to generate similar first and last name candidates by judging the similarity of a plurality of different notations using the inputted character sequence, and the generated first and last name candidates are not the same sound. It is possible to generate similar first name surname candidates including the possibility of misprints that are not the same characters, transcription mistakes, abbreviations, conversion errors, etc., and further, an information processing apparatus, information processing method, An information processing system and program can be provided.

本実施形態の上記機能は、Ｃ＋＋、Ｊａｖａ（登録商標）、Ｊａｖａ（登録商標）Ｂｅａｎｓ、Ｊａｖａ（登録商標）Ａｐｐｌｅｔ、Ｊａｖａ（登録商標）Ｓｃｒｉｐｔ、Ｐｅｒｌ、Ｒｕｂｙなどのオブジェクト指向プログラミング言語などで記述された装置実行可能なプログラムにより実現でき、当該プログラムは、ハードディスク装置、ＣＤ−ＲＯＭ、ＭＯ、フレキシブルディスク、ＥＥＰＲＯＭ、ＥＰＲＯＭなどの装置可読な記録媒体に格納して頒布することができ、また他装置が可能な形式でネットワークを介して伝送することができる。 The functions of this embodiment are described in an object-oriented programming language such as C ++, Java (registered trademark), Java (registered trademark) Beans, Java (registered trademark) Applet, Java (registered trademark) Script, Perl, and Ruby. The program can be realized by a program executable by the apparatus, and the program can be stored in a device-readable recording medium such as a hard disk device, CD-ROM, MO, flexible disk, EEPROM, EPROM, and distributed. It can be transmitted over the network in a possible format.

これまで本実施形態につき説明してきたが、本発明は、上述した実施形態に限定されるものではなく、他の実施形態、追加、変更、削除など、当業者が想到することができる範囲内で変更することができ、いずれの態様においても本発明の作用・効果を奏する限り、本発明の範囲に含まれるものである。 Although the present embodiment has been described so far, the present invention is not limited to the above-described embodiment, and other embodiments, additions, changes, deletions, and the like can be conceived by those skilled in the art. It can be changed, and any aspect is within the scope of the present invention as long as the effects and effects of the present invention are exhibited.

本実施形態の情報処理システム１００の機能ブロック図。The functional block diagram of the information processing system 100 of this embodiment. 本実施形態のサーバ１１０が実行する類似度生成のための処理シーケンス２００を説明する図。The figure explaining the processing sequence 200 for the similarity generation which the server 110 of this embodiment performs. 本実施形態の姓名候補を生成する情報処理方法のフローチャート。The flowchart of the information processing method which produces | generates the full name candidate of this embodiment. 本実施形態の他表記取得部１２０が実行するカナ−アルファベット変換処理のフローチャート。The flowchart of the kana-alphabet conversion process which the other description acquisition part 120 of this embodiment performs. 図４で説明した隠れマルコフモデルを使用するアルファベット変換の状態遷移図。FIG. 5 is a state transition diagram of alphabet conversion using the hidden Markov model described in FIG. 4. 図１で説明した、類似モデル・テーブル１３２、異字体テーブル１３４、同一・非類似漢字テーブル１３６の実施形態を示した図。The figure which showed embodiment of the similar model table 132 demonstrated in FIG. 1, the variant character table 134, and the same and dissimilar kanji table 136. FIG. 本実施形態で、サーバ１１０が入力される可能性のある文字シーケンスに対応する表記ベクトルを使用した、スコア・サブマトリックス７００の実施形態を示した図。FIG. 6 is a diagram illustrating an embodiment of a score sub-matrix 700 using a notation vector corresponding to a character sequence that may be input by the server 110 in the present embodiment. 類似スコアを計算する場合に作製する類似度ベクトルの生成処理を説明するための要素割当てマップ８００を示した図。The figure which showed the element allocation map 800 for demonstrating the production | generation process of the similarity vector produced when calculating a similarity score. スコア・サブマトリックスを部分マトリックスとして含む、スコア付けマトリックス９００の実施形態を示した図。FIG. 9 illustrates an embodiment of a scoring matrix 900 that includes a score sub-matrix as a partial matrix. 本実施形態で、同音でもなく同字でもない類似姓名候補を抽出する処理のフローチャート。The flowchart of the process which extracts the similar first name surname candidate which is not the same sound and the same character in this embodiment. 図１０の処理で使用する経路コスト計算処理１１００の実施形態を示した図。The figure which showed embodiment of the route cost calculation process 1100 used by the process of FIG. 漢字シーケンスの入力を行う場合のＧＵＩ１２００を示した図。The figure which showed GUI1200 in the case of inputting a Chinese character sequence.

Explanation of symbols

１００…情報処理システム、１１０…サーバ、１１２…ネットワーク、１１４…ネットワーク・アダプタ、１１６…文字シーケンス取得部、１１８…文字判定部、１２０…他表記取得部、１２２…類似度計算部、１２４…類似スコア計算部、１２６…姓名候補生成部、１２８…姓名類似度計算部、１３０…表音文字変換テーブル、１３２…類似モデル・テーブル、１３４…異字体テーブル、１３６…同一・非類似漢字テーブル、１４０…ＧＮＲサーバ、１５０…クライアント・コンピュータ、１６０…ＧＮＲ処理部 DESCRIPTION OF SYMBOLS 100 ... Information processing system, 110 ... Server, 112 ... Network, 114 ... Network adapter, 116 ... Character sequence acquisition part, 118 ... Character determination part, 120 ... Other notation acquisition part, 122 ... Similarity calculation part, 124 ... Similarity Score calculation unit 126 ... First name and last name candidate generation unit 128 128 First name and last name similarity calculation unit 130 130 Phonetic character conversion table 132 132 Similar model table 134 134 Different character table 136 136 Same / dissimilar kanji table 140 ... GNR server, 150 ... Client computer, 160 ... GNR processing section

Claims

An information processing apparatus for calculating a first name surname similarity,
A character determination unit that receives a character sequence indicating a first and last name, and determines a character type constituting the received character sequence;
From the character sequence, other notations in which the character sequence is described in other character types including ideograms or phonograms are generated, and a notation vector including at least two different first and last name notations including phonograms is generated. A notation acquisition unit;
A similarity calculation unit that performs a different similarity determination in response to the character type and calculates a score that gives a measure of similarity for the elements of the notation vector;
A similarity score calculation unit that calculates a similarity score for the first and last name candidates using the score calculated by the similarity calculation unit.

If the character type of the character sequence is determined to be alphabet, the other notation acquisition unit generates the other notation using a phonetic character corresponding to the character sequence, and the character sequence is kana The information processing apparatus according to claim 1, wherein when the determination is made, the other notation is generated using ideograms.

The information processing apparatus manages a similar model table that associates ideographic characters having similar appearances, and a different character table that registers different characters of the ideographic characters. The information processing apparatus according to claim 1, further comprising: a first name surname similarity calculation unit that calculates a route cost by looking up the table and the different character table to include appearance similarities and character differences.

Further, the first and last name candidates are selected using the similarity score, and sent to the sender of the character sequence as a first and last name candidate list, and the first and last name similarity calculation unit includes: The information processing apparatus according to claim 3, wherein the information processing apparatus is a module or an independent module of the information processing apparatus.

The similarity calculation unit generates a score sub-matrix in which scores for the elements of the notation vector to be evaluated are registered, and the similarity score calculation unit uses the score of the score sub-matrix to generate the score The information processing apparatus according to claim 1, wherein a similarity score that gives an index of similarity of the sub-matrix is calculated.

An information processing method for determining similarity between first and last names executed by an information processing device, wherein the information processing device includes:
Receiving a character sequence indicating a first and last name and determining a character type constituting the received character sequence;
Generating another notation in which the character sequence is described with other character types including ideograms or phonetic characters from the character sequence, and generating a notation vector including at least two different first and last name notations including phonetic characters When,
Performing different similarity decisions in response to the character type and calculating a score to give a measure of similarity for the elements of the notation vector;
And a step of calculating a similarity score for the first and last name candidates calculated by the similarity calculation unit.

In the step of generating the notation vector, when it is determined that the character type of the character sequence is alphabet, the other notation using a phonetic character corresponding to the character sequence is generated, and the character sequence is The information processing method according to claim 6, further comprising the step of generating the other notation with ideographic characters when it is determined that

Further, it includes a step of calculating a path cost including appearance similarities and font differences by looking up a similar model table that associates the ideographic characters having similar appearances and a different character table that registers the different characters of the ideographic characters. The information processing method according to claim 6.

The information processing method according to claim 8, further comprising: selecting the first and last name candidates using the similarity score, and sending the selected first and last name candidates as a first and last name candidate list to a sender of the character sequence.

The step of calculating the score includes generating a score sub-matrix in which scores for the elements of the notation vector to be evaluated are registered,
The information processing method according to claim 6, wherein the step of calculating the similarity score includes a step of calculating a similarity score that gives an index of the similarity of the score sub-matrix using the score of the score sub-matrix. .

An information processing executable program for an information processing device to execute an information processing method for determining first name surname similarity, the information processing device comprising:
Receiving a character sequence indicating a first and last name and determining a character type constituting the received character sequence;
Generating another notation in which the character sequence is described with other character types including ideograms or phonetic characters from the character sequence, and generating a notation vector including at least two different first and last name notations including phonetic characters When,
Performing different similarity decisions in response to the character type and calculating a score to give a measure of similarity for the elements of the notation vector;
A computer-executable program for executing the step of calculating a similarity score for the first and last name candidates calculated by the similarity calculation unit.

In the step of generating the notation vector, when it is determined that the character type of the character sequence is alphabet, the other notation using a phonetic character corresponding to the character sequence is generated, and the character sequence is The program according to claim 11, including a step of generating the other notation with an ideogram when it is determined that

Further, it includes a step of calculating a path cost including appearance similarities and font differences by looking up a similar model table that associates the ideographic characters having similar appearances and a different character table that registers the different characters of the ideographic characters. The program according to claim 12.

The program according to claim 13, further comprising: selecting the first and last name candidates using the similarity score and sending the selected first and last name candidates to a sender of the character sequence as a first and last name candidate list.

The step of calculating the score includes generating a score sub-matrix in which scores for the elements of the notation vector to be evaluated are registered,
12. The program according to claim 11, wherein the step of calculating the similarity score includes calculating a similarity score that gives an index of the similarity of the score sub-matrix using the score of the score sub-matrix.

An information processing system for calculating first and last name similarity, the information processing system comprising:
A client computer connected via a network;
A server computer that receives a character sequence indicating first and last names from the client computer and determines similarity of the first and last names, the server computer comprising:
A character determination unit that receives a character sequence indicating a first and last name, and determines a character type constituting the received character sequence;
From the character sequence, other notations in which the character sequence is described in other character types including ideograms or phonetic characters are generated, and a notation vector including at least two different first and last name notations including phonetic characters is generated. A notation acquisition unit;
A similarity calculation unit that performs a different similarity determination in response to the character type and calculates a score for giving a measure of similarity for the elements of the notation vector;
A similarity score calculation unit that calculates a similarity score for the first and last name candidates using the score calculated by the similarity calculation unit.

If the character type of the character sequence is determined to be alphabet, the other notation acquisition unit generates the other notation using a phonetic character corresponding to the character sequence, and the character sequence is kana The information processing system according to claim 16, wherein when the determination is made, the other notation is generated with ideographic characters.

The information processing system manages a similar model table for associating ideographic characters with similar appearance, and a different character table for registering different characters of the ideographic characters, and for the other name surname candidate, the similar model The information processing system according to claim 16, further comprising a first and last name similarity calculation unit that calculates a path cost including look-up similarity and font difference by looking up the table and the variant font table.

Further, the first and last name candidates are selected using the similarity score, and sent to the sender of the character sequence as a first and last name candidate list, and the first and last name similarity calculation unit includes: The information processing system according to claim 18, wherein the information processing system is a module or an independent module of the information processing apparatus.

The similarity calculation unit generates a score sub-matrix in which scores for the elements of the notation vector to be evaluated are registered, and the similarity score calculation unit uses the score of the score sub-matrix to generate the score The information processing system according to claim 16, wherein a similarity score that gives an index of similarity of sub-matrix is calculated.

An information processing apparatus for calculating a first name surname similarity,
A character determination unit that receives a character sequence indicating a first and last name, and determines a character type constituting the received character sequence;
From the character sequence, other notations in which the character sequence is described in other character types including ideograms or phonograms are generated, and a notation vector including at least two different first and last name notations including phonograms is generated. A notation acquisition unit;
A similarity calculation unit that performs a different similarity determination in response to the character type and calculates a score that gives a measure of similarity for the elements of the notation vector;
A similarity score calculation unit that calculates a similarity score for the first and last name candidates using the score calculated by the similarity calculation unit;
The other notation acquisition unit generates the other notation using a phonetic character corresponding to the character sequence when it is determined that the character type of the character sequence is alphabet, and the character sequence is kana If it is determined, generate the other notation with ideograms,
The information processing apparatus manages a similar model table that associates ideographic characters having similar appearances, and a different character table that registers different characters of the ideographic characters. A first and last name similarity calculator that looks up the table and variant font table to calculate the path cost including appearance similarity and font difference;
Selecting the first and last name candidates using the similarity score, and sending the first and last name candidates to the sender of the character sequence as a first and last name candidate list,
The full name similarity calculation unit is a module of the full name candidate generation unit or an independent module of the information processing device,
The similarity calculation unit generates a score sub-matrix in which scores for the elements of the notation vector to be evaluated are registered, and the similarity score calculation unit uses the score of the score sub-matrix to generate the score An information processing apparatus that calculates a similarity score that gives an index of similarity of sub-matrix