JP4303027B2

JP4303027B2 - Apparatus and method for converting lexical data to data

Info

Publication number: JP4303027B2
Application number: JP2003115287A
Authority: JP
Inventors: 秀之武井; 英明岩下; 文彦杉浦; 幸子彌永
Original assignee: Bank of Tokyo Mitsubishi UFJ Trust Co
Current assignee: MUFG Bank Ltd
Priority date: 2003-04-21
Filing date: 2003-04-21
Publication date: 2009-07-29
Anticipated expiration: 2023-04-21
Also published as: JP2004318753A

Description

【０００１】
【発明の属する技術分野】
本発明は、複数の単語を含む字句を当該複数の単語により特定される１つの情報を含む別のデータに変換する装置及び方法に関する。
【０００２】
【従来の技術】
例えば、銀行ではいわゆる電文等についてＣＩＦ解析処理が必要になることがある（なお、ＣＩＦは顧客情報ファイル（Customer Information File）を意味する。）。詳細には、銀行間あるいは銀行内での電文の処理で、例えば図１の参照番号３０で示されるような電文中の字句「ＧＥＴＲＯＮＩＣＳＦＯＯＤＳＣＯ．ＬＴＤ１−２−３４ＡＫＡＳＡＫＡ」の中の複数の単語「ＧＥＴＲＯＮＩＣＳ」、「ＦＯＯＤＳ」及び「ＡＫＡＳＡＫＡ」の組み合わせを図１の参照番号３２に示される顧客コード「１２３−４５６７８」に変換することが必要になる。
【０００３】
従来は、この変換処理を次のように行っていた。即ち、複数の単語の組み合わせとそれに対応する顧客コードとの顧客コード・テーブルを予め記憶装置に格納しておく。次いで、入力データから変換すべき複数の単語を抽出して、その抽出された複数の単語と顧客コード・テーブルの中の複数の単語とを文字列比較を行い、一致した場合顧客コードに変換していた（そのような例として、特許文献１参照。）。
【０００４】
【特許文献１】
特開２００２−５６００５号公報
【０００５】
【発明が解決しようとする課題】
しかしながら、このような文字列比較は、１バイト単位で行うため、とりわけ大量のデータが対象になるときには、当該処理に要する検索時間（seek time）の関係上、高速に処理を行うことができないという問題があった。
【０００６】
従って、本発明の課題は、複数の単語を含む字句を当該複数の単語により特定される１つの情報を含む別のデータに高速に変換して、当該データの入力を受けるコンピュータでの処理を可能とすることにある。
【０００７】
【課題を解決するための手段】
上記課題は、本発明の一局面に従った、複数の単語を含む字句を、当該複数の単語が表す対象に予め設定された識別情報を含む別のデータに変換する装置において、複数の単語が予め登録された基本語辞書、及び、前記基本語辞書に登録されている単語のうちキーとして選択されたキー単語と、当該キー単語を他の単語と組み合わせた単語の組と、前記単語の組が表す対象に予め設定された前記識別情報と、が関連付けて予め登録された名称辞書を記憶する記憶手段と、前記複数の単語を含む字句を前記識別情報を含む別のデータに変換する処理エンジンとを備え、前記処理エンジンは、前記基本語辞書をメモリに記憶させ、変換対象の字句を単語に分解し、前記メモリに記憶させた前記基本語辞書において前記分解した個々の単語が記憶されているメモリ・アドレスを各々取得することで、前記変換対象の字句を表すメモリ・アドレスの組み合わせを取得し、前記分解した個々の単語のうち前記名称辞書に前記キー単語として登録されている単語を抽出すると共に、前記名称辞書に登録されている前記単語の組のうち、少なくとも前記抽出した単語を含む単語の組について、前記メモリに記憶させた前記基本語辞書において前記単語の組を構成する個々の単語が記憶されているメモリ・アドレスを各々取得し、少なくとも前記抽出した単語を含む前記単語の組を構成する個々の単語を前記取得したメモリ・アドレスに各々置き換えることで、少なくとも前記抽出した単語を含む前記単語の組を前記単語の組を表すメモリ・アドレスの組み合わせに変換した名称辞書をメモリに記憶させ、前記メモリに記憶した前記名称辞書に登録されている、前記抽出した単語を含む前記単語の組を表すメモリ・アドレスの組み合わせのうち、当該メモリ・アドレスの組み合わせにおける個々のメモリ・アドレスが、前記変換対象の字句を表すメモリ・アドレスの組み合わせにおける個々のメモリ・アドレスの何れかと同じであるメモリ・アドレスの組み合わせを選択し、前記変換対象の字句を、前記選択したメモリ・アドレスの組み合わせと関連付けて前記名称辞書に登録されている前記識別情報を含む別のデータに変換する装置により解決される。
【０００８】
本発明の装置の一形態によれば、前記処理エンジンは、前記名称辞書に登録されている全ての単語の組について、前記メモリに記憶させた前記基本語辞書において前記単語の組を構成する個々の単語が記憶されているメモリ・アドレスを各々取得し、前記個々の単語を前記取得したメモリ・アドレスに各々置き換えることで、前記全ての単語の組を前記単語の組を表すメモリ・アドレスの組み合わせに各々変換した名称辞書をメモリに記憶させることが好ましい。
【０００９】
また、本発明の装置の一形態によれば、前記処理エンジンは、前記名称辞書をメモリに記憶させた際に、前記メモリに記憶させた前記基本語辞書に対し、前記基本語辞書に登録されている単語のうち前記名称辞書に前記キー単語として登録されている単語に、前記メモリに記憶させた前記名称辞書において前記単語を前記キー単語として他の単語と組み合わせた単語の組が記憶されているメモリ・アドレスを付加しておき、前記分解した個々の単語のうち、前記メモリに記憶させた前記基本語辞書において前記メモリ・アドレスが付加されている単語を、前記名称辞書に前記キー単語として登録されている単語として抽出することが好ましい。
【００１０】
また、本発明の装置の一形態によれば、前記処理エンジンは、前記名称辞書に登録されている前記単語の組のうち前記抽出した単語を含む単語の組についてのみ、前記メモリに記憶させた前記基本語辞書において前記単語の組を構成する個々の単語が記憶されているメモリ・アドレスを各々取得し、前記抽出した単語を含む前記単語の組を構成する個々の単語を前記取得したメモリ・アドレスに各々置き換えることで、前記抽出した単語を含む前記単語の組についてのみ前記単語の組を表すメモリ・アドレスの組み合わせに変換した名称辞書のうち、前記抽出した単語を含み前記メモリ・アドレスの組み合わせに変換した前記単語の組及び当該単語の組と関連付けて前記名称辞書に登録された識別情報をメモリに記憶させることが好ましい。
【００１１】
また、本発明の装置の一形態によれば、前記記憶手段に記憶されている名称辞書は、前記キー単語と、前記キー単語を他の単語と組み合わせた単語の組と、当該単語の組に設定されたコードと、が関連付けて予め登録された核名称辞書、及び、前記核名称辞書に登録されているコードと、当該コードと関連付けられた単語の組に更に組み合わせる別の単語と、前記コードと関連付けられた単語の組に更に前記別の単語を組み合わせた単語の組が表す対象に予め設定された識別情報と、が関連付けて予め登録されたフル名称辞書から構成されており、前記処理エンジンは、前記核名称辞書に登録されている前記単語の組のうち、少なくとも前記抽出した単語を含む単語の組について、前記メモリに記憶させた前記基本語辞書において前記単語の組を構成する個々の単語が記憶されているメモリ・アドレスを各々取得し、少なくとも前記抽出した単語を含む前記単語の組を構成する個々の単語を前記取得したメモリ・アドレスに各々置き換えることで、少なくとも前記抽出した単語を含む前記単語の組を前記単語の組を表すメモリ・アドレスの組み合わせに変換した核名称辞書をメモリに記憶させると共に、前記フル名称辞書に登録されている前記別の単語のうち、少なくとも前記抽出した単語を含む単語の組に設定されたコードと関連付けられた前記別の単語について、前記メモリに記憶させた前記基本語辞書において前記別の単語が記憶されているメモリ・アドレスを取得し、少なくとも前記抽出した単語を含む前記単語の組に設定されたコードと関連付けられた前記別の単語を前記取得したメモリ・アドレスに置き換えたフル名称辞書をメモリに記憶させ、前記メモリに記憶した前記核名称辞書に登録されている、前記抽出した単語を含む前記単語の組を表すメモリ・アドレスの組み合わせのうち、当該メモリ・アドレスの組み合わせにおける個々のメモリ・アドレスが、前記変換対象の字句を表すメモリ・アドレスの組み合わせにおける個々のメモリ・アドレスの何れかと同じであるメモリ・アドレスの組み合わせを選択した後に、選択したメモリ・アドレスの組み合わせと関連付けて前記核名称辞書に登録されている前記コードを抽出し、前記メモリに記憶した前記フル名称辞書に前記抽出した前記コードと関連付けて登録されている前記別の単語を表すメモリ・アドレスのうち、当該メモリ・アドレスが、前記変換対象の字句を表すメモリ・アドレスの組み合わせから前記選択したメモリ・アドレスの組み合わせを除外した残りのメモリ・アドレスの何れかと同じであるメモリ・アドレスを選択し、前記変換対象の字句を、前記選択したメモリ・アドレスと関連付けて前記フル名称辞書に登録されている前記識別情報を含む別のデータに変換することが好ましい。
【００１２】
また、本発明の装置の一形態によれば、前記記憶手段に記憶されている名称辞書は、当該名称辞書に登録されている単語の組のうち、前記単語の組を構成する単語の一部が相違しかつ表す対象が同一の複数の単語の組が、前記識別情報としての同一の情報と関連付けられていることが好ましい。
【００１３】
上記課題は、複数の単語を含む字句を、当該複数の単語が表す対象に予め設定された識別情報を含む別のデータに変換する方法において、複数の単語が予め登録された基本語辞書、及び、前記基本語辞書に登録されている単語のうちキーとして選択されたキー単語と、当該キー単語を他の単語と組み合わせた単語の組と、前記単語の組が表す対象に予め設定された前記識別情報と、が関連付けて予め登録された名称辞書を記憶する記憶手段を備えたコンピュータにより、前記基本語辞書をメモリに記憶させ、変換対象の字句を単語に分解し、前記メモリに記憶させた前記基本語辞書において前記分解した個々の単語が記憶されているメモリ・アドレスを各々取得することで、前記変換対象の字句を表すメモリ・アドレスの組み合わせを取得し、前記分解した個々の単語のうち前記名称辞書に前記キー単語として登録されている単語を抽出すると共に、前記名称辞書に登録されている前記単語の組のうち、少なくとも前記抽出した単語を含む単語の組について、前記メモリに記憶させた前記基本語辞書において前記単語の組を構成する個々の単語が記憶されているメモリ・アドレスを各々取得し、少なくとも前記抽出した単語を含む前記単語の組を構成する個々の単語を前記取得したメモリ・アドレスに各々置き換えることで、少なくとも前記抽出した単語を含む前記単語の組を前記単語の組を表すメモリ・アドレスの組み合わせに変換した名称辞書をメモリに記憶させ、前記メモリに記憶した前記名称辞書に登録されている、前記抽出した単語を含む前記単語の組を表すメモリ・アドレスの組み合わせのうち、当該メモリ・アドレスの組み合わせにおける個々のメモリ・アドレスが、前記変換対象の字句を表すメモリ・アドレスの組み合わせにおける個々のメモリ・アドレスの何れかと同じであるメモリ・アドレスの組み合わせを選択し、前記変換対象の字句を、前記選択したメモリ・アドレスの組み合わせと関連付けて前記名称辞書に登録されている前記識別情報を含む別のデータに変換する処理を行わせる方法により解決される。
【００２１】
【発明の実施の形態】
本発明の好適な実施形態を以下図面を参照して説明する。
図１は、本発明の好適な実施形態による字句をデータに変換する装置の基本構成を示す図である。図１において、１０はメイン・フレーム・コンピュータ、パーソナル・コンピュータ、マイクロプロセッサ等の任意のデータ処理装置より構成される処理エンジンを、１２はメイン・メモリを、１４は基本語辞書を、１６は核名称辞書を、１８はフル名称辞書をそれぞれ示す。基本語辞書１４、核名称辞書１６及びフル名称辞書１８は、磁気ディスク等のハード・ディスク（図示せず）に格納されているが、これに限定されず、いずれの他の形式の記憶装置に格納され得る。処理エンジン１０として機能するデータ処理装置と、メイン・メモリ１２、及び基本語辞書１４、核名称辞書１６及びフル名称辞書１８を格納するハード・ディスクとは通常のデータ・バス等（図示せず）により相互に結合されている。
【００２２】
図２は、基本語辞書１４に事前に登録されている単語（以下、「基本語」とも言う。）をメイン・メモリ１２上にメモリ・アドレスをシンボルとしてシンボル化即ちメモリ展開した状態を示す。なお、本明細書における単語あるいは基本語には、普通名詞、固有名詞、略語が含まれるのは勿論、その他、ある意味を有するいずれの一組の記号も含まれる。図２に示すように、基本語辞書１４の一例は、項目として、キー、品詞、名称の属性、コードの属性を含むが、本発明の基本語辞書としては基本語を登録するための項目であるキーを少なくとも含めばよく、その他の項目は上記のものに限定されるものではない。基本語辞書１４は、変換すべき字句に登録されていない基本語を含む場合、新たな基本語を登録し、また登録済みの基本語で使用しなくなった場合に削除できる構造であることが好ましい。処理エンジン１０は、変換処理を開始する前に、図２に示すように、基本語辞書１４に登録されている基本語をメイン・メモリ１２上にメモリ・アドレスをシンボルとしてシンボル化即ちメモリ展開する。即ち、各登録内容のエントリポイントとしてメモリ・アドレスが割り振られる。具体的には、キーの欄の基本語「ＡＫＡＳＡＫＡ」はメイン・メモリ１２上のメモリ・アドレス１００番という場所に情報が格納され、キーの欄の基本語「ＢＡＮＫ」はメモリ・アドレス１０１番という場所に情報が格納される等々である。基本語をメモリ上へ展開するときに、各基本語に対して後述するようにメモリ・アドレスを格納するための「名称パターン」という項目を付加してメモリ展開する。なお、変換すべき字句に含まれる基本語が事前に分かっている場合には、用いられる基本語だけをメイン・メモリ１２上に展開してもよく、更に、用途によっては、変換処理速度が遅くなるが、基本語辞書１４に登録されている基本語の一部分をメモリ展開し、未展開の基本語が変換処理に必要になったとき追加的にメモリ展開するようにしてもよい。
【００２３】
図３は、核名称辞書１６及びフル名称辞書１８のそれぞれに事前に登録されている核名称及びフル名称をメイン・メモリ１２上にメモリ・アドレスをシンボルとしてシンボル化即ちメモリ展開した状態を示す。図３に示すように、核名称辞書１６の項目は、キー、名称パターン、コードから成る。核名称辞書１６の名称パターンの項目には、基本語辞書１４に登録されている基本語のうちで、変換すべき可能性のある基本語の組み合わせに含まれる２つの基本語が事前に登録されている。具体的には、核名称辞書１６の第１行には「ＧＥＴＲＯＮＩＣＳ」と「ＦＯＯＤＳ」とが、第２行には「ＧＥＴＲＯＮＩＣＳ」と「ＳＨＯＫＵＨＩＮ」とが、第３行には「ＧＥＴＲＯＮＩＣＳ」と「ＢＡＮＫ」とがそれぞれ文字列として登録されている。これらの名称パターンに共通する基本語は「ＧＥＴＲＯＮＩＣＳ」であり、この基本語が核名称辞書１６のキーの項目に登録されている。コードには、各名称パターンとの関連を表すための記号が登録される。名称パターンの「ＧＥＴＲＯＮＩＣＳＦＯＯＤＳ」と「ＧＥＴＲＯＮＩＣＳＳＨＯＫＵＨＩＮ」とはその意味内容がおなじであることから、コードとして同じ記号「＃ＧＥＴＲＯ＃」が割り当てられるのが好ましいが、異なっていてもよい。核名称辞書１６は、変換すべき字句に、登録されていない基本語を含む組み合わせがある場合、新たな基本語を含む組み合わせを登録し、また登録済みの組み合わせで使用しなくなった場合に削除できるようにされていることが好ましい。
【００２４】
フル名称辞書１８の項目も、図３に示すように、キー、名称パターン、コードから成る。フル名称辞書１８の名称パターンの項目には、変換すべき可能性のある基本語の組み合わせの中で核名称辞書１６の名称パターンに示された基本語の組み合わせに対応する記号と、それと組になる基本語とが組になって事前に登録されている。具体的には、フル名称辞書１８の名称パターンの第１行には「＃ＧＥＴＲＯ＃」と「ＡＫＡＳＡＫＡ」とが、第２行には「＃ＧＥＴＲＯ＃」と「ＯＳＡＫＡ」とがそれぞれ事前に登録されている。これらの名称パターンに共通する記号は「＃ＧＥＴＲＯ＃」であるので、フル名称辞書１８のキーにはその記号が登録される。フル名称辞書１８のコードには、名称パターンに対応する変換後の目的のデータ、この場合には顧客コードが登録されている。具体的には、「ＧＥＴＲＯＮＩＣＳＦＯＯＤＳＡＫＡＳＡＫＡ」及び「ＧＥＴＲＯＮＩＣＳＳＨＯＫＵＨＩＮＡＫＡＳＡＫＡ」の両方の顧客コードは、「１２３−４５６７８」であるので、その顧客コードがフル名称辞書１８のコードの第１行に、また、「ＧＥＴＲＯＮＩＣＳＦＯＯＤＳＯＳＡＫＡ」及び「ＧＥＴＲＯＮＩＣＳＳＨＯＫＵＨＩＮＯＳＡＫＡ」の両方の顧客コードは、「１０１−２３４５６」であるので、その顧客コードがフル名称辞書１８のコードの第２行にそれぞれ登録される。フル名称辞書１８は、変換すべき字句に、登録されていない基本語を含む組み合わせがある場合、新たな基本語を含む組み合わせを登録し、また登録済みの組み合わせで使用しなくなった場合に削除できるようにされていることが好ましい。
【００２５】
なお、この例では、核名称辞書１６及びフル名称辞書１８の名称パターンとしては２つの基本語の組み合わせを用いているが、処理速度が多少遅くなることが許容できる場合には、３つ以上の組み合わせを用いてもよい。また、この例では、核名称辞書１６とフル名称辞書１８と２段の名称辞書を用いているが、用途に応じて、核名称辞書１６のみ、あるいはフル名称辞書１８を２つ以上用いてもよい。
【００２６】
処理エンジン１０は、変換処理を開始する前に、図３に示すように、核名称辞書１６に登録されている名称パターンを、シンボル化され即ちメモリ展開済みの核名称辞書１６の基本語のメモリ・アドレスを参照して、メイン・メモリ１２上にメモリ・アドレスをシンボルとしてシンボル化即ちメモリ展開する。その際、核名称辞書１６のキーが同じものは１グループにまとめてメモリ展開する。具体的には、核名称辞書１６の名称パターンの第１〜３行にある「ＧＥＴＲＯＮＩＣＳ」、「ＦＯＯＤＳ」、「ＳＨＯＫＵＨＩＮ」及び「ＢＡＮＫ」には、メイン・メモリ１２にメモリ展開された基本語辞書１４の基本語とそれに対応するメモリ・アドレスを参照して、「１０７番」、「１０６番」、「１１２番」及び「１０１番」が図３の３４に示すように割り当てられる。そして、核名称辞書１６のキーに「ＧＥＴＲＯＮＩＣＳ」と登録されている３件を名称パターンとしてシンボル化することにより使用していない任意のメモリ・アドレス、例えば２０００番を取得する。詳細には、核名称辞書１６の名称パターンの第１行から第３行は、「ＧＥＴＲＯＮＩＣＳ」の同一のキーを持つので、第１行の名称パターンの「ＧＥＴＲＯＮＩＣＳ」に対応するメモリ・アドレス１０７番のエントリポイントとして、使用していない任意のメモリ・アドレス、例えば２０００番が割り振られる。
【００２７】
次いで、「＃ＧＥＴＲＯ＃」及び「＃ＧＥＴＲＯＢＫ＃」で登録されている核名称辞書１６のコードをシンボル化する。即ち、核名称辞書１６のコードの第１及び２行の「＃ＧＥＴＲＯ＃」及び第３行の「＃ＧＥＴＲＯＢＫ＃」には使用していない任意のメモリ・アドレス、例えば「５００番」及び「５０１番」がそれぞれ割り振られる。但し、５００番及び５１０番には、メモリ・アドレスを格納できる領域が確保されるだけで、「＃ＧＥＴＲＯ＃」及び「＃ＧＥＴＲＯＢＫ＃」が格納されるわけではない。メイン・メモリ１２上の２０００番の第１行には、核名称辞書１６の第１行に対応するよう、「１０７番」、「１０６番」とそれと関連付けられて「５００番」が格納され、メイン・メモリ１２上の２０００番の第２行には、核名称辞書１６の第２行に対応するよう、「１０７番」、「１１２番」とそれと関連付けられて「５００番」が格納され、メイン・メモリ１２上の２０００番の第３行には、核名称辞書１６の第３行に対応するよう、「１０７番」、「１０１番」とそれと関連付けられて「５０１番」が格納される。更に、核名称辞書１６の中の基本語「ＧＥＴＲＯＮＩＣＳ」をキーとするグループとする名称パターンのメモリ・アドレス２０００番を、シンボル化された基本語「ＧＥＴＲＯＮＩＣＳ」と結びつけるため、メモリ展開された基本語辞書１４上のメモリ・アドレス１０７番の「名称パターン」の格納領域に「２０００番」が格納される。
【００２８】
次いで、処理エンジン１０は、変換処理を開始する前に、図３に示すように、フル名称辞書１８に登録されている名称パターンを、シンボル化され即ちメモリ展開済みの核名称辞書１６の基本語のメモリ・アドレス、及び核名称辞書１６のコードに割り当てられたメモリ・アドレスを参照して、メイン・メモリ１２上にメモリ・アドレスをシンボルとしてシンボル化即ちメモリ展開する。その際、フル名称辞書１８のキーが同じものは１グループにまとめてメモリ展開する。具体的には、シンボル化された核名称辞書のメモリ・アドレスをフル名称辞書１８のシンボルに展開する（即ち、紐付けする）ため、フル名称辞書１８の名称パターンの第１〜２行にある「＃ＧＥＴＲＯ＃」には５００番が先に割り当てられているので、そのメモリ・アドレス番号を図３の３６に示すように割り当てる。そして「ＡＫＡＳＡＫＡ」及び「ＯＳＡＫＡ」には、メイン・メモリ１２に展開された基本語辞書１４の基本語とそれに対応するメモリ・アドレスを参照して、「１００番」及び「１１１番」が図３の３６に示すように割り当てられる。そして、フル名称辞書１８の名称パターンの第１行及び第２行は、「＃ＧＥＴＲＯ＃」の同一のキーを持つので、第１行の名称パターンの「＃ＧＥＴＲＯ＃」に対応するメモリ・アドレス５００番のエントリポイントとして、使用していないメモリ・アドレス、例えば８０００番が割り振られる。次いで、フル名称辞書１８をシンボル化して得られたメモリ・アドレス８０００番を核名称辞書１６のシンボル展開（即ち、紐付け）するため、メモリ・アドレス５００番の格納領域に８０００番を格納する。こうして、メモリ・アドレス８０００番の第１行には、「＃５００」及び「＃１００」が変換後の目的データ即ち顧客コード「１２３−４５６７」と関連付けて格納され、第２行には、「＃５００」及び「＃１１１」が変換後の目的データ即ち顧客コード「１０１−２３５６４」と関連付けて格納される。
【００２９】
なお、フル名称辞書１８が２以上ある場合には、最後のフル名称辞書より前の中間のフル名称辞書のコードには核名称辞書１６のコードの記号（この例では、「＃ＧＥＴＲＯ＃」あるいは「＃ＧＥＴＲＯＢＫ＃」）と類似の記号で各名称パターンを識別可能にする記号が登録される。そして、中間のフル名称辞書のメモリ展開では、その名称パターンのシンボル化はフル名称辞書１８における８０００番での格納状態と同様であるが、８０００番の格納領域の「１２３−４５６７８」及び「１０１−２３５６４」に相当する格納領域に当該中間のフル名称辞書の記号に与えられるメモリ・アドレスが格納される。
【００３０】
次に、入力データの変換処理を図１〜図３並びに図４及び図５を参照して説明する。図４及び図５は、図１に示す変換装置に入力されたデータが変換される過程を説明するための図である。図５のメモリ展開は、図３に示すメモリ展開と同じものであるが、説明の理解を容易にするため、図２に示す基本語辞書１４に記載の全ての基本語のメモリ展開が示されている。
【００３１】
ここで、メイン・メモリ１２上には前述したように基本語辞書１４、核名称辞書１６及びフル名称辞書１８がシンボル化されているとする。そして、図４の参照番号４０で示すデータが入力されたとする。処理エンジン１０は、ステップ４２に示されるように入力データ４０を単語に分解する。次いで、処理エンジン１０は、分解された単語に対応するメモリ・アドレスを、図５に示すメイン・メモリ１２上にメモリ展開された基本語辞書１４ａを参照して取得する。この取得の仕方には二分検索が好ましいが、本発明はいずれの取得方法でもよい。図５の基本語辞書１４ａの中の丸で囲った基本語に対応するメモリ・アドレスが取得される。
【００３２】
次いで、処理エンジン１０は、ステップ４４において、分解された単語のうち、メモリ・アドレスが取得できた単語については、当該単語を取得できたメモリ・アドレスに変換する。なお、＜１−２−３４＞のように基本語辞書１４ａにはない場合にはそのままにしておく。
【００３３】
処理エンジン１０は、ステップ４６において、キーとなる基本語、ここでは「ＧＥＴＲＯＮＩＣＳ」のメモリ・アドレス「１０７番」をキーにして、他のメモリ・アドレス、即ち「１０７番」と「１０６番」、「１０４番」、「１００番」とのうちのいずれかの組が、図５に示すメモリ展開された核名称辞書１６ａの中にあるか検索して、一致した場合には核名称辞書１６の一致したコードのメモリ・アドレス「５００番」を取得する。詳細には、処理エンジン１０は、メモリ展開された基本語辞書１４ａのメモリ・アドレス１０７番の「名称パターン」の格納領域に格納されている２０００番を読み取り、その２０００番に基づいてメモリ展開された核名称辞書１６ａの２０００番に格納されているメモリ・アドレスの組の中で「１０７番」と「１０６番」、「１０４番」、「１００番」とのいずれかとの組み合わせがあるか調べる。この例では、「１０７番」と「１０６番」の組み合わせが一致する（図４のステップ４６で丸を付した組み合わせと図５の核名称辞書１６ａの中で丸を付した行を参照）ので、「５００番」が取得され、「１０７番」と「１０６番」の組み合わせが「５００番」に変換される。
【００３４】
ステップ４８において、処理エンジン１０は、続いて、キーとなる記号のメモリ・アドレス「５００番」をキーにして、他のメモリ・アドレスとの組み合わせ、ここでは「５００番」と「１００番」の組み合わせが、図５に示すメモリ展開されたフル名称辞書１８ａの中にあるか検索して、一致した場合にはフル名称辞書１８の一致したコードを取得する。詳細には、処理エンジン１０は、メイン・メモリ１２内のメモリ・アドレス５００番に格納されているメモリ・アドレス８０００番を読み取り、その８０００番に基づいてメモリ展開されたフル名称辞書１８ａの８０００番に格納されているメモリ・アドレスの組の中で「５００番」と「１００番」の組があるか調べる。この例では、「５００番」と「１００番」の組み合わせが一致する（図４のステップ４８で丸を付した組み合わせと図５のフル名称辞書１８ａの中で丸を付した行を参照）ので、メイン・メモリ１２上の「１２３−４５６７８」が取得され、「５００番」と「１００番」の組み合わせが「１２３−４５６７８」に変換される。その結果、入力データ即ち字句の中の「ＧＥＴＲＯＮＩＣＳＦＯＯＤＳＡＫＡＳＡＫＡ」が所望のデータである顧客コード「１２３−２３５６４」に変換される。
【００３５】
なお、図１の処理エンジン１０内に記載されている処理ブロックと図４の処理ステップとは、図４のステップ４２及び４４が図１の単語認識ブロック２０に、図４のステップ４６が図１の核名称認識ブロック２２に、図４のステップ４８がフル名称認識ブロック２４にそれぞれ対応する。
【００３６】
また、本発明の字句をデータに変換する装置及び方法には、入力データに入力ミス、例えば「ＧＥＴＲＯＮＩＣＳ」を「ＧＥＴＲＯＭＩＣＳ」と入力した場合に、例えば綴りパターン辞書を用いるような、従来の綴り補正機能を持たせてもよく、入力される単語が連続的に綴られている場合に、連語辞書を用いるような、従来の連語処理機能を持たせてもよい。
【００３７】
更に、本発明の字句をデータに変換する装置及び方法には、用途に応じて、図１に示されるように、入力データ３０から参照番号３２に示すように名称「ＧＥＴＲＯＮＩＣＳＦＯＯＤＳＣＯ．ＬＴＤ」を抽出する機能を含めてもよい。
【００３８】
図６は、本発明のシンボル化による単語比較と従来の文字列比較との相違を説明する図である。例えば、入力データ「ＧＥＴＲＯＮＩＣＳＦＯＯＤＳ」を、「ＧＥＴＲＯＮＩＣＳＢＡＮＫ」、「ＧＥＴＲＯＮＩＣＳＥＬＥＣＴＲＯＮＩＣＳ」及び「ＧＥＴＲＯＮＩＣＳＦＯＯＤＳ」の３つの組から一致するのを検索する場合で説明する。本発明では、図６の（ａ）に示すように、これら３つの組６０を上記実施形態で説明したようにメモリ・アドレスをシンボルとしてシンボル化して、６２に示すようにメモリ・アドレスの組に変換する。変換された組の単語は合計６単語になる。しかも、これら６単語は、メモリ・アドレスであるので数字である。従って、メモリ・アドレスに変換された入力データ２単語の数字とこれら６単語の数字とを単語単位で比較するので、非常に高速に比較できる。一方、従来の文字列比較では、図６の（ｂ）に示すように、合計４７文字を文字単位で比較しているので、比較速度は遅くならざるを得なかった。本発明のシンボル化による比較方法は、検索対象が小さい場合でも本質的に従来の文字列比較方法より処理速度が早いが、例えば、銀行業務等のように検索対象のデータが膨大になると処理速度の違いが顕著になり、従来の文字列比較より極めて高速に処理できる。なお、本発明のシンボル化による比較方法では、辞書データをメモリに展開する処理が必要になるが、この処理はシステム起動時の初期処理につき、起動後の比較処理の性能に影響を与えるものではない。
【００３９】
次に、前述した実施形態の変形例を以下に説明する。上記実施形態と同じ構成、動作の部分は説明を省き、相違する部分のみを説明する。処理エンジン１０は、入力データを受け取る前に、基本語辞書１４をメイン・メモリ１２上にメモリ・アドレスをシンボルとしてシンボル化するが、核名称辞書１６及びフル名称辞書１８について事前にメイン・メモリ１２上にシンボル化しない。なお、メモリ展開された基本語辞書１４には、図３に示すような「名称パターン」の格納領域を設ける必要がない。
【００４０】
次いで、処理エンジン１０は、入力データを受け取り、図４のステップ４４までの処理を行う。処理エンジン１０は、次いで、入力データに含まれる単語からキーとなる単語を抽出し、そして核名称辞書１６の中の項目「キー」に抽出された単語を含む組を検索して（図３参照）、メイン・メモリ１２上に、メモリ展開された基本語辞書１４ａ（図５）を参照してメモリ・アドレスをシンボルとしてシンボル化する。例えば、図４に示す入力データ４０が入力された場合、キーの単語として「ＧＥＴＲＯＮＩＣＳ」が抽出され、核名称辞書１６のキーの項目に「ＧＥＴＲＯＮＩＣＳ」を含む組が図３（あるいは図５）におけるメイン・メモリ１２上のメモリ・アドレス２０００番に示されるようにシンボル化される。ここで、処理エンジン１０は、図３における核名称辞書１６の各行とメモリ・アドレス２０００番に示される各行とが任意の従来の技法を用いて関連付けるようにしておく。従って、メモリ・アドレス「５００番」及び「５０１番」を格納しなくてもよい。
【００４１】
処理エンジン１０は、図４のステップ４６と類似の処理を行う。但し、処理エンジン１０は、一致した組、即ち、図４及び図５に示す例では、メモリ・アドレス２０００番の第１行を特定し、それに関連付けられている核名称辞書１６の第１行のコード「＃ＧＥＴＲＯ＃」（図３参照）を抽出する。
【００４２】
処理エンジン１０は、フル名称辞書１８のキーの項目に「＃ＧＥＴＲＯ＃」を含む組を図３（あるいは図５）におけるメイン・メモリ１２上のメモリ・アドレス８０００番に示されるようにシンボル化する。但し、「５００番」を格納しなくてもよい。次いで、処理エンジン１０は、図４のステップ４８と類似の処理を行う。メモリ・アドレス「５００番」を用いない場合は、処理エンジン１０は、メモリ・アドレス８０００番の各行のうち、入力データの中のそれまでのステップで処理していないメモリ・アドレス、この例では「１００番」を含む行を特定して、目的の顧客コード「１２３−４５６７８」に変換する。この変形例は、変換処理速度が前の実施形態より遅くなるが、メイン・メモリ１２の容量が少なくてよい。
【００４３】
【発明の効果】
本発明は、以上説明したように構成され、動作するので、従来の文字列比較において必要とした１バイト単位の検索処理が必要でないことにより検索時間を顕著に削減することができ、その結果複数の単語を含む字句を当該複数の単語により特定される１つの情報を含む別のデータに高速に変換して、当該データの入力を受けるコンピュータでの処理が可能となる。
【図面の簡単な説明】
【図１】図１は、本発明の好適な実施形態による字句をデータに変換する装置の基本構成を示す図である。
【図２】図２は、図１の基本語辞書１に事前に登録されている単語をメイン・メモリ１２上にメモリ・アドレスをシンボルとしてシンボル化即ちメモリ展開した状態を示す。
【図３】図３は、図１の核名称辞書１６及びフル名称辞書１８のそれぞれに事前に登録されている各名称及びフル名称をメイン・メモリ１２上にメモリ・アドレスをシンボルとしてシンボル化即ちメモリ展開した状態を示す。
【図４】図４は、図１に示す変換装置に入力されたデータが変換される過程を説明するための図の一部である。図５のメモリ展開は、図３に示すメモリ展開と同じものであるが、説明の理解を容易にするため、図２に示す基本語辞書１４に記載の全ての基本語のメモリ展開が示されている。
【図５】図５は、図１に示す変換装置に入力されたデータが変換される過程を説明するための図の一部である。なお、図５のメモリ展開は、図３に示すメモリ展開と同じものであるが、説明の理解を容易にするため、図２に示す基本語辞書１４に記載の全ての基本語のメモリ展開が示されている。
【図６】図６は、本発明のシンボル化による単語比較と従来の文字列比較との相違を説明する図である。
【符号の説明】
１０処理エンジン
１２メイン・メモリ
１４基本語辞書
１６核名称辞書
１８フル名称辞書[0001]
BACKGROUND OF THE INVENTION
The present invention relates to an apparatus and a method for converting a lexical phrase including a plurality of words into another data including one piece of information specified by the plurality of words.
[0002]
[Prior art]
For example, a bank may require a CIF analysis process for a so-called telegram (CIF means a customer information file). Specifically, in the processing of messages between banks or within banks, for example, a plurality of words in the “GETRONICS FOODs CO. LTD 1-2-34 AKASAKA” in the message as indicated by reference numeral 30 in FIG. It is necessary to convert the combination of the words “GETRONICS”, “FOODS”, and “AKASAKA” into the customer code “123-45678” indicated by reference numeral 32 in FIG.
[0003]
Conventionally, this conversion processing is performed as follows. That is, a customer code table of combinations of a plurality of words and corresponding customer codes is stored in the storage device in advance. Next, a plurality of words to be converted are extracted from the input data, and the extracted plurality of words and a plurality of words in the customer code table are subjected to character string comparison. (See Patent Document 1 for such an example.)
[0004]
[Patent Document 1]
JP 2002-56005 A
[0005]
[Problems to be solved by the invention]
However, since such character string comparison is performed in units of 1 byte, particularly when a large amount of data is targeted, the processing cannot be performed at high speed due to the search time required for the processing. There was a problem.
[0006]
Accordingly, an object of the present invention is to convert a lexical word including a plurality of words into another data including one piece of information specified by the plurality of words at a high speed and to perform processing by a computer that receives the input of the data. It is to do.
[0007]
[Means for Solving the Problems]
  The above object is based on a lexical phrase including a plurality of words according to one aspect of the present invention.,The multiple wordsPreset identification for the object represented byIn a device that converts to other data containing information, multiple wordsIn advanceRegistrationWasBasic dictionary,as well as,Words registered in the basic word dictionaryKey word selected as a key and a combination of the key word and other wordsPair,The identification set in advance for the object represented by the set of wordsInformation and,ButAssociationIn advanceRegistered name dictionaryStorage means for storingBefore a lexical that contains the wordsIdentificationA processing engine for converting into another data including information, the processing engine comprising the basic word dictionaryIs stored in memory,conversionSubjectBreak the lexical into wordsRememberLetAboveBasic dictionaryInDisassemblyIndividualwordIs stored in memoryaddressBy acquiring each, a combination of memory addresses representing the lexical terms to be converted is obtained.Get and beforeReportSolutionShiTheIndividualwordThe name dictionaryKeyRegistered as a wordExtract wordIn addition, among the word pairs registered in the name dictionary, for each word group including at least the extracted word, each of the individual word words constituting the word set in the basic word dictionary stored in the memory Memory that stores wordsaddressEachAcquired,A memory representing at least the word set including the extracted word by replacing individual words constituting the word set including at least the extracted word with the acquired memory address, respectively. A name dictionary converted into a combination of addresses is stored in a memory, and a memory address combination that is registered in the name dictionary stored in the memory and that represents the set of words including the extracted word is included in the memory Individual memory in address combinationAddress is,TransformationSubjectLexicalIndividual memory in a combination of memory addresses representingSame as one of the addressesMemory addresscombinationTheSelected,The conversion targetLexicalThe identification registered in the name dictionary in association with the selected combination of memory addressesSolved by a device that converts to another data containing information.
[0008]
  According to one aspect of the apparatus of the present invention, the processing engine is configured to individually configure each word set in the basic word dictionary stored in the memory with respect to all word sets registered in the name dictionary. A memory address combination in which each word set is replaced with the acquired memory address, and each set of words is replaced with the acquired memory address. It is preferable to store the name dictionary converted into each in a memory.
[0009]
  Further, according to one aspect of the apparatus of the present invention, the processing engine is registered in the basic word dictionary with respect to the basic word dictionary stored in the memory when the name dictionary is stored in the memory. Among words that are registered as the key words in the name dictionary, a set of words in which the words are combined with other words as the key words in the name dictionary stored in the memory is stored. A memory address is added, and among the decomposed individual words, a word to which the memory address is added in the basic word dictionary stored in the memory is used as the key word in the name dictionary. It is preferable to extract as a registered word.
[0010]
  Moreover, according to one form of the apparatus of this invention, the said processing engine was made to memorize | store in the said memory only about the word group containing the said extracted word among the said word groups registered into the said name dictionary. A memory in which individual words constituting the set of words are stored in the basic word dictionary;addressEachAcquired,A memory address that represents the set of words only for the set of words that includes the extracted word by replacing each individual word that comprises the set of words that includes the extracted word with the acquired memory address. Among the name dictionaries converted into the combination of the above, the word set including the extracted word and converted into the memory address combination and the identification information registered in the name dictionary in association with the word set are stored in the memory It is preferable to make it.
[0011]
  Moreover, according to one form of the apparatus of this invention, the name dictionary memorize | stored in the said memory | storage means includes the said key word, the group of the word which combined the said key word with the other word, and the group of the said word. A nuclear name dictionary registered in advance in association with the set code; a code registered in the nuclear name dictionary; another word further combined with a set of words associated with the code; and the code And a full name dictionary registered in advance in association with identification information set in advance for a target represented by a set of words in which another set of words is combined with a set of words associated with the processing engine. Is a word set in the basic word dictionary stored in the memory for at least a set of words including the extracted word among the set of words registered in the nuclear name dictionary. Each of the memory addresses storing the individual words constituting the word, and replacing each of the words constituting the word set including at least the extracted word with the obtained memory address, at least A nuclear name dictionary obtained by converting the set of words including the extracted word into a combination of memory addresses representing the set of words is stored in a memory, and among the other words registered in the full name dictionary A memory address at which the other word is stored in the basic word dictionary stored in the memory for the other word associated with a code set in a set of words including at least the extracted word. Acquire the previous word associated with the code set in the word set including at least the extracted word A full name dictionary replaced with the acquired memory address is stored in a memory, and a combination of memory addresses representing the set of words including the extracted word registered in the nuclear name dictionary stored in the memory Among them, after selecting a memory address combination in which each memory address in the memory address combination is the same as any one of the individual memory addresses in the memory address combination representing the conversion target token, The code registered in the nuclear name dictionary in association with the selected combination of memory addresses is extracted, and the other code registered in association with the extracted code in the full name dictionary stored in the memory Among memory addresses representing words, the memory address Select a memory address that is the same as any of the remaining memory addresses excluding the combination of the selected memory addresses from the combination of memory addresses representing the target lexical, and select the lexical to be converted It is preferable to convert into another data including the identification information registered in the full name dictionary in association with a memory address.
[0012]
  Moreover, according to one form of the apparatus of this invention, the name dictionary memorize | stored in the said memory | storage means is a part of word which comprises the said word group among the word groups registered into the said name dictionary. It is preferable that a set of a plurality of words having different objects and having the same object to be represented is associated with the same information as the identification information.
[0013]
  The above-described problem is a basic word dictionary in which a plurality of words are registered in advance in a method of converting a lexical word including a plurality of words into another data including identification information set in advance for an object represented by the plurality of words, and The key word selected as a key among the words registered in the basic word dictionary, a set of words in which the key word is combined with other words, and a target represented by the set of words The basic word dictionary is stored in a memory by a computer having storage means for storing a name dictionary registered in advance in association with the identification information, and the words to be converted are decomposed into words and stored in the memory. By acquiring each memory address where the decomposed individual words are stored in the basic word dictionary, a combination of memory addresses representing the lexical terms to be converted is acquired. While extracting the word registered as the key word in the name dictionary among the decomposed individual words, the word including at least the extracted word from the set of words registered in the name dictionary For each set, a memory address in which each word constituting the word set is stored in the basic word dictionary stored in the memory is acquired, and at least the word set including the extracted word is formed. By replacing each individual word to be replaced with the obtained memory address, a name dictionary obtained by converting at least the word set including the extracted word into a memory address combination representing the word set is stored in the memory. A memory address representing the set of words including the extracted word registered in the name dictionary stored in the memory. A memory address combination in which each memory address in the combination of memory addresses is the same as any one of the individual memory addresses in the combination of memory addresses representing the word to be converted This is solved by a method of selecting and converting the lexical object to be converted into another data including the identification information registered in the name dictionary in association with the selected combination of memory addresses.
[0021]
DETAILED DESCRIPTION OF THE INVENTION
Preferred embodiments of the present invention will be described below with reference to the drawings.
FIG. 1 is a diagram showing a basic configuration of an apparatus for converting a lexical word into data according to a preferred embodiment of the present invention. In FIG. 1, 10 is a processing engine composed of an arbitrary data processing device such as a main frame computer, a personal computer, and a microprocessor, 12 is a main memory, 14 is a basic word dictionary, and 16 is a nucleus. A name dictionary and 18 a full name dictionary, respectively. The basic word dictionary 14, the nuclear name dictionary 16, and the full name dictionary 18 are stored in a hard disk (not shown) such as a magnetic disk, but are not limited thereto, and may be stored in any other type of storage device. Can be stored. The data processor that functions as the processing engine 10 and the hard disk that stores the main memory 12, the basic word dictionary 14, the nuclear name dictionary 16, and the full name dictionary 18 are an ordinary data bus or the like (not shown). Are connected to each other.
[0022]
FIG. 2 shows a state in which words (hereinafter, also referred to as “basic words”) registered in advance in the basic word dictionary 14 are symbolized on the main memory 12 using memory addresses as symbols, that is, expanded in memory. Note that the word or basic word in this specification includes not only common nouns, proper nouns, and abbreviations, but also any set of symbols having a certain meaning. As shown in FIG. 2, an example of the basic word dictionary 14 includes, as items, keys, parts of speech, name attributes, and code attributes, but the basic word dictionary of the present invention is an item for registering basic words. It is sufficient to include at least a certain key, and the other items are not limited to those described above. The basic word dictionary 14 preferably has a structure that allows a new basic word to be registered when it includes a basic word that is not registered in the lexical phrase to be converted, and can be deleted when it is no longer used in the registered basic word. . Before starting the conversion process, the processing engine 10 symbolizes or expands the basic words registered in the basic word dictionary 14 on the main memory 12 using the memory address as a symbol, as shown in FIG. . That is, a memory address is assigned as an entry point for each registered content. Specifically, the basic word “AKASAKA” in the key column stores information at a location of memory address 100 on the main memory 12, and the basic word “BANK” in the key column is called memory address 101. Information is stored in place, and so on. When the basic word is expanded on the memory, an item “name pattern” for storing a memory address is added to each basic word, as will be described later, and the memory is expanded. If the basic word included in the lexical phrase to be converted is known in advance, only the basic word to be used may be developed on the main memory 12, and the conversion processing speed may be slow depending on the application. However, a part of the basic words registered in the basic word dictionary 14 may be expanded in the memory, and the unexpanded basic words may be additionally expanded in the memory when necessary for the conversion process.
[0023]
FIG. 3 shows a state in which a nuclear name and a full name registered in advance in the nuclear name dictionary 16 and the full name dictionary 18 are symbolized on the main memory 12 using a memory address as a symbol, that is, expanded in the memory. As shown in FIG. 3, the items in the nuclear name dictionary 16 are composed of keys, name patterns, and codes. In the name pattern item of the nuclear name dictionary 16, two basic words included in a combination of basic words that may be converted among basic words registered in the basic word dictionary 14 are registered in advance. ing. Specifically, “GETRONICS” and “FOODS” are in the first line of the nuclear name dictionary 16, “GETRONICS” and “SHOKUHIN” are in the second line, and “GETRONICS” and “GOTRONICS” are in the third line. “BANK” is registered as a character string. The basic word common to these name patterns is “GETRONICS”, and this basic word is registered in the key item of the nuclear name dictionary 16. In the code, a symbol for representing the association with each name pattern is registered. The name patterns “GETRONICS FOODS” and “GETRONICS SHOKUHIN” have the same meaning and are preferably assigned the same symbol “# GETRO #”, but may be different. The nuclear name dictionary 16 can be deleted when a lexical phrase to be converted includes a combination including a basic word that is not registered, a combination including a new basic word is registered, and the combination that includes a new basic word is not used. It is preferable that it is made so.
[0024]
The items of the full name dictionary 18 are also composed of keys, name patterns, and codes as shown in FIG. The name pattern item of the full name dictionary 18 includes symbols corresponding to combinations of basic words indicated in the name pattern of the nuclear name dictionary 16 among combinations of basic words that may be converted, and combinations thereof. Are registered in advance as a pair. Specifically, “# GETRO #” and “AKASAKA” are registered in advance in the first line of the name pattern of the full name dictionary 18, and “# GETRO #” and “OSAKA” are registered in advance in the second line. Has been. Since the symbol common to these name patterns is “# GETRO #”, the symbol is registered in the key of the full name dictionary 18. In the code of the full name dictionary 18, target data after conversion corresponding to the name pattern, in this case, a customer code is registered. Specifically, since the customer code of both “GETRONICS FOODS AKASAKA” and “GETRONICS SHOKUHIN AKASAKA” is “123-45678”, the customer code is in the first line of the code of the full name dictionary 18, and Since the customer codes of “GETRONICS FOODS OSAKA” and “GETRONICS SHOKUHIN OSAKA” are “101-23456”, the customer codes are respectively registered in the second line of the code of the full name dictionary 18. The full name dictionary 18 can be deleted when there is a combination including a basic word that is not registered in a lexical phrase to be converted, a combination including a new basic word is registered, and the combination that includes a new basic word is not used. It is preferable that it is made so.
[0025]
In this example, a combination of two basic words is used as the name pattern of the nuclear name dictionary 16 and the full name dictionary 18, but when it is acceptable that the processing speed is somewhat slow, three or more Combinations may be used. In this example, the nuclear name dictionary 16 and the full name dictionary 18 and the two-stage name dictionary are used. However, only the nuclear name dictionary 16 or two or more full name dictionaries 18 may be used depending on the application. Good.
[0026]
Prior to starting the conversion process, the processing engine 10 converts the name pattern registered in the nuclear name dictionary 16 into a symbol, that is, a basic word memory of the developed nuclear name dictionary 16 as shown in FIG. Referring to the address, the memory address is symbolized or expanded in the main memory 12 as a symbol. At that time, those with the same key in the nuclear name dictionary 16 are grouped into one group and expanded in memory. Specifically, “GETRONICS”, “FOODS”, “SHOKUHIN”, and “BANK” in the first to third lines of the name pattern of the nuclear name dictionary 16 are the basic word dictionary expanded in the main memory 12. With reference to the 14 basic words and the corresponding memory addresses, “107”, “106”, “112”, and “101” are assigned as indicated by 34 in FIG. Then, any memory address that is not used, for example, 2000, is acquired by symbolizing three cases registered as “GETRONICS” in the key of the nuclear name dictionary 16 as a name pattern. Specifically, since the first to third lines of the name pattern of the nuclear name dictionary 16 have the same key “GETRONICS”, the memory address 107 corresponding to the name pattern “GETRONICS” of the first line As an entry point, an arbitrary memory address not used, for example, 2000 is assigned.
[0027]
Next, the codes of the nuclear name dictionary 16 registered with “# GETRO #” and “# GETROBK #” are symbolized. That is, any memory addresses that are not used for “# GETRO #” in the first and second lines of the code of the nuclear name dictionary 16 and “# GETROBK #” in the third line, for example, “500” and “501”. Each number is assigned. However, Nos. 500 and 510 only have an area for storing a memory address, and “# GETRO #” and “# GETROBK #” are not stored. In the first line of the number 2000 on the main memory 12, “107” and “106” and “500” are stored in association with the first line of the nuclear name dictionary 16, In the second row of No. 2000 on the main memory 12, “No. 107”, “No. 112” and “No. 500” are stored in association with the second row of the nuclear name dictionary 16, In the third line of No. 2000 on the main memory 12, “No. 107” and “No. 101” and “No. 501” are stored so as to correspond to the third line of the nuclear name dictionary 16. . Further, in order to link the memory address 2000 of the name pattern having the basic word “GETRONICS” in the nuclear name dictionary 16 as a key to the symbolized basic word “GETRONICS”, the basic word expanded in the memory is used. “2000” is stored in the “name pattern” storage area of the memory address 107 on the dictionary 14.
[0028]
Next, before starting the conversion process, the processing engine 10 converts the name pattern registered in the full name dictionary 18 into a symbolic name, that is, a basic word of the nuclear name dictionary 16 that has been expanded in memory, as shown in FIG. , And the memory address assigned to the code of the nuclear name dictionary 16, the memory address is symbolized or expanded as a symbol on the main memory 12. At that time, those with the same key in the full name dictionary 18 are grouped into one group and expanded in memory. Specifically, the memory address of the symbolized nuclear name dictionary is expanded (that is, linked) to the symbol of the full name dictionary 18, and therefore is in the first and second lines of the name pattern of the full name dictionary 18. Since “# GETRO #” is assigned the number 500 first, the memory address number is assigned as shown by 36 in FIG. In “AKASAKA” and “OSAKA”, “100” and “111” are shown in FIG. 3 with reference to the basic words of the basic word dictionary 14 expanded in the main memory 12 and the corresponding memory addresses. Of 36. Since the first and second lines of the name pattern of the full name dictionary 18 have the same key “# GETRO #”, the memory address corresponding to the name pattern “# GETRO #” of the first line As an entry point of No. 500, a memory address that is not used, for example, No. 8000 is allocated. Next, in order to expand (ie, link) the memory address 8000 obtained by symbolizing the full name dictionary 18 into the symbol name of the nuclear name dictionary 16, the number 8000 is stored in the storage area of the memory address 500. Thus, in the first row of the memory address 8000, “# 500” and “# 100” are stored in association with the converted target data, that is, the customer code “123-4567”, and in the second row, “# 500” and “# 100” are stored. “# 500” and “# 111” are stored in association with the converted target data, that is, the customer code “101-23564”.
[0029]
When there are two or more full name dictionaries 18, the code of the nuclear name dictionary 16 (in this example, “# GETRO #” or “# GETRO #” or “ Symbols that allow each name pattern to be identified by symbols similar to “# GETROBK #”) are registered. In the memory expansion of the intermediate full name dictionary, the symbolization of the name pattern is the same as the storage state at the number 8000 in the full name dictionary 18, but “123-45678” and “101” in the storage area of the number 8000. In the storage area corresponding to “−23564”, the memory address given to the symbol of the intermediate full name dictionary is stored.
[0030]
Next, input data conversion processing will be described with reference to FIGS. 1 to 3 and FIGS. 4 and 5. 4 and 5 are diagrams for explaining a process in which data input to the conversion device shown in FIG. 1 is converted. The memory expansion in FIG. 5 is the same as the memory expansion shown in FIG. 3, but in order to facilitate understanding of the explanation, the memory expansion of all the basic words described in the basic word dictionary 14 shown in FIG. 2 is shown. ing.
[0031]
Here, it is assumed that the basic word dictionary 14, the nuclear name dictionary 16, and the full name dictionary 18 are symbolized on the main memory 12 as described above. Assume that data indicated by reference numeral 40 in FIG. 4 is input. The processing engine 10 breaks the input data 40 into words as shown in step 42. Next, the processing engine 10 acquires a memory address corresponding to the decomposed word with reference to the basic word dictionary 14a expanded in the memory on the main memory 12 shown in FIG. Although a binary search is preferable for this acquisition method, any acquisition method may be used in the present invention. The memory address corresponding to the basic word circled in the basic word dictionary 14a of FIG. 5 is acquired.
[0032]
Next, in step 44, the processing engine 10 converts a word for which a memory address can be obtained from the decomposed words into a memory address for which the word can be obtained. If <1-234> does not exist in the basic word dictionary 14a, it is left as it is.
[0033]
In step 46, the processing engine 10 uses the basic word as a key, here, the memory address “107” of “GETRONICS” as a key, and other memory addresses, that is, “107” and “106”, A search is made as to whether any pair of “No. 104” and “No. 100” is in the nuclear name dictionary 16a expanded in the memory shown in FIG. The memory address “500” of the matched code is acquired. Specifically, the processing engine 10 reads the number 2000 stored in the storage area of the “name pattern” of the memory address 107 in the basic word dictionary 14a expanded in memory, and the memory is expanded based on the number 2000. It is checked whether there is a combination of “107” and “106”, “104”, or “100” among the memory address pairs stored in the 2000 of the nuclear name dictionary 16a. . In this example, the combination of “No. 107” and “No. 106” matches (see the combination marked with a circle in step 46 in FIG. 4 and the circled line in the nuclear name dictionary 16a in FIG. 5). , “500” is acquired, and the combination of “107” and “106” is converted to “500”.
[0034]
In step 48, the processing engine 10 subsequently uses the memory address “500” of the key symbol as a key and a combination with other memory addresses, here “500” and “100”. Whether or not the combination exists in the full name dictionary 18a expanded in the memory shown in FIG. 5 is searched, and if it matches, the matched code in the full name dictionary 18 is acquired. Specifically, the processing engine 10 reads the memory address 8000 stored at the memory address 500 in the main memory 12, and stores the number 8000 in the full name dictionary 18a expanded in memory based on the memory number 8000. Is checked whether there is a set of “number 500” and “number 100” in the set of memory addresses stored in. In this example, the combination of “No. 500” and “No. 100” match (see the combination circled in step 48 in FIG. 4 and the circled line in the full name dictionary 18a in FIG. 5). , “123-45678” on the main memory 12 is acquired, and the combination of “500” and “100” is converted to “123-45678”. As a result, “GETRONICS FOODS AKASAKA” in the input data, that is, the lexical phrase, is converted into customer code “123-23564” which is desired data.
[0035]
The processing blocks described in the processing engine 10 in FIG. 1 and the processing steps in FIG. 4 are the steps 42 and 44 in FIG. 4 in the word recognition block 20 in FIG. 1, and the step 46 in FIG. 4 corresponds to the full name recognition block 24, respectively.
[0036]
The apparatus and method for converting a lexical word into data according to the present invention includes a conventional spelling correction using, for example, a spelling pattern dictionary when an input error is entered in input data, for example, “GETRONICS” is entered as “GETROMICS”. A function may be provided, and when the input word is spelled continuously, a conventional collocation processing function such as using a collocation dictionary may be provided.
[0037]
Furthermore, in the apparatus and method for converting lexical data of the present invention into data, the name “GETRONICS FOODS CO. LTD” is entered from the input data 30 as indicated by reference numeral 32 as shown in FIG. A function to extract may be included.
[0038]
FIG. 6 is a diagram for explaining the difference between word comparison by symbolization of the present invention and conventional character string comparison. For example, a case will be described where the input data “GETRONICS FOODs” is searched for a match from three sets of “GETRONICS BANK”, “GETRONICS ELECTRONICS”, and “GETRONICS FOODS”. In the present invention, as shown in FIG. 6 (a), these three sets 60 are symbolized as memory addresses as described in the above embodiment, and the memory addresses are set as shown in 62. Convert. The total number of words in the converted set is 6 words. Moreover, these six words are numbers because they are memory addresses. Accordingly, since the numbers of the two words of the input data converted into the memory address and the numbers of these six words are compared in units of words, the comparison can be performed very quickly. On the other hand, in the conventional character string comparison, as shown in FIG. 6B, since a total of 47 characters are compared in units of characters, the comparison speed has to be slow. The comparison method based on symbolization of the present invention is essentially faster in processing speed than the conventional character string comparison method even when the search target is small. However, the processing speed is increased when the data to be searched becomes enormous, such as in banking operations. The difference becomes remarkable, and the processing can be performed at an extremely high speed as compared with the conventional character string comparison. In the comparison method using symbolization according to the present invention, it is necessary to expand the dictionary data in the memory. However, this processing does not affect the performance of the comparison processing after the startup for the initial processing at the time of system startup. Absent.
[0039]
Next, modifications of the above-described embodiment will be described below. The description of the same configuration and operation as in the above embodiment will be omitted, and only the differences will be described. Before receiving the input data, the processing engine 10 symbolizes the basic word dictionary 14 on the main memory 12 with the memory address as a symbol, but the main memory 12 in advance for the nuclear name dictionary 16 and the full name dictionary 18. Do not symbolize above. It is not necessary to provide a storage area for the “name pattern” as shown in FIG. 3 in the basic word dictionary 14 expanded in memory.
[0040]
Next, the processing engine 10 receives the input data and performs the processing up to step 44 in FIG. Next, the processing engine 10 extracts a key word from words included in the input data, and searches for a set including the extracted word in the item “key” in the nuclear name dictionary 16 (see FIG. 3). ) With reference to the basic word dictionary 14a (FIG. 5) expanded in the memory on the main memory 12, the memory address is symbolized as a symbol. For example, when the input data 40 shown in FIG. 4 is input, “GETRONICS” is extracted as the key word, and a group including “GETRONICS” in the key item of the nuclear name dictionary 16 is shown in FIG. 3 (or FIG. 5). Symbolized as indicated by memory address 2000 on the main memory 12. Here, the processing engine 10 associates each row of the nuclear name dictionary 16 in FIG. 3 with each row indicated by the memory address 2000 using any conventional technique. Therefore, the memory addresses “500” and “501” need not be stored.
[0041]
The processing engine 10 performs processing similar to step 46 in FIG. However, the processing engine 10 identifies the first set of the matched set, that is, the first row of the memory address 2000 in the example shown in FIGS. 4 and 5, and the first row of the nuclear name dictionary 16 associated with the first row. The code “# GETRO #” (see FIG. 3) is extracted.
[0042]
The processing engine 10 symbolizes a set including “# GETRO #” in the key item of the full name dictionary 18 as indicated by the memory address 8000 in the main memory 12 in FIG. 3 (or FIG. 5). . However, “500” need not be stored. Next, the processing engine 10 performs processing similar to step 48 in FIG. When the memory address “500” is not used, the processing engine 10 selects the memory address that has not been processed in the previous steps in the input data among the rows of the memory address 8000, in this example “ The line including “100” is identified and converted to the target customer code “123-45678”. In this modification, the conversion processing speed is slower than in the previous embodiment, but the capacity of the main memory 12 may be small.
[0043]
【The invention's effect】
Since the present invention is configured and operates as described above, the search time can be significantly reduced by eliminating the search processing in units of 1 byte required in the conventional character string comparison. Can be processed at a high speed by converting the lexical word including the word into another data including one piece of information specified by the plurality of words, and receiving the input of the data.
[Brief description of the drawings]
FIG. 1 is a diagram showing a basic configuration of an apparatus for converting a lexical word into data according to a preferred embodiment of the present invention.
2 shows a state in which words registered in advance in the basic word dictionary 1 of FIG. 1 are symbolized on the main memory 12 with memory addresses as symbols, that is, expanded in memory.
3 is a diagram that symbolizes each name and full name registered in advance in each of the nuclear name dictionary 16 and the full name dictionary 18 of FIG. 1 on the main memory 12 with a memory address as a symbol. Indicates the expanded state of memory.
FIG. 4 is a part of a diagram for explaining a process in which data input to the conversion device shown in FIG. 1 is converted; The memory expansion in FIG. 5 is the same as the memory expansion shown in FIG. 3, but in order to facilitate understanding of the explanation, the memory expansion of all the basic words described in the basic word dictionary 14 shown in FIG. 2 is shown. ing.
FIG. 5 is a part of a diagram for explaining a process in which data input to the conversion device shown in FIG. 1 is converted; The memory expansion in FIG. 5 is the same as the memory expansion shown in FIG. 3. However, in order to facilitate understanding of the explanation, all the basic words in the basic word dictionary 14 shown in FIG. It is shown.
FIG. 6 is a diagram for explaining a difference between word comparison by symbolization of the present invention and conventional character string comparison;
[Explanation of symbols]
10 Processing engine
12 Main memory
14 basic word dictionary
16 Nuclear name dictionary
18 Full name dictionary

Claims

Lexical including a plurality of words, the device for converting the other data including the preset identification information to the target represented by the plurality of words,
A basic word dictionary in which a plurality of words are registered in advance, a key word selected as a key among the words registered in the basic word dictionary , and a set of words in which the key word is combined with other words ; storage means for storing the identification information previously set to a subject set of words representing the pre Me registered name dictionary in association with the,
And a processing engine that converts the lexical including the plurality of words into another data including the pre-Symbol identification information,
The processing engine is
Storing the basic word dictionary in a memory;
Break down the lexical terms to be converted into words,
The notes that re individual words that the decomposition in the basic dictionary that has been stored in the to obtain each memory address stored, acquires a combination of a memory address representing the lexical of the conversion target,
Extracts a word that is an the key word to the name dictionary of individual words that solution before Symbol min,
Among the word sets registered in the name dictionary, for each word set including at least the extracted word, individual words constituting the word set are stored in the basic word dictionary stored in the memory. It is respectively obtains the memory address is, by replacing each memory address individual words and the acquired constituting a set of words including words that at least the extraction, the words with words that at least the extraction A name dictionary obtained by converting a set of words into a combination of memory addresses representing the set of words is stored in a memory;
The memory is registered in the name dictionary stored, among the combinations of the memory address representing the set of words including words that the extracted, the individual memory addresses in combination with the memory address, the conversion It selects a combination of memory address to be the same as either of the individual memory addresses in combination with a memory address representing the target lexical,
An apparatus for converting the word to be converted into another data including the identification information registered in the name dictionary in association with the selected combination of memory addresses .

The processing engine uses, for all word pairs registered in the name dictionary, memory addresses at which individual words constituting the word set are stored in the basic word dictionary stored in the memory. Each name is acquired and each individual word is replaced with the acquired memory address, thereby storing a name dictionary obtained by converting each set of words into a combination of memory addresses representing the set of words. The apparatus of claim 1.

When the processing engine stores the name dictionary in a memory, the processing engine stores the basic word dictionary stored in the memory as the key word in the name dictionary among the words registered in the basic word dictionary. A memory address in which a set of words in which the word is combined with another word as the key word in the name dictionary stored in the memory is added to the registered word, and the decomposition is performed. 3. The word to which the memory address is added in the basic word dictionary stored in the memory among individual words is extracted as a word registered as the key word in the name dictionary. apparatus.

The processing engine is configured to configure each word set in the basic word dictionary stored in the memory only for a word set including the extracted word among the word sets registered in the name dictionary. Each of the memory addresses in which the words are stored are obtained, and the extracted words are included by substituting each of the words constituting the set of words including the extracted words with the obtained memory addresses. Among the name dictionaries converted into a combination of memory addresses representing the word set only for the word set, the word set including the extracted word and converted into the memory address combination, and the word set The apparatus according to claim 1, wherein the identification information associated and registered in the name dictionary is stored in a memory.

The name dictionary stored in the storage means is pre-registered in association with the key word, a set of words obtained by combining the key word with another word, and a code set for the set of words. Nuclear name dictionary, a code registered in the nuclear name dictionary, another word further combined with a set of words associated with the code, and another word further combined with the set of words associated with the code Is composed of a full name dictionary registered in advance in association with identification information set in advance for the target represented by a set of words
The processing engine is
Among the word sets registered in the nuclear name dictionary, for each word set including at least the extracted word, individual words constituting the word set in the basic word dictionary stored in the memory are Each of the stored memory addresses is acquired, and each of the words constituting the word set including at least the extracted word is replaced with each of the acquired memory addresses, thereby including at least the extracted word. Storing in the memory a nuclear name dictionary obtained by converting a set of words into a combination of memory addresses representing the set of words;
Of the other words registered in the full name dictionary, the basic word stored in the memory for the other word associated with a code set in a set of words including at least the extracted word A memory address where the another word is stored in the dictionary is acquired, and the other word associated with a code set in the set of words including at least the extracted word is set to the acquired memory address. Store the replaced full name dictionary in memory,
Of the memory address combinations that are registered in the nuclear name dictionary stored in the memory and that represent the set of words including the extracted word, individual memory addresses in the combination of the memory addresses are After selecting a memory address combination that is the same as any of the individual memory addresses in the memory address combination representing the lexical to be converted,
Extracting the code registered in the nuclear name dictionary in association with the selected memory address combination,
Of the memory addresses representing the other words registered in association with the extracted code in the full name dictionary stored in the memory, the memory address represents the lexical word to be converted. Selecting a memory address that is the same as any of the remaining memory addresses excluding the selected memory address combination from the combination of:
The lexical phrase to be converted is converted to another data including the identification information registered in the full name dictionary in association with the selected memory address. apparatus.

The name dictionary stored in the storage means includes a plurality of words that are part of a word set registered in the name dictionary and in which some of the words constituting the word set are different and the objects to be expressed are the same. The apparatus according to any one of claims 1 to 5, wherein a set is associated with the same information as the identification information.

In a method for converting a lexical phrase including a plurality of words into another data including identification information set in advance for an object represented by the plurality of words,
A basic word dictionary in which a plurality of words are registered in advance, a key word selected as a key among the words registered in the basic word dictionary, and a set of words in which the key word is combined with other words; By means of a computer comprising storage means for storing a name dictionary registered in advance in association with the identification information set in advance on the target represented by the set of words,
Storing the basic word dictionary in a memory;
Break down the words to be converted into words,
By obtaining a memory address at which each of the decomposed individual words is stored in the basic word dictionary stored in the memory, a combination of memory addresses representing the lexical terms to be converted is obtained,
While extracting the word registered as the key word in the name dictionary among the decomposed individual words,
Among the word sets registered in the name dictionary, for each word set including at least the extracted word, individual words constituting the word set are stored in the basic word dictionary stored in the memory. Each of the stored memory addresses and replacing each individual word constituting the set of words including at least the extracted word with the acquired memory address, thereby including at least the extracted word A name dictionary obtained by converting a set of words into a combination of memory addresses representing the set of words is stored in a memory;
Of the memory address combinations that are registered in the name dictionary stored in the memory and that represent the set of words including the extracted word, individual memory addresses in the memory address combination are Select a memory address combination that is the same as any of the individual memory addresses in the memory address combination that represents the lexical of interest,
A method of performing a process of converting the lexical object to be converted into another data including the identification information registered in the name dictionary in association with the selected combination of memory addresses.