JPH06231188A

JPH06231188A - Device for sorting name of similar data

Info

Publication number: JPH06231188A
Application number: JP5014880A
Authority: JP
Inventors: Jun Sekine; 純関根; Mitsuru Kawashita; 満川下; Masaru Nakagawa; 優中川
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 1993-02-01
Filing date: 1993-02-01
Publication date: 1994-08-19

Abstract

PURPOSE:To sort the name of data from the view point of the similarity of data values. CONSTITUTION:Each entry of a dictionary 4 consists of a word number, the name of a word, segmented word flag, main word flag, and the word number (representitive word number) of the more generalized word of the word (representitive word). A dictionary check section 11 identifies the word which coincides with the name of the word in the dictionary at every word comprising the name of data inputted from an input device 2, and outputs the name of a word (the name of the representitive word) having the same value in the word number as the segmented word flag. main word flag, and the representitive word number of the word. An important word extraction section 12 extracts the segmented word and the main word from the name of data, extracting the representitive segmented word and the representitive main word. A sorting processing section 13 sorts the name of data by perform the sort processing with representitive segmented word as large sorting and with the representitive main word as small sorting and outputs the result from the output device 3.

Description

Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】本発明は、データベースの分析設
計作業において、同一の意図を持つデータに対して付与
された異なるデータ名称を統一する時に、名称の類似性
から、異なる名称を複数のデータが同一の意図を持つデ
ータであることを識別する類似データ名称分類装置に関
する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention is designed to unify different data names given to data having the same intention in a database analysis and design work. The present invention relates to a similar data name classification device that identifies that data have the same intention.

【０００２】[0002]

【従来の技術】類似の名称を分類する従来の技術には次
のものがある。（技術１）データ名称を用語に分解し、その用語毎の
類似性からデータ名称の類似性を判定する技術は、既に
確立されている（例えば特開平３−４１５６０号公
報）。（技術２）文字列をソートすることにより、先頭の類
似の文字を持つ文字列を分類する技術は、既に商用のデ
ータベース管理システムにおいて実用化されている。また、データ名称を用語に分解する技術については、既
に例えば、野口正一監修、牧野武則著「自然言語処理」
（オーム社、平成３年）の第２７頁乃至第３４頁で述べ
られている。2. Description of the Related Art There are the following conventional techniques for classifying similar names. (Technology 1) A technique for decomposing a data name into terms and determining the similarity of data names from the similarity of each term has already been established (for example, Japanese Patent Laid-Open No. 3-41560). (Technology 2) A technology of sorting character strings to classify character strings having similar leading characters has already been put to practical use in commercial database management systems. In addition, the technique for decomposing data names into terms has already been reviewed, for example, by Shoichi Noguchi and Takenori Makino, "Natural Language Processing."
(Ohm, 1991), pages 27-34.

【０００３】[0003]

【発明が解決しようとする課題】データ名称を類似性か
ら判定する場合の例と従来技術の問題を次に示す。（場合１）「現顧客番号」と「新規顧客番号」の類似
データ名称分類いずれも、顧客に関する番号であるという点で類似の値
を持つため、類似データと判定する必要がある。（場合２）「現顧客番号」と「現顧客住所」の類似デ
ータ名称分類両者は、番号と住所という異なる値を持つため、類似で
はないと判定する必要がある。（場合３）「現顧客番号」と「現受注番号」の類似デ
ータ名称分類両者は、顧客と受注という異なる主体に対する番号であ
るため、類似ではないと判定する必要がある。An example of determining a data name from similarity and problems of the prior art will be described below. (Case 1) Similar data name classifications of “current customer number” and “new customer number” have similar values in that they are numbers related to customers, so it is necessary to determine similar data. (Case 2) Similar data name classification of “current customer number” and “current customer address” Since both have different values of number and address, it is necessary to determine that they are not similar. (Case 3) Similar data name classification of “current customer number” and “current order number” Since both are numbers for different entities, that is, customer and order, it is necessary to determine that they are not similar.

【０００４】従来の技術では、（場合１）は正しく、類
似であると判定できるが、（場合２）、（場合３）につ
いては、類似していないにもかかわらず、データ名称が
同一の用語を多く含むため、類似であると判定されると
いう問題があった。In the prior art, it can be determined that (case 1) is correct and similar, but (case 2) and (case 3) have the same data name even though they are not similar. Since many are included, there is a problem that they are determined to be similar.

【０００５】本発明の目的は、上述の（場合２）や（場
合３）を正しく判定でき、データの持つ値が類似してい
るデータ名称の分類を可能とする類似データ名称分類装
置を提供することにある。An object of the present invention is to provide a similar data name classifying apparatus which can correctly determine the above (case 2) and (case 3) and can classify data names having similar data values. Especially.

【０００６】[0006]

【課題を解決するための手段】上記目的を達成するた
め、本発明の類似データ名称分類装置は、データ名称を
構成する用語と用語の種別を登録した辞書と、与えられ
たデータ名称を構成する用語の種別を、前記辞書を用い
て判定する手段と、前記判定された用語の種別を用いて
重要度の高い用語のみを抽出する手段と、前記抽出され
た重要度の高い用語に基づき類似のデータ名称を分類す
る手段とを備えることを特徴とする。In order to achieve the above object, a similar data name classifying device of the present invention constructs a given data name and a dictionary in which terms constituting the data name and the type of the term are registered. A means for determining the type of a term using the dictionary, a means for extracting only a highly important term using the determined type of the term, and a similar method based on the extracted highly important term And a means for classifying the data names.

【０００７】[0007]

【作用】本発明では、データ名称に現われる、番号、住
所などの値を表す用語（区分語と呼ぶ）、及びデータ名
称を表す主体を表す用語（主要語と呼ぶ）を識別し、こ
れら２種類の用語に重点をおいてデータの名称を分類す
る。即ち、データ名称、及び、データ名称を構成する用
語を入力として与えられると、辞書を用いて該用語の中
から区分語、及び主要語を抽出し、次に、この抽出した
２つの用語をそれぞれより一般的な用語に変換し、変換
した結果の２つの用語に基づきデータ名称を分類する。
これにより、確度の高い類似データ名称の分類が可能に
なる。In the present invention, a term that represents a value such as a number or an address that appears in a data name (referred to as a classifier) and a term that represents a subject that represents a data name (referred to as a main term) are identified. Classify data names with emphasis on the term. That is, when a data name and terms that make up the data name are given as input, a dictionary is used to extract a classifier and a main term, and then these two extracted terms are respectively extracted. The data names are converted into more general terms and the data names are classified based on the two terms resulting from the conversion.
This enables highly accurate classification of similar data names.

【０００８】[0008]

【実施例】以下、本発明の一実施例を図面を用いて説明
する。DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS An embodiment of the present invention will be described below with reference to the drawings.

【０００９】図１は、本発明の類似データ名称分類装置
のブロック構成図である。図において、１は分類装置本
体、２は入力装置、３は出力装置、４は辞書であり、分
類本体１は辞書チェック部１１、重要用語抽出部１２及
び分類処理部１３からなる。FIG. 1 is a block diagram of a similar data name classification device of the present invention. In the figure, 1 is a classification device main body, 2 is an input device, 3 is an output device, 4 is a dictionary, and the classification main body 1 is composed of a dictionary check unit 11, an important term extraction unit 12, and a classification processing unit 13.

【００１０】図２は、辞書４内の１エントリの構造を示
したものであり、用語を一意に識別する番号を表す用語
番号４１、用語の名前を表す用語名４２、用語が区分語
となり得るか否かを表す区分語フラグ４３（○が可、×
が否）、用語が主要語となり得るか否かを表す主要語フ
ラグ４４（○が可、×が否）、及び、その用語をより一
般化した用語（代表用語と呼ぶ）の用語番号をポイント
する代表用語番号４５から構成される。区分語即ち区分
語フラグ４３が「○」となる用語は、「番号」、
「数」、「名」のようにデータ名称の右端に来て、デー
タ名称がどのような値を表すかを示す用語である。ま
た、主要語即ち主要語フラグ４４が「○」となる用語
は、データ名称の中で区分語の左方にあって、「顧
客」、「受注」のように何に対するデータであるかを表
す用語である。FIG. 2 shows the structure of one entry in the dictionary 4. The term number 41 showing the number for uniquely identifying the term, the term name 42 showing the name of the term, and the term can be a section word. A segment word flag 43 (○ is acceptable, ×
Point), the main word flag 44 (○ is acceptable, x is not) indicating whether or not the term can be a main word, and the term number of a more generalized term (referred to as a representative term) It consists of a representative term number 45. The term for which the section word, that is, the section word flag 43 is "○", is "number",
It is a term such as “number” and “name” that comes to the right end of the data name and indicates what value the data name represents. Further, the main word, that is, the term in which the main word flag 44 is “◯”, is to the left of the segment word in the data name, and represents what the data is, such as “customer” or “order”. Is a term.

【００１１】次に、図１における分類装置本体１の辞書
チェック部１１、重要用語抽出部１２、分類処理部１３
の動作を詳述する。Next, the dictionary check unit 11, the important term extraction unit 12, and the classification processing unit 13 of the classification apparatus body 1 in FIG.
The operation of will be described in detail.

【００１２】辞書チェック部１１は、データベースの分
析設計者より入力装置２を通して、データの名称、デー
タの名称を構成する用語、及び、用語の並び順を受け取
ると、データ名称を構成する用語毎に、辞書４中の用語
名４２と一致する用語を識別し、その用語が持つ区分語
フラグ４３、主要語フラグ４４、及び、代表用語番号４
５と同じ値を用語番号４１に持つ用語の用語名４２（代
表用語名と呼ぶ）を出力する。ここで、代表用語番号４
５が自分の用語番号４１を示している場合は自分の用語
名４２が代表用語名となり、代表用語番号４５が他の用
語の用語番号４１を示している場合はその用語名４２が
代表用語名となる。When the dictionary checking unit 11 receives the name of data, the terms forming the name of the data, and the arrangement order of the terms from the analysis designer of the database through the input device 2, the dictionary checking unit 11 determines each of the terms forming the data name. , The term that matches the term name 42 in the dictionary 4 is identified, and the terminology flag 43, the main word flag 44, and the representative term number 4 that the term has
A term name 42 (called a representative term name) of a term having the same value as 5 in the term number 41 is output. Here, the representative term number 4
When 5 indicates the own term number 41, the own term name 42 is the representative term name, and when the representative term number 45 indicates the term number 41 of another term, the term name 42 is the representative term name. Becomes

【００１３】重要用語抽出部１２は、データ名称中での
用語の並びの右端から左方に向けて各用語の区分語フラ
グ４３を順に調べ、最初に見つかった区分語となり得る
用語を一つ抽出する。次に、見つかった区分語のさらに
左方の用語を左方に向けて順に調べ、最初に見つかった
主要語となり得る用語を一つ抽出する。重要用語抽出部
１２では、更に、抽出した区分語となり得る用語の代表
用語番号と同じ用語番号を持つ用語名（これを代表区分
語と呼ぶ）を抽出する。最後に、抽出した主要語となり
得る用語の代表用語番号と同じ用語番号を持つ用語名
（これを代表主要語と呼ぶ）を抽出する。The important term extraction unit 12 sequentially checks the terminator flag 43 of each term from the right end of the arrangement of terms in the data name to the left, and extracts one term that can be the first term found. To do. Next, the terms to the left of the found term are sequentially examined toward the left, and one term that can be the first found main word is extracted. The important term extraction unit 12 further extracts a term name having the same term number as the representative term number of the term that can be the extracted term (this is called a representative term). Finally, a term name having the same term number as the representative term number of the term that can be the extracted main word (this is called a representative main word) is extracted.

【００１４】分類処理部４では、データ名称を、代表区
分語を大分類、代表主要語を小分類とするソート処理を
行い、結果を出力装置３を介してデータベースの分析設
計者に出力する。分析設計者は、代表区分語、及び、代
表主要語が等しいデータ名称を類似であると判断する。In the classification processing section 4, the data names are subjected to sort processing in which the representative divisional words are classified into large classifications and the representative main words are classified into small classifications, and the results are output to the database analysis designer via the output device 3. The analysis designer determines that the data names having the same representative section word and the representative main word are similar.

【００１５】図３は、処理対象のデータ名称及びデータ
名称を構成する用語の具体例、図４は辞書４の内容の具
体例を示したものである。以下、これらに基づいて処理
の説体例を説明する。FIG. 3 shows a specific example of the data name to be processed and the terms constituting the data name, and FIG. 4 shows a specific example of the contents of the dictionary 4. Hereinafter, examples of processing will be described based on these.

【００１６】今、データベースの分析設計者より入力装
置２を通して、図３に示したデータ名称とデータ名称を
構成する用語が入力されたとする。辞書チェック部１１
では、この図３に示すデータ名称を構成する用語と図４
に示す辞書４中の用語を突き合わせることにより、各用
語の区分語フラグ、主要語フラグ、及び、代表用語名を
調べる。調べた結果は、図５のようになる。例えば、デ
ータ名称「新規申込顧客Ｎo.」は、「新規」、「申
込」、「顧客」、及び、「Ｎo.」という４つの用語から
構成され、「新規」は、区分語にも主要語にもなり得な
い用語であり、「申込」と「顧客」は主要語にのみなり
得る用語であり、「Ｎo.」は、区分語にも主要語にもな
り得る用語であることが分かる。It is assumed that the data analysis designer of the database inputs the data names and the terms constituting the data names shown in FIG. 3 through the input device 2. Dictionary checker 11
Then, the terms that make up the data name shown in FIG. 3 and FIG.
The terminology of each term, the main word flag, and the representative term name are checked by matching the terms in the dictionary 4 shown in FIG. The result of the examination is as shown in FIG. For example, the data name "new application customer No." is composed of four terms "new", "application", "customer", and "No." It can be seen that “application” and “customer” are terms that can only be the main terms, and “No.” is a term that can be a segmental term and a main term.

【００１７】重要用語抽出部１２では、図５の情報を基
に、区分語となり得る用語、及び、主要語となり得る用
語を探し、更に、それを基に、代表区分語、及び、代表
主要語を求める。図６に求めた結果を示す。例えば、デ
ータ名称「新規申込顧客Ｎo.」の場合、図５より、デー
タ名称の右端から調べて最初に見つかる区分語となり得
る用語は、「Ｎo.」であり、その代表区分語は、「番
号」である。また、この「Ｎo.」の左から順に左方に調
べて最初に見つかる主要語となり得る用語は、「顧客」
であり、その代表主要語は、同じく「顧客」である。
「申込」は主要語になり得る用語であるが、このデータ
名称においては、主要語とはならない。The important term extraction unit 12 searches for a term that can be a section word and a term that can be a main word based on the information in FIG. 5, and further, based on that, a representative section word and a representative main word. Ask for. The results obtained are shown in FIG. For example, in the case of the data name “new application customer No.”, the term that can be the first term to be found by checking from the right end of the data name from FIG. 5 is “No.”, and its representative term is “No. It is. In addition, the term that can be the first major word found by searching from the left of this "No." to the left is "customer".
And its representative main word is also “customer”.
Although “application” is a term that can be a main word, it is not a main word in this data name.

【００１８】分類処理部１３では、図６の情報を基に、
入力された各データ名称を、代表区分語を大分類、及
び、代表主要語を小分類としてソートする。図７にソー
トした出力結果を示す。図７は、「現顧客番号」と「新
規顧客Ｎo.」、「受信番号」と「契約Ｎo.」がそれぞれ
類似しており、「顧客住所」と類似しているデータ名称
はないことを表わしている。このソート結果が、出力装
置３を通してデータベースの分析設計者に示される。In the classification processing unit 13, based on the information shown in FIG.
The input data names are sorted with the representative demarcation word as the major classification and the representative main word as the minor classification. FIG. 7 shows the sorted output result. FIG. 7 shows that “current customer number” and “new customer No.” are similar to each other, “reception number” and “contract No.” are similar to each other, and there is no data name similar to “customer address”. ing. The sorting result is shown to the database analysis designer through the output device 3.

【００１９】[0019]

【発明の効果】以上に説明したように、本発明の類似デ
ータ名称分類装置によれば、データ名称をデータの値の
類似性の観点から分類することが可能になり、データベ
ース分析設計者は、データ名称の統一を効率的に行うこ
とが可能になる。As described above, according to the similar data name classifying apparatus of the present invention, it is possible to classify data names from the viewpoint of the similarity of data values, and the database analysis designer can The data names can be unified efficiently.

[Brief description of drawings]

【図１】本発明の類似データ名称分類装置の一実施例の
ブロック構成図である。FIG. 1 is a block configuration diagram of an embodiment of a similar data name classification device of the present invention.

【図２】辞書の構造を表わす図である。FIG. 2 is a diagram showing a structure of a dictionary.

【図３】処理対象のデータ名称及びデータ名称を構成す
る用語の具体例である。FIG. 3 is a specific example of data names to be processed and terms constituting the data names.

【図４】辞書の具体例である。FIG. 4 is a specific example of a dictionary.

【図５】辞書チェック部の出力例である。FIG. 5 is an output example of a dictionary check unit.

【図６】重要用語抽出部の出力例である。FIG. 6 is an output example of an important term extraction unit.

【図７】分類処理部の出力例である。FIG. 7 is an output example of a classification processing unit.

[Explanation of symbols]

１分類装置本体２入力装置３出力装置４辞書１１辞書チェック部１２重要用語抽出部１３分類処理部 1 classification device main body 2 input device 3 output device 4 dictionary 11 dictionary check unit 12 important term extraction unit 13 classification processing unit

Claims

[Claims]

1. A dictionary in which a term that constitutes a data name and a term type are registered, a means that determines the type of a term that constitutes a given data name using the dictionary, and a means for determining the term. A similar data name classifying device comprising: means for extracting only terms of high importance using a type; and means for classifying similar data names based on the extracted terms of high importance.