JP3470930B2

JP3470930B2 - Natural language analysis method and device

Info

Publication number: JP3470930B2
Application number: JP19069595A
Authority: JP
Inventors: 成人岩瀬
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 1995-07-26
Filing date: 1995-07-26
Publication date: 2003-11-25
Anticipated expiration: 2015-07-26
Also published as: JPH0944496A

Description

Detailed Description of the Invention

【０００１】[0001]

【発明の属する技術分野】本発明は、自然語解析方法及
び装置に係り、特に、入力された文字列から、特定の意
味分野の単語を抽出したり、逆にマスクする処理を実現
するものであり、特に、住所を表現した文字列の処理を
行う自然語解析方法及び装置に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a natural language analyzing method and apparatus, and more particularly, to realizing a process of extracting a word in a specific semantic field from an input character string and conversely masking it. In particular, the present invention relates to a natural language analysis method and apparatus for processing a character string expressing an address.

【０００２】詳しくは、建物名、棟番号、階、部屋番号
が混在した住所を正確に記載するための自然語解析方法
及び装置に関する。More specifically, the present invention relates to a natural language analysis method and apparatus for accurately describing an address in which building names, building numbers, floors, and room numbers are mixed.

【０００３】[0003]

【従来の技術】従来から、入力された自然文を形態素解
析し、単語辞書を参照して漢字、平仮名、カタカナ、英
字、数字等の文字種により助詞を認識し、辞書を用いず
に意味を判定する方法がある。2. Description of the Related Art Conventionally, morphological analysis is performed on an input natural sentence, a particle is referred to by referring to a word dictionary and kanji, hiragana, katakana, English letters, numbers, etc. are recognized, and the meaning is determined without using a dictionary. There is a way to do it.

【０００４】また、住所のように助詞が出現しない文字
列もある。従来の住所等の数字を含む文字列を解析する
例（特開平４−４２３５４）を示す。図６は、従来の住
所解析システムの構成を示す。同図に示す解析システム
は、番地等の数字を含む文字列を入力する入力部１０、
入力された文字列に対して１文字ずつ読み取る１文字取
得部２０、一文字取得部２０で取得された１文字毎に文
字種別を判定する１文字判定部３０、及び判定した文字
種別により数字を含む地番データを格納する地番データ
格納部４０より構成される。There are also character strings in which particles do not appear, such as addresses. An example of analyzing a conventional character string including numbers such as addresses (Japanese Patent Laid-Open No. 4-42354) is shown. FIG. 6 shows the configuration of a conventional address analysis system. The analysis system shown in the figure has an input unit 10 for inputting a character string including numbers such as addresses.
A character acquisition unit 20 that reads the input character string one by one, a character determination unit 30 that determines the character type for each character acquired by the character acquisition unit 20, and a number is included depending on the determined character type. The lot number data storage unit 40 stores lot number data.

【０００５】[0005]

【発明が解決しようとする課題】しかしながら、上記従
来の方法において、住所の場合、助詞が現れないため、
助詞を認識する方法は使用できない。特開平４−４２３
５４に示す方法では、丁目、地番までの解析では多義が
少ないので、有効であるが、棟番号、階、部屋番号の解
析を行う場合には、以下のような問題がある。However, in the above-mentioned conventional method, in the case of an address, a particle does not appear,
The method of recognizing particles cannot be used. JP-A-4-423
The method shown in 54 is effective because the analysis up to the chome and lot number is not ambiguous, but it is effective, but when analyzing the building number, floor, and room number, there are the following problems.

【０００６】（１）英数字を含むビル名に対処できな
い。例ａ：「××ビルパート２」のようなビル名において、
「２」を部屋番号を解釈してしまう。例ｂ：「築地２号倉庫」の場合、「２号」を部屋番号と
解釈し、建物に付与されている固有の名前であることが
認識できない。(1) A building name including alphanumeric characters cannot be dealt with. Example a: In a building name such as "XX Building Part 2",
The room number is interpreted as "2". Example b: In the case of “Tsukiji No. 2 warehouse”, “No. 2” is interpreted as a room number, and it cannot be recognized that it is a unique name given to the building.

【０００７】（２）英数字名の多義に対処できない。例ｃ：「Ｂ１−２３」の「Ｂ」は、地下の意味である
が、「Ｂ１−２３」の「Ｂ」は部屋番号の一部であると
判断されてしまう。例ｄ：本来、「１２３Ｆ」は部屋番号を表し、「５Ｆ」
の場合には、階を表しているがこれらの区別がつかな
い。(2) It cannot handle the ambiguity of alphanumeric names. Example c: "B" in "B1-23" means underground, but "B" in "B1-23" is determined to be part of the room number. Example d: Originally, "123F" represents a room number, and "5F"
In the case of, it indicates the floor, but these are indistinguishable.

【０００８】上記の例のように、固有の建物の名称か、
または、部屋番号または、階数を表しているか等が区別
できないという問題がある。本発明は、上記の点に鑑み
なされたもので、上記従来の問題点を解決し、住所等の
助詞を含まない文字列を正確に解析することが可能な自
然語解析方法及び装置を提供することが可能な自然語解
析方法及び装置を提供することを目的とする。As in the example above, the name of the unique building, or
Alternatively, there is a problem that it is impossible to distinguish whether the room number or the number of floors is displayed. The present invention has been made in view of the above points, and provides a natural language analysis method and apparatus capable of solving the above conventional problems and accurately analyzing a character string that does not include a particle such as an address. It is an object of the present invention to provide a natural language analysis method and device capable of performing the same.

【０００９】本発明の更なる目的は、棟、階、部屋番号
等の多義のある英数字列の意味を適切な意味に判断する
ことが可能な自然語解析方法及び装置を提供することで
ある。A further object of the present invention is to provide a natural language analysis method and apparatus capable of determining the meaning of an ambiguous alphanumeric string such as a building, floor, room number, etc., to an appropriate meaning. .

【００１０】[0010]

【課題を解決するための手段】図１は、本発明の原理を
説明するための図である。本発明は、入力された住所文
字列から住所要素を取得する自然語解析方法において、
形態素解析部が、前記入力された住所文字列について形
態素解析を行い、単語辞書を用いて単語分割を行い、形
態素解析結果記憶部に格納する単語分割過程と（ステッ
プ１）、単語分割過程において、複数の文字が連続する
英字である場合は企業名と見なし、企業名として意味を
付与し、形態素解析結果記憶部に格納する後処理過程と
（ステップ２）、単位詞解析部が、形態素解析結果記憶
部から読み出された住所に関する単位詞を含む文節を、
単位詞解析ルールを用いて、住所要素の意味に分類する
単位詞解析過程と（ステップ３）、係り受け解析部が、
形態素解析結果記憶部から読み出された単位詞を含む文
節について、係り受けルールを用いて、建物名の一部か
チェックし、複合語の解析を行う係り受け解析過程と
（ステップ５）、桁数・記号解析部が、形態素解析結果
記憶部から読み出された単位詞を含まない文節につい
て、桁数・記号解析ルールを用いて、英数字のみからな
る文字列を住所要素の意味に分類する桁数・記号解析過
程と（ステップ６）、からなる。 FIG. 1 is a diagram for explaining the principle of the present invention. The present invention , the input address sentence
In the natural language analysis method that acquires the address element from the character string,
The morphological analysis unit shapes the input address character string.
Performs a morphological analysis, word division using a word dictionary, and
The word segmentation process and (step
1), multiple characters are consecutive in the word segmentation process
If it is a letter, consider it as a company name and
The post-processing process of adding and storing in the morphological analysis result storage unit
(Step 2), the unit word analysis unit stores the morphological analysis result
A clause containing the unit words related to the address read from the department,
Classify into meanings of address elements using the unit analysis rule
The unit analysis process (Step 3), the dependency analysis unit,
Sentences containing unit words read from the morphological analysis result storage unit
About the clause, using the dependency rule, is it a part of the building name?
Dependency analysis process that checks and analyzes compound words
(Step 5), the digit number / symbol analysis unit determines the result of the morpheme analysis.
For phrases that do not contain unit words read from the memory
Using the number of digits and symbol analysis rules, use only alphanumeric characters.
Number of digits / symbol analysis error that classifies the character string into the meaning of the address element
And (step 6).

【００１１】また、本発明は、桁数・記号解析過程にお
いて、桁数・記号解析部が、形態素解析の結果に基づい
て、英数字の桁数、または、ハイフンを含む記号で表現
される区切り記号の出現位置を用いて前記英数字の役割
を判定する。 The present invention also relates to the digit number / symbol analysis process.
The digit number / symbol analysis section is based on the result of morphological analysis.
Expressed by the number of alphanumeric digits or symbols including hyphens.
Role of the alphanumeric character by using the appearance position of the delimiter
To judge.

【００１２】図２は、本発明の原理構成図である。本発
明は、入力された住所文字列から住所要素を取得する自
然語解析装置であって、入力された住所文字列について
形態素解析を行い、単語辞書２０４を用いて単語分割を
行い、複数の文字が連続する英字である場合は企業名と
見なし、企業名として意味を付与し、形態素解析結果記
憶部２０１に格納する形態素解析部２０３と、形態素解
析結果記憶部２０１から解析対象とする文節を選択して
読み出す文節選択部２０５と、形態素解析結果記憶部２
０１から文節選択部２０５により読み出された住所に関
する単位詞を含む文節を、単位詞解析ルール２０９を用
いて、住所要素の意味に分類する単位詞解析部２０６
と、形態素解析結果記憶部２０１から文節選択部２０５
により読み出された単位詞を含む文節について、係り受
けルール２１０を用いて、建物名の一部かチェックし、
複合語の解析を行う係り受け解析部２０７と、形態素解
析結果記憶部２０１から文節選択部２０５により読み出
された単位詞を含まない文節について、桁数・記号解析
ルール２１１を用いて、英数字のみからなる文字列を住
所要素の意味に分類する桁数・記号解析部２０８と、を
有する。 FIG . 2 is a block diagram showing the principle of the present invention. Starting
Ming is the one that gets the address element from the entered address string.
It is a natural language analysis device, and about the input address character string
Morphological analysis is performed and word division is performed using the word dictionary 204.
If multiple letters are consecutive English letters, enter the company name.
Morphological analysis result record
A morphological analysis unit 203 stored in the storage unit 201, and a morphological solution
Select the phrase to be analyzed from the analysis result storage unit 201
The phrase selection unit 205 to be read and the morphological analysis result storage unit 2
01 related to the address read by the phrase selection unit 205.
Use the unit word analysis rule 209 for the phrase containing the unit word
And the unit word analysis unit 206 that classifies the meanings of address elements
From the morphological analysis result storage unit 201 to the phrase selection unit 205
About the clause including the unit word read by
Use the rule 210 to check if it is part of the building name,
Dependency analysis unit 207 for analyzing compound words, and morphological solution
Read from the analysis result storage unit 201 by the phrase selection unit 205
Digit / symbol analysis for clauses that do not include the specified unity
Use rule 211 to save a string consisting of only alphanumeric characters.
The number of digits / symbol analysis unit 208 for classifying the meanings of the place elements
Have.

【００１３】また、上記の桁数・記号解析部は、形態素
解析の結果に基づいて、英数字の桁数、または、ハイフ
ンを含む記号で表現される区切り記号の出現位置を用い
て英数字の役割を判定する手段を含む。 Further, the digit number / symbol analysis unit is a morpheme.
The number of alphanumeric digits or hyphens based on the analysis result
Using the position of the delimiter represented by a symbol containing
Means for determining the role of alphanumeric characters.

【００１４】本発明は、上記に示すように、建物名・棟
番号・階・部屋番号が混在するデータからそれぞれの意
味の情報を正しく取り出すためには、文字列全体の形態
素解析を行い、入力データを構成する単語の意味を求め
ておくことが必要である。その結果を用いて、英数字の
意味を判断する。 According to the present invention, as described above, in order to correctly extract the information of each meaning from the data in which the building name, the building number, the floor, and the room number are mixed, the morpheme analysis of the entire character string is performed and the input is performed. It is necessary to find the meaning of the words that make up the data. The result is used to determine the meaning of the alphanumeric characters.

【００１５】まず、単位詞のある文節について文節の意
味を決める。次に、係り受け関係を解析し、単位詞のあ
る文節が複合語の一部になるか解析する。最後に、単位
詞のない文節について前後の名詞の意味、単位詞のある
文節の意味、ハイフン等の記号の有無から判断すること
ができる。First, the meaning of a bunsetsu having a unit word is determined. Next, the dependency relation is analyzed, and it is analyzed whether or not the bunsetsu with the unitary word becomes a part of the compound word. Finally, it can be judged from the meanings of the nouns before and after the bunsetsu without unitary words, the meanings of bunsetsus with unitary words, and the presence of symbols such as hyphens.

【００１６】従って、前述の問題点である英数字を含む
ビル名に対処できないという点については、前後の単語
意味から数字の意味を判断することで対処できる。前述
の例ａの場合には、「パート」の意味が数字の前に付く
単位詞であること、例ｂの場合、当該単語の後に「倉
庫」が続くことにより判断できる。また、前述の問題点
である英数字名の多義に対処できない点については、英
字に付与される数字の桁数で判断できる。通常、「階」
は２桁以内であり、「部屋番号」は１〜４桁まであり得
るが３〜４桁が多いという知識を用いれば判断できる。Therefore, the above-mentioned problem that the building name including alphanumeric characters cannot be dealt with can be dealt with by determining the meaning of the number from the meaning of the surrounding words. In the case of the above-mentioned example a, it can be determined by the meaning that the meaning of "part" is a unit word preceding a number, and in the case of example b, the word "warehouse" follows the word. In addition, the above-mentioned problem that cannot be dealt with in the ambiguous meaning of the alphanumeric name can be judged by the number of digits of the number given to the alphabet. Usually "floor"
Is within 2 digits, and the "room number" can have 1 to 4 digits, but it can be determined by using the knowledge that there are many 3 to 4 digits.

【００１７】これにより、本発明は、単語毎に付与され
た意味及び単位詞により文節の意味を決定し、前後の文
字列の意味を解析し、当該単位詞を有する文節が複合語
になり得るかを判定する。さらに、単位詞を持たない文
節については、英数字の桁数や記号の種類の情報に基づ
いて文字列の意味を決定することにより、英数字列がビル名の一部か「階」または、「部屋番
号」かを正確に判断できる。Thus, according to the present invention, the meaning of a bunsetsu is determined by the meaning given to each word and the unit word, the meanings of the preceding and following character strings are analyzed, and the bunsetsu having the unit word can be a compound word. To determine. Furthermore, for clauses that do not have unit words, the meaning of the string is determined based on the number of alphanumeric digits and the type of symbol, so that the alphanumeric string is part of the building name or "floor", or Can accurately determine whether it is a "room number".

【００１８】「階」または「部屋番号」かを正確に
判断できる。It is possible to accurately determine whether it is a “floor” or a “room number”.

【００１９】[0019]

【発明の実施の形態】図３は、本発明の住所解析システ
ムの構成を示す。同図に示すシステムは、形態素解析結
果記憶部２０１、解析制御部２０２、形態素解析部２０
３、単語辞書２０４、文節選択部２０５、単位詞解析部
２０６、係り受け解析部２０７、桁数・記号解析部２０
８、単位詞解析ルール２０９、係り受けルール２１０、
及び桁数・記号解析ルール２１１より構成される。DESCRIPTION OF THE PREFERRED EMBODIMENTS FIG. 3 shows the configuration of the address analysis system of the present invention. The system shown in the figure includes a morphological analysis result storage unit 201, an analysis control unit 202, and a morphological analysis unit 20.
3, word dictionary 204, phrase selection unit 205, unit word analysis unit 206, dependency analysis unit 207, digit number / symbol analysis unit 20
8, unit word analysis rule 209, dependency rule 210,
And the number of digits / symbol analysis rule 211.

【００２０】解析制御部２０２は、形態素解析結果記憶
部２０１、形態素解析部２０３、文節選択部２０５の各
構成要素を制御する。形態素解析部２０３は、単語辞書
２０４を参照して、入力された自然語の文字列を単語分
割し、形態素解析を行う。単語辞書２０４には、一般的
な登録単語に加えて住所を構成する意味として「建物」
「棟」「階」「部屋番号」等の単語が登録されているも
のとする。これにより、形態素解析部２０３は、単位詞
として、「棟」「階」「部屋番号」の意味を持つ英数字
にはそれぞれ、「棟」「階」「部屋番号」の意味を付与
する。また、単語分割された中に連続する２文字以上の
英字がある場合には、当該英字は「企業名」とする。The analysis control unit 202 controls each component of the morpheme analysis result storage unit 201, the morpheme analysis unit 203, and the phrase selection unit 205. The morpheme analysis unit 203 refers to the word dictionary 204, divides the input natural language character string into words, and performs morpheme analysis. In the word dictionary 204, “building” is added as a meaning that constitutes an address in addition to general registered words.
It is assumed that words such as “building”, “floor”, and “room number” are registered. As a result, the morpheme analysis unit 203 gives the meanings of “building”, “floor”, and “room number” to the alphanumeric characters having the meanings of “building”, “floor”, and “room number”, respectively, as the unit words. If there are two or more consecutive alphabetic characters in the word division, the alphabetic character is the “company name”.

【００２１】形態素解析結果記憶部２０１は、形態素解
析部２０３で解析された結果を保持し、解析制御部２０
２を介して文節選択部２０５に読み出される。文節選択
部２０５は、形態素解析結果記憶部２０１より解析対象
とする文節を選択して読み出す。読み出された文節を単
位詞解析部２０６、係り受け解析部２０８、桁数・記号
解析部２０８にそれぞれ転送する。The morpheme analysis result storage unit 201 holds the result analyzed by the morpheme analysis unit 203, and the analysis control unit 20.
It is read out to the phrase selection unit 205 via 2. The phrase selecting unit 205 selects and reads a phrase to be analyzed from the morphological analysis result storage unit 201. The read clauses are transferred to the unit word analysis unit 206, the dependency analysis unit 208, and the digit number / symbol analysis unit 208, respectively.

【００２２】単位詞解析部２０６は、単位詞解析ルール
２０９を参照して、単位詞を含む文節の英数字を含む文
節の意味を決定する。このとき、単位詞解析ルール２０
９を参照して、数字や１文字の英字の意味を決定する。
また、数字と１文字の英字が連続している場合に、当該
文字の意味候補が２つ以上になる場合には、当該候補を
選択するために桁数・記号解析部２０８に入力する。The unit word analysis unit 206 refers to the unit word analysis rule 209 to determine the meaning of the phrase including the alphanumeric character of the phrase including the unit word. At this time, the unit word analysis rule 20
With reference to 9, determine the meaning of a number or a single letter.
Further, when the number and one alphabetic character are consecutive and there are two or more semantic candidates for the character, the number of digits / symbol analysis unit 208 is input to select the candidate.

【００２３】係り受け解析部２０７は、係り受けルール
２１０を参照して、前後の単語との係り受け関係により
複合語の解析を行う。つまり、入力された単語の前の単
語または、後続する単語に当該入力された単語に関連す
る意味を有する単語の存在を確認し、前後の単語が入力
された単語に意味を与える単語であれば、係り受け関係
が成立するものとして、入力単語に意味を与える。The dependency analysis unit 207 refers to the dependency rule 210 and analyzes the compound word based on the dependency relationship with the preceding and following words. That is, the existence of a word that has a meaning related to the input word in the word before the input word or in the subsequent word is confirmed, and if the words before and after the word give the input word meaning. , The meaning is given to the input word as the dependency relationship is established.

【００２４】桁数・記号解析部２０８は、単位詞が付与
されていない英数字列について、桁数・記号ルール２１
１を参照して、桁数やハイフン等の記号の位置関係から
意味を解析する。このとき、単位詞解析部２０６におい
て複数の候補が存在している場合には、いずれかの候補
を選択する。The digit number / symbol analysis unit 208 determines the digit number / symbol rule 21 for an alphanumeric string to which no unit words are added.
With reference to 1, the meaning is analyzed from the positional relationship of symbols such as the number of digits and hyphens. At this time, when there are a plurality of candidates in the unit word analysis unit 206, one of the candidates is selected.

【００２５】単位詞解析ルール２０９は、「建物」
「棟」「階」「部屋番号」等の単位詞を含む文節の意味
を決定するためのルールである。係り受け解析ルール２
１０は単位詞のある文節が複合語の一部を形成すること
が可能であるかの判定のためのルールである。The unit word analysis rule 209 is "building".
This is a rule for determining the meaning of a phrase including unit words such as “building”, “floor”, and “room number”. Dependency analysis rule 2
Reference numeral 10 is a rule for determining whether a phrase with a unitary word can form a part of a compound word.

【００２６】桁数・記号解析ルール２１１は、単位詞の
ない文節について前後の名詞の意味、単位詞のある文節
の意味、ハイフン等の記号の有無による判定を行うと共
に、数字の桁数による意味を付与するルールである。桁
数については、２桁以内であれば「階数」、３〜４桁の
場合には「部屋番号」の意味を付与する等のルールであ
る。The digit number / symbol analysis rule 211 determines the meaning of a noun before and after a phrase without a unitary phrase, the meaning of a phrase having a unitary word, the presence or absence of a symbol such as a hyphen, and the meaning according to the number of digits of a number. Is a rule to add. Regarding the number of digits, the rule is to add the meaning of “floor number” if it is within 2 digits and “room number” if it is 3 to 4 digits.

【００２７】図４は、本発明の住所解析システムの動作
を示すフローチャートである。ステップ１０１）最初に、単語分割過程として、形態
素解析部２０３が入力された住所文字列を解析用の単語
辞書２０４を参照して、単語分割する。ステップ１０２）単語の分割過程において、形態素解
析部２０３は通常の形態素解析の他に企業名に対する解
析処理として、予め登録されているルールである“２文
字以上の英字は企業名と見做す”を用いて連続する２文
字以上の英字が入力された場合には企業名として意味を
付与する。これにより、２文字以上の英字は、「棟」、
「階」、「部屋番号」と見做されない。FIG. 4 is a flow chart showing the operation of the address analysis system of the present invention. Step 101) First, as a word division process, the morpheme analysis unit 203 divides the input address character string into words by referring to the analysis word dictionary 204. Step 102) In the word segmentation process, the morphological analysis unit 203 uses a rule that is registered in advance as “analyzing two or more letters as a corporate name” as an analysis process for a corporate name in addition to normal morphological analysis. When two or more consecutive alphabetic characters are input using, the meaning is given as the company name. As a result, two or more letters can be
They are not regarded as "floor" or "room number".

【００２８】ステップ１０３）次に、単位詞解析過程
として、文節選択部２０５で単位詞を含む文節を選択す
る。ここでは、単位詞解析ルール２０９を参照して単位
詞を含む文節があるとき、単語に付与されている意味を
参考にして数字や１文字の英字の意味を決める。例え
ば、「２号館」は「号館」という単位詞があるので、全
体は「棟」の意味に分類される。「２階」は「階」があ
るので、「階」の意味に分類される。但し、「２Ｆ」は
ここでは「階」の意味と英数字列「２Ｆ」の２通りの解
候補が残される。Step 103) Next, as a unit word analysis process, the phrase selecting unit 205 selects a phrase including a unit word. Here, when there is a bunsetsu containing a unit word by referring to the unit word analysis rule 209, the meaning of a numeral or one letter is determined by referring to the meaning given to the word. For example, since "No. 2 building" has a unit word "gokan", the whole is classified into the meaning of "building". Since "2nd floor" has "floor", it is classified as "floor". However, "2F" here means two kinds of solution candidates, meaning "floor" and alphanumeric string "2F".

【００２９】ステップ１０４）係り受け解析部２０７
が係り受けルール２１０を参照して前後の単語との係り
受け関係を解析し、複合語の解析を行う。例えば、後の
単語に「倉庫」「団地」「宿舎」などの建物の意味を持
つ単語があるので、『２号倉庫』のような場合には、
“２号”は「部屋番号」ではなく、後の単語も含めて
「建物」の意味になる。また、「地下」等の位置に関す
る名詞の解析も行う。Step 104) Dependency analysis unit 207
Refers to the dependency rule 210, analyzes the dependency relationship with the preceding and following words, and analyzes the compound word. For example, there are words that have the meaning of a building such as "warehouse", "complex", and "dormitory" in the latter words, so in the case of "No. 2 warehouse",
“No. 2” does not mean “room number”, but includes the following words as “building”. It also analyzes nouns related to positions such as "underground".

【００３０】ステップ１０５）最後に、桁数・記号解
析過程として文節選択部２０５において、単位詞が付与
されていない英数字について意味を決定する。前後の文
節でステップ１０２で意味が決定した文節があるとき
は、それ以外の意味に変更する。Step 105) Finally, as a digit number / symbol analysis process, the phrase selecting unit 205 determines the meaning of the alphanumeric characters to which no unit words are added. When there is a phrase whose meaning is determined in step 102 in the preceding and following phrases, the meaning is changed to another meaning.

【００３１】[0031]

【実施例】本発明の実施例を説明する。図５は、本発明
の一実施例の動作を説明するための図である。以下例１
〜例４の各処理過程毎に説明する。EXAMPLES Examples of the present invention will be described. FIG. 5 is a diagram for explaining the operation of the embodiment of the present invention. Example 1 below
~ Each process of Example 4 will be described.

【００３２】［例１］単語分割過程（ステップ１０１）入力文字列『Ａ棟１階１２３』が入力され、形態素解析
部２０３において、以下のように単語に分割される。[Example 1] Word division process (step 101) The input character string "A ridge first floor 123" is input, and the morpheme analysis unit 203 divides it into words as follows.

【００３３】Ａ／棟／１／階／１２３「Ａ」は英字、「棟」は単位詞、「１」は数字、「階」
は単位詞、「１２３」は数字である。後処理過程（ステップ１０２）形態素解析部２０３は、英字「Ａ」は１文字であるの
で、企業名とは見做さない。従って、英字「Ａ」が有す
る意味は「棟」の意味となる。また、数字「１」が有す
る意味は「階」の意味となり、数字「１２３」はその後
に単位詞が付与されていないため、単に数字と判定され
る。A / building / 1 / floor / 123 "A" is an alphabetic character, "building" is a unit word, "1" is a number, "floor"
Is a unit word, and "123" is a number. Post-Processing Process (Step 102) The morpheme analysis unit 203 does not consider the company name because the letter “A” is one character. Therefore, the meaning of the letter "A" is "building". Also, the meaning of the number “1” is the meaning of “floor”, and the number “123” is simply determined as a number because no unit word is given after that.

【００３４】単位詞解析過程（ステップ１０３）次に、単位詞解析部２０６において、単位詞を含む「Ａ
棟」「１階」はそれぞれ、「棟」と「階」があるので、
それぞれの意味が付与される。係り受け解析過程（ステップ１０４）次に、係り受け解析部２０７が建物の意味が文字列にあ
るかを判定する。この例の場合には、建物を表す単語
「倉庫」「団地」「宿舎」等は含まれていないので、次
の処理に移行する。Unit-Word Analysis Process (Step 103) Next, the unit-word analysis unit 206 includes “A” including a unit word.
Since "ridge" and "first floor" have "ridge" and "floor" respectively,
Each meaning is given. Dependency Analysis Process (Step 104) Next, the dependency analysis unit 207 determines whether the meaning of the building is a character string. In the case of this example, the words "warehouse", "complex", "dormitory", etc., which represent buildings, are not included, so the process proceeds to the next step.

【００３５】桁数・記号解析過程（ステップ１０
５）文節選択部２０５は、単位詞の付かない数値「１２３」
があるので、当該数字「１２３」についての係り受け関
係を判定する。桁数・記号解析ルール２１１を参照する
と、「１２３」の直前に「階」の意味の文節があり、か
つ、桁数が３桁であるため「部屋番号」であると解析す
る。Digit number / symbol analysis process (step 10
5) The phrase selection unit 205 displays the numerical value “123” without a unit word.
Therefore, the dependency relationship for the number “123” is determined. With reference to the number-of-digits / symbol analysis rule 211, since there is a clause meaning “floor” immediately before “123” and the number of digits is three, it is analyzed as a “room number”.

【００３６】上記のように例１においては、１文字で表
される英字「Ａ」の後に単位詞「棟」があるため、企業
名とは判断せず、「Ａ棟」とする。また、数字「１」の
後に単位詞「階」があるため、「１階」とする。最後の
数字「１２３」については、桁数で判断して、３桁であ
るため、「部屋番号」であると判断する。As described above, in Example 1, since the unit word "building" is placed after the alphabetic character "A" represented by one character, it is not judged as a company name but "A building". Also, since there is a unit word "floor" after the number "1", it is referred to as "1st floor". The last number "123" is determined as the number of digits and is 3 digits, so it is determined as "room number".

【００３７】［例２］単語分割過程（ステップ１０１）入力文字列『２号倉庫』が入力され、形態素解析部２０
３において、以下のように単語に分割される。[Example 2] Word division process (step 101) The input character string "No. 2 warehouse" is input, and the morphological analysis unit 20
In 3, the words are divided as follows.

【００３８】２／号／倉庫「２」は数字、「号」は単位詞、「倉庫」は建物を表
す。後処理過程（ステップ１０２）この例の文字列には英字は含まれていないため、の解
析結果のまま次の処理に移行する。2 / No. / Warehouse "2" is a number, "No." is a unit word, and "Warehouse" is a building. Post-Processing Process (Step 102) Since the character string in this example does not include an alphabetic character, the analysis result of is transferred to the next process.

【００３９】単位詞解析過程（ステップ１０３）次に、単位詞解析部２０６において、単位詞「号」があ
るため、部屋番号を表す「２号」の意味に解析される。係り受け解析過程（ステップ１０４）次に、「２号」の後に「倉庫」という建物の意味を有す
る単語があるので、「部屋番号」ではなく、「２号倉
庫」という倉庫の名前に変更する。Unit-Word Analysis Process (Step 103) Next, in the unit-word analysis unit 206, since there is a unit-word “go”, it is analyzed into the meaning of “2”, which represents a room number. Dependency analysis process (step 104) Next, since there is a word having a building meaning "warehouse" after "No. 2", the name of the warehouse is called "No. 2 warehouse" instead of "room number". .

【００４０】桁数・記号解析過程（ステップ１０
５）この例では、単位詞が付与されない英数字はないので、
処理をスキップする。上記の例２については、数字
「２」の後に単位詞「号」があるため、解析結果は部屋
番号「２号」と判断されるが、さらに、後に建物を表す
「倉庫」があるため、建物を表す「２号倉庫」と判断さ
れる。Digit number / symbol analysis process (step 10
5) In this example, there are no alphanumeric characters without unit words, so
Skip processing. Regarding Example 2 above, since the unit word “go” is after the number “2”, the analysis result is determined to be the room number “2”, but further, since there is the “warehouse” that represents the building later, It is judged as "No. 2 Warehouse" that represents the building.

【００４１】［例３］単語分割過程（ステップ１０１）入力文字列『ＡＢＣビル地下１階』が入力され、形態素
解析部２０３において、以下のように単語に分割され
る。[Example 3] Word division process (step 101) The input character string "ABC building basement 1st floor" is input, and the morphological analysis unit 203 divides it into words as follows.

【００４２】ＡＢＣ／ビル／地下／１／階「ＡＢＣ」は英字、「ビル」は建物を表し、「地下」は
位置を表し、「１」は数字、「階」は単位詞を表す。後処理過程（ステップ１０２）この例の文字列に英字「ＡＢＣ」が含まれており、この
英字は、３文字であるので、企業名と見做される。ABC / Building / Underground / First floor "ABC" is an alphabet, "Building" is a building, "Underground" is a position, "1" is a number, and "Floor" is a unit word. Post-Processing Process (Step 102) Since the character string in this example includes the alphabetic character "ABC" and this alphabetic character is 3 characters, it is regarded as a company name.

【００４３】単位詞解析過程（ステップ１０３）次に、単位詞解析部２０６において、単位詞「階」があ
るため、階数「１階」を表す意味に解析される。係り受け解析過程（ステップ１０４）次に、係り受け解析部２０７は、「ＡＢＣ」の後に「ビ
ル」という建物の意味を有する単語があるので、企業名
「ＡＢＣ」を建物の名称に変更し「ＡＢＣビル」とし、
位置を表す単語「地下」があるので、「ＡＢＣビル」
「地下１階」となる。Unit Particle Analysis Process (Step 103) Next, in the unit word analysis unit 206, since there is a unit word “floor”, it is analyzed into a meaning representing the floor number “first floor”. Dependency Analysis Process (Step 104) Next, the dependency analysis unit 207 changes the company name “ABC” to the name of the building because there is a word “building” meaning “building” after “ABC”. ABC building ",
Since there is a word "underground" that indicates the location, "ABC building"
It will be the "1st basement floor".

【００４４】桁数・記号解析過程（ステップ１０
５）この例では、単位詞が付与されない英数字はないので、
処理をスキップする。号であると解析する。上記の例３
では、２文字以上の英字「ＡＢＣ」があるため、企業名
と判断され、「ＡＢＣ」の後に建物を示す「ビル」があ
るので、「ＡＢＣビル」と判定され、その後に、位置を
示す「地下」があり、後続の数字「１」に続いて単位詞
「階」が続くため、係り受け関係解析部２０７により
「地下１階」と判断される。Digit number / symbol analysis process (Step 10
5) In this example, there are no alphanumeric characters without unit words, so
Skip processing. It is analyzed as an issue. Example 3 above
In this case, since there are two or more alphabetic characters "ABC", it is judged as the company name, and since "ABC" is followed by "Building" which indicates a building, it is judged as "ABC Building", and then " Since there is “underground”, and the unit number “floor” follows the subsequent number “1”, the dependency relation analysis unit 207 determines that it is “underground first floor”.

【００４５】［例４］単語分割過程（ステップ１０１）入力文字列『ＡＢＣビルＢ１Ｆ』が入力され、形態素解
析部２０３において、以下のように単語に分割される。[Example 4] Word division process (step 101) The input character string "ABC building B1F" is input, and the morpheme analysis unit 203 divides it into words as follows.

【００４６】ＡＢＣ／ビル／Ｂ／１／Ｆ「ＡＢＣ」は英字、「ビル」は建物を表し、「Ｂ」は英
字、「１」は数字、「Ｆ」は英字かつ階数を表す単位詞
である。後処理過程（ステップ１０２）この例の文字列に英字「ＡＢＣ」が含まれており、この
英字は、３文字であるので、企業名と見做される。ABC / Building / B / 1 / F "ABC" is a letter, "Building" is a building, "B" is a letter, "1" is a number, and "F" is a letter and a unit word representing a floor. is there. Post-Processing Process (Step 102) Since the character string in this example includes the alphabetic character "ABC" and this alphabetic character is 3 characters, it is regarded as a company name.

【００４７】単位詞解析過程（ステップ１０３）次に、単位詞解析部２０６において、英字の単位詞
「Ｆ」があるため、この例では、「１階」の意味と「１
Ｆ」の２通りの解候補が残される。係り受け解析過程（ステップ１０４）次に、係り受け解析部２０７は、「ＡＢＣ」の後に「ビ
ル」という建物の意味を有する単語があるので、「ＡＢ
Ｃ」を企業名から建物名に変更し、「ＡＢＣビル」とす
る。Unit word analysis process (step 103) Next, in the unit word analysis unit 206, since there is an English unit word “F”, in this example, the meaning of “1st floor” and “1st floor”
Two solution candidates of "F" are left. Dependency Analysis Process (Step 104) Next, since the dependency analysis unit 207 has a word “building” meaning “building” after “ABC”, “AB
Change "C" from the company name to the building name, and call it "ABC Building".

【００４８】桁数・記号解析過程（ステップ１０
５）この例４では、上記のの処理において２つの解候補が
ある。従って、１桁の数値「１」の前に英字「Ｂ」があ
るため、当該「Ｂ」は「地下」の意味を持つものとし、
その後にづづく「１Ｆ」が階数を表しているものと、解
析する。Digit number / symbol analysis process (Step 10
5) In this example 4, there are two solution candidates in the above processing. Therefore, since there is an alphabetic character "B" in front of the one-digit number "1", it is assumed that "B" has the meaning of "underground".
It is analyzed that the subsequent "1F" represents the floor number.

【００４９】上記のように、単語辞書２０４に入力デー
タを構成する単語の意味を登録すると共に、単位詞解析
ルール２０９、係り受け解析ルール２１０、桁数・記号
解析ルール２１１にそれぞれ、住所を解析するための知
識を予め登録することにより、それぞれの解析部におい
て、単語分割後、英数字の意味を判断し、さらに、単位
詞のある文節について文節の意味を決め、前後の単語間
の係り受け関係を解析し、単位詞のある文節の意味や、
ハイフン等の記号の有無や位置関係により判断すること
ができる。As described above, the meanings of the words forming the input data are registered in the word dictionary 204, and the address is analyzed by the unit word analysis rule 209, the dependency analysis rule 210, and the digit number / symbol analysis rule 211, respectively. By pre-registering the knowledge to do so, each analysis unit determines the meaning of alphanumeric characters after word division, and further determines the meaning of the bunsetsu for the bunsetsu with a unit, and the dependency between the preceding and following words By analyzing the relationship, the meaning of bunsetsu with unit words,
It can be judged by the presence or absence of symbols such as hyphens and the positional relationship.

【００５０】なお、上記の実施例では、単位詞解析部２
０６、係り受け解析部２０７及び桁数・記号解析部２０
８において、それぞれルールを参照して解析を行ってい
るが、この例に限定されることなく、各解析部毎にルー
ルを内蔵しておき、当該ルールを参照して種々解析する
ことも可能である。In the above embodiment, the unit word analysis unit 2
06, dependency analysis unit 207 and digit / symbol analysis unit 20
In FIG. 8, the analysis is performed by referring to each rule, but the present invention is not limited to this example, and it is also possible to store a rule in each analysis unit and perform various analyzes by referring to the rule. is there.

【００５１】さらに、上記の実施例では、自然語として
住所に関する文字列が入力された場合の処理を示した
が、この例に限定されることなく、自然語の解析におい
て特殊な形態をとる文字列の入力についても種々適用が
可能であり、特殊な形態をとる文字列に対応する分類と
ルールを予め設定しておき、その対応関係に基づいて解
析することが可能である。Further, in the above embodiment, the processing when the character string relating to the address is input as the natural language has been described, but the present invention is not limited to this example, and a character having a special form in the analysis of the natural language is used. Various input can be applied to input of strings, and classification and rules corresponding to character strings having a special form can be set in advance and analysis can be performed based on the correspondence.

【００５２】また、上記の実施例による解析結果を記憶
手段に格納しておき、住所編集等の作業時に、既に解析
による意味付けが終了しているためある一定の編集基準
を設け、不要な記載は省略する等のデータの加工を行う
ことも可能である。なお、本発明は、上記の実施例に限
定されることなく、特許請求の範囲内で種々変更・応用
が可能である。Further, the analysis result according to the above-described embodiment is stored in the storage means, and at the time of work such as address editing, a certain editing standard is set because the meaning by analysis has already been completed, and unnecessary description is made. It is also possible to process data such as omission. The present invention is not limited to the above embodiments, and various modifications and applications are possible within the scope of the claims.

【００５３】[0053]

【発明の効果】上述のように、本発明の自然語解析方法
及び装置によれば、棟・階・部屋番号の多義のある英数
字列の意味をより適切な意味に判断することが可能であ
る。従って、本発明によれば、建物名、棟番号、階、部
屋番号の混在した住所から建物名のみを取り出してデー
タベース化し、ビル名から正式住所に変換するような業
務のデータ作成に利用したり、個人名の同姓同名を区別
するために必要なデータを残しつつ、プライバシーを保
護するため、建物名を省略して、階・部屋番号のみ表示
すう様なデータの作成に利用したり、電話帖のように限
られたスペースに住所をおさめるため、建物名を省略し
て階・部屋番号のみをを抽出する業務に適用することが
可能である。As described above, according to the natural language analysis method and apparatus of the present invention, it is possible to determine the meaning of the ambiguous alphanumeric string of the ridge / floor / room number to a more appropriate meaning. is there. Therefore, according to the present invention, only a building name is extracted from an address in which a building name, a building number, a floor, and a room number are mixed, a database is created, and it is used for creating data of a business such as converting a building name into a formal address. , In order to protect privacy while keeping the data required to distinguish the same surname of individual names, the building name is omitted and it is used to create data such as displaying only the floor / room number, or a telephone notebook. Since the address is stored in a limited space like the above, it is possible to apply to the business of extracting only the floor / room number by omitting the building name.

[Brief description of drawings]

【図１】本発明の原理を説明するための図である。FIG. 1 is a diagram for explaining the principle of the present invention.

【図２】本発明の原理構成図である。FIG. 2 is a principle configuration diagram of the present invention.

【図３】本発明の住所解析システムの構成図である。FIG. 3 is a block diagram of an address analysis system of the present invention.

【図４】本発明の住所解析システムの動作を示すフロー
チャートである。FIG. 4 is a flowchart showing the operation of the address analysis system of the present invention.

【図５】本発明の一実施例の動作を説明するための図で
ある。FIG. 5 is a diagram for explaining the operation of the embodiment of the present invention.

【図６】従来の住所解析システムの構成図である。FIG. 6 is a block diagram of a conventional address analysis system.

[Explanation of symbols]

２０１形態素解析結果記憶部２０２解析制御部２０３形態素解析部２０４単語辞書２０５文節選択部２０６単位詞解析部２０７係り受け解析部２０８桁数・記号解析部２０９単位詞解析ルール２１０係り受け解析ルール２１１桁数・記号解析ルール 201 Morphological analysis result storage unit 202 Analysis control unit 203 Morphological analysis unit 204 word dictionary 205 clause selection section 206 Unit-word analysis part 207 Dependency analysis unit 208 Digits / Symbol Analysis Department209 Unit analysis rules 210 Dependency Analysis Rule 211 Digits / Symbol analysis rules

フロントページの続き (56)参考文献電子情報通信学会論文誌，1987年４月25日，第Ｊ70−Ｄ巻、第４号，ｐ. 832−835 第49回（平成６年後期）全国大会講演論文集（３），1994年９月30日，ｐ. ３−261，３−262 (58)調査した分野(Int.Cl.⁷，ＤＢ名) G06F 17/27 G06F 17/21 550 ＪＩＣＳＴファイル（ＪＯＩＳ)Front Page Continuation (56) Bibliography The Institute of Electronics, Information and Communication Engineers, April 25, 1987, J70-D, Vol. 4, p. 832-835 49th Annual Meeting Lecture Proceedings (3), September 30, 1994, p. 3-261, 3-262 (58) Fields investigated (Int.Cl. ⁷ , DB name) G06F 17/27 G06F 17/21 550 JISST file ( JOIS)

Claims

(57) [Claims]

1. A natural language analysis method for obtaining an address element from an input address character string , wherein a morpheme analysis unit forms a shape for the input address character string.
Performs a morphological analysis, word division using a word dictionary, and
The word segmentation process stored in the morphological analysis result storage unit and the morphological analysis unit are alphabetic characters in which a plurality of characters are continuous.
If it is regarded as a company name and given meaning as a company name,
The post-processing process stored in the morpheme analysis result storage unit and the unity word analysis unit read from the morpheme analysis result storage unit.
The phrase that contains the unit words related to the
Unit analysis parsing that classifies the meanings of address elements using
And the dependency analysis unit reads it from the morphological analysis result storage unit.
For the clause that includes the issued unit phrase, set the dependency rule
Check if part of the building name, and analyze the compound word.
The dependency analysis process and the digit number / symbol analysis unit read from the morphological analysis result storage unit.
Number of digits / symbols for bunsetsu that does not include the detected unit
Using the parsing rules, enter a string of alphanumeric characters
It consists of the number of digits and the symbol analysis process, which are classified into the meanings of elements.
A natural language analysis method characterized by the above.

2. In the digit number / symbol analysis process,
The number of digits / symbol analysis unit is based on the result of the morphological analysis.
The number of alphanumeric digits or a symbol including a hyphen.
Using the appearance position of the represented delimiter,
The natural language analysis method according to claim 1, wherein a role is determined .

3. An address element is extracted from the input address character string.
A natural language analyzer for obtaining, performing morphological analysis on the input address character string,
Morphological analysis results are stored by dividing words using a word dictionary.
It is an alphabetic character that stores multiple characters and has multiple consecutive characters.
In the case, it is regarded as a company name, a meaning is given as a company name, and
The morpheme analysis unit to be stored in the morpheme analysis result storage unit and the phrase to be analyzed are selected from the morpheme analysis result storage unit.
The phrase selection unit that selects and reads, and the phrase selection unit that reads from the morphological analysis result storage unit.
The phrase containing the unit word related to the found address is analyzed as a unit word.
A unit word solution that classifies address element meanings using analysis rules
Read by the analysis unit and the phrase selection unit from the morphological analysis result storage unit.
Dependency rules for bunsetsu that contains the unit word found
Check if a part of the building name using
The dependency analysis unit that performs the reading and the phrase selection unit read from the morphological analysis result storage unit.
Number of digits / symbols for bunsetsu that does not include the detected unit
Using the parsing rules, enter a string of alphanumeric characters
It has a digit number / symbol analysis section that classifies elements into meanings.
Natural language analyzer characterized by.

4. The digit number / symbol analysis unit , based on the result of the morphological analysis, the digit number of the alphanumeric characters,
Or, for a delimiter represented by a symbol that includes a hyphen
It includes means for determining the role of the alphanumeric characters by using the appearance position.
4. The natural language analysis device according to claim 3 .