JP2848430B2

JP2848430B2 - Information extraction method

Info

Publication number: JP2848430B2
Application number: JP5217464A
Authority: JP
Inventors: 克尚柴田; 智子平野; 章子菊池
Original assignee: Hitachi Information Systems Ltd
Current assignee: Hitachi Information Systems Ltd
Priority date: 1993-09-01
Filing date: 1993-09-01
Publication date: 1999-01-20
Anticipated expiration: 2014-01-20
Also published as: JPH0773188A

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【産業上の利用分野】本発明は、先ず抽出すべき情報の
分類を判断することにより、パタ−ンマッチングを迅速
に行って、文から必要な情報を抽出することが可能な情
報抽出方法に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to an information extraction method capable of quickly performing pattern matching by first determining the classification of information to be extracted and extracting necessary information from a sentence. .

【０００２】[0002]

【従来の技術】従来、順序や書き方の異なった文献リス
ト等を抽出する方法としては、例えば特開平５−２０３
６１号公報に記載された意味抽出方法がある。この方法
では、先ず記載形式の異なる入力レコ−ドを形態素解析
処理（単語分割と品詞付け）により単語に分割した後、
予め用意された区切り記号テーブルを用いて、区間分割
処理により分割された単語をさらに区間に分割し、次の
意味解析処理において分割された各区間に対して複数の
意順解析ルーチンを所定の順序で実行することにより、
区間内にある語句が何を表わしているかを求めて、語句
を抽出していく。また、別の方法としては、例えば、
『ｙａｃｃ／ｌｅｘプログラムジェネレータｏｎＵＮ
ＩＸ』五月女健治著、哲学出版に記載されている方法が
ある。この方法では、ＵＮＩＸシステムラボラトリーズ
が開発したＵＮＩＸコマンドのｙａｃｃ（構文解析プロ
グラムのＣ言語ジェネレータ）およびｌｅｘ（字句解析
プログラムのＣ言語ジェネレータ）を用いて、文章から
語句（単語および単語の集り）を抽出する仕組みを構築
する。すなわち、ｙａｃｃおよびｌｅｘの構文規則に従
って、語句抽出プログラムを作成した後、そのプログラ
ムをｙａｃｃおよびｌｅｘに入力することにより、Ｃ言
語の語句抽出プログラムを生成する仕組みである。2. Description of the Related Art Conventionally, as a method of extracting a list of documents having different orders and writing styles, for example, Japanese Patent Application Laid-Open
There is a meaning extraction method described in JP-A-61-61. In this method, first, input records having different description formats are divided into words by morphological analysis processing (word division and part of speech),
The words divided by the section dividing process are further divided into sections using a delimiter table prepared in advance, and a plurality of order analysis routines are executed in a predetermined order for each section divided in the next semantic analysis processing. By running in
Words are extracted by finding what the words in the section represent. Also, as another method, for example,
"Yacc / lex Program Generator on UN
IX], written by Kenji Satsuki, philosophy publishing. In this method, a word (a group of words and a group of words) is extracted from a sentence using the UNIX commands “yacc” (a C language generator of a lexical analysis program) and “lex” (a C language generator of a lexical analysis program) developed by UNIX System Laboratories. Build a mechanism to do it. That is, according to the mechanism of generating a word / phrase extraction program in accordance with the syntax rules of yacc and lex, and then inputting the program to yacc and lex, a word / phrase extraction program in C language is generated.

【０００３】[0003]

【発明が解決しようとする課題】しかしながら、前述の
各方法には、以下に述べるような問題点が存在する。先
ず、特開平５−２０３６１号公報に記載の方法では、文
中の区切り記号（例えば、コンマ、コロン等）により語
句の抽出を可能にしているので、文中にもし区切り記号
がない文のときには、必要な語句を抽出することは不可
能となる。また、語句を意味解析するための特徴、例え
ば全体における位置や字数の長さや元号等を、予め詳細
に定義しておく必要がある。次に、ｙａｃｃおよびｌｅ
ｘを用いた方法では、ＵＮＩＸコマンドの構文規則が極
めて複雑であるため、これに熟知した者でなければプロ
グラムを作成することができないという問題がある。ま
た、プログラムを作成できたとしても、ｙａｃｃおよび
ｌｅｘは簡易的なプログラムジェネレータであるため、
使用するために種々の制限が存在し、普通の者では、実
用的な語句抽出プログラムを作成することはできないと
いう問題もある。本発明の目的は、これら従来の課題を
解決し、文中に区切り記号がなくても、またプログラム
に熟知していない者でも、文から必要な情報を抽出する
ことができ、かつ実用的な情報の抽出が可能な情報抽出
方法を提供することにある。However, each of the above methods has the following problems. First, in the method described in JP-A-5-20361, words can be extracted by using a delimiter (for example, a comma or a colon) in a sentence. It becomes impossible to extract a simple phrase. In addition, it is necessary to previously define in detail features for semantic analysis of words, such as the position in the whole, the length of the number of characters, and the era. Next, yacc and le
In the method using x, since the syntax rules of the UNIX command are extremely complicated, there is a problem that only a person who is familiar with the command can create a program. Even if a program can be created, since yacc and lex are simple program generators,
There is also a problem that there are various restrictions on the use, and that a normal person cannot create a practical word / phrase extraction program. SUMMARY OF THE INVENTION An object of the present invention is to solve these conventional problems and to extract necessary information from a sentence even if there is no delimiter in the sentence or even a person who is not familiar with the program, and can obtain practical information. It is an object of the present invention to provide an information extraction method capable of extracting information.

【０００４】[0004]

【課題を解決するための手段】上記目的を達成するた
め、本発明の情報抽出方法は、複数の所定語句を任意の
位置に並べた入力文から、各所定語句間にある語句を抽
出する場合に、まず、所定語句のうち入力文の分類種別
の特定に用いられる複数の所定語句の記憶と、この分類
種別毎の、複数の所定語句の並びおよび各所定語句間の
一以上の語句の区切りを特定するのに用いる情報からな
る複数の定形パターンの記憶を予め行ない、入力文か
ら、予め記憶した分類種別特定用の所定語句に一致もし
くは最もマッチングする語句を抽出して入力文の分類種
別を特定し、この特定した分類種別の記憶済みの定形パ
ターンから、入力文の各所定語句の並びに一致もしくは
最もマッチングする定形パターンを選択し、この選択し
た定形パターンで示される各所定語句の並びと情報に従
って、入力文の各所定語句間にある各語句を区切って抽
出することを特徴とする。尚、記憶済の定形パターンか
ら、入力文の各所定語句の並びに最もマッチングする定
形パターンを選択する際、複数の定形パターンの候補が
ある場合、各定形パターン毎に、順次、各定形パターン
の各所定語句が他の定形パターンに存在するか否かを判
断して、存在しない数をカウントし、このカウント数が
最も大きい定形パターンを最もマッチングする定形パタ
ーンとして選択する。In order to achieve the above object, an information extraction method according to the present invention is provided .
When extracting words between each predetermined phrase from the input sentences arranged in the position , first, the classification type of the input sentence among the predetermined words
Storage of multiple phrases used to identify
Arrangement of a plurality of predetermined phrases for each type and between each predetermined phrase
From information used to identify one or more words
Before storing multiple fixed patterns in
If there is a match with a predetermined phrase for classifying classification stored in advance,
Or the best matching words and phrases
Identify another, and store the stored fixed-format
From the turn, the sequence of each predetermined phrase in the input sentence
Select the best matching fixed pattern, and follow the sequence and information of each predetermined phrase indicated by the selected fixed pattern.
And extract each word between each predetermined word in the input sentence.
It is characterized by putting out. Is it a stored fixed pattern ?
Et al., When selecting a constant <br/> shaped pattern that best matches the sequence of the predetermined phrase input sentence candidates of a plurality of fixed pattern
If there is, for each fixed pattern, sequentially, each fixed pattern
Each predetermined phrase to determine whether or not another of the fixed pattern of, and counts the number that does not exist, you select a fixed pattern that best matches the largest fixed pattern is the number of counts.

【０００５】[0005]

【作用】本発明においては、例えば、戸籍簿の入力から
必要な情報（身分事項）を抽出するときには、そこに書
き込まれている文が全て複数の身分事項と特徴デ−タの
集合から構成されているので、先ず、入力された戸籍デ
−タがどの事件種別（出生、婚姻、死亡等の種別）に属
するかを判断し、特定された事件種別の記載パターンの
特徴デ−タと入力された戸籍デ−タの特徴デ−タとを順
次マッチングしていき、一致した特徴デ−タが最も多い
記載パターンをその戸籍デ−タの記載パターンであると
みなす。そして、その記載パターンを参照しながら入力
戸籍デ−タを展開し、必要な情報を抽出する。このよう
に、本発明では、予め定形パターン毎に抽出すべき情報
がその定形パターンのどの部分に存在するかを記憶して
おき、その定形パターンに基づいて作成された文から必
要な情報を抽出する場合に、先ずその文がどの定形パタ
ーンに属するかを、定形パターン毎に存在するいくつか
の特徴デ−タが当該文にどれだけ多く存在するか否かに
より判断する。これにより決定した定形パターンとマッ
チングをとることにより、当該文から情報を抽出するの
で、必要な情報が正確に抽出できる。また、定形パター
ンは特定の特徴デ−タに基づいて分類されているので、
情報を抽出すべき文がどの分類に属するかを判断した後
に、該当する分類中の定形パターンとのマッチング処理
を行うのみでよい。その結果、全ての定形パターンとの
マッチング処理を行う必要がなく、迅速に情報を抽出す
ることが可能である。According to the present invention, for example, when extracting necessary information (identification items) from the input of a family register, all sentences written therein are composed of a plurality of identification items and a set of characteristic data. First, it is determined which case type (type of birth, marriage, death, etc.) the input family register data belongs to, and the data is input as the characteristic data of the description pattern of the specified case type. The feature data of the family register data is sequentially matched, and the description pattern having the most matching feature data is regarded as the description pattern of the family register data. Then, the input family register data is developed while referring to the description pattern, and necessary information is extracted. As described above, in the present invention, information to be extracted for each fixed pattern is stored in advance in which part of the fixed pattern, and necessary information is extracted from a sentence created based on the fixed pattern. In such a case, first, it is determined to which fixed pattern the sentence belongs, based on how many feature data present for each fixed pattern exist in the sentence. Since information is extracted from the sentence by matching with the determined fixed pattern, necessary information can be accurately extracted. In addition, since fixed patterns are classified based on specific feature data,
After determining which classification the sentence from which information should be extracted belongs to, it is only necessary to perform a matching process with a fixed pattern in the corresponding classification. As a result, it is not necessary to perform the matching process with all the fixed patterns, and the information can be quickly extracted.

【０００６】[0006]

【実施例】以下、本発明の実施例を、図面により詳細に
説明する。ここでは、戸籍原簿からの情報抽出に適用し
た場合を例にとって説明する。図３は、本実施例の対象
となる戸籍簿の一例を示す図であり、図４は、身分事項
事件種別コ−ドを示す図である。図３に示すように、戸
籍簿に書かれた文章は一般にいくつかの事件が記載され
ており、それらの１つ１つはいずれも日付の次に本人や
本人に関係する人の事件が記載されている。それらの事
件とは、例えば図４に示すように、『出生』，『婚
姻』，『死亡』等を表わす文のことである。なお、図３
には、『出生』と『婚姻』のみが記載されている。この
ように、戸籍簿の文はある程度フォーマットが定まって
おり、以下に示すようにその文は複数の事件文から構成
されている。戸籍文＝『事件文』『事件文』『事件文』・・・・『事
件文』さらに、各事件文は、以下に示すように、複数の身分事
項と特徴デ−タの集合、あるいは身分事項の集合で構成
されている。事件文＝『身分事項』『特徴デ−タ』『身分事項』『特
徴デ−タ』・・・ここで、身分事項とは、日付や場所等の身分に関する事
項であり、特徴デ−タとは、記載パターンを特定するた
めのデ−タ、つまり出生や婚姻や死亡等の特徴的な事件
デ−タのことである。Embodiments of the present invention will be described below in detail with reference to the drawings. Here, a case where the present invention is applied to information extraction from a family register is described as an example. FIG. 3 is a diagram showing an example of a family register to be processed in the present embodiment, and FIG. 4 is a diagram showing an identification matter case type code. As shown in Fig. 3, the sentences written on the family register generally describe several cases, and each of them describes the case of the person or a person related to the person after the date. Have been. These cases are, for example, sentences indicating “birth”, “marriage”, “death”, etc., as shown in FIG. Note that FIG.
Contains only "birth" and "marriage". In this way, the sentence of the family register has a certain format, and as shown below, the sentence is composed of a plurality of case sentences. Family register sentence = "incident sentence""incidentsentence""incidentsentence" ... "incident sentence" Furthermore, each incident sentence is, as shown below, a set of a plurality of identity matters and characteristic data, or an identity. It consists of a set of items. Incident text = "identification matter""characteristicdata""identificationmatter""characteristicdata" ... Here, the identification matter is a matter related to the identity such as date and place. Is data for specifying a description pattern, that is, characteristic case data such as birth, marriage, and death.

【０００７】ところで、事件文は、以下のように４つの
パターンに分類される。ここで、ＮＬとはニューライン
の略で、改行を示している。パターン１＝『身分事項』『身分事項』パターン２＝『身分事項』『ＮＬ』パターン３＝『身分事項』『特徴デ−タ』パターン４＝『特徴デ−タ』例えば、『平成５年３月５日東京都千代田区で出生同月
６日父届出入籍』と言う文章の記載パターンデ−タは、
『（出生日）（出生地）で出生（届出日）（届出人）届
出入籍』であり、これを上記４つのパターンに当てはめ
ると、次のようになる。事件文＝『（身分事項）（身分事項）（特徴デ−タ）
（身分事項）（身分事項）（特徴デ−タ）』By the way, incident sentences are classified into four patterns as follows. Here, NL is an abbreviation of new line, and indicates a line feed. Pattern 1 = "identification matter""identificationmatter" Pattern 2 = "identification matter""NL" Pattern 3 = "identification matter""characteristicdata" Pattern 4 = "characteristic data" For example, "1993 The birth date in Chiyoda-ku, Tokyo on the 5th of March, and the birth on the 6th of the same month, the father was registered, and the pattern data is as follows:
"(Birth date) (birthplace) and birth (notification date) (notifier) notification registration and enrollment" is applied to the above four patterns and becomes as follows. Incident text = “(identification item) (identification item) (characteristic data)
(Identity matters) (Identity matters) (Feature data)

【０００８】図１および図２は、本発明の一実施例を示
す情報抽出方法の動作フローチャ−トであり、特に戸籍
簿の入力から必要な情報（身分事項）を抽出する処理を
示す。また、図５は、内部テーブルの展開図であり、図
６は、記載パターン定義ファイルのデ−タ構成図であ
る。先ず、図１に示すように、記載パターン定義ファイ
ルから記載パターンデ−タを読み込んで登録した後、記
載項目事件種別から事件種別を特定するための特徴デ−
タを作成して、テーブルに展開する。ここで、特徴デ−
タとは、例えば『出生』，『婚姻』等の言葉であって、
以後はキーワードと呼ぶ。また、登録された記載パター
ンデ−タから、記載パターンを特定するための特徴デ−
タを抽出し、テーブルに展開する（ステップ４００）。
展開された内部テーブルの概念は、図５に示す通りであ
る。図５では、『出生』，『婚姻届出、婚姻取消により
婚姻と婚姻国籍取得』，『離婚』，『国籍取得帰化
国籍選択国籍喪失夫国籍妻国籍』に分けられてい
る。また、記載パターン定義ファイルに格納する記載パ
ターンデ−タの構成は、図６に示すように、出生、婚姻
等の事件毎に複数のパターンに分かれており、身分事項
を展開するための記載パターンデ−タを全て格納してお
く。図６のパターンには、次の３つのパターンが示され
ている。出生１＝（出生日）（出生地）で出生（届出日）（届出
人）届出入籍、出生２＝（出生日）（出生地）で出生（届出日）（届出
人）届出（送付を受けた日）（受理者）入籍、出生３＝（出生日）（出生地）で出生（届出日）（届出
人）（特記事項）届出入籍FIGS. 1 and 2 are operation flowcharts of an information extracting method according to an embodiment of the present invention, and particularly show a process of extracting necessary information (identification items) from an input of a family register. FIG. 5 is a development view of the internal table, and FIG. 6 is a data configuration diagram of the description pattern definition file. First, as shown in FIG. 1, after reading and registering the description pattern data from the description pattern definition file, the feature data for specifying the incident type from the entry item incident type is described.
Create data and deploy it to a table. Here, the feature data
Ta is a word such as "birth" or "marriage",
Hereinafter, it is called a keyword. Also, feature data for specifying a description pattern from registered description pattern data.
Data is extracted and developed in a table (step 400).
The concept of the expanded internal table is as shown in FIG. In Figure 5, "Birth", "Marriage registration, marriage and marriage acquisition by marriage cancellation", "Divorce", "Nationality acquisition naturalization"
Nationality selection Loss of nationality Husband nationality Wife nationality]. As shown in FIG. 6, the structure of the description pattern data stored in the description pattern definition file is divided into a plurality of patterns for each case such as birth, marriage, etc. All data is stored. The following three patterns are shown in the pattern of FIG. Birth 1 = (Birth Date) (Birthplace) Birth (Notification Date) (Notified Person) Reported Enrollment, Birth 2 = (Birth Date) (Birthplace) Birth (Notified Date) (Notified Person) Reported (Received Birth 3 = (birth date) (place of birth), birth (notification date) (notified person) (special notes)

【０００９】図１のフロ−では、次に、戸籍デ−タを入
力する（ステップ４０１）。なお、既に複数の戸籍デ−
タが入力されている場合には、次に処理の対象となる戸
籍デ−タを取り出す。全ての戸籍デ−タが入力され、処
理されたか否かを判断し、全てが入力されるまで処理を
続行する（ステップ４０２）。戸籍デ−タが入力される
と、次に、記載パターンデ−タ定義ファイルに格納され
た記載パターンデ−タと、戸籍デ−タファイルより入力
された戸籍データとのマッチング処理を行う。マッチン
グ処理に先行して、先ず記載項目事件種別の特定を行う
（ステップ４０３）。この戸籍デ−タファイルには、戸
籍簿上の身分事項欄のデ−タが１文につき１レコ−ドに
なって格納されている。しかし、登録されている全記載
パターンデ−タの数は膨大であるため、全てのパターン
とのマッチング処理を行うと時間がかかり過ぎる。そこ
で、登録された全記載パタ−ンデ−タとのマッチングを
避けるために、マッチング処理の前に先ず事件種別の特
定を行うのである。特定処理としては、予め記載項目事
件種別を特定するためのキーワード、および以下に述べ
る記載パターン特定処理により、事件種別を特定する。
この時、一意に決まらない場合には（４０３０）、それ
ぞれの事件種別に対して記載パターンの特徴デ−タとマ
ッチングを行う（４０３１）。この場合、対象となる文
の事件種別と異なる事件種別の記載パターンの特徴デ−
タとマッチングする必要はないので、記載パターンを一
意に決定することができる。In the flow of FIG. 1, next, family register data is input (step 401). Note that multiple family register data
If the data has been input, the family register data to be processed next is extracted. It is determined whether all family register data has been input and processed, and the process is continued until all family register data has been input (step 402). When the family register data is input, next, matching processing is performed between the description pattern data stored in the description pattern data definition file and the family register data input from the family register data file. Prior to the matching process, the description item case type is first specified (step 403). In this family register data file, the data in the identity column on the family register is stored as one record per sentence. However, since the number of all the described pattern data registered is enormous, it takes too much time to perform the matching process with all the patterns. Therefore, in order to avoid matching with all the registered pattern data, the case type is first specified before the matching process. As the specifying process, the case type is specified in advance by a keyword for specifying the described item case type and the description pattern specifying process described below.
At this time, if it is not determined uniquely (4030), matching is performed with the feature data of the written pattern for each case type (4031). In this case, the characteristic data of the description pattern of the case type different from the case type of the target sentence
Since there is no need to match the data, the description pattern can be uniquely determined.

【００１０】このようにして、記載項目事件種別が決定
したならば（ステップ４０３０）、次は記載パターンを
特定する。記載パターンの特定は、対象となる事件種別
の全記載パターンに対して、内部テーブル展開順に入力
データと特徴データ（記載パターン特定用デ−タ）とを
マッチングすることにより行う（ステップ４０４）。こ
の場合、一意にパタ−ンとマッチングすることが決定さ
れたときには（ステップ４０５）、マッチングを終了す
るが（ステップ４０８）、一意に決定されないときに
は、後述するように、図７のフロ−により記載パターン
特定処理を用いて１つのパターンに特定する（ステップ
４０５０）。この処理を用いても１つに特定できない場
合には（４０６）、後述するように、身分事項展開時に
特定する（ステップ４０７）。なお、記載パターンに一
致するものがないときには、エラ−となる。図７は、図
１における記載パターン特定処理を示すフローチャート
である。図１における記載パターン特定処理（ステップ
４０５０）では、先ず対象となった全記載パターンにつ
いて、特徴デ−タが他の対象となった記載パターンの特
徴デ−タに存在するか否かを順次チェックする。そし
て、存在しない数を記載パターン毎にカウントして（ス
テップ２０９）、最もカウント数が大である記載パター
ンを身分事項の展開対象とする。すなわち、この場合に
は、文の特徴データと複数の定形パターンとのマッチン
グをとった結果、最も多くの特徴データが一致する定形
パタ−ンが特定できないとき、つまり最も多くの特徴デ
−タが一致する定形パタ−ンが複数個存在するときであ
る。このようなときには、最後に残った複数個の定形パ
ターン相互間で、１定形パターン毎に、順次特徴デ−タ
が他の定形パターンの特徴デ−タに存在するか否かを判
断し、存在しない数を定形パターン毎にカウントして、
最もカウント値の大きい定形パターンをその文の定形パ
ターンとみなすのである。この処理の原理は、文中の用
語が、予め準備されていた定形パターンの特徴デ−タに
ない用語であるとき、つまりヌル文字と判断されるとき
に、１つの定形パターンに特定できなくなることが多い
ので、ヌル文字と判断されるような、複数の定形パター
ンの特徴デ−タには存在しない特徴デ−タを最も多く有
する定形パターンを探索することが最良の方法と考えら
れるからである。When the description item case type is determined in this way (step 4030), the description pattern is specified next. The description pattern is specified by matching the input data and the characteristic data (data for specifying the description pattern) with respect to all the description patterns of the target case type in the order of development of the internal table (step 404). In this case, uniquely pattern - when it is determined that emissions matching (step 405), but ends the matching (step 408), when not uniquely determined, as described below, flow of FIG. 7 - described by One pattern is specified using the pattern specification processing (step 4050). If it is not possible to specify one even by using this processing (406), as will be described later, it is specified at the time of developing the identity item (step 407). If there is no pattern that matches the description pattern, an error occurs. FIG. 7 is a flowchart showing the written pattern specifying process in FIG. In the description pattern specifying process (step 4050) in FIG. 1, first, it is sequentially checked whether or not the feature data of all the target description patterns exists in the feature data of the other target description patterns. I do. Then, the number that does not exist is counted for each written pattern (step 209), and the written pattern with the largest count is set as an expansion target of the identity item. That is, in this case, as a result of matching the sentence feature data with a plurality of fixed patterns, when a fixed pattern with which the most feature data matches cannot be specified, that is, the most feature data is determined. This is when there are a plurality of matching fixed patterns. In such a case, it is determined whether or not the feature data of the last remaining fixed patterns is present for each fixed pattern in the feature data of another fixed pattern. Count the number not to do for each fixed pattern,
The fixed pattern with the largest count value is regarded as the fixed pattern of the sentence. The principle of this processing is that when a term in a sentence is a term that does not exist in the feature data of a prepared standard pattern, that is, when it is determined that it is a null character, it cannot be specified as one standard pattern. This is because it is considered that the best method is to search for a fixed pattern that has the most feature data that does not exist in the feature data of a plurality of fixed patterns, such as a null character.

【００１１】例えば、以下の３つの記載パターンが対象
となった場合の処理について説明する。記載パターン１・・『で出生』，『届出』，『から送付
入籍』記載パターン２・・『で出生』，『国籍保留とともに届
出』，『から送付入籍』記載パターン３・・『で出
生』，『届出』先ず、記載パターン１の特徴デ−タ（『で出生』，『届
出』，『から送付入籍』）が他の記載パターン（２およ
び３）の特徴デ−タに存在しない数を求める。『で出生』は、記載パターン２，３ともに存在するの
で、カウントアップは行わない。『届出』は、記載パターン２，３ともに存在するの
で、カウントアップは行わない。『から送付入籍』は、記載パターン２には存在する
が、記載パターン３には存在しないので、１カウントア
ップする。以上の結果から、記載パターン１のカウント数は
『１』である。次に、記載パターン２の特徴デ−タ
（『で出生』，『国籍保留とともに届出』，『から送付
入籍』）が、他の記載パターン（１および３）の特徴デ
−タに存在しない数を求める。『で出生』は、記載パターン１，３ともに存在するの
で、カウントアップは行わない。『国籍保留とともに届出』は、記載パターン１，３と
もに存在しないので、２カウントアップする。『から送付入籍』は、記載パターン１には存在する
が、記載パターン３には存在しないので、１カウントア
ップする。以上の結果から、記載パターン２のカウント数は
『３』である。最後に、記載パタ−ン３の特徴デ−タ
（『で出生』，『届出』）が、他の記載パターン（１お
よび２）の特徴デ−タに存在しない数を求める。『で出生』は、記載パターン１，２ともに存在するの
で、カウントアップは行わない。『届出』は、記載パターン１，２ともに存在するの
で、カウントアップは行わない。以上の結果から、記載パターン３のカウント数は『０』
である。For example, a process when the following three described patterns are targeted will be described. Described pattern 1 “Birth at”, “Notification”, “Registered to send” Described pattern 2 “Birth at”, “Notification with nationality hold”, “Registered to send at” Described pattern 3 ... “Birth at” , "Notification" First, the number of the feature data of the described pattern 1 ("Birth at,""Notification","Registeredfrom") that does not exist in the feature data of the other described patterns (2 and 3) is determined. Ask. "Birth at birth" is not counted up because both of the description patterns 2 and 3 exist. Since “notification” exists in both the description patterns 2 and 3, the count is not performed. Although “from sending enrollment” exists in the description pattern 2 but does not exist in the description pattern 3, it is counted up by one. From the above results, the count number of the written pattern 1 is “1”. Next, the number of the characteristic data of the description pattern 2 (“Birth at birth”, “Notification with nationality pending”, and “Registration sent from”) that do not exist in the characteristic data of the other description patterns (1 and 3) Ask for. "Birth at birth" is not counted up since both of the description patterns 1 and 3 exist. "Notification with nationality hold" is counted up by 2 since neither of the description patterns 1 and 3 exists. Although "from sending enrollment" exists in the description pattern 1 but does not exist in the description pattern 3, it is counted up by one. From the above results, the count number of the described pattern 2 is “3”. Finally, the number of the feature data of the description pattern 3 ("Birth at", "Notification") that does not exist in the feature data of the other description patterns (1 and 2) is determined. "Birth at birth" does not count up because both description patterns 1 and 2 exist. Since “notification” exists in both the description patterns 1 and 2, the count-up is not performed. From the above results, the count number of the described pattern 3 is “0”.
It is.

【００１２】チェックの結果、記載パターン２のカウン
ト数が最も大であったので、記載パターン２を身分事項
の展開対象とする。つまり、他の記載パターンの特徴デ
ータに存在しない特徴データを多く持つ記載パターンを
身分事項の展開対象とする処理方法である。図７では、
上述した記載パターン特定処理のフロ−が示されてい
る。比較すべき対象パターン（１，２，３）がなくなる
まで（ステップ２０１）、カウンタと不一致エリアの初
期化を行い（ステップ２０２）、かつその対象パターン
の特徴デ−タがなくなるまで（ステップ２０３）、マッ
チングのための初期設定を行う（ステップ２０４）。比
較のための対象パタ−ンとのマッチングが終了すれば、
次の対象パターンに移る（ステップ２０５，２０３）。
対象パターンが同一であれば、カウントせずに次の対象
パターンに移る（ステップ２０６，２０５）。対象パタ
−ンが同じでなければ、マッチングのための初期設定を
行い（ステップ２０６，２０７）、マッチングをとった
結果がヌル文字（無意味な文字）であれば、カウントア
ップを行う（ステップ２０８，２０９）。また、特徴デ
−タが存在すれば、対象パタ−ンの同一チェックを行っ
て、マッチング処理に移る（ステップ２１０，２０５，
２０６，２０７）。そして、対象パタ−ン中の比較すべ
き特徴デ−タが終了すると（ステップ２０３）、不一致
数エリアの値とカウンタの値を比較して（ステップ２１
２）、不一致エリアが大のときには次の対象パターンと
の比較に移り、カウンタの値が大のときにはそのカウン
タ値を不一致エリアに書き込み、記載パターン番号を格
納した後（ステップ２１３）、次の対象パターンとの比
較に移る。比較すべき対象パターンがなくなれば（ステ
ップ２０１）、処理を終了する。As a result of the check, since the count number of the description pattern 2 is the largest, the description pattern 2 is set as the subject of development of the identity item. In other words, this is a processing method in which a description pattern having a large amount of feature data that does not exist in the feature data of another description pattern is set as an expansion target of the identity item. In FIG.
The flow of the above described description pattern specifying process is shown. Until there are no more target patterns (1, 2, 3) to be compared (Step 201), the counter and the non-coincidence area are initialized (Step 202), and until there is no more characteristic data of the target pattern (Step 203). Initial setting for matching is performed (step 204). When the matching with the target pattern for comparison is completed,
The process moves to the next target pattern (steps 205 and 203).
If the target patterns are the same, the process proceeds to the next target pattern without counting (steps 206 and 205). If the target patterns are not the same, initialization is performed for matching (steps 206 and 207), and if the result of the matching is a null character (meaningless character), a count is performed (step 208). , 209). If the feature data exists, the same pattern is checked for the target pattern, and the process proceeds to the matching process (steps 210, 205,
206, 207). When the feature data to be compared in the target pattern is completed (step 203), the value of the mismatch number area is compared with the value of the counter (step 21).
2) When the mismatch area is large, the process proceeds to the comparison with the next target pattern. When the counter value is large, the counter value is written in the mismatch area and the written pattern number is stored (step 213). Move on to comparison with the pattern. When there are no more target patterns to be compared (step 201), the process ends.

【００１３】図１、図２に戻り、図７のような記載パタ
ーン特定処理を行った後（ステップ４０５０）、一意に
決定され（ステップ４０６）、マッチングが終了すると
（ステップ４０８）、特定された記載パターンデ−タを
用いて身分事項の展開を行う（ステップ４０９）。以
下、身分事項の展開例を説明する。〔身分事項展開例〕入力デ−タ：『平成５年３月１日東京都千代田区で出生
同月２日父届出同月３日同区長から送付入籍』記載パターン：（出生日）（出生地）で出生（届出日）
（届出人）届出（送付を受けた日）（受理者）から送付
入籍記載パターンデ−タとマッチングを行い、身分事項を
展開する。（ア）記載パターンデ−タの先頭の情報が『出生日』
で、次の情報が身分事項であることを判定する。この場
合、『出生日』が『日付』であるため『日』で区切り、
『平成５年３月１日』を抽出する。（イ）次の情報が『出生地』で、その次がセパレータで
あることを判定する。従って、特徴デ−タ『で出生』で
区切り、『東京都千代田区』を抽出する。（ウ）次の情報が『届出日』で、その次が身分事項であ
ることを判定する。従って、『届出日』が『日付』であ
るため『日』で区切り、『同月２日』を抽出する。（エ）次の情報が『届出人』で、その次がセパレータで
ある。従って、特徴データ『届出』で区切り、『父』を
抽出する。（オ）次の情報が『送付を受けた日』で、その次が身分
事項である。従って、『送付を受けた日』が『日付』で
あるため『日』で区切り、『同月３日』を抽出する。（カ）次の情報が『受理者』で、その次がセパレータで
ある。従って、特徴デ−タ『から送付入籍』で区切り、
『同区長』を抽出する。（キ）次の情報はＮＬ（ニューライン）である。従っ
て、入力デ−タの身分事項の展開はこれで終了する。Returning to FIGS. 1 and 2, after performing a description pattern specifying process as shown in FIG. 7 (step 4050), it is uniquely determined (step 406), and when the matching is completed (step 408), it is specified. The identification information is developed using the written pattern data (step 409). Hereinafter, an example of development of the identity item will be described. [Example of identification item development] Input data: "Birth day in Chiyoda-ku, Tokyo March 1, 1993 Notification of father on the same month 2 Sending enrollment from the mayor of the same month on the same month 3" Description pattern: (birth date) (place of birth) Birth at (date of notification)
(Notifier) Notification (date of receipt) From (recipient) Matching with the pattern data described in the enrollment registration, and develop the identification information. (A) The head information of the described pattern data is "birth date"
Then, it is determined that the next information is an identity item. In this case, since "birth date" is "date", it is separated by "day".
"March 1, 1993" is extracted. (A) It is determined that the next information is "place of birth" and the next is a separator. Accordingly, the feature data is separated by "departure" and "Chiyoda-ku, Tokyo" is extracted. (C) It is determined that the next information is the “notification date” and that the next is an identity matter. Therefore, since the “report date” is “date”, it is separated by “day” and “same month 2” is extracted. (D) The next information is the "reporter" and the next is the separator. Therefore, "father" is extracted by separating with the feature data "notification". (E) The next information is the "date of receipt", followed by the identity. Therefore, since the “date of receiving the transmission” is the “date”, it is separated by the “day” and “the same month 3” is extracted. (F) The next information is the "acceptor" and the next is the separator. Therefore, it is separated by the feature data
"Same ward mayor" is extracted. (G) The next information is NL (new line). Therefore, the development of the status of the input data is completed.

【００１４】図１、図２に戻って、身分事項の展開（ス
テップ４０９）に続く処理について述べる。ＮＬまで
展開したならば（ステップ４１０）、身分事項デ−タを
ワークエリアにセットする（ステップ４１１）。ワークエリアにセットされた身分事項デ−タと、記載
パターンデ−タを用いて文章を復元し、入力データとの
整合性をチェックする（ステップ４１２）。不整合が生
じなければ（ステップ４１３）、ワークエリアの身分事
項デ−タをセーブエリアにセットする（ステップ４１
４）。一方、不整合があった場合には、例外記載パター
ンとなり（ステップ４１５）、印刷して出力される（ス
テップ４１５０）。セーブエリアにセットされている身分事項データのう
ち、省略されているものを、決められたル−ルに従って
追加することにより、身分事項データを復元する（ステ
ップ４１６）。例えば、『同月』を『３月』にする等が
これに該当する。以上の手順により、身分事項の展開が
終了したならば（ステップ４１７）、セーブエリアに展
開された身分事項の内容と別ファイル（戸籍簿上の身分
事項以外の情報を格納してあるファイル）の内容との整
合性をチェックする（ステップ４１８）。チェック項目
としては、例えば、父親等の名前、日付の前後関係、元
号に対する年号（西歴との対応）等である。最後に、抽
出されてセーブエリアに展開された身分事項情報を、身
分事項ファイルに格納する（ステップ４１９）。なお、
身分事項への格納は、整合性チェックの結果に関係なく
行われる。また、身分事項デ−タの復元（ステップ４１
６）あるいは整合性のチェック（ステップ４１２）等の
処理は、戸籍デ−タに特有の処理であるため、入力デ−
タが戸籍デ−タ以外の場合には省略することができる。Returning to FIGS. 1 and 2, the processing subsequent to the development of the identity item (step 409) will be described. When the data has been expanded to NL (step 410), the identification data is set in the work area (step 411). The sentence is restored using the identification item data set in the work area and the written pattern data, and the consistency with the input data is checked (step 412). If no inconsistency occurs (step 413), the identification data of the work area is set in the save area (step 41).
4). On the other hand, if there is an inconsistency, the pattern becomes an exceptional description pattern (step 415), which is printed and output (step 4150). Of the identification item data set in the save area, omitted ones are added according to a predetermined rule to restore the identification item data (step 416). For example, “Same month” is changed to “March”, which corresponds to this. When the development of the identification information is completed by the above procedure (step 417), the contents of the identification information developed in the save area and a separate file (a file storing information other than the identification information on the family register) are stored. The consistency with the contents is checked (step 418). The check items include, for example, the name of the father or the like, the context of the date, the year of the era (correspondence with Seishi), and the like. Finally, the extracted identification information expanded in the save area is stored in the identification information file (step 419). In addition,
The storage in the identity item is performed regardless of the result of the consistency check. Also, the identification data is restored (step 41).
6) Or the process of checking consistency (step 412) is a process unique to family register data.
If the data is other than family register data, it can be omitted.

【００１５】[0015]

【発明の効果】以上説明したように、本発明によれば、
（ａ）予め定められた定形パターンに基づき文から情報
を抽出するので、文中に区切り記号がなくても必要な情
報を正確に抽出することができる。また、（ｂ）定形パ
ターンは特定の特徴デ−タに基づいて分類されており、
情報を抽出すべき文がどの分類に属するかを先ず判断し
た後に、該当する分類中の定形パターンとのマッチング
処理を行うので、全ての定形パターンとのマッチング処
理を行う必要がなく、迅速な情報抽出を行うことができ
る。As described above, according to the present invention,
(A) Since information is extracted from a sentence based on a predetermined fixed pattern, necessary information can be accurately extracted without a delimiter in the sentence. (B) The fixed patterns are classified based on specific feature data.
After first determining to which class a sentence from which information should be extracted belongs, a matching process is performed with a fixed pattern in the corresponding class. Therefore, it is not necessary to perform a matching process with all fixed patterns, and quick information can be obtained. An extraction can be performed.

[Brief description of the drawings]

【図１】本発明の一実施例として、戸籍簿の入力から必
要な情報を抽出する処理のフローチャートの一部であ
る。FIG. 1 is a part of a flowchart of a process of extracting necessary information from input of a family register as one embodiment of the present invention.

【図２】同じく情報抽出処理のフローチャートの他の一
部である。FIG. 2 is another part of the flowchart of the information extraction process.

【図３】図１，図２に用いられる戸籍簿の一例を示す図
である。FIG. 3 is a diagram showing an example of a family register used in FIGS. 1 and 2;

【図４】図１，図２で用いられる身分事項事件種別コー
ドの一覧を示す図である。FIG. 4 is a diagram showing a list of identification matter case type codes used in FIGS. 1 and 2;

【図５】図１，図２で用いられる内部テーブルの展開図
である。FIG. 5 is a development view of an internal table used in FIGS. 1 and 2;

【図６】図１，図２で用いられる記載パターンデ−タ定
義ファイルのデ−タ構造図である。FIG. 6 is a data structure diagram of a description pattern data definition file used in FIGS. 1 and 2;

【図７】図１，図２における記載パターン特定処理を示
す詳細フローチャートである。FIG. 7 is a detailed flowchart showing a written pattern specifying process in FIGS. 1 and 2;

[Explanation of symbols]

０１〜３２身分事項事件種別コード 01-32 Identity matter case type code

───────────────────────────────────────────────────── フロントページの続き (56)参考文献特開昭56−60935（ＪＰ，Ａ) 特開平２−23413（ＪＰ，Ａ) 特開平５−120269（ＪＰ，Ａ) (58)調査した分野(Int.Cl.⁶，ＤＢ名) G06F 17/30 ＪＩＣＳＴファイル（ＪＯＩＳ)────────────────────────────────────────────────── ─── Continuation of the front page (56) References JP-A-56-60935 (JP, A) JP-A-2-23413 (JP, A) JP-A-5-120269 (JP, A) (58) Investigation Field (Int.Cl. ⁶ , DB name) G06F 17/30 JICST file (JOIS)

Claims

(57) [Claims]

1. An input in which a plurality of predetermined words are arranged at an arbitrary position.
From the force statement, an information extracting method for extracting words in between the predetermined phrases used in a particular classification type of the input sentence of said predetermined word
Storing a plurality of predetermined words and phrases in advance, and for each of the classification types, an arrangement of the plurality of predetermined words and
Used to identify one or more word breaks between given words
For storing a plurality of fixed patterns of information in advance.
From the input sentence and the input sentence,
Extract the phrase that matches or best matches the fixed phrase
Identifying the classification type of the input sentence by
From the previously stored fixed pattern was classified type specified in-up or match the sequence of the predetermined words in the input sentence
Selecting a fixed pattern that best matches
And each predetermined word represented by the fixed pattern selected in the step.
According to the arrangement of phrases and information, between each predetermined phrase in the above input sentence
Information extraction method characterized by a step of extracting by separating certain each word.

From wherein input <br/> sentence by arranging a plurality of predetermined words to an arbitrary position, an information extracting method for extracting words in between the predetermined words, arrangement and each of the plurality of predetermined phrases One or more between certain words
Storing in advance a plurality of fixed patterns consisting of information used to specify the delimiter of the phrase ; and inputting the input data from the fixed patterns stored in the step.
Matches or most matching sequence of each predetermined word sentences
Selecting a fixed pattern to be performed , and various points indicated by the fixed pattern selected in the step.
Each predetermined phrase in the above input sentence according to the list of fixed phrases and information
Information extraction method according to claim <br/> that a step of extracting separate each word in between.

3. The method according to claim 1 or 2,
The information extraction method according to either the predetermined word of the input sentence from the fixed pattern of the above-mentioned storage
Vinegar to select the most matching the fixed pattern to the list
In step, if the candidate of the fixed pattern that best match there is more <br/> number, for each fixed pattern, sequentially, each predetermined word of said fixed pattern is judged whether or not another of the fixed pattern, several of the count does not exist, the information extraction method of the largest fixed pattern the counted number and selects as the fixed pattern of the most matching.