JP4653375B2

JP4653375B2 - Structured document generation apparatus and structured document generation program

Info

Publication number: JP4653375B2
Application number: JP2002208712A
Authority: JP
Inventors: 健太郎尾口; 円博荒木
Original assignee: Toyota Central R&D Labs Inc
Current assignee: Toyota Central R&D Labs Inc
Priority date: 2002-07-17
Filing date: 2002-07-17
Publication date: 2011-03-16
Anticipated expiration: 2022-07-17
Also published as: JP2004054431A

Description

【０００１】
【発明の属する技術分野】
本発明は、構造化文書生成装置及び構造化文書生成プログラムに係り、例えば、検索に直接使用できるデータを生成するのに用いて好適な構造化文書生成装置及び構造化文書生成プログラムに関する。
【０００２】
【従来の技術及び発明が解決しようとする課題】
特開２００１−２９０８０１号公報では、テキスト形式の文書を入力し、その文書構造を利用して構造化文書を出力する構造文書化システム、構造文書化プログラム、及び、コンピュータ可読格納媒体（以下「従来技術１」という。）が提案されている。
【０００３】
従来技術１は、テキスト形式の文書をＷｅｂページで表示するために再構成などをすることを目的としており、文書の構造を分解するために、文書分解定義ファイルを予め定義しておき、文書定義ファイルに記述されているパターンを用いて文書を構造化している。このようにして生成された文書構造は、標題、作成者、日付といった表層的な形式になっている。
【０００４】
しかし、本文として自然文で書かれた部分は、何ら変換されずにそのまま出力されるので、コンピュータシステムで利用できる形式になっていない。このため、従来技術１から出力された構造化文書は、表層的な情報のみデータベースで使用されるが、自然文で書かれた本文などの具体的な内容についてはデータベースで使用されないという問題があった。
【０００５】
本発明は、上述した課題を解決するために提案されたものであり、自然文からコンピュータが容易に認識できるような意味的構造によって構成される構造化文書を生成する構造化文書生成装置及び構造化文書生成プログラムを提供することを目的とする。
【００１８】
【課題を解決するための手段】
請求項１に記載の発明は、自然文を含む文書を入力する文書入力手段と、前記文書入力手段により入力された自然文を含む文書に対して語句解析を実行して、前記文書を構成する語句と品詞種類とを対応付けた語句解析済み文書を出力する語句解析手段と、前記語句列の意味内容を表す概念毎に各々設けられ、抽出対象となる属性値を含んだ語句列の品詞種類の構成であって各概念に対応するパターンを表す品詞種類構成パターンと、前記属性値及び対応するタグを構造化して各概念について構造化された要素を出力するための出力形式情報と、を有する概念辞書を記憶する概念辞書記憶手段と、前記語句解析済み文書の語句列の品詞種類の構成パターンと、前記概念辞書記憶手段に記憶されている概念辞書の品詞種類構成パターンとを照合して、前記概念辞書記憶手段の中から前記語句列の意味内容に対応する概念辞書を選択し、選択した概念辞書を用いて、前記品詞種類構成パターンに対応する前記語句解析済み文書の語句列から属性値を抽出する属性値抽出手段と、前記属性値抽出手段により抽出された属性値と、前記選択した概念辞書の出力形式情報とに基づいて、前記選択された概念辞書に対応する概念について構造化された要素を表す構造化文書を生成する文書生成手段と、を備えている。
【００１９】
請求項１に記載の発明は、コンピュータに請求項３に記載の発明をインストールすることで構成される。
【００２０】
請求項３に記載の発明は、コンピュータを、自然文を含む文書を入力する文書入力手段と、前記文書入力手段により入力された自然文を含む文書に対して語句解析を実行して、前記文書を構成する語句と品詞種類とを対応付けた語句解析済み文書を出力する語句解析手段と、前記語句列の意味内容を表す概念毎に各々設けられ、抽出対象となる属性値を含んだ語句列の品詞種類の構成であって当該概念に対応するパターンを表す品詞種類構成パターンと、前記属性値及び対応するタグを構造化して当該概念について構造化された要素を出力するための出力形式情報と、を有する概念辞書を記憶する概念辞書記憶手段と、前記語句解析済み文書の語句列の品詞種類の構成パターンと、前記概念辞書記憶手段に記憶されている概念辞書の品詞種類構成パターンとを照合して、前記概念辞書記憶手段の中から前記語句列の意味内容に対応する概念辞書を選択し、選択した概念辞書を用いて、前記品詞種類構成パターンに対応する前記語句解析済み文書の語句列から属性値を抽出する属性値抽出手段と、前記属性値抽出手段により抽出された属性値と、前記選択した概念辞書の出力形式情報とに基づいて、前記選択された概念辞書に対応する概念について構造化された要素を表す構造化文書を生成する文書生成手段と、として機能させる構造化文書生成プログラムである。
【００２１】
語句解析手段は、自然文を含む文書に対して語句解析を実行する。ここで、自然文を含む文書とは、自然文のみからなる文書だけでもよいし、自然文と所定のタグとを有する所定形式の文書であってもよい。そして、語句解析手段は、文書を構成する各語句に品詞種類を対応付けて、各語句と各々の品詞種類とを対応付けた語句解析済み文書を出力する。
【００２２】
概念辞書記憶手段は、品詞種類構成パターンと出力形式情報とを有する概念辞書を記憶している。ここで、品詞種類構成パターンは、文書から属性値を抽出しようとする際に、抽出対象となる属性値を含んだ語句列の品詞種類の構成であって概念に対応するパターンを表したものである。出力形式情報は、文書の出力形式を示し、属性値及び対応するタグを構造化して当該概念について構造化された要素を出力するための情報である。
【００２３】
属性値抽出手段は、語句解析済み文書の語句列の品詞種類の構成パターンと、概念辞書の品詞種類構成パターンとを照合して、この品詞種類構成パターンに対応する語句列を語句解析済み文書から探し出す。そして、探し出した語句列から属性値を抽出する。
文書生成手段は、抽出された属性値と、概念辞書の出力形式情報とに基づいて、所定の形式の構造化文書を生成する。この構造化文書は、当初は自然文であっても、属性値とタグとからなる形式の文書である。
【００２４】
したがって、請求項１および３に記載の発明によれば、語句解析済み文書の語句列の品詞種類の構成パターンと、抽出対象となる属性値を含んだ語句列の品詞種類の構成を表す品詞種類構成パターンとを照合して属性値を抽出し、抽出された属性値を出力形式情報に基づいて構造化された要素を表す構造化文書を生成することにより、自然文から意味が抽出された構造化文書を得ることができる。
【００２５】
請求項２に記載の発明は、請求項１に記載の発明において、前記属性値抽出手段は、前記概念辞書記憶手段に記憶されている複数の概念辞書の中から、前記語句解析済み文書の品詞種類の構成パターンと一致する品詞種類構成パターンを有する概念辞書を選択し、選択した概念辞書を用いて属性値を抽出し、前記文書作成手段は、前記属性値抽出手段により抽出された属性値と、前記属性値抽出手段により選択された概念辞書の出力形式情報とに基づいて、構造化文書を生成する。
【００２６】
請求項２に記載の発明は、コンピュータに請求項４に記載の発明をインストールすることで構成される。
【００２７】
請求項４に記載の発明は、請求項３に記載の発明において、前記属性値抽出手段は、前記概念辞書記憶手段に記憶されている複数の概念辞書の中から、前記語句解析済み文書の品詞種類の構成パターンと一致する品詞種類構成パターンを有する概念辞書を選択し、選択した概念辞書を用いて属性値を抽出し、前記文書作成手段は、前記属性値抽出手段により抽出された属性値と、前記属性値抽出手段により選択された概念辞書の出力形式情報とに基づいて、構造化文書を生成する。
【００２８】
概念辞書記憶手段は、様々な語句列の意味内容を表す概念毎に各々の概念辞書を記憶している。属性値抽出手段は、様々な概念辞書の中から、語句解析済み文書の品詞種類の構成パターンと一致する品詞種類構成パターンを有する概念辞書を選択することで、語句解析済み文書の概念を特定する。そして、選択した概念辞書を用いて属性値を抽出することで、特定された概念に関する属性値を得る。
【００２９】
したがって、請求項２および４に記載の発明によれば、様々な語句列の意味内容を表す概念辞書の中から、語句解析済み文書の品詞種類の構成パターンと一致する品詞種類構成パターンを有する概念辞書を選択し、選択した概念辞書を用いて属性値を抽出することで、文書内容がどのような概念であっても、属性値を確実に抽出することができる。
【００３０】
【発明の実施の形態】
以下、本発明の好ましい実施の形態について図面を参照しながら詳細に説明する。
【００３１】
図１は、本発明の実施の形態に係る構造化文書生成装置の構成を示すブロック図である。
【００３２】
構造化文書生成装置は、キー操作により文字情報を入力するキーボード１と、ポインティングデバイスであるマウス２と、外部との間で情報の入出力を行う入出力ポート３と、構造化文書作成のための演算処理を実行するためのアプリケーションプログラム等を記憶するハードディスクドライブ（ＨａｒｄＤｉｓｃＤｒｉｖｅ）４と、構造化文書作成の演算処理を実行するマイクロコンピュータ５と、マイクロコンピュータ５の演算結果、例えば構造化文書を表示するＬＣＤ（ＬｉｑｕｉｄＣｒｙｓｔａｌＤｉｓｐｌａｙ）６とを備えている。
【００３３】
マイクロコンピュータ５は、データのワークエリアであるＲＡＭ（ＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ）、所定の制御プログラムが記憶されているＲＯＭ（ＲｅａｄＯｎｌｙＭｅｍｏｒｙ）、演算処理を実行するＣＰＵ（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）などで構成されている。マイクロコンピュータ５は、キーボード１から入力されたテキスト文書や、入出力ポート３を介して外部から入力されたテキスト文書に基づいて、構造化文書を作成する。なお、図示しない記録媒体に既に記録されているテキスト文書から構造化文書を作成してもよい。
【００３４】
図２は、マイクロコンピュータ５の機能的な構成を示すブロック図である。マイクロコンピュータ５は、テキスト文書からＸＭＬ文書を作成するＸＭＬ文書作成部１０と、語句解析を行う語句解析部２０と、意味を抽出して構造化文書を作成する意味抽出部３０と、を備えている。
【００３５】
ＸＭＬ文書作成部１０は、自然文のテキスト文書と、そのテキスト文書の題名や作成日時等の情報とを用いてＸＭＬ文書を作成し、作成したＸＭＬ文書を語句解析部２０に供給する。なお、テキスト文書は、キーボード１の操作によって生成された文書、入出力ポート３から入力された文書、又はＨＤＤ４やその他の記憶媒体に予め記憶された文書のいずれでもよい。
【００３６】
図３は、語句解析部２０の機能的な構成を示すブロック図である。
【００３７】
語句解析部２０は、単語と品詞種類等との対応関係を示した語句辞書２１と、ＸＭＬ文書の構造タグと自然文である内容文字列とを分離する内容文字列分離部２２と、特定語を分離・置換する特定語分離・置換部２３と、特定語の品詞を決定する特定語品詞決定部２４と、内容文字列の形態素解析を行う形態素解析部２５と、代替語を特定語に置換する代替語置換部２６と、複数の単語から複合語や句である語句列を生成する複合語句生成部２７と、単語、複合語や句である語句に品詞タグを付与する品詞タグ付与部２８とを備えている。
【００３８】
図４は、語句辞書２１の構成を示す図である。語句辞書２１は、単語に関する辞書と、語句（複合語や句）に関する辞書で構成されている。
【００３９】
語句辞書２１は、単語に関する部分については、単語、品詞種類、代替語で構成されている。なお、代替語は、詳しくは後述するが、特定語の場合の場合に限り使用される。
【００４０】
図４によると、例えば、単語「増幅」について、品詞種類は「名詞−サ変接続」である。これは、「増幅」は名詞であり、サ変動詞と接続することを表している。また、単語「問題」について、品詞種類は「名詞−ナイ形容詞語幹」である。これは、「問題」は名詞であり、「…ない」という形容詞の語幹になることを表している。さらに、単語「時」について、品詞種類は「名詞−接尾−副詞可能」である。これは、「時」は名詞であり、「…時」という接尾語となり、さらに副詞として利用可能であることを表している。単語「が」について、品詞種類は「助詞−格助詞−一般」である。これは、「が」は、一般的な格助詞であることを表している。このように、品詞種類は、対応する単語の品詞の種類だけでなく、対応する単語の属性も表している。
【００４１】
ここで、「単語」は、名詞や助詞等の一般的な単語の他に、語句辞書２１によって新たに定義された単語を示す特定語も含んでいる。
【００４２】
例えば、図４に示す「２Ｓ［Ａ−Ｄ］［０−９］＋」は、２Ｓで始まり、次がＡからＤのいずれか１文字であり、最後が１文字以上の数字で構成された単語を表している。上記条件に該当する単語は、トランジスタの品番を表し、一般的な辞書にない。そこで、語句辞書２１は、上記条件に該当する単語を特定語として定義している。
【００４３】
ここで、単語（特定語）「２Ｓ［Ａ−Ｄ］［０−９］＋」について、品詞種類は「名詞―固有名詞―識別子―品番―トランジスタ」であり、代替語は「代替−トランジスタ品番」である。これは、上記特定語は固有名詞であり、識別子であり、トランジスタの品番であることを表している。また、上記特定語は、それ自体では意味が分からないので、代替語として「トランジスタ品番」があることを表している。
【００４４】
語句辞書２１は、語句列（複合語や句）に関する部分については、「語句列パターン」、「品詞種類」で構成され、複合語や句を作成するための辞書としても機能している。
【００４５】
「語句列パターン」は、複合語や句として成立するためのパターンを表している。例えば、図４によると、「（名詞ａｎｄｎｏｔ（名詞−接尾−副詞可能）ｏｒ記号）＊［２，∞］」は、名詞（ただし、名詞−接尾−副詞可能を除く）、又は記号が、２つ以上連続するパターンであることを表し、このパターンの「品詞種類」は名詞句である。つまり、このような条件に該当する語句列のパターンは、名詞句として取り扱うことを表している。
【００４６】
図５及び図６は、語句解析部２０によって解析されている文書を説明する図である。以下では、図５（Ａ）に示すＸＭＬ文書が語句解析部２０に入力された場合について説明する。
【００４７】
内容文字列分離部２２は、ＸＭＬ文書作成部１０から図５（Ａ）に示すＸＭＬ文書が供給されると、ＸＭＬ文書の構造タグと自然文である内容文字列とを分離する。そして、内容文字列分離部２２は、図５（Ｂ）に示すＸＭＬ文書の構造タグを品詞タグ付与部２８に供給し、図５（Ｃ）に示すＸＭＬ文書の内容文字列を特定語分離・置換部２３に供給する。
【００４８】
特定語分離・置換部２３は、内容文字列分離部２２から供給された内容文字列と語句辞書２１とを照合し、内容文字列に含まれる特定語を代替語に置換して、特定語を分離する。そして、分離した特定語を特定語品詞決定部２４に供給し、特定語から代替語に置換された内容文字列を形態素解析部２５に供給する。
【００４９】
具体的には、特定語分離・置換部２３は、語句辞書２１に基づいて、内容文字列に含まれる特定語「２ＳＣ７７７７」を「代替−トランジスタ品番」に置換する。そして、図５（Ｄ）に示す特定語「２ＳＣ７７７７」を分離して特定語品詞決定部２４に供給し、図５（Ｅ）に示す置換済みの内容文字列「（代替−トランジスタ品番）においてＡ級増幅時に発熱が問題」を形態素解析部２５に供給する。
【００５０】
特定語品詞決定部２４は、語句辞書２１に基づいて特定語の品詞を決定し、特定語及び品詞種類情報を代替語置換部２６に供給する。具体的には、「２ＳＣ７７７７」の品詞を決定し、図５（Ｆ）に示すように、特定語及び品詞種類情報である「２ＳＣ７７７７−名詞―固有名詞―識別子―品番−トランジスタ」を代替語置換部２６に供給する。
【００５１】
形態素解析部２５は、特定語分離・置換部２３から供給された内容文字列に対して、語句辞書２１を参照しながら形態素解析を実行する。具体的には、内容文字列を１つ１つの単語に分解し、分解された各々の単語（特定語も含む。）と品詞種類との対応付けを行う。そして、形態素解析部２５は、図５（Ｇ）に示す各単語及び対応する品詞種類を代替語置換部２６に供給する。なお、図５（Ｇ）において、「時」、「に」、「発熱」、「が」、「問題」の各単語については省略しているが、これらの各単語についても同様に形態素解析を行い、各単語及び対応する品詞種類を代替語置換部２６に供給する。
【００５２】
代替語置換部２６は、特定語品詞決定部２４及び形態素解析部２５から供給された情報に基づいて、形態素解析された各単語の中から代替語を選択し、選択した代替語を元の特定語に置換する。具体的には、「代替−トランジスタ」を「２ＳＣ７７７７」に置換する。そして、図６（Ａ）に示す各単語及び対応する品詞種類を複合語・句生成部２７に供給する。なお、図６（Ａ）では、「時」、「に」、「発熱」、「が」、「問題」の各単語については図示を省略している。これらの単語は、後述の図６（Ｂ）でも同様に図示を省略する。
【００５３】
複合語・句生成部２７は、代替語置換部２６から供給された連続する単語の中に、語句辞書２１の複合語・句に該当する単語の部分列があるかを判定し、該当する単語の部分列があるときは、該当する全部の部分列を複合語・句に置換する。
【００５４】
具体的には、単語の部分列「Ａ」、「級」、「増幅」が、図４に示す語句辞書２１の名詞句の語句列パターンに該当している。そこで、複合語句生成部２７は、単語の部分列「Ａ」、「級」、「増幅」から、名詞句である「Ａ級増幅」を生成し、図６（Ｂ）に示す各語句及び対応する品詞種類を品詞タグ付与部２８に供給する。なお、複合語句生成部２７は、語句辞書２１の複合語句に該当する単語の部分列がないときは、代替語置換部２６から供給された情報をそのまま品詞タグ付与部２８に供給する。
【００５５】
品詞タグ付与部２８は、複合語句生成部２７から供給された各語句（複合語・句も含む。）に対して、各々の品詞種類を示す品詞タグを付与する。そして、各語句と各品詞タグとを、内容文字列分離部２２から供給された構造タグ＜内容＞の要素として埋め込む。この結果、品詞タグ付与部２８は、図６（Ｃ）に示すように、語句解析済みのＸＭＬ文書を生成して出力する。
【００５６】
このように、語句解析部２０は、一般的なＸＭＬ文書から構造タグと内容文字列（自然文）とを分離し、内容文字列に対して形態素解析を行い、内容文字列の各語句に品詞タグを付与して、各語句と品詞タグとを元のＸＭＬ文書に埋め込む処理を行う。すなわち、語句解析部２０は、自然文に形態素解析を行って、自然文を構成する各々の語句に品詞タグを付与することで、自然文の文書構成を明確にして、意味を抽出しやすい文書を出力することができる。
【００５７】
また、語句解析部２０は、語句辞書２１に新たに定義した特定語とその代替語を登録しておくことで、形態素解析の際には特定語の代わりに代替語を使用し、形態素解析後には代替語を元の特定語に置換することで、一般的な辞書にはないような専門用語・技術用語であっても、正確に形態素解析を行うことができる。
【００５８】
なお、語句辞書２１は、図４に示した構成に限定されるものではなく、その他の単語や品詞種類についても記憶することができる。また、語句辞書２１は、複数の名詞から名詞句を作成するパターンを記憶するだけでなく、形容詞句や副詞句等のその他の複合語・句を生成するためのパターンも記憶可能であるのは勿論である。
【００５９】
図７は、意味抽出部３０の機能的な構成を示すブロック図である。意味抽出部３０は、語句解析済みのＸＭＬ文書から意味を抽出して、抽出された意味を構造化したＸＭＬ文書を生成する。
【００６０】
意味抽出部３０は、属性値の抽出パターンと属性値の出力形式とを様々な概念毎に記憶した概念辞書３１と、パターン照合を行って複数の項目概念辞書から最適な項目概念辞書を選択するパターン照合部３２と、選択した項目概念辞書を用いて属性値を抽出する属性値抽出部３３と、抽出した属性値と対応するタグとをＸＭＬ文書に整形して出力する整形部３４とを備えている。
【００６１】
図８は、概念辞書３１の構成を示すブロック図である。概念辞書は、例えば、「現状の問題」、「解決策」などの様々な概念（項目）毎に構成された複数の項目概念辞書で構成されている。ここでは、「現状の問題」を例に挙げながら説明する。
【００６２】
項目概念辞書は、当該項目概念辞書の項目を表す「項目」、当該「項目」の類義語を表す「類義語」、「項目を説明する属性」、抽出した属性値のＸＭＬ出力形式を表す「ＸＭＬ出力形式」で構成されている。
【００６３】
「項目を説明する属性」は、「属性名」、「タグパターン」で構成されている。「属性名」は、どのような属性値を抽出するかを表すものである。「タグパターン」は、属性値を含んだ語句列の各々の品詞タグの構成パターンと、当該構成パターンの中に含まれる属性値の位置を表している。
【００６４】
ここで、「名詞＊」は、任意の名詞又は名詞句であることを表している。また、「＄」とそれに続く属性名の組は、タグパターンのその位置に該当する語句の字面がその属性の値になることを表している。
【００６５】
例えば図８において、「問題」は、語句解析済みＸＭＬ文書の中から、どのような「問題」かを説明する属性値を抽出することを示している。「問題」の「タグパターン」は、「問題」を説明する属性値を含んだ語句列の品詞タグ構成パターンを表している。
【００６６】
なお、「問題」の「タグパターン」は、最初の語句の品詞タグは名詞又は名詞句であり、次の語句の品詞タグは任意でよいが、当該次の語句は「が」であることを示している。ここで、最初の品詞タグの要素に「＄問題」があるので、最初の品詞タグに対応する語句が属性名「問題」の値（属性値）になる。
【００６７】
また、「対象」は、語句解析済みＸＭＬ文書の中から、どのような「対象」かを説明する属性値を抽出することを示している。「対象」の「タグパターン」は、最初の語句の品詞タグは名詞又は名詞句であり、次の語句の品詞タグは任意でよいが、当該次の語句（語句列）は「において」であることを示している。また、最初の品詞タグの要素に「＄対象」があるので、最初の品詞タグに対応する語句が属性名「対象」の値（属性値）になる。
【００６８】
さらに、「問題発生の状況」は、語句解析済みＸＭＬ文書の中から、どのような「問題発生の状況」かを説明する属性値を抽出することを示している。「問題発生の状況」の「タグパターン」は、最初の語句の品詞タグは名詞又は名詞句であり、次の語句の品詞タグは「名詞−接尾−副詞可能」（副詞として利用可能な接尾語となる名詞）であり、最後の語句の品詞タグは任意でよいが、当該最後の語句は「に」であることを示している。また、最初の品詞タグの要素に「＄問題発生の状況」があるので、最初の品詞タグに対応する語句が属性名「問題の発生の状況」の値（属性値）になる。
【００６９】
「ＸＭＬ出力形式」は、抽出した各属性値を「＄属性名」の箇所と置き換えたＸＭＬ文書を出力することを表したものである。
【００７０】
例えば図８の「ＸＭＬ出力形式」の１行目は、「問題」の属性値と置き換えて出力することを表している。なお、＜現状の問題＞の要素として組み込まれている＜対象＞、＜問題発生時＞についても同様にして、各々対応する要素を形成して構造タグと共に出力することを表している。
【００７１】
そして、パターン照合部３２、属性値抽出部３３、整形部３４は、以上のように構成された概念辞書３１を用いて、以下のような処理を実行する。
【００７２】
パターン照合部３２は、語句解析部２０から供給された語句解析済みＸＭＬ文書と、概念辞書３１の各々の項目概念辞書との照合を行って、前記ＸＭＬ文書に対応する項目概念辞書を選択する。具体的には、各々の項目概念辞書の中から、すべてのタグパターンが前記文書の品詞タグの構成パターンと完全に一致する項目概念辞書を選択する。そして、パターン照合部３２は、選択した項目概念辞書の「項目」を属性値抽出部３３に供給すると共に、項目概念辞書と照合した部分のＸＭＬ文書を属性値抽出部３３に供給する。
【００７３】
例えば、ここでは図６（Ｃ）に示した文書の品詞タグの構成パターンと、図８に示した項目概念辞書「現状の問題」のすべての「タグパターン」とが一致する。そこで、パターン照合部３２は、項目名「現状の問題」及び図６（Ｃ）の＜内容＞の要素を属性値抽出部３３に供給する。
【００７４】
なお、語句解析部２０から供給される文書の品詞タグの構成パターンと、すべてのタグパターンとが完全に一致する項目概念辞書が複数存在する場合、項目概念辞書のタグパターンの条件を更に制限すればよい。例えば、任意の名詞又は名詞句を表す「名詞＊」の代わりに、固有名詞を表す「名詞−固有名詞」としてもよいし、その他の条件を制限してもよい。
【００７５】
属性値抽出部３３は、パターン照合部３２から供給された「項目」が示す項目概念辞書を概念辞書３１から読み出し、パターン照合部３２から供給された文書の中から、項目概念辞書のタグパターンに該当する語句列を探し出す。そして、語句列の中から、タグパターンの「＄属性値」に対応する語句を属性値として抽出する。
【００７６】
具体的には、属性値抽出部３３は、図６（Ｃ）の＜内容＞の要素の中から、図８に示す３つのタグパターンにそれぞれ対応する語句列を探し出す。そして、探し出した各々の語句列から、各々のタグパターンの「＄属性値」に対応する語句「発熱」、「２ＳＣ７７７７」、「Ａ級増幅」を抽出し、これらを属性値として整形部３４に供給する。
【００７７】
整形部３４は、属性値抽出部３３から供給された各々の属性値を、概念辞書３１の出力形式に従って整形してＸＭＬ文書を生成し、このＸＭＬ文書を外部に出力する。
【００７８】
具体的には、「問題」を説明する属性値「発熱」、「対象」を説明する属性値「２ＳＣ７７７７」、「問題発生の状況」を説明する属性値「Ａ級増幅」を、「ＸＭＬ出力形式」のそれぞれ対応する「＄属性値」に代入して要素を形成し、形成された要素と構造タグとをＸＭＬ文書形式で出力する。
【００７９】
図９は、意味抽出部３０から出力された意味抽出済みのＸＭＬ文書を示す図である。従来は、所定の文書から構造化文書を生成することができたが、当該所定の文書に自然文が含まれていた場合は、その自然文についてはそのまま出力されていた。つまり、従来の構造化文書は、構造タグの要素の中に自然文を含んでいた。これに対して、図９に示す意味抽出済みＸＭＬ文書は、構造タグの要素の中に含まれていた自然文の意味が抽出されて、抽出された意味が構造化された文書になっている。
【００８０】
以上のように、本実施の形態に係る構造化文書生成装置は、ＸＭＬ文書に含まれる自然文に形態素解析を行って各語句に品詞種類を付与し、これらの品詞種類の構成パターンと、属性値を含む語句列の品詞種類構成を表すタグパターンとを照合して、合致する語句列から属性値を抽出する。そして、抽出した属性値を予め定めたＸＭＬ出力形式に従って出力することで、自然文が構造化されたＸＭＬ文書を生成することができる。
【００８１】
構造化文書生成装置は、特に、様々な概念（項目）毎に各々の属性値を抽出するためのタグパターンを予め用意しているので、タグパターンを用いて最初に自然文の概念、つまり項目概念辞書を特定し、更に、その概念を説明するための属性値を自然文から抽出することができる。
【００８２】
また、構造化文書生成装置は、様々な概念（項目）毎に、その概念を説明するのに必要な属性値と構造タグとのＸＭＬ出力形式を予め用意しているので、抽出した属性値をＸＭＬ出力形式に従って出力するだけで、抽出された属性値を構造化したＸＭＬ文書を容易に生成することができる。
【００８３】
構造化文書生成装置によって生成されたＸＭＬ文書は、図９に示すように、自然文の概念が構造化タグと属性値によって説明された構造化文書であるので、検索に直接利用できたり、内容の問い合わせに対して応答しやすい文書である。以下、ＸＭＬ文書の応用例について説明する。
【００８４】
（応用例１）
多数の文書が記憶されているデータベースから、「製品の重量が減少する」ことを記述した文書を検索する場合について説明する。
【００８５】
図１０（Ａ）はデータベースに記憶されている従来の自然文の文書を示す図、（Ｂ）は従来の検索結果を示す図、（Ｃ）はデータベースに記憶されている本願のＸＭＬ文書を示す図である。本願のＸＭＬ文書とは、（Ａ）に示す従来の自然文の文書から、上述した構造化文書生成装置によって生成されたＸＭＬ文書をいう。つまり、（Ａ）及び（Ｃ）の文書は、同じ内容であり、製品の重量が増加することを示唆している。
【００８６】
ここで、「製品の重量が減少する」ことを記述した文書を一般的な手法で検索する場合、検索のキーワードとして、通常では「製品」、「重量」、「減少」を使用する。データベースに従来の文書が記憶されている場合、（Ｂ）に示すように、「製品」、「重量」、「減少」の単語を含む従来の文書を誤って検索してしまう。
【００８７】
一方、データベースに本願のＸＭＬ文書が記憶されている場合、構造タグを順次追っていけばよい。ここでは、最初に「製品の重量」に関する構造タグを検索し、そしてその構造タグの要素から製品の重量が増加したか減少したかを表す構造タグを検索する。具体的には、最初に＜製品の重量＞を検索し、その要素の中から重量の変化を表す＜方向＞を検索する。本願のＸＭＬ文書は、＜方向＞の要素が「増加」となっているので、誤って検索されることはない。
【００８８】
したがって、構造化文書生成装置は、通常のキーワード検索では誤検索してしまうような文書であっても、その文書から誤検索しないようなＸＭＬ文書を生成することができる。
【００８９】
（応用例２）
データベースの１つの文書に対して、「加工不良に対する今回の対策は何か？」という問い合わせをする場合について説明する。
【００９０】
図１１（Ａ）はデータベースに記憶されている従来の自然文の文書を示す図、（Ｂ）はデータベースに記憶されている本願のＸＭＬ文書を示す図である。なお、（Ａ）及び（Ｂ）の文書は、同じ内容である。
【００９１】
データベースに従来の文書が記憶されている場合、「加工不良に対する今回の対策は何か？」という問い合わせをしても、何ら応答することができない。
【００９２】
一方、データベースに本願のＸＭＬ文書が記憶されている場合、上記問い合わせに関連する構造タグを追っていけばよい。ここでは、最初に「対策」に関する構造タグを検索し、そしてその構造タグの要素（下位の構造タグ及びその要素）を抽出する。そして、抽出した構造タグ及び要素を組み合わせると、上記問い合わせに対して、「厚さを２ｍｍ増加させた」と答えることができる。
【００９３】
したがって、構造化文書生成装置は、自然文の意味を抽出して構造化することで、問い合わせに対して容易に応答できるようなＸＭＬ文書を生成することができる。
【００９４】
なお、本発明は、上述した実施の形態に限定されるものではなく、特許請求の範囲に記載された範囲内で種々の設計上の変更を行うことができる。
【００９５】
例えば、概念辞書３１は、図８に示すような構成に限定されるものではない。本実施の形態では、１つの項目概念辞書に３つのタグパターンがある場合を例に挙げて説明したが、抽出すべき属性値の数と同じ数だけタグパターンを設けることができる。
【００９６】
また、本実施の形態では、ＸＭＬ文書を例に挙げて説明したが、例えばＳＧＭＬ文書であってもよい。このとき、項目概念辞書の「ＸＭＬ出力形式」を「ＳＧＭＬ出力形式」にすればよい。なお、自然文から構造化文書を作成することを考慮すれば、文書形式は特に限定されるものではない。
【００９８】
【発明の効果】
本発明に係る構造化文書生成装置及び構造化文書生成プログラムは、語句解析済み文書の語句列の品詞種類の構成パターンと、抽出対象となる属性値を含んだ語句列の品詞種類の構成を表す品詞種類構成パターンとを照合して属性値を抽出し、抽出された属性値を出力形式情報に基づいて構造化して構造化文書を生成することにより、自然文から意味が抽出された構造化文書を得ることができる。
【図面の簡単な説明】
【図１】本発明の実施の形態に係る構造化文書生成装置の構成を示すブロック図である。
【図２】マイクロコンピュータの機能的な構成を示すブロック図である。
【図３】語句解析部の機能的な構成を示すブロック図である。
【図４】語句辞書の構成を示す図である。
【図５】語句解析部によって解析されている文書を説明する図である。
【図６】語句解析部によって解析されている文書を説明する図である。
【図７】意味抽出部の機能的な構成を示すブロック図である。
【図８】概念辞書の構成を示すブロック図である。
【図９】意味抽出部から出力された意味抽出済みのＸＭＬ文書を示す図である。
【図１０】（Ａ）はデータベースに記憶されている従来の自然文の文書を示す図、（Ｂ）は従来の検索結果を示す図、（Ｃ）はデータベースに記憶されている本願のＸＭＬ文書を示す図である。
【図１１】（Ａ）はデータベースに記憶されている従来の自然文の文書を示す図、（Ｂ）はデータベースに記憶されている本願のＸＭＬ文書を示す図である。
【符号の説明】
２０語句解析部
２１語句辞書
２２内容文字列分離部
２３特定語分離・置換部
２４特定語品詞決定部
２５形態素解析部
２６代替語置換部
２７複合語・句生成部
２８品詞タグ付与部
３０意味抽出部
３１概念辞書
３２パターン照合部
３３属性値抽出部
３４整形部[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a structured document generation device and a structured document generation program, and more particularly to a structured document generation device and a structured document generation program suitable for generating data that can be directly used for retrieval.
[0002]
[Prior art and problems to be solved by the invention]
In Japanese Patent Laid-Open No. 2001-290801, a structured document system, a structured document program, and a computer-readable storage medium (hereinafter, “conventional document”) that inputs a text document and outputs a structured document using the document structure. Technology 1 ”) has been proposed.
[0003]
The prior art 1 is intended to reconstruct a text-format document for display on a Web page, and in order to decompose the document structure, a document decomposition definition file is defined in advance, and the document definition The document is structured using patterns described in the file. The document structure generated in this way has a surface format such as title, creator, and date.
[0004]
However, the portion written in the natural text as the text is output as it is without any conversion, so it is not in a format that can be used in a computer system. For this reason, the structured document output from the prior art 1 is used in the database only for surface information, but there is a problem that specific contents such as text written in a natural sentence are not used in the database. It was.
[0005]
The present invention has been proposed in order to solve the above-described problems, and a structured document generation apparatus and structure for generating a structured document having a semantic structure that can be easily recognized by a computer from a natural sentence. An object is to provide a computerized document generation program.
[0018]
[Means for Solving the Problems]
According to the first aspect of the present invention, a document input unit for inputting a document including a natural sentence, and a phrase analysis for the document including the natural sentence input by the document input unit are configured to configure the document. A phrase analysis means for outputting a phrase-analyzed document in which a phrase is associated with a part-of-speech type; Provided for each concept representing the meaning of the word string, Composition of part-of-speech types of phrase strings containing attribute values to be extracted And patterns corresponding to each concept The part-of-speech type composition pattern that represents the attribute value and the corresponding tag is structured. For outputting structured elements for each concept A concept dictionary storage means for storing a concept dictionary having output format information; a part-of-speech type configuration pattern of a phrase string of the phrase-analyzed document; and a part-of-speech type configuration of the concept dictionary stored in the concept dictionary storage means Matching a pattern, selecting a concept dictionary corresponding to the semantic content of the phrase string from the concept dictionary storage means, and using the selected concept dictionary, the phrase analysis completed corresponding to the part of speech type composition pattern Based on attribute value extraction means for extracting an attribute value from a word sequence of a document, attribute values extracted by the attribute value extraction means, and output format information of the selected concept dictionary, Represents a structured element for the concept corresponding to the selected concept dictionary Document generating means for generating a structured document.
[0019]
Claim 1 The invention described in claim 1 is claimed in a computer. 3 It is constituted by installing the invention described in 1.
[0020]
According to a third aspect of the present invention, the computer executes a phrase analysis on a document input unit that inputs a document including a natural sentence, and a document including the natural sentence input by the document input unit, and the document A phrase analysis means for outputting a phrase-analyzed document in which the phrases constituting the phrase and part-of-speech types are associated with each other; Provided for each concept representing the meaning of the word string, Composition of part-of-speech types of phrase strings containing attribute values to be extracted And the pattern corresponding to the concept The part-of-speech type composition pattern that represents the attribute value and the corresponding tag is structured. A structured element of the concept Output for A concept dictionary having output format information The A concept dictionary storage means for storing, a composition pattern of a part of speech type of a phrase string of the phrase analyzed document, Of the concept dictionary stored in the concept dictionary storage means. Compare the part-of-speech type composition pattern, select a concept dictionary corresponding to the semantic content of the word string from the concept dictionary storage means, and use the selected concept dictionary to correspond to the part-of-speech type composition pattern Based on attribute value extraction means for extracting an attribute value from the phrase string of the phrase analyzed document, the attribute value extracted by the attribute value extraction means, and the output format information of the selected concept dictionary, Represents a structured element for the concept corresponding to the selected concept dictionary It is a structured document generation program that functions as a document generation unit that generates a structured document.
[0021]
The phrase analysis unit performs phrase analysis on a document including a natural sentence. Here, the document including the natural sentence may be only a document including only the natural sentence, or may be a document in a predetermined format having the natural sentence and a predetermined tag. Then, the phrase analysis means can phrase Each part-of-speech type to phrase And the phrase-analyzed document in which each part-of-speech type is associated with each other.
[0022]
The concept dictionary storage means stores a concept dictionary having a part-of-speech type configuration pattern and output format information. Here, the part-of-speech type composition pattern is the composition of the part-of-speech type of the phrase string including the attribute value to be extracted when trying to extract the attribute value from the document. A pattern corresponding to a concept It represents. The output format information indicates the output format of the document, and the attribute value and corresponding tag are structured. A structured element of the concept Output for Information.
[0023]
The attribute value extraction means phrase Match the part-of-speech type composition pattern in the column with the part-of-speech type composition pattern in the concept dictionary to correspond to this part-of-speech type composition pattern phrase Find a column from a parsed document. And found out phrase Extract attribute values from columns.
The document generation means generates a structured document of a predetermined format based on the extracted attribute value and the output format information of the concept dictionary. This structured document is a document of a format composed of an attribute value and a tag even if it is a natural sentence at the beginning.
[0024]
Therefore, according to the first and third aspects of the invention, the part-of-speech type representing the composition pattern of the part-of-speech type of the phrase string of the phrase-analyzed document and the part-of-speech type of the phrase string including the attribute value to be extracted Attribute values are extracted by comparing with configuration patterns, and the extracted attribute values are structured based on the output format information Represents the element By generating a structured document, a structured document whose meaning is extracted from a natural sentence can be obtained.
[0025]
The invention according to claim 2 is the invention according to claim 1, in front The attribute value extracting means selects a concept dictionary having a part-of-speech type configuration pattern that matches a part-of-speech type configuration pattern of the word-analyzed document from a plurality of concept dictionaries stored in the concept dictionary storage unit. The attribute value is extracted using the selected concept dictionary, and the document creating means converts the attribute value extracted by the attribute value extraction means and the output format information of the concept dictionary selected by the attribute value extraction means. Based on this, a structured document is generated.
[0026]
Claim 2 The invention described in claim 1 is claimed in a computer. 4 It is constituted by installing the invention described in 1.
[0027]
The invention according to claim 4 is the invention according to claim 3, in front The attribute value extracting means selects a concept dictionary having a part-of-speech type composition pattern that matches a part-of-speech type composition pattern of the word-analyzed document from a plurality of concept dictionaries stored in the concept dictionary storage means. The attribute value is extracted using the selected concept dictionary, and the document creating means converts the attribute value extracted by the attribute value extraction means and the output format information of the concept dictionary selected by the attribute value extraction means. Based on this, a structured document is generated.
[0028]
There are various concept dictionary storage means Represents the semantic content of a word string Each concept dictionary is stored for each concept. The attribute value extraction means specifies the concept of the phrase-analyzed document by selecting a concept dictionary having a part-of-speech type composition pattern that matches the part-of-speech type composition pattern of the word-parsed document from various concept dictionaries. . Then, the attribute value related to the identified concept is obtained by extracting the attribute value using the selected concept dictionary.
[0029]
Therefore, the claims 2 and 4 According to the invention described in Represents the semantic content of a word string Select a concept dictionary that has a part-of-speech type composition pattern that matches the part-of-speech type composition pattern of the word-analyzed document from the concept dictionary, and extract attribute values using the selected concept dictionary to determine which document contents Even with such a concept, the attribute value can be reliably extracted.
[0030]
DETAILED DESCRIPTION OF THE INVENTION
Hereinafter, preferred embodiments of the present invention will be described in detail with reference to the drawings.
[0031]
FIG. 1 is a block diagram showing a configuration of a structured document generation apparatus according to an embodiment of the present invention.
[0032]
The structured document generating apparatus includes a keyboard 1 for inputting character information by key operation, a mouse 2 as a pointing device, an input / output port 3 for inputting / outputting information to / from the outside, and a structured document creation. Hard disk drive (Hard Disc Drive) 4 that stores an application program for executing the arithmetic processing of the above, a microcomputer 5 that executes arithmetic processing for creating a structured document, and an operation result of the microcomputer 5, for example, a structured document LCD (Liquid Crystal Display) 6 is displayed.
[0033]
The microcomputer 5 includes a RAM (Random Access Memory) that is a data work area, a ROM (Read Only Memory) in which a predetermined control program is stored, a CPU (Central Processing Unit) that executes arithmetic processing, and the like. Yes. The microcomputer 5 creates a structured document based on a text document input from the keyboard 1 or a text document input from the outside via the input / output port 3. Note that a structured document may be created from a text document already recorded on a recording medium (not shown).
[0034]
FIG. 2 is a block diagram showing a functional configuration of the microcomputer 5. The microcomputer 5 includes an XML document creation unit 10 that creates an XML document from a text document, a phrase analysis unit 20 that performs phrase analysis, and a meaning extraction unit 30 that extracts a meaning and creates a structured document. Yes.
[0035]
The XML document creation unit 10 creates an XML document by using a natural text document and information such as the title and creation date of the text document, and supplies the created XML document to the phrase analysis unit 20. The text document may be any of a document generated by operating the keyboard 1, a document input from the input / output port 3, or a document stored in advance in the HDD 4 or other storage medium.
[0036]
FIG. 3 is a block diagram illustrating a functional configuration of the phrase analysis unit 20.
[0037]
The phrase analysis unit 20 includes a phrase dictionary 21 that indicates correspondence between words and parts of speech, a content character string separation unit 22 that separates a structure tag of an XML document and a content character string that is a natural sentence, a specific word Specific word separation / replacement unit 23 that separates and replaces, specific word part-of-speech determination unit 24 that determines the part-of-speech of the specific word, morpheme analysis unit 25 that performs morpheme analysis of the content character string, and substitution word replacement with the specific word Alternative word replacement unit 26 and a compound word or phrase from a plurality of words A phrase string A compound phrase generation unit 27 for generating A phrase that is a word, compound word or phrase And a part-of-speech tag assigning unit 28 for attaching a part-of-speech tag to each other.
[0038]
FIG. 4 is a diagram showing the configuration of the word dictionary 21. The phrase dictionary 21 includes a dictionary related to words and a dictionary related to phrases (compound words and phrases).
[0039]
The phrase dictionary 21 is composed of words, part-of-speech types, and alternative words with respect to words. Although the alternative word will be described in detail later, it is used only in the case of a specific word.
[0040]
According to FIG. 4, for example, for the word “amplification”, the part of speech type is “noun-sa change connection”. This means that “amplification” is a noun and is connected to a sub-variable. For the word “problem”, the part of speech type is “noun-Nay adjective stem”. This means that "problem" is a noun and becomes the stem of the adjective "... not". Further, for the word “time”, the part of speech type is “noun-suffix-adverbial possible”. This indicates that “time” is a noun, becomes a suffix “... time”, and can be used as an adverb. For the word “ga”, the part of speech type is “particle-case particle-general”. This indicates that “ga” is a general case particle. In this way, the part of speech type represents not only the type of part of speech of the corresponding word but also the attribute of the corresponding word.
[0041]
Here, the “word” includes a specific word indicating a word newly defined by the phrase dictionary 21 in addition to general words such as nouns and particles.
[0042]
For example, “2S [AD] [0-9] +” shown in FIG. 4 starts with 2S, the next is one of letters A to D, and the last is composed of one or more numbers. Represents a word. A word corresponding to the above condition represents a transistor part number and is not in a general dictionary. Therefore, the phrase dictionary 21 defines words that meet the above conditions as specific words.
[0043]
Here, for the word (specific word) “2S [AD] [0-9] +”, the part of speech type is “noun—proper noun—identifier—part number—transistor”, and the alternative term is “substitution—transistor part number”. It is. This indicates that the specific word is a proper noun, an identifier, and a transistor product number. Further, since the specific word itself does not understand the meaning, it represents that “transistor part number” exists as an alternative word.
[0044]
The phrase dictionary 21 is a phrase Column (Compound words and phrases) pattern ”And“ part of speech ”and also functions as a dictionary for creating compound words and phrases.
[0045]
"Phrase string pattern "Represents a pattern to be established as a compound word or phrase. For example, according to FIG. 4, “(noun and not (noun-suffix-adverb possible) or symbol) * [2, ∞]” is a noun (except for noun-suffix-adverb possible) or a symbol, Two or more continuous patterns are represented, and the “part of speech type” of this pattern is a noun phrase. In other words, this of The pattern of the phrase string that meets such a condition indicates that it is handled as a noun phrase.
[0046]
5 and 6 are diagrams for explaining a document analyzed by the phrase analysis unit 20. Hereinafter, a case where the XML document shown in FIG. 5A is input to the phrase analysis unit 20 will be described.
[0047]
When the XML document shown in FIG. 5A is supplied from the XML document creation unit 10, the content character string separation unit 22 separates the structure tag of the XML document from the content character string that is a natural sentence. Then, the content character string separation unit 22 supplies the XML document structure tag shown in FIG. 5B to the part-of-speech tag adding unit 28, and the content character string of the XML document shown in FIG. This is supplied to the replacement unit 23.
[0048]
The specific word separation / replacement unit 23 collates the content character string supplied from the content character string separation unit 22 with the phrase dictionary 21, replaces the specific word included in the content character string with an alternative word, and converts the specific word To separate. Then, the separated specific word is supplied to the specific word part-of-speech determination unit 24, and the content character string in which the specific word is replaced with the alternative word is supplied to the morpheme analysis unit 25.
[0049]
Specifically, the specific word separation / replacement unit 23 replaces the specific word “2SC7777” included in the content character string with “substitution-transistor part number” based on the phrase dictionary 21. Then, the specific word “2SC7777” shown in FIG. 5D is separated and supplied to the specific word part-of-speech determination unit 24, and A in the replaced content character string “(substitution-transistor part number) shown in FIG. "Heat generation is a problem during class amplification" is supplied to the morphological analysis unit 25.
[0050]
The specific word part of speech determination unit 24 determines the part of speech of the specific word based on the phrase dictionary 21 and supplies the specific word and the part of speech type information to the alternative word replacement unit 26. Specifically, the part of speech of “2SC7777” is determined, and as shown in FIG. 5 (F), “2SC7777-noun-proper noun-identifier-article-transistor” as specific word and part-of-speech type information is replaced with an alternative word. To the unit 26.
[0051]
The morpheme analysis unit 25 performs morpheme analysis on the content character string supplied from the specific word separation / replacement unit 23 with reference to the phrase dictionary 21. Specifically, the content character string is decomposed into individual words, and each decomposed word (including a specific word) is associated with a part of speech type. Then, the morpheme analyzer 25 supplies each word shown in FIG. 5G and the corresponding part of speech type to the substitute word replacement unit 26. In FIG. 5G, the words “hour”, “ni”, “fever”, “ga”, and “problem” are omitted, but morphological analysis is similarly performed for these words. Then, each word and the corresponding part of speech type are supplied to the alternative word replacement unit 26.
[0052]
The alternative word replacement unit 26 selects an alternative word from each word subjected to morphological analysis based on the information supplied from the specific word part-of-speech determination unit 24 and the morpheme analysis unit 25, and specifies the selected alternative word as the original specification. Replace with a word. Specifically, “alternative-transistor” is replaced with “2SC7777”. 6A is supplied to the compound word / phrase generation unit 27. In FIG. 6A, the words “hour”, “ni”, “fever”, “ga”, and “problem” are not shown. These words are also omitted from FIG. 6B described later.
[0053]
The compound word / phrase generating unit 27 determines whether there is a partial sequence of words corresponding to the compound word / phrase in the phrase dictionary 21 among the continuous words supplied from the substitute word replacing unit 26, and the corresponding word If there is a substring, replace all applicable substrings with compound words / phrases.
[0054]
Specifically, the word substrings “A”, “class”, and “amplification” correspond to the word string patterns of the noun phrases in the word dictionary 21 shown in FIG. Therefore, the compound phrase generation unit 27 generates “class A amplification” that is a noun phrase from the word subsequences “A”, “class”, and “amplification”, and each of them is shown in FIG. phrase And the corresponding part-of-speech type are supplied to the part-of-speech tag assignment unit 28. The compound phrase generation unit 27 supplies the information supplied from the alternative word replacement unit 26 to the part-of-speech tag assignment unit 28 as it is when there is no partial sequence of words corresponding to the compound phrase in the phrase dictionary 21.
[0055]
The part-of-speech tag giving unit 28 receives each of the parts supplied from the compound phrase generation unit 27 phrase A part-of-speech tag indicating each part-of-speech type is given to (including compound words and phrases). And each phrase And each part-of-speech tag are embedded as elements of the structure tag <content> supplied from the content character string separation unit 22. As a result, the part-of-speech tag assigning unit 28 generates and outputs an XML document that has been subjected to phrase analysis, as shown in FIG.
[0056]
In this way, the phrase analysis unit 20 separates the structure tag and the content character string (natural sentence) from a general XML document, performs morphological analysis on the content character string, and performs each of the content character strings. phrase A part-of-speech tag to each phrase And the part-of-speech tag are embedded in the original XML document. That is, the phrase analysis unit 20 performs morphological analysis on the natural sentence, and each of the natural sentences constituting the natural sentence. phrase By adding a part-of-speech tag, it is possible to clarify the document structure of a natural sentence and output a document whose meaning is easy to extract.
[0057]
In addition, the phrase analysis unit 20 registers the newly defined specific word and its alternative word in the phrase dictionary 21, so that the alternative word is used instead of the specific word in the morphological analysis, and after the morphological analysis, By replacing the substitute word with the original specific word, it is possible to accurately perform morphological analysis even for technical terms and technical terms that are not in a general dictionary.
[0058]
The phrase dictionary 21 is not limited to the configuration shown in FIG. 4 and can store other words and parts of speech types. The phrase dictionary 21 can store not only patterns for creating noun phrases from a plurality of nouns, but also patterns for generating other compound words / phrases such as adjective phrases and adverb phrases. Of course.
[0059]
FIG. 7 is a block diagram illustrating a functional configuration of the meaning extraction unit 30. The meaning extraction unit 30 extracts the meaning from the XML document that has been subjected to the phrase analysis, and generates an XML document in which the extracted meaning is structured.
[0060]
The meaning extraction unit 30 selects an optimal item concept dictionary from a plurality of item concept dictionaries by performing pattern matching and a concept dictionary 31 that stores attribute value extraction patterns and attribute value output formats for various concepts. A pattern matching unit 32; an attribute value extracting unit 33 that extracts an attribute value using the selected item concept dictionary; and a shaping unit 34 that formats and outputs the tag corresponding to the extracted attribute value into an XML document. ing.
[0061]
FIG. 8 is a block diagram showing the configuration of the concept dictionary 31. The concept dictionary is composed of, for example, a plurality of item concept dictionaries configured for various concepts (items) such as “current problem” and “solution”. Here, the “current problem” will be described as an example.
[0062]
The item concept dictionary includes an “item” representing an item in the item concept dictionary, a “synonym” representing a synonym of the “item”, an “attribute describing the item”, and an “XML output” representing an XML output format of the extracted attribute value. Format ".
[0063]
The “attribute that describes the item” includes “attribute name” and “tag pattern”. The “attribute name” represents what attribute value is extracted. "Tag pattern" includes attribute values phrase It represents the configuration pattern of each part-of-speech tag in the column and the position of the attribute value included in the configuration pattern.
[0064]
Here, “noun *” represents an arbitrary noun or noun phrase. The combination of “$” and the attribute name following it indicates that the face of the word corresponding to the position of the tag pattern becomes the value of the attribute.
[0065]
For example, in FIG. 8, “problem” indicates that an attribute value that describes what kind of “problem” is extracted from the XML document that has been analyzed. "Tag pattern" of "Problem" includes an attribute value that describes "Problem" phrase It represents the part-of-speech tag composition pattern of the column.
[0066]
The “tag pattern” of “problem” phrase The part of speech tag is a noun or noun phrase phrase The part-of-speech tag can be arbitrary, but the following phrase Indicates that “is”. Here, since there is a “$ problem” in the first part of speech tag element, it corresponds to the first part of speech tag. phrase Becomes the value (attribute value) of the attribute name “problem”.
[0067]
“Target” indicates that an attribute value that describes what kind of “target” is extracted from the XML document that has been analyzed. The “tag pattern” of “target” is the first phrase The part of speech tag is a noun or noun phrase phrase The part-of-speech tag can be arbitrary, but the following phrase ( phrase Column) indicates “in”. In addition, since there is “$ target” in the element of the first part of speech tag, it corresponds to the first part of speech tag. phrase Becomes the value (attribute value) of the attribute name “target”.
[0068]
Further, “problem occurrence situation” indicates that an attribute value that describes what kind of “problem occurrence situation” is extracted from the XML document that has been analyzed. The “tag pattern” in the “problem situation” phrase The part of speech tag is a noun or noun phrase phrase The part-of-speech tag is "noun-suffix-adverbable" (a noun that becomes a suffix that can be used as an adverb) phrase The part of speech tag can be arbitrary, but the last part phrase Indicates “to”. In addition, since the element of the first part-of-speech tag includes “$ problem occurrence”, it corresponds to the first part-of-speech tag. phrase Becomes the value (attribute value) of the attribute name “problem occurrence situation”.
[0069]
The “XML output format” represents outputting an XML document in which each extracted attribute value is replaced with the location of “$ attribute name”.
[0070]
For example, the first line of “XML output format” in FIG. 8 indicates that the output is replaced with the attribute value of “problem”. Note that <target> and <when problem occurs> incorporated as elements of <current problem> indicate that corresponding elements are formed and output together with the structure tag.
[0071]
And the pattern collation part 32, the attribute value extraction part 33, and the shaping part 34 perform the following processes using the concept dictionary 31 comprised as mentioned above.
[0072]
The pattern matching unit 32 collates the phrase analyzed XML document supplied from the phrase analyzing unit 20 with each item concept dictionary of the concept dictionary 31 and selects an item concept dictionary corresponding to the XML document. Specifically, an item concept dictionary in which all tag patterns completely match the constituent pattern of the part of speech tag of the document is selected from each item concept dictionary. Then, the pattern matching unit 32 supplies the “item” of the selected item concept dictionary to the attribute value extraction unit 33 and also supplies the attribute value extraction unit 33 with the XML document corresponding to the item concept dictionary.
[0073]
For example, here, the composition pattern of the part-of-speech tag of the document shown in FIG. 6C matches the “tag pattern” of the item concept dictionary “current problem” shown in FIG. Accordingly, the pattern matching unit 32 supplies the item name “current problem” and the element of <content> in FIG.
[0074]
Note that when there are a plurality of item concept dictionaries in which the component pattern of part-of-speech tags of the document supplied from the phrase analysis unit 20 and all the tag patterns completely match, the condition of the tag pattern of the item concept dictionary is further limited. That's fine. For example, instead of “noun *” representing an arbitrary noun or noun phrase, “noun-proper noun” representing a proper noun may be used, or other conditions may be limited.
[0075]
The attribute value extraction unit 33 reads the item concept dictionary indicated by the “item” supplied from the pattern matching unit 32 from the concept dictionary 31, and converts the item concept dictionary from the document supplied from the pattern matching unit 32 to the tag pattern of the item concept dictionary. Applicable phrase Find a column. And phrase Corresponds to "$ attribute value" of the tag pattern from the column phrase Is extracted as an attribute value.
[0076]
Specifically, the attribute value extraction unit 33 corresponds to each of the three tag patterns shown in FIG. 8 from the <content> elements in FIG. phrase Find a column. And each found out phrase Corresponds to the “$ attribute value” of each tag pattern from the column phrase “Heat generation”, “2SC7777”, “Class A amplification” are extracted and supplied to the shaping unit 34 as attribute values.
[0077]
The shaping unit 34 shapes each attribute value supplied from the attribute value extraction unit 33 according to the output format of the concept dictionary 31 to generate an XML document, and outputs the XML document to the outside.
[0078]
Specifically, an attribute value “fever” for explaining “problem”, an attribute value “2SC7777” for explaining “target”, and an attribute value “class A amplification” for explaining “problem occurrence” are output as “XML output”. An element is formed by substituting the corresponding “$ attribute value” of “format”, and the formed element and structure tag are output in an XML document format.
[0079]
FIG. 9 is a diagram illustrating the XML document having the meaning extracted from the meaning extracting unit 30. Conventionally, a structured document can be generated from a predetermined document. However, when a natural sentence is included in the predetermined document, the natural sentence is output as it is. That is, the conventional structured document includes a natural sentence in the element of the structure tag. On the other hand, the meaning-extracted XML document shown in FIG. 9 is a document in which the meaning of the natural sentence contained in the element of the structure tag is extracted and the extracted meaning is structured. .
[0080]
As described above, the structured document generation apparatus according to the present embodiment performs morphological analysis on a natural sentence included in an XML document and performs each morphological analysis. phrase Part-of-speech types are assigned to and the composition pattern of these part-of-speech types and attribute values phrase Matches the tag pattern representing the part-of-speech type composition of the column and matches phrase Extract attribute values from columns. Then, by outputting the extracted attribute values in accordance with a predetermined XML output format, an XML document in which a natural sentence is structured can be generated.
[0081]
In particular, the structured document generation apparatus prepares a tag pattern for extracting each attribute value for each of various concepts (items) in advance. A concept dictionary can be specified, and attribute values for explaining the concept can be extracted from a natural sentence.
[0082]
Further, the structured document generation apparatus prepares in advance an XML output format of attribute values and structure tags necessary for explaining the concept for each of various concepts (items). By simply outputting in accordance with the XML output format, an XML document in which the extracted attribute values are structured can be easily generated.
[0083]
As shown in FIG. 9, the XML document generated by the structured document generation apparatus is a structured document in which the concept of a natural sentence is described by a structured tag and an attribute value. It is a document that is easy to respond to inquiries. Hereinafter, application examples of the XML document will be described.
[0084]
(Application 1)
A case will be described in which a document describing that “the weight of the product is reduced” is searched from a database storing a large number of documents.
[0085]
FIG. 10A shows a conventional natural sentence document stored in the database, FIG. 10B shows a conventional search result, and FIG. 10C shows the XML document of the present application stored in the database. FIG. The XML document of the present application refers to an XML document generated by the above-described structured document generation apparatus from the conventional natural sentence document shown in FIG. That is, the documents (A) and (C) have the same content, suggesting that the weight of the product increases.
[0086]
Here, when a document describing that “the weight of the product is reduced” is searched by a general method, “product”, “weight”, and “decrease” are usually used as search keywords. When the conventional document is stored in the database, as shown in (B), the conventional document including the words “product”, “weight”, and “decrease” is erroneously searched.
[0087]
On the other hand, when the XML document of the present application is stored in the database, the structure tag may be followed sequentially. Here, a structure tag relating to “product weight” is first searched, and a structure tag indicating whether the weight of the product has increased or decreased is searched from the elements of the structure tag. Specifically, first, <product weight> is searched, and <direction> representing a change in weight is searched from the elements. In the XML document of the present application, the element of <direction> is “increase”, so that it is not erroneously searched.
[0088]
Therefore, the structured document generation apparatus can generate an XML document that does not cause an erroneous search from the document even if the document is erroneously searched by a normal keyword search.
[0089]
(Application example 2)
A case will be described in which an inquiry “What is the current countermeasure against processing defects?” Is made to one document in the database.
[0090]
FIG. 11A shows a conventional natural sentence document stored in the database, and FIG. 11B shows an XML document of the present application stored in the database. The documents (A) and (B) have the same contents.
[0091]
When a conventional document is stored in the database, no response can be made even if an inquiry is made "What is the current countermeasure against processing defects?"
[0092]
On the other hand, when the XML document of the present application is stored in the database, the structure tag related to the inquiry may be followed. Here, first, a structure tag related to “countermeasure” is searched, and elements of the structure tag (lower structure tag and its elements) are extracted. Then, when the extracted structure tag and element are combined, it is possible to answer that the thickness has been increased by 2 mm in response to the above inquiry.
[0093]
Therefore, the structured document generation apparatus can generate an XML document that can easily respond to an inquiry by extracting and structuring the meaning of a natural sentence.
[0094]
The present invention is not limited to the above-described embodiments, and various design changes can be made within the scope described in the claims.
[0095]
For example, the concept dictionary 31 is not limited to the configuration shown in FIG. In the present embodiment, the case where there are three tag patterns in one item concept dictionary has been described as an example, but as many tag patterns as the number of attribute values to be extracted can be provided.
[0096]
In the present embodiment, an XML document has been described as an example. However, for example, an SGML document may be used. At this time, the “XML output format” of the item concept dictionary may be changed to “SGML output format”. In consideration of creating a structured document from a natural sentence, the document format is not particularly limited.
[0098]
【The invention's effect】
Book A structured document generation apparatus and a structured document generation program according to the invention are provided for a phrase-analyzed document. phrase Contains the composition pattern of the part of speech type of the column and the attribute value to be extracted phrase By extracting attribute values by collating with the part-of-speech type composition pattern that represents the composition of the part-of-speech type of the column, and generating the structured document by structuring the extracted attribute value based on the output format information, A structured document from which meaning is extracted can be obtained.
[Brief description of the drawings]
FIG. 1 is a block diagram showing a configuration of a structured document generation apparatus according to an embodiment of the present invention.
FIG. 2 is a block diagram showing a functional configuration of a microcomputer.
FIG. 3 is a block diagram showing a functional configuration of a phrase analysis unit.
FIG. 4 is a diagram illustrating a configuration of a word dictionary.
FIG. 5 is a diagram illustrating a document being analyzed by a phrase analysis unit.
FIG. 6 is a diagram illustrating a document that is analyzed by a phrase analysis unit.
FIG. 7 is a block diagram showing a functional configuration of a meaning extraction unit.
FIG. 8 is a block diagram illustrating a configuration of a concept dictionary.
FIG. 9 is a diagram illustrating a meaning-extracted XML document output from a meaning extraction unit;
10A is a diagram showing a conventional natural sentence document stored in a database, FIG. 10B is a diagram showing a conventional search result, and FIG. 10C is an XML document of the present application stored in the database. FIG.
11A is a diagram showing a conventional natural text document stored in a database, and FIG. 11B is a diagram showing an XML document of the present application stored in the database.
[Explanation of symbols]
20 Phrase analysis part
21 phrase dictionary
22 Content string separator
23 Specific word separation / replacement section
24 Part-of-speech determination part
25 Morphological analyzer
26 Alternative word replacement
27 Compound Word / Phrase Generator
28 Part-of-speech tagging section
30 Semantic extraction part
31 concept dictionary
32 Pattern matching part
33 Attribute value extraction unit
34 Orthopedics

Claims

A document input means for inputting a document including a natural sentence;
A phrase analysis unit that performs phrase analysis on a document including a natural sentence input by the document input unit, and outputs a phrase-analyzed document in which words and parts of speech constituting the document are associated with each other;
A part-of-speech type composition pattern that is provided for each concept representing the meaning content of the word string and that is a part-of-speech type composition of a word string that includes an attribute value to be extracted and represents a pattern corresponding to each concept, and the attribute Output format information for structuring values and corresponding tags and outputting structured elements for each concept, concept dictionary storage means for storing a concept dictionary,
The phrase string composition pattern of the phrase string of the word-parsed document is compared with the part-of-speech type composition pattern of the concept dictionary stored in the concept dictionary storage means, and the phrase string is extracted from the concept dictionary storage means. An attribute value extracting means for selecting a concept dictionary corresponding to the semantic content of the text, and using the selected concept dictionary to extract an attribute value from a phrase string of the phrase analyzed document corresponding to the part of speech type configuration pattern;
Based on the attribute value extracted by the attribute value extracting means and the output format information of the selected concept dictionary, a structured document representing an element structured about the concept corresponding to the selected concept dictionary is generated. Document generating means for
A structured document generation apparatus comprising:

Before Symbol attribute value extracting means selects the concept dictionary from among a plurality of concept dictionary stored in the storage means, concept dictionary having a part-of-speech type arrangement pattern matching the part of speech type of configuration patterns of the word analysis document Extract attribute values using the selected concept dictionary,
2. The structured document is generated based on the attribute value extracted by the attribute value extracting unit and the output format information of the concept dictionary selected by the attribute value extracting unit. Structured document generator.

Computer
A document input means for inputting a document including a natural sentence;
A phrase analysis unit that performs phrase analysis on a document including a natural sentence input by the document input unit, and outputs a phrase-analyzed document in which words and parts of speech constituting the document are associated with each other;
A part-of-speech type composition pattern that is provided for each concept representing the semantic content of the word string and that is a part-of-speech type composition of a word string that includes an attribute value to be extracted, and that represents a pattern corresponding to the concept, and the attribute Output format information for structuring values and corresponding tags and outputting structured elements for the concept, concept dictionary storage means for storing a concept dictionary,
The phrase string composition pattern of the phrase string of the word-parsed document is compared with the part- of- speech type composition pattern of the concept dictionary stored in the concept dictionary storage means, and the phrase string is extracted from the concept dictionary storage means. An attribute value extracting means for selecting a concept dictionary corresponding to the semantic content of the text, and using the selected concept dictionary to extract an attribute value from a phrase string of the phrase analyzed document corresponding to the part of speech type configuration pattern;
Based on the attribute value extracted by the attribute value extracting means and the output format information of the selected concept dictionary, a structured document representing an element structured about the concept corresponding to the selected concept dictionary is generated. Document generating means for
Structured document generation program to function as

The attribute value extraction unit selects a concept dictionary having a part-of-speech type configuration pattern that matches a part-of-speech type configuration pattern of the phrase-analyzed document from a plurality of concept dictionaries stored in the concept dictionary storage unit. , Extract attribute values using the selected concept dictionary,
The document creation unit generates a structured document based on the attribute value extracted by the attribute value extraction unit and the output format information of the concept dictionary selected by the attribute value extraction unit. Structured document generator.