JP4934819B2

JP4934819B2 - Information extraction apparatus, method and program thereof

Info

Publication number: JP4934819B2
Application number: JP2007094339A
Authority: JP
Inventors: 浩郷野村
Original assignee: Kyushu Institute of Technology NUC
Current assignee: Kyushu Institute of Technology NUC
Priority date: 2007-03-30
Filing date: 2007-03-30
Publication date: 2012-05-23
Anticipated expiration: 2027-03-30
Also published as: JP2008250887A

Description

本発明は、文書の中から、あらかじめ定められた種類の情報を自動抽出する装置に関する。 The present invention relates to an apparatus for automatically extracting information of a predetermined type from a document.

ネットワークの普及により、情報を電子化して管理する機会が増加しつつある現在、それらの情報の中から必要な情報だけを人間の手作業で取り出してくることは困難な状況になっている。このような状況下において、大量の情報を管理する技術として文章データの中から目的の情報のみを取り出してくる情報抽出の技術が要求されている。また、情報抽出は文章の整理やデータベースの自動的な構築、要約文の生成など応用範囲の広い技術に適用されるため、より高機能・高精度な情報抽出技術が必要となっている。
そこで、情報抽出を利用した文書要約装置が、特開２００２−２８８１９０号公報に開示されている。 With the spread of networks, the opportunity to digitize and manage information is increasing, and it is difficult to manually extract only necessary information from such information. Under such circumstances, as a technique for managing a large amount of information, an information extraction technique for extracting only target information from text data is required. In addition, since information extraction is applied to technologies with a wide range of applications, such as organizing sentences, automatically constructing databases, and generating summary sentences, more advanced and highly accurate information extraction techniques are required.
Therefore, a document summarization apparatus using information extraction is disclosed in Japanese Patent Laid-Open No. 2002-288190.

背景技術の情報抽出を利用した文書要約装置は、形態素列（１つの形態素からなる列を含む。以下同様。）の入力を受け付ける形態素列入力受付部と、要素として認定されるべき形態素列と、当該要素の属性と、をあらかじめ記憶する要素記憶部と、前記記憶された属性の列と、当該属性の列の間に埋め込む文字列の情報とを含むテンプレートを記憶するテンプレート記憶部と、前記入力を受け付けられた形態素列から、前記記憶された要素を検索し、当該検索結果の要素を列として出力する要素検索部と、前記検索結果の要素の列から当該要素の属性の列を取得し、前記記憶されたテンプレートのうち、当該要素の属性の列を含むものを用いて、当該検索された属性の列のそれぞれに対応する要素に対応する形態素列の間に、当該テンプレートが含む文字列の情報を埋め込んで、要約を生成する要約生成部とを備えることを特徴とするものである。
特開２００２−２８８１９０号公報 A document summarization apparatus that uses information extraction of background art includes a morpheme string input receiving unit that receives an input of a morpheme string (including a column composed of one morpheme, and the like), a morpheme string to be recognized as an element, An element storage unit for storing the attribute of the element in advance; a template storage unit for storing a template including the stored attribute column; and information on a character string embedded between the attribute column; and the input From the received morpheme sequence, search for the stored element, and output the search result element as a column, and obtain the attribute column of the element from the search result element column, Among the stored templates, the template including the attribute column of the element is used, and the template is inserted between the morpheme strings corresponding to the elements corresponding to the searched attribute columns. Embed information string bets includes and is characterized by comprising a summary generator for generating a summary.
JP 2002-288190 A

前記背景技術の情報抽出を利用した文書要約装置によれば、ユーザが望む情報について文書を要約するのに好適な要約装置等を提供することができる。
しかしながら、テンプレートを利用した情報抽出を行う場合に、並列構造や入れ子構造など複雑な構成の文章が多数用いられていると、抽出項目が複数の文章にまたがって記述される場合があるなど、一般性が低くなる傾向がみられる。その結果、テンプレート数が増加し、多数の解候補があり抽出精度が低下する。さらに、大量テンプレートとのマッチングの処理によって、高速性が損なわれるなどの問題点がある。 According to the document summarizing apparatus using the information extraction of the background art, it is possible to provide a summarizing apparatus suitable for summarizing a document with respect to information desired by the user.
However, when extracting information using a template, if a large number of sentences with a complicated structure such as a parallel structure or a nested structure are used, the extracted items may be described across multiple sentences. Tend to be low. As a result, the number of templates increases, there are a large number of solution candidates, and the extraction accuracy decreases. Furthermore, there is a problem that high speed is lost due to the matching processing with a large amount of templates.

また、文書中には主要な製品情報以外に、製品の特徴を表す情報が出現している場合が多くある。また製品の特徴情報は記事中に複数出現している場合が多く、複数の特徴情報の中から特に重要度の高い特徴情報を抽出することが望まれる。
そこで、前記の課題を解決するために、指定された項目およびそれに関連する１つないしは複数個の情報、さらに指定された項目ではないが重要な情報を文書から見つけ出す高機能かつ高精度な情報抽出装置の提供を目的とする。 In addition to the main product information, there are many cases where information representing product characteristics appears in the document. In addition, there are many cases where product feature information appears in articles, and it is desired to extract feature information with particularly high importance from the plurality of feature information.
Therefore, in order to solve the above-mentioned problem, highly functional and accurate information for finding out a specified item and one or more information related to the specified item, and important information that is not a specified item, but is important. An object is to provide an extraction device.

まず、本発明における「売り」情報について定義する。次に、「売り」情報の抽出における記事見出しの有用性と、見出しと本文との関連性について述べる。
本発明における「売り」情報の定義について示すと、「売り」情報とは製品の特徴を示す情報の中で、特に重要度の高い情報である。しかし、記事中には複数の特徴情報が出現している場合が多く、各特徴情報に対する重要度は、特徴情報に対して重み付けを行う者の立場や観点によって大きく異なる。以下に、本発明で扱う新製品紹介記事と、記事の本文中に含まれる特徴情報の例を示す。 First, “sale” information in the present invention is defined. Next, the usefulness of the article headline in the extraction of “sell” information and the relationship between the headline and the text will be described.
In terms of the definition of “sale” information in the present invention, “sale” information is information that is particularly important among information indicating the characteristics of a product. However, in many cases, a plurality of feature information appears in articles, and the importance for each feature information varies greatly depending on the position and viewpoint of the person who weights the feature information. Examples of new product introduction articles handled in the present invention and feature information included in the text of the articles are shown below.

図１は、本発明の実施形態に係る情報抽出装置における新製品紹介記事と記事中に含まれる特徴情報の説明図である。
見出し情報は、「［ビジネス情報］ニューロ制御のヒーター−−三菱電機」である。本文情報は、「三菱電機は、設定温度を自動的に決めるニューロ制御の石油ファンヒーター８タイプ２４機種を８月２１日発売する。６―１６畳向けで価格は３万９０００―８万９８００円」である。ここで特徴情報は、「［設定温度を自動的に決める，ニューロ制御，６―１６畳向け］」となる。
これらの特徴情報は、重み付けの観点によってはどれも「売り」情報となり得る。従って、複数の特徴情報から「売り」情報を抽出するためには、この立場や観点を定める必要がある。そこで、本発明ではこの立場を新製品紹介記事の書き手、即ち記者と定め、記者の観点における「売り」情報を抽出する。 FIG. 1 is an explanatory diagram of a new product introduction article and feature information included in the article in the information extraction apparatus according to the embodiment of the present invention.
The header information is “[Business Information] Neuro Control Heater—Mitsubishi Electric”. The text information is "Mitsubishi Electric will release 24 types of 8 types of petroleum-controlled petroleum fan heaters that automatically determine the set temperature on August 21. The price is 39,000-89,800 yen for 6-16 tatami mats. Is. Here, the feature information is “[automatically determine set temperature, neuro control, for 6-16 tatami mats]”.
Any of these pieces of feature information can be “sell” information depending on the viewpoint of weighting. Therefore, in order to extract “sale” information from a plurality of feature information, it is necessary to determine this position and viewpoint. Therefore, in the present invention, this position is set as a writer of a new product introduction article, that is, a reporter, and “sale” information is extracted from the viewpoint of the reporter.

［記事中の特徴情報について］
（ａ）見出し中の特徴情報
記事の見出しとは、記者が読者に対し、本文の内容が一目で分かるようつけたものである。故に、見出し中に特徴情報が出現するとき、その特徴情報は記者の観点における重要度が特に高い「売り」情報であると判断できる。過去の分析用データ３００記事のうち、見出しの特徴情報が製品の「売り」情報となっている記事は２５３記事であった。従って、「売り」情報を抽出する際に見出しを利用する事は極めて有効であるといえる。ここで、見出しの特徴情報が「売り」情報となっている記事の例を以下に示す。 [About feature information in articles]
(A) Feature information in the headline The headline of the article is what the reporter gives the reader to understand the contents of the text at a glance. Therefore, when feature information appears in the headline, it can be determined that the feature information is “sale” information that is particularly important in terms of the reporter. Of the 300 analysis data articles in the past, 253 articles have headline feature information as product “sell” information. Therefore, it can be said that it is very effective to use the headline when extracting “sale” information. Here, an example of an article whose headline characteristic information is “sale” information is shown below.

図２は、本発明の実施形態に係る情報抽出装置における見出し中に「売り」情報を含む記事の説明図である。
見出し情報は、「［雑記帳］足利銀行が視覚障害者が利用できるＡＴＭ」である。本文情報は、「足利銀行が沖電気工業と共同で全国で初めて視覚障害者が利用できるＡＴＭを開発した。従来のＡＴＭに、電話の受話器のようなハンドセットと取引金額や残高がボードに点字で浮き上がる点字表示装置、点字付き操作ボタンが加わった。操作すると、入金額や残高が点字で表示装置にでる。同行は「社会貢献活動の一環。より良い企業市民を目指します」とアピール。」である。ここで、「売り」情報は「視覚障害者が利用できる」となる。 FIG. 2 is an explanatory diagram of an article including “sale” information in a headline in the information extraction device according to the embodiment of the present invention.
The headline information is “[Miscellaneous book] ATM that Ashikaga Bank can use for visually impaired persons”. The text information is, “Ashikaga Bank, in collaboration with Oki Electric Industry, has developed the first ATM in the country that can be used by visually impaired people. In the traditional ATM, the handset like a telephone handset and the transaction amount and balance are in Braille on the board. A braille display device that rises and an operation button with braille are added.When you operate, the deposit amount and balance appear in braille on the display device.The bank appeals as "part of social contribution activities. Aiming to be a better corporate citizen". Is. Here, the “sale” information is “available to visually impaired”.

（ｂ）見出し中の特徴情報と本文との関連性
見出し中の特徴情報が本文中に出現するとき、本文中には見出し中の特徴情報の補足説明となる語句が含まれている記事が多くあった。従って、見出しと本文の両方に含まれる特徴情報に加え、その特徴情報を補足説明する語句を本文中より抽出することで、より詳細な「売り」情報の抽出が可能になる。本文中に見出しの特徴情報の補足説明となる情報を含む記事の例を以下に示す。 (B) Relevance between the feature information in the headline and the text When the feature information in the headline appears in the text, there are many articles in the text that contain supplementary explanations of the feature information in the heading there were. Therefore, in addition to the feature information included in both the headline and the text, by extracting words from the text that supplementarily explain the feature information, more detailed “sale” information can be extracted. The following is an example of an article containing information that supplements the feature information of the heading in the text.

図３は、本発明の実施形態に係る情報抽出装置における本文中に見出しの特徴情報の補足情報を含む記事の説明図である。
見出し情報は、「［ビジネス情報］衛星放送チューナーを２台内蔵したテレビを発売−−シャープ」である。本文情報は、「シャープは、業界で初めて衛星放送チューナーを２台内蔵した２９型カラーテレビ「ツインＢＳ」を４月１５日発売。２２万円。衛星放送を見ながら、手持ちのＶＴＲで衛星放送の裏番組録画が可能。同チューナーを内蔵していない別のテレビとケーブルで接続すれば、２つの衛星放送番組を同時に視聴できる。」である。ここで、補足情報は「業界で初めて」となる。 FIG. 3 is an explanatory diagram of an article including supplementary information of headline feature information in the text in the information extraction apparatus according to the embodiment of the present invention.
The headline information is “[Business Information] Releases TV with two built-in satellite tuners—Sharp”. For the text information, “Sharp will release the industry's first 29-inch color TV“ Twin BS ”with two satellite broadcast tuners on April 15th. 220,000 yen. While watching the satellite broadcast, you can record the back program of the satellite broadcast with your VTR. If you connect a cable to another TV that does not have the tuner, you can watch two satellite broadcast programs at the same time. Is. Here, the supplementary information is “first in the industry”.

（１）本発明に係る情報抽出装置は、記事入力手段と、入力された記事情報についてテンプレートを利用して主要な製品情報を抽出するテンプレート抽出手段と、入力された記事情報について付与されたタグに基づいてパターンマッチングを行うダグパターンマッチング手段と、前記パターンマッチング結果に基づいて記事を見出しと本文に分割する記事分割手段と、分割された記事の見出しを形態素解析する見出しの形態素解析手段と、形態素解析された見出しの文節から助詞を除去する見出し助詞除去手段と、前記テンプレートにより抽出された主要な製品情報と形態素解析後に助詞を除去された見出しの文節とをマッチングする見出し特徴情報マッチング手段と、前記主要な製品情報以外の情報を示す文節を見出しの特徴情報として抽出する見出し特徴情報抽出手段と、分割された記事の本文を形態素解析する本文の形態素解析手段と、形態素解析された本文の文節から助詞を除去する本文助詞除去手段と、見出し特徴情報と形態素解析後に助詞を除去された本文の文節とをマッチングする本文特徴情報マッチング手段と、前記マッチングされた本文の特徴情報を抽出する本文特徴情報抽出手段と、見出し特徴情報または本文特徴情報を売り情報として出力する売り情報の出力手段とを含むものである。 (1) An information extraction apparatus according to the present invention includes an article input means, a template extraction means for extracting main product information using a template for the input article information, and a tag attached to the input article information. Doug pattern matching means for performing pattern matching based on the article, article dividing means for dividing an article into a headline and a body based on the pattern matching result, a morpheme analysis means for a headline for performing a morphological analysis on the divided article headline, Heading particle removal means for removing particles from headline phrases subjected to morphological analysis; headline feature information matching means for matching main product information extracted by the template with headline phrases from which particles have been removed after morphological analysis; , A phrase indicating information other than the main product information is extracted as feature information of the heading. Headline feature information extracting means, morphological analysis means for morphological analysis of the body of the divided article, text particle removal means for removing particles from the morphologically analyzed text clause, headline feature information and after morphological analysis Text feature information matching means for matching text clauses from which particles have been removed, text feature information extraction means for extracting feature information of the matched text, and output headline feature information or text feature information as sales information Sales information output means.

これにより、記事を入力し、入力された記事情報についてテンプレートを利用して主要な製品情報を抽出し、入力された記事情報について見出しと本文に分割し、見出しを形態素解析し、見出しの文節から助詞を除去し、前記テンプレートにより抽出された主要な製品情報と見出しの文節とをマッチングし、見出しの特徴情報を抽出し、記事の本文を形態素解析し、見出し特徴情報と本文の文節とをマッチングし、前記マッチングされた本文の特徴情報を抽出し、見出し特徴情報または本文特徴情報を売り情報として出力するので、定型性の高い文章に対しては簡易かつ迅速に抽出することができる。また、文書の中からあらかじめ定められた種類の情報やあらかじめ定められた種類の情報に関連する重要な情報が簡潔な言語表現で迅速かつ簡易に抽出可能となる。 As a result, the article is input, the main product information is extracted using the template for the input article information, the input article information is divided into the heading and the body, the heading is morphologically analyzed, and the heading clause is extracted. Particles are removed, main product information extracted from the template is matched with headline phrases, headline feature information is extracted, article text is morphologically analyzed, and headline feature information is matched with text phrases Since the matched feature information of the text is extracted and the headline feature information or the text feature information is output as sales information, it is possible to easily and quickly extract a sentence with high formality. In addition, it is possible to quickly and easily extract a predetermined type of information from a document and important information related to the predetermined type of information with a simple language expression.

（２）本発明に係る情報抽出装置は、記事入力手段と、入力された記事情報について係り受け解析を利用して主要な製品情報を抽出する係り受け抽出手段と、入力された記事情報について付与されたタグに基づいてパターンマッチングを行うダグパターンマッチング手段と、前記パターンマッチング結果に基づいて記事を見出しと本文に分割する記事分割手段と、分割された記事の見出しを形態素解析する見出しの形態素解析手段と、形態素解析された見出しの文節から助詞を除去する見出し助詞除去手段と、前記係り受け解析により抽出された主要な製品情報と形態素解析後に助詞を除去された見出しの文節とをマッチングする見出し特徴情報マッチング手段と、前記主要な製品情報以外の情報を示す文節を見出しの特徴情報として抽出する見出し特徴情報抽出手段と、分割された記事の本文を形態素解析する本文の形態素解析手段と、形態素解析された本文の文節から助詞を除去する本文助詞除去手段と、見出し特徴情報と形態素解析後に助詞を除去された本文の文節とをマッチングする本文特徴情報マッチング手段と、前記マッチングされた本文の特徴情報を抽出する本文特徴情報抽出手段と、見出し特徴情報または本文特徴情報を売り情報として出力する売り情報の出力手段とを含む。 (2) An information extraction apparatus according to the present invention provides article input means, dependency extraction means for extracting main product information using dependency analysis for the input article information, and input article information. Doug pattern matching means for performing pattern matching based on the tag, article division means for dividing the article into a headline and a body based on the pattern matching result, and a morphological analysis of the headline for performing morphological analysis on the headline of the divided article A headline particle-removing means for removing a particle from a morphological-analyzed headline clause, and a headline that matches main product information extracted by the dependency analysis with a headline-phrase from which the particle has been removed after morphological analysis Feature information matching means and a phrase indicating information other than the main product information are extracted as feature information of the heading Heading feature information extraction means, morphological analysis means for morphological analysis of the body of a divided article, text particle removal means for removing a particle from the morphological analysis of the text, and a particle after heading feature information and morphological analysis Text feature information matching means for matching text clauses from which text has been removed, text feature information extraction means for extracting feature information of the matched text, and selling that outputs headline feature information or text feature information as sales information Information output means.

これにより、記事を入力し、入力された記事情報について係り受け解析を利用して主要な製品情報を抽出し、入力された記事情報について見出しと本文に分割し、見出しを形態素解析し、前記係り受け解析により抽出された主要な製品情報と見出しの文節とをマッチングし、見出しの特徴情報を抽出し、記事の本文を形態素解析し、見出し特徴情報と本文の文節とをマッチングし、前記マッチングされた本文の特徴情報を抽出し、見出し特徴情報または本文特徴情報を売り情報として出力するので、定型性の低い複雑な文章に対しても簡易かつ精度よく抽出ができる。また、文書の中からあらかじめ定められた種類の情報やあらかじめ定められた種類の情報に関連する重要な情報が簡潔な言語表現で迅速かつ簡易に抽出可能となる。 As a result, an article is input, main product information is extracted by using dependency analysis for the input article information, the input article information is divided into a headline and a body, the headline is analyzed by morpheme, The main product information extracted by receiving analysis is matched with headline clauses, headline feature information is extracted, the body of the article is morphologically analyzed, the headline feature information is matched with the text clauses, and the matching is performed. Since the feature information of the text is extracted and the headline feature information or the text feature information is output as sales information, it is possible to easily and accurately extract even a complicated sentence with low standardity. In addition, it is possible to quickly and easily extract a predetermined type of information from a document and important information related to the predetermined type of information with a simple language expression.

（３）本発明に係る情報抽出装置は、記事入力手段と、抽出精度の重み付けの閾値が一定以上のテンプレートを利用して入力された記事情報から主要な製品情報を抽出するテンプレート抽出手段と、前記テンプレートにより抽出されなかった記事情報について係り受け解析を利用して主要な製品情報を抽出する係り受け抽出手段と、入力された記事情報について付与されたタグに基づいてパターンマッチングを行うダグパターンマッチング手段と、前記パターンマッチング結果に基づいて記事を見出しと本文に分割する記事分割手段と、分割された記事の見出しを形態素解析する見出しの形態素解析手段と、形態素解析された見出しの文節から助詞を除去する見出し助詞除去手段と、前記テンプレートまたは係り受け解析により抽出された主要な製品情報と形態素解析後に助詞を除去された見出しの文節とをマッチングする見出し特徴情報マッチング手段と、前記主要な製品情報以外の情報を示す文節を見出しの特徴情報として抽出する見出し特徴情報抽出手段と、分割された記事の本文を形態素解析する本文の形態素解析手段と、形態素解析された本文の文節から助詞を除去する本文助詞除去手段と、見出し特徴情報と形態素解析後に助詞を除去された本文の文節とをマッチングする本文特徴情報マッチング手段と、前記マッチングされた本文の特徴情報を抽出する本文特徴情報抽出手段と、見出し特徴情報または本文特徴情報を売り情報として出力する売り情報の出力手段とを含む。 (3) An information extraction apparatus according to the present invention includes an article input means, a template extraction means for extracting main product information from article information input using a template whose extraction accuracy weighting threshold is a predetermined value, Dependency extraction means for extracting main product information using dependency analysis for article information not extracted by the template, and doug pattern matching for performing pattern matching based on tags assigned to the input article information Means, an article dividing means for dividing an article into a headline and a text based on the pattern matching result, a morpheme analyzing means for a morpheme analysis of a headline of the divided article, and a particle from a sentence of the headline subjected to morphological analysis Heading particle removal means to be removed, and main template extracted by the template or dependency analysis Headline feature information matching means for matching product information and headline clauses from which particles have been removed after morphological analysis, and headline feature information extraction means for extracting clauses indicating information other than the main product information as headline feature information A body morpheme analysis means for morphological analysis of the body of the divided article, a text particle removal means for removing a particle from a clause of the text subjected to the morpheme analysis, and a text from which the particle is removed after the heading feature information and the morphological analysis Text feature information matching means for matching the clauses of the text, text feature information extraction means for extracting the feature information of the matched text, and sales information output means for outputting the headline feature information or text feature information as sales information; including.

これにより、記事を入力し、抽出精度の重み付けの閾値が一定以上のテンプレートを利用して入力された記事情報から主要な製品情報を抽出し、テンプレートにより抽出されなかった記事情報について係り受け解析を利用して主要な製品情報を抽出し、記事を見出しと本文に分割し、記事の見出しを形態素解析し、前記テンプレートまたは係り受け解析により抽出された主要な製品情報と見出しの文節とをマッチングし、見出しの特徴情報を抽出し、本文を形態素解析し、見出し特徴情報と本文の文節とをマッチングし、前記マッチングされた本文の特徴情報を抽出し、見出し特徴情報または本文特徴情報を売り情報として出力するので、テンプレート抽出により定型性の高い文に対してのみ抽出でき、テンプレートマッチでの誤った抽出の減少、および処理時間の短縮ができる。また、予めテンプレートによる抽出を行うことで、係り受け解析の負担を軽減し、全体の抽出精度の向上を図ることができる。そのうえ、文書の中からあらかじめ定められた種類の情報やあらかじめ定められた種類の情報に関連する重要な情報が簡潔な言語表現で高機能性及び高性能性を保ちながら、迅速かつ簡易に抽出可能となる。 As a result, the article is input, main product information is extracted from the article information that is input using a template whose extraction accuracy weighting threshold is above a certain level, and dependency analysis is performed on the article information that is not extracted by the template. The main product information is extracted, the article is divided into a headline and a body, the headline of the article is morphologically analyzed, and the main product information extracted by the template or dependency analysis is matched with the headline clause. The headline feature information is extracted, the body is subjected to morphological analysis, the headline feature information is matched with the phrase of the body, the matched body feature information is extracted, and the headline feature information or the body feature information is used as sales information. Since it is output, it can be extracted only for sentences with high standardity by template extraction, and incorrect extraction by template matching Small, and it can shorten the processing time. Further, by performing extraction using a template in advance, it is possible to reduce the burden of dependency analysis and improve the overall extraction accuracy. In addition, it is possible to quickly and easily extract important information related to predetermined types of information and predetermined types of information from documents in a simple language while maintaining high functionality and high performance. It becomes.

（４）本発明に係る情報抽出装置は必要に応じて、前記形態素解析された本文の係り受け関係を調べる本文の係り受け関係解析手段と、係り受け関係により修飾語句を補足説明情報として抽出する補足説明の抽出手段と、補足説明情報を売り情報として出力する売り情報の出力手段とを含む。
これにより、形態素解析された本文の係り受け関係を調べ、係り受け関係により修飾語句を補足説明情報として抽出し、補足説明情報を売り情報として出力するので、あらかじめ定められた種類の情報やあらかじめ定められた種類の情報に関連する重要な情報だけでなく、あらかじめ定められた種類の情報ではないが、興味深い、重要だと思われる情報が簡潔な言語表現で迅速かつ簡易に抽出可能となる。また、少ない情報量でより詳しい製品の情報を得ることができる。 (4) The information extraction apparatus according to the present invention extracts, as needed, supplementary explanation information as a supplementary explanation information by means of a text dependency relationship analysis means for examining the dependency relationship of the text subjected to the morphological analysis. It includes supplementary explanation extracting means and selling information output means for outputting supplementary explanation information as selling information.
As a result, the dependency relationship of the text subjected to morphological analysis is examined, the modifier is extracted as supplementary explanation information by the dependency relationship, and the supplementary explanation information is output as selling information. Not only important information related to the given type of information but also the information of a predetermined type, but interesting and important information can be extracted quickly and easily in a concise language expression. Further, more detailed product information can be obtained with a small amount of information.

（５）本発明に係る情報抽出装置は必要に応じて、前記テンプレート抽出手段は、入力された記事を句点ごとに分割する記事句点分割手段と、記事の１行目に対応するＡテンプレート集合とマッチングを行うＡテンプレートマッチング手段と、前記Ａテンプレート集合によりマッチングされた製品の特徴情報を抽出するＡテンプレート抽出手段と抽出された製品の特徴情報について抽出項目ごとにチェックする制約チェック手段と、情報を抽出することができたテンプレートのＩＤを記憶するテンプレートＩＤ記憶手段と、記事の２行目以降に対応するＢテンプレート集合とマッチングを行うＢテンプレートマッチング手段と、前記Ｂテンプレート集合によりマッチングされた製品の特徴情報を抽出するＢテンプレート抽出手段と、抽出された製品の特徴情報である抽出解と製品を対応付けるテンプレート製品対応手段とを含む。 (5) The information extraction apparatus according to the present invention is configured so that the template extraction unit includes an article punctuation dividing unit that divides the input article into punctuation points, and an A template set corresponding to the first line of the article. A template matching means for performing matching, A template extracting means for extracting feature information of products matched by the A template set, constraint checking means for checking the extracted product feature information for each extracted item, and information Template ID storage means for storing the ID of the template that could be extracted, B template matching means for matching the B template set corresponding to the second and subsequent lines of the article, and the product matched by the B template set B template extraction means for extracting feature information and extracted Associating extraction solution and the product is characteristic information of a product and a template product corresponding means.

これにより、入力された記事を句点ごとに分割し、記事の１行目に対応するＡテンプレート集合とマッチングを行い、前記Ａテンプレート集合によりマッチングされた製品の特徴情報を抽出し、抽出された製品の特徴情報について抽出項目ごとに制約をチェックし、情報を抽出することができたテンプレートのＩＤを記憶し、記事の２行目以降に対応するＢテンプレート集合とマッチングを行い、前記Ｂテンプレート集合によりマッチングされた製品の特徴情報を抽出し、抽出された製品の特徴情報である抽出解と製品を対応付けるので、分野が限定され、かつ、文章構造が単純な文書からの情報抽出処理に対しては、全文の構文要素を解析せず、表層の単語列の並びに現れる特定のパターンを認識することから簡易かつ迅速に抽出を行うことができる。 As a result, the input article is divided for each phrase, matched with the A template set corresponding to the first line of the article, the feature information of the product matched by the A template set is extracted, and the extracted product The feature information is checked for each extracted item, the ID of the template from which the information could be extracted is stored, matched with the B template set corresponding to the second and subsequent lines of the article, and the B template set Since the feature information of the matched product is extracted and the extracted solution, which is the feature information of the extracted product, is associated with the product, the field is limited and the information extraction process from a document with a simple sentence structure Therefore, it does not analyze the syntactic elements of the full text, and recognizes a specific pattern that appears in the word sequence on the surface layer, so that extraction can be performed easily and quickly. Can.

（６）本発明に係る情報抽出装置は必要に応じて、前記係り受け抽出手段は、入力された記事に付与されたタグに基づいてパターンマッチングを行う係り受けタグパターンマッチング手段と、前記パターンマッチングの結果に基づいて記事を見出しと本文に分割する係り受けタグ分割手段と、分割された見出しに含まれる特殊記号を分析する見出し分析手段と、前記見出しに含まれる特殊記号の後方の語句を「販売元」情報として処理する見出し処理手段と、分割された本文を句点ごとに分割する本文句点分割手段と、前記本文中に括弧内数値が存在するか否かを判定する括弧内数値判定手段と、括弧内数値が存在すると判断した場合に、構文解析により文節情報を作成する文節情報作成手段と、括弧内数値が存在しないと判断した場合に、固定パターンが存在するか否かを判断する固定パターン判定手段と、固定パターンが存在すると判断した場合に、固定パターンと文節情報から得られる固定パターンの係り受け情報を利用して固定パターンに係る文節情報集合を作成する固定パターン係り受け作成手段と、前記作成された固定パターンに係る文節情報集合から固定パターン及び各形式について定めた条件に従って文節情報を抽出する抽出手段と、抽出された文節情報から不要な情報を削除して抽出解作成手段と、抽出解から製品に対する対応や割り当てを行う係り受け対応・割付手段とを含む。
(6) If the information extracting apparatus according to the present invention requires, the dependency accept Extraction means includes a dependency tag pattern matching means performs pattern matching based on the tags assigned to the input article, wherein Dependency tag dividing means for dividing an article into a heading and a body based on the result of pattern matching, heading analysis means for analyzing special symbols included in the divided headings, and words behind the special symbols included in the headings Heading processing means for processing the information as “sales source” information, text punctuation dividing means for dividing the divided text into punctuation marks, and numerical value determination in parentheses for determining whether or not a numerical value in parentheses exists in the text When it is determined that there is a means and a numerical value in parentheses, and when it is determined that there is no numerical value in parentheses, the clause information creating means that creates clause information by parsing Fixed pattern determination means for determining whether or not a fixed pattern exists, and a phrase related to the fixed pattern using fixed pattern dependency information obtained from the fixed pattern and the phrase information when it is determined that a fixed pattern exists a fixed pattern dependency generation means for generating information set, the extraction means you extract clause information in accordance with the conditions prescribed for the fixed pattern and the format from the phrase data set according to a fixed pattern created in the above, the extracted clauses It includes an extraction solution creation unit that deletes unnecessary information from the information, and a dependency response / allocation unit that handles and assigns products to the extracted solution.

これにより、入力された記事に付与されたタグに基づいてパターンマッチングを行い、前記パターンマッチングの結果に基づいて記事を見出しと本文に分割し、分割された見出しに含まれる特殊記号を分析し、前記見出しに含まれる特殊記号の後方の語句を「販売元」情報として処理できる。また、分割された本文を句点ごとに分割し、前記本文中に括弧内数値が存在するか否かを判定し、括弧内数値が存在すると判断した場合に、構文解析により文節情報を作成し、括弧内数値が存在しないと判断した場合に、固定パターンが存在するか否かを判断し、固定パターンが存在すると判断した場合に、固定パターンと文節情報から得られる固定パターンの係り受け情報を利用して固定パターンに係る文節情報集合を作成することができる。そして、前記作成された固定パターンに係る文節情報集合から固定パターン及び各形式について定めた条件に従って文節情報を抽出し、抽出された文節情報から不要な情報を削除して抽出解を作成し、抽出解から製品に対する対応や割り当てを行うことができるので、抽出情報はある特定の文節に係るという点に着目して抽出を実現することができ、文章構造が複雑で、文書表層の単語列の並びに特定のパターンがなくても抽出できる。 Thereby, pattern matching is performed based on the tag given to the input article, the article is divided into a heading and a body based on the result of the pattern matching, and a special symbol included in the divided heading is analyzed, The word behind the special symbol included in the heading can be processed as “vendor” information. In addition, the divided text is divided into phrases, and it is determined whether or not a numerical value in parentheses exists in the main text.If it is determined that a numerical value in parentheses exists, clause information is created by parsing, When it is determined that there is no numerical value in parentheses, it is determined whether there is a fixed pattern. If it is determined that a fixed pattern exists, the dependency information of the fixed pattern obtained from the fixed pattern and the phrase information is used. Thus, a phrase information set related to the fixed pattern can be created. Then, the phrase information is extracted from the phrase information set relating to the created fixed pattern according to the conditions defined for the fixed pattern and each format, and unnecessary information is deleted from the extracted phrase information to create an extraction solution, and extraction Since it is possible to deal with and assign products from solutions, it is possible to realize the extraction by focusing on the fact that the extracted information relates to a specific phrase, the sentence structure is complicated, and the sequence of word strings on the document surface layer Extraction is possible without specific patterns.

（７）本発明に係る情報抽出方法は、記事入力ステップと、入力された記事情報についてテンプレートを利用して主要な製品情報を抽出するテンプレート抽出ステップと、前記テンプレートにより抽出されなかった記事情報について係り受け解析を利用して主要な製品情報を抽出する係り受け抽出ステップと、入力された記事情報について付与されたタグに基づいてパターンマッチングを行うダグパターンマッチングステップと、前記パターンマッチング結果に基づいて記事を見出しと本文に分割する記事分割ステップと、分割された記事の見出しを形態素解析する見出しの形態素解析ステップと、形態素解析された見出しの文節から助詞を除去する見出し助詞除去ステップと、前記テンプレートまたは係り受け解析により抽出された主要な製品情報と形態素解析後に助詞を除去された見出しの文節とをマッチングする見出し特徴情報マッチングステップと、前記主要な製品情報以外の情報を示す文節を見出しの特徴情報として抽出する見出し特徴情報抽出ステップと、分割された記事の本文を形態素解析する本文の形態素解析ステップと、形態素解析された本文の文節から助詞を除去する本文助詞除去ステップと、見出し特徴情報と形態素解析後に助詞を除去された本文の文節とをマッチングする本文特徴情報マッチングステップと、前記マッチングされた本文の特徴情報を抽出する本文特徴情報抽出ステップと、形態素解析された本文の係り受け関係を調べる本文の係り受け関係解析ステップと、係り受け関係により修飾語句を補足説明情報として抽出する補足説明の抽出ステップと、見出し特徴情報または本文特徴情報または補足説明情報を売り情報として出力する売り情報の出力ステップとを含む。
(7) An information extraction method according to the present invention includes an article input step , a template extraction step for extracting main product information using a template for the input article information, and article information not extracted by the template. A dependency extraction step for extracting main product information using dependency analysis, a Doug pattern matching step for performing pattern matching based on a tag attached to input article information, and a pattern matching result based on the pattern matching result An article dividing step of dividing an article into a headline and a body; a morphological analysis step of a heading that morphologically analyzes the heading of the divided article; a heading particle removal step of removing a particle from a clause of the headline subjected to morphological analysis; and the template Or major extracted by dependency analysis A headline feature information matching step for matching product information with a headline phrase from which particles have been removed after morphological analysis; and a headline feature information extraction step for extracting a clause indicating information other than the main product information as the headline feature information; Morphological analysis of the body of the divided article, morphological analysis step of the body, removal of the particle from the morphological analysis of the text, and removal of the particle after the heading feature information and morphological analysis A text feature information matching step for matching clauses, a text feature information extraction step for extracting feature information of the matched text, a text dependency analysis step for examining a textual dependency relation of morphological analysis, A supplementary explanation extraction step in which modifiers are extracted as supplementary explanation information by dependency relations. Including a flop, and an output step of selling information to be outputted as information to sell heading characteristic information or body characteristic information or supplementary explanation information.

（８）本発明に係る情報抽出プログラムは、記事入力手順と、入力された記事情報についてテンプレートを利用して主要な製品情報を抽出するテンプレート抽出手順と、前記テンプレートにより抽出されなかった記事情報について係り受け解析を利用して主要な製品情報を抽出する係り受け抽出手順と、入力された記事情報について付与されたタグに基づいてパターンマッチングを行うダグパターンマッチング手順と、前記パターンマッチング結果に基づいて記事を見出しと本文に分割する記事分割手順と、分割された記事の見出しを形態素解析する見出しの形態素解析手順と、形態素解析された見出しの文節から助詞を除去する見出し助詞除去手順と、前記テンプレートまたは係り受け解析により抽出された主要な製品情報と形態素解析後に助詞を除去された見出しの文節とをマッチングする見出し特徴情報マッチング手順と、前記主要な製品情報以外の情報を示す文節を見出しの特徴情報として抽出する見出し特徴情報抽出手順と、分割された記事の本文を形態素解析する本文の形態素解析手順と、形態素解析された本文の文節から助詞を除去する本文助詞除去手順と、見出し特徴情報と形態素解析後に助詞を除去された本文の文節とをマッチングする本文特徴情報マッチング手順と、前記マッチングされた本文の特徴情報を抽出する本文特徴情報抽出手順と、形態素解析された本文の係り受け関係を調べる本文の係り受け関係解析手順と、係り受け関係により修飾語句を補足説明情報として抽出する補足説明の抽出手順と、見出し特徴情報または本文特徴情報または補足説明情報を売り情報として出力する売り情報の出力手順としてコンピュータを機能させる。
これら前記の発明の概要は、本発明に必須となる特徴を列挙したものではなく、これら複数の特徴のサブコンビネーションも発明となり得る。 (8) The information extraction program according to the present invention relates to an article input procedure, a template extraction procedure for extracting main product information using a template for the input article information, and article information not extracted by the template. A dependency extraction procedure for extracting main product information using dependency analysis, a Doug pattern matching procedure for performing pattern matching based on a tag attached to input article information, and a pattern matching result based on the pattern matching result An article dividing procedure for dividing an article into a headline and a body, a morphological analysis procedure for a heading for morphological analysis of the heading of the divided article, a heading particle removal procedure for removing a particle from a clause of the headline subjected to morphological analysis, and the template Or after the main product information and morphological analysis extracted by dependency analysis A headline feature information matching procedure for matching a headline phrase from which particles have been removed, a headline feature information extraction procedure for extracting a phrase indicating information other than the main product information as headline feature information, and a divided article Text matching morphological analysis of the text, morphological analysis of the text, text particle removal procedure to remove the particle from the morphologically analyzed text clause, text matching the headline feature information and the text of the text from which the particle was removed after morphological analysis A feature information matching procedure, a text feature information extraction procedure for extracting feature information of the matched text, a text dependency analysis procedure for examining a morphological analysis of text dependency, and a modifier based on the dependency relation For extracting supplementary explanation as supplementary explanation information and heading feature information or body feature information or supplementary explanation Causing a computer to function as the output procedure of the selling information output as information to sell broadcast.
These outlines of the invention do not enumerate the features essential to the present invention, and a sub-combination of these features can also be an invention.

ここで、本発明は多くの異なる形態で実施可能である。したがって、下記の実施形態の記載内容のみで解釈すべきではない。実施形態では、主に装置について説明するが、所謂当業者であれば明らかな通り、本発明は、コンピュータで使用可能なプログラムとしても実施できる。また、本発明では、ハードウェア、ソフトウェア、または、ソフトウェア及びハードウェアの実施形態で実施可能である。プログラムは、ハードディスク、ＣＤ―ＲＯＭ、ＤＶＤ−ＲＯＭ、光記憶装置または磁気記憶装置等の任意のコンピュータ可読媒体に記録できる。さらに、プログラムはネットワークを介した他のコンピュータに記録することが出来る。 Here, the present invention can be implemented in many different forms. Therefore, it should not be interpreted only by the description of the following embodiment. In the embodiment, the apparatus will be mainly described. However, as is apparent to those skilled in the art, the present invention can also be implemented as a program usable on a computer. Further, the present invention can be implemented in hardware, software, or software and hardware embodiments. The program can be recorded on any computer-readable medium such as a hard disk, CD-ROM, DVD-ROM, optical storage device, or magnetic storage device. Furthermore, the program can be recorded on another computer via a network.

（本発明の第１の実施形態）
［１．ハードウエア構成］
図４に本発明の実施形態における情報抽出装置のハードウェア構成図を示す。コンピュータ１は、例えば、ＣＰＵ（ＣｅｎｔｒａｌｐｒｏｃｅｓｓｉｎｇＵｎｉｔ）２、メインメモリ３、ＨＤＤ（ＨａｒｄＤｉｓｋＤｒｉｖｅ）４、ビデオカード５、マウス６、キーボード７、光学ディスク８等により構成される。なお、本実施形態においては、図４に示すように、情報抽出装置を一のコンピュータ上に構築した例を説示するが、クライアントであるウェブブラウザ装置とサーバであるＷＷＷサーバからなるＷＷＷシステムを利用して情報抽出機能をブラウザ装置上で使用する構成とすることは所謂当業者であれば明らかである。例えば、ＷＷＷサーバに情報抽出機能をアドインとして実装する。または、ウェブブラウザに情報抽出機能をプラグインとして実装することもできる。さらに、ＷＷＷサーバに情報抽出機能の一部をアドインとして、ウェブブラウザに残りの情報抽出機能をプラグインとして実装することもできる。 (First embodiment of the present invention)
[1. Hardware configuration]
FIG. 4 shows a hardware configuration diagram of the information extraction apparatus according to the embodiment of the present invention. The computer 1 includes, for example, a CPU (Central processing Unit) 2, a main memory 3, an HDD (Hard Disk Drive) 4, a video card 5, a mouse 6, a keyboard 7, an optical disk 8, and the like. In the present embodiment, as shown in FIG. 4, an example in which the information extraction apparatus is built on one computer will be described. However, a WWW system including a web browser apparatus as a client and a WWW server as a server is used. It is obvious to those skilled in the art that the information extraction function is configured to be used on the browser device. For example, an information extraction function is implemented as an add-in on a WWW server. Alternatively, the information extraction function can be implemented as a plug-in in the web browser. Further, a part of the information extraction function can be implemented as an add-in in the WWW server, and the remaining information extraction function can be implemented as a plug-in in the web browser.

［２．ブロック構成］
図５は、本発明の実施形態に係る情報抽出装置のブロック構成図である。情報抽出装置は、入力部１０、テンプレート抽出部２０、係り受け抽出部３０、ダグパターンマッチング部４０、記事分割部５０、見出しの形態素解析部６０、見出しの助詞除去部７０、見出し特徴情報マッチング部８０、見出し特徴情報抽出部９０、本文の形態素解析部１００、本文の助詞除去部１１０、本文特徴情報マッチング部１２０、本文特徴情報抽出部１３０、本文の係り受け解析部１４０、補足説明の抽出部１５０、売り情報の出力１６０を含む。タグパターンマッチング部４０は、記事にタグを付け、タグのパターンマッチングを行う。記事分割部５０は、記事をタグのパターンマッチの結果に従い、見出しと本文に分割する。見出しの形態素解析部６０は、分割された見出しの形態素解析を行う。ここで、形態素解析は、形態素解析システムＪＵＭＡＮを利用することができる。ＪＵＭＡＮとは、日本語の形態素解析を行うためのシステムで、日本語の文章を入力とし、入力文を単語単位に区切り、それぞれの形態素を決定するものである。図６は、本発明の実施形態に係る情報抽出装置における形態素解析結果である。入力文は「セガ・エンタープライゼス社は、ＳＦ映画的な光線銃戦を模擬体験できるおもちゃ「ロックオン」を発売した」である。見出しの助詞除去部７０は、形態素解析したのちに見出しに含まれる助詞を除去する。見出し特徴情報マッチング部８０は、テンプレート及び係り受け解析により抽出された製品情報と見出しに含まれる特徴情報のマッチングを行う。本文の形態素解析部１００及び本文の助詞除去部１１０は、見出しの形態素解析及び助詞除去と同様である。本文特徴情報マッチング部１２０は、形態素解析後に助詞を除去した本文と見出しの特徴情報のマッチングを行う。本文特徴情報抽出部１３０は、マッチングした本文の中から特徴情報を抽出する。本文の係り受け解析部１４０は、係り受け解析器を用いて本文の係り受けを調べる。ここで、構文解析には構文解析システムＫＮＰを用いることができる。ＫＮＰとは、日本語の構文解析を行うためのシステムで、ＪＵＭＡＮ出力結果を入力とし、それらを文節単位にまとめ、文節間の係り受け関係を決定するものである。図７は、本発明の実施形態に係る情報抽出装置における係り受け解析結果である。入力文は、「セガ・エンタープライゼス社は、ＳＦ映画的な光線銃戦を模擬体験できるおもちゃ「ロックオン」を発売した」である。補足説明の抽出部１５０は、係り受け解析の結果、修飾語句を補足説明情報として抽出する。売り情報の出力１６０は、抽出された特徴情報を出力する。 [2. Block configuration]
FIG. 5 is a block configuration diagram of the information extraction apparatus according to the embodiment of the present invention. The information extraction device includes an input unit 10, a template extraction unit 20, a dependency extraction unit 30, a doug pattern matching unit 40, an article division unit 50, a headline morphological analysis unit 60, a headline particle removal unit 70, and a headline feature information matching unit. 80, headline feature information extraction unit 90, body morphological analysis unit 100, body particle removal unit 110, body feature information matching unit 120, body feature information extraction unit 130, body dependency analysis unit 140, and supplementary explanation extraction unit 150, and sales information output 160. The tag pattern matching unit 40 adds tags to articles and performs tag pattern matching. The article dividing unit 50 divides an article into a headline and a body according to the result of tag pattern matching. The headline morphological analysis unit 60 performs a morphological analysis of the divided headlines. Here, the morpheme analysis can use the morpheme analysis system JUMAN. JUMAN is a system for performing Japanese morpheme analysis, which takes Japanese text as input, divides the input sentence into words, and determines each morpheme. FIG. 6 shows a morphological analysis result in the information extraction apparatus according to the embodiment of the present invention. The input sentence is “SEGA ENTERPRISES, Inc. has released a toy“ lock-on ”that can simulate SF movie-like light gun battle”. The headline particle removal unit 70 removes particles included in the headline after morphological analysis. The headline feature information matching unit 80 matches the product information extracted by the template and dependency analysis with the feature information included in the headline. The morphological analysis unit 100 and the particle removal unit 110 are the same as the morphological analysis and particle removal of the headline. The text feature information matching unit 120 performs matching between text and headline feature information from which particles have been removed after morphological analysis. The text feature information extraction unit 130 extracts feature information from the matched text. The text dependency analysis unit 140 checks the text dependency using a dependency analyzer. Here, the syntax analysis system KNP can be used for the syntax analysis. KNP is a system for syntactic analysis in Japanese, which takes JUMAN output results as input, summarizes them into phrase units, and determines the dependency relationship between phrases. FIG. 7 shows a dependency analysis result in the information extraction apparatus according to the embodiment of the present invention. The input sentence is “SEGA ENTERPRISES, Inc. has released a toy“ LOCK ON ”that can simulate the SF gun-like light gun battle”. The supplementary explanation extracting unit 150 extracts the modifier as supplementary explanation information as a result of the dependency analysis. The sale information output 160 outputs the extracted feature information.

［３．動作］
図８は、本発明の実施形態に係る情報抽出装置における処理フローシートである。まず、入力部１０が記事を入力する（Ｓ１００）。ここで、テンプレート抽出部２０がテンプレートによる抽出処理をする（Ｓ２００）。なお、テンプレートによる抽出処理については、後述する。抽出できたか否か判断する（Ｓ３００）。ここで、テンプレートによる抽出を、定型性の高い文に対してのみ抽出できるようにするために、テンプレートに対し抽出精度に基づいた重み付けを行い、この重みがある閾値を超えるもののみを利用する。この重みを利用することで、テンプレートによる抽出は、抽出精度の高い文に対してのみ行うことができる。これは、テンプレートマッチでの誤った抽出の減少、および処理時間の短縮に繋がる。また、予めテンプレートによる抽出を行うことで、係り受け解析の負担を軽減し、全体の抽出精度の向上を図ることができる。抽出が出来ていなければ、係り受け抽出部３０が係り受け解析による抽出を行う（Ｓ４００）。係り受け抽出処理についても後述する。また、入力された記事については、予め索引番号、見出し、本文にあたる部分にそれぞれ対応するタグがつけられているのでダグパターンマッチング部４０がタグのパターンマッチングを行う（Ｓ５００）。そして、記事分割部５０が記事を見出しと本文に分割する（Ｓ６００）。見出しの形態素解析部６０が、見出しの形態素解析を行い文節に区切る（Ｓ７００）。見出しの助詞除去部７０が各文節から助詞を取り除く（Ｓ８００）。見出し特徴情報マッチング部８０がテンプレート及び係り受けを利用して抽出された主要な製品情報と見出しの文節とのマッチングを行う（Ｓ９００）。見出し特徴情報抽出部９０が主要な製品情報以外の情報を示す文節を見出しの特徴情報として抽出する（Ｓ１０００）。分割された本文は、本文の形態素解析部１００が本文の形態素解析を行い文節に区切る（Ｓ１１００）本文の助詞除去部１１０が助詞の除去をする（Ｓ１２００）。本文特徴情報マッチング部１２０が各文節と見出しの特徴情報とのマッチングを行う（Ｓ１３００）。本文特徴情報抽出部１３０が本文から見出しの特徴情報を示す文節を抽出する（Ｓ１４００）。なお、見出しの特徴情報の同義語にあたる語句を抽出するため、同義語辞書を参照することができる。入力された記事の本文について本文の係り受け解析部１４０が本文の構文解析を行い、本文より抽出した文節の係り受け関係を調べる（Ｓ１５００）。補足説明の抽出部１５０が抽出した文節を修飾する語句があれば，それを特徴情報の補足情報として抽出する（Ｓ１６００）。売り情報の出力１６０が売り情報の出力を行う（Ｓ１７００）。ここで、売り情報は、まとめて出力することも、見出しの特徴情報、本文の特徴情報、補足説明情報と必要に応じて個別に出力することもできる。 [3. Operation]
FIG. 8 is a processing flow sheet in the information extraction apparatus according to the embodiment of the present invention. First, the input unit 10 inputs an article (S100). Here, the template extraction unit 20 performs extraction processing using a template (S200). The extraction process using the template will be described later. It is determined whether or not the extraction has been completed (S300). Here, in order to enable extraction using a template only to a sentence with high formality, the template is weighted based on the extraction accuracy, and only those having a weight exceeding a certain threshold are used. By using this weight, extraction using a template can be performed only on a sentence with high extraction accuracy. This leads to a reduction in erroneous extraction in template matching and a reduction in processing time. Further, by performing extraction using a template in advance, it is possible to reduce the burden of dependency analysis and improve the overall extraction accuracy. If the extraction is not completed, the dependency extraction unit 30 performs extraction by dependency analysis (S400). The dependency extraction process will also be described later. Further, since the tag corresponding to the index number, the headline, and the body is attached to the input article in advance, the doug pattern matching unit 40 performs tag pattern matching (S500). Then, the article dividing unit 50 divides the article into a headline and a text (S600). The headline morpheme analysis unit 60 analyzes the headline morpheme and divides it into phrases (S700). The particle removal unit 70 of the headline removes the particle from each phrase (S800). The headline feature information matching unit 80 matches the main product information extracted using the template and the dependency with the headline clause (S900). The headline feature information extraction unit 90 extracts a phrase indicating information other than main product information as headline feature information (S1000). The divided text is subjected to text morphological analysis by the text morphological analysis unit 100 (S1100), and the text particle removal unit 110 removes the particles (S1200). The body feature information matching unit 120 performs matching between each phrase and the feature information of the headline (S1300). The body feature information extraction unit 130 extracts a phrase indicating the feature information of the headline from the body (S1400). Note that a synonym dictionary can be referred to in order to extract a phrase that is a synonym of the headline feature information. The text dependency analysis unit 140 analyzes the text of the input article body, and examines the dependency relations of phrases extracted from the text (S1500). If there is a phrase that modifies the phrase extracted by the supplementary explanation extracting unit 150, it is extracted as supplementary information of the feature information (S1600). The sale information output 160 outputs the sale information (S1700). Here, the selling information can be output collectively, or can be output individually as necessary, such as headline feature information, body feature information, and supplementary explanation information.

［４．テンプレートによる抽出について］
ここで、テンプレート抽出処理について、詳細を説明する。
図９は、本発明の実施形態に係る情報抽出装置におけるテンプレート抽出のブロック構成図である。テンプレート抽出部２０は、記事句点分割部２１０、Ａテンプレートマッチング部２２０、Ａテンテンプレート抽出部２３０、制約チェック部２４０、テンプレートＩＤ記憶部２５０、Ｂテンプレートマッチング部２６０、Ｂテンプレート抽出部２７０、テンプレート対応・割付部２８０を含む。記事句点分割部２１０は、入力された記事を、句点ごとに分割する。Ａテンプレートマッチング部２２０は、記事の１行目に対応するテンプレートによるマッチングを行う。Ａテンテンプレート抽出部２３０は、１行目に対応するテンプレートによるマッチングの抽出を行う。制約チェック部２４０は、抽出項目ごとに制約チェックを行う。なお、抽出項目に対する制約には抽出項目に依存する制約と抽出項目に依存しない制約がある。抽出項目に依存する制約とは、例えば「製品種別」、「販売元」、「価格」、「販売日」などを含む。具体的には、「製品種別」に関するテンプレートからの抽出文に対して、例えば、「丸括弧に含まれる文字列の除去」、単語の区切りが間違っている解候補の除去」、「解候補の品詞並びから品詞が「名刺」、「接頭辞」など以外の品詞が含まれていたら除去」などである。抽出項目に依存しない制約とは、抽出項目の性質とは関係なく、明らかに意味のない句を排除する。例えば、「括弧の対応がついているか」、「読点から始まっているか」、「例えば「ゃ」や「ぁ」などの禁則開始文字で始まっているか」などである。テンプレートＩＤ記憶部２５０は、抽出されたテンプレートのＩＤを記憶する。Ｂテンプレートマッチング部２６０は２行目以降に対応するテンプレートによるマッチングを行う。Ｂテンプレート抽出部２７０は、２行目以降に対応するプレートによるマッチングの抽出を行う。テンプレート製品対応部２８０は、抽出解における対応を行う。抽出解における対応は、抽出項目の出現パターンに応じて、予め設定された抽出項目間の関係に基づいて行われる。例えば、「製品名」または「製品の細分類」が複数出現し、「価格」、「発売日」が単数で出現する場合に複数の項目が単数の項目に対応する。具体的には、ビール、ジュース等の複数の製品が１１０円で８月１３日に販売されるなどである。また、「価格」、「販売日」が複数で出現する場合は、複数の項目同士が対応する場合などもある。さらに、項目間において「製品名」が単数で「製品の細分類」が複数である場合では、「製品名」がすべての「製品の細分類」に対応するなど、項目間に上位と下位の関係があることから、その間に対応関係があると予め設定して、対応付けを行う。 [4. About extraction by template]
Here, the template extraction process will be described in detail.
FIG. 9 is a block configuration diagram of template extraction in the information extraction apparatus according to the embodiment of the present invention. The template extraction unit 20 includes an article phrase division unit 210, an A template matching unit 220, an A ten template extraction unit 230, a constraint check unit 240, a template ID storage unit 250, a B template matching unit 260, a B template extraction unit 270, and template correspondence. -Includes an allocation unit 280. The article punctuation division unit 210 divides the input article for each punctuation. The A template matching unit 220 performs matching using a template corresponding to the first line of an article. The A ten template extraction unit 230 performs matching extraction using a template corresponding to the first line. The constraint check unit 240 performs a constraint check for each extracted item. In addition, the restrictions on the extraction item include a restriction that depends on the extraction item and a restriction that does not depend on the extraction item. The restrictions depending on the extraction item include, for example, “product type”, “sales source”, “price”, “sales date”, and the like. Specifically, for the extracted sentence from the template related to “product type”, for example, “removal of character strings included in parentheses”, “removal of solution candidates with incorrect word delimiters”, “solution candidate The part of speech from the part of speech list is “removal if part of speech other than“ business card ”,“ prefix ”or the like is included. The constraint that does not depend on the extracted item excludes a phrase that is clearly meaningless regardless of the nature of the extracted item. For example, “Does it correspond to parentheses?”, “Does it begin with a reading mark”, “Does it begin with a prohibited character such as“ nya ”or“ a ”, etc.”. The template ID storage unit 250 stores the extracted template ID. The B template matching unit 260 performs matching using templates corresponding to the second and subsequent lines. The B template extraction unit 270 extracts matching using plates corresponding to the second and subsequent rows. The template product correspondence unit 280 performs correspondence in the extracted solution. The correspondence in the extraction solution is performed based on a preset relationship between extraction items according to the appearance pattern of the extraction items. For example, when a plurality of “product names” or “sub-classifications of products” appear and “price” and “release date” appear singularly, the plurality of items correspond to a single item. Specifically, a plurality of products such as beer and juice are sold on August 13 for 110 yen. In addition, when a plurality of “price” and “sale date” appear, a plurality of items may correspond to each other. In addition, when there is a single “product name” and multiple “product sub-categories” among items, the “product name” corresponds to all “product sub-categories”. Since there is a relationship, it is set in advance that there is a correspondence between them, and the association is performed.

テンプレートの定義について説明する。製品情報抽出に用いるテンプレートは、実験対象である新製品紹介記事の抽出すべき項目にタグを付与したデータから、（１）抽出項目が出現する文章に頻出する表現、（２）抽出項目前後の形態素、（３）抽出項目の種類の３つの情報を残したものである。なお、抽出項目は、例えば、「販売元」、「製品種別」、「製品名」、「製品の細分類」、「価格」、「発売日」などを含む。ここで、テンプレート抽出処理において予め行われるテンプレートの作成及びテンプレートの重み付けの前処理について以下に説明する。 The template definition will be described. Templates used for product information extraction are: (1) an expression that appears frequently in the text in which the extracted item appears, and (2) before and after the extracted item from data in which tags are added to the items to be extracted from the new product introduction article that is the subject of the experiment. The morpheme and (3) three types of extracted item types remain. The extracted items include, for example, “sales source”, “product type”, “product name”, “product sub-category”, “price”, “release date”, and the like. Here, pre-processing of template creation and template weighting performed in advance in the template extraction process will be described below.

図１０は、本発明の実施形態に係る情報抽出装置における分析用タグ付ききデータ例の説明図である。この例のように複数の製品を紹介する記事には、各製品毎に抽出項目の対応を“Ｉ＝＜．．．＞−＜．．．＞．．．”のリストの形で付与する。具体的には、図１０に示すように、「＜ｃ１＞富士通ゼネラル＜／ｃ１＞は横長でワイド感のある画面を用いた＜ｋ１＞２９型衛星放送（ＢＳ）内臓テレビ＜／ｋ１＞「＜ｎ１＞ＢＳ−２９Ｍ５５＜／ｎ１＞」と「＜ｎ２＞ＢＳ−２９Ｍ５０＜／ｎ２＞」を＜ｄ１＞９月２日＜／ｄ１＞に発売する」に対して、Ｉ＝＜ｃ１，ｄ１，ｋ１，ｎ１，０，０＞−＜ｃ１，ｄ１，ｋ１，ｎ２，０，０＞となる。 FIG. 10 is an explanatory diagram of an example of data with an analysis tag in the information extraction apparatus according to the embodiment of the present invention. As shown in this example, for an article introducing a plurality of products, the correspondence of the extracted items for each product is given in the form of a list of “I = <...> − <. Specifically, as shown in FIG. 10, “<c1> Fujitsu General </ c1>” is a <k1> 29-type satellite broadcast (BS) built-in television </ k1> using a horizontally long and wide screen. <N1> BS-29M55 </ n1> ”and“ <n2> BS-29M50 </ n2> ”will be released on <d1> September 2 </ d1>”, I = <c1, d1 , K1, n1, 0, 0> − <c1, d1, k1, n2, 0, 0>.

そして、タグ付きデータを用いて抽出項目が存在する文章に頻出する表現をまとめ、「固定パターン」とする。この固定パターンとタグの前後１形態素を残し、それ以外をワイルドカードとして、任意の文字列がマッチできるようにして、テンプレートを作成する。なお、形態素の切り出しには、形態素解析器ＪＵＭＡＮを用いることができる。上記のタグ付きデータからテンプレートを作成する手順を次に示す。まず、タグ付きデータの「＜ｔａｇ＞・・・＜／ｔａｇ＞」箇所を、それぞれのタグが表す抽出項目に置き換える。例えば、｛販売元１｝は横長でワイド感のある画面を用いた｛製品種別１｝「｛製品名１｝」と「｛製品名２｝」を｛発売日１｝に発売する。次に、抽出項目の前後１形態素と固定パターン（発売する。）以外をワイルドカードに置き換える。例えば、｛販売元１｝は＊用いた｛製品種別１｝「｛製品名１｝」と「｛製品名２｝」を｛発売日１｝に発売する。完成されたテンプレートについては、この例のように、１文中１つの項目に複数の情報が存在する場合は、次の対応リストを付与する。１つの製品の場合は不要である。
｛販売元１｝−｛製品種別１｝−｛製品名１｝−｛発売日１｝
｛販売元１｝−｛製品種別１｝−｛製品名２｝−｛発売日１｝ Then, using the tagged data, the expressions that frequently appear in the text in which the extracted items exist are collected and set as a “fixed pattern”. A template is created by leaving this fixed pattern and one morpheme before and after the tag, and using the rest as wild cards so that any character string can be matched. Note that the morpheme analyzer JUMAN can be used to cut out morphemes. The procedure for creating a template from the tagged data is as follows. First, the “<tag>... </ Tag>” portion of the tagged data is replaced with the extraction item represented by each tag. For example, {Seller 1} releases {Product Type 1}, “{Product Name 1}” and “{Product Name 2}” on {Release Date 1} using a horizontally long and wide screen. Next, replace the extracted items other than the one before and after the morpheme and a fixed pattern (released) with wildcards. For example, {seller 1} releases {used product type 1} “{product name 1}” and “{product name 2}” on {release date 1}. As for the completed template, as shown in this example, when a plurality of pieces of information exist in one item in one sentence, the following correspondence list is assigned. This is not necessary for a single product.
{Seller 1}-{Product Type 1}-{Product Name 1}-{Release Date 1}
{Seller 1}-{Product Type 1}-{Product Name 2}-{Release Date 1}

続いて、テンプレートの優先順位付けについて説明する。実際の抽出処理では、テンプレート集合と記事とのマッチングを行うため、一意に解が決まることは殆んどなく、複数の解候補が存在する。この解候補に対して優先順位付を行い最も優先順位が高いものをその記事の抽出情報とする。記事全てのマッチング処理が終了した後、その解候補全てについて優先順位付けを行うこともできるし、一文毎に優先順位付けを行うこともできる。ここでは、１文毎の場合について説明する。優先順位付けは予めテンプレートに重みを与えておき、その情報を利用して行う。重みは以下の簡易な方法により付与するので以下にその説明をする。（１）テンプレート集合の各テンプレートを今回作成した情報抽出システムを用いてテンプレート作成用データにマッチングさせ、マッチした文の数およびマッチしてかつ情報抽出が成功した文の数を記憶しておく。（２）以下の式で重みを決定する。 Next, template prioritization will be described. In the actual extraction process, since the template set and the article are matched, the solution is hardly determined uniquely, and there are a plurality of solution candidates. Priorities are assigned to the solution candidates, and the one with the highest priority is used as the extracted information of the article. After all the articles have been matched, prioritization can be performed for all the solution candidates, and prioritization can be performed for each sentence. Here, the case of every sentence will be described. Prioritization is performed by giving weights to templates in advance and using the information. The weight is given by the following simple method and will be described below. (1) Each template of the template set is matched with the template creation data using the information extraction system created this time, and the number of matched sentences and the number of matched and successfully extracted information are stored. (2) The weight is determined by the following formula.

重み付けの例を、次の記事例について述べる。

The following example article will be described as an example of weighting.

テンプレートは、それぞれ＜ｓ１＞と＜ｓ２＞の２種類で、＜ｓ１＞は＊、＜ｐ１＞はｐ１であるとし、頻度は、マッチした回数２０回、正解を返した回数１０回とすると、抽出する抽出情報はｓ１，ｓ２，ｓ１，ｐ１の４つであるので、以下の式が与えられる。

There are two types of templates, <s1> and <s2>, <s1> is *, <p1> is p1, and the frequency is 20 matches and 10 is correct. Since the extraction information to be extracted is four of s1, s2, s1, and p1, the following expressions are given.

一般にはｓ１，ｓ２，ｐ１，ｐ２あるいはｓ１，ｐ１，ｓ２，ｐ２の形をとるものが多いが、形はさまざまである。例えば、ｓ１がさらに細分化されており、ｓ１（ｓ３，ｐ３，ｓ４，ｐ４），ｓ２，ｐ２のようにｓ３，ｐ３，ｓ４，ｐ４が埋め込まれていてｓ１に対応するｐ１がもともとない場合もある。このような場合でもｓ３，ｐ３，ｓ４，ｐ４から計算される値を使うなど、上式を拡充して用いることができる。
このようにして、定義したテンプレートを、学習データより大量に作成し、実験データの入力とマッチングさせて抽出を行う。なお、テンプレート集合は予め「テンプレート作成用データの１文目から作成されたテンプレート集合」（Ａプレートとする）と「２文目以降から作成されたテンプレート集合」（Ｂプレートとする）に分類する。 In general, many of them take the form of s1, s2, p1, p2 or s1, p1, s2, p2, but the shapes are various. For example, s1 may be further subdivided, and s3, p3, s4, and p4 are embedded as in s1 (s3, p3, s4, p4), s2, and p2, and p1 corresponding to s1 is not originally present. is there. Even in such a case, the above equation can be expanded and used, such as using values calculated from s3, p3, s4, and p4.
In this way, a large amount of the defined template is created from the learning data, and extraction is performed by matching with the input of the experimental data. The template set is classified in advance into “template set created from the first sentence of template creation data” (A plate) and “template set created from the second sentence onward” (B plate). .

図１１は、本発明の実施形態に係る情報抽出装置におけるテンプレート抽出処理のフローシートである。まず、記事句点分割部２１０が、入力記事を区点で分割する（Ｓ２０１）。Ａテンプレートマッチング部２１０本文の１文目に対して「１文目から作成されたテンプレート集合」（Ａプレート）とマッチングを行う（Ｓ２０２）。Ａテンテンプレート抽出部２３０がマッチングに対する抽出を行う（Ｓ２０３）。テンプレートによって抽出された文字列を抽出項目ごとに制約チェック部２４０が制約チェックを行う（Ｓ２０４）。制約チェックを全てクリアした場合、テンプレートＩＤ記憶部２５０がその抽出情報を抽出したテンプレートの属するテンプレート集合のＩＤを記憶する（Ｓ２０５）。次の文があるか否かを判断する（Ｓ２０６）。次の文があると判断した場合に、Ｂテンプレートマッチング部２６０が先程記憶したテンプレートＩＤの従属する「２文目以降から作成されたテンプレート集合」のテンプレートＩＤと次の文とのマッチングを行う（Ｓ２０７）。Ｂテンプレート抽出部２７０がマッチングに対する抽出を行う（Ｓ２０８）。そして、制約チェック部２４０がテンプレートによって抽出された文字列を抽出項目ごとに制約チェックを行うに戻る（Ｓ２０４）。次の文がないと判断した場合に、テンプレート製品対応部２８０がテンプレートに付与された対応リストに従って抽出解を製品に対応付ける（Ｓ２０９）。 FIG. 11 is a flowchart of template extraction processing in the information extraction apparatus according to the embodiment of the present invention. First, the article / phrase dividing unit 210 divides the input article by the division points (S201). The first template of the A template matching section 210 is matched with the “template set created from the first sentence” (A plate) (S202). The A ten template extraction unit 230 performs extraction for matching (S203). The constraint check unit 240 performs a constraint check on the extracted character string for each extracted item (S204). When all the constraint checks are cleared, the template ID storage unit 250 stores the ID of the template set to which the template from which the extracted information is extracted belongs (S205). It is determined whether there is a next sentence (S206). When it is determined that there is a next sentence, the B template matching unit 260 performs matching between the template ID of the “template set created from the second sentence onward” on which the template ID stored earlier is subordinate, and the next sentence ( S207). The B template extraction unit 270 performs extraction for matching (S208). Then, the constraint check unit 240 returns to performing a constraint check on the extracted character string for each extracted item (S204). If it is determined that there is no next sentence, the template product correspondence unit 280 associates the extracted solution with the product according to the correspondence list assigned to the template (S209).

［５．係り受け解析による抽出について］
係り受け解析による抽出について以下に説明する。図１２は、本発明の実施形態に係る情報抽出装置における係り受け解析による抽出のブロック構成図である。係り受け解析抽出部３０は、係り受けタグパターンマッチング部３１０、係り受けタグ分割部３２０、見出し分析部３３０、見出し処理部３４０、本文句点分割部３５０、括弧内数値判定部３６０、文節情報作成部３７０、固定パターン判定部３８０、固定パターン係り受け作成部３９０、係り受け抽出部４００、抽出解作成部４１０、係り受け対応・割付部４２０を含む。なお、構文解析には前述した日本語構文解析システムＫＮＰを用いることができる。係り受けタグパターンマッチング部３１０は、記事にタグを付け、タグのパターンマッチングを行う。係り受けタグ分割部３２０は、タグのパターンマッチの結果に従い、見出しと本文、記事終了に判定、分割する。見出し分析部３３０は、見出しに含まれる特殊記号を分析する。見出し処理部３４０は、特殊記号の後ろにある語句を「販売元」として利用する。本文句点分割部３５０は、分割された本文を、句点で分割する。括弧内数値判定部３６０は、括弧内の数値があるか否かを判定する。文節情報作成部３７０は、括弧内数値がある場合に、構文解析により文節情報を作成する。ここで、構文解析には前述した構文解析システムＫＮＰを用いることができる。固定パターン判定部３８０は、括弧内数値がない場合に、固定パターンがあるか否かを判定する。固定パターン係り受け作成部３９０は、文節情報と固定パターンから固定パターン係り受けを作成する。係り受け抽出部４００は、係り受け関係を抽出する。抽出解作成部４１０は、重複要素等の削除を行い、抽出解を作成する。係り受け対応・割付部４２０は、抽出解における製品との対応や割付を行う。
ここで、抽出情報の係り受けを調べるために、学習データの分析より、抽出情報を受ける文節、「固定パターン」とその固定パターンに係る抽出情報の「格形式」を定義する。 [5. About extraction by dependency analysis]
The extraction by dependency analysis will be described below. FIG. 12 is a block configuration diagram of extraction by dependency analysis in the information extraction apparatus according to the embodiment of the present invention. The dependency analysis extraction unit 30 includes a dependency tag pattern matching unit 310, a dependency tag division unit 320, a headline analysis unit 330, a headline processing unit 340, a text phrase division unit 350, a numerical value determination unit 360 within parentheses, and a phrase information creation unit. 370, a fixed pattern determination unit 380, a fixed pattern dependency creation unit 390, a dependency extraction unit 400, an extraction solution creation unit 410, and a dependency correspondence / allocation unit 420. For the syntax analysis, the aforementioned Japanese syntax analysis system KNP can be used. The dependency tag pattern matching unit 310 attaches a tag to an article and performs pattern matching of the tag. The dependency tag dividing unit 320 determines and divides the heading, the body, and the end of the article according to the result of the tag pattern matching. The headline analysis unit 330 analyzes special symbols included in the headline. The headline processing unit 340 uses the word behind the special symbol as “sales agency”. The text punctuation division unit 350 divides the divided text by the phrase. The parenthesized numerical value determination unit 360 determines whether or not there is a numerical value in parentheses. The phrase information creation unit 370 creates phrase information by parsing when there is a numerical value in parentheses. Here, the syntax analysis system KNP described above can be used for the syntax analysis. The fixed pattern determination unit 380 determines whether there is a fixed pattern when there is no value in parentheses. The fixed pattern dependency creation unit 390 creates a fixed pattern dependency from the phrase information and the fixed pattern. The dependency extraction unit 400 extracts a dependency relationship. The extracted solution creation unit 410 deletes duplicate elements and creates an extracted solution. The dependency correspondence / assignment unit 420 performs correspondence and assignment with products in the extracted solution.
Here, in order to examine the dependency of the extracted information, a clause that receives the extracted information, “fixed pattern”, and “case form” of the extracted information related to the fixed pattern are defined by analyzing the learning data.

図１３は、本発明の実施形態に係る情報抽出装置における固定パターンと格形式の説明図である。１文目について以下に述べる。固定パターンが、「発売」、「販売」、「売り」、「チェンジ」、「開発」、「改良」、「開始」、「始め」、「発表」、「商品化」、「製品化」、「輸入」、「発刊」、「発行」、「出版」、「創刊」、「刊行」については、抽出情報とみなす格形式が、“未格，ガ格”：ｃ、“カラ格，隣接，無格，ニ格”：ｄ、“ヲ格”：ｋｎｓとする。固定パターンが、「展開」、「参入」、「提携」については、抽出情報とみなす格形式が、“未格，ガ格：ｃ”、“カラ格，隣接，無格，ニ格”：ｄとする。
固定パターンが、「変更」、「強化」、「強調」については、抽出情報とみなす格形式が、“ヲ格＜−ノ格”：ｋｎｓとする。固定パターンが、「採用」、「導入」について、“ヲ格＜−ニ格”：ｋｎｓとする。固定パターンが、「追加」、「設定」、「加え」について、“ニ格，ヲ格”：ｋｎｓとする。固定パターンが、「搭載」、「装備」について、“体言でノ格，用言でニ格”：ｋｎｓとする。 FIG. 13 is an explanatory diagram of a fixed pattern and a case form in the information extraction apparatus according to the embodiment of the present invention. The first sentence is described below. Fixed patterns are “Release”, “Sales”, “Sell”, “Change”, “Development”, “Improvement”, “Start”, “Start”, “Announcement”, “Commercialization”, “Production”, For “import”, “publish”, “issue”, “publish”, “launch”, “publish”, the case format considered as extracted information is “unqualified, ga”: c, “colored, adjacent, “No rating, two ratings”: d, “wo”: kns. When the fixed pattern is “deployment”, “entry”, and “affiliation”, the case formats regarded as the extracted information are “unqualified, ga rating: c”, “colored case, adjacent, unqualified, second case”: d And
When the fixed pattern is “change”, “enhancement”, and “emphasis”, the case form regarded as the extracted information is “wo case <−no case”: kns. The fixed pattern “adopted” and “introduced” is assumed to be “wo case <−ni case”: kns. It is assumed that the fixed pattern is “addition”, “setting”, and “addition”, “dignity, worshipment”: kns. It is assumed that the fixed pattern “mounting” and “equipment” is “no rating in terms, two ratings in usage”: kns.

２文目以降について以下に述べる。固定パターンが、「［製品数］（文末）」については、抽出情報とみなす格形式が、“ノ格，同格未格”：ｋｎｓとする。固定パターンが、「発売（文末）」について、抽出情報とみなす格形式が、“未格”：ｋｎｓとする。固定パターンが、「発売（文頭）」については、抽出情報とみなす格形式が、受け文節：ｋｎｓとする。固定パターンが、「別売り（ノ格）」については、抽出情報とみなす格形式が、受け文節：ｋｎｓとする。固定パターンが、「別売り（ノ格以外）」については、“デ格”：ｋｎｓとする。 The second and subsequent sentences are described below. When the fixed pattern is “[number of products] (end of sentence)”, the case format regarded as the extracted information is “no rating, unqualified rating”: kns. When the fixed pattern is “release (end of sentence)”, the case format regarded as the extracted information is “unqualified”: kns. When the fixed pattern is “release (start of sentence)”, the case format regarded as the extracted information is the received phrase: kns. When the fixed pattern is “sold separately (no rating)”, the case format regarded as the extracted information is the receiving clause: kns. When the fixed pattern is “separate sale (other than no rating)”, “de rating”: kns is assumed.

その他について以下に述べる。固定パターンが、「［価格表記］」については、抽出情報とみなす格形式が、“ガ格，未格，デ格，隣接”：ｋｎｓとする。固定パターンが、「［「」］」については、抽出情報とみなす格形式が、“同格連体，ノ格”：ｋｎｓとする。固定パターンが、「［製品数］，新製品，新商品，新モデル（等）」については、抽出情報とみなす格形式が、“ノ格，同格未格”：ｋｎｓとする。
ここで、解候補の抽出ルールについて、以下に説明する。図１３の固定パターンと格形式を用いて固定パターンに係る文節から解候補を選別し、個々の項目によって詳細ルールを定める。このルールにより係り受けで得た解候補集合に新たな解候補を追加したり、解候補集合から不必要な解候補の削除したりする作業を行い、抽出結果を作成する。 Others are described below. When the fixed pattern is “[price notation]”, the case format regarded as the extracted information is “ga rating, unrated, de-rated, adjacent”: kns. When the fixed pattern is “[“ ”]”, the case form regarded as the extracted information is “same-class union, no-case”: kns. When the fixed pattern is “[number of products], new product, new product, new model (etc.)”, the case format regarded as the extracted information is “no rating, unqualified rating”: kns.
Here, a solution candidate extraction rule will be described below. Using the fixed pattern and case form of FIG. 13, candidate solutions are selected from the clauses related to the fixed pattern, and detailed rules are determined by individual items. An extraction result is created by adding new solution candidates to the solution candidate set obtained by dependency on this rule and deleting unnecessary solution candidates from the solution candidate set.

（ａ）「販売元」について
販売元を含む文節の格形式は主に「未格，ガ格」であり、以下の条件を満たすものを販売元の解候補とする。（１）第１文目，能動態述語文節に“未格” または“ガ格”で係る文節。（２）第１文目、受動態の述語文節に“カラ格”で係る文節。ただし、複数販売元が共同で製品を開発している場合は、特に、（３）（１）及び（２）で販売元解候補と名詞並列の文節。（４）“共同”を含む文節に“ト格”で係る文節という条件が追加される。 (A) “Distributor” The case form of the clause including the distributor is mainly “unqualified, ga”, and those satisfying the following conditions are candidates for the distributor. (1) The first sentence, a phrase that is related to the active predicate clause as “unqualified” or “ga”. (2) In the first sentence, a phrase related to the passive predicate clause in “Kara”. However, when multiple vendors are jointly developing products, in (3) (1) and (2), the vendor solution candidate and noun parallel clauses. (4) The condition of the phrase “G” is added to the phrase including “joint”.

（ｂ）「発売日」について
発売日を含む文節の格形式は主に「カラ格，ニ格，無格，隣接」であり、以下の条件を満たすものを発売日の解候補とする。（１）製品の発売を表現する第１文目、能動態の述語文節に“カラ格、ニ格、無格、隣接”のいずれかで係る文節。（２）製品の発売を表現する第１文目、受動態の述語文節に“ニ格、無格、隣接”で係る文節。（３）（１）及び（２）から発売日解候補が得られない場合、それ以降の文で“発売”に（１）の格形式で係る。または、日付表記の文末。ただし、複数製品が異なる発売日に発売される場合は、特に、（４）（１）の条件を満たす文節が述語並列になっている場合、並列範囲内の日付表現の文節。（５）２文目以降の“発売”に（１）の格形式で係る、または、日付表記の文末という条件が追加される。（１）〜（５）で用いている「日付表記」とは、“数値＋「月」数値＋「日」” や“「上旬」、「中旬」、「下旬」”などの日付表記パターンで、これらをまとめた正規表現を作成している。 (B) “Release Date” The case form of the clause including the release date is mainly “Kara Case, D Case, Unqualified, Adjacent”, and those satisfying the following conditions are the solution candidates for the release date. (1) The first sentence expressing the release of the product, the phrase related to the active predicate clause with one of “Kara, Ni, No, Adjacent”. (2) The first sentence expressing the release of a product, a phrase related to the passive predicate clause as “dignity, disqualification, adjacent”. (3) If the date solution for release date is not obtained from (1) and (2), “release” will be applied in the case of (1) in the subsequent sentence. Or the end of the date notation. However, when multiple products are released on different release dates, especially when clauses satisfying the conditions (4) and (1) are predicate parallel, clauses of date expressions within the parallel range. (5) The condition of “release” after the second sentence is added in the case format of (1) or the end of the date notation. The “date notation” used in (1) to (5) is a date notation pattern such as “numerical value +“ month ”numerical value +“ day ”or“ early ”,“ mid ”or“ late ”. A regular expression that combines these is created.

（ｃ）「製品種別・製品名・細分類」について
「販売元」、「発売日」は固定パターンとの係り受けで抽出可能であるが、「製品種別・製品名・細分類」は、固定パターンから抽出できる解候補とその解候補周辺の係り受けで解候補を得る。解候補がそろった段階で項目に割り当てるため、ここでは３項目を一括して条件をまとめる。ここで、第１文目の固定パターンからの解候補は、（１）“発売”に類する固定パターンに“ヲ格”で係る文節。（２）“追加、設定”に類する固定パターンに“ヲ格、ニ格”で係る文節。（３）“採用、導入、強化、強調”に“ニ格”で係る文節。（４）“装備、搭載、刊行”に類する固定パターンに体言文節は“ノ格”，用言文節は“ニ格”で係る（１）〜（４）で抽出した解候補について、（５）解候補が鈎括弧を含む場合、解候補に係る文節の格形式が“同格連体”または“ノ格”であれば解候補に追加する。（６）解候補が「○機種」など製品数を表す場合は、その候補を削除し、製品数に係る文節を解候補に追加する。また、２文目以降の固定パターンからの解候補は、（７）製品数を表す文末に“ノ格，同格未格”で係る文節。（８）「別売」を表す文節が“ノ格”であった場合、これを受ける文節。（９）「別売」を表す文節が“デ格”であった場合、これに係る文節。（１０）文頭が「発売」を表す文節の場合、それを受ける文節。（１１）文末が「発売」を表す固定パターンであった場合、“未格（助詞：モ）”で係る文節。さらに、価格表記を含む文節からの解候補は、（１２）括弧なしの価格表記の場合、“未格、ガ格、デ格、隣接”で係る文節。（１３）括弧ありの価格表記の場合、“同格連体、連体”係る文節。それぞれ解候補は名詞並列であった場合は、解候補と同等の文節も解候補とする。 (C) About “Product Type / Product Name / Subcategory” “Seller” and “Release Date” can be extracted depending on the fixed pattern, but “Product Type / Product Name / Subcategory” is fixed. A solution candidate is obtained by a solution candidate that can be extracted from the pattern and a dependency around the solution candidate. In order to assign to the items when the solution candidates are ready, here, the three items are collectively gathered. Here, the solution candidate from the fixed pattern of the first sentence is (1) a phrase related to a fixed pattern similar to “release” with “wo case”. (2) A phrase related to a fixed pattern similar to “addition, setting” with “wo case, two case”. (3) The phrase “dating” in “adoption, introduction, strengthening, emphasis”. (4) A fixed pattern similar to “equipment, installation, publication” is related to the solution candidate extracted in (1) to (4) with the body phrase being “No-case” and the prescriptive phrase being “N-case” (5) If the solution candidate includes square brackets, if the case form of the clause related to the solution candidate is “same case” or “no case”, it is added to the solution candidate. (6) When the solution candidate represents the number of products such as “○ model”, the candidate is deleted and a clause related to the number of products is added to the solution candidate. In addition, the solution candidates from the fixed pattern after the second sentence are (7) “No rating, no rating” clause at the end of the sentence indicating the number of products. (8) A phrase that receives a phrase indicating “sold separately” if the phrase is “no case”. (9) If the phrase representing “sold separately” is “de-rated”, the phrase relating to this. (10) If the sentence head is a phrase representing “release”, the phrase that receives it. (11) A sentence related to “unqualified (particle: mo)” when the sentence end is a fixed pattern indicating “release”. Further, the solution candidate from the clause including the price notation is (12) a clause related to “unqualified, unqualified, derated, adjacent” in the case of the price notation without parentheses. (13) In the case of price notation with parentheses, a clause relating to “equivalent union, union”. If the solution candidates are noun parallels, the phrase equivalent to the solution candidate is also set as the solution candidate.

（ｄ）「価格」について
「価格」は他の５項目の抽出法と異なり、表記パターンによるパターンマッチで抽出を行う。しかし、価格表記であっても、他の製品と比較した差額や、売上目標を表すものがある。そのため、パターンマッチを行った後それらのパターンをまとめて制約をかけ、解候補とする。なお、解候補とみなさない価格表記の制約を以下に示す。（１）価格表記直前に以下の表現を含むものは解候補としない。例えば、「売上高」、「売上目標」、「コスト」、「年間」、「電気代」、「ガス代」、「資金」などである。（２）価格表記直後に以下の表現を含むものは解候補としない。例えば、「割安」、「下回」、「切る」、「安く」、「低価格」、「下げる」、「値下げ」、「引き下げ」、「値引き」、「安い」、「低い」、「抑える」、「とどめる」、「価格削減」、「割増」、「アップ」、「高い」、「値上げ」、「市場規模」、「売り上げ」、「費用」、「電気代」、「ガス代」、「資金」、「売上高」、「年間売上」などである。 (D) “Price” Unlike the other five item extraction methods, “price” is extracted by pattern matching based on the notation pattern. However, there are some price notations that represent differences compared to other products and sales targets. Therefore, after pattern matching is performed, the patterns are collectively restricted and set as solution candidates. The price notation restrictions that are not considered as solution candidates are shown below. (1) Anything that includes the following expressions immediately before the price is not considered a solution candidate. For example, “sales”, “sales target”, “cost”, “annual”, “electricity cost”, “gas cost”, “fund”, and the like. (2) Immediately after price notation, the following expressions are not considered as solution candidates. For example, “Cheap”, “Lower”, “Cut”, “Cheap”, “Low price”, “Lower”, “Price reduction”, “Price reduction”, “Price reduction”, “Cheap”, “Low”, “Suppress” ”,“ Stay ”,“ Price reduction ”,“ Increase ”,“ Up ”,“ High ”,“ Price increase ”,“ Market size ”,“ Sales ”,“ Cost ”,“ Electricity ”,“ Gas ”, “Fund”, “Sales”, “Annual sales”, etc.

図１４は、本発明の実施形態に係る情報抽出装置における係り受け抽出処理のフローシートである。まず、係り受けタグパターンマッチング部３１０が、タグのパターンマッチを行う（Ｓ４０１）。ここで、記事には（１）“＜索引記事番号＞”、（２）“＜詳細画面用記事見出し＞”、（３）“＜記事全文＞”、（４）“＜／記事全文＞” の４つのタグが付与されており、文が入力される度に、このタグのパターンマッチングによって、係り受けタグ分割部３２０が、「見出し」と、「本文」、「記事終了」を判定できる（Ｓ４０２）。「見出し」の場合はＳ４０３へ、「本文」の場合はＳ４０５へ、「記事終了」の場合はＳ４１３へ渡す。「見出し」については、見出し分析部３３０が見出しの内容に特殊記号の“−−”、“＝”、“――”のいずれかが含まれているか否かを分析する（Ｓ４０３）。見出しの内容に“−−”、“＝”、“――”のいずれかが含まれている場合は、見出し処理部３４０がマッチした記号の後ろを販売元として利用する（Ｓ４０４）。「本文」については、本文句点分割部３５０が入力記事を区点で分割する（Ｓ４０５）。括弧内数値判定部３６０が括弧内価格の有無を判定する。ここで、分析用データから求めた価格表記を基に、以下の正規表現を作成した。 FIG. 14 is a flowchart of dependency extraction processing in the information extraction apparatus according to the embodiment of the present invention. First, the dependency tag pattern matching unit 310 performs tag pattern matching (S401). Here, the article includes (1) “<Index article number>”, (2) “<Detailed article headline>”, (3) “<Full article>”, (4) “Full article>” Each time a sentence is input, the dependency tag dividing unit 320 can determine “headline”, “text”, and “end of article” by pattern matching of this tag ( S402). In the case of “Heading”, the process proceeds to S403. In the case of “Body”, the process proceeds to S405. In the case of “Article end”, the process proceeds to S413. For “headline”, the headline analysis unit 330 analyzes whether the content of the headline includes any of the special symbols “-”, “=”, and “-” (S403). When any of “-”, “=”, and “-” is included in the content of the headline, the headline processing unit 340 uses the back of the matched symbol as a sales source (S404). With respect to “text”, the text punctuation division unit 350 divides the input article by the division (S405). The parenthesized numerical value determination unit 360 determines whether or not there is a price in parentheses. Here, the following regular expressions were created based on the price notation obtained from the analysis data.

これにマッチするものを価格と判定する（Ｓ４０６）。括弧内価格にマッチしないと判定された場合に、固定パターン判定部３８０が固定パターンを分類別に配列に格納し、適宜本文とマッチングし，固定パターンを含む文であるかを判定する（Ｓ４０７）。固定パターンにマッチしない文は抽出処理を行わずＳ４０１へ戻る。括弧内価格にマッチすると判定した場合に、文節情報作成部３７０が文節情報を作成する（Ｓ４０８）。ここで、本文１文をＫＮＰにかけることによって文節に分割することができる。この文節と、抽出・割り当て・対応付けに必要な、係り受け情報（各文節の係り先）、格情報（ガ格，ヲ格，同格連体など）、用言か体言か、名詞並列・述語並列の範囲、文中の文節の出現位置、情報を抜粋し、文節情報を作成することとなる。固定パターン係り受け作成部３９０が固定パターン係り受けを作成する（Ｓ４０９）。つまり、文節情報と固定パターンを受け取り、係り受け情報を利用して固定パターンに係る文節情報集合を作成する。 A price matching this is determined as a price (S406). If it is determined that the price does not match the price in parentheses, the fixed pattern determination unit 380 stores the fixed patterns in the array by classification, appropriately matches the text, and determines whether the sentence includes the fixed pattern (S407). A sentence that does not match the fixed pattern is not extracted and the process returns to S401. When it is determined that the price in parentheses is matched, the phrase information creation unit 370 creates phrase information (S408). Here, one sentence of the body can be divided into phrases by applying it to KNP. This clause, dependency information (relationship destination of each clause), case information (ga case, wo case, joint case, etc.), predicate or form, or noun parallel / predicate parallel, necessary for extraction / assignment / association The phrase information is created by extracting the range, the occurrence position of the phrase in the sentence, and the information. The fixed pattern dependency creation unit 390 creates a fixed pattern dependency (S409). That is, the phrase information and the fixed pattern are received, and the phrase information set related to the fixed pattern is created using the dependency information.

図１５は、本発明の実施形態に係る情報抽出装置における構文解析結果及びその文節情報の対応の説明図である。図１５（ａ）のＫＮＰ構文木結果が図１５（ｂ）の文節情報に対応している。具体的には、図１５（ａ）は、「○○○○は、低コストの店舗監視記録用ビデオ「ＳＲ−Ｌ９００」を２日に販売する」の構文である。図１５（ｂ）では、「２日に」は、「［３，“２日に”，［０，０］，［“二格”，“体言”］，“５”］」となる。また「発売する。」は、「０，“発売する。”，［０，０］，［ｎｉｌ，“用言”］，“６”」となる。図１５（ｂ）のように（固定パターンの文節情報の第１要素＋３）の文節が固定パターンに係ると判定した。文節情報の要素中の並列範囲情報を利用して、並列構造の係り受けも正しく反映できる。なお、係り文節の文節情報には受け文節の文節情報を追加する。受け文節を参照することでさまざまな制約をかけたり、割り当て・対応付けに利用したりできる。 FIG. 15 is an explanatory diagram of the correspondence between the parsing result and the phrase information in the information extracting apparatus according to the embodiment of the present invention. The KNP syntax tree result in FIG. 15 (a) corresponds to the phrase information in FIG. 15 (b). Specifically, FIG. 15A has a syntax of “XXOO sells a low-cost store monitoring recording video“ SR-L900 ”on the 2nd”. In FIG. 15B, “2nd day” becomes “[3,“ 2nd day ”, [0, 0], [“ second case ”,“ body name ”],“ 5 ”]”. “Release.” Is “0,“ Release. ” ", [0, 0], [nil," use "]," 6 "". As shown in FIG. 15B, it is determined that the phrase (first element of fixed pattern phrase information + 3) is related to the fixed pattern. By using the parallel range information in the phrase information element, the dependency of the parallel structure can be correctly reflected. Note that the clause information of the receiving clause is added to the clause information of the dependency clause. Various restrictions can be applied by referring to the receiving clause, and it can be used for assignment / association.

係り文節集合から、図１３で示したそれぞれのルールに基づいて係り受け抽出部４００が抽出を行う（Ｓ４１０）。ここで、「製品種別・製品名・細分類の抽出」では、係り文節情報からの解候補を受け文節とする係り文節集合が必要になった場合、ここから固定パターン係り受け作成クラスを呼び出している。抽出解作成部４１０がそれぞれの抽出が終了した後表記パターンによる削除や重複要素の削除を行い、抽出解とする（Ｓ４１１）。本文終了か否かを判断する（Ｓ４１２）。本文が終了していない場合にＳ４０１に戻る。本文が終了していると判断した場合に、係り受け対応・割付部４２０が抽出解から割り当て・対応付けを行う（Ｓ４１３）。なお、最終的に、“ 記事番号”、“製品１［販売元、発売日、製品種別、製品名、細分類、価格］”、“製品２［同］”、・・・、の解を表示することもできる（Ｓ４１４）。 The dependency extraction unit 400 extracts from the dependency phrase set based on the respective rules shown in FIG. 13 (S410). Here, in “Extraction of product type, product name, and subcategory”, when a set of dependency clauses that use the solution candidate from the dependency clause information as a clause is required, call the fixed pattern dependency creation class from here. Yes. After each extraction is completed, the extraction solution creation unit 410 deletes the notation pattern or deletes duplicate elements to obtain an extraction solution (S411). It is determined whether or not the text ends (S412). If the text has not ended, the process returns to S401. When it is determined that the text has been completed, the dependency correspondence / assignment unit 420 performs assignment / association from the extracted solution (S413). Finally, "Article No.", "Product 1 [Distributor, Release Date, Product Type, Product Name, Subcategory, Price]", "Product 2 [Same]", ..., are displayed. (S414).

［６．売り情報の出力結果］
図１６は、本発明の実施形態に係る情報抽出装置における補足説明となる語句を含めて抽出された記事の説明図である。
見出し情報は、「［ビジネス情報］省エネタイプの自動販売機を開発−−サンデン」である。本文情報は、「サンデンは１５日、料金の安い深夜電力だけを利用して運転コストを従来機の３０％に抑えた省エネタイプの自動販売機を開発したと発表した。深夜の間に缶入り飲料水を加熱・冷却、昼間は電気を使わずに飲み物の温度を適温に維持する仕組み。外部から熱が入るのを防ぐため断熱材を従来の３０ミリから５０ミリに厚くしたほか、商品を補充する時の庫内の温度変化を防ぐため、専用の扉を作る工夫をした。大きさは高さ１９４センチ、幅１１８センチ、奥行き８６センチで、５２０本を収納できる」である。 [6. Sales information output result]
FIG. 16 is an explanatory diagram of articles extracted including words and phrases serving as supplementary explanations in the information extracting apparatus according to the embodiment of the present invention.
The headline information is “[Business Information] Energy-saving type vending machine developed—Sanden”. The text information, “Sanden announced on the 15th that it has developed an energy-saving vending machine that uses only low-cost late-night electricity and reduces operating costs to 30% of conventional machines. Drinking water is heated and cooled, and the temperature of the drink is maintained at an appropriate temperature without using electricity in the daytime.In order to prevent heat from entering from outside, the insulation has been increased from 30 mm to 50 mm, and the product In order to prevent temperature changes in the cabinet when refilling, a special door was devised. The size is 194 cm high, 118 cm wide, and 86 cm deep, and 520 can be stored.

本発明による情報抽出装置によって抽出した結果は、見出しの特徴情報としては、「省エネタイプ」が抽出される。また、本文より抽出される特徴情報としては、「料金の安い深夜電力だけを利用して運転コストを従来機の３０％に抑えた省エネタイプ」が抽出される。さらに、「売り」情報としては、「料金の安い深夜電力だけを利用して運転コストを従来機の３０％に抑えた省エネタイプ」が抽出されることとなる。 As a result of extraction by the information extraction apparatus according to the present invention, “energy saving type” is extracted as the feature information of the heading. In addition, as the feature information extracted from the text, “energy saving type using only low-cost late-night power and operating cost is reduced to 30% of the conventional machine” is extracted. Furthermore, “sale” information is extracted as “energy-saving type that uses only low-cost late-night electricity and reduces the operating cost to 30% of conventional machines”.

以上の前記実施形態により本発明を説明したが、本発明の技術的範囲は実施形態に記載の範囲には限定されず、これら各実施形態に多様な変更又は改良を加えることが可能である。そして、かような変更又は改良を加えた実施の形態も本発明の技術的範囲に含まれる。このことは、特許請求の範囲及び課題を解決する手段からも明らかなことである。 Although the present invention has been described with the above-described embodiments, the technical scope of the present invention is not limited to the scope described in the embodiments, and various modifications or improvements can be added to these embodiments. And embodiment which added such a change or improvement is also contained in the technical scope of the present invention. This is apparent from the claims and the means for solving the problems.

本発明の実施形態に係る情報抽出装置における新製品紹介記事と記事中に含まれる特徴情報の説明図である。It is explanatory drawing of the feature information contained in the new product introduction article and the article in the information extraction device according to the embodiment of the present invention.

本発明の実施形態に係る情報抽出装置のハードウェア構成図である。
本発明の実施形態に係る情報抽出装置における見出し中に「売り」情報を含む記事の説明図である。本発明の実施形態に係る情報抽出装置における本文中に見出しの特徴情報の補足情報を含む記事の説明図である。本発明の実施形態に係る情報抽出装置のハードウェア構成図である。本発明の実施形態に係る情報抽出装置のブロック構成図である。本発明の実施形態に係る情報抽出装置における処理フローシートである。本発明の実施形態に係る情報抽出装置における形態素解析結果である。本発明の実施形態に係る情報抽出装置における係り受け解析結果である。本発明の実施形態に係る情報抽出装置におけるテンプレート抽出のブロック構成図である。本発明の実施形態に係る情報抽出装置における分析用タグ付ききデータ例の説明図である。本発明の実施形態に係る情報抽出装置におけるテンプレート抽出処理のフローシートである。本発明の実施形態に係る情報抽出装置における係り受け解析による抽出のブロック構成図である。本発明の実施形態に係る情報抽出装置における固定パターンと格形式の説明図である。本発明の実施形態に係る情報抽出装置における係り受け抽出処理のフローシートである。本発明の実施形態に係る情報抽出装置における構文解析結果及びその文節情報の対応の説明図である。本発明の実施形態に係る情報抽出装置における補足説明となる語句を含めて抽出された記事の説明図である。 It is a hardware block diagram of the information extraction apparatus which concerns on embodiment of this invention.
It is explanatory drawing of the article which contains "sale" information in the headline in the information extraction device which concerns on embodiment of this invention. It is explanatory drawing of the article which contains the supplementary information of the feature information of a headline in the text in the information extraction device which concerns on embodiment of this invention. It is a hardware block diagram of the information extraction apparatus which concerns on embodiment of this invention. It is a block block diagram of the information extraction device which concerns on embodiment of this invention. It is a processing flow sheet in the information extraction device concerning the embodiment of the present invention. It is a morpheme analysis result in the information extraction apparatus which concerns on embodiment of this invention. It is a dependency analysis result in the information extracting device which concerns on embodiment of this invention. It is a block block diagram of the template extraction in the information extraction device which concerns on embodiment of this invention. It is explanatory drawing of the example of data with the tag for analysis in the information extraction device which concerns on embodiment of this invention. It is a flow sheet of the template extraction process in the information extraction device according to the embodiment of the present invention. It is a block block diagram of extraction by dependency analysis in the information extraction apparatus which concerns on embodiment of this invention. It is explanatory drawing of the fixed pattern and case form in the information extraction device which concerns on embodiment of this invention. It is a flowchart of the dependency extraction process in the information extraction device which concerns on embodiment of this invention. It is explanatory drawing of a response | compatibility of the parsing result in the information extraction device which concerns on embodiment of this invention, and its phrase information. It is explanatory drawing of the article extracted including the phrase used as the supplementary explanation in the information extraction device which concerns on embodiment of this invention.

Explanation of symbols

１コンピュータ
２ＣＰＵ
３メインメモリ
４ＨＤＤ
５ビデオカード
６マウス
７キーボード
８光学ディスク
１０記事入力部
２０テンプレート抽出部
３０係り受け抽出部
４０ダグパターンマッチング部
５０記事分割部
６０見出しの形態素解析部
７０見出しの助詞除去部
８０見出し特徴情報マッチング部
９０見出し特徴情報抽出部
１００本文の形態素解析部
１１０本文の助詞除去部
１２０本文特徴情報マッチング部
１３０本文特徴情報抽出部
１４０本文の係り受け解析部
１５０補足説明の抽出部
１６０売り情報の出力
２１０記事句点分割部
２２０Ａテンプレートマッチング部
２３０Ａテンテンプレート抽出部
２４０制約チェック部
２５０テンプレートＩＤ記憶部
２６０Ｂテンプレートマッチング部
２７０Ｂテンプレート抽出部
２８０テンプレート製品対応部
３１０係り受けタグパターンマッチング部
３２０係り受けタグ分割部
３３０見出し分析部
３４０見出し処理部
３５０本文句点分割部
３６０括弧内数値判定部
３７０文節情報作成部
３８０固定パターン判定部
３９０固定パターン係り受け作成部
４００係り受け抽出部
４１０抽出解作成部
４２０係り受け対応・割付部 1 Computer 2 CPU
3 Main memory 4 HDD
5 Video Card 6 Mouse 7 Keyboard 8 Optical Disc 10 Article Input Unit 20 Template Extraction Unit 30 Dependency Extraction Unit 40 Doug Pattern Matching Unit 50 Article Division Unit 60 Headline Morphological Analysis Unit 70 Headline Particle Removal Unit 80 Headline Feature Information Matching Unit 90 headline feature information extraction unit 100 body morphological analysis unit 110 body particle particle removal unit 120 body feature information matching unit 130 body feature information extraction unit 140 body dependency analysis unit 150 supplementary explanation extraction unit 160 sale information output 210 article Punctuation division unit 220 A template matching unit 230 A ten template extraction unit 240 constraint check unit 250 template ID storage unit 260 B template matching unit 270 B template extraction unit 280 template product corresponding unit DESCRIPTION OF SYMBOLS 10 Dependency tag pattern matching part 320 Dependency tag division | segmentation part 330 Headline analysis part 340 Headline process part 350 Body text division | segmentation part 360 Numerical value determination part in parenthesis 370 Phrase information creation part 380 Fixed pattern determination part 390 Fixed pattern dependency generation part 400 Dependency extraction unit 410 Extracted solution creation unit 420 Dependency correspondence / allocation unit

Claims

Article input means;
A template extracting means for extracting main product information using a template for input article information;
Doug pattern matching means for performing pattern matching based on the tag attached to the input article information;
Article dividing means for dividing an article into a headline and a body based on the pattern matching result;
A morphological analysis means for headlines that morphologically analyzes the headlines of the divided articles;
A heading particle removal means for removing a particle from a headline phrase subjected to morphological analysis;
A headline feature information matching means for matching main product information extracted by the template with a headline phrase from which particles are removed after morphological analysis;
Headline feature information extraction means for extracting a clause indicating information other than the main product information as headline feature information;
Morphological analysis means for analyzing the body of the divided article,
Text particle removal means for removing particles from morphologically analyzed text clauses;
Text feature information matching means for matching the headline feature information with the text clause from which the particle has been removed after morphological analysis;
Text feature information extracting means for extracting feature information of the matched text;
An information extraction device including sales information output means for outputting headline feature information or text feature information as sales information.

Article input means;
Dependency extraction means for extracting main product information using dependency analysis for input article information;
Doug pattern matching means for performing pattern matching based on the tag attached to the input article information;
Article dividing means for dividing an article into a headline and a body based on the pattern matching result;
A morphological analysis means for headlines that morphologically analyzes the headlines of the divided articles;
A heading particle removal means for removing a particle from a headline phrase subjected to morphological analysis;
A headline feature information matching means for matching main product information extracted by the dependency analysis with a headline clause from which particles are removed after morphological analysis;
Headline feature information extraction means for extracting a clause indicating information other than the main product information as headline feature information;
Morphological analysis means for analyzing the body of the divided article,
Text particle removal means for removing particles from morphologically analyzed text clauses;
Text feature information matching means for matching the headline feature information with the text clause from which the particle has been removed after morphological analysis;
Text feature information extracting means for extracting feature information of the matched text;
An information extraction device including sales information output means for outputting headline feature information or text feature information as sales information.

Article input means;
A template extraction means for extracting main product information from article information input using a template whose extraction accuracy weighting threshold is equal to or greater than a certain value;
Dependency extraction means for extracting main product information using dependency analysis for article information not extracted by the template;
Doug pattern matching means for performing pattern matching based on the tag attached to the input article information;
Article dividing means for dividing an article into a headline and a body based on the pattern matching result;
A morphological analysis means for headlines that morphologically analyzes the headlines of the divided articles;
A heading particle removal means for removing a particle from a headline phrase subjected to morphological analysis;
A headline feature information matching means for matching main product information extracted by the template or dependency analysis with a headline phrase from which particles have been removed after morphological analysis;
Headline feature information extraction means for extracting a clause indicating information other than the main product information as headline feature information;
Morphological analysis means for analyzing the body of the divided article,
Text particle removal means for removing particles from morphologically analyzed text clauses;
Text feature information matching means for matching the headline feature information with the text clause from which the particle has been removed after morphological analysis;
Text feature information extracting means for extracting feature information of the matched text;
An information extraction device including sales information output means for outputting headline feature information or text feature information as sales information.

A text dependency analysis unit for examining the dependency relationship of the text subjected to the morphological analysis;
Supplementary explanation extracting means for extracting modifiers as supplementary explanation information by dependency relation;
4. An information extracting apparatus according to claim 1, further comprising sales information output means for outputting supplementary explanation information as sales information.

The template extraction means includes
Article punctuation dividing means for dividing the inputted article into punctuation points,
A template matching means for matching with the A template set corresponding to the first line of the article;
A template extraction means for extracting feature information of products matched by the A template set;
Constraint check means for checking the constraint for each extracted item about the extracted product feature information;
Template ID storage means for storing the ID of the template from which information could be extracted;
B template matching means for matching a B template set corresponding to the second and subsequent lines of the article;
B template extraction means for extracting feature information of products matched by the B template set;
A template product correspondence means for associating the extracted solution, which is characteristic information of the extracted product, with the product,
The information extraction device according to claim 1 or 3, comprising:

The dependency accepted Extraction means,
A dependency tag pattern matching means for performing pattern matching based on a tag attached to the input article;
Dependency tag dividing means for dividing an article into a headline and a text based on the result of the pattern matching;
A headline analysis means for analyzing special symbols included in the divided headlines,
A headline processing means for processing the word behind the special symbol included in the headline as “vendor” information;
Text punctuation dividing means for dividing the divided text into phrases,
A value in brackets for determining whether or not a value in parentheses is present in the text;
Clause information creation means for creating clause information by syntax analysis when it is determined that a numerical value in parentheses exists;
Fixed pattern determination means for determining whether or not a fixed pattern exists when it is determined that the numerical value in parentheses does not exist;
A fixed pattern dependency creating means for creating a clause information set related to a fixed pattern using the fixed pattern dependency information obtained from the fixed pattern and the clause information when it is determined that a fixed pattern exists;
And Extraction means you extract clause information in accordance with the conditions prescribed for the fixed pattern and the format from the phrase data set according to a fixed pattern created in the above,
An extraction solution creation means for creating an extraction solution by deleting unnecessary information from the extracted phrase information;
Information extraction apparatus according to claim 2 or 3 comprising from the extraction solution and the dependency response and allocation means for performing the corresponding or assigned to the product.

Article input step ,
A template extraction step for extracting main product information using a template for input article information;
A dependency extraction step of extracting main product information using dependency analysis for article information not extracted by the template;
A Doug pattern matching step for performing pattern matching based on a tag attached to the input article information;
An article dividing step of dividing an article into a headline and a body based on the pattern matching result;
A headline morphological analysis step for morphological analysis of the divided article headlines;
A header particle removal step for removing particles from the morphologically analyzed headline clause;
A headline feature information matching step for matching the main product information extracted by the template or dependency analysis with the headline clause from which the particle is removed after the morphological analysis;
A headline feature information extraction step of extracting a clause indicating information other than the main product information as headline feature information;
A morphological analysis step of the body for morphological analysis of the body of the divided article;
A text particle removal step for removing particles from the morphologically analyzed text clauses;
A text feature information matching step for matching the headline feature information with the text clause from which the particle has been removed after morphological analysis;
A text feature information extraction step for extracting feature information of the matched text;
Dependency analysis step of the text to examine the dependency relation of the morphologically analyzed text;
A supplementary explanation extracting step for extracting modifiers as supplementary explanation information by dependency relation;
A sales information output step of outputting headline feature information, body feature information or supplementary explanation information as sales information.

Article input procedure,
Template extraction procedure to extract main product information using template for input article information,
A dependency extraction procedure for extracting main product information using dependency analysis for article information not extracted by the template;
Doug pattern matching procedure for pattern matching based on the tag attached to the input article information,
An article dividing procedure for dividing an article into a headline and a body based on the pattern matching result;
A morphological analysis procedure of a headline for performing a morphological analysis on the headline of the divided article;
A heading particle removal procedure for removing a particle from a headline phrase subjected to morphological analysis;
A headline feature information matching procedure for matching main product information extracted by the template or dependency analysis with a headline clause from which particles have been removed after morphological analysis;
A headline feature information extraction procedure for extracting headline feature information as a clause indicating information other than the main product information;
A morphological analysis procedure for text that morphologically analyzes the text of a divided article,
Text particle removal procedure to remove particles from morphologically analyzed text clauses;
Body feature information matching procedure for matching the headline feature information and the clause of the body from which particles have been removed after morphological analysis;
Text feature information extraction procedure for extracting feature information of the matched text;
Dependency analysis procedure for the text to examine the dependency relation of the morphologically analyzed text;
A supplementary explanation extraction procedure for extracting modifiers as supplementary explanation information by the dependency relationship;
An information extraction program for causing a computer to function as a sales information output procedure for outputting headline feature information, text feature information, or supplementary explanation information as sales information.