JP4148247B2

JP4148247B2 - Vocabulary acquisition method and apparatus, program, and computer-readable recording medium

Info

Publication number: JP4148247B2
Application number: JP2005194298A
Authority: JP
Inventors: 浩之戸田; 良治片岡
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2005-07-01
Filing date: 2005-07-01
Publication date: 2008-09-10
Anticipated expiration: 2025-07-01
Also published as: JP2007011892A

Description

本発明は、語彙獲得方法及び装置及びプログラム及びコンピュータ読み取り可能な記録媒体に係り、特に、インターネットに代表されるコンピュータネットワークにおいて、ＨＴＭＬやＸＭＬ、ＳＧＭＬ等のタグ付テキストから語彙を獲得するための語彙獲得方法及び装置及びプログラム及びコンピュータ読み取り可能な記録媒体に関する。 The present invention relates to a vocabulary acquisition method and apparatus, a program, and a computer-readable recording medium. In particular, in a computer network represented by the Internet, a vocabulary for acquiring a vocabulary from tagged text such as HTML, XML, and SGML. The present invention relates to an acquisition method and apparatus, a program, and a computer-readable recording medium.

コンピュータネットワークにおける情報検索において、検索結果が大量になることが、頻繁となり、検索システムの利用者は、キーワードでの検索を行った後に、得られた検索結果から本当に欲しい情報を獲得する遠いう作業が強いられている。 When searching for information in a computer network, the amount of search results often becomes large, and users of the search system perform a distant work of acquiring the information they really want from the search results obtained after searching by keyword. Is forced.

そのような問題に対して、検索結果のテキスト情報から固有名詞等の文書中でキーとなる用語を抽出し、それらの用語のうち検索結果中で重要であると考えられる用語を選択し、検索結果と共に提示することで、効率的な文書の検索を容易に実現するという方法がある。 For such problems, extract key terms in documents such as proper nouns from text information of search results, select terms that are considered important in the search results from those terms, and search There is a method of easily realizing efficient document search by presenting it together with the result.

これにより、ユーザは検索結果を一つ一つ見ながら所望の情報を探したり、追加キーワードを考え再検索を行わなくとも、所望の情報を絞り込むことができる。 Thus, the user can narrow down the desired information without looking for the desired information while looking at the search results one by one or without performing a re-search by considering additional keywords.

これを実現するため基本的な技術として、テキスト中の固有名詞等を特定する手法が必要となる。 In order to realize this, a technique for identifying proper nouns in text is necessary as a basic technique.

最も単純な方法として、人手で辞書を作成し、その辞書語とマッチする語をテキストから抽出するという手法が考えられる。 The simplest method is to create a dictionary manually and extract words that match the dictionary word from the text.

また、特定の辞書は持たず、文書中に存在する固有名詞を予め人手で特定し、特定した学習データから、形態素（品詞情報）レベルのパターンとして抽出ルールを作成し、予め学習データに含まれた語のみでなく、新たな語の抽出も可能とする手法もある（例えば、特許文献１、非特許文献１参照）。
特開２００３−３３１２５４号公報 Sekine.S.: Named Entity: History and Future, http://cs.nyu.edu/sekinepaper/NEsurvey 200402.pdf In addition, it does not have a specific dictionary, it manually identifies proper nouns that exist in the document in advance, creates extraction rules as morpheme (part of speech information) level patterns from the specified learning data, and is included in the learning data in advance There is also a technique that enables extraction of new words as well as new words (see, for example, Patent Document 1 and Non-Patent Document 1).
JP 2003-331254 A Sekine.S .: Named Entity: History and Future, http://cs.nyu.edu/sekinepaper/NEsurvey 200402.pdf

しかしながら、上記の従来技術には以下の問題点がある。 However, the above prior art has the following problems.

辞書を人手で作成する手法は、確実にテキスト中の該当部分を特定することができるが、辞書の更新にかかるコストが非常に大きいため、幅広い分野や属性の辞書語を収集することは現実的に難しい。 The method of manually creating a dictionary can reliably identify the corresponding part in the text, but the cost of updating the dictionary is very high, so it is realistic to collect dictionary words in a wide range of fields and attributes It is difficult.

また、学習データを利用する手法は、学習データを元にルールを自動生成することによって、学習データに存在する語彙に加えて、学習データにはない新しい語彙についてもテキストから自動的に抽出することができる。 In addition, the method that uses learning data automatically extracts rules from the text for new vocabulary not found in the learning data in addition to the vocabulary existing in the learning data by automatically generating rules based on the learning data. Can do.

しかし、その一方で抽出ルールの生成では、学習データから特徴的なパターンの抽出が必要となる。例えば、新聞記事における人物名や場所名などのデータには、特徴的な出現パターンが存在するため、比較的高精度に抽出することができるが、抽出対象が一般的なＷｅｂページなどを対象とした場合には表現が多様になり、必ずしも高精度の特定ができない。また、人物や場所名ではなく、その名称自体が多用な表現を持つ「店の名前」や「本のタイトル」等の場合には、同様にパターン化が難しく高精度の表現の抽出は難しい。 However, on the other hand, the generation of extraction rules requires extraction of characteristic patterns from learning data. For example, data such as a person name or a place name in a newspaper article has a characteristic appearance pattern, so that it can be extracted with relatively high accuracy, but the extraction target is a general Web page or the like. In such a case, the expressions become diverse and it is not always possible to specify with high accuracy. Also, in the case of “store name”, “book title”, etc., in which the name itself is not a person or place name but a variety of expressions, patterning is similarly difficult and extraction of high-precision expressions is difficult.

本発明は、上記の点に鑑みなされたもので、コンピュータネットワークにおけるテキストから、特定の属性（人物名や本のタイトル等）のデータを、少ないデータ例を元に、テキストやタググ付き文書中でデータ例の出現するパターンを特定し、そのパターンによって抽出されたキーワードの出現頻度や分布から不要語の除去を行い、自動的に獲得することが可能な語彙獲得方法及び装置及びプログラム及びプログラムを格納した記憶媒体を提供することを目的とする。 The present invention has been made in view of the above points, and in a text or a tagged document, data of a specific attribute (person name, book title, etc.) is extracted from text in a computer network based on a small number of data examples. A vocabulary acquisition method, apparatus, program, and program that can automatically acquire data by identifying patterns in which data examples appear, removing unnecessary words from the appearance frequency and distribution of keywords extracted by the patterns It is an object to provide a storage medium.

図１は、本発明の原理を説明するための図である。 FIG. 1 is a diagram for explaining the principle of the present invention.

本発明（請求項１）は、特定の属性のキーワード例を元にＨＴＭＬやＸＭＬを含む電子テキストを解析し、語彙を獲得する装置における語彙獲得方法であって、
テキスト検索手段が、外部から入力された特定の属性のキーワード例に基づいて、ＨＴＭＬやＸＭＬを含む電子テキストを蓄積するテキスト蓄積手段を検索し、該キーワード例を含むテキストを取得するテキスト検索ステップ（ステップ１）と、
キーワード位置特定手段が、検索されたテキストから、キーワード例の出現位置を特定するキーワード位置特定ステップ（ステップ２）と、
キーワード出現パターン抽出手段が、テキスト検索ステップにおいて取得されたテキストにおいて、キーワード例の各キーワードの出現位置から一文字ずつ該テキストの先頭に向かって文字を取得し、取得した各文字同士を比較し、同じ文字であれば次の文字を取得し、比較を繰り返すことにより、該キーワード例に共通の出現パターンを特定するキーワード出現抽出ステップ（ステップ３）と、
キーワード候補抽出手段が、キーワード例の各キーワードの出現位置及びキーワード例に共通の出現パターンに基づいてテキストを解析し、キーワードの候補を該テキストから抽出するキーワード候補抽出ステップ（ステップ４）と、
キーワード抽出手段が、キーワードの候補の各キーワードについて、テキスト検索ステップにおいて取得されたテキストの数Ｎ _Ａ、該テキスト検索ステップにおいて取得されたテキストのうち該キーワード候補を含むテキストの数N _Ａ ^W 、テキスト蓄積手段に蓄積されているテキストの数N ^Ｗから各キーワードの評価値（Ｎ _Ａ ^W ／Ｎ _A ）×log（N _Ａ ^W ／N ^W ）を求め、該評価値が所定の閾値よりも高いキーワードをキーワードの候補から抽出するキーワード抽出ステップ（ステップ５）と、を行う。 The present invention (claim 1) analyzes the electronic text including HTML and XML based keywords example of a specific attribute, a lexical acquisition method in an apparatus for acquiring vocabulary,
A text search step in which the text search means searches the text storage means for storing electronic text including HTML and XML based on a keyword example of a specific attribute input from the outside, and acquires the text including the keyword example ( Step 1) and
A keyword position specifying step (step 2) in which the keyword position specifying means specifies the appearance position of the keyword example from the searched text;
In the text acquired in the text search step, the keyword appearance pattern extraction means acquires characters one by one from the appearance position of each keyword in the keyword example toward the beginning of the text, compares the acquired characters, and the same If it is a character, the next character is acquired, and by repeating the comparison, a keyword appearance extraction step (step 3) for specifying an appearance pattern common to the keyword example;
A keyword candidate extraction step (step 4) in which the keyword candidate extraction means analyzes the text based on the appearance position of each keyword in the keyword example and the appearance pattern common to the keyword example, and extracts the keyword candidate from the text;
The keyword extraction means, for each keyword candidate keyword, the number N _A of text acquired in the text search step, the number N _A ^W of texts including the keyword candidate among the text acquired in the text search step, and the text evaluation value for each keyword from the number N ^W of the text stored in the storage means seeking _{^{_{(N a W / N a)}}} × log (N a W / N W), the keyword evaluation value is higher than a predetermined threshold value And a keyword extraction step (step 5) for extracting from the keyword candidates .

図２は、本発明の原理構成図である。 FIG. 2 is a principle configuration diagram of the present invention.

本発明（請求項２）は、特定の属性のキーワード例を元にＨＴＭＬやＸＭＬを含む電子テキストを解析し、語彙を獲得する語彙獲得装置であって、
ＨＴＭＬやＸＭＬを含む電子テキストを蓄積するテキスト蓄積手段２と、
外部から入力された特定の属性のキーワード例に基づいて、テキスト蓄積手段２を検索し、該キーワード例を含むテキストを取得するテキスト検索手段３と、
検索されたテキストから、キーワード例の出現位置を特定するキーワード位置特定手段４と、
テキスト検索手段３において取得されたテキストにおいて、キーワード例の各キーワードの出現位置から一文字ずつ該テキストの先頭に向かって文字を取得し、取得した各文字同士を比較し、同じ文字であれば次の文字を取得し、比較を繰り返すことにより、該キーワード例に共通の出現パターンを特定するキーワード出現抽出手段５と、
キーワード例の各キーワードの出現位置及びキーワード例に共通の出現パターンに基づいてテキストを解析し、キーワードの候補を該テキストから抽出するキーワード候補抽出手段６と、
キーワードの候補の各キーワードについて、テキスト検索手段において取得されたテキストの数Ｎ _Ａ、該テキスト検索手段３において取得されたテキストのうち該キーワード候補を含むテキストの数N _Ａ ^W 、テキスト蓄積手段２に蓄積されているテキストの数N ^Ｗから各キーワードの評価値（Ｎ _Ａ ^W ／Ｎ _A ）×log（N _Ａ ^W ／N ^W ）を求め、該評価値が所定の閾値よりも高いキーワードをキーワードの候補から抽出するキーワード抽出手段７と、を有する。 The present invention (claim 2) analyzes the electronic text including HTML and XML based keywords example of a specific attribute, a lexical acquisition apparatus for acquiring vocabulary,
Text storage means 2 for storing electronic text including HTML and XML;
Text search means 3 for searching the text storage means 2 based on a keyword example of a specific attribute input from the outside, and obtaining text including the keyword example;
Keyword position specifying means 4 for specifying the appearance position of the keyword example from the searched text;
In the text acquired by the text search means 3, characters are acquired one by one from the appearance position of each keyword in the keyword example toward the beginning of the text, and the acquired characters are compared. A keyword appearance extracting means 5 for identifying an appearance pattern common to the keyword examples by acquiring characters and repeating the comparison ;
A keyword candidate extraction unit 6 that analyzes text based on the appearance position of each keyword in the keyword example and an appearance pattern common to the keyword example, and extracts keyword candidates from the text;
For each keyword candidate keyword, the number of texts N _A acquired by the text search unit, the number of texts N _A ^W including the keyword candidates among the texts acquired by the text search unit 3, and the text storage unit 2 evaluation value for each keyword from the number N ^W of stored electrical text _{^{_{(N a W / N a)}}} × sought _{^{^{log (N a W / N W}}} ), keywords higher keyword than the evaluation value is a predetermined threshold value And keyword extracting means 7 for extracting from candidates .

本発明（請求項３）は、請求項２記載の語彙獲得装置を構成する各手段としてコンピュータを機能させるための語彙獲得プログラムである。 The present invention (Claim 3) is a vocabulary acquisition program for causing a computer to function as each means constituting the vocabulary acquisition apparatus according to claim 2 .

本発明（請求項４）は、請求項３記載の語彙獲得プログラムを格納したコンピュータ読み取り可能な記憶媒体である。
The present invention (Claim 4) is a computer-readable storage medium storing the vocabulary acquisition program according to Claim 3 .

本発明によれば、特定の属性の少ないキーワードを元に、その該当属性のキーワードが出現する位置及びパターンを自動で抽出し、この２つの特徴を利用し、高い精度でキーワード候補の出現するルールを特定し、予め指定されたキーワードを複数含むテキスト中で上記で特定したルールにマッチするキーワードを抽出し、ここで抽出された個々のキーワードの出現頻度や分布を元に最終的な抽出候補のキーワードを特定することで、高精度に語彙を獲得することが可能となる。 According to the present invention, based on a keyword having a small number of specific attributes, a rule and a keyword candidate appear with high accuracy by automatically extracting the position and pattern in which the keyword of the corresponding attribute appears, and using these two features. The keyword that matches the rule specified above is extracted from the text including a plurality of keywords specified in advance, and the final extraction candidate is extracted based on the appearance frequency and distribution of each keyword extracted here. By specifying a keyword, it is possible to acquire a vocabulary with high accuracy.

この語彙獲得により得られた辞書を用いることにより、テキストから特定属性のキーワードを抽出することが可能となる。 By using the dictionary obtained by this vocabulary acquisition, it is possible to extract keywords having specific attributes from the text.

以下、図面と共に本発明の実施の形態を説明する。 Hereinafter, embodiments of the present invention will be described with reference to the drawings.

図３は、本発明の一実施の形態における語彙獲得装置の構成を示す。 FIG. 3 shows a configuration of a vocabulary acquisition apparatus according to an embodiment of the present invention.

同図に示す語彙獲得装置１０は、抽出を想定する属性のキーワードを例として入力するデータ例入力装置２０、インターネット３０と接続され、テキスト収集部１、テキスト蓄積部２、テキスト検索部３、キーワード位置特定部４、キーワード出現パターン抽出部５、キーワード候補抽出部６、及びキーワード抽出部７から構成される。 The vocabulary acquisition device 10 shown in FIG. 1 is connected to an example data input device 20 for inputting a keyword having an attribute assumed to be extracted, and the Internet 30, and is connected to a text collection unit 1, a text storage unit 2, a text search unit 3, a keyword. The position specifying unit 4, the keyword appearance pattern extracting unit 5, the keyword candidate extracting unit 6, and the keyword extracting unit 7 are configured.

テキスト収集部１は、インターネット３０上に存在するＨＴＭＬやＸＭＬのテキストデータを収集し、テキスト蓄積部２に格納する。また、既に収集済みのテキストがローカルディスク上に存在する場合には、それを対象に収集を行い、テキスト蓄積部２に格納する。 The text collection unit 1 collects HTML or XML text data existing on the Internet 30 and stores it in the text storage unit 2. If there is already collected text on the local disk, the text is collected and stored in the text storage unit 2.

テキスト蓄積部２は、テキスト収集部１によって収集されたテキストデータを蓄積する。蓄積されるテキストデータは、ＨＴＭＬ形式のテキストデータや、ＸＭＬ形式のタグ付きテキストデータである。 The text storage unit 2 stores the text data collected by the text collection unit 1. The stored text data is text data in HTML format or text data with tags in XML format.

テキスト検索部３は、データ例入力装置２０から入力された、場所や人物等の固有名詞や属性などからなるキーワード例を元にテキスト蓄積部２に格納されたテキストデータを検索し、当該キーワード例を複数含むテキストデータを特定する。 The text search unit 3 searches the text data stored in the text storage unit 2 based on a keyword example that is input from the data example input device 20 and includes a proper noun such as a place or a person, an attribute, and the like. Identify text data that contains more than one.

キーワード位置特定部４は、テキスト検索部３によってデータ例入力装置２０から入力されたキーワード例に含まれるキーワードを複数内包するテキストを解析し、当該キーワードが存在する位置を特定する。位置を特定する例を図４に示す。 The keyword position specifying unit 4 analyzes text including a plurality of keywords included in the keyword example input from the data example input device 20 by the text search unit 3 and specifies a position where the keyword exists. An example of specifying the position is shown in FIG.

図４（ａ）に示すＨＴＭＬの場合に、キーワード例として「恋し君へ」と「水戸黄門様」が与えられた場合に、当該キーワード例が含まれる部分を特定し、HTMLタグ構造を木構造と見做し、「HTMLタグの下のBODYタグの下の、trタグの下の、tdタグの下に該当するデータ存在する」という判断を行う。この処理を各々のキーワードについて行う。その結果、
「/HTML/BODY/tr/td」
が得られる。 In the case of HTML shown in FIG. 4 (a), when “Koi Shi Kimi To” and “Mito Komon-sama” are given as keyword examples, the part including the keyword example is specified, and the HTML tag structure is a tree structure And the determination is made that “the corresponding data exists under the td tag, below the tr tag, below the BODY tag below the HTML tag”. This process is performed for each keyword. as a result,
"/ HTML / BODY / tr / td"
Is obtained.

図４（ｂ）に示すＨＴＭＬの場合に、キーワード例として「恋し君へ」と「水戸黄門様」が与えられた場合に、当該キーワード例が存在する位置として、
「/HTML/BODY/」
が得られる。得られたキーワード例の出現位置は、キーワード候補抽出部６に転送される。 In the case of HTML shown in FIG. 4 (b), when “Koi Shi Kimi” and “Mito Komon-sama” are given as keyword examples, the position where the keyword example exists is as follows:
"/ HTML / BODY /"
Is obtained. The appearance positions of the obtained keyword examples are transferred to the keyword candidate extraction unit 6.

キーワード出現パターン抽出部５は、テキスト検索部３によって検索された、データ入力装置２０から入力されたキーワード例を複数内包するテキストを解析し、キーワードが出現するパターンを特定する。例えば、キーワード「恋し君へ」や「水戸黄門様」が含まれる部分を特定し、各キーワードについて、一文字ずつ文書の先頭部分に向かって文字を取得し、キーワード間で比較し、同じ文字であれば次の文字を取得し、比較を繰り返すことにより、同じパターンを抽出する。 The keyword appearance pattern extraction unit 5 analyzes text including a plurality of keyword examples input from the data input device 20 searched by the text search unit 3, and specifies a pattern in which the keyword appears. For example, identify the parts that contain the keywords "Koshi to you" and "Mito Komon-sama", and for each keyword, get the characters one by one toward the beginning of the document, compare the keywords, and use the same characters. For example, the next pattern is obtained and the same pattern is extracted by repeating the comparison.

図５（ａ）において、キーワード例として「恋し君へ」と「水戸黄門様」が与えられた場合に、当該キーワードの出現パターンとして、
「<tr><td>ENTITY</td><td>」
が得られる。なお、出現パターン中の“ENTITY”の部分は、最終的に抽出したい語彙が含まれる部分である。 In FIG. 5 (a), when “Kisoshi-kun” and “Mito Komon-sama” are given as keyword examples,
"<Tr><td> ENTITY </ td><td>"
Is obtained. Note that the “ENTITY” part in the appearance pattern is a part that includes the vocabulary to be finally extracted.

また、図５（ｂ）において、キーワード例として「恋し君へ」と「水戸黄門様」が与えられた場合に、当該キーワードの出現パターンとして、
「，ENTITY，」
が得られる。得られたキーワード例の出現パターンは、キーワード候補抽出部６に転送される。なお、出現パターン中の“ENTITY”の部分は、最終的に抽出したい語彙が含まれる部分である。 In addition, in FIG. 5B, when “Koisushi-kun” and “Mito Komon-sama” are given as keyword examples,
“, ENTITY,”
Is obtained. The obtained appearance pattern of the keyword example is transferred to the keyword candidate extraction unit 6. Note that the “ENTITY” part in the appearance pattern is a part that includes the vocabulary to be finally extracted.

キーワード候補抽出部６は、キーワード位置特定部４によって抽出されたキーワード例の位置と、キーワード出現パターン抽出部５によって得られたキーワード例の出現パターンに基づいて、テキスト検索部３により検索されたテキストを解析し、キーワード候補を抽出する。 The keyword candidate extraction unit 6 is a text searched by the text search unit 3 based on the position of the keyword example extracted by the keyword position specifying unit 4 and the appearance pattern of the keyword example obtained by the keyword appearance pattern extraction unit 5. To extract keyword candidates.

キーワード抽出部７は、キーワード候補抽出部６で抽出されたキーワード候補からキーワードの出現頻度及び分布の評価に基づいて、抽出するキーワードを特定する。 The keyword extraction unit 7 specifies a keyword to be extracted from the keyword candidates extracted by the keyword candidate extraction unit 6 based on the keyword appearance frequency and distribution evaluation.

キーワード候補を評価する基準としては、以下のような基準αが考えられる。 As a criterion for evaluating keyword candidates, the following criterion α can be considered.

ここで、
・Ｎ_Ａは、テキスト検索部３によって特定された解析対象テキストの数を示す。
・Ｎ_Ａ ^ｗは、テキスト検索部３によって特定された解析対象テキスト中で現在評価するキーワードを含む文書の数を示す。
・Ｎ^ｗは、文書集合全体中で現在評価するキーワードｗを含む文書の数を示す。

here,
N _A indicates the number of analysis target texts specified by the text search unit 3.
N _A ^w indicates the number of documents including the keyword currently evaluated in the analysis target text specified by the text search unit 3.
^Nw indicates the number of documents including the keyword w currently evaluated in the entire document set.

上記のように求められた評価値が所定の閾値より高いものをキーワードとして抽出する。 A keyword whose evaluation value obtained as described above is higher than a predetermined threshold is extracted as a keyword.

次に、上記の構成における動作を説明する。 Next, the operation in the above configuration will be described.

本発明は、データを収集するための前処理段階と、実際に語彙を獲得する段階の２つに分かれている。 The present invention is divided into two stages: a pre-processing stage for collecting data and an actual vocabulary acquisition stage.

図６は、本発明の一実施の形態における前処理のフローチャートである。 FIG. 6 is a flowchart of the preprocessing in the embodiment of the present invention.

ステップ１０１）テキスト収集部１は、入力装置（図示せず）からＵＲＬの入力を受け付け、当該ＵＲＬを起点とするリンクを辿りながら、テキストを収集し、テキスト蓄積部２に格納する。 Step 101) The text collection unit 1 receives an input of a URL from an input device (not shown), collects the text while following a link starting from the URL, and stores the collected text in the text storage unit 2.

ステップ１０２）テキスト検索部３は、テキスト蓄積部２から文書を読み出して分析し、入力装置（図示せず）から指定されたキーワードを含む文書を特定するためのインデクスを作成する。 Step 102) The text search unit 3 reads out the document from the text storage unit 2, analyzes it, and creates an index for specifying the document including the keyword specified from the input device (not shown).

次に、語彙獲得のための処理について説明する。 Next, processing for acquiring a vocabulary will be described.

図７は、本発明の一実施の形態における語彙獲得のフローチャートである。 FIG. 7 is a vocabulary acquisition flowchart in one embodiment of the present invention.

ステップ２０１）テキスト検索部３は、データ例入力装置から、抽出する語彙の例となる少なくとも１つのキーワード例の入力を受け付ける。 Step 201) The text search unit 3 receives an input of at least one keyword example as an example of the extracted vocabulary from the data example input device.

ステップ２０２）テキスト検索部３では、受け付けたキーワード例に基づいてテキスト蓄積部２を検索して、当該キーワード例を含むテキストを取得する。 Step 202) The text search unit 3 searches the text storage unit 2 based on the accepted keyword example, and acquires text including the keyword example.

ステップ２０３）キーワード位置特定部４は、テキスト検索部３で検索されたテキストを１つずつ解析し、個々のテキスト毎に、データ例入力装置２０から入力されたキーワード例が出現する位置を特定する。 Step 203) The keyword position specifying unit 4 analyzes the text searched by the text search unit 3 one by one, and specifies the position where the keyword example input from the data example input device 20 appears for each individual text. .

ステップ２０４）キーワード出現パターン抽出部４は、テキスト検索部３で検索されたテキストを１つずつ解析し、個々のテキスト毎に、データ例入力装置２０から入力されたキーワード例が存在する出現パターンを特定する。 Step 204) The keyword appearance pattern extraction unit 4 analyzes the text searched by the text search unit 3 one by one, and for each text, shows an occurrence pattern in which the keyword example input from the data example input device 20 exists. Identify.

ステップ２０５）キーワード候補抽出部６は、キーワード位置特定部４で特定された位置情報及び、キーワード出現パターン抽出部５で特定されたパターンに基づいて、ステップ２０２で検索されたテキストからキーワード候補を抽出する。 Step 205) The keyword candidate extracting unit 6 extracts keyword candidates from the text searched in Step 202 based on the position information specified by the keyword position specifying unit 4 and the pattern specified by the keyword appearance pattern extracting unit 5. To do.

ステップ２０６）キーワード抽出部７は、テキスト検索部３により検索された解析対象のテキストの数、及び、キーワード候補抽出部６で抽出されたキーワードを含む文書数及び、抽出された文書集合全体中で評価すべきキーワードを含む文書数を用いて、キーワード候補を評価する評価値を求め、予め設定された閾値以上の閾値を持つキーワード候補をキーワードとして抽出する。この評価値が、予め決定された閾値に満たないキーワードについては不要語と見做し、候補から削除する。この閾値は経験的に設定されるものである。また、別の方式としては、テキストから抽出されるキーワードの出現頻度を算出し、頻度が閾値を超えないものとしては、それを不要語と見做し、候補から削除することも考えられる。この閾値も同様に、経験的に設定されるものである。 Step 206) The keyword extraction unit 7 includes the number of texts to be analyzed searched by the text search unit 3, the number of documents including the keywords extracted by the keyword candidate extraction unit 6, and the entire extracted document set. An evaluation value for evaluating a keyword candidate is obtained using the number of documents including the keyword to be evaluated, and a keyword candidate having a threshold value equal to or higher than a preset threshold value is extracted as a keyword. A keyword whose evaluation value is less than a predetermined threshold is regarded as an unnecessary word and is deleted from the candidates. This threshold is set empirically. As another method, the appearance frequency of a keyword extracted from text is calculated, and if the frequency does not exceed a threshold value, it can be considered as an unnecessary word and deleted from candidates. Similarly, this threshold is set empirically.

この後、当該キーワード抽出部７で抽出されたキーワードを辞書等の記憶手段４０に格納する。なお、記憶手段に格納する前に、一旦表示装置に表示するようにしてもよい。 Thereafter, the keyword extracted by the keyword extraction unit 7 is stored in the storage means 40 such as a dictionary. In addition, you may make it display once on a display apparatus, before storing in a memory | storage means.

なお、本発明は、上記の実施の形態及び実施例に限定されることなく、特許請求の範囲内において種々変更・応用が可能である。 The present invention is not limited to the above-described embodiments and examples, and various modifications and applications can be made within the scope of the claims.

本発明は、コンピュータネットワークにおける情報検索技術に適用可能である。 The present invention is applicable to information retrieval technology in a computer network.

本発明の原理を説明するための図である。It is a figure for demonstrating the principle of this invention. 本発明の原理構成図である。It is a principle block diagram of this invention. 本発明の一実施の形態における語彙獲得装置の構成図である。It is a block diagram of the vocabulary acquisition apparatus in one embodiment of this invention. 本発明の一実施の形態におけるキーワード位置特定部の動作の例である。It is an example of operation | movement of the keyword position specific | specification part in one embodiment of this invention. 本発明の一実施の形態におけるキーワード出現パターン抽出部の動作の例である。It is an example of operation | movement of the keyword appearance pattern extraction part in one embodiment of this invention. 本発明の一実施の形態における前処理のフローチャートである。It is a flowchart of the pre-process in one embodiment of this invention. 本発明の一実施の形態における語彙獲得のフローチャートである。It is a flowchart of the vocabulary acquisition in one embodiment of this invention.

Explanation of symbols

１テキスト収集部
２テキスト蓄積手段、テキスト蓄積部
３テキスト検索手段、テキスト検索部
４キーワード位置特定手段、キーワード位置特定部
５キーワード出現パターン抽出手段、キーワード出現パターン抽出部
６キーワード候補抽出手段、キーワード候補抽出部
７キーワード抽出手段、キーワード抽出部
１０語彙獲得装置
２０データ例入力装置
３０インターネット
４０記憶装置 DESCRIPTION OF SYMBOLS 1 Text collection part 2 Text storage means, Text storage part 3 Text search means, Text search part 4 Keyword position specification means, Keyword position specification part 5 Keyword appearance pattern extraction means, Keyword appearance pattern extraction part 6 Keyword candidate extraction means, keyword candidate Extraction unit 7 Keyword extraction means, keyword extraction unit 10 Vocabulary acquisition device 20 Data example input device 30 Internet 40 Storage device

Claims

Keyword Examples of specific attributes analyzes electronic text including HTML or XML based, a lexical acquisition method in an apparatus for acquiring vocabulary,
A text search step in which text search means searches for text storage means for storing electronic text including HTML and XML based on a keyword example of a specific attribute input from the outside, and acquires text including the keyword example When,
A keyword position specifying unit for specifying an appearance position of the keyword example from the searched text;
The keyword appearance pattern extracting means obtains characters one by one from the appearance position of each keyword in the keyword example toward the beginning of the text in the text obtained in the text search step, and compares the obtained characters. If it is the same character, the next character is obtained, and by repeating the comparison, a keyword appearance extraction step for specifying an appearance pattern common to the keyword example;
A keyword candidate extracting unit that analyzes the text based on an appearance position of each keyword of the keyword example and an appearance pattern common to the keyword example, and extracts a keyword candidate from the text;
For each keyword candidate keyword , the keyword extraction means uses the number N _A of texts acquired in the text search step, and the number N _A ^{W of} texts including the keyword candidates among the texts acquired in the text search step. the calculated evaluation value for each keyword from the number N ^W of the text stored in the text storage means _{^{_{(N a W / N a)}}} × log (N a W / N W), the evaluation value is above a predetermined threshold value A keyword extraction step of extracting a higher keyword from the keyword candidates ;
A vocabulary acquisition method characterized by:

Keyword Examples of specific attributes to analyze the electronic text, including HTML and XML based on, a lexical acquisition apparatus to acquire the vocabulary,
Text storage means for storing the electronic text containing the HTML or the XML,
A text search unit that searches the text storage unit based on a keyword example of a specific attribute input from the outside, and obtains text including the keyword example;
Keyword position specifying means for specifying the appearance position of the keyword example from the searched text;
In the text acquired by the text search means, characters are acquired one by one from the appearance position of each keyword in the keyword example toward the beginning of the text, and the acquired characters are compared with each other. A keyword appearance extraction means for identifying an appearance pattern common to the keyword example by repeating the comparison and
A keyword candidate extracting means for analyzing the text based on an appearance position of each keyword of the keyword example and an appearance pattern common to the keyword example, and extracting a keyword candidate from the text;
For each keyword candidate keyword, the number N _A of texts acquired by the text search unit, the number N _A ^{W of} texts including the keyword candidate among the texts acquired by the text search unit, the text storage unit evaluation value for each keyword from the number N ^W of the text stored in the _{^{_{(N a W / N a)}}} × sought _{^{^{log (N a W / N W}}} ), the higher keyword than the evaluation value is a predetermined threshold value Keyword extraction means for extracting from keyword candidates ;
A vocabulary acquisition device characterized by comprising:

A vocabulary acquisition program for causing a computer to function as each means constituting the vocabulary acquisition apparatus according to claim 2 .

A computer-readable storage medium storing the vocabulary acquisition program according to claim 3 .