JP2013105364A

JP2013105364A - Document feature extraction device, document feature extraction method, and document feature extraction program

Info

Publication number: JP2013105364A
Application number: JP2011249430A
Authority: JP
Inventors: Masayuki Sugizaki; 正之杉崎; Yamato Takahashi; 大和高橋; Shigeru Fujimura; 滋藤村; Masashi Uchiyama; 匡内山
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2011-11-15
Filing date: 2011-11-15
Publication date: 2013-05-30
Anticipated expiration: 2031-11-15
Also published as: JP5739310B2

Abstract

PROBLEM TO BE SOLVED: To appropriately extract a feature corresponding to a browser's browsing intention using a reference relationship between structured documents.SOLUTION: A browsing history recording unit 2 of a document feature extraction device 1 records a browsing history of each browser in a browsing history set DB 3. A feature extraction unit 4 extracts a link and related text of the link from a structured document as a link source included in the browsing history in the DB 3. Words are extracted from body text as a representative portion in a structured document as a link destination including the extracted information. A feature recalculation unit 5 calculates weighting for the extracted words. An output unit 6 outputs the extracted words in a priority order corresponding to the weighting.

Description

本発明は、Ｗｅｂページなどの構造化文書から特徴を抽出する技術に関する。 The present invention relates to a technique for extracting features from a structured document such as a Web page.

近年、インターネットなどのコンピュータネットワークを通じて、大量の電子化された文書（いわゆる電子文書：以下文書と省略する。）の利用や不特定多数人を対象とした情報発信などが可能になっている。このようなコンピュータネットワーク上で表現された文書は、その特徴を生かした表現が利用され、ＷＷＷ（ＷｏｒｌｄＷｉｄｅＷｅｂ）上のＷｅｂページではマークアップ言語（ｍａｒｋｕｐｌａｎｇｕａｇｅ）で記述された構造化文書が多く利用されている。 In recent years, it has become possible to use a large amount of electronic documents (so-called electronic documents: hereinafter abbreviated as documents) and to send information to unspecified large numbers of people through computer networks such as the Internet. Such a document expressed on a computer network uses an expression that makes use of its features, and many structured documents are described in a markup language on a Web page on the WWW (World Wide Web). It's being used.

例えばＨＴＭＬ（ＨｙｐｅｒＴｅｘｔＭａｒｋｕｐＬａｎｇｕａｇｅ）と呼ばれる文書は、何らかの情報を記述するだけではなく、他のコンピュータ上に存在する他の人が記述した文書を参照するための「ハイパーリンク（Ｈｙｐｅｒｌｉｎｋ）」の機能を有している。これは他の文書を信頼して自身の記している情報を補完したり、同様の内容の文書を参照するときなどに利用されている。 For example, a document called HTML (HyperText Markup Language) has a function of “hyperlink” for referring to a document described by another person existing on another computer as well as describing some information. Have. This is used when, for example, trusting another document to supplement the information described by itself or referring to a document having the same content.

また、インターネット上で実現されているＷｅｂサービスの一つとして、閲覧者向けに自動的に編集したＷｅｂページを提供するサービスが存在する。例えばユーザはあらかじめ閲覧した情報を登録し、該登録に合致した情報を収集したときにユーザに提供して実現する方法が主に採用されている。 In addition, as one of Web services realized on the Internet, there is a service that provides an automatically edited Web page for a viewer. For example, a method is mainly adopted in which the user registers information that has been browsed in advance and provides the information to the user when information that matches the registration is collected.

しかしながら、情報の登録を強いることはユーザを限定することとなるため、現実的ではない。そこで、一般的なＷｅｂサイトでは閲覧者の閲覧した情報は、「閲覧者が欲しい情報を選択している。」という観点に基づき閲覧履歴から閲覧者の意図を汲み取る試みが行われている。例えば閲覧した文書すべてを対象に各ページ内に存在する単語を取り出し、取り出した単語に重要度を付与するといった手法が提案されている。また、非特許文献１では、ある個人のクリッピング（ブックマーク）を行う際に閲覧記録の内容を記録する手法が提案されている。 However, forcing registration of information limits the number of users, which is not realistic. Therefore, in the general Web site, the information browsed by the viewer is attempted to draw the viewer's intention from the browsing history based on the viewpoint of “selecting information desired by the viewer”. For example, a technique has been proposed in which words existing in each page are taken out for all the read documents, and importance is given to the taken-out words. Non-Patent Document 1 proposes a technique for recording the contents of browsing records when clipping (bookmarking) a certain individual.

このような閲覧者の「閲覧履歴に基づく文書の特徴」を抽出することで、閲覧者が閲覧していない文書に対して類似性を定義することにより閲覧者に提示すべき文書か否かを判断することができる。 By extracting the “characteristics of the document based on the browsing history” of such a viewer, it is determined whether or not the document should be presented to the viewer by defining similarity to the document that the viewer has not browsed. Judgment can be made.

黒田慎介、中島伸介、角谷和俊、田中克巳 ”Ｗｅｂ環境におけるユーザアクティビティの共有と検索” 第１３回データ工学ワークショップ（ＤＥＷＳ２００２）論文集、Ａ２−４（２００２年）Shinsuke Kuroda, Shinsuke Nakajima, Kazutoshi Tsunoya, Katsumi Tanaka “Sharing and Searching User Activities in the Web Environment” 13th Data Engineering Workshop (DEWS2002) Proceedings, A2-4 (2002)

しかしながら、閲覧者の閲覧した情報を適切に提供するためには、閲覧者の要望を何らかの形で取得しておく必要がある。そのためのシンプルな方法として、「閲覧履歴を元にした特徴を表現した単語の抽出」が考えられるが、適切な「単語抽出」は一般的に困難な場合が多い。 However, in order to appropriately provide information browsed by the viewer, it is necessary to acquire the viewer's request in some form. As a simple method for this, “extraction of words expressing features based on browsing history” can be considered, but appropriate “word extraction” is generally difficult in many cases.

例えば「閲覧記録すべてにおける」単語抽出を試みた場合には、単語抽出の対象が「閲覧文書全体」となるため、ＨＴＭＬファイル内の本文以外の「メニュー」部分や「広告」部分からの単語抽出や、不要な単語が埋まっていてそもそも閲覧意図を汲む観点からすれば無駄な文書も処理しなければならない問題が生じる。 For example, when trying to extract words in “all browsing records”, the word extraction target is “entire browsing document”, so word extraction from the “menu” portion and “advertisement” portion other than the body text in the HTML file is performed. In addition, there is a problem that an unnecessary word must be processed from the viewpoint of filling in an intention of browsing because unnecessary words are buried.

この問題の解消方法の一つとして、非特許文献１の「アンカーテキスト」を中心とした部分的な文書情報の抽出があるが、これではリンク毎に役目が異なる場合に適切に閲覧者の要望を汲み取ることが難しいおそれがある。 One method for solving this problem is the extraction of partial document information centered on the “anchor text” of Non-Patent Document 1, but in this case, if the role is different for each link, the viewer's request is appropriate. It may be difficult to pump out.

本発明は、上述のような従来技術の問題点を解決するためになされたものであり、構造化文書の参照関係を利用して閲覧者の閲覧意図に相当する特徴をより適切に抽出することを解決課題としている。 The present invention has been made to solve the above-described problems of the prior art, and more appropriately extracts features corresponding to the browsing intention of the viewer by using the reference relationship of the structured document. Is a solution issue.

そこで、本発明は、閲覧履歴に含まれる構造化文書内に存在するリンクと該リンクの関連テキストを抽出する。例えばリンクに囲まれた文字列とリンク先の構造化文書のＵＲＬを含む上位ｎ番目（ｎ＝整数）のブロック要素に存在する文字列のいずれかを抽出する。 Therefore, the present invention extracts links existing in the structured document included in the browsing history and related text of the links. For example, one of the character strings existing in the upper nth (n = integer) block element including the character string surrounded by the link and the URL of the structured document of the link destination is extracted.

この抽出情報の文字列を含むリンク先の構造化文書における代表的部分を本文として抽出する。例えば抽出されたリンク文字列をすべて含む上位Ｎ番目（Ｎ＝整数）の要素（タグ）の配下に存在する文字列をリンク先の構造化文書の本文として抽出する。この本文に対して形態素解析等などの一般的な自然言語処理によって単語を特徴として抽出する。これにより不要な単語の抽出を抑え、閲覧者の意図に沿った情報のみを抽出することが可能となる。 A representative portion in the linked structured document including the character string of the extracted information is extracted as the text. For example, a character string existing under the top Nth (N = integer) element (tag) including all the extracted link character strings is extracted as the body text of the linked structured document. Words are extracted from the text as features by general natural language processing such as morphological analysis. As a result, it is possible to suppress extraction of unnecessary words and extract only information according to the intention of the viewer.

さらに特徴抽出手段で抽出した各構造化文書の特徴から全体的な特徴を集計し、各文書の特徴を重み付けて再計算する。例えばある文書から抽出される単語に対して、それ以前の閲覧文書から抽出した単語をクリックした順序などを考慮して重みをつけることで、各文書の「特徴となる単語」に優先順序を付けることが可能となる。 Further, the total features are totaled from the features of each structured document extracted by the feature extraction means, and the features of each document are weighted and recalculated. For example, prioritize the “characteristic words” of each document by assigning weights to the words extracted from a document in consideration of the order in which the words extracted from previous browsing documents are clicked. It becomes possible.

本発明によれば、構造化文書の参照関係を利用して閲覧者の閲覧意図に相当する特徴をより適切に抽出することができる。 According to the present invention, it is possible to more appropriately extract a feature corresponding to a browsing intention of a viewer by using a reference relationship of a structured document.

本発明の実施形態に係る文書特徴抽出装置の構成を示すブロック図。1 is a block diagram showing a configuration of a document feature extraction apparatus according to an embodiment of the present invention. 同特徴抽出部の処理フロー図。The processing flow figure of the same feature extraction part. ハイパーリンクによる参照情報を含む文書例。Example document containing reference information by hyperlink. （ａ）は図３の文書ＡのＨＴＭＬファイル（ｙ）のソース例、（ｂ）は図３の文書ＢのＨＴＭＬファイル（ｘ）のソース例。(A) is a source example of the HTML file (y) of the document A in FIG. 3, and (b) is a source example of the HTML file (x) of the document B in FIG. 参照情報を含む文書の閲覧履歴例Browsing history example of documents containing reference information

以下、本発明の実施形態に係る文書特徴抽出装置を説明する。この特徴抽出装置は、マークアップ言語で記述された構造化文書、主にＷＷＷ上のＷｅｂページを処理対象とし、参照表現を利用して文書特徴を抽出する。抽出された文書特徴は閲覧者への情報推薦などに利用される。 Hereinafter, a document feature extraction apparatus according to an embodiment of the present invention will be described. This feature extraction apparatus uses a structured document described in a markup language, mainly a Web page on the WWW, as a processing target, and extracts a document feature using a reference expression. The extracted document features are used for recommending information to the viewer.

ここではＨＴＭＬファイルに対する処理例を説明する。すなわち、閲覧履歴を考慮した上で、リンク元文書に存在する情報を有する部分をリンク先文書に反映させることで、リンク先文書の特徴語を抽出する。 Here, a processing example for an HTML file will be described. That is, in consideration of the browsing history, the feature word of the link destination document is extracted by reflecting the portion having information existing in the link source document in the link destination document.

≪装置構成≫
図１に基づき前記特徴抽出装置の構成例を説明する。この特徴抽出装置１は、コンピュータにより構成され、通常のコンピュータのハードウェアリソース、例えばＣＰＵ．メモリ（ＲＡＭ）やハードディスクドライブ装置などの記憶装置を備える。 ≪Device configuration≫
A configuration example of the feature extraction apparatus will be described with reference to FIG. The feature extraction apparatus 1 is constituted by a computer, and is a normal computer hardware resource such as a CPU. A storage device such as a memory (RAM) or a hard disk drive device is provided.

このハードウェアリソースとソフトウェアリソース（ＯＳ．アプリケーションなど）との協働の結果、前記特徴抽出装置１は、閲覧履歴記録部２．閲覧履歴集合ＤＢ３．特徴抽出部４．特徴再計算部５．出力部６を実装する。このＤＢ３は前記記憶装置に構築されているものとする。 As a result of the cooperation between the hardware resource and the software resource (OS. Application, etc.), the feature extraction apparatus 1 has the browsing history recording unit 2. Browsing history set DB3. Feature extraction unit 4. 4. Feature recalculation unit The output unit 6 is mounted. This DB3 is assumed to be constructed in the storage device.

前記記録部２は、処理対象を構成する閲覧者毎の閲覧履歴、即ち閲覧されたＨＴＭＬファイルの文書集合を前記ＤＢ３に記録する。この各閲覧履歴は図示省略の入力部を通じて前記特徴抽出装置１に入力されるものとし、例えば検索エンジンなどのログを利用することができる。具体的には、前記記録部２は、閲覧者毎に「記録日時」「閲覧文書」と、特徴抽出部４および特徴再計算部５の解析処理を容易にするために閲覧者がクリックした際のアンカーテキスト、即ちＨＴＭＬファイルの＜ａ＞タグ（ＡＮＣＨＯＲタグ）に囲まれた文字列を記録する。 The recording unit 2 records, in the DB 3, a browsing history for each browsing person who constitutes a processing target, that is, a document set of browsed HTML files. Each browsing history is input to the feature extraction apparatus 1 through an input unit (not shown), and a log such as a search engine can be used. Specifically, the recording unit 2 performs “recording date” and “browsing document” for each viewer, and when the viewer clicks to facilitate the analysis processing of the feature extraction unit 4 and the feature recalculation unit 5. The anchor text, that is, the character string enclosed in the <a> tag (ANCHOR tag) of the HTML file is recorded.

特徴抽出部４は、前記ＤＢ３の閲覧履歴に基づきＨＴＭＬファイル中のハイパーリンクを抽出する情報抽出ステージと、抽出されたハイパーリンクを利用して前記閲覧履歴に含まれるリンク先のＨＴＭＬファイルから閲覧者の欲する本文を抽出する本文抽出ステージとを実行する。 The feature extraction unit 4 uses an information extraction stage for extracting hyperlinks in the HTML file based on the browsing history of the DB 3 and a viewer from a linked HTML file included in the browsing history using the extracted hyperlink. The body extraction stage which extracts the body which wants is performed.

この各ステージにおける閲覧履歴は基本的に有限なものとする。すなわち、前記記録部２が前記ＤＢ３に閲覧履歴を記録してから「すべて」ということも不可能ではないが、好ましくは「一日単位」「時間単位」といった期間によった区切りや「検索サイトの検索開始からブラウザを閉じるまで」といった任意単位の「閲覧履歴」を用いるものとする。この選択は装置の仕様などに応じて設定されるものとする。以下、「閲覧履歴」の用語は設定された単位の閲覧履歴を意味するものとして使用する。 The browsing history in each stage is basically limited. That is, it is not impossible to say “all” after the recording unit 2 records the browsing history in the DB 3, but it is preferable that the recording unit 2 is divided by a period such as “daily unit” or “time unit” or “search site”. It is assumed that “browsing history” in an arbitrary unit such as “from the start of searching until the browser is closed” is used. This selection is set according to the specifications of the apparatus. Hereinafter, the term “browsing history” is used to mean a browsing history of a set unit.

前記各ステージの処理には、本願出願人が先に出願した特願２０１１−１６６４６０の情報抽出方法を利用する。まず、情報抽出ステージにおいては、前記閲覧履歴に含まれるリンク元文書のＨＴＭＬファイル内に存在する参照関係を実現するハイパーリンクと、該リンク周辺の関連テキスト（以下、リンク関連テキストという。）とを抽出する。 For the processing of each stage, the information extraction method of Japanese Patent Application No. 2011-166460 filed earlier by the applicant of the present application is used. First, in the information extraction stage, a hyperlink that realizes a reference relationship existing in the HTML file of the link source document included in the browsing history and related text around the link (hereinafter referred to as link related text). Extract.

このハイパーリンク（ＨＴＭＬのＡＮＣＨＯＲタグ）の抽出には、前記閲覧履歴に記録されたアンカーテキストを利用すればよい。具体的には、ハイパーリンクに囲まれた文字列、即ちアンカーテキストを抽出してもよく、あるいはＨＴＭＬにおける要素（タグ）の階層関係（親子関係）に基づきリンク先文書のＵＲＬを含む上位ｎ番目（ｎ＝正の整数）のブロック要素内に存在する文字列を抽出してもよいものとする。 To extract this hyperlink (HTML ANCHOR tag), the anchor text recorded in the browsing history may be used. Specifically, a character string surrounded by hyperlinks, that is, anchor text may be extracted, or the top nth including the URL of the linked document based on the hierarchical relationship (parent-child relationship) of elements (tags) in HTML It is assumed that a character string existing in the block element (n = positive integer) may be extracted.

つぎに本文抽出ステージにおいては、ハイパーリンクにより参照するリンク先文書内に存在するテキスト情報の文字列と、情報抽出ステージにおける抽出情報の文字列とを比較し、リンク先文書内の代表的な部分を本文として抽出する。ここでは情報抽出ステージで抽出された文字列をすべて含む上位Ｎ番目（Ｎ＝正の整数）の要素（親タグ）の配下に存在する文字列を抽出する。 Next, in the text extraction stage, the character string of the text information existing in the linked document referred to by the hyperlink is compared with the character string of the extracted information in the information extracting stage, and a representative part in the linked document Is extracted as the text. Here, a character string existing under the top Nth (N = positive integer) element (parent tag) including all the character strings extracted in the information extraction stage is extracted.

これにより広告部分やメニュー部分などの不要部分の情報が排除され、本文に該当する部分のみが抽出される。ここで抽出された本文を対象に「形態素解析」などの一般的な自然言語処理によって単語を特徴語（特徴のある単語）として抽出する。これにより不要な単語の抽出が抑えられ、閲覧者の意図に沿った特徴語のみを抽出することが可能となる。 As a result, information on unnecessary parts such as an advertisement part and a menu part is excluded, and only a part corresponding to the text is extracted. The word is extracted as a feature word (characteristic word) by general natural language processing such as “morphological analysis” for the extracted text. As a result, unnecessary word extraction is suppressed, and only feature words that match the viewer's intention can be extracted.

特徴再計算部５は、特徴抽出部４で抽出した特徴語に対して閲覧履歴全体の特徴量を集計し、各ＨＴＭＬファイルの文書の特徴を重み付けて計算する。ここでは特徴抽出部４で抽出された特徴語に対して閲覧履歴の全体と、クリック関係を有する文書数の時間軸とを考慮した重み付けを実施する。 The feature recalculation unit 5 aggregates the feature quantities of the entire browsing history for the feature words extracted by the feature extraction unit 4, and calculates the weights of the document features of each HTML file. Here, the feature words extracted by the feature extraction unit 4 are weighted in consideration of the entire browsing history and the time axis of the number of documents having a click relationship.

計算にあたっては「過去に閲覧された文書の単語（特徴語）」と「閲覧履歴全体における単語（特徴語）の出現回数」とを利用する。例えば閲覧履歴中、過去の数クリック内に数多く出現した場合は、「より特徴のある単語」として重み付けを設定することで、閲覧者によって選択された文書内に存在する特徴語をより強調できる。 In the calculation, “words (feature words) of documents browsed in the past” and “number of occurrences of words (feature words) in the entire browsing history” are used. For example, when a large number of appearances appear in the past few clicks in the browsing history, the characteristic words existing in the document selected by the viewer can be more emphasized by setting the weight as “more characteristic words”.

また、（ａ）閲覧履歴全体を利用する場合と、（ｂ）閲覧履歴の特定文書を利用する場合のいずれの設定でもよい。この設定（ａ）では、閲覧履歴全体に共有に存在する単語を「より特徴のある特徴語」とすることが望ましい。シンプルな計算方法としては、閲覧履歴中の各文書に対する特徴語の出現回数を合計したスコアを用いることが考えられる。 Moreover, either (a) when using the entire browsing history or (b) when using a specific document in the browsing history may be set. In this setting (a), it is desirable that words that are shared in the entire browsing history be “more characteristic feature words”. As a simple calculation method, it is conceivable to use a score obtained by summing up the number of appearances of feature words for each document in the browsing history.

一方、設定（ｂ）では、閲覧履歴全体に存在する特徴語を排除する方向、言い換えればスコアが小さくなるように計算することが望ましい。シンプルな計算方法としては、式（１）をベースに特徴語のスコアを算出すればよい。算出されたスコアに応じて各特徴語が重み付けられる。 On the other hand, in the setting (b), it is desirable to calculate in such a direction that the characteristic words existing in the entire browsing history are excluded, in other words, the score becomes small. As a simple calculation method, the feature word score may be calculated based on Equation (1). Each feature word is weighted according to the calculated score.

出力部６は、特徴再計算部５による重み付けに応じた優先順位で特徴語を出力する。例えばスコアに応じたランキングで出力することができる。このとき文書毎に特徴語を出力してもよく、あるいは前記設定（ａ）（ｂ）に応じて出力を切り替えてもよい。なお、出力形式はモニタ出力・プリンタ出力・他の装置への出力のいずれであってもよい。 The output unit 6 outputs feature words in a priority order corresponding to the weighting by the feature recalculation unit 5. For example, it is possible to output the ranking according to the score. At this time, a feature word may be output for each document, or the output may be switched according to the settings (a) and (b). The output format may be any of monitor output, printer output, and output to other devices.

≪特徴抽出部４の処理例≫
図２に基づき特徴抽出部４の処理例を説明する。ここでは図３の文書Ａ〜Ｄを処理対象の閲覧履歴としている。この文書Ａ〜Ｄは、インターネット上のショッピングサイトにて提供されるＨＴＭＬファイルであって、文書Ａはサイト内での商品検索結果など商品を羅列する文書に関し、文書Ｂ〜Ｄは商品１〜３の詳細を説明する文書に関する。 << Processing Example of Feature Extraction Unit 4 >>
A processing example of the feature extraction unit 4 will be described with reference to FIG. Here, the documents A to D in FIG. 3 are used as browsing histories to be processed. The documents A to D are HTML files provided at a shopping site on the Internet. The document A relates to a document listing products such as product search results in the site, and the documents B to D are products 1 to 3. It relates to a document that explains the details of.

図３中、文書Ａの商品名１〜３の下線（アンダーライン）は文書Ｂ〜Ｄの参照を実現するハイパーリンクを示し、商品名１のハイパーリンクは文書Ｂを参照し、商品名２のハイパーリンクは文書Ｃを参照し、商品名３のハイパーリンクは文書Ｄを参照している。 In FIG. 3, the underline (underline) of the product names 1 to 3 of the document A indicates a hyperlink that realizes reference to the documents B to D, and the hyperlink of the product name 1 refers to the document B. The hyperlink refers to the document C, and the hyperlink of the product name 3 refers to the document D.

Ｓ０１：処理が開始されると前記ＤＢ３の閲覧履歴から処理対象の文書に関するＨＴＭＬファイル（ｘ）を取得する（Ｓ０１）。ここでは一例として文書ＢのＨＴＭＬファイルを取得するものとする。 S01: When the process is started, an HTML file (x) relating to the document to be processed is acquired from the browsing history of the DB 3 (S01). Here, as an example, an HTML file of document B is acquired.

Ｓ０２：Ｓ０１で取得したＨＴＭＬファイル（ｘ）に到達する直前のＨＴＭＬファイル（ｙ）が前記ＤＢ３の閲覧履歴に存在するか否かを確認する。確認の結果、該ＨＴＭＬファイル（ｙ）が存在しなければ処理を終了する一方、存在すれば該ＨＴＭＬファイル（ｙ）を前記ＤＢ３の閲覧履歴から取得してＳ０３に進む。 S02: It is confirmed whether the HTML file (y) immediately before reaching the HTML file (x) acquired in S01 exists in the browsing history of the DB3. As a result of the confirmation, if the HTML file (y) does not exist, the process ends. If it exists, the HTML file (y) is acquired from the browsing history of the DB 3 and the process proceeds to S03.

ここでは文書Ａは、文書Ｂをハイパーリンクにより参照しているため、そのＨＴＭＬファイルはＳ０２における直前のＨＴＭＬファイルに該当する。したがって、文書ＡのＨＴＭＬファイルを前記ＤＢ３の閲覧履歴から取得してＳ０３の処理が開始される。 Here, since document A refers to document B by a hyperlink, the HTML file corresponds to the immediately preceding HTML file in S02. Therefore, the HTML file of document A is acquired from the browsing history of DB3, and the process of S03 is started.

Ｓ０３：Ｓ０２で取得したＨＴＭＬファイル（ｙ）からハイパーリンクとリンク関連テキスト情報とを抽出し、それを用いてＨＴＭＬファイル（ｘ）の代表的部分を本文として抽出する。このＳ０３では前述の情報抽出ステージと本文抽出ステージとが実行される。以下、それぞれのステージの詳細を説明する。 S03: Hyperlinks and link-related text information are extracted from the HTML file (y) acquired in S02, and a representative part of the HTML file (x) is extracted as a text using the hyperlink and link-related text information. In S03, the information extraction stage and the text extraction stage described above are executed. Details of each stage will be described below.

・情報抽出ステージ
情報抽出ステージは、文書ＡのＨＴＭＬファイル（ｙ）から次の（１）（２）のいずれかで定義される抽出情報を抽出する。
（１）ハイパーリンク（ＨＴＭＬタグのＡＮＣＨＯＲタグ）に囲まれた文字列（ＵＲＬを含む文字列）
（２）ハイパーリンク先文書（文書Ｂ）のＵＲＬを含む、かつ上位ｎ番目（ｎ＝正の整数）に出現するブロック要素内に存在する文字列
この抽出情報（１）（２）の選択あるいは「ｎの値」に依存して前記抽出情報の範囲を変更することができる。例えば定義（１）を選択すれば最小範囲、即ち「ハイパーリンク（ＨＴＭＬタグのＡＮＣＨＯＲタグに囲まれた文字列）」が抽出される。一方、定義（２）を選択して「ｎ＝最大値（ＨＴＭＬ要素の階層数）」とすれば最大範囲、即ち「本文全文（ＨＴＭＬタグのＢＯＤＹタグの中身すべて）」が抽出される。 Information extraction stage The information extraction stage extracts the extraction information defined in either of the following (1) and (2) from the HTML file (y) of the document A.
(1) Character string (character string including URL) enclosed by hyperlink (an HTML tag ANCHOR tag)
(2) A character string that includes the URL of the hyperlink destination document (document B) and that exists in the block element that appears in the top nth (n = positive integer). This extraction information (1) (2) The range of the extracted information can be changed depending on “value of n”. For example, if the definition (1) is selected, the minimum range, that is, “hyperlink (character string enclosed in an ANCHOR tag of HTML tag)” is extracted. On the other hand, if the definition (2) is selected and “n = maximum value (number of HTML element layers)”, the maximum range, that is, “the full text (all contents of the BODY tag of the HTML tag)” is extracted.

ここでＨＴＭＬの要素（タグ）は、ある要素がある要素を含み、さらにその要素が別の要素を含む、というように階層構造で表される。この階層構造は、一般的に親要素、子要素、孫要素などと親子関係にたとえて表現され、要素ごとに親子関係を有しており、定義（２）の上位ｎ番目は要素の親子関係を示している。 Here, an HTML element (tag) is expressed in a hierarchical structure such that a certain element includes an element and the element includes another element. This hierarchical structure is generally expressed as a parent-child relationship with parent elements, child elements, grandchild elements, etc., and each element has a parent-child relationship. The top nth of definition (2) is the parent-child relationship of the elements Is shown.

また、要素の種類としては、表示上のブロック（見出し・段落など）を構成するブロック要素と、表示上はブロック要素と連続しているように見えるインライン要素とが存在する。これを図４（ａ）に基づき説明すれば、文書ＡのＨＴＭＬファイル（ｙ）中、ＤＩＶはブロック要素に該当し、ＳＰＡＮとＡ（ＡＮＣＨＯＲ）はインライン要素に該当する。 In addition, as types of elements, there are block elements constituting blocks on display (headings, paragraphs, etc.) and inline elements that appear to be continuous with block elements on display. This will be described with reference to FIG. 4A. In the HTML file (y) of document A, DIV corresponds to a block element, and SPAN and A (ANCHOR) correspond to inline elements.

このときＡ（ＡＮＣＨＯＲ）タグからみれば、「ｉｄ＝ＴＲ」のＤＩＶタグは上位１番目に出現するブロック要素に該当し、「ｉｄ＝ＴＡＢＬＥ」のＤＩＶタグは上位第２番目に出現するブロック要素に該当する。なお、Ａ（ＡＮＣＨＯＲ）タグのｈｒｅｆ属性はリンク先のＵＲＬを指定しているものとする。 At this time, from the viewpoint of the A (ANCHOR) tag, the DIV tag with “id = TR” corresponds to the block element that appears first, and the DIV tag with “id = TABLE” appears as the block element that appears second. It corresponds to. It is assumed that the href attribute of the A (ANCHOR) tag specifies a link destination URL.

文書ＡのＨＴＭＬファイル（ｙ）中からの抽出例を説明する。文書ＡのＨＴＭＬファイル（ｙ）には、図４（ａ）に示すように、３個のＡ（ＡＮＣＨＯＲ）タグが存在するため、３個のハイパーリンクが埋まっていることが確認できる。この各Ａ（ＡＮＣＨＯＲ）タグは、それぞれのｈｒｅｆ属性に示すように、文書Ｂ〜Ｄを参照している。ここでは一例として前記抽出情報の抽出範囲が定義（２）に設定され、「ｎ＝１」に設定されているものとする。また、「タグを含まず人が可読なテキストのみ」という条件も与えられているものとする。 An example of extracting the document A from the HTML file (y) will be described. Since the HTML file (y) of document A has three A (ANCHOR) tags as shown in FIG. 4A, it can be confirmed that three hyperlinks are buried. Each A (ANCHOR) tag refers to the documents B to D as shown in the respective href attributes. Here, as an example, it is assumed that the extraction range of the extraction information is set in the definition (2) and “n = 1”. It is also assumed that a condition “only human-readable text not including tags” is given.

このとき各Ａ（ＡＮＣＨＯＲタグ）からみれば、「ｉｄ＝ＴＲ」の各ＤＩＶタグは上位１番目に出現するブロック要素に該当するから、その配下の各ＳＰＡＮタグに囲まれた文字列がそれぞれ抽出される。例えば文書Ｂに対しては上段の「ＤＩＶｉｄ＝ＴＲ」配下、即ちｈｒｅｆ属性に示す文書ＢのＵＲＬと、「商品名１」「値段：１００円」「色：赤」のリンク周辺テキストとが抽出される。 At this time, from the viewpoint of each A (ANCHOR tag), each DIV tag of “id = TR” corresponds to the block element that appears first in the top, so that the character strings surrounded by the subordinate SPAN tags are respectively extracted. Is done. For example, for the document B, the URL of the document B shown in the upper “DIV id = TR”, that is, the href attribute, and the link peripheral text of “product name 1” “price: 100 yen” “color: red” Extracted.

・本文抽出ステージ
本文抽出ステージは、情報抽出ステージにおける抽出情報の文字列をすべて含むリンク先文書、即ちＨＴＭＬファイル（ｘ）の代表部分を本文として抽出する。このときリンク先文書のどこまでを代表的部分、即ち本文とするのかの条件が必要である。基本的には情報抽出ステージの抽出情報の文字列を含む部分であればよいが、ＨＴＭＬファイル（ｘ）の「すべての文字列」とするのでは無用な情報によるデータ容量の増加を招くおそれがある。 -Text Extraction Stage The text extraction stage extracts a link destination document including all character strings of extracted information in the information extraction stage, that is, a representative part of the HTML file (x) as a text. At this time, it is necessary to have a condition as to how much of the linked document is a representative part, that is, a text. Basically, any part including the character string of the extracted information at the information extraction stage may be used, but if “all character strings” in the HTML file (x) is used, there is a possibility of increasing the data capacity due to useless information. is there.

そこで、本文抽出される代表的部分の範囲指定としては、情報抽出ステージにおける抽出情報の文字列をすべて含み、かつ上位Ｎ番目（Ｎ＝正の整数）に出現する親要素（タグ）を探索し、その配下に存在する文字列を本文とする。この本文抽出の条件、即ちＮの値を変更することでＨＴＭＬファイル内から抽出される本文抽出の文字列を変更することもできる。 Therefore, as a range specification of the representative part extracted from the text, a search is made for a parent element (tag) that includes all the character strings of the extracted information in the information extraction stage and appears in the top Nth (N = positive integer). , The character string existing under it is used as the text. The text extraction character string extracted from the HTML file can be changed by changing the text extraction condition, that is, the value of N.

以下、文書ＢのＨＴＭＬファイル（ｘ）に対する処理例を説明する。ここでは本文抽出条件は「Ｎ＝１」に設定され、上位１番目に出現する親要素を探索するものとする。また、前記記憶装置には情報抽出ステージで抽出した「商品名１」「値段：１００円」「色：赤」の抽出情報が記憶されているものとする。 A processing example for the HTML file (x) of document B will be described below. Here, the text extraction condition is set to “N = 1”, and the parent element that appears first is searched. Further, it is assumed that extraction information of “product name 1”, “price: 100 yen”, and “color: red” extracted at the information extraction stage is stored in the storage device.

このとき文書ＢのＨＴＭＬファイル（ｘ）中では、図４（ｂ）に示すように、「商品名１」は「ｉｄ＝ｎａｍｅ」のＳＰＡＮタグに挟まれ、「値段：１００円」は「ｉｄ＝ｐｒｉｃｅ」のＳＰＡＮタグに挟まれ、「色：赤」は「ｉｄ＝ｃｏｌｏｒ」のＳＰＡＮタグに挟まれており、これらを「すべて含む上位１番目の親タグ」の探索結果として「ＤＩＶ」タグが取得される。 At this time, in the HTML file (x) of document B, as shown in FIG. 4B, “product name 1” is sandwiched between SPAN tags of “id = name”, and “price: 100 yen” is “id = “Price” SPAN tag, “color: red” is sandwiched between “id = color” SPAN tags, and “DIV” tag as a search result of “top first parent tag including all” Is acquired.

したがって、「その配下に存在する文字列」は「ＤＩＶタグの配下に存在する文字列」が該当する。ここでは「タグを含まず人が可読なテキストのみ」という条件が与えられているものとする。その結果、「商品名１，写真，型番：１２３４５６，値段：１００円，色：赤，概要：売れてます」の文字列が文書Ｂの代表的部分、即ち本文として抽出される。 Therefore, the “character string existing under the subordinate” corresponds to the “character string existing under the DIV tag”. Here, it is assumed that the condition “only human-readable text including no tags” is given. As a result, the character string “product name 1, photo, model number: 123456, price: 100 yen, color: red, outline: sold” is extracted as a representative part of the document B, that is, the text.

Ｓ０４：Ｓ０３で抽出された本文を対象に「形態素解析」などの一般的な自然言語処理によって単語を特徴語（特徴のある単語）として抽出し、処理を終了する。 S04: A word is extracted as a feature word (characteristic word) by general natural language processing such as “morphological analysis” for the text extracted in S03, and the process is terminated.

≪特徴再計算部５の処理例≫
図５の閲覧記録に基づき特徴再計算部５の処理例を説明する。この閲覧履歴は前記ＤＢ３に閲覧者Ａの閲覧履歴として記録されているものとする。 << Processing Example of Feature Recalculation Unit 5 >>
A processing example of the feature recalculation unit 5 will be described based on the browsing record of FIG. It is assumed that this browsing history is recorded in the DB 3 as the browsing history of the viewer A.

図５中、文書（ａ）はインターネット上のショッピングサイトの検索ページ［http://a.co.jp/top.htm］を示し、文書（ｂ）は文書（ａ）の検索ページからの検索要求に応じた検索結果ページ［http://a.co.jp/detail.html］を示している。 In FIG. 5, document (a) shows a search page [http://a.co.jp/top.htm] of a shopping site on the Internet, and document (b) is a search from the search page of document (a). The search result page [http://a.co.jp/detail.html] according to the request is shown.

また、文書（ｃ）は文書（ｂ）に表示された「商品Ａ」のハイパーリンクによって参照される商品Ａの詳細ページ［http://a.co.jp/detail.html］を示し、文書（ｄ）は文書（ｃ）に表示された「購入」のハイパーリンクによって参照されるｔｈａｎｋｓページ［http://a.co.jp/purchase.html］を示している。 Document (c) shows a detailed page [http://a.co.jp/detail.html] of product A referenced by the hyperlink of “product A” displayed in document (b). (D) shows a tanks page [http://a.co.jp/purchase.html] referred to by the “purchase” hyperlink displayed in the document (c).

したがって、図５の閲覧履歴によれば閲覧者Ａは、矢印（１）に示すように、インターネット上のショッピングサイトの検索ページ、即ち文書（ａ）において「商品Ａ」の検索を要求し、文書（ｂ）の検索結果ページを閲覧した。この閲覧後に文書（ｂ）に表示された「商品Ａ」のハイパーリンクをクリックし、矢印（２）に示すように、商品Ａの詳細ページに関する文書（ｃ）を表示させて閲覧した。 Therefore, according to the browsing history of FIG. 5, as shown by the arrow (1), the viewer A requests a search for “product A” on the search page of the shopping site on the Internet, that is, the document (a), and the document The search result page of (b) was browsed. After browsing, the hyperlink of “product A” displayed in the document (b) was clicked, and the document (c) related to the detailed page of the product A was displayed and browsed as shown by the arrow (2).

この閲覧後に商品Ａを購入するために、文書（ｃ）に表示された「購入」のハイパーリンクをクリックして、矢印（３）に示すように、商品Ａの購入を実施したｔｈａｎｋｓページ、即ち文書（ｄ）を表示させて閲覧した。 In order to purchase the product A after browsing, the hyperlink “purchase” displayed in the document (c) is clicked, and as shown by the arrow (3), the tanks page where the product A is purchased, that is, The document (d) was displayed and browsed.

特徴抽出部４では、前記ＤＢ３に記録された閲覧履歴の文書（ａ）〜（ｄ）を処理対象とし、Ｓ０１〜Ｓ０４に示すように、”リンク元−リンク先”関係を利用して文書（ｂ）〜（ｄ）の特徴語を抽出する。ここでは文書（ｂ）からは「商品Ａ」が特徴語として抽出され、文書（ｃ）からは「商品Ａ」「Ｙシャツ」「ピンク」が特徴語として抽出され、文書（ｄ）からは「購入内容」「商品Ａ」が特徴語として抽出されたものとする。 The feature extraction unit 4 processes the browsing history documents (a) to (d) recorded in the DB 3 and uses the “link source-link destination” relationship as shown in S01 to S04. The feature words b) to (d) are extracted. Here, “product A” is extracted as a feature word from document (b), “product A”, “Y shirt”, “pink” are extracted as feature words from document (c), and “product A” is extracted from document (d). It is assumed that “purchase content” and “product A” are extracted as feature words.

特徴再計算部５は、特徴抽出部４で文書毎に抽出された特徴語に対して重み付けの計算を行う。この重み付けの計算には、前述のように「過去に閲覧した文書の単語（特徴語）」と「履歴全体における単語（特徴語）の出現回数」が利用される。ここでは一例として「過去に閲覧した文書」として「過去の履歴２回分」を用いて、文書（ｃ）の特徴語の重み付けを計算する。この計算例は、文書（ａ）（ｂ）を利用して特徴語の出現回数を計数するので、前記設定（ｂ）に該当する。 The feature recalculation unit 5 calculates weights for the feature words extracted for each document by the feature extraction unit 4. In this weighting calculation, as described above, “words (feature words) of documents browsed in the past” and “number of occurrences of words (feature words) in the entire history” are used. Here, as an example, the weight of the feature word of the document (c) is calculated by using “past history twice” as the “document browsed in the past”. This calculation example corresponds to the setting (b) because the number of appearances of feature words is counted using the documents (a) and (b).

そうすると、文書（ｃ）の特徴語に対する重み付けを計算するにあたっては、図５の閲覧履歴全体に存在する特徴語の重み付けのスコアが小さくなるように計算される。このとき文書（ｃ）の特徴語「商品Ａ」は、文書（ａ）〜（ｃ）に存在するため、他の特徴語「Ｙシャツ」や「ピンク」と比較して、重み付けのスコアが小さくなる。 Then, when calculating the weighting for the feature word of the document (c), the weighting score of the feature word existing in the entire browsing history of FIG. 5 is calculated to be small. At this time, since the feature word “product A” of the document (c) exists in the documents (a) to (c), the weighting score is smaller than the other feature words “Y-shirt” and “pink”. Become.

具体的には文書（ｃ）の各特徴語は、式（２）〜（４）で重み付けのスコアが算出される。この式（２）〜（４）では、式（１）に文書（ｃ）における特徴語の出現回数が乗じられている。 Specifically, for each feature word of the document (c), a weighting score is calculated by the equations (2) to (4). In equations (2) to (4), equation (1) is multiplied by the number of appearances of the feature word in document (c).

式（２）〜（４）により文書（ｃ）の各特徴語は、「Ｙシャツ＝４．ピンク＝４．商品Ａ＝１」のスコアが算出される。一方、閲覧履歴全体を利用する場合、即ち前記設定（ａ）の場合には特徴語の各文書（ａ）〜（ｄ）の出現回数を合計して重み付けのスコアが計算される。例えば単純に出現回数を合計してスコアを算出すれば、「商品Ａ＝４．Ｙシャツ＝１．ピンク＝１」のスコアが算出される。 The score of “Y shirt = 4. Pink = 4. Product A = 1” is calculated for each feature word of the document (c) by the equations (2) to (4). On the other hand, when the entire browsing history is used, that is, in the case of the setting (a), the number of appearances of each document (a) to (d) of the feature word is totaled to calculate a weighting score. For example, if the score is calculated by simply summing up the number of appearances, the score of “product A = 4. Y shirt = 1. Pink = 1” is calculated.

算出されたスコアに応じて各特徴語が重み付けられる。この重み付けに応じた優先順位をもって特徴語が出力部６を通じて出力される。したがって、前記特徴抽出装置１によれば、文書の”リンク元−リンク先”関係を利用して閲覧者の閲覧意図に相当する特徴語をより適切に抽出することができる。これにより閲覧者の閲覧履歴を元に、閲覧者に推薦するための基礎データである「特徴のある単語」を抽出することが可能となる。 Each feature word is weighted according to the calculated score. The feature words are output through the output unit 6 with the priority according to the weighting. Therefore, according to the feature extraction apparatus 1, it is possible to more appropriately extract the feature word corresponding to the browsing intention of the viewer by using the “link source-link destination” relationship of the document. This makes it possible to extract “characteristic words” that are basic data for recommending to the viewer based on the browsing history of the viewer.

なお、本発明は、ＨＴＭＬの文書に限定されることはなく、例えばＸＭＬなどの他のマークアップ言語で記述された構造化文書も閲覧履歴に含めることができる。この場合には、特徴抽出部４の処理においてＸＭＬの文書同士のリンクを定義する「ＸＬｉｎｋ（ＸＭＬＬｉｎｋｉｎｇＬａｎｇｕａｇｅ）」を利用すればよい。 Note that the present invention is not limited to an HTML document, and a structured document described in another markup language such as XML can also be included in the browsing history. In this case, “XML Linking (XML Linking Language)” that defines a link between XML documents may be used in the process of the feature extraction unit 4.

≪プログラムなど≫
本発明は、前記特徴抽出装置１の各部２〜６の一部もしくは全部として、コンピュータを機能させる文書特徴抽出プログラムとして構成することもできる。このプログラムによれば、前記各部２〜６の処理手順の一部あるいは全部をコンピュータに実行させることが可能となる。 ≪Programs≫
The present invention can also be configured as a document feature extraction program that causes a computer to function as some or all of the units 2 to 6 of the feature extraction apparatus 1. According to this program, it is possible to cause a computer to execute part or all of the processing procedures of the respective units 2 to 6.

前記プログラムは、Ｗｅｂサイトや電子メールなどネットワークを通じて提供することができる。また、前記プログラムは、ＣＤ−ＲＯＭ，ＤＶＤ−ＲＯＭ，ＣＤ−Ｒ，ＣＤ−ＲＷ，ＤＶＤ−Ｒ，ＤＶＤ−ＲＷ，ＭＯ，ＨＤＤ，ＢＤ−ＲＯＭ，ＢＤ−Ｒ，ＢＤ−ＲＥなどの記録媒体に記録して、保存・配布することも可能である。この記録媒体は、記録媒体駆動装置を利用して読み出され、そのプログラムコード自体が前記実施形態の処理を実現するので、該記録媒体も本発明を構成する。 The program can be provided through a network such as a website or e-mail. The program is stored in a recording medium such as a CD-ROM, DVD-ROM, CD-R, CD-RW, DVD-R, DVD-RW, MO, HDD, BD-ROM, BD-R, or BD-RE. It is also possible to record, save and distribute. This recording medium is read using a recording medium driving device, and the program code itself realizes the processing of the above embodiment, so that the recording medium also constitutes the present invention.

１…文書特徴抽出装置
２…閲覧履歴記録部
３…閲覧履歴集合ＤＢ
４…特徴抽出部（特徴抽出手段）
５…特徴再計算部（特徴再計算手段）
６…出力部 DESCRIPTION OF SYMBOLS 1 ... Document feature extraction apparatus 2 ... Browsing history recording part 3 ... Browsing history collection DB
4. Feature extraction unit (feature extraction means)
5. Feature recalculation unit (feature recalculation means)
6 ... Output section

Claims

A document feature extraction device that extracts features of each structured document based on a browsing history of the structured document in which a link realizing a reference relationship is expressed,
The link and the related text of the link are extracted from the structured document of the link source included in the browsing history, the representative part in the structured document of the link destination including the extraction information is extracted as the body, and the extracted A feature extraction means for extracting features from the body;
A feature recalculation unit that aggregates the overall features from the features of each structured document extracted by the feature extraction unit and recalculates by weighting the features of each document;
A document feature extraction apparatus comprising:

A document feature extraction method executed by an apparatus for extracting features of each structured document based on a browsing history of the structured document in which a link realizing a reference relationship is expressed,
The link and the related text of the link are extracted from the structured document of the link source included in the browsing history, the representative part in the structured document of the link destination including the extraction information is extracted as the body, and the extracted A feature extraction step for extracting features from the body;
A feature recalculation step of summing up the overall features from the features of each structured document extracted by the feature extraction means and weighting and recalculating the features of each document;
A document feature extraction method characterized by comprising:

A document feature extraction program for causing a computer to function as the document feature extraction apparatus according to claim 1.