JPH11328218A

JPH11328218A - Contents attribute information normalization method, information collecting/service providing system, attribute information setting device and program storage recording medium

Info

Publication number: JPH11328218A
Application number: JP10146539A
Authority: JP
Inventors: Tomoharu Hikita; 智治疋田; Masaaki Matsumoto; 政昭松本; Noriko Fujishiro; 典子藤代
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 1998-05-12
Filing date: 1998-05-12
Publication date: 1999-11-30
Anticipated expiration: 2018-05-12
Also published as: JP4042830B2

Abstract

PROBLEM TO BE SOLVED: To construct a data base for providing a service without being restricted by the structure/form of a document for the perusal of an information provider by performing normalization processing of an attribute structure as of contents attribute information and performing normalization of a character expression form and normalization processing of numerical expression. SOLUTION: An automatic information collecting part 101 of an automatic information collecting and classifying device 100 patrols a Web site 120 of an information provider on a network 110, collects document files and extracts information. An attribute extraction part 102 normalizes a character code and then extracts only the contents attribute information as of the collected and extracted documents. An attribute normalization part 103 refers to a normalization rule 106 and normalizes the contents attribute information provided with the structure/form coincidently with a perusal document extracted by the attribute extraction part 102 to the form suited to a retrieval service or the like. Further, as of the contents attribute information whose the structure is normalized, a character expression form is normalized and numerical expression is normalized.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、ネットワーク上に
分散したコンテンツ情報を自動収集し、検索サービス等
に再利用する技術に係り、詳しくは、複数のＷｅｂサイ
トから収集したコンテンツ情報を統合・整理し、検索サ
ービス等に必要な属性を抽出・正規化する方法、それを
適用した検索エンジンなどの情報収集・サービス提供シ
ステム、並びに、文書ファイルの作成・編集を支援する
属性情報設定装置、さらには、コンテンツ属性情報を正
規化するプログラムを格納した記録媒体に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a technology for automatically collecting content information distributed on a network and reusing the information for a search service and the like. More specifically, the present invention relates to integrating and organizing content information collected from a plurality of Web sites. A method for extracting and normalizing attributes required for a search service, an information collection and service providing system such as a search engine to which the method is applied, an attribute information setting device for supporting creation and editing of a document file, and And a recording medium storing a program for normalizing content attribute information.

【０００２】[0002]

【従来の技術】ネットワーク上に分散したコンテンツ情
報を自動収集して検索サービス等に再利用するための、
二、三の一般的な従来技術を以下に示す。2. Description of the Related Art In order to automatically collect content information distributed on a network and reuse it for a search service or the like,
The following are a few common prior arts.

【０００３】図２４はコンテンツ属性情報を考慮しない
検索エンジンシステムを示す。図において、２４００は
検索エンジン、２４１０はネットワーク、２４２０は該
ネットワーク上に分散配置されたＷｅｂサイト、２４３
０はユーザ端末である。Ｗｅｂサイト２４２０は、情報
提供者（ＩＰ）の用意した閲覧用文書ファイル（一般的
にはＨＴＭＬで記述）を保持している。検索エンジン２
４００は、制御部２４０３の制御で、自動情報収集部２
４０１にてネットワーク２４１０上のＷｅｂサイト２４
２０を巡回して閲覧用文書ファイル（ＨＴＭＬ文書）を
収集し、解析部２４０２にて、該文書ファイルを解析し
て、ページ単位で要約情報、キーワードインデックス等
を作成し、該ページ単位の情報をコンテンツデータベー
ス（コンテンツＤＢ）２４０４に格納する。一方、検索
部２４０５では、ユーザ端末２４３０から検索要求があ
ると、コンテンツＤＢ２４０４を検索し、検索結果をユ
ーザ端末２４３０へ返送する。ユーザ端末２４３０のユ
ーザは、該検索結果を元に、必要ならＷｅｂサイト２４
２０の文書ファイルを直接閲覧する。FIG. 24 shows a search engine system that does not consider content attribute information. In the figure, 2400 is a search engine, 2410 is a network, 2420 is a web site distributed and arranged on the network, 243
0 is a user terminal. The web site 2420 holds a browsing document file (generally described in HTML) prepared by an information provider (IP). Search engine 2
400 is the control of the control unit 2403, and the automatic information collection unit 2
Web site 24 on network 2410 at 401
The document file for browsing (HTML document) is collected by circulating through the document 20, and the analysis unit 2402 analyzes the document file, creates summary information, a keyword index, and the like in page units, and stores the information in page units. The content is stored in a content database (content DB) 2404. On the other hand, when there is a search request from the user terminal 2430, the search unit 2405 searches the content DB 2404 and returns the search result to the user terminal 2430. The user of the user terminal 2430 can use the Web site 24 if necessary based on the search result.
Browse 20 document files directly.

【０００４】図２５に、該従来技術で対象とする文書フ
ァイル（ＨＴＭＬ文書）の構造を示す。また、図２６に
具体例として、ＨＴＭＬ言語で記述された商品カタログ
の一例を示す。図２５（ａ）や図２６に示すように、対
象とする文書ファイル（ＨＴＭＬ文書）には、複数のコ
ンテンツが格納されており、区切りが不明確、コンテン
ツの分野も不明確であり、また、コンテンツの属性情報
は文章の中に含まれている。解析部２４０２は、該文書
ファイルを自然言語解析などして必要情報を抽出する
が、低い精度でしか抽出できず、コンテンツＤＢ２４０
４には各ページが雑多に格納されることになる。図２５
（ｂ）に示すように、文書はツリー構造で表すことがで
きるが、閲覧用の文書（ＨＴＭＬ文書）は、閲覧スタイ
ル、もしくは文章の論理構造に基づき、構造化されてい
るため、サービス提供（検索サービス）に適した属性情
報をそこから抽出することは難しい。FIG. 25 shows a structure of a document file (HTML document) which is a target of the related art. FIG. 26 shows an example of a product catalog described in the HTML language as a specific example. As shown in FIG. 25A and FIG. 26, a target document file (HTML document) stores a plurality of contents, the delimitation is unclear, and the field of the content is unclear. The attribute information of the content is included in the text. The analysis unit 2402 extracts the necessary information by performing a natural language analysis or the like on the document file.
4, each page is miscellaneously stored. FIG.
As shown in (b), a document can be represented by a tree structure, but a document for browsing (HTML document) is structured based on a browsing style or a logical structure of a sentence. It is difficult to extract attribute information suitable for the search service) therefrom.

【０００５】図２７は、コンテンツ属性情報が固定的な
自動収集分類システムを示す。図において、２７００は
情報自動収集分類装置、２７１０はネットワーク、２７
２０は該ネットワーク上に分散配置されたＷｅｂサイ
ト、２７３０はユーザ端末である。本従来例では、Ｗｅ
ｂサイト２７２０に、文書本体（ＨＴＭＬ文書）とは別
にコンテンツ属性情報を用意する。図２８に、本従来技
術が対象とする文書の構成例を示す。なお、閲覧用文書
中の文字列をコンテンツ属性情報としてタグでマークア
ップして、閲覧用文書中にコンテンツ属性情報を含ませ
ることも可能である。FIG. 27 shows an automatic collection and classification system in which content attribute information is fixed. In the figure, 2700 is an automatic information collection and classification device, 2710 is a network, 27
Reference numeral 20 denotes a Web site distributed and arranged on the network, and 2730 denotes a user terminal. In this conventional example, We
In b site 2720, content attribute information is prepared separately from the document body (HTML document). FIG. 28 shows a configuration example of a document targeted by the conventional technology. Note that it is also possible to mark up a character string in the browsing document as content attribute information with a tag and include the content attribute information in the browsing document.

【０００６】情報自動収集分類装置２７００では、制御
部２７０４の制御下で、自動情報収集部２７０１にてネ
ットワーク２７１０上のＷｅｂサイト２７２０を巡回し
て該当ファイル（文書ファイル、属性情報ファイル）を
収集し、分離部２７０２にてコンテンツ属性情報を分離
し、属性抽出部２７０３にてコンテンツ属性情報を解釈
し、該コンテンツ属性情報をほぼそのままコンテンツＤ
Ｂ２７０５に格納する。サービス提供部２７０６の動作
は、図２４の検索部２４０５と同様である。本従来例で
は、コンテンツ属性情報を閲覧情報とは別に用意するの
で、サービス提供に便利なような構造とすることができ
る。[0006] In the automatic information collection and classification device 2700, under the control of the control unit 2704, the automatic information collection unit 2701 circulates the Web site 2720 on the network 2710 and collects corresponding files (document files, attribute information files). The content attribute information is separated by the separation unit 2702, the content attribute information is interpreted by the attribute extraction unit 2703, and the content
B2705. The operation of the service providing unit 2706 is the same as that of the search unit 2405 in FIG. In this conventional example, since the content attribute information is prepared separately from the browsing information, a structure convenient for providing the service can be provided.

【０００７】図２９は、コンテンツ属性情報を閲覧用タ
グ及び文字列と対応づける自動収集分類システムを示
す。図において、２９００は情報自動収集分類装置、２
９１０はネットワーク、２９２０は該ネットワーク上に
分散配置されたＷｅｂサイト、２９３０はユーザ端末で
ある。本従来例は、図２４の検索エンジンシステムと同
様にＷｅｂサイト２９２０には閲覧情報のみの文書ファ
イル（ＨＴＭＬ文書）を用意するが、情報自動収集分類
装置２９００内に、あらかじめ閲覧用文書ファイルの中
の閲覧用タグ、及び文字列とコンテンツ属性との対応ル
ール２９０４を保持しておき、これを参照して閲覧文書
からコンテンツ属性の抽出を可能にするものである。対
応ルール２９０４としては、例えば、「円」という文字
の前には必ず「価格」が出現するなどというルールを作
っておく。FIG. 29 shows an automatic collection and classification system that associates content attribute information with a viewing tag and a character string. In the figure, 2900 is an automatic information collection and classification device, 2
910 is a network, 2920 is a Web site distributed and arranged on the network, and 2930 is a user terminal. In this conventional example, a document file (HTML document) containing only browsing information is prepared on the Web site 2920 as in the search engine system of FIG. 24. In this case, the browsing tag and the correspondence rule 2904 between the character string and the content attribute are stored, and the content attribute can be extracted from the browsed document by referring to the tag. As the correspondence rule 2904, for example, a rule that “price” always appears before the character “yen” is created.

【０００８】情報自動収集分類装置２９００では、制御
部２９０３の制御下で、自動情報収集部２９０１にてネ
ットワーク２９１０上のＷｅｂサイト２９２０を巡回し
て閲覧用文書ファイル（ＨＴＭＬ文書）を収集し、属性
抽出部２９０２にて、文書ファイル中の文字列、及び閲
覧用の構造を表わすタグと属性との対応ルール２９０４
を参照して、コンテンツ属性情報を文書ファイル中から
抽出し、コンテンツＤＢ２９０５に格納する。サービス
提供部２９０６の動作は、図２４の検索部２４０５と基
本的に同様である。本従来例では、コンテンツ単位（例
えば、商品単位）で属性情報をコンテンツＤＢ２９０５
に格納することができるため、コンテンツ分野指定検
索、属性検索、関連付け検索が可能である。[0008] In the automatic information collection and classification device 2900, under the control of the control unit 2903, the automatic information collection unit 2901 traverses a Web site 2920 on the network 2910 to collect a browsing document file (HTML document), and The extraction unit 2902 determines a character string in the document file and a rule 2904 corresponding to a tag and an attribute indicating a structure for browsing.
, The content attribute information is extracted from the document file and stored in the content DB 2905. The operation of the service providing unit 2906 is basically the same as that of the search unit 2405 in FIG. In this conventional example, attribute information is stored in the content DB 2905 in units of content (for example, in units of products).
, A content field designation search, an attribute search, and an association search can be performed.

【０００９】次に、属性情報設定装置について説明す
る。属性情報設定装置とは、テキスト情報中の文字列を
マークアップすることで、コンテンツ属性情報を設定す
る装置である。一般の構造化文書作成装置（例えば、Ｓ
ＧＭＬエディタと呼ばれるもの）も同等の機能を持って
いる。Next, the attribute information setting device will be described. The attribute information setting device is a device that sets content attribute information by marking up a character string in text information. A general structured document creation device (for example, S
GML editor) has an equivalent function.

【００１０】図３０は従来の属性情報設定装置のブロッ
ク図を示したもので、全体メニュー部３００１、属性設
定部３００２、属性削除部３００３、属性範囲変更部３
００５、ファイル入力部３００６、ファイル出力部３０
０７、構造検証部３００８等で構成される。ファイル入
力部３００６から対象となる文書ファイルを入力し、全
体メニュー部３００１のエディタ画面を見ながら、属性
設定部３００２、属性削除部３００３、属性変更部３０
０４、属性範囲変更部３００５等の機能を利用してコン
テンツ属性情報の設定、削除、変更、範囲変更を実施
し、ファイル出力部３００７から出力する。文書オブジ
ェクトは閲覧用タグとコンテンツ属性用タグを区別しな
いで管理し、構造検証部３００８で検証する。FIG. 30 is a block diagram of a conventional attribute information setting device. The entire menu unit 3001, attribute setting unit 3002, attribute deleting unit 3003, attribute range changing unit 3
005, file input unit 3006, file output unit 30
07, a structure verification unit 3008 and the like. A target document file is input from the file input unit 3006, and the attribute setting unit 3002, the attribute deletion unit 3003, and the attribute change unit 30 are viewed while viewing the editor screen of the overall menu unit 3001.
04, set, delete, change, and change the range of the content attribute information using the functions of the attribute range changing unit 3005 and the like, and output from the file output unit 3007. The document object is managed without distinguishing between the browsing tag and the content attribute tag, and is verified by the structure verification unit 3008.

【００１１】図３１は、属性タグ付き文書（一般にはＸ
ＭＬ文書）の閲覧用文書構造とコンテンツ属性情報の概
念図である。図中、白丸が閲覧用タグ、黒丸がコンテン
ツ属性用タグを示している。従来の属性情報設定装置３
０００の構造検証部３００８では、閲覧用文書とコンテ
ンツ属性情報を同一に管理するため、例えば親子関係ま
でしか検証できない場合には、白丸と黒丸相互には文法
的な制限が存在せず、白丸同士黒丸同士には制限がある
場合でも、検証できない場合がある。したがって、制限
を緩くすることになってしまい、有効な文法検証ができ
ない。FIG. 31 shows a document with an attribute tag (generally, X
FIG. 3 is a conceptual diagram of a browsing document structure of an (ML document) and content attribute information. In the figure, white circles indicate browsing tags, and black circles indicate content attribute tags. Conventional attribute information setting device 3
Since the structure verification unit 3008 of the 000 manages the browsing document and the content attribute information in the same manner, for example, when only the parent-child relationship can be verified, there is no grammatical restriction between the white circles and the black circles, and Even if there are restrictions between the black circles, verification may not be possible in some cases. Therefore, the restrictions are loosened and effective grammar verification cannot be performed.

【００１２】[0012]

【発明が解決しようとする課題】上記従来技術におい
て、コンテンツ属性情報を考慮しない検索エンジンシス
テムには次のような問題がある。 (１) 文書内のコンテンツ（例えば、商品情報）の区切
りが不明で、ページ単位で分類、ページ単位でしか検索
できない。 (２) コンテンツの分野を指定できず、検索結果にノイ
ズが多い。例えば、「日本酒が買いたい」と考え、キー
ワード「日本酒」で検索すると、日本酒の通販だけでな
く、日本酒好きの人のウンチクなどまで検索されてしま
う。 (３) コンテンツの属性（たとえば、商品の価格、色な
ど）が認識できない。したがって、属性による検索はで
きない。例えば、「３０００円以下の日本酒」という検
索は不可能である。また、情報の再利用が困難である。
即ち、属性を利用して他のＤＢと関連づけることができ
ない。In the above prior art, a search engine system that does not consider content attribute information has the following problems. (1) The boundaries of the contents (for example, product information) in a document are unknown, and can be classified in page units and searched only in page units. (2) The field of content cannot be specified, and the search results are noisy. For example, if you think "I want to buy sake" and search with the keyword "Sake", you will find not only mail order of sake, but also unchiku of people who like sake. (3) Content attributes (for example, product price, color, etc.) cannot be recognized. Therefore, search by attribute is not possible. For example, it is impossible to search for “sake of 3000 yen or less”. Also, it is difficult to reuse information.
That is, it cannot be associated with another DB using the attribute.

【００１３】これに対して、コンテンツ属性情報が固定
的な自動収集分類システムでは、文書内あるいはその文
書とは別にコンテンツ属性情報が用意され、システムが
属性情報を取得できることから、コンテンツ単位での分
類が可能、コンテンツの分野を指定した検索が可能、コ
ンテンツの属性による検索が可能（例：「３０００円以
下の日本酒」という検索が可能）、コンテンツの属性を
利用した他のデータとの関連付けが可能であり、コンテ
ンツ属性情報を考慮しない検索エンジンシステムの問題
点をほぼ解消している。On the other hand, in an automatic collection and classification system in which content attribute information is fixed, content attribute information is prepared in a document or separately from the document, and the system can acquire attribute information. Possible, search by specifying the field of content is possible, search by attribute of content is possible (eg, search for "sake of 3000 yen or less" is possible, and association with other data using attribute of content is possible This almost eliminates the problem of a search engine system that does not consider content attribute information.

【００１４】しかしながら、このコンテンツ属性情報が
固定的な自動収集分類システムでは、閲覧用の文書の構
造・表現形式とサービス提供に適した属性情報の構造・
形式は一般に一致しないため、次のような問題点があら
たに生じる。 (１) 閲覧用文書とは別にコンテンツ属性情報を用意す
る場合（いわゆるメタデータと呼ばれる情報）、文書フ
ァイルを直接閲覧した情報と、自動収集して得た情報が
一致しない別能性がある。特に、複数のコンテンツにつ
いて記述した文書ファイルの場合（商品カタログな
ど）、二重にその情報を記述することになり、その可能
性が高い。 (２) 閲覧用文書中の文字列をコンテンツ属性情報とし
てタグでマークアップする場合（これはＸＭＬの一般的
な使い方）、文書ファイル中の文書構造・記述形式を制
限するか、逆にサービス提供の情報を文書ファイル中の
文書構造・記述形式と同一にしなくてはならない。特に
ＸＭＬをそのまま利用するだけでは、同一のコンテンツ
に対する属性の付与の仕方、属性の名称、記述形式等が
作成者によって異ったり、他のサービスへ適用するのに
適した属性が付与されている保証がない。即ち、複数の
計算機（サイト）から収集した文書を他のサービスの提
供を考慮した共通の構造形式（ＤＢテーブル等）に格納
することができない。However, in the automatic collection and classification system in which the content attribute information is fixed, the structure and expression format of the document to be browsed and the structure and expression of the attribute information suitable for providing the service.
Since the formats generally do not match, the following new problems arise. (1) When content attribute information is prepared separately from a browsing document (information called so-called metadata), there is a possibility that information obtained by directly browsing a document file does not match information obtained by automatic collection. In particular, in the case of a document file in which a plurality of contents are described (such as a product catalog), the information is described twice, and the possibility is high. (2) When a character string in a browsing document is marked up with a tag as content attribute information (this is a general use of XML), restrict the document structure and description format in the document file, or conversely provide a service Must be the same as the document structure and description format in the document file. In particular, if the XML is used as it is, the method of assigning the attribute to the same content, the name of the attribute, the description format, and the like are different depending on the creator, and the attribute suitable for application to other services is assigned. There is no guarantee. That is, documents collected from a plurality of computers (sites) cannot be stored in a common structure format (such as a DB table) in consideration of provision of other services.

【００１５】一方、コンテンツ属性情報を閲覧用タグ、
及び文字列と対応づけるシステムでは、閲覧文書ファイ
ル内には閲覧情報のみ格納し、その中の閲覧用タグ、及
び文字列とコンテンツ属性との対応ルールをシステムが
あらかじめ保持しておき、これを参照して閲覧文書から
コンテンツ属性を抽出する。このため、対応ルールが正
しいという条件の元では、コンテンツ属性情報を考慮し
ない検索エンジンシステムの問題点は解決できる。ま
た、コンテンツ属性情報が固定的な自動収集分類システ
ムの上記（１）の問題も存在しない。しかしながら、閲
覧用文書が対応ルールに従わなくてはならない、という
制約が必要であり、その閲覧文書の構造に強い制約が生
じてしまう。商品カタログなどの場合には、消費者に対
しての表現の自由が制限されてしまい、大きな問題とな
る。これは、コンテンツ属性情報が固定的な自動収集分
類システムの上記（２）と同等の問題である。[0015] On the other hand, the content attribute information is read by a viewing tag,
In the system for associating with a character string, only the browsing information is stored in the browsing document file, and the browsing tag and the correspondence rule between the character string and the content attribute in the browsing document file are held in advance by the system and referred to To extract the content attribute from the browsed document. Therefore, under the condition that the corresponding rule is correct, the problem of the search engine system that does not consider the content attribute information can be solved. Further, the problem (1) of the automatic collection and classification system in which the content attribute information is fixed does not exist. However, a restriction is required that the browsing document must obey the corresponding rules, and a strong restriction is imposed on the structure of the browsing document. In the case of a product catalog or the like, the freedom of expression for consumers is limited, which is a major problem. This is a problem equivalent to the above (2) of the automatic collection and classification system in which the content attribute information is fixed.

【００１６】次に、従来の属性情報設定装置には次のよ
うな問題がある。 (１) 一般ユーザが閲覧した際の結果は見ることができ
るが、自動収集・分類装置などが属性を抽出・正規化
（構造、値形式）処理を行った結果を見る機能は持って
いないので、データ作成者はデータがどのように利用さ
れるかを知ることができない。 (２) 閲覧用タグとコンテンツ属性用のタグを区別して
管理していないため、タグの構造検証が煩雑になる。図
３１でも示したように、一般に閲覧用タグ同士、コンテ
ンツ属性用タグ同士の制限は比較的強いが、閲覧用タグ
とコンテンツ属性用タグの間には緩い制限しかないこと
が多い（たとえば、ＸＭＬ言語における適性形式として
の制限程度）。その場合、それぞれを区別して管理して
いないため、親子関係までしか検証できない構造検証部
では、全体の制限としては緩い制限にせざるをえず、有
効に機能しない。Next, the conventional attribute information setting device has the following problems. (1) General users can view the results when browsing, but they do not have the function to view the results of the extraction and normalization (structure, value format) processing of attributes by automatic collection / classification devices. However, data creators cannot know how the data will be used. (2) Since the browsing tag and the content attribute tag are not managed separately, the structure verification of the tag becomes complicated. As shown in FIG. 31, generally, the restrictions between the browsing tags and the content attribute tags are relatively strong, but there are often loose restrictions between the browsing tags and the content attribute tags (for example, XML). Degree of limitation as aptitude form in language). In such a case, since the respective components are not managed separately, the structure verification unit that can verify only the parent-child relationship cannot help functioning effectively, as a strict restriction as a whole.

【００１７】本発明の目的は、上記従来技術の問題点を
解決して、ネットワーク上に分散したコンテンツ情報
を、単なる閲覧以外に自由に再利用可能とするサービス
を実現することにある。An object of the present invention is to solve the above-mentioned problems of the prior art and to realize a service that allows content information distributed on a network to be freely reusable other than merely viewed.

【００１８】より詳しくは、本発明の目的は、サービス
提供時の情報形式と情報提供者の用意した閲覧用文書の
構造形式とを自由に設定可能として、ネットワーク上に
分散したコンテンツ情報を自動収集し、情報提供者の閲
覧用文書の構造・形式に縛られることなく、サービス提
供用データベースの構築を可能とすることにある。More specifically, an object of the present invention is to automatically set the information format at the time of providing a service and the structure format of a browsing document prepared by an information provider, and to automatically collect content information distributed on a network. Another object of the present invention is to make it possible to construct a service providing database without being restricted by the structure and format of the information provider's browsing document.

【００１９】本発明の他の目的は、テキスト情報中の文
字列をマークアップすることで、コンテンツ属性情報を
設定する装置において、データ作成者がデータの利用結
果を確認でき、また、閲覧用文書構造とコンテンツ属性
情報構造のより厳しい文法検証等を可能とすることにあ
る。Another object of the present invention is to mark up a character string in text information so that a data creator can check the use result of data in an apparatus for setting content attribute information. An object of the present invention is to enable strict grammar verification of the structure and the content attribute information structure.

【００２０】[0020]

【課題を解決するための手段】上記目的を達成するため
に、請求項１の発明は、閲覧文書に合わせた構造・形式
を持つコンテンツ属性情報を、該閲覧文書に依らない構
造・形式に正規化する方法であって、閲覧用情報と混在
してコンテンツ属性情報が含まれる文書ファイルからコ
ンテンツ属性情報を抽出するステップと、前記抽出した
コンテンツ属性情報について、属性構造の正規化処理を
行うステップと、前記構造が正規化されたコンテンツ属
性情報について、文字表現形式の正規化、数値表現の正
規化処理を行うステップとを有することを特徴とする。In order to achieve the above object, according to the first aspect of the present invention, content attribute information having a structure and format adapted to a browsed document is converted into a structure and format independent of the browsed document. Extracting content attribute information from a document file containing content attribute information mixed with browsing information; and performing an attribute structure normalization process on the extracted content attribute information. And performing a normalization process of a character expression format and a numerical expression process on the content attribute information whose structure has been normalized.

【００２１】また、請求項２の発明は、上記請求項１記
載のコンテンツ属性情報正規化方法において、属性構造
の正規化処理では、コンテンツの展開、属性名の正規
化、属性の分割、他属性への正規化を行うことを特徴と
する。According to a second aspect of the present invention, in the content attribute information normalizing method according to the first aspect, in the attribute structure normalizing process, the content development, attribute name normalization, attribute division, and other attribute It is characterized by performing normalization to

【００２２】また、請求項３の発明は、上記請求項１、
２記載のコンテンツ属性情報正規化方法において、正規
化処理の正規化ルールとして、分野非依存・属性非依存
ルール、分野依存・属性非依存ルール、分野非依存・属
性依存ルール、分野依存・属性依存ルールを備え、コン
テンツ分野と属性名で管理することを特徴とする。Further, the invention of claim 3 provides the above-mentioned claim 1,
2. In the content attribute information normalization method described in 2, the normalization rules of the normalization processing include field-independent / attribute-independent rules, field-dependent / attribute-independent rules, field-independent / attribute-dependent rules, and field-dependent / attribute-dependent. It has rules and is managed by content fields and attribute names.

【００２３】請求項４の発明は、情報提供者のＷｅｂサ
イト、ホスト装置、ユーザ端末がネットワークを介して
接続された情報収集・サービス提供システムにおいて、
ホスト装置は、ネットワーク上に分散配置されたＷｅｂ
サイトの文書ファイルを自動収集する手段、該収集した
文書ファイルに閲覧用情報と混在して含まれているコン
テンツ属性情報を抽出する手段、該抽出したコンテンツ
属性情報を、サービス提供に適した構造・形式に正規化
する手段、該正規化されたコンテンツ属性情報を蓄積す
る手段と、該蓄積されたコンテンツ属性情報を使用し
て、ユーザ端末からの要求に対してサービスを行う手段
を有することを特徴とする。According to a fourth aspect of the present invention, there is provided an information collection and service providing system in which an information provider's Web site, a host device, and a user terminal are connected via a network.
The host device is a Web device distributed on a network.
Means for automatically collecting the document files of the site, means for extracting the content attribute information contained in the collected document files in a mixture with the browsing information, the extracted content attribute information having a structure suitable for service provision. Means for normalizing to a format, means for storing the normalized content attribute information, and means for providing a service in response to a request from a user terminal using the stored content attribute information. And

【００２４】請求項５の発明は、閲覧用文書中の文字列
をマークアップすることで、コンテンツ属性情報を設定
する属性情報設定装置において、コンテンツ属性情報を
抽出・正規化してプレビューする手段と、閲覧用タグと
コンテンツ属性用タグを別々に管理して、閲覧用情報と
コンテンツ属性情報を相互に無視して検証する手段を有
することを特徴とする。According to a fifth aspect of the present invention, there is provided an attribute information setting device for setting content attribute information by marking up a character string in a document for browsing, extracting and normalizing the content attribute information and previewing the extracted attribute information. There is provided a means for separately managing the browsing tag and the content attribute tag, and ignoring the browsing information and the content attribute information for verification.

【００２５】請求項６の発明は、閲覧文書に合わせた構
造・形式を持つコンテンツ属性情報を、該閲覧文書に依
らない構造・形式に正規化するためのプログラムを記録
したコンピュータ読み取り可能な記録媒体であって、閲
覧用情報と混在してコンテンツ属性情報が含まれる文書
ファイルからコンテンツ属性情報を抽出する処理プロセ
スと、前記抽出したコンテンツ属性情報について、属性
構造の正規化処理を行うプロセスと、前記構造が正規化
されたコンテンツ属性情報について、文字表現形式の正
規化、数値表現の正規化処理を行う処理プロセスとを有
することを特徴とする。According to a sixth aspect of the present invention, there is provided a computer-readable recording medium storing a program for normalizing content attribute information having a structure and format adapted to a browsed document into a structure and format independent of the browsed document. A processing process of extracting content attribute information from a document file containing content attribute information mixed with browsing information; a process of performing an attribute structure normalization process on the extracted content attribute information; The content attribute information having the normalized structure includes a processing process for normalizing a character expression format and normalizing a numerical expression.

【００２６】[0026]

【発明の実施の形態】以下、本発明の一実施の形態につ
いて図面により説明する。図１は本発明を適用したシス
テムの一実施例として、ネットワーク上に分散したコン
テンツ情報を自動収集し、属性を抽出・正規化して検索
サービス等に再利用する情報収集・サービス提供システ
ムのブロック図を示したものである。図において、１０
０は検索エンジンなどとなるホスト装置（ここでは、情
報自動収集分類装置と呼ぶ）、１１０はインタネットな
どのネットワーク、１２０はネットワーク上に分散配置
された情報提供者（ＩＰ）のＷｅｂサイト、１３０は情
報自動収集分類装置１００を利用するユーザ端末であ
る。DESCRIPTION OF THE PREFERRED EMBODIMENTS One embodiment of the present invention will be described below with reference to the drawings. FIG. 1 is a block diagram of an information collection and service providing system that automatically collects content information distributed on a network, extracts and normalizes attributes, and reuses them for a search service and the like as an embodiment of a system to which the present invention is applied. It is shown. In the figure, 10
Reference numeral 0 denotes a host device serving as a search engine or the like (herein, referred to as an automatic information collection / classification device); 110, a network such as the Internet; 120, a Web site of an information provider (IP) distributed on the network; It is a user terminal that uses the automatic information collection and classification device 100.

【００２７】Ｗｅｂサイト１２０は、コンテンツ属性情
報を含んだ閲覧用文書ファイル（例えば、ＸＭＬ文書）
を作成する属性情報設定装置１２５を具備する。この属
性情報設定装置１２５により、多様な構造をとりうる閲
覧用文書（一般にはＨＴＭＬ文書）にコンテンツ属性情
報を一元的に保持する形態で含ませることが可能にな
る。なお、属性情報設定装置１２５の本発明による構成
については後述する。The Web site 120 is a browsing document file (for example, an XML document) containing content attribute information.
Is provided with an attribute information setting device 125 that creates the attribute information. With the attribute information setting device 125, it is possible to include content attribute information in a form for holding the content attribute information in a browsing document (generally, an HTML document) having various structures. The configuration of the attribute information setting device 125 according to the present invention will be described later.

【００２８】情報自動収集分類装置１００は、ネットワ
ーク１１０上のＷｅｂサイト１２０を巡回し、コンテン
ツ属性情報が含まれた文書ファイル（例えばＸＭＬ文
書）を自動収集する自動情報収集部１０１、該収集した
文書ファイルからコンテンツ属性情報を抽出する属性抽
出部１０２、該抽出したコンテンツ属性情報を、検索サ
ービス等の再利用のために構造変換、属性名・属性値形
式変換等の正規化を行う属性正規化部１０３、これら各
部を制御する制御部１０４、属性正規化部１０３で正規
化されたコンテンツ属性情報を、商品単位等のコンテン
ツ単位で格納するコンテンツデータベース（コンテンツ
ＤＢ）１０５、属性正規化部１０３での正規化処理のた
めのルールを、コンテンツ分野、属性名等で管理する正
規化ルール１０６、コンテンツＤＢ１０５の内容を利用
して、ユーザにコンテンツ属性情報による検索サービス
などを提供するユーザサービス提供部１０８などで構成
される。なお、破線の対応ルール１０７はオプションを
示し、先の図２９の対応ルール２９０４と同様に、閲覧
文書ファイル中の閲覧用タグ及び文字列とコンテンツ属
性との対応ルールを保持することで、属性抽出部１０２
は、閲覧情報のみの文書ファイル（一般にはＨＴＭＬ文
書）からもコンテンツ属性が抽出可能であることを示し
ている。The automatic information collection / classification apparatus 100 traverses a Web site 120 on the network 110 and automatically collects a document file (for example, an XML document) including content attribute information. An attribute extracting unit 102 for extracting content attribute information from a file, and an attribute normalizing unit for normalizing the extracted content attribute information such as a structure conversion and an attribute name / attribute value format conversion for reuse of a search service or the like. 103, a control unit 104 that controls these units, a content database (content DB) 105 that stores the content attribute information normalized by the attribute normalization unit 103 in content units such as product units, and the like. A normalization rule 106 for managing rules for normalization processing by content fields, attribute names, and the like; Using the contents of the content DB 105, it consists of such user service providing unit 108 for providing such search service by the content attribute information to the user. Note that a broken line corresponding rule 107 indicates an option, and, as in the case of the corresponding rule 2904 in FIG. 29, by holding a corresponding rule between a browsing tag and a character string in a viewed document file and a content attribute, attribute extraction is performed. Part 102
Indicates that the content attribute can be extracted from a document file containing only browsing information (generally, an HTML document).

【００２９】ユーザは、ユーザ端末１３０を利用して情
報自動収集分類装置１００にアクセスして、分野指定検
索、関連付け検索などを行い、検索結果を元にＷｅｂサ
イト１２０の文書ファイルを直接閲覧する。The user accesses the automatic information collection / classification apparatus 100 using the user terminal 130, performs a field designation search, an association search, and the like, and directly browses the document file on the Web site 120 based on the search result.

【００３０】情報自動収集分類装置１００の全体の処理
フローを図２及び図３に、また、ここで対象とする文書
の構造、その処理過程での遷移を図４に示す。以下、図
２乃至図４に従って、本発明によるコンテンツ属性情報
の抽出・正規化処理の一実施例について説明する。な
お、図２及び図３に示すフローチャートの各処理プロセ
スを記述したプログラムは、コンピュータが読み取り可
能な記録媒体、例えばフロッピーディスクやメモリカー
ド、ＣＤ−ＲＯＭなどに記録して提供することが可能で
ある。FIGS. 2 and 3 show the entire processing flow of the automatic information collection / classification apparatus 100, and FIG. 4 shows the structure of the target document and the transition in the processing. An embodiment of the content attribute information extraction / normalization process according to the present invention will be described below with reference to FIGS. The program describing each process of the flowcharts shown in FIGS. 2 and 3 can be provided by being recorded on a computer-readable recording medium, for example, a floppy disk, a memory card, a CD-ROM, or the like. .

【００３１】本実施例で対象とする文書はＸＭＬ文書と
する。ＸＭＬはタグを自由に定義可能な文書構造記述
（マークアップ）言語であり、このＸＭＬで定義したコ
ンテンツ属性用タグを利用することで、コンテンツ属性
を自動的に抽出できる。また、閲覧時は使用しない属性
を含めることも可能である。The target document in this embodiment is an XML document. XML is a document structure description (markup) language in which tags can be freely defined, and content attributes can be automatically extracted by using the content attribute tags defined in the XML. It is also possible to include attributes that are not used when browsing.

【００３２】図１において、情報提供者（ＩＰ）は、属
性情報設定装置１２５を利用して、閲覧用文書（一般に
はＨＴＭＬで記述）中の文字列をタグでマークアップす
ることで、閲覧文書に合わせた構造・形式を持つコンテ
ンツ属性情報の含まれた文書ファイル（ＸＭＬ文書）を
作成し、Ｗｅｂサイト１２０に用意する。図４（ａ）は
ＨＴＭＬ文書の構造、同図（ｂ）はＸＭＬ文書の構造を
示している。なお、Ｗｅｂサイト１２０の文書ファイル
は、はじめからＸＭＬの形式で記述されたものでもよ
い。In FIG. 1, an information provider (IP) uses an attribute information setting device 125 to mark up a character string in a browsing document (generally described in HTML) with a tag, thereby obtaining a browsing document. A document file (XML document) including the content attribute information having the structure and format according to is created and prepared on the Web site 120. FIG. 4A shows the structure of an HTML document, and FIG. 4B shows the structure of an XML document. Note that the document file of the Web site 120 may be described in the XML format from the beginning.

【００３３】情報自動収集分類装置１００の自動情報収
集部１０１は、ネットワーク１１０上のＷｅｂサイト１
２０を巡回して、文書ファイル（ＸＭＬ文書）を自動収
集し、該文書ファイルからＩＰ単位の情報、ページ単位
の情報を抽出する（ステップ２０１〜２０３）。ここで
抽出される情報は、閲覧文書に合わせた構造・形式を持
つコンテンツ属性情報が閲覧用情報と混在して含まれた
ＸＭＬ文書そのままのものである（図４の（ｃ））。The automatic information collection unit 101 of the automatic information collection / classification device 100
A document file (XML document) is automatically collected by circulating through the file 20, and information in IP units and information in page units are extracted from the document file (steps 201 to 203). The information extracted here is an XML document as it is, which includes content attribute information having a structure and format adapted to the browsing document mixed with the browsing information (FIG. 4C).

【００３４】属性抽出部１０２は、自動情報収集部１０
１で収集・抽出されたＸＭＬ文書について、文字コード
を正規化した後（ステップ２０４）、コンテンツ属性情
報のみを抽出する（ステップ２０５）。ここで、文字コ
ードの正規化は、ネットワーク上の文書ファイルは様々
な文字コードで記述されることが多いので、これに対処
するためである。文書ファイルからのコンテンツ属性情
報のみの抽出は、例えば、タグを解釈するパーサ（ＳＧ
ＭＬパーサ等）をコンテンツ属性情報用タグのみを解釈
するように動作させる（閲覧用のタグは無視）すること
で可能である。どのタグがコンテンツ属性情報用タグな
のかは、設定ファイルとして保持し、それを参照するよ
うにする。例えば、ＳＧＭＬにおけるＤＴＤのデフォル
ト属性として保持する。この場合、文書作成・編集用の
設定ファイルとして再利用が可能である。ここで抽出さ
れたコンテンツ属性情報は、閲覧文書に合わせた構造・
形式を持つ。図４（ｄ）はこれを示している。The attribute extracting unit 102 is provided with the automatic information collecting unit 10
After normalizing the character codes of the XML documents collected and extracted in step 1 (step 204), only the content attribute information is extracted (step 205). Here, the normalization of the character codes is to cope with the fact that document files on the network are often described with various character codes. Extraction of only content attribute information from a document file is performed, for example, by parsing a tag (SG
This can be achieved by operating an ML parser (eg, an ML parser) to interpret only the content attribute information tag (ignoring the browsing tag). Which tag is the tag for the content attribute information is stored as a setting file and is referred to. For example, it is held as a default attribute of DTD in SGML. In this case, it can be reused as a setting file for document creation / editing. The content attribute information extracted here has a structure and
Has a format. FIG. 4D shows this.

【００３５】なお、この属性抽出部１０２では、閲覧用
タグ、及び文字列とコンテンツ属性との対応ルール１０
８（例えば、「円」という文字の前には、必ず「価格」
が出現するなど）を保持しておけば、該ルール１０８を
参照して、閲覧文書からコンテンツ属性情報を抽出する
ことができる。Note that the attribute extraction unit 102 sets a rule 10 for browsing tags and corresponding rules between character strings and content attributes.
8 (for example, be sure to precede the "price"
Is stored, the content attribute information can be extracted from the browsed document by referring to the rule 108.

【００３６】属性正規化部１０３は、正規化ルール１０
４を参照して、属性抽出部１０２で抽出された閲覧文書
に合わせた構造・形式をもつコンテンツ属性情報を、検
索サービスなどのサービス提供に適した形式に正規化す
る。正規化ルール１０４には、分野非依存／属性非依存
ルール、分野依存／属性非依存ルール、分野非依存／属
性依存ルール、分野依存／属性依存ルールなどが存在
し、コンテンツ分野と属性名で管理している。The attribute normalization unit 103 performs the normalization rule 10
Referring to FIG. 4, the content attribute information having the structure and format corresponding to the browsed document extracted by the attribute extraction unit 102 is normalized into a format suitable for providing a service such as a search service. The normalization rules 104 include a field-independent / attribute-independent rule, a field-dependent / attribute-independent rule, a field-independent / attribute-dependent rule, a field-dependent / attribute-dependent rule, and the like, and are managed by content fields and attribute names. doing.

【００３７】属性正規化部１０３では、まず、対象コン
テンツ（閲覧文書の構造と同形のコンテンツ属性情報）
の分野を、コンテンツ属性情報のカテゴリタグの値で認
識する（ステップ２０６）。これは、分野によって、コ
ンテンツＤＢ１０５のスキーマが異なるため、はじめに
認識しておく必要があるためである。次に、当該対象コ
ンテンツ（コンテンツ属性情報）について、閲覧文書に
合わせた構造をサービス提供に適した構造に正規化し
（ステップ２０７）、更に、該構造が正規化されたコン
テンツ属性情報について、文字表現形式の正規化（ステ
ップ２０８）、数値表現の正規化（ステップ２０９）を
行う。図４（ｅ）は正規化の概念を示す。最後に、この
正規化されたコンテンツ属性情報を、コンテンツＤＢ１
０５にコンテンツ単位で格納する（ステップ２１０）。
なお、文字表現形式の正規化と数値表現の正規化は、処
理順序が逆でもよい。また、例えば数値部分と単位部分
に分けるなど、属性構造の変換を伴う場合もある。In the attribute normalizing section 103, first, the target content (content attribute information having the same form as the structure of the browsed document)
Is recognized by the value of the category tag of the content attribute information (step 206). This is because the schema of the content DB 105 differs depending on the field, and thus it is necessary to first recognize the schema. Next, with respect to the target content (content attribute information), the structure according to the browsed document is normalized into a structure suitable for providing a service (step 207). Further, the content attribute information having the normalized structure is represented by a character expression. Format normalization (step 208) and numerical expression normalization (step 209) are performed. FIG. 4E shows the concept of normalization. Finally, the normalized content attribute information is stored in the content DB 1
05 is stored for each content (step 210).
Note that the normalization of the character expression form and the normalization of the numerical expression may be performed in the reverse order. In some cases, the attribute structure may be converted, for example, by dividing into a numerical part and a unit part.

【００３８】以下、属性構造の正規化、文字表現形式の
正規化及び数値表現正規化の処理について詳述する。The normalization of the attribute structure, the normalization of the character expression form, and the processing of the numerical expression normalization will be described in detail below.

【００３９】属性構造の正規化は、コンテンツ展開処
理、属性名の正規化、属性の分割、他属性への変換に大
別される。さらに、コンテンツ展開処理は二つに分けら
れる。ここでは、それらをコンテンツ展開処理（１）、
コンテンツ展開処理（２）と呼ぶことにする。The normalization of the attribute structure can be broadly classified into content expansion processing, normalization of attribute names, division of attributes, and conversion to other attributes. Further, the content development process is divided into two. Here, the content development processing (1),
This will be referred to as content development processing (2).

【００４０】閲覧情報として一つのコンテンツの異形を
表すために、同一属性を複数持つ構造になっていること
があるが、サービス提供時には別コンテンツとして格納
してある方が便利な場合がある。このような場合、コン
テンツ展開処理（１）を適用し、別コンテンツとして展
開する。例えば、ある飲料商品が中身は同一で容量によ
り価格が変化するような場合には、閲覧文書上（商品カ
タログ）は同一商品の異形として表現してあり、再利用
して表現したコンテンツ属性情報も同じような形になっ
ている。サービス提供時にそれぞれの商品として扱う場
合が多く、その場合には、それらを展開する機能が必須
となる。図５に、コンテンツ展開処理（１）の概念図を
示す。なお、先の図４（ｅ）のコンテンツ展開処理の例
は、このコンテンツ展開処理に対応する。In order to represent a variant of one content as browsing information, the content may have a structure having a plurality of same attributes, but it may be more convenient to store the content as another content when providing a service. In such a case, the content development processing (1) is applied to develop as another content. For example, if a beverage product has the same contents and the price changes depending on the capacity, the browsed document (product catalog) is expressed as a variant of the same product, and the content attribute information expressed by reuse is also shown. It has a similar shape. In many cases, they are treated as individual products when the service is provided, and in that case, the function of expanding them is indispensable. FIG. 5 shows a conceptual diagram of the content development processing (1). Note that the example of the content expansion processing in FIG. 4E corresponds to this content expansion processing.

【００４１】一方、閲覧情報として同一のジャンルのコ
ンテンツを一くくりの位置に記述することが多い。その
場合、コンテンツ展開処理（２）を適用して、一くくり
をオブジェクトと考え、そのオブジェクトに設定された
属性情報をその子オブジェクトのデフォルトの属性とし
展開する。例えば、ある飲料商品の一連の商品種がすべ
て吟醸という製品に属する場合、カタログ上では、吟醸
と大きく記述し、その後の商品はすべて暗黙の内に吟醸
であると書く場合がある。サービス提供時には、それぞ
れの商品として扱うため、各商品情報の属性に製法＝吟
醸であると付加する必要がある。図６に、コンテンツ展
開処理（２）の概念図を示す。On the other hand, contents of the same genre are often described as browsing information at one position. In this case, the content development processing (2) is applied, and one round is considered as an object, and the attribute information set for the object is developed as the default attribute of the child object. For example, if a series of beverage products all belong to a product called Ginjo, the catalog may widely describe it as Ginjo and implicitly write that all subsequent products are Ginjo. At the time of providing a service, it is necessary to add that the manufacturing method is Ginjo to the attribute of each product information in order to handle each product. FIG. 6 shows a conceptual diagram of the content development processing (2).

【００４２】属性名の正規化は、概念的に同一な属性の
項目名が異なる場合に行う。図７に、属性名の正規化の
一例を示す。逆に、属性の分割は、概念的に複数の属性
に分割した方が扱いやすい場合に行う。図８に、属性の
分割の一例を示す。The normalization of attribute names is performed when the item names of conceptually identical attributes are different. FIG. 7 shows an example of attribute name normalization. Conversely, attribute division is performed when it is easier to handle conceptually by dividing into a plurality of attributes. FIG. 8 shows an example of attribute division.

【００４３】他属性への変換（正規化）は、単数、ある
いは複数の属性から別の単数、あるいは複数の属性へ変
換する処理である。例えば、日本酒度、酸度から呑み口
を計算し、図９のどの領域に入るかによって、呑み口を
判断する（不等式で計算）。The conversion to another attribute (normalization) is a process of converting a single or multiple attributes to another single or multiple attributes. For example, the drinking mouth is calculated from the degree of sake and the degree of acidity, and the drinking mouth is determined according to which region in FIG. 9 is included (calculated by an inequality).

【００４４】文字表現形式の正規化は、例えば、日本酒
製法を大吟醸、吟醸、普通の３種類に分類して、サービ
ス提供したい場合、閲覧用文書中の表現としては、「大
吟醸酒」でも、「大吟醸のお酒」でも大吟醸と正規化す
ることである（「大吟醸」という表現しか認めないと、
一般のお客様向けの商品カタログの表現が制限されるこ
とになる）。その他、コード体系の正規化などを行う。
この場合、論理式などの条件文が使用できる。The normalization of the character expression form is performed, for example, by classifying the sake brewing method into three types, Daiginjo, Ginjo and ordinary, and providing a service. , "Daiginjo liquor" is also to be normalized with Daiginjo (if only the expression "Daiginjo" is accepted,
Product catalogs for general customers will be limited.) In addition, it normalizes the coding system.
In this case, a conditional statement such as a logical expression can be used.

【００４５】数値形式の正規化は、例えば図１０の処理
手順で行う。閲覧文書上では、数値は全角／半角表現、
通常の数値／漢数字、カンマ区切り入り／なし、など表
現にばらつきがあるので、これらを正規化する（ステッ
プ１００１）。単位付きの数値では、単位の表現の仕方
にばらつきがある。たとえば、価格の表現として￥１０
００、１０００円、１０００ｙｅｎなどが存在する。こ
れらを認識する（ステップ１００２）。この認識した単
位系からサービス提供で用いる単位系に変換する（ステ
ップ１００３）。また、複数の値や、範囲のある数値を
認識して変換する（ステップ１００４）。たとえば、
「１０００円以上２０００円未満、及び３０００円以上
４０００円未満」といった情報を元にサービス提供に適
した形式に変換する。サービス提供上、二つの数値範囲
を一つのコンテンツで扱えれば、そのまま格納するし、
一つの数値範囲しかなければ、２つのコンテンツに展開
して格納することになる。The normalization of the numerical format is performed, for example, according to the processing procedure shown in FIG. In the read document, the numerical value is expressed in full-width / half-width,
Since there are variations in expressions, such as normal numerical values / kanji numerals and comma-separated / none, these are normalized (step 1001). In a numerical value with a unit, there is a variation in a way of expressing the unit. For example, a price expression of $ 10
There are 00, 1000 yen, 1000 yen, and the like. These are recognized (step 1002). The recognized unit system is converted to a unit system used for service provision (step 1003). Also, a plurality of values or numerical values having a range are recognized and converted (step 1004). For example,
Based on information such as "1000 yen or more and less than 2000 yen, and 3000 yen or more and less than 4000 yen", it is converted into a format suitable for service provision. For service provision, if two numerical ranges can be handled by one content, they are stored as they are,
If there is only one numerical range, it is developed and stored in two contents.

【００４６】図１１乃至図１９に、本実施例によるコン
テンツ属性情報抽出・正規化の具体的処理例を示す。図
１１は情報提供者が提供する文書ファイル（ＸＭＬ文
書）であり、閲覧用文書（ＨＴＭＬ文書）中の文書列を
任意に定義されたタグでマークアップすることで、コン
テンツ属性情報が設定されている。図１１において、太
字で示すタグ（例えば〈ＭＥＭＯ〉・・・〈／ＭＥＭ
Ｏ〉）で囲まれた部分がコンテンツ属性情報である。FIGS. 11 to 19 show specific processing examples of content attribute information extraction and normalization according to the present embodiment. FIG. 11 shows a document file (XML document) provided by an information provider, in which content attribute information is set by marking up a document sequence in a browsing document (HTML document) with an arbitrarily defined tag. I have. In FIG. 11, tags shown in bold type (for example, <MEMO>... </ MEM
O>) is the content attribute information.

【００４７】図１２は、図１１の文書ファイル（ＸＭＬ
文書）中からコンテンツ属性情報をそのまま抽出したデ
ータであり、コンテンツ属性情報は情報提供者の閲覧文
書に合わせた構造・形式のままである。この図１２のコ
ンテンツ属性情報を木構造形式で表現したのが図１３で
ある。図１３から分かるように、これは必ずしも検索サ
ービス等の提供に適した構造・形式になっていない。FIG. 12 shows the document file (XML) shown in FIG.
This is data obtained by extracting the content attribute information from the document as it is, and the content attribute information remains in the structure and format according to the document viewed by the information provider. FIG. 13 shows the content attribute information of FIG. 12 expressed in a tree structure format. As can be seen from FIG. 13, this is not necessarily a structure / form suitable for providing a search service or the like.

【００４８】図１４は、図１２の対象コンテンツ（コン
テンツ属性情報）について、コンテンツ展開処理（２）
を適用した結果を示したものである。図１５は、図１４
の展開結果に対して、さらにコンテンツ展開処理（１）
を適用した結果を示したもので、結局、図１２の対象コ
ンテンツは４つのコンテンツに展開されたことになる。
図１６は、図１５のコンテンツ展開処理結果について、
文字表現形式の正規化と数値形式の正規化を適用した結
果を示したものである。FIG. 14 is a content development process (2) for the target content (content attribute information) in FIG.
FIG. FIG.
Content development processing (1)
Is applied, and as a result, the target content in FIG. 12 is expanded into four contents.
FIG. 16 shows the result of the content development process of FIG.
It shows the result of applying normalization in character expression format and normalization in numeric format.

【００４９】図１６の正規化ずみコンテンツ属性情報を
木構造形式で表現すると、図１７及び図１８のようにな
る。図１３に比べて、一見して検索サービス等の提供に
適した構造・形式になっているのが分かる。When the normalized content attribute information shown in FIG. 16 is expressed in a tree structure format, it becomes as shown in FIGS. 17 and 18. At a glance, it can be seen that the structure and format are more suitable for providing a search service and the like than in FIG.

【００５０】図１９は、図１６の正規化ずみコンテンツ
属性情報を表形式で表現したもので、図１のコンテンツ
ＤＢ１０５には、このような形式でコンテンツ属性情報
が格納される。このようにして、閲覧用文書（ＨＴＭＬ
文書）の構造・形式に縛られることなく、検索サービス
提供用等のデータベース（関連データベース）を構築す
ることが可能になる。FIG. 19 shows the normalized content attribute information of FIG. 16 in a table format. The content DB 105 of FIG. 1 stores the content attribute information in such a format. In this way, the browsing document (HTML
It is possible to construct a database (related database) for providing a search service without being restricted by the structure and format of the document).

【００５１】次に、本発明による属性情報設定装置につ
いて説明する。図２０は本発明による属性情報設定装置
の一実施例を示すブロック図である。本属性情報設定装
置２０００の特徴は、図３０に示した従来の装置に属性
抽出・正規化プレビュー部２００８と複数のタグセット
を別に管理し検証する構造検証部２００９を設けた点に
ある。属性抽出・正規化プレビュー部２００８は、図１
の情報自動収集分類装置１００内の属性抽出部１０２、
属性正規化部１０３などと基本的に同様の機能（同等か
その簡略版）であり、作成された文書ファイル（ＸＭＬ
文書）からコンテンツ属性情報の抽出・正規化を行いプ
レビューする。また、構造検証部２００９は、複数のタ
グセットを閲覧用文書とコンテンツ属性情報とで別々に
管理し、相互に無視して構造を検証する。Next, an attribute information setting device according to the present invention will be described. FIG. 20 is a block diagram showing an embodiment of the attribute information setting device according to the present invention. The feature of the attribute information setting device 2000 is that an attribute extraction / normalization preview unit 2008 and a structure verification unit 2009 for separately managing and verifying a plurality of tag sets are provided in the conventional device shown in FIG. The attribute extraction / normalization preview unit 2008
Attribute extraction unit 102 in the information automatic collection and classification device 100,
It has basically the same function (equivalent or a simplified version thereof) as that of the attribute normalization unit 103 and the like.
Document), extract and normalize the content attribute information, and preview. In addition, the structure verification unit 2009 manages a plurality of tag sets separately for the browsing document and the content attribute information, and ignores each other to verify the structure.

【００５２】該属性情報設定装置２００は、図１のシス
テムで装置１２５で示したように、Ｗｅｂサイト１２０
に設置して使用する。この場合、属性情報の抽出・正規
化プレビュー部２００８によれば、情報提供者（ＩＰ）
は、作成した文書ファイル（ＸＭＬ文書）について、情
報自動収集分類装置１００で、コンテンツ属性情報がど
のように抽出・正規化されるか、プレビューして見るこ
とができ、より有効なコンテンツ属性情報の設定を行う
ことが可能となる。図２１に、図１１の文書ファイル
（ＸＭＬ文書）のエデイタ画面、図２２に、これに対応
する属性抽出・正規化結果のプレビュー画面を示す。な
お、この例の属性抽出・正規化結果プレビュー部２００
８は、単位変換機能を持たない簡略版であるため、容量
は「ｌ」と「ｍｌ」表記が混在している。また、構造検
証部２００９によれば、相互の制限が弱く同種のタグ間
での制限が強い複数のタグセットを扱う場合に、閲覧用
文書とコンテンツ属性情報それぞれのタグセットについ
て、例えば相互に無視して文法検証することで、より厳
しい文法検証を行うことが可能となる。図２３に、構造
検証部２００９による複数タグセットの管理・検証の概
念図を示す。The attribute information setting device 200 is, as shown by the device 125 in the system of FIG.
Install and use. In this case, according to the attribute information extraction / normalization preview unit 2008, the information provider (IP)
Can automatically preview and see how the content attribute information is extracted and normalized by the automatic information collection and classification device 100 for the created document file (XML document). Settings can be made. FIG. 21 shows an editor screen of the document file (XML document) shown in FIG. 11, and FIG. 22 shows a preview screen of a corresponding attribute extraction / normalization result. The attribute extraction / normalization result preview unit 200 in this example
8 is a simplified version having no unit conversion function, so that the capacity is a mixture of “l” and “ml”. According to the structure verification unit 2009, when handling a plurality of tag sets in which mutual restrictions are weak and restrictions between tags of the same type are strong, for example, the tag sets for the browsing document and the content attribute information are ignored, for example. By performing the grammar verification in this manner, it becomes possible to perform a stricter grammar verification. FIG. 23 shows a conceptual diagram of management / verification of a plurality of tag sets by the structure verification unit 2009.

【００５３】[0053]

【発明の効果】本発明を適用した情報収集・サービス提
供システムでは、検索エンジンなどのホスト装置に、再
利用する閲覧用文書ファイル中の属性情報の構造、属性
名、属性値形式等を変換する機能を持つことで、閲覧用
文書（たとえば商品カタログ）の表現の自由度を増すこ
とができる。逆に、すでに存在する閲覧用文書の構造・
形式に縛られること無く、サービス提供用のデータベー
スを構築することが可能になる。According to the information collection / service providing system to which the present invention is applied, the structure, attribute name, attribute value format, etc. of the attribute information in the document file to be reused are converted into a host device such as a search engine. By having the function, the degree of freedom of expression of a browsing document (for example, a product catalog) can be increased. Conversely, the structure of the existing browsing document
It is possible to build a database for providing services without being restricted by the format.

【００５４】また、テキスト情報中の文字列をマークア
ップすることで、コンテンツ属性情報を設定する装置
に、属性抽出・正規化プレビュー機能と複数のタグセッ
トを別々に管理し検証する機能を設けることにより、デ
ータ作成者が、ホスト装置が属性を抽出・正規化（構
造、値形式）処理を行った結果を見ることができ、より
有効な属性設定を行うことが可能、相互の制限が弱く同
種のタグ間の制限が強い複数のタグセットを扱う場合
に、それぞれのタグセットについて文法検証すること
で、より厳しい文法検証を行うことが可能となる。Also, a device for setting content attribute information by marking up a character string in text information is provided with an attribute extraction / normalization preview function and a function of separately managing and verifying a plurality of tag sets. This allows the data creator to view the results of the host device extracting and normalizing (structure, value format) processing of the attributes, making it possible to set more effective attribute settings, and the mutual restrictions are weak and similar. When a plurality of tag sets having strong restrictions between tags are handled, grammatical verification is performed for each tag set, so that stricter grammatical verification can be performed.

[Brief description of the drawings]

【図１】本発明を適用した情報収集・サービス提供シス
テムの一実施例のブロック図である。FIG. 1 is a block diagram of an embodiment of an information collection / service providing system to which the present invention is applied.

【図２】図１の動作を説明するための全体的処理フロー
チャートの一部である。FIG. 2 is a part of an overall processing flowchart for explaining the operation of FIG. 1;

【図３】図２の処理フローチャートの続きである。FIG. 3 is a continuation of the processing flowchart of FIG. 2;

【図４】本発明の対象とする文書の構造と処理過程の概
要を示す図である。FIG. 4 is a diagram showing an outline of a structure and a processing process of a document which is an object of the present invention.

【図５】本発明による属性構造正規化のコンテンツ展開
処理（１）を説明する図である。FIG. 5 is a diagram illustrating content expansion processing (1) of attribute structure normalization according to the present invention.

【図６】同じくコンテンツ展開処理（２）を説明する図
である。FIG. 6 is a diagram illustrating a content development process (2).

【図７】属性名の正規化を説明する図である。FIG. 7 is a diagram illustrating normalization of an attribute name.

【図８】属性の分割を説明する図である。FIG. 8 is a diagram illustrating attribute division.

【図９】他属性への正規化を説明する図である。FIG. 9 is a diagram illustrating normalization to another attribute.

【図１０】数値形式の正規化を説明するフローチャート
である。FIG. 10 is a flowchart illustrating normalization in a numerical format.

【図１１】本発明の対象とする文書ファイルの具体例で
ある。FIG. 11 is a specific example of a document file targeted by the present invention.

【図１２】図１１の文書ファイルから抽出したコンテン
ツ属性情報である。FIG. 12 shows content attribute information extracted from the document file shown in FIG. 11;

【図１３】図１２のコンテンツ属性情報を木構造形式で
表現した図である。13 is a diagram expressing the content attribute information of FIG. 12 in a tree structure format.

【図１４】図１１のコンテンツ属性情報にコンテンツ展
開処理（１）を適用した結果を示す図である。FIG. 14 is a diagram illustrating a result of applying content expansion processing (1) to the content attribute information of FIG. 11;

【図１５】図１４のデータについて、さらにコンテンツ
展開処理（２）を適用した結果を示す図である。FIG. 15 is a diagram showing a result of further applying a content development process (2) to the data of FIG. 14;

【図１６】図１５のデータについて、文字表現形式と数
値形式の正規化を適用した結果を示す図である。FIG. 16 is a diagram showing a result of applying a character expression format and a numerical format normalization to the data of FIG. 15;

【図１７】図１６のデータを木構造形式で表現した一部
の図である。FIG. 17 is a diagram partially showing the data of FIG. 16 in a tree structure format;

【図１８】図１７の木構造形式表現の続きの図である。18 is a continuation diagram of the tree structure form expression of FIG. 17;

【図１９】図１６のデータを表形式で表現した図であ
る。FIG. 19 is a diagram showing the data of FIG. 16 in a table format.

【図２０】本発明による属性情報設定装置の一実施例の
ブロック図である。FIG. 20 is a block diagram of an embodiment of an attribute information setting device according to the present invention.

【図２１】本属性情報設定装置での図１１の文書ファイ
ルのエディタ画面を示す図である。21 is a diagram showing an editor screen of the document file of FIG. 11 in the attribute information setting device.

【図２２】本属性情報設定装置での図１９に対応するエ
ディタプレビュー結果を示す図である。FIG. 22 is a diagram showing an editor preview result corresponding to FIG. 19 in the attribute information setting device.

【図２３】本発明による属性情報設定装置での複数タグ
セットの管理・検証を説明する図である。FIG. 23 is a diagram illustrating management and verification of a plurality of tag sets in the attribute information setting device according to the present invention.

【図２４】従来のコンテンツ属性情報を考慮しない検索
エンジンシステムのブロック図である。FIG. 24 is a block diagram of a conventional search engine system that does not consider content attribute information.

【図２５】図２４のシステムで対象とする文書の構造を
説明する図である。FIG. 25 is a diagram illustrating the structure of a target document in the system of FIG. 24;

【図２６】図２４のシステムが対象とする文書の具体例
を示す図である。26 is a diagram illustrating a specific example of a document targeted by the system of FIG. 24;

【図２７】従来のコンテンツ属性情報が固定的な自動収
集分類システムのブロック図である。FIG. 27 is a block diagram of a conventional automatic collection and classification system in which content attribute information is fixed.

【図２８】図２７のシステムが対象とする文書の構造を
説明する図である。FIG. 28 is a diagram illustrating the structure of a document targeted by the system of FIG. 27;

【図２９】従来のコンテンツ属性情報を閲覧用タグ及び
文字列と対応づけるシステムのブロック図である。FIG. 29 is a block diagram of a conventional system for associating content attribute information with a viewing tag and a character string.

【図３０】従来の属性情報設定装置のブロック図であ
る。FIG. 30 is a block diagram of a conventional attribute information setting device.

【図３１】従来の属性情報設定装置でのタグセットの管
理・検証を説明する図である。FIG. 31 is a diagram illustrating management and verification of a tag set in a conventional attribute information setting device.

[Explanation of symbols]

１００情報自動収集分類装置（ホスト装置）１０１自動情報収集部１０２属性抽出部１０３属性正規化部１０４制御部１０５コンテンツデータベース１０６正規化ルール１０７対応ルール１０８サービス提供部１１０ネットワーク１２０Ｗｅｂサイト１３０ユーザ端末１２０，２０００属性情報設定装置２００８属性抽出・正規化プレビュー部２００９構造検証部 REFERENCE SIGNS LIST 100 Information automatic collection and classification device (host device) 101 Automatic information collection unit 102 Attribute extraction unit 103 Attribute normalization unit 104 Control unit 105 Content database 106 Normalization rule 107 Corresponding rule 108 Service provision unit 110 Network 120 Web site 130 User terminal 120 , 2000 Attribute information setting device 2008 Attribute extraction and normalization preview unit 2009 Structural verification unit

Claims

[Claims]

1. A method for normalizing content attribute information having a structure and format according to a browsing document into a structure and format independent of the browsing document, wherein the content attribute information is mixed with browsing information. Extracting content attribute information from the document file to be extracted; performing a normalization process on an attribute structure for the extracted content attribute information; and for the content attribute information having the normalized structure,
Performing a normalization process of a character expression format and a normalization process of a numerical expression.

2. The content attribute information normalization method according to claim 1, wherein in the attribute structure normalization processing, content expansion, attribute name normalization, attribute division, and normalization to other attributes are performed. Content attribute information normalization method to be characterized.

3. The content attribute information normalization method according to claim 1, wherein the normalization rules for the normalization processing include a field-independent / attribute-independent rule, a field-dependent / attribute-independent rule, and a field-independent / Attribute-dependent rules, field-dependent
A content attribute information normalization method comprising an attribute dependency rule and managing by a content field and an attribute name.

4. An information collection and service providing system in which an information provider's Web site, a host device, and a user terminal are connected via a network, wherein the host device is distributed over the network.
Means for automatically collecting the document files of the site, means for extracting the content attribute information contained in the collected document files together with the browsing information, and means for extracting the extracted content attribute information suitable for providing the service. A means for normalizing to a structure / format, a means for storing the normalized content attribute information, and a means for providing a service to a request from a user terminal using the stored content attribute information. An information collection and service providing system.

5. An apparatus for setting content attribute information by marking up a character string in a document for browsing, extracting and normalizing the content attribute information for previewing, a browsing tag and a content attribute tag. Attribute information setting device characterized by having means for separately managing and browsing the browsing information and the content attribute information.

6. A computer-readable recording medium storing a program for normalizing content attribute information having a structure and format according to a browsing document into a structure and format independent of the browsing document. A processing process of extracting content attribute information from a document file containing content attribute information mixed with application information; a process of performing an attribute structure normalization process on the extracted content attribute information; Content attribute information
A program storage medium, comprising: a process for normalizing a character expression format and a normalization process for a numerical expression.