JP2003223390A

JP2003223390A - Data extraction/structure conversion processing program, its recording medium, contents generation processing program, its recording medium, and contents reconstruction processing system

Info

Publication number: JP2003223390A
Application number: JP2002019425A
Authority: JP
Inventors: Minoru Morita; 実守田; Shingo Okamoto; 晋吾岡本; Tomoyoshi Inada; 知義稲田; Takuo Nakamura; 拓郎中村
Original assignee: Fujitsu Social Science Labs Ltd
Current assignee: Fujitsu Social Science Labs Ltd
Priority date: 2002-01-29
Filing date: 2002-01-29
Publication date: 2003-08-08
Anticipated expiration: 2022-01-29
Also published as: JP4084049B2

Abstract

<P>PROBLEM TO BE SOLVED: To provide a method for automatically restructuring contents for a terminal whose display environment is different, from already existing contents. <P>SOLUTION: Data extraction/structure conversion processing part 2 obtains Web contents 4, and divides the Web contents into resources and styles, and groups the resources based on a tag to generate a module, and restructures the tree structure of a page based on link information to rearrange the module, and generates XML data and stores the data in a contents DB 5. When a reading request is issued from a portable terminal 13, a contents generation processing part 3 restructures the XML data of a contents DB 5 according to a display format template selected based on the specification information of a portable terminal 13 being the origin of a request obtained from a data center 14 to reply. <P>COPYRIGHT: (C)2003,JPO

Description

Detailed Description of the Invention

【０００１】[0001]

【発明の属する技術分野】本発明は、Ｗｅｂコンテンツ
を様々な表示処理環境に対応可能な形式のデータに再構
成して記憶し、記憶しておいたデータを要求元の表示処
理環境に対応した表示形式に従って再生成するＷｅｂコ
ンテンツ再構築処理方法に関する。特に、Ｗｅｂコンテ
ンツに対して自動ないし半自動で意味付けを行い、その
意味をもとにページのリンク情報を再構築したデータを
記憶しておき、表示処理環境に制約があるような端末に
適合する表示が可能なようにデータを再構成するＷｅｂ
コンテンツ再構築処理方法に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to reconstructing and storing Web contents into data in a format compatible with various display processing environments and storing the stored data in a display processing environment of a request source. The present invention relates to a Web content reconstruction processing method for regenerating according to a display format. In particular, it automatically or semi-automatically gives meanings to Web contents, and stores data in which page link information is reconstructed based on the meanings, and is suitable for terminals having a limited display processing environment. Web that restructures data so that it can be displayed
The present invention relates to a content reconstruction processing method.

【０００２】[0002]

【従来の技術】Ｗｅｂコンテンツの閲覧処理手段ごと
に、適合するデータの形式が異なる場合がある。例え
ば、通常のコンピュータ端末などでの閲覧を想定したコ
ンテンツと、携帯電話端末やＰＤＡ（Personal Digital
Assistance ）などの携帯型データ処理端末（以下、携
帯端末という。）での閲覧に適合したコンテンツとは、
データの記述形式、データ容量および画面サイズなどに
相違がある。したがって、例えばＨＴＭＬ（HyperText
Markup Language ）で記述されたページの集合（ＨＴＭ
Ｌドキュメント）からなるＷｅｂコンテンツを携帯端末
で閲覧させるためには、Ｗｅｂコンテンツを携帯端末用
に再構築する必要がある。2. Description of the Related Art In some cases, the format of compatible data differs depending on the Web content browsing processing means. For example, content that is supposed to be viewed on a normal computer terminal, a mobile phone terminal, a PDA (Personal Digital
Assistance) and other portable data processing terminals (hereinafter referred to as mobile terminals)
There are differences in the data description format, data capacity, screen size, etc. Therefore, for example, HTML (HyperText
A set of pages described in Markup Language (HTM
In order to browse the Web content composed of L documents) on the mobile terminal, it is necessary to reconstruct the Web content for the mobile terminal.

【０００３】しかし、既存のＨＴＭＬドキュメントの記
述は多様な形式が許容されており、また、携帯端末側の
表示処理条件（画面サイズや表示可能なページ容量な
ど）も様々である。そのために、通常のＷｅｂコンテン
ツから携帯端末用のコンテンツへの再構築処理を自動化
することが困難であった。However, various formats are allowed for the description of the existing HTML document, and the display processing conditions (screen size, displayable page capacity, etc.) on the portable terminal side are also various. Therefore, it is difficult to automate the reconstruction process from normal Web content to content for mobile terminals.

【０００４】従来は、端末の表示画面サイズや表示可能
なページ容量などを考慮して、Ｗｅｂコンテンツの内容
を検討して部分的に内容を変更したり、また、ページレ
イアウトやページ構成などを変更したりして、人手によ
り再構築していた。Conventionally, considering the display screen size of the terminal, the displayable page capacity, etc., the contents of the Web contents are examined and the contents are partially changed, or the page layout and the page structure are changed. I was doing it and manually rebuilding it.

【０００５】または、予め特殊な言語を用いてＷｅｂコ
ンテンツを記述しておき、所定の変換処理ツールを用い
て携帯端末用のコンテンツを再構築するようにしてい
た。Alternatively, the Web content is described in advance by using a special language, and the content for the portable terminal is reconstructed by using a predetermined conversion processing tool.

【０００６】[0006]

【発明が解決しようとする課題】携帯端末の種類の増加
に伴い既存のＷｅｂコンテンツを再利用してコンテンツ
を豊富化することが望まれているが、従来では、内容や
ページ構成などを考慮して、Ｗｅｂコンテンツから他の
表示処理環境に適合したコンテンツを自動的に再構築処
理することはできなかった。Ｗｅｂコンテンツから他の
表示処理環境で表示可能なコンテンツへの再構築処理は
人手で行うために、人的および時間的負担が大きかっ
た。With the increase in the types of mobile terminals, it is desired to reuse existing Web contents to enrich the contents, but in the past, in consideration of the contents and page structure, etc. As a result, it has not been possible to automatically reconstruct the content suitable for other display processing environment from the Web content. Reconstruction processing from Web content to content that can be displayed in another display processing environment is manually performed, and thus a human and time burden is large.

【０００７】本発明の目的は、Ｗｅｂコンテンツの内容
やページ構成を考慮しつつ、元のＷｅｂコンテンツから
他の形式のコンテンツを自動または半自動的に再構築す
る処理方法を提供することである。An object of the present invention is to provide a processing method for automatically or semi-automatically reconstructing content of another format from the original Web content while considering the content and page structure of the Web content.

【０００８】また本発明の目的は、上記の処理方法をコ
ンピュータで実現するためのプログラムもしくは処理装
置を提供することである。Another object of the present invention is to provide a program or a processing device for implementing the above processing method on a computer.

【０００９】[0009]

【課題を解決するための手段】上記の目的を達成するた
め、本発明は、タグ付きマークアップ言語で記述された
コンテンツデータから再利用可能な形式の中間コンテン
ツデータを再構成するデータ抽出・構造変換処理をコン
ピュータに実行させるためのプログラムであって、前記
コンテンツデータから、データ構成を定義する情報であ
るタグをもとに人間にとって有意な情報を構成する部分
であるリソース情報を抽出し、前記リソース情報を前記
タグをもとにグループ化してモジュールを生成し、前記
モジュールに意味を付与するモジュール生成処理と、前
記タグをもとに前記モジュール間の関係を付与し、前記
コンテンツデータに定義されているリンク情報から前記
コンテンツデータのページ間のツリー構造を再構築する
関係設定処理と、前記モジュール間の関係および前記ツ
リー構造にしたがって、前記モジュールを並べ替えて再
利用可能な形式の前記中間コンテンツデータを生成する
再構成処理と、前記中間コンテンツデータを記憶するコ
ンテンツデータ記憶処理とを、コンピュータに実行させ
るものである。To achieve the above object, the present invention provides a data extraction / structure for reconstructing intermediate content data in a reusable format from content data described in a tagged markup language. A program for causing a computer to execute a conversion process, and extracting from the content data resource information that is a portion that constitutes meaningful information for humans based on a tag that is information defining a data configuration, Resource information is grouped based on the tag to generate a module, and a module generation process that adds meaning to the module and a relationship between the modules based on the tag are added to define the content data. Relationship setting processing for reconstructing the tree structure between the pages of the content data from the link information, According to the relationship between the modules and the tree structure, rearrangement processing for rearranging the modules to generate the intermediate content data in a reusable format, and content data storage processing for storing the intermediate content data, It is what makes a computer execute.

【００１０】また、本発明は、タグ付きマークアップ言
語で記述された再利用可能な形式の中間コンテンツデー
タから、閲覧要求があった端末での表示に対応するコン
テンツデータを生成するコンテンツ生成処理をコンピュ
ータに実行させるためのプログラムであって、端末の表
示処理環境に対応する表示形式情報であるテンプレート
を保持するテンプレート記憶手段へアクセスする処理
と、前記端末からの閲覧要求および前記端末の種別情報
を受け付け、前記識別情報をもとに前記テンプレート記
憶手段から該当するテンプレートを抽出し、前記テンプ
レートに従って前記中間コンテンツデータを再構成して
前記コンテンツデータを生成する動的生成処理とを、コ
ンピュータに実行させるものである。The present invention also provides a content generation process for generating content data corresponding to display on a terminal that has received a browsing request from intermediate content data in a reusable format described in a tagged markup language. A program to be executed by a computer, a process of accessing a template storage unit that holds a template that is display format information corresponding to a display processing environment of a terminal, a browsing request from the terminal, and type information of the terminal A dynamic generation process of receiving, extracting a corresponding template from the template storage means based on the identification information, and reconfiguring the intermediate content data according to the template to generate the content data is executed by a computer. It is a thing.

【００１１】また、本発明は、タグ付きマークアップ言
語で記述されたコンテンツデータから再利用可能な形式
の中間コンテンツデータを再構成するデータ抽出・構造
変換処理をコンピュータに実行させるためのプログラム
を記録した記録媒体であって、前記コンテンツデータか
ら、データ構成を定義する情報であるタグをもとに人間
にとって有意な情報を構成する部分であるリソース情報
を抽出し、前記リソース情報を前記タグをもとにグルー
プ化してモジュールを生成し、前記モジュールに意味を
付与するモジュール生成処理と、前記タグをもとに前記
モジュール間の関係を付与し、前記コンテンツデータに
定義されているリンク情報から前記コンテンツデータの
ページ間のツリー構造を再構築する関係設定処理と、前
記モジュール間の関係および前記ツリー構造にしたがっ
て、前記モジュールを並べ替えて再利用可能な形式の前
記中間コンテンツデータを生成する再構成処理と、前記
中間コンテンツデータを記憶するコンテンツデータ記憶
処理とを、コンピュータに実行させるためのプログラム
を記録したものである。Further, the present invention records a program for causing a computer to execute data extraction / structure conversion processing for reconstructing intermediate content data in a reusable format from content data described in a tagged markup language. In the recording medium, resource information, which is a portion that constitutes meaningful information for human beings, is extracted from the content data based on a tag that is information that defines a data structure, and the resource information is also included in the tag. A module generation process for grouping and modules to generate a module and adding a meaning to the module, and a relationship between the modules based on the tag, and the content from the link information defined in the content data. Relationship setting processing to rebuild the tree structure between pages of data, and between the modules A computer is caused to execute a reconfiguration process of rearranging the modules according to the relation and the tree structure to generate the reusable intermediate content data, and a content data storage process of storing the intermediate content data. It is a record of the program for.

【００１２】また、本発明は、タグ付きマークアップ言
語で記述された再利用可能な形式の中間コンテンツデー
タから、閲覧要求があった端末での表示に対応するコン
テンツデータを生成するコンテンツ生成処理をコンピュ
ータに実行させるためのプログラムを記録した記録媒体
であって、端末の表示処理環境に対応する表示形式情報
であるテンプレートを保持するテンプレート記憶手段へ
アクセスする処理と、前記端末からの閲覧要求および前
記端末の種別情報を受け付け、前記識別情報をもとに前
記テンプレート記憶手段から該当するテンプレートを抽
出し、前記テンプレートに従って前記中間コンテンツデ
ータを再構成して前記コンテンツデータを生成する動的
生成処理とを、コンピュータに実行させるためのプログ
ラムを記録したものである。The present invention also provides a content generation process for generating content data corresponding to display on a terminal that has received a browsing request from intermediate content data in a reusable format described in a tagged markup language. A recording medium for recording a program to be executed by a computer, the process of accessing a template storage unit that holds a template that is display format information corresponding to a display processing environment of the terminal, a browsing request from the terminal, and the A dynamic generation process of receiving terminal type information, extracting a corresponding template from the template storage means based on the identification information, reconfiguring the intermediate content data according to the template, and generating the content data. , Recorded the program to be executed by the computer It is.

【００１３】また、本発明は、第１のタグ付きマークア
ップ言語で記述された第１のコンテンツデータから再利
用可能な形式の中間コンテンツデータを再構成して記憶
するデータ抽出・構造変換処理部と、前記中間コンテン
ツデータから閲覧要求があった端末での表示に対応する
第２のタグ付きマークアップ言語で記述した第２のコン
テンツデータを生成するコンテンツ生成処理部とからな
るコンテンツ再構築処理システムであって、前記データ
抽出・構造変換処理部は、前記第１のコンテンツデータ
から、データ構成を定義する情報であるタグをもとに人
間にとって有意な情報を構成する部分であるリソース情
報を抽出し、前記リソース情報を前記タグをもとにグル
ープ化してモジュールを生成し、前記モジュールに意味
を付与するモジュール生成処理手段と、前記タグをもと
に前記モジュール間の関係を付与し、前記第１のコンテ
ンツデータに定義されているリンク情報から前記第１の
コンテンツデータのページ間のツリー構造を再構築する
関係設定手段と、前記モジュール間の関係および前記ツ
リー構造にしたがって、前記モジュールを並べ替えて再
利用可能な形式の前記中間コンテンツデータを生成する
再構成処理手段と、前記中間コンテンツデータを記憶す
るコンテンツデータ記憶部とを備え、前記コンテンツ生
成部は、端末の表示処理環境に対応した表示形式情報で
あるテンプレートを保持するテンプレート記憶手段と、
前記端末からの閲覧要求および前記端末の種別情報を受
け付け、前記識別情報をもとに前記テンプレート記憶手
段から該当するテンプレートを抽出し、前記テンプレー
トに従って前記中間コンテンツデータを再構成して前記
コンテンツデータを生成する動的生成処理手段を備え
る。Further, the present invention is a data extraction / structure conversion processing unit for reconstructing and storing intermediate content data in a reusable format from the first content data described in the first tagged markup language. And a content generation processing unit configured to generate the second content data described in the second tagged markup language corresponding to the display on the terminal that has requested the browsing from the intermediate content data. The data extraction / structure conversion processing unit extracts, from the first content data, resource information that is a portion that constitutes meaningful information for humans based on a tag that is information that defines the data configuration. Then, the resource information is grouped based on the tags to generate a module, and a module that adds meaning to the module. Based on the link information defined in the first content data, the tree structure between the pages of the first content data is reconstructed based on the link generation processing means and the relationship between the modules based on the tag. Relationship setting means, reconfiguration processing means for rearranging the modules according to the relationship between the modules and the tree structure to generate the intermediate content data in a reusable format, and storing the intermediate content data. A content data storage unit, wherein the content generation unit holds a template storage unit that holds a template that is display format information corresponding to the display processing environment of the terminal;
A browsing request from the terminal and type information of the terminal are accepted, a corresponding template is extracted from the template storage means based on the identification information, the intermediate content data is reconfigured according to the template, and the content data is stored. A dynamic generation processing means for generating is provided.

【００１４】本発明のコンテンツ再構築処理システム
は、第１のタグ付きマークアップ言語で記述された第１
のコンテンツデータから再利用可能な形式の中間コンテ
ンツデータを再構成して記憶するデータ抽出・構造変換
処理部と、前記中間コンテンツデータから閲覧要求があ
った端末での表示に対応する第２のタグ付きマークアッ
プ言語で記述した第２のコンテンツデータを生成するコ
ンテンツ生成処理部とからなるシステムであって、前記
データ抽出・構造変換処理部では、前記第１のコンテン
ツデータから、データ構成の定義情報をもとに人間にと
って有意なリソース情報を抽出して前記タグをもとにグ
ループ化してモジュールを生成し、さらに、前記第１の
コンテンツデータのリンク情報から前記第１のコンテン
ツデータのページ間のツリー構造を再構築して前記モジ
ュールを並べ替えて再利用可能な形式の前記中間コンテ
ンツデータを生成して記憶しておき、前記コンテンツ生
成部では、前記端末からの閲覧要求および前記端末の種
別情報を受け付けると、前記識別情報をもとに端末の表
示処理環境に対応した表示形式情報であるテンプレート
から該当するテンプレートを抽出し、前記テンプレート
に従って前記中間コンテンツデータを再構成して前記コ
ンテンツデータを生成することを特徴とする。The content restructuring processing system of the present invention is the first described in the first tagged markup language.
Data conversion / structure conversion processing unit for reconstructing and storing intermediate content data in a reusable format from the above-mentioned content data, and a second tag corresponding to display on the terminal that has received a browsing request from the intermediate content data. A system including a content generation processing unit that generates second content data described in a marked markup language, wherein the data extraction / structure conversion processing unit uses the first content data to define definition information of a data structure. Resource information that is meaningful to human beings is extracted based on the above, grouped based on the tags to generate a module, and further, from the link information of the first content data to the page of the first content data. Rebuild the tree structure and rearrange the modules to generate the reusable intermediate content data When the content generation unit receives a browsing request from the terminal and type information of the terminal, the content generation unit uses a template that is display format information corresponding to the display processing environment of the terminal based on the identification information. A corresponding template is extracted, and the intermediate content data is reconstructed according to the template to generate the content data.

【００１５】本発明の各手段または機能または要素をコ
ンピュータに実行させるためのプログラムは、コンピュ
ータが読み取り可能な、可搬媒体メモリ、半導体メモ
リ、ハードディスクなどの適当な記録媒体に格納するこ
とができ、これらの記録媒体に記録して提供され、また
は、通信インタフェースを介して種々の通信網を利用し
た送受信により提供される。The program for causing a computer to execute each means or function or element of the present invention can be stored in an appropriate recording medium such as a portable medium memory, a semiconductor memory, a hard disk, which can be read by a computer. It is provided by being recorded in these recording media, or by transmission and reception using various communication networks via a communication interface.

【００１６】[0016]

【発明の実施の形態】以下に、本発明の実施の形態とし
て、Ｗｅｂコンテンツから、携帯電話端末などの表示画
面サイズが小型の携帯端末で表示されるコンテンツを再
構築する場合の処理を説明する。BEST MODE FOR CARRYING OUT THE INVENTION Hereinafter, as an embodiment of the present invention, a process for reconstructing content displayed on a mobile terminal having a small display screen size such as a mobile phone terminal from Web content will be described. .

【００１７】図１に、本発明にかかるシステムの構成例
を示す。本発明を実現するコンテンツ再構築処理システ
ム１は、データ抽出・構造変換処理部２と、コンテンツ
生成処理部３と、コンテンツデータベース（コンテンツ
ＤＢ）５とを備える。FIG. 1 shows a configuration example of a system according to the present invention. The content reconstruction processing system 1 for realizing the present invention includes a data extraction / structure conversion processing unit 2, a content generation processing unit 3, and a content database (content DB) 5.

【００１８】データ抽出・構造変換処理部２は、Ｗｅｂ
コンテンツ４として一つのサイトを構成するＨＴＭＬド
キュメント（ＨＴＭＬページの集合）を取得し、人間に
とって意味があるデータ部分を部品化し、部品化したデ
ータに意味付けし、その意味や元のページ構成を考慮し
て、他の表示処理環境で再利用可能な形式へ再構築して
コンテンツＤＢ５に格納する手段である。ここでは、再
利用可能なデータとして例えばＸＭＬ（eXtensible Mar
kup Language）データを生成する。The data extraction / structure conversion processing unit 2 uses the Web
An HTML document (a set of HTML pages) that constitutes one site is acquired as the content 4, the data part that is meaningful to human beings is made into parts, the partized data is given meaning, and the meaning and original page structure are taken into consideration. Then, the content DB 5 is reconstructed into a format reusable in another display processing environment and stored in the content DB 5. Here, as reusable data, for example, XML (eXtensible Mar
kup Language) data is generated.

【００１９】コンテンツ生成処理部３は、携帯端末など
の端末からのアクセス要求を受け付けて、コンテンツＤ
Ｂ５から該当するＸＭＬデータを抽出し、抽出したＸＭ
Ｌデータをもとに要求元に最適なコンテンツとなるよう
に所定の表示形式テンプレートを用いてモバイル用コン
テンツ６を動的に生成する手段である。ここでは、モバ
イル用コンテンツ６として、例えばＣｏｍｐａｃｔＨＴ
ＭＬ（Ｃ−ＨＴＭＬ）で記述されたコンテンツを生成す
る。The content generation processing unit 3 accepts an access request from a terminal such as a mobile terminal and outputs the content D.
Extract the corresponding XML data from B5, and extract the extracted XML
It is a means for dynamically generating the mobile content 6 using a predetermined display format template so that the content is optimal for the request source based on the L data. Here, as the mobile content 6, for example, CompactHT
The content described in ML (C-HTML) is generated.

【００２０】図２に示すように、コンテンツ再構築処理
システム１は、インターネット１２を介して、Ｗｅｂサ
ーバ１１、Ｗｅｂコンテンツ４へのアクセスを中継する
データセンタ１４へ接続している。As shown in FIG. 2, the content reconstruction processing system 1 is connected via the Internet 12 to a data center 14 which relays access to the Web server 11 and Web content 4.

【００２１】Ｗｅｂサーバ１１はＷｅｂコンテンツ４を
提供するサーバである。The web server 11 is a server that provides the web contents 4.

【００２２】データセンタ１４は、携帯端末１３が接続
する電話網などとインターネット１２とを中継する装置
である。The data center 14 is a device that relays the telephone network or the like connected to the mobile terminal 13 and the Internet 12.

【００２３】携帯端末１３は、携帯電話、ＰＨＳ、ＰＤ
Ａなどの端末であって、データセンタ１４を介してイン
ターネット１２へ接続でき、コンテンツ再構築処理シス
テム１で生成されたモバイル用コンテンツ６を表示処理
できる端末である。The mobile terminal 13 is a mobile phone, PHS, PD.
A terminal such as A, which can be connected to the Internet 12 via the data center 14 and can display the mobile content 6 generated by the content restructuring processing system 1.

【００２４】図３に、Ｗｅｂコンテンツ４のデータ構造
の例を示す。それぞれのコンテンツは、一または複数の
ページ４１を持つ。FIG. 3 shows an example of the data structure of the Web content 4. Each content has one or more pages 41.

【００２５】ページ４１は、Ｗｅｂブラウザなどの閲覧
手段により一度に表示されるデータのまとまりであり、
各ページは、一または複数のコンテナ４２およびリンク
を持つ。The page 41 is a group of data displayed at one time by a browsing means such as a Web browser.
Each page has one or more containers 42 and links.

【００２６】コンテナ４２は、一または複数のリソース
４３もしくはコンテナ４２を格納する部分である。The container 42 is a part for storing one or a plurality of resources 43 or the container 42.

【００２７】リソース４３は、人間にとって有意な情報
を構成するデータである。ここでは、リソース４３は、
ＨＴＭＬドキュメントの最小単位で、文字（テキスト）
データまたは画像（イメージ）データなどを指し、いず
れも文全体あるいはイメージ全体ではなく、ＨＴＭＬの
タグで修飾された最小単位を１リソースとする。The resource 43 is data that constitutes meaningful information for humans. Here, the resource 43 is
Character (text), the smallest unit of an HTML document
It refers to data or image (image) data, and in each case, one resource is not a whole sentence or an image but a minimum unit modified by an HTML tag.

【００２８】スタイル４５は、コンテンツを構成するペ
ージ４１、コンテナ４２、リソース４３に関連するデー
タであって、フォントの種類、配置、属性などの書式情
報である。ここでは、文字データまたは画像データなど
のリソース４３を修飾する情報を全てスタイル４５とす
る。The style 45 is data related to the page 41, the container 42, and the resource 43 that make up the content, and is format information such as font type, layout, and attributes. Here, all the information that modifies the resource 43 such as character data or image data is the style 45.

【００２９】リンク４４は、各オブジェクト間の繋がり
を示すデータであり、リソース４３からコンテナ４２も
しくはページ４１へのリンクがある。The link 44 is data indicating the connection between the objects, and is a link from the resource 43 to the container 42 or the page 41.

【００３０】図４に、データ抽出・構造変換処理部２の
構成例を示す。FIG. 4 shows a configuration example of the data extraction / structure conversion processing section 2.

【００３１】データ抽出・構造変換処理部２は、正規化
処理手段２０１と、分離処理手段２０２と、カテゴリ特
定手段２０３と、モジュール生成処理手段２０４と、関
係設定手段２０５と、カスタマイズ処理手段２０６と、
再構成処理手段２０７とを備える。The data extraction / structure conversion processing unit 2 includes a normalization processing unit 201, a separation processing unit 202, a category specifying unit 203, a module generation processing unit 204, a relationship setting unit 205, and a customization processing unit 206. ,
Reconstruction processing means 207 is provided.

【００３２】正規化処理手段２０１は、入力ソースとし
てＷｅｂコンテンツ４であるＨＴＭＬドキュメントを格
納元のＷｅｂサーバ１１から取得し、取得したＨＴＭＬ
ドキュメントの記述を正規化するためにＸＨＴＭＬ形式
へ変換し、変換したＸＨＴＭＬデータを一時的に格納す
る手段である。The normalization processing means 201 acquires an HTML document, which is the Web content 4, as an input source from the Web server 11 that is the storage source, and acquires the acquired HTML.
This is a means for converting the description of the document into the XHTML format for normalization and temporarily storing the converted XHTML data.

【００３３】分離処理手段２０２は、正規化処理手段２
０１で生成されたＸＨＴＭＬデータからリソース４３と
スタイル４５とを分離し、そのＸＨＴＭＬドキュメント
内のリンク情報を特定する手段である。The separation processing means 202 is a normalization processing means 2
It is a means for separating the resource 43 and the style 45 from the XHTML data generated in 01 and specifying the link information in the XHTML document.

【００３４】カテゴリ特定手段２０３は、ＸＨＴＭＬド
キュメントのトップページから抽出した単語データをも
とに、業種カテゴリデータベース（業種カテゴリＤＢ）
２１０を参照して、そのＸＨＴＭＬドキュメント（コン
テンツ）の業種カテゴリを特定する手段である。業種カ
テゴリＤＢ２１０は、社名や製品名などを業種カテゴリ
ごとに管理するデータベースである。The category specifying means 203 is a business category database (business category DB) based on the word data extracted from the top page of the XHTML document.
210 is a means for identifying the business category of the XHTML document (content) with reference to 210. The business category DB 210 is a database that manages company names, product names, etc. for each business category.

【００３５】モジュール生成処理手段２０４は、ＸＨＴ
ＭＬデータの最小単位に分解された要素を機能ごとにグ
ループ化してモジュールを生成し、各要素のモジュール
内での役割を判定してモジュールに意味を付与する手段
である。The module generation processing means 204 is an XHT.
It is a means for grouping elements decomposed into the minimum unit of ML data for each function to generate a module, determining the role of each element in the module, and giving a meaning to the module.

【００３６】関係設定手段２０５は、モジュール間の主
従関係などを示すモジュール関係を付与する手段であ
る。The relationship setting means 205 is means for giving a module relationship indicating a master-slave relationship between modules and the like.

【００３７】カスタマイズ処理手段２０６は、例えば本
システムのユーザの指示入力により、関係設定処理手段
２０５で付与されたモジュール間関係を修正し、また
は、リソース４３の内容を変更する手段である。The customization processing unit 206 is a unit that corrects the inter-module relationship provided by the relationship setting processing unit 205 or changes the content of the resource 43, for example, according to an instruction input of the user of the present system.

【００３８】再構成処理手段２０７は、ＸＨＴＭＬドキ
ュメントのリンク情報を用いてページ間のツリー構成な
どを再構成し、モジュール関係およびツリー構成をもと
に、最小単位であるリソース４３を並べて出力用のＸＭ
Ｌデータを再構成する手段である。再構成されたＸＭＬ
データはコンテンツＤＢ５へ格納される。The reconfiguration processing means 207 reconfigures the tree structure between pages using the link information of the XHTML document, and arranges the resource 43, which is the minimum unit, for output based on the module relationship and the tree structure. XM
It is a means for reconstructing L data. Reconstructed XML
The data is stored in the content DB 5.

【００３９】図５は、データ抽出・構造変換処理部２の
処理を説明するための図である。FIG. 5 is a diagram for explaining the processing of the data extraction / structure conversion processing section 2.

【００４０】正規化処理手段２０１は、Ｗｅｂサーバ１
１の管理者、コンテンツ作成者などから取得したＷｅｂ
コンテンツ４のＵＲＩ（Uniform Resource Identifier
s）にもとづき、同一ドメイン内でリンクしている全て
のＨＴＭＬページをＷｅｂサーバ１１から取得する。こ
こで、すべてのＨＴＭＬページとは、同一の処理装置内
に配置された相対パスで参照可能なページの集合を意味
する。他サイトへのリンクは同一コンテンツの範囲外と
みなして取得しない。The normalization processing means 201 is the Web server 1
Web acquired from the administrator of 1, the content creator, etc.
URI (Uniform Resource Identifier) of content 4
Based on s), all the HTML pages linked in the same domain are acquired from the Web server 11. Here, all the HTML pages mean a set of pages that can be referenced by relative paths arranged in the same processing device. Links to other sites are not considered because they are outside the scope of the same content.

【００４１】そして、図５（Ａ）に示すように、正規化
処理手段２０１は、ＨＴＭＬドキュメントのページひと
つひとつのタグを調べ、ＸＭＬの記述方法に合わせ、い
ったん処理しやすいＸＨＴＭＬ形式へ変換する。一般の
ＨＴＭＬドキュメントでは、終了タグが閉じられていな
いなどの終了タグに関するあいまいさの問題や、属性の
記述がまちまちであるというような記述の多様性の問題
などがあり、ＨＴＭＬドキュメントのままでは、以降の
処理上においてやや不都合な点がいくつか存在するから
である。Then, as shown in FIG. 5A, the normalization processing means 201 examines the tags of each page of the HTML document and converts them into an easily processable XHTML format according to the XML description method. In a general HTML document, there are ambiguity problems related to end tags, such as the end tag not being closed, and a variety of descriptions such as the description of attributes being different. Therefore, in the HTML document as it is, This is because there are some disadvantages in the subsequent processing.

【００４２】ここではＨＴＭＬドキュメントを単にＸＭ
Ｌデータとして扱うことができればよいので、最もＨＴ
ＭＬの記述に近いＸＭＬとしてのＸＨＴＭＬの仕様に合
わせて取得したＨＴＭＬドキュメントをＸＨＴＭＬ形式
のデータへ変換するが、ＸＨＴＭＬ形式への変換処理
は、厳密なＸＨＴＭＬの仕様に沿うようにデータを変換
するのではなく、欠落したタグの補完、不要なタグの削
除などを行なうことにより記述のあいまいさを排除し、
以降の処理が容易となるような記述形式へ整形すること
を目的として変換する。具体的に以下のような変換処理
を指す。・<img> 、<br>、<hr>などの空要素を閉じる例）<br>→<br /> ・同じインライン要素のタグが入れ子になっている場
合、不要なタグとみなし、一方の対を削除する例）<b><a name= "aiueo"><b> あいうえお</b></a></b> →<b><a name= "aiueo">あいうえお</a></b> ・開始タグと終了タグがクロスしている個所を、以下の
規則に従い正しい入れ子にする規則１：ブロック要素とインライン要素がクロスしてい
る場合、インライン要素をブロック要素の中に入れる規則２：ブロック要素同士またはインライン要素同士が
クロスしている場合、はじめに出てきたタグに順序を合
わせる例）<ul><li><b><a name="aiueo"> あいうえお</b></a></ul></li> →<ul><li><b><a name="aiueo"> あいうえお</a></b></li></ul> ・<p> 、<li>など終了タグが欠落している個所を、以下
の規則に従い補う規則３：次に同じタグが現れる直前、もしくはそのタグ
の親となるタグの終了タグに遭遇したとき、閉じられて
いないタグを補完する・属性値が省略されている個所を属性名で補う例）<td align="left" valign="top" nowrap> →<td align="left" valign="top" nowrap="nowrap"> ・テキスト、イメージなどの最小要素を複数修飾するタグを個別に振り分ける例）<a href="index.html"> <img src="pressrelease.gif" alt=" プレスリリース" width="87" heigh t="12"/> <img src="delta-b.gif" alt="" width="10" height="13"/> </a> →<a href="index.html"> <img src="pressrelease.gif" alt=" プレスリリース" width="87" heigh t="12"/> </a> <a href="index.html"> <img src="delta-b.gif" alt="" width="10" height="13"/> </a> （「- 」は、半角のアンダーラインを表す記号である。) 次に、図５（Ｂ）に示すように、分離処理手段２０２
は、ＸＨＴＭＬデータから、リソース４３を最小単位ご
とに別ファイルに切り分け、さらにリソース４３に付随
するスタイル４５を別ファイルとして分離する。また、
分離処理手段２０２は、リソース４３を包含するコンテ
ナ４２を確定するとともにコンテナ４２に付随するスタ
イル４５を分離する。また、ページ４１内の構造情報で
あるリンク４４を特定する。Here, the HTML document is simply XM.
Since it only needs to be handled as L data, it is the most HT
The HTML document acquired according to the XML specification as XML close to the ML description is converted into data in XHTML format, but the conversion processing into XHTML format converts the data according to the strict XHTML specification. Instead, complete the ambiguity of the description by completing the missing tags and deleting unnecessary tags.
Convert for the purpose of shaping into a description format that facilitates the subsequent processing. Specifically, it refers to the following conversion processing.・ Example of closing empty elements such as <img>, <br>, <hr>, etc.) <br> → <br /> ・ If the tags of the same inline element are nested, they are regarded as unnecessary tags and one of them is deleted. Example of deleting a pair) <b><a name= "aiueo"><b> AIUEO </ b></a></b> → <b><a name= "aiueo"> AIUEO </a></b> ・ Nest the crossing of the start tag and end tag correctly according to the following rules Rule 1: If the block element and the inline element cross, put the inline element in the block element Rule 2: When block elements or inline elements cross each other, the order is adjusted to the tag that appears first. Example) <ul><li><b><aname="aiueo"> Aiueo </ b></a></ul></li> → <ul><li><b><aname="aiueo"> aiueo </a></b></li></ul> ・ <p Compensation for missing end tags such as>, <li> according to the following rules Rule 3: Closed immediately before the same tag appears or when the end tag of the parent tag of the tag is encountered Not complete tags -Complement the place where the attribute value is omitted with the attribute name) <td align = "left" valign = "top"nowrap> → <td align = "left" valign = "top" nowrap = "nowrap"> Example of allocating individual tags that decorate multiple minimum elements such as text and image) <a href="index.html"><img src = "pressrelease.gif" alt = "Press Release" width = "87" heigh t = "12"/><img src = "delta-b.gif" alt = "" width = "10" height = "13"/></a> → <a href="index.html"><img src = "pressrelease.gif" alt = "Press release" width = "87" heigh t = "12"/></a><ahref="index.html"><img src = "delta-b.gif "alt =""width=" 10 "height =" 13 "/></a>("-"is a symbol indicating a half-width underline.) Next, as shown in FIG. , Separation processing means 202
Divides the resource 43 into a separate file for each minimum unit from the XHTML data, and further separates the style 45 attached to the resource 43 as a separate file. Also,
The separation processing unit 202 determines the container 42 containing the resource 43 and separates the style 45 attached to the container 42. Further, the link 44 which is the structural information in the page 41 is specified.

【００４３】一般的に、ＨＴＭＬドキュメントから意味
的にまとまりのある部分を抽出することは困難である。
その理由は、ＨＴＭＬドキュメントには人間が直接関わ
る文字データや画像のほかに、それらを修飾する書式情
報（スタイル）や、ページ構造を記述するタグが散在し
ているからである。In general, it is difficult to extract a semantically cohesive portion from an HTML document.
The reason is that, in addition to character data and images directly related to human beings, the HTML document has scattered format information (style) for modifying them and tags for describing the page structure.

【００４４】分離処理手段２０２は、以下の方法によ
り、ＸＨＴＭＬドキュメントからリソース４３を抽出
し、またリソース４３を修飾するスタイル４５を分離す
る。ここでは、リソース４３およびスタイル４５を特定
するための規則を以下のように定義する。・タグに囲まれたテキストノードである。・１リソースには、テキストノードを囲むインラインレ
ベル要素のタグ（文字や画像を直接修飾するタグであっ
て、テーブル<table> やリスト<ul>などのブロックレベ
ル要素を除くもの）全てを含む。ただしインライン要素
のうちアンカー<a> は含まない。・改行タグ<br>が現れた場合は、改行タグを含むその直
前までを１リソースとする。・改行タグ<br>を除く全てのタグおよびび属性は、スタ
イル情報として格納する。・水平線<hr>は１つで１リソースとする。The separation processing means 202 extracts the resource 43 from the XHTML document and separates the style 45 that modifies the resource 43 by the following method. Here, the rule for specifying the resource 43 and the style 45 is defined as follows. -It is a text node surrounded by tags. -One resource includes all tags of inline level elements that surround text nodes (tags that directly modify characters or images, excluding block level elements such as table <table> and list <ul>). However, the anchor <a> is not included among the inline elements. -When a line break tag <br> appears, the resource immediately before the line break tag is included as one resource. -All tags and attributes except the line break tag <br> are stored as style information.・ One horizontal line <hr> is one resource.

【００４５】以下に、ＨＴＭＬドキュメントの一部の例
と、その例から分離処理されたリソース４３およびスタ
イル４５の例を示す。例）ＨＴＭＬドキュメントの一部 <a href="http://www.f.com/"> <img src="img.gif" width="415" height="64" alt="F" border="0"/> </a > <b><i>製品情報</i></b><br /> <a href="products/gis/pd.html"> 位置情報表示システム</a> 例) 作成されるリソース・リソースｒ１： <resource id="1" type="image" src="img.gif" br="false" link="http://ww w.f.com/">F</resource> ・リソースｒ２： <resource id="2" type="text" src="" br="true" link="">製品情報</resour ce> ・リソースｒ３： <resource id="3" type="text" src="" br="false" link="products/gis/pd.h tml"> 位置情報表示システム</resource> 例）作成されるスタイル・スタイルｓ１： <style id="1"><width>415</width><height>64</height><border>0</border>< /style> ・スタイルｓ２： <style id="2"><b /><i /></style> また、カテゴリ特定手段２０３は、ＸＨＴＭＬデータの
トップページの定義語を除いたデータ部分を切り出して
単語インデックスを生成し、単語インデックスの中の社
名や製品名などの単語をキーワードとして業種カテゴリ
ＤＢ２１０を検索して業種カテゴリを特定する。業種カ
テゴリとしては、例えば、日本標準産業分類を用いる。
また、検索の結果、業種カテゴリが確定できない場合に
は、類義語データベース（図示しない）をもとに抽出し
た単語の同義語リストを生成して、生成した同義語リス
トを用いて業種カテゴリＤＢ２１０を検索して業種カテ
ゴリを特定する。An example of a part of the HTML document and examples of the resource 43 and the style 45 separated from the example are shown below. Example) Part of HTML document <a href="http://www.f.com/"><img src = "img.gif" width = "415" height = "64" alt = "F" border = "0"/></a><b><i> Product Information </ i></b><br/><ahref="products/gis/pd.html"> Location Information Display System </a>> Example) Created resource resource r1: <resource id = "1" type = "image" src = "img.gif" br = "false" link = "http: // ww wfcom /"> F </ resource> -Resource r2: <resource id = "2" type = "text" src = "" br = "true" link = ""> Product information </ resour ce> -Resource r3: <resource id = "3" type = "text" src = "" br = "false" link = "products / gis / pd.h tml"> Location information display system </ resource> Example) Styles created s1: <style id = " 1 "><width> 415 </ width><height> 64 </ height><border> 0 </ border></style> -Style s2: <style id =" 2 "><b/><i / ></style> Further, the category specifying means 203 cuts out the data portion of the XHTML data excluding the definition word of the top page and extracts the word in. To generate a box, to identify the industries category words such as company name and product name in the word index and search the industry category DB210 as a keyword. As the industry category, for example, the Japanese standard industry classification is used.
If the industry category cannot be determined as a result of the search, a synonym list of words extracted based on a synonym database (not shown) is generated, and the industry category DB 210 is searched using the generated synonym list. And specify the industry category.

【００４６】次に、図５（Ｃ）に示すように、モジュー
ル生成処理手段２０４は、ＸＨＴＭＬドキュメントにお
いて最小単位に解体された要素であるリソース４３を、
タグをもとに、テーブル、リスト、文パラグラフなどの
所定の機能ごとにグループ化し、モジュールを生成す
る。そして、各要素のモジュール内での役割を判定して
意味として付与する。Next, as shown in FIG. 5C, the module generation processing means 204 replaces the resource 43, which is an element disassembled into the minimum unit in the XHTML document, with
Based on the tags, groups are created for each predetermined function such as table, list, sentence paragraph, etc., and a module is created. Then, the role of each element in the module is determined and given as a meaning.

【００４７】すなわち、ＴＡＢＬＥ、ＵＬ、ＯＬ、Ｄ
Ｌ、Ｐ、ＨＲや、ＴＲ、ＴＨ、ＴＤ、ＬＩ、ＴＤ、ＴＨ
などのタグをもとに、どの要素までがテーブルやリスト
としてひとまとまりであるかを調べて１つのモジュール
とし、構成する要素からそのモジュールの意味するもの
が何であるかを解釈する。That is, TABLE, UL, OL, D
L, P, HR, TR, TH, TD, LI, TD, TH
Based on the tags such as, up to which element is collected as a table or a list is regarded as one module, and what is meant by the module is interpreted from the constituent elements.

【００４８】例えば、いくつかの要素で１つの表を構成
する場合に、その要素がテーブルの見出しにあたる項目
であるとか、見出しに対する値であるとかなど要素の役
割を判定して、そのモジュールの内容を解釈し、例えば
「○○の表」、「△△のリスト」などの意味情報をメタ
データとして付加していく。また別の例として、リンク
項目のみで構成されるモジュールがページに出現する場
合に、そのモジュールに対して「インデックス」という
意味情報を付加する。For example, when a table is composed of several elements, the role of the element is determined such that the element is an item corresponding to the heading of the table or a value for the heading, and the content of the module is determined. And the semantic information such as “table of XX” and “list of ΔΔ” is added as metadata. As another example, when a module including only link items appears on a page, the semantic information “index” is added to the module.

【００４９】なお、スタイル４５を参照し、リソース４
３のテキストの文字サイズやセンタリングなどの書式情
報から、テーブルやリストのタイトルにあたる文字デー
タを抽出し、意味情報としてもよい。In addition, referring to the style 45, the resource 4
Character data corresponding to the title of the table or list may be extracted from the format information such as the character size and the centering of the text of No. 3 and used as the semantic information.

【００５０】そして、図５（Ｄ）に示すように、関係設
定手段２０５は、モジュール単位でページ内のモジュー
ル間に主従関係を付与する。例えば、同じ階層のモジュ
ールで、テキストモジュールの次に表モジュールがくる
場合に、意味情報が同一もしくは類似する場合には、そ
のテキストモジュールは次の表モジュールと関連がある
という文脈情報（関係）を付加していく。Then, as shown in FIG. 5D, the relationship setting means 205 gives a master-slave relationship between the modules in the page in module units. For example, in a module in the same hierarchy, when a table module comes next to a text module and the semantic information is the same or similar, the context information (relation) that the text module is related to the next table module is set. To add.

【００５１】そして、図５（Ｅ）に示すように、カスタ
マイズ処理手段２０６は、ユーザの入力指示により、関
係設定手段２０５で付けられたモジュール関係を修正す
る。また、カスタマイズ処理手段２０６は、タグを利用
して見出し項目のみを集めて新たにページ内インデック
スを作成したり、リソース４３のテキストが長文である
場合にその要約文を新たなリソース４３として代替した
り、表形式のデータの各項目をリスト形式で表現するよ
うなデータに変更したりするなど、ユーザが任意な指示
入力によりダイナミックな変更を行なう。Then, as shown in FIG. 5 (E), the customization processing means 206 corrects the module relation added by the relation setting means 205 according to the input instruction of the user. Further, the customization processing unit 206 collects only the headline items by using the tag and newly creates an index within the page, or substitutes the summary sentence as a new resource 43 when the text of the resource 43 is a long sentence. The user makes dynamic changes by inputting an arbitrary instruction, such as changing each item of tabular data to data expressing in list form.

【００５２】次に、再構成処理手段２０７は、図５
（Ｆ）に示すように、ＸＨＴＭＬドキュメントのリンク
４４をもとに、ページ間のツリー構造を再構築し、この
ツリー構造とモジュール関係に従って、リソース４３を
順に並べて中間データであるＸＭＬドキュメントを生成
する。Next, the reconstruction processing means 207 is executed by the processing shown in FIG.
As shown in (F), the tree structure between pages is reconstructed based on the link 44 of the XHTML document, and the resources 43 are arranged in order according to this tree structure and the module relationship to generate an XML document which is intermediate data. .

【００５３】ここで、再構成処理手段２０７は、ページ
間のツリー構造を再構築する場合に、リンクされる回数
が多いモジュールや、トップページ内に表れるキーワー
ドに重み付けしてそのキーワードと同一または類似する
意味情報を持つモジュールについては、優先度を高く
し、これらのモジュールを含むページが上位階層となる
ように構成することもできる。例えば、トップページに
「プレスリリース」、「新着情報」、「更新情報」など
のキーワードが含まれる場合に、そのキーワードと同一
の意味情報が付与されたモジュールからなるページを、
トップページから直接リンクされるようにページ構成す
る。Here, when the tree structure between pages is reconstructed, the reconstructing processing unit 207 weights a module that is frequently linked or a keyword that appears in the top page and is the same as or similar to the keyword. It is also possible to increase the priority of the modules having the semantic information, and configure the page including these modules to be in the upper hierarchy. For example, if the top page contains keywords such as “press release”, “new arrival information”, and “update information”, a page that consists of modules to which the same semantic information as those keywords is added,
Configure the page so that it is linked directly from the top page.

【００５４】なお、この場合に、キーワードの重み付け
により、「プレスリリース」、「新着情報」の意味情報
を持つモジュールは、トップページから直接リンクされ
るようにページ構成し、「更新情報」の意味情報を持つ
モジュールは、より下位のリンクとなるようにページ構
成することもできる。In this case, by weighting the keywords, the modules having the meaning information of “press release” and “new arrival information” are configured so as to be directly linked from the top page, and the meaning of “update information” is defined. Information-bearing modules can also be organized into pages with lower links.

【００５５】また、再構成処理手段２０７は、予め定め
た業種カテゴリごとにページ構成情報を定義した業種カ
テゴリ別構成情報２１１を持ち、特定した業種カテゴリ
で選択したページ構成情報に従ってツリー構造を再構築
することもできる。Further, the reconfiguration processing means 207 has the industry category configuration information 211 defining page configuration information for each predetermined industry category, and rebuilds the tree structure according to the page configuration information selected in the identified industry category. You can also do it.

【００５６】また、再構成処理手段２０７は、ＸＭＬデ
ータを生成する際に、アクセス者の年齢に応じて内容を
変更した年齢別バージョンのＸＭＬデータを生成しても
よい。The reconstruction processing means 207 may also generate age-specific versions of XML data whose contents are changed according to the age of the accessor when generating the XML data.

【００５７】図６に、データ抽出・構造変換処理部の処
理フローチャートを示す。FIG. 6 shows a processing flowchart of the data extraction / structure conversion processing section.

【００５８】データ抽出・構造変換処理部２では、正規
化処理手段２０１により、Ｗｅｂコンテンツ４として同
一サイト内の全ＨＴＭＬページを取得し（ステップＳ
１）、ＸＨＴＭＬの形式に変換する（ステップＳ２）。
そして、分離処理手段２０２により、リソース４３とス
タイル４５とを分離し（ステップＳ３）、さらにリンク
情報を抽出する（ステップＳ４）。そして、カテゴリ特
定手段２０３により、サイトの業種カテゴリを特定し
（ステップＳ５）、モジュール生成処理手段２０４によ
り、リソース４３をグループ化してモジュールを生成し
（ステップＳ６）、各要素の役割からモジュールの意味
を特定して付与する（ステップＳ７）。そして、関係設
定手段２０５により、モジュール間の関係付けを行い
（ステップＳ８）、カスタマイズ処理手段２０６により
カスタマイズを行う（ステップＳ９）。そして、再構成
処理手段２０７により、リンク情報をもとにサイトのツ
リー構造を再構築し（ステップＳ１０）、再構築したツ
リー構造およびモジュール関係に従ってリソース４３を
順にならべてＸＭＬデータの形式でコンテンツを再構成
する（ステップＳ１１）。In the data extraction / structure conversion processing unit 2, the normalization processing unit 201 acquires all HTML pages in the same site as the Web contents 4 (step S
1), conversion to XHTML format (step S2).
Then, the separation processing unit 202 separates the resource 43 and the style 45 (step S3), and further extracts link information (step S4). Then, the category specifying unit 203 specifies the business category of the site (step S5), and the module generation processing unit 204 groups the resources 43 to generate a module (step S6), and the meaning of the module from the role of each element. Is specified and added (step S7). Then, the relationship setting means 205 associates the modules (step S8), and the customization processing means 206 customizes them (step S9). Then, the reconfiguration processing unit 207 reconstructs the tree structure of the site based on the link information (step S10), arranges the resources 43 in order according to the reconstructed tree structure and module relationships, and creates the content in the XML data format. Reconfigure (step S11).

【００５９】次に、コンテンツ生成処理部３を説明す
る。図７に、コンテンツ生成処理部３の構成例を示す。
コンテンツ生成処理部３は、スペック情報取得処理手段
３０１と、テンプレート選択手段３０２と、動的生成処
理手段３０３と、テンプレートデータベース（テンプレ
ートＤＢ）３１０とを備える。Next, the content generation processing section 3 will be described. FIG. 7 shows a configuration example of the content generation processing unit 3.
The content generation processing unit 3 includes a specification information acquisition processing unit 301, a template selection unit 302, a dynamic generation processing unit 303, and a template database (template DB) 310.

【００６０】スペック情報取得処理手段３０１は、携帯
端末１３からアクセス要求があると、携帯端末１３の機
種名および利用するキャリア名などのスペック情報を取
得する手段である。なお、スペック情報として、アクセ
ス要求があった地域および時期、アクセス者年齢などの
情報を、アクセス要求を中継するデータセンタ１４で付
加してもよい。The specification information acquisition processing means 301 is means for acquiring specification information such as the model name of the mobile terminal 13 and the carrier name to be used when an access request is issued from the mobile terminal 13. As the specification information, information such as the area and time when the access request is made and the age of the accessor may be added at the data center 14 which relays the access request.

【００６１】テンプレート選択手段３０２は、予め記憶
しておいたコンテンツ表示形式を定義する情報であるテ
ンプレートＤＢ３１０から、取得したスペック情報をも
とに該当するテンプレートを選択する手段である。The template selecting means 302 is means for selecting a corresponding template from the template DB 310, which is the information defining the content display format stored in advance, based on the acquired spec information.

【００６２】テンプレートＤＢ３１０に記憶されるテン
プレートでは、キャリアごとに携帯端末１３の機種別
に、携帯端末１３のＷｅｂブラウザが解釈可能な言語、
表示画面サイズ、データ容量、対応可能なカラーの種別
などの情報が定義される。また、テンプレートに、ペー
ジの書式情報を定義したスタイルシートを含めてもよ
い。In the templates stored in the template DB 310, the language that can be interpreted by the Web browser of the mobile terminal 13 is classified according to the model of the mobile terminal 13 for each carrier,
Information such as display screen size, data capacity, and types of colors that can be supported is defined. Also, the template may include a style sheet defining page format information.

【００６３】動的生成処理手段３０３は、テンプレート
ＤＢ３１０から選択したテンプレートに従ってコンテン
ツＤＢ５に格納された該当するＸＭＬデータをモバイル
用コンテンツ６に生成する手段である。例えば、テンプ
レートに定義された言語がＣ−ＨＴＭＬである場合に
は、ＸＭＬデータの各ページを表示画面サイズに対応す
るように分割し、ページ間のリンクを再構築し、Ｃ−Ｈ
ＴＭＬに書き直してモバイル用コンテンツ６を生成す
る。また、動的生成処理手段３０３は、テンプレートが
スタイルシートを含む場合には、そのスタイルシートの
書式情報にしたがってモバイル用コンテンツ６を生成す
る。The dynamic generation processing means 303 is means for generating the corresponding XML data stored in the content DB 5 in the mobile content 6 according to the template selected from the template DB 310. For example, when the language defined in the template is C-HTML, each page of XML data is divided so as to correspond to the display screen size, links between pages are reconstructed, and C-H
It is rewritten in TML to generate the mobile content 6. When the template includes a style sheet, the dynamic generation processing unit 303 generates the mobile content 6 according to the format information of the style sheet.

【００６４】なお、動的生成処理手段３０３は、スペッ
ク情報にアクセス地域やアクセス者年齢などのアクセス
情報が含まれる場合に、これらのアクセス情報をキーに
コンテンツＤＢから該当するＸＭＬデータを抽出してモ
バイル用コンテンツ６を生成することもできる。When the specification information includes access information such as access area and accessor age, the dynamic generation processing means 303 extracts the corresponding XML data from the content DB using these access information as keys. The mobile content 6 can also be generated.

【００６５】図８に、コンテンツ生成処理部３の処理フ
ローチャートを示す。FIG. 8 shows a processing flowchart of the content generation processing section 3.

【００６６】携帯端末１３からアクセス要求を受け付け
ると（ステップＳ２１）、スペック情報取得処理手段３
０１により、データセンタ１４で付加されたスペック情
報を取得する（ステップＳ２２）。そして、テンプレー
ト選択手段３０２により、スペック情報をもとにテンプ
レートＤＢ３１０から該当するテンプレートを選択する
（ステップＳ２３）。そして、動的生成処理手段３０３
により、コンテンツＤＢ５からアクセス要求にかかるコ
ンテンツ（ＸＭＬデータ）を抽出し（ステップＳ２
４）、テンプレートに従ってＸＭＬデータからモバイル
用コンテンツ６を生成し（ステップＳ２５）、生成した
モバイル用コンテンツ６を要求元の携帯端末１３へ応答
する（ステップＳ２６）以下に、本発明を適用して、通
常のサイトのＷｅｂコンテンツ４から、携帯電話端末な
どの小さい画面サイズに対応したモバイル用コンテンツ
６を再構成した場合の各データ例を示す。When an access request is received from the portable terminal 13 (step S21), the specification information acquisition processing means 3
The specification information added by the data center 14 is acquired by 01 (step S22). Then, the template selecting means 302 selects a corresponding template from the template DB 310 based on the specification information (step S23). Then, the dynamic generation processing means 303
By this, the content (XML data) required for the access request is extracted from the content DB 5 (step S2
4) Generate the mobile content 6 from the XML data according to the template (step S25), and respond the generated mobile content 6 to the requesting mobile terminal 13 (step S26). Each data example when the mobile content 6 corresponding to a small screen size such as a mobile phone terminal is reconfigured from the Web content 4 of the normal site is shown.

【００６７】図９に、通常のコンピュータ端末などでの
閲覧に適したＷｅｂコンテンツの表示例を示す。図１０
に、図９のＷｅｂコンテンツ４の表示例において点線で
示す部分の前後のＨＴＭＬデータを示す。FIG. 9 shows a display example of Web contents suitable for browsing on a normal computer terminal or the like. Figure 10
9 shows HTML data before and after the portion shown by the dotted line in the display example of the Web content 4 in FIG.

【００６８】図１０に示すＨＴＭＬデータは、データ抽
出・構造変換処理部２の正規化処理手段２０１によりＸ
ＨＴＭＬデータに変換される。図１１に、正規化処理部
２０１により変換されたＸＨＴＭＬデータの例を示す。The HTML data shown in FIG. 10 is converted into X data by the normalization processing means 201 of the data extraction / structure conversion processing unit 2.
Converted to HTML data. FIG. 11 shows an example of XHTML data converted by the normalization processing unit 201.

【００６９】図１１のＸＨＴＭＬデータは、データ抽出
・構造変換処理部２の各処理手段により再構成されて、
再利用可能なデータ形式であるＸＭＬの記述へ変換され
て、コンテンツＤＢ５に格納される。図１２および図１
３は、図１１のＸＨＴＭＬデータから再構成されたＸＭ
Ｌデータの例を示す。The XHTML data of FIG. 11 is reconstructed by each processing means of the data extraction / structure conversion processing unit 2,
It is converted into an XML description which is a reusable data format and stored in the content DB 5. 12 and 1
3 is an XM reconstructed from the XHTML data of FIG.
An example of L data is shown.

【００７０】その後、携帯端末１３からアクセス要求が
あると、図１２および図１３に示すＸＭＬデータは、コ
ンテンツ生成処理部３により、携帯端末１３に対応する
テンプレートに従ってページ構成が再構築されてモバイ
ル用コンテンツ６が生成される。Then, when an access request is issued from the mobile terminal 13, the XML data shown in FIG. 12 and FIG. 13 is reconstructed by the content generation processing unit 3 in accordance with the template corresponding to the mobile terminal 13 and the page structure is reconstructed. Content 6 is generated.

【００７１】図１４に、アクセス要求した携帯端末１３
の表示処理環境に対応する表示形式テンプレートの例を
示す。図１４のテンプレートでは、携帯端末１３で表示
するモバイル用コンテンツ６のページ構成として、ＸＭ
Ｌデータを構成するコンテナ４２のうち、container[@i
d='10'] およびcontainer[@id='11'] とされたモジュー
ルを順にページ構成することが定義されている。図１５
に、図１２および図１３のＸＭＬデータを、図１４のテ
ンプレートに従って生成されたモバイル用コンテンツ６
の例を示す。FIG. 14 shows the portable terminal 13 that requested access.
An example of a display format template corresponding to the display processing environment of FIG. In the template of FIG. 14, as the page configuration of the mobile content 6 displayed on the mobile terminal 13, XM is used.
Of the containers 42 that make up the L data, container [@i
It is defined that the modules with d = '10 '] and container [@ id = '11'] are page-configured in order. Figure 15
12 and the XML data of FIG. 13 is added to the mobile content 6 generated according to the template of FIG.
For example:

【００７２】[0072]

【発明の効果】以上説明したように、本発明によれば、
データ抽出・構造変換処理部２により、Ｗｅｂコンテン
ツをその内容および構成をもとに再構築し、再利用可能
な中間的なデータ（ＸＭＬデータ）を生成し記憶し、ア
クセス要求時に、コンテンツ生成処理部３により、アク
セス要求元の携帯端末１３のスペック情報をもとに選択
したテンプレートに従って記憶しておいたＸＭＬデータ
からモバイル用コンテンツ６を自動的に生成する。これ
により、通常サイト用のＷｅｂコンテンツから携帯電話
端末などの表示画面が小さい端末での表示に適合したモ
バイル用コンテンツへの再構築処理の負担を軽減でき
る。As described above, according to the present invention,
The data extraction / structure conversion processing unit 2 reconstructs the Web content based on its content and configuration, generates and stores reusable intermediate data (XML data), and creates a content generation process when an access request is made. The unit 3 automatically generates the mobile content 6 from the XML data stored according to the template selected based on the specification information of the mobile terminal 13 that is the access request source. As a result, it is possible to reduce the load of the reconstruction process from the Web content for the normal site to the mobile content suitable for display on a terminal having a small display screen such as a mobile phone terminal.

【００７３】また、本発明によれば、データ抽出・構造
変換処理部２によりコンテンツの所定のモジュールごと
に意味を特定して付与し、その意味をもとにページのツ
リー構造を再構築する。これにより、表示可能なデータ
量が制約されるような端末からのアクセスであっても、
ＸＭＬデータから適切なページ構成のコンテンツを再構
成して提供することができる。Further, according to the present invention, the data extraction / structure conversion processing unit 2 specifies and gives a meaning to each predetermined module of the content, and the tree structure of the page is reconstructed based on the meaning. As a result, even when accessing from a terminal that limits the amount of data that can be displayed,
Content having an appropriate page configuration can be reconstructed and provided from XML data.

【００７４】本発明の効果を以下に列挙する。 (1) 人手を介することなく、ＨＴＭＬコンテンツを内容
に応じて再利用可能な形式（ＸＭＬ）に変換できる。 (2) 再利用可能な形式をもとに検索する場合に、精度の
高い検索結果が得られる。 (3) 業種カテゴリに応じた基本的な規則を踏まえた上
で、提供者の意図を反映した項目の並べ替えや表示内容
の変更が容易になる。The effects of the present invention are listed below. (1) HTML content can be converted into a reusable format (XML) according to the content without human intervention. (2) When searching based on a reusable format, highly accurate search results can be obtained. (3) It becomes easy to rearrange the items and change the display contents that reflect the intention of the provider based on the basic rules according to the industry category.

[Brief description of drawings]

【図１】本発明にかかるシステムの構成例を示す図であ
る。FIG. 1 is a diagram showing a configuration example of a system according to the present invention.

【図２】本発明にかかるシステムの接続関係例を示す図
である。FIG. 2 is a diagram showing an example of connection relations of the system according to the present invention.

【図３】Ｗｅｂコンテンツのデータ構造の例を示す図で
ある。FIG. 3 is a diagram showing an example of a data structure of Web content.

【図４】データ抽出・構造変換処理部の構成例を示す図
である。FIG. 4 is a diagram illustrating a configuration example of a data extraction / structure conversion processing unit.

【図５】データ抽出・構造変換処理部の処理を説明する
ための図である。FIG. 5 is a diagram for explaining processing of a data extraction / structure conversion processing unit.

【図６】データ抽出・構造変換処理部の処理フローチャ
ート図である。FIG. 6 is a processing flowchart of a data extraction / structure conversion processing unit.

【図７】コンテンツ生成処理部の構成例を示す図であ
る。FIG. 7 is a diagram illustrating a configuration example of a content generation processing unit.

【図８】コンテンツ生成処理部の処理フローチャート図
である。FIG. 8 is a processing flowchart of a content generation processing unit.

【図９】通常のＷｅｂコンテンツの表示例を示す図であ
る。FIG. 9 is a diagram showing a display example of normal Web content.

【図１０】図９に示すＷｅｂコンテンツのＨＴＭＬデー
タの例を示す図である。FIG. 10 is a diagram showing an example of HTML data of the Web content shown in FIG. 9.

【図１１】図１０に示すＨＴＭＬデータから変換された
ＸＨＴＭＬデータの例を示す図である。11 is a diagram showing an example of XHTML data converted from the HTML data shown in FIG.

【図１２】データ抽出・構造変換処理部で再構成された
ＸＭＬデータの例を示す図である。FIG. 12 is a diagram showing an example of XML data reconstructed by a data extraction / structure conversion processing unit.

【図１３】データ抽出・構造変換処理部で再構成された
ＸＭＬデータの例を示す図である。FIG. 13 is a diagram showing an example of XML data reconstructed by a data extraction / structure conversion processing unit.

【図１４】テンプレートの例を示す図である。FIG. 14 is a diagram showing an example of a template.

【図１５】携帯端末におけるモバイル用コンテンツの表
示例を示す図である。FIG. 15 is a diagram showing a display example of mobile contents on a mobile terminal.

[Explanation of symbols]

１コンテンツ再構築処理システム２データ抽出・構造変換処理部２０１正規化処理手段２０２分離処理手段２０３カテゴリ特定手段２０４モジュール生成処理手段２０５関係設定手段２０６カスタマイズ処理手段２０７再構成処理手段２１０業種カテゴリＤＢ２１１業種カテゴリ別構成情報３コンテンツ生成処理部３０１スペック情報取得処理手段３０２テンプレート選択手段３０３動的生成処理手段３１０テンプレートＤＢ４Ｗｅｂコンテンツ（ＨＴＭＬ）５コンテンツＤＢ（ＸＭＬ）６モバイル用コンテンツ（Ｃ−ＨＴＭＬ） 1 Content reconstruction processing system 2 Data extraction / structure conversion processing unit 201 Normalization processing means 202 separation processing means 203 Category identification means 204 module generation processing means 205 Relation setting means 206 Customization processing means 207 Reconstruction processing means 210 Industry Category DB 211 Composition information by industry category 3 Content generation processing unit 301 Specification information acquisition processing means 302 Template selection means 303 Dynamic generation processing means 310 Template DB 4 Web contents (HTML) 5 Content DB (XML) 6 Mobile content (C-HTML)

───────────────────────────────────────────────────── フロントページの続き (72)発明者岡本晋吾神奈川県川崎市中原区小杉町１丁目403番地株式会社富士通ソーシアルサイエンスラボラトリ内 (72)発明者稲田知義神奈川県川崎市中原区小杉町１丁目403番地株式会社富士通ソーシアルサイエンスラボラトリ内 (72)発明者中村拓郎神奈川県川崎市中原区小杉町１丁目403番地株式会社富士通ソーシアルサイエンスラボラトリ内Ｆターム(参考） 5B082 GA02 GA07 (54)【発明の名称】データ抽出・構造変換処理プログラム、コンテンツ生成処理プログラム、データ抽出・構造変換処理プログラム記録媒体、コンテンツ生成処理プログラム記録媒体、およびコンテンツ再構築処理システム ─────────────────────────────────────────────────── ─── Continued front page (72) Inventor Shingo Okamoto 1-403 Kosugicho, Nakahara-ku, Kawasaki-shi, Kanagawa Local Fujitsu Limited Science In the laboratory (72) Inventor Tomoyoshi Inada 1-403 Kosugicho, Nakahara-ku, Kawasaki-shi, Kanagawa Local Fujitsu Limited Science In the laboratory (72) Inventor Takuro Nakamura 1-403 Kosugicho, Nakahara-ku, Kawasaki-shi, Kanagawa Local Fujitsu Limited Science In the laboratory F-term (reference) 5B082 GA02 GA07 (54) [Title of Invention] Data extraction / structure conversion processing program, content generation processing program, data extraction / structure conversion Processing program recording medium, content generation processing program recording medium, and content reconstruction processing Management system

Claims

[Claims]

1. A program for causing a computer to execute a data extraction / structure conversion process for reconstructing intermediate content data in a reusable format from content data described in tagged markup language, the content comprising: From the data, the resource information, which is the part that constitutes meaningful information for human beings, is extracted based on the tag that is the information that defines the data structure, and the resource information is grouped based on the tag to generate a module. ,
A module generation process that gives meaning to the module, and a relationship between the modules is given based on the tag, and a tree structure between pages of the content data is reconstructed from link information defined in the content data. Relationship setting processing, reconfiguration processing for rearranging the modules according to the relationship between the modules and the tree structure to generate the intermediate content data in a reusable format, and content for storing the intermediate content data A data extraction / structure conversion processing program characterized by causing a computer to execute data storage processing.

2. The data extraction / structure conversion processing program according to claim 1, wherein in the reconstruction processing, the modules are weighted using the meaning between the modules, and the relation between the weighting and the modules. A data extraction / structure conversion processing program for causing a computer to execute processing for rearranging the modules according to the tree structure and generating the intermediate content data in a reusable format.

3. The data extraction / structure conversion processing program according to claim 1, wherein the content data is based on industry category information in which a company name, a product name, and a word or phrase relating to a company activity that are provided in advance are managed for each industry category. Industry category specification processing for identifying the industry category, and in the reconfiguration processing, category configuration information that defines rules for the module configuration of content data for each industry category is provided, and category configuration information selected according to the industry of the content data Based on the relationship between the modules and the tree structure, the computer is caused to execute processing of rearranging the modules to generate the reusable intermediate content data. Data extraction / structure conversion processing program.

4. A computer is caused to execute a content generation process for generating content data corresponding to display on a terminal that has received a browsing request from intermediate content data in a reusable format described in a tagged markup language. A program for accessing a template storage unit that holds a template that is display format information corresponding to the display processing environment of the terminal, accepting a browsing request from the terminal and type information of the terminal, and identifying the And a dynamic generation process of extracting a corresponding template from the template storage means based on information and reconstructing the intermediate content data according to the template to generate the content data. Content generation processing program.

5. The content generation processing program according to claim 4, wherein in the dynamic generation processing, access information regarding access of the terminal is accepted together with the browsing request, and the access information based on the identification information and the access information is received. A content generation processing program for causing a computer to execute a process of extracting a corresponding template from template storage means and reconstructing the intermediate content data according to the template to generate the content data.

6. A recording medium recording a program for causing a computer to execute a data extraction / structure conversion process for reconstructing intermediate content data in a reusable format from content data described in tagged markup language. Then, from the content data, the resource information that is a part that constitutes meaningful information for humans is extracted based on the tag that is the information that defines the data structure, and the resource information is grouped based on the tag. To generate a module,
A module generation process that gives meaning to the module, and a relationship between the modules is given based on the tag, and a tree structure between pages of the content data is reconstructed from link information defined in the content data. Relationship setting processing, reconfiguration processing for rearranging the modules according to the relationship between the modules and the tree structure to generate the intermediate content data in a reusable format, and content for storing the intermediate content data A program recording medium for data extraction / structure conversion processing, characterized in that a program for causing a computer to execute data storage processing is recorded.

7. A computer is caused to execute a content generation process for generating content data corresponding to display on a terminal that has requested a browsing from intermediate content data in a reusable format described in a tagged markup language. Which is a recording medium for recording a program for accessing a template storing means for holding a template which is display format information corresponding to a display processing environment of the terminal, a browsing request from the terminal and type information of the terminal And executing a dynamic generation process of extracting a corresponding template from the template storage means based on the identification information, reconfiguring the intermediate content data according to the template to generate the content data, and performing the processing on the computer. It is characterized in that a program for recording is recorded Content generation processing program recording medium.

8. A data extraction / structure conversion processing unit for reconstructing and storing intermediate content data in a reusable format from the first content data described in a first tagged markup language, and the intermediate. A content reconstruction processing system comprising: a content generation processing unit that generates second content data described in a second tagged markup language corresponding to display on a terminal that has received a browsing request from content data; The data extraction / structure conversion processing unit extracts, from the first content data, resource information that is a portion that constitutes meaningful information for humans based on a tag that is information that defines data configuration, A module generation processing unit that generates modules by grouping information based on the tags and adds meaning to the modules. And a relationship setting means for reconstructing a tree structure between pages of the first content data from link information defined in the first content data, by adding a relationship between the modules based on the tag. And a reconfiguration processing unit that rearranges the modules according to the relationship between the modules and the tree structure to generate the intermediate content data in a reusable format, and a content data storage unit that stores the intermediate content data. The content generation unit includes a template storage unit that holds a template that is display format information corresponding to a display processing environment of the terminal, a browsing request from the terminal, and type information of the terminal, and the identification information. The corresponding template is extracted from the template storage means based on Content rebuild processing system comprising: a dynamic generation processing means for generating the content data by reconstructing the intermediate content data in accordance with the plate.