JP2005267344A

JP2005267344A - Document shaping device, document shaping method, and program

Info

Publication number: JP2005267344A
Application number: JP2004080074A
Authority: JP
Inventors: Shingo Iwasaki; 晋吾岩崎
Original assignee: Canon Inc
Current assignee: Canon Inc
Priority date: 2004-03-19
Filing date: 2004-03-19
Publication date: 2005-09-29

Abstract

<P>PROBLEM TO BE SOLVED: To reduce a time and effort until the acquisition of information by automatically extracting information which people look for from contents created by using a structured document and shaping the information. <P>SOLUTION: This invention provides a document shaping system comprising; an extraction means (208) to extract object data relevant to a keyword in the object data in the structured document, and a shaping means (211) to shape the structured document relevant to the extracted object data. <P>COPYRIGHT: (C)2005,JPO&NCIPI

Description

本発明は、構造化文書を整形する技術に関する。 The present invention relates to a technique for shaping a structured document.

従来、インターネット上に公開されている構造化文書を利用して作成された膨大にあるコンテンツデータの中から、人間が得たい情報を探し出す場合、人間は、検索サイトを利用して、得たい情報に関連するキーワードを入力し、そのキーワードが含まれている可能性のあるコンテンツ先を示すURLを得ることができ、その得られたURL先のコンテンツにアクセスし、そのコンテンツの中から、改めて得たい情報を探し、初めて情報を取得する。 Conventionally, when searching for information that humans want to obtain from a vast amount of content data created using structured documents published on the Internet, humans use search sites to obtain information they want to obtain. You can enter a keyword related to the URL, obtain a URL that indicates the content destination that may contain the keyword, access the content at the URL destination, and obtain a new URL from the content. Find the information you want and get it for the first time.

また、下記の特許文献１には、ブックマーク・セットを使用してインターネットにアクセスする方法が開示されている。 Patent Document 1 below discloses a method for accessing the Internet using a bookmark set.

特開平１１−２６５３３５号公報Japanese Patent Laid-Open No. 11-265335

このように、得たい情報を得るためには、得たい情報の含まれるコンテンツ先のURLを探し、そのコンテンツ内を改めて探すという、２段階の処理を踏まなくてはならず、得たい情報を得られるまでに時間がかかる。また、得たい情報が、１つのコンテンツにだけではなく、複数のコンテンツにあると、それぞれのコンテンツにアクセスし、それぞれのコンテンツ内から、得たい情報を探さなくてはならず、非常に手間がかかり、効率的でない。 In this way, in order to obtain the information that you want to obtain, you must go through the two-step process of searching for the URL of the content destination that contains the information you want to obtain, and then searching again within the content. It takes time to get it. Also, if the information you want to get is not only in one piece of content but in multiple pieces of content, you must access each piece of content and search for the information you want to get from within each piece of content. Takes and is not efficient.

本発明の目的は、構造化文書を利用して作成されたコンテンツから、人間が求める情報を、自動的に抽出し、それらを整形することで、情報を得られるまでの時間及び手間を減らすことである。 An object of the present invention is to automatically extract information required by human beings from content created using a structured document and shape them, thereby reducing the time and labor required to obtain the information. It is.

本発明の文書整形装置は、構造化文書内のオブジェクトデータの中でキーワードに関連するオブジェクトデータを抽出する抽出手段と、前記抽出されたオブジェクトデータに関する構造化文書を整形する整形手段とを有することを特徴とする。
また、本発明の文書整形方法は、構造化文書内のオブジェクトデータの中でキーワードに関連するオブジェクトデータを抽出する抽出ステップと、前記抽出したオブジェクトデータに関する構造化文書を整形する整形ステップとを有することを特徴とする。
また、本発明のプログラムは、上記の文書整形方法の各ステップをコンピュータに実行させるためのプログラムである。 The document shaping apparatus according to the present invention includes an extraction unit that extracts object data related to a keyword from object data in a structured document, and a shaping unit that shapes a structured document related to the extracted object data. It is characterized by.
The document formatting method of the present invention includes an extraction step of extracting object data related to a keyword from object data in a structured document, and a formatting step of formatting a structured document related to the extracted object data. It is characterized by that.
The program of the present invention is a program for causing a computer to execute each step of the document shaping method.

ユーザーがキーワードを指定すれば、ユーザーが求める情報を構造化文書から自動的に抽出し、それらを整形することができるので、情報を得られるまでの時間及び手間を減らすことができる。 If the user designates a keyword, information required by the user can be automatically extracted from the structured document and shaped, so that the time and effort required to obtain the information can be reduced.

以下、本発明の実施の形態を、具体例を用いて詳細に説明する。
（第１の実施形態）
図１は、本発明の第１の実施形態における構造化文書整形装置の利用方法の一例である。ユーザーは、本装置104に対して、インターネット108上に混在するコンテンツデータの中から、自分の欲しい情報を得るために、その欲しい情報に関連するキーワード（１つ、またはそれ以上の文字列）を与える指示をキーボードおよびマウスなどの入力デバイスを用いて、パーソナルコンピューター103 (以下PCと略す)に通知する。構造化文書データ抽出および整形出力装置（本装置）104は、PC103の内部に組み込まれているものであり、ユーザーからの指示によって与えられたキーワードを、コンテンツ検索装置106に送信する。キーワードを与えられたコンテンツ検索装置106は、108のインターネットに接続し、キーワードに関連するコンテンツの格納先を示すURL(Uniform Resource Locator)を取得し、本装置104が、それらのURLを受信する。本装置104は、インターネット108に接続し、受信したURL先のコンテンツデータを取得する。本装置104は、取得したコンテンツデータ（１つ、またはそれ以上）の構造を解析し、ユーザーから与えられたキーワードを利用して、コンテンツデータの中から、ユーザーが欲しい情報に関連すると判断できるオブジェクトデータ（テキストデータ、バイナリデータなど）を抽出し、さらに、これらのオブジェクトデータの関連性を見つけ、テキストデータとバイナリデータとの組み合わせや関連付けなどの処理を行い、オブジェクトデータをまとめる。そして、データ自動レイアウト出力装置114に、これらのオブジェクトデータを送信する。データ自動レイアウト出力装置114は、これらのオブジェクトデータを自動的に組み合わせて、整形し、各種印刷機器に対応するデータに変換し、印刷機器116へ整形済みデータが出力され、ユーザーが欲しい情報だけが記載された情報が印刷され、一連の処理の流れが終了する。 Hereinafter, embodiments of the present invention will be described in detail using specific examples.
(First embodiment)
FIG. 1 is an example of a method of using a structured document shaping apparatus according to the first embodiment of the present invention. In order to obtain information desired by the user from the content data mixed on the Internet 108, the user inputs a keyword (one or more character strings) related to the desired information to the device 104. An instruction to be given is notified to a personal computer 103 (hereinafter abbreviated as “PC”) using an input device such as a keyboard and a mouse. The structured document data extraction and shaping output device (this device) 104 is incorporated in the PC 103, and transmits a keyword given by an instruction from the user to the content search device. The content search device 106 given the keyword connects to the Internet 108, acquires a URL (Uniform Resource Locator) indicating the storage location of the content related to the keyword, and the device 104 receives these URLs. The apparatus 104 connects to the Internet 108 and acquires the received URL destination content data. The device 104 analyzes the structure of the acquired content data (one or more) and uses a keyword given by the user to determine from the content data that the user can determine that it is related to the information desired by the user Data (text data, binary data, etc.) is extracted, and further, the relevance of these object data is found, processing such as combination and association of text data and binary data is performed, and the object data is collected. Then, these object data are transmitted to the automatic data layout output device 114. The automatic data layout output device 114 automatically combines and formats these object data, converts them into data corresponding to various printing devices, outputs the formatted data to the printing device 116, and only the information desired by the user The described information is printed, and a series of processing flow ends.

図２は、本実施形態における本装置104全体の処理の流れを示す図である。図２を用いて本実施形態における本装置104内部の処理の流れを説明する。 FIG. 2 is a diagram showing a flow of processing of the entire apparatus 104 in the present embodiment. With reference to FIG. 2, the flow of processing inside the apparatus 104 in the present embodiment will be described.

本装置104の内部201の各処理を示したものである。キーワード（１つ、またはそれ以上の文字列）202は、ユーザーが入力したものであって、入力部203で、そのキーワードを受け取る（入力する）。コンテンツ検索装置利用部204は、入力されたキーワードをコンテンツ検索装置（検索サイトなど）106にコンテンツ検索装置106から提供されているAPIなどを利用して、そのキーワードに関連するコンテンツデータがある場所を示すURLをコンテンツ検索装置106から受け取る。通信部205は、その受け取ったURLにインターネットのネットワーク接続でアクセスし、コンテンツデータを取得する。そして、変換処理部206において、取得したコンテンツデータの構造を整える（コンテンツデータがHTML（HyperText Markup Language）で記述されていれば、XHTML（eXtensible HyperText Markup Language）に変換する）ことで、厳密な構造に変換する。解析処理部207において、その構造を整えたコンテンツデータの構造を調べ、構造を意味するツリー構造や、コンテンツを構成しているテキストデータ及び／又はバイナリデータなどのオブジェクトデータを抽出する。比較判断及び抽出処理部208において、その抽出したオブジェクトデータとユーザーが入力したキーワードが関連しているものか、文字列の比較を行い判断し、出力する。また、オブジェクトデータが画像の場合は、画像データがコンテンツデータの中で、どのように構成されているかを解析する。具体的には‘jpeg', 'tif'など画像データに関する記述付近に存在するテキストデータを抽出し、その中にユーザーが入力したキーワードが含まれていれば、その画像がキーワードに関連すると判断する。209のまとめ処理部で、抽出されたオブジェクトデータの中で、コンテンツデータを解析した結果、オブジェクト同士が関連していると判断できれば、それらのオブジェクトデータをXML(eXtensible Markup Language)形式で１つにまとめる。そして、210の付加情報追加処理部において、そのまとめられたオブジェクトデータの重要度を示す優先順位付けを行い、オブジェクトデータが含まれるコンテンツデータがある場所を示すURLを情報として付加する処理などを行う。211の格納処理部において、それらまとめたオブジェクトデータをRAMなどによって構成されるデータ格納領域部212に格納すると共に、画像データなどの実体をインターネットを介して取得し、それらのデータも格納する。最後に213の出力部に格納したことを通知する。通知を受け取った出力部213は、データ格納領域から、通知のあったオブジェクトデータを抜き出し、データ自動レイアウト出力装置214に出力する。 Each process of the inside 201 of this apparatus 104 is shown. The keyword (one or more character strings) 202 is input by the user, and the input unit 203 receives (inputs) the keyword. The content search device utilization unit 204 uses the API provided from the content search device 106 to the content search device (search site or the like) 106 for the input keyword to determine the location where the content data related to the keyword exists. The URL shown is received from the content search device 106. The communication unit 205 accesses the received URL through an Internet network connection and acquires content data. Then, the conversion processing unit 206 arranges the structure of the acquired content data (if the content data is described in HTML (HyperText Markup Language), it is converted into XHTML (eXtensible HyperText Markup Language)), thereby providing a strict structure. Convert to In the analysis processing unit 207, the structure of the content data whose structure is arranged is examined, and a tree structure meaning the structure and object data such as text data and / or binary data constituting the content are extracted. The comparison determination and extraction processing unit 208 determines whether the extracted object data is related to the keyword input by the user by comparing character strings and outputs the result. When the object data is an image, it is analyzed how the image data is configured in the content data. Specifically, text data existing near the description about image data such as 'jpeg' and 'tif' is extracted, and if the keyword entered by the user is included in it, it is determined that the image is related to the keyword. . If it is determined that the objects are related as a result of analyzing the content data in the extracted object data by the summary processing unit 209, those object data are combined into one in XML (eXtensible Markup Language) format. To summarize. Then, in the additional information addition processing unit 210, prioritization indicating the importance of the collected object data is performed, and processing for adding a URL indicating the location where the content data including the object data exists as information is performed. . The storage processing unit 211 stores the collected object data in a data storage area unit 212 constituted by a RAM or the like, acquires an entity such as image data via the Internet, and stores the data. Finally, it notifies that it has been stored in the output unit 213. Upon receiving the notification, the output unit 213 extracts the notified object data from the data storage area and outputs it to the data automatic layout output device 214.

なお、変換処理部２０６は、削除してもよい。すなわち、構造化文書は、HTMLでもXHTMLでもよい。比較判断及び抽出処理部２０８は、オブジェクトデータがテキストデータの場合、センテンス単位又はパラグラフ単位で解析し、キーワードが含まれていればキーワードに関連すると判断し、それらの単位でオブジェクトデータを抽出する。 Note that the conversion processing unit 206 may be deleted. That is, the structured document may be HTML or XHTML. When the object data is text data, the comparison judgment and extraction processing unit 208 analyzes in sentence units or paragraph units, and if the keyword data is included, determines that it is related to the keyword, and extracts the object data in these units.

上述した207〜210のさらに詳しい処理内容を、図３に示す具体例を用いて説明する。ユーザーから入力されたキーワードが、"abc"、"ｘｙｚ"、"１２３"の３つの文字列で、これらに関連するコンテンツデータが、図３中301の構造化文書で記述されていて、このコンテンツデータを示すURLが"http://www.123・・・"である、として説明する。 More detailed processing contents of 207 to 210 described above will be described using a specific example shown in FIG. The keywords input by the user are three character strings “abc”, “xyz”, and “123”, and the content data related to these are described in the structured document 301 in FIG. It is assumed that the URL indicating data is “http: //www.123...”.

本装置の通信部から、URL："http://www.123・・・"にアクセスし、構造化文書301で記述されたコンテンツデータを取得する。取得したコンテンツデータを上層から順に解析することで、まず、302のテキストデータに関してキーワードがあるかどうか判断する。このテキストデータの中には、キーワード："abc"、"ｘｙｚ"、"１２３"が含まれていないので、必要のないオブジェクトデータであると判断する。同様にして、303のテキストデータに関し、そのデータ中にキーワードがあるかどうか調べる。テキストデータ303にはキーワードが含まれていないため、これも必要のないオブジェクトデータであると判断する。さらに、304のテキストデータに関し、そのデータ中にキーワードがあるかどうか調べる。ここで"xyz"の文字列を抽出し、必要なオブジェクトデータであると判断する。そしてこのオブジェクトデータ本体（テキストデータ304）と、そのデータ中に含まれているキーワードの個数、解析された順番（ユニークな識別子）を記憶しておく。次に、305のテキストデータに関し、キーワードを調べる。すると"１２３"の文字列が、１つ含まれているのを見つけ、必要のあるオブジェクトデータであると判断し、このオブジェクトデータと、含まれているキーワードの個数、解析された順番を記憶しておく。次に、306のテキストデータを見つけ、テキストデータ内のキーワード有無を調べる。その結果、"abc"、"ｘｙｚ"、"１２３"の３つの文字列を抽出し、テキストデータ306が必要のあるオブジェクトデータであると判断する。そして、このオブジェクトデータ本体（テキストデータ306）と、含まれているキーワードの個数、解析された順番を記憶しておく。同様に、307のテキストデータに関し、キーワードを調べる。すると"abc"、"ｘｙｚ"の２つの文字列を抽出し、必要のあるオブジェクトデータであると判断する。このオブジェクトデータ本体（テキストデータ307）と、含まれているキーワードの個数、解析された順番を記憶しておく。これらの記憶したオブジェクトデータを、308で示すように、必要な情報を付加し、XML形式でまとめる。 The URL: “http: //www.123...” Is accessed from the communication unit of this apparatus, and the content data described in the structured document 301 is acquired. By analyzing the acquired content data in order from the upper layer, it is first determined whether or not there is a keyword for the text data 302. Since this text data does not include the keywords “abc”, “xyz”, and “123”, it is determined that the text data is unnecessary object data. Similarly, regarding text data 303, it is checked whether or not there is a keyword in the data. Since the keyword is not included in the text data 303, it is determined that this is unnecessary object data. Further, regarding the text data 304, it is checked whether or not there is a keyword in the data. Here, the character string “xyz” is extracted and determined to be necessary object data. The object data body (text data 304), the number of keywords included in the data, and the analyzed order (unique identifier) are stored. Next, a keyword is investigated regarding 305 text data. Then, it finds that one character string “123” is included, determines that it is necessary object data, and stores this object data, the number of included keywords, and the analyzed order. Keep it. Next, 306 text data is found and the presence or absence of a keyword in the text data is checked. As a result, three character strings “abc”, “xyz”, and “123” are extracted, and it is determined that the text data 306 is necessary object data. The object data body (text data 306), the number of included keywords, and the analyzed order are stored. Similarly, keywords are examined for 307 text data. Then, two character strings “abc” and “xyz” are extracted and determined to be necessary object data. The object data body (text data 307), the number of included keywords, and the analyzed order are stored. As shown by reference numeral 308, these stored object data are added with necessary information and are collected in an XML format.

XMLデータ308は、まず<contentsno>タグで、複数あるコンテンツデータの中で、何番目に解析をしたかを記述する。今回の場合は、解析対象となるコンテンツデータは１つなので、"１"である。次に<data>タグによって、それぞれのオブジェクトデータを囲み、その中で、優先順位、キーワードの個数などを付加する。テキストデータ304、305、306、307を、それぞれ比較した結果、キーワードが一番多く含まれているテキストデータ306が一番重要であると判断し、優先順位(<priority>タグ)を一番高く("1")する。同様にして、テキストデータ307が、文字列が次に多く含まれているため、2番目に重要と判断し、優先順位を"２"とする。テキストデータ304と305は共に、キーワードが1つしか含まれていないため、キーワードだけでは、優先順位が決められない。この場合、オブジェクトデータを解析した（探し出した）順番で比較する。すると、テキストデータ304と305では、テキストデータ304の方が先に解析されているので、テキストデータ304の方の優先順位を高くする。最後に<link>タグで、テキストデータ304、305、306、307が含まれるコンテンツデータURLを付加し、1つのXMLデータとしてまとめ、データ格納領域へ格納する。 First, the XML data 308 describes the number of the analyzed content data among a plurality of content data using a <contentsno> tag. In this case, since there is only one content data to be analyzed, it is “1”. Next, each object data is enclosed by a <data> tag, and a priority order, the number of keywords, and the like are added therein. As a result of comparing text data 304, 305, 306, and 307, the text data 306 containing the most keywords is judged to be the most important, and the priority (<priority> tag) is the highest. ("1"). Similarly, since the text data 307 contains the next largest number of character strings, the text data 307 is determined to be second most important, and the priority is set to “2”. Since both the text data 304 and 305 include only one keyword, the priority order cannot be determined only by the keyword. In this case, the object data are compared in the order of analysis (searched out). Then, in the text data 304 and 305, since the text data 304 is analyzed first, the priority of the text data 304 is increased. Finally, a content data URL including text data 304, 305, 306, and 307 is added with a <link> tag, and is combined into one XML data and stored in the data storage area.

（第２の実施形態）
本発明の第２の実施形態を説明する。第１の実施形態では、テキストデータのみの抽出であったが、テキストデータと画像データ（バイナリデータ）の両方を抽出する場合の本装置の詳しい処理内容説明する。ここでは、ユーザーから入力されたキーワードが、"abc"、"ｘｙｚ"、"１２３"の３つの文字列で、これらに関連するコンテンツデータが、図４に示す構造化文書401で記述されていて、このコンテンツデータを示すURLが"http://www.456・・・"であるものに対する処理例を説明する。 (Second Embodiment)
A second embodiment of the present invention will be described. In the first embodiment, only text data is extracted, but detailed processing contents of this apparatus when both text data and image data (binary data) are extracted will be described. Here, the keywords inputted by the user are three character strings “abc”, “xyz”, and “123”, and the content data related to these are described in the structured document 401 shown in FIG. An example of processing for the URL indicating the content data “http: //www.456...” Will be described.

本装置の通信部から、URL："http://www.456・・・"にアクセスし、図４の構造化文書401で記述されたコンテンツデータを取得する。そして、取得したコンテンツデータを上層から順に解析する。まず、テキストデータ402に関し、上述のキーワードがあるか否か判定する。このテキストデータ402の中には、キーワード："abc"、"ｘｙｚ"、"１２３"が含まれていないので、必要のないオブジェクトデータであると判断する。次に、画像データ403に関連するデータを見つけ、キーワードを調べる。<img>タグ中のalt属性（代替データ）の文字列内に、キーワード"ｘｙｚ"、"１２３"の２つの文字列が含まれているので、必要なオブジェクトデータであると判断し、このオブジェクトデータ本体（画像データに関するテキストデータ403）と、含まれているキーワードの個数、解析された順番を記憶しておく。次に、画像データに関連するデータ404に関し、そのデータ内にキーワードがあるか否かを判定する。データ404である<img>タグ中には、alt属性がないため、このデータだけでは、キーワードに関連するものかどうかわからないので、保留にする。次に、テキストデータ405を見つけ、キーワードの有無を調べる。テキストデータ405の場合、"abc"、"ｘｙｚ" 、"１２３"の３つの文字列が抽出されるので、このテキストデータ405を必要のあるオブジェクトデータであると判断する。そして、このオブジェクトデータ（テキストデータ405）と、含まれているキーワードの個数、解析された順番を記憶しておくと共に、このテキストデータ405が、画像データの記述404と隣接しているため、保留にしておいた、画像データを示すデータ404も、キーワードに関連するオブジェクトデータであると判断し、405のテキストデータと関連付けて記憶しておく。すなわち、本処理では、データそのものを解析したときに必要なデータであるか否か判断ができない場合、前後に解析したデータに基づいて必要なデータであるか否か判定している。そして、テキストデータ406に関し、キーワードの有無を調べる。すると"１２３"の文字列が１つ含まれていることから、必要のあるオブジェクトデータであると判断する。同様に、このオブジェクトデータと、含まれているキーワードの個数、解析された順番を記憶しておく。これらの記憶したオブジェクトデータを、構造化文書407で示すように、必要な情報を付加し、XML形式でまとめる。 URL: “http: //www.456...” Is accessed from the communication unit of this apparatus, and the content data described in the structured document 401 in FIG. 4 is acquired. Then, the acquired content data is analyzed in order from the upper layer. First, regarding the text data 402, it is determined whether or not there is the keyword described above. Since the text data 402 does not include the keywords: “abc”, “xyz”, and “123”, it is determined that the text data 402 is unnecessary object data. Next, data related to the image data 403 is found and a keyword is examined. Since two character strings of the keywords “xyz” and “123” are included in the character string of the alt attribute (alternative data) in the <img> tag, it is determined that the object data is necessary, and this object The data body (text data 403 related to image data), the number of included keywords, and the analyzed order are stored. Next, regarding the data 404 related to the image data, it is determined whether or not there is a keyword in the data. Since there is no alt attribute in the <img> tag that is the data 404, it is not possible to determine whether or not this data is related to the keyword. Next, the text data 405 is found to check for the presence of keywords. In the case of the text data 405, since three character strings “abc”, “xyz”, and “123” are extracted, it is determined that the text data 405 is necessary object data. The object data (text data 405), the number of keywords included, and the order of analysis are stored, and the text data 405 is adjacent to the image data description 404. The data 404 indicating the image data is also determined to be object data related to the keyword, and is stored in association with the text data 405. That is, in this process, if it is not possible to determine whether the data is necessary when the data itself is analyzed, it is determined whether the data is necessary based on the data analyzed before and after. The text data 406 is checked for the presence of keywords. Then, since one character string “123” is included, it is determined that the object data is necessary. Similarly, the object data, the number of included keywords, and the analyzed order are stored. Necessary information is added to the stored object data as shown by the structured document 407, and the data is collected in the XML format.

構造化文書407は、まず<contentsno>タグで、複数あるコンテンツデータの中で、何番目に解析をしたかを記述する。今回の場合は、解析対象となるコンテンツデータは１つなので、"１"である。次に<data>タグによって、それぞれのオブジェクトデータを囲み、その中で、優先順位、キーワードの個数などを付加する。オブジェクトデータ403、404、405、406をそれぞれ比較した結果、キーワードが一番多く含まれている405のテキストデータが一番重要であると判断し、優先順位(<priority>タグ)を一番高く("1")する。さらに、このテキストデータ405は、画像データに関するデータ404と関連付けて記憶しているため、同じ<data>タグ内に両者をまとめ、1つの塊にする。次に画像データに関するデータ403のalt属性中の文字列にキーワードが２番目に多く含まれているため、データ403が2番目に重要であると判断し、優先順位を"２"とする。そして、画像データと、alt属性の文字列を2つにわけてまとめる。テキストデータ406には、キーワードが1つしか含まれていないため、優先順位を一番低くする。最後に<link>タグで、オブジェクトデータ403、404、405、406が含まれるコンテンツデータURLを付加し、1つのXMLデータとしてまとめ、データ格納領域へ格納する。 In the structured document 407, first, the <contentsno> tag describes the number of the analyzed content data among a plurality of content data. In this case, since there is only one content data to be analyzed, it is “1”. Next, each object data is enclosed by a <data> tag, and a priority order, the number of keywords, and the like are added therein. As a result of comparing object data 403, 404, 405, and 406, the text data of 405 with the most keywords is judged to be the most important, and the priority (<priority> tag) is the highest. ("1"). Further, since the text data 405 is stored in association with the data 404 relating to the image data, the text data 405 is combined into one lump in the same <data> tag. Next, since the second most common keyword is included in the character string in the alt attribute of the data 403 related to the image data, it is determined that the data 403 is the second most important, and the priority is set to “2”. The image data and the character string of the alt attribute are divided into two. Since the text data 406 includes only one keyword, the priority is set to the lowest. Finally, a content data URL including the object data 403, 404, 405, and 406 is added with the <link> tag, and the data is collected as one XML data and stored in the data storage area.

（第３の実施形態）
本発明の第３の実施形態を説明する。本実施の形態は、第１の実施形態及び第２の実施形態に対して、特にキーワードの抽出対象がテーブル構造になっている場合の処理例である。本形態に関してもユーザーから入力されたキーワードが、"abc"、"ｘｙｚ"、"１２３"の３つの文字列で、これらに関連するコンテンツデータが、図５中501の構造化文書で記述されていて、このコンテンツデータを示すURLが"http://www.789・・・"である、として説明する。 (Third embodiment)
A third embodiment of the present invention will be described. The present embodiment is an example of processing in the case where the keyword extraction target has a table structure, in particular, with respect to the first embodiment and the second embodiment. Also in this embodiment, the keywords input by the user are three character strings “abc”, “xyz”, and “123”, and the content data related to these are described in the structured document 501 in FIG. In the following description, it is assumed that the URL indicating the content data is “http: //www.789...

まず、本装置から、URL："http://www.789・・・"にアクセスし、図５中501の構造化文書で記述されたコンテンツデータを取得する。そして、取得したコンテンツデータを上層から順に解析する。まず、テキストデータ502このテキストデータの中には、キーワード："abc"、"ｘｙｚ"、"１２３"が含まれていないので、必要のないオブジェクトデータであると判断する。次に、テキストデータ503に関し、キーワードの有無を調べる。すると"xyz"の文字列が抽出されるので、テキストデータ503が必要のあるオブジェクトデータであると判断される。このオブジェクトデータと、含まれているキーワードの個数、解析された順番を記憶しておく。次に、画像データに関するデータ504に関し、キーワードを調べる。データ504には、<img>タグ中のalt属性がないため、この画像データだけでは、キーワードに関連するものかどうかわからないので、保留にする。同様に、画像データに関するテキストデータ505も保留にする。そして、テキストデータ506に関し、キーワードの有無を調べる。すると"１２３"、"abc"の２つの文字列が含まれていることから、必要のあるオブジェクトデータであると判断する。そして、このオブジェクトデータ（テキストデータ505）と、含まれているキーワードの個数、解析された順番を記憶しておく。次に、テキストデータ507に関し、キーワードの有無を調べる。すると、"abc"、"ｘｙｚ"、"１２３"の３つの文字列が含まれていることから、必要なオブジェクトデータであると判断する。そして、このオブジェクトデータ（テキストデータ507）と、含まれているキーワードの個数、解析された順番を記憶しておく。さらに、オブジェクトデータ504〜507は、コンテンツデータの構造を解析することで、同一の<table>タグで囲まれていることが判定され、508で示すテーブルの形で表示されるものと解釈できる。そして、テキストデータ506は、画像データ504に関する説明であると推測でき、同様にして、507のテキストデータは、505の画像データに関する説明であると推測できるので、両者の関連付けを行う。さらに、画像データに関するデータ504、505もキーワードに関連し、必要のあるオブジェクトデータであると判断できる。これらの記憶したオブジェクトデータを、構造化文書509で示すように、必要な情報を付加し、XML形式でまとめる。すなわち、テーブルタグで囲まれたオブジェクトデータに対して関連性を調べる場合、そのうちの少なくとも１つが必要なデータであると判断した場合、その他のオブジェクトデータに関しても必要であると判断する。 First, URL: “http: //www.789...” Is accessed from this apparatus, and content data described in a structured document 501 in FIG. Then, the acquired content data is analyzed in order from the upper layer. First, since the text data 502 does not include the keywords “abc”, “xyz”, and “123”, the text data 502 is determined to be unnecessary object data. Next, the text data 503 is checked for the presence of keywords. Then, since the character string “xyz” is extracted, it is determined that the text data 503 is necessary object data. The object data, the number of included keywords, and the analyzed order are stored. Next, a keyword is examined regarding the data 504 regarding image data. Since there is no alt attribute in the <img> tag in the data 504, it is not possible to determine whether or not it is related to the keyword only with this image data. Similarly, text data 505 relating to image data is also put on hold. Then, the text data 506 is checked for the presence of keywords. Then, since two character strings “123” and “abc” are included, it is determined that the object data is necessary. Then, the object data (text data 505), the number of included keywords, and the analyzed order are stored. Next, the text data 507 is checked for the presence of keywords. Then, since three character strings “abc”, “xyz”, and “123” are included, it is determined that the object data is necessary. The object data (text data 507), the number of included keywords, and the analyzed order are stored. Further, the object data 504 to 507 are analyzed by analyzing the structure of the content data, so that it is determined that they are surrounded by the same <table> tag, and can be interpreted as being displayed in the form of a table indicated by 508. The text data 506 can be assumed to be an explanation about the image data 504, and similarly, the text data 507 can be assumed to be an explanation about the image data 505. Further, the data 504 and 505 related to the image data are also related to the keyword, and can be determined as necessary object data. Necessary information is added to these stored object data as shown by the structured document 509, and the data is collected in an XML format. That is, when checking the relevance of object data enclosed by table tags, if it is determined that at least one of them is necessary data, it is determined that other object data is also necessary.

構造化文書509は、まず<contentsno>タグで、複数あるコンテンツデータの中で、何番目に解析をしたかを記述する。今回の場合は、解析対象となるコンテンツデータは１つなので、"１"である。次に<data>タグによって、それぞれのオブジェクトデータを囲み、その中で、優先順位、キーワードの個数などを付加する。オブジェクトデータ503、504、505、506、507をそれぞれ比較した結果、キーワードが一番多く含まれているテキストデータ507が一番重要であると判断し、優先順位(<priority>タグ)を一番高く("1")する。さらに、このテキストデータ507は、画像データに関連するデータ505と関連付けられているため、テキストデータ510に示すように、同じ<data>タグ内に両者をまとめ、1つの塊にする。同様に、テキストデータ506が2番目に重要であると判断し、テキストデータ511に示すように、同じ<data>タグ内に506のテキストデータと504の画像データの両者をまとめ、1つの塊にする。テキストデータ503には、キーワードが1つしか含まれていないため、優先順位を一番低くする。最後に<link>タグで、503、504、505、506、507のオブジェクトデータが含まれるコンテンツデータURLを付加し、1つのXMLデータとして、まとめ、データ格納領域へ格納する。 In the structured document 509, first, the <contentsno> tag describes the number of the analyzed content data among a plurality of content data. In this case, since there is only one content data to be analyzed, it is “1”. Next, each object data is enclosed by a <data> tag, and a priority order, the number of keywords, and the like are added therein. As a result of comparing the object data 503, 504, 505, 506, 507, the text data 507 containing the most keywords is determined to be the most important, and the priority (<priority> tag) is the highest. Increase ("1"). Further, since the text data 507 is associated with the data 505 related to the image data, as shown in the text data 510, the text data 507 is combined into one lump in the same <data> tag. Similarly, it is determined that the text data 506 is the second most important, and as shown in the text data 511, both the text data 506 and the image data 504 are put together in the same <data> tag to form one lump. To do. Since the text data 503 includes only one keyword, the priority is set to the lowest. Finally, a content data URL including the object data of 503, 504, 505, 506, and 507 is added with a <link> tag, and is collected as one XML data and stored in the data storage area.

なお、本装置104がインターネット上のサーバー103内に組み込まれている場合、ユーザー本装置104へのアクセスする方法は、ユーザーがPC又はモバイル端末機器を利用し、インターネット108を介して、アクセスすることになる。 In addition, when the device 104 is incorporated in the server 103 on the Internet, the user can access the device 104 by using a PC or a mobile terminal device and accessing via the Internet 108. become.

図７は、本発明の各実施形態によるコンピュータのハードウエア構成例を示す。本実施形態は、前記第１〜第４の実施形態のＰＣ又はサーバー（本装置１０４を含む）１０３をコンピュータで実現する例を示す。ＰＣ又はモバイル端末機器６０３及びコンテンツ検索装置１０６も同様の構成を有する。 FIG. 7 shows a hardware configuration example of a computer according to each embodiment of the present invention. The present embodiment shows an example in which the PC or server (including the device 104) 103 of the first to fourth embodiments is realized by a computer. The PC or mobile terminal device 603 and the content search device 106 have the same configuration.

バス７０１には、中央処理装置（ＣＰＵ）７０２、ＲＯＭ７０３、ＲＡＭ７０４、ネットワークインタフェース７０５、入力装置７０６、出力装置７０７及び外部記憶装置７０８が接続されている。 A central processing unit (CPU) 702, a ROM 703, a RAM 704, a network interface 705, an input device 706, an output device 707, and an external storage device 708 are connected to the bus 701.

ＣＰＵ７０２は、データの処理又は演算を行うと共に、バス７０１を介して接続された各種構成要素を制御するものである。ＲＯＭ７０３には、予めＣＰＵ７０２の制御手順（コンピュータプログラム）を記憶させておき、このコンピュータプログラムをＣＰＵ７０２が実行することにより、起動する。外部記憶装置７０８にコンピュータプログラムが記憶されており、そのコンピュータプログラムがＲＡＭ７０４にコピーされて実行される。ＲＡＭ７０４は、データの入出力、送受信のためのワークメモリ、各構成要素の制御のための一時記憶として用いられる。外部記憶装置７０８は、例えばハードディスク記憶装置やＣＤ−ＲＯＭ等であり、電源を切っても記憶内容が消えない。ＣＰＵ７０２は、ＲＡＭ７０４内のコンピュータプログラムを実行することにより、第１〜第４の実施形態の処理を行う。 The CPU 702 performs data processing or calculation and controls various components connected via the bus 701. The ROM 703 stores a control procedure (computer program) of the CPU 702 in advance, and is started when the CPU 702 executes this computer program. A computer program is stored in the external storage device 708, and the computer program is copied to the RAM 704 and executed. The RAM 704 is used as a work memory for data input / output, transmission / reception, and temporary storage for control of each component. The external storage device 708 is, for example, a hard disk storage device or a CD-ROM, and the stored content does not disappear even when the power is turned off. The CPU 702 performs the processes of the first to fourth embodiments by executing the computer program in the RAM 704.

ネットワークインタフェース７０５は、インターネット（図１及び図６）１０８等のネットワークに接続するためのインタフェースである。入力装置７０６は、例えばキーボード、マウス等であり、各種指定又は入力等を行うことができる。出力装置７０７は、ディスプレイ及びプリンタ等である。 The network interface 705 is an interface for connecting to a network such as the Internet (FIGS. 1 and 6) 108. The input device 706 is, for example, a keyboard, a mouse, and the like, and can perform various designations or inputs. The output device 707 is a display, a printer, or the like.

本実施形態は、コンピュータがプログラムを実行することによって実現することができる。また、プログラムをコンピュータに供給するための手段、例えばかかるプログラムを記録したＣＤ−ＲＯＭ等のコンピュータ読み取り可能な記録媒体又はかかるプログラムを伝送するインターネット等の伝送媒体も本発明の実施形態として適用することができる。また、上記のプログラムを記録したコンピュータ読み取り可能な記録媒体等のコンピュータプログラムプロダクトも本発明の実施形態として適用することができる。上記のプログラム、記録媒体、伝送媒体及びコンピュータプログラムプロダクトは、本発明の範疇に含まれる。記録媒体としては、例えばフレキシブルディスク、ハードディスク、光ディスク、光磁気ディスク、ＣＤ−ＲＯＭ、磁気テープ、不揮発性のメモリカード、ＲＯＭ等を用いることができる。 This embodiment can be realized by a computer executing a program. Also, means for supplying a program to a computer, for example, a computer-readable recording medium such as a CD-ROM recording such a program, or a transmission medium such as the Internet for transmitting such a program is also applied as an embodiment of the present invention. Can do. A computer program product such as a computer-readable recording medium in which the above program is recorded can also be applied as an embodiment of the present invention. The above program, recording medium, transmission medium, and computer program product are included in the scope of the present invention. As the recording medium, for example, a flexible disk, a hard disk, an optical disk, a magneto-optical disk, a CD-ROM, a magnetic tape, a nonvolatile memory card, a ROM, or the like can be used.

以上のように、各実施形態によれば、インターネット上に公開されている膨大にあるコンテンツから、人間の得たい情報が含まれているコンテンツを探し、取得し、さらに、取得したそれぞれのコンテンツから、人間の得たい情報に関連するテキストデータやバイナリデータなどを抽出し、保存し、改めて、それらのデータを組み合わせて、整形し、出力する装置を提供する。 As described above, according to each embodiment, content including information desired by human beings is searched for and acquired from a large amount of content published on the Internet, and further, from each acquired content. The present invention provides an apparatus for extracting text data and binary data related to information that humans want to obtain, storing, re-combining, shaping, and outputting the data.

インターネット上に混在する情報から、人間が得たい情報を、より効率的に、すばやく取得でき、人間が情報を探すための作業にかかる手間を省くことができる。さらに、コンテンツ全体ではなく、コンテンツデータの中から、自分の知りたい情報だけを見ることができるので、情報を吸収する時間も短縮できる。さらに、このような処理を、今までにない、インターネットサービスの１つのモデルとして確立できる。さらに、必要のないオブジェクトデータは、格納領域にストアをしないため、少ないリソースでのシステム構成が可能となり、より低コストで、システムを実現できる。 Information that a person wants to obtain can be acquired more efficiently and quickly from information mixed on the Internet, and the labor for a person to search for information can be saved. Furthermore, since only the information that the user wants to know can be viewed from the content data, not the entire content, the time for absorbing the information can be shortened. Furthermore, such a process can be established as a model of an Internet service that has never existed before. Furthermore, since unnecessary object data is not stored in the storage area, a system configuration with fewer resources is possible, and the system can be realized at a lower cost.

なお、上記実施形態は、何れも本発明を実施するにあたっての具体化の例を示したものに過ぎず、これらによって本発明の技術的範囲が限定的に解釈されてはならないものである。すなわち、本発明はその技術思想、またはその主要な特徴から逸脱することなく、様々な形で実施することができる。 The above-described embodiments are merely examples of implementation in carrying out the present invention, and the technical scope of the present invention should not be construed in a limited manner. That is, the present invention can be implemented in various forms without departing from the technical idea or the main features thereof.

実施形態における、構造化文書整形装置の利用モデルを示す図である。It is a figure which shows the utilization model of the structured document shaping apparatus in embodiment. 実施形態における本装置全体の構成図である。It is a block diagram of the whole this apparatus in embodiment. 実施形態において、解析対象のコンテンツデータおよび、出力データの具体例を示した図である。In the embodiment, it is the figure which showed the specific example of the content data of analysis object, and output data. 実施形態において、解析対象のコンテンツデータおよび、出力データの具体例を示した図である。In the embodiment, it is the figure which showed the specific example of the content data of analysis object, and output data. 実施形態において、解析対象のコンテンツデータおよび、出力データの具体例を示した図である。In the embodiment, it is the figure which showed the specific example of the content data of analysis object, and output data. 実施形態における、本装置へのアクセスの他の例を示す図である。It is a figure which shows the other example of the access to this apparatus in embodiment. コンピュータのハードウエア構成例を示すブロック図である。It is a block diagram which shows the hardware structural example of a computer.

Explanation of symbols

１０３ PC
１０４構造化文書整形装置
１０６コンテンツ検索装置
１０８インターネット
１１４データ自動レイアウト出力装置
１１６印刷機器
２０１構造化文書整形装置内部の各処理の流れ
２０２ユーザーが入力した文字列キーワード
２０３入力部
２０４コンテンツ検索装置利用部
２０５通信部
２０６変換処理部
２０７解析処理部
２０８比較判断及び抽出処理部
２０９まとめ処理部
２１０付加情報追加処理部
２１１格納処理部
２１２データ格納領域部
２１３出力部
２１４データ自動レイアウト出力装置
３０１コンテンツデータ
３０２オブジェクトデータ
３０３オブジェクトデータ
３０４オブジェクトデータ
３０５オブジェクトデータ
３０６オブジェクトデータ
３０７オブジェクトデータ
３０８出力データ
４０１コンテンツデータ
４０２オブジェクトデータ
４０３オブジェクトデータ
４０４オブジェクトデータ
４０５オブジェクトデータ
４０６オブジェクトデータ
４０７出力データ
５０１コンテンツデータ
５０２オブジェクトデータ
５０３オブジェクトデータ
５０４オブジェクトデータ
５０５オブジェクトデータ
５０６オブジェクトデータ
５０７オブジェクトデータ
５０８コンテンツデータ表示
５０９出力データ
５１０オブジェクトデータ組み合わせ
５１１オブジェクトデータ組み合わせ
６０３ PCまたはモバイル端末
７０１バス
７０２ＣＰＵ
７０３ＲＯＭ
７０４ＲＡＭ
７０５ネットワークインタフェース
７０６入力装置
７０７出力装置
７０８外部記憶装置 103 PC
104 Structured Document Formatting Device 106 Content Searching Device 108 Internet 114 Data Automatic Layout Output Device 116 Printing Device 201 Process Flow Inside Structured Document Formatting Device 202 Character String Keyword 203 Inputted by User Input Unit 204 Content Searching Device Utilizing Unit 205 Communication Unit 206 Conversion Processing Unit 207 Analysis Processing Unit 208 Comparison Judgment and Extraction Processing Unit 209 Summary Processing Unit 210 Additional Information Addition Processing Unit 211 Storage Processing Unit 212 Data Storage Area Unit 213 Output Unit 214 Automatic Data Layout Output Device 301 Content Data 302 Object data 303 Object data 304 Object data 305 Object data 306 Object data 307 Object data 308 Output data 401 Content data 402 Object data 403 Object data 404 Object data 405 Object data 406 Object data 407 Output data 501 Content data 502 Object data 503 Object data 504 Object data 505 Object data 506 Object data 507 Object data 508 Content data display 509 Output data 510 Object data combination 511 Object data combination 603 PC or mobile terminal 701 Bus 702 CPU
703 ROM
704 RAM
705 Network interface 706 Input device 707 Output device 708 External storage device

Claims

An extraction means for extracting object data related to a keyword from object data in a structured document;
A document shaping apparatus comprising: shaping means for shaping a structured document related to the extracted object data.

2. The shaping unit according to claim 1, wherein when there is a relation between object data among the extracted object data, the shaping unit associates them to form one object data to form a structured document. The document shaping device described in 1.

The document shaping apparatus according to claim 1, wherein the shaping unit formats the structured document by describing information on the number of keywords included in each of the extracted object data.

The document shaping apparatus according to claim 1, wherein the shaping unit formats the structured document by describing information on the priority order of the object data.

The shaping unit, when at least one of the object data included in the table tag is extracted, extracts the object data included in the other table tag as object data related to the keyword. Item 2. The document shaping device according to Item 1.

In the case where the object data is data related to image data, the extraction means determines that if the keyword is included in text data adjacent to the data, the image data is related to the keyword. 6. The document shaping apparatus according to claim 1, wherein text data is extracted as object data.

The extraction means, when a keyword is included in a character string described as substitute data indicating object data, determines that the keyword is related to the keyword, and extracts the object data. The document shaping apparatus according to any one of 1 to 6.

8. The document formatting according to claim 1, wherein when it is determined that the object data is related to a keyword, the extraction unit extracts the object data in association with a unique identifier. 9. apparatus.

3. The formatting means, when the text data and binary data as the object data are adjacently related to each other, the text data and the binary data are combined into one in an XML format. The document shaping device described in 1.

3. The document shaping apparatus according to claim 2, wherein when the object data includes alternative data, the shaping unit collects them in an XML format.

The shaping means includes information on the priority order for each object data so that the priority order is high if a plurality of keywords are all included in the extracted object data, and the priority order is lowered when the number of included keywords is small. The document shaping apparatus according to claim 4, wherein the structured document is shaped by adding a slash.

12. The shaping unit, when the number of keywords included in the plurality of extracted object data is the same, adds a priority in the order in which the object data in the structured document is searched. The document formatting device described.

The document shaping apparatus according to claim 1, wherein the shaping unit generates a structured document by adding a URL of the structured document including the extracted object data as information. .

An extraction step of extracting object data related to the keyword from the object data in the structured document;
A document shaping method, comprising: a shaping step of shaping a structured document relating to the extracted object data.

The program for making a computer perform each step of the document shaping method of Claim 14.