JP7290391B2

JP7290391B2 - Information processing device and program

Info

Publication number: JP7290391B2
Application number: JP2017159662A
Authority: JP
Inventors: 聡田端; 克俊前沢
Original assignee: Dai Nippon Printing Co Ltd
Current assignee: Dai Nippon Printing Co Ltd
Priority date: 2017-08-22
Filing date: 2017-08-22
Publication date: 2023-06-13
Anticipated expiration: 2037-08-22
Also published as: JP2019040260A

Description

本発明は、情報処理装置及びプログラムに関する。 The present invention relates to an information processing apparatus and program.

近年、マイクロコンテンツに対するニーズが高まり、書籍、雑誌、新聞等のマイクロコンテンツの販売が拡大しつつある。しかし、書籍等のコンテンツデータはいわゆる非構造化データであることが多く、コンテンツ内の各要素が何であるかを示す情報を保持していない。従って、コンテンツデータから各マイクロコンテンツを抽出するマイクロコンテンツ化の作業を行う必要がある。多くの場合、マイクロコンテンツ化は手作業で行っており、コストが高いという問題がある。 In recent years, the need for microcontents has increased, and sales of microcontents such as books, magazines, and newspapers are expanding. However, content data such as books are often so-called unstructured data, and do not hold information indicating what each element in the content is. Therefore, it is necessary to perform the work of micro-contents that extracts each micro-content from the content data. In many cases, creating microcontents is done manually, and there is the problem of high costs.

例えば特許文献１では、文書画像のマイクロコンテンツ化を行う文書画像処理装置等であって、文書画像を所定の領域毎に分割し、分割した領域内のデータにタグ及び属性値を割り当てることで、マークアップ言語で記述した文書データを生成する文書画像処理装置等が開示されている。 For example, Patent Document 1 discloses a document image processing apparatus or the like that converts a document image into micro-contents. A document image processing apparatus and the like that generate document data described in a markup language have been disclosed.

特開２００２－４１４９７号公報JP-A-2002-41497

しかしながら、特許文献１に係る発明は、設計者がタグ及び属性値を割り当てるルールを事前に設計しておく必要があり、必ずしも効率的ではないという問題があった。 However, the invention according to Patent Literature 1 requires the designer to design rules for assigning tags and attribute values in advance, which is not necessarily efficient.

一つの側面では、コンテンツの管理を効率化することができる情報処理装置等を提供することを目的とする。 An object of one aspect is to provide an information processing apparatus and the like that can efficiently manage content.

一つの側面では、情報処理装置は、少なくとも１つ以上の要素からなる非構造化データのサンプルと、サンプル毎の各要素を定義付けるメタ情報からなるタグ情報とを関連付けた教師情報を取得する取得部と、取得した前記教師情報が示す前記サンプルに対する前記各要素のレイアウトの特徴及び前記各要素内のデータの特徴に基づいて、前記サンプルの前記各要素の特徴を抽出する特徴抽出部と、抽出した前記各要素の特徴と、前記教師情報が示す前記タグ情報とに基づき、前記要素の特徴に応じて前記タグ情報を識別するルールを設定する設定部と、非構造化データであるコンテンツを取得するコンテンツ取得部と、取得した前記コンテンツから前記各要素を抽出する抽出部と、前記ルールを参照して、前記要素に前記タグ情報を付与する付与部と、前記コンテンツから、前記コンテンツ全体での価格に関する情報を含む付随情報を取得する付随情報取得部と、前記コンテンツ全体での価格から、前記要素夫々の価格を算出する算出部と、前記要素夫々に付与した前記タグ情報に基づき、前記コンテンツを構造化し、前記要素夫々に対応付けて前記付随情報を格納した構造化データを生成する生成部と、生成した前記構造化データを参照して、前記各要素及び各要素の価格を出力する出力部とを備えることを特徴とする。 In one aspect, the information processing device includes an acquisition unit that acquires teacher information that associates a sample of unstructured data consisting of at least one or more elements with tag information that consists of meta information that defines each element for each sample. a feature extracting unit for extracting features of each element of the sample based on features of the layout of each element for the sample indicated by the acquired teacher information and features of data in each element; A setting unit for setting rules for identifying the tag information according to the characteristics of the elements, based on the characteristics of the elements and the tag information indicated by the teacher information , and acquiring content, which is unstructured data. a content acquisition unit, an extraction unit for extracting each element from the acquired content, an addition unit for adding the tag information to the element by referring to the rule, and a price for the entire content from the content an accompanying information acquisition unit that acquires accompanying information including information about the A generating unit that structures and generates structured data in which the associated information is stored in association with each of the elements, and an output unit that refers to the generated structured data and outputs each element and the price of each element. and

一つの側面では、情報処理装置は、前記サンプル及びコンテンツは、テキスト又は画像からなる文書データであり、前記特徴抽出部は、前記要素の書式又はレイアウトに係る情報を抽出し、前記設定部は、前記要素の書式又はレイアウトと、前記タグ情報との対応関係を示す前記ルールを設定することを特徴とする。 In one aspect, in the information processing device, the samples and contents are document data consisting of text or images, the feature extraction unit extracts information related to the format or layout of the elements, and the setting unit The rule is set to indicate the correspondence relationship between the format or layout of the element and the tag information.

一つの側面では、情報処理装置は、前記取得部は、複数の前記教師情報を取得し、前記設定部は、前記教師情報夫々から前記要素の特徴と前記タグ情報との対応関係を学習することで、前記要素に対応する前記タグ情報を識別する識別器を生成することを特徴とする。 In one aspect, in the information processing device, the acquisition unit acquires a plurality of pieces of the teacher information, and the setting unit learns the correspondence relationship between the feature of the element and the tag information from each piece of the teacher information. and generating an identifier for identifying the tag information corresponding to the element.

一つの側面では、情報処理装置は、前記出力部が出力した前記要素に対して、該要素に対応する前記コンテンツの出力要求を受け付ける受付部を備え、出力要求を受け付けた場合、前記出力部は前記コンテンツを出力することを特徴とする。 In one aspect, the information processing device includes a reception unit that receives an output request for the content corresponding to the element output by the output unit, and when the output request is received, the output unit It is characterized by outputting the content.

一つの側面では、情報処理装置は、ネットワークを介してＷｅｂページに係るデータを収集する収集部と、収集した前記Ｗｅｂページから、前記要素と一致するコンテンツを有する前記Ｗｅｂページを抽出するページ抽出部と、抽出した前記Ｗｅｂページを報知する報知部とを備えることを特徴とする。 In one aspect, an information processing apparatus includes a collection unit that collects data related to a web page via a network, and a page extraction unit that extracts the web page having content that matches the element from the collected web pages. and a notification unit for notifying the extracted Web page.

一つの側面では、プログラムは、少なくとも１つ以上の要素からなる非構造化データのサンプルと、サンプル毎の各要素を定義付けるメタ情報からなるタグ情報とを関連付けた教師情報を取得し、取得した前記教師情報が示す前記サンプルに対する前記各要素のレイアウトの特徴及び前記各要素内のデータの特徴に基づいて、前記サンプルの前記各要素の特徴を抽出し、抽出した前記各要素の特徴と、前記教師情報が示す前記タグ情報とに基づき、前記要素の特徴に応じて前記タグ情報を識別するルールを設定し、非構造化データであるコンテンツを取得し、取得した前記コンテンツから前記各要素を抽出し、前記ルールを参照して、前記要素に前記タグ情報を付与し、前記コンテンツから、前記コンテンツ全体での価格に関する情報を含む付随情報を取得し、前記コンテンツ全体での価格から、前記要素夫々の価格を算出し、前記要素夫々に付与した前記タグ情報に基づき、前記コンテンツを構造化し、前記要素夫々に対応付けて前記付随情報を格納した構造化データを生成し、生成した前記構造化データを参照して、前記各要素及び各要素の価格を出力する処理をコンピュータに実行させることを特徴とする。 In one aspect, the program obtains teacher information that associates a sample of unstructured data consisting of at least one or more elements with tag information consisting of meta information that defines each element for each sample, and obtains the above extracting the features of the elements of the sample based on the features of the layout of the elements and the features of the data in the elements for the sample indicated by the teacher information; Based on the tag information indicated by the information, a rule for identifying the tag information according to the characteristics of the element is set, content that is unstructured data is acquired, and each element is extracted from the acquired content. assigning the tag information to the element with reference to the rule; acquiring accompanying information including information on the price of the entire content from the content; calculating a price, structuring the content based on the tag information attached to each of the elements, generating structured data in which the accompanying information is stored in association with each of the elements, and storing the generated structured data It is characterized by causing a computer to execute a process of referring to and outputting each element and the price of each element .

一つの側面では、コンテンツの管理を効率化することができる。 In one aspect, content management can be made more efficient.

情報処理システムの概要を示す説明図である。1 is an explanatory diagram showing an overview of an information processing system; FIG. サーバの構成例を示すブロック図である。It is a block diagram which shows the structural example of a server. 構造化テーブルのレコードレイアウトの一例を示す説明図である。FIG. 4 is an explanatory diagram showing an example of a record layout of a structured table; FIG. 構造化ルールの設定処理に関する説明図である。FIG. 10 is an explanatory diagram relating to a structuring rule setting process; 構造化処理に関する説明図である。FIG. 10 is an explanatory diagram regarding structuring processing; サーバが実行する処理手順の一例を示すフローチャートである。4 is a flow chart showing an example of a processing procedure executed by a server; 実施の形態２に係る情報処理システムの構成例を示す模式図である。FIG. 7 is a schematic diagram showing a configuration example of an information processing system according to Embodiment 2; 実施の形態２の概要を示す説明図である。FIG. 10 is an explanatory diagram showing an overview of Embodiment 2; 実施の形態２に係る元データの呼出処理に関する説明図である。FIG. 11 is an explanatory diagram of a process of calling up original data according to the second embodiment; 実施の形態２に係る情報処理システムが実行する処理手順の一例を示すフローチャートである。9 is a flowchart showing an example of a processing procedure executed by an information processing system according to Embodiment 2; 実施の形態３の概要を示す説明図である。FIG. 11 is an explanatory diagram showing an overview of Embodiment 3; 実施の形態３に係るサーバが実行する処理手順の一例を示すフローチャートである。10 is a flow chart showing an example of a processing procedure executed by a server according to Embodiment 3; 上述した形態のサーバの動作を示す機能ブロック図である。It is a functional block diagram which shows operation|movement of the server of the form mentioned above.

以下、本発明をその実施の形態を示す図面に基づいて詳述する。
（実施の形態１）
図１は、情報処理システムの概要を示す説明図である。本実施の形態では、マイクロコンテンツのデータアーカイブを作成する情報処理システムを一例に説明を行う。情報処理システムは、情報処理装置１及び端末２を含む。情報処理装置１及び端末２は、インターネット等のネットワークＮを介して通信接続されている。 Hereinafter, the present invention will be described in detail based on the drawings showing its embodiments.
(Embodiment 1)
FIG. 1 is an explanatory diagram showing an outline of an information processing system. In this embodiment, an information processing system that creates a data archive of micro-contents will be described as an example. The information processing system includes an information processing device 1 and a terminal 2 . The information processing device 1 and the terminal 2 are connected for communication via a network N such as the Internet.

情報処理装置１は、種々の情報処理、情報の送受信を行う装置であり、例えばサーバ装置、パーソナルコンピュータ、多機能端末等である。本実施の形態において情報処理装置１はサーバ装置であるものとし、以下では簡潔のためサーバ１と読み替える。サーバ１は、非構造化データであるデジタルコンテンツ、例えば書籍、新聞、雑誌等の文書データを取得し、デジタルコンテンツからマイクロコンテンツを抽出して構造化データを生成する処理を行う。 The information processing device 1 is a device that performs various types of information processing and information transmission/reception, and is, for example, a server device, a personal computer, a multifunctional terminal, or the like. In the present embodiment, the information processing device 1 is assumed to be a server device, and is replaced with the server 1 below for the sake of simplicity. The server 1 acquires digital content, which is unstructured data, such as document data such as books, newspapers, and magazines, extracts micro content from the digital content, and performs processing to generate structured data.

端末２は、サーバ１と通信を行うクライアント端末であり、マイクロコンテンツ化の作業業務を行う管理者が操作する端末装置である。サーバ１は、端末２からデジタルコンテンツを取得し、端末２からの要求に従ってデータアーカイブの作成を行う。 The terminal 2 is a client terminal that communicates with the server 1, and is a terminal device operated by an administrator who performs micro-content work. The server 1 acquires digital content from the terminal 2 and creates a data archive according to a request from the terminal 2 .

図２は、サーバ１の構成例を示すブロック図である。サーバ１は、制御部１１、主記憶部１２、通信部１３、補助記憶部１４を備える。
制御部１１は、一又は複数のＣＰＵ（Central Processing Unit）、ＭＰＵ（Micro-Processing Unit）等の演算処理装置を有し、補助記憶部１４に記憶されたプログラムＰを読み出して実行することにより、サーバ１に係る種々の情報処理、制御処理等を行う。主記憶部１２は、ＳＲＡＭ（Static Random Access Memory）、ＤＲＡＭ（Dynamic Random Access Memory）、フラッシュメモリ等であり、制御部１１が演算処理を実行するために必要なデータを一時的に記憶する。通信部１３は、通信に関する処理を行うための処理回路等を含み、端末２等と情報の送受信を行う。 FIG. 2 is a block diagram showing a configuration example of the server 1. As shown in FIG. The server 1 includes a control section 11 , a main storage section 12 , a communication section 13 and an auxiliary storage section 14 .
The control unit 11 has an arithmetic processing unit such as one or more CPU (Central Processing Unit), MPU (Micro-Processing Unit), etc. By reading and executing the program P stored in the auxiliary storage unit 14, It performs various information processing, control processing, and the like related to the server 1 . The main storage unit 12 is an SRAM (Static Random Access Memory), a DRAM (Dynamic Random Access Memory), a flash memory, or the like, and temporarily stores data necessary for the control unit 11 to perform arithmetic processing. The communication unit 13 includes a processing circuit and the like for performing processing related to communication, and transmits and receives information to and from the terminal 2 and the like.

補助記憶部１４は大容量メモリ、ハードディスク等であり、制御部１１が処理を実行するために必要なプログラムＰ、その他のデータを記憶している。また、補助記憶部１４は、構造化テーブル１４１、コンテンツＤＢ１４２を記憶している。構造化テーブル１４１は、非構造化データであるデジタルコンテンツを構造化データに変換するための構造化ルールを規定している。コンテンツＤＢ１４２は、デジタルコンテンツを構造化した構造化データを格納するデータベースである。 The auxiliary storage unit 14 is a large-capacity memory, hard disk, or the like, and stores programs P and other data necessary for the control unit 11 to execute processing. The auxiliary storage unit 14 also stores a structured table 141 and a content DB 142 . The structured table 141 defines structured rules for converting digital content, which is unstructured data, into structured data. The content DB 142 is a database that stores structured data in which digital content is structured.

なお、補助記憶部１４はサーバ１に接続された外部記憶装置であってもよい。また、サーバ１は複数のコンピュータからなるマルチサーバであってもよく、ソフトウェアによって仮想的に構築された仮想マシンであってもよい。 Incidentally, the auxiliary storage unit 14 may be an external storage device connected to the server 1 . Moreover, the server 1 may be a multi-server consisting of a plurality of computers, or may be a virtual machine virtually constructed by software.

図３は、構造化テーブル１４１のレコードレイアウトの一例を示す説明図である。構造化テーブル１４１は、要素名列、ルール列を含む。要素名列は、「タイトル」、「サブタイトル」、「本文」などのように、文書ページを構成する各要素の要素名（後述するタグ情報）が記憶されている。ルール列は、要素名と対応付けて、各要素名に対応する要素を特徴付ける情報が記憶されている。例えばルール列には、各要素のフォント、文字サイズ、レイアウト（座標値）等のデータが記憶されている。 FIG. 3 is an explanatory diagram showing an example of the record layout of the structured table 141. As shown in FIG. The structured table 141 includes an element name column and a rule column. The element name column stores the element name (tag information described later) of each element forming the document page, such as "title", "subtitle", and "text". The rule string stores information characterizing the element corresponding to each element name in association with the element name. For example, the rule string stores data such as the font, character size, and layout (coordinate values) of each element.

図４は、構造化ルールの設定処理に関する説明図である。図４では、サーバ１が、非構造化データのサンプル、具体的にはテキスト及び画像からなるサンプル文書から、当該サンプル文書を構成する基本要素を抽出して特徴を学習する様子を図示している。
サーバ１は、端末２から教師用のサンプル文書を取得し、当該サンプルデータを基に構造化ルールを設定（学習）する。サンプル文書は、構造されていない文書データであり、例えばＰＤＦ（Portable Document Format、登録商標）ファイルである。サーバ１は、一点のサンプル文書を基に、非構造化データを構造化するための構造化ルールを設定する。 FIG. 4 is an explanatory diagram of a structuring rule setting process. FIG. 4 illustrates how the server 1 extracts the basic elements that make up the sample document from a sample of unstructured data, specifically a sample document made up of text and images, and learns the features. .
The server 1 acquires sample documents for teachers from the terminal 2 and sets (learns) structuring rules based on the sample data. The sample document is unstructured document data, such as a PDF (Portable Document Format, registered trademark) file. The server 1 sets structuring rules for structuring unstructured data based on one sample document.

例えばサーバ１は、非構造化データである文書のサンプルに、当該サンプルに含まれる各要素のタグ情報の正解値を関連付けた教師情報を取得する。非構造化データの要素は、元データを所定領域毎に分割したデータであり、例えば図４において矩形枠で囲って示すように、文書のタイトル、サブタイトル、本文、図など、文書を構成する基本要素である。タグ情報は、各要素を定義付けるメタ情報であり、ＸＭＬ（Extensible Markup Language）ファイルにおいて各要素にタグ付けされる要素名又は属性値である。本実施の形態においてサーバ１は、教師情報として、サンプルに含まれる各要素の要素名に係る情報を取得する。例えば図４に示すように、文書のタイトルに該当する要素であれば「タイトル」、サブタイトルに該当する要素であれば「サブタイトル」の要素名を取得する。このように、サーバ１は、一点の文書のサンプルに対し、各要素の要素名の正解値を保持した教師情報を端末２から取得する。 For example, the server 1 acquires teacher information in which a document sample, which is unstructured data, is associated with the correct value of the tag information of each element included in the sample. Elements of unstructured data are data obtained by dividing the original data into predetermined areas. is an element. The tag information is meta information that defines each element, and is an element name or attribute value tagged to each element in an XML (Extensible Markup Language) file. In this embodiment, the server 1 acquires information on the element name of each element included in the sample as teacher information. For example, as shown in FIG. 4, if the element corresponds to the title of the document, the element name "title" is acquired, and if the element corresponds to the subtitle, the element name "subtitle" is acquired. In this way, the server 1 acquires from the terminal 2 the teacher information that holds the correct value of the element name of each element for one document sample.

サーバ１は、文書のサンプルから、タイトル、サブタイトル、本文、図などの各要素を抽出する。そしてサーバ１は、サンプルから抽出した各要素の特徴を抽出する。具体的には、サーバ１は、各要素内のテキストの書式、各要素のレイアウト等の特徴を抽出する。例えばサーバ１は、図４で示す矩形領域内に記述されているテキストに対して文字認識を行い、テキストのフォント、文字サイズ等を判別する。また、サーバ１は、各矩形領域の位置及び範囲に基づき、文書ページ内で各要素が占める領域の座標値を判別する。 The server 1 extracts each element such as the title, subtitle, text, and figure from the document sample. The server 1 then extracts features of each element extracted from the sample. Specifically, the server 1 extracts features such as the format of the text in each element and the layout of each element. For example, the server 1 performs character recognition on the text written in the rectangular area shown in FIG. 4, and determines the font, character size, etc. of the text. The server 1 also determines the coordinate values of the area occupied by each element within the document page based on the position and range of each rectangular area.

サーバ１は、上記で抽出した各要素の特徴と、教師情報で示される各要素の要素名（タグ情報）とを対応付け、構造化テーブル１４１に格納する。これにより、サーバ１は、書式、レイアウト等の各要素の傾向（特徴）に応じて、各要素がどの要素名に該当するかを識別する構造化ルールを設定する。 The server 1 associates the feature of each element extracted above with the element name (tag information) of each element indicated by the teacher information, and stores them in the structured table 141 . As a result, the server 1 sets a structuring rule for identifying which element name each element corresponds to, according to the tendency (feature) of each element such as format and layout.

なお、上記では各要素を特徴付ける情報として書式及びレイアウトを挙げたが、本実施の形態はこれに限定されるものではない。例えばサーバ１は、文書内に含まれる表を識別可能とすべく、表を構成する描線を特徴として抽出し、学習するようにしてもよい。このように、要素の特徴は書式及びレイアウトに限定されるものではない。 In the above description, format and layout are mentioned as information characterizing each element, but the present embodiment is not limited to this. For example, the server 1 may extract and learn the lines that make up the table as features so that the table included in the document can be identified. Thus, element characteristics are not limited to formatting and layout.

また、上記では説明の単純化のため、一点の文書のサンプルから各要素の特徴を学習してルールを設定することとしたが、本実施の形態はこれに限定されるものではない。サーバ１は、複数の教師情報から要素の特徴とタグ情報との対応関係を学習する機械学習を行い、要素の特徴からタグ情報を識別するモデルデータ（識別器）を生成してもよい。すなわちサーバ１は、複数の教師用のサンプルそれぞれから各要素の特徴量を抽出し、各要素のタグ情報の正解値と比較する。サーバ１は、全ての教師用のサンプルについて比較処理を行い、例えばタイトルに該当する要素の特徴量がどのようなパラメータであるか、複数のサンプルから学習する処理を行う。サーバ１は、当該処理により、要素の特徴量からタグ情報を識別する識別器を生成し、補助記憶部１４に記憶する。機械学習を行うことで、サーバ１はより正確に各要素を識別可能となる。 In the above description, for the sake of simplification, rules are set by learning the characteristics of each element from one document sample, but the present embodiment is not limited to this. The server 1 may perform machine learning to learn the correspondence relationship between element features and tag information from a plurality of pieces of teacher information, and generate model data (identifier) for identifying tag information from the element features. That is, the server 1 extracts the feature amount of each element from each of a plurality of teacher samples, and compares it with the correct value of the tag information of each element. The server 1 performs comparison processing for all teacher samples, and performs processing for learning from a plurality of samples, for example, what kind of parameter the feature amount of the element corresponding to the title is. Through this process, the server 1 generates a discriminator for discriminating tag information from the feature amount of the element, and stores it in the auxiliary storage unit 14 . Machine learning enables the server 1 to identify each element more accurately.

図５は、構造化処理に関する説明図である。サーバ１は、上記で設定した構造化ルールに基づき、非構造化データであるデジタルコンテンツを構造化データに変換する処理を行う。具体的には、サーバ１は、対象とする文書からタイトル、サブタイトル、本文、図などの各要素を抽出し、各要素の書式、レイアウト等を解析する。そしてサーバ１は、構造化テーブル１４１を参照して、各要素の書式、レイアウト等に対応するタグ情報を識別する。例えば文書中のタイトルに相当するテキスト箇所を抽出した場合、サーバ１は、当該テキストのフォント、文字サイズ、レイアウト等に基づき、当該テキストがタイトルであることを認識する。サーバ１は、デジタルコンテンツの各要素について同様に処理を行い、各要素を識別する。 FIG. 5 is an explanatory diagram of the structuring process. The server 1 performs processing for converting digital content, which is unstructured data, into structured data based on the structuring rules set above. Specifically, the server 1 extracts each element such as a title, subtitle, text, and diagram from the target document, and analyzes the format, layout, and the like of each element. The server 1 then refers to the structured table 141 to identify tag information corresponding to the format, layout, etc. of each element. For example, when extracting a text portion corresponding to a title in a document, the server 1 recognizes that the text is the title based on the font, character size, layout, etc. of the text. The server 1 similarly processes each element of the digital content and identifies each element.

サーバ１は、各要素にタグ情報（要素名）を付与し、タグ情報に基づいて各要素を階層化した構造化データを生成する。例えばサーバ１は、ＸＭＬ形式のテキストファイルを生成する。例えばサーバ１は、文書のタイトルのテキスト要素を抽出した場合、当該要素に要素名「ｔｉｔｌｅ」を付与し、テキストファイルに格納する。サーバ１は同様に、文書のサブタイトル、本文等についても各要素に要素名を付与し、ファイルに格納する。また、サーバ１は、文書から画像（図）を抽出した場合、抽出した画像をテキストファイルのファイル名と対応付けて画像フォルダに格納する。図５に示すように、サーバ１はテキストファイルにおいて、一の要素（例えば頁番号）に紐付けてその他の要素を格納することで、各要素を階層化する。これによりサーバ１は、非構造化データであるデジタルコンテンツ（文書）を構造化した構造化データを生成する。サーバ１は、生成した構造化データをコンテンツＤＢ１４２に記憶する。 The server 1 assigns tag information (element name) to each element, and generates structured data in which each element is layered based on the tag information. For example, the server 1 generates a text file in XML format. For example, when the server 1 extracts the text element of the title of the document, it assigns the element name "title" to the element and stores it in the text file. Similarly, the server 1 assigns an element name to each element of the subtitle, text, etc. of the document, and stores them in a file. When extracting an image (drawing) from a document, the server 1 associates the extracted image with the file name of the text file and stores it in the image folder. As shown in FIG. 5, the server 1 hierarchizes each element in the text file by linking one element (for example, page number) and storing other elements. As a result, the server 1 generates structured data by structuring the digital content (document), which is unstructured data. The server 1 stores the generated structured data in the content DB 142 .

また、サーバ１は、各要素の要素名を識別して構造化データに格納するだけでなく、デジタルコンテンツに付随する付随情報を各要素に対応付けて構造化データに格納してもよい。付随情報は、例えばコンテンツの作者名、価格、出版元等の情報である。例えばサーバ１は、要素名に係るルールと同じように、教師情報を基にサンプル文書の表紙等から作者名、価格、出版元等のテキスト要素を抽出するルールを設定しておく。サーバ１は、デジタルコンテンツに係る文書の表紙画像（不図示）を取得した場合、当該表紙から各種情報を抽出する。サーバ１は、デジタルコンテンツから抽出した各要素をファイルに格納する際、付随情報を各要素に対応付けて格納する。例えばサーバ１は、各要素のタグ内に属性値として当該情報を記述する。これによりサーバ１は、各要素の詳細な意味づけを行うことができる。 Moreover, the server 1 may not only identify the element name of each element and store it in the structured data, but may also associate accompanying information associated with the digital content with each element and store it in the structured data. The accompanying information is, for example, information such as the author name, price, and publisher of the content. For example, the server 1 sets rules for extracting text elements such as author names, prices, publishers, etc. from the covers of sample documents based on teacher information, in the same way as the rules relating to element names. When the server 1 acquires a cover image (not shown) of a document related to digital content, it extracts various information from the cover. When storing each element extracted from the digital content in a file, the server 1 stores accompanying information in association with each element. For example, the server 1 describes the information as an attribute value in the tag of each element. This allows the server 1 to give detailed meaning to each element.

なお、上記でサーバ１は、作者名、価格等の付随情報をコンテンツデータから機械的に抽出することとしたが、端末２を介して手動入力を受け付けるようにしてもよい。例えば管理者が、端末２を操作してコンテンツデータをサーバ１に転送する際、作者名、価格等の情報を入力し、併せて転送する。すなわちサーバ１は、コンテンツに付随する情報を取得可能であればよく、コンテンツデータから自動的に取得してもよいし、手動入力により取得するようにしてもよい。 In the above description, the server 1 mechanically extracts the accompanying information such as the name of the author and the price from the content data. For example, when the administrator operates the terminal 2 to transfer the content data to the server 1, the information such as the name of the author and the price is input and transferred together. That is, the server 1 only needs to be able to acquire information associated with the content, and the information may be acquired automatically from the content data or through manual input.

サーバ１は、端末２から転送される各デジタルコンテンツについて同様に処理を行い、非構造化データである各コンテンツを構造化データに変換する。このように、サーバ１は、デジタルコンテンツから抽出した各要素に意味づけを行い、マイクロコンテンツのデジタルアーカイブを作成する。教師用のデータから構造化ルールを事前に設定（学習）しておくことで、サーバ１は、デジタルコンテンツに含まれる各要素が何であるかを自動的に識別し、構造化データを生成することができる。これにより、マイクロコンテンツ化の作業を効率化することができる。 The server 1 similarly processes each digital content transferred from the terminal 2, and converts each content, which is unstructured data, into structured data. In this way, the server 1 assigns meaning to each element extracted from digital content and creates a digital archive of micro content. By setting (learning) structuring rules in advance from teacher data, the server 1 automatically identifies what each element included in the digital content is and generates structured data. can be done. As a result, it is possible to improve the efficiency of the work of creating micro-contents.

図６は、サーバ１が実行する処理手順の一例を示すフローチャートである。図６に基づき、サーバ１が実行する処理内容について説明する。
サーバ１の制御部１１は、非構造化データのサンプルと、当該サンプルに含まれる各要素のタグ情報の正解値とを含む教師情報を取得する（ステップＳ１１）。タグ情報は、各要素を定義付けるメタ情報であり、例えばＸＭＬファイルにおける要素名又は属性値である。例えば制御部１１は、非構造化データの各要素の要素名を既知とした教師用データを取得する。制御部１１は、サンプルに含まれる各要素の特徴を抽出する（ステップＳ１２）。例えば制御部１１は、サンプル文書に含まれる各要素の書式及びレイアウトに係る情報を抽出する。 FIG. 6 is a flow chart showing an example of a processing procedure executed by the server 1. As shown in FIG. Based on FIG. 6, the contents of processing executed by the server 1 will be described.
The control unit 11 of the server 1 acquires teacher information including a sample of unstructured data and the correct value of the tag information of each element included in the sample (step S11). The tag information is meta information that defines each element, such as an element name or attribute value in an XML file. For example, the control unit 11 acquires teacher data in which the element name of each element of the unstructured data is known. The control unit 11 extracts features of each element included in the sample (step S12). For example, the control unit 11 extracts information on the format and layout of each element included in the sample document.

制御部１１は、抽出した各要素の特徴と、教師情報が示す各要素のタグ情報の正解値とに基づき、非構造化データに含まれる要素の特徴に応じて、付与すべきタグ情報を識別する構造化ルールを設定する（ステップＳ１３）。具体的には、制御部１１は、ステップＳ１２で抽出した各要素の書式、レイアウト等の特徴と、教師情報が示す各要素の要素名とを対応付け、構造化テーブル１４１に格納する。 The control unit 11 identifies the tag information to be added according to the features of the elements included in the unstructured data, based on the extracted features of each element and the correct value of the tag information of each element indicated by the teacher information. A structuring rule is set (step S13). Specifically, the control unit 11 associates the characteristics such as the format and layout of each element extracted in step S 12 with the element name of each element indicated by the teacher information, and stores them in the structured table 141 .

制御部１１は、非構造化データであるデジタルコンテンツを端末２から取得する（ステップＳ１４）。デジタルコンテンツは、構造化されていない文書データであり、例えばＰＤＦファイルである。 The control unit 11 acquires digital content, which is unstructured data, from the terminal 2 (step S14). Digital content is unstructured document data, such as PDF files.

制御部１１は、ステップＳ１４で取得したデジタルコンテンツから、当該コンテンツに含まれる各要素を抽出する（ステップＳ１５）。そして制御部１１は、ステップＳ１３で設定した構造化ルールを参照して、抽出した各要素にタグ情報を付与する（ステップＳ１６）。例えば制御部１１は、抽出した各要素の書式、レイアウト等の特徴に応じて、要素名を付与する。また、例えば制御部１１は、デジタルコンテンツ（文書）の表紙画像から抽出した作者、価格、出版元等の付随情報を、各要素の属性値として付与する。 The control unit 11 extracts each element included in the digital content acquired in step S14 (step S15). Then, the control unit 11 refers to the structuring rule set in step S13 and assigns tag information to each extracted element (step S16). For example, the control unit 11 assigns an element name to each extracted element according to its characteristics such as format and layout. Also, for example, the control unit 11 assigns accompanying information such as the author, price, and publisher extracted from the cover image of the digital content (document) as attribute values of each element.

制御部１１は、各要素に付与したタグ情報に基づき、デジタルコンテンツを構造化した構造化データを生成する（ステップＳ１７）。例えば制御部１１は、ＸＭＬ形式のファイルを生成する。制御部１１は、ステップＳ１６で付与した要素名に応じて各要素を階層化して格納し、構造化データを生成する。また、例えば制御部１１は、コンテンツの付随情報を各要素の属性値としてタグ情報に記述し、構造化データに格納する。制御部１１は、生成した構造化データをコンテンツＤＢ１４２に記憶し（ステップＳ１８）、一連の処理を終了する。 The control unit 11 generates structured data in which the digital content is structured based on the tag information attached to each element (step S17). For example, the control unit 11 generates an XML format file. The control unit 11 hierarchically stores each element according to the element name given in step S16, and generates structured data. Also, for example, the control unit 11 describes accompanying information of the content as an attribute value of each element in the tag information and stores it in the structured data. The control unit 11 stores the generated structured data in the content DB 142 (step S18), and ends the series of processes.

なお、上記では解析対象とするコンテンツが文書であるものとしたが、解析対象とするコンテンツは非構造化データであればよく、例えば音声データであってもよい。 In the above description, it is assumed that the content to be analyzed is a document, but the content to be analyzed may be unstructured data, such as audio data.

また、上記ではＸＭＬファイルを生成することにしたが、ＨＴＭＬ、ＳＧＭＬ等の形式のファイルを生成してもよいことは勿論である。 In addition, although the XML file is generated in the above description, it is of course possible to generate a file in a format such as HTML or SGML.

以上より、本実施の形態１によれば、サーバ１が教師用のサンプルデータから各要素の特徴を抽出し、タグ情報と対応付けた構造化ルールを設定する。これによりサーバ１は、構造が未知のデジタルコンテンツを取得した場合にも、設定済みの構造化ルールを参照して当該コンテンツの各要素を自動的に識別し、タグ情報を付与することができる。サーバ１が教師用のデータから自動的にルールを設定するため、管理者が自ら試行錯誤し、ルール内容を考える必要がない。これにより、マイクロコンテンツ化に際しての作業負担を減らし、コンテンツの管理を効率化することができる。 As described above, according to the first embodiment, the server 1 extracts the characteristics of each element from the teacher sample data, and sets structured rules associated with the tag information. As a result, even when the server 1 acquires digital content whose structure is unknown, the server 1 can refer to the set structuring rule to automatically identify each element of the content and attach tag information. Since the server 1 automatically sets the rules from the teacher data, the administrator does not need to consider the content of the rules through trial and error. As a result, it is possible to reduce the work load in creating micro-contents and improve the efficiency of content management.

また、本実施の形態１によれば、文書の各要素の書式又はレイアウトに基づき各要素の属性を識別することで、サーバ１は各要素に適切な属性を与えることができる。 Further, according to the first embodiment, by identifying the attribute of each element based on the format or layout of each element of the document, the server 1 can give appropriate attributes to each element.

また、本実施の形態１によれば、複数の教師用データそれぞれから各要素の特徴及びタグ情報の対応関係を抽出する機械学習を行うことで、精度を高めることができる。 Further, according to Embodiment 1, accuracy can be improved by performing machine learning for extracting the correspondence between the feature of each element and the tag information from each of a plurality of teacher data.

また、本実施の形態１によれば、識別した各要素の属性に基づきデジタルコンテンツを構造化することで、マイクロコンテンツの利用を容易にすることができる。 Further, according to the first embodiment, by structuring the digital content based on the attribute of each identified element, it is possible to facilitate the use of the micro content.

（実施の形態２）
実施の形態１では、書籍等のデジタルコンテンツから各要素を抽出し、マイクロコンテンツのデジタルアーカイブを作成する形態について述べた。本実施の形態では、作成したデジタルアーカイブを用い、マイクロコンテンツの販売を行う形態について述べる。なお、実施の形態１と重複する内容については同一の符号を付して説明を省略する。 (Embodiment 2)
In the first embodiment, a form of extracting each element from digital contents such as books and creating a digital archive of micro contents has been described. In this embodiment, a form of selling micro contents using a created digital archive will be described. In addition, the same code|symbol is attached|subjected about the content which overlaps with Embodiment 1, and description is abbreviate|omitted.

図７は、実施の形態２に係る情報処理システムの構成例を示す模式図である。本実施の形態に係る情報処理システムは、販売管理サーバ３を含む。販売管理サーバ３は、マイクロコンテンツの販売を行うＥＣサイトの管理を行うサーバ装置であり、サーバ１が生成したコンテンツＤＢ１４２のデータを参照して、各マイクロコンテンツを表示するＥＣサイト画面の生成及び出力、マイクロコンテンツの購入申し込みの受け付け等を行う。 FIG. 7 is a schematic diagram showing a configuration example of an information processing system according to the second embodiment. The information processing system according to this embodiment includes a sales management server 3 . The sales management server 3 is a server device that manages an EC site that sells microcontents, and refers to the data in the content DB 142 generated by the server 1 to generate and output an EC site screen that displays each microcontent. , accepting applications for purchase of micro-contents, etc.

図８は、実施の形態２の概要を示す説明図である。図８の左側には、実施の形態１で説明したように、サーバ１がデジタルコンテンツを構造化データに変換してコンテンツＤＢ１４２に記憶する様子を概念的に図示している。本実施の形態でサーバ１は、当該構造化データを利用して、各要素、すなわちマイクロコンテンツを販売管理サーバ３に出力する。販売管理サーバ３は、図８右側に示すように、各マイクロコンテンツのデータをＥＣサイト上に出力する。 FIG. 8 is an explanatory diagram showing an overview of the second embodiment. The left side of FIG. 8 conceptually illustrates how the server 1 converts digital content into structured data and stores the structured data in the content DB 142, as described in the first embodiment. In this embodiment, the server 1 uses the structured data to output each element, that is, micro content to the sales management server 3 . The sales management server 3 outputs the data of each micro content to the EC site as shown on the right side of FIG.

具体的には、販売管理サーバ３は、元データであるデジタルコンテンツから抽出された各要素をＷｅｂページの素材として利用し、各要素を再配置したＷｅｂ画面に生成して、ＥＣサイトの利用者のクライアント端末に出力する。例えば図８に示すように、販売管理サーバ３は、元データから抽出した画像、画像のキャプション、タイトル等の要素を再配置し、Ｗｅｂ画面上に表示させる。 Specifically, the sales management server 3 uses each element extracted from the digital content, which is the original data, as the material of the Web page, generates a Web screen in which each element is rearranged, and provides the EC site user with a Web screen. output to the client terminal of For example, as shown in FIG. 8, the sales management server 3 rearranges elements such as images extracted from the original data, image captions, and titles, and displays them on the Web screen.

また、サーバ１は各要素（マイクロコンテンツ）を販売管理サーバ３に出力するだけでなく、各要素に対応付けられた付随情報、すなわち作者名、価格等の情報を併せて出力する。なお、サーバ１は元データであるデジタルコンテンツ全体での価格しか取得しておらず、個々のマイクロコンテンツの価格は取得していないが、各マイクロコンテンツの価格Ｐは、例えば以下の式（１）により算出する。 Moreover, the server 1 not only outputs each element (micro-content) to the sales management server 3, but also outputs accompanying information associated with each element, that is, information such as author name and price. Note that the server 1 acquires only the price of the entire digital content, which is the original data, and does not acquire the price of each individual microcontent. Calculated by

Ｐ＝α（Ａ／Ｎ）＋β …（１） P=α(A/N)+β (1)

Ａはデジタルコンテンツ全体の価格、Ｎはデジタルコンテンツに含まれる要素の総数、α及びβは価格の調整パラメータである。α及びβは、例えば各要素のデータ量等に応じて決定される。サーバ１は、式（１）に基づき、コンテンツ全体の価格から各要素の価格を算出する。具体的には、コンテンツ全体の価格を要素数で除算し、該当要素のデータ量等に応じて価格を調整することで、マイクロコンテンツ単位の価格を算出する。 A is the price of the entire digital content, N is the total number of elements included in the digital content, and α and β are price adjustment parameters. α and β are determined according to, for example, the amount of data of each element. The server 1 calculates the price of each element from the price of the entire content based on Equation (1). Specifically, by dividing the price of the entire content by the number of elements and adjusting the price according to the amount of data of the corresponding element, the price of each micro content is calculated.

販売管理サーバ３は、上記で算定した価格のほか、コンテンツの作者名等の付随情報をサーバ１から取得し、Ｗｅｂ画面上に出力する。販売管理サーバ３は、ＥＣサイトの利用者のクライアント端末を介して、Ｗｅｂ画面上に表示した各マイクロコンテンツの購入申し込みを受け付ける。販売管理サーバ３は、販売した各マイクロコンテンツについて、上記で算定した価格に基づき請求料金を定め、利用者に請求する。 In addition to the price calculated above, the sales management server 3 acquires accompanying information such as the author name of the content from the server 1 and outputs it on the Web screen. The sales management server 3 accepts a purchase application for each micro-content displayed on the Web screen via the client terminal of the user of the EC site. The sales management server 3 determines a billing fee based on the price calculated above for each microcontent sold, and bills the user.

図９は、実施の形態２に係る元データの呼出処理に関する説明図である。販売管理サーバ３は、書籍、新聞、雑誌等の文書から抽出したマイクロコンテンツの販売を行うだけでなく、例えばマイクロコンテンツの抽出元である文書、すなわちデジタルコンテンツ自体の販売等を併せて行う。例えば販売管理サーバ３は、図９左側に示す画面においてマイクロコンテンツ（画像）への指定入力を受け付けた場合、図９右側に示す画面に遷移し、指定されたマイクロコンテンツの元データに関する情報を出力する。 FIG. 9 is an explanatory diagram of the original data call processing according to the second embodiment. The sales management server 3 not only sells microcontents extracted from documents such as books, newspapers, and magazines, but also sells documents from which microcontents are extracted, that is, digital content itself. For example, when the sales management server 3 accepts a designation input for microcontents (images) on the screen shown on the left side of FIG. 9, it transitions to the screen shown on the right side of FIG. 9, and outputs information about the original data of the designated microcontents. do.

具体的には、クライアント端末から元データの呼出要求（出力要求）を受け付けた場合、販売管理サーバ３はサーバ１への問い合わせを行う。サーバ１は問い合わせを受け、元データであるデジタルコンテンツの情報をコンテンツＤＢ１４２から読み出し、販売管理サーバ３に出力する。販売管理サーバ３は、元データを取得し、当該元データの情報を示すＷｅｂ画面を生成してクライアント端末に出力する。例えば図９に示すように、販売管理サーバ３は、元データの表紙、書誌情報（付随情報）、収録されているマイクロコンテンツの情報等を出力する。例えば販売管理サーバ３は、当該画面を介して、元データであるデジタルコンテンツ全体での購入申し込みを受け付ける。 Specifically, when receiving a call request (output request) for original data from a client terminal, the sales management server 3 makes an inquiry to the server 1 . The server 1 receives the inquiry, reads the information of the digital content, which is the original data, from the content DB 142 and outputs it to the sales management server 3 . The sales management server 3 acquires the original data, generates a web screen showing the information of the original data, and outputs it to the client terminal. For example, as shown in FIG. 9, the sales management server 3 outputs the cover page of the original data, bibliographic information (associated information), recorded micro-content information, and the like. For example, the sales management server 3 accepts a purchase application for the entire digital content, which is the original data, via the screen.

図１０は、実施の形態２に係る情報処理システムが実行する処理手順の一例を示すフローチャートである。図１０に基づき、サーバ１及び販売管理サーバ３が実行する処理内容について説明する。
サーバ１の制御部１１は、コンテンツＤＢ１４２から、デジタルコンテンツの構造化データを読み出す（ステップＳ２０１）。具体的には、制御部１１は、デジタルコンテンツから抽出した各要素（マイクロコンテンツ）のデータと、要素の属性値として格納されている付随情報、すなわちコンテンツの作者、コンテンツ全体での価格等の情報を読み出す。制御部１１は、読み出したコンテンツ全体での価格から、個々の要素の価格を算出する（ステップＳ２０２）。具体的には、制御部１１は式（１）に従い、コンテンツ全体の価格からマイクロコンテンツ単位の価格を算出する。制御部１１は、各要素と、各要素に対応するデジタルコンテンツの付随情報とを販売管理サーバ３に出力する（ステップＳ２０３）。例えば制御部１１は、各マイクロコンテンツのデータのほか、作者、マイクロコンテンツの価格等の情報を出力する。 10 is a flowchart illustrating an example of a processing procedure executed by an information processing system according to Embodiment 2. FIG. Processing contents executed by the server 1 and the sales management server 3 will be described with reference to FIG.
The control unit 11 of the server 1 reads structured data of digital content from the content DB 142 (step S201). Specifically, the control unit 11 collects the data of each element (micro-content) extracted from the digital content and the associated information stored as the attribute value of the element, that is, information such as the author of the content and the price of the content as a whole. read out. The control unit 11 calculates the price of each element from the read price of the entire content (step S202). Specifically, the control unit 11 calculates the price of each micro content from the price of the entire content according to formula (1). The control unit 11 outputs each element and accompanying information of the digital content corresponding to each element to the sales management server 3 (step S203). For example, the control unit 11 outputs information such as the author and the price of the microcontent in addition to the data of each microcontent.

販売管理サーバ３は、サーバ１から要素のデータを取得し、ＥＣサイトに係るＷｅｂ画面を生成して出力する（ステップＳ２０４）。例えば販売管理サーバ３は、上述の如く、各マイクロコンテンツと、各マイクロコンテンツの付随情報とを表示するＷｅｂ画面を生成して出力する。販売管理サーバ３は、クライアント端末から、要素の元データであるデジタルコンテンツの出力要求を受け付けたか否かを判定する（ステップＳ２０５）。出力要求を受け付けていないと判定した場合（Ｓ２０５：ＮＯ）、販売管理サーバ３は一連の処理を終了する。出力要求を受け付けたと判定した場合（Ｓ２０５：ＹＥＳ）、販売管理サーバ３は、出力要求をサーバ１に転送する（ステップＳ２０６）。 The sales management server 3 acquires element data from the server 1, generates and outputs a Web screen related to the EC site (step S204). For example, the sales management server 3 generates and outputs a Web screen displaying each microcontent and accompanying information of each microcontent, as described above. The sales management server 3 determines whether or not an output request for the digital content, which is the original data of the elements, has been received from the client terminal (step S205). When determining that the output request has not been received (S205: NO), the sales management server 3 terminates the series of processes. When determining that the output request has been received (S205: YES), the sales management server 3 transfers the output request to the server 1 (step S206).

サーバ１の制御部１１は、元データの出力要求を受け付ける（ステップＳ２０７）。出力要求を受け付けた場合、制御部１１は、元データであるデジタルコンテンツを販売管理サーバ３に出力する（ステップＳ２０８）。販売管理サーバ３は、サーバ１から元データを取得し、元データをＷｅｂ画面上に表示させ（ステップＳ２０９）、一連の処理を終了する。 The control unit 11 of the server 1 accepts the output request for the original data (step S207). When receiving the output request, the control unit 11 outputs the digital content, which is the original data, to the sales management server 3 (step S208). The sales management server 3 acquires the original data from the server 1, displays the original data on the web screen (step S209), and ends the series of processes.

なお、上記ではＥＣサイトを一例に説明を行ったが、サーバ１は構造化データを利用して文書内の各要素を出力可能であればよく、本実施の形態の適用対象はＥＣサイトに限定されない。 In the above description, the EC site was explained as an example, but the server 1 only needs to be able to output each element in the document using structured data, and the application target of this embodiment is limited to the EC site. not.

以上より、本実施の形態２によれば、構造化データを参照して各要素を出力することで、マイクロコンテンツの実際的な利用が可能となる。 As described above, according to the second embodiment, by referring to the structured data and outputting each element, it is possible to practically use the micro-contents.

また、本実施の形態２によれば、各要素に元データの付随情報を対応付けておくことで、マイクロコンテンツ利用の利便性を高めることができる。 Further, according to the second embodiment, by associating accompanying information of original data with each element, it is possible to enhance the convenience of using micro-contents.

また、本実施の形態２によれば、コンテンツ全体での価格から各要素の価格を自動算出することで、マイクロコンテンツ単位の適切な価格を算出することができる。また、販売者が各マイクロコンテンツの価格を個別に定める必要がなく、価格算定の煩わしさを解消することができる。 Moreover, according to the second embodiment, by automatically calculating the price of each element from the price of the entire content, it is possible to calculate an appropriate price for each micro-content. In addition, it is not necessary for the seller to set the price of each microcontent individually, and the troublesome price calculation can be eliminated.

また、本実施の形態２によれば、マイクロコンテンツを誘因としてデジタルコンテンツ全体の利用を促進することができる。 Moreover, according to the second embodiment, it is possible to promote the use of the entire digital content by using the micro content as an incentive.

（実施の形態３）
本実施の形態では、著作物であるコンテンツの不正使用をチェックするためのクローリング監視を行う形態について説明する。
図１１は、実施の形態３の概要を示す説明図である。サーバ１は、ネットワークＮを介してＷｅｂサイトの情報を収集するクローリング処理を行い、各サイトのＷｅｂページにおいて、コンテンツＤＢ１４２に記憶されているデジタルコンテンツが不正に使用されていないかどうかを監視する処理を行う。 (Embodiment 3)
In the present embodiment, a mode of performing crawling monitoring for checking unauthorized use of copyrighted content will be described.
FIG. 11 is an explanatory diagram showing an overview of the third embodiment. The server 1 performs a crawling process of collecting website information via the network N, and a process of monitoring whether or not the digital content stored in the content DB 142 is illegally used in the web pages of each site. I do.

例えばサーバ１は、定期的にインターネット上の各Ｗｅｂサイトにアクセスし、各サイトのＷｅｂページのデータを収集しておく。そしてサーバ１は、収集した各Ｗｅｂページのうち、デジタルコンテンツから抽出した各要素、すなわちマイクロコンテンツと一致するコンテンツが掲載されたＷｅｂページがあるかどうかを判定する。一致するコンテンツが掲載されているＷｅｂページがあると判定した場合、サーバ１は、該当ページを管理者に報知する。具体的には、サーバ１は、Ｗｅｂページのアドレス情報と、当該ページ内の該当箇所とを報知する。 For example, the server 1 periodically accesses each website on the Internet and collects the data of the web page of each site. Then, the server 1 determines whether or not there is a Web page in which each element extracted from the digital content, that is, the content matching the micro content is posted among the collected Web pages. If it is determined that there is a Web page on which matching content is posted, the server 1 notifies the administrator of the corresponding page. Specifically, the server 1 notifies the address information of the Web page and the corresponding portion within the page.

図１２は、実施の形態３に係るサーバ１が実行する処理手順の一例を示すフローチャートである。図１２に基づき、本実施の形態においてサーバ１が実行する処理内容について説明する。
サーバ１の制御部１１は、ネットワークＮを介して各Ｗｅｂサイトにアクセスし、Ｗｅｂページのデータを収集する（ステップＳ３０１）。制御部１１は、収集したＷｅｂページから、コンテンツＤＢ１４２に記憶されている要素と一致するコンテンツが掲載されたＷｅｂページがあるか否かを判定する（ステップＳ３０２）。該当するＷｅｂページがないと判定した場合（Ｓ３０２：ＮＯ）、制御部１１は一連の処理を終了する。 FIG. 12 is a flow chart showing an example of a processing procedure executed by the server 1 according to the third embodiment. Based on FIG. 12, the contents of processing executed by the server 1 in this embodiment will be described.
The control unit 11 of the server 1 accesses each website via the network N and collects data of web pages (step S301). The control unit 11 determines whether or not there is a Web page containing content matching elements stored in the content DB 142 from the collected Web pages (step S302). If it is determined that there is no corresponding web page (S302: NO), the control unit 11 terminates the series of processes.

該当するＷｅｂページがあると判定した場合（Ｓ３０２：ＹＥＳ）、制御部１１は、当該Ｗｅｂページを抽出する（ステップＳ３０３）。具体的には、制御部１１は、当該Ｗｅｂページのアドレス情報等を抽出すると共に、当該Ｗｅｂページにおいて一致するコンテンツが掲載されている該当箇所を抽出する。制御部１１は、抽出したＷｅｂページを管理者に報知し（ステップＳ３０４）、一連の処理を終了する。 If it is determined that there is a relevant web page (S302: YES), the control unit 11 extracts the relevant web page (step S303). Specifically, the control unit 11 extracts the address information and the like of the web page, and also extracts the corresponding portion where the matching content is posted on the web page. The control unit 11 notifies the administrator of the extracted web page (step S304), and ends the series of processes.

以上より、本実施の形態３によれば、著作物の不正使用をマイクロコンテンツ単位で監視することができる。 As described above, according to the third embodiment, unauthorized use of a copyrighted work can be monitored for each micro-content.

（実施の形態４）
図１３は、上述した形態のサーバ１の動作を示す機能ブロック図である。制御部１１がプログラムＰを実行することにより、サーバ１は以下のように動作する。取得部１３１は、非構造化データのサンプルと、該サンプルに含まれる各要素を定義付けるタグ情報とを含む教師情報を取得する。特徴抽出部１３２は、前記サンプルの前記各要素の特徴を抽出する。設定部１３３は、抽出した前記各要素の特徴と、前記教師情報が示す前記タグ情報とに基づき、前記要素の特徴に応じて前記タグ情報を識別するルールを設定する。コンテンツ取得部１３４は、非構造化データであるコンテンツを取得する。抽出部１３５は、取得した前記コンテンツから前記各要素を抽出する。付与部１３６は、前記ルールを参照して、前記要素に前記タグ情報を付与する。 (Embodiment 4)
FIG. 13 is a functional block diagram showing the operation of the server 1 of the form described above. When the control unit 11 executes the program P, the server 1 operates as follows. The acquisition unit 131 acquires teacher information including a sample of unstructured data and tag information defining each element included in the sample. A feature extractor 132 extracts features of each element of the sample. The setting unit 133 sets a rule for identifying the tag information according to the feature of the element, based on the feature of each extracted element and the tag information indicated by the teacher information. The content acquisition unit 134 acquires content that is unstructured data. The extraction unit 135 extracts each element from the acquired content. The adding unit 136 refers to the rule and adds the tag information to the element.

本実施の形態４は以上の如きであり、その他は実施の形態１から３と同様であるので、対応する部分には同一の符号を付してその詳細な説明を省略する。 The fourth embodiment is as described above, and the other parts are the same as those of the first to third embodiments.

今回開示された実施の形態はすべての点で例示であって、制限的なものではないと考えられるべきである。本発明の範囲は、上記した意味ではなく、特許請求の範囲によって示され、特許請求の範囲と均等の意味及び範囲内でのすべての変更が含まれることが意図される。 The embodiments disclosed this time are illustrative in all respects and should not be considered restrictive. The scope of the present invention is indicated by the scope of the claims rather than the meaning described above, and is intended to include all modifications within the scope and meaning equivalent to the scope of the claims.

１サーバ（情報処理装置）
１１制御部
１２主記憶部
１３通信部
１４補助記憶部
Ｐプログラム
１４１構造化テーブル
１４２コンテンツＤＢ
２端末
３販売管理サーバ 1 server (information processing device)
REFERENCE SIGNS LIST 11 control unit 12 main storage unit 13 communication unit 14 auxiliary storage unit P program 141 structured table 142 content DB
2 terminal 3 sales management server

Claims

an acquisition unit that acquires teacher information that associates a sample of unstructured data consisting of at least one or more elements with tag information that consists of meta information that defines each element of each sample;
a feature extraction unit that extracts features of each element of the sample based on features of the layout of the elements and features of data in each element for the sample indicated by the acquired teacher information;
a setting unit for setting a rule for identifying the tag information according to the characteristics of the elements, based on the characteristics of the extracted elements and the tag information indicated by the teacher information ;
a content acquisition unit that acquires content that is unstructured data;
an extraction unit that extracts each of the elements from the acquired content;
a granting unit that refers to the rule and grants the tag information to the element;
an accompanying information acquisition unit that acquires accompanying information including price information for the entire content from the content;
a calculation unit that calculates the price of each of the elements from the price of the entire content;
a generation unit that structures the content based on the tag information assigned to each of the elements and generates structured data that stores the accompanying information in association with each of the elements;
an output unit that refers to the generated structured data and outputs each element and the price of each element;
An information processing device comprising:

The samples and contents are document data consisting of text or images,
The feature extraction unit extracts information related to the format or layout of the element,
The information processing apparatus according to claim 1 , wherein the setting unit sets the rule indicating the correspondence between the format or layout of the element and the tag information.

The acquisition unit acquires a plurality of pieces of teacher information,
The setting unit learns a correspondence relationship between the feature of the element and the tag information from each of the teacher information to generate an identifier for identifying the tag information corresponding to the element. Item 3. The information processing device according to Item 1 or 2 .

a receiving unit that receives an output request for the content corresponding to the element output by the output unit;
4. The information processing apparatus according to any one of claims 1 to 3 , wherein the output unit outputs the content when an output request is received.

a collection unit that collects data related to web pages via a network;
a page extracting unit that extracts the web page having content that matches the element from the collected web pages;
5. The information processing apparatus according to any one of claims 1 to 4 , further comprising: a notification unit that notifies the extracted web page.

Acquiring teacher information that associates a sample of unstructured data consisting of at least one or more elements with tag information consisting of meta information that defines each element for each sample,
extracting the features of each element of the sample based on the features of the layout of each element and the features of the data in each element for the sample indicated by the acquired teacher information;
setting a rule for identifying the tag information according to the feature of the element based on the extracted feature of each element and the tag information indicated by the teacher information;
Get content that is unstructured data,
extracting each element from the acquired content;
Referencing the rule, giving the tag information to the element,
Acquiring accompanying information including price information for the entire content from the content;
calculating the price of each of the elements from the price of the entire content;
Structuring the content based on the tag information assigned to each of the elements, and generating structured data in which the accompanying information is stored in association with each of the elements;
A program that causes a computer to execute a process of referring to the generated structured data and outputting each element and the price of each element .