JP2006107155A

JP2006107155A - Device and method for document structural processing, and program for making computer execute same method

Info

Publication number: JP2006107155A
Application number: JP2004293314A
Authority: JP
Inventors: Masaru Tanaka; 大田中
Original assignee: Seiko Epson Corp
Current assignee: Seiko Epson Corp
Priority date: 2004-10-06
Filing date: 2004-10-06
Publication date: 2006-04-20

Abstract

PROBLEM TO BE SOLVED: To provide a document structural processing method capable of changing a layout of document data at the side of a display device which understands structural documents of electronic papers and the like and displaying images on the surface screen of the display device at the best suitable condition. SOLUTION: In structural information in which a string of characters organized by a plurality of characters disposed at one direction is set to be a minimum unit in the document data, the structure of the document data is determined, and the determined structure is showed; a document structural processor 220 producing structural documents for showing structures of the document data is arranged on a printer driver 215 which handles intermediate document data produced in the process of image processing of the document data including character codes. COPYRIGHT: (C)2006,JPO&NCIPI

Description

本発明は、文書構造化処理装置、文書構造化処理方法及びこの方法をコンピュータに実行させるためのプログラムに関する。 The present invention relates to a document structuring processing apparatus, a document structuring processing method, and a program for causing a computer to execute the method.

現在、ＰＣは、ワードプロセッサとして、あるいはメール等の通信手段として広く利用されている。ＰＣは、近年ますます小型、軽量化される傾向にあるものの、日常的に携帯するには未だ電池寿命や重量の点で課題を残している。このため、ＰＣで作成された文書等のデータは、多くの場合紙等の媒体にプリントされた状態で持ち運ばれている。このため、ＰＣは、作成されたデータをプリンタに転送し、プリンタにおいて紙媒体にプリントさせている。 Currently, PCs are widely used as word processors or as communication means such as mail. Although PCs tend to be smaller and lighter in recent years, there are still problems in terms of battery life and weight for carrying them on a daily basis. For this reason, data such as documents created on a PC is often carried in a state of being printed on a medium such as paper. For this reason, the PC transfers the created data to the printer and causes the printer to print on a paper medium.

ＰＣからプリンタへのデータの転送は、メモリカード等の記憶媒体やＵＳＢ等の入出力インターフェイスを利用して行われる。ＰＣ上で使用される多くのデータフォーマットとプリンタとを対応させるため、ＰＣにおいてアプリケーションソフトを制御するＯＳは、アプリケーションソフトに対してプリント機能を提供している。このため、プリンタドライバをインストールすることにより、ＰＣ側では、プリンタの種別によらず一定のルールにしたがってＯＳにプリントすべきデータを渡すことが可能になる。 Data transfer from the PC to the printer is performed using a storage medium such as a memory card or an input / output interface such as a USB. In order to associate many data formats used on a PC with a printer, an OS that controls application software in the PC provides a print function for the application software. For this reason, by installing the printer driver, the PC side can pass data to be printed to the OS according to a certain rule regardless of the type of printer.

また、プリンタ側にあっても、ＰＣ側のアプリケーションソフトによらずデータを印刷することが可能になる。また、プリンタドライバを、ファクシミリ装置にデータを転送することによってファクシミリ通信可能に作成することも可能である。
このようなプリンタドライバは、ＯＳからアプリケーションソフトのデータを受け取り、画像データに変換してプリンタ側に転送している。この際、文字列で構成されているデータは、文字の行ごとにプリンタドライバに渡される。このようなプリンタドライバの従来技術として、ＰｏｓｔＳｃｒｉｐｔ（登録商標）を非特許文献１として、また、ＥＳＣ／Ｐａｇｅ（登録商標）を非特許文献２として挙げる。
インターネット＜ＵＲＬ：http://www.adobe.co.jp/print/postscript/main.html＞＜ＵＲＬ：http://edrj.i-love-epson.co.jp/＞ Even on the printer side, data can be printed regardless of the application software on the PC side. It is also possible to create a printer driver so that facsimile communication is possible by transferring data to the facsimile apparatus.
Such a printer driver receives application software data from the OS, converts it into image data, and transfers it to the printer side. At this time, data composed of character strings is transferred to the printer driver for each character line. As a conventional technology of such a printer driver, Post Script (registered trademark) is cited as Non-patent document 1, and ESC / Page (registered trademark) is named as Non-patent document 2.
Internet <URL: http://www.adobe.co.jp/print/postscript/main.html> <URL: http://edrj.i-love-epson.co.jp/>

ところで、現在、紙等に代えて使用される薄型の表示媒体がある。このような表示媒体を、本明細書では電子ペーパという。電子ペーパのうち、記憶性のものは、いったん表示されたデータを電力の供給を受けることなく保持することができる。また、繰り返しデータを書き換えることができる。このような利点により、電子ペーパを紙媒体と同様にＰＣで作成されたデータの表示に用い、携帯することが考えられている。 By the way, there is a thin display medium currently used instead of paper or the like. Such a display medium is referred to as electronic paper in this specification. Among electronic papers, a memory type can hold data once displayed without receiving power. Further, it is possible to rewrite data repeatedly. Due to such advantages, it is considered that electronic paper is used for displaying data created by a PC in the same manner as a paper medium and is carried.

電子ペーパは、ディスプレイ画面と、ディスプレイ画面に表示された画像を制御する制御部とを有している。制御部には、ＸＨＴＭＬ（eXtensible HyperText Markup Language）やＸＭＬ（eXtensible Markup Language）等の構造化文書を理解可能なものがあり、ディスプレイ画面に表示されたデータが文字列であれば、行や段落といった文字列の構造を考慮して文字列のレイアウトを変更することが可能である。 The electronic paper has a display screen and a control unit that controls an image displayed on the display screen. Some control units can understand structured documents such as XHTML (eXtensible HyperText Markup Language) and XML (eXtensible Markup Language), and if the data displayed on the display screen is a character string, it can be a line or paragraph. The layout of the character string can be changed in consideration of the structure of the character string.

しかしながら、従来技術は、紙媒体にデータをプリントさせることを想定したものであり、前記したように、データを画像としてプリンタに転送している。このため、プリンタ側にはデータを構成する文字の大きさや形状の情報のみが転送され、文書の行や段落といった情報が失われる。
このような従来技術によれば、電子ペーパに構造化文書を理解する機能が備わっているにもかかわらず、文書データのレイアウトを変更することができない。したがって、ＰＣから電子ペーパに文字列でなるデータを転送する構成に従来技術を適用すると、ＰＣ側で作成された画像のサイズと電子ペーパのディスプレイ画面のサイズとを一致させるには、文字列全体を画像として拡大、あるいは縮小する以外の手段がないという不具合がある。また、画像を縮小することが好ましくない場合、画像全体を見るために電子ペーパのディスプレイ画面をスクロールすることが必要になり、この操作がわずらわしいという不具合がある。 However, the prior art assumes that data is printed on a paper medium, and as described above, the data is transferred to the printer as an image. For this reason, only information on the size and shape of the characters constituting the data is transferred to the printer side, and information such as document lines and paragraphs is lost.
According to such a conventional technique, the layout of document data cannot be changed even though the electronic paper has a function of understanding a structured document. Therefore, when the conventional technique is applied to a configuration in which character string data is transferred from the PC to the electronic paper, the entire character string is used to match the size of the image created on the PC side with the size of the display screen of the electronic paper. There is a problem that there is no means other than enlarging or reducing the image as an image. Further, when it is not preferable to reduce the image, it is necessary to scroll the display screen of the electronic paper in order to view the entire image, and this operation is troublesome.

本発明は、上記した点に鑑みてなされたものであり、電子ペーパ等の構造化文書を理解できる表示装置の側で文書データをレイアウト変更し、表示装置の表示画面に最適な状態で画像を表示させることができる文書構造化処理装置、文書構造化処理方法及びこの方法をコンピュータに実行させるためのプログラムを提供することを目的とする。 The present invention has been made in view of the above points. The layout of document data is changed on the side of a display device that can understand a structured document such as electronic paper, and an image is displayed in a state optimal for the display screen of the display device. An object is to provide a document structuring processing apparatus, a document structuring processing method, and a program for causing a computer to execute the method.

以上の課題を解決するため、本発明の文書構造化処理装置は、文字コードを含む文書データを画像化する過程で生成される中間文書データを取り扱う文書構造化処理装置であって、文書データにおいて一方向に複数個配置された文字で構成される文字列を最小単位とし、前記文書データの構造を判定する構造判定手段と、前記構造判定手段によって判定された構造を示す構造化情報に基づいて、前記文書データの構造を示す構造化文書を生成する構造化文書生成手段と、を備えることを特徴とする。 In order to solve the above problems, a document structuring apparatus of the present invention is a document structuring apparatus that handles intermediate document data generated in the process of imaging document data including character codes. Based on the structure determination means for determining the structure of the document data, and the structured information indicating the structure determined by the structure determination means, with a character string composed of a plurality of characters arranged in one direction as a minimum unit And structured document generation means for generating a structured document indicating the structure of the document data.

このような発明によれば、文字コードを含む文書データを画像化する過程において、文書データの構造を判定することができる。そして、判定の結果得られる構造を示す文書を作成することができる。このような処理によれば、文書データの構造を他の機器に送出して知らせることができる。したがって、他の機器が構造化文書を理解できるものであれば、受け取った文書データをレイアウト変更し、機器側の表示画面に最適な状態で画像を表示できる文書構造化処理装置を提供することができる。 According to such an invention, the structure of the document data can be determined in the process of imaging the document data including the character code. Then, a document showing the structure obtained as a result of the determination can be created. According to such processing, the structure of the document data can be transmitted to other devices to be notified. Therefore, if another device can understand a structured document, it is possible to provide a document structuring apparatus that can change the layout of received document data and display an image in an optimal state on the display screen on the device side. it can.

また、本発明の文書構造化処理装置は、前記構造判定手段が、前記文字列に基づいて、前記文書データが示す文書を構成する段落の構造を判定する段落判定手段を備えることを特徴とする。
このような発明によれば、文字を一文字ずつ切り出して解釈するＯＣＲ等の従来技術に比べ、より簡易かつ高速に文書データを構造化することができる。 In the document structuring apparatus according to the present invention, the structure determining unit includes a paragraph determining unit that determines a structure of a paragraph constituting the document indicated by the document data based on the character string. .
According to such an invention, document data can be structured more easily and at a higher speed than conventional techniques such as OCR that extracts and interprets characters one by one.

また、本発明の文書構造化処理装置は、前記段落判定手段が、前記文書データが表す文書の１行分の文字を文字列とし、該文字列における先頭文字列の位置によって行の字下げを検出すると共に、字下げが検出された文字列から次に字下げが検出される文字列の直前の文字列までが１つの段落であると判定することを特徴とする。
このような発明によれば、段落の始まりを比較的簡易、かつ正確に検出することができる。 In the document structuring processing apparatus according to the present invention, the paragraph determination unit may use one line of characters of the document represented by the document data as a character string, and indent the line according to the position of the first character string in the character string. In addition to the detection, the character string from which the indentation is detected to the character string immediately before the next character string from which the indentation is detected is determined as one paragraph.
According to such an invention, the beginning of a paragraph can be detected relatively easily and accurately.

また、本発明の文書構造化処理装置は、前記段落判定手段が、前記文書データが表す文書の１行分の文字を文字列とし、先頭行の文字列または直前の文字列との間隔が直後の文字列との間隔よりも広い文字列から、直後の文字列との間隔が直前の文字列との間隔よりも広い文字列までを１つの段落であると判定することを特徴とする。
このような発明によれば、段落の始まりを比較的簡易、かつ正確に検出することができる。 Further, in the document structuring apparatus of the present invention, the paragraph determining means uses a character of one line of the document represented by the document data as a character string, and an interval between the character string of the first line or the immediately preceding character string is immediately after From a character string wider than the character string to a character string to a character string whose distance from the immediately following character string is wider than the distance from the immediately preceding character string is determined as one paragraph.
According to such an invention, the beginning of a paragraph can be detected relatively easily and accurately.

また、本発明の文書構造化処理装置は、前記構造判定手段が、互いに異なる方向から前記段落と接する複数の余白領域を検出し、検出された複数の余白領域同士の位置関係及び相対的な大きさの違いに基づいて、前記文書データのレイアウトに関する構造を判定することを特徴とする。
このような発明によれば、文書データが左右及び中央、あるいは上下のいずれにシフトされているかを簡易かつ、正確に判定することができる。このため、文書データのシフトに係る構造を構造化文書に含めることができる。 In the document structuring apparatus of the present invention, the structure determination unit detects a plurality of blank areas in contact with the paragraph from different directions, and a positional relationship and a relative size between the detected blank areas. The structure relating to the layout of the document data is determined on the basis of the difference.
According to such an invention, it is possible to easily and accurately determine whether the document data is shifted to the left, right, center, or up and down. For this reason, the structure concerning the shift of the document data can be included in the structured document.

また、本発明の文書構造化処理装置は、前記構造判定手段が、前記文書データに図画が含まれる場合、前記図画の表示領域に対し、互いに異なる方向から前記表示領域と接する複数の余白領域を検出し、検出された複数の余白領域同士の位置関係及び相対的な大きさの違いに基づいて、前記図画のレイアウトに関する構造を判定することを特徴とする。
このような発明によれば、文書データに含まれる図画の表示領域が左右及び中央、あるいは上下のいずれにシフトされているかを簡易かつ、正確に判定することができる。このため、図画の表示領域のシフトに係る構造を構造化文書に含めることができる。 In the document structuring apparatus according to the present invention, when the structure determination unit includes a drawing in the document data, a plurality of blank areas that are in contact with the display area from different directions are displayed with respect to the display area of the drawing. The structure relating to the layout of the drawing is determined based on the detected positional relationship and the relative size difference between the plurality of blank areas.
According to such an invention, it is possible to easily and accurately determine whether the display area of the drawing included in the document data is shifted to the left, right, center, or up and down. For this reason, the structure related to the shift of the graphic display area can be included in the structured document.

また、本発明の文書構造化処理装置は、前記構造判定手段が、前記文字列に含まれる文字のサイズに基づいて前記段落の属性を判定することを特徴とする。
このような発明によれば、段落が見出しやルビ等、本文を構成する以外の段落であるか否かを比較的簡易、かつ正確に判定することができる。このため、段落のより詳細な属性を構造化文書に含めることができる。 In the document structuring apparatus according to the present invention, the structure determination unit determines the attribute of the paragraph based on a size of a character included in the character string.
According to such an invention, whether or not a paragraph is a paragraph other than that constituting the text, such as a headline or ruby, can be determined relatively easily and accurately. Thus, more detailed attributes of the paragraph can be included in the structured document.

また、本発明の文書構造化処理装置は、前記構造判定手段が、前記文字列に含まれるキーワードを検出し、該キーワードの意味に基づいて前記文書データのレイアウトに関する構造を判定することを特徴とする。
このような発明によれば、文字列の属性をより詳細に判定し、特殊な文字列を適正に構造化することができる。 In the document structuring apparatus of the present invention, the structure determination unit detects a keyword included in the character string, and determines a structure related to the layout of the document data based on the meaning of the keyword. To do.
According to such an invention, the attribute of the character string can be determined in more detail, and the special character string can be appropriately structured.

また、本発明の文書構造化処理装置は、文字コードを含む文書データを受け取って画像化するプリンタドライバに備えられることを特徴とする。
このような発明によれば、文書データが画像化される過程で得られる中間文書を適正な工程で処理し、構造化文書作成の処理を円滑に行うことができる。
また、本発明の文書構造化処理方法は、文字コードを含む文書データを画像化する過程で生成される中間文書データを取り扱う文書構造化処理方法であって、文書データにおいて一方向に複数個配置された文字で構成される文字列を最小単位とし、前記文書データの構造を判定する構造判定工程と、前記構造判定手段工程において判定された構造を示す構造化情報に基づいて、前記文書データの構造を示す構造化文書を生成する構造化文書生成工程と、を含むことを特徴とする。 The document structuring apparatus of the present invention is provided in a printer driver that receives document data including a character code and converts it into an image.
According to such an invention, the intermediate document obtained in the process of converting the document data into an image can be processed in an appropriate process, and the structured document creation process can be performed smoothly.
The document structuring method of the present invention is a document structuring method for handling intermediate document data generated in the process of imaging document data including character codes, and a plurality of document structuring methods are arranged in one direction in the document data. A structure determination step for determining a structure of the document data, and a structured information indicating the structure determined in the structure determination means step. And a structured document generation step of generating a structured document indicating the structure.

このような発明によれば、文字コードを含む文書データを画像化する過程において、文書データの構造を判定することができる。そして、判定の結果得られる構造を示す文書を作成することができる。このような処理によれば、文書データの構造を他の機器に送出して知らせることができる。したがって、他の機器が構造化文書を理解できるものであれば、受け取った文書データをレイアウト変更し、機器側の表示画面に最適な状態で画像を表示できる文書構造化処理方法を提供することができる。 According to such an invention, the structure of the document data can be determined in the process of imaging the document data including the character code. Then, a document indicating the structure obtained as a result of the determination can be created. According to such processing, the structure of the document data can be transmitted to other devices to be notified. Therefore, if another device can understand the structured document, it is possible to provide a document structuring processing method that can change the layout of received document data and display an image in an optimal state on the display screen on the device side. it can.

また、本発明の文書構造化処理方法をコンピュータに実行させるためのプログラムは、文字コードを含む文書データを画像化する過程で生成される中間文書データを取り扱う文書構造化処理方法をコンピュータに実行させるためのプログラムであって、文書データにおいて一方向に複数個配置された文字で構成される文字列を最小単位とし、前記文書データの構造を判定する構造判定ステップと、前記構造判定手段ステップにおいて判定された構造を示す構造化情報に基づいて、前記文書データの構造を示す構造化文書を生成する構造化文書生成ステップと、を含むことを特徴とする。 A program for causing a computer to execute the document structuring processing method of the present invention causes the computer to execute a document structuring processing method that handles intermediate document data generated in the process of imaging document data including character codes. A structure determination step for determining the structure of the document data using a character string composed of a plurality of characters arranged in one direction in the document data as a minimum unit, and the structure determination step And a structured document generation step of generating a structured document indicating the structure of the document data based on the structured information indicating the structured.

このような発明によれば、文字コードを含む文書データを画像化する過程において、文書データの構造を判定することができる。そして、判定の結果得られる構造を示す文書を作成することができる。このような処理によれば、文書データの構造を他の機器に送出して知らせることができる。したがって、他の機器が構造化文書を理解できるものであれば、受け取った文書データをレイアウト変更し、機器側の表示画面に最適な状態で画像を表示できる文書構造化処理方法をコンピュータに実行させるためのプログラム提供することができる。 According to such an invention, the structure of the document data can be determined in the process of imaging the document data including the character code. Then, a document indicating the structure obtained as a result of the determination can be created. According to such processing, the structure of the document data can be transmitted to other devices to be notified. Therefore, if the other device can understand the structured document, the layout of the received document data is changed, and the computer executes a document structuring method that can display an image in an optimal state on the display screen on the device side. A program can be provided.

以下、図を参照して本発明に係る文書構造化処理装置、文書構造化処理方法及びこの方法をコンピュータに実行させるためのプログラムの実施の形態を説明する。なお、本明細書でいう文書の構造化とは、複数の文字で構成される文書に対し、文字種（文字コードやフォント、サイズ）、改行位置や見出し、段落、図の位置といったレイアウトまで機械が認識できる記述形式に変換することをいう。また、このような形式で記述された文書を、構造化文書という。 Hereinafter, embodiments of a document structuring apparatus, a document structuring method, and a program for causing a computer to execute the method will be described with reference to the drawings. The document structuring referred to in this specification means that a machine including a character type (character code, font, size), a line feed position, a headline, a paragraph, and a figure position for a document composed of a plurality of characters. To convert to a recognizable description format. A document described in such a format is called a structured document.

一般的な構造化文書の記述方式には、ＨＴＭＬ（HyperText Markup Language），ＸＭＬ（eXtensible Mark-up Language），ＳＧＭＬ（Standard Generalized Markup Language），ＴｅＸ，ＰｏＤ（Plain Old Documentation），ｒｅＳｔｒｕｃｔｕｒｅｄＴｅｘｔ等がある。
図１は、本発明の一実施形態の文書構造化処理装置のハードウェア構成のブロック図である。図１に示した文書構造化処理装置は、ＰＣ（Personal Computer）１と、ＰＣ１に接続されたプリンタ１０３と、電子ペーパ１０１とを備えている。本実施形態でいう電子ペーパ１０１は、薄型のディスプレイと、このディスプレイの制御部とを一体的に構成した携帯型の情報表示機器である。プリンタ１０３と電子ペーパ１０１とは、ＰＣ１の外部入出力Ｉ／Ｏ１１５に接続されている。なお、電子ペーパのディスプレイとしては、例えば液晶の特性を利用した液晶ディスプレイを用いることができる。 Common structured document description methods include HTML (HyperText Markup Language), XML (eXtensible Markup Language), SGML (Standard Generalized Markup Language), TeX, PoD (Plain Old Documentation), reStructured Text, and the like.
FIG. 1 is a block diagram of a hardware configuration of a document structuring apparatus according to an embodiment of the present invention. The document structuring apparatus shown in FIG. 1 includes a PC (Personal Computer) 1, a printer 103 connected to the PC 1, and an electronic paper 101. The electronic paper 101 referred to in the present embodiment is a portable information display device in which a thin display and a control unit of the display are integrally configured. The printer 103 and the electronic paper 101 are connected to the external input / output I / O 115 of the PC 1. As an electronic paper display, for example, a liquid crystal display using the characteristics of liquid crystal can be used.

ＰＣ１は、全体を統括的に制御するＣＰＵ１０９、ＣＰＵ１０９による制御に使用されるＲＡＭ（Random Access Memory）１０５、ＣＰＵ１０９の制御によって駆動するＨＤＤ（Hard Disk Drive）１０７、ディスプレイ画面等の表示装置１１１、表示装置１１１に表示される画像を制御する表示コントローラ１１３を備えている。さらに、ＰＣ１は、オペレータの指示を入力するための入力装置１２１を有し、入力装置１２１にマウス１１７及びキーボード１１９を接続している。オペレータは、マウス１１７やキーボード１１９を操作し、ＰＣ１に処理の内容を指示する。 The PC 1 includes a CPU 109 that performs overall control, a RAM (Random Access Memory) 105 that is used for control by the CPU 109, a HDD (Hard Disk Drive) 107 that is driven by the control of the CPU 109, a display device 111 such as a display screen, a display A display controller 113 that controls an image displayed on the device 111 is provided. Further, the PC 1 has an input device 121 for inputting an operator's instruction, and a mouse 117 and a keyboard 119 are connected to the input device 121. The operator operates the mouse 117 and the keyboard 119 to instruct the PC 1 of processing contents.

図２は、図１に示した文書構造化処理装置の機能ブロック図である。また、図３は、図２に示したソフトウェアの階層構造を示す図である。図２、図３に示した構成のうち、先に図示した構成と同様のものについては同様の符号を付し、説明を一部略すものとする。
本実施形態の文書構造化処理装置は、ソフトウェア２によって動作する。ソフトウェア２は、ＯＳ（Operating System）２００と、その他のアプリケーションソフト群２０１と、を含んでいる。ＯＳ２００は、プリントサービス２１１を提供すると共に、インストールされたプリンタドライバ２１５を動作させている。 FIG. 2 is a functional block diagram of the document structuring apparatus shown in FIG. FIG. 3 is a diagram showing a hierarchical structure of the software shown in FIG. Of the configurations shown in FIGS. 2 and 3, components similar to those illustrated above are denoted by the same reference numerals, and description thereof is partially omitted.
The document structuring apparatus according to the present embodiment is operated by software 2. The software 2 includes an OS (Operating System) 200 and other application software group 201. The OS 200 provides the print service 211 and operates the installed printer driver 215.

アプリケーションソフト群２０１は、ＰＣ１において動作するＯＳ２００以外の複数のアプリケーションソフトを含んでいる。
プリントサービス２１１は、アプリケーションソフト群２０１からプリント要求を受け付け、プリンタドライバ２１５にプリントすべき文書データ（プリントデータと記す。）を渡す。プリントデータは、文字コードを含む文書データであって、また、文字の書体、サイズといったフォント属性や文字の描画位置を示す座標等のデータ（属性情報と記す。）を含む。なお、プリントデータの記述には、例えば、ＰｏｓｔＳｃｒｉｐｔ（登録商標）やＥＳＣ／Ｐａｇｅ（登録商標）が用いられる。 The application software group 201 includes a plurality of application software other than the OS 200 operating on the PC 1.
The print service 211 receives a print request from the application software group 201 and passes document data to be printed (referred to as print data) to the printer driver 215. The print data is document data including a character code, and also includes data (referred to as attribute information) such as font attributes such as the font and size of characters and coordinates indicating the drawing position of the characters. For example, Post Script (registered trademark) or ESC / Page (registered trademark) is used for the description of the print data.

また、プリントサービス２１１は、文字列出力機能２１３を有している。文字列出力機能２１３によれば、プリントデータを一列に配置された文字でなる文字列の単位でプリンタドライバ２１５に渡すことができる。本実施形態では、プリンタドライバ２１５が、渡されたプリントデータを文字列の単位で処理し、電子ペーパ１０１に出力するものとする。 The print service 211 has a character string output function 213. According to the character string output function 213, print data can be transferred to the printer driver 215 in units of character strings made up of characters arranged in a line. In the present embodiment, it is assumed that the printer driver 215 processes the received print data in units of character strings and outputs the processed data to the electronic paper 101.

また、本実施形態のプリンタドライバ２１５は、文書構造化処理部２２０を備えている。プリンタドライバ２１５は、本来、プリントデータを渡され、このプリントデータを画像化するデータ（例えばビットマップデータ。）を生成するものである。文書構造化処理部２２０は、プリンタドライバ２１５において、画像化の過程で生成される中間文書データを取り扱う文書構造化処理装置として機能する。 The printer driver 215 according to the present embodiment includes a document structuring unit 220. The printer driver 215 originally receives print data, and generates data (for example, bitmap data) for converting the print data into an image. The document structuring processing unit 220 functions as a document structuring processing apparatus that handles intermediate document data generated during the imaging process in the printer driver 215.

このような文書構造化処理部２２０は、文字列を最小とする単位でプリントデータを処理し、プリントデータの構造を判定する。プリントデータの構造の判定は、文書構造化処理部２２０に備えられる段落判定部２１９、レイアウト構造判定部２２１、キーワード監視部２２３によって行われる。以上の各構成によってなされる処理については、後に詳述する。 Such a document structuring processing unit 220 processes print data in a unit that minimizes a character string, and determines the structure of the print data. The print data structure is determined by the paragraph determining unit 219, the layout structure determining unit 221, and the keyword monitoring unit 223 provided in the document structuring unit 220. The processing performed by each of the above configurations will be described in detail later.

また、文書構造化処理部２２０は、判定によって得られた構造化情報に基づいて、プリントデータの構造を示す構造化文書を生成する。本実施形態において、構造化文書の生成は、レイアウト構造判定部２２１が備えるタグ付け処理部２１７によって行われる。タグ付け処理部２１７は、構造化情報を示すタグをプリントデータに付加することによって構造化文書を生成するものである。 In addition, the document structuring processing unit 220 generates a structured document indicating the structure of print data based on the structured information obtained by the determination. In the present embodiment, the structured document is generated by the tagging processing unit 217 included in the layout structure determination unit 221. The tagging processing unit 217 generates a structured document by adding a tag indicating structured information to print data.

本実施形態でいう構造化情報とは、文書データの段落や、段落を基準にした文書のレイアウト構造を示す情報である。なお、本実施形態の構造化情報は、一般的な記述形式であるＸＨＴＭＬ（eXtensible HTML）形式等で記述されるものとする。
図４は、文字列を単位とするプリントデータを例示した図である。図４に示した例では、プリントデータによってプリントされる文書の１行分の文字を１つの文字列とする。図示したプリントデータは、文書の１行目、２行目、３行目がプリントされる矩形領域４０１、４０２、４０３の各々の点ｐ1、点ｐ2、点ｐ3の座標（図中に文字列の座標と表記。）と、矩形領域４０１、４０２、４０３の各々の点ｐ’1、点ｐ’2、点ｐ’3の座標（図中に文字列のサイズと表記。）とを示している。 The structured information referred to in the present embodiment is information indicating a paragraph of document data and a document layout structure based on the paragraph. Note that the structured information of the present embodiment is described in a general description format such as XHTML (eXtensible HTML) format.
FIG. 4 is a diagram illustrating print data in units of character strings. In the example shown in FIG. 4, the characters for one line of the document printed by the print data are set as one character string. The illustrated print data includes the coordinates of the points p1, p2, and p3 of the rectangular areas 401, 402, and 403 on which the first, second, and third lines of the document are printed (character strings in the figure). And coordinates of the points p′1, p′2, and p′3 of the rectangular regions 401, 402, and 403 (the size and description of the character string in the figure). .

なお、プリントサービス２１１とプリンタドライバ２１５とのインターフェイス（図示せず）は、一般的に文字や画像を全てラスタライズした上でビットマップ化した状態でプリントデータをプリンタドライバ２１５に送る。あるいは、画像とは別に文字のデータをプリンタドライバ２１５に送ってもよい。
また、ＯＳ２００は、表示装置１１１を制御する表示制御部２０３、キーボード１１９やマウス１１７といった入力装置を制御する入力装置制御部２０５、電子ペーパ１０１やプリンタ１０３といった出力装置を制御する外部入出力制御部２０７、ＰＣ１で作成、あるいはＰＣ１に外部から入力されたデータのファイルを管理するファイル管理機能２０９を備えている。プリンタドライバ２１５で構造化された文書は、ファイル管理機能２０９によって管理される。 Note that an interface (not shown) between the print service 211 and the printer driver 215 generally sends print data to the printer driver 215 in a state where all characters and images are rasterized and converted into a bitmap. Alternatively, character data may be sent to the printer driver 215 separately from the image.
The OS 200 also includes a display control unit 203 that controls the display device 111, an input device control unit 205 that controls input devices such as a keyboard 119 and a mouse 117, and an external input / output control unit that controls output devices such as the electronic paper 101 and the printer 103. 207, a file management function 209 for managing data files created on the PC 1 or input to the PC 1 from the outside. The document structured by the printer driver 215 is managed by the file management function 209.

図３に示したように、ＯＳ２００は、表示制御部２０３、外部入出力制御部２０７、ファイル管理機能２０９、プリンタドライバ２１５を提供している。さらに、ＯＳ２００は、図３中では図示を略すが、図２に示した入力装置制御部２０５、プリントサービス２１１を提供している。ＯＳ２００によって提供される各機能は、アプリケーションソフト群２０１に含まれる複数のアプリケーションソフトで共通して利用される。 As illustrated in FIG. 3, the OS 200 provides a display control unit 203, an external input / output control unit 207, a file management function 209, and a printer driver 215. Further, although not shown in FIG. 3, the OS 200 provides the input device control unit 205 and the print service 211 shown in FIG. Each function provided by the OS 200 is commonly used by a plurality of application software included in the application software group 201.

複数のアプリケーションソフトには、表示装置１１１や電子ペーパ１０１のディスプレイ画面に画像を描画する画面描画用ルーチンと、紙媒体に画像を印刷する印刷出力ルーチンとがあって、両者は一致することもある。多くの場合、ＯＳ２００の呼び出しに応じて画面描画用のルーチンが先に読み出され、文字コード、文字サイズ、書体等の情報に応じて文字列を表すビットマップを生成する。生成されたビットマップは表示装置１１１のディスプレイ画面に表示され、オペレータは、アプリケーションソフトで生成されたデータを確認することができる。 The plurality of application software includes a screen drawing routine for drawing an image on the display screen of the display device 111 or the electronic paper 101, and a print output routine for printing an image on a paper medium. . In many cases, a screen drawing routine is read first in response to a call to the OS 200, and a bitmap representing a character string is generated in accordance with information such as a character code, a character size, and a typeface. The generated bitmap is displayed on the display screen of the display device 111, and the operator can check the data generated by the application software.

また、表示されたデータをオペレータが確認した後、印刷要求をした場合、印刷出力用ルーチンが読み出される。印刷出力用ルーチンは、プリンタ１０３の解像度や用紙のサイズに合わせて印刷を実行する。
また、表示されたデータをオペレータが確認した後、電子ペーパ１０１への出力要求をした場合、プリントデータとなる文字列は、プリンタドライバ２１５で構造化されて外部入出力制御部２０７を介して電子ペーパ１０１に出力される。 Further, when a print request is made after the operator confirms the displayed data, a print output routine is read out. The print output routine executes printing in accordance with the resolution of the printer 103 and the paper size.
In addition, when the operator confirms the displayed data and makes an output request to the electronic paper 101, the character string that becomes the print data is structured by the printer driver 215 and electronically via the external input / output control unit 207. It is output to the paper 101.

図５は、以上述べた文書構造化処理装置でなされる文書構造化処理の概略を説明するためのフローチャートである。なお、図５に示したフローチャートは、図１に示したプリンタドライバ２１５においてなされる処理である。本実施形態の文書構造化処理部２２０は、先ず、プリントサービス２１１から受け取ったプリントデータの文字列（文字コードで表される。）、文字列に関する図４に例示した座標やサイズ、属性を取得する。なお、属性とは、文字のサイズや書体の他、文書の左端揃えや図形の配置等のレイアウト、ルビや見出し等の情報を含む（Ｓ５０１）。 FIG. 5 is a flowchart for explaining the outline of the document structuring process performed by the document structuring apparatus described above. Note that the flowchart shown in FIG. 5 is processing performed in the printer driver 215 shown in FIG. The document structuring processing unit 220 according to the present embodiment first acquires a character string (represented by a character code) of print data received from the print service 211, and coordinates, size, and attributes illustrated in FIG. To do. The attribute includes information such as a character size and a typeface, a layout such as alignment of the left edge of the document and an arrangement of graphics, ruby, a headline, and the like (S501).

次に、文書構造化処理部２２０では、段落判定部２１９が、ステップＳ５０１において得た情報に基づき、文字列で構成される文書の段落を判定する（Ｓ５０２）。続いて、文書構造化処理部２２０では、レイアウト構造判定部２２１が、ステップＳ５０２で段落が判定された文書について段落ごとにレイアウト構造を判定する（Ｓ５０３）。さらに、文書構造化処理部２２０のキーワード監視部２２３は、段落に含まれる文字列に含まれるキーワードを検出し、文字列の内容を判定する（Ｓ５０４）。 Next, in the document structuring unit 220, the paragraph determining unit 219 determines a paragraph of a document composed of character strings based on the information obtained in step S501 (S502). Subsequently, in the document structuring unit 220, the layout structure determination unit 221 determines the layout structure for each paragraph for the document whose paragraph is determined in step S502 (S503). Further, the keyword monitoring unit 223 of the document structuring processing unit 220 detects a keyword included in the character string included in the paragraph, and determines the content of the character string (S504).

次に、文書構造化処理部２２０では、以上の処理によって段落やレイアウト構造等が判定された文書に対し、判定結果を示す情報（タグ）を付して文書を構造化する（Ｓ５０５）。なお、タグは、図２に示したタグ付け処理部２１７において行われる。
以上の処理の後、文書構造化処理部２２０は、構造化された文書を、外部入出力制御部２０７を介して電子ペーパ１０１に出力する（Ｓ５０６）。電子ペーパ１０１では、構造化された文書をタグに基づいて解釈し、文書を図示しないディスプレイに表示する。電子ペーパ１０１の側で表示された文書には、文書の段落やレイアウトに関する情報が付されている。 Next, the document structuring processing unit 220 structures the document by attaching information (tag) indicating the determination result to the document whose paragraph or layout structure is determined by the above processing (S505). Tagging is performed in the tagging processing unit 217 shown in FIG.
After the above processing, the document structuring processing unit 220 outputs the structured document to the electronic paper 101 via the external input / output control unit 207 (S506). The electronic paper 101 interprets the structured document based on the tag and displays the document on a display (not shown). The document displayed on the electronic paper 101 side is attached with information on the paragraph and layout of the document.

このため、オペレータは、表示された文書の１行の文字数、１ページに含まれる行数や列数等を変更することができる。また、このように文書の形式を変更した場合にも、本実施形態によれば、文書の段落やレイアウトが崩れることを防ぐことができる。以下、図５に示した文書構造化の複数の処理を、各々より詳細に説明する。
（１）段落の判定
以下、本実施形態の文書構造化処理装置でなされる段落の判定の方法について説明する。図６は、本実施形態の段落判定の処理を示したフローチャートである。また、図７及び図８は、段落の判定方法を示す模式図であって、図７は行の字下げに基づいて段落を判定する方法を、図８は行と行との間の長さ（行間隔）に基づいて段落を判定する方法を示す。 For this reason, the operator can change the number of characters in one line of the displayed document, the number of lines and the number of columns included in one page, and the like. Even when the document format is changed as described above, according to the present embodiment, it is possible to prevent the paragraphs and layout of the document from being corrupted. Hereinafter, each of the plurality of document structuring processes shown in FIG. 5 will be described in more detail.
(1) Paragraph Determination A paragraph determination method performed by the document structuring apparatus according to this embodiment will be described below. FIG. 6 is a flowchart showing the paragraph determination processing of the present embodiment. 7 and 8 are schematic diagrams showing a method for determining a paragraph. FIG. 7 shows a method for determining a paragraph based on indentation of a line, and FIG. 8 shows a length between lines. A method for determining a paragraph based on (line spacing) will be described.

図２に示した段落判定部２１９は、図６のフローチャートに示すように、プリントサービス２１１からプリントデータからプリントすべき文字のフォントサイズを取得する（Ｓ６０１）。そして、文字列で構成される各行の先頭に字下げが存在するか否か判断する（Ｓ６０２）。判断の結果、字下げが検出された場合（Ｓ６０２：Ｙｅｓ）、字下げが検出された文字列が段落の開始行であると判断する（Ｓ６０５）。 The paragraph determination unit 219 shown in FIG. 2 acquires the font size of the character to be printed from the print data 211 from the print service 211 as shown in the flowchart of FIG. 6 (S601). Then, it is determined whether or not there is an indentation at the head of each line composed of character strings (S602). If indentation is detected as a result of the determination (S602: Yes), it is determined that the character string in which indentation is detected is the start line of the paragraph (S605).

図７は、複数の文字列６によって構成された文書を示している。図示を略すが、各文字列６は、各々複数の文字を一列に配置されて構成されている。本実施形態では、文書の１行分を１つの文字列６とする。段落判定部２１９は、図示した文字列６のうち、行頭部分に１以上の文字幅分の空白があるか否かを調べることによって字下げを検出する。なお、文字幅は、ステップＳ６０１の処理で取得したフォントサイズから検出される。 FIG. 7 shows a document composed of a plurality of character strings 6. Although not shown, each character string 6 is configured by arranging a plurality of characters in a line. In the present embodiment, one line of the document is defined as one character string 6. The paragraph determination unit 219 detects the indentation by checking whether or not there is a space of one or more character widths at the beginning of the line in the illustrated character string 6. Note that the character width is detected from the font size acquired in the process of step S601.

ステップＳ６０２において行の先頭の字下げが検出されなかった場合（Ｓ６０２：Ｎｏ）、段落判定部２１９は、文字列の間隔を行間隔として取得する。行間隔は、例えば、プリントデータのうちの図４に示した各文字列６の座標から得ることができる。そして、取得した行間隔が、以前に取得された行間隔より広いか否か判断する（Ｓ６０３）。ステップＳ６０３の判断の結果、取得された行間隔が以前に取得されたもの（通常の行間隔と記す。）より広い場合（Ｓ６０３：Ｙｅｓ）、直前の行にあたる文字列と通常の行間隔よりも広い間隔を持って位置する文字列が次の段落の開始行であると判断する（Ｓ６０５）。 If the indentation at the beginning of the line is not detected in step S602 (S602: No), the paragraph determination unit 219 acquires the character string interval as the line interval. The line spacing can be obtained, for example, from the coordinates of each character string 6 shown in FIG. 4 in the print data. Then, it is determined whether or not the acquired line interval is wider than the previously acquired line interval (S603). As a result of the determination in step S603, when the acquired line interval is wider than the previously acquired line (denoted as a normal line interval) (S603: Yes), the character string corresponding to the immediately preceding line and the normal line interval It is determined that the character string positioned with a wide interval is the start line of the next paragraph (S605).

図８は、図７と同様に、複数の文字列６によって構成された文書を示している。段落判定部２１９は、直前の文字列の座標と今回処理すべき文字列との座標とから文字間隔を検出する。図示した例では、1行目から３行目までにある文字列６の行間隔は一定の値ａであるが、３行目にあたる文字列６と４行目にあたる文字列６との間の行間隔が値ａより長い値ｂである。このような場合、段落判定部２１９は、３行目にあたる文字列６と４行目にあたる文字列６との行間隔が通常より長いと判断する。 FIG. 8 shows a document composed of a plurality of character strings 6 as in FIG. The paragraph determination unit 219 detects the character spacing from the coordinates of the immediately preceding character string and the coordinates of the character string to be processed this time. In the illustrated example, the line spacing of the character string 6 from the first line to the third line is a constant value a, but the line between the character string 6 corresponding to the third line and the character string 6 corresponding to the fourth line. The interval is a value b longer than the value a. In such a case, the paragraph determination unit 219 determines that the line interval between the character string 6 corresponding to the third line and the character string 6 corresponding to the fourth line is longer than usual.

プリンタドライバ２１５は、プリントサービス２１１からプリントデータを行ごとに順次受け取って以上の処理を実行する。そして、文書にある末尾の行にあたる文字列について字下げの検出、あるいは行間隔の取得の処理が終了したか否か判断する（Ｓ６０４）。判断の結果、処理が末尾の行に達していれば（Ｓ６０４：Ｙｅｓ）、段落判定部２１９は、段落判定の処理を終了する。 The printer driver 215 sequentially receives print data for each row from the print service 211 and executes the above processing. Then, it is determined whether or not the process of detecting indentation or obtaining the line interval is completed for the character string corresponding to the last line in the document (S604). As a result of the determination, if the process has reached the last line (S604: Yes), the paragraph determination unit 219 ends the paragraph determination process.

また、処理が末尾の行に達していない場合（Ｓ６０４：Ｎｏ）、次の行のフォントサイズを取得し、次の行にあたる文字列について処理を続行する。
以上の処理により、図７に示した例では、文字列６のうちの１行目にあたる文字列６の先頭に字下げ６０１ａが検出される。また、文字列６のうちの４行目にあたる文字列６の先頭に字下げ６０２ａが検出される。段落判定部２１９は、１行目にあたる文字列６が段落の開始行であると判断し、４行目にあたる文字列６ａが次の段落の開始行であると判断する。したがって、段落判定部２１９は、文書の１行目から４行目の直前の行（３行目）が１つの段落６０１であると判定する。 If the process has not reached the last line (S604: No), the font size of the next line is acquired, and the process is continued for the character string corresponding to the next line.
Through the above processing, in the example shown in FIG. 7, the indent 601 a is detected at the head of the character string 6 corresponding to the first line of the character string 6. Further, an indent 602a is detected at the head of the character string 6 corresponding to the fourth line of the character string 6. The paragraph determination unit 219 determines that the character string 6 corresponding to the first line is the start line of the paragraph, and determines that the character string 6a corresponding to the fourth line is the start line of the next paragraph. Therefore, the paragraph determination unit 219 determines that the line immediately before the first to fourth lines (third line) of the document is one paragraph 601.

また、図７に示した例によれば、文字列６のうちの７行目にあたる文字列６の先頭に字下げ６０３ａが検出される。したがって、段落判定部２１９は、７行目にあたる文字列６が段落の開始行であると判断し、文書の４行目から７行目の直前の行（６行目）が１つの段落６０２であると判定する。
また、以上の処理により、図８に示した例では、文字列６のうち、直前の行にあたる文字列６と通常よりも広い行間隔を持って位置する文字列が段落の開始行であると判断する。このため、段落判定部２１９は、文書の１行目から４行目の直前の行（３行目）が１つの段落７０１であると判定する。さらに、文書の６行目にあたる文字列６と直後の行にあたる文字列（図示せず）との行間隔が通常の値ａよりも長い場合、段落判定部２１９は、７行目にあたる文字列６が段落の開始行であると判断し、文書の４行目から７行目の直前の行（６行目）が１つの段落７０２であると判定する。 Further, according to the example shown in FIG. 7, an indent 603 a is detected at the head of the character string 6 corresponding to the seventh line of the character string 6. Therefore, the paragraph determining unit 219 determines that the character string 6 corresponding to the seventh line is the start line of the paragraph, and the line immediately preceding the fourth to seventh lines (the sixth line) of the document is one paragraph 602. Judge that there is.
Further, with the above processing, in the example shown in FIG. 8, in the character string 6, a character string positioned with a line spacing wider than the character string 6 corresponding to the immediately preceding line is the start line of the paragraph. to decide. For this reason, the paragraph determination unit 219 determines that the line immediately before the first line to the fourth line (third line) of the document is one paragraph 701. Furthermore, when the line interval between the character string 6 corresponding to the sixth line of the document and the character string corresponding to the immediately following line (not shown) is longer than the normal value a, the paragraph determining unit 219 causes the character string 6 corresponding to the seventh line. Is the start line of the paragraph, and the line immediately before the fourth to seventh lines (the sixth line) of the document is determined to be one paragraph 702.

以上の処理の後、段落判定の結果は、構造化情報として段落判定部２１９からタグ付け処理部２１７に渡される。プリントデータには、この構造化情報に基づいて、後に段落を示すタグ付けが行われる。
（２）レイアウト構造の判定
・余白に基づくレイアウト構造の判定
次に、余白に基づく文書のレイアウト構造を判定する処理について説明する。 After the above processing, the result of the paragraph determination is passed from the paragraph determination unit 219 to the tagging processing unit 217 as structured information. The print data is later tagged with a paragraph based on the structured information.
(2) Determination of layout structure Determination of layout structure based on margins Next, processing for determining a layout structure of a document based on margins will be described.

図９は、余白に基づく文書のレイアウト構造を判定する方法を説明するためのフローチャートである。また、図１０ないし図１２は、図９のフローチャートに示した文書のレイアウト構造を判定する方法を説明するための図である。
図９に示したように、レイアウト構造判定部２２１は、段落判定部２１９から送られてきた文書を受け取り、文字列の両端にある空白の領域（余白）を検出する。余白の検出は、段落判定部２１９によって判定された段落ごとに行われる（Ｓ９０１）。以下、本実施形態では、先ず、１段落に含まれる文字列の左側にある余白（左余白）と、右側にある余白（右余白）との大きさを比較する（Ｓ９０２）。 FIG. 9 is a flowchart for explaining a method for determining the layout structure of a document based on margins. 10 to 12 are diagrams for explaining a method for determining the layout structure of the document shown in the flowchart of FIG.
As shown in FIG. 9, the layout structure determination unit 221 receives the document sent from the paragraph determination unit 219 and detects blank areas (margins) at both ends of the character string. The margin is detected for each paragraph determined by the paragraph determination unit 219 (S901). Hereinafter, in the present embodiment, first, the size of the margin on the left side (left margin) of the character string included in one paragraph is compared with the size of the margin on the right side (right margin) (S902).

図１０は、ステップＳ９０２においてなされる比較の方法を説明するための図である。なお、本実施形態では、余白の大きさを、段落１０００の左から用紙Ｐ左端までの長さａ、ｂ、ｃ、ｄと、段落１０００の右から用紙Ｐ右端までの長さａ’、ｂ’、ｃ’、ｄ’とを、各々対応する長さ同士で比較する。対応する長さ同士とは、例えば、段落に含まれる文字列のうち、特定の文字列（例えば文字列６０）の左端から用紙Ｐ左端までの長さｄと、同じ文字列６０の右端から用紙Ｐ右端までの長さｄ’とを比較することを指す。 FIG. 10 is a diagram for explaining the comparison method performed in step S902. In the present embodiment, the size of the margin is set to the lengths a, b, c, d from the left of the paragraph 1000 to the left end of the paper P, and the lengths a ′, b from the right of the paragraph 1000 to the right end of the paper P. ', C', and d 'are compared with the corresponding lengths. The corresponding lengths are, for example, the length d from the left end of a specific character string (for example, the character string 60) to the left end of the paper P among the character strings included in the paragraph, and the paper from the right end of the same character string 60 to the paper. It means comparing with the length d ′ up to the right end of P.

すなわち、レイアウト構造判定部２２１は、段落１０００に含まれる各文字列６の左端から用紙Ｐの左端までの長さａ、ｂ、ｃ、ｄが０に略等しく、かつ各文字列６の右端から用紙Ｐの右端までの長さａ’、ｂ’、ｃ’、ｄ’が０より大であるか否か判断する。この判断の結果、長さａ、ｂ、ｃ、ｄの値が０に略等しく、かつ長さａ’、ｂ’、ｃ’、ｄ’が０より大であれば（Ｓ９０２：Ｙｅｓ）、段落１０００が左揃えのレイアウト構造であると判断する（Ｓ９０８）。 That is, the layout structure determination unit 221 has lengths a, b, c, d from the left end of each character string 6 included in the paragraph 1000 to the left end of the paper P substantially equal to 0, and from the right end of each character string 6. It is determined whether or not the lengths a ′, b ′, c ′, d ′ to the right end of the paper P are greater than zero. As a result of this determination, if the values of the lengths a, b, c, d are substantially equal to 0 and the lengths a ′, b ′, c ′, d ′ are greater than 0 (S902: Yes), the paragraph 1000 is determined to be a left-aligned layout structure (S908).

一方、ステップＳ９０２において「ノー」と判断された場合（Ｓ９０２：Ｎｏ）、長さａ、ｂ、ｃ、ｄが０より大であって、かつ長さａ’、ｂ’、ｃ’、ｄ’が０に略等しいか否か判断する（Ｓ９０３）。この結果、長さａ、ｂ、ｃ、ｄが０より大であって、かつ長さａ’〜ｄ’が０に略等しい場合（Ｓ９０３：Ｙｅｓ）、段落１０００が右揃えのレイアウト構造であると判断する（Ｓ９０９）。 On the other hand, if “NO” is determined in step S902 (S902: NO), the lengths a, b, c, d are greater than 0, and the lengths a ′, b ′, c ′, d ′. Is determined to be substantially equal to 0 (S903). As a result, when the lengths a, b, c, and d are greater than 0 and the lengths a ′ to d ′ are substantially equal to 0 (S903: Yes), the paragraph 1000 has a right-aligned layout structure. (S909).

また、ステップＳ９０３において「ノー」と判断された場合（Ｓ９０３：Ｎｏ）、レイアウト構造判定部２２１は、長さａ、ｂ、ｃ、ｄと長さａ’、ｂ’、ｃ’、ｄ’とが、いずれも０に略等しいか否か判断する（Ｓ９０４）。この判断の結果、長さａ、ｂ、ｃ、ｄと長さａ’、ｂ’、ｃ’、ｄ’とが、いずれも０に略等しい場合（Ｓ９０４：Ｙｅｓ）、レイアウト構造判定部２２１は、段落１０００が通常レイアウト（左端にも右端にも余白を設けていない）構造を有するものと判断する（Ｓ９０７）。 If it is determined “No” in step S903 (S903: No), the layout structure determination unit 221 determines the lengths a, b, c, and d and the lengths a ′, b ′, c ′, and d ′. Are all substantially equal to 0 (S904). If the lengths a, b, c, d and the lengths a ′, b ′, c ′, d ′ are all substantially equal to 0 as a result of this determination (S904: Yes), the layout structure determination unit 221 , It is determined that the paragraph 1000 has a normal layout (no margins are provided at either the left end or the right end) (S907).

さらに、レイアウト構造判定部２２１は、ステップＳ９０４において、「ノー」と判断された場合（Ｓ９０４：Ｎｏ）、長さａと長さａ’、長さｂと長さｂ’、長さｃと長さｃ’長さｄと長さｄ’がそれぞれ略等しいか否か判断する（Ｓ９０５）。この判断の結果、長さａと長さａ’、長さｂと長さｂ’、長さｃと長さｃ’ 長さｄと長さｄ’がそれぞれ略等しい場合（Ｓ９０５：Ｙｅｓ）、段落１０００が中央揃え（センタリング）のレイアウト構造を有しているものと判断する（Ｓ９０６）。文字列６がセンタリングされた状態の長さａ、ｂ、ｃ、ｄと長さａ’、ｂ’、ｃ’、ｄ’との関係を、図１１に示す。 Furthermore, when the layout structure determination unit 221 determines “No” in step S904 (S904: No), the length a and the length a ′, the length b and the length b ′, and the length c and the length It is determined whether or not the length c ′ and the length d ′ are substantially equal (S905). As a result of this determination, when the length a and the length a ′, the length b and the length b ′, the length c and the length c ′, the length d and the length d ′ are approximately equal (S905: Yes), It is determined that the paragraph 1000 has a centered layout structure (S906). FIG. 11 shows the relationship between the lengths a, b, c, d and the lengths a ′, b ′, c ′, d ′ when the character string 6 is centered.

また、レイアウト構造判定部２２１は、ステップＳ９０５において「ノー」と判断された場合（Ｓ９０５：Ｎｏ）、段落１０００が、通常レイアウト構造を有するものと判断する（Ｓ９０７）。
また、本実施形態のレイアウト構造の判定は、以上述べた構成に限定されるものではない。すなわち、先に述べたレイアウト構造の判定では、段落の左余白と右余白とを比較してレイアウト構造を判定している。しかし、段落の上方の余白（上余白）と段落の下方の余白（下方余白）とを比較することによって段落の上下方向のレイアウトを判定することができる。 If the layout structure determination unit 221 determines “No” in step S905 (S905: No), the layout structure determination unit 221 determines that the paragraph 1000 has a normal layout structure (S907).
Further, the determination of the layout structure of the present embodiment is not limited to the configuration described above. That is, in the determination of the layout structure described above, the layout structure is determined by comparing the left margin and the right margin of the paragraph. However, the vertical layout of a paragraph can be determined by comparing the upper margin of the paragraph (upper margin) with the lower margin of the paragraph (lower margin).

図１２は、段落１０００の上下方向のレイアウトを判定する方法を説明するための図である。図１２に示した例では、段落１０００の上余白と下余白の大きさを、段落１０００の上端から用紙Ｐ上端までの長さａ、ｂ、ｃ、ｄと、段落１０００の下端から用紙Ｐ下端までの長さａ’、ｂ’、ｃ’、ｄ’とを、対応する長さ同士で比較する。比較の結果、例えば、長さａ、ｂ、ｃ、ｄがいずれも略同じであって、長さａ’、ｂ’、ｃ’、ｄ’のうちの対応するもの（例えば長さａには長さａ’が対応する）よりも短い場合、レイアウト構造判定部２２１は、段落１０００が上端揃えであると判断する。 FIG. 12 is a diagram for explaining a method for determining the vertical layout of the paragraph 1000. In the example shown in FIG. 12, the size of the upper and lower margins of the paragraph 1000 is set to the lengths a, b, c, d from the upper end of the paragraph 1000 to the upper end of the paper P, and from the lower end of the paragraph 1000 to the lower end of the paper P. The lengths a ′, b ′, c ′, and d ′ up to are compared with corresponding lengths. As a result of the comparison, for example, the lengths a, b, c, and d are substantially the same, and the corresponding ones of the lengths a ′, b ′, c ′, and d ′ (for example, the length a is If the length a ′ is shorter than the corresponding length), the layout structure determination unit 221 determines that the paragraph 1000 is aligned at the upper end.

なお、本実施形態では、このような処理を図９のフローチャート中の括弧内に示す。すなわち、図９のフローチャートは、図中の左の文字を括弧内の上、右の文字を括弧内の下に読み替えることよって段落の上下方向のレイアウトを判定するフローチャートとなる。
また、以上述べた段落の左側、右側の余白を比較するレイアウト構造の判定方法によれば、文書を段落等の単位ごとに字下げする、いわゆるインデントの設定がなされていることを判定することができる。図１３は、文書が段落１３０１、段落１３０２、段落１３０３ごとにインデントされた状態を示している。 In the present embodiment, such processing is shown in parentheses in the flowchart of FIG. That is, the flowchart of FIG. 9 is a flowchart for determining the vertical layout of a paragraph by replacing the left character in the figure above the parentheses and the right character below the parentheses.
Further, according to the layout structure determination method for comparing the left and right margins of the paragraph described above, it is determined that a so-called indent is set to indent the document for each unit such as a paragraph. it can. FIG. 13 shows a state where the document is indented for each of the paragraphs 1301, 1302, and 1303.

また、本実施形態のレイアウト構造判定部２２１は、以上述べた方法を応用し、文書中の図形の位置をも判定し、文書を構造化することができる。このような処理を、図１４、図１５に例示する。
すなわち、図１４、図１５に示した文書中の図形１４１、図形１５１は、いずれも他の文字とサイズが異なる文字と同様に取り扱われる。したがって、レイアウト構造判定部２２１は、図１４に示した処理では、図形をも含めた段落１４０の左余白を示す長さａ、ｂ、ｃ、ｄ、ｅ、ｆと、右余白を示す長さａ’、ｂ’、ｃ’、ｄ’、ｅ’、ｆ’とを対応する長さ同士比較する。 In addition, the layout structure determination unit 221 according to the present embodiment can apply the above-described method to determine the position of a graphic in the document and to structure the document. Such processing is illustrated in FIGS. 14 and 15.
That is, the graphic 141 and the graphic 151 in the document shown in FIG. 14 and FIG. 15 are all handled in the same manner as other characters having different sizes. Therefore, in the process shown in FIG. 14, the layout structure determining unit 221 includes lengths a, b, c, d, e, and f indicating the left margin of the paragraph 140 including the graphic, and a length indicating the right margin. The lengths a ′, b ′, c ′, d ′, e ′, and f ′ are compared with each other.

この結果、図１４に示した例では、長さａ、ｂ、ｃ、ｄ、ｅと、長さａ’、ｂ’、ｃ’、ｄ’、ｅ’とが各々略等しいものの、長さｆと長さｆ’において、ｆ＜ｆ’が成立する。このような場合、レイアウト構造判定部２２１は、図形１４１を含む段落１４０が、左端揃いのレイアウト構造を持つものと判定する。なお、図１４に示したレイアウト構造の場合、段落１４０における図形１４１の周辺にも文字列６がある。このような状態を、本実施形態では、テキストの回りこみがあるとも記す。 As a result, in the example shown in FIG. 14, the lengths a, b, c, d, and e are substantially equal to the lengths a ′, b ′, c ′, d ′, and e ′. And f <f ′ holds for the length f ′. In such a case, the layout structure determination unit 221 determines that the paragraph 140 including the graphic 141 has a left-aligned layout structure. In the case of the layout structure shown in FIG. 14, there is also a character string 6 around the figure 141 in the paragraph 140. Such a state is also described as text wraparound in this embodiment.

また、図１５に示した処理では、図形をも含めた段落１５０の左余白を示す長さａと、右余白を示すａ’とが略等しい。このような場合、レイアウト構造判定部２２１は、図形１５１を含む段落１５０が、両端揃いのレイアウト構造を持つものと判定する。なお、図１５に示したレイアウト構造の場合、段落１５０における図形１５１の周辺に文字列６は存在しない。このような状態を、本実施形態では、テキストの回りこみがないとも記す。 Further, in the process shown in FIG. 15, the length a indicating the left margin of the paragraph 150 including the figure is substantially equal to a ′ indicating the right margin. In such a case, the layout structure determination unit 221 determines that the paragraph 150 including the graphic 151 has a layout structure with both ends aligned. In the case of the layout structure shown in FIG. 15, the character string 6 does not exist around the graphic 151 in the paragraph 150. In this embodiment, such a state is also described as having no text wraparound.

・文字サイズに基づくレイアウト構造の判定
次に、文字サイズに基づく文書のレイアウト構造を判定する処理について説明する。
図１６は、文字サイズに基づく文書のレイアウト構造を判定する方法を説明するためのフローチャートである。また、図１７は、図１６のフローチャートに示した文書のレイアウト構造を判定する方法を説明するための図である。 Determination of layout structure based on character size Next, processing for determining the layout structure of a document based on character size will be described.
FIG. 16 is a flowchart for explaining a method of determining the layout structure of a document based on the character size. FIG. 17 is a diagram for explaining a method of determining the layout structure of the document shown in the flowchart of FIG.

図１６のフローチャートに示したように、レイアウト構造判定部２２１は、判定された段落に含まれる文字列の文字サイズを取得する（Ｓ１６１）。なお、文字サイズは、文書構造化処理部２２０に渡されたプリントデータに含まれていて、容易に取得することができる。
次に、レイアウト構造判定部２２１は、プリントデータに基づいて、段落に含まれる文字列の文字のサイズを抽出する。そして、この結果から、段落において、一般的に文書内で出現頻度が高いとされている９〜１２ポイントの文字が所定の値より高い頻度で出現しているか否か判断する（Ｓ１６２）。この判断の結果、段落内において９〜１２ポイントの文字の出現頻度が所定の値より低い場合（Ｓ１６２：Ｎｏ）、９ポイント以下のポイントの文字が文字列間（行間）に存在するか否か判断する（Ｓ１６３）。 As shown in the flowchart of FIG. 16, the layout structure determination unit 221 acquires the character size of the character string included in the determined paragraph (S161). Note that the character size is included in the print data passed to the document structuring processing unit 220 and can be easily obtained.
Next, the layout structure determination unit 221 extracts the character size of the character string included in the paragraph based on the print data. From this result, it is determined whether or not 9 to 12 point characters, which are generally regarded as having a high appearance frequency in the document, appear at a frequency higher than a predetermined value in the paragraph (S162). As a result of the determination, if the appearance frequency of characters of 9 to 12 points in the paragraph is lower than a predetermined value (S162: No), whether or not characters of points of 9 points or less exist between character strings (line intervals). Judgment is made (S163).

ステップＳ１６３の判断の結果、行間に９ポイント以下の文字が存在する場合（Ｓ１６３：Ｙｅｓ）、レイアウト構造判定部２２１は、この文字をルビであると判定する（Ｓ１６６）。また、ステップＳ１６３の判断の結果、行間に９ポイント以下の文字が存在しない場合（Ｓ１６３：Ｎｏ）、１０ポイント以上のポイントの文字が太字、あるいは下線付き、中央揃えの文字（強調文字）として存在しているか否か判断する（Ｓ１６４）。この判断の結果、１０ポイント以上のポイントの強調文字が存在していれば（Ｓ１６４：Ｙｅｓ）、この段落を見出しと判定する（Ｓ１６７）。 As a result of the determination in step S163, if there is a character of 9 points or less between the lines (S163: Yes), the layout structure determination unit 221 determines that this character is ruby (S166). If the result of determination in step S163 is that there are no characters of 9 points or less between lines (S163: No), characters of 10 points or more exist as bold or underlined, center-aligned characters (emphasized characters) It is determined whether or not (S164). As a result of this determination, if there is an emphasized character of 10 points or more (S164: Yes), this paragraph is determined as a headline (S167).

また、ステップＳ１６４において、１０ポイント以上の強調文字がない場合（Ｓ１６４：Ｎｏ）、この文字列に含まれる文字を文書の本体を構成する文字と判定する（Ｓ１６５）。なお、ステップＳ１６２において、この段落の文字サイズが９〜１２ポイントであると判定された場合（Ｓ１６２：Ｙｅｓ）にも、この段落を文書本体であると判断する（Ｓ１６５）。 If there is no emphasized character of 10 points or more in step S164 (S164: No), it is determined that the character included in the character string is a character constituting the main body of the document (S165). If it is determined in step S162 that the character size of this paragraph is 9 to 12 points (S162: Yes), it is determined that this paragraph is a document body (S165).

図１７は、このような処理を説明するための図であって、ルビや強調文字を含む文書を示している。図１７に示した文書は、レイアウト構造判定部２２１によって段落１７４ａ、段落１７４ｂ、段落１７４ｃの３つの段落を含んでいると判定されている。段落１７４ａは、１行分の文字列を含み、文字列がすべて強調文字１７２で構成されている。したがって、この段落は、見出しであると判定される。 FIG. 17 is a diagram for explaining such processing, and shows a document including ruby and emphasized characters. The document shown in FIG. 17 is determined by the layout structure determination unit 221 to include the three paragraphs of the paragraph 174a, the paragraph 174b, and the paragraph 174c. The paragraph 174a includes a character string for one line, and the character string is entirely composed of emphasized characters 172. Therefore, this paragraph is determined to be a heading.

また、段落１７４ｂは、９〜１２ポイントの文字１７３と、９ポイント以下の文字１７１とによって構成されている。このような段落では、文字１７３の出現頻度が、すべて文字１７３によって構成される段落よりも低い。したがって、レイアウト構造判定部２２１は、段落１７４ｂは文字１７３の出現頻度が所定の値よりも低いことを検出できる。
また、段落１７４ｂでは、文字１７３で構成された文字列間に９ポイントより小さいポイントの文字で１７１が存在する。レイアウト構造判定部２２１は、文字１７１で構成される文字列をルビであると判定する。同様に、後の段落１７４ｃにおいても、文字１７３で構成される文字列間に文字１７１で構成される文字列が存在する。この文字列についても、レイアウト構造判定部２２１は、ルビであると判定する。 The paragraph 174b includes characters 173 having 9 to 12 points and characters 171 having 9 points or less. In such a paragraph, the appearance frequency of the character 173 is lower than that of a paragraph composed of all the characters 173. Therefore, the layout structure determination unit 221 can detect that the appearance frequency of the character 173 in the paragraph 174b is lower than a predetermined value.
Also, in the paragraph 174b, there are 171 characters with a point smaller than 9 points between character strings composed of the characters 173. The layout structure determination unit 221 determines that the character string composed of the characters 171 is ruby. Similarly, in the later paragraph 174c, a character string composed of characters 171 exists between character strings composed of characters 173. Also for this character string, the layout structure determination unit 221 determines that it is ruby.

以上の処理の後、レイアウト構造判定部２２１による判定結果は、構造化情報としてタグ付け処理部２１７に渡される。この判定結果により、プリントデータは、後にルビや見出し、文書本体といった情報を示すタグが付されて構造化される。
（３）キーワードに基づくレイアウト構造の判定
次に、本実施形態のキーワードに基づくレイアウト構造の判定について説明する。キーワードに基づくレイアウト構造の判定は、文書構造化処理部２２０のキーワード監視部２２３によってなされる。キーワード監視部２２３は、文字列に含まれるキーワードを検出し、このキーワードの意味に基づいてプリントデータのレイアウトに関する構造を判定する。 After the above processing, the determination result by the layout structure determination unit 221 is passed to the tagging processing unit 217 as structured information. Based on the determination result, the print data is structured with tags indicating information such as ruby, headline, and document body later.
(3) Determination of layout structure based on keyword Next, the determination of the layout structure based on the keyword of the present embodiment will be described. The determination of the layout structure based on the keyword is performed by the keyword monitoring unit 223 of the document structuring processing unit 220. The keyword monitoring unit 223 detects a keyword included in the character string, and determines the structure related to the layout of the print data based on the meaning of the keyword.

本実施形態では、文字列先頭の「・」、「http://」、「mailto:」、「＠」、連続する数字「１、２、３…」をキーワードとして扱う。キーワード監視部２２３は、このようなキーワードを記憶した記憶部（図示せず）を備えていて、文字列をこの記憶部に記憶されたキーワードと文字列ごとに対照する。そして、文字列にキーワードのいずれかが含まれている場合、このキーワードに応じて文字列、あるいは文字列でなる段落のレイアウト構造を判定する。 In the present embodiment, “·”, “http: //”, “mailto:”, “@”, and consecutive numbers “1, 2, 3,. The keyword monitoring unit 223 includes a storage unit (not shown) that stores such a keyword, and compares the character string with the keyword stored in the storage unit for each character string. If any of the keywords is included in the character string, the layout structure of the character string or the paragraph composed of the character string is determined according to the keyword.

図１８は、キーワードに基づくレイアウト構造の判定の処理を示したフローチャートである。キーワード監視部２２３は、プリンタデータを段落ごとに取り込み、この段落が見出しの段落であるか否か判断する（Ｓ１８１）。この判断は、前述したレイアウト構造判定部２２１による判定結果に基づいて行われる。この判断によって、段落が見出しを示すものでないと判断された場合（Ｓ１８１：Ｎｏ）、この段落の各行にあたる複数の文字列の先頭が揃っているか否か判断する（Ｓ１８２）。 FIG. 18 is a flowchart showing a layout structure determination process based on keywords. The keyword monitoring unit 223 takes in the printer data for each paragraph, and determines whether this paragraph is a headline paragraph (S181). This determination is made based on the determination result by the layout structure determination unit 221 described above. If it is determined by this determination that the paragraph does not indicate a headline (S181: No), it is determined whether or not the heads of a plurality of character strings corresponding to the respective lines of this paragraph are aligned (S182).

ステップＳ１８２において、複数の文字列の先頭が揃っている場合（Ｓ１８２：Ｙｅｓ）、文字列の先頭が「・」で始まるか否か判断する（Ｓ１８６）。先頭が「・」で始まる場合（Ｓ１８６：Ｙｅｓ）、この複数行分に相当する文字列でなる段落をリスト段落とする（Ｓ１８９）。
図１９は、以上の処理を説明するための図であって、プリントデータによって表示される文書を示している。この文書には、段落１９１、段落１９２、段落１９３、段落１９４の４つの段落がある。キーワード監視部２２３は、段落１９１、段落１９２、段落１９３、段落１９４の順にレイアウト構造を判定する。 In step S182, when the heads of a plurality of character strings are aligned (S182: Yes), it is determined whether or not the heads of the character strings start with “·” (S186). When the head starts with “·” (S186: Yes), a paragraph composed of character strings corresponding to the plurality of lines is set as a list paragraph (S189).
FIG. 19 is a diagram for explaining the above processing, and shows a document displayed by print data. This document has four paragraphs: paragraph 191, paragraph 192, paragraph 193, and paragraph 194. The keyword monitoring unit 223 determines the layout structure in the order of paragraph 191, paragraph 192, paragraph 193, and paragraph 194.

段落１９１は、９〜１２ポイントより大きなポイントの文字で構成された文字列でなり、先にレイアウト構造判定部２２１によって見出しの段落であると判定されている。また、段落１９３は、文字列の先頭が揃っていて、先頭が「・」で始まっている。このような段落は、リストの項目を示す段落であると判定される。
また、キーワード監視部２２３は、ステップＳ１８６において、段落を構成する文字列の先頭に「・」の文字が含まれるかいないと判断した場合（Ｓ１８６：Ｎｏ）、文字列の先頭に連続する数字が含まれているか否か判断する（Ｓ１８７）。この結果、文字列の先頭に連続して増加する数字が含まれている場合（Ｓ１８７：Ｙｅｓ）、この段落を番号付きリストの項目段落と判定する（Ｓ１９０）。図２０は、番号付きリストの項目段落２００１を示す図である。 The paragraph 191 is a character string composed of characters with points greater than 9 to 12 points, and is determined by the layout structure determination unit 221 to be a headline paragraph. In the paragraph 193, the beginnings of the character strings are aligned, and the beginnings begin with “·”. Such a paragraph is determined to be a paragraph indicating an item in the list.
If the keyword monitoring unit 223 determines in step S186 that a character “·” is not included at the beginning of the character string constituting the paragraph (S186: No), a number that continues at the beginning of the character string is displayed. It is determined whether it is included (S187). As a result, when a continuously increasing number is included at the beginning of the character string (S187: Yes), this paragraph is determined as an item paragraph of the numbered list (S190). FIG. 20 is a diagram showing an item paragraph 2001 of the numbered list.

また、キーワード監視部２２３は、ステップＳ１８１において、段落が見出しを示す段落であると判定された場合（Ｓ１８１：Ｙｅｓ）、さらに、この段落の先頭に連続して増加する数字が含まれているか否か判断する（Ｓ１９１）。この結果、連続して増加する数字が含まれている場合（Ｓ１９１：Ｙｅｓ）、この段落に章あるいは節を示す番号が付されていると判断する（Ｓ１９２）。図２１は、章や節を示す番号が付された段落２１０１、段落２１０２を示す図である。 If it is determined in step S181 that the paragraph is a paragraph indicating a headline (S181: Yes), the keyword monitoring unit 223 further determines whether or not a continuously increasing number is included at the beginning of the paragraph. Is determined (S191). As a result, when continuously increasing numbers are included (S191: Yes), it is determined that a number indicating a chapter or a section is attached to this paragraph (S192). FIG. 21 is a diagram showing paragraphs 2101 and 2102 with numbers indicating chapters and sections.

以上の処理の後、この段落が、リストの項目を示す段落、番号付きリストの項目段落、章あるいは節を示す番号付の段落のいずれでもないと判断された場合、キーワード監視部２２３は、この段落にキーワード「http://」で始まるＡＳＣＩＩ文字を含む文字列があるか否か判断する（Ｓ１８３）。そして、このような文字列が段落に含まれている場合（Ｓ１８３：Ｙｅｓ）、この文字列がＵＲＬであると判定する（Ｓ１８４）。 After the above processing, if it is determined that this paragraph is neither a paragraph indicating a list item, an item paragraph of a numbered list, a numbered paragraph indicating a chapter or a section, the keyword monitoring unit 223 It is determined whether or not there is a character string including ASCII characters beginning with the keyword “http: //” in the paragraph (S183). If such a character string is included in the paragraph (S183: Yes), it is determined that the character string is a URL (S184).

さらに、ステップＳ１８３において、「http://」で始まるＡＳＣＩＩ文字を含む文字列がないと判定された場合（Ｓ１８３：Ｎｏ）、「mailto：」で始まるＡＳＣＩＩ文字、あるいは「＠」を挟むＡＳＣＩＩ文字列を含む文字列が段落中にあるか否か判断する（Ｓ１８５）。このような文字列を含むと判定された場合（Ｓ１８５：Ｙｅｓ）、この文字列がメールアドレスであると判定する（Ｓ１８８）。 Furthermore, when it is determined in step S183 that there is no character string including an ASCII character starting with “http: //” (S183: No), an ASCII character starting with “mailto:” or an ASCII character sandwiching “@” is inserted. It is determined whether or not the character string including the column is in the paragraph (S185). If it is determined that such a character string is included (S185: Yes), it is determined that this character string is a mail address (S188).

図２２は、ＵＲＬを含む文字列でなる段落２２０１を示す図である。ＵＲＬやメールアドレスを含む文字列は、構造化されることによって出力先である電子ペーパ１０１側ではＵＲＬや電子メールのアドレスへのリンク機能を実現する。すなわち、ＵＲＬは、電子ペーパ１０１の側でＵＲＬを示す文字列上でクリックが行われた場合にＵＲＬが示すページを表示する、あるいはＵＲＬによって指定されたアドレス宛の電子メールを作成するものと認識される。 FIG. 22 is a diagram showing a paragraph 2201 formed of a character string including a URL. The character string including the URL and the mail address is structured to realize a link function to the URL or the email address on the electronic paper 101 side as the output destination. That is, the URL is recognized to display the page indicated by the URL when the electronic paper 101 is clicked on the character string indicating the URL, or to create an e-mail addressed to the address specified by the URL. Is done.

以上の処理の後、キーワード監視部２２３は、判定の結果を構造化情報としてタグ付け処理部２１７に渡す。プリントデータは、キーワードの判定結果に応じて文字列が持つリスト項目や章番号といった意味が失われないように構造化される。この構造化により、例えば、項目や章番号等を含む文字列の改行を禁止することができる。
（４）構造化（タグ付け）処理
タグ付け処理部２１７は、段落判定部２１９、レイアウト構造判定部２２１、キーワード監視部２２３によってなされた判定結果に基づいて、プリントデータを構造化する。以下に、本実施形態によって構造化されたプリントデータ（構造化文書）を例示する。以下の例では、構造化文書が、ＸＨＴＭＬ（eXtensible HTML）形式で記述されている。ただし、本実施形態は、ＸＨＴＭＬ形式で構造化文書を記述するものに限定されるものでなく、どのようなフォーマットを使って記述するものであってもよい。

<?xml version="1.0" encoding="Shift_JIS"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="ja" lang="ja">
<body>
<h2>構造化文書について</h2>
構造化文書のフォーマットとしては以下が有名である。
<ul>
<li>SGML</li>
<li>HTML</li>
<li>XML</li>
<li>XHTML</li>
</ul>
SGMLとはStandard Generalized Markup Languageの略であり、
システムに依存しない情報の記述方法として定められた。 After the above processing, the keyword monitoring unit 223 passes the determination result to the tagging processing unit 217 as structured information. The print data is structured so that meanings such as list items and chapter numbers of character strings are not lost in accordance with keyword determination results. With this structuring, for example, line breaks in character strings including items and chapter numbers can be prohibited.
(4) Structuring (Tagging) Processing The tagging processing unit 217 structures print data based on the determination results made by the paragraph determination unit 219, the layout structure determination unit 221, and the keyword monitoring unit 223. Hereinafter, print data (structured document) structured according to the present embodiment will be exemplified. In the following example, the structured document is described in the XHTML (eXtensible HTML) format. However, the present embodiment is not limited to describing structured documents in the XHTML format, and may be described using any format.

<? xml version = "1.0" encoding = "Shift_JIS"?>
<! DOCTYPE html PUBLIC "-// W3C // DTD XHTML 1.0 Strict // EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns = "http://www.w3.org/1999/xhtml" xml: lang = "en" lang = "en">
<body>
<h2> About structured documents </ h2>
 The following are well-known formats for structured documents. 
<ul>
<li> SGML </ li>
<li> HTML </ li>
<li> XML </ li>
<li> XHTML </ li>
</ ul>
 SGML stands for Standard Generalized Markup Language,
Defined as a system-independent information description method.

厳密な記述が可能な反面、言語仕様が難解である。
DTDなどXMLに多くの影響を与えている。
HTMLとはHyper Text Markup Languageの略であり、
いわゆるホームページの記述に用いられる。
WWW(World Wide Web)の爆発的普及の立役者である。 While strict description is possible, the language specification is difficult.
It has many influences on XML such as DTD. 
 HTML stands for Hyper Text Markup Language,
Used to describe so-called homepages.
He is a driving force in the explosive spread of the World Wide Web.

テキストと画像、ハイパーリンクを実現する。
W3Cにおいて仕様の策定が行われる。W3CのURLは
<a href="http://www.w3.org/">http://www.w3.org/</a>
である。
XMLはExtensible Markup Languageの略である。 Realize text, images and hyperlinks.
Specification is developed at W3C. W3C URL is
<a href="http://www.w3.org/"> http://www.w3.org/ </a>
It is. 
 XML stands for Extensible Markup Language.

HTMLは決められたタグしか利用できなかったが、
XMLでは自らタグの定義付けを行うことができる。
電子商取引からデータ転送のヘッダまであらゆる分野で
活用がおこなわれている。
XHTMLはExtensible Hyper Text Markup Language略であり、
HTMLをXML形式に対応させたものと言える。XMLではタグを自由に
定義できる反面、タグと表示方法は独立している。このため
印刷分野ではXSL-FOなどが用いられるが、より簡便なものとして
XHTMLが提案された。
</body>
</html>
上記した構造化文書は、電子ペーパ１０１に送信される。図２３は、上記した構造化文書によって電子ペーパ１０１の側で表示される文書である。 HTML was only available for certain tags,
In XML, you can define tags yourself.
It is used in all fields from electronic commerce to data transfer headers. 
 XHTML stands for Extensible Hyper Text Markup Language,
It can be said that HTML corresponds to the XML format. While tags can be freely defined in XML, tags and display methods are independent. For this reason, XSL-FO is used in the printing field.
XHTML was proposed. 
</ body>
</ html>
The structured document described above is transmitted to the electronic paper 101. FIG. 23 is a document displayed on the electronic paper 101 side by the above-described structured document.

以上述べた本実施形態の構造化処理装置及び構造化処理方法は、プリントデータから構造化文書を作成することができる。そして、構造化文書を電子ペーパ等の構造化文書を理解できる機器に送出することにより、機器側がプリントデータの構造を理解させることができる。このため、この機器において、機器のディスプレイ画面やオペレータのニーズに応じてプリントデータを再レイアウトすることができる。 The structured processing apparatus and structured processing method of the present embodiment described above can create a structured document from print data. Then, by sending the structured document to a device that can understand the structured document such as electronic paper, the device side can make the structure of the print data understood. Therefore, in this device, print data can be re-laid out according to the display screen of the device and the needs of the operator.

このような本実施形態は、プリントデータを他の機器に出力し、出力された機器側の表示画面に最適な状態で画像を表示させることができるものといえる。
なお、以上述べた本実施形態のフローチャートで示した文書構造化処理方法をコンピュータに実行させるプログラムは、インストール可能な形式又は実行可能な形式のファイルでＣＤ−ＲＯＭ、フロッピー（登録商標）ディスク（ＦＤ）、ＤＶＤ等のコンピュータで読み取り可能な記録媒体に記録されて提供される。また、本実施形態の文書構造化処理方法をコンピュータに実行させるプログラムを、インターネット等のネットワークに接続されたコンピュータ上に格納し、ネットワーク経由でダウンロードさせることにより提供するように構成しても良い。 In this embodiment, it can be said that print data can be output to another device and an image can be displayed in an optimal state on the output display screen of the device.
Note that a program for causing a computer to execute the document structuring method shown in the flowchart of the present embodiment described above is an installable format or executable format file that is a CD-ROM, floppy (registered trademark) disk (FD). ) And recorded on a computer-readable recording medium such as a DVD. In addition, a program that causes a computer to execute the document structuring method according to the present embodiment may be provided by being stored on a computer connected to a network such as the Internet and downloaded via the network.

本発明の一実施形態の文書構造化処理装置のハードウェア構成のブロック図である。It is a block diagram of the hardware constitutions of the document structure processing apparatus of one Embodiment of this invention. 図１に示した文書構造化処理装置の機能ブロック図である。It is a functional block diagram of the document structuring apparatus shown in FIG. 図２に示したソフトウェアの階層構造を示す図である。It is a figure which shows the hierarchical structure of the software shown in FIG. 本発明の一実施形態の文字列を単位とするプリントデータを例示した図である。It is the figure which illustrated print data per character string of one embodiment of the present invention. 本発明の一実施形態の文書構造化処理装置でなされる文書構造化処理の概略を説明するためのフローチャートである。It is a flowchart for demonstrating the outline of the document structuring process performed with the document structuring processing apparatus of one Embodiment of this invention. 本発明の一実施形態の段落判定の処理を示したフローチャートである。It is the flowchart which showed the process of the paragraph determination of one Embodiment of this invention. 本発明の一実施形態の段落判定の処理のうち、行の字下げに基づいて段落を判定する方法を説明するための図である。It is a figure for demonstrating the method to determine a paragraph based on the indentation of a line among the processes of the paragraph determination of one Embodiment of this invention. 本発明の一実施形態の段落判定の処理のうち、行間隔に基づいて段落を判定する方法を説明するための図である。It is a figure for demonstrating the method to determine a paragraph based on a line space | interval among the processes of the paragraph determination of one Embodiment of this invention. 本発明の一実施形態の余白に基づく文書のレイアウト構造を判定する方法を説明するためのフローチャートである。6 is a flowchart illustrating a method for determining a layout structure of a document based on margins according to an embodiment of the present invention. 図９に示したフローチャートのうち、余白の比較の方法を説明するための図である。It is a figure for demonstrating the method of a margin comparison among the flowcharts shown in FIG. 図９に示したフローチャートのうち、中央揃え時の余白の状態を説明するための図である。FIG. 10 is a diagram for explaining a margin state during center alignment in the flowchart illustrated in FIG. 9. 図９に示したフローチャートのうち、上下方向のレイアウトを判定する方法を説明するための図である。It is a figure for demonstrating the method to determine the layout of an up-down direction among the flowcharts shown in FIG. 本発明の一実施形態の文書のインデントを説明するための図である。It is a figure for demonstrating the indentation of the document of one Embodiment of this invention. 本発明の一実施形態の文書中の図形の位置を判定するための図である。It is a figure for determining the position of the figure in the document of one Embodiment of this invention. 本発明の一実施形態の文書中の図形の位置を判定するための他の図である。It is another figure for determining the position of the figure in the document of one Embodiment of this invention. 本発明の一実施形態の文字サイズに基づく文書のレイアウト構造を判定する方法を説明するためのフローチャートである。6 is a flowchart for explaining a method of determining a layout structure of a document based on a character size according to an embodiment of the present invention. 図１６のフローチャートに示した文書のレイアウト構造を判定する方法を説明するための図である。FIG. 17 is a diagram for explaining a method of determining the layout structure of a document shown in the flowchart of FIG. 16. 本発明の一実施形態のキーワードに基づくレイアウト構造の判定の処理を示したフローチャートである。It is the flowchart which showed the process of the layout structure determination based on the keyword of one Embodiment of this invention. 図１８のフローチャートに示した処理のうち、リスト段落の判定方法を説明するための図である。It is a figure for demonstrating the determination method of a list paragraph among the processes shown to the flowchart of FIG. 本発明の一実施形態の番号付きリストの項目段落を示すための図である。It is a figure for showing the item paragraph of the numbered list of one embodiment of the present invention. 本発明の一実施形態の章や節を示す番号が付された段落を示すための図である。It is a figure for showing the paragraph to which the number which shows the chapter and section of one Embodiment of this invention was attached | subjected. 本発明の一実施形態のＵＲＬを含む文字列でなる段落を示すための図である。It is a figure for showing the paragraph which consists of a character string containing URL of one embodiment of the present invention. 本発明の一実施形態で得られる構造化文書によって表示される文書である。It is a document displayed by the structured document obtained by one Embodiment of this invention.

Explanation of symbols

２ソフトウェア、６文字列、１０１電子ペーパ、１０３プリンタ、
１１１表示装置、１１３表示コントローラ、１１７マウス、１１９キーボード
１２１入力装置、２００ソフトウェア、２０１アプリケーションソフト群
２０３表示制御部、２０５入力装置制御部、２０７外部入出力制御部
２０９ファイル管理機能、２１１プリントサービス、２１３文字列出力機能
２１５プリンタドライバ、２１７タグ付け処理部、２１９段落判定部
２２０文書構造化処理部、２２１レイアウト構造判定部、２２３キーワード監視部 2 software, 6 character string, 101 electronic paper, 103 printer,
111 display device, 113 display controller, 117 mouse, 119 keyboard 121 input device, 200 software, 201 application software group 203 display control unit, 205 input device control unit, 207 external input / output control unit 209 file management function, 211 print service, 213 Character string output function 215 Printer driver, 217 Tagging processing unit, 219 Paragraph determination unit 220 Document structure processing unit, 221 Layout structure determination unit, 223 Keyword monitoring unit

Claims

A document structuring apparatus that handles intermediate document data generated in the process of imaging document data including character codes,
A structure determination means for determining a structure of the document data using a character string composed of a plurality of characters arranged in one direction in the document data as a minimum unit;
Structured document generating means for generating a structured document indicating the structure of the document data based on the structured information indicating the structure determined by the structure determining means;
A document structuring apparatus comprising:

2. The document structuring apparatus according to claim 1, wherein the structure determining unit includes a paragraph determining unit that determines a structure of a paragraph constituting the document indicated by the document data based on the character string.

The paragraph determination means uses one line of characters of the document represented by the document data as a character string, detects indentation of a line according to the position of the first character string in the character string, and character string in which indentation is detected 3. The document structuring apparatus according to claim 1, wherein the first character string to the character string immediately before the character string in which indentation is detected is determined to be one paragraph.

The paragraph determination means uses one line of characters of the document represented by the document data as a character string, and a character string whose interval from the character string of the first line or the immediately preceding character string is wider than the interval from the immediately following character string. 3. The document structuring apparatus according to claim 1, wherein a character string whose interval between the immediately following character string and the character string wider than the immediately preceding character string is determined as one paragraph.

The structure determination unit detects a plurality of blank areas in contact with the paragraph from different directions, and determines the layout of the document data based on a positional relationship and a relative size difference between the detected blank areas. 5. The document structuring apparatus according to claim 2, wherein the structure is determined.

When the document data includes a drawing, the structure determination unit detects a plurality of blank areas in contact with the display area from different directions with respect to the display area of the drawing, and detects a plurality of detected blank areas. 5. The document structuring apparatus according to claim 2, wherein a structure related to the drawing layout is determined based on a positional relationship and a relative size difference.

The document structuring apparatus according to claim 2, wherein the structure determination unit determines the attribute of the paragraph based on a size of a character included in the character string.

8. The structure determination unit according to claim 1, wherein the structure determination unit detects a keyword included in the character string and determines a structure related to a layout of the document data based on a meaning of the keyword. The document structuring apparatus described.

9. The document structuring apparatus according to claim 1, wherein the document structuring apparatus is provided in a printer driver that receives and converts document data including a character code into an image.

A document structuring method for handling intermediate document data generated in the process of imaging document data including character codes,
A structure determination step for determining a structure of the document data using a character string composed of a plurality of characters arranged in one direction in the document data as a minimum unit;
A structured document generating step for generating a structured document indicating the structure of the document data based on the structured information indicating the structure determined in the structure determining means step;
A document structuring method comprising:

A program for causing a computer to execute a document structuring method that handles intermediate document data generated in the process of imaging document data including character codes.
A structure determination step for determining a structure of the document data using a character string composed of a plurality of characters arranged in one direction in the document data as a minimum unit;
A structured document generating step for generating a structured document indicating the structure of the document data based on the structured information indicating the structure determined in the structure determining means step;
A program for causing a computer to execute a document structuring method comprising: