JP7480560B2

JP7480560B2 - Text extraction device and program

Info

Publication number: JP7480560B2
Application number: JP2020063663A
Authority: JP
Inventors: 篤史西田; 荘介下山
Original assignee: Dai Nippon Printing Co Ltd
Current assignee: Dai Nippon Printing Co Ltd
Priority date: 2020-03-31
Filing date: 2020-03-31
Publication date: 2024-05-10
Anticipated expiration: 2040-03-31
Also published as: JP2021163159A

Description

本発明は、文章抽出装置及びプログラムに関する。 The present invention relates to a text extraction device and a program.

従来、書籍や雑誌等のＰＤＦ（ＰｏｒｔａｂｌｅＤｏｃｕｍｅｎｔＦｏｒｍａｔ）データに含まれる文章を編集するため、表示されたＰＤＦデータのうち文章の編集に用いる一部の領域を指定することで、指定した領域に含まれる文章を抽出する、といった作業が行われている。
ＰＤＦデータは、実際に紙に印刷したときの状態を保存するファイル形式のデータであり、様々なメタ情報を有している。 Conventionally, in order to edit text contained in PDF (Portable Document Format) data of books, magazines, etc., a task has been performed in which a part of the displayed PDF data is designated to be used for editing the text, and then the text contained in the designated area is extracted.
PDF data is data in a file format that stores the state of an image when it is actually printed on paper, and includes various meta-information.

図１は、ＰＤＦデータのメタ情報の一部を説明するための図である。
図１に示すＰＤＦデータ９１は、テキスト９２と、テキストライン９３とテキストボックス９４（テキストオブジェクト）とから構成されており、これらは、階層構造を有する。
テキスト９２は、１文字領域である。テキスト９２は、「お」といった文字データ（文字自体）の他、フォントやフォントサイズ、１文字領域の位置情報及び領域の大きさといった情報を、メタ情報として有する。
テキストライン９３は、複数のテキスト９２により構成された１行領域である。テキストライン９３は、１行領域の位置情報及び領域の大きさといった情報を、メタ情報として有する。
テキストボックス９４は、複数のテキストライン９３をまとめた矩形領域である。テキストボックス９４は、矩形領域の位置情報及び領域の大きさといった情報を、メタ情報として有する。 FIG. 1 is a diagram for explaining a part of meta information of PDF data.
The PDF data 91 shown in FIG. 1 is made up of text 92, text lines 93, and text boxes 94 (text objects), which have a hierarchical structure.
The text 92 is a one-character region. The text 92 includes character data (the character itself) such as "o" as well as information such as the font, font size, position information of the one-character region, and size of the region as meta information.
A text line 93 is a one-line area made up of a plurality of texts 92. The text line 93 has meta-information such as the position information of the one-line area and the size of the area.
A text box 94 is a rectangular area that groups together a plurality of text lines 93. The text box 94 has meta information such as the position information of the rectangular area and the size of the area.

ＰＤＦデータは、情報を埋め込むことができるという特徴がある。例えば、ＰＤＦデータを作成するデザイナ等が、ページデータのどの位置に、何の文書をどのように配置するか、といった情報を、予め文字にして埋め込む場合がある。ここで、文書ファイルに文字の埋め込みを行う情報埋め込み装置が開示されている（例えば、特許文献１）。 PDF data has the feature of being able to embed information. For example, a designer creating PDF data may embed information in advance as text, such as which document is to be placed at which position in the page data, and how. Information embedding devices that embed text into document files have been disclosed (for example, Patent Document 1).

特開２０１５－１２４４４号公報JP 2015-12444 A

上記したように、編集のために埋め込んだ文字は、ＰＤＦデータとして表示する際には見えない不可視の文字である。このような不可視の文字を埋め込んだ場合、ＰＤＦデータには、埋め込んだ文字を構成するテキストボックスと、表示される文章のテキストボックスとが存在することになる。
そこで、図２を例に、既存の方法によってＰＤＦデータから文字を抽出して出力する場合について説明する。
例えば、図２（Ａ）に示すＰＤＦデータ７１の場合、ユーザが編集可能な文字として出力したい範囲を示す指定領域７２には、テキストボックス７１ａと、テキストボックス７１ｂとが含まれている。文章７５は、指定領域７２に含まれる文字を出力した結果データである。文章７５は、指定領域７２に含まれるテキストボックス７１ａ及びテキストボックス７１ｂに含まれる文字データを、各文字データの位置情報に基づいて出力したものである。この場合のように、テキストボックス同士の各文字データが重なっていないときには、出力される文章７５は、文章として理解できるものである。
他方、図２（Ｂ）に示すＰＤＦデータ８１の場合、指定領域８２には、テキストボックス８１ａと、テキストボックス８１ｂの一部とが含まれている。ここで、テキストボックス８１ｂは、埋め込まれた不可視の文字を含むものである。文章８５は、指定領域８２に含まれる文字を出力した結果データである。文章８５は、指定領域８２に含まれるテキストボックス８１ａと、テキストボックス８１ｂの一部とに含まれる文字データを、各文字データの位置情報に基づいて出力したものである。この場合のように、テキストボックス同士の各文字データが重なっているときには、出力される文章８５は、文章として理解できないものになってしまう。 As described above, characters embedded for editing purposes are invisible characters that are not visible when displayed as PDF data. When such invisible characters are embedded, the PDF data will have a text box that contains the embedded characters and a text box for the displayed text.
Therefore, a case where characters are extracted from PDF data and output using an existing method will be described with reference to FIG. 2 as an example.
For example, in the case of PDF data 71 shown in Fig. 2A, a designated area 72 indicating the range that the user wishes to output as editable characters includes text boxes 71a and 71b. Sentence 75 is data resulting from outputting the characters included in the designated area 72. Sentence 75 is data resulting from outputting the character data included in text boxes 71a and 71b included in the designated area 72 based on position information of each piece of character data. When the character data of the text boxes does not overlap, as in this case, the output sentence 75 can be understood as sentence.
On the other hand, in the case of PDF data 81 shown in Fig. 2(B), the designated area 82 includes a text box 81a and a part of a text box 81b. Here, the text box 81b includes embedded invisible characters. Sentence 85 is data resulting from outputting the characters included in the designated area 82. Sentence 85 is the output of character data included in the text box 81a and part of the text box 81b included in the designated area 82 based on position information of each piece of character data. When the character data of the text boxes overlap, as in this case, the output sentence 85 becomes incomprehensible as sentence.

そこで、本発明は、文章として理解可能なものを出力するように工夫した文章抽出装置及びプログラムを提供することを目的とする。 The present invention aims to provide a text extraction device and program that are designed to output text that can be understood as text.

本発明は、以下のような解決手段により、前記課題を解決する。
第１の発明は、複数のテキストオブジェクトを有するページデータから埋め込まれた文字の位置に応じて文字列を抽出する文章抽出装置であって、表示された前記ページデータから指定された処理対象領域における文字を抽出する文字抽出手段と、前記文字抽出手段により抽出した前記文字を含む前記テキストオブジェクトを特定するオブジェクト特定手段と、前記オブジェクト特定手段により前記処理対象領域から特定した複数の前記テキストオブジェクトにおいて文字間の重なりの有無を判定する判定手段と、前記判定手段による判定結果に応じて出力内容を決定する出力オブジェクト決定手段と、を備える、文章抽出装置である。
第２の発明は、第１の発明の文章抽出装置において、前記出力オブジェクト決定手段は、前記判定手段による判定結果、重なりがあると判定された場合に、前記処理対象領域に対する特定した前記テキストオブジェクトのそれぞれの面積比を算出し、算出した前記面積比に応じて前記出力内容を決定する、文章抽出装置である。
第３の発明は、第１の発明の文章抽出装置において、前記処理対象領域に含まれる前記文字を、光学文字認識により取得する認識文字取得手段を備え、前記出力オブジェクト決定手段は、前記判定手段による判定結果、重なりがあると判定された場合に、前記処理対象領域に対する特定した前記テキストオブジェクトのそれぞれに含まれる文章について、前記認識文字取得手段により取得した前記文字との一致度合を算出し、算出した一致度合に応じて前記出力内容を決定する、文章抽出装置である。
第４の発明は、第１の発明の文章抽出装置において、前記出力オブジェクト決定手段は、前記判定手段による判定結果、重なりがないと判定された場合に、前記複数のテキストオブジェクトのそれぞれに含まれる文章を、前記出力内容に決定する、文章抽出装置である。
第５の発明は、第１の発明から第４の発明までのいずれかの文章抽出装置において、前記出力オブジェクト決定手段により決定した前記出力内容を、前記テキストオブジェクトに対応するメタ情報に基づいて配置して編集画面に出力する文章出力手段を備える、文章抽出装置である。
第６の発明は、第１の発明から第５の発明までのいずれかの文章抽出装置において、前記ページデータのうち文章を含む指定領域の指定を受け付ける領域受付手段を備え、前記文字抽出手段は、前記指定領域を前記処理対象領域として、前記処理対象領域における前記文字を抽出する、文章抽出装置である。
第７の発明は、第１の発明から第６の発明までのいずれかの文章抽出装置において、前記ページデータは、ＰＤＦ形式のデータである、文章抽出装置である。
第８の発明は、第１の発明から第７の発明までのいずれかの文章抽出装置としてコンピュータを機能させるためのプログラムである。 The present invention solves the above problems by the following solving means.
A first invention is a text extraction device that extracts character strings according to positions of embedded characters from page data having a plurality of text objects, and includes: a character extraction means that extracts characters in a specified processing target area from the displayed page data; an object identification means that identifies the text object including the characters extracted by the character extraction means; a determination means that determines whether or not there is overlap between characters in the plurality of text objects identified from the processing target area by the object identification means; and an output object determination means that determines output content according to a determination result by the determination means.
A second invention is a sentence extraction device according to the first invention, wherein when the determination means determines that there is an overlap as a result of the determination, the output object determination means calculates an area ratio of each of the identified text objects to the processing target area, and determines the output content according to the calculated area ratio.
A third invention is a sentence extraction device according to the first invention, further comprising a recognized character acquisition means for acquiring the characters included in the processing target area by optical character recognition, and when the determination means determines that there is an overlap as a result of the determination made by the determination means, the output object determination means calculates a degree of match between the characters acquired by the recognized character acquisition means and sentences included in each of the text objects identified for the processing target area, and determines the output content in accordance with the calculated degree of match.
A fourth invention is a sentence extraction device according to the first invention, wherein the output object determination means determines, when the determination means determines that there is no overlap as a result of the determination, that a sentence contained in each of the plurality of text objects is to be the output content.
A fifth invention is a sentence extraction device, which is any of the first to fourth inventions, and which includes a sentence output means for arranging the output content determined by the output object determination means based on meta information corresponding to the text object and outputting the output content to an editing screen.
A sixth invention is a sentence extraction device which, in any of the first to fifth inventions, is provided with an area receiving means for receiving designation of a designated area including sentences in the page data, and the character extraction means extracts the characters in the processing target area, with the designated area being the processing target area.
A seventh aspect of the present invention is the text extraction device according to any one of the first to sixth aspects of the present invention, wherein the page data is data in a PDF format.
An eighth aspect of the present invention is a program for causing a computer to function as any one of the sentence extraction devices according to the first to seventh aspects of the present invention.

本発明によれば、文章として理解可能なものを出力するように工夫した文章抽出装置及びプログラムを提供することができる。 The present invention provides a text extraction device and program that are designed to output text that can be understood as text.

ＰＤＦデータのメタ情報の一部を説明するための図である。FIG. 11 is a diagram for explaining a part of meta information of PDF data. 既存の方法によるＰＤＦデータから文字を出力する例を示す図である。FIG. 1 is a diagram showing an example of outputting characters from PDF data according to an existing method. 本実施形態に係る文章出力装置の機能ブロック図である。1 is a functional block diagram of a text output device according to an embodiment of the present invention; 本実施形態に係る文章出力装置の文章出力処理を示すフローチャートである。5 is a flowchart showing a sentence output process of the sentence output device according to the present embodiment. 本実施形態に係る文章出力装置の重なり処理を示すフローチャートである。10 is a flowchart showing an overlap process of the text output device according to the embodiment. 本実施形態に係る文章出力装置の重なり処理を説明するための図である。11A and 11B are diagrams for explaining overlap processing of the text output device according to the embodiment; 本実施形態に係る文章出力装置の重なり処理を説明するための図である。11A and 11B are diagrams for explaining overlap processing of the text output device according to the embodiment; 本実施形態に係る文章出力装置の処理例を示す図である。5A to 5C are diagrams illustrating an example of processing performed by the text output device according to the embodiment; 本実施形態に係る文章出力装置の処理例を示す図である。5A to 5C are diagrams illustrating an example of processing performed by the text output device according to the embodiment; 本実施形態に係る文章出力装置の処理例を示す図である。5A to 5C are diagrams illustrating an example of processing performed by the text output device according to the embodiment;

以下、本発明を実施するための形態について、図を参照しながら説明する。なお、これは、あくまでも一例であって、本発明の技術的範囲はこれに限られるものではない。
（実施形態）
図３は、本実施形態に係る文章出力装置１の機能ブロック図である。 Hereinafter, an embodiment of the present invention will be described with reference to the drawings. Note that this is merely an example, and the technical scope of the present invention is not limited to this example.
(Embodiment)
FIG. 3 is a functional block diagram of the text output device 1 according to the present embodiment.

＜文章出力装置１＞
文章出力装置１（文章抽出装置）は、ＰＤＦデータ（ページデータ）の文章（文字列）を編集可能に出力するための装置である。文章出力装置１は、例えば、ＰＤＦデータに含まれる文章を編集したいユーザが利用する。文章出力装置１は、例えば、パーソナルコンピュータ（ＰＣ）である。文章出力装置１は、その他、タブレット端末や、スマートフォン等に代表されるコンピュータの機能を併せ持った携帯型の端末であってもよい。また、文章出力装置１は、サーバ等であってもよい。 <Text output device 1>
The text output device 1 (text extraction device) is a device for outputting text (character strings) of PDF data (page data) in an editable manner. The text output device 1 is used, for example, by a user who wants to edit text included in PDF data. The text output device 1 is, for example, a personal computer (PC). The text output device 1 may also be a tablet terminal or a portable terminal having computer functions such as a smartphone. The text output device 1 may also be a server or the like.

図３に示すように、文章出力装置１は、制御部１０と、記憶部３０と、入力部３６と、表示部３７と、通信インタフェース部３９とを備える。
制御部１０は、文章出力装置１の全体を制御する中央処理装置（ＣＰＵ）である。制御部１０は、記憶部３０に記憶されているオペレーティングシステム（ＯＳ）やアプリケーションプログラムを適宜読み出して実行することにより、上述したハードウェアと協働し、各種機能を実行する。
制御部１０は、ページデータ出力部１１と、領域受付部１２（領域受付手段）と、文字抽出部１３（文字抽出手段）と、オブジェクト特定部１４（オブジェクト特定手段）と、重なり判定部１５（判定手段）と、出力オブジェクト決定部１６（埋込文字オブジェクト特定手段）と、文章出力部１７（文章出力手段）とを備える。 As shown in FIG. 3, the text output device 1 includes a control unit 10 , a storage unit 30 , an input unit 36 , a display unit 37 , and a communication interface unit 39 .
The control unit 10 is a central processing unit (CPU) that controls the entire text output device 1. The control unit 10 appropriately reads and executes an operating system (OS) and application programs stored in the storage unit 30, thereby cooperating with the above-mentioned hardware to execute various functions.
The control unit 10 includes a page data output unit 11, an area acceptance unit 12 (area acceptance means), a character extraction unit 13 (character extraction means), an object identification unit 14 (object identification means), an overlap determination unit 15 (determination means), an output object determination unit 16 (embedded character object identification means), and a text output unit 17 (text output means).

ページデータ出力部１１は、編集するＰＤＦデータを表示部３７に出力する。例えば、ユーザが所望のＰＤＦ形式の文書ファイルを選択することで、制御部１０は、選択された文書ファイルを受け付けて、ＰＤＦデータを表示部３７に表示する。ＰＤＦデータは、印刷イメージのデータであり、例えば、１ページ分のデータである。ＰＤＦデータは、例えば、版面データであってもよい。また、ＰＤＦデータは、文章を含む。ＰＤＦデータは、文章のみのデータであってもよいし、一部に説明するための挿絵等を含むものであってもよい。 The page data output unit 11 outputs the PDF data to be edited to the display unit 37. For example, when the user selects a desired document file in PDF format, the control unit 10 accepts the selected document file and displays the PDF data on the display unit 37. The PDF data is data of a print image, for example, data for one page. The PDF data may be, for example, layout data. The PDF data also includes text. The PDF data may be data of only text, or may include illustrations for explanation.

領域受付部１２は、ＰＤＦデータの一部領域を指定した指定領域（処理対象領域）の指定を受け付ける。ユーザが、例えば、マウス等の入力部３６を用いて、表示部３７に表示されたＰＤＦデータのうち、ユーザが編集をしたい文章を含むように、左上から右下方向にドラッグ（ｄｒａｇ）することで、領域受付部１２は、矩形形状の指定領域を受け付けてもよい。 The area receiving unit 12 receives the designation of a specified area (area to be processed) that designates a partial area of the PDF data. The area receiving unit 12 may receive a rectangular designated area by the user using the input unit 36, such as a mouse, to drag from the upper left to the lower right of the PDF data displayed on the display unit 37 so as to include the text that the user wishes to edit.

文字抽出部１３は、領域受付部１２により受け付けた指定領域に含まれる文字データを、ＰＤＦデータを解析することで得られるメタ情報に基づいて抽出する。
オブジェクト特定部１４は、文字抽出部１３により抽出した文字を含むテキストボックス（テキストオブジェクト）を特定する。オブジェクト特定部１４は、ＰＤＦデータを解析することで得られるメタ情報に基づいて、テキストボックスを特定できる。
重なり判定部１５は、オブジェクト特定部１４が特定した複数のテキストボックスにおいて文字間の重なりの有無を判定する。重なり判定部１５は、抽出した文字データを構成する文字のメタ情報により、１文字領域の位置情報及び領域の大きさを把握する。そして、重なり判定部１５は、把握した１文字領域の位置情報及び領域の大きさを用いて、文字間の重なりの有無を判定できる。 The character extraction unit 13 extracts character data included in the designated area accepted by the area acceptance unit 12, based on meta-information obtained by analyzing the PDF data.
The object identification unit 14 identifies a text box (text object) including the characters extracted by the character extraction unit 13. The object identification unit 14 can identify a text box based on meta information obtained by analyzing the PDF data.
The overlap determination unit 15 determines whether or not there is overlap between characters in the multiple text boxes identified by the object identification unit 14. The overlap determination unit 15 grasps the position information and size of one character region from meta information of the characters that make up the extracted character data. Then, the overlap determination unit 15 can determine whether or not there is overlap between characters using the grasped position information and size of one character region.

出力オブジェクト決定部１６は、重なり判定部１５による判定結果に応じて出力内容を決定する。
より具体的には、出力オブジェクト決定部１６は、重なり判定部１５による判定結果が、重なりがあると判定されたものである場合に、指定領域に対するオブジェクト特定部１４で特定したテキストボックスのそれぞれの面積比を算出し、算出した面積比に応じて出力内容を決定する。ここで、出力オブジェクト決定部１６は、算出した面積比が高いテキストボックスに含まれる文章を、出力内容に決定してもよい。そして、制御部１０は、算出した面積比が低いテキストボックスに含まれる文章を、埋め込み文字であると判断してもよい。ここで、埋め込み文字は、通常は、表示されたＰＤＦデータや、ＰＤＦデータを印刷したものには含まない不可視なものである。また、出力オブジェクト決定部１６は、算出した面積比が高いテキストボックスに含まれる文章と、算出した面積比が低いテキストボックスに含まれる文章との両方を、出力内容に決定してもよい。 The output object determination unit 16 determines the output contents according to the determination result by the overlap determination unit 15 .
More specifically, when the overlap determination unit 15 determines that there is an overlap, the output object determination unit 16 calculates the area ratio of each of the text boxes specified by the object specification unit 14 to the designated area, and determines the output contents according to the calculated area ratio. Here, the output object determination unit 16 may determine the text included in the text box with the calculated area ratio being high as the output contents. Then, the control unit 10 may determine the text included in the text box with the calculated area ratio being low as embedded characters. Here, embedded characters are invisible characters that are not usually included in displayed PDF data or printed PDF data. Also, the output object determination unit 16 may determine both the text included in the text box with the calculated area ratio being high and the text included in the text box with the calculated area ratio being low as the output contents.

さらに、出力オブジェクト決定部１６は、重なり判定部１５による判定結果が、重なりがないと判定されたものである場合に、オブジェクト特定部１４で特定したテキストボックスのそれぞれに含まれる文章を、出力内容に決定する。
ここで、出力オブジェクト決定部１６が決定する出力内容には、文章（文字列）の他、文章の表示態様も含まれてもよい。出力オブジェクト決定部１６は、文章の表示態様として、例えば、埋め込み文字の文字色や文字の大きさ、文字フォントを、埋め込み文字以外の文字の文字色や文字の大きさ、文字フォントとは異なるようにしてもよい。また、出力オブジェクト決定部１６は、文章の表示態様として、例えば、埋め込み文字を、埋め込み文字以外の文字の最後に下線やかっこ書きで付け加えたりしてもよい。 Furthermore, when the judgment result by the overlap judgment unit 15 indicates that there is no overlap, the output object determination unit 16 determines the sentences contained in each of the text boxes identified by the object identification unit 14 as the output content.
Here, the output content determined by the output object determination unit 16 may include not only a sentence (character string) but also a display mode of the sentence. As a display mode of the sentence, the output object determination unit 16 may, for example, make the character color, character size, and character font of the embedded character different from the character color, character size, and character font of characters other than the embedded character. Furthermore, as a display mode of the sentence, the output object determination unit 16 may, for example, add the embedded character by underlining or parentheses to the end of the characters other than the embedded character.

文章出力部１７は、出力オブジェクト決定部１６により決定した出力内容を、各テキスト及びテキストボックスのメタ情報に基づいて配置した編集画面を出力する。その際、文章出力部１７は、文章の表示態様に基づいて、埋め込み文字を編集画面に配置して出力してもよい。また、文章出力部１７は、出力した文章を、文章記憶部３３に記憶させてもよい。 The text output unit 17 outputs an editing screen in which the output contents determined by the output object determination unit 16 are arranged based on the meta information of each text and text box. In this case, the text output unit 17 may arrange embedded characters on the editing screen based on the display mode of the text and output the text. The text output unit 17 may also store the output text in the text storage unit 33.

記憶部３０は、制御部１０が各種の処理を実行するために必要なプログラム、データ等を記憶するためのハードディスク、半導体メモリ素子等の記憶領域である。
記憶部３０は、プログラム記憶部３１と、文書ファイル記憶部３２と、文章記憶部３３とを備える。
プログラム記憶部３１は、各種のプログラムを記憶する記憶領域である。プログラム記憶部３１は、プログラム３１ａを記憶している。
プログラム３１ａは、文章出力装置１の制御部１０が実行する各種機能を行うためのプログラムである。 The storage unit 30 is a storage area such as a hard disk, a semiconductor memory device, etc. for storing programs, data, etc. required for the control unit 10 to execute various processes.
The storage unit 30 includes a program storage unit 31 , a document file storage unit 32 , and a text storage unit 33 .
The program storage unit 31 is a storage area for storing various programs. The program storage unit 31 stores a program 31a.
The program 31 a is a program for carrying out various functions executed by the control unit 10 of the text output device 1 .

文書ファイル記憶部３２は、各種の文書ファイルを記憶する記憶領域である。ここで、文書ファイル記憶部３２に記憶される文書ファイルは、例えば、雑誌や書籍等の内容のファイルであってよい。
なお、文書ファイル記憶部３２は、例えば、文章出力装置１に対して通信可能に接続された文書サーバ（図示せず）に有してもよい。
文章記憶部３３は、文章出力部１７によって出力された文章を記憶する記憶領域である。文章記憶部３３は、記憶された文章がその後に編集された場合、編集後の文章をさらに記憶してもよい。 The document file storage unit 32 is a storage area for storing various document files. Here, the document files stored in the document file storage unit 32 may be files containing the contents of magazines, books, etc.
The document file storage unit 32 may be included in, for example, a document server (not shown) communicatively connected to the text output device 1 .
The text storage unit 33 is a storage area for storing the text output by the text output unit 17. When the stored text is subsequently edited, the text storage unit 33 may further store the edited text.

入力部３６は、例えば、キーボードやマウス等の入力装置である。
表示部３７は、例えば、ＬＣＤ（ＬｉｑｕｉｄＣｒｙｓｔａｌＤｉｓｐｌａｙ）等の表示装置である。
通信インタフェース部３９は、通信ネットワークを介して他の装置との通信を行うためのインタフェースである。 The input unit 36 is, for example, an input device such as a keyboard or a mouse.
The display unit 37 is, for example, a display device such as an LCD (Liquid Crystal Display).
The communication interface unit 39 is an interface for communicating with other devices via a communication network.

＜文章出力装置１の処理＞
次に、文章出力装置１の処理について説明する。
図４は、本実施形態に係る文章出力装置１の文章出力処理を示すフローチャートである。
図５は、本実施形態に係る文章出力装置１の重なり処理を示すフローチャートである。
図６及び図７は、本実施形態に係る文章出力装置１の重なり処理を説明するための図である。 <Processing of Text Output Device 1>
Next, the process of the text output device 1 will be described.
FIG. 4 is a flowchart showing a text output process of the text output device 1 according to this embodiment.
FIG. 5 is a flowchart showing the overlap process of the text output device 1 according to the present embodiment.
6 and 7 are diagrams for explaining the overlap process of the text output device 1 according to the present embodiment.

図４のステップＳ（以下、「Ｓ」という。）１１において、制御部１０（ページデータ出力部１１）は、ＰＤＦデータを表示部３７に出力させる。例えば、ユーザが、文書ファイル記憶部３２に記憶された文書ファイルのうち、１つを選択することで、制御部１０は、選択された文書ファイルのＰＤＦデータを出力させてもよい。
Ｓ１２において、制御部１０（領域受付部１２）は、表示部３７に表示されたＰＤＦデータについてユーザの操作により、ユーザが編集をしたい文章を含む指定領域の指定を受け付ける。
Ｓ１３において、制御部１０（文字抽出部１３）は、指定領域に含まれる文字データを、ＰＤＦデータのメタ情報に基づいて抽出する。
Ｓ１４において、制御部１０（オブジェクト特定部１４）は、指定領域に含まれるテキストボックスを、ＰＤＦデータのメタ情報に基づいて特定する。 4, in step S (hereinafter referred to as "S") 11, the control unit 10 (page data output unit 11) outputs the PDF data to the display unit 37. For example, the user may select one of the document files stored in the document file storage unit 32, and the control unit 10 may output the PDF data of the selected document file.
In S12, the control unit 10 (area receiving unit 12) receives, through a user operation on the PDF data displayed on the display unit 37, a designation of a designated area including a text that the user wishes to edit.
In S13, the control unit 10 (character extraction unit 13) extracts character data included in the specified area based on meta information of the PDF data.
In S14, the control unit 10 (object identification unit 14) identifies text boxes included in the specified area based on meta information of the PDF data.

Ｓ１５において、制御部１０（重なり判定部１５）は、特定した複数のテキストボックスにおいて文字間の重なりの有無を判定する。
Ｓ１６において、制御部１０（重なり判定部１５）は、文字間の重なりがあるか否かを判定する。重なりがある場合（Ｓ１６：ＹＥＳ）には、制御部１０は、処理をＳ１７に移す。他方、重なりがない場合（Ｓ１６：ＮＯ）には、制御部１０は、処理をＳ１８に移す。
Ｓ１７において、制御部１０は、重なり処理を行う。 In S15, the control unit 10 (overlap determination unit 15) determines whether or not there is overlap between characters in the identified multiple text boxes.
In S16, the control unit 10 (overlap determination unit 15) determines whether or not there is overlap between characters. If there is overlap (S16: YES), the control unit 10 proceeds to S17. On the other hand, if there is no overlap (S16: NO), the control unit 10 proceeds to S18.
In S17, the control unit 10 performs overlap processing.

ここで、重なり処理について、図５に基づき説明する。
図５のＳ２１において、制御部１０（出力オブジェクト決定部１６）は、重なっている文字を含むテキストボックスごとに、指定領域との面積比として、重なり度合を算出する。
ここで、重なり度合の算出方法の一例を、図６に示す。
図６は、指定領域４１と、テキストボックス４２との重なり度合に関する算出方法を示す。 Here, the overlap process will be described with reference to FIG.
In S21 of FIG. 5, the control unit 10 (output object determination unit 16) calculates the degree of overlap as an area ratio with respect to the designated region for each text box including overlapping characters.
An example of a method for calculating the degree of overlap is shown in FIG.
FIG. 6 shows a method for calculating the degree of overlap between the designated area 41 and the text box 42. In FIG.

図６（Ａ）に示す指定領域４１は、Ｘ軸方向が長さｗ_１であり、Ｙ軸方向が長さｈ_１である矩形領域として示される。また、テキストボックス４２は、Ｘ軸方向が長さｗ_２であり、Ｙ軸方向が長さｈ_２である矩形領域として示される。この場合の重なり部分は、Ｘ軸方向が長さｗ_３であり、Ｙ軸方向が長さｈ_３である矩形領域として表すことができる。
そして、面積比算出の一手法として、指定領域４１に対するテキストボックス４２の重なり度合をＩｏＵ（ＩｎｔｅｒｓｅｃｔｉｏｎｏｖｅｒＵｎｉｏｎ）とすると、重なり度合ＩｏＵは、図６（Ｂ）に示す式で表すことができる。つまり、指定領域４１と、テキストボックス４２との重なり部分である、Ｘ軸方向が長さｗ_３であり、Ｙ軸方向が長さｈ_３である矩形領域が大きいほど、重なり度合ＩｏＵは、大きな値になる。 6A is shown as a rectangular area having a length _w1 in the X-axis direction and a length _h1 in the Y-axis direction. Also, the text box 42 is shown as a rectangular area having a length _w2 in the X-axis direction and a length _h2 in the Y-axis direction. In this case, the overlapping portion can be represented as a rectangular area having a length _w3 in the X-axis direction and a length _h3 in the Y-axis direction.
As one method of calculating the area ratio, if the degree of overlap of the text box 42 with respect to the designated area 41 is IoU (Intersection over Union), the degree of overlap IoU can be expressed by the formula shown in Fig. 6B. In other words, the larger the rectangular area having a length _w3 in the X-axis direction and a length _h3 in the Y-axis direction, which is the overlapping portion between the designated area 41 and the text box 42, the larger the degree of overlap IoU becomes.

図７（Ａ）は、各々の重なり度合ＩｏＵを算出する具体例を示す。
図７（Ａ）は、指定領域５１と、テキストボックス５１ａ及びテキストボックス５１ｂとを示す。この例の場合、テキストボックス５１ａとテキストボックス５１ｂとが、指定領域５１内において重なっている。そのため、制御部１０は、各テキストボックスにおける重なり度合ＩｏＵを算出する。
まず、図７（Ｂ）に示すように、制御部１０は、指定領域５１と、テキストボックス５１ａとの重なり度合ＩｏＵを算出する。制御部１０は、この場合の重なり度合ＩｏＵを、図６（Ｂ）の式にあてはめて０．９と算出できる。
次に、図７（Ｃ）に示すように、制御部１０は、指定領域５１と、テキストボックス５１ｂとの重なり度合ＩｏＵを算出する。制御部１０は、この場合の重なり度合ＩｏＵを、図６（Ｂ）の式にあてはめて０．１と算出できる。 FIG. 7A shows a specific example of calculating each degree of overlap IoU.
7A shows a designated area 51 and text boxes 51a and 51b. In this example, text boxes 51a and 51b overlap within the designated area 51. Therefore, the control unit 10 calculates the degree of overlap IoU for each text box.
First, as shown in Fig. 7B, the control unit 10 calculates the degree of overlap IoU between the designated area 51 and the text box 51a. The control unit 10 can calculate the degree of overlap IoU in this case to be 0.9 by applying the formula in Fig. 6B.
Next, as shown in Fig. 7C, the control unit 10 calculates the degree of overlap IoU between the designated area 51 and the text box 51b. The control unit 10 can calculate the degree of overlap IoU in this case to be 0.1 by applying the formula in Fig. 6B.

図５のＳ２２において、制御部１０（出力オブジェクト決定部１６）は、算出した面積比の低いテキストボックスの文字データを、埋め込み文字であると特定する。上記した図７に示す例では、制御部１０は、テキストボックス５１ｂに含まれる文字データを、埋め込み文字であると特定する。
Ｓ２３において、制御部１０（出力オブジェクト決定部１６）は、算出した面積比の高いテキストボックスの文章を出力内容に決定する。その後、制御部１０は、処理を図４のＳ１９に移す。 In S22 of Fig. 5, the control unit 10 (output object determination unit 16) identifies the character data in the text box with the calculated low area ratio as an embedded character. In the example shown in Fig. 7, the control unit 10 identifies the character data included in the text box 51b as an embedded character.
In S23, the control unit 10 (output object determination unit 16) determines the text in the text box with the calculated high area ratio as the output content. After that, the control unit 10 moves the process to S19 in FIG.

他方、図４のＳ１８において、制御部１０（出力オブジェクト決定部１６）は、テキストボックスのそれぞれに含まれる文章を、出力内容に決定する。
Ｓ１９において、制御部１０（文章出力部１７）は、決定した出力内容を、各テキスト及びテキストボックスのメタ情報に基づいて編集画面に配置して出力する。また、制御部１０（文章出力部１７）は、編集画面に出力した文字を、文章記憶部３３に記憶させる。その後、制御部１０は、本処理を終了する。 On the other hand, in S18 of FIG. 4, the control unit 10 (output object determination unit 16) determines the sentences included in each of the text boxes as the output contents.
In S19, the control unit 10 (text output unit 17) arranges and outputs the determined output contents on the editing screen based on the meta information of each text and text box. The control unit 10 (text output unit 17) also stores the characters output to the editing screen in the text storage unit 33. Thereafter, the control unit 10 ends this process.

次に、この文章出力装置１を用いて出力される文章の例を説明する。
図８から図１０までは、本実施形態に係る文章出力装置１の処理例を示す図である。
図８に示すＰＤＦデータ６１は、テキストボックス６１ａと、テキストボックス６１ｂとを含む。そして、テキストボックス６１ｂは、不可視の埋め込み文字のテキストボックスであり、実際には、表面に対して奥の方の位置に設けられた文字を含む。そのため、図８では、埋め込まれていることを明示するために、階層構造によって図示をしている。また、図９及び図１０についても同様である。 Next, an example of a text output by using the text output device 1 will be described.
8 to 10 are diagrams showing a processing example of the text output device 1 according to the present embodiment.
The PDF data 61 shown in Fig. 8 includes a text box 61a and a text box 61b. The text box 61b is a text box for invisible embedded characters, and actually includes characters provided at a position deep inside the surface. Therefore, in Fig. 8, the data is illustrated in a hierarchical structure to clearly show that the characters are embedded. The same applies to Figs. 9 and 10.

ＰＤＦデータ６１において、ユーザにより指定領域６２が指定された場合、制御部１０は、指定領域６２に含まれる文字データを抽出し、テキストボックスを特定する（図４のＳ１３及びＳ１４）。ここで、テキストボックス６１ａと、テキストボックス６１ｂとが重なっていないので、両者のテキストボックスにおいて文字間の重なりがない。よって、制御部１０は、重なりがないと判定し（図４のＳ１６がＮＯ）、指定領域６２に含まれるテキストボックス６１ａと、テキストボックス６１ｂとの両方の文字データを出力内容に決定し、メタ情報に基づいて編集画面６３に配置して出力する（図４のＳ１８及びＳ１９）。 When the user designates a designated area 62 in the PDF data 61, the control unit 10 extracts the character data contained in the designated area 62 and identifies the text box (S13 and S14 in FIG. 4). Here, since text box 61a and text box 61b do not overlap, there is no overlap between the characters in the two text boxes. Therefore, the control unit 10 determines that there is no overlap (NO in S16 in FIG. 4), determines the character data of both text boxes 61a and 61b contained in the designated area 62 as the output content, and arranges and outputs the content on the editing screen 63 based on the meta information (S18 and S19 in FIG. 4).

このように、図８に示す例では、テキストボックス同士の重なりがなく、制御部１０は、テキストボックスにおいて文字間の重なりがないと判定し、テキストボックスに含まれる文字データが埋め込み文字であるか否かの処理（重なり処理）をすることなく、埋め込み文字を含めた出力文字を、編集画面に出力している。これは、文章が他の文章と結合して理解ができないものにならないことによる。 In this way, in the example shown in FIG. 8, there is no overlap between text boxes, so the control unit 10 determines that there is no overlap between characters in the text boxes, and outputs the output characters, including the embedded characters, to the editing screen without processing whether the character data contained in the text boxes is an embedded character (overlap processing). This is because the sentences are not combined with other sentences to make them incomprehensible.

次に、図９に示すＰＤＦデータ６４は、テキストボックス６４ａと、テキストボックス６４ｂとを含む。そして、テキストボックス６４ｂは、不可視の埋め込み文字のテキストボックスである。
ＰＤＦデータ６４において、ユーザにより指定領域６５が指定された場合、制御部１０は、指定領域６５に含まれる文字データを抽出し、テキストボックスを特定する（図４のＳ１３及びＳ１４）。ここで、テキストボックス６４ａと、テキストボックス６４ｂとが重なっているので、次に、制御部１０は、テキストボックス６４ａと、テキストボックス６４ｂとの各文字のメタ情報に基づいて、文字間の重なりの有無を判定する（図４のＳ１５）。そして、制御部１０は、文字間の重なりがないと判定し（図５のＳ１６がＮＯ）、指定領域６５に含まれるテキストボックス６４ａと、テキストボックス６４ｂとの両方の文字データを出力内容に決定し、メタ情報に基づいて編集画面６６に配置して出力する（図４のＳ１８及びＳ１９）。 9 includes a text box 64a and a text box 64b. The text box 64b is a text box containing invisible embedded characters.
When the user designates a designated area 65 in the PDF data 64, the control unit 10 extracts character data included in the designated area 65 and identifies a text box (S13 and S14 in FIG. 4). Since the text box 64a and the text box 64b overlap, the control unit 10 next determines whether or not there is an overlap between the characters based on the meta information of each character in the text box 64a and the text box 64b (S15 in FIG. 4). Then, the control unit 10 determines that there is no overlap between the characters (NO in S16 in FIG. 5), and determines the character data of both the text box 64a and the text box 64b included in the designated area 65 as the output content, and arranges and outputs the data on the editing screen 66 based on the meta information (S18 and S19 in FIG. 4).

このように、図９に示す例では、編集画面６６には、テキストボックス６４ａの全ての文字の他、テキストボックス６４ｂのうち指定領域６５に含まれる一部の文字が表示される。つまり、編集画面６６には、指定領域６５に含まれる全ての文字が出力される。この例でも、埋め込み文字が出力されるが、これは、文字同士が重なっていないため、出力しても、文章が他の文章と結合して理解ができないものにならないためである。 Thus, in the example shown in FIG. 9, in addition to all of the characters in text box 64a, some of the characters in text box 64b that are included in specified area 65 are displayed on editing screen 66. In other words, all of the characters included in specified area 65 are output on editing screen 66. In this example, embedded characters are also output, but this is because the characters do not overlap, and so even if they are output, they will not be combined with other sentences and become incomprehensible.

次に、図１０に示すＰＤＦデータ６７は、テキストボックス６７ａと、テキストボックス６７ｂとを含む。そして、テキストボックス６７ｂは、不可視の埋め込み文字のテキストボックスである。
ＰＤＦデータ６７において、ユーザにより指定領域６８が指定された場合、制御部１０は、指定領域６８に含まれる文字データを抽出し、テキストボックスを特定する（図４のＳ１３及びＳ１４）。ここで、テキストボックス６７ａと、テキストボックス６７ｂとが重なっているので、次に、制御部１０は、テキストボックス６７ａと、テキストボックス６７ｂとの各文字のメタ情報に基づいて、文字間の重なりの有無を判定する（図４のＳ１５）。制御部１０は、文字間の重なりがあると判定し（図５のＳ１６がＹＥＳ）、重なり処理を行う（図５）。制御部１０は、算出した面積比の低いテキストボックス６７ｂの文字を、埋め込み文字に特定し（図５のＳ２２）、算出した面積比の高いテキストボックス６７ａの文章を出力内容に決定する（図５のＳ２３）。そして、制御部１０は、決定した出力内容を、メタ情報に基づいて編集画面６９に配置して出力する（図４のＳ１９）。 10 includes a text box 67a and a text box 67b. The text box 67b is a text box for invisible embedded characters.
When the user designates the designated area 68 in the PDF data 67, the control unit 10 extracts character data included in the designated area 68 and identifies a text box (S13 and S14 in FIG. 4). Here, since the text box 67a and the text box 67b overlap, the control unit 10 next determines whether or not there is an overlap between the characters based on the meta information of each character in the text box 67a and the text box 67b (S15 in FIG. 4). The control unit 10 determines that there is an overlap between the characters (YES in S16 in FIG. 5), and performs overlap processing (FIG. 5). The control unit 10 identifies the character in the text box 67b with the calculated low area ratio as an embedded character (S22 in FIG. 5), and determines the text in the text box 67a with the calculated high area ratio as the output content (S23 in FIG. 5). Then, the control unit 10 arranges the determined output content on the editing screen 69 based on the meta information and outputs it (S19 in FIG. 4).

このように、図１０に示す例では、編集画面６９には、指定領域６５に含まれるテキストボックス６４ａの全ての文字のみが表示される。そして、テキストボックス６４ｂの文字は出力しない。よって、文章出力装置１は、テキストボックス６４ｂの文字が結合して出力されるのを防ぐことができ、文章として理解可能なものを出力できる。
なお、図１０のように、テキストボックス６４ｂの文字を出力しないものは、一例であり、テキストボックス６４ｂの文字も出力してもよい。但し、テキストボックス６４ａの文字とは表示態様を異ならせて、異なる文章であることが認識できるようにする必要がある。 10, only all the characters in the text box 64a included in the specified area 65 are displayed on the editing screen 69. The characters in the text box 64b are not output. Thus, the text output device 1 can prevent the characters in the text box 64b from being combined and output, and can output a text that is understandable as a text.
10, in which the characters in the text box 64b are not output, is merely an example, and the characters in the text box 64b may also be output. However, it is necessary to make the display mode different from that of the characters in the text box 64a so that it can be recognized that they are different sentences.

このように、本実施形態の文章出力装置１によれば、以下のような効果がある。
（１）ＰＤＦデータの指定領域に含む文字データを抽出し、テキストボックスを特定し、複数のテキストボックスにおいて文字間の重なりの有無を判定し、判定結果に応じて出力内容を決定する。
よって、編集画面に出力される文章は、文字間の重なりの有無の判定結果に応じたものになり、例えば、文字間の重なりがある場合には、重なりのある文章同士を分けて出力するようにすることで、文章として理解可能なものにできる。 As described above, the text output device 1 of the present embodiment has the following advantages.
(1) Extract character data contained in a specified area of PDF data, identify text boxes, determine whether or not there is overlap between characters in multiple text boxes, and determine output contents according to the determination result.
Therefore, the sentences output on the editing screen will depend on the results of determining whether or not there is overlap between characters. For example, if there is overlap between characters, the overlapping sentences will be output separately, making it possible to make the sentences understandable.

（２）複数のテキストボックスにおいて文字間の重なりがあると判定された場合に、指定領域に対するテキストボックスのそれぞれの面積比を算出し、算出した面積比に応じて出力内容を決定する。
よって、指定領域と、テキストボックスとのそれぞれの位置情報から重なり度合を利用して出力内容を決定できる。したがって、ユーザが、指定領域には出力したい文章を含めるように領域の指定を行えば、面積比は高いものになるため、ユーザの意図した出力内容にすることができる。 (2) When it is determined that there is overlap between characters in a plurality of text boxes, the area ratio of each of the text boxes to the designated region is calculated, and the output contents are determined according to the calculated area ratio.
Therefore, the output contents can be determined by utilizing the degree of overlap based on the position information of the specified area and the text box. Therefore, if the user specifies an area so that the specified area includes the text to be output, the area ratio will be high, and the output contents will be as intended by the user.

（３）複数のテキストボックスにおいて文字間の重なりがないと判定された場合に、指定領域に含まれる複数のテキストボックスのそれぞれに含まれる文章を出力内容に決定する。
よって、指定領域に含まれる文章を、出力内容にすることができ、ユーザの意図した出力内容にすることができる。 (3) If it is determined that there is no overlap between characters in a plurality of text boxes, the sentences contained in each of the plurality of text boxes included in the specified area are determined as the output content.
Therefore, the text contained in the specified area can be used as the output content, and the output content can be as intended by the user.

（４）出力内容を、テキストボックスに対応するメタ情報に基づいて配置した編集画面を出力する。
よって、編集画面には、文章を、各文字の位置や大きさを含めて、ＰＤＦデータと同じように出力できる。
（５）ＰＤＦデータのうち文章を含む指定領域の指定を受け付けて、受け付けた指定領域に含む文字データを、メタ情報に基づいて抽出する。
よって、ユーザの指定した範囲の領域の文章を、出力対象にできる。 (4) An editing screen is output in which the output contents are arranged based on the meta information corresponding to the text boxes.
Therefore, the text can be output on the editing screen in the same way as the PDF data, including the position and size of each character.
(5) A designation of a designated area including text in the PDF data is accepted, and character data included in the accepted designated area is extracted based on meta information.
Therefore, the text in the area specified by the user can be output.

以上、本発明の実施形態について説明したが、本発明は上述した実施形態に限定されるものではない。また、実施形態に記載した効果は、本発明から生じる最も好適な効果を列挙したに過ぎず、本発明による効果は、実施形態に記載したものに限定されない。なお、上述した実施形態及び後述する変形形態は、適宜組み合わせて用いることもできるが、詳細な説明は省略する。 Although the embodiments of the present invention have been described above, the present invention is not limited to the above-mentioned embodiments. Furthermore, the effects described in the embodiments are merely a list of the most favorable effects resulting from the present invention, and the effects of the present invention are not limited to those described in the embodiments. Note that the above-mentioned embodiments and the modified forms described below can be used in appropriate combinations, but detailed explanations will be omitted.

（変形形態）
（１）本実施形態では、ユーザが指定した指定領域に含まれる文章を出力するものを例に説明したがこれに限定されない。例えば、ＰＤＦデータとして表示部に表示されたＰＤＦデータの全てであるページデータを処理対象領域にして、文章を出力してもよい。 (Modifications)
(1) In the present embodiment, the text included in the designated area designated by the user is output, but the present invention is not limited to this. For example, the page data, which is all of the PDF data displayed on the display unit as PDF data, may be used as the processing target area, and the text may be output.

（２）本実施形態では、テキストボックスの重なり度合を用いて出力内容を決定するものを例に説明したが、これに限定されない。ＰＤＦデータの指定領域に対してＯＣＲ（光学文字認識）を用いて指定領域に含まれる文字を取得し、取得した文字と、各テキストボックスに含まれる文章との一致度合に基づいて出力内容を決定してもよい。この方法によれは、ＯＣＲを用いて取得した文字には、埋め込み文字のような不可視な文字が含まれないことを利用して、出力内容を決定できる。 (2) In this embodiment, the output content is determined using the degree of overlap of text boxes, but this is not limiting. Characters contained in a specified area of PDF data may be obtained using OCR (optical character recognition), and the output content may be determined based on the degree of match between the obtained characters and the text contained in each text box. With this method, the output content can be determined by taking advantage of the fact that characters obtained using OCR do not include invisible characters such as embedded characters.

（３）本実施形態では、複数のテキストボックスにおいて文字間の重なりがある場合に、複数のテキストボックスのうちの算出した面積比の高いテキストボックスに含まれる文章を出力内容に決定するものを例に説明したが、これに限定されない。例えば、算出した面積比の低いテキストボックスに含まれる文章の表示態様を、算出した面積比の高いテキストボックスに含まれる文章の表示態様とは異なるものにして、両方の文章を出力してもよい。 (3) In the present embodiment, when there is overlap between characters in multiple text boxes, the text contained in the text box with the highest calculated area ratio among the multiple text boxes is determined as the output content. However, this is not limited to this. For example, the display mode of the text contained in the text box with the lowest calculated area ratio may be set to a different mode from the display mode of the text contained in the text box with the highest calculated area ratio, and both texts may be output.

（４）本実施形態では、複数のテキストボックスにおいて文字間の重なりがない場合には、テキストボックスに含まれる文章を出力内容に決定するものを例に説明した。その際、各テキストボックスの文章を、各々異なる表示態様にしてもよい。そうすれば、文章のまとまりを一見しただけで把握できるものになる。 (4) In this embodiment, when there is no overlap between characters in multiple text boxes, the text contained in the text boxes is determined as the output content. In this case, the text in each text box may be displayed in a different manner. This makes it possible to grasp the unity of the text at a glance.

（５）本実施形態では、ＰＤＦデータを例に説明したが、これに限定されない。複数のテキストボックスを有し、文章や文章を構成する文字ごとのメタ情報を有するページデータであれば、他のものであっても同様に用いることができる。 (5) In this embodiment, PDF data has been described as an example, but this is not limiting. Other types of page data may be used in the same manner as long as they have multiple text boxes and meta information for each sentence or character that makes up the sentence.

（６）本実施形態では、文章出力装置が入力部及び表示部を含む装置として説明したが、これに限定されない。入力部及び表示部を、例えば、ユーザ端末に有するものとし、文章出力装置は、入力部及び表示部を備えなくてもよい。その場合、ユーザ端末が文章出力装置に対して通信可能に接続することで、処理を行ってもよい。 (6) In this embodiment, the text output device has been described as a device including an input unit and a display unit, but is not limited to this. The input unit and display unit may be included in, for example, a user terminal, and the text output device may not include an input unit and a display unit. In this case, the user terminal may be communicatively connected to the text output device to perform processing.

１文章出力装置
１０制御部
１１ページデータ出力部
１２領域受付部
１３文字抽出部
１４オブジェクト特定部
１５重なり判定部
１６出力オブジェクト決定部
１７文章出力部
３０記憶部
３１ａプログラム
３２文書ファイル記憶部
３３文章記憶部
３６入力部
３７表示部
３９通信インタフェース部
４１，５１，６２，６５，６８指定領域
４２，５１ａ，５１ｂ，６１ａ，６１ｂ，６４ａ，６４ｂ，６７ａ，６７ｂテキストボックス
６１，６４，６７ＰＤＦデータ
６３，６６，６９編集画面 REFERENCE SIGNS LIST 1 Text output device 10 Control unit 11 Page data output unit 12 Area reception unit 13 Character extraction unit 14 Object identification unit 15 Overlap determination unit 16 Output object determination unit 17 Text output unit 30 Memory unit 31a Program 32 Document file memory unit 33 Text memory unit 36 Input unit 37 Display unit 39 Communication interface unit 41, 51, 62, 65, 68 Designated area 42, 51a, 51b, 61a, 61b, 64a, 64b, 67a, 67b Text box 61, 64, 67 PDF data 63, 66, 69 Editing screen

Claims

A text extraction device that extracts character strings from page data having a plurality of text objects according to positions of embedded characters, comprising:
character extraction means for extracting characters in a specified processing target area from the displayed page data;
an object specifying means for specifying the text object including the character extracted by the character extracting means;
a determination means for determining whether or not there is an overlap between characters in the plurality of text objects identified from the processing target area by the object identification means;
a recognized character acquisition means for acquiring the characters included in the processing target area by optical character recognition;
an output object determination means for calculating a degree of coincidence between characters of a sentence included in each of the plurality of text objects specified from the processing target area and the characters acquired by the recognized character acquisition means, when it is determined that there is an overlap as a result of the determination by the determination means, and determining a sentence included in the text object to be output from the plurality of text objects based on the calculated degree of coincidence;
A text extraction device comprising:

2. The text extraction device according to claim 1,
The output object determination means, when the determination means determines that there is no overlap as a result of the determination, determines the sentences included in each of the plurality of text objects as sentences to be output .

In the text extraction device according to claim 1 or 2,
The output object determination means further determines a display mode of the sentence to be output.

In the text extraction device according to any one of claims 1 to 3 ,
The text extraction device further comprises a text output means for arranging the text to be output determined by the output object determination means based on meta information corresponding to the text object and outputting the text to an editing screen.

In the text extraction device according to any one of claims 1 to 4 ,
an area receiving means for receiving a designation of a designated area including a text in the page data;
The character extraction means extracts the characters from the specified region as the processing target region.

In the text extraction device according to any one of claims 1 to 5 ,
The document extraction device, wherein the page data is data in a PDF format.

A program for causing a computer to function as the text extraction device according to any one of claims 1 to 6 .