JP2012190313A

JP2012190313A - Image processing device and program

Info

Publication number: JP2012190313A
Application number: JP2011053975A
Authority: JP
Inventors: Shintaro Adachi; 真太郎安達; Hiroyoshi Kamijo; 裕義上條; Katsuya Koyanagi; 勝也小柳; Kazuhiro Otani; 和宏大谷; Minoru Sodeura; 稔袖浦; Shigeru Okada; 茂岡田; Shinzui Cho; 臻瑞張
Original assignee: Fuji Xerox Co Ltd
Current assignee: Fujifilm Business Innovation Corp
Priority date: 2011-03-11
Filing date: 2011-03-11
Publication date: 2012-10-04
Anticipated expiration: 2031-03-11
Also published as: JP5721052B2

Abstract

PROBLEM TO BE SOLVED: To provide an image processing device capable of generating a feature character string reflecting a content of a document.SOLUTION: An appearance frequency order candidate extraction section 32 extracts candidates of a constitution character string for constituting the feature character string based on an order of appearance frequencies in a document of character strings extracted by a character string extraction section 310. An arrangement order candidate extraction section 34 extracts candidates of the constitution character string for constituting the feature character string based on an arrangement of the character strings in the document extracted by the character string extraction section 310. A feature character string extraction section 36 selects, based on character string order information from the appearance frequency order candidate extraction section 32 and arrangement order information from the arrangement order candidate extraction section 34, two or more character strings from the candidates of the character string included in at least one of the character string order information and the arrangement order information and connects them together to generate the feature character string.

Description

本発明は、画像処理装置およびプログラムに関する。 The present invention relates to an image processing apparatus and a program.

特許文献１は、入力された各インデックス情報の文字列間に所定の区切り文字を付加してファイル名を生成する情報処理装置を開示する。
特許文献２は、原稿画像を読み取り、読み取り画像から文字を認識して、認識結果から、出現頻度の高い文字列を原稿に対するファイル名とする画像読取り装置を開示する。
特許文献３は、文書形式毎のルールからなる知識を用い、対象文書のレイアウト情報、フォントサイズ情報および出現頻度情報を入力し、推論を実行するシステムを開示する。 Patent Document 1 discloses an information processing apparatus that generates a file name by adding a predetermined delimiter between character strings of input index information.
Patent Document 2 discloses an image reading apparatus that reads a document image, recognizes characters from the read image, and uses a character string having a high appearance frequency as a file name for the document based on the recognition result.
Patent Document 3 discloses a system that executes inference by inputting layout information, font size information, and appearance frequency information of a target document using knowledge including rules for each document format.

特開平１０−２８９１３７号公報Japanese Patent Laid-Open No. 10-289137 特開２００６−２１１２６１号公報JP 2006-211261 A 特開２００６−３０９３４７号公報JP 2006-309347 A

本発明の目的は、原稿の内容を反映した特徴文字列を生成可能な画像処理装置を提供することである。 An object of the present invention is to provide an image processing apparatus capable of generating a characteristic character string reflecting the contents of a document.

請求項１にかかる本発明は、原稿を読み取る読取手段によって得られた読取情報から複数の文字列を抽出する文字列抽出手段と、前記文字列抽出手段により抽出された文字列から、原稿における文字列の出現頻度に基づいて、原稿に関する特徴文字列を構成する文字列の第１の候補を１つ以上抽出する第１抽出手段と、前記文字列抽出手段により抽出された文字列から、原稿における文字列の配置に基づいて、前記特徴文字列を構成する文字列の第２の候補を１つ以上抽出する第２抽出手段と、前記第１抽出手段によって抽出された第１の候補および前記第２抽出手段によって抽出された第２の候補の少なくとも一方から２つ以上の文字列を選択して連結し、前記特徴文字列を生成する特徴文字列生成手段とを有する画像処理装置である。 According to the first aspect of the present invention, a character string extracting unit that extracts a plurality of character strings from reading information obtained by a reading unit that reads a document, and a character in the document from the character string extracted by the character string extracting unit. First extraction means for extracting one or more first character string candidates constituting a characteristic character string related to the document based on the appearance frequency of the string, and the character string extracted by the character string extraction means, Second extraction means for extracting one or more second candidates for character strings constituting the characteristic character string based on the arrangement of the character strings; the first candidate extracted by the first extraction means; An image processing apparatus comprising: a feature character string generation unit that selects and connects two or more character strings from at least one of the second candidates extracted by the two extraction units and generates the feature character string.

請求項２にかかる本発明は、前記特徴文字列生成手段は、互いに意味が異なる２つ以上の文字列、又は、互いに意味が同じ語を含まない２つ以上の文字列を選択して、前記特徴文字列を生成する請求項１に記載の画像処理装置である。 According to the second aspect of the present invention, the characteristic character string generation unit selects two or more character strings having different meanings from each other, or two or more character strings having no same meaning from each other, and The image processing apparatus according to claim 1, wherein a characteristic character string is generated.

請求項３にかかる本発明は、前記特徴文字列生成手段は、選択された２つ以上の文字列それぞれの属性に基づいて、選択された２つ以上の文字列を連結する順序を決定する請求項１に記載の画像処理装置である。 According to a third aspect of the present invention, the characteristic character string generation unit determines an order in which two or more selected character strings are connected based on attributes of each of the two or more selected character strings. The image processing apparatus according to Item 1.

請求項４にかかる本発明は、前記特徴文字列生成手段は、前記特徴文字列の文字数が所定数以内となるように、２つ以上の文字列を選択して連結する請求項１に記載の画像処理装置である。 The present invention according to claim 4 is characterized in that the characteristic character string generation means selects and connects two or more character strings so that the number of characters of the characteristic character string is within a predetermined number. An image processing apparatus.

請求項５にかかる本発明は、前記第１抽出手段は、抽出された第１の候補について、複数の語から構成される文字列の重み付けを、１つの語から構成される文字列の重み付けよりも大きくするように重み付けし、前記特徴文字列生成手段は、前記第１抽出手段による重み付けが大きい第１の候補を優先的に選択する請求項１に記載の画像処理装置である。 According to a fifth aspect of the present invention, the first extracting means weights a character string composed of a plurality of words for the extracted first candidate by weighting a character string composed of one word. 2. The image processing apparatus according to claim 1, wherein the characteristic character string generation unit preferentially selects a first candidate having a large weight by the first extraction unit.

請求項６にかかる本発明は、前記特徴文字列生成手段は、原稿の種類に基づいて、選択された文字列を連結する順序を決定する請求項１に記載の画像処理装置である。 According to a sixth aspect of the present invention, in the image processing apparatus according to the first aspect, the characteristic character string generation unit determines an order in which the selected character strings are connected based on the type of document.

請求項７にかかる本発明は、前記特徴文字列生成手段は、前記第１の候補および前記第２の候補のいずれもが、原稿の種類に関する文字列を含まない場合、この原稿の種類に関する文字列を含むように、前記特徴文字列を生成する請求項１に記載の画像処理装置である。 According to a seventh aspect of the present invention, in the feature character string generation means, when both the first candidate and the second candidate do not include a character string related to the document type, the character related to the document type The image processing apparatus according to claim 1, wherein the characteristic character string is generated so as to include a string.

請求項８にかかる本発明は、原稿を読み取る読取手段によって得られた読取情報から複数の文字列を抽出する文字列抽出ステップと、前記文字列抽出ステップにおいて抽出された文字列から、原稿における文字列の出現頻度に基づいて、原稿に関する特徴文字列を構成する文字列の第１の候補を１つ以上抽出する第１抽出ステップと、前記文字列抽出ステップにおいて抽出された文字列から、原稿における文字列の配置に基づいて、前記特徴文字列を構成する文字列の第２の候補を１つ以上抽出する第２抽出ステップと、前記第１抽出ステップにおいて抽出された第１の候補および前記第２抽出ステップにおいて抽出された第２の候補の少なくとも一方から２つ以上の文字列を選択して連結し、前記特徴文字列を生成する特徴文字列生成ステップとをコンピュータに実行させる画像処理プログラムである。 According to an eighth aspect of the present invention, there is provided a character string extraction step for extracting a plurality of character strings from read information obtained by a reading means for reading a document, and characters in the document from the character strings extracted in the character string extraction step. A first extraction step for extracting one or more first character string candidates constituting a character string related to a document based on the appearance frequency of the column, and a character string extracted in the character string extraction step, A second extraction step of extracting one or more second candidates of character strings constituting the characteristic character string based on the arrangement of the character strings; the first candidate extracted in the first extraction step; 2. A feature character string generation step for selecting and connecting two or more character strings from at least one of the second candidates extracted in the two extraction steps to generate the feature character string. An image processing program for executing the up to the computer.

請求項１に係る本発明によれば、原稿の内容を反映した特徴文字列を生成可能な画像処理装置を提供することができる。 According to the first aspect of the present invention, it is possible to provide an image processing apparatus capable of generating a characteristic character string reflecting the contents of a document.

請求項２に係る本発明によれば、請求項１に係る本発明により得られる効果に加えて、同じ意味の語が重複した特徴文字列を生成しないようにすることができる。 According to the second aspect of the present invention, in addition to the effect obtained by the first aspect of the present invention, it is possible to prevent generation of characteristic character strings in which words having the same meaning are duplicated.

請求項３に係る本発明によれば、請求項１に係る本発明により得られる効果に加えて、本構成を有していない場合と比較して、見栄えのよい特徴文字列を生成できる。 According to the third aspect of the present invention, in addition to the effect obtained by the first aspect of the present invention, it is possible to generate a character string that has a good appearance as compared with the case where the present configuration is not provided.

請求項４に係る本発明によれば、請求項１に係る本発明により得られる効果に加えて、本構成を有していない場合と比較して、見栄えのよい特徴文字列を生成できる。 According to the fourth aspect of the present invention, in addition to the effect obtained by the first aspect of the present invention, it is possible to generate a character string having a good appearance as compared with the case where the present configuration is not provided.

請求項５に係る本発明によれば、請求項１に係る本発明により得られる効果に加えて、本構成を有していない場合と比較して、特徴文字列に複合語を含み易くすることができる。 According to the fifth aspect of the present invention, in addition to the effect obtained by the first aspect of the present invention, it is easier to include a compound word in the characteristic character string as compared with the case where the configuration is not provided. Can do.

請求項６に係る本発明によれば、請求項１に係る本発明により得られる効果に加えて、原稿の内容を反映した特徴文字列を生成できる。 According to the present invention of claim 6, in addition to the effect obtained by the present invention of claim 1, it is possible to generate a characteristic character string reflecting the contents of the document.

請求項７に係る本発明によれば、請求項１に係る本発明により得られる効果に加えて、原稿の内容を反映した特徴文字列を生成できる。 According to the present invention of claim 7, in addition to the effect obtained by the present invention of claim 1, it is possible to generate a characteristic character string reflecting the contents of the document.

請求項８に係る本発明によれば、原稿の内容を反映した特徴文字列を生成可能な画像処理プログラムを提供することができる。 According to the eighth aspect of the present invention, it is possible to provide an image processing program capable of generating a characteristic character string reflecting the contents of a document.

本実施形態にかかる画像処理装置のハードウェア構成を例示する図である。It is a figure which illustrates the hardware constitutions of the image processing apparatus concerning this embodiment. 図１に示した画像処理装置において動作する処理プログラムである。2 is a processing program that operates in the image processing apparatus illustrated in FIG. 1. 図２に示した出現頻度順候補抽出部の構成を示す図である。It is a figure which shows the structure of the appearance frequency order candidate extraction part shown in FIG. 図２に示した配置順候補抽出部の構成を示す図である。It is a figure which shows the structure of the arrangement | positioning order candidate extraction part shown in FIG. 図２に示した特徴文字列生成部の構成を示す図である。It is a figure which shows the structure of the characteristic character string production | generation part shown in FIG. 分類基準情報を例示する図である。It is a figure which illustrates classification standard information. 処理プログラムの処理を示すフローチャートである。It is a flowchart which shows the process of a processing program. 処理プログラムの処理を示すフローチャートである。It is a flowchart which shows the process of a processing program. 本実施形態に係る画像処理装置の処理対象である原稿の例を示す図である。FIG. 4 is a diagram illustrating an example of a document that is a processing target of the image processing apparatus according to the present embodiment.

図１は、本実施形態にかかる画像処理装置２のハードウェア構成を例示する図である。
図１に例示するように、画像処理装置２は、ＣＰＵ等の演算部２１２およびメモリ等の記憶部２１４などを含む制御装置２１と、通信装置２２と、記録装置２４と、ユーザインターフェース装置（ＵＩ装置）２５と、印刷装置２６と、画像読取装置２７とから構成される。 FIG. 1 is a diagram illustrating a hardware configuration of an image processing apparatus 2 according to the present embodiment.
As illustrated in FIG. 1, the image processing apparatus 2 includes a control device 21 including a calculation unit 212 such as a CPU and a storage unit 214 such as a memory, a communication device 22, a recording device 24, and a user interface device (UI). Device) 25, a printing device 26, and an image reading device 27.

ＵＩ装置２５は、ＬＣＤ（Liquid Crystal Display）表示装置あるいはＣＲＴ（Cathode Ray Tube）表示装置等の表示装置およびキーボード・タッチパネルなどを含む。
印刷装置２６は、例えばプリンタ等であって、文字データまたは画像データ等を用紙等の記録媒体に印刷する。
画像読取装置２７は、例えばスキャナ等であって、原稿等の記録媒体から画像等を読み取って、例えばビットマップ形式の読取情報に変換する。
つまり、画像処理装置２は、情報処理および他の画像処理装置又は端末との通信が可能なコンピュータとしてのハードウェア構成部分を有している。
また、以下の各図において、実質的に同じ構成部分および処理には同じ番号が付される。
なお、本実施形態において、画像処理装置２は印刷装置２６および画像読取装置２７を有するとしたが、画像処理装置は、印刷装置および画像読取装置を有さない例えばＰＣであってもよく、この場合、画像処理装置は、画像読取装置とＬＡＮ（Local Area Network）等を介して接続されていてもよい。 The UI device 25 includes a display device such as an LCD (Liquid Crystal Display) display device or a CRT (Cathode Ray Tube) display device, and a keyboard / touch panel.
The printing device 26 is, for example, a printer, and prints character data or image data on a recording medium such as paper.
The image reading device 27 is a scanner or the like, for example, and reads an image or the like from a recording medium such as a document, and converts it into, for example, read information in a bitmap format.
That is, the image processing apparatus 2 has a hardware configuration part as a computer capable of information processing and communication with other image processing apparatuses or terminals.
In the following drawings, substantially the same components and processes are denoted by the same reference numerals.
In the present embodiment, the image processing apparatus 2 includes the printing apparatus 26 and the image reading apparatus 27. However, the image processing apparatus may be, for example, a PC that does not include the printing apparatus and the image reading apparatus. In this case, the image processing apparatus may be connected to the image reading apparatus via a LAN (Local Area Network) or the like.

図２は、図１に示した画像処理装置２において動作する処理プログラム３の構成を示す図である。
図２に示すように、処理プログラム３は、原稿読取情報受付部３０２、自動生成要否指定部３０４、文字数設定部３０６、配置解析部３０８、文字列抽出部３１０、言語判定部３１２、原稿分類部３１４、分類基準格納部３１６、出現頻度順候補抽出部３２、配置順候補抽出部３４および特徴文字列抽生成部３６から構成される。
処理プログラム３は、たとえば、記憶媒体２４０（図１）を介して画像処理装置２に供給され、記憶部２１４にロードされ、画像処理装置２にインストールされたＯＳ（図示せず）上で、画像処理装置２のハードウェア資源を具体的に利用して実行される。
なお、本実施形態においては、処理プログラム３は、ソフトウェアで実現されるとしているが、処理プログラム３の全部又は一部は、例えばＦＰＧＡ（Field Programmable Gate Array）などのハードウェアで実現されてもよい。 FIG. 2 is a diagram showing the configuration of the processing program 3 that operates in the image processing apparatus 2 shown in FIG.
As shown in FIG. 2, the processing program 3 includes a document reading information receiving unit 302, an automatic generation necessity specifying unit 304, a character number setting unit 306, a layout analysis unit 308, a character string extraction unit 310, a language determination unit 312, a document classification. 314, classification reference storage unit 316, appearance frequency order candidate extraction unit 32, arrangement order candidate extraction unit 34, and characteristic character string extraction generation unit 36.
The processing program 3 is supplied to the image processing apparatus 2 via, for example, the storage medium 240 (FIG. 1), loaded into the storage unit 214, and installed on the image processing apparatus 2 on the OS (not shown). It is executed by specifically using the hardware resources of the processing device 2.
In the present embodiment, the processing program 3 is realized by software, but all or part of the processing program 3 may be realized by hardware such as an FPGA (Field Programmable Gate Array). .

図３は、図２に示した出現頻度順候補抽出部３２の構成を示す図である。
図３に示すように、出現頻度順候補抽出部３２は、頻度算出部３２２、複合文字列判断部３２４、文字列配置判断部３２６、文字列順位判定部３２８および順位基準格納部３３０から構成される。
図４は、図２に示した配置順候補抽出部３４の構成を示す図である。
図４に示すように、配置順候補抽出部３４は、文字列位置判定部３４２、文字列規模判定部３４４、配置順候補判定部３２６および配点基準格納部３４８から構成される。 FIG. 3 is a diagram illustrating a configuration of the appearance frequency order candidate extraction unit 32 illustrated in FIG. 2.
As shown in FIG. 3, the appearance frequency order candidate extraction unit 32 includes a frequency calculation unit 322, a composite character string determination unit 324, a character string arrangement determination unit 326, a character string rank determination unit 328, and a rank reference storage unit 330. The
FIG. 4 is a diagram illustrating a configuration of the arrangement order candidate extraction unit 34 illustrated in FIG. 2.
As illustrated in FIG. 4, the arrangement order candidate extraction unit 34 includes a character string position determination unit 342, a character string size determination unit 344, an arrangement order candidate determination unit 326, and a scoring reference storage unit 348.

図５は、図２に示した特徴文字列生成部３６の構成を示す図である。
図５に示すように、特徴文字列生成部３６は、配置順候補格納部３６０、出現頻度順候補格納部３６２、原稿種類文字列格納部３６４、配置順候補分割部３６６、配置順候補選択部３６８、出現頻度順候補選択部３７０、構成文字列決定部３７２、同義語辞書データベース（ＤＢ）３７４、文字列属性判定部３７６および文字列連結部３８２から構成される。 FIG. 5 is a diagram showing a configuration of the characteristic character string generation unit 36 shown in FIG.
As shown in FIG. 5, the characteristic character string generation unit 36 includes an arrangement order candidate storage unit 360, an appearance frequency order candidate storage unit 362, a document type character string storage unit 364, an arrangement order candidate division unit 366, and an arrangement order candidate selection unit. 368, an appearance frequency order candidate selection unit 370, a constituent character string determination unit 372, a synonym dictionary database (DB) 374, a character string attribute determination unit 376, and a character string concatenation unit 382.

処理プログラム３（図２）において、原稿読取情報受付部３０２は、画像読取装置２７から得られた読取情報（原稿読取情報）を受け付け、受け付けた原稿読取情報を、配置解析部３０８および文字列抽出部３１０による処理のために提供可能に格納する。
自動生成要否指定部３０４は、原稿読取情報受付部３０２によって受け付けられた原稿読取情報に対応する原稿に関する特徴文字列を、画像処理装置２が自動的に生成するか、または、使用者が例えばＵＩ装置２５を操作することによって作成するかを指定する。
具体的には、画像処理装置２が特徴文字列を自動的に生成するか、または、使用者が特徴文字列を作成するかを、使用者がＵＩ装置２５を操作することによって指定し、使用者がＵＩ装置２５を操作することによって生成された情報（自動生成要否情報）を、自動生成要否指定部３０４が受け入れる。
そして、自動生成要否指定部３０４は、受け入れた自動生成要否情報を、特徴文字列生成部３６に対して出力する。
ここで、「特徴文字列」とは、人間が原稿を識別するための文字列であって、例えば、原稿を電子データ（電子ファイル）等に変換した場合に、その電子データまたはその電子データを保管するパスフォルダ（ディレクトリ）等の名前である。 In the processing program 3 (FIG. 2), the document reading information receiving unit 302 receives the reading information (document reading information) obtained from the image reading device 27, and the received document reading information is extracted from the arrangement analyzing unit 308 and the character string extraction. It is stored so as to be provided for processing by the unit 310.
The automatic generation necessity designation unit 304 is configured so that the image processing apparatus 2 automatically generates a characteristic character string related to a document corresponding to the document reading information received by the document reading information receiving unit 302, or the user performs, for example, It is specified whether to create the UI device 25 by operating it.
Specifically, the user designates whether the image processing device 2 automatically generates a feature character string or the user creates a feature character string by operating the UI device 25 and uses it. The information (automatic generation necessity information) generated by the user operating the UI device 25 is received by the automatic generation necessity specifying unit 304.
Then, the automatic generation necessity specifying unit 304 outputs the received automatic generation necessity information to the characteristic character string generation unit 36.
Here, the “character string” is a character string for a human to identify a document. For example, when the document is converted into electronic data (electronic file) or the like, the electronic data or the electronic data is converted. The name of a path folder (directory) or the like to be stored.

文字数設定部３０６は、特徴文字列の文字数（長さ）を設定する。
具体的には、例えば、使用者が、ＵＩ装置２５を操作することによって設定される特徴文字列の文字数（設定文字数）を指定し、その操作によって生成された情報（文字数情報）を、文字数設定部３０６が受け入れる。
そして、文字数設定部３０６は、その文字数情報に対応する設定文字数を示す情報（設定文字数情報）を、特徴文字列生成部３６に対して出力する。 The character number setting unit 306 sets the number of characters (length) of the characteristic character string.
Specifically, for example, the user specifies the number of characters (set number of characters) of the characteristic character string set by operating the UI device 25, and the information (character number information) generated by the operation is set to the number of characters. Part 306 accepts.
Then, the character number setting unit 306 outputs information indicating the number of set characters corresponding to the character number information (set character number information) to the characteristic character string generation unit 36.

配置解析部３０８は、原稿読取情報を解析して、原稿に含まれる文字、表、写真等の自然画、ＣＧ（Computer Graphics）又は絵画等を分類（オブジェクト分類）し、それぞれについて位置情報を対応付ける。
さらに、配置解析部３０８は、解析結果を示す情報（配置情報）を、出現頻度順候補抽出部３２、配置順候補抽出部３４および原稿分類部３１４に対して出力する。
ここで、配置情報は、原稿読取情報に対応する原稿において、どの位置にどれだけの規模でどのオブジェクト（文字、表、写真等の自然画、ＣＧ又は絵画等）が含まれるかを示す情報である。
この「配置情報」は、例えば、各オブジェクトの位置を示す位置情報と、各オブジェクトの規模（寸法又は面積等）を示す規模情報とを含む。 The layout analysis unit 308 analyzes the document reading information, classifies natural images such as characters, tables, and photographs, CG (Computer Graphics), or paintings included in the document (object classification), and associates position information with each of them. .
Further, the arrangement analysis unit 308 outputs information (arrangement information) indicating the analysis result to the appearance frequency order candidate extraction unit 32, the arrangement order candidate extraction unit 34, and the document classification unit 314.
Here, the arrangement information is information indicating which object (natural image such as a character, a table, or a photograph, a CG or a painting) is included at which position and in which position in the document corresponding to the document reading information. is there.
This “placement information” includes, for example, position information indicating the position of each object and scale information indicating the scale (size, area, etc.) of each object.

ここで、位置情報は、例えば、位置座標等の絶対的な位置を示すものであってもよいし、他の文字列等との相対的な位置関係を示すものであってもよい。
同様に、規模情報は、例えば、フォント又は占有面積等の、そのオブジェクトの絶対的な規模を示すものであってもよいし、他のオブジェクトとの間の相対的な規模を示すものであってもよく、あるいは、オブジェクトの規模の平均値との差を示すものであってもよい。
また、上述した配置解析部３０８による分類は、例えば、原稿に含まれる各種の線、枠線、罫線又は色情報の検出と、エッジ検出と、パターンマッチングとによって行われる。しかし、これらの手法に限られない。 Here, for example, the position information may indicate an absolute position such as position coordinates, or may indicate a relative positional relationship with another character string or the like.
Similarly, the size information may indicate an absolute size of the object, such as a font or an occupied area, or may indicate a relative size between other objects. Alternatively, it may indicate a difference from the average value of the object size.
The above-described classification by the layout analysis unit 308 is performed by, for example, detecting various lines, frame lines, ruled lines, or color information included in the document, edge detection, and pattern matching. However, it is not restricted to these methods.

文字列抽出部３１０は、例えばＯＣＲ（Optical Character Recognition：光学文字認識）機能を使用することによって原稿読取情報を解析し、原稿に含まれる文字列を、例えば形態素解析によってその文字列単独で所定の語義を有する形式で抽出する。
ここで、文字認識とは、読み取って得られた文字の画像データを前もって記憶されたパターンと照合することによって、その文字を特定して、文字データ（文字列）を生成することをいう。
また、形態素解析とは、例えば、予め記憶されている文法の規則に関する情報と単語が登録された辞書とに基づいて、１つの文章を形態素（意味を持つ最小の言語単位）に分類し、分類された形態素の品詞を判別する処理をいう。
また、この形態素解析の処理において、文字列の言語も判別（例えば、その文字列が日本語か英語かまたはその他の言語かが判別）される。
さらに、文字列抽出部３１０は、抽出された各文字列を、出現頻度順候補抽出部３２、配置順候補抽出部３４および原稿分類部３１４に対して出力する。 The character string extraction unit 310 analyzes the document reading information by using, for example, an OCR (Optical Character Recognition) function, and converts the character string included in the document into a predetermined character string alone by, for example, morphological analysis. Extract in a form with meaning.
Here, the character recognition refers to generating character data (character string) by specifying the character by comparing the image data of the character obtained by reading with a pattern stored in advance.
The morpheme analysis is, for example, classifying one sentence into morphemes (the smallest language unit having meaning) based on information on grammatical rules stored in advance and a dictionary in which words are registered. This is a process for discriminating the part of speech of a morpheme.
In this morphological analysis process, the language of the character string is also determined (for example, whether the character string is Japanese, English, or another language).
Further, the character string extraction unit 310 outputs the extracted character strings to the appearance frequency order candidate extraction unit 32, the arrangement order candidate extraction unit 34, and the document classification unit 314.

言語判定部３１２は、文字列抽出部３１０によって抽出された文字列を解析して、原稿がどの言語で構成されているかを判定する。
具体的には、例えば、言語判定部３１２は、原稿内の各文字列において最も割合の多い言語（つまり、原稿内で最も多く出現する言語）を、その原稿の言語と判定し、判定結果を示す情報（言語情報）を、特徴文字列生成部３６に対して出力する。
なお、本実施形態では、言語判定部３１２は、文字列抽出部３１０によって抽出された文字列を解析することによって原稿の言語を判定するとしたが、例えば、使用者がＵＩ装置２５を操作して手動で入力し、またはリストから選択することによって、原稿の言語を判定するようにしてもよい。 The language determination unit 312 analyzes the character string extracted by the character string extraction unit 310 and determines in which language the document is configured.
Specifically, for example, the language determination unit 312 determines the language having the highest ratio in each character string in the document (that is, the language that appears most frequently in the document) as the language of the document, and determines the determination result. The indicated information (language information) is output to the characteristic character string generation unit 36.
In this embodiment, the language determination unit 312 determines the language of the document by analyzing the character string extracted by the character string extraction unit 310. However, for example, the user operates the UI device 25. The language of the document may be determined by manual input or selection from a list.

原稿分類部３１４は、配置解析部３０８からの配置情報と文字列抽出部３１０からの情報とに基づいて、分類基準格納部３１６に格納された分類基準に関する情報（分類基準情報）に従って、原稿の種類を判定する。
さらに、原稿分類部３１４は、判定結果を示す情報（原稿種類情報）を生成し、特徴文字列生成部３６に対して出力する。
分類基準格納部３１６は、図６に例示する分類基準情報を格納する。 Based on the arrangement information from the arrangement analysis unit 308 and the information from the character string extraction unit 310, the document classification unit 314 performs document manuscript according to information (classification standard information) related to the classification standard stored in the classification standard storage unit 316. Determine the type.
Further, the document classification unit 314 generates information indicating the determination result (document type information) and outputs it to the characteristic character string generation unit 36.
The classification reference storage unit 316 stores the classification reference information illustrated in FIG.

図６は、分類基準情報を例示する図である。
分類基準情報は、原稿の種類と、原稿の種類を判定するための条件との関係を示す情報（テーブル）であり、各条件に合致する場合に各原稿の種類に付与される点数が示されている。なお、各条件および各原稿の種類は、図６に例示されたものに限られない。
原稿分類部３１４（図２）は、配置解析部３０８からの情報と文字列抽出部３１０からの情報とに基づいて、各条件について判定し、合致した場合に、所定の点数を各原稿の種類に付与する。
そして、原稿分類部３１４は、その点数の合計が最も高い原稿の種類を、原稿読取情報に関する原稿の種類と判定する。
例えば、原稿の上部中央に文字列「申請書」が存在し、その規模（フォントサイズ等）が所定規模以上であり、さらに表が存在する場合について説明する。
この場合、図６に示された例においては、種類「申請書」については、合計点は３５点であり、種類「稟議書」については、合計点は５点であり、種類「設計図」については、合計点は５点である。
よって、原稿分類部３１４は、点数の最も高い「申請書」を、その原稿の種類と判定する。 FIG. 6 is a diagram illustrating classification standard information.
The classification reference information is information (table) indicating the relationship between the document type and the condition for determining the document type, and indicates the number of points given to each document type when each condition is met. ing. Each condition and each document type are not limited to those illustrated in FIG.
The document classification unit 314 (FIG. 2) determines each condition based on the information from the layout analysis unit 308 and the information from the character string extraction unit 310, and if the conditions match, a predetermined score is assigned to each document type. To grant.
Then, the document classification unit 314 determines the document type with the highest total score as the document type related to the document reading information.
For example, a case will be described in which a character string “application form” exists in the upper center of the manuscript, the size (font size, etc.) is equal to or larger than a predetermined size, and a table exists.
In this case, in the example shown in FIG. 6, the total score for the type “application” is 35 points, the total score for the type “approval” is 5 points, and the type “design drawing”. The total score is 5 points.
Therefore, the document classification unit 314 determines that the “application” having the highest score is the type of the document.

なお、図６の例の「上方」、「中央」といった位置を示す情報は、位置座標等の絶対的な位置情報で表わされてもよいし、他の文字列等との相対的な位置関係を示すものであってもよい。
また、自動生成要否指定部３０４によって受け入れられた自動生成要否情報が、画像処理装置２によって特徴文字列を自動的に生成することを示していない場合（つまり、使用者が選択する場合）、原稿分類部３１４が処理を行わないように構成してもよい。
さらに、本実施形態においては、原稿分類部３１４が原稿の種類を判定するとしたが、使用者が原稿の種類を指定してもよい。 Note that the information indicating the position such as “upper” and “center” in the example of FIG. 6 may be represented by absolute position information such as position coordinates, or a relative position with other character strings or the like. It may indicate a relationship.
Further, when the automatic generation necessity information received by the automatic generation necessity designation unit 304 does not indicate that the characteristic character string is automatically generated by the image processing apparatus 2 (that is, when the user selects). The document classification unit 314 may be configured not to perform processing.
Furthermore, in the present embodiment, the document classification unit 314 determines the type of document, but the user may specify the type of document.

出現頻度順候補抽出部３２は、文字列抽出部３１０によって抽出された文字列の、原稿における出現頻度の順に基づいて、特徴文字列を構成する文字列（構成文字列）の候補を抽出する。なお、出現頻度順候補抽出部３２によって抽出される構成文字列の候補を、出現頻度順候補と称する。
出現頻度順候補抽出部３２において、頻度算出部３２２（図３）は、文字列抽出部３１０によって抽出された各文字列について、その出現数（出現頻度）を算出し、文字列とその文字列の出現頻度とを対応付けて文字列順位判定部３２８に対して出力する。 The appearance frequency order candidate extraction unit 32 extracts a character string (constituent character string) candidate constituting the characteristic character string based on the order of appearance frequency of the character string extracted by the character string extraction unit 310 in the document. The constituent character string candidates extracted by the appearance frequency order candidate extraction unit 32 are referred to as appearance frequency order candidates.
In the appearance frequency order candidate extraction unit 32, the frequency calculation unit 322 (FIG. 3) calculates the number of appearances (appearance frequency) for each character string extracted by the character string extraction unit 310, and the character string and the character string. Are associated with the appearance frequency and output to the character string rank determination unit 328.

複合文字列判断部３２４は、文字列抽出部３１０によって抽出された各文字列が複合文字列であるか否かを、例えば形態素解析によって判断する。
さらに、複合文字列判断部３２４は、文字列が複合文字列であると判断された場合に、その文字列が複合文字列であることを示す情報（複合文字列情報）を、文字列順位判定部３２８に対して出力する。
ここで、「複合文字列」とは、複数の語から構成される文字列である。
例えば、文字列「市場規模」は、２つの語「市場」および「規模」を包含するので、複合文字列と判断される。 The composite character string determination unit 324 determines whether each character string extracted by the character string extraction unit 310 is a composite character string, for example, by morphological analysis.
Further, when it is determined that the character string is a composite character string, the composite character string determination unit 324 determines information indicating that the character string is a composite character string (composite character string information) as a character string rank determination. To the unit 328.
Here, the “composite character string” is a character string composed of a plurality of words.
For example, since the character string “market size” includes two words “market” and “scale”, it is determined as a composite character string.

文字列配置判断部３２６は、配置情報に基づいて、文字列抽出部３１０によって抽出された各文字列が、所定のオブジェクトに含まれる文字列であるか否かを判断する。
そして、ある特定のオブジェクトに含まれる文字列であると判断された場合、文字列配置判断部３２６は、その旨を示す情報（特定配置文字列情報）を、文字列順位判定部３２８に対して出力する。 The character string arrangement determination unit 326 determines whether each character string extracted by the character string extraction unit 310 is a character string included in a predetermined object based on the arrangement information.
When it is determined that the character string is included in a specific object, the character string arrangement determining unit 326 sends information indicating that fact (specific arrangement character string information) to the character string rank determining unit 328. Output.

文字列順位判定部３２８は、頻度算出部３２２からの情報と複合文字列判断部３２４からの複合文字列情報とに基づいて、順位基準格納部３３０に格納された順位付けの基準に関する情報（順位基準情報）に従って、文字列の順位を判定する。
さらに、文字列順位判定部３２８は、判定結果を示す情報（文字列順位情報）を生成し、特徴文字列生成部３６に対して出力する。
順位基準格納部３３０に格納された順位基準情報は、例えば、各文字列について、出現頻度が高い程、高い点数を付与するような基準を示す。
また、順位基準情報は、ある文字列が複合文字列情報に関する文字列である場合に、その文字列に付与する点数を増加させることを示してもよい。 Based on the information from the frequency calculation unit 322 and the composite character string information from the composite character string determination unit 324, the character string order determination unit 328 includes information (ranking information) related to the ranking criteria stored in the ranking reference storage unit 330. The order of the character strings is determined according to the reference information).
Further, the character string rank determination unit 328 generates information indicating the determination result (character string rank information) and outputs the information to the characteristic character string generation unit 36.
The rank reference information stored in the rank reference storage unit 330 indicates, for example, a reference that gives a higher score for each character string as the appearance frequency is higher.
Further, the rank reference information may indicate that when a certain character string is a character string related to the composite character string information, the number of points given to the character string is increased.

さらに、順位基準情報は、ある文字列が特定配置文字列情報に関する文字列である場合に、その文字列に付与する点数を減少させることを示してもよい。
例えば、文字列配置判断部３２６によってある文字列が原稿において表に含まれると判断された場合に、順位基準情報は、その文字列に付与する点数を０点とすることを示してもよい。
また、例えば、文字列配置判断部３２６によってある文字列が原稿において表に含まれると判断された場合に、順位基準情報は、頻度算出部３２２によって算出されたその文字列出現頻度から、表に含まれると判断されたその文字列の数を減算することを示してもよい。 Further, the ranking reference information may indicate that when a certain character string is a character string related to the specific arrangement character string information, the number of points given to the character string is reduced.
For example, when the character string arrangement determining unit 326 determines that a certain character string is included in the table in the document, the ranking reference information may indicate that the score given to the character string is 0.
Further, for example, when the character string arrangement determining unit 326 determines that a certain character string is included in the table in the document, the ranking reference information is displayed in the table from the character string appearance frequency calculated by the frequency calculating unit 322. It may indicate that the number of the character strings determined to be included is subtracted.

文字列順位判定部３２８の処理と順位基準情報とについて、具体例を挙げて説明する。
例えば、順位基準情報が、出現頻度が１位の文字列に１０点を付与し、出現頻度が２位の文字列に８点を付与し、出現頻度が３位の文字列に４点を付与することを示し、出現頻度が４位の文字列に３点を付与することを示し、さらに、文字列が複合文字列である場合に付与する点数を５倍にすることを示すとする。
また、例えば、頻度算出部３２２の算出結果が、
「規模」：１０個、「市場」：８個、「市場規模」：４個、「規模拡大」：３個
であるとする。 The processing of the character string rank determination unit 328 and the rank reference information will be described with a specific example.
For example, the ranking reference information gives 10 points to the character string with the first appearance frequency, gives 8 points to the character string with the second appearance frequency, and gives 4 points to the character string with the third appearance frequency. It is assumed that 3 points are given to the character string having the fourth appearance frequency, and that the number of points given when the character string is a composite character string is increased by 5 times.
For example, the calculation result of the frequency calculation unit 322 is
“Scale”: 10 pieces, “Market”: 8 pieces, “Market scale”: 4 pieces, “Scale expansion”: 3 pieces.

この場合、出現頻度によって、各文字列の出現頻度の順位および点数は、
１位：「規模」（１０点）、２位：「市場」（８点）、３位：「市場規模」（４点）、４位：「規模拡大」（３点）
である。
ここで、文字列「市場規模」および「規模拡大」は、複合文字列判断部３２４によって複合文字列と判断されているので、文字列順位判定部３２８は、文字列「市場規模」および「規模拡大」に付与される点数を５倍にする。
よって、文字列順位判定部３２８は、
１位：「市場規模」（２０点）、２位：「規模拡大」（１５点）、３位：「規模」（１０点）、４位：「市場」（８点）
と判定する。 In this case, depending on the appearance frequency, the rank and score of the appearance frequency of each character string is:
1st place: “Scale” (10 points), 2nd place: “Market” (8 points), 3rd place: “Market size” (4 points), 4th place: “Scale expansion” (3 points)
It is.
Here, since the character strings “market scale” and “scale expansion” are determined to be composite character strings by the composite character string determination unit 324, the character string rank determination unit 328 determines the character strings “market scale” and “scale”. The number of points given to “enlargement” is increased 5 times.
Therefore, the character string rank determination unit 328
1st place: “market size” (20 points), 2nd place: “scale expansion” (15 points), 3rd place: “scale” (10 points), 4th place: “market” (8 points)
Is determined.

配置順候補抽出部３４（図２）は、文字列抽出部３１０によって抽出された文字列の、原稿における配置に基づいて、特徴文字列を構成する構成文字列の候補を抽出する。なお、配置順候補抽出部３４によって抽出される構成文字列の候補を、配置順候補と称する。
配置順候補抽出部３４において、文字列位置判定部３４２（図４）は、文字列抽出部３１０によって抽出された各文字列の位置を、配置解析部３０８からの配置情報に基づいて判定する。
さらに、文字列位置判定部３４２は、各文字列とその文字列に関する位置情報とを対応付けて、配置順候補判定部３２６に対して出力する。 The arrangement order candidate extraction unit 34 (FIG. 2) extracts constituent character string candidates constituting the characteristic character string based on the arrangement of the character string extracted by the character string extraction unit 310 in the document. The constituent character string candidates extracted by the arrangement order candidate extraction unit 34 are referred to as arrangement order candidates.
In the arrangement order candidate extraction unit 34, the character string position determination unit 342 (FIG. 4) determines the position of each character string extracted by the character string extraction unit 310 based on the arrangement information from the arrangement analysis unit 308.
Furthermore, the character string position determination unit 342 associates each character string with position information regarding the character string, and outputs the associated character string position determination unit 342 to the arrangement order candidate determination unit 326.

文字列規模判定部３４４は、文字列抽出部３１０によって抽出された各文字列の規模を、配置解析部３０８からの配置情報に基づいて判断する。
さらに、文字列規模判定部３４４は、各文字列とその文字列に関する規模情報とを対応付けて、配置順候補判定部３２６に対して出力する。 The character string size determination unit 344 determines the size of each character string extracted by the character string extraction unit 310 based on the arrangement information from the arrangement analysis unit 308.
Furthermore, the character string scale determination unit 344 associates each character string with the scale information related to the character string and outputs the associated character string scale determination unit 344 to the arrangement order candidate determination unit 326.

配置順候補判定部３２６は、文字列位置判定部３４２からの情報と文字列規模判定部３４４からの情報とに基づいて、配点基準格納部３４８に格納された順位付けの基準に関する情報（配点基準情報）に従って、各文字列の、配置に基づく順位を判定する。
さらに、配置順候補判定部３２６は、判定結果を示す情報（配置順位情報）を生成し、特徴文字列生成部３６に対して出力する。
配点基準格納部３４８に格納された配点基準情報は、例えば、原稿において、各文字列の位置が相対的に上方にある場合および相対的に中央にある場合に、その文字列に付与する点数を高くすることを示す。
また、配点基準情報は、例えば、文字列のフォントが大きい等、原稿において、各文字列の規模が相対的に大きい場合に、その文字列に付与する点数を高くすることを示す。 Based on the information from the character string position determining unit 342 and the information from the character string size determining unit 344, the arrangement order candidate determining unit 326 is information on the ranking criteria stored in the scoring criterion storage unit 348 (scoring criteria). The order based on the arrangement of each character string is determined according to (information).
Further, the arrangement order candidate determination unit 326 generates information indicating the determination result (arrangement order information) and outputs the information to the characteristic character string generation unit 36.
The scoring standard information stored in the scoring standard storage unit 348 indicates, for example, the number of points given to a character string when the position of each character string is relatively upward and relatively central in the document. Indicates to increase.
In addition, the scoring reference information indicates that when the scale of each character string is relatively large in the document, for example, the font of the character string is large, the number of points assigned to the character string is increased.

配置順候補判定部３２６の処理と配点基準情報とについて、具体例を挙げて説明する。
例えば、配点基準情報が、原稿において所定の位置よりも上方にある文字列に１０点を付与し、原稿において所定の位置よりも横方向中央にある文字列に５点を付与することを示すとする。
また、例えば、配点基準情報が、原稿における文字列の規模の平均値の５倍以上である規模の文字列に１０点を付与し、文字列の規模の平均値の２倍以上５倍未満である規模の文字列に８点を付与することを示すとする。
また、例えば、原稿において文字列「見積書」の位置が所定の位置よりも上方且つ所定の位置よりも横方向中央にあり、さらに、この文字列「見積書」のフォントサイズが平均フォントサイズの５倍であるとする。 The process of the arrangement order candidate determination unit 326 and the point allocation reference information will be described with specific examples.
For example, if the scoring reference information indicates that 10 points are given to a character string above a predetermined position in the document, and 5 points are given to a character string located in the center in the horizontal direction from the predetermined position in the document. To do.
Further, for example, 10 points are assigned to a character string having a scale that is 5 times or more of the average value of the character strings in the original, and the scoring reference information is 2 times or more and less than 5 times the average value of the character strings. Suppose that 8 points are given to a character string of a certain scale.
Further, for example, the position of the character string “estimate” in the manuscript is above the predetermined position and in the center in the horizontal direction from the predetermined position, and the font size of the character string “estimate” is the average font size. Suppose that it is 5 times.

一方、例えば、原稿において文字列「市場」の位置が所定の位置よりも下方にあるが所定の位置よりも横方向中央にあり、さらに、この文字列「市場」のフォントサイズが平均フォントサイズの３倍であるとする。
この場合、文字列「見積書」に付与される点数は、１０＋５＋１０＝２５点であり、文字列「市場」に付与される点数は、０＋５＋８＝１３点である。
したがって、配置順候補判定部３２６は、
１位：「見積書」（２５点）、２位：「市場」（１３点）
と判定する。 On the other hand, for example, in the manuscript, the position of the character string “market” is below the predetermined position but at the center in the horizontal direction from the predetermined position, and the font size of the character string “market” is the average font size. Suppose that it is 3 times.
In this case, the score given to the character string “estimate” is 10 + 5 + 10 = 25 points, and the score given to the character string “market” is 0 + 5 + 8 = 13 points.
Therefore, the arrangement order candidate determination unit 326
1st place: “Estimate” (25 points) 2nd place: “Market” (13 points)
Is determined.

特徴文字列抽生成部３６は、出現頻度順候補抽出部３２からの文字列順位情報と配置順候補抽出部３４からの配置順位情報とに基づいて、文字列順位情報又は配置順位情報の少なくとも一方に含まれる文字列の候補から２つ以上の文字列を選択してそれらを連結し、特徴文字列を生成する。
特徴文字列生成部３６において、配置順候補格納部３６０（図５）は、配置順候補抽出部３４からの配置順位情報を格納する。
出現頻度順候補格納部３６２、出現頻度順候補抽出部３２からの文字列順位情報を格納する。
原稿種類文字列格納部３６４は、原稿分類部３１４からの原稿種類情報に含まれる原稿の種類に対応する文字列（原稿種類文字列）を格納する。 The feature character string extraction generation unit 36 is based on the character string rank information from the appearance frequency order candidate extraction section 32 and the arrangement rank information from the arrangement order candidate extraction section 34, and at least one of the character string rank information and the arrangement rank information. Two or more character strings are selected from the character string candidates included in and connected to generate a characteristic character string.
In the characteristic character string generation unit 36, the arrangement order candidate storage unit 360 (FIG. 5) stores the arrangement order information from the arrangement order candidate extraction unit 34.
Character string order information from the appearance frequency order candidate storage unit 362 and the appearance frequency order candidate extraction unit 32 is stored.
Document type character string storage unit 364 stores a character string (document type character string) corresponding to the document type included in the document type information from document classification unit 314.

配置順候補分割部３６６は、配置順候補格納部３６０に格納された配置順位情報に関する各文字列の内、文字数設定部３０６によって設定された文字数よりも長い文字数の文字列がある場合、その文字列を、例えば形態素解析によってその文字列単独で所定の語義を有する形式で分割する。
配置順候補選択部３６８は、配置順位情報に含まれる各文字列（配置順候補）のうち、順位の高いものから順に選択して、構成文字列決定部３７２に対して出力する。
出現頻度順候補選択部３７０は、文字列順位情報に含まれる各文字列（出現頻度順候補）のうち、順位の高いものから順に選択して、構成文字列決定部３７２に対して出力する。 If there is a character string having a character number longer than the character number set by the character number setting unit 306 among the character strings related to the arrangement order information stored in the arrangement order candidate storage unit 360, the arrangement order candidate dividing unit 366 For example, the character string alone is divided into a form having a predetermined meaning by morphological analysis.
The arrangement order candidate selection unit 368 selects each of the character strings (arrangement order candidates) included in the arrangement order information in descending order, and outputs the selected character strings to the constituent character string determination unit 372.
The appearance frequency order candidate selection unit 370 selects the character strings included in the character string order information (appearance frequency order candidates) in descending order, and outputs them to the constituent character string determination unit 372.

なお、配置順候補選択部３６８は、自動生成要否指定部３０４から、使用者がＵＩ装置２５を操作することによって特徴文字列を作成する旨を示す自動生成要否情報を受け入れた場合に、配置順位情報に含まれる配置順候補が順位の高いものから並べられたリストを、表示装置等のＵＩ装置２５に対して送信してもよい。
同様に、出現頻度順候補選択部３７０は、自動生成要否指定部３０４から、使用者がＵＩ装置２５を操作することによって特徴文字列を作成する旨を示す自動生成要否情報を受け入れた場合に、文字列順位情報に含まれる出現頻度順候補が順位の高いものから並べられたリストを、表示装置等のＵＩ装置２５に対して送信してもよい。
ＵＩ装置２５は、配置順候補が順位の高いものから並べられたリストと出現頻度順候補が順位の高いものから並べられたリストとを表示する。
この場合、使用者がＵＩ装置２５を操作することにより、特徴文字列を構成する配置順候補および出現頻度順候補が選択される。 The arrangement order candidate selection unit 368 receives automatic generation necessity information indicating that the user creates the characteristic character string by operating the UI device 25 from the automatic generation necessity designation unit 304. A list in which arrangement order candidates included in the arrangement order information are arranged in descending order may be transmitted to the UI device 25 such as a display device.
Similarly, the appearance frequency order candidate selection unit 370 receives automatic generation necessity information indicating that the user creates a characteristic character string by operating the UI device 25 from the automatic generation necessity designation unit 304. In addition, a list in which the appearance frequency order candidates included in the character string order information are arranged in descending order may be transmitted to the UI device 25 such as a display device.
The UI device 25 displays a list in which arrangement order candidates are arranged in descending order and a list in which appearance frequency order candidates are arranged in descending order.
In this case, when the user operates the UI device 25, an arrangement order candidate and an appearance frequency order candidate constituting the characteristic character string are selected.

構成文字列決定部３７２は、配置順候補選択部３６８によって選択された配置順候補と、出現頻度順候補選択部３７０によって選択された出現頻度順候補を比較して、それぞれが原稿読取情報に対応する原稿に関する特徴文字列を構成する構成文字列として適当であるか否か判定する。具体的な処理については後述する。
同義語辞書ＤＢ３７４は、例えば同義語となる文字列の組み合わせのリストを含む同義語辞書を記憶する。
文字列属性判定部３７６は、例えば形態素解析により、文字列の属性を判定する。
ここで、文字列の属性とは、例えば、名詞、動詞又は形容詞等の品詞の種類を区別するものであってもよく、文字列が名詞の場合には、普通名詞又は固有名詞等を区別するものであってもよく、さらに、文字列が固有名詞である場合には、人名、法人名等の人間以外の特定のものを示す名称又は地名等を区別するものであってもよい。
また、属性が地名である場合、その属性は、国名又は地域名等を区別するものであってもよい。 The constituent character string determination unit 372 compares the arrangement order candidate selected by the arrangement order candidate selection unit 368 with the appearance frequency order candidates selected by the appearance frequency order candidate selection unit 370, and each corresponds to the document reading information. It is determined whether or not it is suitable as a constituent character string constituting a characteristic character string related to a document to be printed. Specific processing will be described later.
The synonym dictionary DB 374 stores a synonym dictionary including a list of combinations of character strings that are synonyms, for example.
The character string attribute determining unit 376 determines the attribute of the character string, for example, by morphological analysis.
Here, the attribute of the character string may distinguish, for example, the type of part of speech such as a noun, a verb or an adjective. When the character string is a noun, it distinguishes a common noun or proper noun, etc. Furthermore, when the character string is a proper noun, a name or a place name indicating a specific thing other than a person such as a person name or a corporate name may be distinguished.
Further, when the attribute is a place name, the attribute may distinguish a country name or a region name.

構成文字列決定部３７２は、同義語辞書ＤＢ３７４に記憶された同義語辞書に基づいて、配置順候補選択部３６８によって選択された配置順候補と出現頻度順候補選択部３７０によって選択された出現頻度順候補とが、互いに同義語であるか否かを判断する（判断１−１）。
また、構成文字列決定部３７２は、同義語辞書ＤＢ３７４に記憶された同義語辞書に基づいて、配置順候補および出現頻度順候補の両方が同義語となる文字列を包含するか否かを判断する（判断１−２）。
さらに、構成文字列決定部３７２は、配置順候補又は出現頻度順候補が互いに同一の文字列であるか否か、配置順候補又は出現頻度順候補が同一の文字列を包含するか否か、および、配置順候補又は出現頻度順候補のいずれか一方の文字列が他方の文字列を包含するか否かを判断してもよい（判断１−３）。 The constituent character string determination unit 372 selects the arrangement order candidate selected by the arrangement order candidate selection unit 368 and the appearance frequency selected by the appearance frequency order candidate selection unit 370 based on the synonym dictionary stored in the synonym dictionary DB 374. It is determined whether or not the order candidates are synonymous with each other (decision 1-1).
Further, the constituent character string determination unit 372 determines whether or not both the arrangement order candidate and the appearance frequency order candidate include character strings that are synonyms based on the synonym dictionary stored in the synonym dictionary DB 374. (Judgment 1-2).
Furthermore, the constituent character string determination unit 372 determines whether the arrangement order candidates or the appearance frequency order candidates are the same character strings, whether the arrangement order candidates or the appearance frequency order candidates include the same character string, Further, it may be determined whether any one of the arrangement order candidate or the appearance frequency order candidate includes the other character string (determination 1-3).

構成文字列決定部３７２は、上記判断１−１〜１−３の内の少なくとも１つが正しいと判断された場合、配置順候補はそのまま構成文字列の候補として留め置きつつ、出現頻度順候補を破棄する。
そして、構成文字列決定部３７２は、新たな出現頻度順候補を選択するように、出現頻度順候補選択部３７０を制御する。
この場合、出現頻度順候補選択部３７０は、未選択の出現頻度順候補の内最も順位が高い出現頻度順候補を、出現頻度順候補格納部３６２に格納された文字列順位情報から選択し、構成文字列決定部３７２に対して出力する。
そして、構成文字列決定部３７２は、上記と同様に、留め置かれた配置順候補と新たに選択された出現頻度順候補とに対し、上記判断１−１〜１−３を行う。 When it is determined that at least one of the above determinations 1-1 to 1-3 is correct, the constituent character string determination unit 372 discards the appearance frequency order candidates while retaining the arrangement order candidates as they are as constituent character string candidates. To do.
Then, the constituent character string determination unit 372 controls the appearance frequency order candidate selection unit 370 so as to select a new appearance frequency order candidate.
In this case, the appearance frequency order candidate selection unit 370 selects the appearance frequency order candidate having the highest rank among the unselected appearance frequency order candidates from the character string order information stored in the appearance frequency order candidate storage unit 362, The data is output to the constituent character string determination unit 372.
Then, the constituent character string determination unit 372 performs the determinations 1-1 to 1-3 for the placed arrangement order candidate and the newly selected appearance frequency order candidate in the same manner as described above.

構成文字列決定部３７２は、配置順候補と出現頻度順候補とを、文字列属性判定部３７６に対して出力する。
文字列属性判定部３７６は、配置順候補の属性と出現頻度順候補の属性とが同一であるか否かを判断し、判断結果を示す情報を構成文字列決定部３７２に対して出力する。
構成文字列決定部３７２は、文字列属性判定部３７６からの判断結果を示す情報に基づいて、配置順候補の属性と出現頻度順候補の属性とが同一であるか否かを判断する（判断２）。 The constituent character string determination unit 372 outputs the arrangement order candidates and the appearance frequency order candidates to the character string attribute determination unit 376.
The character string attribute determining unit 376 determines whether or not the attribute of the arrangement order candidate and the attribute of the appearance frequency order candidate are the same, and outputs information indicating the determination result to the constituent character string determining unit 372.
The constituent character string determination unit 372 determines whether or not the placement order candidate attribute and the appearance frequency order candidate attribute are the same based on the information indicating the determination result from the character string attribute determination unit 376 (determination). 2).

構成文字列決定部３７２は、上記判断２が正しいと判断された場合、配置順候補はそのまま構成文字列の候補として留め置きつつ、出現頻度順候補を破棄する。
そして、構成文字列決定部３７２は、新たな出現頻度順候補を選択するように、出現頻度順候補選択部３７０を制御する。
この場合、出現頻度順候補選択部３７０は、未選択の出現頻度順候補の内最も順位が高い出現頻度順候補を、出現頻度順候補格納部３６２に格納された文字列順位情報から選択し、構成文字列決定部３７２に対して出力する。
そして、構成文字列決定部３７２は、上記と同様に、留め置かれた配置順候補と新たに選択された出現頻度順候補とに対し、上記判断２を行う。 When it is determined that the above determination 2 is correct, the constituent character string determination unit 372 discards the appearance frequency order candidates while retaining the arrangement order candidates as they are as constituent character string candidates.
Then, the constituent character string determination unit 372 controls the appearance frequency order candidate selection unit 370 so as to select a new appearance frequency order candidate.
In this case, the appearance frequency order candidate selection unit 370 selects the appearance frequency order candidate having the highest rank among the unselected appearance frequency order candidates from the character string order information stored in the appearance frequency order candidate storage unit 362, The data is output to the constituent character string determination unit 372.
Then, the constituent character string determination unit 372 performs the above determination 2 on the placed arrangement order candidate and the newly selected appearance frequency order candidate as described above.

構成文字列決定部３７２は、上記判断１−１〜１−３および上記判断２の全ての判断が否と判断された場合、これらの配置順候補および出現頻度順候補を、構成文字列として、文字列連結部３８２に対して出力する。
なお、構成文字列決定部３７２は、原稿種類文字列格納部３６４に格納された原稿種類文字列を、配置順候補および出現頻度順候補が包含するか否かを判断してもよい。
この場合、配置順候補および出現頻度順候補のいずれか一方が原稿種類文字列を包含すると判断されたときは、構成文字列決定部３７２は、上記判断１−１〜１−３および上記判断２に関わらず、その原稿種類文字列を包含する配置順候補又は出現頻度順候補を、構成文字列として文字列連結部３８２に対して出力してもよい。
また、配置順候補および出現頻度順候補のいずれもが原稿種類文字列を包含しないと判断されたときは、構成文字列決定部３７２は、配置順候補および出現頻度順候補の他に、原稿種類文字列を構成文字列として文字列連結部３８２に対して出力してもよい。 When it is determined that all of the above determinations 1-1 to 1-3 and the above determination 2 are negative, the constituent character string determining unit 372 uses these arrangement order candidates and appearance frequency order candidates as constituent character strings. Output to the character string concatenation unit 382.
The constituent character string determination unit 372 may determine whether the document type character string stored in the document type character string storage unit 364 is included in the placement order candidate and the appearance frequency order candidate.
In this case, when it is determined that one of the arrangement order candidate and the appearance frequency order candidate includes the document type character string, the constituent character string determination unit 372 performs the above determinations 1-1 to 1-3 and the above determination 2. Regardless, the placement order candidate or the appearance frequency order candidate including the document type character string may be output to the character string concatenation unit 382 as a constituent character string.
When it is determined that neither the placement order candidate nor the appearance frequency order candidate includes the document type character string, the constituent character string determination unit 372 determines the document type in addition to the placement order candidate and the appearance frequency order candidate. A character string may be output to the character string concatenation unit 382 as a constituent character string.

さらに、構成文字列決定部３７２は、原稿種類文字列に応じて、特定の属性の配置順候補又は出現頻度順候補を、構成文字列として決定してもよい。
例えば、原稿種類文字列が「申請書」の場合、属性が「人名」である配置順候補又はおよび出現頻度順候補を、上記判断１−１〜１−３および上記判断２に関わらず、構成文字列として文字列連結部３８２に対して出力してもよい。
さらに、構成文字列決定部３７２は、言語判定部３１２からの言語情報に基づいて、決定される構成文字列の判断基準を、適宜、変更するようにしてもよい。 Furthermore, the constituent character string determination unit 372 may determine a specific attribute placement order candidate or appearance frequency order candidate as a constituent character string in accordance with the document type character string.
For example, when the document type character string is “application”, the arrangement order candidate or the appearance frequency order candidate having the attribute “person name” is configured regardless of the above-described judgments 1-1 to 1-3 and the above-mentioned judgment 2. You may output to the character string connection part 382 as a character string.
Furthermore, the constituent character string determination unit 372 may appropriately change the determination criterion of the constituent character string to be determined based on the language information from the language determination unit 312.

なお、配置順候補抽出部３４によって配置順候補が抽出されなかった場合、構成文字列決定部３７２は、別の出現頻度順候補を選択するように、出現頻度順候補選択部３７０を制御する。
この場合、出現頻度順候補選択部３７０は、未選択の出現頻度順候補の内で最も順位が高い出現頻度順候補を、出現頻度順候補格納部３６２に格納された文字列順位情報から選択し、構成文字列決定部３７２に対して出力する。
そして、構成文字列決定部３７２は、上記と同様に、元の出現頻度順候補と新たに選択された出現頻度順候補とに対し、上記判断１−１〜１−３および上記判断２を行ってもよい。
なお、上記判断１−１〜１−３および上記判断２における判断に応じて、構成文字列決定部３７２が留め置くのは配置順候補としたが、出現頻度順候補が留め置かれ、配置順候補が新たに選択されるようにしてもよい。 In addition, when the arrangement order candidate is not extracted by the arrangement order candidate extraction unit 34, the constituent character string determination unit 372 controls the appearance frequency order candidate selection unit 370 so as to select another appearance frequency order candidate.
In this case, the appearance frequency order candidate selection unit 370 selects the appearance frequency order candidate having the highest rank among the unselected appearance frequency order candidates from the character string order information stored in the appearance frequency order candidate storage unit 362. And output to the constituent character string determination unit 372.
Then, similarly to the above, the constituent character string determination unit 372 performs the above determinations 1-1 to 1-3 and the above determination 2 on the original appearance frequency order candidate and the newly selected appearance frequency order candidate. May be.
In addition, according to the determinations in the above determinations 1-1 to 1-3 and the above determination 2, the constituent character string determination unit 372 retains the placement order candidates, but the appearance frequency order candidates are retained and the placement order. A candidate may be newly selected.

文字列連結部３８２は、まず、特徴文字列を構成する構成文字列の数を決定する。
具体的には、文字列連結部３８２は、構成文字列決定部３７２から複数の構成文字列を受け入れ、これらの構成文字列の文字数の合計（合計文字数）を算出する。
また、文字列連結部３８２は、文字数設定部３０６から設定文字数情報を受け入れる。
そして、文字列連結部３８２は、構成文字列の合計文字数が、設定文字列が示す設定文字数以内か否か判断する。
構成文字列の合計文字数が設定文字数以内である場合、文字列連結部３８２は、さらに別の出現頻度順候補を選択するように、構成文字列決定部３７２および出現頻度順候補選択部３７０を制御する。 First, the character string concatenation unit 382 determines the number of constituent character strings that constitute the characteristic character string.
Specifically, the character string linking unit 382 receives a plurality of constituent character strings from the constituent character string determining unit 372 and calculates the total number of characters (total number of characters) of these constituent character strings.
Further, the character string concatenation unit 382 receives the set character number information from the character number setting unit 306.
Then, the character string concatenation unit 382 determines whether the total number of characters of the constituent character strings is within the set number of characters indicated by the set character string.
When the total number of characters of the constituent character string is within the set number of characters, the character string concatenating unit 382 controls the constituent character string determining unit 372 and the appearance frequency order candidate selecting unit 370 so as to select another appearance frequency order candidate. To do.

この場合、出現頻度順候補選択部３７０は、未選択の出現頻度順候補のうち最も順位が高い出現頻度順候補を、出現頻度順候補格納部３６２に格納された文字列順位情報から選択し、構成文字列決定部３７２に対して出力する。
そして、構成文字列決定部３７２は、文字列連結部３８２に出力済みの構成文字列と新たに選択された出現頻度順候補とに対し、上記判断１−１〜１−３および上記判断２を行い、新たに選択された出現頻度順候補を構成文字列と決定した場合には、その構成文字列（出現頻度順候補）を、文字列連結部３８２に対して出力する。 In this case, the appearance frequency order candidate selection unit 370 selects the appearance frequency order candidate having the highest rank from the unselected appearance frequency order candidates from the character string order information stored in the appearance frequency order candidate storage unit 362, The data is output to the constituent character string determination unit 372.
Then, the constituent character string determination unit 372 performs the above determinations 1-1 to 1-3 and the above determination 2 for the constituent character string that has been output to the character string concatenation unit 382 and the newly selected appearance frequency order candidates. If the newly selected appearance frequency order candidate is determined as a constituent character string, the constituent character string (appearance frequency order candidate) is output to the character string concatenation unit 382.

一方、構成文字列の合計文字数が設定文字数以内でない場合、文字列連結部３８２は、直前に構成文字列決定部３７２から受け入れた構成文字列を破棄する。
以上の処理により、特徴文字列を構成する構成文字列の数が決定される。
上記の処理について、例を挙げて具体的に説明する。
例えば、設定文字数が２０文字であり、配置順候補として「住所変更申請書」（７文字）が選択され、出現頻度順候補として、順に、出現頻度順候補＃１「横浜市西区」（５文字）、出現頻度順候補＃２「転居日」（３文字）、出現頻度順候補＃３「世帯主」（３文字）および出現頻度順候補＃４「同居者」（３文字）が選択されたとする。
この場合、配置順候補の文字数と出現頻度順候補＃１〜＃３の文字数の合計は１８文字であり、配置順候補の文字数と出現頻度順候補＃１〜＃４の文字数の合計は２１文字である。 On the other hand, when the total number of characters of the constituent character string is not within the set number of characters, the character string concatenating unit 382 discards the constituent character string received from the constituent character string determining unit 372 immediately before.
Through the above processing, the number of constituent character strings constituting the characteristic character string is determined.
The above processing will be specifically described with an example.
For example, the set number of characters is 20, “Address change application” (7 characters) is selected as the placement order candidate, and the appearance frequency order candidate # 1 “Nishi-Yokohama-shi” (5 characters) in order as the appearance frequency order candidate ) Appearance frequency order candidate # 2 “Relocation date” (3 characters), Appearance frequency order candidate # 3 “Household” (3 characters), and Appearance frequency order candidate # 4 “Housemate” (3 characters) To do.
In this case, the total number of characters of the arrangement order candidates and the number of characters of the appearance frequency order candidates # 1 to # 3 is 18 characters, and the total number of characters of the arrangement order candidates and the appearance frequency order candidates # 1 to # 4 is 21 characters. It is.

したがって、この場合、文字列連結部３８２は、直前に構成文字列決定部３７２から受け入れた構成文字列（出現頻度順候補＃４）「同居者」を破棄し、配置順候補と出現頻度順候補＃１〜＃３とを、連結されるべき構成文字列として決定する。
よって、この場合、文字列連結部３８２は、特徴文字列を構成する構成文字列の数を、４つ（配置順候補および出現頻度順候補＃１〜＃３）と決定する。 Therefore, in this case, the character string concatenation unit 382 discards the constituent character string (appearance frequency order candidate # 4) “cohabitant” received from the constituent character string determination unit 372 immediately before, and arranges the arrangement order candidate and the appearance frequency order candidate. # 1 to # 3 are determined as constituent character strings to be concatenated.
Therefore, in this case, the character string coupling unit 382 determines the number of constituent character strings constituting the characteristic character string as four (placement order candidates and appearance frequency order candidates # 1 to # 3).

なお、特徴文字列を生成する際、複数の構成文字列の間に「−（ハイフン）」又は「＿（アンダーバー）」等の区切り文字が挿入されてもよい。
この場合、文字列連結部３８２は、上記の合計文字数と設定文字数とを比較において、合計文字数に挿入記号の数を加算してもよい。
また、上記実施形態においては、合計文字数が設定文字数以内である場合に、文字列連結部３８２が、別の出現頻度順候補を選択するように、構成文字列決定部３７２および出現頻度順候補選択部３７０を制御するとしたが、文字列連結部３８２が、別の配置順候補を選択するように、構成文字列決定部３７２および配置順候補選択部３６８を制御するようにしてもよい。 When generating a characteristic character string, a delimiter such as “-(hyphen)” or “_ (under bar)” may be inserted between a plurality of constituent character strings.
In this case, the character string concatenation unit 382 may add the number of insertion symbols to the total number of characters in the comparison between the total number of characters and the set number of characters.
Further, in the above embodiment, when the total number of characters is within the set number of characters, the constituent string determining unit 372 and the appearance frequency order candidate selection so that the character string concatenation unit 382 selects another appearance frequency order candidate. Although the unit 370 is controlled, the character string coupling unit 382 may control the constituent character string determination unit 372 and the arrangement order candidate selection unit 368 so as to select another arrangement order candidate.

次に、文字列連結部３８２は、複数の構成文字列を連結する際の順序を決定する。
文字列連結部３８２は、配置順候補である構成文字列を特徴文字列の先頭とし、その後ろに、出現頻度順候補である構成文字列を連結する。
ここで、出現頻度順候補が複数ある場合、文字列連結部３８２は、出現頻度の大きい出現頻度順候補がより前になるように連結する。
また、文字列連結部３８２は、配置順候補が構成文字列として決定されていない場合、出現頻度の大きい出現頻度順候補がより前になるように連結する。 Next, the character string concatenation unit 382 determines the order in which a plurality of constituent character strings are concatenated.
The character string concatenation unit 382 concatenates the constituent character string that is the candidate for the appearance frequency and the constituent character string that is the candidate for the arrangement order to the beginning of the characteristic character string.
Here, when there are a plurality of appearance frequency order candidates, the character string concatenation unit 382 connects the appearance frequency order candidates with a higher appearance frequency so that they appear earlier.
In addition, when the arrangement order candidates are not determined as the constituent character strings, the character string concatenation unit 382 connects the appearance frequency order candidates with a higher appearance frequency in front.

以上の処理により、文字列連結部３８２は、複数の構成文字列を連結し、特徴文字列を生成する。
また、文字列連結部３８２は、生成した特徴文字列を、ＵＩ装置２５に対して送信し、ＵＩ装置２５に特徴文字列が表示される。
なお、文字列連結部３８２は、原稿分類文字列を構成文字列として受け入れた場合、その原稿分類文字列を特徴文字列の先頭としてもよい。
また、文字列連結部３８２は、原稿種類文字列を包含する配置順候補又は出現頻度順候補を構成文字列として受け入れた場合、その原稿種類文字列を包含する配置順候補又は出現頻度順候補を特徴文字列の先頭としてもよい。 Through the above processing, the character string concatenation unit 382 concatenates a plurality of constituent character strings to generate a characteristic character string.
Further, the character string coupling unit 382 transmits the generated characteristic character string to the UI device 25, and the characteristic character string is displayed on the UI device 25.
Note that when the character string concatenation unit 382 accepts the document classification character string as the constituent character string, the character string connection character string 382 may use the document classification character string as the head of the characteristic character string.
In addition, when the character string linking unit 382 accepts an arrangement order candidate or an appearance frequency order candidate including a document type character string as a constituent character string, the character string connection unit 382 selects an arrangement order candidate or an appearance frequency order candidate including the document type character string. It may be the beginning of the feature character string.

また、文字列連結部３８２は、構成文字列が配置順候補であるか出現頻度順候補であるかに関わらず、構成文字列の属性に基づいて、構成文字列の連結順序を決定するようにしてもよい。
例えば、属性「地名」の構成文字列と属性「人名」の構成文字列とがある場合、属性「地名」の構成文字列を属性「人名」の構成文字列よりも前に連結するようにしてもよい。
また、文字列連結部３８２は、構成文字列が配置順候補であるか出現頻度順候補であるかに関わらず、原稿の言語に応じて連結順序を決定するようにしてもよい。
例えば、属性「国名」の構成文字列と属性「地域名」の構成文字列とがある場合、言語判定部３１２からの言語情報が「日本語」を示す場合、属性「国名」の構成文字列を属性「地域名」の構成文字列よりも前に連結し、言語情報が「英語」を示す場合、属性「地域名」の構成文字列を属性「国名」の構成文字列よりも前に連結するようにしてもよい。
また、文字列連結部３８２は、構成文字列が配置順候補であるか出現頻度順候補であるかに関わらず、原稿種類文字列に基づいて、構成文字列の連結順序を決定するようにしてもよい。 The character string concatenation unit 382 determines the concatenation order of the constituent character strings based on the attributes of the constituent character strings regardless of whether the constituent character strings are placement order candidates or appearance frequency order candidates. May be.
For example, if there is a configuration character string of attribute “place name” and a configuration string of attribute “person name”, the configuration string of attribute “location name” is concatenated before the configuration string of attribute “person name”. Also good.
Further, the character string concatenation unit 382 may determine the concatenation order according to the language of the document regardless of whether the constituent character string is an arrangement order candidate or an appearance frequency order candidate.
For example, when there is a constituent character string of the attribute “country name” and a constituent character string of the attribute “region name”, and the language information from the language determination unit 312 indicates “Japanese”, a constituent character string of the attribute “country name”. Are combined before the component string of the attribute "Region name", and the language information indicates "English", the component string of the attribute "Region name" is concatenated before the component string of the attribute "Country name" You may make it do.
Further, the character string concatenation unit 382 determines the concatenation order of the constituent character strings based on the manuscript type character strings regardless of whether the constituent character strings are placement order candidates or appearance frequency order candidates. Also good.

図７Ａ，図７Ｂは、処理プログラム３の処理を示すフローチャート（Ｓ１０）である。
ステップ１０２（Ｓ１０２）において、原稿読取情報受付部３０２は、原稿を読み取って得られた原稿読取情報を受け付ける。
ステップ１０４（Ｓ１０４）において、原稿読取情報に基づいて、配置解析部３０８が配置情報を生成し、文字列抽出部３１０が文字列を抽出する。 7A and 7B are flowcharts (S10) showing the processing of the processing program 3.
In step 102 (S102), document reading information receiving unit 302 receives document reading information obtained by reading a document.
In step 104 (S104), the layout analysis unit 308 generates layout information based on the document reading information, and the character string extraction unit 310 extracts a character string.

ステップ１０６（Ｓ１０６）において、出現頻度順候補抽出部３２は、原稿における出現頻度の順に基づいて、出現頻度順候補を抽出する。
ステップ１０８（Ｓ１０８）において、配置順候補抽出部３４は、原稿における文字列の配置に基づいて、配置順候補を抽出する。
ステップ１１０（Ｓ１１０）において、自動生成要否指定部３０４は、特徴文字列を、画像処理装置２が自動的に生成するか否かを判断し、自動的に生成する設定がなされていると判断した場合は、処理はＳ１２０に進み、そうでない場合（つまり使用者が例えばＵＩ装置２５を操作することによって作成する設定がなされている場合）は、処理はＳ１１２に進む。
ステップ１１２（Ｓ１１２）において、ＵＩ装置２５は、出現頻度順候補抽出部３２によって抽出された出現頻度順候補のリストと、配置順候補抽出部３４によって抽出された配置順候補とを表示する。
ステップ１１４（Ｓ１１４）において、使用者によって、出現頻度順候補および配置順候補が選択されることによって、特徴文字列が選択され、処理が終了する。 In step 106 (S106), the appearance frequency order candidate extraction unit 32 extracts appearance frequency order candidates based on the order of appearance frequencies in the document.
In step 108 (S108), the arrangement order candidate extraction unit 34 extracts arrangement order candidates based on the arrangement of the character strings in the document.
In step 110 (S110), the automatic generation necessity designation unit 304 determines whether or not the image processing apparatus 2 automatically generates the characteristic character string, and determines that the setting for automatic generation is made. If so, the process proceeds to S120, and if not (that is, if the user has made settings for example by operating the UI device 25), the process proceeds to S112.
In step 112 (S112), the UI device 25 displays the list of appearance frequency order candidates extracted by the appearance frequency order candidate extraction unit 32 and the arrangement order candidates extracted by the arrangement order candidate extraction unit 34.
In step 114 (S114), by selecting the appearance frequency order candidate and the arrangement order candidate by the user, the characteristic character string is selected, and the process ends.

ステップ１２０（Ｓ１２０）において、原稿分類部３１４は、分類基準情報に従って、原稿の種類を判定する。
ステップ１２２（Ｓ１２２）において、特徴文字列生成部３６は、配置順候補の抽出数が０でないか否かを判断し、０でない場合は、処理はＳ１２６に進み、０である場合は、処理はＳ１２４に進む。
ステップ１２４（Ｓ１２４）において、特徴文字列生成部３６は、出現頻度順候補の抽出数が０でないか否かを判断し、０でない場合は、処理はＳ１３０に進み、０である場合は、処理が終了する。 In step 120 (S120), the document classification unit 314 determines the type of document according to the classification reference information.
In step 122 (S122), the characteristic character string generation unit 36 determines whether or not the number of arrangement order candidates is not 0. If not, the process proceeds to S126. Proceed to S124.
In step 124 (S124), the characteristic character string generation unit 36 determines whether or not the number of appearance frequency order candidates extracted is not 0. If not, the process proceeds to S130. Ends.

ステップ１２６（Ｓ１２６）において、特徴文字列生成部３６の配置順候補選択部３６８は、配置順候補のうち、順位の高いものから順に選択する。
ステップ１２８（Ｓ１２８）において、特徴文字列生成部３６は、出現頻度順候補の抽出数が０でないか否かを判断し、０でない場合は、処理はＳ１３０に進み、０である場合は、処理が終了する。
ステップ１３０（Ｓ１３０）において、特徴文字列生成部３６の出現頻度順候補選択部３７０は、出現頻度順候補のうち、順位の高いものから順に選択する。 In step 126 (S126), the arrangement order candidate selection unit 368 of the characteristic character string generation unit 36 selects the arrangement order candidates in descending order.
In step 128 (S128), the characteristic character string generation unit 36 determines whether or not the number of appearance frequency order candidates extracted is not 0. If not, the process proceeds to S130. Ends.
In step 130 (S130), the appearance frequency order candidate selection unit 370 of the characteristic character string generation unit 36 selects the appearance frequency order candidates in descending order.

ステップ１４２（Ｓ１４２）において、特徴文字列生成部３６の構成文字列決定部３７２は、選択された配置順候補と出現頻度順候補とが互いに同義語であるか否かを判断し、同義語であると判断した場合は、処理はＳ１４６に進み、同義語でないと判断した場合は、処理はＳ１４２に進む。
ステップ１４４（Ｓ１４４）において、特徴文字列生成部３６の構成文字列決定部３７２は、配置順候補の属性と出現頻度順候補の属性とが同一であるか否かを判断し、属性が互いに同一であると判断した場合は、処理はＳ１４６に進み、同一でないと判断した場合は、処理はＳ１４８に進む。
ステップ１４６（Ｓ１４６）において、特徴文字列生成部３６は、処理対象であった出現頻度順候補を破棄する。 In step 142 (S142), the constituent character string determination unit 372 of the characteristic character string generation unit 36 determines whether or not the selected arrangement order candidate and appearance frequency order candidate are synonymous with each other. If it is determined that there is, the process proceeds to S146, and if it is determined that it is not a synonym, the process proceeds to S142.
In step 144 (S144), the constituent character string determination unit 372 of the characteristic character string generation unit 36 determines whether the attribute of the placement order candidate and the attribute of the appearance frequency order candidate are the same, and the attributes are the same. If it is determined that they are, the process proceeds to S146. If it is determined that they are not the same, the process proceeds to S148.
In step 146 (S146), the characteristic character string generation unit 36 discards the appearance frequency order candidates that are the processing targets.

ステップ１４８（Ｓ１４８）において、特徴文字列生成部３６の文字列連結部３８２は、構成文字列の文字数の合計が、設定文字数以内か否か判断し、設定文字数以内であると判断した場合は、処理はＳ１３０に進み、そうでない場合は、処理はＳ１５０に進む。
ステップ１５０（Ｓ１５０）において、特徴文字列生成部３６の文字列連結部３８２は、直前に構成文字列決定部３７２から受け入れた構成文字列を破棄する。
ステップ１５２（Ｓ１５２）において、特徴文字列生成部３６の文字列連結部３８２は、複数の構成文字列を連結する際の順序を決定し、複数の構成文字列を連結して、処理を終了する。 In step 148 (S148), the character string concatenation unit 382 of the characteristic character string generation unit 36 determines whether or not the total number of characters of the constituent character string is within the set number of characters. The process proceeds to S130, and if not, the process proceeds to S150.
In step 150 (S150), the character string concatenation unit 382 of the characteristic character string generation unit 36 discards the constituent character string received from the constituent character string determination unit 372 immediately before.
In step 152 (S152), the character string concatenation unit 382 of the characteristic character string generation unit 36 determines the order in which a plurality of constituent character strings are concatenated, concatenates the plurality of constituent character strings, and ends the process. .

以下、本実施形態に係る画像処理装置２の処理を、具体的に例を挙げて説明する。
図８は、本実施形態に係る画像処理装置２の処理対象である原稿の例を示す図である。
図８に例示した原稿において、上方中央に、他の文字列よりも大きなフォントで、文字列「申請書」と記載されており、その下の右側に、文字列「申請書」よりは小さいが本文の文字列よりは大きなフォントで、文字列「○○○市市長殿」と記載されている。
したがって、配置順候補抽出部３４は、順に、配置順候補＃１「申請書」および配置順候補＃２「○○○市市長殿」を抽出する。
また、出現頻度順候補抽出部３２は、出現頻度の高いものから順に、出現頻度順候補＃１「グラウンド利用」、出現頻度順候補＃２「申請者」、出現頻度順候補＃３「富士太郎」、出現頻度順候補＃４「○○○市」、出現頻度順候補＃５「申請書」、出現頻度順候補＃６「市長殿」を抽出する。 Hereinafter, the processing of the image processing apparatus 2 according to the present embodiment will be described with specific examples.
FIG. 8 is a diagram illustrating an example of a document to be processed by the image processing apparatus 2 according to the present embodiment.
In the manuscript illustrated in FIG. 8, a character string “application form” is written in the upper center in a font larger than other character strings, and the character string “application form” is smaller on the right side below the character string “application form”. The font is larger than the text string in the main text, and the text "Mayor of XXX City" is written.
Therefore, the arrangement order candidate extraction unit 34 sequentially extracts the arrangement order candidate # 1 “application form” and the arrangement order candidate # 2 “Mayor XXX City”.
In addition, the appearance frequency order candidate extraction unit 32, in descending order of appearance frequency, appearance frequency order candidate # 1 “ground use”, appearance frequency order candidate # 2 “applicant”, appearance frequency order candidate # 3 “Fujitaro” ”, Appearance frequency order candidate # 4“ XXX city ”, appearance frequency order candidate # 5“ application ”, and appearance frequency order candidate # 6“ mayor ”.

特徴文字列生成部３６は、まず、配置順候補＃１「申請書」を選択し、出現頻度順候補＃１「グラウンド利用」を選択する。
特徴文字列生成部３６は、配置順候補＃１「申請書」と出現頻度順候補＃１「グラウンド利用」とは、互いに同義語でなく、同じ同義語を包含せず、これらの属性も互いに異なると判断する。
したがって、特徴文字列生成部３６は、配置順候補＃１「申請書」および出現頻度順候補＃１「グラウンド利用」を、構成文字列として決定する。 The characteristic character string generation unit 36 first selects the arrangement order candidate # 1 “application form” and selects the appearance frequency order candidate # 1 “ground use”.
The feature character string generation unit 36 does not include the arrangement order candidate # 1 “application” and the appearance frequency order candidate # 1 “ground use”, which are not synonymous with each other and do not include the same synonym, and these attributes are also mutually exclusive. Judged to be different.
Therefore, the characteristic character string generation unit 36 determines the arrangement order candidate # 1 “application” and the appearance frequency order candidate # 1 “ground use” as constituent character strings.

設定文字数が１５文字である場合、配置順候補＃１「申請書」と出現頻度順候補＃１「グラウンド利用」との文字数の合計は１０文字であるので、さらに、特徴文字列生成部３６は、出現頻度順候補＃２「申請者」を選択する。
配置順候補＃１「申請書」および出現頻度順候補＃２「申請者」は、ともに文字列「申請」を含むので、特徴文字列生成部３６は、配置順候補＃１「申請書」と出現頻度順候補＃２「申請者」とは互いに同義語であると判断する。
したがって、出現頻度順候補＃２「申請者」は破棄される。 When the set number of characters is 15, the total number of characters of the arrangement order candidate # 1 “application” and the appearance frequency order candidate # 1 “ground use” is 10 characters. Then, appearance frequency order candidate # 2 “applicant” is selected.
Since the arrangement order candidate # 1 “application” and the appearance frequency order candidate # 2 “applicant” both include the character string “application”, the characteristic character string generation unit 36 determines that the arrangement order candidate # 1 “application”. The appearance frequency order candidate # 2 “applicant” is determined to be synonymous with each other.
Therefore, the appearance frequency order candidate # 2 “applicant” is discarded.

次に、特徴文字列生成部３６は、出現頻度順候補＃３「富士太郎」を選択する。
特徴文字列生成部３６は、既に決定されている配置順候補＃１「申請書」および出現頻度順候補＃１「グラウンド利用」と、出現頻度順候補＃３「富士太郎」とは、互いに同義語でなく、同じ同義語を包含せず、これらの属性も互いに異なると判断する。
したがって、特徴文字列生成部３６は、出現頻度順候補＃３「富士太郎」を、新たに構成文字列として決定する。 Next, the characteristic character string generation unit 36 selects the appearance frequency order candidate # 3 “Taro Fuji”.
The characteristic character string generation unit 36 has the same arrangement order candidate # 1 “application form” and appearance frequency order candidate # 1 “ground use” and appearance frequency order candidate # 3 “Taro Fuji” that have already been determined. It is not a word and does not include the same synonym, and it is determined that these attributes are also different from each other.
Therefore, the characteristic character string generation unit 36 newly determines the appearance frequency order candidate # 3 “Taro Fuji” as a constituent character string.

配置順候補＃１「申請書」と出現頻度順候補＃１「グラウンド利用」と出現頻度順候補＃３「富士太郎」との文字数の合計は１４文字であるので、さらに、特徴文字列生成部３６は、出現頻度順候補＃４「○○○市」を選択する。
特徴文字列生成部３６は、既に決定されている配置順候補＃１「申請書」，出現頻度順候補＃１「グラウンド利用」および出現頻度順候補＃３「富士太郎」と、出現頻度順候補＃４「○○○市」とは、互いに同義語でなく、同じ同義語を包含せず、これらの属性も互いに異なると判断する。
しかし、配置順候補＃１「申請書」と出現頻度順候補＃１「グラウンド利用」と出現頻度順候補＃３「富士太郎」と出現頻度順候補＃４「○○○市」との文字数の合計は１８文字となり、設定文字数を超える。
したがって、出現頻度順候補＃４「○○○市」は破棄され、配置順候補＃１「申請書」と出現頻度順候補＃１「グラウンド利用」と出現頻度順候補＃３「富士太郎」とが、構成文字列として決定される。 Since the total number of characters of the arrangement order candidate # 1 “application”, the appearance frequency order candidate # 1 “ground use”, and the appearance frequency order candidate # 3 “Fujitaro” is 14 characters, the characteristic character string generation unit 36 selects appearance frequency order candidate # 4 “XXX city”.
The characteristic character string generation unit 36 includes the already determined placement order candidate # 1 “application”, appearance frequency order candidate # 1 “ground use” and appearance frequency order candidate # 3 “Fuji Taro”, and appearance frequency order candidates. It is determined that # 4 “XXX city” is not synonymous with each other, does not include the same synonym, and these attributes are also different from each other.
However, the number of characters of placement order candidate # 1 “application”, appearance frequency order candidate # 1 “ground use”, appearance frequency order candidate # 3 “Fujitaro”, and appearance frequency order candidate # 4 “XX city” The total is 18 characters, exceeding the set number of characters.
Therefore, the appearance frequency order candidate # 4 “XXX city” is discarded, and the arrangement order candidate # 1 “application”, the appearance frequency order candidate # 1 “ground use”, and the appearance frequency order candidate # 3 “Fujitaro” Is determined as a constituent character string.

次に、特徴文字列生成部３６は、配置順候補＃１「申請書」を先頭とし、その後ろに、出現頻度順候補＃１「グラウンド利用」、出現頻度順候補＃３「富士太郎」の順に、これらの構成文字列を連結する。
よって、特徴文字列生成部３６は、特徴文字列「申請書グラウンド利用富士太郎」を生成する。 Next, the feature character string generation unit 36 starts with placement order candidate # 1 “application”, followed by appearance frequency order candidate # 1 “ground use” and appearance frequency order candidate # 3 “Fujitaro”. In order, these constituent character strings are concatenated.
Therefore, the characteristic character string generation unit 36 generates a characteristic character string “Application form ground use Fuji Taro”.

２・・・画像処理装置，
３・・・処理プログラム，
３０２・・・原稿読取情報受付部，
３０４・・・自動生成要否指定部，
３０６・・・文字数設定部，
３０８・・・配置解析部，
３１０・・・文字列抽出部，
３１２・・・言語判定部，
３１４・・・原稿分類部，
３１６・・・分類基準格納部，
３２・・・出現頻度順候補抽出部，
３２２・・・頻度算出部，
３２４・・・複合文字列判断部，
３２６・・・文字列配置判断部，
３２８・・・文字列順位判定部，
３３０・・・順位基準格納部，
３４・・・配置順候補抽出部，
３４２・・・文字列位置判定部，
３４４・・・文字列規模判定部，
３２６・・・配置順候補判定部，
３４８・・・配点基準格納部，
３６・・・特徴文字列抽生成部，
３６０・・・配置順候補格納部，
３６２・・・出現頻度順候補格納部，
３６４・・・原稿種類文字列格納部，
３６６・・・配置順候補分割部，
３６８・・・配置順候補選択部，
３７０・・・出現頻度順候補選択部，
３７２・・・構成文字列決定部，
３７４・・・同義語辞書ＤＢ，
３７６・・・文字列属性判定部，
３８２・・・文字列連結部， 2 ... Image processing device,
3 ... Processing program,
302: Document reading information receiving unit,
304 ... Automatic generation necessity designation part,
306 ... Number of characters setting section,
308 ... Placement analysis unit,
310 ... character string extraction unit,
312 ... Language determination unit,
314: Document classification section,
316: Classification criteria storage unit,
32 ... Appearance frequency order candidate extraction unit,
322 ... Frequency calculation unit,
324... Composite character string determination unit,
326... Character string arrangement determination unit,
328 ... Character string rank determination unit,
330 ... ranking reference storage unit,
34 ... arrangement order candidate extraction unit,
342 ... character string position determination unit,
344: Character string scale determination unit,
326 ... Arrangement order candidate determination unit,
348 ... Scoring standard storage unit,
36... Character string extraction unit,
360 ... arrangement order candidate storage unit,
362 ... Appearance frequency order candidate storage unit,
364 ... Document type character string storage unit,
366 ... arrangement order candidate division unit,
368 ... Placement order candidate selection unit,
370 ... Appearance frequency order candidate selection unit,
372: Configuration character string determination unit,
374 ... synonym dictionary DB,
376 ... character string attribute determination unit,
382 ... character string concatenation part,

Claims

A character string extracting means for extracting a plurality of character strings from the read information obtained by the reading means for reading an original;
First extraction means for extracting one or more first character string candidates constituting a characteristic character string related to a document from the character string extracted by the character string extraction means based on the appearance frequency of the character string in the document; ,
Second extraction means for extracting one or more second candidates for character strings constituting the characteristic character string from the character strings extracted by the character string extraction means based on the arrangement of the character strings in the document;
Two or more character strings are selected from at least one of the first candidate extracted by the first extracting means and the second candidate extracted by the second extracting means, and the characteristic character string is generated. An image processing apparatus comprising: a characteristic character string generation unit.

2. The feature character string generation unit selects two or more character strings having different meanings from each other, or two or more character strings that do not include words having the same meaning from each other, and generates the feature character string. An image processing apparatus according to 1.

The image processing apparatus according to claim 1, wherein the characteristic character string generation unit determines an order in which the two or more selected character strings are connected based on attributes of the two or more selected character strings.

The image processing apparatus according to claim 1, wherein the characteristic character string generation unit selects and connects two or more character strings so that the number of characters of the characteristic character string is within a predetermined number.

The first extraction means weights the extracted first candidate so that the weight of the character string composed of a plurality of words is larger than the weight of the character string composed of one word,
The image processing apparatus according to claim 1, wherein the characteristic character string generation unit preferentially selects a first candidate having a large weight by the first extraction unit.

The image processing apparatus according to claim 1, wherein the characteristic character string generation unit determines an order of connecting the selected character strings based on a type of document.

The feature character string generation unit may include the feature character string so as to include a character string related to the document type when neither the first candidate nor the second candidate includes a character string related to the document type. The image processing apparatus according to claim 1, wherein a sequence is generated.

A character string extraction step for extracting a plurality of character strings from the read information obtained by the reading means for reading the document;
A first extraction step of extracting one or more first character string candidates constituting a characteristic character string related to a document from the character string extracted in the character string extraction step based on the appearance frequency of the character string in the document; ,
A second extraction step of extracting one or more second character string candidates constituting the characteristic character string from the character string extracted in the character string extraction step based on the arrangement of the character string in the document;
Two or more character strings are selected from at least one of the first candidate extracted in the first extraction step and the second candidate extracted in the second extraction step, and the characteristic character string is generated. An image processing program for causing a computer to execute a characteristic character string generation step.