JP7129206B2

JP7129206B2 - System, Aggregation Method and Program

Info

Publication number: JP7129206B2
Application number: JP2018091348A
Authority: JP
Inventors: 弘志松田
Original assignee: Canon Inc
Current assignee: Canon Inc
Priority date: 2018-05-10
Filing date: 2018-05-10
Publication date: 2022-09-01
Anticipated expiration: 2038-05-10
Also published as: JP2019195959A

Description

本発明は、システム、集計方法、及びプログラムに関する。 The present invention relates to a system, an aggregation method, and a program.

印刷システムにおいては、従来から、印刷コストの見える化および削減を目的として、印刷枚数の集計やレポーティング機能を有する出力管理システムやデバイス管理システムが広く導入されている。特に、複数台のプリンタや複合機（MultiFunction Peripheral：ＭＦＰ）などの事務機を導入する中規模以上の企業や事業所では、印刷コストの把握は経費管理上の重要な命題の一つであるからである。 2. Description of the Related Art Conventionally, in printing systems, an output management system and a device management system having functions of totaling the number of prints and reporting have been widely introduced for the purpose of visualization and reduction of printing costs. In particular, understanding printing costs is one of the important propositions in cost management for medium-sized companies and offices that introduce office machines such as multiple printers and multifunction peripherals (MFPs). is.

例えば、部門別、ユーザ別、アプリ別などの複数の方法で印刷枚数を集計するレポーティング機能が知られている。こうしたレポーティング機能を利用することにより、部門毎やユーザ毎などの印刷枚数が、業務内容に照らして妥当か否かを判断することができる。類似の業務内容であるにも関わらず、印刷枚数が多い部門やユーザにおいては印刷利用の低減が勧奨される。
また、印刷枚数の多いアプリケーションを調べることによって、どの分野の業務で印刷を多く行っているかを類推し、業務の改善や電子化の促進といった対策を検討することができる。 For example, a reporting function is known that totals the number of printed sheets by a plurality of methods, such as by department, by user, and by application. By using such a reporting function, it is possible to determine whether or not the number of printed sheets for each department or each user is appropriate in light of the business content. In spite of similar business content, it is recommended that departments and users who print a large number of pages reduce their use of printing.
In addition, by examining the applications that print a large number of sheets, it is possible to infer in which field of work a lot of printing is performed, and to consider measures such as improving the work and promoting digitization.

他方、機械学習の技術を応用して、電子文書を分類する技術が知られている。
特許文献１には、テキスト文書のクラスタリングとクラスタのラベリング技術が開示されている。
また、特許文献２には、印刷データに含まれる複数の特徴を特徴量として用いて、クラスタリングを実行し、さらにクラスタと当該クラスタに固有の印刷設定を見つける技術が開示されている。 On the other hand, there is known a technique of classifying electronic documents by applying machine learning techniques.
Patent Literature 1 discloses a technique for clustering text documents and labeling the clusters.
Further, Japanese Patent Application Laid-Open No. 2002-200002 discloses a technique of performing clustering using a plurality of features included in print data as feature amounts, and further finding clusters and print settings specific to the clusters.

ところで、電子文書や印刷データなどのデータを分類する技術として、機械学習の分野において、クラスタリング（Clustering）とクラス分類（Classification）の二つが広く知られている。
クラス分類は、データをどのようなグループに分けるか予め指定して学習する技術である。通常、クラス分類はサンプルデータ（教師データ）とその分類結果（正解値）のセットを複数用意して、事前に学習を実施する構成をとる。特定の帳票フォーマットを識別するなど、分類したい電子文書の種類が具体的に決まっている場合は、高い確率で分類することが可能である。
一方、クラスタリングは、対象となるデータ集合を類似の特徴を持つグループに分割する技術である。クラスタリングは、教師データを事前に用意する必要がなく、個別の環境に応じた柔軟な分類が可能であり、事前学習などの煩わしい操作が不要、というメリットがある。しかし、分類した結果のクラスタがどういった意味を持つかは、分類されたデータから判別する必要がある。 In the field of machine learning, clustering and classification are two widely known techniques for classifying data such as electronic documents and print data.
Classification is a technique for learning by designating in advance what kind of group data is to be divided into. Usually, class classification takes a configuration in which a plurality of sets of sample data (teacher data) and their classification results (correct values) are prepared and learning is performed in advance. If the type of electronic document to be classified is specifically determined, such as by identifying a specific form format, classification can be performed with a high probability.
Clustering, on the other hand, is a technique for dividing a target data set into groups with similar characteristics. Clustering has the advantages of not requiring preparation of training data in advance, enabling flexible classification according to individual environments, and eliminating the need for troublesome operations such as prior learning. However, it is necessary to determine from the classified data what kind of meaning the clusters resulting from the classification have.

特開２００９－１５１３９０号公報JP 2009-151390 A 特開２０１５－２０２６６７号公報JP 2015-202667 A

従来の技術では、アプリケーション毎の印刷枚数は分かっても、印刷コストの発生する業務の種類まで特定することは難しかった。例えば、同じ電子文書ビューアアプリケーション用のドキュメントの中にも、プレゼン資料があったり帳票があったり、内容の異なるドキュメントが存在することが普通である。また、同一種類の帳票が、表計算アプリケーションと電子ビューアアプリケーション用に、それぞれ用意されていることも考えられる。
このように、アプリケーションと印刷コストの発生する業務の種類とは、必ずしも結びついていない。したがって、従来のようなアプリケーション毎の集計では、印刷コスト削減のための情報としては十分とは言えない。 With the conventional technology, even if the number of printed sheets for each application is known, it is difficult to specify the type of business that generates printing costs. For example, even among documents for the same electronic document viewer application, it is common for there to be presentation materials, forms, and documents with different contents. Also, it is conceivable that the same type of form is prepared for each of the spreadsheet application and the electronic viewer application.
In this way, the application and the type of business incurring printing costs are not necessarily linked. Therefore, it cannot be said that the conventional aggregation for each application is sufficient as information for reducing printing costs.

本発明は、印刷データの特徴量を抽出する抽出手段と、前記印刷データを、前記印刷データの特徴を多次元ベクトルで表現した特徴量に基づいて、複数のクラスタのうちの１つのクラスタに分類する分類手段と、前記複数のクラスタの各々に属する印刷データの印刷枚数の合計を集計する集計手段と、を有することを特徴とする。 The present invention classifies the print data into one cluster out of a plurality of clusters based on the feature amount representing the feature of the print data using a multidimensional vector. and a tallying means for tallying up the total number of printed sheets of print data belonging to each of the plurality of clusters.

印刷データの特徴を多次元ベクトルで表現した特徴量に基づいて、印刷データを印刷データの内容に応じたクラスタごとに分類し、クラスタごとに印刷枚数の合計を把握することができる。 It is possible to classify the print data into clusters according to the content of the print data based on the feature amount that expresses the features of the print data with a multidimensional vector, and to grasp the total number of printed sheets for each cluster.

印刷システム構成図である。1 is a configuration diagram of a printing system; FIG. プリンタのハードウェア構成図である。2 is a hardware configuration diagram of a printer; FIG. サーバーのハードウェア構成図である。2 is a hardware configuration diagram of a server; FIG. プリンタのソフトウェアモジュール構成図である。3 is a configuration diagram of software modules of the printer; FIG. サーバーのソフトウェアモジュール構成図である。4 is a software module configuration diagram of the server; FIG. サーバーにおける印刷データクラスタリング基本フローチャートである。4 is a basic flow chart of print data clustering in the server; 印刷基本フローチャートである。4 is a printing basic flow chart. ジョブチケットによる印刷設定の一覧表である。4 is a list of print settings based on job tickets; 印刷処理の詳細フローチャートである。4 is a detailed flowchart of print processing; ＰＤＬによる描画コマンドの一覧表である。4 is a list of drawing commands by PDL; 特徴抽出の詳細フローチャートである。4 is a detailed flowchart of feature extraction; 多次元ベクトルによる特徴量の構成例である。It is a configuration example of a feature amount by a multidimensional vector. 印刷済みデータのクラスタリング詳細フローチャートである。4 is a detailed flowchart of clustering of printed data; 集計レポートの一例である。It is an example of a tally report. プリンタのソフトウェアモジュール構成図である（実施例２）。FIG. 11 is a software module configuration diagram of the printer (Embodiment 2); プリンタ処理の基本フローチャートである（実施例２）。10 is a basic flowchart of printer processing (Embodiment 2); テキストによる特徴量抽出の詳細フローチャートである。4 is a detailed flow chart of text-based feature quantity extraction;

以下、本発明を実施するための最良の形態について図面を用いて説明する。 BEST MODE FOR CARRYING OUT THE INVENTION The best mode for carrying out the present invention will be described below with reference to the drawings.

＜システム構成図＞
図１は、本実施例の印刷システムの構成例を示すシステム構成図である。
ネットワーク１０５に、プリンタＡ１００、プリンタＢ１０１、プリンタＣ１０２、サーバー１０３、クライアントＰＣ１０４が接続され、これらの装置が相互に通信可能な状態にある。
サーバー１０３は、本実施例における情報処理装置の役割を担う。
本実施例では、プリンタＡ１００、プリンタＢ１０１、プリンタＣ１０３の３台が接続されているが、プリンタは３台に限定されるものではなく、より多くのプリンタが接続される構成でもよい。
また、プリンタの代わりに、プリント機能を有する複合機などの画像処理装置が接続される構成でもよい。また、クライアントＰＣも、同様に、複数台接続される構成でもよい。 <System configuration diagram>
FIG. 1 is a system configuration diagram showing a configuration example of a printing system according to this embodiment.
A printer A 100, a printer B 101, a printer C 102, a server 103, and a client PC 104 are connected to the network 105, and these devices are in a state of being able to communicate with each other.
The server 103 plays a role of an information processing device in this embodiment.
In this embodiment, three printers A 100, B 101, and C 103 are connected, but the number of printers is not limited to three, and more printers may be connected.
Also, instead of the printer, an image processing apparatus such as a multifunction machine having a print function may be connected. Similarly, a plurality of client PCs may also be connected.

＜プリンタのハードウェア構成図＞
図２は、本実施例のプリンタＡ１００のハードウェア構成を示すハードウェアブロック図である。
ここでは、プリンタＡ１００についてのみ説明するが、プリンタＢ１０１やプリンタＣ１０２についても、プリンタＡ１００と同様のハードウェア構成を有する。 <Printer hardware configuration diagram>
FIG. 2 is a hardware block diagram showing the hardware configuration of the printer A100 of this embodiment.
Here, only the printer A100 will be described, but the printers B101 and C102 also have the same hardware configuration as the printer A100.

コントローラユニット２００は、画像信号やデバイス情報の入出力を制御する。
ＣＰＵ２０１は、ＲＯＭ２０３あるいはＨＤＤ２０４に記憶されたプログラムをＲＡＭ２０２に読み出し、実行する。さらに、ＣＰＵ２０１は、システムバス２０５に接続される各デバイスを統括的に制御する。
ＲＡＭ２０２は、ＣＰＵ２０１のメインメモリであり、プリンタを制御する制御プログラムのためのワークエリアとして使用される。
ＲＯＭ２０３には、電源ＯＮ時に実行されるブートプログラムが格納される。
ＨＤＤ２０４には、オペレーティングシステムと、プリンタの制御プログラム本体が格納される。また、ＨＤＤ２０４は、ビットマップイメージや印刷データなどの大容量データを一時的あるいは長期的に保持する目的でも使用される。 The controller unit 200 controls input/output of image signals and device information.
The CPU 201 reads a program stored in the ROM 203 or HDD 204 to the RAM 202 and executes it. Furthermore, the CPU 201 centrally controls each device connected to the system bus 205 .
A RAM 202 is the main memory of the CPU 201 and is used as a work area for a control program that controls the printer.
The ROM 203 stores a boot program that is executed when the power is turned on.
The HDD 204 stores an operating system and a printer control program body. The HDD 204 is also used for temporarily or long-term storage of large-capacity data such as bitmap images and print data.

ネットワーク２０６は、ローカルエリアネットワーク（ＬＡＮ）２１３に接続し、印刷データやデバイス情報の入出力を担う。
操作部Ｉ／Ｆ２０７は、操作部２１４とのインターフェース部であり、操作部２１４に表示するビットマップデータを操作部２１４に対して出力する。また、操作部２１４からプリンタＡ１００の使用者が入力した情報を、ＣＰＵ２０１に伝える役割をする。
操作部２１４は、出力器として液晶パネルと音源を備え、入力器としてタッチパネルとハードキーを備える。 A network 206 connects to a local area network (LAN) 213 and is responsible for inputting/outputting print data and device information.
An operation unit I/F 207 is an interface unit with the operation unit 214 and outputs bitmap data to be displayed on the operation unit 214 to the operation unit 214 . It also plays a role of transmitting information input by the user of the printer A 100 from the operation unit 214 to the CPU 201 .
The operation unit 214 has a liquid crystal panel and a sound source as output devices, and a touch panel and hard keys as input devices.

コントローラユニット２００は、デバイスＩ／Ｆ２０８を介して、プリンタエンジン２１５に接続される。デバイスＩ／Ｆ２０８は、ＣＰＵ２０１の指示に基づき、画像信号の送出、デバイス動作指示、デバイス情報の受信を行う。
プリンタエンジン２１５は、コントローラユニット２００からの画像信号を媒体上に出力する出力機であり、電子写真方式、インクジェット方式のいずれでもよい。 Controller unit 200 is connected to printer engine 215 via device I/F 208 . A device I/F 208 transmits image signals, device operation instructions, and receives device information based on instructions from the CPU 201 .
The printer engine 215 is an output machine that outputs an image signal from the controller unit 200 onto a medium, and may be of either an electrophotographic method or an inkjet method.

ＲＩＰ（Raster Image Processor）２０９は、ディスプレイリストをビットマップイメージに展開する専用ハードウェアである。ＲＩＰ２０９は、ＲＡＭ２０２上にＣＰＵ２０１により生成されたディスプレイリストを高速、かつ、ＣＰＵ２０１の実行と並列に、処理する。
プリンタ画像処理部２１０は、プリント出力イメージデータに対して、画像補正、ハーフトーニングなどを行う。
画像圧伸部２１１は、多値画像データに対してはＪＰＥＧ、２値画像データに対してはＪＢＩＧ、ＭＭＲ、ＭＨの、圧縮伸張処理を行う。
画像回転部２１２は、画像データの回転を行う。 A RIP (Raster Image Processor) 209 is dedicated hardware that develops a display list into a bitmap image. The RIP 209 processes the display list generated by the CPU 201 on the RAM 202 at high speed and in parallel with the execution of the CPU 201 .
A printer image processing unit 210 performs image correction, halftoning, and the like on print output image data.
The image companding/decompressing unit 211 performs JPEG compression/decompression processing for multilevel image data, and JBIG, MMR, and MH compression/decompression processing for binary image data.
The image rotation unit 212 rotates image data.

＜サーバーのハードウェア構成図＞
図３は、本実施例のサーバー１０３のハードウェア構成を示すハードウェアブロック図である。
サーバー１０３は、コントローラユニット３００、操作部３１０、ディスプレイ３１１を備える。 <Server hardware configuration diagram>
FIG. 3 is a hardware block diagram showing the hardware configuration of the server 103 of this embodiment.
The server 103 has a controller unit 300 , an operation section 310 and a display 311 .

コントローラユニット３００は、ＣＰＵ３０１、ＲＡＭ３０２、ＲＯＭ３０３、ＨＤＤ３０４、ネットワーク３０６、操作部Ｉ／Ｆ３０７、ディスプレイＩ／Ｆ３０８などを備える。
ＣＰＵ３０１は、ＲＯＭ３０３あるいはＨＤＤ３０４に記憶されたプログラムを、ＲＡＭ３０２に読み出し、実行する。ＣＰＵ３０１は、さらに、システムバス３０５に接続される各デバイスをＣＰＵ３０１が統括的に制御する。 Controller unit 300 includes CPU 301, RAM 302, ROM 303, HDD 304, network 306, operation unit I/F 307, display I/F 308, and the like.
The CPU 301 reads a program stored in the ROM 303 or HDD 304 to the RAM 302 and executes it. Further, the CPU 301 comprehensively controls each device connected to the system bus 305 .

ＲＡＭ３０２は、ＣＰＵ３０１のメインメモリであり、各種プログラムのためのワークエリアとして使用される。
ＲＯＭ３０３は、電源ＯＮ時に実行されるブートプログラムを格納する。
ＨＤＤ３０４は、オペレーティングシステムとアプリケーションプログラムを格納する。また、ＨＤＤ３０４は、大容量データを一時的あるいは長期的に保持する目的でも使用される。 A RAM 302 is the main memory of the CPU 301 and is used as a work area for various programs.
A ROM 303 stores a boot program that is executed when the power is turned on.
HDD 304 stores an operating system and application programs. The HDD 304 is also used for the purpose of temporarily or long-term holding large amounts of data.

操作部Ｉ／Ｆ３０７は、マウス、キーボードなどの入力デバイスからなる操作部３１０とのインターフェースであり、操作部３１０により入力された情報をＣＰＵ３０１に通知する。
ディスプレイＩ／Ｆ３０８は、ディスプレイ３１１に表示すべき画像データをディスプレイ３１１に対して出力する。
ネットワーク３０６は、ローカルエリアネットワーク（ＬＡＮ）３０９に接続し、クライアントＰＣやプリンタなどの外部機器との通信を担う。 An operation unit I/F 307 is an interface with an operation unit 310 including an input device such as a mouse and a keyboard, and notifies the CPU 301 of information input through the operation unit 310 .
The display I/F 308 outputs image data to be displayed on the display 311 to the display 311 .
A network 306 connects to a local area network (LAN) 309 and is responsible for communication with external devices such as client PCs and printers.

＜プリンタのソフトウェアモジュール構成図＞
図４は、本実施例のプリンタＡ１００のソフトウェア構成を説明するソフトウェアモジュール構成図である。
ここでは、プリンタＡ１００についてのみ説明するが、プリンタＢ１０１やプリンタＣ１０２についても、プリンタＡ１００と同様のソフトウェア構成を有する。
図４に示される各ソフトウェアモジュールは、プログラムとしてＨＤＤ２０４に格納され、ＲＡＭ２０２にロードされ、ＣＰＵ２０１により実行される。より具体的には、各ソフトウェアモジュールは、ＣＰＵ２０１上で動作するＯＳ（オペレーティングシステム）によりＲＡＭ２０２にロードされ、スレッド単位で実行権を付与され、実行される。 <Printer software module configuration diagram>
FIG. 4 is a software module configuration diagram for explaining the software configuration of the printer A100 of this embodiment.
Here, only the printer A100 will be described, but the printers B101 and C102 also have the same software configuration as the printer A100.
Each software module shown in FIG. 4 is stored in HDD 204 as a program, loaded into RAM 202 and executed by CPU 201 . More specifically, each software module is loaded into the RAM 202 by an OS (Operating System) running on the CPU 201, given execution rights in units of threads, and executed.

データ受信部４０２は、サーバー１０３から送信された印刷データを受信する。受信されたデータは、ジョブ制御部４０１を介してジョブデータ管理部４０９で保持される。
ジョブ制御部４０１は、データ受信から印刷までのジョブ制御の全般を司る。
ＰＤＬインタプリタ４０３は、印刷データとしてページ記述言語（Page Description Language：ＰＤＬ）により記述されたＰＤＬデータを解釈して、中間データであるディスプレイリストを生成する。生成されたディスプレイリストは、ジョブ制御部４０１を介してジョブデータ管理部４０９で保持される。
レンダラ４０４は、ディスプレイリストからビットマップイメージを生成するモジュールである。多くの処理は専用ハードウェアＲＩＰ２０９により実行される。生成されたビットマップイメージは、ジョブ制御部４０１を介してジョブデータ管理部４０９で保持される。 A data receiving unit 402 receives print data transmitted from the server 103 . The received data is held by the job data management unit 409 via the job control unit 401 .
A job control unit 401 manages overall job control from data reception to printing.
A PDL interpreter 403 interprets PDL data described in a page description language (PDL) as print data and generates a display list as intermediate data. The generated display list is held by the job data management unit 409 via the job control unit 401 .
A renderer 404 is a module that generates a bitmap image from a display list. Much of the processing is performed by dedicated hardware RIP 209 . The generated bitmap image is held by the job data management unit 409 via the job control unit 401 .

プリントドライバ４０６は、デバイスＩ／Ｆ２０８を介してプリンタエンジンへの印刷指示とビットマップイメージの送出を行う。この際、プリントドライバ４０６は、プリンタ画像処理２１０による画像補正も行う。
ユーザインタフェース４０５は、操作部Ｉ／Ｆ２０７を介して、操作部２１４を制御するモジュールである。また、操作部２１４の液晶パネルに表示するデータを生成し、タッチパネルからの入力にしたがい、液晶パネルの表示を更新する。また、タッチパネルからの入力が何らかのジョブを実行する指示であった場合は、ジョブ制御部２０２に指示を伝達する。 A print driver 406 sends a print instruction and a bitmap image to the printer engine via the device I/F 208 . At this time, the print driver 406 also performs image correction by the printer image processing 210 .
A user interface 405 is a module that controls the operation unit 214 via the operation unit I/F 207 . It also generates data to be displayed on the liquid crystal panel of the operation unit 214 and updates the display of the liquid crystal panel according to the input from the touch panel. Also, if the input from the touch panel is an instruction to execute a certain job, the instruction is transmitted to the job control unit 202 .

特徴量抽出部４０７は、ＰＤＬデータを解析し特徴量を抽出するモジュールである。抽出された特徴量は、ジョブ制御部４０１を介してジョブデータ管理部４０９で保持される。なお、特徴量の詳細については、後述する。
データ送信部４０８は、印刷終了後にジョブデータ管理部４０９に保持する特徴量データをサーバーに対して送信するモジュールである。
ジョブデータ管理部４０９は、印刷データ、ディスプレイリスト、ビットマップイメージ、特徴量データのそれぞれを、一時的もしくは長期的に、保持管理するデータベースである。 A feature amount extraction unit 407 is a module that analyzes PDL data and extracts feature amounts. The extracted feature amount is held by the job data management unit 409 via the job control unit 401 . Details of the feature amount will be described later.
A data transmission unit 408 is a module that transmits feature amount data held in the job data management unit 409 to the server after printing is completed.
A job data management unit 409 is a database that retains and manages print data, display lists, bitmap images, and feature amount data, either temporarily or over the long term.

＜サーバーのソフトウェアモジュール構成図＞
図５は、本実施例のサーバー１０３のソフトウェア構成を示すソフトウェアモジュール構成図である。
図５に記載した各ソフトウェアモジュールは、プログラムとしてＨＤＤ３０４に格納され、ＲＡＭ３０２にロードされ、ＣＰＵ３０１により実行される。より具体的には、各ソフトウェアモジュールは、ＣＰＵ３０１上で動作するＯＳ（オペレーティングシステム）によりＲＡＭ３０２にロードされ、スレッド単位で実行権を付与され、実行される。 <Server software module configuration diagram>
FIG. 5 is a software module configuration diagram showing the software configuration of the server 103 of this embodiment.
Each software module shown in FIG. 5 is stored in the HDD 304 as a program, loaded into the RAM 302 and executed by the CPU 301 . More specifically, each software module is loaded into the RAM 302 by an OS (Operating System) running on the CPU 301, given execution rights in units of threads, and executed.

特徴量データ受信部５０５は、プリンタから送信された特徴量データを受信する。受信されたデータは、制御部５０１を介してデータ管理部５０６で保持される。
制御部５０１は、特徴量データの受信から、クラスタリング、レポート生成までの一連の処理を司る。
クラスタリング実行部５０２は、特徴量データを用いて、印刷データのデータクラスタリングを実行するモジュールである。
レポート生成部５０３は、印刷データのクラスタと印刷枚数の集計レポートを生成するモジュールである。レポート生成の要求は、Ｗｅｂサーバー５０４より受け付けられる。生成されたレポートは、Ｗｅｂサーバー５０４により要求元に返信される。 A feature amount data receiving unit 505 receives feature amount data transmitted from the printer. The received data is held in data management section 506 via control section 501 .
The control unit 501 manages a series of processes from reception of feature amount data to clustering and report generation.
A clustering execution unit 502 is a module that executes data clustering of print data using feature amount data.
A report generation unit 503 is a module that generates a total report of clusters of print data and the number of prints. A request for report generation is received from the Web server 504 . The generated report is sent back to the requester by web server 504 .

＜サーバーにおける印刷データクラスタリング基本フローチャート＞
図６は、情報処理装置（サーバー１０３）における印刷データのクラスタリングを行う基本フローチャートである。
なお、ここでは、クラスタリングは図５の各ソフトウェアモジュールによって実行されるものとして説明する。また、本フローチャートは、ＨＤＤ３０４に記憶された図５の各ソフトウェアモジュール内の各プログラムがＲＡＭ３０２に読み出され、ＣＰＵ３０１により実行されることにより、実現される。 <Basic flow chart for print data clustering on the server>
FIG. 6 is a basic flowchart for clustering print data in the information processing apparatus (server 103).
It should be noted that the clustering will be explained here assuming that it is executed by each software module in FIG. 5 stored in the HDD 304 is read to the RAM 302 and executed by the CPU 301, thereby realizing this flowchart.

最初に、Ｓ６０１において、制御部５０１は、サーバーイベントが発生するまで待つ。
サーバーイベントには、後述する特徴量データの受信通知と、印刷データのクラスタリングと印刷枚数の集計レポートの生成要求と、の２つがある。 First, in S601, the control unit 501 waits until a server event occurs.
There are two types of server events: a notification of reception of feature amount data, which will be described later, and a request for clustering of print data and generation of a tally report of the number of prints.

サーバーイベントが発生すると、Ｓ６０２に進み、制御部５０１は、サーバーイベントが、特徴量データの受信通知であるか否かを判断する。
Ｓ６０２でＹｅｓの場合、Ｓ６０３に進み、特徴量データ受信部５０５は、特徴量データを受信し、制御部５０１を介してデータ管理部５０６へ格納する。
特徴量とは、印刷データの特徴を多次元ベクトルで表現したものであり、印刷の前処理時に抽出される。また、特徴量データには、特徴量の他に、付属情報として、特徴量の元となった印刷データの印刷枚数、カラー印刷枚数、モノクロ印刷枚数、代表画像が含まれる。特徴量データのさらなる詳細およびその抽出方法については後述する。 When a server event occurs, the process advances to step S602, and the control unit 501 determines whether the server event is notification of reception of feature amount data.
In the case of Yes in S602 , the process proceeds to S603 , and the feature amount data reception unit 505 receives the feature amount data and stores the data in the data management unit 506 via the control unit 501 .
A feature amount is a feature of print data represented by a multidimensional vector, and is extracted during preprocessing for printing. In addition to the feature amount, the feature amount data includes, as attached information, the number of prints, the number of color prints, the number of monochrome prints, and a representative image of the print data that is the source of the feature amount. Further details of the feature amount data and its extraction method will be described later.

Ｓ６０３の処理を終了すると、Ｓ６０１に戻り、制御部５０１は再びサーバーイベント待ちに入る。
Ｓ６０２でＮｏの場合は、Ｓ６０４へ進み、制御部５０１は、サーバーイベントが印刷データのクラスタリングと印刷枚数の集計レポートの生成要求であるか否かを判断する。
なお、集計レポートの生成要求は、Ｗｅｂサーバー５０４によるＷｅｂページ上でのユーザ操作により発生したものが、制御部５０１にイベントとして通知されるものである。 After completing the process of S603, the process returns to S601, and the control unit 501 again waits for a server event.
If No in S602, the process advances to S604, and the control unit 501 determines whether the server event is a request for clustering of print data and generation of a tally report of the number of prints.
It should be noted that a request for generation of a tally report generated by a user's operation on a Web page by the Web server 504 is notified to the control unit 501 as an event.

Ｓ６０４でＹｅｓの場合は、Ｓ６０５へ進み、クラスタリング実行部５０２は、印刷済みデータを複数のクラスタに分類するために、特徴量を用いて印刷データのデータクラスタリングを実行する。
対象の特徴量データは、データ管理部５０６に保存されたすべての特徴量データであるが、特徴量データを受信した期間や特徴量データの送信元のプリンタなどにより、フィルタリングをしてもよい。
データクラスタリングは、特徴量の類似するデータのグループであるクラスタに分割する処理であり、その詳細なフローについては、図１３で後述する。
データクラスタリングの結果として、複数のクラスタと各特徴量データ、すなわち、その元になった印刷データが帰属するクラスタが決定される。そして、クラスタとそのクラスタに属する特徴量データのリストは、制御部５０１を介してデータ管理部５０６に保存される。 If Yes in S604, the process advances to S605, and the clustering execution unit 502 executes data clustering of the print data using the feature amount in order to classify the printed data into a plurality of clusters.
The target feature amount data is all the feature amount data saved in the data management unit 506, but filtering may be performed according to the period during which the feature amount data was received, the printer from which the feature amount data was sent, or the like.
Data clustering is a process of dividing data into clusters, which are groups of data with similar feature values, and the detailed flow will be described later with reference to FIG. 13 .
As a result of data clustering, a plurality of clusters and each feature amount data, that is, the cluster to which the original print data belongs is determined. A cluster and a list of feature amount data belonging to the cluster are stored in the data management unit 506 via the control unit 501 .

次に、Ｓ６０６において、レポート生成部５０３は、クラスタ毎の印刷枚数を集計する。
レポート生成部５０３は、各クラスタに属する印刷データの印刷枚数の合計値、カラー印刷の合計値、モノクロ印刷の合計値を、それぞれ、算出する。 Next, in S606, the report generation unit 503 counts the number of printed sheets for each cluster.
The report generation unit 503 calculates the total number of printed sheets of print data belonging to each cluster, the total value of color printing, and the total value of monochrome printing.

次に、Ｓ６０７において、レポート生成部５０３は、印刷データのクラスタリングと印刷枚数の集計レポートを電子文書として生成する。
レポート生成部５０３は、各クラスタの印刷枚数の合計値、カラー印刷の合計値、モノクロ印刷の合計値を集計レポートに記載する。また、レポート生成部５０３は、各クラスタの特徴を最も反映した画像として、クラスタの重心点に最も近い印刷データである代表画像を表示する。
図１４に、集計レポートの一例を示す。この例では、４つのクラスタについて、それぞれをカテゴリ１から４に割り当て、各々に代表画像と印刷枚数を表示している。 Next, in step S607, the report generation unit 503 generates an electronic document including a clustering report of the print data and a tabulation report of the number of printed sheets.
The report generation unit 503 writes the total number of pages printed, the total value of color printing, and the total value of monochrome printing of each cluster in a total report. The report generation unit 503 also displays a representative image, which is print data closest to the center of gravity of the cluster, as an image that best reflects the characteristics of each cluster.
FIG. 14 shows an example of a total report. In this example, four clusters are assigned to categories 1 to 4, respectively, and a representative image and the number of prints are displayed for each.

次に、Ｓ６０８において、Ｗｅｂサーバー５０４は、印刷データのクラスタリングと印刷枚数の集計レポートを要求元のユーザに返信する。
そして、Ｓ６０１に戻り、制御部５０１は再びサーバーイベント待ちに入る。
また、Ｓ６０４でＮｏの場合も、同様に、Ｓ６０１に戻り、制御部５０１は再びサーバーイベント待ちに入る。 Next, in step S608, the Web server 504 sends back to the requesting user a print data clustering report and a tally report of the number of prints.
Then, the process returns to S601, and the control unit 501 waits for a server event again.
If No in S604, the control unit 501 similarly returns to S601 and waits for a server event again.

なお、本フローチャートによるデータクラスタリングは、Ｗｅｂページを介したユーザ要求の時点で実行されるが、夜間やユーザ要求のない遊休時間に実行される構成であってもよい。 Although the data clustering according to this flowchart is executed at the time of a user request via a web page, it may be executed at night or during an idle time when there is no user request.

＜印刷基本フローチャート＞
図７は、プリンタにおける印刷の基本フローチャートである。
なお、ここでは、図４の各ソフトウェアモジュールによって実行されるように説明する。また、本フローチャートは、ＨＤＤ２０４に記憶された図４の各ソフトウェアモジュール内の各プログラムがＲＡＭ２０２に読み出され、ＣＰＵ２０１により実行されることにより、実現される。 <Print basic flow chart>
FIG. 7 is a basic flow chart of printing in the printer.
It should be noted that, here, it will be described as being executed by each software module in FIG. 4 stored in the HDD 204 is read to the RAM 202 and executed by the CPU 201, thereby realizing this flowchart.

最初に、Ｓ７０１において、データ受信部４０２は、印刷データを受信し、ジョブ制御部４０１を介してジョブデータ管理部４０９に印刷データを保存する。
印刷データは、ホストＰＣ上のアプリケーションとプリンタドライバにより生成され、プリンタへ送信される。印刷データは、ＰＤＬ（ページ記述言語）データと、ジョブチケットと、から構成される。
ＰＤＬは、紙面に描画される中身を表現するものである。また、ジョブチケットは、部数や両面印刷、カラー／モノクロ印刷などの、印刷の設定情報を表現するものである。 First, in step S701 , the data reception unit 402 receives print data and stores the print data in the job data management unit 409 via the job control unit 401 .
Print data is generated by an application and printer driver on the host PC and sent to the printer. The print data consists of PDL (page description language) data and a job ticket.
PDL expresses the content to be drawn on paper. The job ticket expresses print setting information such as the number of copies, double-sided printing, color/monochrome printing, and the like.

次に、Ｓ７０２において、ＰＤＬインタプリタ４０３は印刷データの解釈を行う。また、特徴量抽出部４０７は特徴量抽出を行う。レンダラ４０４はＲＩＰ処理を行う。また、レンダラ４０４は、印刷データの先頭ページのサムネール画像を代表画像として生成する。
抽出された特徴量と代表画像は、ジョブデータ管理部４０９に保存される。印刷データの解釈、特徴量抽出、ＲＩＰ処理は同時並列的に実行されるが、詳細については、図９で後述する。 Next, in S702, the PDL interpreter 403 interprets the print data. Also, the feature amount extraction unit 407 performs feature amount extraction. A renderer 404 performs RIP processing. The renderer 404 also generates a thumbnail image of the first page of print data as a representative image.
The extracted feature amount and representative image are saved in the job data management unit 409 . Interpretation of print data, feature extraction, and RIP processing are executed in parallel, and the details will be described later with reference to FIG.

次に、Ｓ７０３において、ジョブ制御部４０１は、印刷データの印刷を実行する。
また、Ｓ７０４において、ジョブ制御部４０１は、印刷データにおける印刷枚数を確認して、ジョブデータ管理部４０９に保存する。
なお、印刷データにおける印刷枚数は、印刷データの特徴量に紐づけされて、ジョブデータ管理部４０９において管理される。また、印刷枚数は、後述する印刷枚数カウンタにより管理される値を参照して、決定される。
そして、Ｓ７０５において、データ送信部４０８は、印刷データの特徴量、印刷枚数、代表画像などの特徴量データを、サーバーに送信する。 Next, in S703, the job control unit 401 executes printing of the print data.
In step S704 , the job control unit 401 confirms the number of prints in the print data and saves it in the job data management unit 409 .
Note that the number of prints in the print data is linked to the feature amount of the print data and managed by the job data management unit 409 . Also, the number of printed sheets is determined by referring to a value managed by a printed number counter, which will be described later.
Then, in step S705, the data transmission unit 408 transmits feature amount data such as the feature amount of the print data, the number of prints, and the representative image to the server.

＜ジョブチケットによる印刷設定の一覧表＞
図８は、ＰＤＬとともに印刷データを構成する、ジョブチケットで指定可能な印刷設定の一覧表である。
印刷設定は、「両面」、「ステイプル」、「カラーモード」、「用紙サイズ」、「ページ集約」の５つからなる。 <List of print settings by job ticket>
FIG. 8 is a list of print settings that can be designated by a job ticket, which constitute print data together with PDL.
There are five print settings: "double-sided", "staple", "color mode", "paper size", and "page combination".

「両面」は、連続するページを用紙の表面のみに印刷するか、表面、裏面、表面、裏面のように交互に印刷する機能であり、属性値として、１：片面、２：両面が指定可能である。
「ステイプル」は、複数毎の出力用紙をステイプル留めする機能であり、属性値として、１：なし、２：シングル、３：ダブルが指定可能である。
「カラーモード」は、印刷データをカラーで出力するか、モノクロに変換してから出力する機能であり、属性値として、１：自動、２：カラー、３：モノクロが指定可能である。ここで、属性値の１：自動は、印刷データの内容によりＰＤＬインタプリタ２０３がカラーかモノクロを自動判別するものである。 "Duplex" is a function that prints consecutive pages only on the front side of the paper, or alternately prints on the front side, the back side, the front side, and the back side, and can specify 1: 1 side and 2: 2 sides as an attribute value. is.
"Staple" is a function for stapling a plurality of output sheets, and as attribute values, 1: none, 2: single, and 3: double can be specified.
"Color mode" is a function of outputting print data in color or converting it to monochrome before outputting, and as attribute values, 1: automatic, 2: color, and 3: monochrome can be specified. Here, the attribute value 1: automatic is for the PDL interpreter 203 to automatically determine color or monochrome depending on the contents of the print data.

「用紙サイズ」は、出力用紙サイズを指定する設定であり、属性値として、１：Ａ４、２：Ａ３、３：Ｂ４が指定可能である。
「ページ集約」は、印刷データ内の連続する複数のページを指定された用紙に縮小し、割り付けて印刷する機能である。属性値として、１：１ｉｎ１、２：２ｉｎ１、３：４ｉｎ１、４：８ｉｎ１が指定可能である。例えば、２ｉｎ１の場合は、連続する２ページが１枚の出力用紙の片面に割り付けられる。 "Paper size" is a setting for specifying the output paper size, and 1: A4, 2: A3, and 3: B4 can be specified as attribute values.
"Combination of pages" is a function of reducing a plurality of consecutive pages in print data onto a designated sheet, laying them out, and printing them. As attribute values, 1:1in1, 2:2in1, 3:4in1, and 4:8in1 can be specified. For example, in the case of 2in1, two consecutive pages are laid out on one side of one output sheet.

＜印刷処理の詳細フローチャート＞
図９は、印刷処理の詳細フローチャートであり、図７の印刷基本フローチャートのＳ７０２とＳ７０３の処理を詳細化したものである。
本フローチャートでは、スレッドＡ、スレッドＢ、スレッドＣの３つのスレッドが並列に実行される（マルチスレッディング）。
各スレッドは、オペレーティングシステムにより時分割され、その実行権が割り振られる。時分割の単位は十分に小さいため、３つのスレッドは並列動作しているとみなすことができる。
オペレーティングシステムによるマルチスレッディング処理は、一般に広く知られている技術であるため、詳細な説明は省略する。スレッドＡ、スレッドＢ、スレッドＣは、本印刷データ分類装置の起動時にオペレーティングシステムにより生成される、常駐スレッドである。 <Detailed Flowchart of Print Processing>
FIG. 9 is a detailed flowchart of the printing process, and details the processing of S702 and S703 of the basic printing flowchart of FIG.
In this flowchart, three threads, thread A, thread B, and thread C, are executed in parallel (multithreading).
Each thread is time-divided and assigned its execution rights by the operating system. Since the time-sharing unit is sufficiently small, it can be considered that the three threads are operating in parallel.
Multi-threading processing by an operating system is a widely known technique, so detailed description thereof will be omitted. Thread A, thread B, and thread C are resident threads generated by the operating system when the print data classification apparatus is started.

スレッドＡでは、最初にＳ９０１において、ジョブ制御部４０１は、印刷枚数カウンタを０リセットし、印刷データから印刷設定が記載されたジョブチケットを取り出す。
次に、Ｓ９０２において、ジョブ制御部４０１は、印刷データがページ集約印刷を指示されたジョブの印刷データであるか否かを判定する。
ここで、ページ集約印刷とは、Ｎ－ＵＰ印刷のように、単一のシートに複数のページが配置される印刷方法を指す。 In thread A, first, in step S901, the job control unit 401 resets the number-of-printed counter to 0, and extracts a job ticket describing print settings from print data.
Next, in step S902 , the job control unit 401 determines whether the print data is print data of a job for which page-aggregate printing is instructed.
Here, page aggregate printing refers to a printing method such as N-UP printing in which a plurality of pages are arranged on a single sheet.

Ｓ９０２でＹｅｓの場合は、Ｓ９０４へ進む。
そして、Ｓ９０４において、ＰＤＬインタプリタ４０３は、単一シートに割り付ける論理ページと配置場所を算出する。
次に、Ｓ９０５において、ＰＤＬインタプリタ４０３は、カラーモード設定にしたがい、シートに割り当てられたページのＰＤＬデータを処理し、シートのディスプレイリストを生成する。
この際、ＰＤＬインタプリタ４０３は、算出済みの配置場所にしたがい、各ページのシート内での描画位置を調整する。なお、ディスプレイリストは、シートの描画情報を表す中間データである。
そして、Ｓ９０６に進む。 If Yes in S902, the process proceeds to S904.
Then, in S904, the PDL interpreter 403 calculates logical pages to be laid out on a single sheet and layout locations.
Next, in step S905, the PDL interpreter 403 processes the PDL data of the page assigned to the sheet according to the color mode setting and generates a sheet display list.
At this time, the PDL interpreter 403 adjusts the drawing position within the sheet of each page according to the calculated placement location. Note that the display list is intermediate data representing the drawing information of the sheet.
Then, the process proceeds to S906.

Ｓ９０２でＮｏの場合は、Ｓ９０３に進む。
そして、Ｓ９０３において、ＰＤＬインタプリタ４０３は、カラーモード設定にしたがい、ＰＤＬデータを処理して、１ページ分のディスプレイリストを生成する。
そして、Ｓ９０６へ進む。 If No in S902, the process proceeds to S903.
In step S903, the PDL interpreter 403 processes the PDL data according to the color mode setting and generates a display list for one page.
Then, the process proceeds to S906.

Ｓ９０６において、特徴量抽出部４０７は、ＰＤＬデータを処理し、特徴量を抽出する。
特徴量抽出処理の詳細については、図１１（及び、図１７）で後述する。 In S906, the feature amount extraction unit 407 processes the PDL data and extracts feature amounts.
Details of the feature amount extraction processing will be described later with reference to FIG. 11 (and FIG. 17).

次に、Ｓ９０７において、ＰＤＬインタプリタ４０３は、レンダラ４０４に対してディスレプリリストの生成完了を通知し、レンダリングを依頼する。
そして、Ｓ９０８において、ＰＤＬインタプリタ４０３は、全ページの処理が終了したか否かを判定する。
Ｙｅｓの場合は、本フローの処理を終了する。
Ｎｏの場合は、Ｓ９０２へ戻り、残りのページの処理を継続する。 Next, in step S907, the PDL interpreter 403 notifies the renderer 404 of completion of generation of the display list and requests rendering.
In step S908, the PDL interpreter 403 determines whether or not all pages have been processed.
If Yes, the processing of this flow ends.
If No, return to S902 to continue processing the remaining pages.

スレッドＢのフローは、レンダラ４０４によって実行される。
最初に、Ｓ９０９において、レンダラ４０４は１シート分のディスプレイリストの生成完了を待つ。
スレッドＡのＳ９０７でディスプレイリストの生成完了が通知されると、スレッドＢの処理は、Ｓ９０９からＳ９１０に進む。 Thread B's flow is executed by renderer 404 .
First, in S909, the renderer 404 waits for completion of generation of a display list for one sheet.
When the display list generation completion is notified in S907 of thread A, the process of thread B advances from S909 to S910.

Ｓ９１０において、レンダラ４０４は、レンダリングを行い、ビットマップイメージを生成する。
ここで、生成されるビットマップイメージは、ＣＭＹＫの各色８ビットの諧調を有する。
また、レンダラ４０４は、１枚目のみ解像度を低下させた、サムネール画像を生成する。なお、１枚目か否かは、前述の印刷枚数カウンタが判断する。 At S910, renderer 404 renders to generate a bitmap image.
Here, the generated bitmap image has 8-bit gradation for each color of CMYK.
Also, the renderer 404 generates a thumbnail image in which the resolution of only the first image is lowered. Whether or not it is the first sheet is determined by the above-described printed number counter.

次に、Ｓ９１１において、レンダラ４０４は、ビットマップイメージをジョブデータ管理部４０９に保存し、プリントドライバ４０６にプリントを依頼する。
プリントの依頼は、プリントドライバ４０６に対してレンダリング終了通知を送信することにより実行される。プリントドライバ４０６はエンジン同期して処理を実行するために、プリントの依頼は、スレッドＣとして、スレッドＡとは別スレッドで実行される。 In step S911 , the renderer 404 saves the bitmap image in the job data management unit 409 and requests the print driver 406 to print it.
A print request is executed by sending a rendering end notification to the print driver 406 . Since the print driver 406 executes processing in synchronization with the engine, the print request is executed as thread C, which is a thread separate from thread A.

スレッドＢのＳ９１１において依頼を受けたプリントドライバ４０６は、スレッドＣのＳ９１３において、エンジンに対してプリント開始要求コマンドを送信し、ビットマップイメージを転送する。
同時に、プリントドライバ４０６は、出力用紙サイズ、両面、ステイプルの、動作指示コマンドを送信する。また、ビットマップイメージ転送に先立って、ビットマップイメージに対して画像処理を施す。
次に、Ｓ９１４において、プリントドライバ４０６は、両面の裏面を出力する場合を除いて、印刷枚数カウンタをアップする。 The print driver 406 that has received the request in S911 of thread B transmits a print start request command to the engine and transfers the bitmap image in S913 of thread C. FIG.
At the same time, the print driver 406 transmits operation instruction commands for output paper size, duplex, and stapling. Prior to bitmap image transfer, the bitmap image is subjected to image processing.
Next, in S914, the print driver 406 increments the printed number counter except when outputting the back side of both sides.

スレッドＢのＳ９１１の処理が終わると、Ｓ９１２へ進み、レンダラ４０４は全ページのレンダリングが完了したか否かを判定する。
Ｙｅｓの場合は、図９のフローを終了する。
Ｎｏの場合は、Ｓ９０９へ戻り、処理を繰り返す。
ここで、印刷データの論理的なページ数と、出力される用紙の枚数とは、一致しないことに注意すべきである。Ｎ－ＵＰ印刷においては、論理ページ数よりも出力用紙枚数は少なくなる。両面印刷においても、同様である。 When the process of S911 of thread B ends, the process advances to S912, and the renderer 404 determines whether rendering of all pages has been completed.
If Yes, the flow of FIG. 9 ends.
If No, return to S909 and repeat the process.
It should be noted here that the logical number of pages of print data does not match the number of output sheets. In N-UP printing, the number of output sheets is smaller than the number of logical pages. The same applies to double-sided printing.

＜ＰＤＬによる描画コマンドの一覧表＞
図１０は、印刷データ分類装置における印刷データを記述するためのＰＤＬの描画コマンドの一覧表である。
描画コマンドは、DrawPath、DrawFillPath、DrawText、DrawImageの４つから構成される。 <List of drawing commands by PDL>
FIG. 10 is a list of PDL drawing commands for describing print data in the print data classification device.
A drawing command is composed of DrawPath, DrawFillPath, DrawText, and DrawImage.

DrawPathは、座標配列により構成される点列を結ぶ線分を塗るためのコマンドであり、追加パラメータとして線の色と線幅が指定される。座標は左上角を原点とするピクセル座標系が使用される。
DrawFillPathは、同様に、座標配列により構成される点列で囲まれる領域を塗りつぶすためのコマンドであり、追加パラメータとして塗りの色が指定される。
DrawPathとDrawFillPathは、点列による線分を細かく繋げることにより曲線を描画することも可能である。 DrawPath is a command for drawing a line segment connecting a sequence of points configured by a coordinate array, and the line color and line width are specified as additional parameters. A pixel coordinate system with the upper left corner as the origin is used for the coordinates.
DrawFillPath is similarly a command for filling an area enclosed by a sequence of points configured by a coordinate array, and specifies the fill color as an additional parameter.
DrawPath and DrawFillPath can also draw curves by finely connecting line segments by point sequences.

DrawTextは、文字列を指定されたフォントで描画するためのコマンドであり、追加パラメータとして（文字）サイズ、色、描画位置が指定される。
DrawImageは、ビットマップ形式のピクセルデータ列を描画するためのコマンドであり、追加パラメータとしてサイズ（幅、高さ）と描画位置が指定される。 DrawText is a command for drawing a character string with a specified font, and (character) size, color, and drawing position are specified as additional parameters.
DrawImage is a command for drawing a string of pixel data in bitmap format, and the size (width, height) and drawing position are specified as additional parameters.

＜特徴量抽出の詳細フローチャート＞
図１１は、特徴量抽出の詳細なフローチャートであり、図９のＳ９０６の処理を詳細に示したものである。 <Detailed flow chart of feature extraction>
FIG. 11 is a detailed flow chart of feature quantity extraction, showing in detail the process of S906 in FIG.

Ｓ１１０１において、特徴量抽出部４０７は、ＰＤＬデータをパース（構文解析）して、描画コマンドを取り出す。
次に、Ｓ１１０２において、特徴量抽出部４０７は、描画コマンドによる描画座標値を四捨五入して概算座標を求める。
同様のテンプレートを使用していても、テンプレートの改変や不注意による変更に伴い、位置が外観から判別不能な程度に変化してしまうことがある。このため、座標値の変化に対する耐性を持たせるために、四捨五入による丸め処理を実行する。 In S1101, the feature amount extraction unit 407 parses (syntax analysis) the PDL data to extract a drawing command.
Next, in step S1102, the feature amount extraction unit 407 obtains approximate coordinates by rounding off the drawing coordinate values according to the drawing command.
Even if a similar template is used, the position may change to such an extent that it cannot be discerned from the appearance due to modification or inadvertent change of the template. For this reason, rounding processing is performed by rounding off in order to provide resistance to changes in coordinate values.

Ｓ１１０３において、特徴量抽出部４０７は、描画コマンドがDrawPathもしくはDrawFillPathであるかを判定する。
Ｙｅｓの場合はＳ１１０５へ進み、Ｎｏの場合はＳ１１０４へ進む。 In S1103, the feature amount extraction unit 407 determines whether the drawing command is DrawPath or DrawFillPath.
If Yes, the process proceeds to S1105, and if No, the process proceeds to S1104.

描画コマンドがDrawPathとDrawFillPathのいずれかであった場合、Ｓ１１０５において、特徴量抽出部４０７は、描画コマンド、概算座標、点列数を組み合わせて、描画識別子として取り出す。
描画識別子では、DrawPathもしくはDrawFillPathによる描画を、位置と点列数により分類し、特徴として扱う。すなわち、異なる描画識別子は異なる特徴として扱われる。
描画識別子は概算座標と点列数の２つのみで定義されるため、異なる描画に対して同一の描画識別子が割り当てられる可能性がある。しかしながら、確率的には非常に限定されるため、ドキュメント単位、ページ単位でみると、描画識別子を特徴として、その集合である特徴量を用いて識別することが十分可能となる。
なお、本実施例では特徴量を特徴の集合として扱う。 If the drawing command is either DrawPath or DrawFillPath, in S1105, the feature quantity extraction unit 407 combines the drawing command, the approximate coordinates, and the number of points and extracts them as a drawing identifier.
Drawing identifiers classify drawings by DrawPath or DrawFillPath by position and the number of point sequences and handle them as features. That is, different drawing identifiers are treated as different features.
Since the drawing identifier is defined by only the approximate coordinates and the number of points, the same drawing identifier may be assigned to different drawings. However, since the probability is very limited, it is sufficiently possible to identify each document and each page by using a set of drawing identifiers as features.
Note that, in this embodiment, the feature amount is treated as a set of features.

描画コマンドがDrawPathとDrawFillPathのいずれでもなかった場合、Ｓ１１０４において、特徴量抽出部４０７は、描画コマンドがDrawTextであるか否かを判定する。
Ｙｅｓの場合はＳ１１０６へ進み、Ｎｏの場合はＳ１１０７へ進む。
描画コマンドがDrawTextであった場合、Ｓ１１０６において、特徴量抽出部４０７は、描画コマンド、概算座標、文字列長を組み合わせて、描画識別子とする。 If the drawing command is neither DrawPath nor DrawFillPath, in S1104 the feature amount extraction unit 407 determines whether the drawing command is DrawText.
If Yes, the process proceeds to S1106, and if No, the process proceeds to S1107.
If the drawing command is DrawText, in S1106 the feature amount extraction unit 407 combines the drawing command, approximate coordinates, and character string length to obtain a drawing identifier.

描画コマンドがDrawTextでなかった場合、Ｓ１１０７において、特徴量抽出部４０７は、描画コマンド、概算座標、幅、高さを組み合わせて、描画識別子とする。
なお、Ｓ１１０７が実行されるのは、描画コマンドがDrawPathでも、DrawFillPathでも、DrawTextでもなかった場合であり、すなわち、DrawImageであったときのみである。 If the drawing command is not DrawText, in S1107 the feature amount extraction unit 407 combines the drawing command, approximate coordinates, width, and height to obtain a drawing identifier.
Note that S1107 is executed only when the drawing command is neither DrawPath, DrawFillPath, nor DrawText, that is, when it is DrawImage.

Ｓ１１０８へ進むと、特徴量抽出部４０７は、同一の描画識別子のカウントをインクリメントして記憶する。
カウントは、特徴の強さを表す値として使用される。ＰＤＬデータ内に同一の描画識別子が複数存在するということは、多くの場合、複数のページで同一の描画が行われていることを意味する。テンプレートを使用した文書では、各ページで同様のヘッダやフッターなどを使用するケースが多いことから、描画識別子のカウントはテンプレートの特性を考慮したものと言える。別の方法として、描画エリアの大きさにより重みづけを行うことも可能である。 Proceeding to S1108, the feature amount extraction unit 407 increments and stores the count of the same drawing identifier.
Count is used as a value that represents the strength of a feature. The existence of a plurality of identical drawing identifiers in PDL data often means that the same drawing is performed on a plurality of pages. Template-based documents often use similar headers and footers on each page, so it can be said that the count of drawing identifiers takes into consideration the characteristics of the template. As another method, it is also possible to perform weighting according to the size of the drawing area.

次に、Ｓ１１０９において、特徴量抽出部４０７は、ＰＤＬデータのパースがすべて完了したか否かを判定する。
Ｙｅｓの場合はＳ１１１０へ進む。
Ｎｏの場合は、Ｓ１１０１に戻り、特徴量抽出処理を繰り返す。 Next, in step S1109, the feature amount extraction unit 407 determines whether or not the parsing of the PDL data has been completed.
If Yes, proceed to S1110.
If No, the process returns to S1101 to repeat the feature quantity extraction process.

ＰＤＬデータのパースがすべて完了すると、Ｓ１１１０において、特徴量抽出部４０７は、１つの描画識別子を一次元とし、各描画識別子のカウントを各次元の値とする、多次元ベクトルを生成する。
この多次元ベクトルが印刷データの特徴量となる。 When the parsing of the PDL data is completed, in S1110, the feature amount extraction unit 407 generates a multidimensional vector in which one drawing identifier is one dimension and the count of each drawing identifier is the value of each dimension.
This multidimensional vector becomes the feature quantity of the print data.

＜多次元ベクトルによる特徴量の構成例＞
図１２は、ある印刷データの多次元ベクトルによる特徴量の構成例である。
例えば、次元番号１の描画識別子「DrawPath_100_80_2」は、DrawPathコマンドにより、概算座標（100, 80）に点列数２のパス描画が２つあることを意味する。
次元番号４の描画識別子「DrawText_230_250_8」は、DrawTextコマンドにより、概算座標（230, 250）に文字列長８のテキスト描画が４つあることを意味する。
次元番号５の描画識別子「DrawImage_850_1200_2400_1200」は、DrawImageコマンドにより、概算座標（850, 1200）に幅：2400、高さ：1200のイメージ描画が１つあることを意味する。 <Configuration example of feature amount by multidimensional vector>
FIG. 12 is a configuration example of a feature amount by a multidimensional vector of certain print data.
For example, a drawing identifier "DrawPath_100_80_2" with a dimension number of 1 means that there are two path drawings with a point sequence number of 2 at approximate coordinates (100, 80) by the DrawPath command.
The drawing identifier "DrawText_230_250_8" with dimension number 4 means that there are four text drawings with a character string length of 8 at approximate coordinates (230, 250) by the DrawText command.
The drawing identifier "DrawImage_850_1200_2400_1200" of dimension number 5 means that there is one drawing of an image with width: 2400 and height: 1200 at approximate coordinates (850, 1200) by the DrawImage command.

この印刷データ全体では、描画識別子により識別される描画が１３５個あり、１３５次元の特徴量で構成されている。
なお、図１２では、描画識別子の表現形式として文字列を採用しているが、表現形式はいかなる形式でもよい。 In the entire print data, there are 135 drawings identified by drawing identifiers, which are composed of 135-dimensional feature amounts.
In FIG. 12, a character string is used as the representation format of the drawing identifier, but any representation format may be used.

＜印刷済みデータのクラスタリング詳細フローチャート＞
図１３は、印刷済みデータのクラスタリング処理の詳細なフローチャートであり、図６のＳ６０５の処理を詳細に示したものである。 <Detailed flowchart for clustering of printed data>
FIG. 13 is a detailed flowchart of clustering processing for printed data, and shows the processing of S605 in FIG. 6 in detail.

Ｓ１３０１において、クラスタリング実行部５０２は、各印刷済みデータについて、その特徴量から特徴空間を決定する。
各印刷済みデータは、共通の描画識別子を有する場合もあれば、有しない場合もある。クラスタリングに必要な特徴量の比較を行うためには、共通の特徴空間で実行する必要がある。 In S1301, the clustering execution unit 502 determines a feature space for each piece of printed data from its feature amount.
Each printed data may or may not have a common drawing identifier. In order to compare feature quantities required for clustering, it is necessary to perform in a common feature space.

そこで、まず、すべての印刷データの特徴量識別子を包含する特徴空間を合成する。各印刷済みデータは異なる特徴量識別子を有することが多いので、この処理により特徴空間の次元数は各印刷済みデータの次元数と比して大幅に増加する。そこで、特定の印刷データにのみ存在する特徴量識別子を削除することにより、次元を削減した特徴空間を生成する。特定の印刷済みデータのみに存在する特徴量識別子は、グループの特徴を表す識別子とはならないからである。 Therefore, first, a feature space that includes feature quantity identifiers of all print data is synthesized. Since each piece of printed data often has a different feature quantity identifier, this process greatly increases the dimensionality of the feature space relative to the number of dimensions of each piece of printed data. Therefore, a feature space with reduced dimensions is generated by deleting the feature quantity identifiers that exist only in specific print data. This is because a feature quantity identifier that exists only in specific printed data does not serve as an identifier representing a group feature.

次に、Ｓ１３０２において、クラスタリング実行部５０２は、各印刷済みデータの特徴量を特徴空間の特徴点にマッピングする。
ここで、クラスタリング実行部５０２は、特定の印刷済みデータのみに存在する特徴量識別子を削除する。削除された特徴量識別子が極端に多い印刷済みデータは、クラスタリングの対象外としてもよい。削除された特徴量識別子が極端に多いということは、他の印刷データと類似する描画がほとんどないことを意味するからである。 Next, in step S1302, the clustering execution unit 502 maps the feature amount of each piece of printed data to feature points in the feature space.
Here, the clustering execution unit 502 deletes feature quantity identifiers that exist only in specific printed data. Printed data with an extremely large number of deleted feature quantity identifiers may be excluded from clustering. This is because an extremely large number of deleted feature quantity identifiers means that there is almost no rendering similar to other print data.

次に、Ｓ１３０３において、クラスタリング実行部５０２は、キャノピークラスタリングにより、クラスタ数と各クラスタの代表点を決める。
後述するように、本実施例では、クラスタリングの手法として、既知の手法であるＫ平均法を採用する。
Ｋ平均法では、クラスタ数と初期のクラスタの重心点を予めに決めておく必要がある。キャノピークラスタリングはクラスタ数と初期のクラスタの重心点を適切に決めるための前処理として採用する。なお、キャノピークラスタリングも、同様に既知の手法であるため説明は省略する。 Next, in S1303, the clustering execution unit 502 determines the number of clusters and the representative point of each cluster by canopy clustering.
As will be described later, this embodiment employs the known K-means method as a clustering method.
In the K-means method, it is necessary to determine the number of clusters and the center of gravity of the initial clusters in advance. Canopy clustering is adopted as a preprocessing to properly determine the number of clusters and the centroids of the initial clusters. Since the canopy clustering is also a known technique, the explanation is omitted.

次に、Ｓ１３０４において、クラスタリング実行部５０２は、各クラスタの代表点を初期の重心点として設定する。
続く、Ｓ１３０５からＳ１３０９までの処理により、クラスタリング実行部５０２はＫ平均法によるクラスタリングを実行する。 Next, in S1304, the clustering execution unit 502 sets the representative point of each cluster as the initial centroid point.
The clustering execution unit 502 executes clustering by the K-means method by the processing from S1305 to S1309.

Ｓ１３０５において、クラスタリング実行部５０２は、各印刷済みデータの特徴点と各クラスタの重心点と間の距離を計算する。
距離の計算手法としては、既知の手法であるユークリッド距離を用いる。ここで、距離の遠近は、特徴量の類似度を表すことに注意すべきである。すなわち、類似する特徴量（特徴点）間の距離は小さく、類似しない特徴量（特徴点）間の距離は大きくなる。
次に、Ｓ１３０６において、クラスタリング実行部５０２は、各印刷済みデータをその特徴点に最も近い重心点を有するクラスタに割り当てる。
Ｓ１３０７において、クラスタリング実行部５０２は、クラスタに対する特徴点の割り当てに変化があるか否か判定する。
Ｙｅｓの場合はＳ１３０８に進み、Ｎｏの場合はＳ１３０９へ進む。 In S1305, the clustering execution unit 502 calculates the distance between the feature point of each printed data and the center of gravity of each cluster.
Euclidean distance, which is a known method, is used as a distance calculation method. Here, it should be noted that the degree of distance represents the degree of similarity between feature quantities. That is, the distance between similar feature amounts (feature points) is small, and the distance between dissimilar feature amounts (feature points) is large.
Next, in step S1306, the clustering execution unit 502 assigns each piece of printed data to the cluster having the center of gravity closest to its feature point.
In S1307, the clustering execution unit 502 determines whether or not there is a change in assignment of feature points to clusters.
If Yes, go to S1308; if No, go to S1309.

クラスタに対する特徴点の割り当てに変化があった場合、Ｓ１３０８において、クラスタリング実行部５０２は、各クラスタの重心点を再計算する。
重心点は各クラスタに割り当てられた特徴点の平均値であり、クラスタの特徴を代表する特徴点とみなすことができる。
Ｓ１３０８の処理を終了したら、Ｓ１３０５に戻り処理を継続する。そして、Ｓ１３０８、Ｓ１３０５、Ｓ１３０６の処理により、クラスタへの特徴点への割り当てに変化がなくなるまで、すなわち、割り当てが収束するまで、これらの処理を繰り返す。 If there is a change in the assignment of feature points to clusters, the clustering execution unit 502 recalculates the center of gravity of each cluster in S1308.
The centroid point is the average value of the feature points assigned to each cluster, and can be regarded as the feature point representing the feature of the cluster.
After completing the process of S1308, the process returns to S1305 to continue the process. These processes are repeated until there is no change in the assignment of feature points to clusters through the processes of S1308, S1305, and S1306, that is, until the assignment converges.

各クラスタの重心点が定まると、Ｓ１３０９において、クラスタリング実行部５０２は、各クラスタの重心点の特徴量と各クラスタに帰属する印刷済みデータを記憶する。
そして、Ｓ１３１０において、クラスタリング実行部５０２は、各クラスタの重心点に最も近い印刷済みデータを調べ、記憶する。
図６のＳ６０７におけるクラスタの代表画像には、クラスタの重心点に最も近い印刷済みデータの画像が使われる。 After the center of gravity of each cluster is determined, in S1309, the clustering execution unit 502 stores the feature amount of the center of gravity of each cluster and the printed data belonging to each cluster.
Then, in S1310, the clustering execution unit 502 checks and stores the printed data closest to the center of gravity of each cluster.
As the representative image of the cluster in S607 of FIG. 6, the image of the printed data closest to the center of gravity of the cluster is used.

ここで、各クラスタに帰属する印刷済みデータとは、特徴点が最も近い重心点を有するいずれかのクラスタに割り当てられた印刷済みデータである。Ｋ平均法ではすべての特徴点（特徴量）がいずれかのクラスタに割り当てられる。したがって、どのクラスタの重心点からも遠い孤立点であっても、いずれかのクラスタに割り当てられることになる。
このような孤立点はノイズとなるため、こうしたノイズを除去するために、クラスタの重心から一定距離内にある特徴点（特徴量）のみを、当該クラスタに帰属する印刷済みデータとする方法を採用してもよい。その場合、どのクラスタにも帰属しない印刷済みデータは、その他のデータとして一括りにして扱うことにより、情報の欠落を避けることができる。 Here, the printed data belonging to each cluster is the printed data assigned to any cluster having the center of gravity point closest to the feature point. In the K-means method, all feature points (feature amounts) are assigned to one of the clusters. Therefore, even an isolated point far from the center of gravity of any cluster is assigned to one of the clusters.
Since such isolated points become noise, in order to remove such noise, a method is adopted in which only feature points (feature amounts) within a certain distance from the centroid of the cluster are treated as printed data belonging to the cluster. You may In this case, the printed data that do not belong to any cluster can be collectively handled as other data, thereby avoiding the omission of information.

クラスタリングは特徴空間における距離を用いて類似度を評価するため、結果的に同一の特徴を多く有する特徴点（特徴量）を有する印刷データは、同一のクラスタに抽出される。すなわち、印刷データに含まれる多数の特徴の中から、一致する特徴を多く有する印刷データを同じグループに属する印刷データとして識別することが可能となる。 Since clustering evaluates similarity using distance in feature space, print data having feature points (feature amounts) having many identical features are extracted into the same cluster. That is, print data having many matching features can be identified as print data belonging to the same group from a large number of features included in the print data.

なお、データクラスタリングの手法としては、本実施例におけるＫ平均法以外にも、階層的クラスタリング、サポートベクターマシンなど他の手法を用いてもかまわない。その場合でも、類似度は特徴量同士の距離によって評価される。 As a data clustering method, other methods such as hierarchical clustering and support vector machine may be used in addition to the K-means method in this embodiment. Even in that case, the similarity is evaluated by the distance between feature amounts.

上述のとおり、本実施例によれば、印刷データを生成したアプリケーションではなく、印刷データの内容に基づいて、印刷コストを把握することが可能となる。
また、一般的に、パターン認識は事前の作り込みが必要であり、事後に発生する様々なパターンを後から考慮することができない。このため、特定の書式として認識するパターンを事前に決めてプログラミングしておく必要がある。すなわち、新たなパターンが出現した場合には、再度、プログラミングが必要となる。 As described above, according to this embodiment, it is possible to grasp the printing cost based on the content of the print data rather than the application that generated the print data.
Moreover, pattern recognition generally requires preparation in advance, and various patterns that occur after the fact cannot be taken into consideration afterward. For this reason, it is necessary to predetermine and program the pattern to be recognized as a specific format. That is, when a new pattern appears, programming is required again.

これに対して、本実施例によれば、印刷データに含まれる複数の特徴を特徴量として用いて、データクラスタリングすることにより、事前に特定パターンを認識するためのプログラミングが不要になる。
また、教師データによる学習も不要であるため、利便性が非常に高い方法と言える。さらに、複数の特徴からなる特徴量の類似性を判別するために、特定の特徴に依存することがなく、様々な書式として認識することができる。すなわち、本実施例のクラスタリングは、入力される様々な印刷データに対して柔軟性の高い分類方式であるといえる。 On the other hand, according to the present embodiment, a plurality of features included in print data are used as feature amounts to perform data clustering, thereby eliminating the need for programming for recognizing specific patterns in advance.
In addition, since learning using teacher data is not required, the method can be said to be extremely convenient. Furthermore, in order to determine the similarity of feature amounts consisting of a plurality of features, it is possible to recognize various formats without depending on specific features. That is, the clustering of this embodiment can be said to be a highly flexible classification method for various input print data.

実施例１では、サーバーとクライアント（プリンタ）による分散処理を行うシステムについて説明した。
これに対して、実施例２ではプリンタ単体でクラスタリングとレポート生成を行うシステムを説明する。 In the first embodiment, a system that performs distributed processing by a server and clients (printers) has been described.
On the other hand, in the second embodiment, a system that performs clustering and report generation with a single printer will be described.

また、実施例１では、描画コマンドの種別、位置情報、サイズ情報などを用いて、印刷データの特徴量を構成する手法について説明した。このような特徴量は描画オブジェクトの配置やサイズに類似性の高いデータの抽出に適している。すなわち、類似のテンプレート書式を使った印刷データを見分ける能力が高いと言える。
これに対して、実施例２では、印刷データの特徴量として、テキスト（印字文字列）の内容を用いる方法について説明する。テキストを使うと、印刷データの内容が意味的に類似するグループを抽出することができる。
ここで、サーバークライアントのシステム構成と特徴量抽出方法とは、依存関係にないことに注意すべきである。すなわち、プリンタ単体の構成においても、実施例１の特徴量抽出方法を採用することができる。 Also, in the first embodiment, a method of configuring the feature amount of print data using the drawing command type, position information, size information, and the like has been described. Such a feature amount is suitable for extracting data having a high degree of similarity in the arrangement and size of drawing objects. That is, it can be said that the ability to distinguish print data using similar template formats is high.
On the other hand, in a second embodiment, a method of using the content of text (printed character string) as the feature amount of print data will be described. By using text, it is possible to extract groups whose content of print data is semantically similar.
Here, it should be noted that there is no dependency between the server/client system configuration and the feature extraction method. That is, the feature amount extraction method of the first embodiment can be adopted even in the configuration of a single printer.

＜プリンタのソフトウェアモジュール構成図＞
図１５は、実施例２におけるプリンタのソフトウェア構成を表すソフトウェアモジュール構成図である。
実施例２では、プリンタがサーバーの機能を兼ねるため、いくつかのモジュールが図４のソフトウェアモジュール構成図に追加される。具体的には、クラスタリング実行部４１０、レポート生成部４１１、Ｗｅｂサーバー４１２が追加される。
これらのモジュールの機能性は、それぞれ、図５におけるサーバーのクラスタリング実行部５０２、レポート生成部５０３、Ｗｅｂサーバー５０４と同等である。 <Printer software module configuration diagram>
FIG. 15 is a software module configuration diagram showing the software configuration of the printer according to the second embodiment.
In the second embodiment, since the printer also functions as a server, some modules are added to the software module configuration diagram of FIG. Specifically, a clustering execution unit 410, a report generation unit 411, and a web server 412 are added.
The functionality of these modules is equivalent to the server clustering execution unit 502, report generation unit 503, and web server 504 in FIG. 5, respectively.

但し、実施例２では、サーバーとプリンタの間で特徴量などのデータ送受信が不要になるため、プリンタにおいてデータ送信部４０８が不要となる。
また、プリンタ単体の処理能力はサーバーと比較して劣るため、プリンタ単体の場合では、印刷枚数の集計対象を自機のみに限定する構成とする。すなわち、ネットワーク上の他のプリンタでの印刷は集計対象に含めない。その他は、図４のソフトウェアモジュール構成図と同等であるため、説明は省略する。 However, in the second embodiment, the data transmission unit 408 is not required in the printer because data such as feature amounts are not required to be transmitted/received between the server and the printer.
Further, since the processing capability of a printer alone is inferior to that of a server, in the case of a printer alone, the configuration is such that the total number of printed sheets is limited to the printer itself. That is, printing by other printers on the network is not counted. Others are the same as those of the software module configuration diagram of FIG. 4, so description thereof will be omitted.

＜プリンタ処理の基本フローチャート＞
図１６は、実施例２におけるプリンタにおける基本処理を表すフローチャートである。
最初に、Ｓ１６０１において、ジョブ制御部４０１は、プリンタイベントが発生するまで待つ。
プリンタイベントには、印刷データの受信通知と、印刷データのクラスタリングと印刷枚数の集計レポートの生成要求と、の２つがある。 <Basic flowchart of printer processing>
FIG. 16 is a flow chart showing basic processing in the printer according to the second embodiment.
First, in S1601, the job control unit 401 waits until a printer event occurs.
There are two types of printer events: a print data reception notification and a print data clustering and print count total report generation request.

プリンタイベントが発生すると、Ｓ１６０２へ進み、ジョブ制御部４０１は、プリンタイベントが印刷データの受信通知であるか否かを判断する。
Ｓ１６０２でＹｅｓの場合は、Ｓ１６０３へ進み、印刷データを受信する。
Ｓ１６０２でＮｏの場合は、Ｓ１６０８へ進む。 When a printer event occurs, the process advances to step S1602, and the job control unit 401 determines whether the printer event is print data reception notification.
If Yes in S1602, the process advances to S1603 to receive print data.
If No in S1602, the process proceeds to S1608.

プリンタイベントが印刷データの受信通知であった場合の、Ｓ１６０３からＳ１６０６までの処理は、図７のＳ７０１からＳ７０４までの処理と同様であるため、説明は省略する。
そして、Ｓ１６０７において、ジョブ制御部４０１は、印刷データの特徴量と、印刷枚数及び代表画像を、ジョブデータ管理部４０９に保存する。
Ｓ１６０７の処理が終了すると、Ｓ１６０１へ戻る。 If the printer event is print data reception notification, the processing from S1603 to S1606 is the same as the processing from S701 to S704 in FIG. 7, so description thereof will be omitted.
In step S1607 , the job control unit 401 saves the print data feature amount, the number of prints, and the representative image in the job data management unit 409 .
When the process of S1607 is completed, the process returns to S1601.

Ｓ１６０２でＮｏの場合、すなわち、プリンタイベントが印刷データのクラスタリングと印刷枚数の集計レポートの生成要求であった場合は、Ｓ１６０８へ進む。
そして、Ｓ１６０８において、クラスタリング実行部５０２は、印刷済みデータのクラスタリングを実行する。 If No in S1602, that is, if the printer event is a request for clustering of print data and generation of a tally report of the number of prints, the process advances to S1608.
In step S1608, the clustering execution unit 502 clusters the printed data.

Ｓ１６０９からＳ１６１２までの処理は、図６のＳ６０５からＳ６０８までの処理と同様であるため、説明は省略する。
Ｓ６０８の処理が終了すると、Ｓ１６０１へ戻る。 Since the processing from S1609 to S1612 is the same as the processing from S605 to S608 in FIG. 6, description thereof will be omitted.
When the process of S608 is completed, the process returns to S1601.

＜テキストによる特徴量抽出の詳細フローチャート＞
図１７は、テキストによる特徴量抽出の詳細なフローチャートであり、図１１と同様に、図９のＳ９０６の処理を詳細に示したものである。 <Detailed Flowchart of Characteristic Value Extraction from Text>
FIG. 17 is a detailed flowchart of text-based feature quantity extraction, and shows in detail the processing of S906 in FIG. 9, similar to FIG.

Ｓ１７０１において、特徴量抽出部４０７は、ＰＤＬデータをパース（構文解析）して、描画コマンドを取り出す。
次に、Ｓ１７０２において、特徴量抽出部４０７は、描画コマンドがDrawTextであるか否かを判定する。
Ｙｅｓの場合はＳ１７０３へ進み、Ｎｏの場合はＳ１７０６へ進む。
Ｓ１７０３において、特徴量抽出部４０７は、描画コマンドから文字列情報を取り出す。
なお、図４で説明したように、DrawTextはパラメータとして描画文字列を有する。 In S1701, the feature amount extraction unit 407 parses (syntax analysis) the PDL data to extract a drawing command.
Next, in S1702, the feature quantity extraction unit 407 determines whether the drawing command is DrawText.
If Yes, the process proceeds to S1703, and if No, the process proceeds to S1706.
In S1703, the feature amount extraction unit 407 extracts character string information from the drawing command.
Note that, as described with reference to FIG. 4, DrawText has a drawing character string as a parameter.

Ｓ１７０４において、特徴量抽出部４０７は、文字列情報を形態素解析して、名詞のみを単語として取り出す。
形態素解析は、文書を単語で区切り、辞書を用いて品詞などを判別する処理を指す。ここでは、形態素解析として公知の技術を用いる。
次に、Ｓ１７０５において、特徴量抽出部４０７は、単語と、各単語の累積出現数と、を特徴として記憶する。 In S1704, the feature amount extraction unit 407 morphologically analyzes the character string information and extracts only nouns as words.
Morphological analysis refers to a process of dividing a document into words and determining parts of speech using a dictionary. Here, a technique known as morphological analysis is used.
Next, in S1705, the feature amount extraction unit 407 stores the words and the cumulative number of occurrences of each word as features.

次に、Ｓ１７０６において、特徴量抽出部４０７は、ＰＤＬデータのパースがすべて完了したか否かを判定する。
Ｙｅｓの場合はＳ１７０７へ進む。
Ｎｏの場合は、Ｓ１７０１に戻り、特徴量抽出処理を繰り返す。
ＰＤＬデータのパースがすべて完了すると、Ｓ１７０７において、特徴量抽出部４０７は、１つの単語を一次元とし、各単語の出現数を各次元の値とする、多次元ベクトルを生成する。 Next, in step S1706, the feature amount extraction unit 407 determines whether or not the parsing of the PDL data has been completed.
If Yes, the process advances to S1707.
If No, the process returns to S1701 to repeat the feature quantity extraction process.
When the parsing of the PDL data is completed, in S1707, the feature amount extraction unit 407 generates a multidimensional vector in which one word is one dimension and the number of occurrences of each word is the value of each dimension.

特徴量抽出以外の処理については実施例１と同じ方式を採用してもよい。但し、実施例２においては、特徴量間の距離として、ユークリッド距離よりもコサイン距離を採用する方がより望ましい結果が得られる。
また、クラスタリングの前処理（図９のＳ１３０１、Ｓ１３０２）として、公知のＴＦ／ＩＤＦ法により特徴の重みづけ調整を行うと、さらに望ましい結果が得られる。 The same method as in the first embodiment may be adopted for processing other than the feature amount extraction. However, in Example 2, a more desirable result can be obtained by adopting the cosine distance than the Euclidean distance as the distance between feature quantities.
A more desirable result can be obtained by adjusting the weights of features by a known TF/IDF method as preprocessing for clustering (S1301 and S1302 in FIG. 9).

上述のとおり、印刷データの特徴量としてテキスト（印字文字列）を用いると、同じ単語を多く使用する印刷データが同じクラスタとして抽出される。すなわち、フォーマットが違っていても、印刷データの元である文書の意味的内容が近いものが集まる。このため、描画オブジェクト配置などの幾何学的な特性だけではなく、意味的な特性に着目して印刷設定との相関を調べることにより、より多面的な解析が可能となる。 As described above, when text (printed character strings) is used as the feature quantity of print data, print data that use many of the same words are extracted as the same cluster. In other words, even if the formats are different, documents with similar semantic contents of the original documents of the print data are collected. For this reason, not only geometric characteristics such as drawing object placement, but also semantic characteristics can be focused on and examined for correlation with print settings to enable more multifaceted analysis.

また、幾何学的な特性と意味的な特性とは独立しているとみなすことができるので、同時に扱うことも可能である。この場合、ある印刷データは、幾何学的な特性によるクラスタと、意味的な特性によるクラスタと、に同時に帰属することになる。
なお、実施例１ではクラスタの代表画像をレポートに表示したが、実施例２では、クラスタの特徴を表すメタ情報として、クラスタに特徴的な単語を表示する方法も考えられる。 In addition, since geometric properties and semantic properties can be regarded as independent, they can be treated simultaneously. In this case, some print data belongs to a cluster based on geometric characteristics and a cluster based on semantic characteristics at the same time.
In the first embodiment, the representative image of the cluster is displayed in the report, but in the second embodiment, a method of displaying a word characteristic of the cluster as meta information representing the characteristics of the cluster is also conceivable.

（その他の実施例）
本発明は、上述の実施例の１以上の機能を実現するプログラムを、ネットワーク又は記憶媒体を介してシステム又は装置に供給し、そのシステム又は装置のコンピュータにおける１つ以上のプロセッサーがプログラムを読出し実行する処理でも実現可能である。また、１以上の機能を実現する回路（例えば、ＡＳＩＣ）によっても実現可能である。
また、本発明は、複数の機器から構成されるシステムに適用しても、１つの機器からなる装置に適用してもよい。
本発明は上述の実施例に限定されるものではなく、本発明の趣旨に基づき種々の変形（各実施例の有機的な組合せを含む）が可能であり、それらを本発明の範囲から除外するものではない。すなわち、上述した各実施例及びその変形例を組み合わせた構成もすべて本発明に含まれる。 (Other examples)
The present invention supplies a program that implements one or more functions of the above-described embodiments to a system or apparatus via a network or a storage medium, and one or more processors in the computer of the system or apparatus reads and executes the program. It can also be realized by processing to It can also be implemented by a circuit (for example, ASIC) that implements one or more functions.
Moreover, the present invention may be applied to a system composed of a plurality of devices or to an apparatus composed of a single device.
The present invention is not limited to the above-described embodiments, and various modifications (including organic combinations of each embodiment) are possible based on the spirit of the present invention, and they are excluded from the scope of the present invention. not a thing In other words, the present invention includes all configurations obtained by combining each of the above-described embodiments and modifications thereof.

１００プリンタＡ
１０３サーバー
４０１ジョブ制御部
４０２データ受信部
４０３ＰＤＬインタプリタ
４０４レンダラ
４０５ユーザインタフェース
４０６プリントドライバ
４０７特徴量抽出部
４０８データ送信部
４０９ジョブデータ管理部
５０１制御部
５０２クラスタリング実行部
５０３レポート生成部
５０４Ｗｅｂサーバー
５０５特徴量データ受信部
５０６データ管理部 100 Printer A
103 Server 401 Job control unit 402 Data reception unit 403 PDL interpreter 404 Renderer 405 User interface 406 Print driver 407 Feature extraction unit 408 Data transmission unit 409 Job data management unit 501 Control unit 502 Clustering execution unit 503 Report generation unit 504 Web server 505 Feature amount data receiving unit 506 Data management unit

Claims

a system,
an extraction means for extracting a feature amount of print data;
Classifying means for classifying the print data into one cluster out of a plurality of clusters based on a feature amount representing the feature of the print data as a multidimensional vector ;
and totaling means for totalizing the total number of printed sheets of print data belonging to each of the plurality of clusters.

2. The system of claim 1, wherein the print data comprises PDL data described in a page description language (PDL) and a job ticket.

3. The system according to claim 1, wherein the feature amount is represented by a multi-dimensional vector whose dimensions are each of the feature amount identifiers.

4. The system of claim 3, wherein the feature identifier is a drawing identifier derived from print data.

5. The system of claim 4, wherein the drawing identifier is defined from arithmetic coordinates and point sequences of print data described in a page description language.

4. The system of claim 3, wherein the feature identifiers are words extracted from print data.

7. The system according to any one of claims 3 to 6, wherein from among the feature quantity identifiers, a feature quantity identifier that exists only in specific print data is deleted.

The classification means are
8. The system according to any one of claims 3 to 7, wherein the print data is classified into clusters having a centroid point closest to the feature amount.

9. The system according to any one of claims 1 to 8, further comprising generating means for generating a report of the number of printed sheets counted by the counting means.

10. The system according to claim 9, wherein said generating means generates a representative image from print data classified into each cluster.

11. The system according to claim 10, wherein the representative image is the print data closest to the center of gravity of the cluster among the print data classified into each cluster.

12. A system according to any preceding claim, further comprising printing means for printing based on said print data.

An aggregation method comprising:
an extraction step of extracting a feature amount of print data;
a classification step of classifying the print data into one cluster out of a plurality of clusters based on a feature quantity representing the characteristics of the print data as a multidimensional vector ;
and a counting step of counting the total number of printed sheets of print data belonging to each of the plurality of clusters.

A program for causing a computer to execute the counting method according to claim 13.