JP2006330995A

JP2006330995A - Document processor

Info

Publication number: JP2006330995A
Application number: JP2005152398A
Authority: JP
Inventors: Michihiro Tamune; 道弘田宗; Masatoshi Tagawa; 昌俊田川; Kiyoshi Tashiro; 潔田代; Hiroshi Masuichi; 博増市; Tsuguaki Ryu; 紹明劉; Atsushi Ito; 篤伊藤; Naoko Sato; 直子佐藤
Original assignee: Fuji Xerox Co Ltd
Current assignee: Fujifilm Business Innovation Corp
Priority date: 2005-05-25
Filing date: 2005-05-25
Publication date: 2006-12-07

Abstract

<P>PROBLEM TO BE SOLVED: To provide a technlogy for classifying documents into categories and assigning a proper name to each category. <P>SOLUTION: In the document processor 20, a control part 21 analyzes, upon receipt of document data, the document data to extract character string data. Whether each of a plurality of words predetermined based on the extracted character string data appears in the character string data or not is determined, and a characteristic vector is generated based on the determination result. The document data are classified to a category number designated by a user using SVM algorithm based on the generated characteristic vector. Appearance frequencies of the words are calculated at each of the classified categories, and a word having a high appearance frequency is displayed on a display part 23. If a word stored in association with the word displayed is contained in a relevant word table stored in a nonvolatile storage part 25b, this word is also displayed in parallel. The user selects a category name based on the displayed word. The control part 21 determines the selected word as the category name. <P>COPYRIGHT: (C)2007,JPO&INPIT

Description

本発明は、電子文書を蓄積する技術に関し、特に、電子文書の内容を元に電子文書を分類して蓄積する技術に関する。 The present invention relates to a technique for storing electronic documents, and more particularly to a technique for classifying and storing electronic documents based on the contents of the electronic documents.

情報の伝達や記録を行う際、紙文書が使われるが、この紙文書をキャビネット等に保存しようとすると保存するスペースが必要となる。紙文書が増大すれば、それに伴い保存するスペースを拡大しなくてはならないため、紙文書による情報の保存は場所の有効利用に欠けた。そこで、紙文書をスキャナ等によって読み取り電子化し、電子化した画像データを紙文書毎にファイル化してハードディスク等に保存することが行われている。 When transmitting and recording information, a paper document is used. If an attempt is made to save the paper document in a cabinet or the like, a storage space is required. As the number of paper documents increases, the space to be stored must be expanded accordingly, and thus the storage of information by paper documents lacks effective use of the location. Therefore, a paper document is read and digitized by a scanner or the like, and the digitized image data is filed for each paper document and stored in a hard disk or the like.

ところで、ファイル化した文書を保存する際、その文書の種類に応じて例えばディレクトリ等で分類して保存すると便利である。そのような技術として、非特許文献１及び２には、分類対象の文書の特徴を表す特徴量（以下、特徴ベクトル）を用いてサポートベクトルマシンアルゴリズム（以下、ＳＶＭアルゴリズム）に従って文書を分類する技術が開示されている。 By the way, when saving a filed document, it is convenient to classify and save it in a directory or the like according to the type of the document. As such a technique, Non-Patent Documents 1 and 2 describe a technique for classifying a document according to a support vector machine algorithm (hereinafter referred to as an SVM algorithm) using a feature amount (hereinafter referred to as a feature vector) representing the characteristics of the document to be classified. Is disclosed.

また、特許文献１には、訓練データの文書から特徴ベクトルを生成し、ＳＶＭアルゴリズムにより、分類方法を学習し、その学習結果を元に分類対象となる文書を分類する技術が開示されている。
“テキスト分類−学習理論の「見本市」−”、情報処理、vol142、no.１、2001 川内貴志, 牧之内顕文, “モルフォロジ演算を用いた領域画像の分類”電子情報通信学会データ工学ワークショップ(DEWS)2005オンライン論文集, 1C-i5,(2005) 特開２００１−２２７２７号公報 Patent Document 1 discloses a technique of generating a feature vector from a training data document, learning a classification method using an SVM algorithm, and classifying a document to be classified based on the learning result.
"Text classification-" Trade fair "in learning theory", Information processing, vol142, no.1, 2001 Takashi Kawauchi, Akifumi Makinouchi, “Area Image Classification Using Morphological Operations”, IEICE Data Engineering Workshop (DEWS) 2005 Online Proceedings, 1C-i5, (2005) JP 2001-22727 A

しかしながら、非特許文献１及び２の技術では、同一の内容を表す文書であっても、含まれる単語が違えば、別の意味を表す文書であると認識されてしまう。また、特許文献１においては文書を分類する技術が開示されているのみで、文書が分類されたカテゴリ名を決定する技術は開示されていない。従って、文書の所在をカテゴリ名から特定することができない。 However, in the techniques of Non-Patent Documents 1 and 2, even if the documents represent the same content, if the included words are different, they are recognized as documents representing different meanings. Further, Patent Document 1 only discloses a technique for classifying a document, and does not disclose a technique for determining a category name into which a document is classified. Therefore, the location of the document cannot be specified from the category name.

本発明はこのような事情に鑑みてなされたものであり、その目的は、文書をカテゴリに分類し、カテゴリ毎に適切な名称を付与する技術を提供する。 The present invention has been made in view of such circumstances, and an object thereof is to provide a technique for classifying documents into categories and assigning appropriate names to the categories.

上記課題を解決するために、本発明は、文書を電子化した文書データを取得する文書データ入力手段と、前記文書データを分類するカテゴリの数を受け付けるカテゴリ数指定手段と、前記文書データ入力手段が取得した文書データを解析し、文字列を表す文字列データを生成する文字列データ生成手段と、前記文字列データの特徴を検出して前記文書毎の特徴ベクトルを生成する特徴ベクトル生成手段と、前記文書データ入力手段が取得した各文書データを、前記文書毎の前記特徴ベクトルに基づいて前記カテゴリ数指定手段が受け付けたカテゴリ数となるように分類する分類手段と、前記文字列データを構成する単語の頻出頻度を前記分類手段によって各カテゴリに分類された文書群から算出する算出手段と、前記算出手段によって算出された頻出頻度の高い単語に基づいて前記各カテゴリ名を決定するカテゴリ名決定手段とを具備する文書処理装置を提供する。 In order to solve the above problems, the present invention provides document data input means for obtaining document data obtained by digitizing a document, category number specifying means for receiving the number of categories for classifying the document data, and document data input means. Analyzing the acquired document data, character string data generating means for generating character string data representing a character string, and feature vector generating means for detecting a feature of the character string data and generating a feature vector for each document And classifying means for classifying the document data acquired by the document data input means so that the number of categories received by the category number specifying means is based on the feature vector for each document, and the character string data. A calculation means for calculating the frequency of occurrence of words from the document group classified into each category by the classification means; To provide a document processing apparatus for and a category name determining means for determining the respective category name based on the highly frequent frequency word has.

この文書処理装置によれば、文書データ毎に文書データを構成する文字列データの特徴から特徴ベクトルを生成するため、その特徴ベクトルから所定のカテゴリ数に文書データを分類することができる。また、分類したカテゴリ毎に、文字列データを構成する単語の頻出頻度を算出し、頻出頻度の高い単語に基づいてカテゴリ名を決定するため、分類したカテゴリに最適なカテゴリ名を付与することができる。 According to this document processing apparatus, the feature vector is generated from the feature of the character string data constituting the document data for each document data, so that the document data can be classified into a predetermined number of categories from the feature vector. In addition, for each classified category, the frequency of the words constituting the character string data is calculated, and the category name is determined based on the words with the high frequency, so that an optimum category name can be assigned to the classified category. it can.

また、本発明は、文書を電子化した文書データを取得する文書データ入力手段と、前記文書データを分類するカテゴリの数を受け付けるカテゴリ数指定手段と、前記文書データ入力手段が取得した文書データを解析し、文字列を表す文字列データを生成する文字列データ生成手段と、前記文字列データの特徴を検出して前記文書毎の特徴ベクトルを生成する特徴ベクトル生成手段と、前記文書データ入力手段が取得した各文書データを、前記各文書毎の前記特徴ベクトルに基づいて前記カテゴリ数指定手段が受け付けたカテゴリ数となるように分類する分類手段と、前記文字列データを構成する単語の頻出頻度を前記分類手段によって各カテゴリに分類された文書群から算出する算出手段と、前記算出手段によって算出された頻出頻度の高い単語を表示する表示手段と、前記表示手段が表示した単語の選択を受け付ける選択単語受付手段と、前記選択単語受付手段が受け付けた単語に基づいて前記各カテゴリ名を決定するカテゴリ名決定手段とを具備する文書処理装置を提供する。 Further, the present invention provides document data input means for acquiring document data obtained by digitizing a document, category number specifying means for receiving the number of categories for classifying the document data, and document data acquired by the document data input means. Character string data generating means for analyzing and generating character string data representing a character string, feature vector generating means for detecting a feature of the character string data and generating a feature vector for each document, and the document data input means Classifying means for classifying each piece of document data acquired so that the number of categories received by the number-of-category specifying means is based on the feature vector for each document, and the frequency of occurrence of words constituting the character string data Is calculated from the document group classified into each category by the classification unit, and the frequent occurrence frequency calculated by the calculation unit is high Display means for displaying words, selected word receiving means for receiving selection of words displayed by the display means, and category name determining means for determining each category name based on the words received by the selected word receiving means. Provided is a document processing apparatus.

この文書処理装置によれば、文書データ毎に文書データを構成する文字列データの特徴から特徴ベクトルを生成するため、その特徴ベクトルから所定のカテゴリ数に文書データを分類することができる。また、分類したカテゴリ毎に文字列データを構成する単語の頻出頻度を算出し、頻出頻度の高い単語を表示することによりユーザにカテゴリ名として適切な単語を選択させることができる。そして、ユーザによって選択された単語をカテゴリ名として決定するため、それぞれのカテゴリをユーザが所望する名前にすることができる。 According to this document processing apparatus, the feature vector is generated from the feature of the character string data constituting the document data for each document data, so that the document data can be classified into a predetermined number of categories from the feature vector. In addition, the frequency of words constituting the character string data is calculated for each classified category, and a word having a high frequency is displayed to allow the user to select an appropriate word as the category name. And since the word selected by the user is determined as a category name, each category can be given a name desired by the user.

本発明の好ましい態様においては、所定の単語とこれに関連する関連単語とを対応付けて記憶する記憶手段を有し、前記カテゴリ名決定手段は前記算出手段によって算出された頻出頻度の高い単語に対応する関連単語を前記記憶手段を参照して検出し、検出された関連単語と前記頻出頻度の高い単語の双方に基づいて前記各カテゴリ名を決定するようにしてもよい。 In a preferred aspect of the present invention, there is provided storage means for associating and storing a predetermined word and a related word related thereto, and the category name determination means is a word having a high frequency of occurrence calculated by the calculation means. Corresponding related words may be detected with reference to the storage means, and the category names may be determined based on both the detected related words and the frequently occurring words.

この文書処理装置によれば、カテゴリ内で頻出頻度が最も高い単語をカテゴリ名として用いるのではなく、該単語に関連する別の単語をカテゴリ名として用いる場合、それらの単語が関連づけて記憶されているため、該記憶内容を参照することにより、所定の単語をカテゴリ名として決定することができる。 According to this document processing apparatus, when a word having the highest frequency in a category is not used as a category name but another word related to the word is used as a category name, these words are stored in association with each other. Therefore, it is possible to determine a predetermined word as a category name by referring to the stored content.

また、本発明の好ましい態様においては、所定の単語とこれに関連する関連単語とを対応付けて記憶する記憶手段を有し、前記表示手段は前記算出手段によって算出された頻出頻度の高い単語に対応する関連単語を前記記憶手段を参照して検出し、検出された関連単語と前記頻出頻度の高い単語の双方を表示し、前記選択単語受付手段は前記表示手段が表示した単語および関連単語の選択を受け付け、前記カテゴリ名決定手段は、前記選択単語受付手段が受け付けた単語および関連単語に基づいて前記各カテゴリ名を決定するようにしてもよい。 Further, in a preferred aspect of the present invention, there is provided storage means for associating and storing a predetermined word and a related word related to the predetermined word, and the display means displays the frequently occurring word calculated by the calculation means. Corresponding related words are detected with reference to the storage means, and both the detected related words and the frequently-occurring words are displayed, and the selected word receiving means is configured to display the words and related words displayed by the display means. The category name determination unit may receive the selection, and may determine each category name based on the word and the related word received by the selected word reception unit.

この文書処理装置によれば、所定の単語とこれに関連する関連単語とが対応付けて記憶されているので、カテゴリ内で頻出頻度が最も高い単語と、その単語に対応付けて記憶されている単語を表示することができる。そのためユーザに、カテゴリ内で頻出頻度が最も高い単語と、その単語に対応付けて記憶されている単語からカテゴリ名を選択させることができる。つまり、ユーザにカテゴリ名の選択肢を与えることができる。 According to this document processing apparatus, since a predetermined word and a related word related thereto are stored in association with each other, the word having the highest frequency in the category is stored in association with the word. Words can be displayed. Therefore, the user can select a category name from a word having the highest frequency of occurrence in the category and a word stored in association with the word. That is, a category name option can be given to the user.

また、別の好ましい態様においては、前記特徴ベクトル生成手段は、予め定められた複数の単語の各々について、前記各文書毎の文字列データ内の頻出頻度を判定し、この判定結果に基づいて前記各文書毎の特徴ベクトルを生成するようにしてもよい。 In another preferred aspect, the feature vector generation means determines the frequency of occurrence in the character string data for each document for each of a plurality of predetermined words, and based on the determination result, A feature vector for each document may be generated.

初めに、本実施形態における文書について定義する。文書の構成にはいくつかの種類があり、１つの文書（１文書）とそれに続く１文書の区切りはそれらの文書の構成のされ方により異なる。例えば、図１（ａ）に示すように１枚の用紙上に１文書分の記載事項が納められている場合、用紙の区切りが文書の区切りになる。図１（ｂ）に示すように１文書が複数枚の用紙に亘っている場合、意味上の区切りが文書の区切りになる。図１（ｃ）に示すように１枚の用紙上に描画を含む複数の文書が構成されている場合、例えば、描画部分が文書の区切りになる。本実施形態においては、図１（ａ）に示した１枚の用紙によって１文書が構成されている文書を扱う場合について説明する。 First, a document in this embodiment will be defined. There are several types of document configurations, and the separation of one document (one document) and the following one document differs depending on how the documents are configured. For example, as shown in FIG. 1A, when the description items for one document are stored on one sheet, the sheet separation becomes the document separation. As shown in FIG. 1B, when one document covers a plurality of sheets, a semantic break becomes a document break. As shown in FIG. 1C, when a plurality of documents including drawing are formed on one sheet, for example, the drawing portion becomes a document delimiter. In the present embodiment, a case will be described in which a document in which one document is constituted by one sheet shown in FIG.

図２は、本発明に係る文書処理装置の１実施形態である文書処理システム１の構成を示すブロック図である。図２の画像読取装置１０は、例えばＡＤＦ（Auto Document Feeder）などの自動給紙機構を備えたスキャナ装置であり、ユーザによってＡＤＦにセットされた紙文書を１ページずつ光学的に読み取り、読み取った画像に対応する画像データを通信線１２を介して文書処理装置２０に転送するものである。通信線１２はＬＡＮ（Local Area Network）、ＷＡＮ（Wide Area Network）、インターネットなどを用いることができるが、本実施形態においては、通信線１２としてＬＡＮを用いている。 FIG. 2 is a block diagram showing a configuration of a document processing system 1 which is an embodiment of a document processing apparatus according to the present invention. The image reading device 10 in FIG. 2 is a scanner device having an automatic paper feeding mechanism such as an ADF (Auto Document Feeder), for example, and optically reads and reads a paper document set on the ADF by a user page by page. The image data corresponding to the image is transferred to the document processing apparatus 20 via the communication line 12. As the communication line 12, a LAN (Local Area Network), a WAN (Wide Area Network), the Internet, or the like can be used. In this embodiment, a LAN is used as the communication line 12.

文書処理装置２０は、画像読取装置１０から送られた画像データをファイル化、分類し、蓄積する。
図３は文書処理装置２０のハードウェア構成を示すブロック図である。制御部２１は、例えばＣＰＵ（Central Processing Unit）であり、各種ソフトウェアを実行することによって、文書処理装置２０の各部を中枢的に制御する。通信ＩＦ部２２は通信線１２を介して画像読取装置１０に接続されており、この通信線１２を介して画像読取装置１０から送信される画像データを受け取り、制御部２１に転送する。 The document processing device 20 makes the image data sent from the image reading device 10 into a file, classifies it, and stores it.
FIG. 3 is a block diagram showing a hardware configuration of the document processing apparatus 20. The control unit 21 is a CPU (Central Processing Unit), for example, and centrally controls each unit of the document processing device 20 by executing various software. The communication IF unit 22 is connected to the image reading device 10 via the communication line 12, receives image data transmitted from the image reading device 10 via the communication line 12, and transfers the image data to the control unit 21.

表示部２３は、例えば液晶ディスプレイとその駆動回路であり、制御部２１から転送されたデータに応じた画像を表示する。操作部２４は、例えば、複数の操作子（図示省略）を備えたキーボードであり、それら操作子の操作内容に応じたデータを制御部２１へ出力する。 The display unit 23 is, for example, a liquid crystal display and its drive circuit, and displays an image corresponding to the data transferred from the control unit 21. The operation unit 24 is, for example, a keyboard provided with a plurality of operators (not shown), and outputs data corresponding to the operation contents of these operators to the control unit 21.

記憶部２５は、揮発性記憶部２５ａ、不揮発性記憶部２５ｂを含んでいる。揮発性記憶部２５ａは、例えばＲＡＭ（Random Access Memory）であり、制御部２１が各種ソフトウェアに従って作動するワークエリアとして利用される。
不揮発性記憶部２５ｂは、例えばハードディスクであり、上記画像データを文書の種類毎に異なる記憶領域（例えば、フォルダ）へ格納し蓄積する。 The storage unit 25 includes a volatile storage unit 25a and a nonvolatile storage unit 25b. The volatile storage unit 25a is, for example, a RAM (Random Access Memory), and is used as a work area in which the control unit 21 operates according to various software.
The nonvolatile storage unit 25b is, for example, a hard disk, and stores and accumulates the image data in different storage areas (for example, folders) for each document type.

また、不揮発性記憶部２５ｂにはＯＳ（Operating System）を制御部２１に実行させるためのＯＳソフトウェア、本実施形態に係る文書処理装置２０に特有な機能を制御部２１に実現させるための文書分類ソフトウェアが格納されている。制御部２１はこの文書分類ソフトウェアを実行することにより、画像読取装置１０から転送される文書を表す画像データから文字列データを抽出し、抽出した文字列データに形態素解析などを施してその文字列を構成する単語を抽出する。そして抽出した単語を元に文書毎の特徴ベクトルを生成する。特徴ベクトルとは予め定められた複数の単語（例えば、予め定められた数十万個の単語）の各々について、抽出した文字列データ内での出現有無を判定し、単語が出現したときに１、出現しなかったときに０の値を持つ要素からなるベクトルである。例えば、上記予め定められた複数の単語が（“愛” ，“逆転” ，“国会” ，“ホームラン”）であり、文字列データの表す文字列が“最終回に逆転満塁ホームランが飛び出した”である場合には、特徴ベクトルとして（０，１，０，１）が生成される。 The nonvolatile storage unit 25b includes OS software for causing the control unit 21 to execute an OS (Operating System), and document classification for causing the control unit 21 to implement functions unique to the document processing apparatus 20 according to the present embodiment. Software is stored. The control unit 21 executes the document classification software to extract character string data from the image data representing the document transferred from the image reading apparatus 10, performs morphological analysis on the extracted character string data, and the character string Extract the words that make up. A feature vector for each document is generated based on the extracted word. The feature vector is determined when the presence or absence of appearance in the extracted character string data is determined for each of a plurality of predetermined words (for example, predetermined hundreds of thousands of words). , A vector composed of elements having a value of 0 when they do not appear. For example, the plurality of predetermined words are (“love”, “reverse”, “National Diet”, “home run”), and the character string represented by the character string data is “a reverse full home run popped out in the last round” In this case, (0, 1, 0, 1) is generated as a feature vector.

そして、この特徴ベクトルを用い、画像読取装置１０から転送された画像データを所定のカテゴリ数に分類する。本実施形態においては、上記所定のアルゴリズムとして、分類精度が高いＳＶＭアルゴリズム（非特許文献１、非特許文献２参照）を利用する。 Then, using this feature vector, the image data transferred from the image reading apparatus 10 is classified into a predetermined number of categories. In the present embodiment, an SVM algorithm with high classification accuracy (see Non-Patent Document 1 and Non-Patent Document 2) is used as the predetermined algorithm.

また、不揮発性記憶部２５ｂには、図４に示した関連単語表が格納されている。図４に示したように、関連単語表は「単語」フィールドと、「関連単語」フィールドによって構成され、表現が異なるが、同一の意味としてユーザが用いている単語がそれぞれ関連付けて格納されている。例えば同一の意味で用いられている「連絡書」、「連確」、「連絡確認書」においては、それぞれの単語を表す画像データを制御部２１が文字認識を行う際、文字の形態が異なるため、制御部２１はこれらの単語を別のものと認識する。しかし、関連単語表によって、制御部２１はこれらの単語には関連があると認識する。なお、本実施形態においては、この関連単語表はユーザによって予め入力されているものとする。 Further, the related word table shown in FIG. 4 is stored in the nonvolatile storage unit 25b. As shown in FIG. 4, the related word table is configured by a “word” field and a “related word” field, and the expressions are different, but the words used by the user with the same meaning are stored in association with each other. . For example, in “contact form”, “continuity confirmation”, and “contact confirmation form” used in the same meaning, when the control unit 21 performs character recognition of image data representing each word, the form of the character is different. Therefore, the control unit 21 recognizes these words as different ones. However, from the related word table, the control unit 21 recognizes that these words are related. In the present embodiment, it is assumed that the related word table is input in advance by the user.

次に、制御部２１が行う動作について説明する。以下では、文書処理装置２０の制御部２１は上記ＯＳソフトウェアに従って作動しており、ユーザが何らかの入力操作を行うことを待ち受けている。 Next, an operation performed by the control unit 21 will be described. In the following, the control unit 21 of the document processing apparatus 20 operates according to the OS software and waits for a user to perform some input operation.

ユーザが操作部２４を適宜操作し、上記文書分類ソフトウェアの実行を指示する旨の入力操作を行うと、制御部２１は、上記文書分類ソフトウェアを不揮発性記憶部２５ｂから読み出し、これを実行する。以下、文書分類ソフトウェアに従って制御部２１が行う動作について図面を用いて説明する。なお、以下の説明において、ユーザは１００枚の文書を５つのカテゴリに分類することを所望していることを前提とする。 When the user appropriately operates the operation unit 24 and performs an input operation for instructing execution of the document classification software, the control unit 21 reads the document classification software from the nonvolatile storage unit 25b and executes it. Hereinafter, operations performed by the control unit 21 in accordance with the document classification software will be described with reference to the drawings. In the following description, it is assumed that the user desires to classify 100 documents into five categories.

図５は制御部２１が文書分類ソフトウェアにしたがって行う動作を示したフローチャートである。まず、制御部２１は、図６に示したカテゴリ数入力画面３０を表示部２３に表示し、ユーザにカテゴリ数を指定させる（ステップＳＡ１０）。図６に示したように、カテゴリ数入力画面３０は入力ボックス３１、「決定」ボタン３２によって構成されている。このカテゴリ数入力画面３０を視認したユーザは、操作部２４を操作し、カテゴリ数入力ボックス３１にカテゴリ数「５」を入力し、決定ボタン３２を押下することによって、所望のカテゴリ数を決定することができる。制御部２１は、上記ユーザの操作内容を表すカテゴリ数データを操作部２４から受け取り、そのカテゴリ数データに基づいて、ユーザが指定したカテゴリ数（ｎ）が「５」であることを特定し、「ｎ＝５」に設定する。 FIG. 5 is a flowchart showing the operation performed by the control unit 21 according to the document classification software. First, the control unit 21 displays the category number input screen 30 shown in FIG. 6 on the display unit 23, and allows the user to specify the number of categories (step SA10). As shown in FIG. 6, the category number input screen 30 includes an input box 31 and a “decision” button 32. The user who visually recognizes the category number input screen 30 operates the operation unit 24, inputs the category number “5” in the category number input box 31, and presses the enter button 32 to determine the desired number of categories. be able to. The control unit 21 receives the category number data representing the user's operation content from the operation unit 24, and based on the category number data, specifies that the category number (n) designated by the user is “5”, Set “n = 5”.

次に、制御部２１は、操作部２４から転送されたカテゴリ数データを揮発性記憶部２５ａに書き込んで記憶し、画像データが画像読取装置１０から送られてくることを待ち受ける。一方、ユーザが画像読取装置１０のＡＤＦに紙文書をセットし、所定の操作を行うと、紙文書の記載内容を表す画像が画像読取装置１０によって１枚ずつ読み取られ、各画像に対応する画像データが通信線１２を介して画像読取装置１０から文書処理装置２０へ順次転送される。このとき、上述のように、画像読取装置１０のＡＤＦにはユーザによって１００枚の文書がセットされているので、文書処理装置２０には、１００枚の文書の画像データが順次転送される。 Next, the control unit 21 writes and stores the category number data transferred from the operation unit 24 in the volatile storage unit 25a, and waits for image data to be sent from the image reading apparatus 10. On the other hand, when the user sets a paper document in the ADF of the image reading apparatus 10 and performs a predetermined operation, images representing the description content of the paper document are read one by one by the image reading apparatus 10 and images corresponding to the respective images are read. Data is sequentially transferred from the image reading apparatus 10 to the document processing apparatus 20 via the communication line 12. At this time, as described above, since 100 documents are set by the user in the ADF of the image reading apparatus 10, the image data of 100 documents is sequentially transferred to the document processing apparatus 20.

制御部２１は、画像読取装置１０から送られた画像データを通信ＩＦ部２２を介して受け取ると（ステップＳＡ１２）、各画像データに記載されている文字列を文書単位（１枚単位）に、例えばＯＣＲ（Optical Character Recognition）などによって読み取り文字列データを抽出する（ステップＳＡ１４）。このとき制御部２１はＮＬＰ（Neuro Linguistic Programming）を用いて、読み取った文字列データが、日本語として不適切であると判断すると、適切な文字列に訂正する。つまり、例えば「連絡書」という単語が含まれる文書にＯＣＲ処理を施した場合、フォントのずれなどによって「連絡書」を「連結書」という文字列データとして抽出してしまう場合がある。そのような場合、制御部２１はこのＮＬＰによって、「連結書」として抽出した文字列データを「連絡書」に訂正する。 When the control unit 21 receives the image data sent from the image reading device 10 via the communication IF unit 22 (step SA12), the character string described in each image data is converted into a document unit (one sheet unit). For example, read character string data is extracted by OCR (Optical Character Recognition) or the like (step SA14). At this time, if the control unit 21 determines that the read character string data is inappropriate as Japanese using NLP (Neuro Linguistic Programming), the control unit 21 corrects the character string data to an appropriate character string. That is, for example, when an OCR process is performed on a document including the word “contact”, the “contact” may be extracted as character string data “linked” due to a font shift or the like. In such a case, the control unit 21 corrects the character string data extracted as “concatenation” to “contact” by this NLP.

このような処理を経て抽出した文字列データに基づいて、制御部２１は、前述した特徴ベクトルを生成する（ステップＳＡ１６）。本実施形態においては画像読取装置１０から転送された１００枚の文書に対応して１００個の特徴ベクトルが生成される。 Based on the character string data extracted through such processing, the control unit 21 generates the above-described feature vector (step SA16). In the present embodiment, 100 feature vectors are generated corresponding to 100 documents transferred from the image reading apparatus 10.

そして、制御部２１はステップＳＡ１６にて生成した特徴ベクトルを用い、１００個の画像データを所定のアルゴリズムに従って分類する（ステップＳＡ１８）。このとき、ステップＳＡ１０によってユーザに指示されたカテゴリ数ｎ＝５であったので、制御部２１は１００個の画像データを上記ＳＶＭアルゴリズムを用いて５つに分類する。以下では、５つに分類したカテゴリを便宜上カテゴリ１〜カテゴリ５と称して説明する。 And the control part 21 classifies 100 image data according to a predetermined algorithm using the feature vector produced | generated by step SA16 (step SA18). At this time, since the number of categories instructed by the user in step SA10 is n = 5, the control unit 21 classifies 100 pieces of image data into five using the SVM algorithm. Hereinafter, the categories classified into five are referred to as categories 1 to 5 for convenience.

次に、制御部２１は図示しないカウンタを「ｍ＝０」に設定する（ステップＳＡ２０）。そしてカテゴリ１に分類したカテゴリ内の文字列データにおいて、単語の頻出頻度を算出する（ステップＳＡ２２）。このときの算出結果が、頻出頻度が高い単語から順に「連絡書」、「申請」、「総務部」であったとする。制御部２１はこれらの単語を元に図７に示したカテゴリ名決定画面４０を表示する（ステップＳＡ２４）。本実施形態においては、カテゴリ名決定画面４０には、頻出頻度が最も高い単語から順に３つ、つまり「連絡書」、「申請」、「総務部」を表示する。また、制御部２１は表示部２３に表示する単語のそれぞれを元に、図４に示した関連単語表の「単語」フィールドを参照し、該当する単語があれば、その単語に関連して「関連単語」フィールドに格納されている単語を並列して表示するようにする。図７においては、頻出頻度が最も高かった「連絡書」に関連づけられて格納されていた「連確」、「連絡確認書」が「連絡書」と並列に表示されている。 Next, the control unit 21 sets a counter (not shown) to “m = 0” (step SA20). Then, in the character string data in the category classified into category 1, the word frequency is calculated (step SA22). Assume that the calculation results at this time are “contact”, “application”, and “general affairs department” in order from the word with the highest frequency of occurrence. The control unit 21 displays the category name determination screen 40 shown in FIG. 7 based on these words (step SA24). In the present embodiment, the category name determination screen 40 displays three words in descending order of frequency of occurrence, that is, “contact”, “application”, and “general affairs department”. Further, the control unit 21 refers to the “word” field of the related word table shown in FIG. 4 based on each word displayed on the display unit 23, and if there is a corresponding word, “ The words stored in the “related words” field are displayed in parallel. In FIG. 7, “continuity confirmation” and “contact confirmation” stored in association with “contact” having the highest frequency of occurrence are displayed in parallel with “contact”.

このカテゴリ名決定画面４０において、ユーザは操作部２４を操作し、カーソルを移動させることにより、単語を選択する。このように制御部２１が頻出頻度の高い単語を複数表示することで、ユーザはカテゴリ毎の文字列に多く含まれる単語を把握することができる。従って、適切なカテゴリ名を決定することができる。さらに、ユーザは複数のカテゴリ名の候補を把握できるため、決定するカテゴリ名の選択肢がユーザにとって広いものとなる。なお、本実施形態においては、選択されている単語は反転表示等で示す。また、カテゴリ名決定画面４０の初期画面においては、一番目の単語を反転表示する。図７においては、「連絡書」が反転表示されている場合を示している。 On the category name determination screen 40, the user operates the operation unit 24 and moves the cursor to select a word. In this way, the control unit 21 displays a plurality of words with a high frequency of occurrence, so that the user can grasp words included in a character string for each category. Therefore, an appropriate category name can be determined. Furthermore, since the user can grasp a plurality of category name candidates, the choice of category name to be determined is wide for the user. In the present embodiment, the selected word is shown in reverse video. On the initial screen of the category name determination screen 40, the first word is highlighted. FIG. 7 shows a case where “Contact” is highlighted.

ユーザによってカテゴリ名決定画面４０の「決定」ボタン４１が押下されると（ステップＳＡ２６；ＹＥＳ）、制御部２１は、その時点において反転表示している単語をカテゴリ名として特定する。一方「修正」ボタン４２が押下されると（ステップＳＡ２６；ＮＯ）、制御部２１は、図８に示したカテゴリ名入力画面５０を表示部２３に表示する。図８に示したように、カテゴリ名入力画面５０にはカテゴリ名入力ボックス５１が設けられている。ユーザによって操作部２４が操作され、カテゴリ名入力ボックス５１にカテゴリ名が入力されると（ステップＳＡ３０）、制御部２１は、入力された単語をカテゴリ名として特定する。 When the “determine” button 41 on the category name determination screen 40 is pressed by the user (step SA26; YES), the control unit 21 specifies the highlighted word at that time as the category name. On the other hand, when the “modify” button 42 is pressed (step SA26; NO), the control unit 21 displays the category name input screen 50 shown in FIG. As shown in FIG. 8, a category name input box 51 is provided on the category name input screen 50. When the operation unit 24 is operated by the user and a category name is input to the category name input box 51 (step SA30), the control unit 21 specifies the input word as the category name.

次に、制御部２１はカウンタを歩進して「ｍ＝ｍ＋１」とし（ステップＳＡ２８）、ｍがカテゴリ数を表すｎ以上になったかを判定する。この判定で、カウンタ値がｎ未満の場合（ステップＳＡ３２；ＮＯ）には、カテゴリ２〜５についてのカテゴリ名が特定されていないためにステップＳＡ２２以降の処理を続行し、カウンタ値がｎに達した場合（ステップＳＡ３２；ＹＥＳ）には、制御部２１は本ルーチンを終了する。 Next, the controller 21 increments the counter to “m = m + 1” (step SA28), and determines whether m is equal to or greater than n representing the number of categories. In this determination, if the counter value is less than n (step SA32; NO), the category names for categories 2 to 5 are not specified, so the processing after step SA22 is continued, and the counter value reaches n. If so (step SA32; YES), the control unit 21 ends this routine.

以上説明したように、本実施形態においては、特徴ベクトルに基づいて画像データをカテゴリに分類し、その分類したカテゴリ毎の頻出単語を元に、ユーザはカテゴリ名を決定することができる。従って、操作するユーザが異なっても、統一性のあるカテゴリ名を設定することができる。また、関連単語表によって、表現が異なるが同一の意味で用いられている単語が関連づけて格納されているので、例えば、図７に示したカテゴリ名決定画面４０において、ユーザがカテゴリ名にしたい単語が頻出度数が最も高い「連絡書」ではなく「連確」であった場合、制御部２１は「連絡書」に関連づけて「連確」を選択できるように表示するので、「連確」の文字を改めて入力する手間を省くことができる。 As described above, in the present embodiment, the image data is classified into categories based on the feature vectors, and the user can determine the category name based on the frequent words for each classified category. Therefore, a uniform category name can be set even if the operating user is different. In addition, because the related word table stores words that are differently expressed but used in the same meaning, for example, in the category name determination screen 40 shown in FIG. Is “confirmed” instead of “contact” with the highest frequency, the control unit 21 displays the “confirmed” in association with the “contact” so that it can be selected. It is possible to save the trouble of inputting characters again.

なお、本実施形態においては、カテゴリ名を決定する際、制御部２１は図７に示したカテゴリ名決定画面４０において頻出単語を表示し、ユーザに選択させる場合を示したが、カテゴリ名候補を一覧にしたテーブルを初期セット（以下初期セットテーブル）として記憶部２５に格納しておき、頻出単語を元に初期セットテーブルを検索し、該当する単語をカテゴリ名として自動的に付与するようにしてもよい。なお、該当する単語が初期セットテーブルに存在しなかった場合には、図７に示したカテゴリ名決定画面４０を表示し、ユーザに選択させるようにする。さらに、この際、制御部２１に学習機能を持たせ、ユーザに選択されたカテゴリ名と、頻出頻度が高い単語を関連づけて記憶するようにしてもよい。 In the present embodiment, when the category name is determined, the control unit 21 displays a frequent word on the category name determination screen 40 illustrated in FIG. 7 and allows the user to select a category name. The listed table is stored in the storage unit 25 as an initial set (hereinafter referred to as an initial set table), the initial set table is searched based on frequent words, and the corresponding words are automatically assigned as category names. Also good. If the corresponding word does not exist in the initial set table, the category name determination screen 40 shown in FIG. 7 is displayed to allow the user to select it. Further, at this time, the control unit 21 may be provided with a learning function so that the category name selected by the user and a word with a high frequency of occurrence are associated and stored.

また、上述のような初期セットテーブルは図４に示した関連単語表と別に設けてもよいし、１つにまとめて設けてもよい。１つにまとめて設ける場合、関連単語表の「関連単語」フィールドには、「単語」フィールドに格納されている単語について関連した単語がある場合は、その関連した単語を格納し、関連した単語がない場合は空白とする。このように、カテゴリ名を自動的に付与する場合には、初期セットとしてそれぞれの単語に関連する関連単語が格納された初期セットテーブルを用いることにより、会社によって書類の名称が違う場合でも、初期セットを設定し直す必要がない。つまり、Ａ社では「連絡書」と称している書類をＢ社では「連確」と称している場合でも、関連単語表を用いることにより、Ａ社用の初期セット、Ｂ社用の初期セットを設定する必要がない。従って、初期セットを設定し直す際のコストダウンを図ることができる。 The initial set table as described above may be provided separately from the related word table shown in FIG. 4 or may be provided together. In the case of providing them together, if there is a related word for the word stored in the “word” field in the “related word” field of the related word table, the related word is stored and the related word is stored. If there is no, leave blank. In this way, when automatically assigning category names, an initial set table that stores related words related to each word is used as an initial set. There is no need to reconfigure the set. In other words, even if a document referred to as “Contact” in Company A is referred to as “Continuous” in Company B, an initial set for Company A and an initial set for Company B can be obtained by using a related word table. There is no need to set. Therefore, it is possible to reduce the cost when resetting the initial set.

また、本実施形態においては、特徴ベクトルの要素が単語である場合を示したが、書類の体裁を要素としてもよい。つまり画像データにレイアウト解析を施し、文字列を認識し、その認識した文字列の記載位置または文字列長から特徴ベクトルを生成する。 In the present embodiment, the case where the element of the feature vector is a word is shown, but the appearance of the document may be used as the element. That is, layout analysis is performed on the image data, a character string is recognized, and a feature vector is generated from the description position or character string length of the recognized character string.

さらに本実施形態においては、ユーザによって用いられる１００枚の文書は用紙１枚を文書の区切りの単位として示したが、例えば前述の図１（ｂ）及び（ｃ）の構成を呈した文書が用いられるようにしてもよい。このとき、制御部２１は前述のレイアウト解析又は意味解析、記号による解析等によって意味上の区分を認識したり、段落の位置や図面の位置を認識したりすることで、文書の単位を区切るようにしてもよい。また、操作部２４によってユーザに文書の区切りの基準として、特定の文字や記号を入力させ、制御部２１が文書中の当該文字や記号を検出し、検出した箇所を文書の区切りとして認識してもよい。要は区切りが表されればどのような構成を呈した文書が用いられてもよい。そして制御部２１は区切り毎に１文書の範囲を特定し、図５に示した動作を行う。 Further, in the present embodiment, 100 documents used by the user are shown with one sheet as a document delimiter unit. For example, the document having the configuration shown in FIGS. 1B and 1C is used. You may be made to do. At this time, the control unit 21 recognizes the semantic division by the above-described layout analysis, semantic analysis, symbolic analysis, or the like, or recognizes the position of the paragraph or the drawing, thereby dividing the document unit. It may be. Further, the operation unit 24 allows the user to input a specific character or symbol as a reference for document separation, the control unit 21 detects the character or symbol in the document, and recognizes the detected position as the document separation. Also good. In short, a document having any configuration may be used as long as a break is represented. Then, the control unit 21 specifies the range of one document for each segment and performs the operation shown in FIG.

また、本実施形態においては、文書をカテゴリに分類した後に、カテゴリ毎の単語の頻出頻度を算出したが、特徴ベクトルを生成する際に、単語の頻出頻度を算出するようにしてもよい。つまり、特徴ベクトルを生成する際は、予め定められた複数の単語に対して、文字列データ内に該当する単語の有無のみを検索するようにしたが、単語の有無を検索するとともに、単語の頻出頻度を算出するようにしてもよい。 Further, in this embodiment, after classifying a document into categories, the frequency of word occurrence for each category is calculated. However, the frequency of word occurrence may be calculated when generating a feature vector. That is, when generating a feature vector, only the presence / absence of a corresponding word in the character string data is searched for a plurality of predetermined words. The frequency of frequent occurrences may be calculated.

また、本実施形態においては、画像読取装置１０と文書処理装置２０とをそれぞれ個別のハードウェアとして構成する場合を示したが、両者を一体のハードウェアで構成するようにしてもよい。このとき、通信線１２は、係るハードウェア内で画像読取装置１０と文書処理装置２０とを接続する内部バスとなる。 In the present embodiment, the case where the image reading apparatus 10 and the document processing apparatus 20 are configured as separate hardware has been described, but both may be configured as integral hardware. At this time, the communication line 12 serves as an internal bus for connecting the image reading apparatus 10 and the document processing apparatus 20 within the hardware.

また、本実施形態においては、本発明に係る文書処理装置２０に特有な機能を制御部２１に実現さえるためのソフトウェアを不揮発性記憶部２５ｂに予め記憶させておく場合について説明した。しかしながら、例えばＣＤ−ＲＯＭ（Compact Disk-Read Only Memory）やＤＶＤ（Digital Versatile Disk）などのコンピュータ装置読取可能な記録媒体に、上記ソフトウェアをインストールするようにしてもよい。このようにすることで、一般的なコンピュータ装置を本発明に係る文書処理装置として機能させることが可能になる。 Further, in the present embodiment, a case has been described in which software for realizing the function unique to the document processing apparatus 20 according to the present invention in the control unit 21 is stored in advance in the nonvolatile storage unit 25b. However, the software may be installed in a computer-readable recording medium such as a CD-ROM (Compact Disk-Read Only Memory) or a DVD (Digital Versatile Disk). In this way, a general computer device can function as the document processing device according to the present invention.

文書の構成のフォーマットを例示した図である。It is the figure which illustrated the format of the structure of a document. 本発明の実施形態に係る文書処理システムの全体構成を示すブロック図である。1 is a block diagram showing an overall configuration of a document processing system according to an embodiment of the present invention. 文書処理装置のハードウェア構成の一例を示すブロック図である。It is a block diagram which shows an example of the hardware constitutions of a document processing apparatus. 関連単語表を例示した図である。It is the figure which illustrated the related word table. 文書処理装置の制御部が行う動作を示したフローチャートを示した図である。It is the figure which showed the flowchart which showed the operation | movement which the control part of a document processing apparatus performs. 表示部が表示するカテゴリ数入力画面を例示した図である。It is the figure which illustrated the category number input screen which a display part displays. 表示部が表示するカテゴリ名決定画面を例示した図である。It is the figure which illustrated the category name determination screen which a display part displays. 表示部が表示するカテゴリ名入力画面を例示した図である。It is the figure which illustrated the category name input screen which a display part displays.

Explanation of symbols

１・・・文書検索システム、１０・・・画像読取装置、１２・・・通信線、、２０・・・文書処理装置、２１・・・制御部、２２・・・通信ＩＦ部、２３・・・表示部、２４・・・操作部、２５・・・記憶部、２５ａ・・・揮発性記憶部、２５ｂ・・・不揮発性記憶部、２６・・・バス、３０・・・カテゴリ数入力画面、４０・・・カテゴリ名決定画面、５０・・・カテゴリ名入力画面 DESCRIPTION OF SYMBOLS 1 ... Document search system, 10 ... Image reading apparatus, 12 ... Communication line, 20 ... Document processing apparatus, 21 ... Control part, 22 ... Communication IF part, 23 ... Display unit, 24 ... operation unit, 25 ... storage unit, 25a ... volatile storage unit, 25b ... non-volatile storage unit, 26 ... bus, 30 ... category number input screen , 40 ... Category name determination screen, 50 ... Category name input screen

Claims

Document data input means for obtaining document data obtained by digitizing a document;
Category number specifying means for receiving the number of categories for classifying the document data;
Character string data generating means for analyzing the document data acquired by the document data input means and generating character string data representing a character string;
Feature vector generation means for detecting features of the character string data and generating a feature vector for each document;
Classifying means for classifying each document data acquired by the document data input means so that the number of categories received by the category number specifying means is based on the feature vector for each document;
Calculating means for calculating the frequency of occurrence of words constituting the character string data from the document group classified into each category by the classification means;
A document processing apparatus comprising: category name determining means for determining each category name based on a word with high frequency calculated by the calculating means.

Document data input means for obtaining document data obtained by digitizing a document;
Category number specifying means for receiving the number of categories for classifying the document data;
Character string data generating means for analyzing the document data acquired by the document data input means and generating character string data representing a character string;
Feature vector generation means for detecting features of the character string data and generating a feature vector for each document;
Classifying means for classifying each document data acquired by the document data input means so that the number of categories received by the category number specifying means is based on the feature vector for each document;
Calculating means for calculating the frequency of occurrence of words constituting the character string data from the document group classified into each category by the classification means;
Display means for displaying words with a high frequency of occurrence calculated by the calculation means;
Selection word reception means for receiving selection of the word displayed by the display means;
A document processing apparatus comprising: category name determining means for determining each category name based on a word received by the selected word receiving means.

Storage means for storing a predetermined word and a related word related thereto in association with each other;
The category name determining unit detects a related word corresponding to the frequently occurring word calculated by the calculating unit with reference to the storage unit, and based on both the detected related word and the frequently occurring word. The document processing apparatus according to claim 1, wherein each category name is determined.

Storage means for storing a predetermined word and a related word related thereto in association with each other;
The display means detects a related word corresponding to a frequently occurring word calculated by the calculating means with reference to the storage means, and displays both the detected related word and the frequently occurring word,
The selected word accepting unit accepts selection of a word displayed by the display unit and a related word,
3. The document processing apparatus according to claim 2, wherein the category name determining unit determines each category name based on a word received by the selected word receiving unit and a related word.

The feature vector generation means determines a frequency of occurrence in the character string data for each document for each of a plurality of predetermined words, and generates a feature vector for each document based on the determination result. The document processing apparatus according to claim 1, wherein the document processing apparatus is a document processing apparatus.