JP2006164028A

JP2006164028A - Window display device and its method

Info

Publication number: JP2006164028A
Application number: JP2004356762A
Authority: JP
Inventors: Shuichi Morisawa; 秀一森澤
Original assignee: Canon Inc
Current assignee: Canon Inc
Priority date: 2004-12-09
Filing date: 2004-12-09
Publication date: 2006-06-22

Abstract

<P>PROBLEM TO BE SOLVED: To solve the problem that a list of documents corresponding to a category can be displayed when the category is selected, however, when a document is selected as inverse operation, it is impossible to display all categories, to which the document belongs, by tracing from the document side. <P>SOLUTION: The device is provided with a document classification result file indicating the correspondence relation between categories of all category systems and documents to which the categories belong. The classification result file can be searched similarly from the category side or the document side even if either of the category side and the document side is selected. <P>COPYRIGHT: (C)2006,JPO&NCIPI

Description

本発明は、互いに関連する２つのオブジェクト群をウィンドウシステムの中で表示する見せ方に関するものであり、特に入力された文書を与えられたカテゴリに分類する文書自動分類装置においてカテゴリ群と文書群との関係にこの方法を適用したグラフィカル・ユーザ・インタフェースに関するものである。 The present invention relates to a way of displaying two related object groups in a window system. In particular, in an automatic document classification apparatus that classifies an input document into a given category, a category group and a document group are provided. The present invention relates to a graphical user interface in which this method is applied to the relationship.

文書をカテゴリに分類し、検索できる装置があった（例えば特許文献１）。
特開平１０−０２７１２５号公報 There has been an apparatus capable of classifying and searching documents (for example, Patent Document 1).
Japanese Patent Laid-Open No. 10-027125

入力された文書をユーザにより定義されたカテゴリ群に振り分ける文書自動分類システムにおいて、与えられた文書集合に対する現在の分類状況を表示するために、例えば表示ウィンドウを左右に二分割し、その左側にカテゴリ群の各カテゴリを表示し、マウスやキーボードなどにより任意の１カテゴリを選択でき、また右側には左側で選択されたカテゴリに分類された全ての文書を表示することができるものを考える。 In an automatic document classification system that sorts input documents into categories defined by the user, for example, to display the current classification status for a given set of documents, the display window is divided into two parts, for example, and a category is displayed on the left side. It is assumed that each category of the group is displayed, and one arbitrary category can be selected with a mouse or a keyboard, and all documents classified into the category selected on the left side can be displayed on the right side.

文書の多面性やカテゴリ分けの曖昧さなどの理由から、１つの文書は必ずしも唯一つのカテゴリに属するとは限らず、自動分類システムが１個のカテゴリに絞りきれずに複数個のカテゴリに振り分けてしまうこともありうるとすると、カテゴリ群と文書群との対応は多対多の関係となる。すなわち、カテゴリを１個決めると、そこに属する文書が一般に複数個決まり、逆に文書を１個決めるとその文書が所属するカテゴリが一般には複数個決まるという関係になる。 One document does not necessarily belong to one category because of the multifaceted nature of the document and ambiguity of categorization. The automatic classification system does not narrow down to a single category and distributes it to multiple categories. If there is a possibility, the correspondence between the category group and the document group is a many-to-many relationship. That is, when one category is determined, a plurality of documents belonging to the category are generally determined. Conversely, when one document is determined, a plurality of categories to which the document belongs is generally determined.

また、このような複数カテゴリへの所属を許さず、どの文書も必ず唯一つのカテゴリに属するように設計されたシステムにおいても、もしカテゴリ群が複数個定義されているならば、やはり１個の文書に複数個のカテゴリが対応することになる。 Further, even in a system that does not allow affiliation to a plurality of categories and is designed so that every document always belongs to only one category, if a plurality of category groups are defined, one document is still used. A plurality of categories correspond to.

なお、複数個のカテゴリ群とは、１つのシステムを複数人のユーザが共有して使用するシステムの場合に、ユーザごとに異なるカテゴリ体系を使用して分類を実行したい場合や、また同一ユーザであっても異なる観点から見た分類体系を複数個共存させたい場合などに、１つのシステム内でカテゴリのセットを複数個用意し、同じ文書群を複数とおりに振り分けて保持することを意味している。 In addition, a plurality of category groups means that when a system is shared by a plurality of users and a user wants to execute classification using a different category system for each user, or for the same user. This means that if you want to coexist multiple classification systems from different viewpoints, you can prepare multiple sets of categories in one system and distribute and hold the same document group in multiple ways. Yes.

このような状況において、従来のシステムでは、あるカテゴリに属する文書を全て表示することは出来ても、逆にある文書の所属するカテゴリ全部を同時に表示させる手段はなかった。 In such a situation, in the conventional system, although all documents belonging to a certain category can be displayed, there is no means for simultaneously displaying all categories to which a certain document belongs.

複数個のカテゴリ体系が定義されている場合でも、現在着目しているカテゴリ体系における分類状況は把握できても、現在着目している文書の、異なるカテゴリ体系における分類状況を見たいと思ったら、別のカテゴリ体系に着目しなおすしか方法がなく、すると最初に着目していたカテゴリ体系での分類状況が見られない状況になってしまっていた。すなわち、あるカテゴリ中の１つの文書が他の観点によるカテゴリ体系においてどんなカテゴリに分類されているかを知りたいとき、従来は他の観点によるカテゴリ体系の各カテゴリに属する文書の中から探し出す必要があった。これは、文書を中心とした表示方法ではなく、カテゴリを出発点とした階層的ツリー表示をとる以上、やむを得ないことであった。逆に個々の文書から出発した表示法では、各文書があるカテゴリ体系でどんなカテゴリに割り当てられているかを知りやすいが、カテゴリ体系全体にわたる傾向を知るには不便であった。 Even if multiple category systems are defined, you can understand the classification status in the category system you are currently interested in, but you want to see the classification status in a different category system for the document you are currently interested in. The only way to refocus on another category system was, and then the situation was such that the classification status in the category system that was initially focused on could not be seen. That is, when one wants to know what category a document in a certain category is classified in the category system from another viewpoint, it has conventionally been necessary to search among documents belonging to each category of the category system from another viewpoint. It was. This is unavoidable as long as it takes a hierarchical tree display starting from a category rather than a document-centered display method. On the other hand, in the display method starting from individual documents, it is easy to know what category each document is assigned to in a certain category system, but it is inconvenient to know the tendency of the entire category system.

また１個のカテゴリ体系の中においても、文書の分類先が複数個のカテゴリにわたる場合に、右側のウィンドウから文書を選択し、その所属する全てのカテゴリを左側のウィンドウに表示させることは出来なかった。 Also, within a single category system, when a document is classified into multiple categories, it is not possible to select a document from the right window and display all the categories to which it belongs in the left window. It was.

本発明は、以上のような問題点を解決するためのものであり、従来の文書分類システムでは１個のカテゴリ体系における１個のカテゴリをまず選択して、その範囲内で属する文書の一覧を表示していたのに対し、本発明においてはカテゴリ体系によらず、文書から出発して対応するカテゴリを全部同時に参照できる。 The present invention is for solving the above problems. In the conventional document classification system, one category in one category system is first selected, and a list of documents belonging to the category is selected. In contrast to this, in the present invention, it is possible to refer to all corresponding categories starting from a document at the same time, regardless of the category system.

すなわち、文書表示手段に表示された文書を選択すると、それが属する全てのカテゴリをカテゴリ表示手段内に対応して表示させることにより、複数個のカテゴリ体系を持つ場合や文書の所属カテゴリが１つとは限らないシステムにおいても、文書が所属する全てのカテゴリを網羅的に同時に知らしめることが可能となる。 In other words, when a document displayed on the document display means is selected, all categories to which the document belongs are displayed correspondingly in the category display means, so that a plurality of category systems or a document belongs to one category. Even in a system that is not limited, all categories to which a document belongs can be comprehensively and simultaneously known.

このように、複数個のカテゴリ体系にまたがってその文書が属する全てのカテゴリを表示すれば、自分とは異なるユーザの分類例を参考にしたり、異なる観点に立った分類状況を総合的に把握したいときに効果を発揮する。 In this way, if you want to display all the categories to which the document belongs across multiple category systems, you can refer to the classification examples of users who are different from you, or comprehensively understand the classification status from different viewpoints. Sometimes effective.

たとえばシステムの初期学習時に、ある文書をどのカテゴリに入れるのが適当かについてカテゴリ作成者自身でも迷うことがある。そのとき、同じ文書に対して他のユーザが異なるカテゴリ体系に従って既にカテゴリ割り当てをしている場合が考えられる。他のユーザが分類したカテゴリを知ることで、自分のカテゴリ体系の中での分類において参考にすることが可能となる。 For example, at the time of initial learning of the system, the category creator may be at a loss as to which category it is appropriate to put a document. At that time, it is possible that other users have already assigned categories to the same document according to different category systems. Knowing the categories classified by other users makes it possible to refer to the classifications in their own category system.

上記問題点を解決するために、本発明では、それぞれのオブジェクトが多対多で対応する２つのオブジェクト集合があり、各オブジェクト集合の任意個のオブジェクトを２個のウィンドウあるいは２分割された１個のウィンドウにそれぞれ表示できるシステムにおいて、一方のウィンドウをウィンドウ１、他方のウィンドウをウィンドウ２とするとき、ウィンドウ１のオブジェクトを選択するとそれに対応する全てのオブジェクトがウィンドウ２に表示され、逆にウィンドウ２のオブジェクトを選択すると、それに対応する全てのウィンドウ２のオブジェクトが表示されることを特徴とするウィンドウ型表示装置を備える。 In order to solve the above problems, in the present invention, there are two object sets in which each object corresponds many-to-many, and an arbitrary object in each object set is divided into two windows or one divided into two. In the system that can display each of the windows, if one window is the window 1 and the other window is the window 2, when the object of the window 1 is selected, all the corresponding objects are displayed in the window 2, and conversely the window 2 When the object is selected, all the objects of the window 2 corresponding to the selected object are displayed.

また特に、入力された文書をユーザにより定義されたカテゴリ群に振り分ける文書自動分類システムにおいて、与えられた文書集合に対する現在の分類状況を表示するために、ウィンドウ１にユーザにより定義されたカテゴリを、ウィンドウ２に対応するカテゴリに分類された文書を表示することができるウィンドウ型表示装置であって、ウィンドウ２に表示された文書を選択すると、それに対応した全てのカテゴリがハイライト表示されることを特徴とする表示方式を備える。 In particular, in an automatic document classification system that distributes input documents to categories defined by the user, a category defined by the user is displayed in the window 1 in order to display the current classification status for a given document set. A window-type display device capable of displaying a document classified into a category corresponding to window 2, and selecting a document displayed in window 2 highlights all the categories corresponding to the selected document. It has a characteristic display method.

従来の文書自動分類システムにおいても、カテゴリが選択されたときにそのカテゴリに対応した文書一覧を表示することは可能であったが、逆の操作として文書が選択されたときには、図１１のような分類結果ファイルを検索することがなかったため、文書側からたどってそれの所属するカテゴリ全部を表示させることは不可能であった。 Even in the conventional automatic document classification system, when a category is selected, it is possible to display a document list corresponding to the category. However, when a document is selected as the reverse operation, as shown in FIG. Since the classification result file was not searched, it was impossible to display all categories to which the document belongs from the document side.

しかしながら本発明に寄れば、全てのカテゴリ体系にわたるカテゴリとその所属文書の対応関係を示す文書分類結果ファイルを設け、カテゴリあるいは文書のどちら側が選択されても、同じように分類結果ファイルを検索するようにしたため、カテゴリを起点として対応する文書全部を参照できるのみならず、これとは逆方向の操作、すなわち文書を起点として対応するカテゴリ全部を見ることができる。 However, according to the present invention, a document classification result file showing the correspondence between categories belonging to all category systems and their affiliated documents is provided, and the classification result file is searched in the same manner regardless of which side of the category or document is selected. Therefore, not only can the entire corresponding document be referred to starting from the category, but also the operation in the opposite direction, that is, the entire corresponding category can be viewed starting from the document.

実施例で説明したように、複数個のカテゴリ体系にまたがってその文書が属する全てのカテゴリを表示すれば、自分とは異なるユーザの分類例を参考にしたり、異なる観点に立った分類状況を総合的に把握したいときに効果を発揮する。 As explained in the example, if all categories to which the document belongs across multiple category systems are displayed, you can refer to the classification examples of users who are different from yours or combine classification statuses from different viewpoints. Effective when you want to grasp.

たとえばシステムの初期学習時に、ある文書をどのカテゴリに入れるのが適当かについてカテゴリ作成者自身でも迷うことがある。そのとき、同じ文書に対して他のユーザが異なるカテゴリ体系に従って既にカテゴリ割り当てをしている場合が考えられる。他のユーザが分類したカテゴリを知ることで、自分のカテゴリ体系の中での分類において参考にすることが可能となるなどの効果が期待できる。 For example, at the time of initial learning of the system, the category creator may be at a loss as to which category it is appropriate to put a document. At that time, it is possible that other users have already assigned categories to the same document according to different category systems. By knowing the categories classified by other users, it is possible to expect effects such as being able to refer to the classification in the category system of one's own user.

（実施例１）
本発明での実施例をフローチャートその他を参照しながら説明する。 Example 1
Embodiments of the present invention will be described with reference to flowcharts and the like.

図１は本システムの全体構成を示すものである。 FIG. 1 shows the overall configuration of this system.

図２は図１の中の特に表示部の構成を詳しく示したものである。 FIG. 2 shows in detail the configuration of the display unit in FIG.

図３は本システムにおける学習処理のフローチャートである。学習処理では、分類先がユーザによって予め指定された学習用文書を解析し、分類を行ううえで必要となる辞書を作成する。まず、学習用文書を形態素解析し、単語切りを行う。そして普通名詞、固有名詞、サ変名詞、および形態素辞書にない単語である未知語をピックアップし、全学習用文書にわたる総出現頻度がある程度大きく信頼性の高い単語だけを選んで残りは捨てる。残った単語の中から、特定のカテゴリに偏って現れる単語だけを有効語候補とするため、ピックアップされた各単語に対し局在度なるものを計算する。まず、カテゴリＣjに属する学習用文書の中で、単語Ｗiを含む文書の割合、すなわち、カテゴリＣjに属しＷiを含む文書数をカテゴリＣjに所属する文書数で割ったものでをPijとおく。ただし、どの単語に対しても、その局在度をすべてのカテゴリに対して足し合わせた値が１となるように正規化しておく。すなわち、すべてのカテゴリをＣ1，Ｃ2，…，Ｃmとしたとき、Σ(j=1,m)Pij＝１とする。単語ＷiのカテゴリＣjに対する局在度Ｅ(Ｗi)は、Ｅ(Ｗi)＝１＋Σ(j=1,m) Pij＊logm Pijで定義する。局在度の値が大きい順にＮ個の単語を選択し、これらを有効語候補とする（フローチャート図３の処理３０１）。 FIG. 3 is a flowchart of the learning process in this system. In the learning process, a learning document whose classification destination is designated in advance by the user is analyzed, and a dictionary necessary for performing classification is created. First, the learning document is subjected to morphological analysis, and word cutting is performed. Then, common nouns, proper nouns, sa-changing nouns, and unknown words that are not in the morpheme dictionary are picked up, and only the words with a high total appearance frequency and high reliability over all the learning documents are selected and the rest are discarded. Of the remaining words, only the words that appear biased in a specific category are used as valid word candidates, so that a locality is calculated for each picked up word. First, among the learning documents belonging to category Cj, the ratio of documents including word Wi, that is, the number of documents belonging to category Cj and including Wi divided by the number of documents belonging to category Cj is set as Pij. However, the normalization is performed so that the value obtained by adding the localization to all categories is 1 for any word. That is, when all categories are C1, C2,..., Cm, Σ (j = 1, m) Pij = 1. The localization E (Wi) of the word Wi with respect to the category Cj is defined by E (Wi) = 1 + Σ (j = 1, m) Pij * logm Pij. N words are selected in descending order of the localization value, and these are set as valid word candidates (process 301 in the flowchart in FIG. 3).

次に、単語間の距離を求めるために、各有効語に対して意味ベクトルと呼ぶベクトル値を定義する。ベクトルの軸として全有効語自体をとり、その成分値として当該有効語と軸とする有効語との共起確率を採用する。ただし、単語Ｗiの単語Ｗjに対する共起確率とは、ＷiとＷjをともに含む文書数の、Ｗiを含む文書数に対する比で定義する。図５は共起確率を表したものであり、それによると、例えば単語「商社」のベクトル表現は、
（１．００．２０．１０．６０．８０．００．００．３ … ）
となる。 Next, in order to obtain the distance between words, a vector value called a semantic vector is defined for each effective word. All effective words themselves are taken as vector axes, and co-occurrence probabilities between the effective words and effective words as axes are adopted as the component values. However, the co-occurrence probability of the word Wi with respect to the word Wj is defined by the ratio of the number of documents including both Wi and Wj to the number of documents including Wi. FIG. 5 shows co-occurrence probabilities, for example, the vector representation of the word “trading company” is
(1.0 0.2 0.1 0.6 0.8 0.0 0.0 0.0 0.3 ...)
It becomes.

そしてこれらの情報を元に有効語辞書を作成する（フローチャート図３の処理３０２）。図６は有効語辞書の例であり、単語「商社」を見出しとするものの例である。帰属数は各カテゴリごとに、この単語を含む文書がいくつ属しているかを示すものであり、また出現数はこの単語を含む学習用文書の総数である。 Based on these pieces of information, a valid word dictionary is created (process 302 in the flowchart of FIG. 3). FIG. 6 shows an example of a valid word dictionary, which uses the word “trade company” as a headline. The number of belongings indicates how many documents including the word belong to each category, and the number of appearances is the total number of learning documents including the word.

有効語辞書を作成したら、次に各学習用文書に対してその中に含まれる有効語の意味ベクトルの重み付き平均を計算することによって、その文書のベクトル表現を定義することが出来る。また各カテゴリごとに、そこに所属する全ての学習用文書のベクトルの平均を取ることにより、カテゴリの代表ベクトルを定める（フローチャート図３の処理３０３）。 Once the valid word dictionary is created, the vector representation of the document can be defined by calculating the weighted average of the semantic vectors of the valid words contained therein for each learning document. Further, for each category, a representative vector of the category is determined by taking the average of the vectors of all the learning documents belonging to the category (process 303 in FIG. 3 in the flowchart).

有効語の重みは、その有効語自体が分類という行為に対してどの程度有効かという点、およびその有効語が各文書の中でどの程度重要な位置を示しているかという点の２点を考慮して定義する。第１の観点は各カテゴリへの帰属度の度合いを表すもので、特定のカテゴリを特徴付ける度合いの高い有効語ほど重みを重くするという考えであり、上記で説明した局在度を用いる。第２の観点は対象とする文書の中でその有効語がどのように使われているか、文書の内容とどのように関わっているのか、という側面を評価するもので、有効語の出現位置やその有効語の格役割、修飾タイプなどの言語的役割、に注目して評価項目をあらかじめ作成しておき、有効語が各評価項目の条件を満たした場合に与える重みの値をフローチャート図３の処理３０４〜３１０で示したような繰り返し学習によって定めるものである。学習スタート時には重みをすべて１に初期化しておく。 The effective word weight takes into consideration two points: how effective the effective word itself is for the act of classification, and how important the effective word indicates in each document. And define. The first point of view represents the degree of belonging to each category, and is an idea that the effective word having a higher degree of characterizing a specific category is given a higher weight, and the degree of localization described above is used. The second viewpoint evaluates aspects such as how the valid word is used in the target document and how it is related to the content of the document. The evaluation item is created in advance by paying attention to the case role of the effective word and the linguistic role such as the modification type, and the weight value given when the effective word satisfies the condition of each evaluation item is shown in the flowchart of FIG. It is determined by repetitive learning as shown in processes 304 to 310. All weights are initialized to 1 at the start of learning.

第２の観点に基づく重みは各評価項目の値を辞書として格納しておき、第１の観点に基づく重み、すなわち有効語の局在度を記した有効語辞書とともに分類実行時に参照し、各有効語のトータルの重みを計算するようになっている。図７は有効語の出現位置を評価項目とした重みの値を格納した重み辞書の例であり、また図８は有効語の格役割、修飾タイプなどの言語的役割を評価項目とした重みの値を格納した重み辞書の例である。 The weight based on the second viewpoint stores the value of each evaluation item as a dictionary, and refers to the weight based on the first viewpoint, that is, the effective word dictionary that describes the localization of the effective word, at the time of classification execution, The total weight of valid words is calculated. FIG. 7 shows an example of a weight dictionary storing weight values with the appearance position of a valid word as an evaluation item, and FIG. 8 shows the weights with the linguistic roles such as the case role and the modification type of the valid word as evaluation items. It is an example of a weight dictionary storing values.

次に、上で述べた重みの学習について説明する。これまでに作成した有効語辞書、カテゴリ代表ベクトル、および現在の重みの値を記した重み辞書等を参照しながら、学習用文書のそれぞれについて分類を実行し、各評価項目ごとの重みの値を微調整して再度分類を試みる、という処理を繰り返して最終的な重みの値を決定する（フローチャート図３の処理３０４〜３０５）。図４は３０５の分類処理を説明するフローチャートである。 Next, the weight learning described above will be described. While referring to the valid word dictionary created so far, the category representative vector, the weight dictionary describing the current weight value, etc., classification is performed for each of the learning documents, and the weight value for each evaluation item is determined. The final weight value is determined by repeating the process of fine adjustment and re-classification (processes 304 to 305 in the flowchart of FIG. 3). FIG. 4 is a flowchart for explaining the classification process 305.

まず学習処理の冒頭で行ったように、分類対象文書をもう一度形態素解析し、有効語辞書を参照してそこに含まれる有効語をピックアップする（フローチャート図４の処理４０１）。なお、図４は重みの学習処理を終了後に学習用文書以外の文書を分類する際の処理を記述したものであるから、今述べているような重み学習処理の中で分類を実行する際には、言うまでもなく再び形態素解析を実行する必要はなく、学習処理冒頭で行った形態素解析の結果を保存しておいてそれをここでの処理結果として差し支えない。 First, as performed at the beginning of the learning process, the classification target document is once again subjected to morphological analysis, and the effective word contained therein is picked up by referring to the effective word dictionary (process 401 in FIG. 4). Note that FIG. 4 describes a process for classifying a document other than the learning document after the weight learning process is completed. Therefore, when performing the classification in the weight learning process as described above, FIG. Needless to say, it is not necessary to perform the morphological analysis again, and the result of the morphological analysis performed at the beginning of the learning process may be stored and used as the processing result here.

次にピックアップされた有効語の重みを計算する（フローチャート図３の処理３０２）。第１の観点による重みの値は有効語辞書の局在度を用い、第２の観点による重みは図６、図７の２つの重み辞書を参照して、それらの値を総合してトータルの重みを求める。 Next, the weight of the picked-up valid word is calculated (process 302 in the flowchart in FIG. 3). The weight value according to the first viewpoint uses the locality of the effective word dictionary, and the weight according to the second viewpoint refers to the two weight dictionaries in FIGS. Find the weight.

次に有効語辞書から意味ベクトルを取得する（フローチャート図３の処理３０３）。以上を分類対象文書からピックアップされた全ての有効語について行ったら、各有効語の意味ベクトルに３０２で計算した重みを付けて平均を取り、分類対象文書の文書ベクトルを求める（フローチャート図３の処理３０４）。 Next, a semantic vector is acquired from the effective word dictionary (process 303 in the flowchart in FIG. 3). When the above is performed for all the effective words picked up from the classification target document, the weight vector calculated in 302 is added to the semantic vector of each effective word, and the average is obtained to obtain the document vector of the classification target document (processing in the flowchart in FIG. 3). 304).

そして、図３の処理３０３で求めた各カテゴリの代表ベクトルと、分類対象ベクトルの文書ベクトルとの距離を計算する（フローチャート図４の処理４０５）。２つのベクトル間の距離としては、よく行われるように内積を計算して得られる両ベクトルのなす角の余弦値を用いる。距離が最も近いカテゴリを当該分類対象文書の分類先カテゴリとして決定する（フローチャート図４の処理４０６）。 Then, the distance between the representative vector of each category obtained in the process 303 in FIG. 3 and the document vector of the classification target vector is calculated (process 405 in the flowchart in FIG. 4). As a distance between two vectors, a cosine value of an angle formed by both vectors obtained by calculating an inner product is used as is often done. The category having the shortest distance is determined as the classification target category of the classification target document (process 406 in the flowchart in FIG. 4).

以上がフローチャート図３の３０５で行われる分類実行の処理であり、再び図３の処理３０６に戻って重み学習の続きを説明する。 The above is the classification execution process performed at 305 in the flowchart of FIG. 3, and the continuation of the weight learning will be described by returning to the process 306 of FIG.

学習用文書にはあらかじめ、ユーザが分類させたい正解カテゴリが決められているので、処理３０５で得られた分類先カテゴリと比較する。両者のカテゴリが一致した場合には、重みをチューニングする必要はないとみなし、当該学習用文書に対してはこれ以上処理することはなく、ループ３０４の次の学習用文書の処理に移る。両者のカテゴリが一致しなかった場合には、分類システムの分類先が間違っていたことになり、重み辞書の重みの値を以下のように調整する。まず、当該学習用文書に含まれる有効語について、分類処理の４０３で取得した意味ベクトルを参照し、それと正解カテゴリ、分類先カテゴリとの距離をそれぞれ計算して、当該有効語の意味ベクトルが正解カテゴリ、分類先カテゴリのどちらの代表ベクトルにより近いかを判定する（フローチャート図３の処理３０８）。ここで分類先カテゴリに近かったとすると、当該文書の文書ベクトルが正解カテゴリの方へ近づけるため、当該有効語が該当している重みの評価項目の重みの値を微小量だけ減らすよう２つの重み辞書を修正する（フローチャート図３の処理３０９）。逆に当該文書の文書ベクトルが正解カテゴリの方により近かったとすると、当該有効語が該当している重みの評価項目は正しい分類に寄与していたものとみなし、その重みの値を微小量だけ増やすように重み辞書を修正する（フローチャート図３の処理３１０）。これを当該学習用文書からピックアップされた全ての有効語に対して行い（フローチャート図３の３０７のループ）、重みの値を調整する。 Since the correct category that the user wants to classify is determined in advance in the learning document, it is compared with the classification destination category obtained in the processing 305. If the two categories match, it is considered that the weight need not be tuned, and the learning document is not processed any further, and the process proceeds to the processing of the learning document next to the loop 304. If the two categories do not match, it means that the classification destination of the classification system is wrong, and the weight value of the weight dictionary is adjusted as follows. First, with respect to the valid words included in the learning document, the semantic vector acquired in 403 of the classification process is referred to, and the distance between the correct answer category and the classification category is calculated. It is determined which of the category and the classification destination category is closer to the representative vector (process 308 in FIG. 3 in the flowchart). If it is close to the classification destination category, the document vector of the document is closer to the correct category, so that the two weight dictionaries reduce the weight value of the evaluation item of the weight corresponding to the valid word by a minute amount. Is corrected (processing 309 in the flowchart of FIG. 3). Conversely, if the document vector of the document is closer to the correct category, the weight evaluation item to which the valid word corresponds is considered to have contributed to the correct classification, and the weight value is increased by a small amount. Thus, the weight dictionary is corrected (process 310 in the flowchart of FIG. 3). This is performed for all valid words picked up from the learning document (loop 307 in the flowchart of FIG. 3), and the weight value is adjusted.

そして、以上説明した処理を、全ての学習用文書を分類対象とみなして繰り返し（フローチャート図３の３０４のループ）、重みの値が最適となるように一連の学習用文書に対して何回か繰り返し、重み辞書を最終的に完成させる。このようにして学習処理が完了する。 Then, the above-described processing is repeated with all learning documents regarded as classification targets (loop 304 in the flowchart in FIG. 3), and several times for a series of learning documents so that the weight values are optimum. Repeat to finally complete the weight dictionary. In this way, the learning process is completed.

以上で作成された各種辞書を参照して、学習用文書以外の一般文書の分類を行うのであるが、その手順は重みの学習処理の中で説明した分類実行での処理と全く同一である。分類対象文書が、正解カテゴリの与えられた学習用文書であるか、正解カテゴリが未知の一般文書であるかの違いだけである。 The general documents other than the learning document are classified by referring to the various dictionaries created as described above, and the procedure is exactly the same as the classification execution process described in the weight learning process. The only difference is whether the classification target document is a learning document to which a correct category is given or whether the correct category is an unknown general document.

ただし、フローチャート図４の処理４０６においては、分類対象文書の分類先カテゴリを唯一つに決定して出力するようになっているが、カテゴリ代表ベクトルと分類対象文書の文書ベクトルとの距離がたまたま一致してしまうような複数個のカテゴリが存在することもありうる。また、システムによっては複数個の分類先カテゴリを出力することを要求されることも考えられる。後者の要求に対しては、カテゴリ代表ベクトルと分類対象文書の文書ベクトルとの距離があらかじめ定められた閾値を超えていたら正解とするなどの方法で行えばよい。 However, in the process 406 of the flowchart in FIG. 4, only one classification destination category of the classification target document is determined and output. However, the distance between the category representative vector and the document vector of the classification target document happens to coincide. There may be a plurality of categories that can be used. In addition, depending on the system, it may be required to output a plurality of classification destination categories. In response to the latter request, a correct answer may be used if the distance between the category representative vector and the document vector of the classification target document exceeds a predetermined threshold.

こうして一般文書の分類を行った結果、ユーザがシステムの決定したカテゴリが正しいのか正しくないのかの判定をし、必要ならばユーザの考える正しいカテゴリを指定してシステムに学習させることが行われる。そのために分類結果をわかりやすく表示することが求められ、本発明で提案する表示方法について次に説明する。 As a result of classifying general documents in this way, the user determines whether the category determined by the system is correct or incorrect, and if necessary, designates the correct category considered by the user and causes the system to learn. Therefore, it is required to display the classification result in an easy-to-understand manner, and the display method proposed in the present invention will be described next.

図９は、システム管理者またはシステムのエンドユーザ等によって自動文書分類システム上に定義された分類カテゴリを、本発明における表示ウィンドウの左側に表示した例である。また図１０はカテゴリ定義ファイルであり、文書自動分類のシステム管理者等が定義したカテゴリおよびカテゴリ体系の内容を記憶するためのシステムファイルの例である。各エントリは３つの値の組からなり、それぞれカテゴリ体系ＩＤ，そのカテゴリ体系に定義されたカテゴリの数、そのカテゴリ体系に定義されたカテゴリＩＤのリストを表している。 FIG. 9 is an example in which the classification categories defined on the automatic document classification system by the system administrator or the end user of the system are displayed on the left side of the display window in the present invention. FIG. 10 shows a category definition file, which is an example of a system file for storing the contents of categories and category systems defined by a system administrator for automatic document classification. Each entry consists of a set of three values, each representing a category system ID, the number of categories defined in the category system, and a list of category IDs defined in the category system.

一つの分類システムを複数のエンドユーザが使用することも考えられ、ある文書群を各ユーザごとに別々のカテゴリ体系で分類したいと思うことがありうるため、システム上で一般に複数個のカテゴリ体系が存在しているため、ここではカテゴリセットＡ、カテゴリセットＢなどという名称で各カテゴリ体系を区別している。カテゴリセットＡというカテゴリ体系には、そこに含まれるカテゴリとして、「政治」、「経済」、「司法」、「教育」、「医療」、「文芸」、「学術」、「事件」の８個のカテゴリが定義されており、またカテゴリセットＢには、「ヒト」、「モノ」、「娯楽」、「教養」、「時事」、「その他」の６個のカテゴリが定義されていることを示している。 A single classification system may be used by multiple end users, and you may want to classify a group of documents in different category systems for each user. Since they exist, here, the category systems are distinguished by the names of category set A, category set B, and the like. In the category system of category set A, there are eight categories, “politics”, “economy”, “judicial”, “education”, “medical”, “literary”, “academic”, and “incidents”. In addition, the category set B includes six categories of “human”, “thing”, “entertainment”, “education”, “current affairs”, and “others”. Show.

図１１は分類結果ファイルの例であり、これは本システムにより自動分類が実行された文書群に対し、各文書がそれぞれどのカテゴリ体系のどのカテゴリに分類されたかを記述したファイルであり、ある分類対象文書に対して分類が実行されるごとに更新される。各エントリはシステムに蓄積された文書を一意に識別するための「文書ＩＤ属性」、着目するカテゴリ体系のＩＤを表す「カテゴリ体系属性」、当該文書が当該カテゴリ体系の中で属するカテゴリのＩＤを意味する「カテゴリ属性」の３つの属性を持つ組として表現される。 FIG. 11 shows an example of a classification result file, which is a file describing which category of which category system each document is classified with respect to a group of documents for which automatic classification has been executed by this system. Updated each time classification is performed on the target document. Each entry includes a “document ID attribute” for uniquely identifying a document stored in the system, a “category system attribute” indicating an ID of a category system of interest, and an ID of a category to which the document belongs in the category system. It is expressed as a set having three attributes “category attribute”.

図９において任意のカテゴリがマウスやキーボード等により選択されると、表示制御装置は図１１の分類結果ファイルを参照し、カテゴリ属性が選択されたカテゴリに等しいエントリを検索し、そのエントリの文書ＩＤを取得して文書のタイトルを表示する。 In FIG. 9, when an arbitrary category is selected by a mouse, a keyboard, or the like, the display control device searches the entry having the category attribute equal to the selected category with reference to the classification result file of FIG. To display the document title.

図１２はそのときの画面の例を表したものであり、図９に示したカテゴリ一覧の中から、カテゴリセットＡの中の「政治」カテゴリを選択した場合の表示例である。ウィンドウの右側はウィンドウ左側で選択されたカテゴリに分類されている文書の一覧を表示するためのものであり、この例では「政治」カテゴリに属する全ての文書のタイトルが表示されている。 FIG. 12 shows an example of the screen at that time, and is a display example when the “politics” category in the category set A is selected from the category list shown in FIG. The right side of the window is for displaying a list of documents classified in the category selected on the left side of the window. In this example, the titles of all documents belonging to the “politics” category are displayed.

次に、図１２の右ウィンドウに表示された文書タイトルの一覧から任意の文書をマウスまたはキーボードで選択すると、表示制御装置は図１１の分類結果ファイルを再び参照し、文書ＩＤ属性が選択された文書に等しいエントリを全て検索し、そのエントリのカテゴリ体系属性とカテゴリ属性の値のペアを取得する。そして、右ウィンドウに表示されているカテゴリ一覧から、取得したペアに相当するカテゴリだけをハイライト表示する。 Next, when an arbitrary document is selected from the list of document titles displayed in the right window of FIG. 12 with the mouse or the keyboard, the display control apparatus refers again to the classification result file of FIG. 11 and the document ID attribute is selected. All entries that are equal to the document are searched, and a pair of category system attribute and category attribute value of the entry is acquired. Then, only the category corresponding to the acquired pair is highlighted from the category list displayed in the right window.

従来の文書自動分類システムにおいても、カテゴリが選択されたときにそのカテゴリに対応した文書一覧を表示することは可能であったが、逆の操作として文書が選択されたときには、図１１のような分類結果ファイルを検索することがなかったため、文書側からたどってそれの所属するカテゴリ全部を表示させることは不可能であった。本発明においては全てのカテゴリ体系にわたるカテゴリとその所属文書の対応関係を示す文書分類結果ファイルを設け、カテゴリあるいは文書のどちら側が選択されても、同じように分類結果ファイルを検索するようにしたため、カテゴリを起点として対応する文書全部を参照できるのみならず、これとは逆方向の操作、すなわち文書を起点として対応するカテゴリ全部を見ることが可能となっている。 Even in the conventional automatic document classification system, when a category is selected, it is possible to display a document list corresponding to the category. However, when a document is selected as the reverse operation, as shown in FIG. Since the classification result file was not searched, it was impossible to display all categories to which the document belongs from the document side. In the present invention, a document classification result file indicating the correspondence relationship between the categories over all category systems and their belonging documents is provided, and the classification result file is similarly searched regardless of which side of the category or document is selected. It is possible not only to refer to all corresponding documents starting from a category, but also to perform an operation in the opposite direction, that is, to view all corresponding categories starting from a document.

図１３はこのときの画面例を示したものであり、文書一覧からタイトルが「構造改革の行方」という文書を選択すると、それに対応して、この文書が分類されているカテゴリが左ウィンドウにハイライト表示されていることを表している。当該文書はカテゴリセットＡにおいては「政治」、「経済」の２個のカテゴリに分類されており、またカテゴリセットＢにおいては、「時事」カテゴリ１個に分類されているので、合わせて３つのカテゴリがハイライト表示される。ただし、現在注目しているカテゴリであるカテゴリセットＡの「政治」カテゴリを他の２つのカテゴリと区別して表示しておかないと、右ウィンドウに一覧表示している文書群がどこに所属しているのかが不明となるので、「政治」カテゴリは例えば表示色を変更したり、アンダーライン表示するなどして、該当するカテゴリの中で区別できるようにしている。 FIG. 13 shows an example of the screen at this time. When a document whose title is “Whereabouts of Structural Reform” is selected from the document list, the category in which this document is classified is highlighted in the left window. Indicates that it is lighted. In the category set A, the document is classified into two categories of “politics” and “economy”, and in the category set B, it is classified into one “current affairs” category. The category is highlighted. However, unless the “politics” category of category set A, which is the category of interest, is displayed separately from the other two categories, the document group listed in the right window belongs to where it belongs. For example, the “politics” category can be distinguished from the corresponding category by changing the display color or displaying the underline.

なお、上のような操作方法を採用したとしても、従来から実現されている文書内容の表示機能と両立している。すなわち、選択した文書の全文を表示するためには、文書上でマウスをダブルクリックするか、マウスの右ボタンのクリックにより、ビューワを起動して行える。図１４〜１５はこの操作により表示された文書の内容を示したものである。 Even if the above operation method is adopted, it is compatible with the document content display function that has been conventionally realized. That is, in order to display the full text of the selected document, the viewer can be activated by double-clicking the mouse on the document or clicking the right button of the mouse. 14 to 15 show the contents of the document displayed by this operation.

（実施例２）
実施例１においては、カテゴリが選択されたときにそのカテゴリに対応した文書一覧を表示する場合も、逆の操作として文書が選択されたときにそれの所属するカテゴリ全部を表示させる場合も、ともにただ一つの分類結果ファイルを共通して検索していた。図１１では分類結果ファイルの各エントリは文書ＩＤの属性を見出しとしているため、文書が選択されたときに対応するカテゴリを検索するのは高速に処理できるが、逆にカテゴリを選択されたときに対応する文書をすべて引っ張ってくるのは比較的時間の要する処理となってしまう。 (Example 2)
In the first embodiment, when a category is selected, a document list corresponding to the category is displayed, or when a document is selected as the reverse operation, all categories to which the category belongs are displayed. A single classification result file was searched in common. In FIG. 11, since each entry of the classification result file uses the attribute of the document ID as a headline, searching the corresponding category when the document is selected can be processed at high speed, but conversely when the category is selected. Pulling all the corresponding documents is a relatively time-consuming process.

そこで、図１６のような分類結果ファイルをもう一つ別に作成しておき、ある分類対象文書に対して分類が実行されるごとに図１１の分類結果ファイルとともに同時に更新するようにしておく。図１６は図１１と同一の内容のまま、各エントリの見出しをカテゴリ属性となるように並べ替えたものである。 Therefore, another classification result file as shown in FIG. 16 is created and updated together with the classification result file of FIG. 11 every time classification is performed on a certain classification target document. FIG. 16 is a table in which the headings of the entries are rearranged so as to have the category attributes with the same contents as FIG.

（その他の実施例）
なお、上述したような本発明は、例えば、システム、装置、方法、プログラムもしくは記憶媒体等としての実施態様をとることが可能であり、具体的には、複数の機器から構成されるシステムに適用しても良いし、また、一つの機器からなる装置に適用しても良い。 (Other examples)
It should be noted that the present invention as described above can take the form of, for example, a system, apparatus, method, program, or storage medium, and is specifically applied to a system composed of a plurality of devices. Alternatively, it may be applied to an apparatus composed of one device.

尚、本発明は、前述した実施形態の機能を実現するソフトウェアのプログラム（実施形態では図に示すフローチャートに対応したプログラム）を、システムあるいは装置に直接あるいは遠隔から供給し、そのシステムあるいは装置のコンピュータが該供給されたプログラムコードを読み出して実行することによっても達成される場合を含む。 In the present invention, a software program (in the embodiment, a program corresponding to the flowchart shown in the figure) that realizes the functions of the above-described embodiment is directly or remotely supplied to the system or apparatus, and the computer of the system or apparatus Is also achieved by reading and executing the supplied program code.

従って、本発明の機能処理をコンピュータで実現するために、該コンピュータにインストールされるプログラムコード自体も本発明を実現するものである。つまり、本発明は、本発明の機能処理を実現するためのコンピュータプログラム自体も含まれる。 Accordingly, since the functions of the present invention are implemented by computer, the program code installed in the computer also implements the present invention. In other words, the present invention includes a computer program itself for realizing the functional processing of the present invention.

その場合、プログラムの機能を有していれば、オブジェクトコード、インタプリタにより実行されるプログラム、ＯＳに供給するスクリプトデータ等の形態であっても良い。 In that case, as long as it has the function of a program, it may be in the form of object code, a program executed by an interpreter, script data supplied to the OS, or the like.

プログラムを供給するための記録媒体としては、例えば、フロッピー（登録商標）ディスク、ハードディスク、光ディスク、光磁気ディスク、ＭＯ、ＣＤ−ＲＯＭ、ＣＤ−Ｒ、ＣＤ−ＲＷ、磁気テープ、不揮発性のメモリカード、ＲＯＭ、ＤＶＤ（ＤＶＤ−ＲＯＭ，ＤＶＤ−Ｒ）などがある。 As a recording medium for supplying the program, for example, floppy (registered trademark) disk, hard disk, optical disk, magneto-optical disk, MO, CD-ROM, CD-R, CD-RW, magnetic tape, nonvolatile memory card ROM, DVD (DVD-ROM, DVD-R) and the like.

その他、プログラムの供給方法としては、コンピュータのブラウザを用いてインターネットのホームページに接続し、該ホームページから本発明のコンピュータプログラムそのもの、もしくは圧縮され自動インストール機能を含むファイルをハードディスク等の記録媒体にダウンロードすることによっても供給できる。また、本発明のプログラムを構成するプログラムコードを複数のファイルに分割し、それぞれのファイルを異なるホームページからダウンロードすることによっても実現可能である。つまり、本発明の機能処理をコンピュータで実現するためのプログラムファイルを複数のユーザに対してダウンロードさせるサーバも、本発明に含まれるものである。 As another program supply method, a computer browser is used to connect to a homepage on the Internet, and the computer program itself of the present invention or a compressed file including an automatic installation function is downloaded from the homepage to a recording medium such as a hard disk. Can also be supplied. It can also be realized by dividing the program code constituting the program of the present invention into a plurality of files and downloading each file from a different homepage. That is, a server that allows a plurality of users to download a program file for realizing the functional processing of the present invention on a computer is also included in the present invention.

また、本発明のプログラムを暗号化してＣＤ−ＲＯＭ等の記憶媒体に格納してユーザに配布し、所定の条件をクリアしたユーザに対し、インターネットを介してホームページから暗号化を解く鍵情報をダウンロードさせ、その鍵情報を使用することにより暗号化されたプログラムを実行してコンピュータにインストールさせて実現することも可能である。 In addition, the program of the present invention is encrypted, stored in a storage medium such as a CD-ROM, distributed to users, and key information for decryption is downloaded from a homepage via the Internet to users who have cleared predetermined conditions. It is also possible to execute the encrypted program by using the key information and install the program on a computer.

また、コンピュータが、読み出したプログラムを実行することによって、前述した実施形態の機能が実現される他、そのプログラムの指示に基づき、コンピュータ上で稼動しているＯＳなどが、実際の処理の一部または全部を行ない、その処理によっても前述した実施形態の機能が実現され得る。 In addition to the functions of the above-described embodiments being realized by the computer executing the read program, the OS running on the computer based on an instruction of the program is a part of the actual processing. Alternatively, the functions of the above-described embodiment can be realized by performing all of them and performing the processing.

さらに、記録媒体から読み出されたプログラムが、コンピュータに挿入された機能拡張ボードやコンピュータに接続された機能拡張ユニットに備わるメモリに書き込まれた後、そのプログラムの指示に基づき、その機能拡張ボードや機能拡張ユニットに備わるＣＰＵなどが実際の処理の一部または全部を行ない、その処理によっても前述した実施形態の機能が実現される。 Furthermore, after the program read from the recording medium is written in a memory provided in a function expansion board inserted into the computer or a function expansion unit connected to the computer, the function expansion board or The CPU or the like provided in the function expansion unit performs part or all of the actual processing, and the functions of the above-described embodiments are realized by the processing.

本発明におけるシステムの全体構成を示した図である。It is the figure which showed the whole structure of the system in this invention. 本発明における表示部の構成を示した図である。It is the figure which showed the structure of the display part in this invention. 本発明における学習処理の流れを表したフローチャートである。It is a flowchart showing the flow of the learning process in this invention. 本発明における分類処理の流れを表したフローチャートである。It is a flowchart showing the flow of the classification process in this invention. 単語間の共起確率の例を示した図である。It is the figure which showed the example of the co-occurrence probability between words. 有効語辞書の例を示した図である。It is the figure which showed the example of the effective word dictionary. 有効語の位置的役割を評価項目とした重み辞書の例である。It is an example of the weight dictionary which used the positional role of the effective word as an evaluation item. 有効語の言語的役割を評価項目とした重み辞書の例である。It is an example of the weight dictionary which used the linguistic role of the effective word as an evaluation item. 本発明における表示ウィンドウにカテゴリ一覧を表示した例である。It is the example which displayed the category list on the display window in this invention. カテゴリ定義ファイルの例を示した図である。It is the figure which showed the example of the category definition file. 文書ＩＤを見出しとする分類結果ファイルの例を示した図である。It is the figure which showed the example of the classification result file which uses document ID as a headline. 表示ウィンドウにてカテゴリを選択したときの画面例を示した図である。It is the figure which showed the example of a screen when a category is selected in a display window. 表示ウィンドウにて文書を選択したときの画面例を示した図である。It is the figure which showed the example of a screen when a document is selected in a display window. 表示ウィンドウにて文書を右クリックしたときの画面例を示した図である。It is the figure which showed the example of a screen when a document is right-clicked in a display window. 表示ウィンドウにて文書の本文を表示したときの画面例を示した図である。It is the figure which showed the example of a screen when the text of a document is displayed on the display window. 他の実施例において使用するカテゴリ見出しに変更した分類結果ファイルの例を示した図である。It is the figure which showed the example of the classification result file changed into the category headline used in another Example.

Claims

There are two object sets in which each object corresponds to many-to-many, and in a system that can display an arbitrary number of objects in each object set in two windows or one divided window, When the window 1 is the window 2 and the other window is the window 2, when the object of the window 1 is selected, all the corresponding objects are displayed in the window 2. Conversely, when the object of the window 2 is selected, all the corresponding windows 2 are displayed. A window-type display device characterized in that the object is displayed.

2. The window type display device according to claim 1, wherein a current classification status for a given document set is displayed in an automatic document classification system that distributes an input document to a category group defined by a user. An automatic document classification apparatus that displays a document defined by a user in a category corresponding to window 2, and when a document displayed in window 2 is selected, all categories to which the document belongs are highlighted. An automatic document classification device characterized by being displayed.

3. The automatic document classification apparatus according to claim 2, wherein when there are a plurality of categories to which a certain document belongs, a case where the classification system represents that the document is classified into a plurality of categories within the same category system, and An automatic document classification device characterized in that a display method is changed depending on whether it belongs to a plurality of different categories.

3. The automatic document classification apparatus according to claim 2, wherein when there are a plurality of categories to which the document belongs, the display form, for example, the display color of the category is changed in the order of the category most suitable as a document classification destination. Automatic document classification device.

3. The automatic document classification apparatus according to claim 2, wherein a correspondence category between a document displayed in window 2 and a category displayed in window 1 is displayed in a category system corresponding to a selected document, and a plurality of categories are displayed. An automatic document classification device characterized in that it can switch between simultaneous display methods of categories across category systems.

There are two object sets in which each object corresponds to many-to-many, and in a system that can display an arbitrary number of objects in each object set in two windows or one divided window, When the window 1 is the window 2 and the other window is the window 2, when the object of the window 1 is selected, all the corresponding objects are displayed in the window 2. Conversely, when the object of the window 2 is selected, all the corresponding windows 2 are displayed. A window type display method characterized in that an object is displayed.

There are two object sets in which each object corresponds to many-to-many, and in a system that can display an arbitrary number of objects in each object set in two windows or one divided window, When the window 1 is the window 2 and the other window is the window 2, when the object of the window 1 is selected, all the corresponding objects are displayed in the window 2. Conversely, when the object of the window 2 is selected, all the corresponding windows 2 are displayed. A computer program for causing a computer to execute an object so that the object is displayed.

A computer-readable storage medium storing the computer program according to claim 7.