JP4356640B2

JP4356640B2 - Document management device

Info

Publication number: JP4356640B2
Application number: JP2005101361A
Authority: JP
Inventors: 健士西村
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 2005-03-31
Filing date: 2005-03-31
Publication date: 2009-11-04
Anticipated expiration: 2025-03-31
Also published as: JP2006285390A

Description

本発明は文書管理装置に関し、特に、分類階層化された文書集合を管理する文書管理装置に関するものである。 The present invention relates to a document management apparatus, and more particularly, to a document management apparatus that manages a document set that is classified and hierarchized.

文書を階層化して分類する代表的な従来技術は特許文献１や特許文献２に記載されている。特許文献１の発明では、過去に生成した分類情報を参照する手段および再利用する手段を利用者に与える。分類情報として使われるのは、分類のラベル、例えば単語のブール式である分類の定義、分類された文書、である。分類構造化は基本的に利用者が行う。また、特許文献１の発明では、分類結果に基づいてカテゴリー構造情報が生成され、生成されたカテゴリー構造情報が保存され、その後のカテゴリー構造情報の生成作業時、保存されている過去に生成されたカテゴリー構造情報を参照したり修正したりしてカテゴリー構造化作業をおこなうことができるようにして、カテゴリー構造化作業を効率良く行うことができるようにしている。また、特許文献２の発明では、ベクトル空間法による文書分類装置を再帰的に呼び出すことで階層的な分類を実現している。分類は利用者の作業ではなく、自動化されている。
特開２００３−２１６６２２号公報特開２００３−１４１１２９号公報 Typical conventional techniques for classifying documents in a hierarchical manner are described in Patent Document 1 and Patent Document 2. In the invention of Patent Document 1, a means for referring to classification information generated in the past and a means for reusing are given to the user. The classification information used is a classification label, for example, a classification definition that is a Boolean expression of words, a classified document. Classification and structuring are basically performed by the user. Further, in the invention of Patent Document 1, category structure information is generated based on the classification result, the generated category structure information is stored, and is generated in the past that is stored during the subsequent generation work of the category structure information. The category structuring work can be performed efficiently by referring to and modifying the category structure information and performing the category structuring work. Further, in the invention of Patent Document 2, hierarchical classification is realized by recursively calling a document classification apparatus based on the vector space method. Classification is not a user task, it is automated.
JP 2003-216622 A JP 2003-141129 A

しかしながら、特許文献１の分類の助けとなる情報を提示して利用者が文書を階層分類化する方法では、分類の精度は高いが、分類作業に多くの労力を要するという問題点がある。また、分類精度を維持するためには、文書の属する全領域に熟知した限定された利用者によって階層分類化が行われなくてはならず、得られた階層分類が利用者全体の考える階層分類と乖離する可能性がある。また、特許文献１では、過去に生成されたカテゴリー構造情報を参照したり修正したりしてカテゴリー構造化作業を効率良く行うことはできるが、全利用者が分類階層に従って探索した探索経路の履歴に基づいて所定の分類階層への分類の追加を自動で行い文書登録を行うということはできなかった。また、特許文献２の階層分類化を完全自動化する方法では、分類作業の労力はかからないが、分類の精度が低くなるという問題点がある。
本発明の目的は、少ない労力で精度の高い文書の階層分類化を行う装置を提供することにある。ここで精度が高いとは、全利用者の平均した概念階層に最も近いという意味である。 However, in the method in which a user classifies documents by presenting information that helps classification in Patent Document 1, classification accuracy is high, but there is a problem that much labor is required for classification work. In order to maintain the classification accuracy, hierarchical classification must be performed by a limited user who is familiar with all areas to which the document belongs, and the obtained hierarchical classification is the hierarchical classification considered by the entire user. There is a possibility of deviation. Further, in Patent Document 1, although it is possible to efficiently perform category structuring work by referring to or correcting category structure information generated in the past, a history of search routes searched by all users according to the classification hierarchy Based on the above, it is impossible to automatically add a classification to a predetermined classification hierarchy and register a document. Further, the method of fully automating the hierarchical classification of Patent Document 2 does not require the labor of classification work, but has a problem that the accuracy of classification is lowered.
An object of the present invention is to provide an apparatus for performing hierarchical classification of documents with little effort and high accuracy. Here, high accuracy means that it is closest to the average concept hierarchy of all users.

本発明の文書管理装置は、階層的に分類された文書群を管理する文書管理装置であって、分類階層と前記分類階層に対応づけられた文書群とを格納する文書格納手段と、利用者に前記分類階層に沿って対話的に文書を探索させるインタフェース手段と、前記利用者が前記インタフェース手段によって前記分類階層に沿って文書を探索した際に、前記利用者が到達した最終分類を履歴として格納する履歴格納手段と、前記利用者が前記探索によって意図した文書を見付けられない場合に、前記利用者が新規に登録を希望する文書を登録候補文書として前記履歴と対応づけて格納する登録候補文書格納手段と、前記登録候補文書のうち同一の最終分類下にあるものを前記履歴から判定し、それらのうち類似度の高い一群を新たな分類とする登録候補文書分類手段とを備えることを特徴とする。
また、前記登録候補文書分類手段は、前記登録候補文書を前記利用者が到達した最終分類毎に分類して新規分類を生成し、前記新規分類の数が予め定められた定数値を超える場合に限って、前記最終分類の下に前記新規分類を追加するようにすることができる。
また、前記登録候補文書分類手段は、前記登録候補文書を前記利用者が到達した最終分類毎に分類して新規分類を生成し、前記新規分類の数が予め定められた定数値を超えない場合に、前記新規分類を最も良く代表する前記登録候補文書を新規文書として追加するようにすることができる。
また、前記登録候補文書分類手段は、前記登録候補文書を前記利用者が到達した最終分類より上位の分類毎に分類して新規分類を生成し、前記新規分類の数が予め定められた定数値を超える場合に限って、前記上位分類の下に前記新規分類を追加するようにすることができる。
本発明の文書管理方法は、階層的に分類された文書群を管理する文書管理装置における文書管理方法であって、分類階層と前記分類階層に対応づけられた文書群とを格納する文書格納ステップと、利用者に前記分類階層に沿って対話的に文書を探索させるインタフェースステップと、前記利用者が前記インタフェースステップにおいて前記分類階層に沿って文書を探索した際に、前記利用者が到達した最終分類を履歴として格納する履歴格納ステップと、前記利用者が前記探索によって意図した文書を見付けられない場合に、前記利用者が新規に登録を希望する文書を登録候補文書として前記履歴と対応づけて格納する登録候補文書格納ステップと、前記登録候補文書のうち同一の最終分類下にあるものを前記履歴から判定し、それらのうち類似度の高い一群を新たな分類とする登録候補文書分類ステップとを備えることを特徴とする。
また、前記登録候補文書分類ステップにおいては、前記登録候補文書を前記利用者が到達した最終分類毎に分類して新規分類を生成し、前記新規分類の数が予め定められた定数値を超える場合に限って、前記最終分類の下に前記新規分類を追加するようにすることができる。
また、前記登録候補文書分類ステップにおいては、前記登録候補文書を前記利用者が到達した最終分類毎に分類して新規分類を生成し、前記新規分類の数が予め定められた定数値を超えない場合に、前記新規分類を最も良く代表する前記登録候補文書を新規文書として追加するようにすることができる。
また、前記登録候補文書分類ステップにおいては、前記登録候補文書を前記利用者が到達した最終分類より上位の分類毎に分類して新規分類を生成し、前記新規分類の数が予め定められた定数値を超える場合に限って、前記上位分類の下に前記新規分類を追加するようにすることができる。
本発明の文書管理プログラムは、階層的に分類された文書群を管理する文書管理装置を制御する文書管理プログラムであって、分類階層と前記分類階層に対応づけられた文書群とを格納する文書格納ステップと、利用者に前記分類階層に沿って対話的に文書を探索させるインタフェースステップと、前記利用者が前記インタフェースステップにおいて前記分類階層に沿って文書を探索した際に、前記利用者が到達した最終分類を履歴として格納する履歴格納ステップと、前記利用者が前記探索によって意図した文書を見付けられない場合に、前記利用者が新規に登録を希望する文書を登録候補文書として前記履歴と対応づけて格納する登録候補文書格納ステップと、前記登録候補文書のうち同一の最終分類下にあるものを前記履歴から判定し、それらのうち類似度の高い一群を新たな分類とする登録候補文書分類ステップとを前記文書管理装置に実行させることを特徴とする。
また、前記登録候補文書分類ステップにおいては、前記登録候補文書を前記利用者が到達した最終分類毎に分類して新規分類を生成し、前記新規分類の数が予め定められた定数値を超える場合に限って、前記最終分類の下に前記新規分類を追加するようにすることができる。
また、前記登録候補文書分類ステップにおいては、前記登録候補文書を前記利用者が到達した最終分類毎に分類して新規分類を生成し、前記新規分類の数が予め定められた定数値を超えない場合に、前記新規分類を最も良く代表する前記登録候補文書を新規文書として追加するようにすることができる。
また、前記登録候補文書分類ステップにおいては、前記登録候補文書を前記利用者が到達した最終分類より上位の分類毎に分類して新規分類を生成し、前記新規分類の数が予め定められた定数値を超える場合に限って、前記上位分類の下に前記新規分類を追加するようにすることができる。 The document management apparatus of the present invention is a document management apparatus for managing a hierarchically classified document group, a document storage means for storing a classification hierarchy and a document group associated with the classification hierarchy, and a user Interface means for interactively searching for a document along the classification hierarchy, and when the user searches for a document along the classification hierarchy by the interface means, the final classification reached by the user is used as a history. A history storage means for storing and a registration candidate for storing a document that the user wishes to newly register as a registration candidate document in association with the history when the user cannot find the document intended by the search registration climate to the document storage means, and the ones that are under the same final classification of the registered candidate document to determine from the history, the new classification of high similarity group of them Characterized in that it comprises a document classification means.
Further, the registration candidate document classification means generates a new classification by classifying the registration candidate document for each final classification reached by the user, and the number of the new classification exceeds a predetermined constant value. For example, the new classification can be added under the final classification.
Further, the registration candidate document classification means generates the new classification by classifying the registration candidate document for each final classification reached by the user, and the number of the new classification does not exceed a predetermined constant value In addition, the registration candidate document that best represents the new classification can be added as a new document.
Further, the registration candidate document classification unit generates a new classification by classifying the registration candidate document for each classification higher than the final classification reached by the user, and a constant value in which the number of the new classification is determined in advance. The new classification can be added below the superordinate class only when the number exceeds .
The document management method of the present invention is a document management method in a document management apparatus that manages a hierarchically classified document group, and stores a classification hierarchy and a document group associated with the classification hierarchy. And an interface step for allowing a user to search for a document interactively along the classification hierarchy , and a final result reached by the user when the user searches for a document along the classification hierarchy in the interface step. A history storing step for storing the classification as a history, and when the user cannot find a document intended by the search, a document that the user desires to newly register is associated with the history as a registration candidate document. a registration candidate document storage step of storing, what is under the same final classification of the registered candidate document to determine from the history, among them Characterized in that it comprises a registration candidate document classification step of the high group of similarity score as a new classification.
In the registration candidate document classification step, the registration candidate document is classified for each final classification reached by the user to generate a new classification, and the number of the new classifications exceeds a predetermined constant value. Only the new classification can be added below the final classification.
Further, in the registration candidate document classification step, the registration candidate document is classified for each final classification reached by the user to generate a new classification, and the number of the new classifications does not exceed a predetermined constant value. In this case, the registration candidate document that best represents the new classification can be added as a new document.
Further, in the registration candidate document classification step, the registration candidate document is classified for each classification higher than the final classification reached by the user to generate a new classification, and the number of the new classifications is determined in advance. Only when the numerical value is exceeded, the new classification can be added under the superordinate classification.
The document management program of the present invention is a document management program for controlling a document management apparatus that manages a hierarchically classified document group, and stores a classification hierarchy and a document group associated with the classification hierarchy. A storage step; an interface step for allowing a user to search for a document interactively along the classification hierarchy; and the user reaching when the user searches for a document along the classification hierarchy in the interface step. A history storage step for storing the final classification as a history, and when the user cannot find a document intended by the search, a document that the user desires to newly register corresponds to the history as a registration candidate document. a registration candidate document storage step of storing in association, those under same final classification of the registered candidate document to determine from the history Among them a registration candidate document classification step of the high similarity group and a new classification, characterized in that to be executed by the document management apparatus.
In the registration candidate document classification step, the registration candidate document is classified for each final classification reached by the user to generate a new classification, and the number of the new classifications exceeds a predetermined constant value. Only the new classification can be added below the final classification.
Further, in the registration candidate document classification step, the registration candidate document is classified for each final classification reached by the user to generate a new classification, and the number of the new classifications does not exceed a predetermined constant value. In this case, the registration candidate document that best represents the new classification can be added as a new document.
Further, in the registration candidate document classification step, the registration candidate document is classified for each classification higher than the final classification reached by the user to generate a new classification, and the number of the new classifications is determined in advance. Only when the numerical value is exceeded, the new classification can be added under the superordinate classification.

本発明の文書管理装置によれば、利用者が分類階層に従って行った探索によって意図した文書を見付けられない場合に利用者が新規に登録を希望する登録候補文書を格納するようにしたので、次のような効果が得られる。第１の効果は、階層分類の労力を軽減することができることである。必要な労力は、各利用者が自分の欲しい文書を分類階層に従って探索し、見つからなかったときに文書登録することだけである。分類階層への新規の分類の追加は自動的に行われる。第２の効果は、階層分類の精度を維持することができることである。新規の分類が既存の分類階層のどの位置にあたるかが関係する全利用者によって判断されているからである。 According to the document management apparatus of the present invention, when the user cannot find the intended document by the search performed according to the classification hierarchy, the registration candidate document that the user desires to newly register is stored. The following effects can be obtained. The first effect is that the labor of hierarchical classification can be reduced. All that is required is that each user searches for the desired document according to the classification hierarchy and registers the document when it is not found. Adding new classifications to the classification hierarchy is done automatically. The second effect is that the accuracy of hierarchical classification can be maintained. This is because it is determined by all users who are related to the position of the new classification hierarchy in the existing classification hierarchy.

本発明は、分類階層化された文書集合を管理する分野に関するものであり、例えば、ＦＡＱ（Frequently Asked Questions）を分類階層に従って探索し、所望の質問および回答を発見することができない場合に、新たな質問および回答を追加するという分野も含まれる The present invention relates to the field of managing a document set that is classified into hierarchies. For example, when a frequently asked question (FAQ) is searched according to a classification hierarchy and a desired question and answer cannot be found, the present invention is newly added. Including adding additional questions and answers

次に、発明を実施するための最良の形態について図面を参照して詳細に説明する。 Next, the best mode for carrying out the invention will be described in detail with reference to the drawings.

図１は本発明の実施の形態の動作を説明する図である。文書格納部１０１は、分類階層と文書を格納する。インタフェース部１０２は、分類階層に沿って利用者に対話的に文書を探索させ、探索の履歴を出力する。インタフェース部１０２は、適切な文書が見つからなかった場合は、利用者に登録候補文書を入力させ、同文書を出力する。履歴格納部１０３は上記インタフェース部が出力した利用者の探索履歴を格納する。登録候補文書格納部１０４はインタフェース部が出力した登録候補文書を格納する。登録候補文書分類部１０５は登録候補文書を分類し、得られた分類を既存の分類階層位置に追加する。 FIG. 1 is a diagram for explaining the operation of the embodiment of the present invention. The document storage unit 101 stores a classification hierarchy and a document. The interface unit 102 causes the user to search for a document interactively along the classification hierarchy, and outputs a search history. If an appropriate document is not found, the interface unit 102 causes the user to input a registration candidate document and outputs the document. The history storage unit 103 stores the user search history output by the interface unit. The registration candidate document storage unit 104 stores the registration candidate document output by the interface unit. The registration candidate document classification unit 105 classifies the registration candidate documents, and adds the obtained classification to the existing classification hierarchy position.

図１に示した実施の形態の各構成要素について説明する前に、図９を用いて分類階層および文書の概念を説明する。分類の第一階層は分類１と分類２の２分類である。分類１の下位分類は分類１１、分類１２、分類１３の３分類である。分類２の下位分類は分類２１、分類２２、分類１２の３分類である。分類１３の下位分類は分類１３１、分類１３２の２分類である。分類１１には文書１１ａ、文書１１ｂ、文書１１ｃ３つの文書が対応付けられている。 Before describing each component of the embodiment shown in FIG. 1, the concept of the classification hierarchy and the document will be described with reference to FIG. The first level of classification includes two classifications, classification 1 and classification 2. There are three sub-classes of class 1, class 11, class 12, and class 13. There are three sub-classes of class 2: class 21, class 22, and class 12. There are two subclasses of class 13: class 131 and class 132. The document 11a, the document 11b, and the document 11c are associated with the classification 11.

例えば、ＰＣ用電子メールソフトの管理方法を知りたい利用者を考える。利用者は分類１「メールについて」を選択し、次に分類１１「ＰＣメール」を選択したが、自分の意図した内容の文書が見つからないので、結局別手段で課題を解決し、その内容を記述した文書を次回の利用者のために登録することを希望する。この文書を登録候補文書と呼んで登録候補文書格納部１０４が格納する。その際、履歴格納部１０３には「分類１→分類１１→登録候補文書」という履歴情報が格納される。このようにして本文書管理装置には登録候補文書が次々に格納されていく。 For example, consider a user who wants to know how to manage PC email software. The user selects category 1 “About mail” and then selects category 11 “PC mail”. However, since a document with the content intended by the user cannot be found, the problem is solved by another means, and the content is changed. I would like to register the described document for the next user. This document is called a registration candidate document and stored in the registration candidate document storage unit 104. At this time, history information “class 1 → class 11 → registration candidate document” is stored in the history storage unit 103. In this way, registration candidate documents are successively stored in the document management apparatus.

本発明の第１の実施の形態の特徴は、これらの蓄積された登録候補文書を分類し、分類１１の下に新たに分類を作成することである。例えば図９に示すように、「メールソフトの管理」に関する登録候補文書がまとまった数存在し、他の新規文書との独立性が高い場合、「メールソフトの管理」は分類１１１として分類階層に追加登録される。従来技術との相違点は、文書の内容だけでなく、履歴情報を分類の作成に用いているところである。 A feature of the first embodiment of the present invention is that these accumulated registration candidate documents are classified, and a new classification is created under the classification 11. For example, as shown in FIG. 9, when there are a large number of registration candidate documents related to “mail software management” and the independence from other new documents is high, “mail software management” is classified as a category 111 in the classification hierarchy. It is additionally registered. The difference from the prior art is that not only the contents of the document but also the history information is used for creating the classification.

図１０に第２の実施の形態の概念図を示す。図１０においては、図９と異なり、分類１１の下に新規分類を追加するのではなく、新規の文書を追加している。分類として大きな塊にならないものは、このように文書として追加するのが適切である。 FIG. 10 shows a conceptual diagram of the second embodiment. In FIG. 10, unlike FIG. 9, a new document is added instead of adding a new category under the category 11. It is appropriate to add a document that does not become a large chunk as a classification in this way.

図１１に第３の実施の形態の概念図を示す。図１１においては、図９と異なり、分類１１の下に新規分類を追加するのではなく、分類１１と同レベルに新規の分類を追加している。分類の抽象度によっては、このように既存の分類と同レベルに新規分類を追加するのが適切である。 FIG. 11 shows a conceptual diagram of the third embodiment. In FIG. 11, unlike FIG. 9, a new classification is added at the same level as the classification 11 instead of adding a new classification under the classification 11. Depending on the level of abstraction of the classification, it is appropriate to add a new classification at the same level as the existing classification.

続いて、図１の各構成要素の詳細な説明を行う。図２は文書格納部１０１の構成例を示す図である。２０１は分類階層である。分類階層２０１は、分類ＩＤ２０３、上位分類ＩＤ２０４、分類名２０５とからなる表である。分類ＩＤ２０３は分類を一意に特定する数値である。上位分類ＩＤ２０４は分類の上位に位置する分類ＩＤを示す。上記分類ＩＤは複数もつことができる。例えば図中、分類１２は分類１と分類２が上位分類であることを示している。分類名２０５はその分類の名称であり、利用者に表示する際に使われる。 Subsequently, each component in FIG. 1 will be described in detail. FIG. 2 is a diagram illustrating a configuration example of the document storage unit 101. Reference numeral 201 denotes a classification hierarchy. The classification hierarchy 201 is a table including a classification ID 203, a higher classification ID 204, and a classification name 205. The classification ID 203 is a numerical value that uniquely identifies the classification. The upper class ID 204 indicates a class ID located at the upper level of the class. There can be a plurality of classification IDs. For example, in the figure, classification 12 indicates that classification 1 and classification 2 are higher classifications. The classification name 205 is the name of the classification, and is used when displayed to the user.

２０２は文書である。文書２０２は、文書ＩＤ２０６、分類ＩＤ２０７、文書名２０８、文書内容２０９とからなる表である。文書ＩＤ２０６は文書を一意に特定する数値である。分類ＩＤ２０７はその文書の属する分類ＩＤである。文書名２０８はその文書の名称であり、利用者に表示する際に使われる。文書内容２０９はその文書の実体である文字列であり、利用者に表示する際に使われる。 Reference numeral 202 denotes a document. The document 202 is a table including a document ID 206, a classification ID 207, a document name 208, and a document content 209. The document ID 206 is a numerical value that uniquely identifies the document. A classification ID 207 is a classification ID to which the document belongs. The document name 208 is the name of the document, and is used when displayed to the user. The document content 209 is a character string that is the substance of the document, and is used when displayed to the user.

図３はインタフェース部１０２の構成例を示す図である。３０１は画面遷移の例を示している。画面３０４には分類１とその下位分類である分類１１、分類１２、分類１３が表示されている。画面３０４においてマウスなどのポインティングデバイスを用いて分類１１である「ＰＣメール」が選択されると、分類１１に属する文書の文書名が画面３０５のように表示される。画面３０５においてマウスなどのポインティングデバイスを用いて文書１１ａである「メールソフトの種類」が選択されると、文書１１ａの文書名と文書内容が画面３０６のように表示される。 FIG. 3 is a diagram illustrating a configuration example of the interface unit 102. 301 shows an example of screen transition. On the screen 304, classification 1 and its subordinate classifications classification 11, classification 12, and classification 13 are displayed. When “PC mail” of category 11 is selected on the screen 304 using a pointing device such as a mouse, the document names of documents belonging to the category 11 are displayed as in the screen 305. When the “type of mail software” which is the document 11 a is selected using a pointing device such as a mouse on the screen 305, the document name and document content of the document 11 a are displayed as in the screen 306.

画面制御部３０２はこのような画面遷移を制御するとともに、履歴取得部３０３に利用者の選択した分類や文書のＩＤを送信する。利用者の求める文書がなかった場合は、画面３０５の新規ボタン３０５ａを利用者がマウスなどのポインティングデバイスを用いて選択することにより、登録候補文書入力のための画面に遷移し、利用者は同画面から登録候補文書を登録することができる。上記登録候補文書は画面制御部３０２によって、登録候補文書取得部３０７に送信される The screen control unit 302 controls such screen transition and transmits the classification selected by the user and the document ID to the history acquisition unit 303. If there is no document requested by the user, the user selects a new button 305a on the screen 305 by using a pointing device such as a mouse, and the screen changes to a screen for inputting a registration candidate document. Registration candidate documents can be registered from the screen. The registration candidate document is transmitted to the registration candidate document acquisition unit 307 by the screen control unit 302.

図４は履歴格納部１０３の構成例を示す図である。履歴格納部１０３は履歴ＩＤ４０２、最終分類４０３、履歴４０４、到達時刻４０５からなる表である。履歴ＩＤ４０２は履歴を一意に特定する数値である。最終分類４０３は利用者が最終的に到達した分類の分類ＩＤである。履歴４０４は最終分類４０３にいたる分類の経路である。例えば表１０３の３行目と５行目のように、最終分類が同一でも、経路は異なることがある。到達時刻４０５は、利用者が最終分類４０３に到達した時刻である。 FIG. 4 is a diagram illustrating a configuration example of the history storage unit 103. The history storage unit 103 is a table including a history ID 402, a final classification 403, a history 404, and an arrival time 405. The history ID 402 is a numerical value that uniquely identifies the history. The final classification 403 is a classification ID of the classification finally reached by the user. The history 404 is a classification route to the final classification 403. For example, as shown in the third and fifth rows of Table 103, the path may be different even if the final classification is the same. The arrival time 405 is the time when the user reaches the final classification 403.

図５は登録候補文書格納部１０４の構成例を示す図である。登録候補文書格納部１０４は登録候補文書ＩＤ５０２、履歴ＩＤ５０３、文書内容５０４からなる表である。登録候補文書ＩＤ５０２は登録候補文書を一意に特定する数値である。履歴ＩＤ５０３は、その登録候補文書が新規に記述されるにあたり利用者がたどった経路の履歴ＩＤである。文書内容５０３はその登録候補文書の実体である文字列である。 FIG. 5 is a diagram illustrating a configuration example of the registration candidate document storage unit 104. The registration candidate document storage unit 104 is a table including registration candidate document ID 502, history ID 503, and document content 504. The registration candidate document ID 502 is a numerical value that uniquely identifies the registration candidate document. The history ID 503 is a history ID of a route taken by the user when the registration candidate document is newly described. The document content 503 is a character string that is the substance of the registration candidate document.

図６は登録文書候補分類部１０５の処理フローを示す図である。６０１において、登録候補文書格納部１０４から登録候補文書を獲得する。６０２において、６０１で獲得した文書を最終分類４０３毎に分類する。登録候補文書には履歴ＩＤ５０３が付与されているので、履歴格納部１０３の履歴ＩＤ４０２と照合することにより、最終分類４０３が決まる。 FIG. 6 is a diagram illustrating a processing flow of the registered document candidate classification unit 105. In 601, a registration candidate document is acquired from the registration candidate document storage unit 104. In 602, the document acquired in 601 is classified for each final classification 403. Since the history candidate ID 503 is assigned to the registration candidate document, the final classification 403 is determined by collating with the history ID 402 of the history storage unit 103.

以降、最終分類４０３毎に、新規の分類を追加すべきかどうかを判定する処理にはいる。ステップ６０３はその処理が全ての最終分類について行われたかどうかを判断するステップである。ステップ６０４において、登録候補文書を分類する。分類手法には既知の手法を使えば良い。例えば文書中に出現する単語の頻度をもとにベクトルを作り、ベクトルのコサイン測度で文書距離を計算し、あらかじめ決められた距離以下の文書をクラスターとする手法を使えば良い。 Thereafter, processing for determining whether a new category should be added for each final category 403 is performed. Step 603 is a step of determining whether or not the processing has been performed for all final classifications. In step 604, the registration candidate documents are classified. A known method may be used as the classification method. For example, a method may be used in which a vector is created based on the frequency of words appearing in a document, a document distance is calculated by a vector cosine measure, and documents having a distance equal to or less than a predetermined distance are clustered.

ステップ６０５において、ステップ６０４で分類された文書集合の性質を吟味し、新規の分類として採用するか否かを判定する。判定の基準としては例えば、クラスター数や、１クラスターに属する文書の最小数が考えられる。これらの判断基準はあらかじめ装置全体で定義しておく。例えばクラスター数が少ない場合は分類数が少ないことを意味するから、わざわざ分類を新規に作る必要はない。 In step 605, the nature of the document set classified in step 604 is examined, and it is determined whether or not to adopt as a new classification. As a criterion for determination, for example, the number of clusters and the minimum number of documents belonging to one cluster can be considered. These criteria are defined in advance for the entire apparatus. For example, if the number of clusters is small, it means that the number of classifications is small, so there is no need to create a new classification.

また、１クラスターに属する文書の数が少ない場合、それを分類として追加していくと、分類の数が必要以上に増大する危険性がある。ステップ６０５において、新規分類として追加すべきと判定された場合、ステップ６０６で新規分類と文書が文書格納部１０１の分類階層２０１および文書２０２に追加される。追加後、登録候補文書格納部１０４から該当文書は削除される。 Further, when the number of documents belonging to one cluster is small, there is a risk that the number of classifications will increase more than necessary if they are added as classifications. If it is determined in step 605 that the new classification should be added, the new classification and the document are added to the classification hierarchy 201 and the document 202 of the document storage unit 101 in step 606. After the addition, the corresponding document is deleted from the registration candidate document storage unit 104.

図７は、第２の実施の形態における登録文書候補分類部１０５の処理フローである。第１の実施の形態の図６とはステップ７０５とステップ７０６が異なる。ステップ７０５では、最終分類６０４の下に新たな分類を作るのではなく、新たに文書を登録すべきかどうかが判定される。例えば第１の実施の形態における分類追加を示す概念図である図９において、分類１１１、分類１１２、分類１１３は第１の実施の形態の装置によって追加された新しい分類を示している。 FIG. 7 is a processing flow of the registered document candidate classification unit 105 according to the second embodiment. Step 705 and step 706 are different from FIG. 6 of the first embodiment. In step 705, it is determined whether a new document should be registered instead of creating a new category under the final category 604. For example, in FIG. 9 which is a conceptual diagram showing the addition of the classification in the first embodiment, the classification 111, the classification 112, and the classification 113 indicate new classifications added by the apparatus of the first embodiment.

それに対して、第２の実施の形態における文書追加を示す概念図である図１０においては、分類１１１、分類１１２、分類１１３の代表文書がそれぞれ文書１１ｄ、文書１１ｅ、文書１１ｆとして登録されている。図６におけるステップ６０５で、新規に分類を追加するという基準に達しないものについても、クラスター数や一クラスターに属する文書数に関する第二の基準をあらかじめ定めておき、同基準を満たすものは図７におけるステップ７０５で、代表文書として登録する。 On the other hand, in FIG. 10, which is a conceptual diagram showing document addition in the second embodiment, representative documents of classification 111, classification 112, and classification 113 are registered as document 11d, document 11e, and document 11f, respectively. . For those that do not reach the criterion for adding a new classification in step 605 in FIG. 6, a second criterion for the number of clusters and the number of documents belonging to one cluster is determined in advance. In step 705, the document is registered as a representative document.

なお、代表文書の選択には既知の技術を使えば良い。例えばクラスターの重心に最も近い文書を選択する方法が考えられる。さらに近傍の文書の内容も考慮し、人手によって新たに文書を作成できる編集手段を提供することも考えられる。 A known technique may be used to select the representative document. For example, a method of selecting a document closest to the center of gravity of the cluster can be considered. Furthermore, considering the contents of nearby documents, it may be possible to provide an editing means that can create a new document manually.

図８は、第３の実施の形態における登録文書候補分類部１０５の処理フローである。図１１は本実施の形態によって新規に分類が追加された様子を示す概念図である。分類１の下に分類１４と分類１５が追加されている。この２つの分類は、最終分類である分類１１、分類１２、分類１３の登録候補文書の和から生成されたものである。言い替えれば、新規分類が、分類１１、分類１２、分類１３の下に追加されるのではなく、並行に追加される様子を示している。 FIG. 8 is a processing flow of the registered document candidate classification unit 105 according to the third embodiment. FIG. 11 is a conceptual diagram showing a state in which a new classification is added according to this embodiment. A classification 14 and a classification 15 are added under the classification 1. These two classifications are generated from the sum of the registration candidate documents of classification 11, classification 12, and classification 13, which are the final classification. In other words, the new classification is not added under the classification 11, the classification 12, and the classification 13, but is added in parallel.

図８は、第１の実施の形態の図６とはステップ８０２、ステップ８０３、およびステップ８０６が異なる。また、新たにステップ８０９が追加されている。ステップ８０２では、ステップ６０２のように最終分類毎に登録候補文書を分類するのではなく、任意階層の分類で登録候補文書が分類されている。例えば図１１において、分類１１や分類１２の下にある登録候補文書は、分類１１や分類１２の上位である分類１の登録候補文書としても扱われる。ステップ８０９は、新規分類を追加するにあたり、深い階層の分類から順に判定を行うため、深さ順に分類をソートする処理である。 FIG. 8 differs from FIG. 6 of the first embodiment in steps 802, 803, and 806. In addition, step 809 is newly added. In step 802, the registration candidate documents are not classified for each final classification as in step 602, but the registration candidate documents are classified according to the classification of an arbitrary hierarchy. For example, in FIG. 11, registration candidate documents under classification 11 and classification 12 are also handled as registration candidate documents of classification 1 that is higher than classification 11 and classification 12. Step 809 is a process of sorting the classes in order of depth in order to make a determination in order from the class of the deep hierarchy when adding a new class.

例えば、図１１において、分類１よりは分類１１や分類１２、分類１３が、分類１３よりは分類１３１や分類１３２が優先的に、その下に新規分類が追加可能かどうかが判定される。ステップ８０３では、ステップ６０３のように全ての最終分類について処理が完了したのかではなく、全ての任意分類について処理が完了したかどうかを判定する。ステップ８０６は新規分類を最終分類下に追加するステップ６０６と異なり、ステップ８０３で判定された分類の下に追加される。 For example, in FIG. 11, it is determined whether classification 11, classification 12, and classification 13 have priority over classification 1, classification 131 and classification 132 have priority over classification 13, and whether a new classification can be added below. In step 803, it is determined whether or not the processing has been completed for all arbitrary classifications, not the processing for all final classifications as in step 603. Step 806 is different from step 606 in which a new classification is added under the final classification, and is added under the classification determined in step 803.

なお、上記実施の形態の構成及び動作は例であって、本発明の趣旨を逸脱しない範囲で適宜変更することができることは言うまでもない。 It should be noted that the configuration and operation of the above-described embodiment are examples, and it goes without saying that they can be changed as appropriate without departing from the spirit of the present invention.

本発明は、多量の文書を効率よく検索する情報検索装置や、情報検索装置をコンピュータに実現するためのプログラムといった用途に適用することができる。 The present invention can be applied to applications such as an information retrieval apparatus that efficiently retrieves a large amount of documents and a program for realizing the information retrieval apparatus in a computer.

本発明の実施の形態の構成例を示すブロック図である。It is a block diagram which shows the structural example of embodiment of this invention. 文書格納部１０１の構成例を示す図である。2 is a diagram illustrating a configuration example of a document storage unit 101. FIG. インタフェース部１０２の構成例を示す図である。3 is a diagram illustrating a configuration example of an interface unit 102. FIG. 履歴格納部１０３の構成例を示す図である。3 is a diagram illustrating a configuration example of a history storage unit 103. FIG. 登録候補文書格納部１０４の構成例を示す図である。3 is a diagram illustrating a configuration example of a registration candidate document storage unit 104. FIG. 第１の実施の形態における登録文書候補分類部１０５の処理フローである。It is a processing flow of the registration document candidate classification | category part 105 in 1st Embodiment. 第２の実施の形態における登録文書候補分類部１０５の処理フローである。It is a processing flow of the registration document candidate classification | category part 105 in 2nd Embodiment. 第３の実施の形態における登録文書候補分類部１０５の処理フローである。It is a processing flow of the registration document candidate classification | category part 105 in 3rd Embodiment. 第１の実施の形態の概念図である。It is a conceptual diagram of 1st Embodiment. 第２の実施の形態の概念図である。It is a conceptual diagram of 2nd Embodiment. 第３の実施の形態の概念図である。It is a conceptual diagram of 3rd Embodiment.

Explanation of symbols

１０１文書格納部
１０２インタフェース部
１０３履歴格納部
１０４登録文書候補格納部
１０５登録文書候補分類部
２０１分類階層
２０２文書
２０３文書ＩＤ
２０４上位分類ＩＤ
２０５分類名
２０６文書ＩＤ
２０７分類ＩＤ
２０８文書名
２０９文書内容
３０１インタフェースの画面遷移
３０２画面制御部
３０３履歴取得部
３０４分類とその下位分類を示す画面
３０５分類と対応文書を示す画面
３０６文書内容を示す画面
３０７新規文書取得部
４０２履歴ＩＤ
４０３最終分類
４０４履歴
４０５到達時刻
５０２登録候補文書ＩＤ
５０３履歴ＩＤ
５０４文書内容
Reference Signs List 101 Document storage unit 102 Interface unit 103 History storage unit 104 Registered document candidate storage unit 105 Registered document candidate classification unit 201 Classification hierarchy 202 Document 203 Document ID
204 Upper classification ID
205 Classification name 206 Document ID
207 Classification ID
208 Document name 209 Document content 301 Screen transition of interface 302 Screen control unit 303 History acquisition unit 304 Screen showing classification and its lower classification 305 Screen showing classification and corresponding document 306 Screen showing document content 307 New document acquisition unit 402 History ID
403 Final classification 404 History 405 Arrival time 502 Registration candidate document ID
503 History ID
504 Document Contents

Claims

A document management apparatus for managing a hierarchically classified document group,
Document storage means for storing a classification hierarchy and a document group associated with the classification hierarchy;
Interface means for allowing a user to interactively search for a document along the classification hierarchy;
When the user searches for a document along the classification hierarchy by the interface means, history storage means for storing the final classification reached by the user as history,
A registration candidate document storage unit that stores a document that the user desires to newly register as a registration candidate document in association with the history when the user cannot find the document that the user intended by the search;
Document management comprising: a registration candidate document classifying unit that determines, from the history, documents under the same final classification among the registration candidate documents , and sets a group having a high similarity among them as a new classification apparatus.

The registration candidate document classification means classifies the registration candidate document for each final classification reached by the user, generates a new classification, and only when the number of the new classification exceeds a predetermined constant value. the document management apparatus according to claim 1, characterized in that adding the new classification under the final classification.

The registration candidate document classification means classifies the registration candidate document for each final classification reached by the user to generate a new classification, and when the number of the new classification does not exceed a predetermined constant value, The document management apparatus according to claim 1, wherein the registration candidate document that best represents the new classification is added as a new document.

The registration candidate document classification means classifies the registration candidate document for each classification higher than the final classification reached by the user to generate a new classification, and the number of the new classifications exceeds a predetermined constant value. 2. The document management apparatus according to claim 1, wherein the new classification is added below the upper classification only in a case .

A document management method in a document management apparatus for managing a hierarchically classified document group,
A document storage step for storing a classification hierarchy and a document group associated with the classification hierarchy;
An interface step for allowing a user to interactively search for a document along the classification hierarchy;
When the user searches for a document along the classification hierarchy in the interface step, a history storage step for storing the final classification reached by the user as a history;
A registration candidate document storage step of storing a document that the user desires to newly register in association with the history as a registration candidate document when the user cannot find the document intended by the search;
Document registration comprising: a registration candidate document classification step for determining, from the history, documents under the same final classification among the registration candidate documents , and setting a group having a high similarity among them as a new classification. Method.

In the registration candidate document classification step, the registration candidate document is classified for each final classification reached by the user to generate a new classification, and only when the number of the new classifications exceeds a predetermined constant value. Te, document management method according to claim 5, characterized in that adding the new classification under the final classification.

In the registration candidate document classification step, the registration candidate document is classified for each final classification reached by the user to generate a new classification, and when the number of the new classification does not exceed a predetermined constant value the document management method according to claim 5, characterized in that adding the registration candidate documents that best represent the new classification as a new document.

In the registration candidate document classification step, the registration candidate document is classified for each classification higher than the final classification reached by the user to generate a new classification, and a constant value in which the number of the new classifications is determined in advance. 6. The document management method according to claim 5, wherein the new classification is added below the higher classification only when exceeding the upper classification.

A document management program for controlling a document management apparatus that manages a hierarchically classified document group,
A document storage step for storing a classification hierarchy and a document group associated with the classification hierarchy;
An interface step for allowing a user to interactively search for a document along the classification hierarchy;
When the user searches for a document along the classification hierarchy in the interface step, a history storage step for storing the final classification reached by the user as a history;
A registration candidate document storage step of storing a document that the user desires to newly register in association with the history as a registration candidate document when the user cannot find the document intended by the search;
Determining from the history those documents that are under the same final classification among the registration candidate documents, and causing the document management apparatus to execute a registration candidate document classification step in which a group having a high degree of similarity is a new classification. Document management program characterized by

In the registration candidate document classification step, the registration candidate document is classified for each final classification reached by the user to generate a new classification, and only when the number of the new classifications exceeds a predetermined constant value. Te, document management program according to claim 9, characterized in that adding the new classification under the final classification.

In the registration candidate document classification step, the registration candidate document is classified for each final classification reached by the user to generate a new classification, and when the number of the new classification does not exceed a predetermined constant value the document management program according to claim 9, characterized in that adding the registration candidate documents that best represent the new classification as a new document.

In the registration candidate document classification step, the registration candidate document is classified for each classification higher than the final classification reached by the user to generate a new classification, and a constant value in which the number of the new classifications is determined in advance. 10. The document management program according to claim 9, wherein the new classification is added below the superordinate classification only when the number exceeds the upper classification.