JP2005141476A

JP2005141476A - Document management device, program and recording medium

Info

Publication number: JP2005141476A
Application number: JP2003376942A
Authority: JP
Inventors: Atsuyuki Goto; 淳之後藤
Original assignee: Ricoh Co Ltd
Current assignee: Ricoh Co Ltd
Priority date: 2003-11-06
Filing date: 2003-11-06
Publication date: 2005-06-02

Abstract

<P>PROBLEM TO BE SOLVED: To provide a document management device for efficiently registering a large quantity of documents belonging to a plurality of categories with no distinction between documents relating to a plurality of topics and documents with a plurality of topics not relating to one another included therein and displaying only parts belonging to the categories when displaying a retrieval result. <P>SOLUTION: The management device performing segmentation of documents to determine whether the plurality of the categories are applied resulting from collecting the topics not relating with one another or the plurality of the categories are applied resulting from collecting the topics relating to one another when applying the categories to the document to register it into a database. The management device determines that document contents include the plurality of topics relating to one another and applies the plurality of categories to the documents, as long as the documents have a single segment even if the plurality of categories are applied thereto. The management device determines that the documents include the plurality of topics not relating to one another when the documents are segmented into two or more segments and re-applies the categories at each divided segment without applying the categories to the documents. <P>COPYRIGHT: (C)2005,JPO&NCIPI

Description

本発明は、文書管理装置、プログラムおよび記録媒体に関し、具体的には、文書データを複数のカテゴリに分類するときの技術に関する。 The present invention relates to a document management apparatus, a program, and a recording medium, and more specifically to a technique for classifying document data into a plurality of categories.

近年、インターネットなどのコンピュータネットワークを通じて、大量の電子化された文書をやり取りできるようになっている。そのため、自分が獲得した情報が大量になってしまい、個々の情報の持つ特徴を抽出することが困難となっている。そこで、獲得した情報を分類し整理する技術が必要となってくる。 In recent years, a large amount of electronic documents can be exchanged through a computer network such as the Internet. For this reason, a large amount of information has been acquired, making it difficult to extract features of individual information. Therefore, a technique for classifying and organizing the acquired information is required.

文書データを自動的に分類する代表的な手法としては、図書の十進分類のように分類するためのカテゴリが既知で、新規の文書データに対しそれぞれ適切と思われるカテゴリに分類する手法や、分類するカテゴリが未知で、文書集合の中から類似する文書を集めて分類カテゴリを作成し割り当てるという方法などがある。これらの技術により、大量の文書の分類整理を行うことができる。 Typical methods for automatically classifying document data include known categories such as decimal classification of books, and a method for classifying new document data into categories that are considered appropriate, There is a method in which categories to be classified are unknown, and similar documents are collected from a set of documents to create and assign classification categories. With these technologies, it is possible to sort and organize a large number of documents.

前者のようなカテゴリが既知の場合では、予め分類するためのカテゴリと、そのカテゴリに属するサンプルの文書と特徴となる単語を与え、それらのサンプル文書から単語の重要度を計算して、単語とそのカテゴリに対する重要度を対とした特徴量ベクトルを生成する。分類対象となる文書データに対しても同様に、単語と文書データに対する重要度を計算して、特徴量ベクトルを生成する。 When the category such as the former is known, a category for pre-classification, a sample document belonging to the category, and a characteristic word are given, and the importance of the word is calculated from the sample document, A feature quantity vector is generated with a pair of importance levels for the category. Similarly, with respect to the document data to be classified, the importance level for the word and the document data is calculated to generate a feature vector.

次に、カテゴリの持つ特徴量ベクトルと文書の持つ特徴量ベクトルとの距離を定義しておいて、その値を利用して各文書を類似するカテゴリに割り当てる。また、距離が非常に離れている、即ち、どのカテゴリとも類似しないと判断した場合は、どのカテゴリにも割り当てないようにする。 Next, the distance between the feature quantity vector of the category and the feature quantity vector of the document is defined, and each document is assigned to a similar category using the value. If it is determined that the distance is very far, that is, it is not similar to any category, it is not assigned to any category.

この方法を用いた特許文献１の技術では、文書データと分類するためのカテゴリとその特徴を入力し、この入力されたカテゴリと特徴から特徴量ベクトルを計算して記憶し、分類基準を入力して記憶し、入力されたカテゴリおよびその特徴と分類基準とを用いて、中間カテゴリを新規に作成し、入力されたカテゴリで類似したものを同一の中間カテゴリに割り当て、この処理で得た中間カテゴリと入力されたカテゴリとを用いて、入力された文書データを分類している。
これにより、従来の手法では、分類したいカテゴリの総数や出現する単語の数が多ければ多いほど、その数に比例して計算時間も増大するという問題を回避して、大量の文書データ等の情報を複数のカテゴリに短時間で効率的に分類することができる。 In the technique of Patent Document 1 using this method, a category to be classified as document data and its features are input, a feature vector is calculated and stored from the input category and features, and a classification criterion is input. Using the input category and its features and classification criteria, a new intermediate category is created, and similar input categories are assigned to the same intermediate category. And the inputted category are used to classify the inputted document data.
As a result, the conventional method avoids the problem that as the total number of categories to be classified and the number of words appearing increase, the calculation time increases in proportion to the number of information. Can be efficiently classified into a plurality of categories in a short time.

ところで、関連する話題を含む文書、例えば、『スポーツとその経済波及効果』について書かれた文書については、上述したような分類方法で『スポーツ』と『経済』のカテゴリに分けられる。また、いろいろな話題が１つの紙面に混在した新聞記事、例えば、『スポーツ』について書かれた記事と、『経済』について書かれた記事を１つにまとめたような文書であっても、同じようなカテゴリに分類されることになる。 By the way, a document including a related topic, for example, a document written about “sports and its economic ripple effect” is classified into “sports” and “economy” categories by the classification method as described above. Also, newspaper articles with various topics mixed on a single page, for example, articles written about “Sports” and articles written about “Economy” in one, are the same. It will be classified into such a category.

後者のいろいろな話題が１つの紙面に混在した新聞記事のような文書を分類する技術として、特許文献２では、文脈の概念であるクラス情報と、当該クラスに基づいて分割され、階層化されたセグメント情報が付加された文書の意味的構造化結果を入力し、構造化結果である属性の集合中の各属性に対し、文書の固有ＩＤと、クラス情報と、当該クラスに応じた分割結果であるセグメント情報を付与し、データベースに格納するようにして、視点が一定ではなく、検索者個々の文脈に応じた柔軟な属性指定検索を実現できる。 As a technique for classifying a document such as a newspaper article in which various topics of the latter are mixed on a single page, Patent Document 2 classifies and classifies the class information that is a concept of context and the class. The semantic structured result of the document with segment information added is input, and for each attribute in the attribute set that is the structured result, the document unique ID, class information, and the division result according to the class By assigning certain segment information and storing it in the database, the viewpoint is not constant, and flexible attribute-designated search according to the context of each searcher can be realized.

ここで、複数の記事などが混在した文書を意味ごとのセグメントに分割する技術としては、特許文献３がある。この特許文献３の技術は、電子化された文書を段落に分割し、段落から抽出されたキーワードに基づいて段落間の関連度を計算し、段落を次元とする正方行列を作成する。この正方行列の対角成分を境として片側の領域の各成分に関連度を入れ、該関連度を入れた片側領域において、任意番目の行（又は列）と、任意番目の列（又は行）と、対角成分と、で囲まれる三角形領域内の関連度の合計値を求め、該関連度の合計値に基づいて文書を分割する。
特開２０００−１１２９７１号公報特開２００１−１９５４２６号公報特開２０００−２３５５７４号公報 Here, as a technique for dividing a document in which a plurality of articles and the like are mixed into segments for each meaning, there is Patent Document 3. The technique of Patent Document 3 divides an electronic document into paragraphs, calculates the degree of association between paragraphs based on keywords extracted from the paragraphs, and creates a square matrix with the paragraphs as dimensions. The degree of relevance is put in each component of one side area with the diagonal component of this square matrix as a boundary, and in the one side area where the degree of relevance is put, an arbitrary first row (or column) and an arbitrary first column (or row) Then, the total value of the relevance in the triangular area surrounded by the diagonal components is obtained, and the document is divided based on the total value of the relevance.
JP 2000-112971 A JP 2001-195426 A JP 2000-235574 A

上述したように、複数の話題に関連する文書と関連のない複数の話題の入っている文書とを一般的な方法で分類すると、同じカテゴリが付与されてしまい、いずれの文書か区別がつかなくなる場合がある。
上記の例では、『スポーツとその経済波及効果』について書かれた文書のカテゴリは、『スポーツ』と『経済』の２つのカテゴリが付与される。また、『スポーツ』について書かれた記事と『経済』について書かれた記事を１つにまとめた文書に付与されるカテゴリも同様に、『スポーツ』と『経済』の２つのカテゴリが付与されることになる。 As described above, when a document related to a plurality of topics and a document containing a plurality of unrelated topics are classified by a general method, the same category is assigned and it is not possible to distinguish between the documents. There is a case.
In the above example, two categories of “sports” and “economy” are assigned to the categories of documents written on “sports and their economic ripple effects”. Similarly, categories assigned to documents that combine articles written about “sports” and articles written about “economics” into two categories are also assigned to “sports” and “economics”. It will be.

この問題点は、複数の記事を１つにまとめた文書にカテゴリを付与する前に、文書を記事（セグメント）ごとに分離し、特許文献２のように１つ１つの記事にカテゴリを付与すれば問題は解決できる。
しかし、大量の文書を文書データベースとして登録する際には、文書がいずれのタイプのものかいちいち確かめて、さらに、セグメントに分割してそのセグメントにカテゴリを付与して登録するのでは労力やコストがかかり非現実的である。 This problem is that before assigning a category to a document in which a plurality of articles are grouped into one, separate the documents into articles (segments) and assign a category to each article as in Patent Document 2. The problem can be solved.
However, when registering a large number of documents as a document database, it is necessary to check each type of document, and then to divide into segments and assign categories to the segments for registration. It is unrealistic.

本発明は、上述の実情を考慮してなされたものであって、複数のカテゴリに属する大量の文書を、複数の話題に関連する文書と関連のない複数の話題の入っている文書と区別なく効率よくデータベースへ登録でき、検索結果の表示に際しても分割されたセグメントのカテゴリにマッチする場合には、そのセグメントだけを表示する文書管理装置、プログラムおよび記録媒体を提供することを目的とする。 The present invention has been made in consideration of the above-described circumstances, and a large number of documents belonging to a plurality of categories are not distinguished from a document containing a plurality of topics not related to a document related to a plurality of topics. An object of the present invention is to provide a document management apparatus, a program, and a recording medium that can be efficiently registered in a database and that display only the segment when the search results are displayed that match the segment category.

上記の課題を解決するために、請求項１の発明は、文書内容を予め決められたカテゴリに分類する分類部と、該分類部で分類したカテゴリと該文書内容とを関連付けて文書データベースへ登録する登録部を有する文書管理装置において、前記文書内容を個々の話題からなるセグメントに分割する文書セグメンテーション部を有し、前記分類部で複数のカテゴリに分類された場合、前記文書セグメンテーション部により前記文書内容をセグメントに分割し、分割されたセグメントの数が１つのときには前記分類部で付与された複数のカテゴリを前記文書内容に付与し、分割されたセグメントの数が２つ以上のときには前記分類部で分割された各セグメントに対して再分類してセグメントごとにカテゴリを付与して、前記文書データベースへ登録するようにしたことを特徴とする。 In order to solve the above-mentioned problems, the invention of claim 1 categorizes a document content into a predetermined category, associates the category classified by the classification unit with the document content, and registers them in the document database. A document segmentation unit that divides the document content into segments composed of individual topics, and when the document is classified into a plurality of categories by the classification unit, the document segmentation unit includes the document segmentation unit. The content is divided into segments, and when the number of divided segments is one, a plurality of categories assigned by the classification unit are assigned to the document content, and when the number of divided segments is two or more, the classification unit Re-categorize each segment divided in (1) and assign categories to each segment and register them in the document database. Characterized in that the so that.

請求項２の発明は、請求項１に記載の文書管理装置において、前記登録部は、文書内容に１つのカテゴリが付与された場合には該文書と関連付けて単一カテゴリフラグと該カテゴリを文書属性レコードにセットし、複数のカテゴリが付与され且つ文書内容が複数のセグメントに分割されない場合には該文書と関連付けて複数カテゴリフラグと該複数のカテゴリを文書属性レコードにセットし、文書内容が複数のセグメントに分割される場合には該文書と関連付けて複数カテゴリフラグを文書属性レコードにセットするとともに各セグメントに付与されたカテゴリと該文書におけるセグメント位置とを該文書と関連付けてセグメント属性レコードにセットして、これらの文書属性レコードとセグメント属性レコードを前記文書データベースへ登録するようにしたことを特徴とする。 According to a second aspect of the present invention, in the document management apparatus according to the first aspect, the registration unit associates a single category flag and the category with the document when one category is given to the document content. When a plurality of categories are assigned and the document content is not divided into a plurality of segments, a plurality of category flags and a plurality of categories are set in the document attribute record in association with the document. If the segment is divided into segments, a plurality of category flags are set in the document attribute record in association with the document, and the category assigned to each segment and the segment position in the document are set in the segment attribute record in association with the document. Then, these document attribute records and segment attribute records are registered in the document database. Characterized by being adapted to.

請求項３の発明は、請求項２に記載の文書管理装置において、前記文書データベースに対してカテゴリを指定して検索し、検索された文書内容を表示する検索部を有し、前記検索部は、指定されたカテゴリとマッチングするカテゴリが前記セグメント属性テーブルに存在するときには、マッチングしたセグメントの文書内容だけを表示するようにしたことを特徴とする。 According to a third aspect of the present invention, in the document management apparatus according to the second aspect of the present invention, the document management apparatus includes a search unit that searches the document database by specifying a category and displays the searched document content. When there is a category matching the designated category in the segment attribute table, only the document content of the matched segment is displayed.

請求項４の発明は、コンピュータに、請求項１、２または３に記載の文書管理装置の機能を実行させるためのプログラムである。
請求項５の発明は、請求項４に記載のプログラムを記録したコンピュータ読み取り可能な記録媒体である。 The invention according to claim 4 is a program for causing a computer to execute the function of the document management apparatus according to claim 1, 2 or 3.
The invention of claim 5 is a computer-readable recording medium on which the program according to claim 4 is recorded.

本発明によると、文書にカテゴリを付与してデータベースに登録するときに、文書を話題ごとのセグメントに分割し、分割したセグメントごとにカテゴリ分けするようにしたので、複数のカテゴリに属する大量の文書を、複数の話題に関連する文書と関連のない複数の話題の入っている文書と区別なく効率よく登録できる。
また、検索結果の表示に際しても、セグメントに分割されていた場合には、当該カテゴリに属する部分だけを表示するようにして、複数の話題から構成される文書を的確に表示・管理することができる。 According to the present invention, when a category is assigned to a document and registered in the database, the document is divided into segments for each topic, and the divided segments are categorized, so a large number of documents belonging to a plurality of categories. Can be efficiently registered without distinction from a document containing a plurality of topics not related to a document related to a plurality of topics.
Further, when displaying search results, if the search results are divided into segments, only the parts belonging to the category are displayed, so that a document composed of a plurality of topics can be accurately displayed and managed. .

以下、図面を参照して本発明の文書管理装置に係る好適な実施形態について説明する。
本実施形態では、サポートベクターマシン（ＳＶＭ）を用いて分類器を構成するものとして説明するが、他のＢａｙｅｓ法、Ｆｉｓｈｅｒ判別式を使用した判別分析等で分類するようにしてもかまわない。 A preferred embodiment according to the document management apparatus of the present invention will be described below with reference to the drawings.
In the present embodiment, description will be made assuming that a classifier is configured using a support vector machine (SVM). However, classification may be performed by other Bayes method, discriminant analysis using Fisher discriminant, or the like.

ＳＶＭは、多次元ベクトルで表されたオブジェクトを２つのクラスに分類する分類法で、特徴空間を超平面で２つの部分空間に分割することによって分類を行うものであり、識別関数を式（１）とし、式（２）でオブジェクトが当該クラスに属するか否かが決まる。 SVM is a classification method that classifies an object represented by a multidimensional vector into two classes, and performs classification by dividing a feature space into two subspaces on a hyperplane. And whether the object belongs to the class is determined by the expression (2).

ｇ（ｘ）＝ｗ・ｘ＋ｂ（１）
ｆ（ｘ）＝ｓｇｎ（ｇ（ｘ））（２） g (x) = w · x + b (1)
f (x) = sgn (g (x)) (2)

ここで、「・」はベクトルｗとオブジェクトｘとの内積を示し、ｗは超平面の法線を定める重みベクトル、ｂは閾値を表すスカラー量である。
関数ｓｇｎ（ｙ）は、引数ｙ（スカラー量）の値が正のとき「＋１」の値をとり、０以下のときには「−１」の値をとる。 Here, “·” indicates the inner product of the vector w and the object x, w is a weight vector that defines the normal of the hyperplane, and b is a scalar quantity that represents a threshold value.
The function sgn (y) takes a value of “+1” when the value of the argument y (scalar amount) is positive, and takes a value of “−1” when the value is 0 or less.

ここでは、文書分類に際して、上記の識別関数に次のように当てはめる。
ｎ個のカテゴリＣ＝（Ｃ_１，Ｃ_２，…，Ｃ_ｎ）と、カテゴリ分けを行うためのｄ個の特徴語Ｔ＝（ｔ_１，ｔ_２，…，ｔ_ｄ）を予め決定しておく。また、文書データの特徴を現す特徴量ベクトルＶの各要素ｖ_ｉは、要素ｖ_ｉに対応する特徴語ｔ_ｉが文書に現れた頻度や特徴語ｔ_ｉが出現した文書数等から求めるものとする。また、カテゴリＣ_ｉにおいて、重みベクトルの要素ｗ_ｉｊは、特徴語ｔ_ｊが文書に現れたときにカテゴリＣ_ｉに属する確率と特徴語ｔ_ｊがその文書に現れなかったときにカテゴリＣ_ｉに属する確率とを組み合わせて計算するものとする。 Here, the above classification function is applied as follows in document classification.
n categories C = (C ₁ , C ₂ ,..., C _n ) and d feature words T = (t ₁ , t ₂ ,..., t _d ) for categorization are determined in advance. deep. Each element v _i of the feature vector V representing the feature of the document data is obtained from the frequency at which the feature word t _i corresponding to the element v _i appears in the document, the number of documents in which the feature word t _i appears, and the like. To do. In addition, in category C _i, the elements w _ij of the weight vector is, in category C _i when the probability and the feature word t _j belonging to the category C _i when the feature word t _j appeared in the document did not appear in the document Assume that the calculation is performed in combination with the probability of belonging.

次に、あるカテゴリＣ_ｉに属する学習用の文書を複数用意し、この学習用文書の特徴量ベクトルＶを計算し、ｆ（Ｖ）の値が「＋１」となり、他のカテゴリに属する学習用文書ではｆ（Ｖ）の値が「−１」となる重みベクトルＷ_ｉと閾値ｂ_ｉとを計算し、すべてのカテゴリについて、重みベクトルＷ_ｉと閾値ｂ_ｉとを計算する。
以下、あるカテゴリＣ_ｉについてｆ（Ｖ）の値が「＋１」となる分類器をｆ_ｉ（Ｖ）と書くことにする。 Next, a plurality of learning documents belonging to a certain category C _i are prepared, a feature quantity vector V of the learning document is calculated, the value of f (V) becomes “+1”, and learning documents belonging to other categories are calculated. calculates a weight vector W _i and the threshold b _i the value of f (V) is "-1" in the document, for all categories, calculate a weight vector W _i and the threshold value b _i.
Hereinafter, a classifier in which the value of f (V) is “+1” for a certain category C _i will be written as f _i (V).

したがって、入力された文書をｎ個のカテゴリに分類するときには、入力された文書の特徴量ベクトルＸを計算し、ｎ個の分類器
ｆ₁（Ｘ），ｆ₂（Ｘ），…，ｆ_n（Ｘ）
を計算する。
そして、ｆ_ｉ（Ｘ）＞０の場合、カテゴリＣ_ｉに属し、ｆ_ｉ（ｘ）≦０の場合は、カテゴリＣ_ｉに属さないと判断する。この判断を各カテゴリの分類器に対してｎ回繰り返すことによって、与えられた文書のカテゴリを決定する。 Therefore, when the input document is classified into n categories, the feature quantity vector X of the input document is calculated, and n classifiers f ₁ (X), f ₂ (X) _,. (X)
Calculate
If f _i (X)> 0, it belongs to category C _i , and if f _i (x) ≦ 0, it is determined not to belong to category C _i . By repeating this determination n times for each category classifier, the category of the given document is determined.

次に、上述した分類器を用いて文書の分類および検索を行う文書管理装置について説明する。
図１は、本発明の文書管理装置の機能構成を示すブロック図であり、同図において、文書管理装置は、文書データベース１０、分類部２０、文書セグメンテーション部３０、登録部４０、検索部５０、学習部６０とからなっている。 Next, a document management apparatus that performs document classification and search using the above-described classifier will be described.
FIG. 1 is a block diagram showing a functional configuration of a document management apparatus according to the present invention. In the figure, the document management apparatus includes a document database 10, a classification unit 20, a document segmentation unit 30, a registration unit 40, a search unit 50, The learning unit 60 is included.

文書ＤＢ１０は、分類パラメータ、文書属性テーブル、セグメント属性テーブルおよび複数の文書からなっており、リレーショナル・データベースにより管理される。
複数の文書は、少なくとも文書ＩＤ（識別子）、文書名や書誌事項（作成者名、作成日、発行所等）、文書内容のコラムからなっている。
分類パラメータは、特徴語Ｔに対応させて、カテゴリＣ_ｉごとに、重みベクトルＷ_ｉと閾値ｂ_ｉからなっている。 The document DB 10 includes a classification parameter, a document attribute table, a segment attribute table, and a plurality of documents, and is managed by a relational database.
The plurality of documents includes at least a document ID (identifier), a document name, a bibliographic item (creator name, creation date, issuing place, etc.) and a column of document contents.
The classification parameter is made up of a weight vector W _i and a threshold value b _i for each category C _i corresponding to the feature word T.

文書属性テーブルは、少なくとも文書ＩＤ（識別子）、文書名、カテゴリ名、マルチカテゴリフラグのコラムからなっている。
マルチカテゴリフラグは、文書が単一カテゴリに属するときであり「０」の値を持ち、文書が複数カテゴリに属するときには「１」の値をもつ。
例えば、図２に示すように、文書Ｄｏｃ１およびＤｏｃ５は単一カテゴリ『経済』、Ｄｏｃ２は単一カテゴリ『政治』、Ｄｏｃ４は単一カテゴリ『スポーツ』に分類される。 The document attribute table includes at least columns of document ID (identifier), document name, category name, and multi-category flag.
The multi-category flag has a value of “0” when the document belongs to a single category, and has a value of “1” when the document belongs to a plurality of categories.
For example, as shown in FIG. 2, documents Doc1 and Doc5 are classified into a single category “Economy”, Doc2 is classified into a single category “Politics”, and Doc4 is classified into a single category “Sports”.

また、文書Ｄｏｃ３とＤｏｃ６とは複数カテゴリに属している。このうち文書Ｄｏｃ６は、複数のカテゴリ（『政治』と『経済』）に関連した内容となっている場合である。この場合には、マルチカテゴリフラグを「１」とし、単一カテゴリのレコードとしてカテゴリの数だけ列挙する。 Documents Doc3 and Doc6 belong to a plurality of categories. Among these, the document Doc6 is a case where the contents are related to a plurality of categories ("politics" and "economy"). In this case, the multi-category flag is set to “1”, and the number of categories is listed as a single category record.

また、文書Ｄｏｃ３は、別個の話題（『政治』と『経済』）の文書をまとめた内容となっている場合である。この場合には、マルチカテゴリフラグを「１」とし、文書をそれぞれのカテゴリのセグメントに分割されて、そのセグメントにカテゴリが与えたことをセグメント属性テーブルに記録する。 Further, the document Doc3 is a case in which documents of different topics (“politics” and “economy”) are collected. In this case, the multi-category flag is set to “1”, and the document is divided into segments of each category, and it is recorded in the segment attribute table that a category is given to the segment.

セグメント属性テーブルは、少なくとも文書ＩＤ（識別子）、カテゴリ名、セグメント開始位置、セグメント終了位置のコラムからなっている（図２参照）。
即ち、文書ＩＤで示される文書の文字列のうちセグメント開始位置からセグメント終了位置で示される部分文字列をセグメントとして、このセグメントに付与されたカテゴリ名が記録される。 The segment attribute table includes at least columns of document ID (identifier), category name, segment start position, and segment end position (see FIG. 2).
That is, the category name assigned to this segment is recorded using the partial character string indicated by the segment start position to the segment end position in the document character string indicated by the document ID.

分類部２０は、文書ＤＢ１０の分類パラメータ（特徴語集合Ｔ、カテゴリごとの重みベクトルＷおよび閾値ｂ）をメモリに読み込み、上述した式（１）および式（２）で表される分類器を用い、分類対象となる文書の実データを分類する。
分類対象の文書における特徴語の出現頻度やその特徴語が出現する文書の割合から特徴量ベクトルを計算して、ｎ個の分類器を当てはめる。
その結果、正の値になる分類器ｆ_ｉがｍ個あれば、その文書にはｍ個のカテゴリが付与されることになる。 The classification unit 20 reads the classification parameters (the feature word set T, the weight vector W for each category, and the threshold value b) of the document DB 10 into the memory, and uses the classifier represented by the above-described equations (1) and (2). Classify actual data of documents to be classified.
The feature quantity vector is calculated from the appearance frequency of the feature word in the document to be classified and the ratio of the document in which the feature word appears, and n classifiers are applied.
As a result, classifier f _i be a positive value if m pieces, so that the m categories is assigned to that document.

ここで、ｍ＝１の場合には、判別されたカテゴリを文書のカテゴリとする。また、ｍ＞１の場合には、分類対象の文書の実データを文書セグメンテーション部３０へ渡して分類処理を行う。
文書セグメンテーション部３０は、公知の技術（例えば、特開２０００−２３５５７４号公報参照）を用いて、渡された文書のセグメンテーションを行い、この文書が複数の話題からなる文書かどうか判断する。このとき、各セグメントは、文書中におけるセグメントの開始位置と終了位置とで示される。 If m = 1, the determined category is set as the document category. If m> 1, the actual data of the document to be classified is passed to the document segmentation unit 30 for classification processing.
The document segmentation unit 30 performs segmentation of the delivered document using a known technique (for example, see Japanese Patent Application Laid-Open No. 2000-235574), and determines whether this document is a document composed of a plurality of topics. At this time, each segment is indicated by the start position and end position of the segment in the document.

次に、セグメンテーションの結果、複数のセグメントに分割された場合には、分類部２０を用いて各セグメントにカテゴリを付与する。この場合、もともと別文書として処理されるべき文書が複数連結されて入力されたと判断する。
また、セグメンテーションの結果、複数のセグメントに分割されなかった場合には、複数の関連する話題（複数のカテゴリ）からなる文書であるとする。 Next, as a result of segmentation, when the image is divided into a plurality of segments, a category is assigned to each segment using the classification unit 20. In this case, it is determined that a plurality of documents that are to be processed as separate documents are connected and input.
Further, when the segmentation result indicates that the document is not divided into a plurality of segments, the document is composed of a plurality of related topics (a plurality of categories).

以上のように分類した後、次の３つのケースに対して、 After classification as above, for the following three cases,

ケース１：単一セグメント、単一カテゴリ。
ケース２：複数セグメント、複数カテゴリ。
ケース３：単一セグメント、複数カテゴリ。 Case 1: Single segment, single category.
Case 2: Multiple segments, multiple categories.
Case 3: Single segment, multiple categories.

次の項目からなるデータを生成する。 Generate data consisting of the following items:

（ドキュメント名，カテゴリ名，マルチカテゴリフラグ，セグメント開始位置，セグメント終了位置） (Document name, category name, multi-category flag, segment start position, segment end position)

例えば、ケース１の場合には、次の五つ組が生成される。
(ドキュメント名,カテゴリ名,0,0,0) For example, in case 1, the following five sets are generated.
(Document name, category name, 0,0,0)

ケース２の場合には、最初の行は「ドキュメント名」で表される文書が複数のカテゴリからなることを示し、２行目以降はセグメント数だけ繰り返される。
(ドキュメント名,0,1,0,0)
(ドキュメント名,カテゴリ名１,0,セグメント１の開始位置，終了位置)
(ドキュメント名,カテゴリ名２,0,セグメント２の開始位置，終了位置)
．．． In case 2, the first line indicates that the document represented by “document name” consists of a plurality of categories, and the second and subsequent lines are repeated by the number of segments.
(Document name, 0,1,0,0)
(Document name, category name 1,0, start position and end position of segment 1)
(Document name, category name 2,0, start position and end position of segment 2)
. . .

ケース３の場合には、最初の行は「ドキュメント名」で表される文書が複数のカテゴリからなることを示し、２行目以降はカテゴリの数だけ繰り返される。
(ドキュメント名,0,1,0,0)
(ドキュメント名,カテゴリ名１,0,0,0)
(ドキュメント名,カテゴリ名２,0,0,0)
(ドキュメント名,カテゴリ名３,0,0,0)
．．． In case 3, the first line indicates that the document represented by “document name” consists of a plurality of categories, and the second and subsequent lines are repeated by the number of categories.
(Document name, 0,1,0,0)
(Document name, Category name 1,0,0,0)
(Document name, Category name 2,0,0,0)
(Document name, Category name 3,0,0,0)
. . .

図３は、分類部２０および文書セグメンテーション部３０の処理手順を示すフローチャートである。
文書ＤＢ１０の分類パラメータを読み込んで、分類対象となる文書の特徴語に対する特徴量ベクトルＸを計算する（ステップＳ１０）。
各カテゴリの重みベクトルＷと閾値ｂと文書の特徴量ベクトルＸを各分類器ｆ_ｉ（Ｘ），ｉ＝１〜ｎに当てはめて、各ｆ_ｉ（Ｘ）の値を計算する（ステップＳ１１）。
ｆ_ｉ（Ｘ）の値が正となるカテゴリの個数が１の場合（ステップＳ１２のＮＯ）、単一カテゴリであるとして文書にこのカテゴリを付与し、上記のケース１の五つ組(ドキュメント名,カテゴリ名,0,0,0)を生成し（ステップＳ１９）、この文書に対する分類処理を終了する。 FIG. 3 is a flowchart showing the processing procedure of the classification unit 20 and the document segmentation unit 30.
The classification parameter of the document DB 10 is read, and the feature quantity vector X for the feature word of the document to be classified is calculated (step S10).
The weight vector W, the threshold value b, and the document feature vector X of each category are applied to each classifier f _i (X), i = 1 to n, and the value of each f _i (X) is calculated (step S11). .
When the number of categories for which the value of f _i (X) is positive is 1 (NO in step S12), this category is assigned to the document as being a single category, and the five groups of the above case 1 (document name) , Category name, 0,0,0) (step S19), and the classification process for this document is completed.

一方、カテゴリ数が２以上であるとき（ステップＳ１２のＹＥＳ）、公知の技術（例えば、特開２０００−２３５５７４号公報）により、渡された文書のセグメンテーションを行う。このとき、各セグメントは、文書中のセグメントの開始位置と終了位置とで示される（ステップＳ１３）。
分割されたセグメント数が１であれば（ステップＳ１４のＮＯ）、ステップＳ１９で単一セグメント、複数カテゴリとしてケース３の五つ組
(ドキュメント名,0,1,0,0)
(ドキュメント名,カテゴリ名１,0,0,0)
(ドキュメント名,カテゴリ名２,0,0,0)
(ドキュメント名,カテゴリ名３,0,0,0)
．．．
を生成し（ステップＳ１９）、この文書に対する分類処理を終了する。 On the other hand, when the number of categories is 2 or more (YES in step S12), the passed document is segmented by a known technique (for example, Japanese Patent Laid-Open No. 2000-235574). At this time, each segment is indicated by the start position and end position of the segment in the document (step S13).
If the number of divided segments is 1 (NO in step S14), the set of case 3 as a single segment and multiple categories in step S19
(Document name, 0,1,0,0)
(Document name, Category name 1,0,0,0)
(Document name, Category name 2,0,0,0)
(Document name, Category name 3,0,0,0)
. . .
Is generated (step S19), and the classification process for this document is terminated.

他方、セグメントの分割数が２以上である場合（ステップＳ１４のＹＥＳ）、ケース２の最初の行の五つ組(ドキュメント名,0,1,0,0)を生成するとともに、分割されたセグメントから１つのセグメントを対象として、特徴量ベクトルＳを計算する（ステップＳ１５）。 On the other hand, when the number of segment divisions is 2 or more (YES in step S14), a quintuple (document name, 0,1,0,0) of the first line of case 2 is generated and the segment is divided. A feature vector S is calculated for one segment from (Step S15).

各カテゴリの重みベクトルＷと閾値ｂとセグメントの特徴量ベクトルＳを各分類器ｆ_ｉ（Ｓ），ｉ＝１〜ｎに当てはめて、各ｆ_ｉ（Ｓ）の値を計算する（ステップＳ１６）。
ｆ_ｉ（Ｓ）の値が正となるカテゴリをこのセグメントのカテゴリとして付与し、上記のケース２の五つ組(ドキュメント名,カテゴリ名,0,セグメントの開始位置，終了位置)を生成する（ステップＳ１７）。
残りのセグメントに対して同様にカテゴリを付与して、すべてのセグメントについて付与し終えた場合、この文書に対する分類処理を終了する（ステップＳ５からＳ１８）。 The weight vector W of each category, the threshold value b, and the feature vector S of the segment are applied to each classifier f _i (S), i = 1 to n, and the value of each f _i (S) is calculated (step S16). .
A category in which the value of f _i (S) is positive is assigned as the category of this segment, and the quintuple (document name, category name, 0, segment start position, end position) of the above case 2 is generated ( Step S17).
If the categories are assigned to the remaining segments in the same manner and all the segments have been assigned, the classification process for this document is terminated (steps S5 to S18).

登録部４０は、分類対象の文書データを受け取って、その文書データを特定するための識別子（文書ＩＤ）を割り当てて、例えば、次のようなＳＱＬ文で文書ＤＢ１０の文書属性テーブルおよびセグメント属性テーブルに登録する。 The registration unit 40 receives the document data to be classified, assigns an identifier (document ID) for specifying the document data, and, for example, the document attribute table and the segment attribute table of the document DB 10 using the following SQL sentence: Register with.

insert into 文書属性テーブル
values(doc_id,doc_name,category_name,multi_category_flag);
insert into セグメント属性テーブル
values(doc_id,category_name,seg_start,seg_end); insert into document attribute table
values (doc_id, doc_name, category_name, multi_category_flag);
insert into segment attribute table
values (doc_id, category_name, seg_start, seg_end);

検索部５０は、ユーザからの検索要求、例えば、ＳＱＬ文の検索文に応じて、文書ＤＢ１０を検索して、検索結果の文書内容を表示装置等へ表示させる。 The search unit 50 searches the document DB 10 in response to a search request from the user, for example, a search sentence of an SQL sentence, and displays the document content of the search result on a display device or the like.

（１）カテゴリ名で文書を検索する場合：
例えば、次のＳＱＬ文で検索できる。
select ドキュメント名 from 文書属性テーブル where カテゴリ名=“政治” (1) When searching for documents by category name:
For example, it can be searched with the following SQL sentence.
select Document name from Document attribute table where Category name = “Politics”

（２）複数カテゴリに属する文書とそのカテゴリを検索する場合：
例えば、次のＳＱＬ文で検索できる。
select ドキュメント名，カテゴリ名 from 文書属性テーブル
where マルチカテゴリフラグ=1 (2) When searching for documents belonging to multiple categories and their categories:
For example, it can be searched with the following SQL sentence.
select Document name, category name from document attribute table
where multi-category flag = 1

（３）複数カテゴリに属する文書と構成セグメントとそのカテゴリを検索する場合:
例えば、次のＳＱＬ文で検索できる。
select t1.ドキュメント名,t2.カテゴリ名,t2.セグメント開始位置,
t2.セグメント終了位置
from 文書属性テーブル t1,セグメント属性テーブル t2
where t1.マルチカテゴリフラグ=1 and t1.文書ＩＤ=t2.文書ＩＤ (3) When searching for documents and constituent segments belonging to multiple categories and their categories:
For example, it can be searched with the following SQL sentence.
select t1.document name, t2.category name, t2.segment start position,
t2. Segment end position
from document attribute table t1, segment attribute table t2
where t1. Multi-category flag = 1 and t1. Document ID = t2. Document ID

（４）複数カテゴリに属する単一セグメントからなる文書を検索する場合：
例えば、次のＳＱＬ文で検索できる。
select ドキュメント名，カテゴリ名 from 文書属性テーブル
where 文書属性テーブル.マルチカテゴリフラグ=1 and not exists
(select * from 文書属性テーブル
where 文書属性テーブル. 文書ＩＤ=セグメント属性テーブル.文書ＩＤ) (4) When searching for a document consisting of a single segment belonging to multiple categories:
For example, it can be searched with the following SQL sentence.
select Document name, category name from document attribute table
where document attribute table, multi-category flag = 1 and not exists
(select * from document attribute table
where Document attribute table. Document ID = Segment attribute table. Document ID)

このようにして検索された文書の内容のうち、文書全体を対象とするときには従来通りの表示となる。
しかし、検索結果がセグメントに分割されている文書内容の表示は、文書の該当セグメント以外のところはマスクし、該当するカテゴリを付与されたセグメントの部分だけを表示するようにする。これにより、検索者は意図しないデータを参照しなくて済む。 Of the contents of the document searched in this way, when the entire document is targeted, the display is as before.
However, in the display of the document content in which the search result is divided into segments, portions other than the corresponding segment of the document are masked, and only the segment portion to which the corresponding category is assigned is displayed. As a result, the searcher need not refer to unintended data.

例えば、ＧＵＩシステムがある場合には、図４に示すように該当セグメント以外をグレイでマスクして、見えないようにして表示させる。
また、ＧＵＩシステムがない場合には、図５に示すように該当セグメント以外のセグメント中の文字をピリオド、中点等の記号で置き換えて表示する。 For example, in the case where there is a GUI system, as shown in FIG. 4, other than the corresponding segment is masked with gray so as not to be displayed.
If there is no GUI system, characters in segments other than the corresponding segment are replaced with symbols such as periods and midpoints as shown in FIG.

学習部６０は、前述したように、カテゴリごとに分類された文書データを学習用文書データとして、式（１）、（２）に定義したｎ個の分類器の分類パラメータ（カテゴリごとの重みベクトルと閾値）を生成し、この分類パラメータを文書ＤＢ１０の分類パラメータに格納する。 As described above, the learning unit 60 uses the document data classified for each category as learning document data, and uses the classification parameters (weight vectors for each category) of the n classifiers defined in Expressions (1) and (2). And the threshold value), and this classification parameter is stored in the classification parameter of the document DB 10.

以上のように、文書管理装置を構成することによって、文書にカテゴリを付与してデータベースに登録するときに、文書を話題ごとのセグメントに分割し、分割したセグメントごとにカテゴリ分けするようにしたので、複数のカテゴリに属する大量の文書を、複数の話題に関連する文書と関連のない複数の話題の入っている文書と区別なく効率よく登録できる。
また、検索結果の表示に際しても、セグメントに分割されていた場合には、当該カテゴリに属する部分だけを表示するようにして、複数の話題から構成される文書を的確に表示・管理することができる。 As described above, by configuring the document management device, when a category is assigned to a document and registered in the database, the document is divided into segments for each topic, and the divided segments are categorized. Thus, a large number of documents belonging to a plurality of categories can be efficiently registered without being distinguished from a document containing a plurality of topics not related to a document related to a plurality of topics.
Further, when displaying search results, if the search results are divided into segments, only the parts belonging to the category are displayed, so that a document composed of a plurality of topics can be accurately displayed and managed. .

さらに、本発明は、上述した実施形態のみに限定されたものではなく、上述した実施形態の文書管理装置の機能をそれぞれプログラム化し、あらかじめＣＤ−ＲＯＭ等の記録媒体に書き込んでおき、コンピュータに搭載したＣＤ−ＲＯＭドライブのような媒体駆動装置にこのＣＤ−ＲＯＭ等を装着し、これらのプログラムをインストールして、実行することによっても、本発明の目的が達成されることは言うまでもない。
この場合、記録媒体から読出されたプログラム自体が上述した実施形態を実現することになり、そのプログラムおよびそのプログラムを記録した記録媒体も本発明を構成することになる。 Further, the present invention is not limited only to the above-described embodiment, and the functions of the document management apparatus of the above-described embodiment are each programmed, written in a recording medium such as a CD-ROM in advance, and installed in a computer. It goes without saying that the object of the present invention can also be achieved by installing the CD-ROM or the like in a medium driving device such as a CD-ROM drive, and installing and executing these programs.
In this case, the program read from the recording medium itself realizes the above-described embodiment, and the program and the recording medium on which the program is recorded also constitute the present invention.

なお、記録媒体としては半導体媒体（例えば、ＲＯＭ、不揮発性メモリカード等）、光媒体（例えば、ＤＶＤ、ＭＯ、ＭＤ、ＣＤ−Ｒ等）、磁気媒体（例えば、磁気テープ、フレキシブルディスク等）のいずれであってもよい。
あるいは、インターネット等の通信網を介して記憶装置に格納されたプログラムをサーバコンピュータから直接供給を受けるようにしてもよい。この場合、このサーバコンピュータの記憶装置も本発明の記録媒体に含まれる。 As a recording medium, a semiconductor medium (for example, ROM, nonvolatile memory card, etc.), an optical medium (for example, DVD, MO, MD, CD-R, etc.), a magnetic medium (for example, magnetic tape, flexible disk, etc.) Either may be sufficient.
Alternatively, the program stored in the storage device may be directly supplied from the server computer via a communication network such as the Internet. In this case, the storage device of this server computer is also included in the recording medium of the present invention.

また、ロードしたプログラムを実行することにより上述した実施形態が実現されるだけでなく、そのプログラムの指示に基づき、オペレーティングシステム等が実際の処理の一部または全部を行い、その処理によって上述した実施形態が実現される場合も含まれる。 Further, not only the above-described embodiment is realized by executing the loaded program, but the operating system or the like performs part or all of the actual processing based on the instruction of the program, and the above-described embodiment is performed by the processing. The case where the form is realized is also included.

したがって、上述した実施形態の機能を実行するプログラムやそのプログラムを記録した記録媒体を流通させ、そのプログラムをコンピュータの内部記憶装置または外部記憶装置にインストールし、そのインストールされたプログラムを実行することによって、上述した実施形態の機能が実現されるので、コスト、可搬性、汎用性を向上させることができる。 Accordingly, by distributing a program for executing the functions of the above-described embodiment and a recording medium storing the program, installing the program in an internal storage device or an external storage device of the computer, and executing the installed program Since the functions of the above-described embodiment are realized, cost, portability, and versatility can be improved.

本発明の文書管理装置の機能構成を示すブロック図である。It is a block diagram which shows the function structure of the document management apparatus of this invention. 文書データベースの文書属性テーブルおよびセグメント属性テーブルのデータ構造例を示す図である。It is a figure which shows the example of a data structure of the document attribute table of a document database, and a segment attribute table. 分類対象となる文書をカテゴリ分けする手順を示すフローチャートである。It is a flowchart which shows the procedure which categorizes the document used as a classification | category object. ＧＵＩシステムを備えた場合に、セグメントに分割された文書内容の表示画面例を示す図である。FIG. 6 is a diagram illustrating an example of a display screen of document content divided into segments when a GUI system is provided. ＧＵＩシステムを備えない場合に、セグメントに分割された文書内容の表示画面例を示す図である。FIG. 6 is a diagram illustrating an example of a display screen of document content divided into segments when a GUI system is not provided.

Explanation of symbols

１０…文書データベース（ＤＢ）、２０…分類部、３０…文書セグメンテーション部、４０…登録部、５０…検索部、６０…学習部。 DESCRIPTION OF SYMBOLS 10 ... Document database (DB), 20 ... Classification part, 30 ... Document segmentation part, 40 ... Registration part, 50 ... Search part, 60 ... Learning part.

Claims

In a document management apparatus having a classification unit that classifies document contents into predetermined categories, and a registration unit that associates the categories classified by the classification unit with the document contents and registers them in the document database. A document segmentation unit that divides the segment into topics, and the document segmentation unit divides the document content into segments when the classification unit categorizes the segment, and the number of divided segments is one. Sometimes, a plurality of categories assigned by the classification unit are assigned to the document content, and when the number of divided segments is two or more, the segments divided by the classification unit are reclassified for each segment. A document management apparatus characterized in that a category is assigned and registered in the document database.

2. The document management apparatus according to claim 1, wherein when a category is assigned to the document content, the registration unit sets a single category flag and the category in the document attribute record in association with the document, If the document category is assigned and the document content is not divided into a plurality of segments, a plurality of category flags and the plurality of categories are set in the document attribute record in association with the document, and the document content is divided into a plurality of segments. In association with the document, a plurality of category flags are set in the document attribute record, and the category assigned to each segment and the segment position in the document are set in the segment attribute record in association with the document. Records and segment attribute records are registered in the document database. Document management apparatus according to claim.

The document management apparatus according to claim 2, further comprising: a search unit that searches the document database by specifying a category and displays the searched document content, and the search unit matches the specified category. A document management apparatus characterized by displaying only the document contents of a matched segment when there is a category to be matched in the segment attribute table.

A program for causing a computer to execute the function of the document management apparatus according to claim 1, 2 or 3.

The computer-readable recording medium which recorded the program of Claim 4.